Skip to content

Find structured-datasets from within the Sequence Read Archive

Notifications You must be signed in to change notification settings

mbernste/hypothesis-driven-SRA-queries

Repository files navigation

Jupyter notebook-based tools for building structured datasets from the Sequence Read Archive

This repository implements a pair of Jupyter notebook-based tools that utilize the MetaSRA for building structured datasets from the SRA in order to facilitate secondary analyses of the SRA’s human RNA-seq data:

  • Case-Control Finder: Finds suitable case and control samples for a given disease or condition where the cases and controls are matched by tissue or cell type.
  • Series Finder: Finds ordered sets of samples for the purpose of addressing biological questions pertaining to changes over a numerical property such as time.

For a more in depth description of these tools, please see our publication:

Bernstein, M.N. et al. (2020). Jupyter notebook-based tools for building structured datasets from the Sequence Read Archive. F1000Research, 9:376

A few things to note:

Running the notebooks on Google Colab

The notebooks can be executed in the cloud via Google Colab:

Running the notebooks locally

Setup

The dependencies for these notebooks are described in requirements.txt. To install these dependencies, please run

pip install -r requirements.txt

Furthermore, before running the notebook, you must unpack the static metadata files from data.tar.gz. To do so, run the following command:

tar -zcf data.tar.gz

Running the notebooks

To run the Case-Control Finder, run:

jupyter notebook case_control_finder.ipynb

To run the Series Finder, run:

jupyter notebook series_finder.ipynb

Contributors

  • Matthew Bernstein
  • Emily Clough
  • Ariella Gladstein
  • Khun Zaw Latt
  • Ben Busby
  • Allissa Dillman