Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for any number of expression or splicing datasets on a study #319

Open
olgabot opened this issue Aug 13, 2015 · 0 comments
Open

Allow for any number of expression or splicing datasets on a study #319

olgabot opened this issue Aug 13, 2015 · 0 comments

Comments

@olgabot
Copy link
Collaborator

olgabot commented Aug 13, 2015

Some studies may consist of multiple sets of gene expression datasets (RNA-Seq, RT-PCR) or splicing datasets (Percent spliced-in on the 5' side and the 3' side of the intron, separately). The idea is that the expression datasets can be treated similarly, i.e. have similar assumptions about the data type (log-normal ish distribution for expression, between 0 and 1 for splicing), and can use the same underlying ExpressionData or SplicingData methods, but will act on separate underlying datasets. This can similarly be extended for location-style data types like ChIP-Seq, CLIP-Seq, Methyl-seq, RNA editing, etc.

I envision implementing this in the datapackage as:

{
  "name": "million_dollar_dataset", 
  "title": null, 
  "datapackage_version": "0.1.0", 
  "sources": null, 
  "licenses": null, 
  "species": "hg19", 
  "resources": [
    {
      "path": "psi5.csv.gz", 
      "format": "csv", 
      "data_type": "splicing",
      "name": "psi5", 
      "compression": "gzip"
    }, 
    {
      "name": "psi5_feature", 
      "format": "csv", 
      "rename_col": "gene_name", 
      "ignore_subset_cols": [
        "ensembl_gene", 
        "gencode_gene", 
        "gencode_transcript", 
        "ensembl_transcript", 
        "gene_name", 
        "transcript_id", 
        "havana_gene", 
        "gencode_id"
      ], 
      "path": "psi5_feature.csv.gz", 
      "expression_id_col": "one_ensembl_id", 
      "compression": "gzip"
    }, 
    {
      "path": "psi3.csv.gz", 
      "format": "csv", 
      "data_type": "splicing",
      "name": "psi3", 
      "compression": "gzip"
    }, 
    {
      "name": "psi3_feature", 
      "format": "csv", 
      "rename_col": "gene_name", 
      "ignore_subset_cols": [
        "ensembl_gene", 
        "gencode_gene", 
        "gencode_transcript", 
        "ensembl_transcript", 
        "gene_name", 
        "transcript_id", 
        "havana_gene", 
        "gencode_id"
      ], 
      "path": "psi3_feature.csv.gz", 
      "expression_id_col": "one_ensembl_id", 
      "compression": "gzip"
    }, 
    {
      "path": "rtpcr.csv.gz", 
      "format": "csv", 
      "data_type": "expression",
      "name": "rtpcr", 
      "compression": "gzip"
    }, 
    {
      "name": "rtpcr_feature", 
      "format": "csv", 
      "rename_col": "gene_name", 
      "path": "rtpcr_feature.csv.gz", 
      "compression": "gzip"
    }, 
    {
      "path": "rnaseq.csv.gz", 
      "format": "csv", 
      "data_type": "expression",
      "name": "rnaseq", 
      "compression": "gzip"
    }, 
    {
      "name": "rnaseq_feature", 
      "format": "csv", 
      "rename_col": "gene_name", 
      "ignore_subset_cols": [
        "ensembl_gene", 
        "gencode_gene", 
        "gencode_transcript", 
        "ensembl_transcript", 
        "gene_name", 
        "transcript_id", 
        "havana_gene", 
        "gencode_id"
      ], 
      "path": "rnaseq_feature.csv.gz", 
      "compression": "gzip"
    }, 
    {
      "name": "mapping_stats", 
      "format": "csv", 
      "min_reads": 1000000.0, 
      "path": "mapping_stats.csv.gz", 
      "number_mapped_col": "Uniquely mapped reads number", 
      "compression": "gzip"
    }, 
    {
      "path": "gene_ontology.csv.gz", 
      "format": "csv", 
      "name": "gene_ontology", 
      "compression": "gzip"
    }, 
    {
      "name": "metadata", 
      "format": "csv", 
      "minimum_samples": 20, 
      "path": "metadata.csv.gz", 
      "phenotype_col": "phenotype", 
      "compression": "gzip"
    }
  ]
}

Notice {rnaseq,rtpcr,psi3,psi5}_feature are the feature metadata objects for the datasets. They may change to be stored within the {rnaseq,rtpcr,psi3,psi5} entries.

This datapackage would then produce a Study object where you can do:

study.plot_rnaseq('RBFOX2')
stidy.plot_rtpcr('ACTB')
study.plot_psi3("NRXN1")
study.plot_psi5("PKM")

It's going to be hard to implement this such that initializing from Study isn't too complicated, because even the current implementation is really all over the place.

This feature will really change the game because it will make flotilla a true one-stop-shop for all your needs with a particular dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant