-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Acquires and loads gene and mutation data #42
Changes from all commits
d12acf8
12c3276
91064b9
d48475f
47e65c5
68b0e45
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,9 +1,10 @@ | ||
import os | ||
import csv | ||
import bz2 | ||
|
||
from django.core.management.base import BaseCommand | ||
|
||
from api.models import Disease, Sample | ||
from api.models import Disease, Sample, Gene, Mutation | ||
|
||
|
||
class Command(BaseCommand): | ||
|
@@ -22,22 +23,66 @@ def handle(self, *args, **options): | |
disease_path = os.path.join(options['path'], 'diseases.tsv') | ||
with open(disease_path) as disease_file: | ||
disease_reader = csv.DictReader(disease_file, delimiter='\t') | ||
disease_list = [] | ||
for row in disease_reader: | ||
Disease.objects.create( | ||
disease = Disease( | ||
acronym=row['acronym'], | ||
name=row['disease'] | ||
) | ||
disease_list.append(disease) | ||
Disease.objects.bulk_create(disease_list) | ||
|
||
# Samples | ||
if Sample.objects.count() == 0: | ||
sample_path = os.path.join(options['path'], 'samples.tsv') | ||
with open(sample_path) as sample_file: | ||
sample_reader = csv.DictReader(sample_file, delimiter='\t') | ||
sample_list = [] | ||
for row in sample_reader: | ||
disease = Disease.objects.get(acronym=row['acronym']) | ||
Sample.objects.create( | ||
sample = Sample( | ||
sample_id=row['sample_id'], | ||
disease=disease, | ||
gender=row['gender'] or None, | ||
age_diagnosed=row['age_diagnosed'] or None | ||
) | ||
sample_list.append(sample) | ||
Sample.objects.bulk_create(sample_list) | ||
|
||
# Genes | ||
if Gene.objects.count() == 0: | ||
gene_path = os.path.join(options['path'], 'genes.tsv') | ||
with open(gene_path) as gene_file: | ||
gene_reader = csv.DictReader(gene_file, delimiter='\t') | ||
gene_list = [] | ||
for row in gene_reader: | ||
gene = Gene( | ||
entrez_gene_id=row['entrez_gene_id'], | ||
symbol=row['symbol'], | ||
description=row['description'], | ||
chromosome=row['chromosome'] or None, | ||
gene_type=row['gene_type'], | ||
synonyms=row['synonyms'].split('|') or None, | ||
aliases=row['aliases'].split('|') or None | ||
) | ||
gene_list.append(gene) | ||
Gene.objects.bulk_create(gene_list) | ||
|
||
# Mutations | ||
if Mutation.objects.count() == 0: | ||
mutation_path = os.path.join(options['path'], 'mutation-matrix.tsv.bz2') | ||
with bz2.open(mutation_path , 'rt') as mutation_file: | ||
mutation_reader = csv.DictReader(mutation_file, delimiter='\t') | ||
mutation_list = [] | ||
for row in mutation_reader: | ||
sample_id = row.pop('sample_id') | ||
sample = Sample.objects.get(sample_id=sample_id) | ||
for entrez_gene_id, mutation_status in row.items(): | ||
if mutation_status == '1': | ||
try: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This does indeed catch a lone exception, even with the up-to-date There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are you using the latest figshare files (v5)? From https://api.figshare.com/v2/articles/3487685: [
{
"is_link_only":false,
"size":173889264,
"id":5864859,
"download_url":"https://ndownloader.figshare.com/files/5864859",
"name":"expression-matrix.tsv.bz2"
},
{
"is_link_only":false,
"size":1564703,
"id":5864862,
"download_url":"https://ndownloader.figshare.com/files/5864862",
"name":"mutation-matrix.tsv.bz2"
},
{
"is_link_only":false,
"size":772313,
"id":6207135,
"download_url":"https://ndownloader.figshare.com/files/6207135",
"name":"samples.tsv"
},
{
"is_link_only":false,
"size":1211305,
"id":6207138,
"download_url":"https://ndownloader.figshare.com/files/6207138",
"name":"covariates.tsv"
}
] There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looks like you are. Will keep investingating There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you delete your local There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Just tried this, and the issue persists. ID 117153 (the lone exception) was discontinued on 9/10/16 and replaced with 4253. To investigate, I downloaded Looking at commit histories in cancer-data, the mutation matrix appears to have been made before this date. Looking at commits from genes, Entrez information was obtained after this date. My guess is that this ID changed between the creation of these two files. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hmm, There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In the event that this is useful, when I download the latest genes file, load with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for looking into this @stephenshank. I reported the issue in cognoma/cancer-data#36. |
||
gene = Gene.objects.get(entrez_gene_id=entrez_gene_id) | ||
mutation = Mutation(gene=gene, sample=sample) | ||
mutation_list.append(mutation) | ||
except: | ||
print('Had an issue inserting sample', sample_id, 'mutation', entrez_gene_id) | ||
Mutation.objects.bulk_create(mutation_list) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
# -*- coding: utf-8 -*- | ||
# Generated by Django 1.9.8 on 2016-11-26 02:22 | ||
from __future__ import unicode_literals | ||
|
||
import django.contrib.postgres.fields | ||
from django.db import migrations, models | ||
import django.db.models.deletion | ||
|
||
|
||
class Migration(migrations.Migration): | ||
|
||
dependencies = [ | ||
('api', '0002_alter_sample_fields'), | ||
] | ||
|
||
operations = [ | ||
migrations.CreateModel( | ||
name='Gene', | ||
fields=[ | ||
('entrez_gene_id', models.IntegerField(primary_key=True, serialize=False)), | ||
('symbol', models.CharField(max_length=32)), | ||
('description', models.CharField(max_length=256)), | ||
('chromosome', models.CharField(max_length=8, null=True)), | ||
('gene_type', models.CharField(max_length=16)), | ||
('synonyms', django.contrib.postgres.fields.ArrayField(base_field=models.CharField(max_length=32), null=True, size=None)), | ||
('aliases', django.contrib.postgres.fields.ArrayField(base_field=models.CharField(max_length=256), null=True, size=None)), | ||
], | ||
options={ | ||
'db_table': 'cognoma_genes', | ||
}, | ||
), | ||
migrations.RemoveField( | ||
model_name='mutation', | ||
name='status', | ||
), | ||
migrations.AlterField( | ||
model_name='classifier', | ||
name='genes', | ||
field=models.ManyToManyField(to='api.Gene'), | ||
), | ||
migrations.AlterField( | ||
model_name='mutation', | ||
name='gene', | ||
field=models.ForeignKey(on_delete=django.db.models.deletion.CASCADE, related_name='mutations', to='api.Gene'), | ||
), | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice if we could use the figshare download logic from
cognoml
. Seecognoml/figshare.py
. @jessept is our code modular enough that @stephenshank can use cognoml to download the figshare data here, or would this application be out of scope.I created a corresponding issue for the
cognoml
team: cognoma/cognoml#15.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just assigned cognoma/cognoml#15 to myself to help move this forward. We can definitely use our data retrieval code in other places, @stephenshank check if the code here works for what you need. I'm happy to add any additional helper code as well, just let me know what you're looking for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jessept -- nice. My main worry here is that hardcoding the URL for a specific dataset of a specific version of the figshare data is going to cause an upkeep issue later on. So the goal will be to use the cognoml logic for figshare downloads to avoid repeating any efforts and clean this up!