-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add clinvar #39
Add clinvar #39
Conversation
args = parser.parse_args() | ||
vcf2tsv(args.input_vcf, args.out_tsv, args.fields) | ||
|
||
def get_genomic_location(variant): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please help double check the genomic location calculation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks good to me, might be nice at some point to introduce some testing framework in a separate PR. Currently we have 0 tests for any of the scripts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see also: #2
scripts/transform_vcf_to_tsv.py
Outdated
|
||
def vcf2tsv(input_vcf, out_tsv, fields): | ||
vcf = VCF(input_vcf, gts012 = True) | ||
fixed_columns_header = ['Chromosome','Start_Position','End_Position','Reference_Allele','Alternate_Allele','ClinVar_ID','Quality','Filter'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also change 'ClinVar_ID' to 'Variant_ID' to make this script more generic, or keep 'ClinVar_ID' since we only have clinvar files now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think Clinvar_ID is fine for now
scripts/transform_vcf_to_tsv.py
Outdated
tsv_fields.append(",".join(str(n) for n in variant_info.get(info_field))) | ||
else: | ||
if variant_info.get(info_field) is None: | ||
tsv_fields.append('') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VCF has '.' for an empty cell, do we want '' or null or also '.'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good question - I guess ''
is more common for tsv files, but most important is actually when it gets imported into mongo. There we want to make sure we follow mongo's recommended practice for missing values. Not sure what that is, but we should check how we can make mongoimport
recognize N/A values. I see some stackoverflow question about this here: https://stackoverflow.com/questions/45574359/using-mongoimport-to-import-a-csv-file-is-it-possible-to-import-nulls, but it's still a bit unclear to me. Could you investigate further?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mongoimport
will import empty fields as ""
, we can use --ignoreBlanks
to ignore those fields (https://docs.mongodb.com/v2.2/reference/mongoimport/). But I guess it doesn't hurt to leave empty fields as ""
tsv_fields.append('') | ||
else: | ||
if column_types[info_field] == 'Float': | ||
tsv_fields.append(str("{0:.5f}".format(variant_info.get(info_field)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{0:.5f}
Because clinvar vcf float columns have 5 digits after decimal
scripts/transform_vcf_to_tsv.py
Outdated
else: | ||
df = df.drop(column, axis=1) | ||
df.to_csv(out_tsv, sep="\t", index=False) | ||
subprocess.run('gzip -f ' + str(out_tsv), shell=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im personally not a huge fan of running subprocess within a python script for gzipping, because when I'm debugging the output of a script I want to see the unzipped output. I therefore prefer to gzip in Make, e.g. if you create some rule:
%.gz: %
gzip $<
Then you can easily gzip any file outputted by any script, see also: https://riptutorial.com/makefile/example/21374/generic-rule-to-gzip-a-file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could work but seems like it needs to have a separate command $ make -f makefile example.txt.gz
to zip a file, it's probably not worth manually running a command just to gzip one file(other files are already zipped). But I get your point it's better to do gzip outside python, I remove the gzip subprocess and do gzip in Make by regular command.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leexgh make
should able to figure out the different commands it needs to run based on the extensions. So in theory if you have a rule that generates example.txt
and a rule that can generate %.gz
, then all you would run is make example.txt.gz
and both rules get executed. Make will also automatically clean up in between files (in this case example.txt
). Not sure if that clarifies things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent work! This is very clean and well organized 🥇
I made some small comments but good to go after fixing those
scripts/transform_vcf_to_tsv.py
Outdated
# multiple variant allele, or no variant allele | ||
# start position | ||
genomic_location.append(str('-1')) | ||
# end position | ||
genomic_location.append(str('-1')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure i follow multiple variant allele
case, do you have an example of that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@inodb There are some examples in VCF format specification: https://samtools.github.io/hts-specs/VCFv4.1.pdf
I don't see multiple variant alleles in clinvar, but just in case we have other vcf files in the future that contains multiple alleles
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leexgh Nice! Good catch. Can you maybe add a comment with an example of these two cases. E.g.:
A G,T
T .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
example added
d075029
to
f89420d
Compare
clinvar/export/clinvar_grch38.txt.gz: | ||
gunzip -k ../data/clinvar/input/clinvar_grch38_input.vcf.gz && python ../scripts/transform_vcf_to_tsv.py ../data/clinvar/input/clinvar_grch38_input.vcf ../data/clinvar/export/clinvar_grch38.txt --fields chromosome,start_position,end_position,reference_allele,alternate_allele,clinvar_id,clnsig,clnsigconf && gzip ../data/clinvar/export/clinvar_grch38.txt && rm ../data/clinvar/input/clinvar_grch38_input.vcf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leexgh in reference to previous comment: you could also instead create two rules that extract and gzip the file:
%: %.gz
gunzip $<
%.gz: %
gzip $<
That would make the commands slightly easier to follow
For this specific file, you'd prolly want to indicate the dependency:
clinvar/export/clinvar_grch38.txt.gz: clinvar/input/clinvar_grch38_input.vcf.gz
That way whenever we update clinvar_grch38_input.vcf.gz
. Running make clinvar/export/clinvar_grch38.txt.gz
will also update clinvar/export/clinvar_grch38.txt.gz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this as separate issue #44
Fix: knowledgesystems/signal#104
Conver ClinVar VCF to tsv file and save to database.
transform_vcf_to_tsv.py
is a generic script to transform vcf to tsv, could be used on other files.To download and generate clinvar files, run
make clinvar/input/clinvar_grch37_input.vcf.gz
to download vcf files first, then runmake clinvar/export/clinvar_grch38.tst.gz
.When generating clinvar files, there might be some warnings:
[W::vcf_parse] Contig '1' is not defined in the header. (Quick workaround: index the file with tabix.)
, I guess it's ok to ignore them for now because everything works as expected. Solutions to fix(haven't verified yet): https://www.biostars.org/p/407384/Reference:
https://github.com/sigven/vcf2tsv
https://github.com/brentp/cyvcf2