Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add clinvar #39

Merged
merged 6 commits into from
Mar 22, 2021
Merged

Add clinvar #39

merged 6 commits into from
Mar 22, 2021

Conversation

leexgh
Copy link
Member

@leexgh leexgh commented Mar 5, 2021

Fix: knowledgesystems/signal#104

Conver ClinVar VCF to tsv file and save to database. transform_vcf_to_tsv.py is a generic script to transform vcf to tsv, could be used on other files.

To download and generate clinvar files, run make clinvar/input/clinvar_grch37_input.vcf.gz to download vcf files first, then run make clinvar/export/clinvar_grch38.tst.gz.

When generating clinvar files, there might be some warnings: [W::vcf_parse] Contig '1' is not defined in the header. (Quick workaround: index the file with tabix.), I guess it's ok to ignore them for now because everything works as expected. Solutions to fix(haven't verified yet): https://www.biostars.org/p/407384/

Reference:
https://github.com/sigven/vcf2tsv
https://github.com/brentp/cyvcf2

  • conver clinvar VCF to tsv
  • import tsv to database

@leexgh leexgh marked this pull request as ready for review March 9, 2021 15:35
args = parser.parse_args()
vcf2tsv(args.input_vcf, args.out_tsv, args.fields)

def get_genomic_location(variant):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please help double check the genomic location calculation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks good to me, might be nice at some point to introduce some testing framework in a separate PR. Currently we have 0 tests for any of the scripts

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see also: #2


def vcf2tsv(input_vcf, out_tsv, fields):
vcf = VCF(input_vcf, gts012 = True)
fixed_columns_header = ['Chromosome','Start_Position','End_Position','Reference_Allele','Alternate_Allele','ClinVar_ID','Quality','Filter']
Copy link
Member Author

@leexgh leexgh Mar 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also change 'ClinVar_ID' to 'Variant_ID' to make this script more generic, or keep 'ClinVar_ID' since we only have clinvar files now

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think Clinvar_ID is fine for now

tsv_fields.append(",".join(str(n) for n in variant_info.get(info_field)))
else:
if variant_info.get(info_field) is None:
tsv_fields.append('')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VCF has '.' for an empty cell, do we want '' or null or also '.'?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question - I guess '' is more common for tsv files, but most important is actually when it gets imported into mongo. There we want to make sure we follow mongo's recommended practice for missing values. Not sure what that is, but we should check how we can make mongoimport recognize N/A values. I see some stackoverflow question about this here: https://stackoverflow.com/questions/45574359/using-mongoimport-to-import-a-csv-file-is-it-possible-to-import-nulls, but it's still a bit unclear to me. Could you investigate further?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mongoimport will import empty fields as "", we can use --ignoreBlanks to ignore those fields (https://docs.mongodb.com/v2.2/reference/mongoimport/). But I guess it doesn't hurt to leave empty fields as ""

tsv_fields.append('')
else:
if column_types[info_field] == 'Float':
tsv_fields.append(str("{0:.5f}".format(variant_info.get(info_field))))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{0:.5f} Because clinvar vcf float columns have 5 digits after decimal

@leexgh leexgh requested review from inodb and onursumer March 9, 2021 23:35
data/Makefile Outdated Show resolved Hide resolved
scripts/import_mongo.sh Outdated Show resolved Hide resolved
else:
df = df.drop(column, axis=1)
df.to_csv(out_tsv, sep="\t", index=False)
subprocess.run('gzip -f ' + str(out_tsv), shell=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im personally not a huge fan of running subprocess within a python script for gzipping, because when I'm debugging the output of a script I want to see the unzipped output. I therefore prefer to gzip in Make, e.g. if you create some rule:

%.gz: %
    gzip $<

Then you can easily gzip any file outputted by any script, see also: https://riptutorial.com/makefile/example/21374/generic-rule-to-gzip-a-file

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could work but seems like it needs to have a separate command $ make -f makefile example.txt.gz to zip a file, it's probably not worth manually running a command just to gzip one file(other files are already zipped). But I get your point it's better to do gzip outside python, I remove the gzip subprocess and do gzip in Make by regular command.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leexgh make should able to figure out the different commands it needs to run based on the extensions. So in theory if you have a rule that generates example.txt and a rule that can generate %.gz, then all you would run is make example.txt.gz and both rules get executed. Make will also automatically clean up in between files (in this case example.txt). Not sure if that clarifies things

Copy link
Member

@inodb inodb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work! This is very clean and well organized 🥇

I made some small comments but good to go after fixing those

Comment on lines 63 to 67
# multiple variant allele, or no variant allele
# start position
genomic_location.append(str('-1'))
# end position
genomic_location.append(str('-1'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure i follow multiple variant allele case, do you have an example of that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@inodb There are some examples in VCF format specification: https://samtools.github.io/hts-specs/VCFv4.1.pdf
image
I don't see multiple variant alleles in clinvar, but just in case we have other vcf files in the future that contains multiple alleles

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leexgh Nice! Good catch. Can you maybe add a comment with an example of these two cases. E.g.:

A G,T
T .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

example added

@leexgh leexgh force-pushed the add-clinvar branch 2 times, most recently from d075029 to f89420d Compare March 17, 2021 22:15
Comment on lines +196 to +197
clinvar/export/clinvar_grch38.txt.gz:
gunzip -k ../data/clinvar/input/clinvar_grch38_input.vcf.gz && python ../scripts/transform_vcf_to_tsv.py ../data/clinvar/input/clinvar_grch38_input.vcf ../data/clinvar/export/clinvar_grch38.txt --fields chromosome,start_position,end_position,reference_allele,alternate_allele,clinvar_id,clnsig,clnsigconf && gzip ../data/clinvar/export/clinvar_grch38.txt && rm ../data/clinvar/input/clinvar_grch38_input.vcf
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leexgh in reference to previous comment: you could also instead create two rules that extract and gzip the file:

%: %.gz
    gunzip $<
%.gz: %
    gzip $<

That would make the commands slightly easier to follow

For this specific file, you'd prolly want to indicate the dependency:

clinvar/export/clinvar_grch38.txt.gz: clinvar/input/clinvar_grch38_input.vcf.gz

That way whenever we update clinvar_grch38_input.vcf.gz. Running make clinvar/export/clinvar_grch38.txt.gz will also update clinvar/export/clinvar_grch38.txt.gz

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this as separate issue #44

@inodb inodb merged commit 0c7da82 into genome-nexus:master Mar 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adding Clinvar to genome nexus
2 participants