New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add clinvar #39

Merged

inodb merged 6 commits into genome-nexus:master from leexgh:add-clinvar

Mar 22, 2021

Member

leexgh commented Mar 5, 2021 •

edited

Loading

Fix: knowledgesystems/signal#104

Conver ClinVar VCF to tsv file and save to database. transform_vcf_to_tsv.py is a generic script to transform vcf to tsv, could be used on other files.

To download and generate clinvar files, run make clinvar/input/clinvar_grch37_input.vcf.gz to download vcf files first, then run make clinvar/export/clinvar_grch38.tst.gz.

When generating clinvar files, there might be some warnings: [W::vcf_parse] Contig '1' is not defined in the header. (Quick workaround: index the file with tabix.), I guess it's ok to ignore them for now because everything works as expected. Solutions to fix(haven't verified yet): https://www.biostars.org/p/407384/

Reference:
https://github.com/sigven/vcf2tsv
https://github.com/brentp/cyvcf2

conver clinvar VCF to tsv
import tsv to database

leexgh added 4 commits

March 4, 2021 23:41


          add a script to conver clinvar vcf to tsv

4e4049e


          update scripts

1690ff3


          make vcf2tsv script more general

c9110be


          code clean up

210f609

leexgh marked this pull request as ready for review

March 9, 2021 15:35


          upload vcf file and update script to remove unzip file

aa6aff5

leexgh commented

View reviewed changes

scripts/transform_vcf_to_tsv.py

+                 args = parser.parse_args()
+                 vcf2tsv(args.input_vcf, args.out_tsv, args.fields)
+              def get_genomic_location(variant):

Member Author

leexgh Mar 9, 2021

Please help double check the genomic location calculation

Member

inodb Mar 11, 2021

it looks good to me, might be nice at some point to introduce some testing framework in a separate PR. Currently we have 0 tests for any of the scripts

Member

inodb Mar 11, 2021

see also: #2

scripts/transform_vcf_to_tsv.py Outdated

+              def vcf2tsv(input_vcf, out_tsv, fields):
+                 vcf = VCF(input_vcf, gts012 = True)
+                 fixed_columns_header = ['Chromosome','Start_Position','End_Position','Reference_Allele','Alternate_Allele','ClinVar_ID','Quality','Filter']

Member Author

leexgh Mar 9, 2021 •

edited

Loading

We could also change 'ClinVar_ID' to 'Variant_ID' to make this script more generic, or keep 'ClinVar_ID' since we only have clinvar files now

Member

inodb Mar 11, 2021

i think Clinvar_ID is fine for now

leexgh commented

View reviewed changes

scripts/transform_vcf_to_tsv.py Outdated

+                          tsv_fields.append(",".join(str(n) for n in variant_info.get(info_field)))
+                       else:
+                          if variant_info.get(info_field) is None:
+                             tsv_fields.append('')

Member Author

leexgh Mar 9, 2021

VCF has '.' for an empty cell, do we want '' or null or also '.'?

Member

inodb Mar 11, 2021

good question - I guess '' is more common for tsv files, but most important is actually when it gets imported into mongo. There we want to make sure we follow mongo's recommended practice for missing values. Not sure what that is, but we should check how we can make mongoimport recognize N/A values. I see some stackoverflow question about this here: https://stackoverflow.com/questions/45574359/using-mongoimport-to-import-a-csv-file-is-it-possible-to-import-nulls, but it's still a bit unclear to me. Could you investigate further?

Member Author

leexgh Mar 16, 2021

mongoimport will import empty fields as "", we can use --ignoreBlanks to ignore those fields (https://docs.mongodb.com/v2.2/reference/mongoimport/). But I guess it doesn't hurt to leave empty fields as ""

scripts/transform_vcf_to_tsv.py

+                             tsv_fields.append('')
+                          else:
+                             if column_types[info_field] == 'Float':
+                                tsv_fields.append(str("{0:.5f}".format(variant_info.get(info_field))))

Member Author

leexgh Mar 9, 2021

{0:.5f} Because clinvar vcf float columns have 5 digits after decimal

leexgh requested review from inodb and onursumer

March 9, 2021 23:35

inodb reviewed

View reviewed changes

data/Makefile Outdated Show resolved Hide resolved

inodb reviewed

View reviewed changes

scripts/import_mongo.sh Outdated Show resolved Hide resolved

inodb reviewed

View reviewed changes

scripts/transform_vcf_to_tsv.py Outdated

+                    else:
+                       df = df.drop(column, axis=1)
+                 df.to_csv(out_tsv, sep="\t", index=False)
+                 subprocess.run('gzip -f ' + str(out_tsv), shell=True)

Member

inodb Mar 11, 2021

im personally not a huge fan of running subprocess within a python script for gzipping, because when I'm debugging the output of a script I want to see the unzipped output. I therefore prefer to gzip in Make, e.g. if you create some rule:

%.gz: %
    gzip $<

Then you can easily gzip any file outputted by any script, see also: https://riptutorial.com/makefile/example/21374/generic-rule-to-gzip-a-file

Member Author

leexgh Mar 16, 2021

This could work but seems like it needs to have a separate command $ make -f makefile example.txt.gz to zip a file, it's probably not worth manually running a command just to gzip one file(other files are already zipped). But I get your point it's better to do gzip outside python, I remove the gzip subprocess and do gzip in Make by regular command.

Member

inodb Mar 21, 2021

@leexgh make should able to figure out the different commands it needs to run based on the extensions. So in theory if you have a rule that generates example.txt and a rule that can generate %.gz, then all you would run is make example.txt.gz and both rules get executed. Make will also automatically clean up in between files (in this case example.txt). Not sure if that clarifies things

inodb reviewed

View reviewed changes

scripts/transform_vcf_to_tsv.py Show resolved Hide resolved

inodb approved these changes

View reviewed changes

Member

inodb left a comment

Excellent work! This is very clean and well organized 🥇

I made some small comments but good to go after fixing those

inodb reviewed

View reviewed changes

scripts/transform_vcf_to_tsv.py Outdated

Comment on lines 63 to 67

+                    # multiple variant allele, or no variant allele
+                    # start position
+                    genomic_location.append(str('-1'))
+                    # end position
+                    genomic_location.append(str('-1'))

Member

inodb Mar 11, 2021

not sure i follow multiple variant allele case, do you have an example of that?

Member Author

leexgh Mar 15, 2021

@inodb There are some examples in VCF format specification: https://samtools.github.io/hts-specs/VCFv4.1.pdf

I don't see multiple variant alleles in clinvar, but just in case we have other vcf files in the future that contains multiple alleles

Member

inodb Mar 16, 2021

@leexgh Nice! Good catch. Can you maybe add a comment with an example of these two cases. E.g.:

A G,T
T .

Member Author

leexgh Mar 16, 2021

example added

leexgh force-pushed the add-clinvar branch 2 times, most recently from d075029 to f89420d Compare

March 17, 2021 22:15


          fix comments

976d5da

leexgh force-pushed the add-clinvar branch from f89420d to 976d5da Compare

March 18, 2021 04:25

inodb reviewed

View reviewed changes

data/Makefile

Comment on lines +196 to +197

		clinvar/export/clinvar_grch38.txt.gz:
		gunzip -k ../data/clinvar/input/clinvar_grch38_input.vcf.gz && python ../scripts/transform_vcf_to_tsv.py ../data/clinvar/input/clinvar_grch38_input.vcf ../data/clinvar/export/clinvar_grch38.txt --fields chromosome,start_position,end_position,reference_allele,alternate_allele,clinvar_id,clnsig,clnsigconf && gzip ../data/clinvar/export/clinvar_grch38.txt && rm ../data/clinvar/input/clinvar_grch38_input.vcf

Member

inodb Mar 21, 2021

@leexgh in reference to previous comment: you could also instead create two rules that extract and gzip the file:

%: %.gz
    gunzip $<
%.gz: %
    gzip $<

That would make the commands slightly easier to follow

For this specific file, you'd prolly want to indicate the dependency:

clinvar/export/clinvar_grch38.txt.gz: clinvar/input/clinvar_grch38_input.vcf.gz

That way whenever we update clinvar_grch38_input.vcf.gz. Running make clinvar/export/clinvar_grch38.txt.gz will also update clinvar/export/clinvar_grch38.txt.gz

Member

inodb Mar 22, 2021

Added this as separate issue #44

inodb mentioned this pull request

Simplify makefile gzip/gunzip #44

Open

inodb merged commit 0c7da82 into genome-nexus:master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet