vcf_to_csv -> not recognizing FORMAT #340

ayadlin · 2020-09-11T23:37:30Z

Hi -
I am trying to copy 3 cols from a VCF file to CSV file. 1 is ID, the second is fromat DS and format GP.
using the command below I get a warning
allel.vcf_to_csv(my_vcf.vcf', 'my_vcf.csv', fields=['calldata/*'])

UserWarning: '*' FORMAT header not found
warnings.warn('%r FORMAT header not found' % name)

Changing the command to

allel.vcf_to_csv('my_vcf.vcf', 'my_vcf.csv', fields=['calldata/DS'], types={'calldata/DS':'f4' )
avoids the error but I get an empty csv file - I know there is data (float in the FORMAT:DS column, is there any thing wrong with what I am doing or is there an issue on the csv writing?

allel.read_vcf('my_vcf.vcf', fileds=['DS','GP']) works, but it is a long process- and I am note sure how to go from there to the csv file.

just in case this is useful FORMAT is one of the headers of my file -
and the description section defines:

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Estimated Alternate Allele Dosage">
##FORMAT=<ID=GP,Number=3,Type=Float,Description="Estimated Posterior Probabilities for Genotypes 0/0, 0/1 and 1/1 ">

A final question , as you see, GP is a tuple (float, float, float) if I need to assign a type to it in the types dictionary - what would the correct syntax be?
Also I have not been able to find exactly what f4 means ( I know is float but is it float 32, float64?

What I am trying to build is a CSV file that conserves the samples identifiers , the SNP IDs and the Dosage (DS) and the posterior probabilities GP. I have about 5000 samples and 10000000 divided across 22 chromosomes. Would appreciate any help on how to extract and consolidate that data

Thanks,

A

PS - I am sure the issue is with reading the calldata as
allel.vcf_to_csv('my_vcf.vcf', 'my_vcf.csv', fields=['ID' , 'calldata/DS'], types={'calldata/DS':'f4' )
allel.vcf_to_csv('my_vcf.vcf', 'my_vcf.csv', fields=['ID' , 'DS'])
allel.vcf_to_csv('my_vcf.vcf', 'my_vcf.csv', fields=['ID'])

produce the same .csv with the ID only

hardingnj · 2020-09-13T16:30:53Z

Also I have not been able to find exactly what f4 means ( I know is float but is it float 32, float64?

This is the number of bytes, so f4 = 8x4 = 32 bit float.

Your code looks ok to me- no obvious problems with how you have specified those fields.

It's difficult to debug without access to the file, but if you could provide a minimal example that fails I'd be happy to look in detail. If you modify the numbers to obscure anything potentially identifiable/privileged that would be good too.

vcflibcontains some useful commands to downsample VCF files.

ayadlin · 2020-09-13T18:58:00Z

Hi thanks got the quick reply. I will ask permission, as I don’t have authorization to share the files (I’m only an user), and extract and anonymize with vcflib. In the meantime could it be the size of the file that’s is a problem? Would you recommend dividing them into subfiles or into chunks? Thanks! A

…

On Sep 13, 2020, at 9:31 AM, Nick Harding ***@***.***> wrote: Also I have not been able to find exactly what f4 means ( I know is float but is it float 32, float64? This is the number of bytes, so f4 = 8x4 = 32 bit float. Your code looks ok to me- no obvious problems with how you have specified those fields. It's difficult to debug without access to the file, but if you could provide a minimal example that fails I'd be happy to look in detail. If you modify the numbers to obscure anything potentially identifiable/privileged that would be good too. vcflibcontains some useful commands to downsample VCF files. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

hardingnj · 2020-09-15T13:04:24Z

I think it's unlikely to be the size... there are ways of chunking the file within allel if it's very large. The sorting could be a problem maybe. It might be worth using the region argument to read a subset of the data too.

A small subset of the data that fails to work can tell us much more than the error message.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vcf_to_csv -> not recognizing FORMAT #340

vcf_to_csv -> not recognizing FORMAT #340

ayadlin commented Sep 11, 2020 •

edited

Loading

hardingnj commented Sep 13, 2020 •

edited

Loading

ayadlin commented Sep 13, 2020 via email

hardingnj commented Sep 15, 2020

vcf_to_csv -> not recognizing FORMAT #340

vcf_to_csv -> not recognizing FORMAT #340

Comments

ayadlin commented Sep 11, 2020 • edited Loading

hardingnj commented Sep 13, 2020 • edited Loading

ayadlin commented Sep 13, 2020 via email

hardingnj commented Sep 15, 2020

ayadlin commented Sep 11, 2020 •

edited

Loading

hardingnj commented Sep 13, 2020 •

edited

Loading