-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vcf_to_csv -> not recognizing FORMAT #340
Comments
This is the number of Your code looks ok to me- no obvious problems with how you have specified those fields. It's difficult to debug without access to the file, but if you could provide a minimal example that fails I'd be happy to look in detail. If you modify the numbers to obscure anything potentially identifiable/privileged that would be good too.
|
Hi thanks got the quick reply.
I will ask permission, as I don’t have authorization to share the files (I’m only an user), and extract and anonymize with vcflib.
In the meantime could it be the size of the file that’s is a problem? Would you recommend dividing them into subfiles or into chunks?
Thanks!
A
… On Sep 13, 2020, at 9:31 AM, Nick Harding ***@***.***> wrote:
Also I have not been able to find exactly what f4 means ( I know is float but is it float 32, float64?
This is the number of bytes, so f4 = 8x4 = 32 bit float.
Your code looks ok to me- no obvious problems with how you have specified those fields.
It's difficult to debug without access to the file, but if you could provide a minimal example that fails I'd be happy to look in detail. If you modify the numbers to obscure anything potentially identifiable/privileged that would be good too.
vcflibcontains some useful commands to downsample VCF files.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
I think it's unlikely to be the size... there are ways of chunking the file within A small subset of the data that fails to work can tell us much more than the error message. |
Hi -
I am trying to copy 3 cols from a VCF file to CSV file. 1 is ID, the second is fromat DS and format GP.
using the command below I get a warning
allel.vcf_to_csv(my_vcf.vcf', 'my_vcf.csv', fields=['calldata/*'])
UserWarning: '*' FORMAT header not found
warnings.warn('%r FORMAT header not found' % name)
Changing the command to
allel.vcf_to_csv('my_vcf.vcf', 'my_vcf.csv', fields=['calldata/DS'], types={'calldata/DS':'f4' )
avoids the error but I get an empty csv file - I know there is data (float in the FORMAT:DS column, is there any thing wrong with what I am doing or is there an issue on the csv writing?
allel.read_vcf('my_vcf.vcf', fileds=['DS','GP']) works, but it is a long process- and I am note sure how to go from there to the csv file.
just in case this is useful FORMAT is one of the headers of my file -
and the description section defines:
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Estimated Alternate Allele Dosage">
##FORMAT=<ID=GP,Number=3,Type=Float,Description="Estimated Posterior Probabilities for Genotypes 0/0, 0/1 and 1/1 ">
A final question , as you see, GP is a tuple (float, float, float) if I need to assign a type to it in the types dictionary - what would the correct syntax be?
Also I have not been able to find exactly what f4 means ( I know is float but is it float 32, float64?
What I am trying to build is a CSV file that conserves the samples identifiers , the SNP IDs and the Dosage (DS) and the posterior probabilities GP. I have about 5000 samples and 10000000 divided across 22 chromosomes. Would appreciate any help on how to extract and consolidate that data
Thanks,
A
PS - I am sure the issue is with reading the calldata as
allel.vcf_to_csv('my_vcf.vcf', 'my_vcf.csv', fields=['ID' , 'calldata/DS'], types={'calldata/DS':'f4' )
allel.vcf_to_csv('my_vcf.vcf', 'my_vcf.csv', fields=['ID' , 'DS'])
allel.vcf_to_csv('my_vcf.vcf', 'my_vcf.csv', fields=['ID'])
produce the same .csv with the ID only
The text was updated successfully, but these errors were encountered: