-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using scripts to process protein data from NCBI #930
Comments
Hi wang748, We do have an experimental script for getting primary transcripts from NCBI data - if you provide me with links to the genomes you want on NCBI I can take a look Thanks, OrthoLaurie |
https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/963/514/075/GCF_963514075.1_fConCon1.1/GCF_963514075.1_fConCon1.1_protein.faa.gz;https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/018/555/375/GCF_018555375.3_ASM1855537v3/GCF_018555375.3_ASM1855537v3_protein.faa.gz; |
I'll take a look now. The script will be using the GFF files to extract the longest transcript per gene (similar to the primary transcripts script for ensembl). It will be published and available with the next version of orthofinder |
primary_transcripts.zip |
Thank you for your help! I would also like to know if your script derives some value by subtracting the end and start positions of the row where the mRNA is located in the third column of the gff file, and then comparing the magnitude of that value of the mRNAs belonging to the same gene, so as to derive, that the mRNA with the largest that value is the primary transcript? And if I don't do the extraction of the primary transcript, will it have a bad effect on the results generated by orthofinder? |
It works by mapping each protein ID in the protein fasta .faa file to a gene in the .gff file. For each gene we then have a set of protein IDs. We then simply take the longest protein (sequence length) for each gene as the primary transcript. I can send you the script if you want? If we ran OrthoFinder on the raw files (without selecting only the primary transcripts) it would take 10x longer than necessary and could lower the accuracy. |
I think I need this script, thanks a lot! Here is my email address you can send the script to: [email protected] |
Hi! I would like that script as well, where can I find it? thanks! |
Getting data from NCBI.pdf |
Dear developer, I was trying to download the pep.all.fa files for the species “Conger conger” and “Gymnothorax javanicus” from the ensembl database. But I found that there are no relevant files for these two species in ensembl database, but there are reference files for these two species in NCBI, so I want to get the protein sequence files of these two species from NCBI, but the annotations of the protein files in NCBI are not very good, I would like to know if there is a script that can change the protein sequence files in NCBI to the annotation format in ensembl, that is, change their annotation format, that is to say, change their annotation format. I was wondering if there is a script that can transform the protein sequence files in NCBI to the annotation format in ensembl, i.e. replace their headers so that I can use primary_transcript.py to extract the longest transcript for each gene?
The text was updated successfully, but these errors were encountered: