Using scripts to process protein data from NCBI #930

wang748 · 2024-10-21T13:20:05Z

Dear developer, I was trying to download the pep.all.fa files for the species “Conger conger” and “Gymnothorax javanicus” from the ensembl database. But I found that there are no relevant files for these two species in ensembl database, but there are reference files for these two species in NCBI, so I want to get the protein sequence files of these two species from NCBI, but the annotations of the protein files in NCBI are not very good, I would like to know if there is a script that can change the protein sequence files in NCBI to the annotation format in ensembl, that is, change their annotation format, that is to say, change their annotation format. I was wondering if there is a script that can transform the protein sequence files in NCBI to the annotation format in ensembl, i.e. replace their headers so that I can use primary_transcript.py to extract the longest transcript for each gene?

lauriebelch · 2024-10-23T10:45:38Z

Hi wang748,

We do have an experimental script for getting primary transcripts from NCBI data - if you provide me with links to the genomes you want on NCBI I can take a look

Thanks,

OrthoLaurie

wang748 · 2024-10-23T10:59:09Z

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/963/514/075/GCF_963514075.1_fConCon1.1/GCF_963514075.1_fConCon1.1_protein.faa.gz；https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/018/555/375/GCF_018555375.3_ASM1855537v3/GCF_018555375.3_ASM1855537v3_protein.faa.gz；
The two links above are the protein sequences I need to extract their primary transcripts, please check them out, thank you very much!

lauriebelch · 2024-10-23T13:01:48Z

I'll take a look now. The script will be using the GFF files to extract the longest transcript per gene (similar to the primary transcripts script for ensembl). It will be published and available with the next version of orthofinder

lauriebelch · 2024-10-23T13:09:21Z

primary_transcripts.zip
Hopefully this has worked! I would definitely check that the number of genes is what you are expecting for each species

wang748 · 2024-10-24T01:48:25Z

Thank you for your help! I would also like to know if your script derives some value by subtracting the end and start positions of the row where the mRNA is located in the third column of the gff file, and then comparing the magnitude of that value of the mRNAs belonging to the same gene, so as to derive, that the mRNA with the largest that value is the primary transcript? And if I don't do the extraction of the primary transcript, will it have a bad effect on the results generated by orthofinder?

lauriebelch · 2024-10-24T08:30:31Z

It works by mapping each protein ID in the protein fasta .faa file to a gene in the .gff file. For each gene we then have a set of protein IDs. We then simply take the longest protein (sequence length) for each gene as the primary transcript. I can send you the script if you want?

If we ran OrthoFinder on the raw files (without selecting only the primary transcripts) it would take 10x longer than necessary and could lower the accuracy.

wang748 · 2024-10-25T02:52:10Z

I think I need this script, thanks a lot! Here is my email address you can send the script to: [email protected]

ferrojm · 2024-10-28T15:58:31Z

Hi! I would like that script as well, where can I find it? thanks!

lauriebelch · 2024-10-30T10:31:53Z

Getting data from NCBI.pdf
ncbi_primary_transcripts.py.zip
Here is the script, and a brief PDF explaining how to get data to use it. Please let me know if it is helpful / what might make it more helpful!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using scripts to process protein data from NCBI #930

Using scripts to process protein data from NCBI #930

wang748 commented Oct 21, 2024

lauriebelch commented Oct 23, 2024

wang748 commented Oct 23, 2024

lauriebelch commented Oct 23, 2024

lauriebelch commented Oct 23, 2024

wang748 commented Oct 24, 2024

lauriebelch commented Oct 24, 2024

wang748 commented Oct 25, 2024

ferrojm commented Oct 28, 2024 •

edited

Loading

lauriebelch commented Oct 30, 2024

Using scripts to process protein data from NCBI #930

Using scripts to process protein data from NCBI #930

Comments

wang748 commented Oct 21, 2024

lauriebelch commented Oct 23, 2024

wang748 commented Oct 23, 2024

lauriebelch commented Oct 23, 2024

lauriebelch commented Oct 23, 2024

wang748 commented Oct 24, 2024

lauriebelch commented Oct 24, 2024

wang748 commented Oct 25, 2024

ferrojm commented Oct 28, 2024 • edited Loading

lauriebelch commented Oct 30, 2024

ferrojm commented Oct 28, 2024 •

edited

Loading