Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using scripts to process protein data from NCBI #930

Open
wang748 opened this issue Oct 21, 2024 · 9 comments
Open

Using scripts to process protein data from NCBI #930

wang748 opened this issue Oct 21, 2024 · 9 comments

Comments

@wang748
Copy link

wang748 commented Oct 21, 2024

Dear developer, I was trying to download the pep.all.fa files for the species “Conger conger” and “Gymnothorax javanicus” from the ensembl database. But I found that there are no relevant files for these two species in ensembl database, but there are reference files for these two species in NCBI, so I want to get the protein sequence files of these two species from NCBI, but the annotations of the protein files in NCBI are not very good, I would like to know if there is a script that can change the protein sequence files in NCBI to the annotation format in ensembl, that is, change their annotation format, that is to say, change their annotation format. I was wondering if there is a script that can transform the protein sequence files in NCBI to the annotation format in ensembl, i.e. replace their headers so that I can use primary_transcript.py to extract the longest transcript for each gene?

@lauriebelch
Copy link

Hi wang748,

We do have an experimental script for getting primary transcripts from NCBI data - if you provide me with links to the genomes you want on NCBI I can take a look

Thanks,

OrthoLaurie

@wang748
Copy link
Author

wang748 commented Oct 23, 2024

@lauriebelch
Copy link

I'll take a look now. The script will be using the GFF files to extract the longest transcript per gene (similar to the primary transcripts script for ensembl). It will be published and available with the next version of orthofinder

@lauriebelch
Copy link

primary_transcripts.zip
Hopefully this has worked! I would definitely check that the number of genes is what you are expecting for each species

@wang748
Copy link
Author

wang748 commented Oct 24, 2024

Thank you for your help! I would also like to know if your script derives some value by subtracting the end and start positions of the row where the mRNA is located in the third column of the gff file, and then comparing the magnitude of that value of the mRNAs belonging to the same gene, so as to derive, that the mRNA with the largest that value is the primary transcript? And if I don't do the extraction of the primary transcript, will it have a bad effect on the results generated by orthofinder?

@lauriebelch
Copy link

It works by mapping each protein ID in the protein fasta .faa file to a gene in the .gff file. For each gene we then have a set of protein IDs. We then simply take the longest protein (sequence length) for each gene as the primary transcript. I can send you the script if you want?

If we ran OrthoFinder on the raw files (without selecting only the primary transcripts) it would take 10x longer than necessary and could lower the accuracy.

@wang748
Copy link
Author

wang748 commented Oct 25, 2024

I think I need this script, thanks a lot! Here is my email address you can send the script to: [email protected]

@ferrojm
Copy link

ferrojm commented Oct 28, 2024

Hi! I would like that script as well, where can I find it? thanks!

@lauriebelch
Copy link

Getting data from NCBI.pdf
ncbi_primary_transcripts.py.zip
Here is the script, and a brief PDF explaining how to get data to use it. Please let me know if it is helpful / what might make it more helpful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants