These scripts aim at facilitating all the annotation formatting steps before going into an RNASeq analysis using Eoulsan or other pipelines.
It is an on going suite of scripts.
- python 2.7
- python modules
- pylab
- pandas
- ftplib
- urllib2
- fnmatch
- biopython
- docker
Before going further, you have to build a file where all the target directories will be defined. A kind of environment defining file. => validannot_env.py
Environment script to define all the paths needed. It needs to be included in all the validannot scripts.
These paths are required :
gff3_path = ".../gff3/"
dna_fasta_path = ".../dna_fasta/"
cdna_fasta_path = ".../cdna_fasta/"
ncrna_fasta_path = ".../ncrna_fasta/"
gtf_path = ".../gtf/"
log_path = ".../log/"
Retrieves fasta and annotation files from Ensembl ftp (ensembl.org or ensemblgenomes.org).
python retrieve_files_from_ensembl.py -o organism_ensembl -e ensemblversion -f file type -t generic -v
-o or --organism = Bos_taurus, Mus_musculus, Homo_sapiens -t or type = plants, fungi, metazoa, bacteria, protists, generic -e or --ensemblversion = 83 -f or --files = all, gff3, gtf, dna_fasta, cdna_fasta, ncrna_fasta (default=all) -v or --verbose
python retrieve_files_from_ensembl.py -o Bos_taurus -e 83 -f all -t generic -v
Formats Ensembl gff3 files to make real and clean gff3 files with chromosome only if desired. Avoids heavy files that could damage genome indexing process in mappers like STAR.
python modify_ensembl_gff.py -o Ensembl organism name -e Ensembl version -c y -v
-o or --organism Ensembl organism name Ex: Mus_musculus
-e or --ensemblversion Ensembl version Ex: 84
-c or --chronly Chromosome only : y
-v or --verbose
python modify_ensembl_gff.py -o Mus_musculus -e 84 -c y -v
Selects only the "official" chromosome sequences in a fasta file from an only_chr_Xxxx_xxxx_ensNN_sgdb.gff file. Avoids heavy files that could damage genome indexing process in mappers like STAR.
python select_ensembl_fasta_from_gffid_ensembl.py -o Ensembl organism name -e Ensembl version -v
# -o or --organism Ensembl organism name Ex: Mus_musculus
# -e or --ensemblversion Ensembl version Ex: 84
# -v or --verbose
python select_ensembl_fasta_from_gffid_ensembl.py -o Mus_musculus -e 84 -v
Formats Ensembl gtf files to cope with the formatted gff3 files.
python select_ensembl_gtfid_from_ensembl_gffid.py -o Ensembl organism name -e Ensembl version -v
-o or --organism Ensembl organism name Ex: Mus_musculus
-e or --ensemblversion Ensembl version Ex: 84
-v or --verbose
python select_ensembl_gtfid_from_ensembl_gffid.py -o Mus_musculus -e 84 -v
Builds a fasta and a gff file from cdna and ncrna fasta files from Ensembl versionned files.
python build_gff_from_ensembl_fasta.py -o Ensembl organism name -e Ensembl version -v
-o or --organism Ensembl organism name Ex: Mus_musculus
-e or --ensemblversion Ensembl version Ex: 84
-v or --verbose
python build_gff_from_ensembl_fasta.py -o Mus_musculus -e 84 -v
Analyses features in gff files and generates : 1- a histogram of the length of features in only_chr gff files and cdna-ncrna gff files 2- a summary of feature stats in only_chr gff files and cdna-ncrna gff files
python analyse_gff.py -o Ensembl organism name -e Ensembl version -f feature_list -v
-o or --organism Ensembl organism name Ex: Mus_musculus
-e or --ensemblversion Ensembl version Ex: 84
-f or --feature List of features, separated by commas Ex: ncrna,cdna
-v or --verbose
python analyse_gff.py -o Mus_musculus -e 84 -f cdna,ncrna -v
Retrieves biomart annotations from Ensembl gene database using bioservices web services. This script uses a docker container of bioservices (genomicpariscentre/bioservices)
docker pull genomicpariscentre/bioservices
docker run -t -i -v /.../Scripts/ValidAnnot/:/test --rm genomicpariscentre/bioservices bash
python query_ensembl_bioservices.py -o organism_ensembl_name -e ensemblversion -f filenb
-o or --organism = Bos_taurus, Mus_musculus, Homo_sapiens
-e or --ensemblversion = 83
-f or --filenb = 2 for genes and transcripts separated, 1 means together
-v or --verbose
python query_ensembl_bioservices.py -o Bos_taurus -e 83 -f 1 -v
- 2 files if file_number_for_gene_and_transcript = 2 btaurus_ens83_transcriptid.tsv and btaurus_ens83_geneid.tsv
- 1 file if file_number_for_gene_and_transcript = 1 btaurus_ens83.tsv
- Issue about scerevisiae gene : ids are mostly the same as transcrit ids (nearly no intron), building 2 files is probably better
Retrieves fasta and annotation files from NCBI database using ftp.
python retrieve_files_from_ensembl.py -o organism_ncbi -v
-o or --organism = Capra_hircus
-v or --verbose
python retrieve_ncbi_fasta_from_gffid.py -o Capra_hircus -v
Formats fasta and annotation files retrieved from the NCBI database.
python format_ncbi_fasta_from_gffid.py -i ncbi_gff3_file -f ncbi_merged_fasta_file -v
-i or --gffin = 20160203_Citrus_sinensis_ncbi.gff.gz or .gff
-f or --fastain = 20160203_Citrus_sinensis_ncbi_genome.fa.gz or .fa
-v or --verbose
python format_ncbi_fasta_from_gffid.py -i 20160203_Citrus_sinensis_ncbi.gff.gz -f 20160203_Citrus_sinensis_ncbi_genome.fa.gz -v