Skip to content

Commit

Permalink
Add files via upload
Browse files Browse the repository at this point in the history
  • Loading branch information
josuebarrera authored Aug 26, 2022
1 parent 9cf7d59 commit 2289ace
Showing 1 changed file with 6 additions and 4 deletions.
10 changes: 6 additions & 4 deletions genEra
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#!/bin/bash

### genEra v1.0.1 (C) Max Planck Society for the Advancement of Science
### genEra v1.0.2 (C) Max Planck Society for the Advancement of Science
###
### Code developed by Josué Barrera-Redondo <[email protected]>
###
Expand All @@ -14,6 +14,8 @@
### MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
### GNU General Public License for more details.

VERSION='1.0.2'

QUERY_FASTA=''
NCBITAX=''
NR_DB=''
Expand All @@ -31,10 +33,10 @@ MINHITS='10'
PRINTOLDEST='false'
DIVERGENCE=''
TMP_PATH=''
SENSITIVITY='ultra-sensitive'
SENSITIVITY='sensitive'

print_usage() {
printf "\ngenEra v1.0 (C) Max Planck Society for the Advancement of Science\n\n BASIC USAGE\n\tgenEra -q [query_sequences.fasta] -t [query_taxid] -b [path/to/nr] -d [path/to/taxdump]\n\n BEST USAGE\n\tgenEra -q [query_sequences.fasta] -t [query_taxid] -b [path/to/nr] -d [path/to/taxdump] \ \n\t-a [protein_list.tsv] -f [nucleotide_list.tsv] -s [evolutionary_distances.tsv] -i [true] -n [many threads]\n\n DESCRIPTION\n\tgenEra is an easy-to-use, low-dependency command-line tool that\n\testimates the age of the earliest common ancestor of protein\n\tcoding genes though genomic phylostratigraphy.\n\n MANDATORY ARGUMENTS\n\t-q\tQuery protein sequences in FASTA format\n\t-t\tNCBI Taxonomy ID of query species (search for the taxid of\n\t\tyour query species at https://www.ncbi.nlm.nih.gov/taxonomy)\n\n MANDATORY ONE OF THE FOLLOWING ARGUMENTS\n\t-b\tPath to a locally installed nr database for DIAMOND\n\t-p\tPre-generated DIAMOND/MMseqs2 table (skip step 1), with the\n\t\tquery genes in the first column, the bitscore in the second\n\t\tto last column and the target taxid in the last column\n\t\t(IMPORTANT: the query sequences must be searched\n\t\tagainst themselves for genEra to work properly)\n\n ALSO MANDATORY ONE OF THESE THREE ARGUMENTS\n\t-d\tLocation of the uncompressed taxonomy dump from the NCBI\n\t\t(ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz)\n\t-r\tRaw \"ncbi_lineages\" file generated by ncbitax2lin\n\t\t(saves time on step 2)\n\t-c\tCustom \"ncbi_lineages\" file that is already tailored for the\n\t\tquery species (skip step 2). The table should be arranged so\n\t\tthat the taxid is in the first column and all the phylostrata\n\t\tof interest are organized from the species level all the way\n\t\tback to \"cellular organisms\"\n\n IMPORTANT OPTIONAL ARGUMENTS\n\t-n\tNumber of threads to run genEra (genEra can run with a single\n\t\tthread, but it is HIGHLY suggested to use as many threads as\n\t\tpossible) (default: 20 threads)\n\t-s\tTable with pairwise evolutionary distances (substitutions/site)\n\t\tbetween several species in the database and the query species\n\t\t(necessary to calculate homology detection failure probabilities\n\t\twith abSENSE). NOTE: the query species SHOULD be included in\n\t\tthis table. The table should be tab-delimited and have the\n\t\tfollowing format:\n\t\t query_sp_taxid\t0\n\t\t species_1_taxid\tdistance_1\n\t\t species_2_taxid\tdistance_2\n\t-a\tTable with additional proteins to be included in the analysis\n\t\t(e.g., proteins from species that are absent from the nr).\n\t\tThe table should be tab-delimited and have the following format:\n\t\t /path/to/species_1.fasta\ttaxid_1\n\t\t /path/to/species_2.fasta\ttaxid_2\n\t\t /path/to/species_3.fasta\ttaxid_3\n\t-f\tTable with additional nucleotide sequences to be search against\n\t\tyour query proteins. Particularly useful with genome assemblies\n\t\tfor improved orphan gene classification. The table should be\n\t\ttab-delimited and have the following format:\n\t\t /path/to/genome_1.fasta\ttaxid_1\n\t\t /path/to/genome_2.fasta\ttaxid_2\n\t\t /path/to/genome_3.fasta\ttaxid_3\n\t-i\tWhen true, prints an additional output file with the best\n\t\tsequence hit responsible for the oldest phylostrata assignment\n\t\tfor each of the query genes (default: false)\n\n FINE-TUNNING ARGUMENTS (DEFAULT IS USUALLY FINE)\n\t-l\tTaxonomic representativeness threshold below which a gene will\n\t\tbe flagged as putative genome contamination or the product of\n\t\ta horizontal gene transfer (HGT) event (default: 30)\n\t-e\tE-value threshold for DIAMOND and MMseqs2 (default: 1e-5)\n\t-o\tAdditional options to feed DIAMOND, based on user preferences\n\t\t(e.g., filtering the hits by identity or query coverage)\n\t\tUsers should input the additional commands in quotes, using\n\t\tthe original arguments from DIAMOND (Example: -o \"--id 30\")\n\t-m\tMinimum percentage of matches between your query sequences\n\t\tand another species to consider it useful for the gene age\n\t\tassignment (i.e., filtering species with just a couple of\n\t\tgenes in the nr)(default: 10)\n\t-x\tAlternative path where you would like to store the temporary\n\t\tfiles as well as the DIAMOND/MMseqs2 results (warning: genEra\n\t\twill generate HUGE temporary files) (default: the files will\n\t\tbe stored in a tmp_[RAMDOMNUM]/ directory created by genEra)\n\t-y\tModify the sensitivity parameter in DIAMOND for faster\n\t\tresults in step 1 (default: ultra-sensitive)\n\t-h\tPrint this help message and exit\n\n"
printf "\ngenEra v${VERSION} (C) Max Planck Society for the Advancement of Science\n\n BASIC USAGE\n\tgenEra -q [query_sequences.fasta] -t [query_taxid] -b [path/to/nr] -d [path/to/taxdump]\n\n BEST USAGE\n\tgenEra -q [query_sequences.fasta] -t [query_taxid] -b [path/to/nr] -d [path/to/taxdump] \ \n\t-a [protein_list.tsv] -f [nucleotide_list.tsv] -s [evolutionary_distances.tsv] -i [true] -n [many threads]\n\n DESCRIPTION\n\tgenEra is an easy-to-use, low-dependency command-line tool that\n\testimates the age of the earliest common ancestor of protein\n\tcoding genes though genomic phylostratigraphy.\n\n MANDATORY ARGUMENTS\n\t-q\tQuery protein sequences in FASTA format\n\t-t\tNCBI Taxonomy ID of query species (search for the taxid of\n\t\tyour query species at https://www.ncbi.nlm.nih.gov/taxonomy)\n\n MANDATORY ONE OF THE FOLLOWING ARGUMENTS\n\t-b\tPath to a locally installed nr database for DIAMOND\n\t-p\tPre-generated DIAMOND/MMseqs2 table (skip step 1), with the\n\t\tquery genes in the first column, the bitscore in the second\n\t\tto last column and the target taxid in the last column\n\t\t(IMPORTANT: the query sequences must be searched\n\t\tagainst themselves for genEra to work properly)\n\n ALSO MANDATORY ONE OF THESE THREE ARGUMENTS\n\t-d\tLocation of the uncompressed taxonomy dump from the NCBI\n\t\t(ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz)\n\t-r\tRaw \"ncbi_lineages\" file generated by ncbitax2lin\n\t\t(saves time on step 2)\n\t-c\tCustom \"ncbi_lineages\" file that is already tailored for the\n\t\tquery species (skip step 2). The table should be arranged so\n\t\tthat the taxid is in the first column and all the phylostrata\n\t\tof interest are organized from the species level all the way\n\t\tback to \"cellular organisms\"\n\n IMPORTANT OPTIONAL ARGUMENTS\n\t-n\tNumber of threads to run genEra (genEra can run with a single\n\t\tthread, but it is HIGHLY suggested to use as many threads as\n\t\tpossible) (default: 20 threads)\n\t-s\tTable with pairwise evolutionary distances (substitutions/site)\n\t\tbetween several species in the database and the query species\n\t\t(necessary to calculate homology detection failure probabilities\n\t\twith abSENSE). NOTE: the query species SHOULD be included in\n\t\tthis table. The table should be tab-delimited and have the\n\t\tfollowing format:\n\t\t query_sp_taxid\t0\n\t\t species_1_taxid\tdistance_1\n\t\t species_2_taxid\tdistance_2\n\t-a\tTable with additional proteins to be included in the analysis\n\t\t(e.g., proteins from species that are absent from the nr).\n\t\tThe table should be tab-delimited and have the following format:\n\t\t /path/to/species_1.fasta\ttaxid_1\n\t\t /path/to/species_2.fasta\ttaxid_2\n\t\t /path/to/species_3.fasta\ttaxid_3\n\t-f\tTable with additional nucleotide sequences to be search against\n\t\tyour query proteins. Particularly useful with genome assemblies\n\t\tfor improved orphan gene classification. The table should be\n\t\ttab-delimited and have the following format:\n\t\t /path/to/genome_1.fasta\ttaxid_1\n\t\t /path/to/genome_2.fasta\ttaxid_2\n\t\t /path/to/genome_3.fasta\ttaxid_3\n\t-i\tWhen true, prints an additional output file with the best\n\t\tsequence hit responsible for the oldest phylostrata assignment\n\t\tfor each of the query genes (default: false)\n\n FINE-TUNNING ARGUMENTS (DEFAULT IS USUALLY FINE)\n\t-l\tTaxonomic representativeness threshold below which a gene will\n\t\tbe flagged as putative genome contamination or the product of\n\t\ta horizontal gene transfer (HGT) event (default: 30)\n\t-e\tE-value threshold for DIAMOND and MMseqs2 (default: 1e-5)\n\t-o\tAdditional options to feed DIAMOND, based on user preferences\n\t\t(e.g., filtering the hits by identity or query coverage)\n\t\tUsers should input the additional commands in quotes, using\n\t\tthe original arguments from DIAMOND (Example: -o \"--id 30\")\n\t-m\tMinimum percentage of matches between your query sequences\n\t\tand another species to consider it useful for the gene age\n\t\tassignment (i.e., filtering species with just a couple of\n\t\tgenes in the database)(default: 10)\n\t-x\tAlternative path where you would like to store the temporary\n\t\tfiles as well as the DIAMOND/MMseqs2 results (warning: genEra\n\t\twill generate HUGE temporary files) (default: the files will\n\t\tbe stored in a tmp_[RAMDOMNUM]/ directory created by genEra)\n\t-y\tModify the sensitivity parameter in DIAMOND for faster\n\t\tresults in step 1 (default: sensitive)\n\t-h\tPrint this help message and exit\n\n"
}

while getopts ':q:t:b:n:l:p:d:r:c:a:f:e:o:m:i:s:x:y:h' flag; do
Expand Down Expand Up @@ -149,7 +151,7 @@ if [[ -f ${DIVERGENCE} ]]; then

fi

echo "genEra v1.0 (C) Max Planck Society for the Advancement of Science"
echo "genEra v${VERSION} (C) Max Planck Society for the Advancement of Science"
echo "Starting time of run:"
date

Expand Down

0 comments on commit 2289ace

Please sign in to comment.