Add agat sq stat basic (#110)

* add help * add config * add run script * add test data and expected output + script to fetch them * add test * update changelog * handle input --gff has multiple=true * cleanup config * add direction for input arguments * update config: add requirements, add keywords, update --config description * remove unset IFS * add set -eo pipefail to script and test files * create temporary directory and clean up on exit * cleanup changelog * Update CHANGELOG.md --------- Co-authored-by: Robrecht Cannoodt <[email protected]>
viash-hub · Nov 2, 2024 · cc67547 · cc67547
1 parent aa43543
commit cc67547
Show file tree

Hide file tree

Showing 8 changed files with 1,203 additions and 1 deletion.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -8,6 +8,7 @@
   - `agat/agat_sp_filter_feature_from_kill_list`: remove features in a GFF file based on a kill list (PR #105).
   - `agat/agat_sp_merge_annotations`: merge different gff annotation files in one (PR #106).
   - `agat/agat_sp_statistics`: provides exhaustive statistics of a gft/gff file (PR #107).
+  - `agat/agat_sq_stat_basic`: provide basic statistics of a gtf/gff file (PR #110).
 
 * `bd_rhapsody/bd_rhapsody_sequence_analysis`: BD Rhapsody Sequence Analysis CWL pipeline (PR #96).
 
@@ -68,7 +69,6 @@
   - `agat/agat_convert_sp_gff2tsv`: convert gtf/gff file into tabulated file (PR #102).
   - `agat/agat_convert_sp_gxf2gxf`: fixes and/or standardizes any GTF/GFF file into full sorted GTF/GFF file (PR #103).
 
-
 * `bedtools`:
   - `bedtools/bedtools_intersect`: Allows one to screen for overlaps between two sets of genomic features (PR #94).
   - `bedtools/bedtools_sort`: Sorts a feature file (bed/gff/vcf) by chromosome and other criteria (PR #98).

diff --git a/src/agat/agat_sq_stat_basic/config.vsh.yaml b/src/agat/agat_sq_stat_basic/config.vsh.yaml
@@ -0,0 +1,92 @@
+name: agat_sq_stat_basic
+namespace: agat
+description: |
+  The script aims to provide basic statistics of a gtf/gff file.
+keywords: [gene annotations, gff, statistics]
+links:
+  homepage: https://github.com/NBISweden/AGAT
+  documentation: https://agat.readthedocs.io/en/latest/tools/agat_sq_stat_basic.html
+  issue_tracker: https://github.com/NBISweden/AGAT/issues
+  repository: https://github.com/NBISweden/AGAT
+references: 
+  doi: 10.5281/zenodo.3552717
+license: GPL-3.0
+requirements:
+ - commands: [agat]
+authors:
+  - __merge__: /src/_authors/leila_paquay.yaml
+    roles: [ author, maintainer ]
+argument_groups:
+  - name: Inputs
+    arguments:
+      - name: --gff
+        alternatives: [-i, --file, --input]
+        description: |
+          Input GTF/GFF file.
+        type: file
+        required: true
+        multiple: true
+        direction: input
+        example: input.gff
+      - name: --genome_size
+        alternatives: [-g]
+        description: |
+          That input is designed to know the genome size in order to calculate the percentage of the genome represented by each kind of feature type. You can provide an INTEGER. Or you can also pass a fasta file using the argument --genome_size_fasta. If both are provided, only the value of --genome_size will be considered.
+        type: integer
+        required: false
+        direction: input
+        example: 10000
+      - name: --genome_size_fasta
+        description: |
+          That input is designed to know the genome size in order to calculate the percentage of the genome represented by each kind of feature type. You can provide the genome in fasta format. Or you can also pass the size directly as an integer using the argument --genome_size. If you provide the fasta, the genome size will be calculated on the fly. If both are provided, only the value of --genome_size will be considered.
+        type: file
+        required: false
+        direction: input
+        example: genome.fasta
+  - name: Outputs
+    arguments:
+      - name: --output
+        alternatives: [-o]
+        description: |
+          Output file. The result is in tabulate format.
+        type: file
+        direction: output
+        required: true
+        example: output.txt
+  - name: Arguments
+    arguments:
+      - name: --inflate
+        description: |
+            Inflate the statistics taking into account feature with
+            multi-parents. Indeed to avoid redundant information, some gff
+            factorize identical features. e.g: one exon used in two
+            different isoform will be defined only once, and will have
+            multiple parent. By default the script count such feature only
+            once. Using the inflate option allows to count the feature and
+            its size as many time there are parents.
+        type: boolean_true
+      - name: --config
+        alternatives: [-c]
+        description: |
+          AGAT config file. By default AGAT takes the original agat_config.yaml shipped with AGAT. The `--config` option gives you the possibility to use your own AGAT config file (located elsewhere or named differently).
+        type: file
+        required: false
+        example: custom_agat_config.yaml
+resources:
+  - type: bash_script
+    path: script.sh
+test_resources:
+  - type: bash_script
+    path: test.sh
+  - type: file
+    path: test_data
+engines:
+  - type: docker
+    image: quay.io/biocontainers/agat:1.4.0--pl5321hdfd78af_0
+    setup:
+      - type: docker
+        run: |
+          agat --version | sed 's/AGAT\s\(.*\)/agat: "\1"/' > /var/software_versions.txt
+runners:
+  - type: executable
+  - type: nextflow
diff --git a/src/agat/agat_sq_stat_basic/help.txt b/src/agat/agat_sq_stat_basic/help.txt
@@ -0,0 +1,79 @@
+```sh
+agat_sq_stat_basic.pl --help
+```
+
+ ------------------------------------------------------------------------------
+|   Another GFF Analysis Toolkit (AGAT) - Version: v1.4.0                      |
+|   https://github.com/NBISweden/AGAT                                          |
+|   National Bioinformatics Infrastructure Sweden (NBIS) - www.nbis.se         |
+ ------------------------------------------------------------------------------
+
+
+Name:
+    agat_sq_stat_basic.pl
+
+Description:
+    The script aims to provide basic statistics of a gtf/gff file.
+
+Usage:
+        agat_sq_stat_basic.pl -i <input file> [-g <integer or fasta> -o <output file>]
+        agat_sq_stat_basic.pl --help
+
+Options:
+    -i, --gff, --file or --input
+            STRING: Input GTF/GFF file. Several files can be processed at
+            once: -i file1 -i file2
+
+    -g, --genome
+            That input is design to know the genome size in order to
+            calculate the percentage of the genome represented by each kind
+            of feature type. You can provide an INTEGER or the genome in
+            fasta format. If you provide the fasta, the genome size will be
+            calculated on the fly.
+
+    --inflate
+            Inflate the statistics taking into account feature with
+            multi-parents. Indeed to avoid redundant information, some gff
+            factorize identical features. e.g: one exon used in two
+            different isoform will be defined only once, and will have
+            multiple parent. By default the script count such feature only
+            once. Using the inflate option allows to count the feature and
+            its size as many time there are parents.
+
+    -o or --output
+            STRING: Output file. If no output file is specified, the output
+            will be written to STDOUT. The result is in tabulate format.
+
+    -c or --config
+            String - Input agat config file. By default AGAT takes as input
+            agat_config.yaml file from the working directory if any,
+            otherwise it takes the orignal agat_config.yaml shipped with
+            AGAT. To get the agat_config.yaml locally type: "agat config
+            --expose". The --config option gives you the possibility to use
+            your own AGAT config file (located elsewhere or named
+            differently).
+
+    --help or -h
+            Display this helpful text.
+
+Feedback:
+  Did you find a bug?:
+    Do not hesitate to report bugs to help us keep track of the bugs and
+    their resolution. Please use the GitHub issue tracking system available
+    at this address:
+
+                https://github.com/NBISweden/AGAT/issues
+
+     Ensure that the bug was not already reported by searching under Issues.
+     If you're unable to find an (open) issue addressing the problem, open a new one.
+     Try as much as possible to include in the issue when relevant:
+     - a clear description,
+     - as much relevant information as possible,
+     - the command used,
+     - a data sample,
+     - an explanation of the expected behaviour that is not occurring.
+
+  Do you want to contribute?:
+    You are very welcome, visit this address for the Contributing
+    guidelines:
+    https://github.com/NBISweden/AGAT/blob/master/CONTRIBUTING.md
diff --git a/src/agat/agat_sq_stat_basic/script.sh b/src/agat/agat_sq_stat_basic/script.sh
@@ -0,0 +1,31 @@
+#!/bin/bash
+
+set -eo pipefail
+
+## VIASH START
+## VIASH END
+
+# unset flags
+[[ "$par_inflate" == "false" ]] && unset par_inflate
+
+# Convert a list of file names to multiple -gff arguments
+input_files=""
+IFS=";" read -ra file_names <<< "$par_gff"
+for file in "${file_names[@]}"; do
+    input_files+="--gff $file "
+done
+
+# take care of --genome (can originally be either a fasta file or an integer)
+if [[ -n "$par_genome_size" ]]; then
+  genome_arg=$par_genome_size
+elif [[ -n "$par_genome_size_fasta" ]]; then
+  genome_arg=$par_genome_size_fasta
+fi
+
+# run agat_convert_sp_bed2gff.pl
+agat_sq_stat_basic.pl \
+  $input_files \
+  ${genome_arg:+--genome "${genome_arg}"} \
+  --output "${par_output}" \
+  ${par_inflate:+--inflate} \
+  ${par_config:+--config "${par_config}"}
diff --git a/src/agat/agat_sq_stat_basic/test.sh b/src/agat/agat_sq_stat_basic/test.sh
@@ -0,0 +1,36 @@
+#!/bin/bash
+
+set -eo pipefail
+
+## VIASH START
+## VIASH END
+
+test_dir="${meta_resources_dir}/test_data"
+
+# create temporary directory and clean up on exit
+TMPDIR=$(mktemp -d "$meta_temp_dir/$meta_functionality_name-XXXXXX")
+function clean_up {
+ [[ -d "$TMPDIR" ]] && rm -rf "$TMPDIR"
+}
+trap clean_up EXIT
+
+
+echo "> Run $meta_name with test data"
+"$meta_executable" \
+  --gff "$test_dir/1.gff" \
+  --output "$TMPDIR/output.txt" 
+
+echo ">> Checking output"
+[ ! -f "$TMPDIR/output.txt" ] && echo "Output file output.txt does not exist" && exit 1
+
+echo ">> Check if output is empty"
+[ ! -s "$TMPDIR/output.txt" ] && echo "Output file output.txt is empty" && exit 1
+
+echo ">> Check if output matches expected output"
+diff "$TMPDIR/output.txt" "$test_dir/agat_sq_stat_basic_1.gff"
+if [ $? -ne 0 ]; then
+  echo "Output file output.txt does not match expected output"
+  exit 1
+fi
+
+echo "> Test successful"