Skip to content

Commit

Permalink
Updated content of hqtb #5 #11
Browse files Browse the repository at this point in the history
  • Loading branch information
GwennyGit committed Jul 11, 2024
1 parent 413f012 commit 8e8a0c4
Show file tree
Hide file tree
Showing 4 changed files with 80 additions and 75 deletions.
13 changes: 8 additions & 5 deletions docs/source/hqtb/about-pipeline.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,10 @@ of a closely related strain (species) and additional database information.
This type of pipeline aims to build upon already existing knowledge to speed up model curation
and minimize the need to perform steps again that have already been done in a similar context.

Overview of the HQTB Pipeline
-----------------------------
.. _overview-hqtb:

Overview of the ``HQTB`` Pipeline
---------------------------------

The following image shows an overrview of the steps of the pipeline:

Expand All @@ -24,8 +26,9 @@ The pipeline consists of five main steps:
Step 2: Draft Model Generation <step-desc/gen_draft.rst>

- | Step 3: Model Refinement

| Refine the previously generated draft model to make the model more complete and strain-specific.
In other words, fitting it more closely to the input genome.
| In other words, fitting it more closely to the input genome.
- :doc:`Part 1: Extension <step-desc/refine-parts/extension>`
- :doc:`Part 2: Clean-up <step-desc/refine-parts/cleanup>`
Expand Down Expand Up @@ -55,12 +58,12 @@ boudary parameters.

.. hint::
Many of the steps of the pipeline can be fine tuned and turned off/on.
Check the `configuration file <hqtb-config.html>`__ for a full list of all parameters.
Check the :doc:`configuration file <hqtb-config>` for a full list of all parameters.

.. note::

All steps of the pipeline can be run separatly via the command line or
the Python integration (see :ref:`Run the HTQB Pipeline`).
the Python integration (see :doc:`run-pipeline`).

All accessable function are listed in the :ref:`Contents of SPECIMEN` section.

Expand Down
76 changes: 38 additions & 38 deletions docs/source/hqtb/hqtb-config.rst
Original file line number Diff line number Diff line change
@@ -1,63 +1,63 @@
HQTB Configuration File
=======================
``HQTB`` Configuration File
===========================

Below, the configuration file with the underlying defaults, is shown.

.. code-block:: yaml
# information about the genome to be used to generate the new model
# Information about the genome to be used to generate the new model
subject:
annotated_genome: __USER__
full_sequence: __USER__
# information about the template model/genome
# Information about the template model/genome
template:
annotated_genome: __USER__
model: __USER__
namespace: BiGG
# information about the output
# Information about the output
out:
dir: ./specimen_run/
name: specimen_model
memote: False
# data(bases) required to run the program
# Data(bases) required to run the program
data:
# if this parameter is set, assumes that the directory structure from setup
# If this parameter is set, assumes that the directory structure from setup
# is used and uses this path to a directory as the parent folder for the
# following paths (assumes all data paths are relative ones)
data_direc: null
# required
# Required
diamond: __USER__
# needed but potentially downloaded
# Needed but potentially downloaded
mnx_chem_prop: MetaNetX/chem_prop.tsv
mnx_chem_xref: MetaNetX/chem_xref.tsv
mnx_reac_prop: MetaNetX/reac_prop.tsv
mnx_reac_xref: MetaNetX/reac_xref.tsv
# optional, but good and manual
# Optional, but good and manual
ncbi_map: null
ncbi_dat: null
# optional for directionality control
# Optional for directionality control
biocyc: null
# optional:
# the pan-core model is used for analysis and if no universal model
# is given, also for gapfilling
# if the pan-core model is too small for useful gapfilling, use an
# additional universal model for gapfilling
# if none if given gapfilling (and core-pan analysis) is skipped
# Optional:
# The pan-core model is used for analysis and if no universal model
# is given, also for gapfilling.
# If the pan-core model is too small for useful gapfilling, use an
# additional universal model for gapfilling.
# If none is given gapfilling (and core-pan analysis) is skipped
universal: null
pan-core: null
# paramters for the single steps of the pipeline
# Paramters for the single steps of the pipeline
parameters:
bidirectional_blast:
# default should suffice except special cases
# Default should suffice except special cases
template_name: null
input_name: null
temp_header: null
in_header: null
# can be set by user if wanted, but not necessary
# Can be set by user if wanted, but not necessary
sensitivity: more-sensitive
generate_draft_model:
Expand All @@ -66,59 +66,59 @@ Below, the configuration file with the underlying defaults, is shown.
medium: default
refinement_extension:
# default (usually) fine
# Default (usually) fine
id: locus_tag
# default fine
# Default fine
sensitivity: more-sensitive
# default alright but good to edit for trying different options
# Default alright but good to edit for trying different options
coverage: 95.0
pid: 90.0
# default almost needed, except for special cases
# Default almost needed, except for special cases
exclude_dna: True
exclude_rna: True
refinement_cleanup:
# default as standart
# Default as standard
check_dupl_reac: True
check_dupl_meta: default
remove_unused_meta: False
remove_dupl_reac: True
remove_dupl_meta: True
# current default means no gapfilling
# Current default means no gapfilling
media_gap: null
refinement_annotation:
# for KEGG pathway annotation
# For KEGG pathway annotation
viaEC: False
viaRC: False
refinement_smoothing:
# useful
# Useful
mcc: skip
# ECG correction
egc: null
# depend on organism (current: Klebsiella )
# Depend on organism (current: Klebsiella )
dna_weight_frac: 0.023
ion_weight_frac: 0.05
# validation:
# default should suffice
# Validation:
# Default should suffice
analysis:
# default is currently only option
# Default is currently only option
pc_based_on: id
# can be default but useful to edit
media_analysis: __USER__ # edit to fit a default media config file
# Can be default but useful to edit
media_analysis: __USER__ # Edit to fit a default media config file
test_aa_auxotrophies: True
# perform pathway analysis with KEGG
# Perform pathway analysis with KEGG
pathway: True
# options for performance
# Options for performance
performance:
threads: 2
# for the gapfilling, if iterations and chunk_size are set (not null)
# For the gapfilling, if iterations and chunk_size are set (not null)
# use a heuristic for faster performance:
# instead of using all reactions that can be added at once,
# Instead of using all reactions that can be added at once,
# run x interations of gapfilling with n-size randomised chunks of reactions
gapfilling:
iterations: 3
Expand Down
64 changes: 33 additions & 31 deletions docs/source/hqtb/run-pipeline.rst
Original file line number Diff line number Diff line change
@@ -1,19 +1,19 @@
Run the HTQB Pipeline
=====================
Run the ``HQTB`` Pipeline
=========================

This page explains how to run the complete ``HTQB`` (high-quality template based) pipeline
and how to collected the neccessary data.

For more information about the steps of the pipeline,
see :ref:`Overview of the HQTB Pipeline`.
see :ref:`overview-hqtb`.

HQTB: Quickstart
----------------
``HQTB``: Quickstart
--------------------

The pipeline can either be run directly from the command line or its functions can be called from inside a Python script.
The input in both cases is a configuration file that contains all information needed (data file paths and parameters) to run it.

The configuration can be downloaded using the command line:
The `configuration <hqtb-config.html>`__ can be downloaded using the command line:

.. code-block:: bash
:class: copyable
Expand Down Expand Up @@ -54,12 +54,12 @@ from inside a Python script or Jupyter Notebook with "config.yaml" being the pat

.. note::

Additionally, the pipeline can be run with a wrapper to susequently build multiple models for different genome using the same parameters.
To wrapper can be accessed using :code:`specimen hqtb run wrapper "config.yaml"` or :code:`specimen.workflow.wrapper_pipeline(config_file='/User/path/to/config.yaml', parent_dir="./")`.
Additionally, the pipeline can be run with a wrapper to susequently build multiple models for different genomes using the same parameters.
The wrapper can be accessed using :code:`specimen hqtb run wrapper "config.yaml"` or :code:`specimen.workflow.wrapper_pipeline(config_file='/User/path/to/config.yaml', parent_dir="./")`.


HQTB: Collecting Data
---------------------
``HQTB``: Collecting Data
-------------------------

If you are just starting a new project and do not have all the data ready to go, you can use the setup function of
``SPECIMEN`` to help you collect the data you need.
Expand All @@ -69,7 +69,11 @@ If you are just starting a new project and do not have all the data ready to go,
specimen.util.set_up.build_data_directories('your_folder_name')
The function above creates the following directory structure for your project:
| The function above creates the following directory structure for your project.
| The 'contains' column lists what is supposed to be inside the according folder.
The tags manual/semi/automated report how these files are added to the folder (automated = by the setup function, manual = by the user).
``TODO``: Was bedeutet semi?
The tags required/optional report whether this input is necessary to run the pipeline or if it is an optional input.
.. table::
:align: center
Expand Down Expand Up @@ -97,39 +101,37 @@ The function above creates the following directory structure for your project:
| universal-models | universal models | manual, optional |
+--------------------+------------------------------+---------------------+

In the contains columns it is listed what is supposed to be inside that folder.
The tags manual/semi/automated report how these are added to the folder (automated = by the setup function, manual = by the user).
The tags report/optional report whether this input is necessary to run the pipeline or if it is an optional input.

.. note::

Regarding the annotated genomes, the program currently only supports the file types ``GBFF`` and ``FAA`` + ``FNA``.
Regarding the annotated_genomes folder, the program currently only supports the file types ``GBFF`` and ``FAA`` + ``FNA``.
``TODO``: Für welche Dateien in contains gilt das?

Further details for collecting the data:

- BioCyc:
- `BioCyc <https://biocyc.org/>`__:

- downloading a smart table from BioCyc requires a subscription
- the smart table needs to have the columns Reactions, EC-Number, KEGG reaction, METANETX and Reaction-Direction
- Downloading a smart table from BioCyc requires a subscription.
- The SmartTable needs to have the columns 'Reactions', 'EC-Number', 'KEGG reaction', 'METANETX' and 'Reaction-Direction'.

- RefSeqs
- RefSeq

- one way to builf a DIAMOND reference database is to download a set of reference sequences from the NCBI database, e.g. in the **FAA** format
- use the function :code:`specimen.util.util.create_DIAMOND_db_from_folder('/User/path/input/directory', '/User/Path/for/output/', name = 'database', extention = 'faa')` to create a DIAMOND database
- to speed up the mapping, create an additional mapping file from the e.g. ``GBFF`` files from NCBI using :code:`specimen.util.util.create_NCBIinfo_mapping('/User/path/input/directory', '/User/Path/for/output/', extention = 'gbff')`
- to ensure correct mapping to KEGG, an additional information file can be created by constructing a CSV file with the following columns: NCBI genome, organism, locus_tag (start) and KEGG.organism
- One way to build a DIAMOND reference database is to download a set of reference sequences from the NCBI database, e.g. in the **FAA** format.
- Use the function :code:`specimen.util.util.create_DIAMOND_db_from_folder('/User/path/input/directory', '/User/Path/for/output/', name = 'database', extention = 'faa')` to create a DIAMOND database
- To speed up the mapping, create an additional mapping file from the e.g. ``GBFF`` files from NCBI using :code:`specimen.util.util.create_NCBIinfo_mapping('/User/path/input/directory', '/User/Path/for/output/', extention = 'gbff')`
- To ensure correct mapping to KEGG, an additional information file can be created by constructing a CSV file with the following columns: 'NCBI genome', 'organism', 'locus_tag' (start) and 'KEGG.organism'
``TODO``: Was ist hier mit start gemeint?

- the information of the first three columns can be taken from the previous two steps while
- the last column the user needs to check, if the genomes have been entered into KEGG and have an organism identifier
- this file is purely optional for running the pipeline but potentially leads to better results
- The information of the first three columns can be taken from the previous two steps while
- For the last column the user needs to check, if the genomes have been entered into KEGG and have an organism identifier.
- This file is purely optional for running the pipeline but potentially leads to better results.

- medium:

The media, either for analysis or gapfilling can be entered into the pipeline via a config file (each).
The config files are from the `refineGEMs <https://github.com/draeger-lab/refinegems>`__ :footcite:p:`bauerle2023genome` toolbox and access its in-build medium database
and additionally allow for manual adjustment / external input.
The media, either for analysis or gap filling can be entered into the pipeline via a config file (each). ``TODO``: Muss wirklich für jedes Medium eine neue Datei erstellt werden?
The config files are from the `refineGEMs <https://github.com/draeger-lab/refinegems/tree/dev-2>`__ :footcite:p:`bauerle2023genome` toolbox and access its in-build medium database.
Additionally, the config files allow for manual adjustment / external input.

A examplary config file can be accessed using the following command:
An examplary config file can be accessed using the following command:

.. code-block:: python
:class: copyable
Expand Down
2 changes: 1 addition & 1 deletion docs/source/hqtb/step-desc/validation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ Step 4: Model Validation

After the previous step, the final model of the pipeline has been generated.
To ensure the model is functional and a valid SBML model, the fourth step
of the pipeline perform a validation of the created model.
of the pipeline performs a validation of the created model.

Currently implemented are the following validators (more will be added in future updates):

Expand Down

0 comments on commit 8e8a0c4

Please sign in to comment.