Updated content of hqtb #5 #11

draeger-lab · Jul 11, 2024 · 8e8a0c4 · 8e8a0c4
1 parent 413f012
commit 8e8a0c4
Show file tree

Hide file tree

Showing 4 changed files with 80 additions and 75 deletions.
diff --git a/docs/source/hqtb/about-pipeline.rst b/docs/source/hqtb/about-pipeline.rst
@@ -8,8 +8,10 @@ of a closely related strain (species) and additional database information.
 This type of pipeline aims to build upon already existing knowledge to speed up model curation
 and minimize the need to perform steps again that have already been done in a similar context.
 
-Overview of the HQTB Pipeline
------------------------------
+.. _overview-hqtb:
+
+Overview of the ``HQTB`` Pipeline
+---------------------------------
 
 The following image shows an overrview of the steps of the pipeline:
 
@@ -24,8 +26,9 @@ The pipeline consists of five main steps:
     Step 2: Draft Model Generation <step-desc/gen_draft.rst>
 
 - | Step 3: Model Refinement 
+
   | Refine the previously generated draft model to make the model more complete and strain-specific. 
-  In other words, fitting it more closely to the input genome.
+  | In other words, fitting it more closely to the input genome.
 
     - :doc:`Part 1: Extension <step-desc/refine-parts/extension>`
     - :doc:`Part 2: Clean-up <step-desc/refine-parts/cleanup>`
@@ -55,12 +58,12 @@ boudary parameters.
 
 .. hint::
     Many of the steps of the pipeline can be fine tuned and turned off/on. 
-    Check the `configuration file <hqtb-config.html>`__ for a full list of all parameters.
+    Check the :doc:`configuration file <hqtb-config>` for a full list of all parameters.
 
 .. note::
 
     All steps of the pipeline can be run separatly via the command line or 
-    the Python integration (see :ref:`Run the HTQB Pipeline`).
+    the Python integration (see :doc:`run-pipeline`).
 
     All accessable function are listed in the :ref:`Contents of SPECIMEN` section.
 

diff --git a/docs/source/hqtb/hqtb-config.rst b/docs/source/hqtb/hqtb-config.rst
@@ -1,63 +1,63 @@
-HQTB Configuration File
-=======================
+``HQTB`` Configuration File
+===========================
 
 Below, the configuration file with the underlying defaults, is shown.
 
 .. code-block:: yaml 
 
-    # information about the genome to be used to generate the new model
+    # Information about the genome to be used to generate the new model
     subject:
         annotated_genome: __USER__
         full_sequence: __USER__
 
-    # information about the template model/genome
+    # Information about the template model/genome
     template:
         annotated_genome: __USER__
         model: __USER__
         namespace: BiGG
 
-    # information about the output
+    # Information about the output
     out:
         dir: ./specimen_run/
         name: specimen_model
         memote: False
 
-    # data(bases) required to run the program
+    # Data(bases) required to run the program
     data:
-        # if this parameter is set, assumes that the directory structure from setup
+        # If this parameter is set, assumes that the directory structure from setup
         # is used and uses this path to a directory as the parent folder for the
         # following paths (assumes all data paths are relative ones)
         data_direc: null
-        # required
+        # Required
         diamond: __USER__
-        # needed but potentially downloaded
+        # Needed but potentially downloaded
         mnx_chem_prop: MetaNetX/chem_prop.tsv
         mnx_chem_xref: MetaNetX/chem_xref.tsv
         mnx_reac_prop: MetaNetX/reac_prop.tsv
         mnx_reac_xref: MetaNetX/reac_xref.tsv
-        # optional, but good and manual
+        # Optional, but good and manual
         ncbi_map: null
         ncbi_dat: null
-        # optional for directionality control
+        # Optional for directionality control
         biocyc: null
-        # optional:
-        #   the pan-core model is used for analysis and if no universal model
-        #   is given, also for gapfilling
-        #   if the pan-core model is too small for useful gapfilling, use an
-        #   additional universal model for gapfilling
-        #   if none if given gapfilling (and core-pan analysis) is skipped
+        # Optional:
+        #   The pan-core model is used for analysis and if no universal model
+        #   is given, also for gapfilling.
+        #   If the pan-core model is too small for useful gapfilling, use an
+        #   additional universal model for gapfilling.
+        #   If none is given gapfilling (and core-pan analysis) is skipped
         universal: null
         pan-core: null
 
-    # paramters for the single steps of the pipeline
+    # Paramters for the single steps of the pipeline
     parameters:
         bidirectional_blast:
-            # default should suffice except special cases
+            # Default should suffice except special cases
             template_name: null
             input_name: null
             temp_header: null
             in_header: null
-            # can be set by user if wanted, but not necessary
+            # Can be set by user if wanted, but not necessary
             sensitivity: more-sensitive
 
         generate_draft_model:
@@ -66,59 +66,59 @@ Below, the configuration file with the underlying defaults, is shown.
             medium: default
 
         refinement_extension:
-            # default (usually) fine
+            # Default (usually) fine
             id: locus_tag
-            # default fine
+            # Default fine
             sensitivity: more-sensitive
-            # default alright but good to edit for trying different options
+            # Default alright but good to edit for trying different options
             coverage: 95.0
             pid: 90.0
-            # default almost needed, except for special cases
+            # Default almost needed, except for special cases
             exclude_dna: True
             exclude_rna: True
 
         refinement_cleanup:
-            # default as standart
+            # Default as standard
             check_dupl_reac: True
             check_dupl_meta: default
             remove_unused_meta: False
             remove_dupl_reac: True
             remove_dupl_meta: True
-            # current default means no gapfilling
+            # Current default means no gapfilling
             media_gap: null
 
         refinement_annotation:
-            # for KEGG pathway annotation
+            # For KEGG pathway annotation
             viaEC: False
             viaRC: False
 
         refinement_smoothing:
-            # useful
+            # Useful
             mcc: skip
             # ECG correction
             egc: null
-            # depend on organism (current: Klebsiella )
+            # Depend on organism (current: Klebsiella )
             dna_weight_frac: 0.023
             ion_weight_frac: 0.05
 
-        # validation:
-            # default should suffice
+        # Validation:
+            # Default should suffice
 
         analysis:
-            # default is currently only option
+            # Default is currently only option
             pc_based_on: id
-            # can be default but useful to edit
-            media_analysis: __USER__ # edit to fit a default media config file
+            # Can be default but useful to edit
+            media_analysis: __USER__ # Edit to fit a default media config file
             test_aa_auxotrophies: True
-            # perform pathway analysis with KEGG
+            # Perform pathway analysis with KEGG
             pathway: True
 
-    # options for performance
+    # Options for performance
     performance:
         threads: 2
-        # for the gapfilling, if iterations and chunk_size are set (not null)
+        # For the gapfilling, if iterations and chunk_size are set (not null)
         # use a heuristic for faster performance:
-        #     instead of using all reactions that can be added at once,
+        #     Instead of using all reactions that can be added at once,
         #     run x interations of gapfilling with n-size randomised chunks of reactions
         gapfilling:
             iterations: 3

diff --git a/docs/source/hqtb/run-pipeline.rst b/docs/source/hqtb/run-pipeline.rst
@@ -1,19 +1,19 @@
-Run the HTQB Pipeline
-=====================
+Run the ``HQTB`` Pipeline
+=========================
 
 This page explains how to run the complete ``HTQB`` (high-quality template based) pipeline 
 and how to collected the neccessary data.
 
 For more information about the steps of the pipeline, 
-see :ref:`Overview of the HQTB Pipeline`.
+see :ref:`overview-hqtb`.
 
-HQTB: Quickstart
-----------------
+``HQTB``: Quickstart
+--------------------
 
 The pipeline can either be run directly from the command line or its functions can be called from inside a Python script.
 The input in both cases is a configuration file that contains all information needed (data file paths and parameters) to run it.
 
-The configuration can be downloaded using the command line:
+The `configuration <hqtb-config.html>`__ can be downloaded using the command line:
 
 .. code-block:: bash
     :class: copyable
@@ -54,12 +54,12 @@ from inside a Python script or Jupyter Notebook with "config.yaml" being the pat
 
 .. note::
 
-    Additionally, the pipeline can be run with a wrapper to susequently build multiple models for different genome using the same parameters.
-    To wrapper can be accessed using :code:`specimen hqtb run wrapper "config.yaml"` or :code:`specimen.workflow.wrapper_pipeline(config_file='/User/path/to/config.yaml', parent_dir="./")`.
+    Additionally, the pipeline can be run with a wrapper to susequently build multiple models for different genomes using the same parameters.
+    The wrapper can be accessed using :code:`specimen hqtb run wrapper "config.yaml"` or :code:`specimen.workflow.wrapper_pipeline(config_file='/User/path/to/config.yaml', parent_dir="./")`.
 
 
-HQTB: Collecting Data
----------------------
+``HQTB``: Collecting Data
+-------------------------
 
 If you are just starting a new project and do not have all the data ready to go, you can use the setup function of
 ``SPECIMEN`` to help you collect the data you need.
@@ -69,7 +69,11 @@ If you are just starting a new project and do not have all the data ready to go,
 
     specimen.util.set_up.build_data_directories('your_folder_name')
 
-The function above creates the following directory structure for your project:
+| The function above creates the following directory structure for your project.
+| The 'contains' column lists what is supposed to be inside the according folder. 
+  The tags manual/semi/automated report how these files are added to the folder (automated = by the setup function, manual = by the user).
+  ``TODO``: Was bedeutet semi?
+  The tags required/optional report whether this input is necessary to run the pipeline or if it is an optional input.
 
 .. table::
     :align: center 
@@ -97,39 +101,37 @@ The function above creates the following directory structure for your project:
     | universal-models   | universal models             | manual, optional    |
     +--------------------+------------------------------+---------------------+
 
-In the contains columns it is listed what is supposed to be inside that folder.
-The tags manual/semi/automated report how these are added to the folder (automated = by the setup function, manual = by the user).
-The tags report/optional report whether this input is necessary to run the pipeline or if it is an optional input.
-
 .. note::
 
-    Regarding the annotated genomes, the program currently only supports the file types ``GBFF`` and ``FAA`` + ``FNA``.
+    Regarding the annotated_genomes folder, the program currently only supports the file types ``GBFF`` and ``FAA`` + ``FNA``.
+    ``TODO``: Für welche Dateien in contains gilt das?
 
 Further details for collecting the data:
 
-- BioCyc:
+- `BioCyc <https://biocyc.org/>`__:
 
-    - downloading a smart table from BioCyc requires a subscription
-    - the smart table needs to have the columns Reactions, EC-Number, KEGG reaction, METANETX and Reaction-Direction
+    - Downloading a smart table from BioCyc requires a subscription.
+    - The SmartTable needs to have the columns 'Reactions', 'EC-Number', 'KEGG reaction', 'METANETX' and 'Reaction-Direction'.
 
-- RefSeqs
+- RefSeq
 
-    - one way to builf a DIAMOND reference database is to download a set of reference sequences from the NCBI database, e.g. in the **FAA** format
-    - use the function :code:`specimen.util.util.create_DIAMOND_db_from_folder('/User/path/input/directory', '/User/Path/for/output/', name = 'database', extention = 'faa')` to create a DIAMOND database
-    - to speed up the mapping, create an additional mapping file from the e.g. ``GBFF`` files from NCBI using :code:`specimen.util.util.create_NCBIinfo_mapping('/User/path/input/directory', '/User/Path/for/output/', extention = 'gbff')`
-    - to ensure correct mapping to KEGG, an additional information file can be created by constructing a CSV file with the following columns: NCBI genome, organism, locus_tag (start) and KEGG.organism
+    - One way to build a DIAMOND reference database is to download a set of reference sequences from the NCBI database, e.g. in the **FAA** format.
+    - Use the function :code:`specimen.util.util.create_DIAMOND_db_from_folder('/User/path/input/directory', '/User/Path/for/output/', name = 'database', extention = 'faa')` to create a DIAMOND database
+    - To speed up the mapping, create an additional mapping file from the e.g. ``GBFF`` files from NCBI using :code:`specimen.util.util.create_NCBIinfo_mapping('/User/path/input/directory', '/User/Path/for/output/', extention = 'gbff')`
+    - To ensure correct mapping to KEGG, an additional information file can be created by constructing a CSV file with the following columns: 'NCBI genome', 'organism', 'locus_tag' (start) and 'KEGG.organism'
+      ``TODO``: Was ist hier mit start gemeint?
 
-        - the information of the first three columns can be taken from the previous two steps while
-        - the last column the user needs to check, if the genomes have been entered into KEGG and have an organism identifier
-        - this file is purely optional for running the pipeline but potentially leads to better results
+        - The information of the first three columns can be taken from the previous two steps while
+        - For the last column the user needs to check, if the genomes have been entered into KEGG and have an organism identifier.
+        - This file is purely optional for running the pipeline but potentially leads to better results.
 
 - medium:   
 
-    The media, either for analysis or gapfilling can be entered into the pipeline via a config file (each).
-    The config files are from the `refineGEMs <https://github.com/draeger-lab/refinegems>`__ :footcite:p:`bauerle2023genome` toolbox and access its in-build medium database 
-    and additionally allow for manual adjustment / external input.
+    The media, either for analysis or gap filling can be entered into the pipeline via a config file (each). ``TODO``: Muss wirklich für jedes Medium eine neue Datei erstellt werden?
+    The config files are from the `refineGEMs <https://github.com/draeger-lab/refinegems/tree/dev-2>`__ :footcite:p:`bauerle2023genome` toolbox and access its in-build medium database. 
+    Additionally, the config files allow for manual adjustment / external input.
 
-    A examplary config file can be accessed using the following command:
+    An examplary config file can be accessed using the following command:
 
     .. code-block:: python
         :class: copyable

diff --git a/docs/source/hqtb/step-desc/validation.rst b/docs/source/hqtb/step-desc/validation.rst
@@ -3,7 +3,7 @@ Step 4: Model Validation
 
 After the previous step, the final model of the pipeline has been generated.
 To ensure the model is functional and a valid SBML model, the fourth step
-of the pipeline perform a validation of the created model.
+of the pipeline performs a validation of the created model.
 
 Currently implemented are the following validators (more will be added in future updates):