Yggdrasil is a data processing framework designed to manage and automate workflows for various genomic sequencing projects (currently including TenX and SmartSeq3 modules). It provides a unified interface to handle data ingestion, processing, result generation, and ultimately project packing and delivery, streamlining the analysis pipeline for sequencing data.
- Prerequisites
- Installation
- Project Structure
- Usage
- Configuration
- Development Guidelines
- Contributing
- License
- Python 3.11 or higher
- Conda for environment management
- Git for version control
- VSCode (recommended) for development
To get started with the Yggdrasil Project, you need to set up the necessary dependencies. Follow the instructions below:
- Clone the Repository:
git clone https://github.com/NationalGenomicsInfrastructure/Yggdrasil.git
cd Yggdrasil
- Create and Activate a Conda Environment:
It is recommended to use a conda environment to manage dependencies. You can set up the environment using conda
:
conda create --name yggdrasil-env python=3.11
conda activate yggdrasil-env
- Install Required Packages:
pip install -r requirements.txt
Brief overview of the main components and directories:
Yggdrasil/
βββ lib/
β βββ base/
β βββ core_utils/
β βββ couchdb/
β βββ module_utils/
β βββ realms/
β β βββ tenx/
β β βββ smartseq3/
βββ tests/
βββ .github/
β βββ workflows/
βββ ygg_trunk.py
βββ ygg-mule.py
βββ pyproject.toml
βββ requirements.txt
βββ LICENSE
βββ README.md
- lib/: Core library containing base classes and utilities.
- base/: Abstract base classes and interfaces.
- core_utils/: Utility modules for the Yggdrasil core functionalities.
- couchdb/: Classes specific for Yggdrasil - CouchDB interactions.
- module_utils/: Utility modules for various Yggdrasil module functionalities.
- realms/: Modules specific to different sequencing technologies (e.g. TenX, SmartSeq3, etc.)
- tests/: Test cases for the application.
- .github/workflows/: GitHub Actions workflows for CI/CD.
To run Yggdrasil manually, use the manual core script ygg-mule.py
. It is used for processing documents manually by providing a CouchDB document ID.
Usage:
python ygg-mule.py <doc_id>
Replace <doc_id> with the actual CouchDB document ID you wish to process.
Yggdrasil uses a configuration loader to manage settings. Configuration files should be placed in the yggdrasil_workspace/common/configurations
directory. This directory path can be adjusted in the lib/core_utils/common.py
script if needed.
config.json: This file contains global settings for Yggdrasil.
Fields:
- yggdrasil_log_dir: Directory where logs will be stored.
- couchdb_url: URL of the CouchDB server. Example: "http://localhost:5984"
- couchdb_database: Name of the CouchDB project database.
- couchdb_status_tracking: Name of the CouchDB yggdrasil database for project status tracking.
- couchdb_poll_interval: Interval (in seconds) for polling CouchDB for changes.
- job_monitor_poll_interval: Interval (in seconds) for polling the job monitor.
- activate_ngi_cmd: Command to activate NGI environment
Example Configuration File (config.json)
{
"yggdrasil_log_dir": "yggdrasil_workspace/logs",
"couchdb_url": "http://localhost:5984",
"couchdb_database": "my_project_db",
"couchdb_status_tracking": "my_ygg_status_db",
"couchdb_poll_interval": 3,
"job_monitor_poll_interval": 60,
"activate_ngi_cmd": "source sourceme_sthlm.sh && source activate NGI"
}
module_registry.json: This file maps different library construction methods to their respective processing modules. The modules specified here will be dynamically loaded and executed based on the entire name of a library_prep_method
specified in the CouchDB document, or a designated prefix of them.
This file maps different library construction methods to their respective processing modules. The modules specified here will be dynamically loaded and executed based on the library construction method specified in the CouchDB document.
Example:
{
"SmartSeq 3": {
"module": "lib.realms.smartseq3.smartseq3.SmartSeq3"
},
"10X": {
"module": "lib.realms.tenx.tenx_project.TenXProject",
"prefix": true
}
}
- SmartSeq 3:
- module: The path to the module handling SmartSeq 3 library data.
- 10X:
- module: The path to the module handling 10X-prefixed library data.
Ensure the following environment variables are set:
- COUCH_USER: Your CouchDB username.
- COUCH_PASS: Your CouchDB password.
Yggdrasil uses a custom logging utility to manage logs. Logs are stored in the directory specified by the yggdrasil_log_dir configuration.
Enabling Debug Logging: To enable debug logging, modify the configure_logging
call in your script:
from lib.utils.logging_utils import configure_logging
configure_logging(debug=True)
Ensure you have activated the Conda environment and installed all required packages as per the Installation section.
Use pre-commit to automate code formatting and linting on each commit.
- Install pre-commit hooks:
pre-commit install
- Run pre-commit hooks manually:
pre-commit run --all-files
Use black
for code formatting, ruff
for linting and mypy
for static type checking. It is recommended to have these tools set as extensions on your editor (e.g. VSCode) too, for a more seamless, automated experience. But if you preffer running them manually in cmd:
- Format code with Black:
black .
- Lint code with Ruff:
ruff check .
- Run type checks:
mypy .
For an optimal development experience, we recommend using VSCode with the following extensions:
- Python (by Microsoft)
- Ruff (by Astral Software)
- Black Formatter (by Microsoft)
- Mypy Type Checker (by Microsoft)
VSCode Settings
Make sure your (user)settings.json
contains the following settings to integrate the tools:
{
"editor.defaultFormatter": "ms-python.black-formatter",
"editor.formatOnSave": true,
"ruff.lint.args": [ "--config=pyproject.toml" ],
"mypy-type-checker.args": [ "--config-file=pyproject.toml" ]
}
To ensure git blame ignores bulk formatting commits.
- Configure Git:
git config blame.ignoreRevsFile .git-blame-ignore-revs
- Add Formatting Commits to
.git-blame-ignore-revs
:
Add the commit (full) hashes of your formatting commits to the .git-blame-ignore-revs
file, one per line, e.g.:
a1b2c3d4e5f6g7h8i9j0k1l2m3n4o5p6q7r8s9t0
b1c2d3e4f5g6h7i8j9k0l1m2n3o4p5q6r7s8t9u0
GitHub Actions are set up to automatically run ruff
, black
, and mypy
on pushes and pull requests.
- Workflow File: .github/workflows/lint.yml
- Separate Jobs: Each tool runs in its own job for clear feedback.
Contributions are very welcome! To have as smooth of an experience as possible, the following guidelines are recommended:
- Forking: Fork the main repository to your personal GitHub account. Develop your changes and submit pull requests to the main repository for review.
- Code Style: Format with
black
and lint withruff
. - Type Annotations: If you use type annotations make sure to set (and pass)
mypy
checks. - Documentation: Documented contributions are easier to understand and review.
Suggested contributions: Tests, Bug Fixes, Code Optimization, New Modules (reach out to Anastasios if you don't know where to start with developing a new module).
Yggdrasil is licensed under the MIT License - see the LICENSE file for details.