Skip to content

Commit

Permalink
Update Docs (round 1) (#79)
Browse files Browse the repository at this point in the history
* WIP docs update

* Add overview of dso tools

* Update README.md

* Update README

* Update README

* README table as pure markdown

* Include rendered docs snippets

* WIP getting_started section

* cosmetics

* Fix sphinx
  • Loading branch information
grst authored Jan 13, 2025
1 parent c54243f commit 91466b9
Show file tree
Hide file tree
Showing 7 changed files with 77 additions and 124 deletions.
139 changes: 16 additions & 123 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,133 +1,26 @@
# DSO: data science operations

<img src="img/dso_kraken.jpg" alt="DSO Kraken" width="250" />

*DSO* is a command line helper for building reproducible data anlaysis projects with ease.
It builds on top of [dvc](https://github.com/iterative/dvc) for data versioning and provides project
templates, linting checks, hierarchical overlay of configuration files and integrates with quarto and jupyter notebooks.
_DSO_ is a command line helper for building reproducible data anlaysis projects with ease by connecting our favorite tools:
It builds on top of git and [dvc](https://github.com/iterative/dvc) for code and data versioning and provides project
templates, dependency management via [uv](https://docs.astral.sh/uv), linting checks, hierarchical overlay of configuration files and integrates with quarto and jupyter notebooks.

At Boehringer Ingelheim, we introduced DSO to meet the high quality standards required for biomarker analysis
in clinical trials. DSO is still under early development and we value community feedback.

## Getting started

### What is DVC?
[DVC](https://github.com/iterative/dvc) is like "git for data". It can version large data files and data directories alongside source code tracked with git. In addition to versioning files, dvc can be used to run analyses in a reproducible way by declaring input and output files as well as commands to be executed in a `dvc.yaml` configuration file. After executing an analysis, timestamps and checksums of all input and output files are stored in a `lock` file, providing a provenance record. Different analysis tasks are organized in *stages*. Since input and output files of each stage are declared, dvc can build a dependency graph of the stages to re-execute stages as appropriate if input data or preprocessing steps have been updated.

### Creating a project from a template
There are three types of DSO templates: project, folders and stages. A *project* is the root of your project
and always a git repository at the same time. It can be created using `dso init`. A *stage* is an executable
step of your analysis (usually one script with defined inputs and outputs) organized in a folder. Stages
cannot be nested. A *folder* is used to organize stages in a hierarchical way within the project.

You can use `dso init` to create a new project
```
$> dso init
Please enter the name of the project, e.g. "single_cell_lung_atlas": my_cool_project
Please add a short description of the project: This analysis solves *all* the problems!
```

Within a project, you can use `dso create` to initalize folders and stages from a predefined template

```
$> dso create stage
? Choose a template: (Use arrow keys)
bash
» quarto
Please enter the name of the stage, e.g. "01_preprocessing": 02_quality_control
Please add a short description of the stage: Make a PCA to detect outliers
```

### How-to write and use config files
The config files in a project, subfolder or stage are the cornerstone of any reproducable analysis by minimising analysis configuration errors within related scripts. Additionally, config files reduce the time needed to modify your scripts when changing configurations such as p-value cutoffs, excluded samples, output directory, data input, and many more.

A config file of a project, subfolder, or stage contains all necessary parameters that should be consistent across the analyses. Therefore, changing parameters is done within the config files and not individually within an analysis script.

In DSO two parameter files are given called `params.yaml` and `params.in.yaml`. `params.yaml` is an autogenerated YAML containing all the parameters specified in the params.in.yaml and other params.yaml files in its parent directories (see figure below for an example how this behaves in real). `params.yaml` will be compiled when running `dso compile-config`.

<img src="img/config.png" width="500" alt="Hierarchical configuration schema" />

```
$> dso compile-config
[08/22/24 20:53:43] INFO Detected /home/grst/my_cool_project as project root.
INFO Compiling a total of 2 config files.
INFO Configuration compiled successfully.
```

### Linting checks

Dso provides linting checks that detect common errors in analysis projects. Right now only few checks are implemented,
but more will be available in the future.
in clinical trials. DSO is under active development and we value community feedback.

To run the linting checks manuall, execute
| <img src="img/dso_kraken.jpg" alt="DSO Kraken" width="700"> | <img src="img/dso_tools.png" alt="tools used by DSO"> |
| ----------------------------------------------------------- | ----------------------------------------------------- |

```
$> dso lint
[08/22/24 20:53:43] INFO Compiled a list of 22 to be linted
```

However, it is preferable to execute linting checks as pre-commit hooks and/or as continuous integration checks.
A `.pre-commit-config.yaml` comes with the DSO project template. Simply activate it using `pre-commit install`.

### Reproducing projects

To reproduce/execute all stages within a project, run

```
$> dso repro
```

This is a thin wrapper around `dvc repro` that compiles all configuration files beforehand.
DVC will only reproduce stages defined in the dvc.yaml where changes have been made. When dependencies have been changed, previous stages will also be re-run.


### Integration with quarto

DSO provides some additional tooling around quarto documents for generating reproducible reports. When you create a
quarto stage via `dso create stage --template quarto` you are all set to use this tooling:

* Render quarto stages to html via `dso exec quarto .`
* Inherit quarto configuration through the project from the `params.yaml` files. Quarto configuration can be placed in
`dso.quarto`, e.g.
```yaml
dso:
quarto:
author:
- Jane Doe
execute:
warning: false
```
* Add a disclaimer box and watermarks to all plots (e.g. to mark them as drafts) by adding additional settings
```yaml
dso:
quarto:
watermark:
text: DRAFT
disclaimer:
title: This document is a DRAFT
text: Please do not share!
```
To access stage parameters and resolve file paths relative to the stage directory from within R, we provide the
companion package [`dso-r`](https://github.com/Boehringer-Ingelheim/dso-r) that provides the two functions
`read_params(stage_name)` and `stage_here(path)`.
## Getting started

Please refer to the documentation, in particular ...

## Installation

DSO requires Python 3.10 or later.

You can install the latest version with pip using
TODO

```bash
pip install dso-core
```
## Contact

Alternatively, you can install the development version from GitHub:

```bash
pip install git+https://github.com/Boehringer-Ingelheim/dso.git@main
```
Please use the [issue tracker](https://github.com/Boehringer-Ingelheim/dso/issues).

## Release notes

Expand All @@ -142,13 +35,13 @@ This program is distributed in the hope that it will be useful, but WITHOUT ANY
Additionally, the templates files used internally by `dso init` and `dso create` are distributed under the Creative Commons Zero v1.0
Universal license. See also the [separate LICENSE file](https://github.com/Boehringer-Ingelheim/dso/blob/main/src/dso/templates/LICENSE) in the `templates` directory.


## Credits

dso was initially developed by
* [Gregor Sturm](https://github.com/grst)
* [Tom Schwarzl](https://github.com/tschwarzl)
* [Daniel Schreyer](https://github.com/dschreyer)
* [Alexander Peltzer](https://github.com/apeltzer)

- [Gregor Sturm](https://github.com/grst)
- [Tom Schwarzl](https://github.com/tschwarzl)
- [Daniel Schreyer](https://github.com/dschreyer)
- [Alexander Peltzer](https://github.com/apeltzer)

DSO depends on many great open source projects, most notably [dvc](https://github.com/iterative/dvc), [hiyapyco](https://github.com/zerwes/hiyapyco) and [jinja2](https://jinja.palletsprojects.com/).
2 changes: 2 additions & 0 deletions docs/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# stuff generated by sphinxcontrib-programoutput
/test_project
1 change: 1 addition & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,7 @@
"sphinx.ext.mathjax",
"IPython.sphinxext.ipython_console_highlighting",
"sphinxext.opengraph",
"sphinxcontrib.programoutput",
*[p.stem for p in (HERE / "extensions").glob("*.py")],
]

Expand Down
58 changes: 57 additions & 1 deletion docs/getting_started.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,59 @@
# Getting started

TODO
## `dso init` -- Initialize a project

`dso init` initializes a new project in your current directory.

```{command-output} dso init test_project --description "This is a test project"
```

It creates the root directory of your project with all the necessary configuration files for `git`, `dvc`, `uv` and
`dso` itself:

```{command-output} ls -a test_project
```

## `dso create` -- Add folders or stages to your project

We consider a _stage_ an individual step in your analysis, usually a script with defined inputs and outputs.
Stages can be organized in _folders_ with arbitrary structures. `dso create` initializes folders and stages
from predefined templates. We recommend naming stages with a numeric prefix, e.g. `01_` to declare the
order of scripts, but this is not a requirement.

```bash
cd test_project

# Let's create a folder that we'll use to organize all analysis steps related to "RNA-seq"
dso create folder RNA_seq

# Let's create first stage for pre-processing
cd RNA_seq
dso create stage 01_preprocessing --template bash --description "Run nf-core/rnaseq"

# Let's create a second stage for quality control
dso create stage 02_qc --template quarto --description "Perform RNA-seq quality control"
```

Stages have the following pre-defined folder-structure. This folder system aims to make the structure coherent throughout a project for easy readability and navigation. Additional folders can still be added if necessary.

```text
stage
|-- input # contains Input Data
|-- src # contains Analysis Script(s)
|-- output # contains TLF - Outputs generated by Analysis Scripts
|-- report # contains HTML Report generated by Analysis Scripts
```

## Writing configuration files

## Implementing a stage

### R

### Python

## `dso repro` -- Reproducing all stages

## Syncing changes with a remote
Binary file added img/dso_tools.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/dso_tools.pptx
Binary file not shown.
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ optional-dependencies.doc = [
"sphinx-book-theme>=1",
"sphinx-copybutton",
"sphinxcontrib-bibtex>=1",
"sphinxcontrib-programoutput>=0.18",
"sphinxext-opengraph",
]
optional-dependencies.test = [
Expand Down

0 comments on commit 91466b9

Please sign in to comment.