Skip to content

Commit

Permalink
[MRG] support local genome collections (including private genomes) (#130
Browse files Browse the repository at this point in the history
)

* rename rule to sourmash_prefetch_wc

* start using {outdir}/genomes/

* swizzle up config to allowprivate_databases and genbank_databases, etc.

* more progress: copying private genomes around

* combine listing private and genbank genomes - seems to work\!

* simplify the ListGenomes stuff

* it's aliiiiiiiive

* remove genbank accession requirement

* remove genbank from most filenames, rules

* rename minimap to mapping; add clean_gather

* updated to properly (?) use checkpoints throughout

* tests pass locally

* fix typo

* add the beginnings of testing for private databases

* getting started

* update all the things

* [MRG] Change column names in intermediate CSVs. (#133)

* change column names

* remove old notebooks

* fix mistake

* comments etc.

* remove glob pattern, configure genbank_cache

* remove 'process' command

* check for old config file params

* add important comment

* actually remove 'process'

* check for 'database_taxonomy' instead of 'taxonomies'

* add trailing / in Makefile

* add default taxonomies file to system.conf

* fix test files

* fix conf-private.yml

* start of doc/ subdirectory

* initial commit

* add badge

* compleat first draft

* minor corrections

* spell check

* add picklists into the config (#136)

* fix 'taxonomies' in test config; check that it's a list

* add comment

* swipe from #97

* swipe getting started from #97

* update!

* Apply suggestions from Taylor's docs review

Co-authored-by: Taylor Reiter <[email protected]>

* more update in re taylor's suggestions

* more more update

* even more update

* more update

* more update

* fix help output for CLI

* configure mkdocs

* clean it out

* update gitignore

* add some figures

* upd

* more figure adjustment

* add badges

* simplify to single sourmash_dtabases; use 'local' instead of 'private'

* update to 'local' instead of 'private'

* fix extra backquote

* more fix?

* fix formatting

* add tax test

* add test for picklist

* switch SRR5950647_subset over to use local_databses_info 🎉

* cleanup & commenting

* add missing file

Co-authored-by: HackMD <[email protected]>
Co-authored-by: Taylor Reiter <[email protected]>
  • Loading branch information
3 people authored Jan 17, 2022
1 parent 65acc69 commit cb7421a
Show file tree
Hide file tree
Showing 42 changed files with 1,347 additions and 3,078 deletions.
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -11,3 +11,8 @@ dist/
genome_grist.egg-info/
genome_grist/version.py
outputs.*
genbank_cache
*.yml
site
.DS_Store
bak
44 changes: 38 additions & 6 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
all: clean-test test

flakes:
flake8 --ignore=E501 genome_grist/ tests/

black:
black .

clean-test:
rm -fr outputs.test/

Expand All @@ -8,15 +14,41 @@ test:
genome-grist run tests/test-data/SRR5950647.conf summarize_mapping summarize_tax make_sgc_conf -j 8 -p

# try various targets to make sure they work
genome-grist run tests/test-data/SRR5950647.conf download_matching_genomes -j 8 -p
genome-grist run tests/test-data/SRR5950647.conf download_matching_genomes_info -j 8 -p
genome-grist run tests/test-data/SRR5950647.conf download_genbank_genomes -j 8 -p
genome-grist run tests/test-data/SRR5950647.conf combine_genome_info -j 8 -p
genome-grist run tests/test-data/SRR5950647.conf retrieve_genomes -j 8 -p
genome-grist run tests/test-data/SRR5950647.conf estimate_distinct_kmers -j 8 -p
genome-grist run tests/test-data/SRR5950647.conf count_trimmed_reads -j 8 -p
genome-grist run tests/test-data/SRR5950647.conf summarize_sample_info -j 8 -p

### private/local genomes test stuff

flakes:
flake8 --ignore=E501 genome_grist/ tests/
test-private: outputs.private/abundtrim/podar.abundtrim.fq.gz \
databases/podar-ref.zip databases/podar-ref.info.csv \
databases/podar-ref.tax.csv
genome-grist run conf-private.yml summarize_gather summarize_mapping summarize_tax -j 4 -p

black:
black .
# download the (subsampled) reads for SRR606249
outputs.private/abundtrim/podar.abundtrim.fq.gz:
mkdir -p outputs.private/abundtrim
curl -L https://osf.io/ckbq3/download -o outputs.private/abundtrim/podar.abundtrim.fq.gz

# download the ref genomes
databases/podar-ref/:
mkdir -p databases/podar-ref
curl -L https://osf.io/vbhy5/download -o databases/podar-ref.tar.gz
cd databases/podar-ref/ && tar xzf ../podar-ref.tar.gz

# sketch the ref genomes
databases/podar-ref.zip: databases/podar-ref/
sourmash sketch dna -p k=31,scaled=1000 --name-from-first \
databases/podar-ref/*.fa -o databases/podar-ref.zip

# download taxonomy
databases/podar-ref.tax.csv:
curl -L https://osf.io/4yhjw/download -o databases/podar-ref.tax.csv

# create info file and genomes directory:
databases/podar-ref.info.csv:
python -m genome_grist.copy_local_genomes databases/podar-ref/*.fa -o databases/podar-ref.info.csv -d databases/podar-ref.d
python -m genome_grist.make_info_file databases/podar-ref.info.csv
163 changes: 17 additions & 146 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,156 +1,27 @@
# genome-grist: a quickstart tutorial.
# genome-grist README

This quickstart tutorial will take about 30 minutes to run, and
requires 5 GB of disk space and 4 GB of RAM, as well as a fairly
good Internet connection.
<!-- CTB: this is /README.md in dib-lab/genome-grist -->

## What is genome-grist?
<a href="https://pypi.org/project/genome-grist/"><img alt="PyPI" src="https://badge.fury.io/py/genome-grist.svg"></a>
<img alt="License: 3-Clause BSD" src="https://img.shields.io/badge/License-BSD%203--Clause-blue.svg">

genome-grist is software that automates a number of tedious metagenome tasks related to reference-based analyses on Illumina metagenomes. Specifically, genome-grist will download public metagenomes from the SRA, preprocess them, and use `sourmash gather` to identify reference genomes for the metagenome. It will then download the reference genomes, map reads to them, and summarize the mapping.
genome-grist analyzes the strain composition of microbial metagenomes
using
[minimum metagenome covers](https://dib-lab.github.io/2020-paper-sourmash-gather/)
and produces a variety of compositional and taxonomic summaries.

## Installing genome-grist
Check out the
[quick start!](https://dib-lab.github.io/genome-grist/quickstart/) And
please also see
[the rest of the docs](https://dib-lab.github.io/genome-grist/) for
more information!

We suggest installing in an isolated conda environment. The following will create a new environment, activate it, and install the latest version of genome-grist from PyPI (which is <a href="https://pypi.org/project/genome-grist/"><img alt="PyPI" src="https://badge.fury.io/py/genome-grist.svg"></a>).
## Example: the strain composition of a gut microbiome (iHMP)

```
conda create -y -n grist python=3.8 pip
conda activate grist
python -m pip install genome-grist
```
## Running genome-grist
This figure was autogenerated by genome-grist.

We currently recommend running genome-grist in its own directory, for several reasons that include software installation (genome-grist uses snakemake and conda to install software under this directory).

Within the current working directory, genome-grist will create an `inputs` subdir, a `genbank_genomes` subdir, and any `outputs.NAME` subdirectories required by the configuration; it should be straightforward to keep projects separate by configuring the output directories appropriately.

So, create a subdirectory and change into it:
```shell
mkdir grist/
cd grist/
```
Note, genome-grist does not rely on the directory name or location in any way; it works entirely within the current working directory.

### Download a small example database

Download the GTDB release 95 set of ~32k guide genomes, in a pre-prepared sourmash database format:
```
curl -L https://osf.io/4n3m5/download -o gtdb-r95.nucleotide-k31-scaled1000.sbt.zip
```
(Any sourmash database will do as long as the sequences are named so that the full GenBank accession is the first field in the name.)

### Make a configuration file

Put the following in a config file named `conf-tutorial.yml`:
```
sample:
- SRR5950647
outdir: outputs.tutorial/
metagenome_trim_memory: 1e9
sourmash_database_glob_pattern: gtdb-r95.nucleotide-k31-scaled1000.sbt.zip
```

Notes:
* you can put multiple samples IDs here, in a [YAML array format](https://www.cloudbees.com/blog/yaml-tutorial-everything-you-need-get-started/) - put them on a new line after a dash (`-`).
* if you have multiple databases you can specify them here with an appropriate wild card pattern, e.g. `db/*` will work.
* if you are running this on the farm HPC at UC Davis, you can search all of genbank by *omitting* the database configuration line. Currently these files are not yet publicly available, which is why this tutorial uses GTDB instead.

### Do your first real run!

Execute:
```
genome-grist run conf-tutorial.yml summarize_mapping
```

This will perform the following steps:
* download the [HSMA33MX metagenome](https://www.ncbi.nlm.nih.gov/sra/?term=HSMA33MX) from the Sequence Read Archive (target `download_reads`).
* preprocess it to remove adapters and low-abundance k-mers (target `trim_reads`).
* build a sourmash signature from the preprocess reads. (target `smash_reads`).
* perform a `sourmash gather` against the specified database (target `gather_genbank`).
* download the matching genomes from GenBank into `genbank_genomes/` (target `download_matching_genomes`).
* map the metagenome reads to the various genomes (target `map_reads`).
* produce a summary notebook (target `summarize_mapping`).

## Output files

The key output files under the outputs directory are:

* `genbank/{sample}.x.genbank.gather.out` - human-readable output from [sourmash gather](https://sourmash.readthedocs.io/en/latest/classifying-signatures.html).
* `genbank/{sample}.x.genbank.gather.csv` - [sourmash gather CSV output](https://sourmash.readthedocs.io/en/latest/classifying-signatures.html).
* `genbank/{sample}.genomes.info.csv` - information about the matching genomes from genbank.
* `reports/report-{sample}.html` - a summary report.
* `abundtrim/{sample}.abundtrim.fq.gz` - trimmed and preprocessed reads.
* `sigs/HSMA33MX.abundtrim.sig` - sourmash signature for the preprocessed reads.

Note that `genome-grist run <config.yml> zip` will create a file named `transfer.zip` with the above files in it.

## Where to insert your own files

genome-grist is built on top of [the snakemake workflow](https://snakemake.readthedocs.io/en/stable/), which lets you substitute your own files in many places.

For example,
* you can put your own `SAMPLE_1.fastq.gz`, `SAMPLE_2.fastq.gz`, and `SAMPLE_unpaired.fastq.gz` files in `raw/` to have genome-grist process reads for you.
* you can put your own interleaved reads file in `abundtrim/SAMPLE.abundtrim.fq.gz` to run genome-grist on a private or preprocessed set of reads;
* you can put your own sourmash signature (k=31, scaled=1000) in `sigs/SAMPLE.abundtrim.sig` if you want to have it do the database search for you;

Please see [the genome-grist Snakefile](https://github.com/dib-lab/genome-grist/blob/latest/genome_grist/conf/Snakefile) for all the gory details.

## Additional targets

Recommended targets:

* summarize_gather - produce summary reports on metagenome composition
* summarize_tax - produce summary reports on taxonomic composition
* summarize_mapping - produce summary reports on k-mer and read mapping

Note, 'summarize_mapping' includes 'summarize_gather'; reports will be
in {{outdir}}/reports, where 'outdir' is specified in the config file.

Additional intermediate targets:

* download_reads - download SRA metagenomes specified in conf file
* trim_reads - do basic read trimming/adapter removal for metagenome reads
* smash_reads - create sourmash signatures from metagenome reads
* summarize_sample_info - build a info.yaml summary file for each metagenome
* gather_genbank - run 'sourmash gather' on metagenomes against Genbank
* download_matching_genomes - download all matching Genbank genomes
* map_reads - map all metagenome reads to Genbank genomes
* make_sgc_conf - make a spacegraphcats config file

## Other information

### Resource requirements

**Disk space:** genome-grist makes about 4-5 copies of each SRA metagenome.

**Memory:** the genbank search step on all of genbank takes ~120 GB of RAM. On GTDB, it's much, much less. Other than that, the other steps are all under 10 GB of RAM (unless you adjust `metagenome_trim_memory` upwards, which may be needed for complex metagenomes).

**Time:** This is largely dependent on the size of the metagenome; 100m reads takes less than a day or two, typically. The processing of multiple data sets can be done in parallel with `-j`, as well, although you probably want to specify resource limits. For example, here is the command that Titus uses on farm:
```
genome-grist run <config> -k --resources mem_mb=145000 -j 16
```
to run in 150GB of RAM, which will run at most one genbank search at a time.

### Installing unreleased versions.

You can run genome-grist from a git checkout directory by using pip to install it in editable mode:
```
pip install -e .
```

### Support

We like to support our software!

That having been said, genome-grist is early-stage beta-level software. Please be patient and kind :).

Please ask questions and add comments [on the github issue tracker for genome-grist](https://github.com/dib-lab/genome-grist/issues).

## Why the name `grist`?

'grist' is in the sourmash family of names (sourmash, wort, distillerycats, etc.) See [Grist in Wikipedia](https://en.wikipedia.org/wiki/Grist).

(It is not the [computing grist](https://en.wikipedia.org/wiki/Grist_(computing))!)
![an example image made with genome-grist](doc/gather-vs-mapping.png)

---

[CTB](https://twitter.com/ctitusbrown/) Jan 27, 2021
[CTB](https://twitter.com/ctitusbrown/) 01/22
13 changes: 13 additions & 0 deletions conf-private.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
samples:
- podar

outdir: outputs.private/

sourmash_databases:
- databases/podar-ref.zip

local_databases_info:
- databases/podar-ref.info.csv

taxonomies:
- databases/podar-ref.tax.csv
Loading

0 comments on commit cb7421a

Please sign in to comment.