-
Notifications
You must be signed in to change notification settings - Fork 6
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[MRG] support local genome collections (including private genomes) (#130
) * rename rule to sourmash_prefetch_wc * start using {outdir}/genomes/ * swizzle up config to allowprivate_databases and genbank_databases, etc. * more progress: copying private genomes around * combine listing private and genbank genomes - seems to work\! * simplify the ListGenomes stuff * it's aliiiiiiiive * remove genbank accession requirement * remove genbank from most filenames, rules * rename minimap to mapping; add clean_gather * updated to properly (?) use checkpoints throughout * tests pass locally * fix typo * add the beginnings of testing for private databases * getting started * update all the things * [MRG] Change column names in intermediate CSVs. (#133) * change column names * remove old notebooks * fix mistake * comments etc. * remove glob pattern, configure genbank_cache * remove 'process' command * check for old config file params * add important comment * actually remove 'process' * check for 'database_taxonomy' instead of 'taxonomies' * add trailing / in Makefile * add default taxonomies file to system.conf * fix test files * fix conf-private.yml * start of doc/ subdirectory * initial commit * add badge * compleat first draft * minor corrections * spell check * add picklists into the config (#136) * fix 'taxonomies' in test config; check that it's a list * add comment * swipe from #97 * swipe getting started from #97 * update! * Apply suggestions from Taylor's docs review Co-authored-by: Taylor Reiter <[email protected]> * more update in re taylor's suggestions * more more update * even more update * more update * more update * fix help output for CLI * configure mkdocs * clean it out * update gitignore * add some figures * upd * more figure adjustment * add badges * simplify to single sourmash_dtabases; use 'local' instead of 'private' * update to 'local' instead of 'private' * fix extra backquote * more fix? * fix formatting * add tax test * add test for picklist * switch SRR5950647_subset over to use local_databses_info 🎉 * cleanup & commenting * add missing file Co-authored-by: HackMD <[email protected]> Co-authored-by: Taylor Reiter <[email protected]>
- Loading branch information
1 parent
65acc69
commit cb7421a
Showing
42 changed files
with
1,347 additions
and
3,078 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -11,3 +11,8 @@ dist/ | |
genome_grist.egg-info/ | ||
genome_grist/version.py | ||
outputs.* | ||
genbank_cache | ||
*.yml | ||
site | ||
.DS_Store | ||
bak |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,156 +1,27 @@ | ||
# genome-grist: a quickstart tutorial. | ||
# genome-grist README | ||
|
||
This quickstart tutorial will take about 30 minutes to run, and | ||
requires 5 GB of disk space and 4 GB of RAM, as well as a fairly | ||
good Internet connection. | ||
<!-- CTB: this is /README.md in dib-lab/genome-grist --> | ||
|
||
## What is genome-grist? | ||
<a href="https://pypi.org/project/genome-grist/"><img alt="PyPI" src="https://badge.fury.io/py/genome-grist.svg"></a> | ||
<img alt="License: 3-Clause BSD" src="https://img.shields.io/badge/License-BSD%203--Clause-blue.svg"> | ||
|
||
genome-grist is software that automates a number of tedious metagenome tasks related to reference-based analyses on Illumina metagenomes. Specifically, genome-grist will download public metagenomes from the SRA, preprocess them, and use `sourmash gather` to identify reference genomes for the metagenome. It will then download the reference genomes, map reads to them, and summarize the mapping. | ||
genome-grist analyzes the strain composition of microbial metagenomes | ||
using | ||
[minimum metagenome covers](https://dib-lab.github.io/2020-paper-sourmash-gather/) | ||
and produces a variety of compositional and taxonomic summaries. | ||
|
||
## Installing genome-grist | ||
Check out the | ||
[quick start!](https://dib-lab.github.io/genome-grist/quickstart/) And | ||
please also see | ||
[the rest of the docs](https://dib-lab.github.io/genome-grist/) for | ||
more information! | ||
|
||
We suggest installing in an isolated conda environment. The following will create a new environment, activate it, and install the latest version of genome-grist from PyPI (which is <a href="https://pypi.org/project/genome-grist/"><img alt="PyPI" src="https://badge.fury.io/py/genome-grist.svg"></a>). | ||
## Example: the strain composition of a gut microbiome (iHMP) | ||
|
||
``` | ||
conda create -y -n grist python=3.8 pip | ||
conda activate grist | ||
python -m pip install genome-grist | ||
``` | ||
## Running genome-grist | ||
This figure was autogenerated by genome-grist. | ||
|
||
We currently recommend running genome-grist in its own directory, for several reasons that include software installation (genome-grist uses snakemake and conda to install software under this directory). | ||
|
||
Within the current working directory, genome-grist will create an `inputs` subdir, a `genbank_genomes` subdir, and any `outputs.NAME` subdirectories required by the configuration; it should be straightforward to keep projects separate by configuring the output directories appropriately. | ||
|
||
So, create a subdirectory and change into it: | ||
```shell | ||
mkdir grist/ | ||
cd grist/ | ||
``` | ||
Note, genome-grist does not rely on the directory name or location in any way; it works entirely within the current working directory. | ||
|
||
### Download a small example database | ||
|
||
Download the GTDB release 95 set of ~32k guide genomes, in a pre-prepared sourmash database format: | ||
``` | ||
curl -L https://osf.io/4n3m5/download -o gtdb-r95.nucleotide-k31-scaled1000.sbt.zip | ||
``` | ||
(Any sourmash database will do as long as the sequences are named so that the full GenBank accession is the first field in the name.) | ||
|
||
### Make a configuration file | ||
|
||
Put the following in a config file named `conf-tutorial.yml`: | ||
``` | ||
sample: | ||
- SRR5950647 | ||
outdir: outputs.tutorial/ | ||
metagenome_trim_memory: 1e9 | ||
sourmash_database_glob_pattern: gtdb-r95.nucleotide-k31-scaled1000.sbt.zip | ||
``` | ||
|
||
Notes: | ||
* you can put multiple samples IDs here, in a [YAML array format](https://www.cloudbees.com/blog/yaml-tutorial-everything-you-need-get-started/) - put them on a new line after a dash (`-`). | ||
* if you have multiple databases you can specify them here with an appropriate wild card pattern, e.g. `db/*` will work. | ||
* if you are running this on the farm HPC at UC Davis, you can search all of genbank by *omitting* the database configuration line. Currently these files are not yet publicly available, which is why this tutorial uses GTDB instead. | ||
|
||
### Do your first real run! | ||
|
||
Execute: | ||
``` | ||
genome-grist run conf-tutorial.yml summarize_mapping | ||
``` | ||
|
||
This will perform the following steps: | ||
* download the [HSMA33MX metagenome](https://www.ncbi.nlm.nih.gov/sra/?term=HSMA33MX) from the Sequence Read Archive (target `download_reads`). | ||
* preprocess it to remove adapters and low-abundance k-mers (target `trim_reads`). | ||
* build a sourmash signature from the preprocess reads. (target `smash_reads`). | ||
* perform a `sourmash gather` against the specified database (target `gather_genbank`). | ||
* download the matching genomes from GenBank into `genbank_genomes/` (target `download_matching_genomes`). | ||
* map the metagenome reads to the various genomes (target `map_reads`). | ||
* produce a summary notebook (target `summarize_mapping`). | ||
|
||
## Output files | ||
|
||
The key output files under the outputs directory are: | ||
|
||
* `genbank/{sample}.x.genbank.gather.out` - human-readable output from [sourmash gather](https://sourmash.readthedocs.io/en/latest/classifying-signatures.html). | ||
* `genbank/{sample}.x.genbank.gather.csv` - [sourmash gather CSV output](https://sourmash.readthedocs.io/en/latest/classifying-signatures.html). | ||
* `genbank/{sample}.genomes.info.csv` - information about the matching genomes from genbank. | ||
* `reports/report-{sample}.html` - a summary report. | ||
* `abundtrim/{sample}.abundtrim.fq.gz` - trimmed and preprocessed reads. | ||
* `sigs/HSMA33MX.abundtrim.sig` - sourmash signature for the preprocessed reads. | ||
|
||
Note that `genome-grist run <config.yml> zip` will create a file named `transfer.zip` with the above files in it. | ||
|
||
## Where to insert your own files | ||
|
||
genome-grist is built on top of [the snakemake workflow](https://snakemake.readthedocs.io/en/stable/), which lets you substitute your own files in many places. | ||
|
||
For example, | ||
* you can put your own `SAMPLE_1.fastq.gz`, `SAMPLE_2.fastq.gz`, and `SAMPLE_unpaired.fastq.gz` files in `raw/` to have genome-grist process reads for you. | ||
* you can put your own interleaved reads file in `abundtrim/SAMPLE.abundtrim.fq.gz` to run genome-grist on a private or preprocessed set of reads; | ||
* you can put your own sourmash signature (k=31, scaled=1000) in `sigs/SAMPLE.abundtrim.sig` if you want to have it do the database search for you; | ||
|
||
Please see [the genome-grist Snakefile](https://github.com/dib-lab/genome-grist/blob/latest/genome_grist/conf/Snakefile) for all the gory details. | ||
|
||
## Additional targets | ||
|
||
Recommended targets: | ||
|
||
* summarize_gather - produce summary reports on metagenome composition | ||
* summarize_tax - produce summary reports on taxonomic composition | ||
* summarize_mapping - produce summary reports on k-mer and read mapping | ||
|
||
Note, 'summarize_mapping' includes 'summarize_gather'; reports will be | ||
in {{outdir}}/reports, where 'outdir' is specified in the config file. | ||
|
||
Additional intermediate targets: | ||
|
||
* download_reads - download SRA metagenomes specified in conf file | ||
* trim_reads - do basic read trimming/adapter removal for metagenome reads | ||
* smash_reads - create sourmash signatures from metagenome reads | ||
* summarize_sample_info - build a info.yaml summary file for each metagenome | ||
* gather_genbank - run 'sourmash gather' on metagenomes against Genbank | ||
* download_matching_genomes - download all matching Genbank genomes | ||
* map_reads - map all metagenome reads to Genbank genomes | ||
* make_sgc_conf - make a spacegraphcats config file | ||
|
||
## Other information | ||
|
||
### Resource requirements | ||
|
||
**Disk space:** genome-grist makes about 4-5 copies of each SRA metagenome. | ||
|
||
**Memory:** the genbank search step on all of genbank takes ~120 GB of RAM. On GTDB, it's much, much less. Other than that, the other steps are all under 10 GB of RAM (unless you adjust `metagenome_trim_memory` upwards, which may be needed for complex metagenomes). | ||
|
||
**Time:** This is largely dependent on the size of the metagenome; 100m reads takes less than a day or two, typically. The processing of multiple data sets can be done in parallel with `-j`, as well, although you probably want to specify resource limits. For example, here is the command that Titus uses on farm: | ||
``` | ||
genome-grist run <config> -k --resources mem_mb=145000 -j 16 | ||
``` | ||
to run in 150GB of RAM, which will run at most one genbank search at a time. | ||
|
||
### Installing unreleased versions. | ||
|
||
You can run genome-grist from a git checkout directory by using pip to install it in editable mode: | ||
``` | ||
pip install -e . | ||
``` | ||
|
||
### Support | ||
|
||
We like to support our software! | ||
|
||
That having been said, genome-grist is early-stage beta-level software. Please be patient and kind :). | ||
|
||
Please ask questions and add comments [on the github issue tracker for genome-grist](https://github.com/dib-lab/genome-grist/issues). | ||
|
||
## Why the name `grist`? | ||
|
||
'grist' is in the sourmash family of names (sourmash, wort, distillerycats, etc.) See [Grist in Wikipedia](https://en.wikipedia.org/wiki/Grist). | ||
|
||
(It is not the [computing grist](https://en.wikipedia.org/wiki/Grist_(computing))!) | ||
![an example image made with genome-grist](doc/gather-vs-mapping.png) | ||
|
||
--- | ||
|
||
[CTB](https://twitter.com/ctitusbrown/) Jan 27, 2021 | ||
[CTB](https://twitter.com/ctitusbrown/) 01/22 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
samples: | ||
- podar | ||
|
||
outdir: outputs.private/ | ||
|
||
sourmash_databases: | ||
- databases/podar-ref.zip | ||
|
||
local_databases_info: | ||
- databases/podar-ref.info.csv | ||
|
||
taxonomies: | ||
- databases/podar-ref.tax.csv |
Oops, something went wrong.