[MRG] support local genome collections (including private genomes) (#130

) * rename rule to sourmash_prefetch_wc * start using {outdir}/genomes/ * swizzle up config to allowprivate_databases and genbank_databases, etc. * more progress: copying private genomes around * combine listing private and genbank genomes - seems to work\! * simplify the ListGenomes stuff * it's aliiiiiiiive * remove genbank accession requirement * remove genbank from most filenames, rules * rename minimap to mapping; add clean_gather * updated to properly (?) use checkpoints throughout * tests pass locally * fix typo * add the beginnings of testing for private databases * getting started * update all the things * [MRG] Change column names in intermediate CSVs. (#133) * change column names * remove old notebooks * fix mistake * comments etc. * remove glob pattern, configure genbank_cache * remove 'process' command * check for old config file params * add important comment * actually remove 'process' * check for 'database_taxonomy' instead of 'taxonomies' * add trailing / in Makefile * add default taxonomies file to system.conf * fix test files * fix conf-private.yml * start of doc/ subdirectory * initial commit * add badge * compleat first draft * minor corrections * spell check * add picklists into the config (#136) * fix 'taxonomies' in test config; check that it's a list * add comment * swipe from #97 * swipe getting started from #97 * update! * Apply suggestions from Taylor's docs review Co-authored-by: Taylor Reiter <[email protected]> * more update in re taylor's suggestions * more more update * even more update * more update * more update * fix help output for CLI * configure mkdocs * clean it out * update gitignore * add some figures * upd * more figure adjustment * add badges * simplify to single sourmash_dtabases; use 'local' instead of 'private' * update to 'local' instead of 'private' * fix extra backquote * more fix? * fix formatting * add tax test * add test for picklist * switch SRR5950647_subset over to use local_databses_info 🎉 * cleanup & commenting * add missing file Co-authored-by: HackMD <[email protected]> Co-authored-by: Taylor Reiter <[email protected]>
dib-lab · Jan 17, 2022 · cb7421a · cb7421a
1 parent 65acc69
commit cb7421a
Show file tree

Hide file tree

Showing 42 changed files with 1,347 additions and 3,078 deletions.
diff --git a/.gitignore b/.gitignore
@@ -11,3 +11,8 @@ dist/
 genome_grist.egg-info/
 genome_grist/version.py
 outputs.*
+genbank_cache
+*.yml
+site
+.DS_Store
+bak
diff --git a/Makefile b/Makefile
@@ -1,5 +1,11 @@
 all: clean-test test
 
+flakes:
+	flake8 --ignore=E501 genome_grist/ tests/
+
+black:
+	black .
+
 clean-test:
 	rm -fr outputs.test/
 
@@ -8,15 +14,41 @@ test:
 	genome-grist run tests/test-data/SRR5950647.conf summarize_mapping summarize_tax make_sgc_conf -j 8 -p
 
 	# try various targets to make sure they work
-	genome-grist run tests/test-data/SRR5950647.conf download_matching_genomes -j 8 -p
-	genome-grist run tests/test-data/SRR5950647.conf download_matching_genomes_info -j 8 -p
+	genome-grist run tests/test-data/SRR5950647.conf download_genbank_genomes -j 8 -p
+	genome-grist run tests/test-data/SRR5950647.conf combine_genome_info -j 8 -p
+	genome-grist run tests/test-data/SRR5950647.conf retrieve_genomes -j 8 -p
 	genome-grist run tests/test-data/SRR5950647.conf estimate_distinct_kmers -j 8 -p
 	genome-grist run tests/test-data/SRR5950647.conf count_trimmed_reads -j 8 -p
 	genome-grist run tests/test-data/SRR5950647.conf summarize_sample_info -j 8 -p
 
+### private/local genomes test stuff
 
-flakes:
-	flake8 --ignore=E501 genome_grist/ tests/
+test-private: outputs.private/abundtrim/podar.abundtrim.fq.gz \
+		databases/podar-ref.zip  databases/podar-ref.info.csv \
+		databases/podar-ref.tax.csv
+	genome-grist run conf-private.yml summarize_gather summarize_mapping summarize_tax -j 4 -p
 
-black:
-	black .
+# download the (subsampled) reads for SRR606249
+outputs.private/abundtrim/podar.abundtrim.fq.gz:
+	mkdir -p outputs.private/abundtrim
+	curl -L https://osf.io/ckbq3/download -o outputs.private/abundtrim/podar.abundtrim.fq.gz
+
+# download the ref genomes
+databases/podar-ref/: 
+	mkdir -p databases/podar-ref
+	curl -L https://osf.io/vbhy5/download -o databases/podar-ref.tar.gz
+	cd databases/podar-ref/ && tar xzf ../podar-ref.tar.gz
+
+# sketch the ref genomes
+databases/podar-ref.zip: databases/podar-ref/
+	sourmash sketch dna -p k=31,scaled=1000 --name-from-first \
+	    databases/podar-ref/*.fa -o databases/podar-ref.zip
+
+# download taxonomy
+databases/podar-ref.tax.csv:
+	curl -L https://osf.io/4yhjw/download -o databases/podar-ref.tax.csv
+
+# create info file and genomes directory:
+databases/podar-ref.info.csv:
+	python -m genome_grist.copy_local_genomes databases/podar-ref/*.fa -o databases/podar-ref.info.csv -d databases/podar-ref.d
+	python -m genome_grist.make_info_file databases/podar-ref.info.csv
diff --git a/README.md b/README.md
@@ -1,156 +1,27 @@
-# genome-grist: a quickstart tutorial.
+# genome-grist README
 
-This quickstart tutorial will take about 30 minutes to run, and
-requires 5 GB of disk space and 4 GB of RAM, as well as a fairly
-good Internet connection.
+<!-- CTB: this is /README.md in dib-lab/genome-grist -->
 
-## What is genome-grist?
+<a href="https://pypi.org/project/genome-grist/"><img alt="PyPI" src="https://badge.fury.io/py/genome-grist.svg"></a>
+<img alt="License: 3-Clause BSD" src="https://img.shields.io/badge/License-BSD%203--Clause-blue.svg">
 
-genome-grist is software that automates a number of tedious metagenome tasks related to reference-based analyses on Illumina metagenomes. Specifically, genome-grist will download public metagenomes from the SRA, preprocess them, and use `sourmash gather` to identify reference genomes for the metagenome. It will then download the reference genomes, map reads to them, and summarize the mapping.
+genome-grist analyzes the strain composition of microbial metagenomes
+using
+[minimum metagenome covers](https://dib-lab.github.io/2020-paper-sourmash-gather/)
+and produces a variety of compositional and taxonomic summaries.
 
-## Installing genome-grist
+Check out the
+[quick start!](https://dib-lab.github.io/genome-grist/quickstart/) And
+please also see
+[the rest of the docs](https://dib-lab.github.io/genome-grist/) for
+more information!
 
-We suggest installing in an isolated conda environment. The following will create a new environment, activate it, and install the latest version of genome-grist from PyPI (which is <a href="https://pypi.org/project/genome-grist/"><img alt="PyPI" src="https://badge.fury.io/py/genome-grist.svg"></a>).
+## Example: the strain composition of a gut microbiome (iHMP)
 
-```
-conda create -y -n grist python=3.8 pip
-conda activate grist
-python -m pip install genome-grist
-```
-## Running genome-grist
+This figure was autogenerated by genome-grist.
 
-We currently recommend running genome-grist in its own directory, for several reasons that include software installation (genome-grist uses snakemake and conda to install software under this directory).
-
-Within the current working directory, genome-grist will create an `inputs` subdir, a `genbank_genomes` subdir, and any `outputs.NAME` subdirectories required by the configuration; it should be straightforward to keep projects separate by configuring the output directories appropriately.
-
-So, create a subdirectory and change into it:
-```shell
-mkdir grist/
-cd grist/
-```
-Note, genome-grist does not rely on the directory name or location in any way; it works entirely within the current working directory.
-
-### Download a small example database
-
-Download the GTDB release 95 set of ~32k guide genomes, in a pre-prepared sourmash database format:
-```
-curl -L https://osf.io/4n3m5/download -o gtdb-r95.nucleotide-k31-scaled1000.sbt.zip
-```
-(Any sourmash database will do as long as the sequences are named so that the full GenBank accession is the first field in the name.)
-
-### Make a configuration file
-
-Put the following in a config file named `conf-tutorial.yml`:
-```
-sample:
-- SRR5950647
-outdir: outputs.tutorial/
-metagenome_trim_memory: 1e9
-sourmash_database_glob_pattern: gtdb-r95.nucleotide-k31-scaled1000.sbt.zip
-```
-
-Notes:
-* you can put multiple samples IDs here, in a [YAML array format](https://www.cloudbees.com/blog/yaml-tutorial-everything-you-need-get-started/) - put them on a new line after a dash (`-`).
-* if you have multiple databases you can specify them here with an appropriate wild card pattern, e.g. `db/*` will work.
-* if you are running this on the farm HPC at UC Davis, you can search all of genbank by *omitting* the database configuration line. Currently these files are not yet publicly available, which is why this tutorial uses GTDB instead.
-
-### Do your first real run!
-
-Execute:
-```
-genome-grist run conf-tutorial.yml summarize_mapping
-```
-
-This will perform the following steps:
-* download the [HSMA33MX metagenome](https://www.ncbi.nlm.nih.gov/sra/?term=HSMA33MX) from the Sequence Read Archive (target `download_reads`).
-* preprocess it to remove adapters and low-abundance k-mers (target `trim_reads`).
-* build a sourmash signature from the preprocess reads. (target `smash_reads`).
-* perform a `sourmash gather` against the specified database (target `gather_genbank`).
-* download the matching genomes from GenBank into `genbank_genomes/` (target `download_matching_genomes`).
-* map the metagenome reads to the various genomes (target `map_reads`).
-* produce a summary notebook (target `summarize_mapping`).
-
-## Output files
-
-The key output files under the outputs directory are:
-
-* `genbank/{sample}.x.genbank.gather.out` - human-readable output from [sourmash gather](https://sourmash.readthedocs.io/en/latest/classifying-signatures.html).
-* `genbank/{sample}.x.genbank.gather.csv` - [sourmash gather CSV output](https://sourmash.readthedocs.io/en/latest/classifying-signatures.html).
-* `genbank/{sample}.genomes.info.csv` - information about the matching genomes from genbank.
-* `reports/report-{sample}.html` - a summary report.
-* `abundtrim/{sample}.abundtrim.fq.gz` - trimmed and preprocessed reads.
-* `sigs/HSMA33MX.abundtrim.sig` - sourmash signature for the preprocessed reads.
-
-Note that `genome-grist run <config.yml> zip` will create a file named `transfer.zip` with the above files in it.
-
-## Where to insert your own files
-
-genome-grist is built on top of [the snakemake workflow](https://snakemake.readthedocs.io/en/stable/), which lets you substitute your own files in many places.
-
-For example,
-* you can put your own `SAMPLE_1.fastq.gz`, `SAMPLE_2.fastq.gz`, and `SAMPLE_unpaired.fastq.gz` files in `raw/` to have genome-grist process reads for you.
-* you can put your own interleaved reads file in `abundtrim/SAMPLE.abundtrim.fq.gz` to run genome-grist on a private or preprocessed set of reads;
-* you can put your own sourmash signature (k=31, scaled=1000) in `sigs/SAMPLE.abundtrim.sig` if you want to have it do the database search for you;
-
-Please see [the genome-grist Snakefile](https://github.com/dib-lab/genome-grist/blob/latest/genome_grist/conf/Snakefile) for all the gory details.
-
-## Additional targets
-
-Recommended targets:
-
- * summarize_gather - produce summary reports on metagenome composition
- * summarize_tax - produce summary reports on taxonomic composition
- * summarize_mapping - produce summary reports on k-mer and read mapping
-
-Note, 'summarize_mapping' includes 'summarize_gather'; reports will be
-in {{outdir}}/reports, where 'outdir' is specified in the config file.
-
-Additional intermediate targets:
-
- * download_reads - download SRA metagenomes specified in conf file
- * trim_reads - do basic read trimming/adapter removal for metagenome reads
- * smash_reads - create sourmash signatures from metagenome reads
- * summarize_sample_info - build a info.yaml summary file for each metagenome
- * gather_genbank - run 'sourmash gather' on metagenomes against Genbank
- * download_matching_genomes - download all matching Genbank genomes
- * map_reads - map all metagenome reads to Genbank genomes
- * make_sgc_conf - make a spacegraphcats config file
-
-## Other information
-
-### Resource requirements
-
-**Disk space:** genome-grist makes about 4-5 copies of each SRA metagenome.
-
-**Memory:** the genbank search step on all of genbank takes ~120 GB of RAM. On GTDB, it's much, much less. Other than that, the other steps are all under 10 GB of RAM (unless you adjust `metagenome_trim_memory` upwards, which may be needed for complex metagenomes).
-
-**Time:** This is largely dependent on the size of the metagenome; 100m reads takes less than a day or two, typically. The processing of multiple data sets can be done in parallel with `-j`, as well, although you probably want to specify resource limits. For example, here is the command that Titus uses on farm:
-```
-genome-grist run <config> -k --resources mem_mb=145000 -j 16
-```
-to run in 150GB of RAM, which will run at most one genbank search at a time.
-
-### Installing unreleased versions.
-
-You can run genome-grist from a git checkout directory by using pip to install it in editable mode:
-```
-pip install -e .
-```
-
-### Support
-
-We like to support our software!
-
-That having been said, genome-grist is early-stage beta-level software. Please be patient and kind :).
-
-Please ask questions and add comments [on the github issue tracker for genome-grist](https://github.com/dib-lab/genome-grist/issues).
-
-## Why the name `grist`?
-
-'grist' is in the sourmash family of names (sourmash, wort, distillerycats, etc.) See [Grist in Wikipedia](https://en.wikipedia.org/wiki/Grist).
-
-(It is not the [computing grist](https://en.wikipedia.org/wiki/Grist_(computing))!)
+![an example image made with genome-grist](doc/gather-vs-mapping.png)
 
 ---
 
-[CTB](https://twitter.com/ctitusbrown/) Jan 27, 2021
+[CTB](https://twitter.com/ctitusbrown/) 01/22
diff --git a/conf-private.yml b/conf-private.yml
@@ -0,0 +1,13 @@
+samples:
+- podar
+
+outdir: outputs.private/
+
+sourmash_databases:
+- databases/podar-ref.zip
+
+local_databases_info:
+- databases/podar-ref.info.csv
+
+taxonomies:
+- databases/podar-ref.tax.csv