Implement subsampling via a script #711
Open
+595
−246
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description of proposed changes
This implements subsampling via a single script, rather than a series of snakemake commands. The behavior is intended to be the same, and the the subsampling definitions (in
builds.yaml
) have not changed. This script was initially developed asaugur subsample
in nextstrain/augur#762, however we are choosing to move development here to allow us to improve the proximity/priority calculations here. The eventual aim is for this functionality to be part of augur.Verbosity
The subsample script contains overly verbose print statements. Some of these should be removed before merge, but they may be helpful for review.
Subsampling syntax
We introduce a new rule,
extract_subsampling_scheme
, which takes the subsampling definition (inbuilds.yaml
), applies wildcard expansion, and then transforms this into a syntax more in line with how augur filter is called. This rule replaces a number of functions formerly in the snakefiles. As an example:Going forward, I suggest slowly updaating our nCoV subsampling definitions to the latter format. This syntax is more in line with the rest of the augur ecosystem and avoids the sticking point where some values need to specify the
--argument
and some don't (this has tripped up people). It is also easier (I think) to reason with YAML arrays than strings which are then coerced into arrays.DAG simplification
By essentially bringing the complexity into the subsample script (eventually to be
augur subsample
), we have a simpler snakemake DAG, with far fewer rules, which should be easier to port to other workflow languages. The DAG for subsampling now only contains two steps per build:extract_subsampling_scheme
(see above) andsubsample
.DAG for Nextstrain Open (GenBank) builds. Top: current master, bottom: this PR
Future improvements
augur filter
’srun()
function, we could refactor that function slightly to have a function which returns a strain list for inclusion, as well as logging data etc. Currently therun()
function writes data to disk which we immediately read in.priorities.py
andget_distance_to_focal_set.py
scripts. These can be refactored to return data rather than writing to disk.augur filter
ignore_seqs
, which is currently hardcoded to be "Wuhan/Hu-1/2019". Similarly for proximity / priority config parameters. (Breaking change.)--min-date
and proximity to another sample, we can do this in two stages so that the proximity calculations are run against a pre-filtered set (via `--min-date).Related issue(s)
Related to nextstrain/augur#635
Testing
Test builds triggered (via GitHub). I will update this section when they complete. Update: they failed. Updated PR, and AWS info below updated.
AWS batch console link
Release checklist
This PR should not introduce any backwards incompatible changes
docs/change_log.md
in this pull request to document these changes by the date they were added.