diff --git a/CHANGELOG.md b/CHANGELOG.md
index 5efbe2f..80fdb09 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -4,6 +4,12 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [v0.2.5-rc1]
+### Adds
+- `modkit validate` sub-command for rigorous testing of modified base calling models when ground truth labels are known.
+### Fixes
+- [all] Improve performance when using commands on transcriptome reference (or any reference with many sequences less than ~100kb).
+
## [v0.2.4]
### Adds
- [extract, adjust-mods, update-tags, call-mods] Parse MN tag in order to use secondary and supplementary alignments.
diff --git a/book/src/advanced_usage.md b/book/src/advanced_usage.md
index 35daf8f..b634bb5 100644
--- a/book/src/advanced_usage.md
+++ b/book/src/advanced_usage.md
@@ -905,6 +905,73 @@ Options:
-h, --help Print help information.
```
+## validate
+```text
+Validate results from a set of mod-BAM files and associated BED files containing the ground truth
+modified base status at reference positions.
+
+Usage: modkit validate [OPTIONS]
+
+Options:
+ --bam-and-bed
+ Argument accepts 2 values. The first value is the BAM file path with modified base tags.
+ The second is a bed file with ground truth reference positions. The name field in the
+ ground truth bed file should be the short name (single letter code or ChEBI ID) for a
+ modified base or the corresponding canonical base. This argument can be provided more than
+ once for multiple samples.
+
+ --ignore
+ Ignore a modified base class _in_situ_ by redistributing base modification probability
+ equally across other options. For example, if collapsing 'h', with 'm' and canonical
+ options, half of the probability of 'h' will be added to both 'm' and 'C'. A full
+ description of the methods can be found in collapse.md.
+
+ --edge-filter
+ Discard base modification calls that are this many bases from the start or the end of the
+ read. Two comma-separated values may be provided to asymmetrically filter out base
+ modification calls from the start and end of the reads. For example, 4,8 will filter out
+ base modification calls in the first 4 and last 8 bases of the read.
+
+ --invert-edge-filter
+ Invert the edge filter, instead of filtering out base modification calls at the ends of
+ reads, only _keep_ base modification calls at the ends of reads. E.g. if usually, "4,8"
+ would remove (i.e. filter out) base modification calls in the first 4 and last 8 bases of
+ the read, using this flag will keep only base modification calls in the first 4 and last 8
+ bases.
+
+ -c, --canonical-base
+ Canonical base to evaluate. By default, this will be derived from mod codes in ground
+ truth BED files. For ground truth with only canonical sites and/or ChEBI codes this values
+ must be set.
+
+ [possible values: A, C, G, T]
+
+ -q, --filter-quantile
+ Filter out modified base calls where the probability of the predicted variant is below
+ this confidence percentile. For example, 0.1 will filter out the 10% lowest confidence
+ modification calls.
+
+ [default: 0.1]
+
+ -t, --threads
+ Number of threads to use.
+
+ [default: 4]
+
+ --suppress-progress
+ Hide the progress bar.
+
+ -o, --out-filepath
+ Specify a file for machine parseable output.
+
+ --log-filepath
+ Specify a file for debug logs to be written to, otherwise ignore them. Setting a file is
+ recommended. (alias: log)
+
+ -h, --help
+ Print help information (use `-h` for a summary).
+```
+
## pileup-hemi
```text
Tabulates double-stranded base modification patters (such as hemi-methylation) across genomic motif
diff --git a/docs/404.html b/docs/404.html
index 40ccbc7..394071d 100644
--- a/docs/404.html
+++ b/docs/404.html
@@ -87,7 +87,7 @@
diff --git a/docs/advanced_usage.html b/docs/advanced_usage.html
index ea07c33..b3aaba6 100644
--- a/docs/advanced_usage.html
+++ b/docs/advanced_usage.html
@@ -86,7 +86,7 @@
@@ -1032,6 +1032,71 @@
Validate results from a set of mod-BAM files and associated BED files containing the ground truth
+modified base status at reference positions.
+
+Usage: modkit validate [OPTIONS]
+
+Options:
+ --bam-and-bed <BAM> <BED>
+ Argument accepts 2 values. The first value is the BAM file path with modified base tags.
+ The second is a bed file with ground truth reference positions. The name field in the
+ ground truth bed file should be the short name (single letter code or ChEBI ID) for a
+ modified base or the corresponding canonical base. This argument can be provided more than
+ once for multiple samples.
+
+ --ignore <IGNORE>
+ Ignore a modified base class _in_situ_ by redistributing base modification probability
+ equally across other options. For example, if collapsing 'h', with 'm' and canonical
+ options, half of the probability of 'h' will be added to both 'm' and 'C'. A full
+ description of the methods can be found in collapse.md.
+
+ --edge-filter <EDGE_FILTER>
+ Discard base modification calls that are this many bases from the start or the end of the
+ read. Two comma-separated values may be provided to asymmetrically filter out base
+ modification calls from the start and end of the reads. For example, 4,8 will filter out
+ base modification calls in the first 4 and last 8 bases of the read.
+
+ --invert-edge-filter
+ Invert the edge filter, instead of filtering out base modification calls at the ends of
+ reads, only _keep_ base modification calls at the ends of reads. E.g. if usually, "4,8"
+ would remove (i.e. filter out) base modification calls in the first 4 and last 8 bases of
+ the read, using this flag will keep only base modification calls in the first 4 and last 8
+ bases.
+
+ -c, --canonical-base <CANONICAL_BASE>
+ Canonical base to evaluate. By default, this will be derived from mod codes in ground
+ truth BED files. For ground truth with only canonical sites and/or ChEBI codes this values
+ must be set.
+
+ [possible values: A, C, G, T]
+
+ -q, --filter-quantile <FILTER_QUANTILE>
+ Filter out modified base calls where the probability of the predicted variant is below
+ this confidence percentile. For example, 0.1 will filter out the 10% lowest confidence
+ modification calls.
+
+ [default: 0.1]
+
+ -t, --threads <THREADS>
+ Number of threads to use.
+
+ [default: 4]
+
+ --suppress-progress
+ Hide the progress bar.
+
+ -o, --out-filepath <OUT_FILEPATH>
+ Specify a file for machine parseable output.
+
+ --log-filepath <LOG_FILEPATH>
+ Specify a file for debug logs to be written to, otherwise ignore them. Setting a file is
+ recommended. (alias: log)
+
+ -h, --help
+ Print help information (use `-h` for a summary).
+
Tabulates double-stranded base modification patters (such as hemi-methylation) across genomic motif
positions. This command produces a bedMethyl file, the schema can be found in the online
@@ -1323,7 +1388,7 @@
The modkit validate sub-command is intended for validating results in a uniform manner from samples with known modified base content. Specifically the modified base status at any annotated reference location should be known.
The input to the modkit validate command will be pairs of modBAM and BED files.
+modBAM files should contain modified base calls in the MM/ML tag as input to most modkit commands.
+BED files paired to each input modBAM file describe the ground truth modified base status at reference positions.
+This ground truth status should be known by the researcher due to previous experimental conditions and cannot be derived by modkit.
> Parsing BED at /path/to/sample1_annotation.bed
+> Processed 10 BED lines
+> Parsing BED at /path/to/sample2_annotation.bed
+> Processed 10 BED lines
+> Canonical base: C
+> Parsing mapping at /path/to/sample1.bam
+> Processed 10 mapping recrods
+> Parsing mapping at /path/to/sample2.bam
+> Processed 10 mapping recrods
+> Raw counts summary
+ Called Base
+ ┌───┬───────┬───────┬───┬───┬───┬──────────┐
+ │ │ - │ a │ C │ G │ T │ Deletion │
+ ├───┼───────┼───────┼───┼───┼───┼──────────┤
+ Ground │ - │ 9,900 │ 100 │ 1 │ 1 │ 1 │ 2 │
+ Truth │ a │ 100 │ 9,900 │ 1 │ 1 │ 1 │ 2 │
+ └───┴───────┴───────┴───┴───┴───┴──────────┘
+> Balancing ground truth call totals
+> Raw accuracy: 99.00%
+> Raw modified base calls contingency table
+ Called Base
+ ┌───┬────────┬────────┐
+ │ │ - │ a │
+ ├───┼────────┼────────┤
+ Ground │ - │ 99.00% │ 1.00% │
+ Truth │ a │ 1.00% │ 99.00% │
+ └───┴────────┴────────┘
+> Call probability threshold: 0.95
+> Filtered accuracy: 99.90%
+> Filtered modified base calls contingency table
+ Called Base
+ ┌───┬────────┬────────┐
+ │ │ - │ a │
+ ├───┼────────┼────────┤
+ Ground │ - │ 99.90% │ 0.10% │
+ Truth │ a │ 0.10% │ 99.90% │
+ └───┴────────┴────────┘
+
+
The filtering threshold is computed in the same manner as in all other modkit commands.
+Currently only a defined percentage of input data filtering threshold estimation is implemented.
+The default value is 10% (as in other modkit commands) and can be adjusted with the --filter-quantile argument.
+Other methods (including user-defined thresholds) will be implemented in a future version.
+
The Call probability threshold is intended a value to be used for user-defined thresholds for other modkit commands.
A BED file is a tab-delimited file.
+For this command the first 6 fields are processed.
+These fields are as follows:
+
column
name
description
+
1
chrom
name of reference sequence
str
+
2
start position
0-based start position
int
+
3
end position
0-based exclusive end position
int
+
4
mod code
Modified base code
str
+
6
strand
Strand (e.g. +,-,.)
str
+
+
+
The 5th column is ignored in the validate command.
+
The 4th column represents the modified base code annotating the status at this reference position (or range of reference positions).
+This value can be - representing a canonical base (note that this differs from the remora validate annotation), a single letter code as defined in the modBAM tag specification, or any ChEBI code.
+The validate command will assume that any base from the associated modBAM file overlapping these positions should match this annotation.
The --out-filepath option is provided to allow persistent storage of results in a machine-parseable format without other logging lines.
+This format outputs all contingency tables in a machine-parseable format.
+For example this contingency table [["ground_truth_label","-","a"],["-",9900,100],["a",100,9900]] would be produced from the above example results.
The goal of modkit is to enable best-practices manipulation of BAM files containing
modified base information (modBAMs). The various sub-commands and tools available in
@@ -1787,6 +1862,71 @@
Validate results from a set of mod-BAM files and associated BED files containing the ground truth
+modified base status at reference positions.
+
+Usage: modkit validate [OPTIONS]
+
+Options:
+ --bam-and-bed <BAM> <BED>
+ Argument accepts 2 values. The first value is the BAM file path with modified base tags.
+ The second is a bed file with ground truth reference positions. The name field in the
+ ground truth bed file should be the short name (single letter code or ChEBI ID) for a
+ modified base or the corresponding canonical base. This argument can be provided more than
+ once for multiple samples.
+
+ --ignore <IGNORE>
+ Ignore a modified base class _in_situ_ by redistributing base modification probability
+ equally across other options. For example, if collapsing 'h', with 'm' and canonical
+ options, half of the probability of 'h' will be added to both 'm' and 'C'. A full
+ description of the methods can be found in collapse.md.
+
+ --edge-filter <EDGE_FILTER>
+ Discard base modification calls that are this many bases from the start or the end of the
+ read. Two comma-separated values may be provided to asymmetrically filter out base
+ modification calls from the start and end of the reads. For example, 4,8 will filter out
+ base modification calls in the first 4 and last 8 bases of the read.
+
+ --invert-edge-filter
+ Invert the edge filter, instead of filtering out base modification calls at the ends of
+ reads, only _keep_ base modification calls at the ends of reads. E.g. if usually, "4,8"
+ would remove (i.e. filter out) base modification calls in the first 4 and last 8 bases of
+ the read, using this flag will keep only base modification calls in the first 4 and last 8
+ bases.
+
+ -c, --canonical-base <CANONICAL_BASE>
+ Canonical base to evaluate. By default, this will be derived from mod codes in ground
+ truth BED files. For ground truth with only canonical sites and/or ChEBI codes this values
+ must be set.
+
+ [possible values: A, C, G, T]
+
+ -q, --filter-quantile <FILTER_QUANTILE>
+ Filter out modified base calls where the probability of the predicted variant is below
+ this confidence percentile. For example, 0.1 will filter out the 10% lowest confidence
+ modification calls.
+
+ [default: 0.1]
+
+ -t, --threads <THREADS>
+ Number of threads to use.
+
+ [default: 4]
+
+ --suppress-progress
+ Hide the progress bar.
+
+ -o, --out-filepath <OUT_FILEPATH>
+ Specify a file for machine parseable output.
+
+ --log-filepath <LOG_FILEPATH>
+ Specify a file for debug logs to be written to, otherwise ignore them. Setting a file is
+ recommended. (alias: log)
+
+ -h, --help
+ Print help information (use `-h` for a summary).
+
Tabulates double-stranded base modification patters (such as hemi-methylation) across genomic motif
positions. This command produces a bedMethyl file, the schema can be found in the online
diff --git a/docs/quick_start.html b/docs/quick_start.html
index d801d85..5840e6e 100644
--- a/docs/quick_start.html
+++ b/docs/quick_start.html
@@ -86,7 +86,7 @@
diff --git a/docs/searchindex.js b/docs/searchindex.js
index 7b8310a..1fcaf11 100644
--- a/docs/searchindex.js
+++ b/docs/searchindex.js
@@ -1 +1 @@
-Object.assign(window.search, {"doc_urls":["quick_start.html#basic-usage","quick_start.html#modkit-is-a-bioinformatics-tool-for-working-with-modified-bases-from-oxford-nanopore","quick_start.html#installation","quick_start.html#common-use-cases","quick_start.html#notes-and-troubleshooting","intro_bedmethyl.html#constructing-bedmethyl-tables","intro_bedmethyl.html#basic-usage","intro_bedmethyl.html#narrowing-output-to-cpg-dinucleotides","intro_bedmethyl.html#narrowing-output-to-specific-motifs","intro_bedmethyl.html#partitioning-reads-based-on-sam-tag-values","intro_bedmethyl.html#description-of-bedmethyl-output","intro_bedmethyl.html#definitions","intro_bedmethyl.html#bedmethyl-column-descriptions","intro_bedmethyl.html#performance-considerations","intro_adjust.html#updating-and-adjusting-mm-tags","intro_adjust.html#ignoring-a-modification-class","intro_adjust.html#combining-base-modification-probabilities","intro_adjust.html#updating-the-flag--and-","intro_adjust.html#changing-the-base-modification-code","intro_summary.html#summarizing-a-modbam","intro_summary.html#summarize-the-base-modification-calls-in-a-modbam","intro_summary.html#description-of-columns-in-modkit-summary","intro_summary.html#totals-table","intro_summary.html#modification-calls-table","intro_summary.html#passing-a-threshold-directly","intro_motif_bed.html#making-a-motif-bed-file","intro_extract.html#extracting-base-modification-information","intro_extract.html#description-of-output-table","intro_extract.html#tabulating-base-modification--calls--for-each-read-position","intro_extract.html#note-on-implicit-base-modification-calls","intro_extract.html#note-on-non-primary-alignments","intro_extract.html#example-usages","intro_extract.html#extract-a-table-from-an-aligned-and-indexed-bam","intro_extract.html#extract-a-table-from-a-region-of-a-large-modbam","intro_extract.html#extract-only-sites-aligned-to-a-cg-motif","intro_extract.html#extract-only-sites-that-are-at-least-50-bases-from-the-ends-of-the-reads","intro_extract.html#extract-read-level-base-modification-calls","intro_call_mods.html#calling-mods-in-a-modbam","intro_call_mods.html#example-usages","intro_call_mods.html#estimate-the-threshold-on-the-fly-apply-to-modbam-and-clamp-the-modification-calls-to-certainty","intro_call_mods.html#specify-a-filter-threshold-for-your-use-case","intro_call_mods.html#call-mods-with-the-estimated-threshold-and-ignore-modification-calls-within-100-base-pairs-of-the-ends-of-the-reds","intro_edge_filter.html#removing-modification-calls-at-the-ends-of-reads","intro_edge_filter.html#example-usages","intro_edge_filter.html#call-mods-with-the-estimated-threshold-and-ignore-modification-calls-within-100-base-pairs-of-the-ends-of-the-reads","intro_edge_filter.html#perform-pileup-ignoring-base-modification-calls-within-100-base-pairs-of-the-ends-of-the-reads","intro_include_bed.html#narrow-output-to-specific-positions","intro_repair.html#repair-mmml-tags-on-trimmed-reads","intro_pileup_hemi.html#make-hemi-methylation-bedmethyl-tables-with-pileup-hemi","intro_pileup_hemi.html#description-of-hemi-methylation-patterns","intro_pileup_hemi.html#definitions","intro_pileup_hemi.html#bedmethyl-column-descriptions","intro_pileup_hemi.html#limitations","intro_dmr.html#perform-differential-methylation-scoring","intro_dmr.html#preparing-the-input-data","intro_dmr.html#running-differential-methylation-scoring","intro_dmr.html#scoring-differentially-methylated-bases-as-opposed-to-regions","intro_dmr.html#running-multiple-samples","intro_dmr.html#differential-methylation-output-format","intro_dmr.html#scoring-details","advanced_usage.html#modkit-subcommand-documentation","advanced_usage.html#pileup","advanced_usage.html#adjust-mods","advanced_usage.html#update-tags","advanced_usage.html#sample-probs","advanced_usage.html#summary","advanced_usage.html#motif-bed","advanced_usage.html#call-mods","advanced_usage.html#extract","advanced_usage.html#repair","advanced_usage.html#pileup-hemi","advanced_usage.html#pileup-hemi-pair","advanced_usage.html#pileup-hemi-1","troubleshooting.html#troubleshooting","troubleshooting.html#missing-secondary-and-supplementary-alignments-in-output","troubleshooting.html#no-rows-in-modkit-pileup-output","troubleshooting.html#not-sampling-enough-reads-to-estimate-threshold","troubleshooting.html#cg-positions-are-missing-from-output-bedmethyl","limitations.html#current-limitations","perf_considerations.html#performance-considerations","perf_considerations.html#sharding-a-large-modbam-by-region","perf_considerations.html#setting-the---interval-size-and---chunk-size-pileup","algo_details.html#algorithm-details","filtering.html#partitioning-pass-and-fail-base-modification-calls","filtering.html#further-details","filtering_details.html#examples-of-how-thresholds-affect-base-modification-calls","filtering_details.html#two-way-base-modification-calls","filtering_details.html#three-way-base-modification-calls","filtering_details.html#three-way-base-modification-calls-with-modification-specific-thresholds","filtering_numeric_details.html#filtering-base-modification-calls","filtering_numeric_details.html#a-note-on-the-probability-of-modification--mm-flag","collapse.html#removing-modification-calls-from-bams","collapse.html#removing-dna-base-modification-probabilities","collapse.html#combining-multiple-base-modifications-into-a-single-count","collapse.html#combine---combine"],"index":{"documentStore":{"docInfo":{"0":{"body":0,"breadcrumbs":5,"title":2},"1":{"body":1,"breadcrumbs":11,"title":8},"10":{"body":13,"breadcrumbs":9,"title":3},"11":{"body":157,"breadcrumbs":7,"title":1},"12":{"body":132,"breadcrumbs":9,"title":3},"13":{"body":27,"breadcrumbs":8,"title":2},"14":{"body":33,"breadcrumbs":11,"title":4},"15":{"body":44,"breadcrumbs":10,"title":3},"16":{"body":59,"breadcrumbs":11,"title":4},"17":{"body":40,"breadcrumbs":9,"title":2},"18":{"body":25,"breadcrumbs":11,"title":4},"19":{"body":14,"breadcrumbs":7,"title":2},"2":{"body":28,"breadcrumbs":4,"title":1},"20":{"body":70,"breadcrumbs":10,"title":5},"21":{"body":0,"breadcrumbs":9,"title":4},"22":{"body":46,"breadcrumbs":7,"title":2},"23":{"body":104,"breadcrumbs":8,"title":3},"24":{"body":73,"breadcrumbs":8,"title":3},"25":{"body":47,"breadcrumbs":11,"title":4},"26":{"body":102,"breadcrumbs":11,"title":4},"27":{"body":169,"breadcrumbs":10,"title":3},"28":{"body":241,"breadcrumbs":14,"title":7},"29":{"body":51,"breadcrumbs":12,"title":5},"3":{"body":64,"breadcrumbs":6,"title":3},"30":{"body":62,"breadcrumbs":11,"title":4},"31":{"body":0,"breadcrumbs":9,"title":2},"32":{"body":13,"breadcrumbs":12,"title":5},"33":{"body":17,"breadcrumbs":12,"title":5},"34":{"body":16,"breadcrumbs":12,"title":5},"35":{"body":7,"breadcrumbs":13,"title":6},"36":{"body":55,"breadcrumbs":13,"title":6},"37":{"body":101,"breadcrumbs":9,"title":3},"38":{"body":0,"breadcrumbs":8,"title":2},"39":{"body":5,"breadcrumbs":15,"title":9},"4":{"body":7,"breadcrumbs":5,"title":2},"40":{"body":14,"breadcrumbs":11,"title":5},"41":{"body":8,"breadcrumbs":19,"title":13},"42":{"body":127,"breadcrumbs":13,"title":5},"43":{"body":0,"breadcrumbs":10,"title":2},"44":{"body":8,"breadcrumbs":21,"title":13},"45":{"body":26,"breadcrumbs":20,"title":12},"46":{"body":45,"breadcrumbs":11,"title":4},"47":{"body":163,"breadcrumbs":13,"title":5},"48":{"body":179,"breadcrumbs":15,"title":7},"49":{"body":82,"breadcrumbs":12,"title":4},"5":{"body":63,"breadcrumbs":9,"title":3},"50":{"body":120,"breadcrumbs":9,"title":1},"51":{"body":129,"breadcrumbs":11,"title":3},"52":{"body":19,"breadcrumbs":9,"title":1},"53":{"body":28,"breadcrumbs":11,"title":4},"54":{"body":59,"breadcrumbs":10,"title":3},"55":{"body":192,"breadcrumbs":11,"title":4},"56":{"body":53,"breadcrumbs":13,"title":6},"57":{"body":89,"breadcrumbs":10,"title":3},"58":{"body":188,"breadcrumbs":11,"title":4},"59":{"body":165,"breadcrumbs":9,"title":2},"6":{"body":52,"breadcrumbs":8,"title":2},"60":{"body":281,"breadcrumbs":6,"title":3},"61":{"body":916,"breadcrumbs":4,"title":1},"62":{"body":227,"breadcrumbs":5,"title":2},"63":{"body":110,"breadcrumbs":5,"title":2},"64":{"body":406,"breadcrumbs":5,"title":2},"65":{"body":502,"breadcrumbs":4,"title":1},"66":{"body":51,"breadcrumbs":5,"title":2},"67":{"body":480,"breadcrumbs":5,"title":2},"68":{"body":806,"breadcrumbs":4,"title":1},"69":{"body":125,"breadcrumbs":4,"title":1},"7":{"body":123,"breadcrumbs":10,"title":4},"70":{"body":777,"breadcrumbs":5,"title":2},"71":{"body":333,"breadcrumbs":6,"title":3},"72":{"body":278,"breadcrumbs":5,"title":2},"73":{"body":18,"breadcrumbs":2,"title":1},"74":{"body":103,"breadcrumbs":6,"title":5},"75":{"body":70,"breadcrumbs":5,"title":4},"76":{"body":91,"breadcrumbs":6,"title":5},"77":{"body":19,"breadcrumbs":6,"title":5},"78":{"body":54,"breadcrumbs":4,"title":2},"79":{"body":0,"breadcrumbs":4,"title":2},"8":{"body":147,"breadcrumbs":10,"title":4},"80":{"body":29,"breadcrumbs":6,"title":4},"81":{"body":80,"breadcrumbs":8,"title":6},"82":{"body":11,"breadcrumbs":4,"title":2},"83":{"body":100,"breadcrumbs":12,"title":6},"84":{"body":11,"breadcrumbs":8,"title":2},"85":{"body":21,"breadcrumbs":14,"title":6},"86":{"body":33,"breadcrumbs":13,"title":5},"87":{"body":37,"breadcrumbs":13,"title":5},"88":{"body":83,"breadcrumbs":16,"title":8},"89":{"body":109,"breadcrumbs":12,"title":4},"9":{"body":110,"breadcrumbs":12,"title":6},"90":{"body":53,"breadcrumbs":13,"title":5},"91":{"body":57,"breadcrumbs":9,"title":4},"92":{"body":84,"breadcrumbs":10,"title":5},"93":{"body":0,"breadcrumbs":11,"title":6},"94":{"body":48,"breadcrumbs":7,"title":2}},"docs":{"0":{"body":"","breadcrumbs":"Quick Start guides » Basic Usage","id":"0","title":"Basic Usage"},"1":{"body":"ONT_logo","breadcrumbs":"Quick Start guides » modkit is a bioinformatics tool for working with modified bases from Oxford Nanopore.","id":"1","title":"modkit is a bioinformatics tool for working with modified bases from Oxford Nanopore."},"10":{"body":"Below is a description of the bedMethyl columns generated by modkit pileup. A brief description of the bedMethyl specification can be found on Encode .","breadcrumbs":"Quick Start guides » Constructing bedMethyl tables » Description of bedMethyl output.","id":"10","title":"Description of bedMethyl output."},"11":{"body":"Nmod - Number of calls passing filters that were classified as a residue with a specified base modification. Ncanonical - Number of calls passing filters were classified as the canonical base rather than modified. The exact base must be inferred by the modification code. For example, if the modification code is m (5mC) then the canonical base is cytosine. If the modification code is a, the canonical base is adenine. Nother mod - Number of calls passing filters that were classified as modified, but where the modification is different from the listed base (and the corresponding canonical base is equal). For example, for a given cytosine there may be 3 reads with h calls, 1 with a canonical call, and 2 with m calls. In the bedMethyl row for h Nother_mod would be 2. In the m row Nother_mod would be 3. Nvalid_cov - the valid coverage. Nvalid_cov = Nmod + Nother_mod + Ncanonical, also used as the score in the bedMethyl Ndiff - Number of reads with a base other than the canonical base for this modification. For example, in a row for h the canonical base is cytosine, if there are 2 reads with C->A substitutions, Ndiff will be 2. Ndelete - Number of reads with a deletion at this reference position Nfail - Number of calls where the probability of the call was below the threshold. The threshold can be set on the command line or computed from the data (usually failing the lowest 10th percentile of calls). Nnocall - Number of reads aligned to this reference position, with the correct canonical base, but without a base modification call. This can happen, for example, if the model requires a CpG dinucleotide and the read has a CG->CH substitution such that no modification call was produced by the basecaller.","breadcrumbs":"Quick Start guides » Constructing bedMethyl tables » Definitions:","id":"11","title":"Definitions:"},"12":{"body":"column name description type 1 chrom name of reference sequence from BAM header str 2 start position 0-based start position int 3 end position 0-based exclusive end position int 4 modified base code and motif single letter code for modified base and motif when more than one motif is used str 5 score Equal to Nvalid_cov. int 6 strand '+' for positive strand '-' for negative strand, '.' when strands are combined str 7 start position included for compatibility int 8 end position included for compatibility int 9 color included for compatibility, always 255,0,0 str 10 Nvalid_cov See definitions above. int 11 fraction modified Nmod / Nvalid_cov float 12 Nmod See definitions above. int 13 Ncanonical See definitions above. int 14 Nother_mod See definitions above. int 15 Ndelete See definitions above. int 16 Nfail See definitions above. int 17 Ndiff See definitions above. int 18 Nnocall See definitions above. int","breadcrumbs":"Quick Start guides » Constructing bedMethyl tables » bedMethyl column descriptions.","id":"12","title":"bedMethyl column descriptions."},"13":{"body":"The --interval-size, --threads, --chunk-size, and --max-depth parameters can be used to tweak the parallelism and memory consumption of modkit pileup. The defaults should be suitable for most use cases, for more details see the advanced usage and performance considerations sections.","breadcrumbs":"Quick Start guides » Constructing bedMethyl tables » Performance considerations","id":"13","title":"Performance considerations"},"14":{"body":"The adjust-mods subcommand can be used to manipulate MM (and corresponding ML) tags in a modBam. In general, these simple commands are run prior to pileup, visualization, or other analysis. For adjust-mods and update-tags, if a correct MN tag is found, secondary and supplementary alignments will be output. See troubleshooting for details.","breadcrumbs":"Quick Start guides » Updating and adjusting MM tags » Updating and Adjusting MM tags.","id":"14","title":"Updating and Adjusting MM tags."},"15":{"body":"To remove a base modification class from a modBAM and produce a new modBAM, use the --ignore option for adjust-mods. modkit adjust-mods input.bam output.adjust.bam --ignore For example the command below will remove 5hmC calls, leaving just 5mC calls. modkit adjust-mods input.bam output.adjust.bam --ignore h For technical details on the transformation see Removing modification calls from BAMs .","breadcrumbs":"Quick Start guides » Updating and adjusting MM tags » Ignoring a modification class.","id":"15","title":"Ignoring a modification class."},"16":{"body":"Combining base modification probabilities may be desirable for downstream analysis or visualization. Unlike --ignore which removes the probability of a class, --convert will sum the probability of one class with another if the second class already exists. For example, the command below will convert probabilities associated with h probability into m probability. If m already exists, the probabilities will be summed. As described in changing the modification code , if the second base modification code doesn't exist, the probabilities are left unchanged. modkit adjust-mods input.bam output.convert.bam --convert h m","breadcrumbs":"Quick Start guides » Updating and adjusting MM tags » Combining base modification probabilities.","id":"16","title":"Combining base modification probabilities."},"17":{"body":"The specification (Section 1.7) allows for omission of the MM flag, however this may not be the intent of missing base modification probabilities for some models. The command below will add or change the ? flag to a modBAM. modkit adjust-mods input.bam output.bam --mode ambiguous Another option is to set the flag to ., the \"implicitly canonical\" mode: modkit adjust-mods input.bam output.bam --mode implicit","breadcrumbs":"Quick Start guides » Updating and adjusting MM tags » Updating the flag (? and .).","id":"17","title":"Updating the flag (? and .)."},"18":{"body":"Some functions in modkit or other tools may require the mod-codes in the MM tag be in the specification . For example, the following command will change C+Z, tags to C+m, tags. modkit adjust-mods input.bam output.bam --convert Z m","breadcrumbs":"Quick Start guides » Updating and adjusting MM tags » Changing the base modification code.","id":"18","title":"Changing the base modification code."},"19":{"body":"The modkit summary sub-command is intended for collecting read-level statistics on either a sample of reads, a region, or an entire modBam.","breadcrumbs":"Quick Start guides » Summarizing a modBAM » Summarizing a modBAM.","id":"19","title":"Summarizing a modBAM."},"2":{"body":"Pre-compiled binaries are provided for Linux from the release page . We recommend the use of these in most circumstances. As a rust-based project, modkit can also be installed with cargo . git clone https://github.com/nanoporetech/modkit.git\ncd modkit\ncargo install --path .\n# or\ncargo install --git https://github.com/nanoporetech/modkit.git","breadcrumbs":"Quick Start guides » Installation","id":"2","title":"Installation"},"20":{"body":"modkit summary input.bam will output a table similar to this > parsing region chr20 # only present if --region option is provided\n> sampling 10042 reads from BAM # modulated with --num-reads\n> calculating threshold at 10% percentile # modulated with --filter-percentile\n> calculated thresholds: C: 0.7167969 # calculated per-canonical base, on the fly\n# bases C\n# total_reads_used 9989\n# count_reads_C 9989\n# pass_threshold_C 0.7167969\n# region chr20:0-64444167 base code pass_count pass_frac all_count all_frac C m 1192533 0.58716166 1305956 0.5790408 C h 119937 0.0590528 195335 0.086608544 C - 718543 0.3537855 754087 0.33435062","breadcrumbs":"Quick Start guides » Summarizing a modBAM » Summarize the base modification calls in a modBAM.","id":"20","title":"Summarize the base modification calls in a modBAM."},"21":{"body":"","breadcrumbs":"Quick Start guides » Summarizing a modBAM » Description of columns in modkit summary:","id":"21","title":"Description of columns in modkit summary:"},"22":{"body":"The lines of the totals table are prefixed with a # character. row name description type 1 bases comma-separated list of canonical bases with modification calls. str 2 total_reads_used total number of reads from which base modification calls were extracted int 3+ count_reads_{base} total number of reads that contained base modifications for {base} int 4+ filter_threshold_{base} filter threshold used for {base} float","breadcrumbs":"Quick Start guides » Summarizing a modBAM » Totals table","id":"22","title":"Totals table"},"23":{"body":"The modification calls table follows immediately after the totals table. column name description type 1 base canonical base with modification call char 2 code base modification code, or - for canonical char 3 pass_count total number of passing (confidence >= threshold) calls for the modification in column 2 int 4 pass_frac fraction of passing (>= threshold) calls for the modification in column 2 float 5 all_count total number of calls for the modification code in column 2 int 6 all_frac fraction of all calls for the modification in column 2 float For more details on thresholds see filtering base modification calls . By default modkit summary will only use ten thousand reads when generating the summary (or fewer if the modBAM has fewer than that). To use all of the reads in the modBAM set the --no-sampling flag. modkit summary input.bam --no-sampling There are --no-filtering, --filter-percentile, and --filter-threshold options that can be used with or without sampling.","breadcrumbs":"Quick Start guides » Summarizing a modBAM » Modification calls table","id":"23","title":"Modification calls table"},"24":{"body":"To estimate the pass thresholds on a subset of reads, but then summarize all of the reads, there is a two-step process. First, determine the thresholds with modkit sample-probs (see usage for more details). Then run modkit summary with the threshold value specified. modkit sample-probs input.bam [--sampling-frac | --num-reads ] This command will output a table like this: > sampling 10042 reads from BAM base percentile threshold C 10 0.6972656 C 50 0.96484375 C 90 0.9941406 You can then use pass this threshold directly to modkit summary: modkit summary input.bam \\ --filter-threshold 0.6972656 \\ # filter 10% lowest confidence calls --no-sampling","breadcrumbs":"Quick Start guides » Summarizing a modBAM » Passing a threshold directly.","id":"24","title":"Passing a threshold directly."},"25":{"body":"Downstream analysis may require a BED file to select motifs of interest. For example, selecting GATC motifs in E. coli . This command requires a reference sequence in FASTA a motif to find, which can include IUPAC ambiguous bases and a position within the motif. The following command would make a BED file for CG motifs. modkit motif-bed reference.fasta CG 0 1> cg_modifs.bed The output is directed to standard out.","breadcrumbs":"Quick Start guides » Making a motif BED file » Making a motif BED file.","id":"25","title":"Making a motif BED file."},"26":{"body":"The modkit extract sub-command will produce a table containing the base modification probabilities, the read sequence context, and optionally aligned reference information. For extract, if a correct MN tag is found, secondary and supplementary alignments may be output with the --allow-non-primary flag. See troubleshooting for details. The table will by default contain unmapped sections of the read (soft-clipped sections, for example). To only include mapped bases use the --mapped flag. To only include sites of interest, pass a BED-formatted file to the --include-bed option. Similarly, to exclude sites, pass a BED-formatted file to the --exclude option. One caution, the files generated by modkit extract can be large (2-2.5x the size of the BAM). You may want to either use the --num-reads option, the --region option, or pre-filter the modBAM ahead of time. You can also stream the output to stdout by setting the output to - or stdout and filter the columns before writing to disk.","breadcrumbs":"Quick Start guides » Extracting read information to a table » Extracting base modification information","id":"26","title":"Extracting base modification information"},"27":{"body":"column name description type 1 read_id name of the read str 2 forward_read_position 0-based position on the forward-oriented read sequence int 3 ref_position aligned 0-based reference sequence position, -1 means unmapped int 4 chrom name of aligned contig, or '.' if the read is Gunmapped str 5 mod_strand strand of the molecule the base modification is on str 6 ref_strand strand of the reference the read is aligned to, or '.' if unmapped str 7 ref_mod_strand strand of the reference with the base modification, or '.' if unmapped str 8 fw_soft_clipped_start number of bases soft clipped from the start of the forward-oriented read int 9 fw_soft_clipped_end number of bases soft clipped from the end of the forward-oriented read int 10 read_length total length of the read int 11 mod_qual probability of the base modification in the next column int 12 mod_code base modification code from the MM tag str 13 base_qual basecall quality score (phred) int 14 ref_kmer reference 5-mer sequence context (center base is aligned base), '.' if unmapped str 15 query_kmer read 5-mer sequence context (center base is aligned base) str 16 canonical_base canonical base from the query sequence, from the MM tag str 17 modified_primary_base primary sequence base with the modification str 18 inferred whether the base modification call is implicit canonical str 19 flag FLAG from alignment record str","breadcrumbs":"Quick Start guides » Extracting read information to a table » Description of output table","id":"27","title":"Description of output table"},"28":{"body":"Passing --read-calls option will generate a table of read-level base modification calls using the same thresholding algorithm employed by modkit pileup. The resultant table has, for each read, one row for each base modification call in that read. If a base is called as modified then call_code will be the code in the MM tag. If the base is called as canonical the call_code will be - (A, C, G, and T are reserved for \"any modification\"). The full schema of the table is below: column name description type 1 read_id name of the read str 2 forward_read_position 0-based position on the forward-oriented read sequence int 3 ref_position aligned 0-based reference sequence position, -1 means unmapped int 4 chrom name of aligned contig, or '.' if unmapped str 5 mod_strand strand of the molecule the base modification is on str 6 ref_strand strand of the reference the read is aligned to, or '.' if unmapped str 7 ref_mod_strand strand of the reference with the base modification, or '.' if unmapped str 8 fw_soft_clipped_start number of bases soft clipped from the start of the forward-oriented read int 9 fw_soft_clipped_end number of bases soft clipped from the end of the forward-oriented read int 10 read_length total length of the read int 11 call_prob probability of the base modification call in the next column int 12 call_code base modification call, - indicates a canonical call str 13 base_qual basecall quality score (phred) int 14 ref_kmer reference 5-mer sequence context (center base is aligned base), '.' if unmapped str 15 query_kmer read 5-mer sequence context (center base is aligned base) str 16 canonical_base canonical base from the query sequence, from the MM tag str 17 modified_primary_base primary sequence base with the modification str 18 fail true if the base modification call fell below the pass threshold str 19 inferred whether the base modification call is implicit canonical str 20 within_alignment when alignment information is present, is this base aligned to the reference str 21 flag FLAG from alignment record str","breadcrumbs":"Quick Start guides » Extracting read information to a table » Tabulating base modification calls for each read position","id":"28","title":"Tabulating base modification calls for each read position"},"29":{"body":"The . MM flag indicates that primary sequence bases without an associated base modification probability should be inferred to be canonical. By default, when this flag is encountered in a modBAM, modkit extract will output rows with the inferred column set to true and a mod_qual value of 0.0 for the base modifications called on that read. For example, if you have a A+a. MM tag, and there are A bases in the read for which there aren't base modification calls (identifiable as non-0s in the MM tag) will be rows where the mod_code is a and the mod_qual is 0.0.","breadcrumbs":"Quick Start guides » Extracting read information to a table » Note on implicit base modification calls.","id":"29","title":"Note on implicit base modification calls."},"3":{"body":"Creating a bedMethyl table with pileup Updating and Adjusting MM tags with adjust-mods and update-tags Summarizing a modBAM with summary Making a motif BED file with motif-bed Extracting per-read base modification data into a table Convert modification probabilities into hard calls Removing base modification calls at the ends of reads Narrow analysis to only specific positions with a BED file Repairing/adding MM/ML tags to reads with clipped sequences Creating hemi-methylation pattern bedMethyl tables with pileup-hemi Performing differential methylation scoring with dmr","breadcrumbs":"Quick Start guides » Common Use Cases","id":"3","title":"Common Use Cases"},"30":{"body":"If a valid MN tag is found, secondary and supplementary alignments can be output in the modkit extract tables above. See troubleshooting for details on how to get valid MN tags. To have non-primary alignments appear in the output, the --allow-non-primary flag must be passed. By default, the primary alignment will have all base modification information contained on the read, including soft-clipped and unaligned read positions. If the --mapped-only flag is used, soft clipped sections of the read will not be included. For secondary and supplementary alignments, soft-clipped positions are not repeated. See advanced usage for more details.","breadcrumbs":"Quick Start guides » Extracting read information to a table » Note on non-primary alignments","id":"30","title":"Note on non-primary alignments"},"31":{"body":"","breadcrumbs":"Quick Start guides » Extracting read information to a table » Example usages:","id":"31","title":"Example usages:"},"32":{"body":"modkit extract If the index input.bam.bai can be found, intervals along the aligned genome can be performed in parallel.","breadcrumbs":"Quick Start guides » Extracting read information to a table » Extract a table from an aligned and indexed BAM","id":"32","title":"Extract a table from an aligned and indexed BAM"},"33":{"body":"The below example will extract reads from only chr20, and include reference sequence context modkit extract --region chr20 --ref ","breadcrumbs":"Quick Start guides » Extracting read information to a table » Extract a table from a region of a large modBAM","id":"33","title":"Extract a table from a region of a large modBAM"},"34":{"body":"modkit motif-bed CG 0 > CG_motifs.bed\nmodkit extract --ref --include-bed CG_motifs.bed","breadcrumbs":"Quick Start guides » Extracting read information to a table » Extract only sites aligned to a CG motif","id":"34","title":"Extract only sites aligned to a CG motif"},"35":{"body":"modkit extract --edge-filter 50","breadcrumbs":"Quick Start guides » Extracting read information to a table » Extract only sites that are at least 50 bases from the ends of the reads","id":"35","title":"Extract only sites that are at least 50 bases from the ends of the reads"},"36":{"body":"modkit extract null --read-calls Using \"null\" in the place of the normal output will direct the normal extract output to /dev/null, to keep this output specify a file or - for standard out. modkit extract --read-calls Use --allow-non-primary to get secondary and supplementary mappings in the output. modkit extract --read-calls --allow-non-primary See the help string and/or advanced_usage for more details.","breadcrumbs":"Quick Start guides » Extracting read information to a table » Extract read-level base modification calls","id":"36","title":"Extract read-level base modification calls"},"37":{"body":"The call-mods subcommand in modkit transforms one modBAM into another modBAM where the base modification probabilities have been clamped to 100% and 0%. If the --filter-threshold and/or --mod-threshold options are provided, base modification calls failing the threshold will be removed prior to changing the probabilities. The output modBAM can be used for visualization, pileup, or other applications. For call-mods, if a correct MN tag is found, secondary and supplementary alignments will be output. See troubleshooting for details. A modBAM that has been transformed with call-mods using --filter-threshold and/or --mod-threshold cannot be re-transformed with different thresholds. Note on pileup with clamped probabilities: modkit pileup will attempt to estimate the threshold probability by default, but it is unnecessary if the modBAM is the result of call-mods. The threshold probabilities will be artificially high (i.e. not representative of the model's output probabilities). Similarly, specifying --filter-threshold and --mod-threshold is not useful because all the ML probabilities have been set to 0 and 100%.","breadcrumbs":"Quick Start guides » Calling mods in a modBAM » Calling mods in a modBAM","id":"37","title":"Calling mods in a modBAM"},"38":{"body":"","breadcrumbs":"Quick Start guides » Calling mods in a modBAM » Example usages","id":"38","title":"Example usages"},"39":{"body":"modkit call-mods ","breadcrumbs":"Quick Start guides » Calling mods in a modBAM » Estimate the threshold on the fly, apply to modBAM and clamp the modification calls to certainty.","id":"39","title":"Estimate the threshold on the fly, apply to modBAM and clamp the modification calls to certainty."},"4":{"body":"General troubleshooting Threshold evaluation examples (for advanced users)","breadcrumbs":"Quick Start guides » Notes and troubleshooting","id":"4","title":"Notes and troubleshooting"},"40":{"body":"modkit call-mods --filter-threshold A:0.9 --mod-threshold a:0.95 --filter-threshold C:0.97","breadcrumbs":"Quick Start guides » Calling mods in a modBAM » Specify a filter threshold for your use-case","id":"40","title":"Specify a filter threshold for your use-case"},"41":{"body":"modkit call-mods --edge-filter 100","breadcrumbs":"Quick Start guides » Calling mods in a modBAM » Call mods with the estimated threshold and ignore modification calls within 100 base pairs of the ends of the reds","id":"41","title":"Call mods with the estimated threshold and ignore modification calls within 100 base pairs of the ends of the reds"},"42":{"body":"If you have reads where you know base modifications near the ends should not be used (for example, if they are in adapters), you can use the --edge-filter option. Two comma-separated values may be provided to asymmetrically filter out base modification calls from the start and end of reads. For example, 4,8 will filter out base modification calls in the first 4 and last 8 bases of the read. One value will filter symmetrically. pileup, will ignore base modification calls that are from the ends. adjust-mods, will remove base modification calls that are from the ends from the resultant output modBAM. summary, will ignore base modification calls that are from the ends. sample-probs, will ignore base modification calls that are from the ends. call-mods, will remove base modification calls that are from the ends from the resultant output modBAM. extract, will ignore base modification calls that are from the ends, this also applies when making the read-calls table (see intro to extract ). In pileup, call-mods, and extract the edge-filter is also respected when estimating the pass-thresholds. All commands have the flag --invert-edge-filter that will keep only base modification probabilities within of the ends of the reads.","breadcrumbs":"Quick Start guides » Removing modification calls at the ends of reads » Removing modification calls at the ends of reads","id":"42","title":"Removing modification calls at the ends of reads"},"43":{"body":"","breadcrumbs":"Quick Start guides » Removing modification calls at the ends of reads » Example usages","id":"43","title":"Example usages"},"44":{"body":"modkit call-mods --edge-filter 100","breadcrumbs":"Quick Start guides » Removing modification calls at the ends of reads » call mods with the estimated threshold and ignore modification calls within 100 base pairs of the ends of the reads","id":"44","title":"call mods with the estimated threshold and ignore modification calls within 100 base pairs of the ends of the reads"},"45":{"body":"modkit pileup --edge-filter 100 Filter out base modification calls within the first 25 bases or the last 10 bases. modkit pileup --edge-filter 25,10","breadcrumbs":"Quick Start guides » Removing modification calls at the ends of reads » Perform pileup, ignoring base modification calls within 100 base pairs of the ends of the reads","id":"45","title":"Perform pileup, ignoring base modification calls within 100 base pairs of the ends of the reads"},"46":{"body":"The pileup, sample-probs, summary, and extract sub commands have a --include-bed (or --include-positions) option that will restrict analysis to only positions that overlap with the intervals contained within the BED file. In the case of pileup, summary, and sample-probs, the pass-threshold will be estimated with only base modification probabilities that are aligned to positions overlapping intervals in the BED. In the case of pileup and extract only positions will be reported if they overlap intervals in the BED.","breadcrumbs":"Quick Start guides » Narrow output to specific positions » Narrow output to specific positions","id":"46","title":"Narrow output to specific positions"},"47":{"body":"The modkit repair command is useful when you have a BAM with reads where the canonical sequences have been altered in some way that either renders the MM and ML tags invalid (for example, trimmed or hard-clipped) or the data has been lost completely. This command requires that you have the original base modification calls for each read you want to repair, and it will project these base modification calls onto the sequences in the altered BAM. The command uses two arguments called the \"donor\" and the \"acceptor\". The donor, contains the original, correct, MM and ML tags and the acceptor is either missing MM and ML tags or they are invalid (they will be discarded either way). The reads in the donor must be a superset of the reads in the acceptor, meaning you can have extra reads in the donor BAM if some reads have been removed or filtered earlier in the workflow. Both the donor and the acceptor must be sorted by read name prior to running modkit repair. Duplicate reads in the acceptor are allowed so long as they have valid SEQ fields. Lastly, modkit repair only works on reads that have been trimmed, other kinds of alteration such as run-length-encoding are not currently supported. Split reads, or other derived transformations, are not currently repairable with this command. For example a typical workflow may look like this: # original base modification calls\nbasecalls_5mC_5hmC.bam # basecalls that have been trimmed\ntrimmed.bam # could also be fastq, but would require conversion to BAM # the two BAM files need to be sorted\nsamtools -n trimmed.bam -O BAM > trimed_read_sort.bam samtools -n basecalls_5mC_5hmC.bam -O BAM > basecalls_5mC_5hmC_read_sort.bam modkit repair \\ --donor-bam basecalls_5mC_5hmC_read_sort.bam \\ --acceptor-bam trimed_read_sort.bam \\ --log-filepath modkit_repair.log \\ --output-bam trimmed_repaired.bam","breadcrumbs":"Quick Start guides » Repair MM/ML tags on trimmed reads » Repair MM/ML tags on trimmed reads","id":"47","title":"Repair MM/ML tags on trimmed reads"},"48":{"body":"Base modifications in DNA are inherently single-stranded, they (usually [1] ) don't change the base pairing of the modified base. However, it may be of interest to know the correspondence between the methylation state of a single base and another nearby base on the opposite strand - on the same molecule. In CpG dinucleotides, this is called \"hemi-methylation\", when one cytosine is methylated and the neighbor on the opposite strand is not: m\n5'GATCGTACA CTAGCATGT - In the above diagram, the cytosine in the fourth position on the positive strand is methylated (5mC) and the cytosine in the fifth position is canonical (-), indicating a \"hemi-methylation\". In the case of 5mC and canonical, there are 4 \"patterns\" of methylation: m,m (5mC, 5mC)\n-,m (canonical, 5mC)\nm,- (5mC, canonical)\n-,- (canonical, canonical) These are all measured at the single molecule level, meaning each molecule must report on both strands (as is the case with duplex reads). For CpGs in the example above the MM tags would be C+m? and G-m? for the top-strand and bottom-strand cytosines, respectively. The modkit pileup-hemi command will perform an aggregation of the methylation \"patterns\" at genomic positions. An example command to perform hemi-methylation analysis at CpGs would be modkit pileup-hemi \\ /path/to/duplex_reads.bam \\ --cpg \\ -r /path/to/reference.fasta \\ -o hemi_pileup.bed \\ --log modkit.log Many of the pileup options are available in pileup-hemi with a couple differences: : A motif must be provided. The --cpg flag is a preset to aggregate CpG hemi-methylation patterns as shown above. If a motif is provided (as an argument to --motif) it must be reverse-complement palindromic. A reference must be provided. Both the positive strand base modification probability and the negative strand base modification probability must be above the pass threshold. See Advanced Usage for details on all the options.","breadcrumbs":"Quick Start guides » Make hemi-methylation bedMethyl tables » Make hemi-methylation bedMethyl tables with pileup-hemi","id":"48","title":"Make hemi-methylation bedMethyl tables with pileup-hemi"},"49":{"body":"The modkit pileup-hemi command aggregates a pair of base modification calls at each reference motif position for each double-stranded DNA molecule. The base modification \"pattern\" indicates the methylation state on each base in 5-prime to 3-prime order, using the base modification code to indicate the identity of the base modification and - to indicate canonical (unmodified). For example m,-,C would mean the first base (from the reference 5' direction) is 5mC and the second base is unmodified and the primary base is cytosone. Similarly, h,m,C indicates the first base is 5hmC and the second base is 5mC. The primary base called by the read is included to help disambiguate the unmodified patterns (-,-). All patterns recognized at a location will be reported in the bedMethyl output.","breadcrumbs":"Quick Start guides » Make hemi-methylation bedMethyl tables » Description of hemi-methylation patterns","id":"49","title":"Description of hemi-methylation patterns"},"5":{"body":"A primary use of modkit is to create summary counts of modified and unmodified bases in an extended bedMethyl format. bedMethyl files tabulate the counts of base modifications from every sequencing read over each aligned reference genomic position. In order to create a bedMethyl table, your modBAM must be aligned to a reference genome. The genome sequence is only required if you are using the --cpg flag or traditional preset. Only primary alignments are used in generating the table, it is recommended to mark duplicate alignments before running as multiple primary alignments can be double counted (but the behavior is logged). See limitations for details.","breadcrumbs":"Quick Start guides » Constructing bedMethyl tables » Constructing bedMethyl tables.","id":"5","title":"Constructing bedMethyl tables."},"50":{"body":"Npattern - Number of call-pairs passing filters that had the pattern and primary base in column 4. E.g. m,-,C indicates the first base in the 5' to 3' direction is 5mC, the second base is unmodified and the primary base in the reads was C. Ncanonical - Number of call-pairs passing filters that were classified as unmodified (i.e. the pattern is -,-). Nother_pattern - Number of call-pairs passing filters where the pattern is different from the pattern in column 4, but where the primary read base is the same. This count includes the unmodified pattern (-,-). Note this differs from pileup where Nother does not contain the canonical counts. Nvalid_cov - the valid coverage, total number of valid call-pairs. Ndiff - Number of reads with a primary base other than the primary base in column 4. Ndelete - Number of reads with a deletion at this reference position. Nfail - Number of call-pairs where the probability of the at least one of the calls in the pair was below the pass threshold. The threshold can be set on the command line or computed from the data (usually failing the lowest 10th percentile of calls). Nnocall - Number of reads where either one or both of the base modification calls was not present in the read.","breadcrumbs":"Quick Start guides » Make hemi-methylation bedMethyl tables » Definitions:","id":"50","title":"Definitions:"},"51":{"body":"column name description type 1 chrom name of reference sequence from BAM header str 2 start position 0-based start position int 3 end position 0-based exclusive end position int 4 methylation pattern comma-separated pair of modification codes - means canonical, followed by the primary read base str 5 score Equal to Nvalid_cov. int 6 strand always '.' because strand information is combined str 7 start position included for compatibility int 8 end position included for compatibility int 9 color included for compatibility, always 255,0,0 str 10 Nvalid_cov See definitions above. int 11 fraction modified Npattern / Nvalid_cov float 12 Npattern See definitions above. int 13 Ncanonical See definitions above. int 14 Nother_pattern See definitions above. int 15 Ndelete See definitions above. int 16 Nfail See definitions above. int 17 Ndiff See definitions above. int 18 Nnocall See definitions above. int","breadcrumbs":"Quick Start guides » Make hemi-methylation bedMethyl tables » bedMethyl column descriptions.","id":"51","title":"bedMethyl column descriptions."},"52":{"body":"Only one motif can be used at a time, this limitation may be removed in a later version. Partitioning on tag key:value pairs is not currently supported. [1] In biology, there are almost always exceptions to every rule!","breadcrumbs":"Quick Start guides » Make hemi-methylation bedMethyl tables » Limitations","id":"52","title":"Limitations"},"53":{"body":"The modkit dmr command contains two subcommands, pair and multi, that will compare a pair of samples and multiple samples, respectively. The details of multi are the same as pair ( it simply does all the pairwise comparisons), so most of the description below will focus on how to run pair and how to interpret the outputs.","breadcrumbs":"Quick Start guides » Perform differential methylation scoring » Perform differential methylation scoring","id":"53","title":"Perform differential methylation scoring"},"54":{"body":"The inputs to modkit dmr are two or more bedMethyl files (created by modkit pileup) that have been compressed with bgzip and indexed with tabix . An example workflow to generate the input data is shown below: ref=grch38.fasta\nthreads=32 norm=normal_sample.bam\nnorm_pileup=normal_pileup.bed modkit pileup ${norm} ${norm_pileup} \\ --cpg \\ --ref ${ref} \\ --threads ${threads} \\ --log-filepath log.txt bgzip ${norm_pileup}\ntabix ${norm_pileup}.gz tumor=tumor_sample.bam\ntumor_pileup=tumor_pileup.bed modkit pileup ${tumor} ${tumor_pileup} \\ --cpg \\ --ref ${ref} \\ --threads ${threads} \\ --log-filepath log.txt bgzip ${tumor_pileup}\ntabix ${tumor_pileup}.gz","breadcrumbs":"Quick Start guides » Perform differential methylation scoring » Preparing the input data","id":"54","title":"Preparing the input data"},"55":{"body":"Once you have the two (or more) samples to be compared in the appropriate format, the final piece necessary is a BED file of the regions to be compared. The modkit dmr functionality does not \"segment\" or otherwise discover regions, it scores the differences between user-provided regions. To continue with the above example we can get CpG Islands from the UCSC table browser . The data may not always be appropriate input for modkit. For example, the CpG Islands track has extra columns and a header line: #bin chrom chromStart chromEnd name length cpgNum gcNum perCpg perGc obsExp\n1065 chr20 63004819 63007703 CpG: 272 2884 272 1869 18.9 64.8 0.9\n1065 chr20 63009128 63009816 CpG: 48 688 48 432 14 62.8 0.71 Therefore, we need to transform the data with awk or similar, such as: awk 'BEGIN{FS=\"\\t\"; OFS=\"\\t\"} NR>1 {print $2, $3, $4, $5}' cpg_islands_ucsc.bed \\ | bedtools sort -i - > cpg_islands_ucsc_cleaned.bed Keeping the name column is optional. Sorting the regions isn't strictly necessary, the output will be in the same order as the regions file. Below is an example command to produce the scored output (continuing from the top example). The --base option tells modkit dmr which bases to use for scoring the differences, the argument should be a canonical nucleotide (A, C, G, or T) whichever primary sequence base has the modifications you're interested in capturing. For example, for CpG islands the base we're interested in is C. regions=cpg_islands_ucsc_cleaned.bed\ndmr_result=cpg_islands_tumor_normal.bed modkit dmr pair \\ -a ${norm_pileup}.gz \\ --index-a ${norm_pileup}.gz.tbi \\ # optional -b ${tumor_pileup}.gz \\ --index-b ${tumor_pileup}.gz.tbi \\ # optional -o ${dmr_result} \\ # output to stdout if not present -r ${regions} \\ --ref ${ref} \\ --base C \\ # may be repeated if multiple modifications are being used --threads ${threads} \\ --log-filepath dmr.log","breadcrumbs":"Quick Start guides » Perform differential methylation scoring » Running differential methylation scoring","id":"55","title":"Running differential methylation scoring"},"56":{"body":"To score individual bases (e.g. differentially methylated CpGs), simply omit the --regions (-r) option when running modkit dmr [pair|multi]. For example the above command becomes: dmr_result=cpg_islands_tumor_normal.bed modkit dmr pair \\ -a ${norm_pileup}.gz \\ --index-a ${norm_pileup}.gz.tbi \\ # optional -b ${tumor_pileup}.gz \\ --index-b ${tumor_pileup}.gz.tbi \\ # optional -o ${dmr_result} \\ # output to stdout if not present --ref ${ref} \\ --base C \\ # may be repeated if multiple modifications are being used --threads ${threads} \\ --log-filepath dmr.log","breadcrumbs":"Quick Start guides » Perform differential methylation scoring » Scoring differentially methylated bases (as opposed to regions)","id":"56","title":"Scoring differentially methylated bases (as opposed to regions)"},"57":{"body":"The modkit dmr multi command runs all pairwise comparisons for more than two samples. The preparation of the data is identical to that for dmr pair (for each sample, of course). An example command could be: modkit dmr multi \\ -s ${norm_pileup_1}.gz norm1 \\ -s ${tumor_pileup_1}.gz tumor1 \\ -s ${norm_pileup_2}.gz norm2 \\ -s ${tumor_pileup_2}.gz tumor2 \\ -o ${dmr_dir} \\ # required for multi -r ${cpg_islands} \\ # skip this option to perform base-level DMR --ref ${ref} \\ --base C \\ -t 10 \\ -f \\ --log-filepath dmr_multi.log For example the samples could be haplotype-partitioned bedMethyl tables or biological replicates. Unlike for modkit dmr pair a sample name (e.g. norm1 and tumor1 above) must be provided for each input sample. You can also use --index to specify where the tabix index file is for each sample.","breadcrumbs":"Quick Start guides » Perform differential methylation scoring » Running multiple samples","id":"57","title":"Running multiple samples"},"58":{"body":"The output from modkit dmr pair (and for each pairwise comparison with modkit dmr multi) is (roughly) a BED file with the following schema: column name description type 1 chrom name of reference sequence from bedMethyl input samples str 2 start position 0-based start position, from --regions argument int 3 end position 0-based exclusive end position, from --regions argument int 4 name name column from --regions BED, or chr:start-stop if absent str 5 score Difference score, more positive values have increased difference float 6 samplea counts Counts of each base modification in the region, comma-separated, for sample A str 7 samplea total Total number of base modification calls in the region, including unmodified, for sample A str 8 sampleb counts Counts of each base modification in the region, comma-separated, for sample B str 9 sampleb total Total number of base modification calls in the region, including unmodified, for sample B str 10 samplea fractions Fraction of calls for each base modification in the region, comma-separated, for sample A str 11 sampleb fractions Fraction of calls for each base modification in the region, comma-separated, for sample B str an example of the output is given below: chr10 73861 74083 chr10:73861-74083 -0.5007740865394226 h:7,m:18 950 h:8,m:16 802 h:0.74,m:1.89 h:1.00,m:2.00\nchr10 74090 74289 chr10:74090-74289 0.5533780473006118 h:8,m:5 936 h:3,m:7 853 h:0.85,m:0.53 h:0.35,m:0.82\nchr10 76139 76313 chr10:76139-76313 1.334274110255592 h:6,m:46 507 h:13,m:35 446 h:1.18,m:9.07 h:2.91,m:7.85","breadcrumbs":"Quick Start guides » Perform differential methylation scoring » Differential methylation output format","id":"58","title":"Differential methylation output format"},"59":{"body":"The aim of modkit dmr is to enable exploratory data analysis of methylation patterns. To that aim, the approach to scoring methylation differences is intended to be simple and interpretable. For every region provided, within a sample, we model each potentially methylated base as arising from the same distribution. In other words, we discard the relative ordering of the base modification calls within a region. We then define a model for the frequency of observing each base modification state. In the case of methylated versus unmodified (5mC vs C, or 6mA vs A), we use the binomial distribution and model the probability of methylation \\(p\\) as a beta-distributed random variable: \\[ \\mathbf{X}|p \\sim \\text{Bin}(n, p) \\] \\[ p \\sim \\text{Beta}(\\alpha, \\beta) \\] where \\(n\\) is the number of potentially methylated bases reported on in the region and \\(\\mathbf{X}\\) is the vector of counts (canonical and methylated). In the case where there are more than two states (for example, 5hmC, 5mC, and unmodified C) we use a multinomial distribution and a Dirichlet as the base distribution: \\[ \\mathbf{X}|\\pi \\sim \\text{Mult}(n, \\pi) \\] \\[ \\pi \\sim \\text{Dir}(\\alpha) \\] Let \\(\\theta\\) be the parameters describing the posterior distribution ( \\( \\alpha, \\beta \\) for the binary case, and \\(\\alpha \\) in the general case). The score reported is the result of the following log marginal likelihood ratio : \\[ \\text{score} = \\text{log}(\\frac{l_{\\theta_{a}}( \\mathbf{X_a} ) l_{\\theta_{b}} ( \\mathbf{X_b} )}{l_{\\theta_{a+b}} (\\mathbf{X_{a+b}} )}) \\] Where \\(\\theta_a\\) and \\(\\theta_b\\) are the posterior distributions with the two conditions modeled separately, and \\(\\theta_{a+b}\\) is the posterior when the two conditions are modeled together. The function \\(l_{\\theta}(\\mathbf{X}) \\) is the log marginal likelihood of the counts under the parameters of the model \\(\\theta\\). For all cases, we use Jeffrey's prior as the prior distribution.","breadcrumbs":"Quick Start guides » Perform differential methylation scoring » Scoring details","id":"59","title":"Scoring details"},"6":{"body":"In its simplest form modkit creates a bedMethyl file using the following: modkit pileup path/to/reads.bam output/path/pileup.bed --log-filepath pileup.log No reference sequence is required. A single file (described below ) with base count summaries will be created. The final argument here specifies an optional log file output. The program performs best-practices filtering and manipulation of the raw data stored in the input file. For further details see filtering modified-base calls .","breadcrumbs":"Quick Start guides » Constructing bedMethyl tables » Basic usage","id":"6","title":"Basic usage"},"60":{"body":"The goal of modkit is to enable best-practices manipulation of BAM files containing modified base information (modBAMs). The various sub-commands and tools available in modkit are described below. This information can be obtained by invoking the long help (--help) for each command. Advanced usage information. Usage: modkit Commands: pileup Tabulates base modification calls across genomic positions. This command produces a bedMethyl formatted file. Schema and description of fields can be found in the README. adjust-mods Performs various operations on BAM files containing base modification information, such as converting base modification codes and ignoring modification calls. Produces a BAM output file. update-tags Renames Mm/Ml to tags to MM/ML. Also allows changing the the mode flag from silent '.' to explicitly '?' or '.'. sample-probs Calculate an estimate of the base modification probability distribution. summary Summarize the mod tags present in a BAM and get basic statistics. The default output is a totals table (designated by '#' lines) and a modification calls table. Descriptions of the columns can be found in the README. call-mods Call mods from a modbam, creates a new modbam with probabilities set to 100% if a base modification is called or 0% if called canonical. motif-bed Create BED file with all locations of a sequence motif. Example: modkit motif-bed CG 0 extract Extract read-level base modification information from a modBAM into a tab-separated values table. repair Repair MM and ML tags in one bam with the correct tags from another. To use this command, both modBAMs _must_ be sorted by read name. The \"donor\" modBAM's reads must be a superset of the acceptor's reads. Extra reads in the donor are allowed, and multiple reads with the same name (secondary, etc.) are allowed in the acceptor. Reads with an empty SEQ field cannot be repaired and will be rejected. Reads where there is an ambiguous alignment of the acceptor to the donor will be rejected (and logged). See the full documentation for details. dmr Perform DMR test on a set of regions. Output a BED file of regions with the score column indicating the magnitude of the difference. Find the schema and description of fields can in the README as well as a description of the model and method. See subcommand help for additional details. pileup-hemi Tabulates double-stranded base modification patters (such as hemi-methylation) across genomic motif positions. This command produces a bedMethyl file, the schema can be found in the online documentation. help Print this message or the help of the given subcommand(s). Options: -h, --help Print help information. -V, --version Print version information.","breadcrumbs":"Extended subcommand help » modkit, subcommand documentation","id":"60","title":"modkit, subcommand documentation"},"61":{"body":"Tabulates base modification calls across genomic positions. This command produces a bedMethyl\nformatted file. Schema and description of fields can be found in the README. Usage: modkit pileup [OPTIONS] Arguments: Input BAM, should be sorted and have associated index available. Output file (or directory with --bedgraph option) to write results into. Specify \"-\" or \"stdout\" to direct output to stdout. Options: --log-filepath Specify a file for debug logs to be written to, otherwise ignore them. Setting a file is recommended. (alias: log) --region Process only the specified region of the BAM when performing pileup. Format should be :- or . Commas are allowed. --max-depth Maximum number of records to use when calculating pileup. This argument is passed to the pileup engine. If you have high depth data, consider increasing this value substantially. Must be less than 2147483647 or an error will be raised. [default: 8000] -t, --threads Number of threads to use while processing chunks concurrently. [default: 4] -i, --interval-size Interval chunk size in base pairs to process concurrently. Smaller interval chunk sizes will use less memory but incur more overhead. [default: 100000] --chunk-size Break contigs into chunks containing this many intervals (see `interval_size`). This option can be used to help prevent excessive memory usage, usually with no performance penalty. By default, modkit will set this value to 1.5x the number of threads specified, so if 4 threads are specified the chunk_size will be 6. A warning will be shown if this option is less than the number of threads specified. --suppress-progress Hide the progress bar. -n, --num-reads Sample this many reads when estimating the filtering threshold. Reads will be sampled evenly across aligned genome. If a region is specified, either with the --region option or the --sample-region option, then reads will be sampled evenly across the region given. This option is useful for large BAM files. In practice, 10-50 thousand reads is sufficient to estimate the model output distribution and determine the filtering threshold. [default: 10042] -f, --sampling-frac Sample this fraction of the reads when estimating the pass-threshold. In practice, 10-100 thousand reads is sufficient to estimate the model output distribution and determine the filtering threshold. See filtering.md for details on filtering. --seed Set a random seed for deterministic running, the default is non-deterministic. --no-filtering Do not perform any filtering, include all mod base calls in output. See filtering.md for details on filtering. -p, --filter-percentile Filter out modified base calls where the probability of the predicted variant is below this confidence percentile. For example, 0.1 will filter out the 10% lowest confidence modification calls. [default: 0.1] --filter-threshold Specify the filter threshold globally or per-base. Global filter threshold can be specified with by a decimal number (e.g. 0.75). Per-base thresholds can be specified by colon-separated values, for example C:0.75 specifies a threshold value of 0.75 for cytosine modification calls. Additional per-base thresholds can be specified by repeating the option: for example --filter-threshold C:0.75 --filter-threshold A:0.70 or specify a single base option and a default for all other bases with: --filter-threshold A:0.70 --filter-threshold 0.9 will specify a threshold value of 0.70 for adenine and 0.9 for all other base modification calls. --mod-thresholds Specify a passing threshold to use for a base modification, independent of the threshold for the primary sequence base or the default. For example, to set the pass threshold for 5hmC to 0.8 use `--mod-threshold h:0.8`. The pass threshold will still be estimated as usual and used for canonical cytosine and other modifications unless the `--filter-threshold` option is also passed. See the online documentation for more details. --sample-region Specify a region for sampling reads from when estimating the threshold probability. If this option is not provided, but --region is provided, the genomic interval passed to --region will be used. Format should be :- or . --sampling-interval-size Interval chunk size in base pairs to process concurrently when estimating the threshold probability, can be larger than the pileup processing interval. [default: 1000000] --include-bed BED file that will restrict threshold estimation and pileup results to positions overlapping intervals in the file. (alias: include-positions) --include-unmapped Include unmapped base modifications when estimating the pass threshold. --ignore Ignore a modified base class _in_situ_ by redistributing base modification probability equally across other options. For example, if collapsing 'h', with 'm' and canonical options, half of the probability of 'h' will be added to both 'm' and 'C'. A full description of the methods can be found in collapse.md. --force-allow-implicit Force allow implicit-canonical mode. By default modkit does not allow pileup with the implicit mode (e.g. C+m, no '.' or '?'). The `update-tags` subcommand is provided to update tags to the new mode. This option allows the interpretation of implicit mode tags: residues without modified base probability will be interpreted as being the non-modified base. --motif Output pileup counts for only sequence motifs provided. The first argument should be the sequence motif and the second argument is the 0-based offset to the base to pileup base modification counts for. For example: --motif CGCG 0 indicates to pileup counts for the first C on the top strand and the last C (complement to G) on the bottom strand. The --cpg argument is short hand for --motif CG 0. This argument can be passed multiple times. When more than one motif is used, the resulting output BED file will indicate the motif in the \"name\" field as ,,. For example, given `--motif CGCG 2 --motif CG 0` there will be output lines with name fields such as \"m,CG,0\" and \"m,CGCG,2\". To use this option with `--combine-strands`, all motifs must be reverse-complement palindromic or an error will be raised. --cpg Only output counts at CpG motifs. Requires a reference sequence to be provided. -r, --ref Reference sequence in FASTA format. Required for CpG motif filtering. -k, --mask Respect soft masking in the reference FASTA. --preset Optional preset options for specific applications. traditional: Prepares bedMethyl analogous to that generated from other technologies for the analysis of 5mC modified bases. Shorthand for --cpg --combine-strands --ignore h. [possible values: traditional] --combine-mods Combine base modification calls, all counts of modified bases are summed together. See collapse.md for details. --combine-strands When performing motif analysis (such as CpG), sum the counts from the positive and negative strands into the counts for the positive strand position. --edge-filter Discard base modification calls that are this many bases from the start or the end of the read. Two comma-separated values may be provided to asymmetrically filter out base modification calls from the start and end of the reads. For example, 4,8 will filter out base modification calls in the first 4 and last 8 bases of the read. --invert-edge-filter Invert the edge filter, instead of filtering out base modification calls at the ends of reads, only _keep_ base modification calls at the ends of reads. E.g. if usually, \"4,8\" would remove (i.e. filter out) base modification calls in the first 4 and last 8 bases of the read, using this flag will keep only base modification calls in the first 4 and last 8 bases. --only-tabs For bedMethyl output, separate columns with only tabs. The default is to use tabs for the first 10 fields and spaces thereafter. The default behavior is more likely to be compatible with genome viewers. Enabling this option may make it easier to parse the output with tabular data handlers that expect a single kind of separator. --bedgraph Output bedGraph format, see https://genome.ucsc.edu/goldenPath/help/bedgraph.html. For this setting, specify a directory for output files to be make in. Two files for each modification will be produced, one for the positive strand and one for the negative strand. So for 5mC (m) and 5hmC (h) there will be 4 files produced. --prefix Prefix to prepend on bedgraph output file names. Without this option the files will be _.bedgraph. --partition-tag Partition output into multiple bedMethyl files based on tag-value pairs. The output will be multiple bedMethyl files with the format `___.bed` prefix is optional and set with the `--prefix` flag. -h, --help Print help information (use `-h` for a summary)","breadcrumbs":"Extended subcommand help » pileup","id":"61","title":"pileup"},"62":{"body":"Performs various operations on BAM files containing base modification information, such as\nconverting base modification codes and ignoring modification calls. Produces a BAM output file Usage: modkit adjust-mods [OPTIONS] Arguments: BAM file to collapse mod call from. Can be a path to a file or one of `-` or `stdin` to specify a stream from standard input. File path to new BAM file to be created. Can be a path to a file or one of `-` or `stdin` to specify a stream from standard output. Options: --log-filepath Output debug logs to file at this path. --ignore Modified base code to ignore/remove, see https://samtools.github.io/hts-specs/SAMtags.pdf for details on the modified base codes. -t, --threads Number of threads to use. [default: 4] -f, --ff Fast fail, stop processing at the first invalid sequence record. Default behavior is to continue and report failed/skipped records at the end. --convert Convert one mod-tag to another, summing the probabilities together if the retained mod tag is already present. --edge-filter Discard base modification calls that are this many bases from the start or the end of the read. Two comma-separated values may be provided to asymmetrically filter out base modification calls from the start and end of the reads. For example, 4,8 will filter out base modification calls in the first 4 and last 8 bases of the read. --invert-edge-filter Invert the edge filter, instead of filtering out base modification calls at the ends of reads, only _keep_ base modification calls at the ends of reads. E.g. if usually, \"4,8\" would remove (i.e. filter out) base modification calls in the first 4 and last 8 bases of the read, using this flag will keep only base modification calls in the first 4 and last 8 bases. --output-sam Output SAM format instead of BAM. --suppress-progress Hide the progress bar -h, --help Print help information (use `-h` for a summary).","breadcrumbs":"Extended subcommand help » adjust-mods","id":"62","title":"adjust-mods"},"63":{"body":"Renames Mm/Ml to tags to MM/ML. Also allows changing the the mode flag from silent '.' to explicitly\n'?' or '.'. Usage: modkit update-tags [OPTIONS] Arguments: BAM to update modified base tags in. Can be a path to a file or one of `-` or `stdin` to specify a stream from standard input. File to new BAM file to be created or one of `-` or `stdin` to specify a stream from standard output. Options: -m, --mode Mode, change mode to this value, options {'ambiguous', 'implicit'}. See spec at: https://samtools.github.io/hts-specs/SAMtags.pdf. 'ambiguous' ('?') means residues without explicit modification probabilities will not be assumed canonical or modified. 'implicit' means residues without explicit modification probabilities are assumed to be canonical. [possible values: ambiguous, implicit] -t, --threads Number of threads to use. [default: 4] --log-filepath Output debug logs to file at this path. --output-sam Output SAM format instead of BAM. -h, --help Print help information.","breadcrumbs":"Extended subcommand help » update-tags","id":"63","title":"update-tags"},"64":{"body":"Calculate an estimate of the base modification probability distribution. Usage: modkit sample-probs [OPTIONS] Arguments: Input BAM with modified base tags. If a index is found reads will be sampled evenly across the length of the reference sequence. Can be a path to a file or one of `-` or `stdin` to specify a stream from standard input. Options: -t, --threads Number of threads to use. [default: 4] --log-filepath Specify a file for debug logs to be written to, otherwise ignore them. Setting a file is recommended. --suppress-progress Hide the progress bar. -p, --percentiles Percentiles to calculate, a space separated list of floats. [default: 0.1,0.5,0.9] -o, --out-dir Directory to deposit result tables into. Required for model probability histogram output. Creates two files probabilities.tsv and probabilities.txt The .txt contains ASCII-histograms and the .tsv contains tab-separated variable data represented by the histograms. --prefix Label to prefix output files with. E.g. 'foo' will output foo_thresholds.tsv, foo_probabilities.tsv, and foo_probabilities.txt. --force Overwrite results if present. --ignore Ignore a modified base class _in_situ_ by redistributing base modification probability equally across other options. For example, if collapsing 'h', with 'm' and canonical options, half of the probability of 'h' will be added to both 'm' and 'C'. A full description of the methods can be found in collapse.md. --edge-filter Discard base modification calls that are this many bases from the start or the end of the read. Two comma-separated values may be provided to asymmetrically filter out base modification calls from the start and end of the reads. For example, 4,8 will filter out base modification calls in the first 4 and last 8 bases of the read. --invert-edge-filter Invert the edge filter, instead of filtering out base modification calls at the ends of reads, only _keep_ base modification calls at the ends of reads. E.g. if usually, \"4,8\" would remove (i.e. filter out) base modification calls in the first 4 and last 8 bases of the read, using this flag will keep only base modification calls in the first 4 and last 8 bases. --hist Output histogram of base modification prediction probabilities. --buckets Number of buckets for the histogram, if used. [default: 128] -n, --num-reads Approximate maximum number of reads to use, especially recommended when using a large BAM without an index. If an indexed BAM is provided, the reads will be sampled evenly over the length of the aligned reference. If a region is passed with the --region option, they will be sampled over the genomic region. Actual number of reads used may deviate slightly from this number. [default: 10042] -f, --sampling-frac Instead of using a defined number of reads, specify a fraction of reads to sample, for example 0.1 will sample 1/10th of the reads. --no-sampling No sampling, use all of the reads to calculate the filter thresholds. -s, --seed Random seed for deterministic running, the default is non-deterministic, only used when no BAM index is provided. --region Process only the specified region of the BAM when collecting probabilities. Format should be :- or . -i, --interval-size Interval chunk size in base pairs to process concurrently. Smaller interval chunk sizes will use less memory but incur more overhead. Only used when sampling probs from an indexed bam. [default: 1000000] --include-bed Only sample base modification probabilities that are aligned to the positions in this BED file. (alias: include-positions) --only-mapped Only use base modification probabilities that are aligned (i.e. ignore soft-clipped, and inserted bases). -h, --help Print help information (use `-h` for a summary).","breadcrumbs":"Extended subcommand help » sample-probs","id":"64","title":"sample-probs"},"65":{"body":"Summarize the mod tags present in a BAM and get basic statistics. The default output is a totals\ntable (designated by '#' lines) and a modification calls table. Descriptions of the columns can be\nfound in the README. Usage: modkit summary [OPTIONS] Arguments: Input modBam, can be a path to a file or one of `-` or `stdin` to specify a stream from standard input. Options: -t, --threads Number of threads to use. [default: 4] --log-filepath Specify a file for debug logs to be written to, otherwise ignore them. Setting a file is recommended. --tsv Output summary as a tab-separated variables stdout instead of a table. --suppress-progress Hide the progress bar. -n, --num-reads Approximate maximum number of reads to use, especially recommended when using a large BAM without an index. If an indexed BAM is provided, the reads will be sampled evenly over the length of the aligned reference. If a region is passed with the --region option, they will be sampled over the genomic region. Actual number of reads used may deviate slightly from this number. [default: 10042] -f, --sampling-frac Instead of using a defined number of reads, specify a fraction of reads to sample when estimating the filter threshold. For example 0.1 will sample 1/10th of the reads. --no-sampling No sampling, use all of the reads to calculate the filter thresholds and generating the summary. -s, --seed Sets a random seed for deterministic running (when using --sample-frac), the default is non-deterministic, only used when no BAM index is provided. --no-filtering Do not perform any filtering, include all base modification calls in the summary. See filtering.md for details on filtering. -p, --filter-percentile Filter out modified base calls where the probability of the predicted variant is below this confidence percentile. For example, 0.1 will filter out the 10% lowest confidence base modification calls. [default: 0.1] --filter-threshold Specify the filter threshold globally or per-base. Global filter threshold can be specified with by a decimal number (e.g. 0.75). Per-base thresholds can be specified by colon-separated values, for example C:0.75 specifies a threshold value of 0.75 for cytosine modification calls. Additional per-base thresholds can be specified by repeating the option: for example --filter-threshold C:0.75 --filter-threshold A:0.70 or specify a single base option and a default for all other bases with: --filter-threshold A:0.70 --filter-threshold 0.9 will specify a threshold value of 0.70 for adenine and 0.9 for all other base modification calls. --mod-thresholds Specify a passing threshold to use for a base modification, independent of the threshold for the primary sequence base or the default. For example, to set the pass threshold for 5hmC to 0.8 use `--mod-threshold h:0.8`. The pass threshold will still be estimated as usual and used for canonical cytosine and other modifications unless the `--filter-threshold` option is also passed. See the online documentation for more details. --ignore Ignore a modified base class _in_situ_ by redistributing base modification probability equally across other options. For example, if collapsing 'h', with 'm' and canonical options, half of the probability of 'h' will be added to both 'm' and 'C'. A full description of the methods can be found in collapse.md. --edge-filter Discard base modification calls that are this many bases from the start or the end of the read. Two comma-separated values may be provided to asymmetrically filter out base modification calls from the start and end of the reads. For example, 4,8 will filter out base modification calls in the first 4 and last 8 bases of the read. --invert-edge-filter Invert the edge filter, instead of filtering out base modification calls at the ends of reads, only _keep_ base modification calls at the ends of reads. E.g. if usually, \"4,8\" would remove (i.e. filter out) base modification calls in the first 4 and last 8 bases of the read, using this flag will keep only base modification calls in the first 4 and last 8 bases. --include-bed Only summarize base modification probabilities that are aligned to the positions in this BED file. (alias: include-positions) --only-mapped Only use base modification probabilities that are aligned (i.e. ignore soft-clipped, and inserted bases). --region Process only the specified region of the BAM when collecting probabilities. Format should be :- or . -i, --interval-size When using regions, interval chunk size in base pairs to process concurrently. Smaller interval chunk sizes will use less memory but incur more overhead. [default: 1000000] -h, --help Print help information (use `-h` for a summary).","breadcrumbs":"Extended subcommand help » summary","id":"65","title":"summary"},"66":{"body":"Create BED file with all locations of a sequence motif. Example: modkit motif-bed CG 0 Usage: modkit motif-bed [OPTIONS] Arguments: Input FASTA file. Motif to search for within FASTA, e.g. CG. Offset within motif, e.g. 0. Options: -k, --mask Respect soft masking in the reference FASTA. -h, --help Print help information.","breadcrumbs":"Extended subcommand help » motif-bed","id":"66","title":"motif-bed"},"67":{"body":"Call mods from a modBam, creates a new modBam with probabilities set to 100% if a base modification\nis called or 0% if called canonical. Usage: modkit call-mods [OPTIONS] Arguments: Input BAM, may be sorted and have associated index available. Can be a path to a file or one of `-` or `stdin` to specify a stream from standard input. Output BAM, can be a path to a file or one of `-` or `stdin` to specify a stream from standard input. Options: --log-filepath Specify a file for debug logs to be written to, otherwise ignore them. Setting a file is recommended. --ff Fast fail, stop processing at the first invalid sequence record. Default behavior is to continue and report failed/skipped records at the end. --suppress-progress Hide the progress bar. -t, --threads Number of threads to use while processing chunks concurrently. [default: 4] -n, --num-reads Sample approximately this many reads when estimating the filtering threshold. If alignments are present reads will be sampled evenly across aligned genome. If a region is specified, either with the --region option or the --sample-region option, then reads will be sampled evenly across the region given. This option is useful for large BAM files. In practice, 10-50 thousand reads is sufficient to estimate the model output distribution and determine the filtering threshold. [default: 10042] -f, --sampling-frac Sample this fraction of the reads when estimating the filter-percentile. In practice, 50-100 thousand reads is sufficient to estimate the model output distribution and determine the filtering threshold. See filtering.md for details on filtering. --seed Set a random seed for deterministic running, the default is non-deterministic, only used when no BAM index is provided. --sample-region Specify a region for sampling reads from when estimating the threshold probability. If this option is not provided, but --region is provided, the genomic interval passed to --region will be used. Format should be :- or . --sampling-interval-size Interval chunk size to process concurrently when estimating the threshold probability, can be larger than the pileup processing interval. [default: 1000000] -p, --filter-percentile Filter out modified base calls where the probability of the predicted variant is below this confidence percentile. For example, 0.1 will filter out the 10% lowest confidence modification calls. [default: 0.1] --filter-threshold Specify the filter threshold globally or per primary base. A global filter threshold can be specified with by a decimal number (e.g. 0.75). Per-base thresholds can be specified by colon-separated values, for example C:0.75 specifies a threshold value of 0.75 for cytosine modification calls. Additional per-base thresholds can be specified by repeating the option: for example --filter-threshold C:0.75 --filter-threshold A:0.70 or specify a single base option and a default for all other bases with: --filter-threshold A:0.70 --filter-threshold 0.9 will specify a threshold value of 0.70 for adenine and 0.9 for all other base modification calls. --mod-threshold Specify a passing threshold to use for a base modification, independent of the threshold for the primary sequence base or the default. For example, to set the pass threshold for 5hmC to 0.8 use `--mod-threshold h:0.8`. The pass threshold will still be estimated as usual and used for canonical cytosine and other modifications unless the `--filter-threshold` option is also passed. See the online documentation for more details. --no-filtering Don't filter base modification calls, assign each base modification to the highest probability prediction. --edge-filter Discard base modification calls that are this many bases from the start or the end of the read. Two comma-separated values may be provided to asymmetrically filter out base modification calls from the start and end of the reads. For example, 4,8 will filter out base modification calls in the first 4 and last 8 bases of the read. --invert-edge-filter Invert the edge filter, instead of filtering out base modification calls at the ends of reads, only _keep_ base modification calls at the ends of reads. E.g. if usually, \"4,8\" would remove (i.e. filter out) base modification calls in the first 4 and last 8 bases of the read, using this flag will keep only base modification calls in the first 4 and last 8 bases. --output-sam Output SAM format instead of BAM. -h, --help Print help information (use `-h` for a summary).","breadcrumbs":"Extended subcommand help » call-mods","id":"67","title":"call-mods"},"68":{"body":"Extract read-level base modification information from a modBAM into a tab-separated values table. Usage: modkit extract [OPTIONS] Arguments: Path to modBAM file to extract read-level information from, or one of `-` or `stdin` to specify a stream from standard input. If a file is used it may be sorted and have associated index. Path to output file, \"stdout\" or \"-\" will direct output to standard out. Specifying \"null\" will not output the extract table (useful if all you need is the `--read-calls` output table). Options: -t, --threads Number of threads to use. [default: 4] --log-filepath Path to file to write run log, setting this file is recommended. --mapped-only Include only mapped bases in output. (alias: mapped) --allow-non-primary Output aligned secondary and supplementary base modification probabilities as additional rows. The primary alignment will have all of the base modification probabilities (including soft-clipped ones, unless --mapped-only is used). The non-primary alignments will only have mapped bases in the output. --num-reads Number of reads to use. Note that when using a sorted, indexed modBAM that the sampling algorithm will attempt to sample records evenly over the length of the reference sequence. The result is the final number of records used may be slightly more or less than the requested number. When piping from stdin or using a modBAM without an index, the requested number of reads will be exact. --region Process only reads that are aligned to a specified region of the BAM. Format should be :- or . --force Force overwrite of output file. --suppress-progress Hide the progress bar. --kmer-size Set the query and reference k-mer size (if a reference is provided). Maxumum number for this value is 12. [default: 5] --ignore-index Ignore the BAM index (if it exists) and default to a serial scan of the BAM. --read-calls-path Produce a table of read-level base modification calls. This table has, for each read, one row for each base modification call in that read using the same thresholding algorithm as in pileup, or summary (see online documentation for details on thresholds). Passing this option will cause `modkit` to estimate the pass thresholds from the data unless a `--filter-threshold` value is passed to the command. (alias: --read-calls) --reference Path to reference FASTA to extract reference context information from. If no reference is provided, `ref_kmer` column will be \".\" in the output. (alias: ref) --include-bed BED file with regions to include (alias: include-positions). Implicitly only includes mapped sites -v, --exclude-bed BED file with regions to _exclude_ (alias: exclude). --motif Output read-level base modification probabilities restricted to the reference sequence motifs provided. The first argument should be the sequence motif and the second argument is the 0-based offset to the base to pileup base modification counts for. For example: --motif CGCG 0 indicates include base modifications for which the read is aligned to the first C on the top strand and the last C (complement to G) on the bottom strand. The --cpg argument is short hand for --motif CG 0. This argument can be passed multiple times. --cpg Only output counts at CpG motifs. Requires a reference sequence to be provided. -k, --mask When using motifs, respect soft masking in the reference sequence. --filter-threshold Specify the filter threshold globally or per-base. Global filter threshold can be specified with by a decimal number (e.g. 0.75). Per-base thresholds can be specified by colon-separated values, for example C:0.75 specifies a threshold value of 0.75 for cytosine modification calls. Additional per-base thresholds can be specified by repeating the option: for example --filter-threshold C:0.75 --filter-threshold A:0.70 or specify a single base option and a default for all other bases with: --filter-threshold A:0.70 --filter-threshold 0.9 will specify a threshold value of 0.70 for adenine and 0.9 for all other base modification calls. --mod-thresholds Specify a passing threshold to use for a base modification, independent of the threshold for the primary sequence base or the default. For example, to set the pass threshold for 5hmC to 0.8 use `--mod-threshold h:0.8`. The pass threshold will still be estimated as usual and used for canonical cytosine and other modifications unless the `--filter-threshold` option is also passed. See the online documentation for more details. --no-filtering Do not perform any filtering, include all mod base calls in output. See filtering.md for details on filtering. --sampling-interval-size Interval chunk size in base pairs to process concurrently when estimating the threshold probability. [default: 1000000] -f, --sampling-frac Sample this fraction of the reads when estimating the pass-threshold. In practice, 10-100 thousand reads is sufficient to estimate the model output distribution and determine the filtering threshold. See filtering.md for details on filtering. -n, --sample-num-reads Sample this many reads when estimating the filtering threshold. If a sorted, indexed modBAM is provided reads will be sampled evenly across aligned genome. If a region is specified, with the --region, then reads will be sampled evenly across the region given. This option is useful for large BAM files. In practice, 10-50 thousand reads is sufficient to estimate the model output distribution and determine the filtering threshold. [default: 10042] --seed Set a random seed for deterministic running, the default is non-deterministic. -p, --filter-percentile Filter out modified base calls where the probability of the predicted variant is below this confidence percentile. For example, 0.1 will filter out the 10% lowest confidence modification calls. [default: 0.1] --edge-filter Discard base modification calls that are this many bases from the start or the end of the read. Two comma-separated values may be provided to asymmetrically filter out base modification calls from the start and end of the reads. For example, 4,8 will filter out base modification calls in the first 4 and last 8 bases of the read. --invert-edge-filter Invert the edge filter, instead of filtering out base modification calls at the ends of reads, only _keep_ base modification calls at the ends of reads. E.g. if usually, \"4,8\" would remove (i.e. filter out) base modification calls in the first 4 and last 8 bases of the read, using this flag will keep only base modification calls in the first 4 and last 8 bases. --ignore Ignore a modified base class _in_situ_ by redistributing base modification probability equally across other options. For example, if collapsing 'h', with 'm' and canonical options, half of the probability of 'h' will be added to both 'm' and 'C'. A full description of the methods can be found in collapse.md. -i, --interval-size Interval chunk size in base pairs to process concurrently. Smaller interval chunk sizes will use less memory but incur more overhead. Only used when an indexed modBam is provided. [default: 100000] --ignore-implicit Ignore implicitly canonical base modification calls. When the `.` flag is used in the MM tag, this implies that bases missing a base modification probability are to be assumed canonical. Set this flag to omit those base modifications from the output. For additional details see the SAM spec: https://samtools.github.io/hts-specs/SAMtags.pdf. -h, --help Print help information (use `-h` for a summary).","breadcrumbs":"Extended subcommand help » extract","id":"68","title":"extract"},"69":{"body":"Repair MM and ML tags in one bam with the correct tags from another. To use this command, both\nmodBAMs _must_ be sorted by read name. The \"donor\" modBAM's reads must be a superset of the\nacceptor's reads. Extra reads in the donor are allowed, and multiple reads with the same name\n(secondary, etc.) are allowed in the acceptor. Reads with an empty SEQ field cannot be repaired and\nwill be rejected. Reads where there is an ambiguous alignment of the acceptor to the donor will be\nrejected (and logged). See the full documentation for details. Usage: modkit repair [OPTIONS] --donor-bam --acceptor-bam --output-bam Options: -d, --donor-bam Donor modBAM with original MM/ML tags. Must be sorted by read name. -a, --acceptor-bam Acceptor modBAM with reads to have MM/ML base modification data projected on to. Must be sorted by read name. -o, --output-bam output modBAM location. --log-filepath File to write logs to, it is recommended to use this option as some reads may be rejected and logged here -t, --threads The number of threads to use [default: 4] -h, --help Print help information.","breadcrumbs":"Extended subcommand help » repair","id":"69","title":"repair"},"7":{"body":"For user convenience, the counting process can be modulated using several additional transforms and filters. The most basic of these is to report only counts from reference CpG dinucleotides. This option requires a reference sequence in order to locate the CpGs in the reference: modkit pileup path/to/reads.bam output/path/pileup.bed --cpg --ref path/to/reference.fasta To restrict output to only certain CpGs, pass the --include-bed option with the CpGs to be used, see this page for more details. modkit pileup path/to/reads.bam output/path/pileup.bed \\ --cpg \\ --ref path/to/reference.fasta \\ --include-bed path/to/my_cpgs.bed The program also contains preset which combine several options for ease of use. The traditional preset, modkit pileup path/to/reads.bam output/path/pileup.bed \\ --ref path/to/reference.fasta \\ --preset traditional performs three transforms: restricts output to locations where there is a CG dinucleotide in the reference, reports only a C and 5mC counts, using procedures to take into account counts of other forms of cytosine modification (notably 5hmC), and aggregates data across strands. The strand field of the output will be marked as '.' indicating that the strand information has been lost. Using this option is equivalent to running with the options: modkit pileup path/to/reads.bam output/path/pileup.bed --cpg --ref --ignore h --combine-strands","breadcrumbs":"Quick Start guides » Constructing bedMethyl tables » Narrowing output to CpG dinucleotides","id":"7","title":"Narrowing output to CpG dinucleotides"},"70":{"body":"Tabulates double-stranded base modification patters (such as hemi-methylation) across genomic motif\npositions. This command produces a bedMethyl file, the schema can be found in the online\ndocumentation. Usage: modkit pileup-hemi [OPTIONS] Arguments: Input BAM, should be sorted and have associated index available. Options: -o, --out-bed Output file to write results into. Will write to stdout if not provided. --cpg Aggregate double-stranded base modifications for CpG dinucleotides. This flag is short-hand for --motif CG 0. --motif Specify the sequence motif to pileup double-stranded base modification pattern counts for. The first argument should be the sequence motif and the second argument is the 0-based offset to the base to pileup base modification counts for. For example: --motif CG 0 indicates to generate pattern counts for the C on the top strand and the following C (opposite to G) on the negative strand. The motif must be reverse-complement palindromic or an error will be raised. See the documentation for more examples and details. -r, --ref Reference sequence in FASTA format. --log-filepath Specify a file for debug logs to be written to, otherwise ignore them. Setting a file is recommended. (alias: log) --region Process only the specified region of the BAM when performing pileup. Format should be :- or . Commas are allowed. --max-depth Maximum number of records to use when calculating pileup. This argument is passed to the pileup engine. If you have high depth data, consider increasing this value substantially. Must be less than 2147483647 or an error will be raised. [default: 8000] -t, --threads Number of threads to use while processing chunks concurrently. [default: 4] -i, --interval-size Interval chunk size in base pairs to process concurrently. Smaller interval chunk sizes will use less memory but incur more overhead. [default: 100000] --chunk-size Break contigs into chunks containing this many intervals (see `interval_size`). This option can be used to help prevent excessive memory usage, usually with no performance penalty. By default, modkit will set this value to 1.5x the number of threads specified, so if 4 threads are specified the chunk_size will be 6. A warning will be shown if this option is less than the number of threads specified. --suppress-progress Hide the progress bar. -n, --num-reads Sample this many reads when estimating the filtering threshold. Reads will be sampled evenly across aligned genome. If a region is specified, either with the --region option or the --sample-region option, then reads will be sampled evenly across the region given. This option is useful for large BAM files. In practice, 10-50 thousand reads is sufficient to estimate the model output distribution and determine the filtering threshold. [default: 10042] -f, --sampling-frac Sample this fraction of the reads when estimating the filter-percentile. In practice, 50-100 thousand reads is sufficient to estimate the model output distribution and determine the filtering threshold. See filtering.md for details on filtering. --seed Set a random seed for deterministic running, the default is non-deterministic. --no-filtering Do not perform any filtering, include all mod base calls in output. See filtering.md for details on filtering. -p, --filter-percentile Filter out modified base calls where the probability of the predicted variant is below this confidence percentile. For example, 0.1 will filter out the 10% lowest confidence modification calls. [default: 0.1] --filter-threshold Specify the filter threshold globally or per-base. Global filter threshold can be specified with by a decimal number (e.g. 0.75). Per-base thresholds can be specified by colon-separated values, for example C:0.75 specifies a threshold value of 0.75 for cytosine modification calls. Additional per-base thresholds can be specified by repeating the option: for example --filter-threshold C:0.75 --filter-threshold A:0.70 or specify a single base option and a default for all other bases with: --filter-threshold A:0.70 --filter-threshold 0.9 will specify a threshold value of 0.70 for adenine and 0.9 for all other base modification calls. --mod-thresholds Specify a passing threshold to use for a base modification, independent of the threshold for the primary sequence base or the default. For example, to set the pass threshold for 5hmC to 0.8 use `--mod-threshold h:0.8`. The pass threshold will still be estimated as usual and used for canonical cytosine and other modifications unless the `--filter-threshold` option is also passed. See the online documentation for more details. --sample-region Specify a region for sampling reads from when estimating the threshold probability. If this option is not provided, but --region is provided, the genomic interval passed to --region will be used. Format should be :- or . --sampling-interval-size Interval chunk size in base pairs to process concurrently when estimating the threshold probability, can be larger than the pileup processing interval. [default: 1000000] --include-bed BED file that will restrict threshold estimation and pileup results to positions overlapping intervals in the file. (alias: include-positions) --include-unmapped Include unmapped base modifications when estimating the pass threshold. --ignore Ignore a modified base class _in_situ_ by redistributing base modification probability equally across other options. For example, if collapsing 'h', with 'm' and canonical options, half of the probability of 'h' will be added to both 'm' and 'C'. A full description of the methods can be found in collapse.md. --force-allow-implicit Force allow implicit-canonical mode. By default modkit does not allow pileup with the implicit mode (e.g. C+m, no '.' or '?'). The `update-tags` subcommand is provided to update tags to the new mode. This option allows the interpretation of implicit mode tags: residues without modified base probability will be interpreted as being the non-modified base. -k, --mask Respect soft masking in the reference FASTA. --combine-mods Combine base modification calls, all counts of modified bases are summed together. See collapse.md for details. --edge-filter Discard base modification calls that are this many bases from the start or the end of the read. Two comma-separated values may be provided to asymmetrically filter out base modification calls from the start and end of the reads. For example, 4,8 will filter out base modification calls in the first 4 and last 8 bases of the read. --invert-edge-filter Invert the edge filter, instead of filtering out base modification calls at the ends of reads, only _keep_ base modification calls at the ends of reads. E.g. if usually, \"4,8\" would remove (i.e. filter out) base modification calls in the first 4 and last 8 bases of the read, using this flag will keep only base modification calls in the first 4 and last 8 bases. --only-tabs Separate bedMethyl columns with only tabs. The default is to use tabs for the first 10 fields and spaces thereafter. The default behavior is more likely to be compatible with genome viewers. Enabling this option may make it easier to parse the output with tabular data handlers that expect a single kind of separator. -h, --help Print help information (use `-h` for a summary).","breadcrumbs":"Extended subcommand help » pileup-hemi","id":"70","title":"pileup-hemi"},"71":{"body":"Compare regions in a pair of samples (for example, tumor and normal or control and experiment). A\nsample is input as a bgzip pileup bedMethyl (produced by pileup, for example) that has an associated\ntabix index. Output is a BED file with the score column indicating the magnitude of the difference\nin methylation between the two samples. See the online documentation for additional details. Usage: modkit dmr pair [OPTIONS] -a -b --ref Options: -a Bgzipped bedMethyl file for the first (usually control) sample. There should be a tabix index with the same name and .tbi next to this file or the --index-a option must be provided. -b Bgzipped bedMethyl file for the second (usually experimental) sample. There should be a tabix index with the same name and .tbi next to this file or the --index-b option must be provided. -o, --out-path Path to file to direct output, optional, no argument will direct output to stdout. -r, --regions-bed BED file of regions over which to compare methylation levels. Should be tab-separated (spaces allowed in the \"name\" column). Requires chrom, chromStart and chromEnd. The Name column is optional. Strand is currently ignored. When omitted, methylation levels are compared at each site in the `-a`/`control_bed_methyl` BED file (or optionally, the `-b`/`exp_bed_methyl` file with the `--use-b` flag. --use-b When performing site-level DMR, use the bedMethyl indicated by the -b/exp_bed_methyl argument to collect bases to score. --ref Path to reference fasta for the pileup. -m Bases to use to calculate DMR, may be multiple. For example, to calculate differentially methylated regions using only cytosine modifications use --base C. --log-filepath File to write logs to, it's recommended to use this option. -t, --threads Number of threads to use [default: 4] --batch-size Control the batch size. The batch size is the number of regions to load at a time. Each region will be processed concurrently. Loading more regions at a time will decrease IO to load data, but will use more memory. Default will be 50% more than the number of threads assigned. -k, --mask Respect soft masking in the reference FASTA. --suppress-progress Don't show progress bars. -f, --force Force overwrite of output file, if it already exists. --index-a Path to tabix index associated with -a (--control-bed-methyl) bedMethyl file. --index-b