The WiGiTS tool suite fully supports targeted panel sequencing as well as WGS and WTS.
The output from the pipeline largely matches that of WGS/WTS but with some modules excluded:
- germline analysis is typically disabled (unless the panel is also run for a normal sample)
- Cuppa is disabled
- Chord is disabled and instead Purple is used to estimate HRD
- TMB and MSI have custom calculations routines
- TPM is normalised to be in line with observed WGS rates
For each panel a set of custom resources is required to fit and normalise the biases inherent to that specific panel. These are described in detail below.
OncoAnalyser can be used to run any targeted panel.
To run in targeted mode, panel specific resources first need to be generated. As indicated in the below diagram, some of these are configured and some of these can be trained by first running oncoanalyser on a representative set of aligned BAMs from samples captured with the panel and then running a set of normalisation scripts that are described below. Here is a schematic of this process
The following files define the panel and are typically compiled manually and represent a basic definition of the PANEL:
File Name | Tool(s) | Purpose |
---|---|---|
Panel regions BED | Multiple | Defines the targeted regions of the bed file. Other regions are ignored in the training. May include gene exons or other regions of interest. The bed file must be sorted and exclude any ALT contigs or non standard human chromosomes |
MSI Indels TSV | Purple | List of {chromosome,position} of MSI loci to consider in MSI model |
TMB/MSI Configuration TSV | Purple | Configuration for TMB/L estimation2 |
Driver Gene Panel TSV | Multiple | Defines the set of genes in the panel, and which are reported for various events (SNVs, AMPs, DELs etc). See Driver_Catalog for details of this file |
Coverage Coding Regions BED1 | Sage | Gene regions to assess coverage |
Actionable Coding Regions BED1 | Sage | Gene coding regions for sensitive variant calling |
- These BED files are intended to match the exonic regions of the genes specified driver gene panel. These can be generated automatically from the panel Driver Gene Panel TSV using the routine in the GeneUtils tool and the section: "Generating the Sage gene panel regions files"
- The default TMB/MSI configuration of this file is shown below
TmbRatio TmlRatio MsiIndelRatio Msi23BaseAF Msi4BaseAF CodingBaseFactor
0.05 0.74 220 0.15 0.08 150000
The following files are the output of the training process described below:
File Name | Tool(s) | Purpose |
---|---|---|
Target Regions Normalisation TSV | Cobalt | Normalise copy number regions and mask off-target regions |
Panel Artefact PON | Pave | Panel-specific PON to apply in addition to standard WGS PON |
TPM Normalisation | Isofox | Normalise TPM for genes in panel (only required for panels with targeted RNA content;Not required for PANEL+WTS) |
These files are then used by the pipeline in panel mode to produce well-calibrated, accurate results ideally without panel-specific biases.
An initial set of input samples, recommended to number at least 20, are use to 'train' the pipeline. In particular this identifies variance in read depth and variant calling compared with whole genome and transcriptome. Note that the assumption of this training is that the median relative copy number for any given gene should not deviate too systematically from the ploidy across the cohort. In general this is a relatively safe assumption for a pan-cancer dataset, but in cancer specific training sets, there may be certain recurrently copy-number-altered genes that violate this assumption.
There are 2 steps in the training procedure:
This can be done by running the samples through oncoanalyser in WGTS mode.
Alternatively COBALT, AMBER & SAGE & ISOFOX can be run on each sample with the below configurations
COBALT
java -jar cobalt.jar \
-tumor SAMPLE_ID \
-tumor_bam /sample_data/SAMPLE_ID.bam \
-output_dir /sample_data/cobalt \
-gc_profile /ref_data/GC_profile.1000bp.37.cnp \
-tumor_only_diploid_bed DiploidRegions.37.bed.gz \
-threads 10 \
AMBER Amber is required to determined the gender of the samples for copy number normalisation. Note that we recommend to lower the tumor_min_depth to ensure sufficient coverage of heterozygous points.
java -jar amber.jar com.hartwig.hmftools.amber.AmberApplication \
-tumor SAMPLE_ID \
-tumor_bam /sample_data/SAMPLE_ID.bam \
-loci /path/to/GermlineHetPon.37.vcf.gz \
-tumor_min_depth 2 \
-output_dir /sample_data/ \
-threads 10 \
SAGE
java -jar sage.jar \
-tumor SAMPLE_ID \
-tumor_bam /sample_data/SAMPLE_ID.bam \
-ref_genome_version 37 \
-ref_genome /ref_data/refGenome.fasta \
-hotspots /ref_data/KnownHotspots.37.vcf.gz \
-panel_bed /path/to/ActionableCodingPanel.37.bed.gz \
-high_confidence_bed /ref_data/NA12878_GIAB_highconf_IllFB-IllGATKHC-CG-Ion-Solid_ALLCHROM_v3.2.2_highconf.bed \
-ensembl_data_dir /ref_data/ensembl_cache/ \
-output_vcf /sample_data/COLO829v003.sage.vcf.gz \
-threads 16 \
ISOFOX (panels with targeted RNA only)
java -jar isofox.jar \
-sample SAMPLE_ID \
-functions TRANSCRIPT_COUNTS \
-bam_file /sample_data/SAMPLE_ID.RNA.bam \
-ref_genome /ref_data/ref-genome.fasta \
-ensembl_data_dir /ref_data/ensembl_cache/ \
-exp_counts_file /ref_data/read_151_exp_counts.csv \
-exp_gc_ratios_file /ref_data/read_100_exp_gc_ratios.csv \
-output_dir /sample_data/ \
-threads 16 \
Note: Isofox requires the expected counts file to have been generated for the correct RNA read length. See Isofox read-me for details.
Run the Cobalt normalisation file builder command described below on the training samples output. This performs the following steps
- for each 1K region covering any target region, extract each sample's tumor read count and the GC profile mappability and GC ratio bucket
- calculate median and median read counts for each sample, and overall sample mean and median counts
- normalise each sample's tumor read counts per region
- calculate a median read count from all samples per GC ratio bucket
- write a relative enrichment for each region to the output file, with a min enrichment of 0.1
- if no WGS is available for normalisation, the tumorGCRatio is assumed to be 1 for autosomes. The gender of each sample must be provided. Female samples are excluded from Y chromosome normalisation and males use a tumorGCRatio of 0.5 for the sex chromosomes
The output of this process is a target regions normalisation file with the expected relative enrichment for each on target region.
Field | Description |
---|---|
sample_id_file | CSV with SampleId column header |
cobalt_dir | Cobalt output directory from step 1 above |
amber_dir | Amber output directory from step 2 above |
ref_genome_version | V37 or V38 |
gc_profile | As used in Cobalt and Purple |
target_regions_bed | Definition of target regions |
output_file | Output normalisation TSV file |
java -cp cobalt.jar com.hartwig.hmftools.cobalt.norm.NormalisationFileBuilder
-sample_id_file sample_ids.csv \
-cobalt_dir /training_sample_data/ \
-amber_dir /training_sample_data/ \
-ref_genome_version V37 \
-gc_profile /ref_data/GC_profile.1000bp.37.cnp \
-target_regions_bed /ref_data/target_regions_definition.37.bed \
-output_file /ref_data/target_regions.cobalt_normalisation.37.tsv \
-log_debug \
In addition to the WGS PON, Pave utilises a panel-specific PON to capture panel specific artefacts. Any non-hotspot variant found 3 or more times with a qual (TQP) of > 40 and modified map-qual factor of > -10 is added to the panel PON.
To generate this file all the Pave PonBuilder to make the additional PON file:
java -cp pave.jar com.hartwig.hmftools.pave.resources.PonBuilder \
-sample_id_file training_sample_ids.csv \
-vcf_path "/training_sample_data/*.sage.vcf.gz" \
-ref_genome_version V38 \
-output_dir /ref_data/ \
-log_debug \
The TPM normalisation is only required for panels with RNA coverage (eg. TSO500) in order to normalise the TPM so that it is equivalent to WTS. This can be generated using the Isofox Normalisation Builder
java -cp isofox.jar com.hartwig.hmftools.isofox.cohort.CohortAnalyser \
-root_data_dir /sample_data/ \
-sample_data_file /sample_data/training_sample_ids.csv \
-analyses PANEL_TPM_NORMALISATION \
-gene_distribution_file /ref_data/isofox.wgs_gene_distribution.37.csv \
-gene_id_file /ref_data/targeted_pane_gene_ids.csv \
-output_dir /ref_data/ \
The median adjusted TPM for each gene across the panel samples. If the median is zero, a replacement value of 0.01 is used instead. The adjustment factor is calculated by dividing the panel median value by the corresponding whole genome value for each gene.
Note: The adjustment factors are calculated at the gene level and not at the transcript level. This means the adjusted TPMs for transcripts from panel sequencing are not reliable.
Cobalt normalises copy number and masks off-target regions according to the CN normalisation file. If a targetRegions file is provided, then a target enrichment rate is calculated simply as the median tumorGCRatio for the specified regions. Any depth windows outside of the targetRegions file are masked so that they are ignored downstream by PURPLE. Depth windows found in the TSV file are normalised first by the overall target enrichment rate for the sample, then by the relativeEnrichment for that depth window and finally by the normal GC bias adjustment. The GC bias is calculated using on target regions only.
The following filters are applied:
- min_depth (in tumor) > 25
- Tumor ref and alt support >= 2
- Min_depth_percent and max_depth_percent are not applied
- Tumor ref and alt VAF >= 0.05
- Amber loci must be within 300 bases of a target region
To estimate MSI, a set of microsatellites with high coverage in the panel must also be defined.
For a set of microsatellite sites defined in the MSI target bed file count the number of passing variants at MSI sites ignoring SNV, MNV and 1 base deletes and requiring a VAF cutoff of > 0.15 for 2 and 3 base deletes or 0.08 for 4+ base deletes or any length insertion.
We estimate MSI rate as:
MSIndelsPerMb = 220 * # of MSI variants / # of MSI sites in panel
A custom model is used for TMB estimated in targeted mode. The main challenges of the model is to determine variants are included in the TMB estimate. PURPLE selects variants that meet the following criteria:
- Coding effect <> NONE
- GNDFreq <0.00005
- GENE in PANEL and not in {HLA-A,HLA-B,HLA-C,PIM1,BCL2}
- Type = SNV
- !HOTSPOT
- 0.05 <AF < 0.9
Each variant included is classified as ‘somatic’ if somatic likelihood = HIGH. If somatic likelihood = MEDIUM, then the variant is marked as 'unclear'.
The final somatic count estimate is set to = somatic + unclear^2 / ( CodingBases/170,000 + unclear).
This function is intended to reflect that when the number of unclear variants is less than expected germline variants then most unclear variants will be germline, whereas where the number of unclear variants is very high most will be somatic.
Using this number we then estimate the mutational burden as follows
TML = somatic Variant Estimate / CodingBases * RefGenomeCodingBases
TMB = 0.05 * TML + MSIIndelPerMb
The 0.05 conversion from TML to TMB is the empirically observed relationship in the Hartwig database.
For driver likelihood calculations, we assume 20% of variants are biallelic for targeted sequencing samples.
The following special rules apply to the consrtuction of the driver catalog
- DELS: Don’t report DELS >10Mb or if the copy number segment has less than 3 depth windows (unless supported by SV on both sides)
- PARTIAL_AMP: only in genes with known pathogenic exon deletions {BRAF, EGFR, CTNNB1, CBL,MET, ALK, PDGFRA}
There is also no somatic fit mode or somatic penalty and no SV recovery in PURPLE in targeted mode.
TPM is normalised to bring panel gene expression in-line with WGS expression rates.
Sage is run in high depth mode. See Sage readme for details.
The following parameters are calibrated for panel sequencing and are set differently to WGS. These are the default panel parameter values.
Amber
-target_regions_bed Target Regions BED file
Cobalt
-target_region Target Regions Normalisation TSV
-pcf_gamma 50
Sage
-high_depth_mode
Purple
-target_regions_bed Target Regions BED file
-target_regions_ratios Target Regions Ratios file
-target_regions_msi_indels Target Regions MSI Indels file
-min_diploid_tumor_ratio_count 3
-min_diploid_tumor_ratio_count_centromere 3
-ploidy_penalty_standard_derviation = 0.1
-ploid_penalty_factor = 0.4
-ploidy_penalty_min 0.2
-ploidy_penalty_min_standard_deviation_per_ploidy 1.5
-ploidy_penalty_sub_one_major_allele_multiplier 3
-deviation_penalty_gc_min_adjust 0.25
-gc_ratio_exponent 3.0
- HRD - vCHORD (HRD classifier) is completed and will be made available in an upcoming release.
- TOO - vCUPPA (CNN based tissue of origin classifier) is under development