refactor ngs_mapping #517

ericblanc20 · 2024-06-07T16:43:16Z

Is your feature request related to a problem? Please describe.
ngs_mapping has become a unruly monster and could be simplified and generalized, removing other steps in the process.

Describe the solution you'd like
The step should be broken in smaller pieces, which could share a temporary directory, and pass large files through named pipes. In particular, we might have:

An optional sub-step to clip UMIs or MBCs from the read sequences, and put them in the read title line. This sub-step should also be able to deal with UMIs provided in separate files. This sub-step would be run for each fastq file, unless the separate UMI file contains information for both reads from the pair.
An optional sub-step to perform adapter trimming, possibly using pipes. This sub-step should be taking its output either from the original fastq files, or from step 1. This is also run per fastq file.
The mapping per se. This should be done separately for each pair of fastq files, and tag the output bam with a read group, because it is necessary in presence of MBCs.
Optionally sort the output bam file by genomic coordinates. That can be done by piping the output of 3 into 4.
Merge sorted bam files from different lanes. This can only be done the bam files are all sorted by coordinates.
Optionally mark duplicates, possibly using named pipes
Optionally re-assemble fragments from UMI/MBC are present. Depending on UMI technology, 6 & 7 might be mutually exclusive, or 6 should come before 7, or the opposite. The logic should be flexible to accomodate all cases.
Optionally perform base quality score re-calibration, which is recommended in the somatic GATK best practice.
Optionally perform reports & QC sub-steps (samtools (idx|flag)?stats, ngs_chew, coverage with alfred_qc, picard, ...). The selection of possible reports & QC sub-steps should be contingent to the tools & assay type (for example, coverage of exome & WGS data should be carried out differently, methinks).

We could then dispose of the adapter_trimming step, and possibly of ngs_data_qc, if we also include fastqc in the available QC reports.

Describe alternatives you've considered
It is important that this step doesn't become a monster too. So it might be useful to revive the gene_expression_quantification step, to separate mapping of RNA expression. The arguments could be:

I am not aware that UMIs are used in bulk RNA data, so it would simplify the protocol above quite a bit.
RNA expression quantification doesn't necessitate mapping (see salmon for example), so a lot of the protocol above is also unnecessary.
On the contrary, STAR can produce 2 bam files besides the counts (one mapped on the genome, and the other on the transcriptome). This would lead to extra complications of the ngs_mapping step.
STAR also allows sophisticated mapping to discover new transcripts (2 pass alignments). This is also specific to RNA.
The tools for QC & reports are quite different

Finally, I don't know if the long read mapping should be in the same mapping step than short reads. But if the sub-steps are small and well-defined, it might be possible to include both technologies in the same step.

The text was updated successfully, but these errors were encountered:

ericblanc20 assigned tedil and ericblanc20 Jun 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor ngs_mapping #517

refactor ngs_mapping #517

ericblanc20 commented Jun 7, 2024

refactor ngs_mapping #517

refactor ngs_mapping #517

Comments

ericblanc20 commented Jun 7, 2024