Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor ngs_mapping #517

Open
ericblanc20 opened this issue Jun 7, 2024 · 0 comments
Open

refactor ngs_mapping #517

ericblanc20 opened this issue Jun 7, 2024 · 0 comments
Assignees

Comments

@ericblanc20
Copy link
Contributor

Is your feature request related to a problem? Please describe.
ngs_mapping has become a unruly monster and could be simplified and generalized, removing other steps in the process.

Describe the solution you'd like
The step should be broken in smaller pieces, which could share a temporary directory, and pass large files through named pipes. In particular, we might have:

  1. An optional sub-step to clip UMIs or MBCs from the read sequences, and put them in the read title line. This sub-step should also be able to deal with UMIs provided in separate files. This sub-step would be run for each fastq file, unless the separate UMI file contains information for both reads from the pair.
  2. An optional sub-step to perform adapter trimming, possibly using pipes. This sub-step should be taking its output either from the original fastq files, or from step 1. This is also run per fastq file.
  3. The mapping per se. This should be done separately for each pair of fastq files, and tag the output bam with a read group, because it is necessary in presence of MBCs.
  4. Optionally sort the output bam file by genomic coordinates. That can be done by piping the output of 3 into 4.
  5. Merge sorted bam files from different lanes. This can only be done the bam files are all sorted by coordinates.
  6. Optionally mark duplicates, possibly using named pipes
  7. Optionally re-assemble fragments from UMI/MBC are present. Depending on UMI technology, 6 & 7 might be mutually exclusive, or 6 should come before 7, or the opposite. The logic should be flexible to accomodate all cases.
  8. Optionally perform base quality score re-calibration, which is recommended in the somatic GATK best practice.
  9. Optionally perform reports & QC sub-steps (samtools (idx|flag)?stats, ngs_chew, coverage with alfred_qc, picard, ...). The selection of possible reports & QC sub-steps should be contingent to the tools & assay type (for example, coverage of exome & WGS data should be carried out differently, methinks).

We could then dispose of the adapter_trimming step, and possibly of ngs_data_qc, if we also include fastqc in the available QC reports.

Describe alternatives you've considered
It is important that this step doesn't become a monster too. So it might be useful to revive the gene_expression_quantification step, to separate mapping of RNA expression. The arguments could be:

  1. I am not aware that UMIs are used in bulk RNA data, so it would simplify the protocol above quite a bit.
  2. RNA expression quantification doesn't necessitate mapping (see salmon for example), so a lot of the protocol above is also unnecessary.
  3. On the contrary, STAR can produce 2 bam files besides the counts (one mapped on the genome, and the other on the transcriptome). This would lead to extra complications of the ngs_mapping step.
  4. STAR also allows sophisticated mapping to discover new transcripts (2 pass alignments). This is also specific to RNA.
  5. The tools for QC & reports are quite different

Finally, I don't know if the long read mapping should be in the same mapping step than short reads. But if the sub-steps are small and well-defined, it might be possible to include both technologies in the same step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants