Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't align PacBio/ONT reads on genomes > 4Gbp #81

Closed
muffato opened this issue Oct 23, 2023 · 2 comments · Fixed by #82
Closed

Can't align PacBio/ONT reads on genomes > 4Gbp #81

muffato opened this issue Oct 23, 2023 · 2 comments · Fixed by #82
Assignees
Labels
bug Something isn't working
Milestone

Comments

@muffato
Copy link
Member

muffato commented Oct 23, 2023

Description of the bug

By default, minimap2 can only deal with genomes up to 4 Gbp. For larger genomes, it needs either the --split-prefix option or the -I option. See more information at https://github.com/lh3/minimap2/blob/master/FAQ.md#3-the-output-sam-doesnt-have-a-header

Without that, samtools complain and the process crashes.

Command used and terminal output

Error executing process > 'SANGERTOL_READMAPPING:READMAPPING:ALIGN_HIFI:MINIMAP2_ALIGN (iqMecThal1_T2)'

Caused by:
  Process `SANGERTOL_READMAPPING:READMAPPING:ALIGN_HIFI:MINIMAP2_ALIGN (iqMecThal1_T2)` terminated with an error exit status (1)

Command executed:

  minimap2 \
      -ax map-hifi --cs=short \
      -t 6 \
      "GCA_946902985.2.unmasked.fasta" \
      "iqMecThal1_T2_other.fastq.gz" \
       \
       \
      -a | samtools sort | samtools view -@ 6 -b -h -o iqMecThal1_T2.bam
  
  
  cat <<-END_VERSIONS > versions.yml
  "SANGERTOL_READMAPPING:READMAPPING:ALIGN_HIFI:MINIMAP2_ALIGN":
      minimap2: $(minimap2 --version 2>&1)
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  [M::mm_idx_gen::73.626*1.36] collected minimizers
  [M::mm_idx_gen::82.689*1.85] sorted minimizers
  [WARNING]�[1;31m For a multi-part index, no @SQ lines will be outputted. Please use --split-prefix.�[0m
  [M::main::82.689*1.85] loaded/built the index for 28 target sequence(s)
  [M::mm_mapopt_update::85.125*1.83] mid_occ = 500
  [M::mm_idx_stat] kmer size: 19; skip: 19; is_hpc: 0; #seq: 28
  [M::mm_idx_stat::86.619*1.82] distinct minimizers: 130653147 (82.35% are singletons); average occurrences: 3.431; average spacing: 9.970; total length: 4469457077
  [E::sam_parse1] no SQ lines present in the header
  [W::sam_read1_sam] Parse error at line 2
  samtools sort: truncated file. Aborting
  [main_samview] fail to read the header from "-".

Relevant files

No response

System information

No response

@muffato muffato added the bug Something isn't working label Oct 23, 2023
@muffato
Copy link
Member Author

muffato commented Oct 23, 2023

I've started both commands. Will report on the runtime and memory usage when they complete.

First snag: [ERROR] --cs or --MD doesn't work with --split-prefix. --cs=short was required by HiMut for somatic variant calling.

@muffato
Copy link
Member Author

muffato commented Oct 25, 2023

  1. -I mode

5h 50 min runtime, 45 GB RAM

  1. --split-prefix mode

19h 30 min runtime, 29 GB RAM.
I can see in the log file how the target genome was virtually split into three chunks, each under 4 Gbp, and reads were aligned to each chunk. That's why it was about 3 times as slow as the -I mode.


For this pipeline, I would go with the -I option as i) the extra memory requirement isn't hard to fulfil and is worth the speed increase, ii) --split-prefix clashes with -cs

@muffato muffato added this to the 1.2.0 milestone Nov 6, 2023
@muffato muffato self-assigned this Nov 6, 2023
@muffato muffato moved this from Todo to In Progress in Genome After Party Nov 6, 2023
muffato added a commit that referenced this issue Nov 6, 2023
@muffato muffato mentioned this issue Nov 6, 2023
9 tasks
@muffato muffato linked a pull request Nov 16, 2023 that will close this issue
9 tasks
muffato added a commit that referenced this issue Dec 8, 2023
@muffato muffato closed this as completed Dec 18, 2023
@github-project-automation github-project-automation bot moved this from In Progress to Done in Genome After Party Dec 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

1 participant