Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bwa-mem2 mem parallelisation #35

Closed
priyanka-surana opened this issue Apr 6, 2022 · 4 comments · Fixed by #113
Closed

Bwa-mem2 mem parallelisation #35

priyanka-surana opened this issue Apr 6, 2022 · 4 comments · Fixed by #113
Assignees
Labels
enhancement Improvement of the existing features user request Requests made by users and public

Comments

@priyanka-surana
Copy link
Contributor

Description of feature

For illumina and hic, I want to break the input fastq files, align them against the genome. The split files need to be aligned with the same @RG tag.

  1. Split fastq using split
  2. Make sure all split files have the same @RG tag
  3. Run bwa-mem2 mem on them individually
  4. Sort them with samtools individually
  5. Merge all split alignments from the same individual with -c tag
  6. Merge at the specimen level.
  7. Run through the rest of the markdup_stats subworkflow

Maybe possible to combine steps (5) and (6).

@priyanka-surana priyanka-surana added the enhancement Improvement of the existing features label Apr 6, 2022
@priyanka-surana priyanka-surana added this to the v0.2 milestone Apr 7, 2022
@muffato
Copy link
Member

muffato commented Apr 13, 2022

For the record, we want to split the fastq file rather than the CRAM as it would allow reusing the workflow for PacBio and ONT, esp. as the PacBio BAM file is "weird".

@priyanka-surana priyanka-surana removed this from the v0.2 milestone Apr 24, 2022
@priyanka-surana priyanka-surana pinned this issue Dec 3, 2022
@priyanka-surana
Copy link
Contributor Author

@muffato Based on the latest discussion with Shane, are we still focusing on splitting by FASTQ. He recommended we split by BAM/CRAM to avoid excess I/O.

@priyanka-surana priyanka-surana added this to the 1.2.0 milestone Mar 16, 2023
@priyanka-surana priyanka-surana unpinned this issue Mar 16, 2023
@priyanka-surana priyanka-surana self-assigned this Mar 16, 2023
@muffato
Copy link
Member

muffato commented Mar 17, 2023

If you can avoid converting to fastq, please do so. It's a valid disk optimisation even without splitting the input file

@priyanka-surana priyanka-surana added feature Requests for new features user request Requests made by users and public labels Jun 27, 2023
@muffato
Copy link
Member

muffato commented Nov 30, 2023

We've run the pipeline quite a lot and Hi-C alignment with BWA doesn't seem to be an issue for us. The largest sample I've tested was Sambucus nigra: 6.3 billion reads (many species have around 1 billion reads) and 11.8 Gbp genome (the average genome size is < 1 Gbp). It took 2 days and 2 hours to run, which is fine in the After-Party context.

@muffato muffato removed this from the 1.2.0 milestone Dec 8, 2023
@muffato muffato removed enhancement Improvement of the existing features backlog labels Jun 1, 2024
@muffato muffato moved this to Ideas in readmapping Jun 5, 2024
@muffato muffato added enhancement Improvement of the existing features and removed feature Requests for new features labels Jun 17, 2024
@tkchafin tkchafin linked a pull request Sep 16, 2024 that will close this issue
9 tasks
@github-project-automation github-project-automation bot moved this from Ideas to Done in readmapping Sep 17, 2024
@github-project-automation github-project-automation bot moved this from Todo to Done in Genome After Party Sep 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvement of the existing features user request Requests made by users and public
Projects
Status: Done
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants