Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: upgrade throughput of read processing #104

Merged
merged 41 commits into from
Feb 24, 2025
Merged

Conversation

gordonkoehn
Copy link
Collaborator

@gordonkoehn gordonkoehn commented Feb 21, 2025

This PR implements batch processing of the nucleotide alignment to parse, translate, and align it and finally print it to JSON. To handle the memory pressure, we implement batch processing of the reads and writing the NDJSON in a continuously compressed way (i.e., .gz) to disk.

This enables the processing on an end-consumer laptop (8GB RAM) whilst handling + 100 GB of text data. The processing time is about 20 -40 min – acceptable for now - possibility to improve. The final compressed .ndjson.gz is about 1-2 GB big.

Explanation of issue:

  • Nucleotide Alignment, Amino Acid Alignment and corresponding files of Insertions, must be read in at the same time to gather all info about a given read.
  • Each Read object, as SILO requires has a size of about 35 kB
  • One Sequencing Run contains about 5 Million Reads (unpaired, will half)
  • That makes for 175 GB, of uncompressed text.

To address memory usage of the read processing, we are explored multiple options:

  1. on-disk NoSQL database
  2. incremental processing by taking advantage os sorted structure of bams
  3. batch processing and continuous compression (running diamond for small batches)

Occam's razor --> choosing 3)


Note this PR is preceded by:
#101

@gordonkoehn gordonkoehn self-assigned this Feb 21, 2025
@gordonkoehn gordonkoehn linked an issue Feb 21, 2025 that may be closed by this pull request
@gordonkoehn gordonkoehn marked this pull request as ready for review February 21, 2025 20:17
@Copilot Copilot bot review requested due to automatic review settings February 21, 2025 20:17
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.

@gordonkoehn gordonkoehn added the enhancement New feature or request label Feb 21, 2025
@gordonkoehn gordonkoehn requested a review from Copilot February 21, 2025 20:40
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 6 out of 13 changed files in this pull request and generated 3 comments.

Files not reviewed (7)
  • scripts/run_vp_transformer.sh: Language not supported
  • src/sr2silo/silo_aligned_read.py: Evaluated as low risk
  • src/sr2silo/process/init.py: Evaluated as low risk
  • src/sr2silo/process/interface.py: Evaluated as low risk
  • src/sr2silo/s3.py: Evaluated as low risk
  • tests/process/test_translation_aligment.py: Evaluated as low risk
  • src/sr2silo/silo/lapis.py: Evaluated as low risk
Comments suppressed due to low confidence (2)

src/sr2silo/process/translate_align.py:429

  • The json.JSONDecodeError should be raised with the correct arguments: raise json.JSONDecodeError(e.msg, e.doc, e.pos).
raise json.JSONDecodeError(e.msg, e.doc, e.pos)

src/sr2silo/process/translate_align.py:485

  • [nitpick] The logging of file sizes is unnecessary and could be removed to reduce clutter.
logging.info(f"Size of {fp.name}: {file_size_mb:.2f} MB")

Co-authored-by: Copilot <[email protected]>
@gordonkoehn
Copy link
Collaborator Author

@DrYak FYI, Taepper and I may now finally test SILO with complete alignment data from V-Pipe.

@Taepper I switched to .gz compression here to continuously append in compression instead of .bz2 as we agreed upon. So we need to adjust the current SILO build script.

@gordonkoehn gordonkoehn requested review from Taepper and DrYak and removed request for DrYak February 21, 2025 21:05
@Taepper
Copy link
Collaborator

Taepper commented Feb 22, 2025

Alright, do we have the build script in the repo now? Should I amend it as a separate PR?

@gordonkoehn
Copy link
Collaborator Author

gordonkoehn commented Feb 22, 2025

Yeah, merging in the SILO build sounds good to me – now, unless you have a better place / other repo where SILO's actual code resides.

It still lives on this PR – so feel free to merge this in.
Perhaps in a folder ./silo/, with a note in the ReadMe.

Do you think .gz is an ok choice for compression? Is there a better choice? You know these better than me.

Side Remark:
We could also think about having an option to skip the download of input compressed files but give it Path to copy from. This would allow for Start-to-End testing, to see if the Pre-Processing succeeds. I am validating each JSON as it is printed, but I may still miss something. Not a priority.

@Taepper
Copy link
Collaborator

Taepper commented Feb 23, 2025

I think zstd compression would have advantages over gzip

Alright, I can prepare the PR to add silo, once it breaks with the changes from here just ping me and I will fix it.

I will also already put in the option to skip the downloading

@gordonkoehn
Copy link
Collaborator Author

gordonkoehn commented Feb 24, 2025

OK - will adjust to zstd, finally read up on the compression methods ;) Thank you for the suggestion!

  • switch compression to zstd

@gordonkoehn
Copy link
Collaborator Author

gordonkoehn commented Feb 24, 2025

Improve Performance in the future by

  • adjust the default for batch processingchunk_size and write_chunk_size
  • adjust the default for diamond: block_size in diamond

noted here:

@gordonkoehn gordonkoehn requested a review from Copilot February 24, 2025 10:47
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (3)

src/sr2silo/process/convert.py:38

  • Raising a new Exception here wraps the original error, which may lose the original traceback. Consider re-raising the caught exception (e.g., using just 'raise') to preserve debugging information.
raise Exception(f"An error occurred: {e}")

src/sr2silo/process/translate_align.py:140

  • [nitpick] The temporary FASTA filename is derived from the output SAM file's name by replacing its suffix with '.fasta', which might be confusing. Consider using a distinct temporary filename to clearly indicate its purpose.
fasta_nuc_for_aa_alignment = temp_dir_path / out_aa_alignment_fp.with_suffix(".fasta")

tests/test_database_config_validation.py:27

  • The comment previously contained 'trype' which has been corrected to 'type'.
# get the second field of each item as the type of the fields

@gordonkoehn gordonkoehn merged commit 937a52c into main Feb 24, 2025
3 checks passed
@gordonkoehn gordonkoehn deleted the feature/throughput branch February 24, 2025 11:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
2 participants