feature: upgrade throughput of read processing #104

gordonkoehn · 2025-02-21T10:12:56Z

This PR implements batch processing of the nucleotide alignment to parse, translate, and align it and finally print it to JSON. To handle the memory pressure, we implement batch processing of the reads and writing the NDJSON in a continuously compressed way (i.e., .gz) to disk.

This enables the processing on an end-consumer laptop (8GB RAM) whilst handling + 100 GB of text data. The processing time is about 20 -40 min – acceptable for now - possibility to improve. The final compressed .ndjson.gz is about 1-2 GB big.

Explanation of issue:

Nucleotide Alignment, Amino Acid Alignment and corresponding files of Insertions, must be read in at the same time to gather all info about a given read.
Each Read object, as SILO requires has a size of about 35 kB
One Sequencing Run contains about 5 Million Reads (unpaired, will half)
That makes for 175 GB, of uncompressed text.

To address memory usage of the read processing, we are explored multiple options:

on-disk NoSQL database
incremental processing by taking advantage os sorted structure of bams
batch processing and continuous compression (running diamond for small batches)

Occam's razor --> choosing 3)

Note this PR is preceded by:
- #101

Co-authored-by: Copilot <[email protected]>

Copilot

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.

src/sr2silo/process/translate_align.py

Copilot

Copilot reviewed 6 out of 13 changed files in this pull request and generated 3 comments.

Files not reviewed (7)

scripts/run_vp_transformer.sh: Language not supported
src/sr2silo/silo_aligned_read.py: Evaluated as low risk
src/sr2silo/process/init.py: Evaluated as low risk
src/sr2silo/process/interface.py: Evaluated as low risk
src/sr2silo/s3.py: Evaluated as low risk
tests/process/test_translation_aligment.py: Evaluated as low risk
src/sr2silo/silo/lapis.py: Evaluated as low risk

Comments suppressed due to low confidence (2)

src/sr2silo/process/translate_align.py:429

The json.JSONDecodeError should be raised with the correct arguments: raise json.JSONDecodeError(e.msg, e.doc, e.pos).

raise json.JSONDecodeError(e.msg, e.doc, e.pos)

src/sr2silo/process/translate_align.py:485

[nitpick] The logging of file sizes is unnecessary and could be removed to reduce clutter.

logging.info(f"Size of {fp.name}: {file_size_mb:.2f} MB")

tests/test_database_config_validation.py

src/sr2silo/process/convert.py

src/sr2silo/process/translate_align.py

Co-authored-by: Copilot <[email protected]>

gordonkoehn · 2025-02-21T21:05:43Z

@DrYak FYI, Taepper and I may now finally test SILO with complete alignment data from V-Pipe.

@Taepper I switched to .gz compression here to continuously append in compression instead of .bz2 as we agreed upon. So we need to adjust the current SILO build script.

Taepper · 2025-02-22T10:17:34Z

Alright, do we have the build script in the repo now? Should I amend it as a separate PR?

gordonkoehn · 2025-02-22T11:34:18Z

Yeah, merging in the SILO build sounds good to me – now, unless you have a better place / other repo where SILO's actual code resides.

It still lives on this PR – so feel free to merge this in.
Perhaps in a folder ./silo/, with a note in the ReadMe.

Do you think .gz is an ok choice for compression? Is there a better choice? You know these better than me.

Side Remark:
We could also think about having an option to skip the download of input compressed files but give it Path to copy from. This would allow for Start-to-End testing, to see if the Pre-Processing succeeds. I am validating each JSON as it is printed, but I may still miss something. Not a priority.

Taepper · 2025-02-23T07:26:49Z

I think zstd compression would have advantages over gzip

Alright, I can prepare the PR to add silo, once it breaks with the changes from here just ping me and I will fix it.

I will also already put in the option to skip the downloading

gordonkoehn · 2025-02-24T09:37:19Z

OK - will adjust to zstd, finally read up on the compression methods ;) Thank you for the suggestion!

switch compression to zstd

gordonkoehn · 2025-02-24T10:03:50Z

Improve Performance in the future by

adjust the default for batch processingchunk_size and write_chunk_size
adjust the default for diamond: block_size in diamond

noted here:

feature: Improve Performance of Read Processing And Translation #109

Copilot

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (3)

src/sr2silo/process/convert.py:38

Raising a new Exception here wraps the original error, which may lose the original traceback. Consider re-raising the caught exception (e.g., using just 'raise') to preserve debugging information.

raise Exception(f"An error occurred: {e}")

src/sr2silo/process/translate_align.py:140

[nitpick] The temporary FASTA filename is derived from the output SAM file's name by replacing its suffix with '.fasta', which might be confusing. Consider using a distinct temporary filename to clearly indicate its purpose.

fasta_nuc_for_aa_alignment = temp_dir_path / out_aa_alignment_fp.with_suffix(".fasta")

tests/test_database_config_validation.py:27

The comment previously contained 'trype' which has been corrected to 'type'.

# get the second field of each item as the type of the fields

scripts/vp_transformer.py

Gordon J. Köhn and others added 15 commits February 20, 2025 11:27

WIP paths

dcba016

fix unnessary list comprehension

45ab179

fix test

c4bff02

fix to_silo_josn

319502c

results dir

c89045a

fix s3 secrets

a8d026b

supress aws logger

53257ba

full submission

6dbc170

first full submission

32842c3

give option of intend

1d9a679

Merge branch 'main' into feature/minimal-translate

2d32f5a

fix metadta

7ef6312

add valdiataion of db schema

2adf2fb

Update src/sr2silo/process/convert.py

76dd8a6

Co-authored-by: Copilot <[email protected]>

typos

38c3866

gordonkoehn self-assigned this Feb 21, 2025

gordonkoehn linked an issue Feb 21, 2025 that may be closed by this pull request

Improve read throughput – address memory #98

Closed

Gordon J. Köhn added 13 commits February 21, 2025 16:16

WIP

913acaa

temp files in nuc_to_aa_alignment

b0635e7

moved all files to temp in processing

606ecc6

add todos for refactoring

2deabf3

WIP

76515fc

batch gzip

d8aa5e6

integrate metadata

7980413

full run with compression and batching

eb47f51

rename python convention

b4f8f94

fix naming

56711ca

move nextclade test into temp dir

731c97b

notes logging

71ee423

fix broken progress bars

a142eea

Full Work

07a1897

gordonkoehn marked this pull request as ready for review February 21, 2025 20:17

Copilot bot review requested due to automatic review settings February 21, 2025 20:17

Copilot AI reviewed Feb 21, 2025

View reviewed changes

src/sr2silo/process/translate_align.py Outdated Show resolved Hide resolved

gordonkoehn added the enhancement New feature or request label Feb 21, 2025

gordonkoehn requested a review from Copilot February 21, 2025 20:40

Copilot AI reviewed Feb 21, 2025

View reviewed changes

tests/test_database_config_validation.py Outdated Show resolved Hide resolved

src/sr2silo/process/convert.py Show resolved Hide resolved

src/sr2silo/process/translate_align.py Outdated Show resolved Hide resolved

typo from CoPilote

ed3ffb1

Co-authored-by: Copilot <[email protected]>

gordonkoehn requested review from Taepper and DrYak and removed request for DrYak February 21, 2025 21:05

gordonkoehn and others added 3 commits February 24, 2025 10:42

Merge branch 'main' into feature/throughput

efdbbd3

ajust chunking for performance

a70cdea

swap out continous compression from gzip to zstandard

4a49f33

adjust compression handeling

c3026b8

gordonkoehn requested a review from Copilot February 24, 2025 10:47

remove wrangle silo

4e31e13

gordonkoehn mentioned this pull request Feb 24, 2025

docs: remove ref to sr2silo.silo.wrangle_for_transformer from docs #108

Closed

gordonkoehn linked an issue Feb 24, 2025 that may be closed by this pull request

docs: remove ref to sr2silo.silo.wrangle_for_transformer from docs #108

Closed

Copilot AI reviewed Feb 24, 2025

View reviewed changes

scripts/vp_transformer.py Outdated Show resolved Hide resolved

Gordon J. Köhn added 2 commits February 24, 2025 12:00

sheltering filepath from outside

de48ec6

fix suffix handeling

0ee0f49

gordonkoehn merged commit 937a52c into main Feb 24, 2025
3 checks passed

gordonkoehn deleted the feature/throughput branch February 24, 2025 11:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature: upgrade throughput of read processing #104

feature: upgrade throughput of read processing #104

gordonkoehn commented Feb 21, 2025 •

edited

Loading

Copilot AI left a comment

Copilot AI left a comment

gordonkoehn commented Feb 21, 2025

Taepper commented Feb 22, 2025

gordonkoehn commented Feb 22, 2025 •

edited

Loading

Taepper commented Feb 23, 2025 •

edited

Loading

gordonkoehn commented Feb 24, 2025 •

edited

Loading

gordonkoehn commented Feb 24, 2025 •

edited

Loading

Copilot AI left a comment

feature: upgrade throughput of read processing #104

feature: upgrade throughput of read processing #104

Conversation

gordonkoehn commented Feb 21, 2025 • edited Loading

Copilot AI left a comment

Choose a reason for hiding this comment

Copilot AI left a comment

Choose a reason for hiding this comment

gordonkoehn commented Feb 21, 2025

Taepper commented Feb 22, 2025

gordonkoehn commented Feb 22, 2025 • edited Loading

Taepper commented Feb 23, 2025 • edited Loading

gordonkoehn commented Feb 24, 2025 • edited Loading

gordonkoehn commented Feb 24, 2025 • edited Loading

Copilot AI left a comment

Choose a reason for hiding this comment

gordonkoehn commented Feb 21, 2025 •

edited

Loading

gordonkoehn commented Feb 22, 2025 •

edited

Loading

Taepper commented Feb 23, 2025 •

edited

Loading

gordonkoehn commented Feb 24, 2025 •

edited

Loading

gordonkoehn commented Feb 24, 2025 •

edited

Loading