-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature: upgrade throughput of read processing #104
Conversation
Co-authored-by: Copilot <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 6 out of 13 changed files in this pull request and generated 3 comments.
Files not reviewed (7)
- scripts/run_vp_transformer.sh: Language not supported
- src/sr2silo/silo_aligned_read.py: Evaluated as low risk
- src/sr2silo/process/init.py: Evaluated as low risk
- src/sr2silo/process/interface.py: Evaluated as low risk
- src/sr2silo/s3.py: Evaluated as low risk
- tests/process/test_translation_aligment.py: Evaluated as low risk
- src/sr2silo/silo/lapis.py: Evaluated as low risk
Comments suppressed due to low confidence (2)
src/sr2silo/process/translate_align.py:429
- The json.JSONDecodeError should be raised with the correct arguments: raise json.JSONDecodeError(e.msg, e.doc, e.pos).
raise json.JSONDecodeError(e.msg, e.doc, e.pos)
src/sr2silo/process/translate_align.py:485
- [nitpick] The logging of file sizes is unnecessary and could be removed to reduce clutter.
logging.info(f"Size of {fp.name}: {file_size_mb:.2f} MB")
Co-authored-by: Copilot <[email protected]>
Alright, do we have the build script in the repo now? Should I amend it as a separate PR? |
Yeah, merging in the SILO build sounds good to me – now, unless you have a better place / other repo where SILO's actual code resides. It still lives on this PR – so feel free to merge this in. Do you think Side Remark: |
I think Alright, I can prepare the PR to add silo, once it breaks with the changes from here just ping me and I will fix it. I will also already put in the option to skip the downloading |
OK - will adjust to
|
Improve Performance in the future by
noted here: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (3)
src/sr2silo/process/convert.py:38
- Raising a new Exception here wraps the original error, which may lose the original traceback. Consider re-raising the caught exception (e.g., using just 'raise') to preserve debugging information.
raise Exception(f"An error occurred: {e}")
src/sr2silo/process/translate_align.py:140
- [nitpick] The temporary FASTA filename is derived from the output SAM file's name by replacing its suffix with '.fasta', which might be confusing. Consider using a distinct temporary filename to clearly indicate its purpose.
fasta_nuc_for_aa_alignment = temp_dir_path / out_aa_alignment_fp.with_suffix(".fasta")
tests/test_database_config_validation.py:27
- The comment previously contained 'trype' which has been corrected to 'type'.
# get the second field of each item as the type of the fields
This PR implements batch processing of the nucleotide alignment to parse, translate, and align it and finally print it to JSON. To handle the memory pressure, we implement batch processing of the reads and writing the NDJSON in a continuously compressed way (i.e.,
.gz
) to disk.This enables the processing on an end-consumer laptop (8GB RAM) whilst handling + 100 GB of text data. The processing time is about 20 -40 min – acceptable for now - possibility to improve. The final compressed
.ndjson.gz
is about 1-2 GB big.Explanation of issue:
To address memory usage of the read processing, we are explored multiple options:
Occam's razor --> choosing 3)
Note this PR is preceded by:
- #101