-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about iteratively getting a read subset #136
Comments
Hi @hasindu2008 , It looks reasonable, do you know the number of megabytes per second of signal you receive from the loading? This is a more interesting performance metric as any operation you run involving the sample data will bottleneck on decompression - and potentially the disk IO speed depending on the speed of your processor. If you aren't hitting your disk's max read speed I would suggest it could be more optimal. Depending on the size of the pod5 files and the size of the selection you are making, you could explore passing the reads from an entire batch as one async operation to reduce overhead of the threadpool - but you would need to configure this based on profiling of your use case. Hope that helps,
|
@0x55555555 to add further: As per iostats, it is reaching like 300 MB/s:
For a typical SSD, sequential read speed is around 500MB/s, so I guess for random access 300MB/s is likely to be closer to the max? I am not sure how random access in POD5 is implemented under the API and thus what kind of size each random access is, but you could answer this better I presume. The size of my POD5 file is 741G which contains a whole PromethION sample. The size of the selection I make at a time is around 1000 reads, imagine getting reads belonging to a certain gene region. If you have any code examples of the suggestion you mentioned about "passing the reads from an entire batch as one async operation to reduce overhead of the threadpool", we could try that too. Thanks. |
If your CPU is maxing out, its more likely you are bottlenecking on decompression of signal - which you could confirm using It looks like you are achieving a reasonable level of performance though - it'll depend on the downstream application of the signal if you need to optimise further. Thanks,
|
I do not quite understand how the random access in pod5 works. Is there an index or something? If not, why not? |
Hi pod5 developers,
I am trying to do some random access to a POD5 file, a small batch of around 1000 at a time. For instance, say we want to grab the reads for a small gene region at a time. Looking at the Dorado code, we have written the following piece of code that would be exploiting multiple threads.
We get like ~316 reads per second on a system with SSD. We are running with 40 threads and the CPU utilisation seems to be good.
Could you advise if this looks right interms of maximally getting the expected performance?
The text was updated successfully, but these errors were encountered: