-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dorado 'sup' RAM overflow on Windows, only with large pod5 files (weird bug) #1103
Comments
Hi @GabeAl, We'll investigate and get back to you if we have more questions and we'll keep an eye out for more updates. Regarding subsetting in pod5 you can generate the subsetting mapping using Best regards, |
Update on this! Another unrelated CUDA adventure with block strides, where batches that aren't a perfect denominator of the stride produce a bunch of detritus, led to me a new hypothesis. What if there is such hypersensitivity to the block size and no cleanup of padding? And lo and behold, across every system tested (2 windows systems, 2 Linux systems), and 3 different GPUs, one trend holds:
All other block sizes I tried (including the sizes autoselected by dorado), produce RAM-chewing effects. The Linux systems handle this more gracefully but still indeed show the RAM leak behavior when files are large enough (or run on a combined folder, as before). With -b 64 and -b 256, the faulty behavior disappears. I think a workaround for now would be to have dorado limit block sizes to a tested few. |
Just to keep up on this: I got the same crash/error (after 3.5h) with
system with 384GB RAM, 40G A100, single user, single job. @HalfPhoton - Were you able to find the underlying issue? |
Issue Report
Please describe the issue:
I took a large pod5 file produced by MinKNOW (after skipping catch-up basecalling from a P2 solo run). I tried basecalling using dorado 0.8.2 and 0.8.1 on both a Linux and a Windows system. The GPUs are a bit different, so this may also complicate figuring out the root cause.
Steps to reproduce the issue:
dorado.exe basecaller -v --emit-fastq -b 96 --kit-name SQK-RBK114-24 --output-dir basecalled/ sup pod5/PBA45027_skip_ff434a84_3ce02e27_5.pod5
System memory keeps increasing until 64GB of system memory is used up in addition to the 16GB of VRAM on the RTX 5000 Ada (80GB total, which is the max Windows allows as "total GPU memory" in my system). When the system finally runs out of memory, it shows an error about CUDA gemm functions not allocating, mentions it's trying to clear the CUDA cache and try again, but instead dies or locks up horribly.
Please list any steps to reproduce the issue.
Run environment:
Dorado version: 0.8.1 and 0.8.2. (0.8.0 crashes silently after producing ~70MB of output, no matter what batch size parameters chosen. It does not run out of RAM, but just crashes with no error even in -vv).
Dorado command:
dorado.exe basecaller -v --emit-fastq -b 32 --kit-name SQK-RBK114-24 --output-dir basecalled/ sup pod5/PBA45027_skip_ff434a84_3ce02e27_5.pod5
Operating system: Windows 11 23H2
Hardware (CPUs, Memory, GPUs):
Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance):
pod5 from MinKnow (bla_bla_skipped_5.pod5)
Source data location (on device or networked drive - NFS, etc.):
Local SSD
Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB):
enzyme 8.2.1, kit 14 (latest chemistry and kit and pore), read lengths N50 ~6mb, unknown total read number; total pod5 size: 64GB. (There are multiple 64GB pod5 files, but I'm trying one at a time).
Dataset to reproduce, if applicable (small subset of data to share as a pod5 to reproduce the issue):
Cannot reproduce on small pod5. In fact, that seems to be the problem, as it runs fine up until it runs out of RAM. I cannot split the pod5 due to apparent bugs in pod5 format making it impossible to split a large pod5 into smaller chunks [edit: see below, I try this anyway and it works, but a folder of split files does not]. This is the original data from MinKNOW, not something fenagled by me or converted from other formats, so there is no option to "regenerate" the pod5 files using a different splitting criteria. (I had instructed MinKnow to split by number of reads, but that apparently doesn't apply to the _skip files, only the basecalled ones).
Logs
Log with -v provided from 0.8.2 (0.8.1 produces a very similar log).
At the very end as the memory is completely exhausted (80GB used of VRAM + Shared VRAM), it prints something like CUDA kernel couldn't allocate for gemm_something... then the display completely locks up (hard freeze).
Is there a way to split a 64GB pod5 file produced by MinKNOW? (pod5 subset is a bit broken) I can probably work through this glitch if I can just split this thing into a few hundred parts (each just small enough to fit under the 80GB VRAM without crashing). My best run so far had a 350MB fastq output (with -b 32), but it won't accept smaller values of "-b".
[Update]
Currently trying pod5 subset anyway.
Here's how.
printf "read_id\tbarcode\n" > map.tsv; sed -n '1~4p' test.fq | grep -F 'barcode' | sed 's/\t.*_barcode/\tbarcode/' | sed 's/\tDS.*//' | cut -c2- >> map.tsv
<-- this is a tsv mapping file pod5 subset expects.pod5 subset ../pod5/PBA45027_skip_ff434a84_3ce02e27_5.pod5 --columns barcode --table map.tsv --threads 1 --missing-ok
<-- this spits out a pod5 file per barcode.This is slow but at least performs some subsetting. If any barcode is still too big, I'll try to add a new column trivially (
awk -F'\t' 'BEGIN{OFS=FS} {print $0, int(NR/1000)}'
) to split it into 1000-record chunks and call subset on that column.All-in-one to make a tsv where you can split on barcode, raw batch, or batch-within-barcode ("barcodeBatch"):
cat test.fq | awk -F'\t' 'BEGIN{print "read_id\tbarcode\tchunk\tbarcodeBatch"; OFS=FS} NR%4==1 && /barcode/ {sub(/^@/,"",$1); match($3,/barcode[0-9]+/,m); print $1,m[0],int(++c/1000),m[0] "-" int(b[m[0]]/1000); ++b[m[0]]}' > map.tsv
But if it works, this is a workaround. The bug is that 'sup' doesn't know how to evict context or old stuff from its memory. This is a modern Ada generation GPU with 16GB of VRAM. It's one of the most common modern GPUs in mobile workstations for AI/ML. It is also among the most performant and efficient. It would make sense to support it.
[Update 2]
Another update:
The sinkhole occurs even when dorado has seen exactly the same reads in exactly the same order as the single subset file, implying that a parameter governing future behavior (total reads, total allowable padding, something dependent on the filesize) is what's messing things up. Perhaps there is a "context window" used by the transformer that is pre-initialized to the entire sequence space that needs to be reined in. Or an int32 instead of an int64 in CUDA defaults in Windows, resulting in arithmetic underflow. Etc
Hopefully these observations will help you fix the bug.
The text was updated successfully, but these errors were encountered: