Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modkit bedmethyl merge bug #350

Open
paulinor42 opened this issue Jan 17, 2025 · 4 comments
Open

modkit bedmethyl merge bug #350

paulinor42 opened this issue Jan 17, 2025 · 4 comments
Labels
bug Something isn't working needs-attention Requires follow up from developers troubleshooting workflow and data preparation questions

Comments

@paulinor42
Copy link

Hello,
I am trying to merge one pileup.bed containing 5mC and 5hmC mod with another pileup.bed containing 5mC and 5hmC mods. Both files have been compressed and indexed according to the online documentation.

Below is the command I am using

dist_modkit_v0.4.2_10d99bc/modkit bedmethyl merge \
PROM0146_Frost_179130_06152023_5mCG_5hmCG_hs1_sort_pileup.bed.gz PROM0146_Frost_179130_FC2_10042023_5mCG_5hmCG_5khz_hs1_sort_pileup.bed.gz \
-o 179130_5mCG_5hmCG_hs1_pileup.bed \
-g genome_references/chm13/hs1_genome_sizes.tsv \
--threads 20 \
--force \
--log-filepath merge.log \
--interval-size 50000

However, the program keeps stalling and not exiting, and it does not write anything to the output pileup bed.

This is the stdout that the commands freezes on. I notice it keeps on processing contigs even though they are only 25 in the genome sizes file.

calculated chunk size: 30, interval size 50000, processing 1500000 positions concurrently
[00:03:01] ######################################## 1001/25 contigs processed
0 merging contigs
0 batch errors

Here is log output

[src/command_utils.rs::295][2025-01-17 08:25:10][INFO] calculated chunk size: 30, interval size 50000, processing 1500000 positions concurrently
[src/interval_chunks.rs::512][2025-01-17 08:25:10][DEBUG] there are 25 contig(s) to work on (25 parts)

Thank you for your help. modkit is a very helpful and useful tool!

@ArtRand
Copy link
Contributor

ArtRand commented Jan 17, 2025

Hello @paulinor42

Sorry about the stall. So if I'm understanding correctly, hs1_genome_sizes.tsv has a subset of the contigs in the bedMethyls. This should be fine and a supported use case.

I notice it keeps on processing contigs even though they are only 25 in the genome sizes file.

This is a bug, but it should be a harmless one, and shouldn't cause a stall like you're seeing - and actually provides a clue to what's going on.

To me it looks like Modkit is failing to write to the file, or the filesystem is holding it up. The program loads sections of each bedMethyl in parallel, merges them, and puts the merged records on a queue to be written. This queue is set to a maximum capacity of 1000 items. So the fact that it stalls at 1001 makes me think that the queue is full.

Would you be willing to share the data that produces this problem with me so I can try and reproduce it? If so, please send me email at art.rand[at]nanoporetech.com and we can sort it out. Thanks!

@ArtRand ArtRand added bug Something isn't working troubleshooting workflow and data preparation questions labels Jan 17, 2025
@paulinor42
Copy link
Author

Hi @ArtRand,

Thanks for the response.
The hs1_genome_sizes.tsv file has all the contigs in the bed methyl file. So the bedmethyl files and the hs1_genome_sizes.tsv file should have only 25 contigs. The contigs are the chromosomes in the CHM13 genome; originally, I used the .fai file of the reference sequences I aligned to but i get the same error.

and yes! I can send you the files.

@ArtRand
Copy link
Contributor

ArtRand commented Jan 23, 2025

Hello @paulinor42,

For some reason your email got triaged and didn't make it to my inbox. I found it and I have your files. I'll try to reproduce asap and get back to you about the solution. Thanks!

@ArtRand ArtRand added the needs-attention Requires follow up from developers label Jan 23, 2025
@stevenb21
Copy link

stevenb21 commented Jan 31, 2025

Hello @ArtRand, big fan of what you are doing - I just wanted to add that I also face the same bug, however my magic number is 699 instead of 1001.

Below is the command I ran:
modkit bedmethyl merge sample1_h1.bedmethyl.gz sample1_h2.bedmethyl.gz sample1_ug.bedmethyl.gz -o "$out_path" -g "$sizes_path" --force --log-filepath "$log_path"

I created the size.tsv file with the following awk code from another ticket:
awk '/^>/ {if (seq) print chr"\t"seq; chr=substr($1,2); seq=0; next} {seq+=length($0)} END {if (seq) print chr"\t"seq}' input.fasta > sizes.tsv

I created tabix files for each .bedmethyl.gz file with

tabix -p bed sample1_h1.bedmethyl.gz

and the log output:
[src/command_utils.rs::295][2025-01-31 13:42:41][INFO] calculated chunk size: 6, interval size 100000, processing 600000 positions concurrently [src/interval_chunks.rs::512][2025-01-31 13:42:41][DEBUG] there are 25 contig(s) to work on (25 parts)

This could be user error on my part, please let me know if I am missing anything workflow wise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs-attention Requires follow up from developers troubleshooting workflow and data preparation questions
Projects
None yet
Development

No branches or pull requests

3 participants