Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue in Inosine detection #373

Open
Salvobioinfo opened this issue Feb 10, 2025 · 3 comments
Open

Issue in Inosine detection #373

Salvobioinfo opened this issue Feb 10, 2025 · 3 comments
Labels
question Looking for clarification on inputs and/or outputs

Comments

@Salvobioinfo
Copy link

Salvobioinfo commented Feb 10, 2025

We created an ADAR KO cell line, meaning no inosine should be detected in RNA. This expectation, along with the reliability of our knockout, was confirmed by Illumina sequencing. We then performed nanopore sequencing on the same group of samples. I basecalled our library using Dorado 0.8 with the hac,ionosine_m6A model. As suggested by the Modkit manual, I ran sample-probs to fine-tune the filtering threshold for inosine and m6A detection. However, when I used the output file to generate a density plot of the total counts at each probability level, I was surprised to find no significant differences between ADAR KO and control samples.

I also attempted to use the inosine sites identified from Illumina sequencing as ground truth, but this approach resulted in many false negative in CTRLs and false positives in KOs . Is there any planned solution to address this issue? Thanks in advance.

Image
@ArtRand
Copy link
Contributor

ArtRand commented Feb 12, 2025

Hello @Salvobioinfo,

However, when I used the output file to generate a density plot of the total counts at each probability level, I was surprised to find no significant differences between ADAR KO and control samples.

A couple things.

(1) Distributions that look like this, where there is a downward sloping line from the left (which I'm assuming is the density of low confidence calls) usually indicates that a lot of the probabilities in the plot are due to false positives. If you look at just the frequency of very high confidence Inosine calls do you see much of a difference between the KO and Ctrl?

(2) What is the expected frequency of Inosine in your samples, roughly? It appears that the levels are close to the false positive rate of the model at a global level. But that may not be the case. Since you have orthogonal data, what levels do you expect?

I also attempted to use the inosine sites identified from Illumina sequencing as ground truth, but this approach resulted in many false negative in CTRLs and false positives in KOs . Is there any planned solution to address this issue? Thanks in advance.

How many FNs and FPs did you get? Could you use modkit validate to check?

@ArtRand ArtRand added the question Looking for clarification on inputs and/or outputs label Feb 12, 2025
@Salvobioinfo Salvobioinfo reopened this Feb 14, 2025
@Salvobioinfo
Copy link
Author

Salvobioinfo commented Feb 14, 2025

Hello @ArtRand

Hello @Salvobioinfo,

However, when I used the output file to generate a density plot of the total counts at each probability level, I was surprised to find no significant differences between ADAR KO and control samples.

A couple things.

(1) Distributions that look like this, where there is a downward sloping line from the left (which I'm assuming is the density of low confidence calls) usually indicates that a lot of the probabilities in the plot are due to false positives. If you look at just the frequency of very high confidence Inosine calls do you see much of a difference between the KO and Ctrl?

(Using 0.99) Approximately 163 sites differ between KO and CTRL. Given that I have triplicates for each sample condition, I consider an editing site as valid only if it is detected in at least 2 out of 3 replicates. Additionally, the same sites should be well covered in KO to ensure reliability.

(2) What is the expected frequency of Inosine in your samples, roughly? It appears that the levels are close to the false positive rate of the model at a global level. But that may not be the case. Since you have orthogonal data, what levels do you expect?

Since we are discussing physiological RNA editing, inosine frequency typically ranges between 5% and 15% in my cell lines under steady-state conditions. After treatment, it increases to 10%–30%, with some sites reaching 40%–50%.

Both C→U and A→I physiological modifications generally occur at low frequencies. This makes me question the validity of the A→I detection model, especially if it hasn't been trained on proper biological samples and is instead based on modified oligos (I suppose). I’m not sure how reliable its claims are in this context. From Illumina sequencing, approximately 4,000 editing sites have been detected. Of course, I don't expect a perfect overlap due to a series of technical factors, including the huge difference in coverage, as well as several other methodological differences.

I also attempted to use the inosine sites identified from Illumina sequencing as ground truth, but this approach resulted in many false negative in CTRLs and false positives in KOs . Is there any planned solution to address this issue? Thanks in advance.

How many FNs and FPs did you get? Could you use modkit validate to check?

Yes I could. 👍🏻👍🏻

@ArtRand
Copy link
Contributor

ArtRand commented Feb 14, 2025

Hello @Salvobioinfo,

(Using 0.99) Approximately 163 sites differ between KO and CTRL. Given that I have triplicates for each sample condition, I consider an editing site as valid only if it is detected in at least 2 out of 3 replicates. Additionally, the same sites should be well covered in KO to ensure reliability.

Are you looking at the percent modified column in the pileup bedMethyls? In general, I would recommend using the bedMethyl for looking for changes in modifications at specific positions, seems like you're already doing this. When you were looking at the sample-probs output before it got me thinking that you're looking for read-level changes that might not all concentrate on a specific reference position. You can also use dmr pair to perform comparisons at reference positions.

Since we are discussing physiological RNA editing, inosine frequency typically ranges between 5% and 15% in my cell lines under steady-state conditions. After treatment, it increases to 10%–30%, with some sites reaching 40%–50%.

For changes in the order of 5-15% you will probably need relatively high coverage to know that a site is different between the two samples/conditions. The effect size model describes some of the intuition.

I may have led you down the wrong path with modkit validate, the assumption with that command is that there are sites that are known to be completely one modification state. If your ILMN data shows that there are sites with 30% A->I editing, you could label it as "I" and expect that the accuracy is ~30%, but I don't know if that helps you get to your research question. On the other hand, if you ILMN data suggests that a site is entirely Inosine - the command should work as intended.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Looking for clarification on inputs and/or outputs
Projects
None yet
Development

No branches or pull requests

2 participants