Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

preprocessNoob using incorrect row indices to Meth and Unmeth matrices #264

Open
jonathonl opened this issue Feb 26, 2024 · 10 comments
Open

Comments

@jonathonl
Copy link
Contributor

minfi/R/preprocessNoob.R

Lines 386 to 392 in 77d3345

MSet <- preprocessRaw(rgSet)
probe.type <- getProbeType(MSet, withColor = TRUE)
Green_probes <- which(probe.type == "IGrn")
Red_probes <- which(probe.type == "IRed")
d2.probes <- which(probe.type == "II")
Meth <- getMeth(MSet)
Unmeth <- getUnmeth(MSet)

The preprocessNoob function uses indices from the getProbeType() function to select probes by type using their row index (see code linked above). The problem is that getProbeType() returns a vector based on the manifest which is unlikely to have the same length as the number of probes in the MethylSet. For example, the indices in d2.probes <- which(probe.type == "II") do not map to indices for the same probes in the return matrices from getMeth(MSet) and getUnmeth(MSet), but are in fact used to select the signals for type II probes.

I suspect that fixing this bug will improve the effectiveness of ssNoob in minfi.

@jonathonl jonathonl changed the title preprocessNoob using Incorrect row indices to Meth and Unmeth matrices preprocessNoob using incorrect row indices to Meth and Unmeth matrices Feb 26, 2024
@jonathonl
Copy link
Contributor Author

I created a proof of concept PR. I think a similar issue may be affecting FunNorm as well. I haven't looked at other normalization methods.

@kasperdanielhansen
Copy link
Collaborator

This could be a serious bug. However, I am not sure it is. Since I haven't yet had time to investigate, perhaps you could do some leg work.

The code - as currently written - assumes that the output of getProbeType() is a vector of the same length as the MSet and has a 1-1 match between the rows of the MSet etc. You're right in saying that a full manifest object can be much bigger than the size of the MSet. But note how the MSet is an argument to getProbeType? In my recollection, we go to some significant lengths to do the following

  1. Use the annotation to get the right Manifest object (which could be bigger than the MSet).
  2. Do some manipulation to make sure that the returned Manifest object gets subsetted and re-ordered to match 1-1 with the MSet.
    In my understanding, you're saying that (2) doesn't happen or there is a bug. Which could be true, but I am not certain.

If this mismatch is happening you should be able to see it by taking the default objects in the minfiData package, subset the objects to remove some probes or some loci and then observe that the probeType changes. Right? Could you try to make an example along those lines? That would make it much easier to reason about.

@jonathonl
Copy link
Contributor Author

You're right in saying that a full manifest object can be much bigger than the size of the MSet.

The issue actually arises when the manifest is smaller than the MSet (not bigger). We observe EPIC v1 idats in the wild that have more probes than what is contained in the B4/B5 EPIC manifest. I have already looked at the minfiData (450k as I remember) and it has the same number of probes as the manifest, so we won't be able to replicate the issue with this test data. But you can see example output from one of our datasets below.

> MSet <- preprocessRaw(rgset)
Loading required package: IlluminaHumanMethylationEPICmanifest
> dim(MSet)
[1] 866091     95
> probe.type <- getProbeType(MSet, withColor = TRUE)
Loading required package: IlluminaHumanMethylationEPICanno.ilm10b4.hg19
> length(probe.type)
[1] 865859

I've done some comparisons of my PR with the latest devel branch. I have taken an RGSet with 866091 probes and made a copy of the RGSet that is subset to the bead addresses found in the B5 manifest (excluding the manufacture change probes). I then ran preprocessNoob() on both RGSets using both the official version and my fixed version of minfi. When comparing the Noob-adjusted betas for the two RGSets, the official version produces a lot of discordance while the differences in the fixed version of minfi are negligible (see attached plots). Excluding a few hundred out-of-band signals from the background estimate shouldn't make much of a difference, so I suspect that the discordance seen in the official version of minfi is because changing the number of rows in the MSet affects which probes get Noob-adjusted.

minfi_noob_plots

@kasperdanielhansen
Copy link
Collaborator

kasperdanielhansen commented Feb 29, 2024 via email

@jonathonl
Copy link
Contributor Author

I cannot share samples from our data but I could share the bead address list. I'm assuming you can generate an RGChannelSet object with fake (random values) red/green matrices using these addresses. Would that be useful to you?

@kasperdanielhansen
Copy link
Collaborator

Perhaps you could generate such a simulated RGChannelSet or perhaps - instead of simulating - you just have integer values in the channels representing different types of probes. That would be useful.

I am still surprised by this, because the called to preprocessRaw should remove any probes/loci not in the manifest. But on the other hand, your plot is pretty convincing.

@jonathonl
Copy link
Contributor Author

I have created a test case with a couple samples from GEO.

Another thing I discovered is that when using read.metharray.exp(..., force=T) with idats that have different sizes (i.e., different number of bead addresses), the resulting methyl sets are of the same size as the probe type vector. So this issue can only be reproduced with an rgset made from idats of the same size. The test case shows this as well.

Running Rscript run.sh from the minfi_issue directory produces:

...

[1] "dim(rgset.1):"
[1] 1051815       1
[1] "dim(rgset.2):"
[1] 1051943       1
[1] "dim(rgset.force):"
[1] 1051539       2
Loading required package: IlluminaHumanMethylationEPICmanifest
[1] "dim(mset.1):"
[1] 866091      1
[1] "dim(mset.2):"
[1] 866238      1
[1] "dim(mset.force):"
[1] 865859      2
Loading required package: IlluminaHumanMethylationEPICanno.ilm10b4.hg19
[1] "length(probe.type.1):"
[1] 865859
[1] "length(probe.type.2):"
[1] 865859
[1] "length(probe.type.force):"
[1] 865859

Note: I had to use LZMA to compress this tar archive because the gzip version was too big for Github.

minfi_issue.lzma.tgz

@jonathonl
Copy link
Contributor Author

Another thing I discovered is that when using read.metharray.exp(..., force=T) with idats that have different sizes (i.e., different number of bead addresses), the resulting methyl sets are of the same size as the probe type vector. So this issue can only be reproduced with an rgset made from idats of the same size.

I just realized that this is because the set of probes that overlap between these two samples are all in the manifest. So if the intersection of probes across all samples are in the manifiest, then we don't have a problem. The issue can still exist if the idats have different sizes as long as the intersection of probes in the idats is larger than the manifest.

> length(intersect(rownames(mset.1), rownames(mset.2)))
[1] 865859
> dim(mset.force)
[1] 865859      2
> length(intersect(rownames(mset.1), getAnnotation(rgset.1)$Name))
[1] 865859
> length(intersect(rownames(mset.2), getAnnotation(rgset.2)$Name))
[1] 865859

@kasperdanielhansen
Copy link
Collaborator

I can replicate the issue.

The core of the issue is because of the (in hindsight wrong) to have separate Manifest and Annotation packages. This means that these packages may not be in sync. The issues happens when you use getAnnotation on a MethylSet/RatioSet etc which was created using a Manifest package (usually by a call to preprocessRaw) which results in more loci (CpGs) than is contained in the Annotation package. The diagnosis is simple: the output of getAnnotation has a smaller number of rows compared to the input object (typically a MethylSet).

This affects output from getAnnotation as well as getProbeType, getLocations, getIslandStatus, and getSnpInfo. A short term fix is to not assume that the output of these functions are matched 1-1 with the input object, but to do an intersection. This does not happen in getProbeType but it does happen in mapToGenome. The long term fix is to no longer have separate Manifest and Annotation packages.

I need to do a more careful analysis of the potential impact of this bug, including a clear description (for users) of how this could happen with various types of IDAT files and array versions. It might be nice to have a have a call with the OP with some more details and their observations. For this reason, I am waiting on merging the pull request.

@vjcitn
Copy link

vjcitn commented Jul 25, 2024

Is this still waiting on a call with the OP?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants