Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

what does weight.by.var do mathematically? #2237

Closed
brianpenghe opened this issue Oct 22, 2019 · 6 comments
Closed

what does weight.by.var do mathematically? #2237

brianpenghe opened this issue Oct 22, 2019 · 6 comments

Comments

@brianpenghe
Copy link

I was running RunPCA.
The instruction confused me:

weight.by.var
Weight the cell embeddings by the variance of each PC (weights the gene loadings if rev.pca is TRUE)

Can I ask what does it do mathematically? Any formula?

@timoast
Copy link
Collaborator

timoast commented Oct 25, 2019

You can see here, the cell embeddings are multiplied by the variance of the PC:

cell.embeddings <- pca.results$u %*% diag(pca.results$d)

@timoast timoast closed this as completed Oct 25, 2019
@jan-glx
Copy link
Contributor

jan-glx commented Nov 16, 2019

@brianpenghe raised a very relevant question here. The behavior of weight.by.var is differs with the approx parameter and can't possibly do what it is meant to do (whatever that is):

library(Seurat)
library(tidyverse)

set.seed(seed = 1)
dummyexpMat <- matrix(
  data = stats::rexp(n = 2e4, rate = 1),
  ncol = 200, nrow = 100
)
colnames(x = dummyexpMat) <- paste0("cell", seq(ncol(x = dummyexpMat)))
row.names(x = dummyexpMat) <- paste0("gene", seq(nrow(x = dummyexpMat)))

# Create Seurat object for testing
obj <- CreateSeuratObject(counts = dummyexpMat)
# Normalize
obj <- NormalizeData(object = obj, verbose = FALSE)
# Scale
obj <- ScaleData(object = obj, verbose = FALSE)

expand.grid( approx=c(T,F), weight.by.var=c(T,F)) %>% rowwise() %>% do(with(., {
  # unweighted
  # compute PCA
  obj <- RunPCA(
    object = obj,
    features = rownames(x = obj),
    verbose = FALSE,
    npcs = 10,
    weight.by.var = weight.by.var,
    approx = approx
  )
  #cat("weight.by.var",weight.by.var, "approx", approx, "\n")
  stdev_pca <- slot(object = obj[["pca"]], name = "stdev")[1]
  stdev_weighted <- sqrt(diag(cov(slot(object = obj[["pca"]], name = "cell.embeddings"))))[1]
  data.frame(
    weight.by.var,
    approx,
    stdev_pca,
    stdev_embedding_is = stdev_weighted,
    stdev_embedding_should_1 = if(weight.by.var) stdev_pca^3 else stdev_pca,
    stdev_embedding_should_2 = if(weight.by.var) stdev_pca^2 else stdev_pca
  )
}))
weight.by.var approx stdev_pca stdev_embedding_is stdev_embedding_should_1 stdev_embedding_should_2
TRUE TRUE 1.701479 1.70147856 4.925830 2.895029
TRUE FALSE 1.701479 4.92583023 4.925830 2.895029
FALSE TRUE 1.701479 0.07088812 1.701479 1.701479
FALSE FALSE 1.701479 1.70147856 1.701479 1.701479

This table shows the standard deviation of the 1st principal component for some random data. Default parameter values are in bold. I can't image any use of any scaling other than the default (highlighted by italics), but have two possible interpretations of weight.by.var: should_1 - embedding multiplied by variance & should_2 - embedding multiplied by standard deviation such that they have a standard deviation equals to the variance of the PC.

Luckily, the default parameters (approx=TRUE, weight.by.var=TRUE) while not giving the expected results, give sensible results desired in all cases I can think of.

To overcome this inconsistency without affecting too many users, I propose the following:

  • new behavior would match current default
  • deprecate the weight.by.var and drop support for weighting by variance
    • issue a warning if !missing(weight.by.var)
  • if approx & !missing(weight.by.var) & weight.by.var warn that actually no weighting was nor is performed
    • this avoids warning everyone who just used the default parameters
    • Caveat: this might lead to unnecessary warnings for people using wrappers
    • Caveat: this misses to warn approx=TRUE users who implicitly set weight.by.var=TRUE expecting "stdev_embedding_should_1" behavior
  • if approx & !weight.by.var error warning about bug & describing how to get desired behavior (correct/no weighting)

Additionally it would simplify things to

  • deprecate the approx and drop support for prcomp
    • issue a warning if !missing(approx), use irlba in any case

.

I will try to submit a pull request later today or tomorrow.

@jan-glx
Copy link
Contributor

jan-glx commented Nov 16, 2019

OK, looking at the history, it seems that , scale.by.varexp was renamed to (the hardly more comprehensive) weight.by.var in 6a9dbd9 before it was apparently misinterpreted when the prcomp option was added in c8fca64 .
This indicates that the the purpose of the scale.by.var was in getting the normal PCA output, while weight.by.var=FALSE gives embeddings such that each PC has the same variance.

This slightly changes the interpretation of the above but I still think it would make sense to deprecate the 'weight.by.var' argument. @andrewwbutler, do you know if there is any valid usecase for this?
@timoast what is the use of approx=FALSE?

@andrewwbutler
Copy link
Collaborator

Hi @jan-glx ,

Thanks for the detailed responses/PRs! The motivation for this weight.by.var parameter is for cases where an important rare cell type is defined by a PC further down the list (e.g. PC50). By not weighting PCs by their variance, the hope is to promote the rare cell type coming out as distinct in a spectral clustering analysis. While it is not something that we tend to use very often, it can sometimes be helpful in exploratory analyses.

In order to have consistent behavior between approx = TRUE and approx = FALSE, I believe the if/else when approx = FALSE at lines

if (weight.by.var) {
cell.embeddings <- pca.results$x %*% diag(pca.results$sdev[1:npcs]^2)
} else {
cell.embeddings <- pca.results$x

should actually be

      if (weight.by.var) {
        cell.embeddings <- pca.results$x
      } else {
        cell.embeddings <- pca.results$x / (pca.results$sdev[1:npcs] * sqrt(max(1, ncol(object - 1))))
      }

We should definitely fix the bug for the next release but I think we should maintain the option.

@jan-glx
Copy link
Contributor

jan-glx commented Nov 19, 2019

yes. #2330 does just that (1) + tests (2) + handling transition with warnings & an option (3) + merging code with the duplicated code of reverse.pca (4).

Should I remove any part of this? Or do you just want to do (1)? I was afraid that people using approx=FALSE might have noticed that weight.by.var=FALSE gives them the desired result, in which case a smoother transition would be nice.

Another alternative (to the option transition #2330 | completely eliminating the feature #2331 ) would be to introduce another argument standardize.cell.embeddings=FALSE, say, to replace weight.by.var. Having unit variance would likely be much safer for any downstream analysis than a variance scaled to 1/ncol(object - 1) and it would be easier to understand (at least to me) what the option does.

Thanks for considering!

@andrewwbutler
Copy link
Collaborator

I like the idea of standardizing to unit variance as a more intuitive operation and follows the same logic as what we trying to do before. Then we also wouldn't need quite as elaborate a transition. We could simply check if they're trying to pass the weight.by.var parameter and if so give a warning/explanation of the change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants