what does weight.by.var do mathematically? #2237

brianpenghe · 2019-10-22T20:46:14Z

I was running RunPCA.
The instruction confused me:

weight.by.var
Weight the cell embeddings by the variance of each PC (weights the gene loadings if rev.pca is TRUE)

Can I ask what does it do mathematically? Any formula?

timoast · 2019-10-25T15:53:56Z

You can see here, the cell embeddings are multiplied by the variance of the PC:

seurat/R/dimensional_reduction.R

Line 821 in fc4a4f5

cell.embeddings <- pca.results$u %*% diag(pca.results$d)

jan-glx · 2019-11-16T14:15:37Z

@brianpenghe raised a very relevant question here. The behavior of weight.by.var is differs with the approx parameter and can't possibly do what it is meant to do (whatever that is):

library(Seurat)
library(tidyverse)

set.seed(seed = 1)
dummyexpMat <- matrix(
  data = stats::rexp(n = 2e4, rate = 1),
  ncol = 200, nrow = 100
)
colnames(x = dummyexpMat) <- paste0("cell", seq(ncol(x = dummyexpMat)))
row.names(x = dummyexpMat) <- paste0("gene", seq(nrow(x = dummyexpMat)))

# Create Seurat object for testing
obj <- CreateSeuratObject(counts = dummyexpMat)
# Normalize
obj <- NormalizeData(object = obj, verbose = FALSE)
# Scale
obj <- ScaleData(object = obj, verbose = FALSE)

expand.grid( approx=c(T,F), weight.by.var=c(T,F)) %>% rowwise() %>% do(with(., {
  # unweighted
  # compute PCA
  obj <- RunPCA(
    object = obj,
    features = rownames(x = obj),
    verbose = FALSE,
    npcs = 10,
    weight.by.var = weight.by.var,
    approx = approx
  )
  #cat("weight.by.var",weight.by.var, "approx", approx, "\n")
  stdev_pca <- slot(object = obj[["pca"]], name = "stdev")[1]
  stdev_weighted <- sqrt(diag(cov(slot(object = obj[["pca"]], name = "cell.embeddings"))))[1]
  data.frame(
    weight.by.var,
    approx,
    stdev_pca,
    stdev_embedding_is = stdev_weighted,
    stdev_embedding_should_1 = if(weight.by.var) stdev_pca^3 else stdev_pca,
    stdev_embedding_should_2 = if(weight.by.var) stdev_pca^2 else stdev_pca
  )
}))

weight.by.var	approx	stdev_pca	stdev_embedding_is	stdev_embedding_should_1	stdev_embedding_should_2
TRUE	TRUE	1.701479	1.70147856	4.925830	2.895029
TRUE	FALSE	1.701479	4.92583023	4.925830	2.895029
FALSE	TRUE	1.701479	0.07088812	1.701479	1.701479
FALSE	FALSE	1.701479	1.70147856	1.701479	1.701479

This table shows the standard deviation of the 1st principal component for some random data. Default parameter values are in bold. I can't image any use of any scaling other than the default (highlighted by italics), but have two possible interpretations of weight.by.var: should_1 - embedding multiplied by variance & should_2 - embedding multiplied by standard deviation such that they have a standard deviation equals to the variance of the PC.

Luckily, the default parameters (approx=TRUE, weight.by.var=TRUE) while not giving the expected results, give sensible results desired in all cases I can think of.

To overcome this inconsistency without affecting too many users, I propose the following:

new behavior would match current default
deprecate the weight.by.var and drop support for weighting by variance
- issue a warning if !missing(weight.by.var)
if approx & !missing(weight.by.var) & weight.by.var warn that actually no weighting was nor is performed
- this avoids warning everyone who just used the default parameters
- Caveat: this might lead to unnecessary warnings for people using wrappers
- Caveat: this misses to warn approx=TRUE users who implicitly set weight.by.var=TRUE expecting "stdev_embedding_should_1" behavior
if approx & !weight.by.var error warning about bug & describing how to get desired behavior (correct/no weighting)

Additionally it would simplify things to

deprecate the approx and drop support for prcomp
- issue a warning if !missing(approx), use irlba in any case

.

I will try to submit a pull request later today or tomorrow.

jan-glx · 2019-11-16T18:16:49Z

OK, looking at the history, it seems that , scale.by.varexp was renamed to (the hardly more comprehensive) weight.by.var in 6a9dbd9 before it was apparently misinterpreted when the prcomp option was added in c8fca64 .
This indicates that the the purpose of the scale.by.var was in getting the normal PCA output, while weight.by.var=FALSE gives embeddings such that each PC has the same variance.

This slightly changes the interpretation of the above but I still think it would make sense to deprecate the 'weight.by.var' argument. @andrewwbutler, do you know if there is any valid usecase for this?
@timoast what is the use of approx=FALSE?

andrewwbutler · 2019-11-19T19:11:33Z

Hi @jan-glx ,

Thanks for the detailed responses/PRs! The motivation for this weight.by.var parameter is for cases where an important rare cell type is defined by a PC further down the list (e.g. PC50). By not weighting PCs by their variance, the hope is to promote the rare cell type coming out as distinct in a spectral clustering analysis. While it is not something that we tend to use very often, it can sometimes be helpful in exploratory analyses.

In order to have consistent behavior between approx = TRUE and approx = FALSE, I believe the if/else when approx = FALSE at lines

seurat/R/dimensional_reduction.R

Lines 831 to 834 in fc4a4f5

    
           if (weight.by.var) { 
        
             cell.embeddings <- pca.results$x %*% diag(pca.results$sdev[1:npcs]^2) 
        
           } else { 
        
             cell.embeddings <- pca.results$x

should actually be

      if (weight.by.var) {
        cell.embeddings <- pca.results$x
      } else {
        cell.embeddings <- pca.results$x / (pca.results$sdev[1:npcs] * sqrt(max(1, ncol(object - 1))))
      }

We should definitely fix the bug for the next release but I think we should maintain the option.

jan-glx · 2019-11-19T20:57:53Z

yes. #2330 does just that (1) + tests (2) + handling transition with warnings & an option (3) + merging code with the duplicated code of reverse.pca (4).

Should I remove any part of this? Or do you just want to do (1)? I was afraid that people using approx=FALSE might have noticed that weight.by.var=FALSE gives them the desired result, in which case a smoother transition would be nice.

Another alternative (to the option transition #2330 | completely eliminating the feature #2331 ) would be to introduce another argument standardize.cell.embeddings=FALSE, say, to replace weight.by.var. Having unit variance would likely be much safer for any downstream analysis than a variance scaled to 1/ncol(object - 1) and it would be easier to understand (at least to me) what the option does.

Thanks for considering!

andrewwbutler · 2019-11-20T15:47:07Z

I like the idea of standardizing to unit variance as a more intuitive operation and follows the same logic as what we trying to do before. Then we also wouldn't need quite as elaborate a transition. We could simply check if they're trying to pass the weight.by.var parameter and if so give a warning/explanation of the change.

timoast closed this as completed Oct 25, 2019

This was referenced Nov 17, 2019

Fix weight.by.var for approx=FALSE #2330

Open

Deprecate weight.by.var #2331

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

what does weight.by.var do mathematically? #2237

what does weight.by.var do mathematically? #2237

brianpenghe commented Oct 22, 2019

timoast commented Oct 25, 2019

jan-glx commented Nov 16, 2019 •

edited

Loading

jan-glx commented Nov 16, 2019

andrewwbutler commented Nov 19, 2019

jan-glx commented Nov 19, 2019 •

edited

Loading

andrewwbutler commented Nov 20, 2019

what does weight.by.var do mathematically? #2237

what does weight.by.var do mathematically? #2237

Comments

brianpenghe commented Oct 22, 2019

timoast commented Oct 25, 2019

jan-glx commented Nov 16, 2019 • edited Loading

jan-glx commented Nov 16, 2019

andrewwbutler commented Nov 19, 2019

jan-glx commented Nov 19, 2019 • edited Loading

andrewwbutler commented Nov 20, 2019

jan-glx commented Nov 16, 2019 •

edited

Loading

jan-glx commented Nov 19, 2019 •

edited

Loading