Scalability over larger datasets #5

Addicted-to-coding · 2022-11-25T06:36:54Z

Hi,

Is your method scalable over larger datasets? I tried running this method on a dataset of size (10000, 8) and got an estimated run time as below. This should not be the case since your own test dataset is of size (300,8) and the time per iteration is low. Are you retraining the model for computing the shape values for each example? It is not clear to me why the time per iteration has increased so much given the number of features is the same.

krzyzinskim · 2022-11-25T15:24:09Z

Hi,
Currently, the exact values of Kernel SHAP are calculated. However, if you have 8 variables, it shouldn't be a problem (as you can see in the experiments). We're not retraining the model so it's not the cause.

I'm interested in where this problem might come from but I can't reproduce it. Can you share a minimal reproducible code example with this dataset or any other causing this problem?

It's also related to #4, I believe.

Addicted-to-coding · 2022-11-25T17:06:48Z

Thanks for the prompt reply. Yes, this is related to issue #4 which I posted earlier. We can reproduce this by create a random array of size (1000,8). Here's a simple example on how to reproduce it

create random datasets

X=np.random.rand(1000,8)
y=np.random.rand(1000,1)
boo=np.random.choice(a=[True,False],size=(1000,1),p=[0.5,0.5])
out=np.empty(1000,dtype=[('event','?'),('time','<f8')]
boo=boo.reshape(1000,)
out=out.reshape(1000,)
out['event']=boo
out['time']=y
X=pd.DataFrame(X,columns=['f1','f2','f3','f4','f5','f6','f7','f8'])

run random survival forest

from sksurv.ensemble import RandomSurvivalForest
rsf=RandomSurvivalForest(random_state=42,n_estimators=120, max_depth=8,min_sample_leaf=4,max_features=3)
rsf.fit(X,out)
rsf.score(X,out)

run survshap

from survshap import SurvivalModelExplainer,ModelSurvShap
rsf=SurvivalModelExplainer(rsf,X,out)

pnd_survshap_global_rsf=ModelSurvShap(random_state=42)
pnd_survshap_global_rsf.fit(rsf_pnd)

Produces the following output

Addicted-to-coding · 2022-11-29T16:46:31Z

Hi,
I was wondering if you were able to reproduce this on your end and had any solutions?

solidate · 2023-02-09T12:49:29Z

Hi,
I am also facing the same issue as mentioned by @Addicted-to-coding .
The fit method seems to be awfully slow.

@krzyzinskim Were you able to reproduce this?

hbaniecki · 2023-02-10T19:55:33Z

Hi @Addicted-to-coding @solidate, it's expected to be slow. The implemented (default) algorithm aims to "exactly" approximate Shapley values and therefore is useful for relatively small (background) datasets. So you can probably compute SurvSHAP(t) for 1000+ samples, but when using 100-200 samples as the background for estimation.

Another way to speed up calculations is to reduce the number of timestamps (parameter timestamps in fit() method) at which the survival function is predicted. By default, values are predicted for each unique event time, which in the case of 1000 observations can be a lot of timestamps.

Also, RSF has a slow inference adding to the time. See the comparison with a simpler CPH model.

import numpy as np
import pandas as pd
from survshap import SurvivalModelExplainer, ModelSurvSHAP

X=np.random.rand(1000,8)
y=np.random.rand(1000,1)
boo=np.random.choice(a=[True,False],size=(1000,1),p=[0.5,0.5])
out=np.empty(1000,dtype=[('event','?'),('time','<f8')])
out['event']=boo.reshape(-1)
out['time']=y.reshape(-1)
X=pd.DataFrame(X,columns=['f1','f2','f3','f4','f5','f6','f7','f8'])

from sksurv.linear_model import CoxPHSurvivalAnalysis
cph = CoxPHSurvivalAnalysis()
cph.fit(X, out)
cph.score(X, out)

from sksurv.ensemble import RandomSurvivalForest
rsf=RandomSurvivalForest(random_state=42,n_estimators=120, max_depth=8,max_features=3)
rsf.fit(X,out)
rsf.score(X,out)

exp_cph=SurvivalModelExplainer(cph,X,out)
ms_cph=ModelSurvSHAP(random_state=42)
ms_cph.fit(exp_cph)

exp_rsf=SurvivalModelExplainer(rsf,X,out)
ms_rsf=ModelSurvSHAP(random_state=42)
ms_rsf.fit(exp_rsf)

hbaniecki mentioned this issue Feb 10, 2023

SurvSHAP model freezes for new data #4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scalability over larger datasets #5

Scalability over larger datasets #5

Addicted-to-coding commented Nov 25, 2022

krzyzinskim commented Nov 25, 2022

Addicted-to-coding commented Nov 25, 2022 •

edited

Loading

Addicted-to-coding commented Nov 29, 2022

solidate commented Feb 9, 2023

hbaniecki commented Feb 10, 2023

Scalability over larger datasets #5

Scalability over larger datasets #5

Comments

Addicted-to-coding commented Nov 25, 2022

krzyzinskim commented Nov 25, 2022

Addicted-to-coding commented Nov 25, 2022 • edited Loading

create random datasets

run random survival forest

run survshap

Addicted-to-coding commented Nov 29, 2022

solidate commented Feb 9, 2023

hbaniecki commented Feb 10, 2023

Addicted-to-coding commented Nov 25, 2022 •

edited

Loading