-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path08-survival-analysis.qmd
733 lines (508 loc) · 25.4 KB
/
08-survival-analysis.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
---
bibliography: references/01-c-index.bib
format:
html:
number-depth: 3
css: summary-format.css
---
# Time-event (Survival) Analysis and Censored Data
## General concepts
In this models the outcome variable is the **time until an event occurs** or any other numeric variable have been **censored** by any limitation during data collection. In consecuence, times are always **positive** and we need to use distributions like the **Weibull** distribution.
- **Survival, failure or event time** $T$: Represents the time at which the event of interest occurs. For instance, the time at which the *patient dies* or the *customer cancels his or her subscription*.
- **Censoring time** $C$: Represents the time at which censoring occurs. For example, the time at which the patient *drops out of the study* or the *study ends*.
As result our target variable is the result of:
$$
Y = \min(T, C)
$$
To know how to interpret the results we will need an indicator:
$$
\delta =
\begin{cases}
1 & \quad \text{if } T \leq C \\
0 & \quad \text{if } T > C
\end{cases}
$$
As result when $\delta = 1$ we observe the true survival time, and when $\delta = 0$ if we observe the censoring time. In the next example, we just could observe the event for patients 1 and 3 before ending the study.
![](img/55-censored-survival-example.png){fig-align="center"}
### Case of use
- The time needed for the individual to find a job again
- The time needed for a letter to be delivered
- The time needed for a cab to pick you up at your house after having called the cab company
- The time needed for a customer to churn.
- When our measure instrument cannot report values **above a certain number**.
### Assumptions
In order to analyze survival data, we need to determine whether the following assumptions are reasonable:
- The **event time** $T$ **is independent of the censoring time** $C$. For example, *patients drop out of the cancer study early because they are very sick*, that would **overestimate** the true average survival time. We can check this assumption by exploring the ***reasons related to dropouts***.
- The **predictors are independent of the censoring event** $\delta = 0$. For example, if in our study *very sick males are more likely to drop out of the study than very sick females*, that would drive the **wrong conclusion** that males survive longer than females.
### Censoring types
1. **Right censoring**: It occurs when $T \geq Y$, i.e. the true event time $T$ is at least as large as the observed time $Y$.
2. **Left censoring**: It occurs when $T \leq Y$, i.e. the true event time $T$ is less than or equal to the observed time $Y$.
3. **Interval censoring**: It refers to the setting in which we do not know the exact event time, but we know that it falls in some interval.
*Right censoring will be the main focus of this chapter.*
## Kaplan-Meier Survival Curve
The **survival curve (function)** is defined as the probability that the *event time* $T$ happens later than a time $t$. As result, *the larger the value of *$S(t)$*, the less likely that the event would take place before time* $t$.
$$
S(t) = \Pr(T > t)
$$
To explain how complicated can be to estimate $S(20) = \Pr(T>20)$ when our target variable is a mix of event and censored times we will use the `ISLR2::BrainCancer` table as a example.
```{r}
library(data.table)
BrainCancer <-
na.omit(ISLR2::BrainCancer) |>
as.data.table()
pillar::glimpse(BrainCancer)
BrainCancer[, .N,
keyby = .(alive_after_20_months = time > 20,
event_time = status)]
```
- Taking the total of patients who were alive after 20 months ($36+12=48$) over the total number of patients ($48/88 \approx 55\%$) would be a mistake as **we cannot assume that event time is lower than 20** ($T < 20$) **for 17 censored patients**.
- Omitting the $17$ censored patients might sound as solution but that would **under estimate** the probability as a patient who was censored at $t = 19.9$ likely would have survived past $t = 20$, and would be better to take a advantage of that censored time.
*The Kaplan-Meier method shares a solution to this problem.*
### Empirical explanation
<br>
<div style="text-align: center;">
<iframe width="560" height="315" src="https://www.youtube.com/embed/7_XK7mGMm1E?start=659" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen ></iframe>
</div>
### Mathematical explanation
Let's start defining some important elements:
- $d_1 < d_2 < \dots < d_K$: The $K$ unique death times among the non-censored patients.
- $q_k$: The number of patients who died at time $d_k$.
- $r_k$: The number of patients alive and in the study just before $d_k$, *at risk patients*.
- $\widehat{\Pr}(T > d_j|T > d_{j-1}) = (r_j - q_j) / r_j$: It estimates the fraction of the risk set at time $d_j$ who survived past time $d_j$.
*And based on the law of total probability*
$$
\begin{split}
\Pr(T > d_k) = & \Pr(T > d_k|T > d_{k-1}) \Pr(T > d_{k-1}) + \\
& \Pr(T > d_k|T \leq d_{k-1}) \Pr(T \leq d_{k-1})
\end{split}
$$
But as it is impossible for a patient to survive past time $d_k$ if he or she did not survive until an earlier time $d_{k−1}$, we know that $\Pr(T > d_k|T \leq d_{k-1}) = 0$ and we can simplify the function and found out that this a recurrent function.
$$
\begin{split}
\Pr(T > d_k) = & \Pr(T > d_k|T > d_{k-1}) \Pr(T > d_{k-1}) \\
S(d_k) = & \Pr(T > d_k|T > d_{k-1}) \times \dots \times \Pr(T > d_2|T > d_1) \Pr(T > d_1) \\
\hat{S}(d_k) = & \prod_{j=1}^k \left( \frac{r_j - q_j}{r_j} \right)
\end{split}
$$
As we can see the the example below, the **Kaplan-Meier survival curve** has a step-like shape as we assume that $\hat{S}(t) = \hat{S}(d_k)$ when $d_k < t < d_{k+1}$.
![](img/56-Kaplan-Meier-survival-curve.png){fig-align="center"}
*Based on this new function we can say that the probability of survival past 20 months is* $71\%$.
### Interpreting Survival Curve
If select a $t$ value on the $x$-axis we can answer the flowing questions:
- What is the probability that a breast cancer patient survives longer than 5 years? $S(5)$
- Out of 100 unemployed people, how many do we expect to have a job again after 2 months? $1 - S(2)$
But it also works in the other sense by selecting a quantile on the $y$-axis and the checking the related value on the $x$-axis, answer
What is the typical waiting time for a cab?
![](img/64-median-survival-curve.png){fig-align="center"}
### Coding example
```{r}
library(survival)
BrainCancerSv <- survfit(
Surv(time, status) ~ 1,
data = BrainCancer
)
survminer::surv_summary(BrainCancerSv) |>
subset(select = time:surv) |>
head()
broom::tidy(BrainCancerSv) |>
base::subset(select = time:estimate) |>
head()
```
We can plot this results using the `survminer` package:
```{r}
library(survminer)
ggsurvplot(
BrainCancerSv,
palette = "blue",
surv.median.line = "hv",
conf.int = FALSE,
risk.table = "nrisk_cumevents",
legend = "none",
break.x.by = 10,
title = "Brain Cancer Survival Curve"
)
```
Some additional options to consider:
- `palette`: Can be used to define the colors of the curves.
- `linetype`: Can be used to define the linetype of the curves.
- `surv.median.line`: Can highlight the median survival time.
- `risk.table`: Shows a table with the number of subjects at risk of dying.
- `cumevents`: Shows a table with the number of events that happened already.
- `cumcensor`: Shows a table with the number of censored observations so far.
- `tables.height`: Indicate how big the tables should be.
You event can compare to survival curves. To use the same data let's assume for one curve that we are reporting the event time for all observations.
```{r}
BrainCancerSvWrong <- survfit(
Surv(time) ~ 1,
data = BrainCancer
)
ggsurvplot_combine(
list(correct = BrainCancerSv,
wrong = BrainCancerSvWrong)
)
```
*As results we would unter estimate the probability of survive*.
## Weibull Model Survival Curve
The Weibull model can produce a *smooth survival curve* by using some assumptions about the distribution, but it is very useful when we are adjusting the *results based on covariates* or *making inferences*.
### Coding example
```{r}
# Defining the probabilities to plot
SurvivalProb <- seq(.99, .01, by = -.01)
plot(SurvivalProb)
# Fitting the distribution
BrainCancerWSC <- survreg(
Surv(time, status) ~ 1,
data = BrainCancer,
dist = "weibull"
)
# Predicting times
BrainCancerWmTime <- predict(
BrainCancerWSC,
type = "quantile",
p = 1 - SurvivalProb,
newdata = data.frame(1)
)
# Create a data.frame an plot the results
data.frame(
time = BrainCancerWmTime,
surv = SurvivalProb,
upper = NA,
lower = NA,
std.err = NA
) |>
ggsurvplot_df(surv.geom = geom_line)
```
## Log-Rank Test
If we want to explore whether the sex is an important factor to impact the *survival curve* we can create a plot comparing the curve of each sex.
![](img/57-survival-curve-male-female.png){fig-align="center"}
Females seem to fare a little better up to about 50 months, but then the two curves both level off to about 50%. To take a better decision we need to confirm if this difference was produced by chance or if it was statistical significant.
As our target variable `time` mixes event and censoring times we need to use a particular statistical test known as **log-rank test**, **Mantel-Haenszel test** or **Cochran-Mantel-Haenszel test**. In this test we need to split some the variables used in the Kaplan-Meier survival curve.
$$
\begin{split}
r_{1k} + r_{2k} & = r_k \\
q_{1k} + q_{2k} & = q_k
\end{split}
$$
In order to test $H_0 : \text{E}(X) = 0$ for some random variable $X$, one approach is to construct a test statistic of the form
$$
W = \frac{X - \text{E}(X)}{\sqrt{\text{Var}(X)}}
$$
Where:
$$
\begin{split}
X & = \sum_{k=1}^K q_{1k} \\
\text{E}(q_{1k}) & = \frac{r_{1k}}{r_k} q_k \\
\text{Var}\left( X \right) \approx \sum_{k=1}^K \text{Var} (q_{1k}) & = \sum_{k=1}^K \frac{q_k(r_{1k}/r_k) (1-r_{1k}/r_k) (r_k-q_k)}{r_k-1}
\end{split}
$$
As result
$$
\begin{split}
W & = \frac{\sum_{k=1}^K(q_{1k}-\text{E}(q_{1k}))}
{\sqrt{\sum_{k=1}^K \text{Var} (q_{1k})}} \\
& = \frac{\sum_{k=1}^K(q_{1k}- \frac{r_{1k}}{r_k} q_k)}
{\sqrt{\sum_{k=1}^K \frac{q_k(r_{1k}/r_k) (1-r_{1k}/r_k) (r_k-q_k)}{r_k-1}}}
\end{split}
$$
When the **sample size is large**, the log-rank test statistic $W$ has approximately a **standard normal distribution** and can be used to compute a **p-value** for the null hypothesis that there is no difference between the survival curves in the two groups.
For the `ISLR2::BrainCancer`the **p-value is 0.2** using the theoretical null distribution. Thus, **we cannot reject the null hypothesis** of no difference in survival curves between females and males.
## Regression Models With a Survival Response
Fitting a linear regression to a censored data can be challenging as we want to predict $T$ rather than $Y$ and to overcome this difficulty need to use *a sequential construction*.
### Hazard Function
The **hazard function** also known as *hazard rate* or *force of mortality* is useful to estimate the **risk of an event** and measures the instantaneous rate (*conditional probability/unit of time*) at which events occur given that the event has not yet occurred for the subjects under study.
$$
h(t) = \lim_{\Delta t \rightarrow 0} \frac{\Pr(t < T \leq t + \Delta t| T > t)}{\Delta t}
$$
Where:
- $T$: It is the (unobserved) survival time.
- $\Delta t$: It's an extremely tiny number.
It has a close relational with the **probability density function of** $T$ which shows how *common* or *rare* is any particular $T$ value is likely to be:
$$
f(t) = \lim_{\Delta t \rightarrow 0} \frac{\Pr(t < T \leq t + \Delta t)}{\Delta t}
$$
Using the conditional probability definition we can find how the *hazard function* connect the *probability density function* and the *survival function*.
$$
\begin{split}
h(t) & = \lim_{\Delta t \rightarrow 0} \frac{\Pr((t < T \leq t + \Delta t) \cap (T > t)) / \Pr(T > t)}{\Delta t} \\
h(t) & = \lim_{\Delta t \rightarrow 0} \frac{\Pr(t < T \leq t + \Delta t) / \Delta t }{\Pr(T > t)} \\
h(t) & = \frac{f(t)}{S(t)}
\end{split}
$$
But also
$$
\frac{\mathrm d}{\mathrm d x} S(t) = -f(t)
$$
Let's see a simulated example:
```{r hazard-rate-simulated-plot}
#| code-fold: true
# Setup
library(ggplot2)
mu <- 5
sigma <- 1
t <- 3:6
# Create a data frame of values around the mean
df <- data.frame(x = seq(mu-4*sigma, mu+4*sigma, length=1000))
df$Density <- dnorm(df$x, mean=mu, sd=sigma)
df$S <- pnorm(df$x, mean=mu, sd=sigma, lower.tail = FALSE)
# Define hazard points to estimate
annotations <-data.frame(
t,
density = dnorm(t, mean=mu, sd=sigma) |> round(2),
survival = pnorm(t, mean=mu, sd=sigma, lower.tail = FALSE) |> round(2),
hazard =
(dnorm(t, mean=mu, sd=sigma) /
pnorm(t, mean=mu, sd=sigma, lower.tail = FALSE)) |>
round(2)
)
annotations$label <- paste0(
annotations$density," / ",
annotations$survival," = ",
annotations$hazard
)
# Plot the normal distribution
ggplot(df, aes(x=x, y=Density)) +
geom_blank(aes(y = Density*1.2)) +
geom_line() +
geom_area(data= subset(df, x > max(t)),
fill="blue",
alpha = 0.4) +
geom_point(data = annotations,
aes(x = t, y = density),
size = 3) +
geom_text(data = annotations,
aes(x = t, y = density, label = label),
vjust = -1,
fontface = "bold",
size = 4) +
labs(x="x",
y="Probability Density",
title = "Hazard rate = f(t) / S(t), where f(t) is represented by a normal distribution") +
scale_x_continuous(breaks = scales::breaks_width(1)) +
theme_classic()+
theme(plot.title = element_text(face = "bold",
margin = margin(b = 0.5, unit = "cm")))
```
### Modeling the survival time
As we often want to know how a treatment or the severity of an illness affects the survival of the patients.
#### Lineal approach
To use the hazard function to **model the survival time** as a function of the **covariates** (predictors), we can assume the next form for the hazard function to assure positive results:
$$
h(t|x_i) = \exp \left( \beta_0 + \sum_{j=1}^p \beta_j x_{ij} \right)
$$
Based on $h(t|x_i)$ we can calculate $S(t|x_i)$ and maximize the next likelihood function to estimate the parameters $\beta$ assuming that the $n$ **observations are independent**
$$
L = \prod_{i=1}^n h(y_i)^{\delta_i} S(y_i)
$$
- When the $i$th observation is **not censored** ($\delta = 1$) the likelihood is the **probability of dying** $f(y_i)$ in a tiny interval around time $y_i$.
- When the $i$th observation is **censored**($\delta = 0$) the likelihood is the **probability of surviving** $S(y_i)$ at least until time $y_i$.
#### Flexible approach
To use the hazard function to **model the survival time** as a function of the **covariates** (predictors), we can use the **proportional hazards assumption**
$$
h(t|x_i) = \underbrace{h_0(t)}_\text{Baseline Hazard}
\underbrace{\exp \left( \sum_{j=1}^p x_{ij} \beta_j \right)}_\text{Relative Risk}
$$
Where
- *Baseline Hazard* $h_0(t) \geq 0$: Represent an unspecified function, so it can take any form.
The most import assumption to keep in main is that *a one-unit increase in* $x_{ij}$ *corresponds to an increase in* $h(t|x_i)$ *by a factor of* $\exp(\beta_j)$.
![](img/58-proportinal-hazard-assumption.png){fig-align="center"}
But with the proportional hazards assumption **we can not estimate** $\beta = (\beta_1, \dots, \beta_p)^T$ by maximizing the likelihood with out having an specify the form of $h_0(t)$.
To solve this problem the **Cox’s proportional hazards model** make use of the same **sequential in time** logic that we used to derive the *Kaplan-Meier survival curve* and the *log-rank test*.
We know that if the $i$th observation is uncensored, then $h(t|x_i) = h_0(y_i)\exp \left( \sum_{j=1}^p x_{ij} \beta_j \right)$, but the *total hazard at time* $y_i$ *for the at risk observations* $\sum_{i': y_{i'} \geq y_i} h_0(y_i) \exp \left( \sum_{j=1}^p x_{i'j} \beta_j \right)$ making possible to **cancel out** $h_0(y_i)$ at calculating **the probability that the** $i$**th observation is the one to fail at time** $y_i$:
$$
\frac{h_0(y_i)\exp \left( \sum_{j=1}^p x_{ij} \beta_j \right)}
{\sum_{i': y_{i'} \geq y_i} h_0(y_i) \exp \left( \sum_{j=1}^p x_{i'j} \beta_j \right)}
$$
The **partial likelihood** correspond to the product of these probabilities over all of the uncensored observations and we can used to estimate $\beta$. If there are **no tied failure times** it takes the form:
$$
PL(\beta) = \prod_{i: \delta_i = 1} \frac{\exp \left( \sum_{j=1}^p x_{ij} \beta_j \right)}
{\sum_{i': y_{i'} \geq y_i}\exp \left( \sum_{j=1}^p x_{i'j} \beta_j \right)}
$$
##### Connection With The Log-Rank Test
In the case of **a single binary covariate**, the score test for $H_0 : \beta = 0$ in *Cox’s proportional hazards model* is exactly equal to the *log-rank test*.
##### Examples
###### **Brain Cancer Data**
![](img/59-BrainCancer-Cox’s-proportional-hazards-model.png){fig-align="center"}
- The estimated hazard for patients with *HG Glioma* is $e^{2.15} = 8.58$ times greater that patients with different diagnosis if we **hold all other covariates fixed**.
- The higher the *Karnofsky index*, the lower the chance of dying at any given point in time, to be more specific each one-unit increase in the *Karnofsky index* corresponds to a multiplier of $e^{−0.05} = 0.95$ in the instantaneous chance of dying.
###### **Publication Data**
- They start running a **log-rank test** yields a very unimpressive $p$-value of $0.36$ based on `posres`.
![](img/60-publication-KM-survival-curve.png){fig-align="center"}
- Then the run a **Cox model** with all the predictors the `posres` turn to be a great predictor of the time to publication. Here we can see the *KM survival curve* after adjusting for all other covariates.
![](img/61-publication-KM-survival-curve-adjusted.png){fig-align="center"}
##### Coding example
```{r}
# Training Cox Models
BrainCancerCox <- coxph(
Surv(time, status) ~ diagnosis + ki,
data = BrainCancer
)
coef(BrainCancerCox)
# Defining the relations to plot
BrainCancerGrid <- expand.grid(
diagnosis = levels(BrainCancer$diagnosis),
ki = quantile(BrainCancer$ki, probs = c(0.50, 0.75))
)
# Compute Cox model and survival curves
BrainCancerCoxSc <- survfit(
BrainCancerCox,
data = BrainCancer,
newdata = BrainCancerGrid,
conf.type = "none"
)
# Use the summary of BrainCancerCoxSc
# to take a vector of patient IDs
BrainCancerCoxScDf <- surv_summary(BrainCancerCoxSc)
pid <- as.character(BrainCancerCoxScDf$strata)
# Transforming the data to create the plot
m_newdat <- BrainCancerGrid[pid, , drop = FALSE]
BrainCancerCoxScDfPlot <- cbind(BrainCancerCoxScDf, m_newdat)
BrainCancerCoxScDfPlot$ki <- factor(BrainCancerCoxScDfPlot$ki)
# Plot
ggsurvplot_df(
BrainCancerCoxScDfPlot,
linetype = "ki",
color = "diagnosis",
legend.title = NULL,
censor = FALSE
)
```
#### Weibull model
This method start assuming a distribution to describe the event time, from the following distributions.
```{r}
names(survreg.distributions)
```
In this model the coefficients are interpreted in the opposite way to the Cox's Model as **positive coefficients increase time until the event take place**.
```{r}
BrainCancerWM <- survreg(
Surv(time, status) ~ diagnosis + ki,
data = BrainCancer,
dist = "weibull"
)
coef(BrainCancerWM)
```
One technique to visualize numeric variables in using the 25%, 50%, and 75-% quantiles
```{r}
# Defining the probabilities to plot
SurvivalProb <- seq(.99, .01, by = -.01)
plot(SurvivalProb)
# Defining the relations to plot
BrainCancerGrid <- CJ(
diagnosis = levels(BrainCancer$diagnosis),
ki = quantile(BrainCancer$ki, probs = c(0.50, 0.75))
)
BrainCancerGrid
# The predict function creates column per observation in newdata
BrainCancerWmTime <- predict(
BrainCancerWM,
type = "quantile",
p = 1 - SurvivalProb,
newdata = BrainCancerGrid
)
# Join and pivot longer the variables in order the have
# the columns names to run the plotting function
cbind(BrainCancerGrid,
BrainCancerWmTime
)[, melt(.SD,
id.vars = names(BrainCancerGrid),
variable.name = "surv_id",
value.name = "time")
][,`:=`(surv = SurvivalProb[as.integer(surv_id)],
ki = factor(ki))
][, c("upper", "lower", "std.err", "strata") := NA_real_] |>
ggsurvplot_df(surv.geom = geom_line,
linetype = "ki",
color = "diagnosis",
legend.title = NULL)
```
### Shrinkage for the Cox Model
To implement the “loss+penalty” formulation to our *partial likelihood* we use the next function:
$$
- \log
\left(
\prod_{i: \delta_i = 1} \frac{\exp \left( \sum_{j=1}^p x_{ij} \beta_j \right)}
{\sum_{i': y_{i'} \geq y_i}\exp \left( \sum_{j=1}^p x_{i'j} \beta_j \right)}
\right)
+ \lambda P(\beta)
$$
Where:
- $\beta = \beta_1, \dots, \beta_p$
- $\lambda$: Correspond to a non-negative tunning parameter
- $P(\beta) = \sum_{j=1}^p \beta_j^2$ *ridge penalty*
- $P(\beta) = \sum_{j=1}^p |\beta_j|$ *lasso penalty*
Let's see an example of the tuning process for a lasso-penalized Cox model:
![](img/62-lasso-penalized-Cox-model.png){fig-align="center"}
### Model Evaluation
#### Comparing predicted and true survival times
To compare the predicted and the true survival time, we need to figure out how to:
1. Use censored observations.
2. Translate the estimated survival curve $S(t|x)$ into *survival times*.
One possible solution is to stratify the observations based on the coefficient estimated by following the next steps:
1. Calculate an **estimated risk score** using the coefficients from the Cox's model on test dataset.
$$
\hat{\eta}_i = \hat{\beta}_1 x_{i1} + \dots + \hat{\beta}_p x_{ip}
$$
2. Categorize the observations based on their “risk”. If the model works well you should see a clear separation between each class.
![](img/63-km-survival-curve-after-classification.png){fig-align="center"}
#### C-index
I would be useful to calculate a metric like the AUC from the ROC curve, but it's important to recall that we do not observe the event time of each observation $t_1, \dots, t_n$.
According to @LONGATO2020103496 an alternative is to use the **Harrell's concordance index** (**C-index**) which quantify the probability that *greater risk scores are attributed to subjects with higher change of experiencing the event*.
$$
C = P(\hat{\eta}_{i'} > \hat{\eta}_i \; | \; T_{i'} < T_i, \; \delta_{i'} = 1, \; T_{i'} < t)
$$
With the next function:
$$
C = \frac{\sum_{i,i':y_i>y_{i'}} I(\hat{\eta}_{i'} > \hat{\eta}_{i}) \delta_{i'}}{\sum_{i,i':y_i>y_{i'}} \delta_{i'}}
$$
Where:
- $y_i$ and $y_{i'}$ are the observed survival times.
- $\hat{\eta}_i$ and $\hat{\eta}_{i'}$ are the estimated risk scores
- $I(\hat{\eta}_{i'} > \hat{\eta}_i)$ returns 1 if the criteria is met.
- $\delta_{i'}$ returns 1 if the $i'^{th}$ subject's event has been observed.
##### Simulated example
Where $P = 1$ and $\hat{\beta} = 2.5$ and we have the next testing set:
| Subject | x | Survival Time $y$ | Censoring Indicator $\delta$ |
|---------|------|-------------------|-----------------------------|
| 1 | 1 | 3 | 0 |
| 2 | 2 | 4 | 1 |
| 3 | 3 | 5 | 1 |
| 4 | 4 | 6 | 1 |
| 5 | 5 | 7 | 0 |
1. Calculate the risk scores $\hat{\eta}$ for each subject:
$\hat{\eta}_i = \hat{\beta} x_i$
| Subject | $\hat{\eta}$ |
|---------|---------------|
| 1 | 2.5 |
| 2 | 5 |
| 3 | 7.5 |
| 4 | 10 |
| 5 | 12.5 |
2. For each pair of subjects ($i, i'$) where $y_i > y_{i'}$, calculate $I(\hat{\eta}_{i'} > \hat{\eta}_i) \delta_{i'}$ and $\delta_{i'}$:
| Pair (i, i') | $y_i > y_{i'}$ | $I(\hat{\eta}_{i'} > \hat{\eta}_i)$ | $\delta_{i'}$ | $I(\hat{\eta}_{i'} > \hat{\eta}_i) \delta_{i'}$ |
|--------------|-----------------|-----------------|-----------------|-----------------|
| (2,1) | True | 0 | 0 | 0 |
| (3,1) | True | 0 | 0 | 0 |
| (3,2) | True | 0 | 1 | 0 |
| (4,1) | True | 0 | 0 | 0 |
| (4,2) | True | 0 | 1 | 0 |
| (4,3) | True | 0 | 1 | 0 |
| (5,1) | True | 0 | 0 | 0 |
| (5,2) | True | 0 | 1 | 0 |
| (5,3) | True | 0 | 1 | 0 |
| (5,4) | True | 0 | 1 | 0 |
3. Calculate the concordance index C:
$$
C = \frac{\sum_{i,i':y_i>y_{i'}} I(\hat{\eta}_{i'} > \hat{\eta}_{i}) \delta_{i'}}{\sum_{i,i':y_i>y_{i'}} \delta_{i'}} = \frac{0}{5} = 0
$$
##### Coding example
```{r}
library(dynpred)
data.frame(
predictors = c("diagnosis" ,"diagnosis + ki", "."),
c_index = c(
CVcindex(Surv(time, status) ~ diagnosis, data = BrainCancer)[["cindex"]],
CVcindex(Surv(time, status) ~ diagnosis + ki, data = BrainCancer)[["cindex"]],
CVcindex(Surv(time, status) ~ ., data = BrainCancer)[["cindex"]]
)
)
```
## References