-
Notifications
You must be signed in to change notification settings - Fork 0
/
08-Bayes.Rmd
507 lines (344 loc) · 21.2 KB
/
08-Bayes.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
# Uncertainty Quantification
Imagine there is an unknown quantity \(u\) that we can't observe directly. However, we may obtain a collection of observations \(Y = (y_i, \cdots, y_p) \in \mathcal R^p\). Then the **Forward Problem** is \(\underset{\text{unknown}}{u \in Z} \to \underset{\text{observed}}{Y}\).
In a sense, when we talk about science, we are talking about use \(Y\) to learn about \(u\). In practice, there are at least two "sources" of uncertainty associated to the picture: \(u \to Y\):
1. \(u\) is unknown.
2. Observations are usually contaminated by some sort of noise so that in general \(Y\) is not simply a function of the unknown \(u\). In statistics, we may model the "likelihood" or "probability" of observing \(Y\) given a certain value of \(u\). This quantifies our uncertainty of the possible values \(Y\) given the unknown \(u\) through the probability mass function: \(P(Y = y, | u)\) or through a density function: \(P(Y \in C | u) = \int_C p(y | u) dy\).
:::{.example #ber}
Let \(u \in (0,1)\) be the probability that the outcome of a certain biased coin flip is H. Suppose that this coin is flipped \(p\) times and we get \(y_i = \begin{cases}1 &i\text{-th is H}\\0 & i\text{-th is T}\end{cases}\).
In this case, the likelihood is
$$
P(Y = y | u) = u^{\sum_{i=1}^p y_i} (1-u)^{p - \sum_{i=1}^p y_i}, ~y_i \in \{0,1\}
$$
:::
:::{.example #gau}
- Regression: \(u: \mathcal X \to \mathcal R\) (Hidden regression function). \(y_i = u(x_i) + \varepsilon_i\), \(\varepsilon_i \sim N(0, \sigma^2)\).
$$
p(y|u) = \frac{1}{(\sqrt{2 \pi \sigma^2})^p} exp(-\frac{||y - (u(x_1), \cdots, u(x_n)||^2}{2\sigma^2})
$$
- Classification with logistic model: Given \(u: \mathcal X \to \mathcal R\), define: \(q_i = \frac{exp(u(x_i))}{1+exp(u(x_i))}\) for \(x_1, \cdots, x_p\). Then, \(y_i = \begin{cases}1 &\text{with prob } q_i\\0 & \text{with prob } 1-q_i\end{cases}\). Thus,
$$
P(y | u) = \prod_{i=1}^p q_i^{y_i} (1- q_i)^{1-y_i} = \prod_{i=1}^p \frac{exp(y_iu(x_i))}{1+exp(u(x_i))}
$$
:::
:::{.example #DE name="Density Estimation"}
Let \(\mathcal X = \mathcal R^d\) and suppose that \(u: \mathcal X \to \mathcal R\) is an unknown density function (\(u(x) \geq 0\), \(\forall x\), and \(\int_{\mathcal R^d} u(x) dx = 1\)), \(y_1, \cdots, y_p \underset{i.i.d}{\sim} u\). Thus, the likelihood of \(Y=(y_1, \cdots, y_p)\) is:
$$
p(y|u) = u(y_1) \cdots u(y_p)
$$
:::
So far we have talked about the distribution of \(Y\) given \(u\) (Forward Problem). This models the uncertainty of our "measurement" if we had access to \(u\). However, how do we learn \(u\) from \(Y\) (Inverse Problem)? Moreover, how do we quantify the uncertainty of the unknown \(u\) before observing \(Y\), and after observing \(Y\)?
## Bayes Perspective
Bayes Rule: \(p(u | y) \propto p(y | u) \pi(u)\). For the moment, we consider the case where \(u \in \mathcal R^d\). Suppose \(\pi(u)\) is then interpreted as a density in \(\mathcal R^d\). The posterior is usually also a density in \(\mathcal R^d\).
:::{.example name="Example \@ref(exm:ber) Continued"}
Let \(u \in (0,1)\), the likelihood is \(Bernoulli(u):\)
$$
P(Y = y | u) = u^{\sum_{i=1}^p y_i} (1-u)^{p - \sum_{i=1}^p y_i}, ~y_i \in \{0,1\},
$$
the prior is \(Beta(\alpha, \beta)\):
$$
\pi(u) = \frac{u^{\alpha-1}(1-u)^{\beta-1}}{\Gamma(\alpha, \beta)}, ~\alpha, \beta >0.
$$
Then the posterior is
$$
p(u | y) \propto p(y |u) \pi(u) \propto u^{\sum_{i=1}^p y_i + \alpha - 1} (1-u)^{p - \sum_{i=1}^p y_i + \beta - 1},
$$
that is, given \(Y=y\), \(u\) is \(Beta(\sum_{i=1}^p y_i + \alpha, p - \sum_{i=1}^p y_i + \beta)\)
:::
:::{.example #gaucon name="Example \@ref(exm:gau) Continued"}
Let \(u \in \mathcal R^d\) and \(x_1, \cdots, x_p \in \mathcal R^d\), the likelihood is:
$$
p(y|u) = \frac{1}{(\sqrt{2 \pi \sigma^2})^p} exp(-\frac{||y - (u(x_1), \cdots, u(x_n)||^2}{2\sigma^2}),
$$
the prior is
$$
\pi(u) = \frac{1}{(\sqrt{2 \pi \lambda^{-1}})^p} exp(-\frac{\lambda}{2} ||u||^2_{\mathcal R^d}), ~i.e. ~ u \sim N(\vec 0, \frac{1}{\lambda} I).
$$
Let \(X = (x_1, \cdots, x_p)^T_{p \times d}\). Then the posterior is
$$
\begin{aligned}
p(u | y) &\propto p(y |u) \pi(u) \\
&\propto exp \Big(<\frac{X^Ty}{\sigma^2}, u> - \frac{1}{2} <(\lambda I + \frac{X^TX}{\sigma^2})u, u>\Big)\\
&= N(\mu, \Sigma),
\end{aligned}
$$
where \(\mu = \frac{1}{\sigma^2} (\lambda I + \frac{1}{\sigma^2} (X^TX))^{-1} X^Ty\) and \(\Sigma= (\lambda I + \frac{1}{\sigma^2} (X^TX))^{-1}\).
:::
With the posterior, we can then compute the following:
1. Posterior Mean:
$$
\begin{aligned}
\hat u_{PM} = E[u_i | y] &= \frac{\int_{\mathcal R^d} u_i p(y | u) \pi(u) du_1 \cdots du_d}{\int_{\mathcal R^d} p(y | u) \pi(u) du_1 \cdots du_d}\\
&= \frac{\int_{\mathcal R^d} u_i p(y | u) \pi(u) du_1 \cdots du_d}{Z(y)}
\end{aligned}
$$
where \(Z(y)\) is the normalization constant.
Then, based on the observations (and model), we estimate the i-th coordinate of the unknown as \(E[u_i | y]\).
We could also estimate \(u_i^2 cos(u_j)\) with: \(E[u_i^2 cos(u_j) | y] = \frac{\int_{\mathcal R^d} u_i^2cos(u_j) p(y | u) \pi(u) du_1 \cdots du_d}{Z(y)}\)
2. Posterior Covariance: \(Cov(u_i, u_j | y)\) or \(Var(u_i|y) = E[u_i^2 | y] - (E[u|y])^2\). A measure of how certain we are about the estimation of \(u_i\).
3. Map (Maximum a posterior):
$$
\begin{aligned}
u^* &= \underset{u}{\operatorname{arg max}} \pi(u) p(y | u)\\
&= \underset{u}{\operatorname{arg max}} log(\pi(u)) + log(p(y | u))\\
&= \underset{u}{\operatorname{arg min}} -log(\pi(u)) - log(p(y | u))
\end{aligned}
$$
Both 1 and 2 are based on being able to take expectation w.r.t. posterior. 3 on the other hand, is not related to expectation but rather to optimization.
:::{.example #gauconcon name="Example \@ref(exm:gaucon) Continued"}
We have:
- \(E[u_i |y] = \mu_i\)
- \(Cov(u_iu_j | y) = \Sigma_{ij}\)
- The Map is determined by:
$$
\begin{aligned}
u^* &= \underset{u}{\operatorname{arg min}} -log(p(u | y))\\
&= \underset{u}{\operatorname{arg min}} -log(\pi(u)) - log(p(y | u))\\
&= \underset{u}{\operatorname{arg min}} \frac{\lambda}{2} ||u||^2 + \frac{1}{\sigma^2} \sum_{j=1}^p (y_i - <u, x_i>)^2,
\end{aligned}
$$
which is just the ridge regression.
:::
:::{.remark}
$$
u^* = \mu
$$
This can be seen by direct optimization or by noticing that the Map of a Gaussian random vector coincides with its mean vector.
The posterior can be interpreted as a different kind of regularization:
$$
p(u | y) \propto \underset{\text{Fidality}}{p(y | u)} ~\underset{\text{Regularizer}}{\pi(u)}
$$
:::
Back to **Example \@ref(exm:gauconcon)**, when \(u | y\) is \(N(\mu, \Sigma)\), we have formulas for \(E[u_i | y]\) and \(Cov(u_i, u_j)\). However, something like \(E[u_1cos(u_2) exp(u_3) | y]\) would be quite impossible to compute. Nevertheless, we can use Monte-Carlo to approximate this. The idea is that we know what the posterior is. Thus, we can try to sample \(u^1, \cdots, u^K \sim N_d(\mu, \Sigma)\). Then consider the empirical average: \(\frac{1}{K} \sum{j=1}^K u_1^j cos(u_2^j) exp(u_3^j)\). That is, if we can sample from the posterior, we can numerically approximate \(E[G(u) | y] \approx \frac{1}{K} \sum_{j=1}^K G(u^j)\), where \(u^1, \cdots, u^K \underset{i.i.d}{\sim} Posterior\).
If it is difficult to sample from the posterior, we can use MCMC (Monte Carlo Markov Chain).
## Gaussians in \(\mathcal R^d\)
:::{.definition}
A Gaussian random vector \(Z\) with mean zero and covariance \(\Sigma \in \mathcal R_{d\times d}\) (positive definite matrix) is a random vector with density (in \(\mathcal R^d\)):
$$
p(z) = \frac{1}{\sqrt{(2 \pi)^d ~det(\Sigma)}} \cdot exp(-\frac{1}{2} < \Sigma^{-1} z, z>),
$$
and we write \(Z \sim N(0, \Sigma)\).
:::
We can characterize this distribution via its characteristic function: \(\phi_Z(v):= E[e^{\textbf{i} <Z, v>}]_{V \in \mathcal R^d} = e^{-\frac{1}{2} <\Sigma v, v>}\).
Notice that we can see \(Z\) as a collection of real valued random variables: \(Z = (Z_1, \cdots, Z_d)\), namely the coordinates of \(Z\). For convenience let's write \(Z\) in the following form: \(\{Z_x\}_{x \in \mathcal X}\), where \(\mathcal X\) is the set \(\mathcal X = \{1, \cdots, d\}\).
:::{.proposition}
Let \(\mathcal X = \{1, \cdots, d\}\), \(\{Z_x\}_{x \in \mathcal X}\) a collection of real valued random variables is a Gaussian iff \(\forall\) (finite) subset \(\mathcal X' \subseteq \mathcal X\), the collection \(\{Z_x\}_{x \in \mathcal X'}\) is a Gaussian.
:::
:::{.proof}
\quad
"\(\Leftarrow\)": It is obvious since we can pick \(\mathcal X' = \mathcal X\).
"\(\Rightarrow\)": For simplicity, suppose \(\mathcal X' = \{1, \cdots, m\}\), \(m \leq d\). Let \(\tilde Z = \{Z_x\}_{x \in \mathcal X'}\), \(w \in \mathcal R^m\). Then we have:
$$
\begin{aligned}
E[e^{\textbf{i} <\tilde Z, w>_{\mathcal R^m}}] &= E[e^{\textbf{i} <\tilde Z,uw>_{\mathcal R^m}}]\\
&(u = (w_1, \cdots, w_m, 0, \cdots, 0))\\
&= e^{-\frac{1}{2} < \Sigma u, u >_{\mathcal R^d}}\\
&= e^{-\frac{1}{2} < \tilde \Sigma w, w >_{\mathcal R^m}},
\end{aligned}
$$
where \(\tilde \Sigma \in \mathcal R^{m \times m}\) and \(\tilde \Sigma_{ij} = \Sigma_{ij}\), \(i,j = 1,\cdots,m\).
Since the characteristic function of \(\tilde Z\) is that of a Gaussian in \(\mathcal R^m\), we have \(\tilde Z \sim N_m(0, \tilde \Sigma)\).
:::
:::{.proposition}
Let \(Z\) be a Gaussian random vector in \(\mathcal R^d: Z \sim N_d(0, \Sigma)\).
1. If \(A \in \mathcal R^{m \times d}\) is a matrix, then \(U = AZ\) is a Gaussian in \(\mathcal R^m\) and \(U \sim N_m(0, A\Sigma A^T)\).
2. If \(Z \sim N_d(0, \Sigma)\), \(\tilde Z \sim N_m(0, \tilde \Sigma)\), and \(Z, \tilde Z\) are independent, then \(Z, \tilde Z \sim N_{d+m}(0, \left(\begin{matrix}\Sigma&0\\0&\tilde \Sigma\end{matrix}\right))\).
3. \(Z\) is a Gaussian in \(\mathcal R^d\) iff every linear combination of its coordinates is univariate (单元的) Gaussian:
$$
Z = (Z_1, \cdots, Z_d) \sim N_d \Leftrightarrow \forall a_1, \cdots, a_d \in \mathcal R, ~\sum_{j=1}^d a_j Z_j \sim N_1.
$$
4. Suppose that \(Z \sim N_d(0, \Sigma)\), and let \(v, \tilde v \in \mathcal R^d\) be two vectors, then
$$
Cov(<Z, v>, <Z, \tilde v>) = <\Sigma v, \tilde v> = <\Sigma \tilde v, v>
$$
5. Suppose \(Z = (Z_1, \cdots, Z_d)\) is Gaussian with cov. matrix \(\Sigma\), then the coordinates of \(Z\) are independent random variables iff \(\Sigma\) is a diagonal matrix.
:::
:::{.proof}
1.
$$
\begin{aligned}
E[e^{\textbf{i}<w,U>_{\mathcal R^m}}] &= E[e^{\textbf{i}<w,AZ>_{\mathcal R^m}}]\\
&\underset{<A,B> = tr(A^TB)}{=} E[e^{\textbf{i}<A^Tw,Z>_{\mathcal R^m}}]\\
&= e^{-\frac{1}{2} < \Sigma (A^Tw), A^Tw>_{\mathcal R^d}}\\
&= e^{-\frac{1}{2} < A \Sigma A^Tw, w>_{\mathcal R^d}}
\end{aligned}
$$
3. "\(\Rightarrow\)": Let \(U = \sum_{j=1}^d a_j Z_j\) be a linear combination of the coordinates of \(Z\). By 1, we have \(U=AZ\), where \(A = (a_1, \cdots, a_d)\), is a Gaussian.
"\(\Leftarrow\)": Let \(a \in \mathcal R^d\), with \(a = (a_1, \cdots, a_d)\). Then, \(E[e^{\textbf{i}<a,Z>}] = E[e^{\textbf{i} \sum_{j=1}^d a_j Z_j}] \underset{\sum_{j=1}^d a_j Z_j \sim N(0, \sigma_a^2)}{=} e^{-\frac{1}{2} \sigma_a^2}\), where \(\sigma_a^2 = Var(\sum_{j=1}^d a_j Z_j) = <\Sigma a, a>_{\mathcal R^d}\), \(\Sigma\) is the covariance matrix of \(Z = (Z_1, \cdots, Z_d)\). Thus, \(E[e^{\textbf{i}<a,Z>}] = e^{-\frac{1}{2} <\Sigma a, a>_{\mathcal R^d}}\), \(\forall a \in \mathcal R^d\), which means that \(Z \sim N_d\).
4.
$$
\begin{aligned}
Cov(<Z, v>, <Z, \tilde v>) &= Cov(\sum_{i=1}^n Z_i v_i, \sum_{j=1}^n Z_j, \tilde v_j)\\
&= \sum_{i=1}^n \sum_{j=1}^n v_i \tilde v_j Cov(Z_i, Z_j)\\
&= \sum_{i=1}^n \sum_{j=1}^n (\Sigma_{ij}v_i \tilde v_j)\\
&= <\Sigma v, \tilde v> = <\Sigma \tilde v, v>.
\end{aligned}
$$
:::
## Diagonalization of \(\Sigma\) and Sampling for \(N(0, \Sigma)\)
:::{.theorem}
If \(\Sigma \in \mathcal R^{d \times d}\) is symmetric positive semi definite, then there are vectors \(v_1, \cdots, v_d \in \mathcal R^d\) and non-negative \(\lambda_1 \geq \cdots \geq \lambda_d \geq 0\) s.t \(\Sigma v_i = \lambda_i vi_i\) and \(<v_i, v_j>_{\mathcal R^d} = \begin{cases}1 & i=j\\0 & i \neq j\end{cases}\).
In short: \(\Sigma\) induces an orthonormal basis for \(\mathcal R^d\) formed of eigen vectors of \(\Sigma\). Moreover, all of \(\Sigma\)'s eigen values are non-negative.
:::
- How to sample from \(Z \sim N_d(0, \Sigma)\):
Since \(Z \in \mathcal R^d\), we notice that \(Z = \sum_{i=1}^d <Z, v_i>_{\mathcal R^d} v_i\), where \(\{v_i\}\) is an orthonormal basis for \(\mathcal R^d\) and are fixed. We then have \(<Z, v_i> \sim N_1(0, <\Sigma v_i, v_i>) = N_1(0, \lambda_i <v_i, v_i>) = N_1(0, \lambda_i)\).
Moreover, for \(i \neq j\):
$$
\begin{aligned}
Cov(<Z, v_i>, <Z, v_j>) &= <\Sigma v_i, v_j>\\
&= <\lambda_i v_i, v_j>\\
&= \lambda <v_i, v_j>\\
&= 0
\end{aligned}
$$
which means that the random variables \(<Z, v_1>,\cdots, <Z,v_d>\) are independent.
So, we can write: \(Z = \sum_{i=1}^d \sqrt{\lambda_i}(\frac{<Z, v_i>}{\sqrt{\lambda_i}}) v_i\ = \sum_{i=1}^d \sqrt{\lambda_i}\xi_i v_i\), where \(\xi_1, \cdots, \xi_d\) are i.i.d \(N(0,1)\) random variables.
Thus, to sample from \(N_d(0, \Sigma)\), we can do the following:
1. Find the eigen pairs \(\{(\lambda_i, v_i)\}_{i=1,\cdots,d}\) for \(\Sigma\).
2. Sample \(\xi_1, \cdots, \xi_d \sim N(0,1)\).
3. Set \(Z = \sum_{i=1}^d \sqrt{\lambda_i}\xi_i v_i\).
:::{.remark}
The above is precisely what we need to sample form Gaussian in more general settings. In particular, for Gaussians in certain RKHSs.
:::
## Stochastic Processes and Gaussian Processes
:::{.definition #sp name="Stochastic Processes"}
Let \(\mathcal X\) be a set. A stochastic process over \(\mathcal X\) is a collection of random variables \(\{Z_x\}_{x \in \mathcal X}\).
:::
:::{.definition #gp name="Gaussian Processes"}
A Gaussian Process over \(\mathcal X\) is a stochastic process \(\{Z_x\}_{x \in \mathcal X}\) with the property that \(\forall\) finite subset \(\mathcal X'\) of \(\mathcal X\), the vector \(\{Z_x\}_{x \in \mathcal X'}\) is a Gaussian.
:::
:::{.definition #cf name="Covariance Funciton"}
Suppose \(Z = \{Z_x\}_{x \in \mathcal X}\) is a Gaussian Process over \(\mathcal X\). We define \(Z\)'s covariance function as:
$$
K: \mathcal X \times \mathcal X \to \mathcal R\\
K(x, \tilde x):= Cov(Z_x, Z_{\tilde x})
$$
We will write \(Z \sim N(0, K)\) for simplicity.
:::
:::{.proposition}
The covariance function of a Gaussian process over \(\mathcal X\) is a kernel over \(\mathcal X\):
\(x_1, \cdots, x_n \in \mathcal X\), \(\{Z_{x_1}, \cdots, Z_{x_n}\}\) is a n-dimensional Gaussian. Thus \(\Sigma\) is positive semi definite and \(\Sigma_{ij} = K(x_i, x_j)\).
:::
Let \(K: \mathcal R^d \times \mathcal R^d \to \mathcal R\) be a continuous kernel, for which \(|K(x, \tilde x)| \leq M\), where \(M\) is a constant. Let \(p: \mathcal R^d \to \mathcal R\) be a density function.
Consider the following operator:
$$
T_K: f \in \mathcal L^2(p) \to \int_{\mathcal R^d} K(\cdot, x) f(x) dx,
$$
where \(\mathcal L^2(p) = \{f: \int_{\mathcal R^d} f^2 p(x)dx < \infty\}\). In particular, for \(f \in \mathcal L^2(p)\), \(T_K f\) is a function on \(\mathcal R^d\), defined by \(T_K f(\tilde x):= \int_{\mathcal R^d} K(\tilde x, x) f(x) p(x) dx\), \(\tilde x \in \mathcal R^d\).
:::{.definition #ep name="Eigen Values and Eigen Functions"}
\(\lambda \in \mathcal R\) is said to be an eigen value of \(T_K\) if \(\exists f \in \mathcal L^2(p)\), s.t \(T_Kf = \lambda f\).
In other words, if \(\lambda f(\cdot) = \int_{\mathcal R^d} K(\cdot, \tilde x) f(\tilde x) p(\tilde x) d \tilde x\), we call \((\lambda, f)\) and eigen pair.
:::
:::{.theorem #mt name="Mercer's Theorem"}
There exists a sequence of eigen pairs \(\{(\lambda_k, \phi_k)\}_{k \in \mathcal N}\) of \(T_K\) s.t
1. \(<\phi_k, \phi_l)>_{\mathcal L^2(p)} = \int_{\mathcal R^d} \phi_k(x) \phi_l(x) p(x) dx = \begin{cases}1 &k=l\\0&o.w.\end{cases}\).
2. \(\mathcal L^2(p) = \{\sum_{i=1}^\infty a_i \phi_i: \sum_{i=1}^\infty a_i^2 < \infty\}\) and \(f \in \mathcal L^2(p)\) can be represented as \(\sum_{i=1}^\infty <f, \phi_i>_{\mathcal L^2(p)} \phi_i\) and the inner product can be written as \(<f,g>_{\mathcal L^2(p)} = \sum_{i=1}^\infty <f, \phi_i>_{\mathcal L^2(p)} < g, \phi_i>_{\mathcal L^2(p)}\).
3. The numbers \(\lambda_1, \lambda_2, \cdots\) satisfy \(\lambda_1 \geq \lambda_2 \geq \cdots \geq 0\) and \(\lambda_n \to 0\) as \(n \to \infty\).
4. The kernel \(K\) can be written as \(K(x, \tilde x) = \sum_{i=1}^\infty \lambda_i \phi_i(x) \phi_i(\tilde x)\).
:::
:::{.proof}
4. First, note that \(K(\cdot, x) \in \mathcal L^2(p)\):
$$
\begin{aligned}
\int_{\mathcal R^d} (K(\tilde x, x))^2 p(\tilde x) d\tilde x &\leq \int_{\mathcal R^d} [\sqrt{K(\tilde x, \tilde x)} \sqrt{K(x,x)}]^2 p(\tilde x) d\tilde x\\
&= K(x,x) \int_{\mathcal R^d} K(\tilde x, \tilde x) p(\tilde x) d\tilde x\\
&< \infty
\end{aligned}
$$
In particular, \(K(\cdot, x) = \sum_{i=1}^\infty <K(\cdot,x), \phi_i>_{\mathcal L^2(p)} \phi_i(\cdot)\). Thus, \(K(\tilde x, x) = \sum_{i=1}^\infty <K(\cdot,x), \phi_i>_{\mathcal L^2(p)} \phi_i(\tilde x)\). We just need to show that \(<K(\cdot,x), \phi_i>_{\mathcal L^2(p)} = \lambda_i \phi_i(x)\):
$$
\begin{aligned}
<K(\cdot,x), \phi_i>_{\mathcal L^2(p)} &= \int_{\mathcal R^d} K(\tilde x, x) \phi_i(\tilde x) p(\tilde x) d \tilde x\\
&= T_K \phi_i(x)\\
&= \lambda_i \phi_i(x)
\end{aligned}
$$
:::
- How to define a Gaussian Process over \(\mathcal X\) with covariance function \(K\)? How to do sampling?
Assuming \(K\) satisfies the condition for Mercer's Theorem for some density \(p\), we consider the eigen pairs \(\{(\lambda_i, \phi_i)\}\) and do the following:
Let \(\xi_1, \cdots \) be i.i.d \(N(0,1)\) random variables. We define a random function: \(Z = \sum_{i=1}^\infty \sqrt{\lambda_i} \xi_i \phi_i\). In particular, \(Z(x) = \sum_{i=1}^\infty \sqrt{\lambda_i} \xi_i \phi_i(x_i)\).
It follows that,
$$
\begin{aligned}
Cov(Z(x), Z(\tilde x)) &= E[\sum_{i=1}^\infty \sum_{j=1}^\infty \sqrt{\lambda_i} \sqrt{\lambda_j} \xi_1 \xi_j \phi_i(\tilde x)\phi_i(x)]\\
&= \sum_{i=1}^\infty \sum_{j=1}^\infty \sqrt{\lambda_i} \sqrt{\lambda_j} \phi_i(\tilde x)\phi_i(x)E[\xi_1 \xi_j]\\
&\underset{E[\xi_i \xi_j] = 1_{i=j}}{=} \sum_{i=1}^\infty\lambda_i \phi_i(x) \phi_i(\tilde x)\\
&= K(x, \tilde x)
\end{aligned}
$$
:::{.theorem #kle name="Karhunen-Loewe Expansion"}
The representation: \(Z = \sum_{i=1}^\infty \sqrt{\lambda_i} \xi_i \phi_i\) for \(Z \sim N(0, K)\) is known as Karhunen-Loewe Expansion.
:::
:::{.example}
Let \(\mathcal X = [0,1]\), \(K(s,t) = min\{s,t\} - st\) is a kernel. Consider \(p(t) = 1\), \(t \in [0,1]\). Consider the operator: \(T_K: f \in \mathcal L^2(p) \to T_Kf \in \mathcal L^2(p)\), where \(T_K f(\cdot) = \int_0^1 K(\cdot, s) f(s) ds\).
Let us try to find the eigen pairs of \(T_K\). That is,
$$
\begin{aligned}
\lambda \phi(t) &= \int_0^1 K(t,s) \phi(s) ds\\
&= \int_0^1 (min\{s,t\} - st) \phi(s) ds\\
&= \int_0^1 min\{s,t\} \phi(s) ds - t\int_0^1 s \phi(s) ds\\
&= \int_0^t s\phi(s) ds + t\int_t^1 \phi(s) ds - t\int_0^1 s \phi(s) ds\\
&:= \int_0^t s\phi(s) ds + t\int_t^1 \phi(s) ds - tC
\end{aligned}
$$
Take derivative w.r.t. \(t\) to get
$$
\begin{aligned}
\lambda \phi'(t) &= t\phi(t) + [\int_t^1 \phi(s) ds - t\phi(t)] - C\\
& = \int_t^1 \phi(s) ds - C
\end{aligned}
$$
Take derivative w.t.t. \(t\) again to derive:
$$
\lambda \phi''(t) = - \phi(t)
$$
For reasons that will soon become apparent (\(\phi \in \mathcal H\)), we must have:
$$
\phi(0) = 0 = \phi(1)
$$
Thus, we have the equation:
$$
\begin{cases}
\lambda \phi''(t) + \phi(t) = 0, ~\forall t \in (0,1)\\
\phi(0) = 0\\
\phi(1) = 0
\end{cases}
$$
Thus, \(\phi\) has to take the form:
\[\phi(t) = A cos(\frac{t}{\sqrt{\lambda}}) + B sin(\frac{t}{\sqrt \lambda})\]
However, we need to make sure \(\phi(0) = 0 = \phi(1)\), which means that \(0 = \phi(0) = A\), and \(0 = \phi(1) = B sin(\frac{1}{\sqrt \lambda})\), which forces \(\frac{1}{\sqrt \lambda}\) to be a multiple of \(\pi\).
In other words: \(\frac{1}{\sqrt \lambda} = k\pi\), i.e. \(\lambda = \frac{1}{k^2 \pi^2}\), \(k \in \mathcal N\).
Thus, we have
$$
\begin{cases}
\lambda_k = \frac{1}{k^2 \pi^2}\\
\phi_k = C_k sin (k\pi t)
\end{cases}
$$
where \(C_k\) satisfies \(\int_0^1 \phi^2(x) dx = 1\).
:::
:::{.remark}
Let's go back to the setting of Mercer's Theorem:
- \(K\) a continuous, bounded kernel.
- \(p\) a density.
- \(\{(\lambda_k, \phi_k)\}_{k \in \mathcal N}\) are eigen pairs of \(T_K\).
1. \(\mathcal L^2(p)\) can be alternatively written as: \(\mathcal L^2(p) = \{\sum_{i=1}^\infty : \sum_{i=1}^\infty a_i^2 < \infty\}\).
2. The RKHS \(\mathcal H\) of \(K\) is defined as
$$
\mathcal H := \{\sum_{i=1}^\infty b_i K(\cdot, x_i): \{x_i\} \subseteq \mathcal X, \{b_i\} \subseteq \mathcal R, \text{ s.t } \sum_{j=1}^\infty \sum_{i=1}^\infty K(x_i, x_j) b_i b_j < \infty\}
$$
As
$$
\begin{aligned}
\sum_{i=1}^\infty b_i K(\cdot, x_i) &= \sum_{i=1}^\infty b_i \sum_{k=1}^\infty \lambda_k \phi_k(\cdot) \phi_k(x_i)\\
&= \sum_{k=1}^\infty \lambda_k (\sum_{i=1}^\infty b_i\phi_k(x_i)) \phi_k(\cdot)\\
&= \sum_{k=1}^\infty <\sum_{i=1}^\infty b_i K(\cdot, x_i), \phi_k>_{\mathcal L^2(p)} \phi_k(\cdot)\\
&:= \sum_{k=1}^\infty a_k \phi_k(\cdot),
\end{aligned}
$$
we can write it as:
\[\mathcal H := \{\sum_{k=1}^\infty a_k \phi_k(\cdot): \{x_i\} \subseteq \mathcal X, \{b_i\} \subseteq \mathcal R, \text{ s.t } \sum_{i=1}^\infty \frac{a_i^2}{\lambda_i} < \infty.\]
In particular,
- \(\mathcal H \subseteq \mathcal L^2(p)\)
- \(\phi_i \in \mathcal H\), \(\forall i\).
- \(Z = \sum_{i=1}^\infty \sqrt{\lambda_i} \xi_i \phi_i \notin \mathcal H\), although it does belong to \(\mathcal L^2(p)\).
:::