-
Notifications
You must be signed in to change notification settings - Fork 2
/
warmup-exercises.qmd
545 lines (356 loc) · 27.4 KB
/
warmup-exercises.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
---
title: Warmup Exercises
number-sections: true
author: Phil
bibliography: refs.bib
---
::: {.hidden}
$$
\newcommand{\R}{\mathbb{R}}
\newcommand{\vx}{\mathbf{x}}
\newcommand{\vw}{\mathbf{w}}
\newcommand{\vz}{\mathbf{z}}
\newcommand{\norm}[1]{\lVert #1 \rVert}
\newcommand{\bracket}[1]{\langle #1 \rangle}
\newcommand{\abs}[1]{\lvert #1 \rvert}
\newcommand{\paren}[1]{\left( #1 \right)}
$$
:::
## Perceptron {#sec-perceptron}
#### Part 1
Sketch the line in $\R^2$ described by the equation
$$
\bracket{\vw, \vx} = b\;,
$$ {#eq-perceptron-boundary}
where $\vw = \paren{1, -\frac{1}{2}}^T \in \R^2$ and $b = \frac{1}{2}$. Here, $\bracket{\vw, \vx} = \sum_{i = 1}^n w_i x_i$ is the inner product (or dot product) between the vectors $\vw$ and $\vw$.
#### Part 2
Write a quick Python function called `perceptron_classify(w, b, x)`. `w` and `x` should both be 1d numpy arrays of the same length, and `b` should be a scalar. Your function should return `0` if $\bracket{\vw, \vx} < b$ and `1` if $\bracket{\vw, \vx} \geq b$. [An excellent solution will use neither a `for`-loop nor an `if`-statement.]{.aside}
Verify that your function works on a few simple examples.
#### Part 3
Consider a line of the general form of @eq-perceptron-boundary. Let's allow $\vx$ and $\vw$ to both be $n$-dimensional, so that this equation defines a hyperplane in $\R^n$. Suppose that we wanted to represent *the same hyperplane* in $\R^n$ using an equation of the form
$$
\bracket{\tilde{\vw}, \tilde{\vx}} = 0\;.
$$ {#eq-perceptron-boundary-2}
for some $\tilde{\vw} \in \R^{n+1}$. Define $\tilde{\vx} = (\vx, 1)$. How could you define $\tilde{\vw}$ to make @eq-perceptron-boundary-2 equivalent to @eq-perceptron-boundary?
## Convexity {#sec-convexity}
As you learned in Daumé, informally, a convex function is a function that is "bowl-shaped." Hardt and Recht give the formal definition, which has the benefit of applying to functions of many variables.
#### Part 1
Consider the 0-1 step function that I've plotted below:
```{python}
#| echo: false
from matplotlib import pyplot as plt
plt.rcParams["figure.figsize"] = (5, 3)
```
```{python}
from matplotlib import pyplot as plt
import numpy as np
fig, ax = plt.subplots(1, 1)
y_hat = np.linspace(-1, 1, 101)
loss = lambda y_hat, y: 1 - 1*(y_hat*y > 0)
ax.set(xlabel = r"$\hat{y}$",
ylabel = r"$\ell(\hat{y}, y)$")
ax.plot(y_hat, loss(y_hat, 1))
```
Show pictorially that this function is not convex. No proof needed -- just the right drawing.
### Part 2: Second Derivative Test
Another way to tell whether a function is convex is to check its second derivative. If a function $f:S\rightarrow \R$ has a convex domain $S\subseteq \R$, if $f$ is everywhere twice-differentiable, and if $\frac{d^2f(z_0)}{dz^2} > 0$ for all $z_0 \in S$, then $f$ is convex.
Use the second derivative test to check that the following two functions are convex: [The base of the logarithm doesn't really matter, but for this course it is always most convenient to assume logs base $e$, which you might also have seen written $\ln$.]{.aside}
$$
\begin{aligned}
f(z) &= - \log z \\
g(z) &= - \log(1-z)\;.
\end{aligned}
$$
### Part 3: Plotting Practice
In a Jupyter notebook, write a simple program to plot each of the functions $f$ and $g$ from Part 2. Some of the Part 1 code is likely to help you.
### Part 4: Convexity in Many Variables
Recall the Hardt and Recht definition of convexity: a function $f:\R^p \rightarrow \R$ is convex if, for any $\lambda \in [0,1]$ and any points $\vz_1, \vz_2 \in \R^p$,
$$
f(\lambda \vz_1 + (1-\lambda)\vz_2) \leq \lambda f(\vz_1) + (1-\lambda)f(\vz_2)\;.
$$
Using this definition, write a short mathematical proof that the function $f(\vz) = \norm{\vz} = \sqrt{\bracket{\vz, \vz}}$ is convex. You will want to use the *triangle inequality*, which says that $\norm{\vz_1 + \vz_2} \leq \norm{\vz_1} + \norm{\vz_2}$. [This proof requires just a few lines if you carefully use your definitions!]{.aside}
## Gradient Descent {#sec-gradient-descent}
Consider the quadratic function $g(z) = \frac{1}{2}az^2 + bz + c$.
1. Prove that $g$ has a critical point at the point $z^* = -\frac{b}{a}$ (*hint: solve $g'(z^*) = 0$*).
2. What must be true about the constants $a$, $b$, and $c$ to ensure that this point is a *local minimum* of $g$? (*Hint: second derivative test*).
3. Suppose now that we are able to evaluate the function $g$, as well as its derivative $g'$, but not able to use algebra to find $z^*$ (this mirrors our situation in most practical problems). Instead, we are going to use the following algorithm to attempt to approximate $z^*$:
- Begin with some initial guess $z^{(0)}$.
- In each time-step $t$, compute $z^{(t+1)} \gets z^{(t)} - \alpha g'(z^{(t)})$, where $\alpha > 0$ is the *learning rate*.
- In practice we would need to specify a stopping criterion, but for this theoretical problem we don't need to worry about it.
4. Using algebra, prove that for any timestep $t$,
$$
(z^* - z^{(t+1)})^2 = (a\alpha - 1)^2(z^* - z^{(t)})^2\;.
$$
5. Let's think of $\abs{z^* - z^{(t)}}$ as the *error* in our current estimate $z^{(t)}$. Using the recurrence above, conclude that, for any $t$, the error $\abs{z^* - z^{(t)}}$ satisfies
$$
\abs{z^* - z^{(t)}} = \abs{a\alpha - 1}^{t}\abs{z^* - z^{(0)}}\;.
$$
6. For $\alpha \in (\alpha_*, \alpha^*)$, we are guaranteed that the error $\abs{z^* - z^{(t)}}\rightarrow 0$ as $t\rightarrow \infty$. What are $\alpha_*$ and $\alpha^*$?
7. Suppose that $\alpha$ is within the necessary range. I want to guarantee that $\abs{z^* - z^{(t)}} < \epsilon$ for some small $\epsilon > 0$ (in practice we often call this the *tolerance*). Conclude that the number of steps necessary to reach this tolerance is no greater than
$$
\bar{t} = \frac{ \log \epsilon - \log \abs{z^* - z^{(0)}}}{\log \abs{a\alpha - 1}}\;.
$$
Ignoring constants with respect to $\epsilon$, we say that this algorithm for finding the minimum of $g$ with tolerance $\epsilon$ has a $\log \epsilon$ a convergence rate.
## Gradient Descent (Again) {#sec-gradient-descent-2}
Consider the function $f(w_0, w_1) = \sin(w_0w_1)$. You can define this function like this:
```{python}
#| code-fold: false
import numpy as np
def f(w):
return np.sin(w[0]*w[1])
```
Mathematically, the gradient of this function is
$$\nabla f(w_0, w_1) = (w_1\cos w_0w_1, w_0 \cos w_0w_1)^T.$$
1. Implement a simple loop that uses gradient descent to find a minimum of this function.
- You'll have to choose the learning rate $\alpha$.
- The `np.cos()` function will be useful for programming the gradient.
- It's not the fastest approach, but if you're not show how to program the gradient you can always first implement it as a list of two floats, and then use `np.array(my_list)` to convert it into a numpy array.
- You'll also need to pick a random starting guess.
2. Find two initial guesses for the parameter vector $\vw$ such that you get two *different* final minimizers (this is possible because $f$ is not convex).
## Overfitting and the Scientific Method {#sec-overfitting}
[![](https://upload.wikimedia.org/wikipedia/commons/thumb/8/82/The_Scientific_Method.svg/520px-The_Scientific_Method.svg.png) Image from [Wikipedia](https://en.wikipedia.org/wiki/Scientific_method).]{.aside}
In [the scientific method](https://en.wikipedia.org/wiki/Scientific_method), it is often emphasized that we need to formulate a hypothesis *before* performing an experiment. It's fine for the hypothesis to be based on *previous* experiments. However, the scientific method never allows us to perform an experiment, formulate a hypothesis, and then say that the experiment supported the (new) hypothesis.
We can think of scientific theories as systems of thought that help us make predictions about new phenomena. With this in mind, ***please write a short paragraph explaining the importance of hypothesis-first science using the language of machine learning.*** In your explanation, please use the following vocabulary:
- Training data.
- Training accuracy.
- Validation/testing data.
- Validation/testing accuracy.
- Overfitting.
## The Coin-Flipping Game {#sec-erm}
Let's play a game! Here is the setup:
I have a coin with probability of heads equal to $p \in [0,1]$. I am going to ask you to pick a number $\hat{p} \in [0,1]$. Then, I flip my coin.
[This game is more fun for me than it is for you.]{.aside}
- If my coin comes up heads, you give me $-\log \hat{p}$ dollars.
- If my coin comes up tails, you give me $-\log (1-\hat{p})$ dollars.
#### Part 1
Compute the *expected* amount of money you will give me when we play this game in terms of $p$ and $\hat{p}$. Call this quantity $R(\hat{p}, p)$. This is the *risk* of the guess $\hat{p}$.
#### Part 2
[Take the derivative and set it equal to 0! Don't forget to check that you've found a minimum of $R(\hat{p}, p)$ rather than a maximum or an inflection point.]{.aside}
Suppose I tell you the value of $p$. Write a mathematical proof to show that your best choice of $\hat{p}$ (the one that loses you the least money) is $\hat{p} = p$.
#### Part 3
Now suppose that I *don't* tell you the true value of $p$. Instead, I let you observe $n$ coin flips before asking you to make your guess. Describe:
- A suggestion for choosing $\hat{p}$ based only on the results of the previous flips.
- A way to estimate the risk (expected amount of money lost) based only on the results of the previous flips.
Your answer should depend on $\hat{p}$ but not on $p$!
## Balancing Classification Rates {#sec-classification-rates-2}
[You can do this first part just by copying and pasting lecture code. It doesn't matter much how good your model is -- just make sure you're able to get predictions.]{.aside}
Use the code from [our recent lecture](/lecture-notes/classification-in-practice.qmd) to download the Titanic data set as a Pandas data frame and train a model on the training data. Then download the test data. Compute `y_pred`, the vector of predictions of your model on the test data.
Then, write a function that verifies eq. (2.6) in Alexandra Chouldechova's paper "[Fair Prediction with disparate impact](https://via.hypothes.is/https://arxiv.org/pdf/1703.00056.pdf)." Here's what your function should do:
[The positive predictive value is $\mathrm{PPV} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP}}$.]{.aside}
1. Given vectors `y_pred` of predictions and `y_test` of actual labels, compute the False Negative Rate (FNR), False Positive Rate (FPR), prevalence $p$, and positive predictive value (PPV).
2. Return as a tuple the lefthand side and righthand side of eq. (2.6) in Chouldechova.
3. Verify that the two numbers are equal!
## Limits of The Quantitative Approach to Discrimination {#sec-limits-quantitative}
I'll give you each a number in Slack. The numbers correspond to the following sections of @narayanan2022limits. These are:
1. The null hypothesis allocates the burden of proof (p. 7-8)
2. Compounding inequality is far below the radar of quantitative methods (p. 9-10)
3. Snapshot datasets hide discrimination (p. 10-11)
4. Explaining away discrimination (p. 12-13)
5. What counts as evidence is a subjective choice (p. 5-7)
For your assigned section, please write a short paragraph (4-5 simple sentences is fine). You should:
- Summarize Narayanan's key points in that section.
- In one of the sentences, describe which aspects of the Uber case study (p. 13-16) reflect the ideas of the section you described.
Bring your paragraph in class and be ready to read it to your group.
## Vectorization Brainstorm {#sec-vectorization}
In a [recent lecture](lecture-notes/vectorization.qmd), we discussed methods of vectorizing text like the document-term matrix that use the *bag of words* assumption: the order of the words doesn't matter!
Take some time and propose an alternative approach to word-based text vectorization. Can you find a scheme that would give different vector representations to the following two sentences?
> "I love rabbits, not cats."
> "I love cats, not rabbits."
You don't have to implement your vectorization, but please be prepared to write pseudocode for your group to show in detail how you would perform the vectorization.
## Image Compression Factor of K-Means {#sec-compression}
In [today's reading](https://jakevdp.github.io/PythonDataScienceHandbook/05.11-k-means.html) on K-means clustering from the Python Data Science Handbook, Jake VanderPlas considers the use of K-means to reduce the number of distinct colors in an image (Example 2). I encourage you to run the code for this example while thinking about this warmup!
Give an estimate of the *compression factor*: the reduction of information achieved when compressing an image using k-means clustering into $k$ color clusters. The compression factor is the number of bits required to store the compressed image, divided by the number of bits required to store the original image. Both of these numbers can be computed asymptotically (i.e. with big-oh reasoning) in order to simplify the analysis.
There are multiple good ways to think about this question, and you're welcome to choose one that makes sense to you *as long as you carefully state your steps and assumptions*. Here are a few points that I find helpful:
#### Bits in Original Image
1. An image with $n$ rows and $m$ columns has $nm$ pixels.
2. Each pixel has one of three RGB color channels (Red, Green, and Blue).
3. Each color channel can be represented with 8 bits (which encode an integer between 0 and 255, denoting the color intensity in that channel).
#### Bits in Compressed Image
1. If I compress an image into just $k$ distinct colors, then instead of storing the full RGB value for each pixel, I can just store enough bits to uniquely identify the cluster containing each pixel. How many bits do I need for this?
2. I also need to store a dictionary (hash map) that associates color $j$ (i.e. the centroid of the $j$th cluster of colors) to its RGB value.
### Optional Extra
Try running the code above while varying the number of clusters. Do you think that a 16-color compression looks much better than an 8-color compression. Do you think the difference is good enough to justify approximately twice the storage? What about 32 colors vs. 16?
## Introducing Tensors {#sec-intro-tensors}
**First**, install [PyTorch](https://pytorch.org/) 2.0 into your `ml-0451` Anaconda environment.
::: {.column-margin}
The best way to install PyTorch is is probably to run the following at the command line:
```
conda activate ml-0451
pip3 install torch torchvision torchaudio
```
:::
**Then**, in a blank Jupyter notebook, copy, paste, and run each of the code blocks in the [first section](https://pytorch.org/tutorials/beginner/basics/tensor_tutorial.html) of the PyTorch tutorial.
[You may need to do a little bit of exploring around the tutorials in order to come up with answers to these questions.]{.aside}
**Finally**, write down a single-sentence answer to each of the following questions:
1. In what ways is a PyTorch tensor *similar* to a Numpy array?
2. In what ways is a PyTorch tensor *different* from a Numpy array?
3. What is the primary motivation for the use of a specialized tensor data type, rather than an array, for deep learning?
## Efficient Differentiation {#sec-backprop}
[This exercise is based on [ a section of Chinmay Hegde's notes](https://chinmayhegde.github.io/dl-notes/notes/lecture02/#sgd-and-neural-networks) on stochastic gradient descent and neural networks: ]{.aside}
Consider the following function:
$$
L(w, b) = \frac{1}{2} \left(y - \sigma(wx + b)\right)^2 + \frac{1}{2}\lambda w^2\;,
$$
where $\sigma(a) = \frac{1}{1 + e^{-a}}$.
This is the loss function that would be obtained when using a single feature $x$ to predict $y$, using the function $\sigma(wx + b)$ the predictor and measuring the quality of this predictor using the square-error loss function. Our aim is to compute the gradient of $L$ with respect to $w$ and $b$, with a "ridge" regularization term $\frac{1}{2}\lambda w^2$ that encourages the weight $w$ to be small.
[I've used the property of the sigmoid that $\sigma'(a) = \sigma(a)(1-\sigma(a))$.]{.aside}
The gradient of $L$ is $\nabla L (w, b) = \left(\frac{\partial L}{\partial w}, \frac{\partial L}{\partial b}\right)$, where
$$
\begin{aligned}
\frac{\partial L}{\partial w} &= (\sigma(wx + b) - y)\sigma(wx+b)(1 - \sigma(wx+b))x + \lambda w \\
\frac{\partial L}{\partial b} &= (\sigma(wx + b) - y)\sigma(wx+b)(1 - \sigma(wx+b))\;.
\end{aligned}
$${#eq-explicit-gradient}
### What You Should Do
Assume that each of the following operations cost one computational unit:
- Multiplying or dividing two scalar numbers.
- Adding or subtracting two scalar numbers.
- Computing an exponential like $e^a$.
Using this assumption:
1. Determine the number of computational units (i.e. computational cost) of computing the gradient of $L$ exactly as written in @eq-explicit-gradient, under the assumption that you are not allowed to store the values of any intermediate computations.
2. Now determine the computational cost of computing the gradient of $L$ under the assumption that you are allowed to store intermediate computations. Please describe both the number of computations and the number of floating point numbers that must be stored.
3. Finally, determine the computational cost in terms of both steps and storage to compute $L$ using the backpropagation algorithm (described for a very similar function in [Hegde's notes](https://chinmayhegde.github.io/dl-notes/notes/lecture02/#sgd-and-neural-networks)).
Compare your results from each method.
## Convolutional Kernels {#sec-convolutional-kernel}
Implement kernel convolution. Your implementation should accept a 2d array `X` (think of `X` as representing a greyscale image) and a square convolutional kernel `K`. Your implementation should operate using pure `numpy`. You can use any zero-padding strategy, but you do need to explain what your strategy is when presenting.
It's ok to use a for-loop to loop over pixels.
Here's an example image you can use:
```{python}
from PIL import Image
import urllib
import numpy as np
from matplotlib import pyplot as plt
def read_image(url):
return np.array(Image.open(urllib.request.urlopen(url)))
url = "https://i.pinimg.com/originals/0e/d0/23/0ed023847cad0d652d6371c3e53d1482.png"
img = read_image(url)
def to_greyscale(im):
return 1 - np.dot(im[...,:3], [0.2989, 0.5870, 0.1140])
img = to_greyscale(img)
plt.imshow(img, cmap = "Greys")
plt.gca().axis("off")
```
After implementing your function, you should be able to use it like this, replacing the implementation of `scipy.signal` with your own implementation. The result should look something like this:
```{python}
from scipy.signal import convolve2d
kernel = np.array([[-1, -1, -1],
[-1, 8, -1],
[-1, -1, -1]])
convd = convolve2d(img, kernel)
plt.imshow(convd, cmap = "Greys", vmin = 0, vmax = 8)
plt.gca().axis("off")
```
## Project Check-In {#sec-project-check-in}
*This is a warmup activity that we will repeat weekly until the end of the semester.*
Prepare a 2-3 minute "presentation" of your project to your group. Your presentation can be informal and does not need to have any special visual aids. The primary expectation is that you are able to demonstrate *some relevant functionality* to your peers. If your project involves coding or data analysis, your relevant functionality might be as simple as accessing or preparing the data. You should plan to demonstrate additional functionality each week.
In other words, you should show your group *something that works*, regardless of how "big" it is.
If you are doing a project that does not involve implementation (such as a research essay), then you are still expected to offer an update. Your contributions could include describing the sources you've found or showing your group an outline of the argument that you will make.
It's appropriate for each member of your project group to give the same presentation during warmup. **Please note that you may not be in the same warmup group as your project partners**. This means that:
- The code you show needs to run on *your* laptop or in *your* compute instance (e.g. Google Colab).
- *You* need to be ready to explain what is being shown, even if your project partners did much of the work.
## What Needs To Be Learned? {#sec-transfer-learning}
Suppose that you wanted to teach an individual to recognize English-language [phishing emails](https://consumer.ftc.gov/articles/how-recognize-and-avoid-phishing-scams). Write down a few features (based on the linked website, your own experience, or other sources) that you think would help someone classify an email as "phishing attempt or not" based on the text of the email.
Now, imagine that you are going to sit down with your tutee to teach them how to recognize English-language phishing emails. Where would you start your instruction if...
1. Your tutee was another member of your warmup group.
2. Your tutee was a fluent English speaker but had never used email.
3. Your tutee was a regular email user but spoke no English.
4. Your tutee spoke no English and had never seen a computer.
Which of these four scenarios would require the most "learning effort?" Which would require the least?
## Word Embedding {#sec-word-embedding}
Take out a sheet of paper and a pencil. Your goal is to place the following words on the sheet of paper in such a way that their location on the sheet is indicative of their relationships to each other. You can decide exactly how to do this. Should words with similar meanings be in the same part of the page? Should pairs of words with similar *relationships* have similar distances? Your approach is up to you, but please **write it down** along with your placements. Your words are:
- *Woman*
- *Student*
- *Nurse*
- *Doctor*
- *Man*
- *Professor*
- *Model*
- *Computer*
- *Machine*
- *Programmer*
## Realistic Text? {#sec-realistic-text}
In our reading on [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/), there are a few examples of model output that is *realistic* but not *real*. Pick one of the examples, and write down as carefully as you are able what makes the generated text *realistic*. Then, describe what "tells" would tip off an attentive observer that the text isn't real (generated intentionally by a human) after all.
The Shakespeare and Wikipedia examples might be the easiest ones to think about, but feel free to look at the $\LaTeX$ or Linux source code examples if you prefer.
## Mind Map {#sec-mind-map}
Wow, we've covered a lot of ground in this class! Use a graphics program or a pen/paper to make a mind map describing some of our main theoretical and concepts. As a reminder, a mind-map is a graph in which the nodes are concepts and edges join related concepts. Please incorporate the following concepts as nodes:
- *Loss function*
- *Target*
- *Predictor*
- *Model*
- *Regression*
- *Classification*
- *Empirical risk minimization*
- *Gradient descent*
- *Feature map*
- *Vectorization*
- *Overfitting*
- *Training data*
- *Validation/testing data*
- *Perceptron*
- *Logistic regression*
- *Neural networks*
- *Linear regression*
Additionally, please incorporate **at least three other concepts of your own choosing**.
### Flowcharts in Quarto
Since mind-maps can be a little complicated to organize, you might find it easiest to work with some software. One **optional** possibility is actually included with Quarto: the [Mermaid](https://mermaid.js.org/) chart tool can render attractive diagrams that include labeled nodes and directed edges. Using a [flowchart](https://mermaid.js.org/syntax/flowchart.html) is probably the way. For example, inserting the following code into a special `{mermaid}` code block will produce the following diagram:
```
flowchart TB
A(First concept) --> B(Second concept)
B --is part<br> of--> A
B-->C(Third concept)
A-->C
A & B & C -->D(Fourth concept)
```
```{mermaid}
flowchart TB
A(First concept) --> B(Second concept)
B --is part<br> of--> A
B-->C(Third concept)
A-->C
A & B & C -->D(Fourth concept)
```
For the basics of using Mermaid with Quarto, see the [Quarto docs](https://quarto.org/docs/authoring/diagrams.html). A benefit of this approach is that you don't have to worry too much about positioning, and you can publish your mind map easily on your blog! On the other hand, this approach doesn't give you much flexibility and makes it harder to be creative about incorporating complex relationships. **Doing your mind map by hand or in any other software is entirely fine**.
## Classification Rates {#sec-classification-rates}
#### Part 1
COVID-19 rapid tests have approximately an 80% sensitivity rate, which means that, in an individual who truly has COVID-19, the probability of a rapid test giving a positive result is roughly 80%. [These numbers are mostly made-up.]{.aside} On the other hand, the probability of a rapid test giving a positive result for an individual who truly does **not** have COVID-19 is 5%. Suppose that approximately 4% of the population are currently infected with COVID-19. [Example 2.3.1 of [Murphy](https://github.com/probml/pml-book/releases/latest/download/book1.pdf), page 46, has a good review of the relevant probability and the definition of each of the rates below.]{.aside}
Write a Python function called `rate_summary` that prints the following output, filling in the correct values for each of the specified rates:
```python
s = 0.8 # test sensitivity
f = 0.02 # probability of positive test if no COVID
prevalence = 0.05 # fraction of population infected
rate_summary(s, f, current_infection)
```
```
The true positive rate is ___.
The false positive rate is ___.
The true negative rate is ___.
The false positive rate is ___.
```
#### Part 2
1. Suppose that scientists found an alternative rapid test which had a 75% sensitivity rate with a 0% chance of a positive test on someone who is truly not infected. Would you suggest replacing the old rapid tests with these alternative tests? Why? [You don't necessarily need to use your function from the previous part in this part.]{.aside}
2. What if the alternative test had an 85% sensitivity rate and a 10% chance of a positive test on someone who is truly not infected?
#### Part 3
It's all well and good to do the math, but what about when we actually have data? Write a function called `rate_summary_2` that accepts two columns of a `pandas.DataFrame` (or equivalently two one-dimensional `numpy.arrays` of equal length). Call these `y` and `y_pred`. Assume that both `y` and `y_pred` are binary arrays (i.e. arrays of 0s and 1s). `y` represents the true outcome, whereas `y_pred` represents the prediction from an algorithm or test. Here's an example of the kind of data we are thinking about:
```{python}
import pandas as pd
url = "https://github.com/middlebury-csci-0451/CSCI-0451/raw/main/data/toy-classification-data.csv"
df = pd.read_csv(url)
df.head() # just for visualizing the first few rows
```
You should be able to use your function like this:
```python
# y is the true label, y_pred is the prediction
rate_summary_2(df["y"], df["y_pred"])
```
```
The true positive rate is ___.
The false positive rate is ___.
The true negative rate is ___.
The false positive rate is ___.
```
##### Hints
An excellent solution for this part will not use any for-loops. Computing each of the four rates can be performed in a single compact line of code. To begin thinking of how you might do this, you may want to experiment with code like the following:
```python
df[["y"]] == df[["y_pred"]]
df[["y"]].sum(), df[["y"]].sum()
```