-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathCh_regression.html
1110 lines (951 loc) · 101 KB
/
Ch_regression.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>2. Regression — Principles of Machine Learning: A Deployment-First Perspective</title>
<script data-cfasync="false">
document.documentElement.dataset.mode = localStorage.getItem("mode") || "";
document.documentElement.dataset.theme = localStorage.getItem("theme") || "light";
</script>
<!-- Loaded before other Sphinx assets -->
<link href="_static/styles/theme.css?digest=e353d410970836974a52" rel="stylesheet" />
<link href="_static/styles/bootstrap.css?digest=e353d410970836974a52" rel="stylesheet" />
<link href="_static/styles/pydata-sphinx-theme.css?digest=e353d410970836974a52" rel="stylesheet" />
<link href="_static/vendor/fontawesome/6.1.2/css/all.min.css?digest=e353d410970836974a52" rel="stylesheet" />
<link rel="preload" as="font" type="font/woff2" crossorigin href="_static/vendor/fontawesome/6.1.2/webfonts/fa-solid-900.woff2" />
<link rel="preload" as="font" type="font/woff2" crossorigin href="_static/vendor/fontawesome/6.1.2/webfonts/fa-brands-400.woff2" />
<link rel="preload" as="font" type="font/woff2" crossorigin href="_static/vendor/fontawesome/6.1.2/webfonts/fa-regular-400.woff2" />
<link rel="stylesheet" type="text/css" href="_static/pygments.css" />
<link rel="stylesheet" href="_static/styles/sphinx-book-theme.css?digest=14f4ca6b54d191a8c7657f6c759bf11a5fb86285" type="text/css" />
<link rel="stylesheet" type="text/css" href="_static/togglebutton.css" />
<link rel="stylesheet" type="text/css" href="_static/copybutton.css" />
<link rel="stylesheet" type="text/css" href="_static/mystnb.4510f1fc1dee50b3e5859aac5469c37c29e427902b24a333a5f9fcb2f0b3ac41.css" />
<link rel="stylesheet" type="text/css" href="_static/sphinx-thebe.css" />
<link rel="stylesheet" type="text/css" href="_static/pml_admonitions.css" />
<link rel="stylesheet" type="text/css" href="_static/custom.css" />
<link rel="stylesheet" type="text/css" href="_static/design-style.4045f2051d55cab465a707391d5b2007.min.css" />
<!-- Pre-loaded scripts that we'll load fully later -->
<link rel="preload" as="script" href="_static/scripts/bootstrap.js?digest=e353d410970836974a52" />
<link rel="preload" as="script" href="_static/scripts/pydata-sphinx-theme.js?digest=e353d410970836974a52" />
<script data-url_root="./" id="documentation_options" src="_static/documentation_options.js"></script>
<script src="_static/jquery.js"></script>
<script src="_static/underscore.js"></script>
<script src="_static/_sphinx_javascript_frameworks_compat.js"></script>
<script src="_static/doctools.js"></script>
<script src="_static/clipboard.min.js"></script>
<script src="_static/copybutton.js"></script>
<script src="_static/scripts/sphinx-book-theme.js?digest=5a5c038af52cf7bc1a1ec88eea08e6366ee68824"></script>
<script>let toggleHintShow = 'Click to show';</script>
<script>let toggleHintHide = 'Click to hide';</script>
<script>let toggleOpenOnPrint = 'true';</script>
<script src="_static/togglebutton.js"></script>
<script>var togglebuttonSelector = '.toggle, .admonition.dropdown';</script>
<script src="_static/design-tabs.js"></script>
<script async="async" src="https://www.googletagmanager.com/gtag/js?id=G-0HQMPESCSN"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){ dataLayer.push(arguments); }
gtag('js', new Date());
gtag('config', 'G-0HQMPESCSN');
</script>
<script>const THEBE_JS_URL = "https://unpkg.com/[email protected]/lib/index.js"
const thebe_selector = ".thebe,.cell"
const thebe_selector_input = "pre"
const thebe_selector_output = ".output, .cell_output"
</script>
<script async="async" src="_static/sphinx-thebe.js"></script>
<script>window.MathJax = {"options": {"processHtmlClass": "tex2jax_process|mathjax_process|math|output_area"}}</script>
<script defer="defer" src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"></script>
<script>DOCUMENTATION_OPTIONS.pagename = 'Ch_regression';</script>
<link rel="shortcut icon" href="_static/pml_ico.ico"/>
<link rel="index" title="Index" href="genindex.html" />
<link rel="search" title="Search" href="search.html" />
<link rel="next" title="3. Methodology I: Three basic tasks" href="Ch_methodology1.html" />
<link rel="prev" title="1. Introduction" href="Ch_introduction.html" />
<meta name="viewport" content="width=device-width, initial-scale=1"/>
<meta name="docsearch:language" content="en"/>
</head>
<body data-bs-spy="scroll" data-bs-target=".bd-toc-nav" data-offset="180" data-bs-root-margin="0px 0px -60%" data-default-mode="">
<a class="skip-link" href="#main-content">Skip to main content</a>
<input type="checkbox"
class="sidebar-toggle"
name="__primary"
id="__primary"/>
<label class="overlay overlay-primary" for="__primary"></label>
<input type="checkbox"
class="sidebar-toggle"
name="__secondary"
id="__secondary"/>
<label class="overlay overlay-secondary" for="__secondary"></label>
<div class="search-button__wrapper">
<div class="search-button__overlay"></div>
<div class="search-button__search-container">
<form class="bd-search d-flex align-items-center"
action="search.html"
method="get">
<i class="fa-solid fa-magnifying-glass"></i>
<input type="search"
class="form-control"
name="q"
id="search-input"
placeholder="Search this book..."
aria-label="Search this book..."
autocomplete="off"
autocorrect="off"
autocapitalize="off"
spellcheck="false"/>
<span class="search-button__kbd-shortcut"><kbd class="kbd-shortcut__modifier">Ctrl</kbd>+<kbd>K</kbd></span>
</form></div>
</div>
<nav class="bd-header navbar navbar-expand-lg bd-navbar">
</nav>
<div class="bd-container">
<div class="bd-container__inner bd-page-width">
<div class="bd-sidebar-primary bd-sidebar">
<div class="sidebar-header-items sidebar-primary__section">
</div>
<div class="sidebar-primary-items__start sidebar-primary__section">
<div class="sidebar-primary-item">
<a class="navbar-brand logo" href="welcome.html">
<img src="_static/pml_logo.png" class="logo__image only-light" alt="Logo image"/>
<script>document.write(`<img src="_static/pml_logo.png" class="logo__image only-dark" alt="Logo image"/>`);</script>
</a></div>
<div class="sidebar-primary-item"><nav class="bd-links" id="bd-docs-nav" aria-label="Main">
<div class="bd-toc-item navbar-nav active">
<ul class="nav bd-sidenav bd-sidenav__home-link">
<li class="toctree-l1">
<a class="reference internal" href="welcome.html">
Welcome to our Principles of Machine Learning
</a>
</li>
</ul>
<ul class="current nav bd-sidenav">
<li class="toctree-l1"><a class="reference internal" href="Ch_introduction.html">1. Introduction</a></li>
<li class="toctree-l1 current active"><a class="current reference internal" href="#">2. Regression</a></li>
<li class="toctree-l1"><a class="reference internal" href="Ch_methodology1.html">3. Methodology I: Three basic tasks</a></li>
<li class="toctree-l1"><a class="reference internal" href="Ch_classification1.html">4. Classification I: The geometric view</a></li>
<li class="toctree-l1"><a class="reference internal" href="Ch_discovery.html">5. Structure analysis</a></li>
<li class="toctree-l1"><a class="reference internal" href="Ch_density.html">6. Density estimation</a></li>
<li class="toctree-l1"><a class="reference internal" href="Ch_classification2.html">7. Classification II: The probabilistic view</a></li>
<li class="toctree-l1"><a class="reference internal" href="Ch_methodology2.html">8. Methodology II: Pipelines</a></li>
<li class="toctree-l1"><a class="reference internal" href="Ch_feature.html">9. Feature Engineering</a></li>
<li class="toctree-l1"><a class="reference internal" href="Ch_ensemble.html">10. Ensemble methods</a></li>
<li class="toctree-l1"><a class="reference internal" href="Ch_neuralnets.html">11. Neural networks</a></li>
<li class="toctree-l1"><a class="reference internal" href="Ch_optimisation.html">12. Optimisation methods</a></li>
<li class="toctree-l1"><a class="reference internal" href="Ch_methodology3.html">13. Methodology III: Workflows</a></li>
<li class="toctree-l1"><a class="reference internal" href="Ch_ethics.html">14. The machine learning professional</a></li>
<li class="toctree-l1"><a class="reference internal" href="Ch_appendix.html">15. Appendix</a></li>
</ul>
<hr style="height:2px;border:none;color:#000000;background-color:#000000;width:50%;text-align:center;margin:10px auto auto auto;">
</div>
</nav>
</div></div>
<a><b>Readers:</b></a>
<div style="height:80%;width:80%;">
<script type="text/javascript" id="clstr_globe" src="//clustrmaps.com/globe.js?d=06DuCmf206QlXB0PwXp_5bEXHN0MJWuVeBiYDLQ4Ovc"></script>
<!-- <h1>Test 0</h1> -->
</div>
<hr>
<div class="sidebar-primary-items__end sidebar-primary__section">
</div>
<div id="rtd-footer-container"></div>
</div>
<main id="main-content" class="bd-main">
<div class="sbt-scroll-pixel-helper"></div>
<div class="bd-content">
<div class="bd-article-container">
<div class="bd-header-article">
<div class="header-article-items header-article__inner">
<div class="header-article-items__start">
<div class="header-article-item"><label class="sidebar-toggle primary-toggle btn btn-sm" for="__primary" title="Toggle primary sidebar" data-bs-placement="bottom" data-bs-toggle="tooltip">
<span class="fa-solid fa-bars"></span>
</label></div>
</div>
<div class="header-article-items__end">
<div class="header-article-item">
<div class="article-header-buttons">
<div class="dropdown dropdown-source-buttons">
<button class="btn dropdown-toggle" type="button" data-bs-toggle="dropdown" aria-expanded="false" aria-label="Source repositories">
<i class="fab fa-github"></i>
</button>
<ul class="dropdown-menu">
<li><a href="https://github.com/PMLBook/PMLBook.github.io" target="_blank"
class="btn btn-sm btn-source-repository-button dropdown-item"
title="Source repository"
data-bs-placement="left" data-bs-toggle="tooltip"
>
<span class="btn__icon-container">
<i class="fab fa-github"></i>
</span>
<span class="btn__text-container">Repository</span>
</a>
</li>
<li><a href="https://github.com/PMLBook/PMLBook.github.io/issues/new?title=Issue%20on%20page%20%2FCh_regression.html&body=Your%20issue%20content%20here." target="_blank"
class="btn btn-sm btn-source-issues-button dropdown-item"
title="Open an issue"
data-bs-placement="left" data-bs-toggle="tooltip"
>
<span class="btn__icon-container">
<i class="fas fa-lightbulb"></i>
</span>
<span class="btn__text-container">Open issue</span>
</a>
</li>
</ul>
</div>
<div class="dropdown dropdown-download-buttons">
<button class="btn dropdown-toggle" type="button" data-bs-toggle="dropdown" aria-expanded="false" aria-label="Download this page">
<i class="fas fa-download"></i>
</button>
<ul class="dropdown-menu">
<li><a href="_sources/Ch_regression.md" target="_blank"
class="btn btn-sm btn-download-source-button dropdown-item"
title="Download source file"
data-bs-placement="left" data-bs-toggle="tooltip"
>
<span class="btn__icon-container">
<i class="fas fa-file"></i>
</span>
<span class="btn__text-container">.md</span>
</a>
</li>
<li>
<button onclick="window.print()"
class="btn btn-sm btn-download-pdf-button dropdown-item"
title="Print to PDF"
data-bs-placement="left" data-bs-toggle="tooltip"
>
<span class="btn__icon-container">
<i class="fas fa-file-pdf"></i>
</span>
<span class="btn__text-container">.pdf</span>
</button>
</li>
</ul>
</div>
<button onclick="toggleFullScreen()"
class="btn btn-sm btn-fullscreen-button"
title="Fullscreen mode"
data-bs-placement="bottom" data-bs-toggle="tooltip"
>
<span class="btn__icon-container">
<i class="fas fa-expand"></i>
</span>
</button>
<script>
document.write(`
<button class="theme-switch-button btn btn-sm btn-outline-primary navbar-btn rounded-circle" title="light/dark" aria-label="light/dark" data-bs-placement="bottom" data-bs-toggle="tooltip">
<span class="theme-switch" data-mode="light"><i class="fa-solid fa-sun"></i></span>
<span class="theme-switch" data-mode="dark"><i class="fa-solid fa-moon"></i></span>
<span class="theme-switch" data-mode="auto"><i class="fa-solid fa-circle-half-stroke"></i></span>
</button>
`);
</script>
<script>
document.write(`
<button class="btn btn-sm navbar-btn search-button search-button__button" title="Search" aria-label="Search" data-bs-placement="bottom" data-bs-toggle="tooltip">
<i class="fa-solid fa-magnifying-glass"></i>
</button>
`);
</script>
<label class="sidebar-toggle secondary-toggle btn btn-sm" for="__secondary"title="Toggle secondary sidebar" data-bs-placement="bottom" data-bs-toggle="tooltip">
<span class="fa-solid fa-list"></span>
</label>
</div></div>
</div>
</div>
</div>
<div id="jb-print-docs-body" class="onlyprint">
<h1>Regression</h1>
<!-- Table of contents -->
<div id="print-main-content">
<div id="jb-print-toc">
<div>
<h2> Contents </h2>
</div>
<nav aria-label="Page">
<ul class="visible nav section-nav flex-column">
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#how-far-is-the-equator-from-the-north-pole">2.1. How far is the Equator from the North Pole?</a></li>
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#formulating-regression-problems">2.2. Formulating regression problems</a><ul class="nav section-nav flex-column">
<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#mathematical-notation">2.2.1. Mathematical notation</a></li>
<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#quality-metrics">2.2.2. Quality metrics</a></li>
<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#regression-as-an-optimisation-problem-take-1">2.2.3. Regression as an optimisation problem (Take 1)</a></li>
</ul>
</li>
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#basic-regression-models">2.3. Basic regression models</a><ul class="nav section-nav flex-column">
<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#simple-linear-regression">2.3.1. Simple linear regression</a></li>
<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#simple-polynomial-regression">2.3.2. Simple polynomial regression</a></li>
<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#multiple-linear-regression">2.3.3. Multiple linear regression</a></li>
<li class="toc-h3 nav-item toc-entry"><a class="reference internal nav-link" href="#the-least-squares-solution">2.3.4. The least squares solution</a></li>
</ul>
</li>
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#flexibility-interpretability-and-generalisation">2.4. Flexibility, interpretability and generalisation</a></li>
<li class="toc-h2 nav-item toc-entry"><a class="reference internal nav-link" href="#summary-and-discussion">2.5. Summary and discussion</a></li>
</ul>
</nav>
</div>
</div>
</div>
<div id="searchbox"></div>
<article class="bd-article" role="main">
<div class="tex2jax_ignore mathjax_ignore section" id="regression">
<span id="reg"></span><h1><span class="section-number">2. </span>Regression<a class="headerlink" href="#regression" title="Permalink to this heading">#</a></h1>
<p>Regression is the first family of machine learning problems that we will study. As you might remember, we have already considered one regression problem, namely that of guessing the heart rate of an animal from its body mass. Our approach to solve it was however mostly intuitive and only allowed us to produce rough guesses by visually inspecting the available dataset. In this chapter we will introduce, formulate and discuss regression problems and techniques rigorously.</p>
<p>The structure of this chapter is as follows. We will start off by offering our second top tip, using Revolutionary France as a backdrop. Then, we will formulate regression problems using mathematical notation. Our mathematical notation will allow us to explore some basic regression models and discuss how to use datasets to build solutions. Finally, we will explore the notions of model flexibility and complexity, and will connect them to the important machine learning concepts of interpretability and generalisation. This will allow us to define the fundamental notions of underfitting and overfitting.</p>
<div class="section" id="how-far-is-the-equator-from-the-north-pole">
<span id="reg1"></span><h2><span class="section-number">2.1. </span>How far is the Equator from the North Pole?<a class="headerlink" href="#how-far-is-the-equator-from-the-north-pole" title="Permalink to this heading">#</a></h2>
<p>In the last decade of the 18th century, a commision appointed by the French Académie des sciences decided that the distance from the Equator to the North Pole should be exactly 10,000 km. Yes, you have read it correctly, they <em>decided</em> it.</p>
<p>How can you decide how long an existing distance should be, surely it must be given? In reality, what the French commission did was define a much needed new unit of length, the <strong>metre</strong>, and they did so taking the distance from the Equator to the Norh Pole as a reference. This is why in some old textbooks you might read, rather intriguingly, that <em>one metre is one ten-millionth of the meridian quadrant</em>. A meridian quadrant is precisely any segment that starts in the Equator and finishes in the North Pole (see <a class="reference internal" href="#meridianquadrant"><span class="std std-numref">Fig. 2.1</span></a>). Defining physical units requires stable references and back in the 18th century, the Earth was seen as the most suitable physical object to define a standard unit of length. The definition of the metre has changed over time though and since 1983, we define the metre using the speed of light in vacuum as our stable reference.</p>
<div class="figure align-default" id="meridianquadrant">
<a class="reference internal image-reference" href="_images/meridian_quadrant.png"><img alt="_images/meridian_quadrant.png" src="_images/meridian_quadrant.png" style="width: 485.6px; height: 392.40000000000003px;" /></a>
<p class="caption"><span class="caption-number">Fig. 2.1 </span><span class="caption-text">The Paris meridian quadrant runs from the North Pole, through Paris, to the Equator and was used by the French Académie des sciences in 1791 to define a new unit of length: the metre.</span><a class="headerlink" href="#meridianquadrant" title="Permalink to this image">#</a></p>
</div>
<p>Let us get back to the 1790s. Stating that one metre is one ten-millionth of a meridian quadrant was the easy part of the job entrusted to the French commission. It can be done, and most likely was done, from the comfort of an armchair. The challenge was to actually measure a meridian quadrant accurately. Think about it for a moment, how would you measure a meridian quadrant, let alone immersed in the social and political instability of 1790s Revolutionary France? One of the main concerns of the team appointed to measure the meridian quadrant was their limited ability to produce precise measurements, or in other words, to reduce the measurement <em>errors</em>. To overcome this obstacle, one obvious avenue was to improve the existing instruments. Better instruments, more precise measurements. It is in this atmosphere where a second, less-obvious idea to reduce the impact of measurement errors took shape. The great French mathematician Adrien-Marie Legendre used the following words to describe the essence of this idea: <em>By using this method, a sort of equilibrium is established between the errors which prevents the extremes from prevailing […] [getting us closer to the] truth.</em> Legendre called this method least squares (<em>moindres carrés</em>, in French) and he published it in 1805. So what is Legendre trying to tell us?</p>
<p>What Legendre is suggesting is to deal with errors <em>mathematically</em>. Up until 1805, scientists dealt with errors <em>physically</em>, by improving their instrumentation. Legendre is telling us that by carrying out the right mathematical operations on our measurements, we can reduce the impact of errors on our final solution. From this point onwards, mathematics provided a second avenue to deal with errors and get closer to the <em>truth</em>. Note that to deal with errors mathematically, we need to explicitely account for them. Specifically, we need to represent them in our mathematical formulation. Only by including them in our mathematical formulation, we will be able to devise methods that can deal with them mathematically.</p>
<div class="tip admonition">
<p class="admonition-title">So here is our second top tip:</p>
<h3 style="text-align: center;"><b>Embrace the error!</b></h3>
</div>
<p>We might not like them, but errors do exist and we should not pretend they are not there. In other words, errors are first-class citizens in our formulation. This idea constitutes a core principle in machine learning.</p>
<p>Least squares quickly became a cornerstone in science, and from its early applications to geodesy (the science of measuring Earth’s geometry) and astronomy (e.g. to determine the orbit of a celestial object), it spread inexorably throughout every branch of science. Least squares is such an important method that science historians still debate today whether it should be credited to Legendre or to another great mathematician, Karl F. Gauss. Incidentally, if you are looking for the origins of machine learning, you will find them exactly here. Least squares is actually a regression method and we will cover it later in this chapter. However, even though we could regard least squares as the first machine learning method, what we want to highlight is not least squares itself, but the revolutionary idea that allowed Legendre to conceive this method. Remember this: if we do not embrace the error, there is no machine learning.</p>
</div>
<div class="section" id="formulating-regression-problems">
<span id="reg2"></span><h2><span class="section-number">2.2. </span>Formulating regression problems<a class="headerlink" href="#formulating-regression-problems" title="Permalink to this heading">#</a></h2>
<p>Regression problems belong to the category of supervised learning problems, where we seek to predict a label using a set of predictors (<a class="reference internal" href="#regressiondiagram"><span class="std std-numref">Fig. 2.2</span></a>). What distinguishes regression from the other family of supervised learning problems, i.e. classification, is that in regression the label takes on continuous values. Examples of problems where we are interested in predicting a continuous label include predicting the energy consumption of a household, the future value of a company stock, tomorrow’s temperature or the probability of developing a specific health condition. As in any other machine learning scenario, regression problems belong to machine learning because their solutions are built using a dataset.</p>
<div class="figure align-center" id="regressiondiagram">
<a class="reference internal image-reference" href="_images/regression_diagram_nq.svg"><img alt="_images/regression_diagram_nq.svg" src="_images/regression_diagram_nq.svg" width="70%" /></a>
<p class="caption"><span class="caption-number">Fig. 2.2 </span><span class="caption-text">In supervised learning we seek to find a model that predicts a label using a set of predictors. This model is the solution to a supervised learning problem.</span><a class="headerlink" href="#regressiondiagram" title="Permalink to this image">#</a></p>
</div>
<p>To illustrate regression, let us consider the problem of predicting the salary of an individual who lives in Paris, of whom we know their age. If the salary of a Parisian was prescribed by their age according to some written law, our job would be finished. We would obtain the salary from the age simply using this law. Unfortunatelly there is not any such written law and hence the question is, is there any relationship between the salary and age of the Parisians? If this is the case, how can we discover this relationship? If such relationship exists, our goal is to build a mathematical model that represents it.</p>
<p>Using a dataset recording the age and the salary of a collection of individuals from Paris, we can try to discover how Parisian salaries are related to ages. <a class="reference internal" href="#agevssalary"><span class="std std-numref">Table 2.1</span></a> shows a made-up dataset created for this purpose. Note that this same dataset could have been created to build a model that predicts the age of an individual using their salary as the predictor. It is us who have to decide which attribute is the predictor and which is the label when we formulate a regression problem.</p>
<table class="table" id="agevssalary">
<caption><span class="caption-number">Table 2.1 </span><span class="caption-text">A toy dataset registering the age and salary of a group of individuals</span><a class="headerlink" href="#agevssalary" title="Permalink to this table">#</a></caption>
<colgroup>
<col style="width: 33%" />
<col style="width: 33%" />
<col style="width: 33%" />
</colgroup>
<thead>
<tr class="row-odd"><th class="head"><p>ID</p></th>
<th class="head"><p>Age</p></th>
<th class="head"><p>Salary</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td><p><span class="math notranslate nohighlight">\(S_1\)</span></p></td>
<td><p>37</p></td>
<td><p>68,000</p></td>
</tr>
<tr class="row-odd"><td><p><span class="math notranslate nohighlight">\(S_2\)</span></p></td>
<td><p>18</p></td>
<td><p>12,000</p></td>
</tr>
<tr class="row-even"><td><p><span class="math notranslate nohighlight">\(S_3\)</span></p></td>
<td><p>66</p></td>
<td><p>80,000</p></td>
</tr>
<tr class="row-odd"><td><p><span class="math notranslate nohighlight">\(S_4\)</span></p></td>
<td><p>25</p></td>
<td><p>45,000</p></td>
</tr>
<tr class="row-even"><td><p><span class="math notranslate nohighlight">\(S_5\)</span></p></td>
<td><p>26</p></td>
<td><p>30,000</p></td>
</tr>
</tbody>
</table>
<div class="section" id="mathematical-notation">
<h3><span class="section-number">2.2.1. </span>Mathematical notation<a class="headerlink" href="#mathematical-notation" title="Permalink to this heading">#</a></h3>
<p>In machine learning, our first step is always to formulate our problem <em>mathematically</em>. This involves using mathematical notation to represent all the concepts in our problem and their relationships. Let us start with the basic mathematical notation needed to describe our population and dataset:</p>
<ul class="simple">
<li><p><strong>Predictor</strong>: <span class="math notranslate nohighlight">\(x\)</span>.</p></li>
<li><p><strong>Label</strong>: <span class="math notranslate nohighlight">\(y\)</span>.</p></li>
<li><p><strong>Number of samples</strong> in our dataset: <span class="math notranslate nohighlight">\(N\)</span>.</p></li>
<li><p>Dataset <strong>sample identifier</strong>: <span class="math notranslate nohighlight">\(i\)</span>.</p></li>
</ul>
<p>Using this notation, the value of the predictor of the <span class="math notranslate nohighlight">\(i\)</span>-th sample in a dataset can be denoted by <span class="math notranslate nohighlight">\(x_i\)</span> and its label by <span class="math notranslate nohighlight">\(y_i\)</span>. For instance, to report on the predictor and label of the third sample in the dataset shown in <a class="reference internal" href="#agevssalary"><span class="std std-numref">Table 2.1</span></a>, we would write <span class="math notranslate nohighlight">\(x_3=66\)</span> and <span class="math notranslate nohighlight">\(y_3=80,000\)</span>, respectively. Remember that when we formulated our problem, we decided that age was the predictor (<span class="math notranslate nohighlight">\(x\)</span>) and salary the label (<span class="math notranslate nohighlight">\(y\)</span>).</p>
<p>Furthermore, we can denote our entire dataset by <span class="math notranslate nohighlight">\(\{(x_i,y_i): 1\leq i \leq N \}\)</span>. Curly brackets ‘<span class="math notranslate nohighlight">\(\{\)</span>‘ and ‘<span class="math notranslate nohighlight">\(\}\)</span>’ are used to represent the notion of <em>collection</em>. With this in mind, the mathematical expression <span class="math notranslate nohighlight">\(\{(x_i,y_i): 1\leq i \leq N \}\)</span> should be read as <em>a collection of pairs of values <span class="math notranslate nohighlight">\((x_i,y_i)\)</span>, where <span class="math notranslate nohighlight">\(i\)</span> runs from 1 to <span class="math notranslate nohighlight">\(N\)</span></em>. For instance, our dataset in <a class="reference internal" href="#agevssalary"><span class="std std-numref">Table 2.1</span></a> can be expressed as a collection of <span class="math notranslate nohighlight">\(N=5\)</span> pairs:</p>
<p><span class="math notranslate nohighlight">\(\{(x_i,y_i): 1\leq i \leq 5 \} = \{(x_1,y_1), (x_2,y_2), (x_3,y_3), (x_4,y_4), (x_5,y_5)\}\)</span>,</p>
<p>specifically</p>
<p><span class="math notranslate nohighlight">\(\{(x_i,y_i): 1\leq i \leq 5 \} = \{(37, 68000), (18, 12000), (66, 80000), (25, 45000), (26, 30000)\}\)</span>.</p>
<p>Now that we have agreed on how to represent basic population and dataset concepts, let us create the notation needed to describe our model:</p>
<ul class="simple">
<li><p><strong>Model</strong>: <span class="math notranslate nohighlight">\(f\)</span>.</p></li>
<li><p><strong>Prediction</strong>: <span class="math notranslate nohighlight">\(\hat{y}\)</span>.</p></li>
</ul>
<p>Using this notation, we express the idea of making a prediction as</p>
<div class="math notranslate nohighlight" id="equation-eqmodelnotation">
<span class="eqno">(2.1)<a class="headerlink" href="#equation-eqmodelnotation" title="Permalink to this equation">#</a></span>\[
\hat{y} = f(x)
\]</div>
<p>which should be read as <em>the model <span class="math notranslate nohighlight">\(f\)</span> takes the predictor <span class="math notranslate nohighlight">\(x\)</span> as an input and produces the prediction <span class="math notranslate nohighlight">\(\hat{y}\)</span> as an output</em> (see <a class="reference internal" href="#functioniobox"><span class="std std-numref">Fig. 2.3</span></a>).</p>
<div class="figure align-default" id="functioniobox">
<img alt="_images/function_input_output_box.svg" src="_images/function_input_output_box.svg" /><p class="caption"><span class="caption-number">Fig. 2.3 </span><span class="caption-text">A supervised learning model <span class="math notranslate nohighlight">\(f\)</span> takes a predictor <span class="math notranslate nohighlight">\(x\)</span> as an input and produces a prediction <span class="math notranslate nohighlight">\(\hat{y}\)</span> as an output. This can be represented as a block diagram and expressed mathematically as <span class="math notranslate nohighlight">\(\hat{y} = f(x)\)</span>.</span><a class="headerlink" href="#functioniobox" title="Permalink to this image">#</a></p>
</div>
<p>Note that our notation explicitly distinguishes between the actual value that we want to predict, <span class="math notranslate nohighlight">\(y\)</span>, and the prediction provided by our model, <span class="math notranslate nohighlight">\(\hat{y}\)</span>. This distinction is crucial to define our last concept:</p>
<ul class="simple">
<li><p><strong>Prediction error</strong>: <span class="math notranslate nohighlight">\(e\)</span>.</p></li>
</ul>
<p>The prediction error can be defined as <span class="math notranslate nohighlight">\(e= y-\hat{y}\)</span>, i.e. as the difference between the actual value that we want to predict and the value that our model predicts.</p>
<p>To consolidate our understanding of the mathematical notation that we have developed, let us apply it to the following example. Assume that we have collected the dataset shown in <a class="reference internal" href="#agevssalary"><span class="std std-numref">Table 2.1</span></a> and are using the model <span class="math notranslate nohighlight">\(f(x) = 1,000x\)</span> to predict the salary <span class="math notranslate nohighlight">\(y\)</span> of an individual given their age <span class="math notranslate nohighlight">\(x\)</span>. This model simply predicts the salary of an individual to be 1,000 times their age. <a class="reference internal" href="#regresionnotation"><span class="std std-numref">Fig. 2.4</span></a> provides a visual illustration of the mathematical notation that we have created. First, it represents in the attribute space the five samples, <span class="math notranslate nohighlight">\(S_1\)</span> to <span class="math notranslate nohighlight">\(S_5\)</span>, of the dataset defined in <a class="reference internal" href="#agevssalary"><span class="std std-numref">Table 2.1</span></a>. Note that the coordinates of each sample <span class="math notranslate nohighlight">\(S_i\)</span> correspond to the values of its attributes <span class="math notranslate nohighlight">\(x_i\)</span> and <span class="math notranslate nohighlight">\(y_i\)</span>. Second, the model <span class="math notranslate nohighlight">\(f(x) = 1,000 x\)</span> is plotted as a solid line. The coordinates of each point in the line representing the model correspond to each age value <span class="math notranslate nohighlight">\(x\)</span> and its predicted salary <span class="math notranslate nohighlight">\(\hat{y}=f(x)\)</span>. Finally, the prediction error <span class="math notranslate nohighlight">\(e_i\)</span> is represented as a vertical line from each individual sample to the line representing the model, which corresponds to the difference <span class="math notranslate nohighlight">\(e_i=y_i-\hat{y}_i\)</span>.</p>
<div class="figure align-default" id="regresionnotation">
<img alt="_images/regression_notation.svg" src="_images/regression_notation.svg" /><p class="caption"><span class="caption-number">Fig. 2.4 </span><span class="caption-text">Visualisation of the dataset defined in <a class="reference internal" href="#agevssalary"><span class="std std-numref">Table 2.1</span></a> toghether with the model <span class="math notranslate nohighlight">\(f(x) = 1,000 x\)</span> (solid line). The vertical dashed lines from each sample to the model represent the individual prediction errors.</span><a class="headerlink" href="#regresionnotation" title="Permalink to this image">#</a></p>
</div>
<p><a class="reference internal" href="#agevssalary2"><span class="std std-numref">Table 2.2</span></a> captures the dataset shown in <a class="reference internal" href="#agevssalary"><span class="std std-numref">Table 2.1</span></a> together with the predicted labels and the prediction errors of the model <span class="math notranslate nohighlight">\(f(x) = 1,000 x\)</span>. The predictor value of the first sample in <a class="reference internal" href="#agevssalary"><span class="std std-numref">Table 2.1</span></a> is <span class="math notranslate nohighlight">\(x_1 = 37\)</span> and its actual label <span class="math notranslate nohighlight">\(y_1=68,000\)</span>. Using the model <span class="math notranslate nohighlight">\(f(x) = 1,000 x\)</span>, the predicted label is <span class="math notranslate nohighlight">\(\hat{y}_1= f(x_1) = 1,000 \times 37 = 37,000\)</span> and the prediction error is <span class="math notranslate nohighlight">\(e_1 = 68,000-37,000=31,000\)</span>. You should be able to carry out this process with the remaining samples. In doing this, make sure you use our mathematical notation consistently.</p>
<table class="table" id="agevssalary2">
<caption><span class="caption-number">Table 2.2 </span><span class="caption-text">Predictor <span class="math notranslate nohighlight">\(x\)</span>, actual label <span class="math notranslate nohighlight">\(y\)</span>, prediction <span class="math notranslate nohighlight">\(\hat{y}\)</span> and error <span class="math notranslate nohighlight">\(e\)</span> of our 5 individuals.</span><a class="headerlink" href="#agevssalary2" title="Permalink to this table">#</a></caption>
<colgroup>
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
<col style="width: 20%" />
</colgroup>
<thead>
<tr class="row-odd"><th class="head"><p>ID</p></th>
<th class="head"><p><span class="math notranslate nohighlight">\(x\)</span></p></th>
<th class="head"><p><span class="math notranslate nohighlight">\(y\)</span></p></th>
<th class="head"><p><span class="math notranslate nohighlight">\(\hat{y}\)</span></p></th>
<th class="head"><p><span class="math notranslate nohighlight">\(e\)</span></p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td><p><span class="math notranslate nohighlight">\(1\)</span></p></td>
<td><p>37</p></td>
<td><p>68,000</p></td>
<td><p>37,000</p></td>
<td><p>31,000</p></td>
</tr>
<tr class="row-odd"><td><p><span class="math notranslate nohighlight">\(2\)</span></p></td>
<td><p>18</p></td>
<td><p>12,000</p></td>
<td><p>18,000</p></td>
<td><p>-6,000</p></td>
</tr>
<tr class="row-even"><td><p><span class="math notranslate nohighlight">\(3\)</span></p></td>
<td><p>66</p></td>
<td><p>80,000</p></td>
<td><p>66,000</p></td>
<td><p>14,000</p></td>
</tr>
<tr class="row-odd"><td><p><span class="math notranslate nohighlight">\(4\)</span></p></td>
<td><p>25</p></td>
<td><p>45,000</p></td>
<td><p>25,000</p></td>
<td><p>20,000</p></td>
</tr>
<tr class="row-even"><td><p><span class="math notranslate nohighlight">\(5\)</span></p></td>
<td><p>26</p></td>
<td><p>30,000</p></td>
<td><p>26,000</p></td>
<td><p>4,000</p></td>
</tr>
</tbody>
</table>
</div>
<div class="section" id="quality-metrics">
<h3><span class="section-number">2.2.2. </span>Quality metrics<a class="headerlink" href="#quality-metrics" title="Permalink to this heading">#</a></h3>
<p>Regression models can be represented using mathematical expressions that tell us how to calculate the predicted label from a predictor. For instance, the simple model <span class="math notranslate nohighlight">\(f(x) = 1,000 x\)</span> predicts the salary of an individual as 1,000 times their age. <a class="reference internal" href="#salaryvsage3models"><span class="std std-numref">Fig. 2.5</span></a> shows in the attribute space a dataset consisting of the salary and age of a collection of individuals. Superimposed to the dataset are three curves that represent three candidate models that predict the salary of an individual given their age. Specifically, Model 1 represents a <em>linear</em> model such as <span class="math notranslate nohighlight">\(f(x) = 1,000 x\)</span>.</p>
<div class="figure align-center" id="salaryvsage3models">
<a class="reference internal image-reference" href="_images/salaryVage3sols_label.svg"><img alt="_images/salaryVage3sols_label.svg" src="_images/salaryVage3sols_label.svg" width="70%" /></a>
<p class="caption"><span class="caption-number">Fig. 2.5 </span><span class="caption-text">Toy dataset consisting of the salary and age of 200 individuals in the attribute space, together with three candidate models that predict the salary of individuals from their age.</span><a class="headerlink" href="#salaryvsage3models" title="Permalink to this image">#</a></p>
</div>
<div class="question1 admonition">
<p class="admonition-title">Question for you</p>
<p>Given the dataset and candidate models shown in <a class="reference internal" href="#salaryvsage3models"><span class="std std-numref">Fig. 2.5</span></a>, which model would you say is the <em>best</em>, <em>Model 1</em>, <em>Model 2</em> or <em>Model 3</em>?</p>
<p>Submit your response here: <a href="https://forms.office.com/e/XagZJFmuLx" target="_blank">Your Response </a></p>
</div>
<p>Did you identify the <em>best</em> model? Which one did you choose? It turns out that each of the models can potentially be the best model. The reason is that to talk about the best model, we need to agree on what we mean by <em>best</em> first. In other words, we need to agree on a notion of <strong>model quality</strong>. If we are looking for the simplest model, Model 1 would be the best, as it represents a very simple, linear relationship between salary and age. If we want our model to make predictions that reduce the prediction error overall, Model 3 would be the best. Finally, if we want our model not to make predictions that are always greater than the actual label, Model 2 would be the best. In summary, asking for the <em>best</em> model does not make sense until we decide what we mean by <em>best</em>. Or mathematically speaking, until we decide what our chosen <strong>quality metric</strong> is.</p>
<p>The <strong>quadratic</strong> or <strong>squared error</strong> <span class="math notranslate nohighlight">\(e^2\)</span> is a common quantity used in regression to encapsulate the notion of <strong>single prediction quality</strong>. Given a sample <span class="math notranslate nohighlight">\(i\)</span>, the closer <span class="math notranslate nohighlight">\(e_i^2\)</span> is to zero, the closer is the predicted label <span class="math notranslate nohighlight">\(\hat{y}_i\)</span> to the actual label <span class="math notranslate nohighlight">\(y_i\)</span>. Using the squared error as our notion of single prediction quality, good models are those that lead to small squared errors across a collection of samples. What quantities can we define that give us an idea of how good a model is on a collection of samples, rather than just on one individual sample?</p>
<p>One such quantity is the <strong>sum of squared errors</strong> (SSE) (also known as the residual sum of squares), which is defined as the sum of all the squared errors produced by our model on the dataset:</p>
<div class="math notranslate nohighlight" id="equation-eqsse1">
<span class="eqno">(2.2)<a class="headerlink" href="#equation-eqsse1" title="Permalink to this equation">#</a></span>\[
SSE = e_1^2 + e_2^2+\dots+e_N^2
\]</div>
<p>or using the summation symbol <span class="math notranslate nohighlight">\(\Sigma\)</span> (<em>sigma</em>)</p>
<div class="math notranslate nohighlight" id="equation-eqsse2">
<span class="eqno">(2.3)<a class="headerlink" href="#equation-eqsse2" title="Permalink to this equation">#</a></span>\[\begin{split}
SSE &= \sum_1^N e_i^2 \\
&= \sum_1^N (y_i-\hat{y}_i)^2 \\
&= \sum_1^N (y_i-f(x_i))^2 \label{eq-sse2}
\end{split}\]</div>
<p>The SSE is a metric that can be used to quantify the overall quality of a model on a given dataset. The lower the SSE, the closer the model predictions are to the actual labels on the dataset. The performance of two models can then be compared by comparing their respective SSE values.</p>
<p>We can define a second, related quality metric that describes how good a model is at predicting a label on average. This quantity is known as the <strong>mean squared error</strong> (MSE) and can be obtained on a dataset by simply averaging the squared errors:</p>
<div class="math notranslate nohighlight" id="equation-eqmse1">
<span class="eqno">(2.4)<a class="headerlink" href="#equation-eqmse1" title="Permalink to this equation">#</a></span>\[
MSE = \frac{1}{N}(e_1^2 + e_2^2+\dots+e_N^2) = \frac{1}{N}\sum_1^N e_i^2
\]</div>
<p>As an example, the SSE of model <span class="math notranslate nohighlight">\(f(x) = 1,000 x\)</span> on the dataset shown in <a class="reference internal" href="#agevssalary2"><span class="std std-numref">Table 2.2</span></a> is:</p>
<div class="math notranslate nohighlight">
\[
SSE = 31,000^2+(-6,000)^2+14,000^2+20,000^2+4,000^2 = 1,609,000,000
\]</div>
<p>and its MSE is</p>
<div class="math notranslate nohighlight">
\[
MSE = \frac{1,609,000,000}{5} = 321,800,000
\]</div>
<p>Note that SSE and MSE seem to be very similar quantities. Specifically, MSE can be calculated as SSE divided by the number of samples <span class="math notranslate nohighlight">\(N\)</span>. The interpretation is however slightly different. We will come back to this idea in the next chapter. For now, let us just use both as two in principle equivalent quality metrics.</p>
<div class="question1 admonition">
<p class="admonition-title">Question for you</p>
<p>Given a dataset, is it possible to find a model such that <span class="math notranslate nohighlight">\(\hat{y}_i = y_i\)</span> for every
sample <span class="math notranslate nohighlight">\(i\)</span> in the dataset, i.e. a model whose error is exactly zero (SSE=0 and MSE = 0)?</p>
<p>Submit your response here: <a href="https://forms.office.com/e/vemZER0DWJ" target = "_blank">Your Response</a></p>
</div>
<p>A model such that SSE=0 on a dataset can be visualised in the attribute space as a curve that goes through every single sample. Therefore, the question as to whether there exists such a model for any dataset can be rephrased as, can we draw a curve that goes through every single sample in the dataset? At first the answer seems to be yes - after all, we can draw as wiggly a curve as we want to so that it goes through each one of the samples. There is, however, one restriction. Models produce one prediction per predictor and therefore, visually they are not allowed to go through two samples that have the same predictor. Thus, if our dataset has two samples that have the same predictor and different labels, no model will be able to predict both labels and therefore the error will never be zero. Think of the problem of predicting the salary of a Parisian. If our dataset of Parisians has two individuals of the same age but different salaries, then no matter how hard we try, our model will predict one and only one salary and therefore will produce the wrong prediction for at least one of these two individuals. In summary, it is never guaranteed that if we are given a dataset we will be able to find a zero error model.</p>
</div>
<div class="section" id="regression-as-an-optimisation-problem-take-1">
<h3><span class="section-number">2.2.3. </span>Regression as an optimisation problem (Take 1)<a class="headerlink" href="#regression-as-an-optimisation-problem-take-1" title="Permalink to this heading">#</a></h3>
<p>You might be wondering whether we have forgotten to remove the text <em>(Take 1)</em> from the heading. We have not. In this section we present our first formulation of regression problems. In the next chapter, we will refine this formulation. To present the refined version, we need to consolidate some basic understanding first.</p>
<p>In a regression problem, we have three main components:</p>
<ul class="simple">
<li><p>A dataset, <span class="math notranslate nohighlight">\(\{(x_i,y_i): 1\leq i \leq N \}\)</span>.</p></li>
<li><p>A collection of candidate models, <span class="math notranslate nohighlight">\(f\)</span>.</p></li>
<li><p>A quality metric.</p></li>
</ul>
<p>Our <em>Take 1</em> definition of regression is as follows. We define regression as the process of identifying the best model from a set of candidate models, where the best model is the one that exhibits the highest quality <em>on the available dataset</em>. If we use the SSE as our quality metric, the best model is the one that has the lowest SSE value on the dataset. A mathematician would write:</p>
<div class="math notranslate nohighlight" id="equation-eqfbest-1">
<span class="eqno">(2.5)<a class="headerlink" href="#equation-eqfbest-1" title="Permalink to this equation">#</a></span>\[
f_{best} = \underset{f}{\operatorname{argmin}} \sum_1^N (y_i-f(x_i))^2
\]</div>
<p>which might look scary but simply reads <em>the best model, <span class="math notranslate nohighlight">\(f_{best}\)</span>, among all the candidate models, <span class="math notranslate nohighlight">\(f\)</span>, is the one that has the lowest (argmin) SSE on our dataset, where the SSE is calculated as <span class="math notranslate nohighlight">\(\sum_1^N (y_i-f(x_i))^2\)</span></em>. In machine learning, we say that we are <strong>training</strong> a model or <strong>fitting</strong> a model to a dataset when we use a dataset to identify the best model among a family of candidate models. Accordingly, we call the dataset that we are fitting the model to the <strong>training dataset</strong>.</p>
<p>This process, where we aim at identifying the model that produces the lowest error, is what we call in mathematics an <strong>optimisation</strong> problem. Incidentally, using the SSE as our quality metric we have just formulated the classical <strong>least squares</strong> problem that Legendre and others, including Karl F. Gauss, came up with more than two centuries ago. Note that the best model according to the SSE metric is the same as the best model accrding to the MSE metric, as we have defined the latter as the former divided by <span class="math notranslate nohighlight">\(N\)</span>.</p>
<div class="question1 admonition">
<p class="admonition-title">Question for you</p>
<p>Consider the following three models:</p>
<p><span class="math notranslate nohighlight">\(f_1(x) = 1,000x\)</span></p>
<p><span class="math notranslate nohighlight">\(f_2(x) = 999x\)</span></p>
<p><span class="math notranslate nohighlight">\(f_3(x) = 1,000 + 1,000x\)</span></p>
<p>Using the SSE as your quality metric and the dataset in <a class="reference internal" href="#agevssalary"><span class="std std-numref">Table 2.1</span></a>, identify the best model among the three candidate models <span class="math notranslate nohighlight">\(f_1\)</span>, <span class="math notranslate nohighlight">\(f_2\)</span> and <span class="math notranslate nohighlight">\(f_3\)</span>.</p>
<p>Submit your response here: <a href="https://forms.office.com/e/etKdZmRC37" target = "_blank">Your Response</a></p>
</div>
<p>The idea of regression looks very simple: we have a collection of models, each with an associated quality obtained using a dataset. Our task is to identify the one with the highest quality. If we have a few candidate models this is easy. We use the training dataset to compute their quality (for instance, SSE), rank them according to this quality and choose the one at the top. This is what you must have done to solve the previous question. However, what if we have an infinite number of candidate models? We cannot possibly compute each individual quality! Note that this is not an extreme situation. On the contrary, it is the most common case. Think about models <span class="math notranslate nohighlight">\(f_1(x) = 1,000x\)</span> and <span class="math notranslate nohighlight">\(f_2(x) = 999x\)</span>. They look almost the same, yet using the coefficient <span class="math notranslate nohighlight">\(1,000\)</span> or <span class="math notranslate nohighlight">\(999\)</span> makes them different. We have in fact an infinite choice of values for this coefficient and therefore, we could consider an infinite number of candidate models. Optimisation theory will provide us with useful approaches to operate in such scenarios.</p>
</div>
</div>
<div class="section" id="basic-regression-models">
<span id="reg3"></span><h2><span class="section-number">2.3. </span>Basic regression models<a class="headerlink" href="#basic-regression-models" title="Permalink to this heading">#</a></h2>
<p>To build a machine learning solution for a given regression problem, we need to identify a family of candidate models. In this section we introduce the <strong>linear</strong> and <strong>polynomial</strong> families of regression models. We will distinguish between <strong>simple regression</strong>, in which there is is only one predictor, and <strong>multiple regression</strong>, which considers two or more predictors. One example of a simple regression problem is that of predicting the salary of an individual knowing their age. Predicting a salary knowing the age and the height of an individual is one example of multiple regression. At the end of this section we present the <strong>least squares</strong> solution, which can be used to identify exactly the best model within a family of linear or polynomial models.</p>
<div class="section" id="simple-linear-regression">
<h3><span class="section-number">2.3.1. </span>Simple linear regression<a class="headerlink" href="#simple-linear-regression" title="Permalink to this heading">#</a></h3>
<p>The family of <strong>linear models</strong> for <strong>simple regression problems</strong> prescribe a linear relationship between the predictor and the label:</p>
<div class="math notranslate nohighlight" id="equation-eqslms1">
<span class="eqno">(2.6)<a class="headerlink" href="#equation-eqslms1" title="Permalink to this equation">#</a></span>\[
f(x) = w_o + w_1 x
\]</div>
<p>Simple linear models have two <strong>parameters</strong>, namely the intercept (<span class="math notranslate nohighlight">\(w_0\)</span>) and the gradient or slope (<span class="math notranslate nohighlight">\(w_1\)</span>). Changing the value of either parameter leads to different models. Hence, if we use the family of linear models to build our solution, finding the best model is equivalent to identifying the values for the intercept and the gradient that yield the highest quality. This is why we sometimes refer to model training as <strong>parameter tuning</strong>, since training involves changing or tuning the values of the parameters of the model. Note that we have an infinite number of choices for both parameters, i.e. any real number between minus infinity and infinity. Hence, the number of models belonging to this family is infinite too.</p>
<p><a class="reference internal" href="#salaryvsagelinear"><span class="std std-numref">Fig. 2.6</span></a> shows the result of fitting a linear model to our salary vs age toy dataset, using the SSE as our quality metric. The solution is the linear model that produces the lowest SSE on our dataset. In other words, it is the <em>least squares linear solution</em>. Note that we have not yet explained how to find this model, e.g. how we have fitted the model to the dataset. For now, you can assume that you have an optimisation genie that does this for you.</p>
<div class="figure align-default" id="salaryvsagelinear">
<img alt="_images/salaryvsageSolMSE.svg" src="_images/salaryvsageSolMSE.svg" /><p class="caption"><span class="caption-number">Fig. 2.6 </span><span class="caption-text">Linear solution for the salary vs age toy dataset, using the SSE on the training dataset as our quality metric.</span><a class="headerlink" href="#salaryvsagelinear" title="Permalink to this image">#</a></p>
</div>
</div>
<div class="section" id="simple-polynomial-regression">
<h3><span class="section-number">2.3.2. </span>Simple polynomial regression<a class="headerlink" href="#simple-polynomial-regression" title="Permalink to this heading">#</a></h3>
<p>Visually, you might have concluded that the linear solution that we have obtained does not capture well the relationship between salary and age. It is indeed the best linear model, but it does not seem to be good enough. Unfortunatelly changing the intercept and gradient of our linear models, we will not be able to produce a curve that represents adequately the relationship that we want to discover. We need a family of models that is less rigid than the linear family and allows us to produce more complex curves.</p>
<p>One such family is the family of <strong>polynomial models</strong>. In polynomial regression, we use models that follow the mathematical expression:</p>
<div class="math notranslate nohighlight" id="equation-eqspms1">
<span class="eqno">(2.7)<a class="headerlink" href="#equation-eqspms1" title="Permalink to this equation">#</a></span>\[
f(x) = w_0 + w_1 x + w_2 x^2+ \dots + w_D x^D
\]</div>
<p>where <span class="math notranslate nohighlight">\(D\)</span> is known as the degree of the polynomial. Linear models are of course a subfamily of the polynomial family, where <span class="math notranslate nohighlight">\(D = 1\)</span>. Depending on our chosen value for <span class="math notranslate nohighlight">\(D\)</span>, we can define different families. When <span class="math notranslate nohighlight">\(D=2\)</span>, we have the quadratic family:</p>
<div class="math notranslate nohighlight" id="equation-eqspms-d2">
<span class="eqno">(2.8)<a class="headerlink" href="#equation-eqspms-d2" title="Permalink to this equation">#</a></span>\[
f(x) = w_0 + w_1 x + w_2 x^2
\]</div>
<p>when <span class="math notranslate nohighlight">\(D=3\)</span>, the cubic family:</p>
<div class="math notranslate nohighlight" id="equation-eqspms-d3">
<span class="eqno">(2.9)<a class="headerlink" href="#equation-eqspms-d3" title="Permalink to this equation">#</a></span>\[
f(x) = w_0 + w_1 x + w_2 x^2 + w_3 x^3
\]</div>
<p>and so on. <a class="reference internal" href="#salaryvsagequadratic"><span class="std std-numref">Fig. 2.7</span></a>, <a class="reference internal" href="#salaryvsagecubic"><span class="std std-numref">Fig. 2.8</span></a> and <a class="reference internal" href="#salaryvsage5"><span class="std std-numref">Fig. 2.9</span></a> show the quadratic, cubic and degree 5 least squares solutions. As you can see, increasing the degree of the polynomial <span class="math notranslate nohighlight">\(D\)</span> gives us more flexibility to produce models that are fitted better to the dataset.</p>
<div class="figure align-default" id="salaryvsagequadratic">
<img alt="_images/salaryvsageSolsMSEQuadratic.svg" src="_images/salaryvsageSolsMSEQuadratic.svg" /><p class="caption"><span class="caption-number">Fig. 2.7 </span><span class="caption-text">Quadratic solution for the salary vs age toy dataset, using the SSE on the training dataset as our quality metric.</span><a class="headerlink" href="#salaryvsagequadratic" title="Permalink to this image">#</a></p>
</div>
<div class="figure align-default" id="salaryvsagecubic">
<img alt="_images/salaryvsageSolsMSECubic.svg" src="_images/salaryvsageSolsMSECubic.svg" /><p class="caption"><span class="caption-number">Fig. 2.8 </span><span class="caption-text">Cubic solution for the salary vs age toy dataset, using the SSE on the training dataset as our quality metric.</span><a class="headerlink" href="#salaryvsagecubic" title="Permalink to this image">#</a></p>
</div>
<div class="figure align-default" id="salaryvsage5">
<img alt="_images/salaryvsageSolsMSE5.svg" src="_images/salaryvsageSolsMSE5.svg" /><p class="caption"><span class="caption-number">Fig. 2.9 </span><span class="caption-text">Solution for the salary vs age toy dataset for <span class="math notranslate nohighlight">\(D=5\)</span>, using the SSE on the training dataset as our quality metric.</span><a class="headerlink" href="#salaryvsage5" title="Permalink to this image">#</a></p>
</div>
<p>Once again, we have not discussed yet how to obtain these solutions. At this stage, what is important is to understand how to express polynomial models mathematically and reflect on the solutions that they can produce once fitted to a training dataset.</p>
</div>
<div class="section" id="multiple-linear-regression">
<h3><span class="section-number">2.3.3. </span>Multiple linear regression<a class="headerlink" href="#multiple-linear-regression" title="Permalink to this heading">#</a></h3>
<p>So far we have considered <strong>simple regression</strong> problems, which are problems where there is only one predictor. <strong>Multiple regression</strong> involves two or more predictors. For instance, in the multiple regression problem where we want to predict the salary of in individual from their age and height, age and height are the two predictors. A toy dataset that we could use to build machine learning solutions for this multiple regression problem is shown in the attribute space in <a class="reference internal" href="#salaryvsagevsheight"><span class="std std-numref">Fig. 2.10</span></a>.</p>
<div class="figure align-default" id="salaryvsagevsheight">
<img alt="_images/SalaryVsAgeVsHeight.svg" src="_images/SalaryVsAgeVsHeight.svg" /><p class="caption"><span class="caption-number">Fig. 2.10 </span><span class="caption-text">Toy dataset consisting of the salary, age and height of 200 individuals in the attribute space.</span><a class="headerlink" href="#salaryvsagevsheight" title="Permalink to this image">#</a></p>
</div>
<p>A linear model for this multiple regression problem could be expressed mathematically as follows:</p>
<div class="math notranslate nohighlight" id="equation-eqsalary">
<span class="eqno">(2.10)<a class="headerlink" href="#equation-eqsalary" title="Permalink to this equation">#</a></span>\[
SALARY = w_0 + w_{a} \times AGE + w_{h} \times HEIGHT
\]</div>
<p>where the coefficients <span class="math notranslate nohighlight">\(w_0\)</span>, <span class="math notranslate nohighlight">\(w_{a}\)</span> and <span class="math notranslate nohighlight">\(w_{h}\)</span> are the model’s parameters. If we fit this linear model to the dataset in <a class="reference internal" href="#salaryvsagevsheight"><span class="std std-numref">Fig. 2.10</span></a>, using again our optimisation genie, we will obtain as our solution the plane shown in <a class="reference internal" href="#salaryvsagevsheightsurface"><span class="std std-numref">Fig. 2.11</span></a>. This plane dictates how to predict the salary of an individual, based on their age and height.</p>
<div class="figure align-default" id="salaryvsagevsheightsurface">
<img alt="_images/SalaryVsAgeVsHeightSurface.svg" src="_images/SalaryVsAgeVsHeightSurface.svg" /><p class="caption"><span class="caption-number">Fig. 2.11 </span><span class="caption-text">The plane surface represents the linear solution for the toy salary vs age and height dataset, using the SSE on the training dataset as our quality metric.</span><a class="headerlink" href="#salaryvsagevsheightsurface" title="Permalink to this image">#</a></p>
</div>
<p>Linear models in multiple regression, such as <a class="reference internal" href="#equation-eqsalary">(2.10)</a>, are defined as the <em>sum of a constant (the intercept) plus each predictor multiplied by a coefficient</em>. The constant and coefficients are precisely the parameters of the linear model that we need to tune. To formulate multiple regression mathematically and obtain general solutions, we need to develop a few last pieces of mathematical notation, starting with a symbol for</p>
<ul class="simple">
<li><p><strong>Number of predictors</strong>: <span class="math notranslate nohighlight">\(K\)</span>.</p></li>
</ul>
<p>As an example, if we want to predict the salary of an individual from their age and height, <span class="math notranslate nohighlight">\(K=2\)</span>. A small toy dataset for this problem is shown in <a class="reference internal" href="#ageheightvssalary"><span class="std std-numref">Table 2.3</span></a>. As you can see, each individual in the dataset is described by three attributes, two of which are used as the predictors (age and height) and the third one as the label (salary).</p>
<table class="table" id="ageheightvssalary">
<caption><span class="caption-number">Table 2.3 </span><span class="caption-text">A toy dataset registering the age and salary of a small group of individuals</span><a class="headerlink" href="#ageheightvssalary" title="Permalink to this table">#</a></caption>
<colgroup>
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
<col style="width: 25%" />
</colgroup>
<thead>
<tr class="row-odd"><th class="head"><p>ID</p></th>
<th class="head"><p>Age</p></th>
<th class="head"><p>Height [cm]</p></th>
<th class="head"><p>Salary</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td><p><span class="math notranslate nohighlight">\(S_1\)</span></p></td>
<td><p>18</p></td>
<td><p>175</p></td>
<td><p>12,000</p></td>
</tr>
<tr class="row-odd"><td><p><span class="math notranslate nohighlight">\(S_2\)</span></p></td>
<td><p>37</p></td>
<td><p>180</p></td>
<td><p>68,000</p></td>
</tr>
<tr class="row-even"><td><p><span class="math notranslate nohighlight">\(S_3\)</span></p></td>
<td><p>66</p></td>
<td><p>158</p></td>
<td><p>80,000</p></td>
</tr>
<tr class="row-odd"><td><p><span class="math notranslate nohighlight">\(S_4\)</span></p></td>
<td><p>25</p></td>
<td><p>168</p></td>
<td><p>45,000</p></td>
</tr>
<tr class="row-even"><td><p><span class="math notranslate nohighlight">\(S_5\)</span></p></td>
<td><p>26</p></td>
<td><p>190</p></td>
<td><p>30,000</p></td>
</tr>
</tbody>
</table>
<p>We can extend our mathematical notation to identify each of the predictors of each sample. We will denote the <span class="math notranslate nohighlight">\(k\)</span>-th predictor of sample <span class="math notranslate nohighlight">\(i\)</span> by <span class="math notranslate nohighlight">\(x_{i,k}\)</span>. Accordingly, <span class="math notranslate nohighlight">\(x_{i,1}\)</span> is the first predictor of sample <span class="math notranslate nohighlight">\(i\)</span>, <span class="math notranslate nohighlight">\(x_{i,2}\)</span> is the second predictor and so on. For instance, if age is our first predictor and height our second predictor, using the dataset shown in <a class="reference internal" href="#ageheightvssalary"><span class="std std-numref">Table 2.3</span></a> we would write <span class="math notranslate nohighlight">\(x_{1,1}=18\)</span>, <span class="math notranslate nohighlight">\(x_{1,2}=175\)</span>, <span class="math notranslate nohighlight">\(x_{2,1}=37\)</span> and so on.</p>
<p>The next piece of notation allows us to pack all the predictors together in a column vector which we represent using <strong>bold font</strong>:</p>
<div class="math notranslate nohighlight" id="equation-eqxi1">
<span class="eqno">(2.11)<a class="headerlink" href="#equation-eqxi1" title="Permalink to this equation">#</a></span>\[\begin{split}
\boldsymbol{x}_i= \begin{bmatrix}
1\\
x_{i,1}\\
x_{i,2}\\
\vdots \\
x_{i,K}
\end{bmatrix}
\end{split}\]</div>
<p>As you can see, vector <span class="math notranslate nohighlight">\(\boldsymbol{x}_i\)</span> contains all the predictors of sample <span class="math notranslate nohighlight">\(i\)</span>. There is an additional entry, the number 1, whose role will be clear very soon.</p>
<p>At this stage, we can abstract away what each predictor means and how many predictors there are by simply using the symbol <span class="math notranslate nohighlight">\(\boldsymbol{x}_i\)</span> to denote all the predictors of sample <span class="math notranslate nohighlight">\(i\)</span>. Using this notation, we can write</p>
<div class="math notranslate nohighlight" id="equation-eqyi1">
<span class="eqno">(2.12)<a class="headerlink" href="#equation-eqyi1" title="Permalink to this equation">#</a></span>\[
\hat{y}_i = f(\boldsymbol{x}_i)
\]</div>
<p>which should be read as <em>model <span class="math notranslate nohighlight">\(f\)</span> takes as an input all the predictors of sample <span class="math notranslate nohighlight">\(i\)</span>, which are packed in <span class="math notranslate nohighlight">\(\boldsymbol{x}_i\)</span>, and produces the prediction <span class="math notranslate nohighlight">\(\hat{y}_i\)</span></em>. Note that <a class="reference internal" href="#equation-eqyi1">(2.12)</a> can be used to respresent any multiple regression problem, irrespective of the number of predictors that it defines. In addition, <a class="reference internal" href="#equation-eqyi1">(2.12)</a> looks almost identical to <a class="reference internal" href="#equation-eqmodelnotation">(2.1)</a>. The only difference is that the input to the model is now a set of predictors instead of a single one, which for convenience we highlight by using bold font instead of normal font.</p>
<p>We will also pack the parameters of a multiple linear regression model in a vector. These parameters are a constant (<span class="math notranslate nohighlight">\(w_0\)</span>) and the coefficients that multiply each predictor. We denote this vector by <span class="math notranslate nohighlight">\(\boldsymbol{w}\)</span> and define it as:</p>
<div class="math notranslate nohighlight" id="equation-eqw1">
<span class="eqno">(2.13)<a class="headerlink" href="#equation-eqw1" title="Permalink to this equation">#</a></span>\[\begin{split}
\boldsymbol{w}= \begin{bmatrix}
w_0\\
w_{1}\\
w_{2}\\
\vdots \\
w_{K}
\end{bmatrix}
\end{split}\]</div>
<p>Note that there are <span class="math notranslate nohighlight">\(K+1\)</span> parameters in a linear model for a multiple regression problem with <span class="math notranslate nohighlight">\(K\)</span> predictors. Using this notation and a bit of vector algebra, we can express <em>any</em> multiple linear regression model as</p>
<div class="math notranslate nohighlight" id="equation-eqfx1">
<span class="eqno">(2.14)<a class="headerlink" href="#equation-eqfx1" title="Permalink to this equation">#</a></span>\[\begin{split}
f(\boldsymbol{x}_i) &= w_0 + w_1 x_{i,1} + w_2 x_{i,2} + \dots + w_K x_{i,K}\\
&= \boldsymbol{x}_i^T\boldsymbol{w}
\end{split}\]</div>
<p>where <span class="math notranslate nohighlight">\(T\)</span> denotes vector transposition and <span class="math notranslate nohighlight">\(\boldsymbol{x}_i^T\boldsymbol{w}\)</span> is the vector multiplication of vector <span class="math notranslate nohighlight">\(\boldsymbol{x}_i^T\)</span> and vector <span class="math notranslate nohighlight">\(\boldsymbol{w}\)</span>. Pretty neat, isn’t it? The role of the entry of value 1 in the extended vector <span class="math notranslate nohighlight">\(\boldsymbol{x}_i\)</span> should be now clearer: it multiplies the coefficient <span class="math notranslate nohighlight">\(w_0\)</span> and allows us to build the compact expression <span class="math notranslate nohighlight">\(\boldsymbol{x}_i^T\boldsymbol{w}\)</span>.</p>
<p>Let us take our vector notation one step further and define the <strong>design matrix</strong> <span class="math notranslate nohighlight">\(\boldsymbol{X}\)</span> and the <strong>label vector</strong> <span class="math notranslate nohighlight">\(\boldsymbol{y}\)</span>. Given a dataset consisting of a collection of <span class="math notranslate nohighlight">\(N\)</span> samples described by <span class="math notranslate nohighlight">\(K\)</span> predictors and one label, <strong>the design matrix encapsulates all the predictor values</strong> and is defined as</p>
<div class="math notranslate nohighlight" id="equation-eqx1">
<span class="eqno">(2.15)<a class="headerlink" href="#equation-eqx1" title="Permalink to this equation">#</a></span>\[\begin{split}
\boldsymbol{X}= \begin{bmatrix}
1 & x_{1,1}& x_{1,2}& \dots & x_{1,K} \\
1 & x_{2,1}& x_{2,2}& \dots & x_{2,K} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
1 & x_{N,1}& x_{N,2}& \dots & x_{N,K} \\
\end{bmatrix}\nonumber
\end{split}\]</div>
<p>The all-ones first column in <a class="reference internal" href="#equation-eqx1">(2.15)</a> play the same role as the 1 entry in <a class="reference internal" href="#equation-eqxi1">(2.11)</a>. The <strong>label vector <span class="math notranslate nohighlight">\(\boldsymbol{y}\)</span> contains all the labels</strong> of the <span class="math notranslate nohighlight">\(N\)</span> samples in our dataset and is defined as</p>
<div class="math notranslate nohighlight" id="equation-eqy1">
<span class="eqno">(2.16)<a class="headerlink" href="#equation-eqy1" title="Permalink to this equation">#</a></span>\[\begin{split}
\boldsymbol{y}= \begin{bmatrix}
y_{1}\\
y_{2}\\
\vdots \\
y_{N}
\end{bmatrix}
\end{split}\]</div>
<!-- TEST - equation link {eq}`eqy1` -->
<p>The design matrix design matrix <span class="math notranslate nohighlight">\(\boldsymbol{X}\)</span> and the label vector <span class="math notranslate nohighlight">\(\boldsymbol{y}\)</span> pack all the values in our dataset. For instance, if we are considering the problem of predicting the salary of an individual from their age and height, and assume that age is our first predictor and height our second predictor, <a class="reference internal" href="#ageheightvssalary"><span class="std std-numref">Table 2.3</span></a> would lead to the following design matrix and label vector:</p>
<div class="math notranslate nohighlight" id="equation-eqxy1">
<span class="eqno">(2.17)<a class="headerlink" href="#equation-eqxy1" title="Permalink to this equation">#</a></span>\[\begin{split}
\boldsymbol{X}= \begin{bmatrix}
1 & 18 & 175 \\
1 & 37 & 180 \\
1 & 66 & 158 \\
1 & 25 & 168 \\
1 & 26 & 190
\end{bmatrix}\nonumber
\quad \quad
\boldsymbol{y}= \begin{bmatrix}
12,000\\
68,000\\
80,000 \\
45,000\\
30,000
\end{bmatrix}
\end{split}\]</div>
<p>We also need to define a new vector <span class="math notranslate nohighlight">\(\hat{\boldsymbol{y}}\)</span> that, given a model, contains <strong>all the predicted labels</strong>:</p>
<div class="math notranslate nohighlight">
\[\begin{split}
\hat{\boldsymbol{y}} = \begin{bmatrix}
\hat{y}_{1}\\
\hat{y}_{2}\\
\vdots \\
\hat{y}_{N}
\end{bmatrix}
\end{split}\]</div>
<p>Applying basic matrix algebra, given a linear model defined by a coefficients vector <span class="math notranslate nohighlight">\(\boldsymbol{w}\)</span>, we can express <span class="math notranslate nohighlight">\(\hat{\boldsymbol{y}}\)</span> as the product of the design matrix <span class="math notranslate nohighlight">\(\boldsymbol{X}\)</span> and the coefficients vector <span class="math notranslate nohighlight">\(\boldsymbol{w}\)</span>:</p>
<div class="math notranslate nohighlight" id="equation-eqy-hat1">
<span class="eqno">(2.18)<a class="headerlink" href="#equation-eqy-hat1" title="Permalink to this equation">#</a></span>\[\begin{split}
\hat{\boldsymbol{y}}&=\boldsymbol{X}\boldsymbol{w}\\
&= \begin{bmatrix}
1 & x_{1,1}& x_{1,2}& \dots & x_{1,K} \\
1 & x_{2,1}& x_{2,2}& \dots & x_{2,K} \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
1 & x_{N,1}& x_{N,2}& \dots & x_{N,K} \\
\end{bmatrix}\nonumber
\begin{bmatrix}
w_{0}\\
w_{1}\\
w_{2}\\
\vdots \\
w_{K}
\end{bmatrix}
\end{split}\]</div>
<p>Finally, we can define vector <span class="math notranslate nohighlight">\(\boldsymbol{e}\)</span>, which contains <strong>all the prediction errors</strong>:</p>
<div class="math notranslate nohighlight" id="equation-eqerr1">
<span class="eqno">(2.19)<a class="headerlink" href="#equation-eqerr1" title="Permalink to this equation">#</a></span>\[\begin{split}
\boldsymbol{e} = \begin{bmatrix}
{e}_{1}\\
{e}_{2}\\
\vdots \\
{e}_{N}
\end{bmatrix} =
\begin{bmatrix}
{y}_{1}-\hat{y}_{1}\\
{y}_{2}-\hat{y}_{2}\\
\vdots \\
{y}_{N}-\hat{y}_{N}
\end{bmatrix} =
\boldsymbol{y}-\hat{\boldsymbol{y}}
\end{split}\]</div>
<p>Even though it might look complicated at first, vector notation makes it easier for us to describe our problems and build solutions. Essentially, when discussing multiple regression problems we will use the mathematical symbols <span class="math notranslate nohighlight">\(\boldsymbol{x}_i\)</span>, <span class="math notranslate nohighlight">\(\boldsymbol{X}\)</span>, <span class="math notranslate nohighlight">\(\boldsymbol{y}\)</span>, <span class="math notranslate nohighlight">\(\boldsymbol{w}\)</span>, <span class="math notranslate nohighlight">\(\hat{\boldsymbol{y}}\)</span> and <span class="math notranslate nohighlight">\(\boldsymbol{e}\)</span>, irrespective of whether we have 2 predictors or 1,000 predictors, 10 samples or 100,000 samples. This is the power of our mathematical notation: we can abstract details away and focus instead on the essence of the problem.</p>
</div>
<div class="section" id="the-least-squares-solution">
<h3><span class="section-number">2.3.4. </span>The least squares solution<a class="headerlink" href="#the-least-squares-solution" title="Permalink to this heading">#</a></h3>
<p>Now that we have developed our mathematical notation, we are ready to show you how to fit a multiple linear model to a given dataset using the SSE, or equivalently the MSE, as our quality metric. We will have to wait until the next chapter to understand fully where this solution comes from. For now, you will just need to trust us.</p>
<p>The coefficients <span class="math notranslate nohighlight">\(\boldsymbol{w}\)</span> of the multiple linear model with the lowest SSE on a training dataset characterised by a design matrix <span class="math notranslate nohighlight">\(\boldsymbol{X}\)</span> and a label vector <span class="math notranslate nohighlight">\(\boldsymbol{y}\)</span>, can be calculated as</p>
<div class="math notranslate nohighlight" id="equation-eqw-best1">
<span class="eqno">(2.20)<a class="headerlink" href="#equation-eqw-best1" title="Permalink to this equation">#</a></span>\[
\boldsymbol{w}_{best} = \left(\boldsymbol{X}^T \boldsymbol{X}\right)^{-1} \boldsymbol{X}^T \boldsymbol{y}
\]</div>
<p>This is the <strong>least squares</strong> solution. The calculation defined by <a class="reference internal" href="#equation-eqw-best1">(2.20)</a> consists of simple matrix operations (multiplication, inversion and transposition) that can be easily implemented in computing engines equipped with linear algebra capabilities. In addition to having the lowest SSE <em>on the training dataset</em>, the model with coefficients <span class="math notranslate nohighlight">\(\boldsymbol{w}_{best}\)</span> also has the lowest MSE <em>on the training dataset</em>.</p>
<p>As you would expect, this solution can also be used for simple linear regression as simple linear regression problems can be formulated as a multiple linear regression problem where <span class="math notranslate nohighlight">\(K=1\)</span>. What might be susrpising at first is to know that we can also use the least squares solution for multiple linear regression to solve polynomial regression problems. Let us see how to solve simple polynomial regression.</p>
<p>In simple polynomial regression, the predicted label <span class="math notranslate nohighlight">\(\hat{y}_i\)</span> is calculated as</p>
<div class="math notranslate nohighlight" id="equation-eqy-hat2">
<span class="eqno">(2.21)<a class="headerlink" href="#equation-eqy-hat2" title="Permalink to this equation">#</a></span>\[
\hat{y}_i = w_0 + w_1 x_i + w_2 x_i^2 + \dots + w_D x_i^D
\]</div>
<p>This expression is similar to the multiple linear regression expression <a class="reference internal" href="#equation-eqfx1">(2.14)</a>, where instead of a linear combination of predictors, we have a <strong>linear combination of the powers of one predictor</strong>. The trickt consists of treating the powers of the predictor as predictors themselves, i.e. <span class="math notranslate nohighlight">\(x_i\)</span> is the first predictor, <span class="math notranslate nohighlight">\(x_i^2\)</span> the second predictor and so on. Accordingly, we can create a design matrix <span class="math notranslate nohighlight">\(\boldsymbol{X}_P\)</span> where each column corresponds to a power of the only predictor</p>
<div class="math notranslate nohighlight" id="equation-eqxp1">
<span class="eqno">(2.22)<a class="headerlink" href="#equation-eqxp1" title="Permalink to this equation">#</a></span>\[\begin{split}
\boldsymbol{X}_P=
\begin{bmatrix}
1 & x_{1,1}& x_{1}^2& \dots & x_{1}^D \\
1 & x_{2,1}& x_{2}^2& \dots & x_{2}^D \\
\vdots & \vdots & \vdots & \ddots & \vdots \\
1 & x_{N,1}& x_{N}^2& \dots & x_{N}^D \\
\end{bmatrix}\nonumber
\end{split}\]</div>
<p>Using this design matrix in the least squares solution <a class="reference internal" href="#equation-eqw-best1">(2.20)</a>, we can obtain the coefficients of the best polynomial model of degree <span class="math notranslate nohighlight">\(D\)</span>, assuming the SSE on the training dataset is our quality metric.</p>
</div>
</div>
<div class="section" id="flexibility-interpretability-and-generalisation">
<span id="reg4"></span><h2><span class="section-number">2.4. </span>Flexibility, interpretability and generalisation<a class="headerlink" href="#flexibility-interpretability-and-generalisation" title="Permalink to this heading">#</a></h2>
<p>Linear and polynomial models have a set of parameters that we can tune. Changing the values of these parameters allows us to generate different shapes, for instance setting the gradient of a linear model to 0 we obtain a horizontal straight line, while setting it to 1 we obtain a straight line that forms a 45 degree angle with the horizontal axis.</p>
<p>The range of shapes that a model can produce is in general related to the number of parameters that it has. We sometimes refer to the number of parameters of a model as its degrees of freedom. A quadratic model, for instance, has three parameters (3 degrees of freedom) and can produce a wider range of shapes than a linear model, which has two parameters (2 degrees of freedom); a cubic model has four parameters (4 degrees of freedom) and produces more shapes than a quadratic model, and so on. The ability of a model to generate different shapes by changing the value of its parameters is known as the model’s <strong>flexibility</strong>. Accordingly, a linear model is very rigid, as it can only generate straight lines. In comparison, a cubic model is flexible, as it can generate many more curves, including straight lines. Flexible models are also said to be more <strong>complex</strong> in that they can generate curves of greater complexity.</p>
<p>The expected quality of a model on the training dataset is related to its flexibility. Since flexible models can generate a wider range of shapes than rigid ones, we should expect flexible models to be able to produce solutions that are better fitted to our training dataset, compared to rigid ones. For instance, <a class="reference internal" href="#salaryvsagelinear"><span class="std std-numref">Fig. 2.6</span></a>, <a class="reference internal" href="#salaryvsagequadratic"><span class="std std-numref">Fig. 2.7</span></a>, <a class="reference internal" href="#salaryvsagecubic"><span class="std std-numref">Fig. 2.8</span></a> and <a class="reference internal" href="#salaryvsage5"><span class="std std-numref">Fig. 2.9</span></a> show that as we increase the degree of the polynomial, our solution follows better the observed pattern. Indeed, the SSE and MSE values of the best linear model on this toy dataset is the highest, whereas the SSE and MSE values of the best polynomial of order 5 is the lowest. This implies, following our <em>take 1</em> definition, that the quality of a polynomial of order 5 is higher than the quality of a linear model. On the flip side, flexible models are harder to interpret than rigid ones. Model <strong>interpretability</strong> is crucial for us, as humans, to make sense in a qualitative manner how a predictor is mapped to a label. Linear models, which are very rigid, are easier to interpret. For instance, we could describe the best linear model shown in <a class="reference internal" href="#salaryvsagelinear"><span class="std std-numref">Fig. 2.6</span></a> by simply saying, <em>the older you are, the higher your salary</em>. In contrast, describing <a class="reference internal" href="#salaryvsage5"><span class="std std-numref">Fig. 2.9</span></a> would be much more dificcult.</p>
<p>If flexible and complex models are expected to have a higher quality on the training dataset than rigid and simple models, should we always be using complex models? The answer is no. To understand why, we need to remind ourselves that our ultimate goal is to <em>deploy</em> the model that we have built. In other words, we ultimately want to put our model to work. Accordingly, our goal should not be to produce models that have a high quality on the training dataset, but models that have a <strong>high quality when deployed</strong>. It turns out that training and deployment qualities behave quite differently.</p>
<p><a class="reference internal" href="#traindeployment"><span class="std std-numref">Fig. 2.12</span></a> illustrates what happens to the training and deployment quality as we increase the flexibility of our models. Initially, as we increase the flexibility, the quality improves both during training and deployment, as the MSE is decreasing. This means that by initially increasing the flexibility, our models make better predictions. However, beyond a certaing degree of flexibility, the training quality keeps improving whereas the deployment quality deteriorates, as the deployment MSE starts to increase. Therefore, a model that appears to be working very well on the training dataset, can in reality perform very poorly during deployment. The question is, why are training and deployment qualities behaving so differently?</p>
<div class="figure align-default" id="traindeployment">
<img alt="_images/train_test_MSE2.jpg" src="_images/train_test_MSE2.jpg" />
<p class="caption"><span class="caption-number">Fig. 2.12 </span><span class="caption-text">Training and deployment MSE of models of increasing flexibility. The flexibility of a model can be defined as its degrees of freedom.</span><a class="headerlink" href="#traindeployment" title="Permalink to this image">#</a></p>
</div>
<p><a class="reference internal" href="#under-over-dataset"><span class="std std-numref">Fig. 2.13</span></a> shows a simple dataset consisting of 8 samples randomly extracted from the dataset shown in <a class="reference internal" href="#salaryvsage3models"><span class="std std-numref">Fig. 2.5</span></a>. Note that the pattern that we could visually identify in <a class="reference internal" href="#salaryvsage3models"><span class="std std-numref">Fig. 2.5</span></a> cannot be discerned anymore, due to having only a few samples.</p>
<div class="figure align-default" id="under-over-dataset">
<img alt="_images/under_over_dataset.svg" src="_images/under_over_dataset.svg" /><p class="caption"><span class="caption-number">Fig. 2.13 </span><span class="caption-text">Small dataset consisting og 8 samples.</span><a class="headerlink" href="#under-over-dataset" title="Permalink to this image">#</a></p>
</div>
<p>If we fit a rigid linear model to this small dataset, we obtain the straight line shown in <a class="reference internal" href="#under-over-linearsol"><span class="std std-numref">Fig. 2.14</span></a>. Overall, this straight line follows the general pattern, however we could wonder if we could reduce the prediction error on the training dataset further. To do so, we need to use models of higher complexity.</p>
<div class="figure align-default" id="under-over-linearsol">
<img alt="_images/under_over_linearsol.svg" src="_images/under_over_linearsol.svg" /><p class="caption"><span class="caption-number">Fig. 2.14 </span><span class="caption-text">Linear model fitted to the small dataset.</span><a class="headerlink" href="#under-over-linearsol" title="Permalink to this image">#</a></p>
</div>
<p><a class="reference internal" href="#under-over-6ordercsol"><span class="std std-numref">Fig. 2.15</span></a> shows the result of fitting a polynomial model of order 6 to the training dataset. This polynomial predicts almost without errors the label of every single sample in the training dataset. Accordingly, the MSE on the training dataset is close to zero and we could be tempted to conclude that this model is close to perfect.</p>
<div class="figure align-default" id="under-over-6ordercsol">
<img alt="_images/under_over_6ordersol.svg" src="_images/under_over_6ordersol.svg" /><p class="caption"><span class="caption-number">Fig. 2.15 </span><span class="caption-text">Polynomial model of order 6 fitted to the small dataset.</span><a class="headerlink" href="#under-over-6ordercsol" title="Permalink to this image">#</a></p>
</div>
<p>Finally, <a class="reference internal" href="#under-over-cubicsol"><span class="std std-numref">Fig. 2.16</span></a> shows a cubic fit. This model is more complex than the basic linear one, but not as complex as the polynomial model of order 6. Its MSE on the training dataset is lower than the linear model, but higher than the polynomial of order 6.</p>
<div class="figure align-default" id="under-over-cubicsol">
<img alt="_images/under_over_cubicsol.svg" src="_images/under_over_cubicsol.svg" /><p class="caption"><span class="caption-number">Fig. 2.16 </span><span class="caption-text">Cubic model fitted to the small dataset.</span><a class="headerlink" href="#under-over-cubicsol" title="Permalink to this image">#</a></p>
</div>
<p>Linear, cubic, order 6, which one is the right model? Frustraitingly the answer is, we cannot tell just by looking at their performance <em>on the training dataset</em>. The only way for us to decide which one is the best is by <strong>assessing the quality of our models during deployment</strong>. One way to assess the quality of our models during deployment, without actually deploying them, is to use a separate dataset. For instance, if we superimpose the linear, cubic and order 6 solutions to the dataset shown in <a class="reference internal" href="#salaryvsage3models"><span class="std std-numref">Fig. 2.5</span></a>, we would conclude that the order 6 solution is actually performing very poorly, in spite of performing so well on the training dataset. A model that captures the right pattern will be able to perform well when presented with new samples that it has not been exposed to before. We would say that this model is <strong>generalising</strong> well.</p>
<p>For the sake of the argument, assume that the real pattern underlying the training dataset shown in <a class="reference internal" href="#under-over-dataset"><span class="std std-numref">Fig. 2.13</span></a> is a cubic one. Then, the linear model would be too rigid, the polynomial of order 6 would be too complex and the cubic model would be the right one. These three behaviours can be identified in <a class="reference internal" href="#traindeployment"><span class="std std-numref">Fig. 2.12</span></a> and we have three terms to refer to them:</p>
<ul class="simple">
<li><p><strong>Underfitting</strong> models produce large errors during training and deployment. The flexibility of these models is too low and are unable to reproduce the underlying pattern. They occupy the left-hand side of <a class="reference internal" href="#traindeployment"><span class="std std-numref">Fig. 2.12</span></a>.</p></li>
<li><p><strong>Overfitting</strong> models are too flexible and perform extremely well on the training dataset at the expense of their generalisation ability. Consequently, they produce very small errors during training and large errors during deployment. They occupy the right-hand side of <a class="reference internal" href="#traindeployment"><span class="std std-numref">Fig. 2.12</span></a>.</p></li>
<li><p><strong>Just right</strong> models produce small errors during training and deploymnet. They have the right complexity and are capable of reproducing the underlying pattern. They are situated between the underfitting and overfitting regions in <a class="reference internal" href="#traindeployment"><span class="std std-numref">Fig. 2.12</span></a>.</p></li>
</ul>
<p>According to these terms, if we assume that the underlying pattern in <a class="reference internal" href="#under-over-dataset"><span class="std std-numref">Fig. 2.13</span></a> has a cubic complexity, the linear model is underfitting and it produces large errors during training and deployment. The polynomial model of order 6 is overfitting, as its error is practically zero during training, but would be very high during deployment.</p>
</div>
<div class="section" id="summary-and-discussion">
<span id="reg5"></span><h2><span class="section-number">2.5. </span>Summary and discussion<a class="headerlink" href="#summary-and-discussion" title="Permalink to this heading">#</a></h2>
<p>In regression we set out to build a model that predicts the value of a continuous label using a set of predictors. There are many real-world problems that can be formulated as regression problems. One example of a problem that could be tackled using regression approaches would be that of predicting the energy consumption of a household, given the location of the house, the household size and the income. In this case the energy consumption is the continuous label and location, size and income the predictors.</p>
<p>The basic ingredients of any regression problem are a <strong>training dataset</strong>, a family of <strong>candidate models</strong>, a <strong>quality metric</strong> and an <strong>optimisation method</strong>. The training dataset is a collection of samples extracted from the population against which we will deploy our model. We use a training dataset because we do not know the true relationship between the label and the predictors. Our hope is to discover such relationship in the training dataset. An example of a real-world problem that we could formulate as a regression problem but we would not, is that of predicting the distance driven by a vehicle, using speed and journey duration as predictors. In this case we know very well the relationship between distance, speed and duration, hence there is no need to extract this relationship from a dataset.</p>
<p>Building a model involves selecting the best one among a family of candidate models. In this chapter we have covered linear and polynomial models, but there are many others, including exponential models, sinusoidal models, radial basis functions, spline, the logistic model and many more. These models have a set of parameters that can be tuned. Hence, finding the best model can be seen as finding the best values for the model’s parameters.</p>
<p>It is important to highlight that to talk about a best model, we need to first agree on a notion of quality. In this chapter we have used two related quality metrics, namely the SSE and the MSE. Other quality metrics that you might come across include the root mean squared error</p>
<div class="math notranslate nohighlight" id="equation-eqrmse1">
<span class="eqno">(2.23)<a class="headerlink" href="#equation-eqrmse1" title="Permalink to this equation">#</a></span>\[
RMSE = \sqrt{\frac{1}{N}\sum{e_i^2}}
\]</div>
<p>the mean absolute error</p>
<div class="math notranslate nohighlight" id="equation-eqmae1">
<span class="eqno">(2.24)<a class="headerlink" href="#equation-eqmae1" title="Permalink to this equation">#</a></span>\[
MAE = \frac{1}{N}\sum{|e_i|}
\]</div>
<p>or the R-squared metric</p>
<div class="math notranslate nohighlight" id="equation-eqrsq1">
<span class="eqno">(2.25)<a class="headerlink" href="#equation-eqrsq1" title="Permalink to this equation">#</a></span>\[
R^2 = 1 -\frac{\sum{e_i^2}}{\sum{(y_i-\bar{y})^2}}, \quad \text{where} \quad \bar{y}=\frac{1}{N}\sum{y_i}
\]</div>
<p>Note that in general, the best model according to one metric will be different from the best model according to another metric. So which one is the right quality metric? Machine learning will not automatically answer this question. It is the job of the machine learning expert who formulates the problem to decide which metric is the most appropriate. Remember our first top tip: <strong>Know Thy Domain!</strong> Only by knowing our domain we will be able to define a suitable quality metric.</p>
<p>You might still be asking yourself, why can I not expect my models to always produce perfect predictions? Why do models make errors? We can identify several factors that contribute to this. First, our family of candidate models might not be flexible enough to reproduce the correct pattern. We sometimes call this model bias. Second, there might be factors that determine the value of the label, but are not included among the chosen predictors. For instance, no matter how hard we try, we will never be able to predict the salary of an individual using their age as the only predictor. Other factors contribute to someone’ salary, for instance education or family background. Third, we might not have enough samples to reveal a very complex underlying pattern. Finally, unrelated random factors might be contributing to the final value of a label. By their nature, these random factors cannot be predictied deterministically.</p>
<p>The final basic element in any regression problem is an optimisation method, which allows us to identify the best model among a family of candidate models. In this chapter we have presented the least squares solution for linear and polynomial models. Least squares is an exact solution obtained using basic optimisation theory. In general, exacts solutions will not be available and we will need to implement other optimisation approaches to identify the best model.</p>
<p>At the end of this chapter, we discovered a disturbing fact about our approach: the best model for our training dataset, migth not be the best model during deployment. We hoped that finding the best model on our training dataset would be sufficient, but then we observed that in fact, a perfect model on our training dataset might be a terrible model when deployed. This does not mean that we need to discard everything we have learnt so far. Instead, we need to reformulate our regression problem slightly. We have defined regression as the process of identifying the best model from a set of candidate models, where the best model is the one with highest quality <em>on the available dataset</em>. This is our <em>Take 1</em> definition. From now on, we will define the best model as the one with the highest quality when deployed, in other words, <em>on the target population</em>. This is our <em>Take 2</em> and final definition. The question is, how do we identify this model, if all we have about our population is a dataset? This is precisely one of the main questions that we will be addressing in the next chapter.</p>
</div>
</div>
<script type="text/x-thebe-config">
{
requestKernel: true,
binderOptions: {
repo: "binder-examples/jupyter-stacks-datascience",
ref: "master",
},
codeMirrorConfig: {
theme: "abcdef",
mode: "python"
},
kernelOptions: {
name: "python3",
path: "./."
},
predefinedOutput: true
}
</script>
<script>kernelName = 'python3'</script>
</article>
<footer class="bd-footer-article">
<div class="footer-article-items footer-article__inner">