-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathMethods.Handbook.rnw
4578 lines (3422 loc) · 234 KB
/
Methods.Handbook.rnw
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
%%! Created by: Kaylee Davis
% https://www.linkedin.com/in/KayleeDavisIN
%%! Document created using R's .rnw document editing, not LaTeX
% platform (although both should work). UTF-8
% Consider knitr, index, and bibliography when compiling.
% Please read preamble, change file locations as necessary.
% Similarly, some data are called file a local file-path.
%%! All data are free and fair use, but please consider proper citation.
% Some data have been collected via Google Big Query and TwittR;
% These data have only partially been deidentified; please consider online
% privacy when publishing/sharing data used here.
%------------ Quick Note on Coding Conventions -----------------
% Primarily the intent for the document is in the .pdf output,
% however, I have taken measures to add notes and organization
% in the coding of this document, for my own use and for openness
% to future authors.
% %--- lines typically divide and add some organization
% %! usually indicates an important comment, or section
% # The same is true for notes in R code (##!) and (#---)
%----------------------- Preamble ------------------------------
\documentclass[12pt]{article}
% Margins:
\usepackage[top = 1in, left = 1in, right = 1in, bottom = 1in]{geometry}
% Holdover packages from long ago:
\usepackage{graphicx, epsfig} % tex tools
\usepackage{amsfonts, amsmath, amsthm, amssymb} % Math tools/ symbols
\usepackage{relsize} % for mathlarger mathsmaller
% Setting local graphics pathing:
\graphicspath{{C:/Users/mailk/OneDrive/Documents/R/Graduate_Methods_Handbook/figure}}
% No idea:
\usepackage{textcomp} % Just for tildas I think \sim?
\usepackage{verbatim} % No Idea
% For images and stuff, I think:
\usepackage{wrapfig} % Wrap Figures and caption easier
\usepackage{setspace}
\usepackage[bookmarks = false, hidelinks]{hyperref} % Hides red box around links
\usepackage{booktabs}
% \usepackage{subcaption}
%% Type Face/ Font Stuff % fixed?
\usepackage[english]{babel}
% \usepackage{baskervald}
% \usepackage[T1]{fontenc}
% \usepackage[style = american]{csquotes}
%% Making an Index:
\usepackage{makeidx}
%% Making a Bibliography:
% \usepackage{natbib}
% \bibliographystyle{apsr}
% Reference customization is included in the reference section.
%% Footnote Formatting
% \setlist{nolistsep, noitemsep}
\setlength{\footnotesep}{1.2\baselineskip}
% \deffootnote[.2in]{0in}{0em}{\normalsize\thefootnotemark.\enskip}
%% Section Formatting
\def\ci{\perp\!\!\!\perp}
% \titleformat*{\section}{\large\bfseries}
% \titleformat*{\subsection}{\normalsize\bfseries}
%% Table of Contents Formatting (babel)
\addto\captionsamerican{ \renewcommand*\contentsname{Table of Contents:}}
%! Used in making graphs and plotting:
\usepackage{tikz}
\usetikzlibrary{positioning}
\usepackage{tkz-graph}
\usetikzlibrary{shapes,arrows}
\usepackage{caption}
% Custom Shortcuts:
\newcommand*{\h}{\hspace{5pt}} % for indentation
\newcommand*{\hh}{\h\h} % double indentation
%! For tables:
\usepackage{dcolumn}
\usepackage{float} % To use "H" in tables, place in location of LaTeX code
\usepackage{longtable}
\usepackage{array}
% for \thedate and other titling conveniences
\usepackage{titling}
\date{\today} % so that it's today (specified)
% -------------------------------------------------------------------
% ------------------------- Begin Document --------------------------
% -------------------------------------------------------------------
\makeindex
\begin{document}
% \SweaveOpts{concordance=TRUE}
% \graphicspath{Data Files}
\begin{titlepage}
\begin{center}
\vspace*{1cm}
\Huge
\textbf{Introduction to Data Science With R}
\vspace{0.5cm}
\LARGE
A Survey of Statistical Methodology
\vspace{1.5cm}
\textbf{Kaylee L. Davis}
\vfill
\Large
For the most recent version: \\
\normalsize
\url{https://github.com/KayleeDavisGitHub/Graduate_Methods_Handbook} \\
\Large
Summer 2017 --- \thedate ~(last updated) \\
\end{center}
\end{titlepage}
% FlushLeft Start:
\begin{flushleft}
\setlength{\parindent}{1cm} %1cm indent
\clearpage
%%! Page for Table of Contents:
\tableofcontents
\thispagestyle{empty}
\clearpage
\section{Introduction}
\setcounter{page}{1}
\subsection{Who Should Read This Book}
\hfill \\
The purpose of this book is to provide an overview and tutorial for executing and critically interpreting various methods. The key contribution of this book is the simplified primer to various data science topics with code for those interested in expanding and learning more.
Beyond the more stream-lined methods topics, there are sections on causal inference, time series modeling, network analysis, and even brief sections devoted to qualitative methodology. The online .pdf version of the book includes hyperlinks for the table of contents and index for quick access to information (useful for reviewing). The scope of this book will be only in R, with rare mentions to other tools like Tableau, Power BI, Python, and SQL.
The book follows various data sets and topics from beer rankings to popular Reddit posts. I have tried my best to include copies of all data tables at the GitHub repository.\footnote{\url{https://github.com/KayleeDavisGitHub/Graduate_Methods_Handbook/tree/master/data}} If the data is not available there the source will be mentioned in the text.
To learn the content in this book better I would suggest applying the basic ideas here to a project you care about, and just taking time to run code and teach others what you've done. This is also not an exhaustive list of statistical methods and I may reference something but never cover it later on. This only covers courses I've taken in my PhD program at Ohio State, but some of my insights from my work later on may be folded in to provide relevance for those in the business or government industry.
Code in this document will appear as follows:
<< setup, include=TRUE, cache=FALSE, echo=FALSE>>=
# Code Setup/ and Formatting:
library(knitr)
## This code is for figures, which may be similarly customized output
## https://github.com/yihui/knitr/blob/master/inst/examples/knitr-themes.Rnw
## https://github.com/yihui/knitr/blob/master/inst/misc/Sweavel.sty
options(formatR.arrow=TRUE, width=78)
##! Changing out Theme to "kellys"
##! For Printing, consider changing this to "moe" or "edit-xcode"
##! Check out Themes at: http://animation.r-forge.r-project.org/knitr/
opts_chunk$set(prompt=TRUE)
opts_knit$set(out.format = "latex")
theme <- knit_theme$get("kellys")
knit_theme$set(theme)
@
<<example code, warning=FALSE, message=FALSE, prompt=TRUE>>=
# Example Code:
X <- "String"
inverse_logit <- function(x){
base::exp(x)/(1 + base::exp(x))
}
@
\newpage
\subsection{Why Choose R?}
\hfill \\
<<GoogleTrends, include=TRUE, cache=FALSE, echo=FALSE, message=FALSE, warning=FALSE, out.height= "80%">>=
library(tidyverse)
library(readr)
# data download from Google Trends:
df <- read_csv("https://raw.githubusercontent.com/KayleeDavisGithub/Graduate_Methods_Handbook/master/data/multiTimeline.csv")
# Rename, convert to tibble
df$SPSS <- df$...4
df <- as_tibble(df)
# Fix date variable:
df$Month <- paste0(df$Month, '-01')
df$Month <- as.Date(df$Month)
# Group join for plotting:
df2 <- df %>%
select(Month, R, SAS, SPSS) %>%
gather(key = "variable", value = "value", -Month)
ggplot(df2, aes(x=Month, y=value)) +
geom_line(aes(color = variable)) +
scale_color_manual(values = c("steelblue", "darkred", "grey40"),
name = "Program") +
labs(title = "Worldwide Google Search Trends: Science Category",
subtitle = "R, SAS, SPSS [Jan. 2004 | Sept. 2022]",
caption = "https://trends.google.com/trends/") +
xlab(" ") +
ylab("Search Interest Percent (%)") +
geom_vline(xintercept=as.Date("2010-06-01"), linetype="dashed") +
annotate("text", x = as.Date("2010-06-01"), y = 80,
label = "June, 2010\nR Passes SPSS",
angle = 90, size = 3) +
theme_minimal()
@
%%! Clear page; set page counter at page "1"
\clearpage
%----------------------------------------------------------------------------------
\section{Causal Inference and Assumptions}
What is a \textit{cause}? More specifically, what in social science can really ``cause'' anything, is it ever really measurable? These are the questions that have been plaguing scientists for years, and with increases in methodology and theory we have tried to answer causal questions --- or at the very least, accurately report our assumptions and errors. A lot of the first few pages of this guide will be explicit in describing the assumptions we make with various methods, because these should be explicit for the audience when we find ``causal effects.''
Here we are using one language to define ``cause,'' specifically that used by social scientists, which is the first theoretical assumption. A cause here, in the simplest terms that maximizes the utility of the word -- \textbf{is a treatment}. Anytime you see the word ``cause'' you could also say treatment, and the reverse. Therefore, race, gender, any attribute is \textbf{not} a cause because we can not give people a new race or gender on a fundamental and societal level. But obviously these things matter! Here we would divide up our study, \textit{stratifying} on gender, race, etc for all of those groups which we cannot assign a new race or gender. Being creative in design up front can save a lot of our analysis on the back-end too; for example, creating an online program that generates a name or image for you on some online messenger program may give you a new \textit{perceieved race} or \textit{perceieved gender} which we can randomize and identify. The key point here is that our definition of ``causal'' already introduces some problems, and a lot of what follows is attempting to advert them, understand them, and explain them accurately to the audience through \textit{causal inference}. Causal inference, then, is the way we infer causality through many design and analytic choices. One take-away here is that by broadening our language we strengthen the assumptions we can make, it's a give-and-take that has become convention. \index{cause}
\begin{longtable}{ll}
\hline\noalign{\smallskip}
\textbf{Symbol} & \textbf{Meaning} \\
\noalign{\smallskip}\hline\noalign{\smallskip}
$D$ & Indicating Treatment Group ($1$) or Control Group ($0$)\\
$Y$ & Dependent Variable of Some Theoretical Interest \\
$Y_0$ and $Y_1$ & Dependent Variable when Controlled or Treated\\
$X$ or $x_1$ etc. & Independent Variables\\
$V$ or $\sigma^2$ & Variance\\
E[~~] & Expectation of Something, Typically the Mean\\
$E[Y |D]$ & Expectation of Y \textit{Given Some} D\\
$\mu_{ATE}$ & The Average Treatment Effect\\
$\mu_{ATT}$ & The Average Treatment of the Treated Group\\
$\mu_{ATC}$ & The Average Treatment of the Control Group\\
$\xrightarrow[]{d}$ & The Distribution Becomes\\
$n$ or $N$ & The Number of Observations\\
$\mathcal{N}$ & Normally Distributed\\
$\hat{\theta}$ & Estimated Parameters\\
$X_{it}$ & All Rows Within X (i) and At Each Time (t)\\
\noalign{\smallskip}\hline\noalign{\smallskip}
\end{longtable}
\clearpage
%-----------------------------------------------------------------------------------
\subsection{Pearl vs. Rubin Models of Causal Inference}
\hfill \\
We frequently hear the saying ``correlation does not equal causation.''\footnote{For a particularly hilarious article involving correlation versus causation see: Maltzman, Forrest, James H. Lebovic, Elizabeth N. Saunders, and Emma Furth. ``Unleashing presidential power: The politics of pets in the White House.'' PS: Political Science \& Politics 45, no. 3 (2012): 395-400.} Working with similar definitions of ``cause'' Rubin and Pearl are two scholars that have had different approaches to tackling causal inference. Both focus on weighted averages (means) to make the jump between data to the ``real world.'' Note that we could care about the distribution or variance instead of weighted means, but this won't be covered here. Rubin's approach is more algebraic, focusing on how attributes cannot be causal, and specifically stating when errors originate and cross between variables. Pearl leans on graphing theories, creating ``DAG'' plots (this will be covered later) to draw out for readers which variables influence others. Pearl's approach is intuitive and straightforward and seems to encapsulate theory just as well as Rubin's approach but has been criticized for not being explicit enough in error tracking between nodes of the theory plots. However, in some ways Pearl is more explicit algebraically, consider a fundamental mean difference in means between treatment and controls:
\begin{equation}
\text{Rubin approach:} ~~~~ E[Y_1] - E[Y_0]
\end{equation}
\begin{equation}
\text{Pearl approach:} ~~~~ E[Y_1 | \text{do}(D=1)] - E[Y_0 | \text{do}(D=0)]
\end{equation}
Pearl's approach, using the \textit{do()} operator explicitly states what the researcher \textit{did} and does not assume the reader has knowledge of what the researcher did in notation.
\hfill \\
\addcontentsline{toc}{subsubsection}{The DAG Plot}
\noindent \textbf{The DAG Plot:}
\hfill \\
% Here are some examples that should help:
\begin{center}
% The beginning tikzpicture [brackets] has the settings
% and functions for the rest of the graph, the graph will
% generate the nodes (the center connecting points) and
% the paths (the lines intersecting these nodes)
% more details on the measurements of these lines:
% https://tex.stackexchange.com/questions/8260/what-are-the-various-units-ex-em-in-pt-bp-dd-pc-expressed-in-mm
\begin{tikzpicture}[%
->,
shorten >=2pt,
>=stealth,
node distance=3cm,
noname/.style={%
rectangle,
minimum width=2em,
minimum height=2em,
draw
}
]
% Nodes:
\node[noname] (C) {C};
\node[noname] (D) [node distance=2cm, below right=of C] {D};
\node[noname] (O) [right=of C] {O};
\node[noname] (Y) [node distance=2cm, below right=of O] {Y};
% Paths:
\path (D) edge node {} (Y)
(C) edge node {} (D)
(C) edge node {} (O)
(O) edge node {} (Y);
\end{tikzpicture}
\end{center}
\hfill
Consider our first plot (above), and imagine that it is some causal ``story'' we wish to convey to the reader. Here C influences both O and D which both influence Y. If we care about the effect of D on Y then we ought be careful that we are not measuring any effect of O on Y as well. This is the backbone to more complex DAG plots, or dynamic acyclic graphs. As a few definitions with these going forward, a \textbf{back door path} is any possible influence that goes into our causal variable rather than coming \textit{from} our causal variable to the dependent variable. Here C is a backdoor path, because Y could influence O (thus C) against the arrow, which in turn gives us false results if we merely measure D on Y. A \textbf{collider} is a node which has multiple variables running into it (in social science most variables will be colliders). Colliders hand-off information into these ``hubs'' which can greatly complicate our findings if not accounted for. Sometimes we may see broken lines or different shapes and subscripts to denote errors in these graphs and how these are thought to exist between nodes. Note that DAG plots are made entirely from theory, and that qualitative and quantitative data alike can sculpt how we might model our causal phenomena of interest. Ultimately we care about a few things, (1) identifying exogenous variation, (2) blocking backdoor paths, (3) sum up the front door paths. But how do we ``block'' these dangerous paths? Well there are many ways, the most common in experiments is to stratify the experiment by the node, but the \textbf{curse of dimensionality} tells us that as we do this more and more we will run out of respondents (countries, etc) to be able to afford all of the strata we may want. Thus, many methods such as matching, and design choices help us account for blocking difficulties. \index{Curse of Dimensionality} \index{DAG} \index{Back-Door-Paths} \index{colliders}
Another great way to account for back-door-paths is to simply randomize, randomize everything in many different ways. By randomizing we shuffle who might get treated or not, with no interaction between them (SUTVA), and no selection into treatment. Although this will be mentioned in detail later. A final note is that sometimes researchers will put coefficients and effect sizes on the edges between nodes to show effects between and within variables of a study. Also when we see circles and dotted lines these are not easily observable variables that are not measurable, or observable.
%----------------------------------------------------------------------
\begin{center}
\begin{tikzpicture}[%
->,
shorten >=2pt,
>=stealth,
node distance=2cm,
noname/.style={%
rectangle,
minimum width=2em,
minimum height=2em,
draw
}
]
% Nodes:
\node[noname] (A) {A};
\node[noname] (u) [circle, node distance=2cm, right=of A] {u};
\node[noname] (B) [node distance= 1cm, above =of u] {B};
\node[noname] (C) [right=of u] {C};
% Paths:
\path (A) edge node [above] {} (B);
\path (B) edge node [above] {} (A);
\path (B) edge node [above] {} (C);
\path (C) edge node [above] {} (B);
\path[dashed] (u) edge node [left] {} (A);
\path[dashed] (u) edge node [right] {} (C);
\path[dashed] (A) edge node [left] {} (u);
\path[dashed] (C) edge node [right] {} (u);
%\draw[->] (C) to [bend right] node [above] {?} (A);
\end{tikzpicture}
\end{center}
%------------------------------------------------------------------------
\subsection{Formal Modeling}
The late statistician George Box once said ``All models are wrong, but some are useful.'' Formal Modeling (which I'll sometimes call game theory interchangeably) can be used to define the the particular assumptions leading to an empirical test. Instead of writing long-hand the history, studies, and assumptions that go alongside our theory we explicitly state relevant ``variables'' in math notation leading right up to a logical empirical test. This has many obvious benefits, chiefly it disciplines one to be explicit with their assumptions (similar, but more involved than DAG plots from the causal inference chapter). Also, this allows for clear objections and revisions to be made by future scholarship. Indeed many models, such as Downs' ``Median Voter Theorem'' has been revised multiple times given the clear writing from the original mathematical formal modeling.
\hfill \\
\noindent \textbf{Utility, First Best, Nash Equilibrium, Comparative Static}
\index{utility}
\noindent To begin, utility is one's preferences for one thing over another. For example, preferring guns to butter:
$$ u_i(\textit{Guns}) = 10 ~~~~ u_i(\textit{Butter}) = 1 $$\\
\noindent Which, note, is the same as this \textit{utility function}:
$$ u_i(\textit{Guns}) = 8,675,309 ~~~~ u_i(\textit{Butter}) = 1,337 $$\\
One's \textit{expected utility} is the mathematical calculation of one's probability of each event happening multiplied by each utility per outcome.\footnote{pg. 339 of Bueno de Mesquita's ``Political Economy for Public Policy has a great example of this. }
\index{First Best} \index{Nash Equilibrium}
\noindent Considering the standard 2x2 matrix format:
\begin{table}[h!]
\centering
\setlength{\extrarowheight}{2pt}
\begin{tabular}{cc|c|c|}
& \multicolumn{1}{c}{} & \multicolumn{2}{c}{Player $Y$}\\
& \multicolumn{1}{c}{} & \multicolumn{1}{c}{$A$} & \multicolumn{1}{c}{$B$} \\\cline{3-4}
{Player $X$} & $A$ & $(x,y)$ & $(x,y)$ \\\cline{3-4} % \multirow{2}* was in front of {player}
& $B$ & $(x,y)$ & $(x,y)$ \\\cline{3-4}
\end{tabular}
\end{table}
The \textit{First Best} is the bolded sections given the best strategic decisions per potential choice by the other player.
\begin{table}[h!] % here!
\centering % I guess this is the only way to center the table?
\setlength{\extrarowheight}{2pt}
\begin{tabular}{cc|c|c|}
& \multicolumn{1}{c}{} & \multicolumn{2}{c}{Country $Y$}\\
& \multicolumn{1}{c}{} & \multicolumn{1}{c}{Don't Arm} & \multicolumn{1}{c}{Arm} \\\cline{3-4}
{Country $X$} & Don't Arm & $\textbf{4, 4} $ & 0,\textbf{3} \\\cline{3-4}
& Arm & \textbf{3},0 & 1,1 \\\cline{3-4}
\end{tabular}
\end{table}
The \textit{Nash Equilibrium} is the ``first best of the first bests'' or where all players reach an equilibrium around one set of strategies -- even though they may not have the highest pay-off or reward for a particular player. Note that the process here for solving a 3x3 to NxN matrix is similar to the standard 2x2. Also the Nash Equilibrium(a) can be solved algebraically as well; graphically it is the intersection of two lines which represent the strategic continuum a player has. A Nash Equilibrium(a) can be notated by any given variable: $\theta^*$.
\index{subgame perfect Nash Equilibrium}
A variation of the Nash Equilibrium is the \textit{subgame perfect Nash Equilibrium}, or the regular Nash Eqilibria per subgame of a longer, more dynamic, game.
\index{comparative static}
Another term, the \textit{comparative static} is the component of a model that connects our theory to our empiricism through game-theoretic reasoning. Typically the comparative static is deployed in papers to test hypotheses, arrive at novel findings, argue for an expansion of a particular model or theory, or just to describe a phenomenon with more precision than before.
\subsection{Treatment (Causal) Effects, SUTVA, Confounding}
In its simplest form, we care about the difference between the average treated effect and the average control effect. The difference between the two is the \textbf{Naive Differences in Means}, or the naively simple calculation of treatment effects. Why is it naive? Well, we haven't considered any assumptions surrounding the groups, and the fundamental problem of causal inference --- the counterfactual. A counterfactual is the ``what if'' of being in the other group than the one assigned. \index{Naive Differences in Means}
\begin{equation}
\hat{\mu}_\text{Naive} = E[Y|D=1] - E[Y|D=0]
\end{equation}
Think for a second, given all of this, when might the $\hat{\mu}_\text{Naive} = \hat{\mu}_\text{ate}$? Or, when might the naive estimator be the same as our average treatment effect (ATE)? Remember that the reason why $\mu_{naive}$ is naive is because it never observes the counterfactual world.
\hfill \\
\noindent \textit{Answer:}\\
Note that the above equation has no decorations on $Y$, we are assuming that all respondents are not interacting with each other, even over time, (SUTVA) and that our design is absolutely complete (think of a complete DAG plot) such that we really are just getting $ E[Y_1] - E[Y_0]$. So no error, and no SUTVA violation we get the $ate$ to equal $\mu_{Naive}$. \index{SUTVA} Here SUTVA is doing a lot of work, we are assuming with SUTVA that there is no interference and treatment variation ($ Y = Y_1)$. This is a pretty monstrous assumption, countries, people, politicians, the media, \textit{everyone} often communicates and shares what they have been ``treated'' with all the time! Beyond this, treatments often effect people very differently depending on affiliation, openness to feedback, or if you ate breakfast that morning. Some methods like network analysis will embrace this fact instead of avoiding, although this method will not be covered here.
\hfill \\
\noindent SUTVA, or the Stable Unit Treatment Value Assumption, follows two primary tenants:
\hfill \\
\begin{itemize}
\item[1.] That the treated and controlled units are not interacting between one another.
\item[2.] That the method of how someone gets treated does not vary between units, or is at least equivalent.
\end{itemize}
\hfill \\
\noindent Some core assumptions are not only SUTVA (consistency) but also that no \textit{confounding} occurs (also known as selection on observables, ignorability). This is when the people that were to benefit most from the treatment select into the treatment. No confounding formally is: \index{confounding} \index{selection on observables} \index{ignorability}
\begin{equation}
D ~ | ~ Y_1, Y_0 = P(D=1 ~ | ~ Y_1, Y_0)
\end{equation}
This language, by the way, is really annoying because: selection on observables $=$ conditional independence $=$ conditional ignorability $=$ no confounding, conditional on covariates $=$ blocking all back door paths. \index{blocking} \index{selection on observables} \index{back door paths} Open backdoor paths, is Pearlsian DAG-speak for having confounding. If we have more treated folk for each $Y_0$'s than $Y_1$'s then we have imbalance. If X is correlated with $Y_1$ and $Y_0$ for \textit{any reason}, then even though we randomized, then $Y_1$ and $Y_0$ would be informative about $X$ - And therefore $Y_1$ and $Y_0$ would be informative about our treatment D! Therefore we have a broken experiment.
Some other things to look out for is the plug-in estimator and conditional expectation functions, each of these are just examining average treatment effects around conditions (control variables). The \textit{plug-in estimator} is when we condition on and control for the average treatment effect (ATE) summing up over our controlled $X$'s.
\begin{equation}
E_X[E[Y | D = 1, X] - E[Y | D = 0,X]]
\end{equation}
\noindent The conditional expectation function (CEF) is when we take the average treatment effect to see conditional causal effects.
\begin{equation}
E[Y | D] = E[Y_0] + \mu_{ate}D
\end{equation}
We could think about taking into account all variables and possible conditions, this is the hope of Rubin's \textit{saturated model}, which has it's roots in imputation for trying to find missing data (the missing data here in our case would be the counterfactuals). Saturated models do come with particular downsides however, mostly in the errors it produces, and our lack of knowing the exact causal pathway given that adding new variables can open up back-door-paths to our causal estimand. \index{saturated modeling} \index{imputation}
%--------------------------------------------------------------------------------------------------------------
\hfill \\
\subsection{Identification Through Randomization, Hypothesis Testing}
Identification strategies are the attempt to find the missing data that we don't know due to the fundamental problem of causal inference. Two popular paths to do this are using controls and randomization.
There are different types of randomization to help us gain inference. Simple randomization is a probability of treatment per individual ($D$ ``treatment'' being applied) in all possible cases; complete randomization is deciding what fraction of observations we need to fulfill for our total group and then curate our probability of treatment to this. Using randomization, we can hope to ``block'' on these observations (like gender, or X) which we cannot randomly assign.
\begin{equation}
D ~ | ~ Y_1, Y_0 | X ~ \text{or,} ~ P(D=1 ~ | ~ Y_1, Y_0; X) = p
\end{equation}
\hfill \\
Essentially, we could calculate the naive differences in means for each possible estimation of all of treatment and control groups: and that is called randomization inference (Rubin's causal model).
\hfill \\
In hypothesis testing we could just take with and without treatment and see how our distributions converge (remember the t-test stuff)
\begin{equation}
\sqrt{n}(\hat{\mu} - \mu) \xrightarrow[]{d} \mathcal{N}(0, \sigma^2)
\end{equation}
\index{p-values}
Cool, what's the p-values that we get from this? It is saying that conditional on the null hypothesis (i.e., conditional on $\mu_{ate} = 0$ ), what is the (frequentist) probability of observing this extreme an estimate of $\hat{\mu}$? This is typically in regards to a sharp null hypothesis of zero average effects, $Y_1 = Y_0$ always. \index{sharp null hypothesis}
%------------------------------------------------------------------------------------------
\hfill \\
% Slide 45 of causal 4 has regression components.
%------------------------------------------------------------------------------------------
% Law of iterated expectations? it is when we take the E[E(Y|D)] we can boil it down to E[Y]
\noindent On hypothesis testing, in most cases we start off with testing against a \textit{sharp null hypothesis of zero average effects}. That is, $\mu_{ate} = 0$. Note that this is probably the most boring, unrealistic, and uninteresting null hypothesis that we could ever test against. Does anything, in social science, really ever have a sharp zero effect? Probably not, but we use our handy friends, the z-test and t-test, to find the statistical significance of our finding compared to the zero effect. For small samples we often assume that, because of the large sample size used, the unknown population variance may be replaced by the sample variance. That is, instead of a z score we use a t score:
\begin{equation}
z= \frac{\bar{x} - \mu}{\sigma / \sqrt{n}}
\end{equation}
\noindent one would use:
\begin{equation}
t = \frac{\bar{x} - \mu}{s_x / \sqrt{n}}
\end{equation}
\subsection{Frequentism and Bayesianism}
``Statistics'' has its roots, interestingly, in social sciences as the origins of the word derives from ``state-istics'' or the study of the state. This methodology, uniquely bred for States to understand it's citizenry, statistics have become fascinatingly turned for the \textit{people} to now understand their State. Following World War II this was more important than ever, a social science with deep philosophical, theoretical, and qualitative research began to explode with quantitative methodology in through the 1950's. These methodologies, deeply aided by fields like biology, computer science, and econometric, have both drawn us closer to understanding public and governmental behavior \textit{and} has driven us farther from these concepts as methodology has been misused (e.g. p-hacking), misunderstood, and miscommunicated.
To begin to understand why we use statistics, general linear models, or machine learning, one must first understand a few basic underpinnings of frequentist theory versus Bayesian theory. Frequentism comes in social science largely from R.A. Fisher's idea of likelihood theory of inference. It argues that one true parameter ($\theta$) exists in nature and all scientists are doing are getting multiple samples to guess the one-true-fixed $\theta$. Our results are likelihoods that are close to measuring the one-true DGP but \textbf{always} with some uncertainty. In theory, we can map our estimated DGP $\hat{\theta}$ in two-dimensions pretty easily (think of a basic x--y coordinate grid). Yet, the social world is very complex, as we add more and more variables we begin to get to a point where we find ourselves in hyperspace, unable to map and visualize our surroundings (thus, machine learning helps here, but this will come up later). To help find interesting points when we do these regressions, we (in frequentist theory) seek the MLE or the Maximum Likelihood Estimation. Note this is purposefully not the Maximum \textit{Probabilistic} Estimation, in likelihoodist theory these words mean very different things. The take-away here should be to be careful when speaking in likelihoods (a relative measure of uncertainty) versus probabilities (results only from randomized experiments) as these words have strict definitions in causal inference.
An MLE is interesting because it is the impasse between our estimated parameter $\hat{\theta}$ and our $y$ dependent variable. This typically occurs at the ``highest'' or maximum point on whatever curve we have. Consider the image below, the red line indicates the plateau or maximum point, for the liklihood ($L$) of $y$ (estimated by $x_1$, etc.) on our $\theta$.
\begin{center}
\includegraphics[scale=2]{MLEex}
\end{center}
To find, mathematically, our MLE (which occurs under-the-hood in R) we compute the following: \index{MLE}
\begin{list}{}{}
\item[1.] We $\log$ the likelihood function for ease of computation.
\item[2.] We then take the derivative (also known as the score-function).
\item[3.] Then set this derivative equal to zero (to find our flat top part).
\item[4.] we then solve for $\hat{\theta}$.
\item[5.] We then make sure found the maximum and not some minimum peak (note we could have found the other peak in the above image). To do this we check the sign of the second derivative, if negative we know the curve is downward-facing. We finish by calculating the Fisher information (the second derivative) to find our confidence via variance and standard errors.
\end{list}
\subsection{Running Assumptions}
Typically we trade-off assumptions but can never completely eliminate them, only show or \textit{prove} that we have fulfilled them -- all to get closer to causation (causal inference). Some running assumptions and axioms are:
\begin{itemize}
\item[1.] SUTVA
\item[2.] No confounding
\item[3.] Probability exists only between 0 and 1 (probability axiom).
\end{itemize}
\noindent If we attempt to ``lean-into'' one of these assumptions to be able to measure it, we may come across additional assumptions like ``parallel paths'' or various representation and error assumptions. Some of these will be discussed in the Regression section of this book. Probabilistic axioms can be broken, and probability ranges can be reinterpreted to be from 0 to 200\%, for example, but there's usually no reason for this.
\subsection{Non-parametric Modeling}
A few non-parametric models help us lean away from certain assumptions. This could be an entire class on it's own, but I have a few notes that I'll share here.
\subsubsection{Multiple Comparisons}
\subsubsection{Marginal Structural Models (MSM)}
\index{Marginal Structural Modeling}
If we are working with relational datasets, and do have downstream effects at $t_1$, $t_2$, $t_n$ we cannot solve this problem by using fixed effects. Because we are assuming with fixed effects that our estimands do not change over time, this is actually a core assumption that is violated all the time. However, fear not, for marginal structural models can help us when fixed effects cannot.
Assuming time doesn't influence previous times (which may be an issue if we're studying some forecasting or strategic process that appears in our data). we can look at the outcome in the last time period and stack each time period on top of one another. This is very different from adding a bunch of dummy variables, it is stacking the time periods together; we can treat for the histories in this way and potential influence among them.\footnote{It may be a good idea to check out time series analysis here, just in case you problems can be solved with your data.}
So marginal structural models can work with different treatment histories (arms) than fixed effects. Let's fit a model for every ``D,'' and just weight each of these down the road. We condition the history of past events, and if the model pukes at us then we drop the bad cases and redefine the subsample we are talking about, and how it's still interesting.
Note, we do not get out of the unobservables problem, we still need to set up this model with the same theory and careful design. \textit{You cannot analyize your way out of a bad design. }
Some code to look into, (and notes):
<<msm1, prompt=TRUE, eval=FALSE>>=
# GLM has an argument called "weights," don't trust it...
library(survey) # this is better for a variety of reasons.
svydesign(ids = , weights = ~1, data)
svyglm(Y ~ D, unweighted_design, family = quasibinomial)
coef(The_Model)
@
These weighted coefficients can help us, but we can try to stabilize the weights we get, especially in cases of low probability events. For this we need non-dynamic probabilities. Assign these values with R's ``mgcv'' package, and function ``bam'' and some smoothing with \texttt{s()}. Get the t-specific weights, and there are ways to plot the improvements from doing this. Then, put the weights together (stabilized) so mean around 1, and multiply these easily with \texttt{prod(weights)}. Change out these weights in above, instead of \texttt{1} we can use \texttt{ weights}:
<<msm2, prompt=TRUE, eval=FALSE>>=
# after checking out:
library(mgcv)
bam()
s()
# Then we can edit things:
library(survey) # this is better for a variety of reasons.
svydesign(ids = , weights = ~weights, data)
svyglm(Y ~ D, weighted_design, family = quasibinomial)
coef(The_Model)
# Might wan to check out package: cbmsm
@
We can then get predicted probabilities from our marginal structural models. So essentially with marginal structural models:
\begin{itemize}
\item[1.] Set up data so each outcome has multiple individuals and prior time periods.
\item[2.] Model treatment in each period without tie-varying covariates with them.
\item[3.] Assess balance (repeat 2 if necessary).
\item[4.] stabilize weights.
\item[5.] calculate weighted estimates of QOI's.
\item[6.] What is N anymore? Let's just always bootstrap the heck out of this: doing (2), (4), (5), over 1,000 times over.
\end{itemize}
\hfill \\
\hfill \\
%------------------------------------------------------------
\subsection{Finding Causal Estimands:}
\hfill \\
\subsubsection{Matching and Propensity Scores:}
\textbf{Matching is weighting} each observation to all strata. Much of this is really the same thing, matching is pairing individuals and comparing causal estimands and weighting is forcefully multiplying a ``weight'' onto our variables to pair them to reality (say the Census). Subclassifcation is the same thing, in the long, as matching and weighting. If we don't have exact matches we can use a coarse matching approach where we break up a continuous variable into deciles. This is similar to K-Nearest Neighbor subclassification matching. \index{matching} And a propensity score is just the probability of treatment given some X. \index{propensity scores}
\begin{equation}
e(X) = Pr(D = 1|X)
\end{equation}
The idea is then to coarsen/subclassify or nearest-neighbor match on the \textbf{propensity score}. \index{coarse matching} \index{nearest-neighbor} \textbf{Subclassification} is making the probabilities of treatment and control equivilant in observational studies. Doing this, we help create balance in our treated and controlled groups. This is similar to complete randomization in experimental trials. (What machine learning techniques can help us here?) \index{subclassification}
Some machine learning approaches that can help us here (see machine learning chapter for more) include regression trees and other classification approaches. These can help us identify groups to match on create more balance in our data.
\hfill \\
\subsubsection{Fixed Effects:}
Fixed effects are dummy variables (are stratification) and are turning ``on or off'' a year, country, or some other theoretically interesting variable to ``account'' for the things that pour into a time or place. Using this we assume no carry over between years to previous or future years. Here are the assumptions with fixed effects: \index{Fixed Effects} \index{Marginal Structural Models}
\hfill \\
\begin{itemize}
\item[1.] no unobserved time-varying confounds exist
\item[2.] past outcomes do not directly affect current outcome
\item[3.] past outcomes do not directly affect current treatment
\item[4.] past treatments do not directly affect current outcome
\end{itemize}
\hfill \\
\noindent One larger problem is that the linear fixed effect does not consistently estimate the ATE:
\begin{equation}
\beta_{LFE} \rightarrow \frac{E[V^i \mu^i_{naive}]}{E[V^i]} \neq \mu_{ate}
\end{equation}
\noindent But marginal structural models allow for our times and events to influence one another (more on that later).
\hfill \\
\subsubsection{Differences in Differences}
\index{Differences in Differences}
Differences in Differences is a way to estimate our causal estimand (ATT, ATE, etc) and map it when treatment rolls out. Say we have three time points 1, 2, and 3, where all are untreated at 1, some are treated at 2, and all are treated at 3. To get the estimand from differences in differences we first subtract (getting the difference) of the mean outcomes between times and 2 and time 1 for the treated and controlled then calculate the differences in those two groups between the two time groups. We could do the same thing for 3 and 2. An assumption for difference in differences designs is that things were to be on the same parallel paths, or that we need to prove to the reader (often through theory) that the treated and controlled group -- in a world where no treatment occurred, actually would've been the same (parallel to one another). Further, that it is only our treatment group that caused a change between the control and treated group, not anything that would've happened anyways.
\hfill \\
\subsubsection{Instrumental Variables}
Here imagine we are interested in the causal effect of D --> Y but our errors, or some unknown (u) covariate influences both our treatment and outcome. What we can do is add in some "garbage" variable (Z) into the mix, and see just how Z passes through to Y. It helps me think of "pipes" where we want to see how much [insert liquid here] passes through D to Y, neverminding covariate u.
\begin{center}
\begin{tikzpicture}[%
->,
shorten >=2pt,
>=stealth,
node distance=3cm,
noname/.style={%
rectangle,
minimum width=2em,
minimum height=2em,
draw
}
]
% Nodes:
\node[noname] (Z) {Z};
\node[noname] (D) [node distance=2cm, right=of Z] {D};
\node[noname] (Y) [right=of D] {Y};
\node[noname] (u) [circle, node distance=2cm, above right=of D] {u};
% Paths:
\path (D) edge node {} (Y)
(Z) edge node {} (D)
(u) edge node {} (D)
(u) edge node {} (Y);
\end{tikzpicture}
\end{center}
\hfill
\subsection{Conclusion: Causal Inference}
In the end, whether you agree or disagree that causality can truly ever be found in our social world, the efforts we take to ensure that theory is at the forefront of our models is never time wasted. Thinking through - seriously - the exact relationships between our variables and carefully crafting our models make them more convincing be it scientifically, in a business scenario, or in government work. It shows our hand, hides nothing, and clarifies our thinking. It encourages teamwork and collaboration around a shared thought and research agenda, and it helps in presentation by painting a clear picture of the work accomplished. Further chapters are not meant to be shortcuts to this chapter, but additions to it.
%----------------------------------------------------------------------------------
\clearpage
\section{Introductory Statistics}
\hfill \\
While causal effects are certainly the cream-of-the-crop in terms of statistical findings; it's not necessary to examine our social world. In fact, in many cases it's not possible to get causal effects for human behavior. I cannot, for example, randomly assign a population to become black, female, or Hispanic to study the effects of discrimination. However, there is an intersection of causality and correlation: the likelihood. That is, how likely is X to happen, given a list of set assumptions?
There's an example involving famous statistician Ronald Fisher, someone we've heard of. Story has it, someone in his office one day claimed that she was a passionate tea lover, so passionate that she could tell from taste alone whether tea was added to milk or if milk was before the tea. Fisher, ever-fun at parties, questioned the lady's passion with statistical analysis. How would we set up a \textit{causal} analysis of this however? Well, we don't have multiple universes of passionate tea-loving ladies to examine; yet, we can test the lady's abilities against random chance. Assuming random guessing would lend us a 50/50 chance of success if the lady were to defy those odds to such a great detail we could claim her successful. As the story goes, the lady could not correctly guess the difference against random chance after Fisher had her blindfolded and tested. Something deep inside me hopes this story is true.
This all begs the question, how do we statistically examine the degree to which someone ``defies odds'' or what can we call ``statistically significant'' success at something? Fisher, not leaving the world hanging, had other projects besides breaking old ladies hopes and dreams. In a study on manure on farming yields Fisher noted that ``statistical significance'' could be found at three separate thresholds which could be best communicated using associated p-values. These p-values are simply representations of the ``probability of unusualness'' found via transformation of our standard errors giving a set distribution. The thresholds are usually on the 10\% tail (.10), 5\% tail (.05), and the most unusual 1\% tail observations (.01). If a set observation, or population distribution were to exist in these tail regions, passing these thresholds we could claim statistical significance. These thresholds were noted by Fisher to be ``usual, and convenient'' enough in the scientific enterprise to use them universally. This also means that the origins of the p-value are mired in farm manure.
In this section of the methods handbook I'll go over the fundamentals of t-tests, population distributions, z-scores, chi-squared testing, and other tests that all come back to the same notion of testing ``unusualness'' in the world. This may not be causal observations but can be powerful, these tests are deployed by cancer researchers to aerospace engineers - it has saved lives and it's perversion has conversely caused great harm. These examples and many more will be included along the way. Last note, there very well could be a section on our assumptions and tests for them, however I will just integrate these issues alongside the tests and direct specific attention to this in the introduction section on OLS regression.
\subsection{T Testing}
\index{t-test}
Directly, t-score is a ratio between the difference between two groups and the difference within the groups. The larger the t-score, the more difference there is between groups. The smaller the t-score, the more similarity there is between groups. A t-score of 3 means that the groups are three times as different from each other as they are within each other. When you run a t-test, the bigger the t-value, the more likely it is that the results are repeatable. When we work with t-tests we are typically working with sample populations rather than whole population data (with a known $\sigma$ error like we saw with z-testing). There are three ways we usually find a t-test:
\begin{itemize}
\item[1.] An Independent Samples t-test compares the means for two groups.
\item[2.] A Paired sample t-test compares means from the same group at different times (say, one year apart).
\item[3.] A One sample t-test tests the mean of a single group against a known mean. Say, a sample statistic to a census statistic.
\end{itemize}
Each of these follow the same theory with slight variations to account for differences in known variance or not. Each of these test against a standardized null hypothesis of zero average effects ($ H_0 : \mu_1 - \mu_2 = 0$) and, conversely, our alternative hypothesis can be that our sample is anything but zero ($H_A : \mu_1 - \mu_2 \neq 0$) or existing only in one tail ($H_A : \mu_1 - \mu_2 > 0$). However, our alternative hypothesis specification doesn't really change our test much - just our interpretation of the critical values and our multiplicative testing parameter, similar to the z test.
Formally, our t-test takes the following form when we are comparing our data to a known population mean:
$$ t = \frac{\bar{x} - \mu_0}{\sqrt{s^2 / n}} \sim t_{df=n-1} $$
Here $\mu_0$ is our postulated value of the population where as $\bar{x}$ is the sample mean. Our test should, now, not follow a normal distribution but rather a student's t-distribution which is similar to a normal distribution but rather we take into account the degrees of freedom ($n-1$). Why subtract one? Well this is because we ``spend'' one degree of freedom when finding our sample mean, with additional parameters we ``use'' more degrees of freedom in our estimation process. In the denominator we account for our sample variance and the quotient with of our sample size, all square rooted. To find $s$:
$$ s = \sqrt{\frac{(X-\bar{x})^2 \text{...} (X_n - \bar{x})^2 }{n-1}} $$
Recall our degrees of freedom being calculated in $s$ not in the actual t-test. Without speeding ahead too fast let's stop and consider an example before opening it up to two sample differences. Suppose we gave a small class an exam from a well-published textbook. The textbook company claims that, on average, students get an 85\% on the exam. Is our class \textit{significantly} better or worse than the national average?
<<ttestex1, prompt=TRUE>>=
# Class scores:
class1 <- c(60,70,80,78,98,86,78,82,93,99,86,86,79,0,77)
# Finding s:
s <- sqrt( sum(((class1 - mean(class1))^2)) / length(class1)-1)
# Finding our t statistic:
t <- (mean(class1) - 85) / (sqrt(s^2/length(class1)))
t
# p-value:
pt(t, df = length(class1)-1) # using t-distribution! (pt)
# Testing this against R's function
r_results <- t.test(class1, mu=85)
r_results$statistic
@
We see that according to Fisher's definition of statistical significance we pass the 10\% threshold of ``unusualness'' but not the significance level of .05 or .01.
Suppose we were to compare exam scores for our class to a comparative school across a local river. Here we would need to use to two-sample t-test. Luckily, the formula doesn't change much except that we need a pooled sample standard deviation in the denominator:
$$ t = \frac{\bar{x}-\bar{y}}{Sp \sqrt{1/m + 1/n}} $$
Where our $Sp$ standardized deviation is:
$$ Sp = \sqrt{ \frac{ (x_1 - \bar{x})^2 + \text{...} + (x_m - \bar{x})^2 + (y_1 - \bar{y})^2 + \text{...} + (y_n - \bar{y})^2 }{m+n -2}} $$
Note the pooled observation $-2$ because of the two sample mean estimations already used. This follows the same t-distribution ($t_{m+n-2}$). Is our class significantly better than the other schools?
<<ttest2, prompt=TRUE>>=
# Two samples:
class1 <- c(60,70,80,78,98,86,78,82,93,99,86,86,79,0,77)
class2 <- c(40,100,81,88,98,86,90,82,93,99,86,86,80)
# Using Cohen 1988:
sp <- sqrt( sd(class1)^2 + sd(class2)^2 / 2)
# But wait:
sd(class1) == sd(class2)
t <- (mean(class1) - mean(class2)) /
sqrt( sd(class1)^2/length(class1) + sd(class2)^2/length(class2))
# p-value; x2 for both tails
2*pt(t, df = (length(class1) + length(class2) - 2) )
r_results <- t.test(class1,class2)
r_results$statistic
r_results$p.value
@
Note that here we could not prove that our standard deviations are equal so we had to do a different standard deviation test where the denominator uses the standard deviations independently instead of pooling them. Our results are exactly what R's t.test function would have given us however it should be noted that there is some nuance here in selecting exactly how we account for standard deviations. For more information consider Welch's t-test versus choosing the smaller df as a conservative estimate against our Type 1 error.
Above I used a very sharp test to see if our standard deviations were equal or not, however we can add some nuance to this assumption and see if our standard deviations are significantly different from one another. Testing this assumption can be done with an F-test. \index{F-test} Unsurprisingly, the ``F'' in F-test is in regards to Fisher who thought the F-test could be used to help researchers make better decisions in the variation of t-test they use.
$$ F = \frac{s_x^2}{s_y^2} \sim F_{m-1,n-1,\alpha} $$
Using an F-test table we can find our critical value and test the null hypothesis that $s_x - s_y = 0$. That is, if we reject the null hypothesis using the F-test we would not use a pooled test and if we could not reject the null we would have to keep the standard deviations separated.
\subsection{Z Testing}
$$ z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}} $$ This is distributed normal mean 0 standard deviation of 1.
$$ z = \frac{ \bar{x} - \bar{y} - E[\bar{x}-\bar{y}]}{SD(\bar{x}-\bar{y})} $$
Where $E[\bar{x}-\bar{y}]$ is just our null hypothesis statement, note our null hypothesis argues zero average ($E$) effects - so this usually just wipes out given the zero. So what we get is:
$$ z = \frac{\bar{x}-\bar{y} }{\sqrt{\sigma_x^2/m + \sigma_y^2/n}} $$
We would take this result, compare it with a z-table depending on our $\alpha$ and either use our observed value to either reject or accept the null. These z-tables can be found online and basically act as the pass/fail for much of these scores.
If our standard deviations are equal then we can:
$$ z = \frac{\bar{x}-\bar{y} }{\sigma \sqrt{1/m + 1/n}} $$
Finding confidence intervals for all of this are as simple as taking the upper and lower threshold desired (say 95 percent or 90 percent) and then reporting the upper/lower in a table. Some models in the regression segment will do this for variety.
I found a really good article online about confidence intervals if you are looking for more notes on this:
\url{http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_Confidence_Intervals/BS704_Confidence_Intervals_print.html}
\subsection{Chi-Squared Testing}
The chi-square test of independence is used to analyze the frequency table (i.e. contingency table) formed by two categorical variables. The chi-square test evaluates whether there is a significant association between the categories of the two variables.
Let's review an example of this in R. We'll pull R's native iris data and separate the sepal lengths into two categorical variables - big and small depending on if it's above or below the median.
<<chisquared1, prompt=TRUE>>=
dat <- iris
dat$size <- ifelse(dat$Sepal.Length < median(dat$Sepal.Length),
"small", "big"
)
# here's our contingency table:
table(dat$Species, dat$size)
@
This is great, and below we will use R functions to calculate that chi-squared statistic:
<<chisquared2, prompt=TRUE>>=
# calculate chi-squared (full results)
test <- chisq.test(table(dat$Species, dat$size))
test # full results
test$statistic # chi-squared result
test$p.value # p-value
@
From the output and from test p.value argument we see that the p-value is less than the significance level of 5\%. Like any other statistical test, if the p-value is less than the significance level, we can reject the null hypothesis.
Finally, we can combine the contingency table with the statistical results together in a concise visual:
<<chisquared3, prompt=TRUE>>=
# This gives us the ability to make mosaic plots pretty easily.
library(vcd)
mosaic(~ Species + size,
direction = c("v", "h"),
data = dat,
shade = TRUE
)
@
This mosaic plot with colored cases shows where the observed frequencies deviates from the expected frequencies if the variables were independent. The red cases means that the observed frequencies are smaller than the expected frequencies, whereas the blue cases means that the observed frequencies are larger than the expected frequencies.
\subsection{Permutation Testing}
\index{permutation}
Much of this section on permutation testing comes straight from Thomas Leeper upon me meeting him at Ohio State in 2016. I highly recommend following him on Github, or checking out the relevant article for this section linked at the end of this section.
Thomas writes that an increasingly common statistical tool for constructing sampling distributions is the permutation test (or sometimes called a randomization test). Like bootstrapping, a permutation test builds - rather than assumes - sampling distribution (called the ``permutation distribution'') by resampling the observed data. Specifically, we can ``shuffle'' or permute the observed data (e.g., by assigning different outcome values to each observation from among the set of actually observed outcomes). Unlike bootstrapping, we do this without replacement.
Permutation tests are particularly relevant in experimental studies, where we are often interested in the sharp null hypothesis of no difference between treatment groups. In these situations, the permutation test perfectly represents our process of inference because our null hypothesis is that the two treatment groups do not differ on the outcome (i.e., that the outcome is observed independently of treatment assignment). When we permute the outcome values during the test, we therefore see all of the possible alternative treatment assignments we could have had and where the mean-difference in our observed data falls relative to all of the differences we could have seen if the outcome was independent of treatment assignment. While a permutation test requires that we see all possible permutations of the data (which can become quite large), we can easily conduct ``approximate permutation tests'' by simply conducting a vary large number of resamples. That process should, in expectation, approximate the permutation distribution.
For example, if we have only n=20 units in our study, the number of permutations is:
<<perm1, echo=TRUE>>=
factorial(20)
@
That number exceeds what we can reasonably compute. But we can randomly sample from that permutation distribution to obtain the approximate permutation distribution, simply by running a large number of resamples. Let's look at this as an example using some made up data:
<<perm2, echo=TRUE>>=
set.seed(1)
n <- 100
tr <- rbinom(100, 1, 0.5)
y <- 1 + tr + rnorm(n, 0, 3)
@
The difference in means is, as we would expect (given we made it up), about 1:
<<perm3, echo=TRUE>>=
diff(by(y, tr, mean))
@
To obtain a single permutation of the data, we simply resample without replacement and calculate the difference again:
<<perm4, echo=TRUE>>=
s <- sample(tr, length(tr), FALSE)
diff(by(y, s, mean))
@
Here we use the permuted treatment vector s instead of tr to calculate the difference and find a very small difference. If we repeat this process a large number of times, we can build our approximate permutation distribution (i.e., the sampling distribution for the mean-difference). We'll use replicate do repeat our permutation process. The result will be a vector of the differences from each permutation (i.e., our distribution):
<<perm5, echo=TRUE>>=
dist <- replicate(2000, diff(by(y, sample(tr, length(tr), FALSE), mean)))
@
We can look at our distribution using hist and draw a vertical line for our observed difference:
<<perm6, echo=TRUE, out.width='4in', warning=FALSE, message=FALSE, prompt=TRUE>>=
hist(dist, xlim = c(-3, 3), col = "black", breaks = 100)
abline(v = diff(by(y, tr, mean)), col = "blue", lwd = 2)
@
At face value, it seems that our null hypothesis can probably be rejected. Our observed mean-difference appears to be quite extreme in terms of the distribution of possible mean-differences observable were the outcome independent of treatment assignment. But we can use the distribution to obtain a p-value for our mean-difference by counting how many permuted mean-differences are larger than the one we observed in our actual data. We can then divide this by the number of items in our permutation distribution (i.e., 2000 from our call to replicate, above):
<<perm7, echo=TRUE>>=
sum(dist > diff(by(y, tr, mean)))/2000 # one-tailed test
sum(abs(dist) > abs(diff(by(y, tr, mean))))/2000 # two-tailed test
@
Using either the one-tailed test or the two-tailed test, our difference is unlikely to be due to chance variation observable in a world where the outcome is independent of treatment assignment.
There are many packages that assist in permutation testing and sampling, a few are used in future chapters. One Thomas Leeper uses in his article (quoted nearly verbatim in this chapter) is the library ``coin''. Visit his article linked below to review this package to see if it may help. \\
\noindent \url{https://github.com/leeper}
\noindent \url{https://thomasleeper.com/Rcourse/Tutorials/permutationtests.html}
\clearpage
\section{Regression:}
\begin{equation}
\mathlarger{ y = \beta_0 + \beta X + \epsilon }
\end{equation}
The General Linear Model (GLM) consists of a \textit{stochastic and systematic} components. The stochastic component is the section in the equation which changes (``randomly'') and has error our error term ($\epsilon$), whereas the systematic component remains constant with data values ($X$) and is used to test our dependent variable ($y$) and give us interpretive values in the form of coefficients ($\beta$). To ``generalize'' this to other models (as we'll see in this document) we need a distribution and a link function, which will come up later.
It should be mentioned that this GLM formula has it's genesis in basic geometry's equation for a line $y=mx + b$ the root of this should be intuitive as we attempt to fit a line to our data and estimate it using maximum likelihood.
No matter what the model we're specifying throughout this document, be it logit, probit, Poisson, or survival models the linear predictor will always be the same:
\begin{equation}
\mathlarger{ \eta = X\beta \text{ \text{ } or, }}
\mathlarger{ \eta_i = \beta_0 + \beta_1 x_{i1} \textbf{...} \beta_q x_{iq} }
\end{equation}
One example of when we might use regression is when trying to determine the best quality of beer based upon mean judged ratings, these are the questions that matter the most. Beer ratings range from $-1$: ``Fair'', to $0$: ``Good'', and $1$: ``Great.'' Consider our first model where we consider how expensive the beer is:
\begin{equation}
\mathlarger{ \text{Beer Rating} = \beta_0 + \beta \text{Beer Price} + \epsilon }
\end{equation}
\noindent Given the information from our data regressed in R: (This is shown on the following page)
\begin{equation}
\mathlarger{ \text{Beer Rating} = (-0.76) + (.26) \text{Beer Price} + \text{error} }
\end{equation}
We can solve this to find our beer rating with the substitution of a beer price, we remove the error term because we assume it's normally distributed and centered at zero (an assumption that will come up in the following section). Let's assume we are going out with friends and only have \$2.00:
\begin{equation}
\mathlarger{ \text{Beer Rating} = (-0.76)+(.26)\cdot 2.00 = -.24 }
\end{equation}
This means, given that our average (``Good'') beer rating is zero, that if we only had two dollars to spend on a beer we would get a slightly below average beer ($-.24$). This is assuming the price of beer and rating of beer is a linear relationship, this isn't an assumption we always have to make but one we will for now.
\clearpage
\noindent Showing this in R, let's load and set up our data and run the model that we used to get the numbers above: \index{loading data} \index{recoding data}
<<BeerData, warning=FALSE, prompt=TRUE, message=FALSE>>=
## Initial pass: reading data, looking at variables.
library(readr) # For reading data.
# KayleeDavisGithub/Graduate_Methods_Handbook/master/data/beer.csv
beer_data <- read_csv("https://raw.githubusercontent.com/KayleeDavisGithub/Graduate_Methods_Handbook/master/data/beer.csv")