-
Notifications
You must be signed in to change notification settings - Fork 13
/
Copy path08-App_Baseball.Rmd
871 lines (658 loc) · 54.8 KB
/
08-App_Baseball.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
# Baseball Benchmarking Applications
## Introduction
```{r, include=FALSE, eval=FALSE}
library(bookdown); library(rmarkdown); rmarkdown::render("08-App_Baseball.Rmd", "pdf_book")
```
```{r, include=FALSE, eval=FALSE, warning=FALSE, messages=FALSE}
library(devtools);devtools::install_github("aurokp2/TFDEA_V2")
# Install TFDEA package from Auro's updated fork
```
Now that we have covered an introduction to benchmarking using data envelopment analysis, let's look at a variety of models applied to sports - in particular, baseball.
Sports has a long history in \index{Operations research} operations research and \index{Management science} management science, well before the idea of sports analytics was popularized by Michael Lewis' *Moneyball* [@lewis2004]. While it doesn't have the life and death significance of emergency relief or health care applications, it provides quite a few benefits for explaining complex analysis methods:
- Accurate and well curated historical data
- Easy to understand models
- Opportunities to validate results
- Clear, quantifiable, and objective metrics
- Input-output models can be agreed upon
The range of applications can cover benchmarking individual players, managers, general managers, and teams. Analyzing individual players is the most common type of baseball application and will be the bulk of this chapter. Each baseball player typically plays a role as a batter, a fielder, and/or a pitcher. The analysis of baseball batters is an updated version of an earlier paper [@Andersonnewmeasurebaseball1997a].
For those readers not interested in baseball, this can be viewed as a set of case studies examining individual performance and organizational performance from a variety of perspectives. It also provides a rich example of data preparation for doing a benchmarking study.
## Baseball Data, R, and dplyr
For the sake of readers unfamiliar with baseball, we will provide a brief introduction. If you want to think of a baseball batter as a simple factory that uses Plate Appearances as a resource to produce five different kinds of products. The following discussion gives more context for baseball fans.
Batting is in many ways the simplest aspect of baseball to examine. In the abstract, a batter uses their at-bats or plate appearances as opportunities to create events that help create runs for their team. In each game, a batter typically has four to six at-bats. These at-bats may be framed as a contest between a pitcher trying to make the batter create an out and the batter trying to get a hit. Hits include singles, doubles, triples, and home runs, in increasing order of value. A fifth common outcome is a walk or "base-on-balls" which occurs when the pitcher throws four balls that gives the batter a free pass to first base. There is a common expression that "a walk is as good as a hit" but a walk may not give as much advancement to other runners as a single so it may be considered as slightly less valuable than a single.
In the past, batters weren't given much "recognition" for walks and a batter's receiving a walk did not count as an at-bat. Plate appearances count the opportunities for creating a hit or a walk and is sum of at-bats and walks. There are a variety of less frequent outcomes of a plate appearance such as a sacrifice fly, an error, or hit by pitch. These are all less common and often an accidental outcome from the batter's perspective. Therefore, we will exclude these items from the analysis.
```{r load_ch8_packages, echo = TRUE, message = FALSE, warning=FALSE}
library(TRA) # Install from github if not available
# library(devtools);install_github("prof-anderson/TRA")
```
Our exploration will also serve as a demonstration of data preparation in R. There are many ways of doing this. Feel free to explore further.
A good model of baseball batting is that a batter uses plate appearances to produce singles, doubles, triples, home runs, and walks. Baseball fans might think of other outcomes such as sacrifice flies, but these are generally less common, less desirable, and often not even intentional. These five outcomes constitute a model that fits well enough. The following code chunk demonstrates how to build the DEA input-output diagram for this model.
```{r ch8_draw_bb_io_diagram, message = FALSE, warning=FALSE}
library(rsvg) # Used for saving the figure as an image.
library(DiagrammeRsvg)
XFigNames <- "PA (Plate Appearances)"
YFigNames <- c("BB (Base on Balls or Walks)",
"1B (Singles)",
"2B (Doubles)",
"3B (Triples)",
"HR (Home Runs)")
Figure <- DrawIOdiagram (XFigNames,YFigNames, '"\n\nBatter\n\n "')
tmp <- capture.output(rsvg_png(charToRaw (export_svg(Figure)),
'BaseballBatting.png'))
```
![Model of Baseball Batting](BaseballBatting.png){#fig:BaseballBatting}
Now, let's wrestle with data sources. Sean Lahman's baseball database has been made available as an R package. This is a wonderful resource with a broad range of data for batters, pitchers, teams, and more going back to 1871. <http://seanlahman.com/>
We will restrict our attention to batters and the corresponding `Batting` table. The following code chunk loads relevant packages and then displays a sample of the first six rows in the `Batting` table. Each player is given a `playerID`. To illustrate data processing, we will use the `filter` command from `dplyr` to include only Ty Cobb (`playerID="cobbty01"`).
```{r display-batting-data, echo=FALSE, message=FALSE}
library (Lahman) # Lahman package of baseball statistics
library (dplyr) # Package for data management
library (kableExtra) # Package for nice table formatting
kbl (head(filter(Lahman::Batting,playerID=="cobbty01"),5),
booktabs = T, caption="Sample of Lahman Baseball Data") |>
kable_styling(latex_options = c("HOLD_position", "scale_down"))
```
What does this all mean? The first column, `playerID`, is a uniquely coded ID for each player consisting of the first six letters of the last name, first two letters of the first name, and which instance of that eight-letter combination the player is. For example, row 5 is `cobbty01,` corresponds to the Hall of Famer, Ty Cobb. Each row gives the statistics for a year that he had with a particular team. If the player plays for multiple teams in a year, they will have multiple rows. Also, if their career lasts more than one year, they will have more than one row.
In this case, in 1905 Ty Cobb was a member of the Detroit Tigers (`teamID=DET`) in the American League (`lgID=AL`) where he played in 41 Games, had 151 at-bats, scored 19 runs, and had 36 hits. These hits consisted of 6 doubles, 0 triples, and one home run, leaving the remainder as 29 singles ($36-6-1=29$). He also had 15 runs-batted-in, 2 stolen bases, 10 walks or base-on-balls, 23 strikeouts, and no hit-by-pitches. There are frequent entries marked as `NA` or Not Applicable/Not Available. These were not official statistics at the time so there is no record of the times he was caught stealing (`CS`), the intentional bases on balls (`IBB`), sacrifice flies (`SF`), and grounded into double plays (`GIDP`). It is impressive to have such detailed and specific data going back over 120 years!
Let's use the summary command to look at a few things of interest.
```{r}
kbl (summary(select(Lahman::Batting, playerID, yearID, stint, lgID)),
booktabs = T, caption="Summary of Lahman Baseball Data Stints") |>
kable_styling(latex_options = c("HOLD_position", "scale_down"))
```
The first column, `playerID`, indicates that there are 113,799 rows in the data set indicating separate player season stints across the entire data set. The `yearID` column is more interesting and says that the first year was 1871 but half of all player stints occurred in 1979 or more recently highlight that more people play professional baseball in recent years and the most recent year was 2023 for the Lahman package as of version 12.0. The `stint` column indicates that most players play for the same team throughout the whole season (median=1.0) but at least one player had stints on 5 teams in one year. The league ID, (`lgID`) highlights that the two most common leagues are the American League and National League, both with over 55,000 player stints.
```{r, echo=FALSE}
kbl (summary(select(Lahman::Batting, AB, H, X2B, X3B, HR, BB)),
booktabs = T, caption="Summary of Lahman Baseball Batting Data by Stint") |>
kable_styling(latex_options = c("HOLD_position", "scale_down"))
```
The batting related metrics are more relevant to our modeling. The number of at-bats (`AB`) ranged from 0 to 716 but the mean is only 137 which might not be too surprising as there are many reasons for people to not have many opportunities including pitchers that only played a limited number of games, platooning, backup players, mid-season trades, and injury shortened seasons. In any case, this highlights that it will be important to limit the dataset to full-time players as will be discussed later. The maximum number of hits (`H`) in a season was 262 but the two most surprising numbers for me are that someone hit 36 triples (`X3B`) in a season and someone else had 232 walks (`BB`).
Using the Summary command will provide all kinds of interesting information such as the most player seasons were for the National League's Chicago Cubs (t`eamID=CHN`). I'll leave this for your exploration.
Let's start by subsetting the data in terms of including only the columns that we will be using. To do this, we will again use the `select` command from `dplyr`. For demonstration purposes, we will restrict this to Honus Wagner who was in the inaugural class of five members in the Hall of Fame. While his career spanned 21 years, we will examine the first six of his years.
```{r subsetting-data}
batting<-filter(Lahman::Batting, playerID=="wagneho01")
batting<-batting[1:6,] # First six rows/years
batting <- dplyr::select(batting, playerID, yearID, teamID,
lgID, AB, H, X2B, X3B, HR, BB)
```
```{r, echo=FALSE}
kbl (head(batting), booktabs = T,
caption="Batting Data Used for Analysis") |>
kable_styling(latex_options = c("HOLD_position"))
```
Through the 2023 season, this dataset has over 100,000 rows or player seasons. The game of baseball went through a lot of change in teams, rules, and other development and so it may be more useful to focus our attention on the period after 1900.
```{r filtering-data}
batting<-dplyr::filter(batting,yearID>1900)
```
```{r, echo=FALSE}
kbl (head(batting), booktabs = T,
caption="Batting Statistics for Players after 1900") |>
kable_styling(latex_options = c("HOLD_position"))
```
As we discussed earlier, At-Bats (`AB`) is good but it doesn't include the times when the batter draws a walk. Let's convert that to plate appearances, `PA`. Also, let's get the number of singles, `X1B` which is simply the number of hits minus the number of hits longer than a single (doubles, triples, and home runs.) The `mutate` function is used for changing or modifying the values of column to create a new column.
```{r converting-data}
batting<-dplyr::mutate(batting,X1B=H-HR-X3B-X2B)
batting<-dplyr::mutate(batting,PA=AB+BB)
kbl (head(batting), booktabs = T,
caption="Calculate Singles and Plate Appearances") |>
kable_styling(latex_options = c("HOLD_position", "scale_down"))
```
Lastly, now that we feel confident that our data transformation for calculating singles and plate appearances work, let's drop the old columns for hits and at-bats as well as reordering them using the `select` command.
```{r selecting-columns}
batting <- dplyr::select(batting, playerID, yearID, teamID,
lgID, PA, BB, X1B, X2B, X3B, HR)
```
```{r, echo=FALSE}
kbl (head(batting), booktabs = T,
caption="Batting Statistics Converted into Inputs and Outputs") |>
kable_styling(latex_options = c("HOLD_position"))
```
A small number of plate appearances can result in a problem of setting an unfair target of performance whether due to a statistically insignificant lucky streak or only batting against pitchers that they have an advantage. Over the course of a 162-game season, most full-time players will have more than 500 plate appearances. A few players may exceed 700 plate appearances in a season. Now we will trim out player stints to only include those that have more than 400 (`MinPA`) plate appearances to limit the impact of part-time players or platooning players. In the case of Honus Wagner, this would have removed his first season (1897) but we already removed that due to being before 1900.
```{r selecting-fulltime-players}
MinPA <- 400
batting <- dplyr::filter(batting,PA>MinPA)
```
Let's consolidate all of this into a single piped set of commands for all players.
```{r preparing-the-data-with-dplyr}
batting<-Lahman::Batting
batting <- batting |>
select(playerID,yearID,teamID,lgID,AB,H,X2B,X3B,HR,BB) |>
mutate(PA=as.numeric(AB+BB)) |>
mutate(X1B=as.numeric(H-HR-X3B-X2B)) |>
transform(BB=as.numeric(BB)) |>
transform(X2B=as.numeric(X2B)) |>
transform(X3B=as.numeric(X3B)) |>
transform(HR=as.numeric(HR)) |>
filter(yearID>1900, PA>MinPA) |>
select(playerID,yearID,teamID,lgID,PA,BB,X1B,X2B,X3B,HR)
```
```{r, echo=FALSE}
kbl (head(batting), booktabs = T,
caption="Results of Preparing Data using dplyr") |>
kable_styling(latex_options = c("HOLD_position"))
```
Now that we have data prepared along with the tools for further manipulation, we are ready to proceed onto the analysis.
## Benchmarking Baseball Batters
Our basic input-output model for baseball batting consists of plate appearances as the sole input and the five most common results of types of hits along with walks.
In 1920, the American League featured a batter so dominant that he transformed the "industry" of baseball. Let's focus on this year for now. We will also limit the dataset to just the American League as players at this time only played against the other league in the World Series at the end of the season.
The core model for analysis is described in the following figure. We are going to use an output-oriented, constant returns to scale model. Output orientation is a better fit than an input-orientation because the batter is trying to create more hits rather than reducing plate appearances. The constant returns to scale was selected because increasing plate appearances should have a proportional increase in hits. Recall the first figure in this chapter illustrating the input-output model used.
Let's use the `filter` command again to create a dataset of only 1920 American League data and arrange into separate input (`PA`) and batting outputs.
```{r preparing-IO-data}
inputs <- c("PA")
outputs <- c("BB", "X1B", "X2B", "X3B", "HR")
batting1920AL <- batting |>
filter(yearID==1920, lgID=="AL")
x <- batting1920AL |> select( PA)
row.names(x)<-batting1920AL[,1]
y <- batting1920AL |> select(BB, X1B, X2B, X3B, HR)
row.names(y)<-batting1920AL[,1]
```
```{r, include=TRUE, echo=FALSE}
kbl (head (cbind(x,y)), booktabs=T,
caption="Sample of 1920AL Data for DEA") |>
add_header_above(c("Player ID" = 1, "Input" = 1,
"Outputs" = 5)) |>
kable_styling(latex_options = c("HOLD_position"))
```
Let's use a constant returns to scale output-oriented multiplier. Recall the formulation from Chapter 5. The weight of input *i* is $v_i$ and the weight on output *r* is $u_r$.
$$
\begin{split}
\begin{aligned}
\text {min } & \sum_{i=1}^{N^X} v_i x_{i,k}\\
\text{s.t.: } & \sum_{r=1}^{N^Y} u_r y_{r,k} =1 \\
& \sum_{i=1}^{N^X} v_i x_{i,j} - \sum_{r=1}^{N^Y} u_r y_{r,j}
\geq 0 \; \forall \; j\\
& u_r, \;v_i\geq 0 \; \forall \; r,i
\end{aligned}
\end{split}
$$
Given the one input, five output case, we could write an explicit linear program as the following for each player *k*.
$$
\begin{split}
\begin{aligned}
\text {min } & v_{PA} x_{PA,k}\\
\text{s.t.: } & u_{BB}x_{BB,k} + u_{1B}x_{1B,k} +
u_{2B}x_{2B,k} + u_{3B}x_{3B,k} + u_{HR}x_{HR,k} =1 \\
& v_{PA} x_{PA,j} - \\
& (u_{BB}x_{BB,j} + u_{1B}x_{1B,j} +
u_{2B}x_{2B,j} + u_{3B}x_{3B,j} + u_{HR}x_{HR,j} )
\geq 0 \; \forall \; j\\
& v_{PA}, \;u_{BB}, \; u_{1B}, \; u_{2B}, \; u_{3B}, \; u_{HR}\geq 0
\end{aligned}
\end{split}
$$
We are now ready to conduct a DEA study of the batters. For this analysis we will use the `MultiplierDEA` \index{MultiplierDEA} package by Aurobindh Kalathil Puthanpura. This package is focused on the multiplier model which lends itself naturally to some additional analyses we will conduct.
```{r doing-DEA-multDEA, warning=FALSE, messages=FALSE}
library(MultiplierDEA)
res1920AL<-DeaMultiplierModel (x,y, rts="CRS", orientation="output")
theta <- res1920AL$Efficiency
phi <- 1/theta
table1920AL <- cbind(phi,theta, res1920AL$vx, res1920AL$uy)
colnames (table1920AL) <-
c("$\\phi$", "$\\theta$", "$v_{PA}$", "$u_{BB}$",
"$u_{1B}$", "$u_{2B}$", "$u_{3B}$", "$u_{HR}$")
kbl (head(table1920AL,10), booktabs=T, escape=F, digits=5,
caption="CCR-OO Scores for 1920 AL") |>
add_header_above(c(" " = 1, "Efficiencies"=2, "Input Weight" = 1,
"Output Weights" = 5)) |>
kable_styling(latex_options = c("HOLD_position", "scale_down"))
```
Notice that this is an output-oriented model so the values greater than 1.0 indicate that decreasing radial efficiency (increasing inefficiency) and describe the amount of additional outputs that should be achieved at the same level of input usage. In other words, how many more walks, singles, doubles, triples, and home runs the player should be producing in the same number of plate appearances if he was batting *efficiently*. As discussed in the chapter on the output-oriented model, under constant returns to scale, the input and output-oriented models are reciprocals of each other ($\theta=\frac{1}{\phi}$)
Let's dig a little deeper now. Let's start by reviewing the output weights for the first few players.
A few things stand out. First, only two of the first ten players put any weight on home runs. There should be a hierarchy that singles are at least as good as walks, doubles are at least as good singles, triples are at least as good as doubles, and home runs are at least as good triples. This means that every player violates at least one of these relationships. Any plausible analysis of baseball batting needs to reflect these relationships. While very different, common metrics such as batting average and slugging average satisfy these relationships, but our first DEA model does not.
In Anderson and Sharp's 1997 paper, ordinal weight restrictions were incorporated into the envelopment model for examining baseball batters by accumulating hits [@anderson1997]. This created new outputs of Singles or better (BB+X1B), doubles or better (BB+X1B+X2B), *etc.* This ordinal approach to weight restrictions can be done as a simple preprocessing step.
Another approach is to analyze the data set using the \index{Multiplier model} multiplier model and directly incorporating weight restrictions. The \index{DEAMultiplier} `DEAMultiplier`package written by Aurobindh Kalathil Puthanpura has options for integrating weight restrictions.
Notice that the scores are up to 1.0 rather than 1.0 and higher as in the \index{TFDEA} TFDEA package. This is because the TFDEA package reports efficiency as $\phi$ whereas the \index{MultiplierDEA} `MultiplierDEA` package reports $\frac{1}{\phi}$. The results are consistent and reciprocals. The value of $\phi$ is consistent with the formulation of the \index{Output-oriented} output-oriented model. The value of $\frac{1}{\phi}$ has the same interpretation $\theta$ where a value less than 1.0 indicates inefficiency. In fact, for the case of constant returns to scale (CRS), the orientation does not make a difference in the value of inefficiency found, in other words, $\theta=\frac{1}{\phi}$ For the sake of this discussion, we will interpret the results as $\frac{1}{\phi}$.
Now, let's look at the results.
```{r, echo=FALSE}
hist (res1920AL$Efficiency,
xlab="Efficiency Score",
main="1920 American League Batter Efficiency")
```
```{r warning=FALSE, messages=FALSE}
library(ggplot2)
qplot (res1920AL$Efficiency,
geom="dotplot", binwidth=0.01,
main = "1920 American League Batter Efficiency",
xlab = "Efficiency Score",
fill = I("blue") )
```
Next, let's review the output weights.
Again, we see that the results are similar to what was observed earlier. Note that the results for Eddie Collins (`collied01`) are different. Eddie Collins was efficient and it is common that efficient players have multiple weighting schemes that still result in being efficient. The \index{MultiplierDEA} `MultiplierDEA` package puts weight on triples (`X3B`).
In any case, the output weights do not meet the requirements of a realistic baseball application. Let's apply weight restrictions.
We want to enforce the following restrictions on our output weights, *u*.
$$
\begin{split}
\begin{aligned}
\ & u_{2B} \geq u_{BB}\\
\ & u_{2B} \geq u_{1B}\\
\ & u_{3B} \geq u_{2B}\\
\ & u_{HR} \geq u_{3B}\\
\end{aligned}
\end{split}
(\#eq:Ch8BasicWeightRestrictions)
$$
A more generalized data structure for the weight restrictions looks at these as a series of ratios.
$$
\begin{split}
\begin{aligned}
\ & 1 \leq \frac{u_{2B}} {u_{BB}} \leq \infty \\
\ & 1 \leq \frac{u_{2B}} {u_{1B}} \leq \infty \\
\ & 1 \leq \frac{u_{3B}} {u_{2B}} \leq \infty \\
\ & 1 \leq \frac{u_{HR}} {u_{3B}} \leq \infty \\
\end{aligned}
\end{split}
(\#eq:Ch8RatioWeightRestrictions)
$$
We can now specify these ratios in a data frame that will be passed to `multiplierDEA` package by defining a lower bound, numerator, denominator, and upper bound for each weight restriction ratio relationship. Note that rather than specifying the upper limit of infinity, we declare it as `NaN` to indicate that it is not a number.
```{r}
BattingWR<-data.frame(lower = c(1.0, 1.0, 1.0, 1.0),
numerator = c("X2B", "X2B","X3B", "HR"),
denominator = c("BB", "X1B", "X2B","X3B"),
upper = c(NaN, NaN, NaN, NaN))
```
```{r, echo=FALSE}
kbl(BattingWR, booktabs=T,
caption=
"Data Structure for Batting Weight Restrictions") |>
kable_styling(latex_options = c("HOLD_position"))
```
```{r, warning=FALSE, message=FALSE}
res1920ALmultWR<-DeaMultiplierModel(x,y,rts = "crs",
orientation="output",
weightRestriction = BattingWR)
```
Let's examine the relative output weights given with and without weight restrictions.
```{r, echo=FALSE, message=FALSE}
weights <- data.frame(cbind( res1920AL$Efficiency,res1920AL$uy,
res1920ALmultWR$Efficiency,res1920ALmultWR$uy))
kbl(head(weights
[order(weights[,1], decreasing=TRUE),],20), digits=5, booktabs=T,
caption="Output Weights", escape=F,
col.names=c("$1/\\phi$", "$u_{BB}$", "$u_{1B}$", "$u_{2B}$", "$u_{3B}$", "$u_{HR}$",
"$1/\\phi$", "$u_{BB}$", "$u_{1B}$", "$u_{2B}$", "$u_{3B}$", "$u_{HR}$")) |>
add_header_above( c(" " = 1, "Standard Input-Oriented CRS" = 6,
"With Weight Restrictions Added" = 6)) |>
kable_styling(latex_options = c("HOLD_position", "scale_down")) |>
footnote (general="Top 20 Efficient 1920 American League Batters")
```
Let's talk about a particular hitter, Eddie Collins. Without weight restrictions, he was efficient. With weight restrictions, his efficiency was `r res1920ALmultWR$Efficiency["collied01",]`. In fact, the histogram of efficiency scores is more telling.
```{r}
binwidth <- 0.01
hist(res1920AL$Efficiency, xlim=c(0.6,1.0),
col="red", breaks=seq(0.6,1.0,by=binwidth),
main="Impact of Weight Restrictions on Batter Efficiency",
xlab="Efficiency")
hist(res1920ALmultWR$Efficiency,
add=T,col=scales::alpha('blue',.5), border=F,
breaks=seq(0.6,1.0,by=binwidth) )
```
The weight restrictions have dropped the number of efficient players from nine to two. As might be expected, Babe Ruth was efficient in both models.
### Cross-Efficiency of Baseball Batters
Now, let's turn our attention to a variation of DEA called \index{Cross-efficiency} cross-efficiency. Cross-efficiency was discussed in an earlier chapter. Proponents of Cross-efficiency contend that it can differentiate between players (DMUs) that are DEA efficient without resorting to weight restrictions. Simple CRS DEA finds nine out of 54 players to be efficient. We will focus on the ten players with the top CRS efficiency scores.
```{r, warning = FALSE}
res1920ALcross<-CrossEfficiency(x, y, rts = "crs",
orientation="output")
res1920ALCEWR<-CrossEfficiency(x, y, rts = "crs",
orientation="output",
weightRestriction = BattingWR)
```
```{r, include=FALSE}
temp <- data.frame(cbind(res1920AL$Efficiency,
t(res1920ALcross$ce_ave),
t(res1920ALcross$ceva_max)))
colnames(temp)<-c("Efficiency", "Cross-Efficiency", "Max Cross-Evaluation")
kbl(head (temp,10), booktabs=T, digits=5,
caption="Cross-Efficiency Scores for First Ten 1920 AL Batters") |>
kable_styling(latex_options = c("HOLD_position"))
```
This table of batters is simply in alphabetical order but illustrates that the maximum cross-evaluation score is the same as the efficiency score.
Let's now examine the top ten most efficient hitters sorted by efficiency and their cross-efficiency scores with and without weight restrictions imposed.
```{r}
collectedeff1 <- data.frame(cbind(as.numeric(res1920AL$Efficiency),
as.numeric(res1920ALmultWR$Efficiency),
as.numeric(t(res1920ALcross$ce_ave)),
as.numeric(t(res1920ALCEWR$ce_ave))))
rownames(collectedeff1)<-rownames(res1920AL$Efficiency)
```
```{r}
kbl(head(collectedeff1
[order(collectedeff1[,1], decreasing = TRUE),
1:4], # Display first four columns
10), # Display top 10 rows
booktabs=T, digits=5,
col.names=c("Standard", "Weight Restricted",
"Standard", "Weight Restricted"),
caption=
"1920 AL Top Ten Input-Oriented CRS Efficient Players") |>
add_header_above(c(" " = 1, "Efficiency" = 2, "Cross-Efficiency" = 2)) |>
kable_styling(latex_options = c("HOLD_position"))
```
Baseball fans may recognize some of the names:
- "Shoeless" Joe Jackson (`jacksjo01`) was a Hall of Fame caliber player, but his career was cut short in 1920 after having one of his best seasons after being embroiled in a World Series scandal.
- Tris Speaker (`speaktr01`) and Eddie Collins (`collied01`) were not just Hall of Famers but often considered among the very best players of all time. Despite this, neither player is well known by baseball fans a century after they started playing - their biggest problem was that there were a few colleagues whose stars shown even brighter.
- George Sisler (`sislege01`), was a Hall of Famer but not generally considered to be at the level of Speaker and Collins except for a brief peak of 1920 and 1922.
- Harry Hooper (`hoopeha01`) and Sam Rice (`ricesa01`) were star players and made the Hall of Fame but was a big step below Speaker and Collins.
- George Herman "Babe" Ruth (`ruthba01`) has already been discussed at length.
- Eddie Collins, like Tris Speaker, was among the very best players of all time.
- Bob Meusel (`meusebo01`), Joe Judge (`judgejo01`), and Joe Dugan (`duganjo01`) had good careers but have never been thought of as all-time great batters.
```{r, echo=FALSE}
temp<-as.matrix(cbind(res1920AL$Efficiency,
res1920ALmultWR$Efficiency,
t(res1920ALcross$ce_ave),
t(res1920ALCEWR$ce_ave)))
colnames(temp)<-c("Efficiency", "Efficiency/WR", "Cross-Efficiency", "Cross-Efficiency/WR")
corrtemp <- rbind( cor(temp, method="pearson"), cor(temp, method="spearman"))
```
```{r, echo=FALSE}
kbl(corrtemp, booktabs=T,
caption="Correlations between Efficiency, Cross-Efficiency, and Weight Restricted Results") |>
pack_rows ("Pearson Correlation", 1, 4) |>
pack_rows ("Spearman Rank Correlation", 5, 8) |>
kable_styling(latex_options = c("HOLD_position", "scale_down"))
```
Both simple Pearson's correlation and the Spearman's rank correlation give similar scores, but the Spearman rank correlations are generally higher. This indicates that the scores for the `r nrow(res1920AL$Efficiency)` players are numerically related and their relative ranking is a little more similar.
The \index{Cross-efficiency} cross-efficiency and the \index{Regular efficiency} regular efficiency scores are related - they have a correlation of `r round(corrtemp[1,2], 3)` but the cross-efficiency has a 0.910 correlation with the weight restricted DEA results. This suggests that perhaps cross-efficiency can help deal with decreasing the impact of unrealistic weight schemes but let's take a closer look at the results.
While all of the correlations are strong, they are being done over the entire population of `r nrow(res1920AL$Efficiency)` batters. On the other hand, DEA is about benchmarking against the best. Let's dig into the results more deeply by examining one hitter in particular. In 1920, Babe Ruth had one of the best seasons ever by a baseball hitter. Various statistics demonstrate his dominance but home runs really highlight the impact. His 54 home runs in 1920 nearly doubled the previous major league baseball record of 29 in a season he had set the previous year. Alone, he hit more home runs than 14 of the 16 teams in the leagues. Until the recent steroid era, Babe Ruth's 1920 and 1921 seasons are often considered to be the two best offensive seasons in baseball.
Babe's \index{Cross-efficiency} cross-efficiency score was `r round(res1920ALcross$ce_ave[1,"ruthba01"],6)`. Coincidentally, his cross-efficiency score was nearly identical to the average cross-efficiency of other regular baseball batters of `r round(mean(res1920ALcross$ce_ave),6)`. In other words, what may have been one of the best batting performances in a century of baseball was wrongly misclassified as being just average by cross-efficiency. The reason for this can be seen by the cross-efficiency matrix of weights presented earlier that showed many batters put zero weight on home runs because they paled in comparison to the best performer on home runs, Babe Ruth.
```{r, fig.show='hold', fig.align='center', fig.cap="Impact of Weight Restrictions on Efficiency"}
collectedeff1 <- data.frame(cbind(res1920AL$Efficiency,
res1920ALmultWR$Efficiency,
t(res1920ALcross$ce_ave),
t(res1920ALCEWR$ce_ave),
rownames(collectedeff1)))
colnames(collectedeff1)<-c("CCRIO", "CCRIOWR", "CE", "CEWR", "nameID")
CRSvsCRSWR <- ggplot (collectedeff1, aes(x=as.numeric(CCRIO),
y=as.numeric(CCRIOWR))) +
xlab ("Input-Oriented CRS Efficiency") +
ylab ("Input-Oriented CRS Efficiency with Weight Restrictions") +
ggtitle ("1920 AL Input-Oriented Constant Returns to Scale Model") +
geom_point(shape=1) +
geom_abline(slope=1, color="red") +
geom_text(data=subset(collectedeff1, as.numeric(CCRIO) > 0.999 ),
nudge_x = 0.03, aes(x=as.numeric(CCRIO),
y=as.numeric(CCRIOWR),
label=nameID))
CRSvsCRSWR
```
This chart of efficiency vs. efficiency with weight restrictions clearly shows that no one has their efficiency improved by the imposition of weight restrictions as no one is above the diagonal line of equal scores in both models. All but two players are below the line indicating that their efficiencies went down. The only two players with the same score in both models: `sislege01` (George Sisler) and `ruthba01` (Babe Ruth). Sisler and Ruth are efficient in both models. The other formerly efficient batters are now inefficient when unrealistic weighting schemes are disallowed. Their IDs can be read down the right hand side of the chart.
Let's examine similar plots to examine cross-efficiency. The code for plot generation is not displayed for the sake of readability (i.e. a code chunk option of `echo=FALSE` is used.)
```{r, echo=FALSE, fig.show='hold', fig.align='center', fig.cap="Cross-Efficiency vs. Efficiency"}
CRSvsCE <- ggplot (collectedeff1, aes(x=as.numeric(CCRIO), y=as.numeric(CE))) +
xlab("Input-Oriented CRS Efficiency") +
ylab("Cross-Efficiency") +
ggtitle("1920 American League Input-Oriented Constant Returns to Scale Model") +
geom_point(shape=1) +
geom_abline(slope=1, color="red") +
geom_text(data=subset(collectedeff1, as.numeric(CCRIO) > 0.999 ),
nudge_x = 0.03, aes(x=as.numeric(CCRIO), y=as.numeric(CE),
label=nameID))
CRSvsCE
```
At first glance, the story of cross-efficiency vs. efficiency. This chart makes it clear that while they are correlated, the nine players that were originally deemed CCR-IO Efficient are treated quite differently by cross-efficiency. No one receives a score of 1.0 in the cross-efficiency analysis as this would require that everyone uses weights that score that player with a 1.0. George Sisler is rated highest but is closely followed by Tris Speaker, Eddie Collins, and "Shoeless" Joe Jackson. These four are clearly separated from all of the other players. The next two formerly efficient players, Sam Rice and Joe Judge, still receive cross-efficiency scores over 0.85 but are now even surpassed by one CCR-IO inefficient player.
This brings us to the last three of the formerly efficient players: Harry Hooper, Bob Meusel, and Babe Ruth. All three of these players are outscored in cross-efficiency by many other players. Their cross-efficiency scores are all approximately the league average. All of this despite the fact that as discussed earlier, Babe Ruth was not just inarguably the *best* batter in 1920 but arguably the best of the century.
Now let's see if weight restrictions can solve the problem of cross-efficiencies' incorrect evaluation of Babe Ruth.
```{r, echo=FALSE, fig.show='hold', fig.align='center', fig.cap="Impact of Weight Restrictions on Cross-Efficiency"}
CEvsCEWR <- ggplot (collectedeff1, aes(x=as.numeric(CE), y=as.numeric(CEWR))) +
xlab("Cross-Efficiency") +
ylab("Cross-Efficiency with Weight Restrictions") +
ggtitle("1920 American League Input-Oriented Constant Returns to Scale Model") +
geom_point(shape=1) +
geom_abline(slope=1, color="red") +
geom_text(data=subset(collectedeff1, as.numeric(CEWR) > 0.85 ),
nudge_x = -0.03, aes(x=as.numeric(CE), y=as.numeric(CEWR),
label=nameID))
CEvsCEWR
```
While \index{Cross-efficiency} cross-efficiency is sometimes used as a crutch to improve discrimination without needing to use application area-expertise, weight restrictions can be used just as readily in cross-efficiency as in regular DEA studies. This chart highlights shows that unlike regular DEA, adding weight restrictions helps some players' scores and hurts others' scores. The biggest beneficiary of weight restrictions is Babe Ruth. While his weight-restricted cross-efficiency score still trails George Sisler by a small margin, he is no longer considered a mere "average" batter.
Most DEA applications do not have the benefit of a historically transcendent performance such as that of Babe Ruth to highlight a modeling error of using cross-efficiency to increase improve discrimination among DMUs. Cross-efficiency sacrifices DEA's conservative or generous assumption of weight flexibility to opaquely create an alternate scoring model. While it does typically create a unique ranking of all of the DMUs, this does not necessarily make it valid.
This analysis demonstrates that cross-efficiency is not a substitute for the hard work of good modeling.
### Cross-Efficiency and Fixed Weights
Babe Ruth's scores showed high variability across the different analysis. Let's explore this in more detail by digging into the actual weights from the multiplier model used.
The cross-efficiency score for a player, say Babe Ruth, is computed by averaging each cross-evaluation score which is the score that each of the players implicitly give Babe Ruth using their own preferred weighting scheme.
The result is that the only way that Babe Ruth could get a cross-efficiency score of 1.0 would be for every player to say that based on their own weighting scheme, they find that Babe Ruth was efficient.
Let's look at the calculations for Babe Ruth using his own weighting scheme.
```{r, echo=FALSE}
ybabe <- y["ruthba01", ]
uybabe <- as.data.frame(t(res1920ALcross$uy["ruthba01",]))
xbabe <- x["ruthba01", ]
vxbabe <- as.data.frame(t(res1920ALcross$vx["ruthba01",]))
rownames(uybabe) <- "output weights"
```
```{r, echo=FALSE}
kbl (rbind(ybabe, uybabe), booktabs=T, digits=5,
caption =
"Ruth's Outputs and Ruth's Cross-Efficiency Output Weights")|>
kable_styling(latex_options = c("HOLD_position"))
```
Now, we can multiply each output by the corresponding output weight and then add these products together. The result is `r as.matrix(ybabe) %*% t(as.matrix(uybabe))`. The input weight for Babe was `r round(vxbabe,5)` while his input (Plate Appearances) was `r xbabe`. Simply multiplying them together gives us `r as.matrix(xbabe) %*% t(as.matrix(vxbabe))`. The final result is taking the weighted output divided by the weighted output which gives us `r as.matrix(ybabe) %*% t(as.matrix(uybabe))/as.matrix(xbabe) %*% t(as.matrix(vxbabe))`. Not surprisingly, Babe Ruth's cross-evaluation score based on his own weights is the same as his CCR output-oriented efficiency score. In this case, simply 1.0.
Now, let's go through the process of seeing the calculation of Babe Ruth's cross-evaluation score using the weights of George Sisler.
```{r}
uysisl <- as.data.frame(t(res1920ALcross$uy["sislege01",]))
vxsisl <- as.data.frame(t(res1920ALcross$vx["sislege01",]))
rownames(uysisl) <- "output weights"
```
```{r}
kbl (rbind(ybabe, uysisl), booktabs=T, digits=6, caption =
"Ruth's Outputs and Sisler's Cross-Efficiency Output Weights")|>
kable_styling(latex_options = c("HOLD_position"))
```
Now, we again multiply each output by the corresponding output weight and then add these products together. The result is `r round(as.matrix(ybabe) %*% t(as.matrix(uysisl)),5)`. The input weight for Sisler was `r round(vxsisl,5)` while Babe's input (Plate Appearances) was `r xbabe`. Simply multiplying them together gives us `r round(as.matrix(xbabe) %*% t(as.matrix(vxsisl)),5)`. The final result is taking the weighted output divided by the weighted output which gives us `r round((as.matrix(ybabe) %*% t(as.matrix(uysisl)))/(as.matrix(xbabe) %*% t(as.matrix(vxsisl))),5)`. Babe Ruth's cross-evaluation score based on George Sisler's weights results in a much lower score than his own score.
Both George Sisler's and Babe Ruth's output weights violate the pattern of increasing (or more precisely, non-decreasing) values as you move to the right. In the case of Babe Ruth's efficiency score, it does not hurt him to have this unrealistic scheme because he gave himself a *1.0*. On the other hand, in cross-efficiency, this unrealistic weighting scheme used by George Sisler *hurt* the final cross-efficiency score of Babe Ruth.
In fact, a paper by Anderson, Inman, and Hollingsworth [@Andersonfixedweightingnature2002] found that under certain circumstances, cross-efficiency is effectively a fixed weighting scheme that may assess everyone under the same possibly incorrect weighting scheme. The authors show that in a single-input, output-oriented model or single output, input-oriented model, the cross-efficiency calculations, cross-efficiency just relies on an average of the weights.
The cross-efficiency mean output weights can be expressed as the following:
$$
\begin{split}
\begin{aligned}
u_r^{CE} & = \frac{1} {N^D} \sum_{j=1}^{N^D} \frac {u_{r,j}} {v_{1,j}} \\
\\
v_1^{CE} & = 1 \\
\end{aligned}
\end{split}
(\#eq:Ch8CEFixedWeights)
$$
Let's repeat this exercise one more time based on the mean output and mean input weights for all 54 players in the 1920 American League.
```{r}
vxmeantemp <- matrix(rep(res1920ALcross$vx, each=5), ncol=5, byrow=TRUE)
# Expand vx from one column to as many columns as uy (5)
uymeantemp <- res1920ALcross$uy / vxmeantemp
# Now it is easy to do term by term division of the two matrices
uymean <- as.data.frame(t(colMeans(uymeantemp)))
# Mean cross-efficiency output weights are now just the column means
vxmean <- 1
# Mean cross-efficiency input weight is just unity.
# Equivalently, could output weights using column sums
# and input weight of ND
rownames(uymean) <- "output weights"
```
```{r, echo=FALSE}
kbl (rbind(ybabe, uymean), booktabs=T, digits=4,
caption =
"Ruth's Outputs and Mean Cross-Efficiency Output Weights")|>
kable_styling(latex_options = c("HOLD_position"))
```
Once again we multiply each output by the corresponding output weight and then add these products together. The result is `r round(as.matrix(ybabe) %*% t(as.matrix(uymean)),4)`. The mean input weight was `r round(vxmean,4)` while his input (Plate Appearances) was `r xbabe`. Simply multiplying them together gives us `r round(as.matrix(xbabe) %*% t(as.matrix(vxmean)),4)`. The final result is taking the weighted output divided by the weighted output which gives us `r round(as.matrix(ybabe) %*% t(as.matrix(uymean))/as.matrix(xbabe) %*% t(as.matrix(vxmean)),4)`. Babe Ruth's cross-evaluation score based on the league-wide mean weights results in a higher score score than what Sisler gave him but far below his own efficiency score.
Note, this calculated value matches the original cross-efficiency score for Babe Ruth of `r round(res1920ALcross$ce_ave[1, "ruthba01"],4)` calculated as simply the column average of the cross-evaluation matrix.
The same mean cross-efficiency weights can be used for calculating the cross-efficiency score for each of the other players. This is not a matter of some (or most) players occasionally violating the relationship in some way but a systematic error. What this means is that *every* player is now being assessed against the same fixed and in this case, implausible weighting scheme. If you look carefully at the mean output weights, the mean weight for Home Runs is just half that of the next lowest output weight (base on balls). Singles, doubles, and triples are all given four or fives times as much weight as home runs.
The low weighting of home runs is in effect caused by a *ganging up* phenomenon where most players do poorly on Home Runs relative to Babe Ruth so they choose to not put weight on that output. On the whole, the league then does not on average put a low weight on Home Runs. Furthermore, outputs that tend to have low values (such as triples) will tend to have higher output weights than those with higher values such as home runs. Rightly or wrongly, these distributional characteristics of the data set will affect cross-efficiency scores.
Most studies that do cross-efficiency do not take a careful look at the weights.
## Baseball Team Management
Examining baseball batters is interesting but let's change our perspective to the team. A baseball general manager tries to construct the best team. Players have different salaries and the Lahman baseball dataset includes salary information. An interesting statistical analysis of the relationship between team salary and wins can be found at <https://rpubs.com/grigory/MLBSalaryPerfLR> by Gregory Kanevsky. We will draw inspiration from Gregory Kanevsky's analysis and data wrestling to get us started. Instead of his regression model, we will use a benchmarking approach to demonstrate the use of DEA for examining organizations.
```{r warning=FALSE, messages=FALSE}
suppressPackageStartupMessages(library (data.table))
teams = as.data.table(Teams)
teams = teams[, .(yearID,
lgID = as.character(lgID),
teamID = as.character(teamID),
franchID = as.character(franchID),
Rank, G, W, L, R, ERA, SO,
PostW = 4*(LgWin=="Y")+4*(WSWin=="Y"),
# Construct Post Season wins based on games to win a series
# Would prefer to get actual postseason games won
WinPercent = W/(W+L),
name, attendance
)]
MLBSalaries = as.data.table(Lahman::Salaries)
MLBSalaries = MLBSalaries[, c("lgID", "teamID", "salary1M") :=
list(as.character(lgID),
as.character(teamID), salary / 1e6L)]
payroll = MLBSalaries[, .(payroll = sum(salary1M)),
by=.(teamID, yearID)]
teamPayroll = merge(teams, payroll, by=c("teamID","yearID"))
MLB1991AL <- subset (teamPayroll, yearID==1991 & lgID=="AL")
```
Our \index{Input-output} DEA input-output model will be salary used to create regular season wins.
```{r MLBSalaryIODiagram, echo=FALSE}
XFigNames <- "Team Salary"
YFigNames <- c("Wins")
ModelName <- '"\n\nBCC\nIO\n\n "'
Figure<-DrawIOdiagram(XFigNames,YFigNames, ModelName)
tmp<-capture.output(rsvg_png(
charToRaw(export_svg(Figure)),'TeamSalary.png'))
```
Note that we could use an input or an output orientation.
```{r warning=FALSE, messages=FALSE}
res1991MLB<-DeaMultiplierModel (x=as.matrix(MLB1991AL$payroll),
y=as.matrix(MLB1991AL$W), rts="vrs",
orientation="input")
rownames(res1991MLB$Efficiency)<-MLB1991AL$teamID
colnames(res1991MLB$Lambda)<-t(MLB1991AL$teamID)
```
```{r, echo=FALSE}
kbl(TRA::poscol(cbind(res1991MLB$Efficiency,res1991MLB$Lambda), cutoff=0.00001),
booktabs=T, digits=4,
caption=
"VRS Efficiency and Non-Zero Lambda Columns of 1991 AL Teams") |>
kable_styling(latex_options = c("HOLD_position"))
```
These results show that only four teams were efficient: Chicago White Sox (CHA), Minnesota Twins (MIN), Seattle Mariners (SEA), and Toronto Blue Jays (TOR). All other teams were compared to a combination of these teams. Since this is a simple, one-input, one-output model, it is easy to show graphically. The \index{Benchmarking} `Benchmarking` package from Bogetoft and Otto does a great job of drawing 2 dimensional DEA plots so we will again use this package.
```{r, message=FALSE, warning=FALSE}
library (Benchmarking)
```
```{r}
dea.plot.frontier (MLB1991AL$payroll, MLB1991AL$W, RTS="vrs",
txt=MLB1991AL$teamID,
fex = 0.7, # Scales data text label size
xlab = "Team Salary ($Millions)",
ylab = "Regular Season Wins")
```
Now, let's rerun it for multiple years.
```{r BCC-Team-Salary-Efficiency}
teampayroll2 <- as.data.frame(t(c(rep_len("",17))))
class(teamPayroll$yearID)<-"numeric"
colnames(teampayroll2)<-c(colnames(teamPayroll),"Eff")
# Above feels like a kludge to aggregate results
# Feel free to come up with cleaner alternatives
class(teampayroll2$payroll)<-"numeric"
class(teampayroll2$W)<-"numeric"
class(teampayroll2$yearID)<-"numeric"
class(teampayroll2$Eff)<-"numeric"
lastyear <- 2016
# Note that as of October 2024, the Lahman package (12.0) only
# includes teampayroll data through 2016.
for (year in 1991:lastyear) {
MLBteamAL <- subset (teamPayroll, yearID==year & lgID=="AL")
resMLB <-DeaMultiplierModel(as.matrix(MLBteamAL$payroll),
as.matrix(MLBteamAL$W), rts="vrs")
rownames(resMLB$Efficiency) <-MLBteamAL$teamID
teampayroll2 <- rbind(teampayroll2, cbind (
subset (teamPayroll, yearID==year & lgID=="AL"),
resMLB$Efficiency))
}
#colnames(teampayroll2[,17])<-"Efficiency"
colnames(teampayroll2)[colnames(teampayroll2)=="V2"] <- "Efficiency"
class(teampayroll2$payroll)<-"numeric"
class(teampayroll2$W)<-"numeric"
# class(teampayroll2$Efficiency)<-"numeric"
class(teampayroll2$WinPercent)<-"numeric"
teampayroll2 <- teampayroll2[-1,]
#pander(poscol(cbind(resMLB$eff,resMLB$lambda)),
# caption="BCC Efficiency and Lambda Values of AL Teams")
```
Now, let's plot the data.
```{r, echo=FALSE}
kbl(head(subset (teampayroll2, yearID==1991 & lgID=="AL")),
booktabs=T, digits=2,
caption="Sample of 1991 AL Team Payrolls") |>
kable_styling (latex_options = c("HOLD_position", "scale_down"))
```
```{r, echo=FALSE}
ggplot (subset (teampayroll2, lgID=="AL"),
aes (x=payroll, y=W, color=yearID)) + geom_point() +
scale_color_gradient(low="blue", high="red") +
xlab("Payroll ($Millions") + ylab("Regular Season Wins") +
ggtitle("Salary and Wins: American League (1991-2016)")
```
This chart shows that salaries have been going up over time with the colors on the left being generally blue and the colors on the right being red.
Now, let's look at changes for two specific teams over time that are featured in Michael Lewis' *Moneyball*: the New York Yankees and the Oakland Athletics [@lewis2004]. The Yankees are generally among the biggest highest spending of teams while the Oakland Athletics are among the lowest.
```{r}
moneyballplot2 <- ggplot (
data=teampayroll2) +
geom_point(data=teampayroll2[teampayroll2$teamID %in%
c("NYA"),],
aes(x=payroll, y=W, group=teamID, color=yearID,
shape="Yankees")) +
geom_point(data=teampayroll2[teampayroll2$teamID %in%
c("OAK"),],
aes(x=payroll, y=W, group=teamID, color=yearID,
shape="Athletics")) +
scale_color_gradient(low="blue", high="red") +
xlab("Payroll ($Millions") + ylab("Regular Season Wins") +
ggtitle("Salary and Wins: Oakland A's and New York Yankees
(1991-2016)") +
labs (shape="Team", col="Year")
# Moneyballplot <- Moneyballplot +
moneyballplot2
```
```{r}
moneyballplot2 <- ggplot (
data=teampayroll2) +
geom_point(data=teampayroll2[teampayroll2$teamID %in%
c("NYA"),],
aes(x=yearID, y=Eff, shape="New York")) +
geom_point(data=teampayroll2[teampayroll2$teamID %in%
c("OAK"),],
aes(x=yearID, y=Eff, shape="Oakland")) +
xlab("Year") + ylab("Team Efficiency") +
ggtitle("Team Efficiency Oakland and New York Yankees (1991-2016)") +
# scale_shape_discrete(name="Team",
# breaks=c("NYA", "OAK"),
# labels=c("Yankees", "Athletics")) +
xlab("Payroll ($Millions)") + ylab("Regular Season Wins") +
ggtitle(
"Salary and Wins: Oakland A'thletic's and New York Yankees
(1991-2016)") +
labs (shape="Team")
#Moneyballplot <- Moneyballplot +
moneyballplot2
```
The book, *Moneyball* was published in 2003 and tells the story of Billy Beane as the Oakland A's General Manager. He took over as General Manager at the end of the 1997 season. The central thesis is that the Oakland A's outcompeted the New York Yankees despite having a much lower salary for players by recognizing a skill that was overlooked by industry conventional wisdom. We can then divide our salary data and analysis into three periods: Before Billy Beane's General Manager role (1991-1997), Billy Beane before *Moneyball* (1998-2002), and the after *Moneyball* era (2004-2010). This leads to a few potential hypotheses.
Oakland's salary efficiency should have improved with the hiring of Billy Beane. $\theta ^{1991-1997}_{OAK}$=`r mean(subset(teampayroll2, teamID=="OAK" & yearID>1990 & yearID<1998)[["Eff"]],round=4)`\<$\theta ^{1998-2002}_{OAK}$=`r mean(subset(teampayroll2, teamID=="OAK" & yearID>1997 & yearID<2003)[["Eff"]],round=4)`: *H1 Supported*
Oakland's salary efficiency should have been higher than New York's before the book was published while Billy Beane was running Oakland. $\theta ^{1998-2002}_{OAK}$=`r mean(subset(teampayroll2, teamID=="OAK" & yearID>1997 & yearID<2003)[["Eff"]],round=4)`\>$\theta ^{1998-2002}_{NYA}$=`r mean(subset(teampayroll2, teamID=="NYA" & yearID>1997 & yearID<2003)[["Eff"]],round=4)`: *H2 Supported*
After the book was published, Oakland's strategy was widely known. The *efficient market hypothesis* would indicate that Oakland's salary efficiency would decline as the market would then reflect this information in player pricing. $\theta ^{1998-2003}_{OAK}$=`r mean(subset(teampayroll2, teamID=="OAK" & yearID<2003 & yearID>1997)[["Eff"]],round=4)`\>$\theta ^{2004-2009}_{OAK}$=`r mean(subset(teampayroll2, teamID=="OAK" & yearID>2003 & yearID<2010)[["Eff"]],round=4)`: *H3 Supported*
```{r}
teameff <- teampayroll2 |>
group_by (teamID) |>
dplyr::summarize(Mean = mean(Eff,na.rm=TRUE))
kbl(teameff,
booktabs=T, digits=3,
caption="Efficiency of Teams") |>
kable_styling (latex_options = c("HOLD_position"))
```
```{r warning=FALSE, messages=FALSE}
#library(dplyr); library (pander)
#kbl (dply(teampayroll2, c("teamID"), summarise, mean(Efficiency)))
# pander(ddply(teampayroll2, c("teamID"), summarise,
# mean(Eff)),
# caption="Average Team Efficiency", round=4)
mean(subset(teampayroll2, teamID=="NYA" & yearID<2003 )[["Eff"]])
mean(subset(teampayroll2, teamID=="NYA" & yearID>2003 )[["Eff"]])
mean(subset(teampayroll2, teamID=="OAK" & yearID<2003 )[["Eff"]])
mean(subset(teampayroll2, teamID=="OAK" & yearID>2003 )[["Eff"]])
```
### Future Research Opportunities on Team Management
- The above hypotheses were considered based on means. It could be extended for statistical significance.
- The spread of best practices from the Oakland A's through baseball could be modeled from the perspective of the diffusion of innovation.
- The book *Moneyball* helped to popularize and legitimize analytics in sports management. Other industries could be explored to see if they have similar pre- and post-*Moneyball* effects.
- General managers spend salary between pitchers and position players. Extending the model by separating salaries between the two players could be used to reveal patterns of allocative inefficiency between teams.
- Salary efficiency could be used to explore and test the impact of organizational practices and structures.
- One of the open questions in baseball analysis is measuring the impact of being a *good* teammate. It has been argued that some players have most of their value come by way of being a positive influence in the team clubhouse but unlike for batting, pitching, and more recently, fielding, no accepted measure of this team influence has been done. Lessons from sports might then be applicable to areas of greater significance such as new product teams, project management, boards of directors, and other areas.
- Malmquist Productivity Indices could be used to examine maturity of the industry.