-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path08-explore.Rmd
276 lines (193 loc) · 11.6 KB
/
08-explore.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
# Exploring data
As for exploring data, there is no quick, one-click method to explore your data, whether at a high level or in more granular detail. Realistically, you will oscillate between cleaning and exploration before you are able to better understand the data you are looking testing.
There are many primarily two methods in exploring data, both quantitatively and visually. Both are necessary methods for understanding data, at a quantitative and also a human level.
For this section we will still use the same company database, and also another unique dataset, along with some more packages - mainly, ggplot2.
```{r, message = FALSE, warning = FALSE}
library(dplyr)
library(tidyr)
library(lubridate)
library(DBI)
library(ggplot2) # For graphics
dir.create("data", showWarnings = FALSE)
download.file(url = "https://github.com/jonlinca/auditanalytics/raw/master/data/rauditanalytics.sqlite",
destfile = "data/rauditanalytics.sqlite", mode = "wb")
con <- dbConnect(RSQLite::SQLite(), "data/rauditanalytics.sqlite")
gl <- tbl(con, 'gl') %>%
collect()
dbDisconnect(con)
```
## Basic Statistics {#explore-stats}
Performing summaries, or profiling the data, is a great way to begin analyzing data. It allows you to understand general numeric performance of a group. More importantly, understanding the characteristics of a datset, including mean, median, and percentiles are useful, as they help you determine what is both normal and abnormal in your data.
In our GL account, we have a materials category - lets isolate by that and begin our analysis:
```{r}
materials <- gl %>%
filter(account == 'exp_materials_6000') %>%
mutate(gl_date = as_date(gl_date_char)) %>% # We demonstrated how to handle dates in the Cleaning chapter
select(-gl_date_char)
```
`summary()` gives a high level preview of each field, including the type and the statistical profile of the data table:
```{r, rows.print = 5}
summary(materials)
```
If we focus on amount, we can reproduce the same summaries as well, useful if we want to extract items relative to the criteria:
```{r}
min(materials$amount)
max(materials$amount)
mean(materials$amount) # The average
median(materials$amount) # The value in the middle between upper half and lower half. If these are different than the average, then you may expect some skew in the data set.
```
Quartiles are useful because they help you generally understand where the majority of the data lies. These also also known as quantiles (or percentiles), but they cover the same area of a normal curve.
```{r}
quantile(materials$amount) # Look familiar?
quantile(materials$amount)[2] # This is the 1st quartile, or the 25% quantile
quantile(materials$amount)[4] # This is the 3st quartile, or the 75% quantile
IQR(materials$amount) # The difference between 3rd and 1st quartile
```
You may have missing values in your data set that show up as NA. Its important to recognize how you will [diagnose and treat NAs](#completeness-na). If a column has NAs, you may choose to remove them by using the argument `na.rm = TRUE`, which simply ignores these values when performing the calculation.
```{r}
some_nums <- c(1, 2, NA, 3, 4, 5)
mean(some_nums) # Returns NA
mean(some_nums, na.rm = TRUE) # Removes the NA value prior to caluclating the mean
```
In accounting, it may be important to know the absolute value of a number. They're useful in audit situations as they help express total activity of the subject, as positive and negative values may either zero out the total or mask credit activity and reversals. They're also useful for calculating errors, as you may be not concerned about the direction of the error (debit or credit) but the total magnitude instead.
```{r}
sum(materials$amount)
abs(sum(materials$amount))
```
We can even do basic correlation plots in R of two numerical data sets:
```{r}
x <- c(1, 2, 3, 4)
y <- c(2, 4, 6, 8)
cor(x, y) # The correlation of X and Y. 1 means positively highly correlated, -1 means negatively highly coorelated, and 0 means no discerable pattern.
```
## Summarizations
While calculating one-off summary statistics is useful, you may want to perform summation or summary statistics, whether by an entity (a vendor) or by time frame (month, year). R provides methods for summarizing this information, with a combination of two verbs.
- `group_by()` enables you to indicate the levels you want to aggregate (for example, by vendor or by month),
- `summarize()` requires you to create new columns to contain the calculation or aggregation you want to occur (for example, `sum` or `mean`).
The materials table has a `vendor_id` number. We can try to perform a `group_by()`...
```{r}
materials %>%
group_by(vendor_id)
```
But it will return nothing. The `group_by()` will tell R what to group by, but not perform any additional calculations. Hence, the `summarize()`:
```{r}
materials %>%
group_by(vendor_id) %>%
summarize(total_amount = sum(amount))
```
Summarize will isolate the data set first by the groups specified, and then perform the calculation and put the results into a new field. This works for all summary statistics as well. For example, if you want to find the number of transactions, and the highest and lowest amount charged per description:
```{r}
materials %>%
group_by(description) %>%
summarize(number_of_trans = n(),
lowest_amount = min(amount),
highest_amount = max(amount))
```
## Data Visualization - Why
With such useful statistical and summary functions, you may wonder why you ever need to delve into graphics. While summary statistics are fun (and perhaps, highly accurate and even 'authoritative'), they don't help bring to light full patterns of your data. The power of visual deduction can not be understated, especially in situations where superficially summary statistics have no issues.
Consider this data set, which are merely 4 pairs of x-y values (i.e. x1 goes with y1 etc):
```{r rows.print = 20}
anscombe
```
If we did statistical testing on these pairs, we will notice they have nearly identical summary statistics:
```{r}
mean(anscombe$x1); mean(anscombe$x2); mean(anscombe$x3); mean(anscombe$x4)
mean(anscombe$y1); mean(anscombe$y2); mean(anscombe$y3); mean(anscombe$y4)
```
And even their correlatons are similar:
```{r}
cor(anscombe$x1, anscombe$y1); cor(anscombe$x2, anscombe$y2); cor(anscombe$x3, anscombe$y3); cor(anscombe$x4, anscombe$y4)
```
With the mean and correlation statistics, you may jump to the conclusion that these datasets may be shaped identically. However, we should seriously consider plotting the data to validate our initial assumption that the data is similar.
We will use the *ggplot2* to chart these x-y pairs and try to understand the data we are looking at. We'll explain ggplot2 syntax later, so for now, focus on the graph produced.
If we produce the four graphs together, what do you notice?
```{r, message=FALSE, fig.show = 'hold', out.width = '50%'}
ggplot(anscombe, aes(x = x1, y = y1)) +
geom_point() +
stat_smooth(method="lm", se = FALSE) +
labs(title="Pair 1: Y1 over X1")
ggplot(anscombe, aes(x = x2, y = y2)) +
geom_point() +
stat_smooth(method="lm", se = FALSE) +
labs(title="Pair 2: Y2 over X2")
ggplot(anscombe, aes(x = x3, y = y3)) +
geom_point() +
stat_smooth(method="lm", se = FALSE) +
labs(title="Pair 3: Y3 over X3")
ggplot(anscombe, aes(x = x4, y = y4)) +
geom_point() +
stat_smooth(method="lm", se = FALSE) +
labs(title="Pair 4: Y4 over Y4")
```
These four charts and scatterplots themselves are quite different! While the blue correlation line and the prior summary statistics are similar, the points all follow a different general path. Independently looking at the graphics for each one will give you different insight, and may influence your audit approach. Graphics matter a lot!
## Data Visualization - How
*ggplot2* is the package that many consider to be one of the differentiating strengths of the R universe. It is one of the most elegant methods of visualizing data, and will help you understand your audit subject.
Back to our materials analysis.
```{r}
materials %>%
arrange(gl_date) %>%
head()
```
Perhaps we want to start analyzing the change over time for the amount spent. The `gl_date` is useful here, although there are many transactions that occur on the same date. We also have a `vendor_id` as well, identifying the vendor that the transaction the invoice belongs to.
Generally we will want our time to be the x variable (the dependent variable), and the amount spent to be the y variable (the independent variable). We can insert the x and y into a ggplot function and generate the visualization we think may reveal useful infromation.
```{r}
ggplot(data = materials, mapping = aes(x = gl_date, y = amount)) +
geom_point()
```
The foundations of any ggplot are essentially:
- `data` - The data frame you want to visualize,
- `mapping` - The aesthetics, which dictate the x and y (generally for the rest of the ggplot), and
- `geom_*` - The visualization mechanism you want to apply to represent the data points.
When you specify a single `mapping` up front, it will be used for all the subsequent geoms called. You can also specify a mapping for each geom as well, which is useful for isolating points.
This foundations layer of `data`, `mapping` and `geom_*` allows us to call different geoms and perspectives of looking at the data.
For example, histograms are useful to understand the shape of a dataset:
```{r, message = FALSE}
ggplot(materials, aes(x = amount)) + # data is the first default argument, and mapping is the second
geom_histogram()
```
And boxplots are useful to understand the distribution of the data:
```{r}
ggplot(data = materials, aes(x = as.factor(vendor_id), y = amount)) +
geom_boxplot()
```
### Time-series
As the gl table (and materials subset) is a time-series data, containing both the transaction date and amount, we may want to try plotting a line chart to see the movement of data over time.
```{r}
ggplot(data = materials, aes(x = gl_date, y = amount)) +
geom_line()
```
Since our time-series data has a relatively high resolution with detailed data, we may sometimes gain insight by aggregating data (by week, month, or even year). As of now, the data needs to be aggregated into a higher level time period.
```{r}
monthly_mats <- materials %>%
mutate(gl_month = floor_date(gl_date, 'month')) %>% # Rounds the date down to the first of the month
group_by(gl_month) %>%
summarize(amount = sum(amount))
head(monthly_mats)
```
Aggregating this same information by month may be more illustrative:
```{r}
ggplot(data = monthly_mats, aes(x = gl_month, y = amount)) +
geom_line()
```
### Facets
Another powerful feature of ggplot2 is facets - that is, the ability to show how different categories of the same data may look in the same dataset. In our GL dataset, we actually had multiple types of expenses:
```{r}
unique(gl$account)
```
We can pull these specific expense accounts (starting with 'exp_'), and can choose how we want to plot them:
```{r}
expense <- gl %>%
filter(account %in% c('exp_materials_6000', 'exp_consulting_6500', 'exp_meals_7000')) %>%
mutate(gl_date = as_date(gl_date_char),
gl_month = floor_date(gl_date, 'month')) %>%
group_by(account, gl_month) %>% # Need to add a new layer of grouping to preserve the account
summarize(amount = sum(amount))
ggplot(data = expense, aes(x = gl_month, y = amount, colour = account)) + # We can specify the colour now too, depending on the geom_*
geom_line()
```
Instead of plotting them onto the same chart, we can `facet_*` them:
```{r}
ggplot(data = expense, aes(x = gl_month, y = amount, colour = account)) +
geom_line() +
facet_grid(. ~ account) # Syntax for variable, either in the (row ~ column)... or both!
```