-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path07-clean.Rmd
390 lines (265 loc) · 20.1 KB
/
07-clean.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
# Cleaning data
Cleaning a data set, including fields and reshaping of the data, is a fundamental skill in order to get the data into an analyzable format. Real world data is messy, and will not typically be immediately analyzable immediately after downloading.
For this section we will still use the company database, along with some more packages. New packages in this chapter include lubridate and tidyr.
```{r, message = FALSE, warning = FALSE}
library(dplyr) # For manipulating data
library(tidyr) # For making data long and wide
library(lubridate) # For date handling
library(DBI) # For database connections
dir.create("data", showWarnings = FALSE)
download.file(url = "https://github.com/jonlinca/auditanalytics/raw/master/data/rauditanalytics.sqlite",
destfile = "data/rauditanalytics.sqlite", mode = "wb")
con <- dbConnect(RSQLite::SQLite(), "data/rauditanalytics.sqlite")
gl <- tbl(con, 'gl') %>%
collect()
vendors <- tbl(con, 'vendors') %>%
collect()
industry <- dbReadTable(con, 'industry') %>%
collect()
dbDisconnect(con)
```
## Basic math
We covered some of the basic operators in R in the Completeness chapter, so we'll cover some more calculations and transformations.
```{r}
x <- c(1, 2, 3)
y <- c(4, 5, 6)
```
The basic mathematical symbols will calculate a vector with a position element against a different vector with the same position. In this case, the first number in `x` is added to the first number in `y` resulting in the first calculation, the second number in `x` is added to the second number in `y` resulting in the second calculation, and so on.
```{r}
y + x
```
This distinction is important - vectorization allows R to be fast, but it changes how the user needs to understand objects.
`sum()` is fairly straight forward - it will add all the elements given, together:
```{r}
sum(x)
```
If you specify multiple vectors - it will merely combine them all together, equivalent to `sum(c(1, 2, 3), c(4, 5, 6))`:
```{r}
sum(x, y)
```
We mentioned in the prior chapter about NA values, which were 'Not Available' or Missing Values. It becomes important to understand the implication of them in your calculations. For example, when a NA value exists in a vector, you will get an NA by default:
```{r}
z <- c(1, 2, NA)
sum(z)
```
While `sum()`, and several other functions like `mean()`, `min()` and `max()` all generally behave in predictable manners, they all have an argument of `na.rm = FALSE` by default (enter `?sum` to see this argument in the help file. The reason why `FALSE` is by default is that it brings attention to the vector missing values. If you want to perform the calculation on this, you either need to substitute the `NA` out for a value (imputation), or simply ignore them. Both these decisions require professional judgment before substitution or ignoring. To substitute `NA` for a value (for example, 10):
```{r}
is.na(z) # Shows which positions in the vector have an NA value
z[is.na(z)] <- 10 # Where the z value is is.na, assign 10
sum(z)
```
Or alternatively, set the argument `na.rm = TRUE` to throw away these values completely:
```{r}
z <- c(1, 2, NA)
sum(z, na.rm = TRUE)
```
## Dates and Times
Unlike numbers, date and times are inherently more complex. They require many additional considerations to properly wrangle:
- What format is the date and time in?
- Are the timezones as expected? This also applies for dates.
- Do you want the the month the transaction took place in? The year? The day-of-week?
- Do you care about daylight savings time?
Starting with a basic example, lets understand the basic characteristics of dates. When you define a new date, the `lubridate::as_date()` function (or base `as.Date()`) works to convert character strings to dates.
```{r}
startString <- '2020-07-30' # ISO 8601 standards, follow them
# base::as.Date(startString) # Same thing as as_date, for now...
start <- lubridate::as_date(startString)
class(start)
```
You can convert any date to a date, as long as you specify the mask or format. A mask specifies where the date, month, year exist (as well as for time, hours, minutes, seconds). If you check the help file for `?strptime`, you will see several different formats for both specifying dates and times. For example, `%Y` is a four digit year while `%y%` is a two digit year. Here are a few examples with the same date (some are more unreasonable than others, but all can be captured):
```{r, eval = FALSE}
lubridate::as_date('2020-07-30', format = "%Y-%m-%d")
lubridate::as_date('30/07/2020', format = "%d/%m/%Y") # Not ISO 8601
lubridate::as_date('July 30, 20', format = "%B %d, %y") # Really not ISO 8601
lubridate::as_date('Jul3020', format = "%b%d%y") # Definitely not ISO 8601
```
The same concept also applies to datetimes, where you can convert a given time to a 'POSIXct' or 'POSIXlt' format (the differences between 'POSIXct' and 'POSIXlt' are found in `?DateTimeClasses`, under details):
```{r}
starttime_base <- as.POSIXct('2020-07-30 12:00:00')
starttime_lub <- lubridate::as_datetime('2020-07-30 12:00:00')
class(starttime_base); class(starttime_lub)
```
And format masks apply as well to date time, just in case you get something unconventional:
```{r}
lubridate::as_datetime('2020-07-30 12:00:00 PM -0600',
format = "%Y-%m-%d %I:%M:%S %p %z")
```
### Understanding Default Time Zones
Did you consider the time zone that the times were created in? This is system-specific and base `POSIXct` and lubridate `as_datetime` behave differently, so you have to be careful!
```{r}
print(starttime_base); attr(starttime_base,"tzone")
```
Notice that base treats the creation of the current timezone based off your local R machine, although its not extractable (and you will eventually find, hard to change).
```{r}
print(starttime_lub); tz(starttime_lub)
```
The lubridate version will assign the date-time automatically to UTC, unless you specify it. It is for this reason you will generally want to avoid using base R to convert to date times, and stick with lubridate for better predictability.
Having an automatic conversion to UTC isn't bad per se - having a universal ground truth to UTC is good, as it removes Daylight Saving component, and is an easy way to standardize multiple data sets together. Unfortunately, many systems will not tell you what time zone their data is in, so you will have to figure it out and ensure R is aligned accordingly.
### Override or display timezones
Hopefully you have been convinced to stick with lubridate for your date transformations. Now, lets get some confidence on these timezone transformations.
The first method is `force_tz()`, which forces the current time to become that same hour in a new timezone. When we assigned July 30th to the date, we would have assumed this date would have been in the current timezone we exist in (generally, the system time-zone you are in). To do this, specify the date, but also the time zone you want the date or time to be harmonized in:
```{r}
starttime_lub # Starts as UTC
force_tz(starttime_lub, tz = 'Canada/Mountain') # Force the saved time to be Mountain
```
Alternatively, you may wish to only convert the time so its displayed in the correct timezone, but leave the base truth as is. `with_tz()` will allow you to change the timezone that is displayed, while honoring the original time in the data.
```{r}
starttime_lub
with_tz(starttime_lub, tz = 'Canada/Mountain') # Display the current time as Mountain
```
### Daylight Saving Implications
There are also timezone implications when it comes to daylight saving time. This is especially important when making audit decisions based on duration, or expected local norms for time-of-day.
In several locales across the world, daylight saving time will alter the clock for half a year locally, but has no impact on UTC. For example, your staff may start their day at 8:00 AM locally. When daylight saving kicks in, the local time stays at 8:00 AM and the UTC time shifts from 15:00 to 14:00.
```{r}
pre <- as_datetime('2020-03-07 8:00:00', tz = 'Canada/Mountain') # The day before DST starts
post <- as_datetime('2020-03-08 8:00:00', tz = 'Canada/Mountain') # The day DST started
dst(pre); with_tz(pre, tz = 'UTC') # Was DST active before? ; What is the UTC time before?
dst(post); with_tz(post, tz = 'UTC') # Was DST active after? ; What is the UTC time after?
```
### Differences between time
Another common calculation is to calculate the time differences between two points of time. This seemingly simple action has several interpretations behind it. Do you want relative dates? Do you want exact elapsed time?
> People are not always consistent in how they expect time to behave. Sometimes the passage of time is a monotone progression of instants that should be as mathematically reliable as the number line. On other occasions time must follow complex conventions and rules so that the clock times we see reflect what we expect to observe in terms of daylight, season, and congruence with the atomic clock. [@lubridate-help]
Adding time is easy - you simply need to decide if you're adding a period (i.e. human construct of days) or if you are following physical time (i.e. seconds). The main difference is choosing lubridate's functions that honor periods or durations
```{r}
pre <- as_datetime('2020-03-07 8:00:00', tz = 'Canada/Mountain') # The day before DST starts
post <- as_datetime('2020-03-08 8:00:00', tz = 'Canada/Mountain') # The day DST started, skipped ahead 1 hour
# Periods: Human construct of days
pre + days(1) ; pre + hours(24) # Equal to post, the human construct
# Duration: Physical construct of time
pre + ddays(1) ; pre + dhours(24) # NOT to post, but actually 1 hour after
```
Calculating intervals and periods is a bit different. If we go back up to our example of pre-and-post daylight saving time, we can either expect this answer to be 1 day or 23 hours. Again, depending on your use case, it is up to you.^[Division is the intended mechanism - see https://github.com/tidyverse/lubridate/issues/105 and https://stackoverflow.com/questions/8765621/length-of-lubridate-interval]
```{r}
span <- interval(pre, post)
# Period: To find the number of days, in human constructs
span / days(1) # Period calculation by days.
# Duration - To find the hours between, in physical constructs
as.duration(span)
```
### Derived date information
A common task for auditors is to find out the month a transaction took place in. With lubridate, its quite straight forward - for example, to extract the month, you don't need to have any messy substrings (or to extract other useful times, for that matter):
```{r}
thedate <- as_datetime('2020-07-01 12:00:00', tz = 'Canada/Mountain')
month(thedate)
year(thedate)
quarter(thedate) # Canada Day starts on the first day of the third quarter. Does your company Financial Year not start on January 1? Just change the fiscal_start argument.
wday(thedate, label = TRUE)
```
### Cleaning numerical dates
With dates, sometimes they may represented as a number - it is important to understand your source system date before you convert it.
Some systems may calculate a date from an 'origin' using the number. While our concept of time may start with 0 AD (as the origin), several computer systems and applications do not behave in that manner. Excel, for example, starts from 1900-01-01, whereas Unix operating systems have their origin as 1970-01-01. If converting a date from a number, you need to research and then specify an origin, otherwise you will have challenges in specifying dates.
In our vendors table, we have a numeric date but also a date in a character form for reference
```{r}
vendors %>%
select(date_added, date_added_char)
```
By specifying the origin a few different origins, we can see what the origin should have been in this case:
```{r}
vendors %>%
select(date_added, date_added_char) %>%
mutate(date_added_unix = as_date(date_added, origin = '1970-01-01'),
date_added_excel = as_date(date_added, origin = '1900-01-01'))
```
Dates can also appear in a completely different format. Julian (also known as mainframe or ordinal) dates are an example, where the year is specified but the date is relative to January 1. So 212 days is either July 30 or 31, depending if it is a leap year or not.
```{r}
lubridate::as_date('19212', format = "%y%j")
lubridate::as_date('20212', format = "%y%j")
```
JDE E1's dates also add an extra digit in front of it, to symbolize the century. While there is no R way to translate 0 to 19xx and 1 to 20xx, practically speaking we are safe with assuming the century as we are at least 20 years past that point (as of this writing). Amending the mask to ignore the 1 can be done in the format:
```{r}
lubridate::as_date('119212', format = "1%y%j")
lubridate::as_date('120212', format = "1%y%j")
```
## Separate
Occasionally, distinct data will be combined in the same cell. This can be due to data entry or an unusual method of storing information. There will generally a separator within the field (similar to a delimiter), but the data is still contained within one field.
For example, if we look at our vendor table and `name` column, we can see the company names appear to also have a Contact person.
```{r}
vendors %>%
select(type, name)
```
While the name is accurate, the contact name shouldn't be in there. We can isolate the contact name by using `separate()`, allowing us to split a column by defining a separator and also new column names. In this case, the '-' (dash) character could work:
```{r}
vendors %>%
select(type, name) %>%
separate(name, c('name', 'company_contact'), sep = '-')
```
Sometimes a character isn't specific enough. We can actually customize the separator even further by changing its argument to be more specific. In this case, perhaps we want both the dash and the word "Contact:":
```{r}
vendors %>%
select(id, type, name) %>%
separate(name, c('name', 'company_contact'), sep = '- Contact:')
```
What is nice about this functionality is that if the separator isn't found, the missing values are NA. The warning is good to see (as the employees don't have a company contact), and can be safely ignored in this case.
## Separate_rows
Similarly grouped data can also be combined in the same cell. This is typically representing a one-to-many (or many-to-one) relationship. This can be due to data entry or unusual output formats. Consider our industry table, which is an industry-to-company mapping table, telling us which companies and the respective industry they work in:
```{r}
industry
```
As of now, there would be no way to join the industry, and specifically the value Commercial Supplies, into our vendor table. The unique key is id number of the company, which gets joined to the vendor table.
When we run into a situation where a value may belong to the same key multiple times (multiple IDs are associated with one industry), we can use `separate_rows()` to break up that row into several rows. For this function, we need to specify the columns that need to be separated - in this case, `id` needs to be broken apart.
```{r}
clean_industry <- industry %>%
separate_rows(id, convert = TRUE) # Convert will try to guess the data type for each field, based on the content that was separated
print(clean_industry)
```
`separate_rows()` will automatically expand the column that needs to be separated, while replicating the content for that same value into subsequent rows.
## Tidy data
The three key concepts in 'tidy data' [@rfordatasci] is that:
- Each variable must have its own column.
- Each observation must have its own row.
- Each value must have its own cell.
While you can get away with performing basic analysis and checks, having the ability to move data between formats increases competency for cleaning, flexibility for plotting, and legibility for outputting tables in a readable format.
Lets start by cleaning the vendor table
```{r}
clean_vendors <- vendors %>%
separate(name, c('name', 'company_contact'), sep = '- Contact: ') %>% # From the separate() section
select(-date_added_char) %>% # Remove one of the date fields
mutate(date_added = as_date(date_added, origin = '1970-01-01')) %>% # From the cleaning dates section
filter(type == 'V')
print(clean_vendors)
```
This vendor data set is a combination of both long and wide, and needs to be standardized to fit the tidy data model. Specifically:
- *Each variable must have its own column* - The vendor's spend and credits each year is separated over multiple columns. That is, the values of the variable (the year) is in the column name. In the end output, every combination of vendor-year should be a row.
- *Each observation must have its own row* - Each vendor has both a spend and credits value, within the `summary` column. Even though these are one observation for each vendor-year, they are separated by two rows.
- *Each value must have its own cell* - Thankfully, this seems to be valid (although if not, you can clean it with `separate()` and `separate_rows()`).
Why would you want to subscribe to this tidy data philosophy? When you have a tidy data set, you are more empowered to explore your data (especially with tidyverse related commands), but you can also have the power to reshape data as necessary to fit the visualizing output you are imagining.
## Pivots, wide and longer {#clean-pivot}
Effectively pivoting your data into more suitable columns and rows requires you to consider the following:
- What columns contains the values you want to aggregate?
- What do you want the new name of these columns to be? Consider both the name and aggregated value.
In the `clean_vendors` data, we see each year represented as columns, and those columns represent dollar amounts.
```{r}
clean_vendors
```
We can pivot this data by specifying the years, and then aggregating the amount. In this case, we want to take the columns and assign them to rows, and making the data longer. `pivot_longer()` will allow us to make this transformation:
```{r}
clean_vendors %>%
pivot_longer(cols = c(`2015`, `2016`, `2017`, `2018`), # The columns we want to pivot, in this case, the years.
names_to = 'year', # We want this collected column to be called the year
values_to = 'amount') # the values of the aggregation
```
This data is more readable - each year is on its own row. However, we still have the issue where the same company will have two rows for each year - for the `summary` field, both a spend and credits exist for a vendor:
```{r}
clean_vendors %>%
pivot_longer(cols = c(`2015`, `2016`, `2017`, `2018`), # The columns we want to pivot, in this case, the years.
names_to = 'year', # We want this collected column to be called the year
values_to = 'amount') %>% # the values of the aggregation
arrange(id) %>%
head(10)
```
Since spend and credits belong to a single vendor-year pairing, it may be worthwhile to to put both `summary` classifications into their own column. In this case, we want to make the table wider by adding columns, so lets use `pivot_wider()`. This function is similar to `pivot_longer` in the same approach:
- What columns contains the values you want to ~~aggregate~~ spread out? These values becomes the new column name.
- Where is the value located? These will get moved under the new column.
Building from our last example, lets make this data bit wider
```{r}
clean_vendors %>%
pivot_longer(cols = c(`2015`, `2016`, `2017`, `2018`), # The columns we want to pivot, in this case, the years.
names_to = 'year', # We want this collected column to be called the year
values_to = 'amount') %>% # the values of the aggregation
pivot_wider(names_from = 'summary',
values_from = 'amount')
```
Each company now has its spend and credits in separate columns, but is now digestible by year.
## Joining data
Often our data will come from different data sets or different tables. Having the ability to combine data from different systems is a powerful way to add more analyzable information from your dataset. In the [Completeness chapter](#completeness-join), we joined accounts from both the GL and TB to calculate whether there were any differences.
The [Relational Data chapter in R for Data Science](https://r4ds.had.co.nz/relational-data.html) goes into depth how the different join operations work.