forked from evoling/VisStat-HT18
-
Notifications
You must be signed in to change notification settings - Fork 0
/
02.good-enough-notes.Rmd
446 lines (351 loc) · 10.7 KB
/
02.good-enough-notes.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
---
title: "Visualisation and statistical analysis"
author: "Michael Dunn, Dept. of Linguistics and Philology, Uppsala University"
date: "Session 2, 2018-10-01"
output: html_notebook
---
## Contents
- Good Enough Practices
- Review of session 1
- Making a notebook
- Names, operators, functions
- Tidyverse installation
- Vectors and matrices
- indexing
- 2-dimensional indexes (matrix and data.frame)
- Flow control
- loops
- if-else
- Tabular data
- read_tsv etc.
- **HOMEWORK**
- Exercise: wrangling the Kelly data
- Exercise: write a complex function with defaults
## Revision etc.
```{r}
getwd() # where RStudio thinks we are
list.files()
```
NOTE: better experiment with this in the console, notebooks don't like to change directories midstream
```{r}
setwd("/Users/micdu303/Dropbox/Work/Uppsala/teaching/visualisation-and-statistical-analysis/")
list.files()
```
## Functions
A more consise version of our function:
```{r}
get.hypotenuse <- function(a, b){sqrt(a^2 + b^2)}
get.hypotenuse(4,5)
```
Here's a function to get the volume of a sphere, (v = 4/3πr³):
```{r}
get.spherical.volume <- function(r) {4 / 3 * pi * (r ^ 3)}
get.spherical.volume(10)
```
Remember that if you change a chunk above you also have to run it before later chunks will have access to the changes. Note the "Run all chunks above" button (downward pointing triangle over a line) next to the "Run current chunk" button on the right side of the screen here in the markdown view
```{r}
get.hypotenuse(19, 13)
```
```{r}
get.cylindrical.volume <- function(r, h){
h * pi * r ^ 2
}
get.cylindrical.volume(1, 10)
```
### Positional and keyword arguments
Positional arguments (R knows which argument is which by the position in the function call)
```{r}
my.func <- function(first_argument, second_argument, etc){
#do something with first_argument, second_argument, etc
}
```
Arguments with default (use = in the function definition to give a default value)
```{r}
my.func <- function(radius, dimensions=2){
if (dimensions == 2){
pi * radius ^ 2
} else if (dimensions == 3) {
4 / 3 * pi * radius ^ 3
} else {
-999
}
}
my.func(1)
my.func(1, 3)
my.func(1, 2)
```
Keyword arguments: specify the arguments in the function call (advantages: this is self-documenting, easier to know which argument is which; arguments can be given in any order)
```{r}
my.func(dimensions=3, radius=1)
```
### The pipe operator
Import the tidyverse to use "pipes".
```{r}
library(tidyverse)
```
With the pipe operator the item on the left is ‘sent’ to the next function as its first argument.
```{r}
sqrt(9)
```
With a pipe:
```{r}
9 %>% sqrt()
```
The pipe operator is functionally identical to a regular function call.
You can do crazy things with pipes:
```{r}
(3**2 + 4**2) %>% sqrt()
```
But you can also do very very sensible things, especially when working with tables. Here's a preview:
```{r eval=FALSE}
# you must load tidyverse first
write_csv(select(filter(read_csv("class_list.csv"), Course.code=="5LN139"), -Personnummer), "anon_class_list.csv")
```
i.e. horrible to read nested functions; they get even worse if the functions have multiple arguments
```{r eval=FALSE}
f(g(h(i(j("stuff")))))
f(g(h(i(j("stuff", type=char), junk=TRUE), foo=999), save=TRUE, blah="XXX"), verbose=FALSE)
```
Translating this into pipes, we get
```{r eval=FALSE}
read_csv("class_list.csv") %>%
filter(Course.code=="5LN139") %>%
select(-Personnummer) %>%
write_csv("anon_class_list.csv")
# this could all have been on one line
```
i.e
1. read the csv file, and send the output to
2. the filter function, which chooses certain rows and send the output to
3. the select function, which (here) deletes an unwanted column, then send the output to
4. the write csv function, which puts the output into a new file
## Indexing vectors
Remember the built in vector `month.name`:
```{r}
month.name
```
Indexes are numbers pointing out which position in the position/s in the vector to choose. Indexes are written in square brackets after the thing they are indexing
```{r}
month.name[2]
```
Ranges
```{r}
month.name[1:3]
```
Use a vector of numerals as index
```{r}
month.name[c(1,4,9)]
```
Negative index to drop a value
```{r}
month.name[-4]
```
```{r}
month.name[c(-2,-11)]
```
It doesn't make sense to mix positive and negative integers.
Repeat indexes
```{r}
month.name[c(1,1,1,1,1,2,2)]
```
Vectors can be added together:
```{r}
a <- 1:6
b <- c(0,1,0,0,1,1)
a
b
a + b
a * b
```
Matrix
```{r}
matrix(0, nrow=4, ncol=3)
```
```{r}
m <- matrix(0, nrow=4, ncol=3)
m[2,3] <- 99
```
```{r}
n <- matrix(1:12, nrow=4, ncol=3)
n
```
```{r}
n + n
```
Data frame
a table with columns of the same length, each column is one type
```{r}
data.frame(age=c(4,5,3), name=c("Timmy", "Anna", "Susan"))
```
```{r}
df <- data.frame(age=c(4,5,3), name=c("Timmy", "Anna", "Susan"))
df[1,2]
df[1,] # the entire row
```
```{r}
df[,2]
```
Using named columns with $
```{r}
df$age
```
```{r}
df[c(1,3), ]
df[c(1,3), 2]
```
## Flow control
### Loops
```{r}
for (i in 1:10){
print(i ^ 2)
}
```
### if-else
```{r}
age = 10
if (age < 18) {
"child"
}
```
```{r}
if (age < 18) {
"child"
} else {
"adult"
}
```
```{r}
# returning values rather than using print()
get_age_group <- function(age){
if (age < 13){
"child"
} else if (age < 19) {
"teenager"
} else {
"adult"
}
}
get_age_group(2)
get_age_group(12)
get_age_group(22)
get_age_group(15)
```
Feed a vector to this function:
```{r}
ages <- c(2,16,22,33,97,2,1,54,13)
for (age in ages){
print(get_age_group(age))
}
```
## Delimited files
Note that you have to specify file paths, either
1. relative to where this markdown file is located (this is what I've done below; on my computer this folder has a subfolder called "data" where my data files live)
2. absolute paths (i.e. relative to the "top level" of your computer's file hierarchy). In different systems this will look something like one of the following:
```{r eval=FALSE}
"/Users/michael/Documents/my_file.csv"
"~/Documents/my_file.csv"
# These two are the same; ~ is shorthand for the user home directory, e.g. /Users/michael
# Windows version
"c://Documents/my_file.csv"
# Note that you need to use forward slash / rather than backslash \, even on Windows
```
Read a help file:
```{r}
help(read_csv)
```
Read a csv etc file
n.b. read_XXX and write_XXX functions with _ are from the tidyverse, other versions (from "base R") are with dots read.XXX and write.XXX
Remember
- read_csv is comma separated
- read_tsv is tab sep
- read_csv2 is semicolon sep
```{r}
library(tidyverse)
favourites <- read_csv2("favourite_colours.csv")
favourites
```
```{r}
favourites$name
```
If you have spaces in a column name, you have to wrap it in ` symbols
```{r}
favourites$`favourite colour`
```
Boolean test on a vector:
```{r}
values <- 1:10
values
values <= 6
```
Use these as indexes
```{r}
values[values <= 6]
```
```{r}
favourites$age
favourites$age > 10
favourites[favourites$age > 10, ]
# favourites[WHICH_ROWS, WHICH_COLS], where blank means "all"
```
# Homework 1: Try to import some data
## Starting work with tables
- First experiment with some fake data of your own, like my "favourite_colour.csv" file
- Then torture yourself with the *Kelly* frequency based vocabulary list from Språkbanken
Here is the [downloadable data](https://svn.spraakdata.gu.se/sb-arkiv/pub/lexikon/kelly/Swedish-Kelly_M3_CEFR.xls) of the [Swedish Kelly-List](https://spraakbanken.gu.se/resource/kelly#tabs=information), from Språkbanken.
This is an excel sheet
- There are packages that *might* be able to read excel, but scripting languages are always happier when processing *text*
- The *good-enough* way to do this is to convert the excel sheet to text yourself
- Excel output is messy. LibreOffice/OpenOffice is usually better.
- Delimiters might be "," or ";" or "TAB"; fields might have quotes, or might not
- Don't forget `help(read_csv)` if there's something it's doing that isn't quite right
## Loading data
1. Download the [Swedish-Kelly_M3_CEFR.xls](https://svn.spraakdata.gu.se/sb-arkiv/pub/lexikon/kelly/Swedish-Kelly_M3_CEFR.xls) from Språkbanken
2. Open it and take a look
3. Save the sheet with data as csv
4. Open the csv file with a text editor, or by using RStudio's own "File" → "Open File..."
- What kind of separators does it use? Is there anything obvioiusly wrong with the exported data (hint: the header line might be a bit strange)
5. Try to open it directly, using a function from the tidyverse's `read_delim` family of functions
- If it fails to load, try to fix the errors manually in the csv file
## Subsetting data ("filtering" rows)
- We'll spend lots of time on this next week, but here's a promise...
![](subset-rows.png)
Get a subset of rows, use filter
```{r}
kelly.subset <- filter(kelly, WPM > 12000 & WPM != 1000000)
kelly.subset
```
```{r}
select(kelly.subset, WPM, `Swedish items for translation`)
```
## Subsetting data ("selecting" columns)
![](subset-columns.png)
Write this data to a text file
```{r}
write_csv(kelly.subset, "data/kelly_subset.csv")
```
**NOTE** I never succeeded in getting Excel to export a csv file in a sensible format: it always did something weird with the file encoding, so it was impossible to view the characters å, ö, ä. The only thing that did work was to do the export from LibreOffice instead (a free office package), specifying unicode utf-8 output. I'll put my function version of `Swedish-Kelly_M3_CEFR.csv` on the course website
# Homework 2: A complex function
## Buy a flat in Uppsala
It is quite complicated to turn housing prices quoted e.g. on [hemnet](https://www.hemnet.se/) into the real monthly cost that you would have to pay back. Make a function to calculate this. Use keyword arguments with defaults where appropriate.
**price**
- the selling price quoted on hemnet
**fee**
- *avgift*, the compulsory monthly fee to the housing association
**interest**
- the bank determines the rate, but let's use a default of 1.7%
**amortization**
- compulsory minimum 2% that you have to repay from you loan
**number of residents**
- set a sensible default
**operating costs**
- *drift*. The annual cost is sometimes quoted on the website. If there are no driftkostnader quoted, use `(residents * 300) / month`. *Hint:* make the default an impossible value (e.g. -1), and then have a test in your function:
```{r}
if (operating_costs < 0) {
# set the estimated operating_costs
}```
**deposit**
- a percentage, by law 15% minimum (which is also a good default)
The formula:
- The amount borrowed from the bank is the price minus the deposit
- The annual interest on the amount borrowed is the bank interest + the amortization rate; monthly interest is 1/12 of this
- The monthly cost is the monthly interest repayments plus the association fee plus the monthly operating costs