forked from KimmoVehkalahti/IODS-project
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdiary.Rmd
125 lines (108 loc) · 6.18 KB
/
diary.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
## Course diary
### Week 1
Thoughts after the first week:
* Created IODS-project repository
* Installed R and Rstudio on Ubuntu 20.04 (Linux). Rstudio (latest
1.3.1093) keeps crashing very frequenty. R seems to work. For now
I'm using Rscript to generate HTML from the .Rmd files.
* I've used github for years so that part was easy.
* Rstudio did not clone the github repository as I would like. It did not
store the username in the cloned repository correctly. Thus manual
``git pull`` or ``git push`` did not work properly with SSH keys (it asked
for git username and password). I had to tweak it manually to make
SSH keys work correctly.
* The actual information on the lecture could have been presented much
faster. I will probably read the information from the book/web
pages/transcripts in the coming weeks rather than listening to
lectures.
### Week 2
Thoughts for the second week:
* I went through the DataCamp exercises. The platform was simple and
easy to use, even though this would not have been my preferred mode
of learning. I would rather have read a description of the R
philosophy and how main commands work and then approached it as a
programming language. Now rather than learning the basics we are
forced to learn snippets that may be useful in themselves, but we
don't learn to understand what options and commands really mean or
what the philosophy and basic concepts are.
* It took some time to find out how to use R. I'm not a fan of its
programming language syntax, but it certainly seems useful for various
statistical and plotting tasks once you learn the different libraries.
The challenge is that you need to become familiar with a number of
libraries. In some sense python, pandas, scikit, and matplotlib still
feels easier for me.
* I've used various types of linear regression many times in the past
(least squares, Lasso, Ridge), but haven't really looked at
significance analysis and the distribution of residual errors in the
past. I think that analysis may turn out to be a useful tool for me
in the future. I would probably implement in Python for my machine
learning applications rather than use R though.
* Rmarkdown is kind of a cool idea and nice for prototyping or coursework,
but I still fail to see how to utilize it for an academic paper. The nice
thing is that it leaves a trail that makes repeating the procedure easy.
### Week 3
Thoughts after the third week
* I'm becoming frustrated. I read throught the chapters of the book,
but found them too general and vague, lacking precise analysis and
description of the topics. Perhaps I'd prefer a more mathematical
approach to the topics.
* I'm finding I really dislike R syntax and the study approach taken
in the data camp exercises. They are not hard, but they are
performing tasks without first gaining an understanding of the
available commands and operations. We haven't looked at even the
basic concepts of the *programming language* that R is, yet we are
learning and memorizing snippets in the hope that they might be used
for something useful. As someone with a long programming
background, I'm finding this approach very frustrating, inefficient,
and annoying. I'm waiting for a book on R programming to arrive.
* I do appreciate the graphics and significant testing that is easily
available with R packages.
* The Super-Bonus exercise was fun.
### Week 4
Thoughts after the fourth week
* These exercises are taking quite a few hours to do. They are not difficult,
but there is a lot of tedious detail. This is working to teach some
rote skills, but the concepts and overall approach in R has not been
discussed. R really looks like a hack, though a lot of people have
put a lot of effort into it.
* I got a book on R, which seems to help somewhat, but it doesn't
quite go as deep as I'd like either. It's Nicholas J. Horton and
Ken Kleinman: Using R and RStudio for Data Management, Statistical
Analysis, and Graphics, 2nd ed, CRC Press, 2015. The book is helpful
though.
### Week 5
Thoughts after the fifth week
* I've used PCA previously, but haven't really paid attention to data
normalization (perhaps it hasn't been a major issue in my
applications). Nevertheless, this exercise clearly points out how
important normalization is in this context. This is a useful takeaway.
### Week 6
* I definitely hit the low point of motivation and high point of
resentment towards this course during Exercise 6.
* The way the Analysis task was defined in Exercise 6 was in my
opinion unacceptable. It defines the tasks by reference to chapters
of the book (MABS), but those chapters are not available in the
"special edition" available for the online course and only
downloadable through the university library if you haven't used your
100 page limit on EBSCO. I believe the university library is closed
due to covid-19 so getting a physical copy from there is out of the
question. It is too late to order from Amazon. Now I think it is
fine to require a book for a course (if the requirement is announced
at the beginning of the course), but I detest definining the tasks
to be performed in the exercise by reference to chapters of the book
that I did not expect to need for the course and that are not easily
available.
* Ok, I was told on the chat area that these chapters are at the end of the
MABS special edition, out-of-place. The table of contents does not reflect
this, and apparently the only way to find them and know that they are there
is to page through the whole document. It never occurred to me that the
chapters in a PDF document with a table of contents and numbered pages
would be out-of-order.
* I don't know if using a non-unique subject identifier in the BPRS
dataset was intentional or not. It was rather devious.
* The datacamp exercises seemed to have a few problems, the most
serious the incorrect value of ``n`` in computing the standard error.
* The more I use R, the more I dislike its programming language. I
will give though that it is quite handy for certain visualization
and analysis tasks. I will probably look into Python and pandas
next (I've used matplotlib, numpy, scipy, sklearn, etc. before).