-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdee_blog.Rmd
441 lines (329 loc) · 18.3 KB
/
dee_blog.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
---
title: "Reproducible Research with R Markdown, ipumsr, and the IPUMS API"
output: html_document
editor_options:
markdown:
wrap: 72
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
```
Have you ever wanted to share a project using IPUMS data with a
colleague, but then thought, "Oh no, [I can't redistribute my IPUMS
data!"](https://www.ipums.org/about/terms)
Maybe you'd like a colleague to explore your findings. Or maybe you're a
teacher with an exercise you'd like your students to review and
replicate. In the past, if you wanted to someone to use the same IPUMS
data that you did, you would need to provide a list of samples and
variables and instructions for your collaborator on how to navigate the
online data extract system.
If you're thinking that sounds like a pain, don't worry, the brand new
[IPUMS microdata API](https://beta.developer.ipums.org/docs/apiprogram/)
makes it easier than ever to share your extract definitions with fellow
IPUMS users!!! Using the microdata API, you and your collaborators can:
- Save an extract definition as a .json file that can be shared freely
- Submit a new extract request based on a .json definition
- Download data *and* metadata directly into your project directory
(this feature is a personal favorite)
![dwayne johnson clapping](Images/the-rock-applause.gif)
The latest version of [ipumsr](https://tech.popdata.org/ipumsr/)
contains new functions allowing users to call on the IPUMS microdata API
directly from R or RStudio. Python users should check out
[ipumspy](https://ipumspy.readthedocs.io/en/latest/index.html). For more
on the microdata API check out these other recent blog posts:
- [Introduction to the API]()
- [Using the API with
ipumsr](https://blog.popdata.org/interacting-with-the-ipums-extract-api-using-ipumsr/)
- [Making IPUMS extracts from
Stata](https://blog.popdata.org/making-ipums-extracts-from-stata/)
- [Introduction to extract sharing (with ipumspy
examples)](https://blog.popdata.org/)
In this post, I'll first introduce the ipumsr functions for saving an
extract definition to a .json file and loading a saved definition from
.json. Then, I'll demonstrate two use-cases for those functions: sharing
an analysis in an [R
Markdown](https://rmarkdown.rstudio.com/lesson-1.html) document, and
sharing an interactive application created with
[Shiny](https://shiny.rstudio.com/). Note that the code examples here
will only work once you've requested beta access to the IPUMS microdata
API by emailing ipums+api\@umn.edu and [set up your API
key](https://tech.popdata.org/ipumsr/articles/ipums-api.html).
# Sharing extracts using ipumsr
To share an extract using ipumsr, you first need an extract definition
to work with. You can create a new extract definition with
[`define_extract_usa()`](https://tech.popdata.org/ipumsr/reference/define_extract_usa.html)
or
[`define_extract_cps()`](https://tech.popdata.org/ipumsr/reference/define_extract_cps.html).
Or, if you've already submitted the extract, you can pull down the
definition of any submitted extract with `get_extract_info()`. This
works whether you created the extract with API functions or with the
online extract system. To pull down the definition your IPUMS USA
extract number 10, you would use:
```{r eval = FALSE}
extract_to_share <- get_extract_info("usa:10")
```
Once you have your extract definition stored in a R object like
`extract_to_share`, you can save that definition to a .json file with:
```{r eval = FALSE}
save_extract_as_json(extract_to_share, file = "extract_to_share.json")
```
Then you can share the file `extract_to_share.json` with a collaborator,
or in a public repository such as GitHub, and anyone with the file can
submit their own identical extract request with:
```{r eval = FALSE}
cloned_extract_definition <- define_extract_from_json("extract_to_share.json")
submit_extract(cloned_extract_definition)
```
# Sharing an analysis in R Markdown
[R Markdown](https://rmarkdown.rstudio.com/lesson-1.html) is a
plain-text file format that allows you to combine prose, code, and
analysis output into one document. To help users share an analysis of
IPUMS data in an R Markdown document, we've created a new R Markdown
template, the ".Rmd for Reproducible Research" (RRR). You can [download
the template as a standalone file
here](https://raw.githubusercontent.com/ipums/ipumsr/reproducible_research_template/inst/rmarkdown/templates/rmd-for-reproducible-research/skeleton/skeleton.Rmd),
or you can install the development version of ipumsr (by following the
[instructions here](https://tech.popdata.org/ipumsr/index.html)) and
access the template through the RStudio menu interface as shown below.
The beauty of the RRR is that it allows your collaborators to run your
analysis out-of-the-box, without taking any separate steps to download
the data. How does it accomplish this? Let's take a look.
!["hold on to your butts" meme](Images/holdontoyourbutts.gif)
The first step in using the RRR workflow is to create a data extract.
While it is possible to create extracts entirely within R ([more on that
here](https://blog.popdata.org/interacting-with-the-ipums-extract-api-using-ipumsr/)),
many users (this author included) may want to use the online IPUMS
extract system to create and submit their extracts. Once you've
submitted your extract, take note of the extract number, then begin
working with the RRR as follows.
In RStudio, select File \> New File \> R Markdown:
![Screenshot of File menu in RStudio, with New File and R Markdown
selected.](Images/create-new-rmarkdown-file-menu.png)
In the the popup menu, select From Template in the left sidebar, then
Rmd for Reproducible Research from the list of templates, and click OK:
![Screenshot of New R Markdown popup in RStudio, with From Template
selected in the left sidebar, and Rmd for Reproducible Research selected
from the list of templates.](Images/new-rmarkdown-from-template-rrr.png)
Now here we are, looking at a wall of instructions:
![Screenshot of the RRR R Markdown template file opened in the RStudio
editor.](Images/rrr-initial-open.png)
But don't worry! We've tried to make this as painless as possible. In
just a few steps you'll have your IPUMS data downloaded and the
framework for a shareable analysis project. These steps are described in
a bit more detail in the template itself, but we'll walk through them
quickly here. First, scroll down to the first code chunk, labeled
"project-parameters", and fill in values for the four parameters defined
there, as shown below: the IPUMS collection and extract number of your
submitted extract, a descriptive name for your extract, and a subfolder
in which to save your data files.
```{r eval = FALSE}
collection <- "usa" # The IPUMS data collection of your extract; run
# `ipums_data_collections()` for a list of supported
# collections
extract_num <- NULL # The extract number, or leave as `NULL` for your most
# recent extract
descriptive_name <- "my_ipums_extract" # A descriptive label for your extract;
# used to rename your data files
data_dir <- "data" # The folder in which to save data, codebook, and .json files
```
In fact, you can leave all the default values of these parameters if you
want to analyze your most recent IPUMS USA extract, though I'd recommend
filling in a better `descriptive_name` for the extract even in that
case. Since I'll be using IPUMS USA data on migration from the Puerto
Rican Community Survey, I'll fill in `"prcs_migration_analysis"` for
`descriptive_name`.
After filling in values, save the file, then click the RStudio "Knit"
button, and awaaaaaaay it goes! All that's left to do is sit back,
relax, and --
...wait, is that an error???
![Screenshot of error in the RStudio Render pane, reading "NOT AN ERROR:
usa extract number 117 is not yet ready to download. Try re-running
again later."](Images/extract-not-ready-error.png)
![Gif of Gimli falling down and saying "That was deliberate!" from The
Lord of the Rings: The Two
Towers](https://c.tenor.com/nppajdUlz6kAAAAC/gimli-the-lord-of-the-rings.gif){width="300"}
No! See, as both the "error" message and our friend Gimli indicate --
that's not an error! The RRR is designed to check whether your extract
is ready to download and stop execution if it isn't. As the message
says, you just have to try running again later. IPUMS extracts can take
anywhere from a few minutes to a few hours to process, depending on
their size and traffic levels in the extract engine.
Once your extract is ready, clicking the "Knit" button will produce an
HTML report that looks like this:
![An HTML document rendered from the RRR template, with title
Reproducible Research and four tabs labeled Delete this section before
sharing, Load Packages, Define File Paths, Load you IPUMS Data, and
Analysis Awaits. The Analysis Awaits tab is active, and shows the first
10 rows of our IPUMS USA dataset.](Images/rrr-rendered-to-html.png)
With just a couple clicks, we've pulled our most recent IPUMS USA data
extract DIRECTLY into our R project! (I really can't overstate how cool
this feature is.)
Now for some clean up to get your analysis ready to share. In the HTML
report, click the "Delete this section before sharing" tab and scroll to
the very bottom to find some output like the following:
Data, codebook, and .json extract definition files have been saved to folder "data".
Next, copy the code below into the "Define File Paths" code chunk, overwriting the existing code:
extract_definition_path <- "data/prcs_migration_analysis.json"
data_path <- "data/prcs_migration_analysis.dat.gz"
ddi_path <- "data/prcs_migration_analysis.xml"
Finally, delete all text and code in the section "Delete this section before sharing"
As the instructions indicate, copy the three lines defining file paths
and paste them back into the R Markdown template file, overwriting the
existing code. This will hard code the paths to the .json, data, and DDI
codebook files so that we can delete the first section of the report,
where those paths were initially defined. Next, delete the section
labeled "Delete this section before sharing" from the R Markdown file --
everything from `## Delete this section before sharing` up to, but not
including, the `## Load Packages` section heading. This section is
designed to be deleted because it contains set up code and instructions
for you, the creator of the original analysis, which are not necessary
or relevant to your collaborators.
From here, we can fill out the remainder of the RRR with whatever
analysis we'd like, such as plotting migration rates over time:
![Screenshot of HTML report titled Reproducible Research, with tabs Load
Packages, Define File Paths, Load your IPUMS Data, and Analysis:
Migration in Puerto Rico 2015-2019. The last tab is active, and below
that is some summary text about migration in Puerto Rico, as well as
another tab set with tabs titled Overall, By education, By household
income, and By age. The By household income tab is active and shows a
series of line plots of the percentage of people who moved in the past
year by household income quintiles. People in the lowest household
income quintile were more likely to
move.](Images/rrr-migration-analysis-rendered.png)
In fact, this template can run out of the box **IPUMS USA** and your
**most recent extract**. Since this *just so happens* to be the extract
I'd like to work with, I can proceed without making **any edits**,
simply by clicking `Knit` or running `rmarkdown::render()`.
Instead, they use the `.json extract definition` to create and submit a
**new data extract** the first time the script is run. By sharing as few
as two files, you can allow a colleague or student to download the exact
same IPUMS data you used in order to replicate or further explore your
work. We hope this helps make research more accessible and replicable.
Read on to see the RRR in action, as we explore some data data from the
[Puerto Rican Community
Survey](https://usa.ipums.org/usa/sampdesc.shtml#us2019b), available
from IPUMS USA.
The basic assumptions of the template are that you:
1. Have registered with IPUMS USA (or IPUMS CPS)
2. Have generated an [IPUMS API
key](https://account.ipums.org/api_keys)
3. Have [added that key to your
.Renviron](https://tech.popdata.org/ipumsr/reference/set_ipums_api_key.html)
4. Have a specific dataset you want to download, analyze, and visualize
5. Would like to let other (IPUMS users) replicate the work
For this example, we're using IPUMS USA data, specifically looking at
the Puerto Rican Community Survey from the years 2015-2019.
To get started in R, make sure to update `ipumsr`, then select our new
`RRR template`.
```{r,fig.show="hold", out.width="25%"}
knitr::include_graphics(file.path("Images","rmd_template_1.png"))
knitr::include_graphics(file.path("Images","rmd_template_2.png"))
```
Now here we are, looking at a (possibly) overwhelming amount of code.
But don't worry! We've tried to make this as painless as possible. In
fact, this template can run out of the box, defaulting to **IPUMS USA**
and your **most recent extract**. Since this *just so happens* to be the
extract I'd like to work with, I can proceed without making **any
edits**, simply by clicking `Knit` or running `rmarkdown::render()`.
```{r,fig.show="hold", out.width="25%"}
knitr::include_graphics(file.path("Images","rmd_initial_open.png"))
```
And awaaaaaaay it goes! All that's left to do is sit back, relax, and -
...wait, is that an error???
```{r,fig.show="hold", out.width="25%"}
knitr::include_graphics(file.path("Images","open2-not_an_error.png"))
```
![gimli from lord of the rings saying "it was
deliberate"](Images/gimli-the-lord-of-the-rings.gif)
No! See, as both the "error message" and our friend *Gimli* indicate -
that's not an error! The `RRR` is set up to be run/Knit a few times, at
your leisure. The reason for this is to ensure that the IPUMS servers
have time to process your data requests. But look, something **did**
happen - we've added a subfolder named `Data`. And within that are two
new files: **a .json extract definition** and a `chk_....csv` file. The
first file contains all the information needed to get your data (or to
share with friends/loved ones) and the second file, you don't need to
worry about!
```{r,fig.show="hold", out.width="33%"}
knitr::include_graphics(file.path("Images","open1a.png"))
knitr::include_graphics(file.path("Images","open2a.png"))
```
```{r,fig.show="hold", out.width="50%"}
knitr::include_graphics(file.path("Images","open2a.png"))
```
Now, you may have noticed that these files are both called "template,"
and you might be wondering why. This is one of the default parameters of
the `RRR`. Users will want to edit this, which can easily be done in the
first code-chunk of the `RRR`, depending on your window/font size, you
may need to scroll. Or you can use the **Table of Contents** to jump
down to **Setup-Project Parameters.** We'll set this to a more
descriptive names since our main focus will be migration rates in Puerto
Rico.
```{r,fig.show="hold", out.width="50%"}
knitr::include_graphics(file.path("Images","params.png"))
knitr::include_graphics(file.path("Images","params_filled.png"))
```
Since we've changed the `descriptive_name` parameter, it's helpful to
delete the `.json` and `chk_.csv` files with the old name, "template",
before proceeding (if you set a proper descriptive name in the first
place, you would not need to delete anything). With the name updated,
awaaaay we knit!
And with just 2 clicks, we've pulled our most recent IPUMS USA data
DIRECTLY into our Rproj! (I really can't overstate how cool this feature
is). You'll notice there's some basic descriptive information included
by default. Feel free to replace these as you develop your analyses.
```{r,fig.show="hold", out.width="50%"}
knitr::include_graphics(file.path("Images","open4a.png"))
knitr::include_graphics(file.path("Images","open4.png"))
```
From here, we can fill out the remainder of the `RRR` with whatever
analysis we'd like such as plotting migration rates over time. To check
out the full features of the `RRR`, be sure to check out
[github.com/ipums/simple-api-shiny-app](https://github.com/ipums/simple-api-shiny-app).
Clone the repo to try out the interactive tabset .HTML report for
yourself. Or check out the [pre-rendered
version](https://github.com/ipums/simple-api-shiny-app/blob/main/prcs_migration_ex.html),
though many features such as code-folding are not available in this
version.
```{r,fig.show="hold", out.width="50%"}
knitr::include_graphics(file.path("Images","mig1.png"))
```
```{r,fig.show="hold", out.width="50%"}
knitr::include_graphics(file.path("Images","mig2.png"))
knitr::include_graphics(file.path("Images","mig3.png"))
knitr::include_graphics(file.path("Images","mig4.png"))
```
This template is very much in beta, so be sure to **share your
feedback** by emailing us at
[ipums+cran\@umn.edu](mailto:[email protected]){.email} or [creating an
issue on GitHub](https://github.com/ipums/ipumsr/issues). As an
even-more-beta-bonus, we've included a simple Shiny App: the [Variable
Variation Value Viewer
(VVVV)](https://github.com/ipums/simple-api-shiny-app/VVVV), which uses
these functions in a similar way to create a self-compiling web-app.
# Sharing an interactive Shiny app
Also included in this repo is the [Variable Variation Value Viewer
(VVVV)](https://github.com/ipums/simple-api-shiny-app/VVVV). This app
follows the same steps as the `RRR`, however it also makes use of the
`wait_for_extract()` function. Thats right, you can define, submit,
wait, and download your IPUMS data all automatically...though you may be
waiting a while for larger extracts. The etract used in this example is
intentionall small so that users do not need to wait long (avg \< 1 min)
for the app to load. As mentioned above, this is not meant to be a
robust, one-size-fits all app. But it does provide a petty neat way to
show users what you've done...
```{r,fig.show="hold", out.width="50%"}
knitr::include_graphics(file.path("Images","vvvv_1.png"))
```
And let them explore further trends...
```{r,fig.show="hold", out.width="33%"}
knitr::include_graphics(file.path("Images","vvvv_2.png"))
knitr::include_graphics(file.path("Images","vvvv_5.png"))
```
Complete with metadata
```{r,fig.show="hold", out.width="50%"}
knitr::include_graphics(file.path("Images","vvvv_3.png"))
```
We hope this inspires some cool new uses of IPUMS data. Happy coding and
remember,
Use it for Good.