dee_blog.Rmd

---
title: "Reproducible Research with R Markdown, ipumsr, and the IPUMS API"
output: html_document
editor_options: 
  markdown: 
    wrap: 72
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
```

Have you ever wanted to share a project using IPUMS data with a
colleague, but then thought, "Oh no, [I can't redistribute my IPUMS
data!"](https://www.ipums.org/about/terms)

Maybe you'd like a colleague to explore your findings. Or maybe you're a
teacher with an exercise you'd like your students to review and
replicate. In the past, if you wanted to someone to use the same IPUMS
data that you did, you would need to provide a list of samples and
variables and instructions for your collaborator on how to navigate the
online data extract system.

If you're thinking that sounds like a pain, don't worry, the brand new
[IPUMS microdata API](https://beta.developer.ipums.org/docs/apiprogram/)
makes it easier than ever to share your extract definitions with fellow
IPUMS users!!! Using the microdata API, you and your collaborators can:

-   Save an extract definition as a .json file that can be shared freely

-   Submit a new extract request based on a .json definition

-   Download data *and* metadata directly into your project directory
    (this feature is a personal favorite)

![dwayne johnson clapping](Images/the-rock-applause.gif)

The latest version of [ipumsr](https://tech.popdata.org/ipumsr/)
contains new functions allowing users to call on the IPUMS microdata API
directly from R or RStudio. Python users should check out
[ipumspy](https://ipumspy.readthedocs.io/en/latest/index.html). For more
on the microdata API check out these other recent blog posts:

-   [Introduction to the API]()
-   [Using the API with
    ipumsr](https://blog.popdata.org/interacting-with-the-ipums-extract-api-using-ipumsr/)
-   [Making IPUMS extracts from
    Stata](https://blog.popdata.org/making-ipums-extracts-from-stata/)
-   [Introduction to extract sharing (with ipumspy
    examples)](https://blog.popdata.org/)

In this post, I'll first introduce the ipumsr functions for saving an
extract definition to a .json file and loading a saved definition from
.json. Then, I'll demonstrate two use-cases for those functions: sharing
an analysis in an [R
Markdown](https://rmarkdown.rstudio.com/lesson-1.html) document, and
sharing an interactive application created with
[Shiny](https://shiny.rstudio.com/). Note that the code examples here
will only work once you've requested beta access to the IPUMS microdata
API by emailing ipums+api\@umn.edu and [set up your API
key](https://tech.popdata.org/ipumsr/articles/ipums-api.html).

# Sharing extracts using ipumsr

To share an extract using ipumsr, you first need an extract definition
to work with. You can create a new extract definition with
[`define_extract_usa()`](https://tech.popdata.org/ipumsr/reference/define_extract_usa.html)
or
[`define_extract_cps()`](https://tech.popdata.org/ipumsr/reference/define_extract_cps.html).
Or, if you've already submitted the extract, you can pull down the
definition of any submitted extract with `get_extract_info()`. This
works whether you created the extract with API functions or with the
online extract system. To pull down the definition your IPUMS USA
extract number 10, you would use:

```{r eval = FALSE}
extract_to_share <- get_extract_info("usa:10")
```

Once you have your extract definition stored in a R object like
`extract_to_share`, you can save that definition to a .json file with:

```{r eval = FALSE}
save_extract_as_json(extract_to_share, file = "extract_to_share.json")
```

Then you can share the file `extract_to_share.json` with a collaborator,
or in a public repository such as GitHub, and anyone with the file can
submit their own identical extract request with:

```{r eval = FALSE}
cloned_extract_definition <- define_extract_from_json("extract_to_share.json")
submit_extract(cloned_extract_definition)
```

# Sharing an analysis in R Markdown

[R Markdown](https://rmarkdown.rstudio.com/lesson-1.html) is a
plain-text file format that allows you to combine prose, code, and
analysis output into one document. To help users share an analysis of
IPUMS data in an R Markdown document, we've created a new R Markdown
template, the ".Rmd for Reproducible Research" (RRR). You can [download
the template as a standalone file
here](https://raw.githubusercontent.com/ipums/ipumsr/reproducible_research_template/inst/rmarkdown/templates/rmd-for-reproducible-research/skeleton/skeleton.Rmd),
or you can install the development version of ipumsr (by following the
[instructions here](https://tech.popdata.org/ipumsr/index.html)) and
access the template through the RStudio menu interface as shown below.

The beauty of the RRR is that it allows your collaborators to run your
analysis out-of-the-box, without taking any separate steps to download
the data. How does it accomplish this? Let's take a look.

!["hold on to your butts" meme](Images/holdontoyourbutts.gif)

The first step in using the RRR workflow is to create a data extract.
While it is possible to create extracts entirely within R ([more on that
here](https://blog.popdata.org/interacting-with-the-ipums-extract-api-using-ipumsr/)),
many users (this author included) may want to use the online IPUMS
extract system to create and submit their extracts. Once you've
submitted your extract, take note of the extract number, then begin
working with the RRR as follows.

In RStudio, select File \> New File \> R Markdown:

![Screenshot of File menu in RStudio, with New File and R Markdown
selected.](Images/create-new-rmarkdown-file-menu.png)

In the the popup menu, select From Template in the left sidebar, then
Rmd for Reproducible Research from the list of templates, and click OK:

![Screenshot of New R Markdown popup in RStudio, with From Template
selected in the left sidebar, and Rmd for Reproducible Research selected
from the list of templates.](Images/new-rmarkdown-from-template-rrr.png)

Now here we are, looking at a wall of instructions:

![Screenshot of the RRR R Markdown template file opened in the RStudio
editor.](Images/rrr-initial-open.png)

But don't worry! We've tried to make this as painless as possible. In
just a few steps you'll have your IPUMS data downloaded and the
framework for a shareable analysis project. These steps are described in
a bit more detail in the template itself, but we'll walk through them
quickly here. First, scroll down to the first code chunk, labeled
"project-parameters", and fill in values for the four parameters defined
there, as shown below: the IPUMS collection and extract number of your
submitted extract, a descriptive name for your extract, and a subfolder
in which to save your data files.

```{r eval = FALSE}
collection <- "usa" # The IPUMS data collection of your extract; run 
                    # `ipums_data_collections()` for a list of supported
                    # collections

extract_num <- NULL # The extract number, or leave as `NULL` for your most 
                    # recent extract

descriptive_name <- "my_ipums_extract" # A descriptive label for your extract; 
                                       # used to rename your data files

data_dir <- "data" # The folder in which to save data, codebook, and .json files
```

In fact, you can leave all the default values of these parameters if you
want to analyze your most recent IPUMS USA extract, though I'd recommend
filling in a better `descriptive_name` for the extract even in that
case. Since I'll be using IPUMS USA data on migration from the Puerto
Rican Community Survey, I'll fill in `"prcs_migration_analysis"` for
`descriptive_name`.

After filling in values, save the file, then click the RStudio "Knit"
button, and awaaaaaaay it goes! All that's left to do is sit back,
relax, and --

...wait, is that an error???

![Screenshot of error in the RStudio Render pane, reading "NOT AN ERROR:
usa extract number 117 is not yet ready to download. Try re-running
again later."](Images/extract-not-ready-error.png)

![Gif of Gimli falling down and saying "That was deliberate!" from The
Lord of the Rings: The Two
Towers](https://c.tenor.com/nppajdUlz6kAAAAC/gimli-the-lord-of-the-rings.gif){width="300"}

No! See, as both the "error" message and our friend Gimli indicate --
that's not an error! The RRR is designed to check whether your extract
is ready to download and stop execution if it isn't. As the message
says, you just have to try running again later. IPUMS extracts can take
anywhere from a few minutes to a few hours to process, depending on
their size and traffic levels in the extract engine.

Once your extract is ready, clicking the "Knit" button will produce an
HTML report that looks like this:

![An HTML document rendered from the RRR template, with title
Reproducible Research and four tabs labeled Delete this section before
sharing, Load Packages, Define File Paths, Load you IPUMS Data, and
Analysis Awaits. The Analysis Awaits tab is active, and shows the first
10 rows of our IPUMS USA dataset.](Images/rrr-rendered-to-html.png)

With just a couple clicks, we've pulled our most recent IPUMS USA data
extract DIRECTLY into our R project! (I really can't overstate how cool
this feature is.)

Now for some clean up to get your analysis ready to share. In the HTML
report, click the "Delete this section before sharing" tab and scroll to
the very bottom to find some output like the following:

    Data, codebook, and .json extract definition files have been saved to folder "data".

    Next, copy the code below into the "Define File Paths" code chunk, overwriting the existing code:

    extract_definition_path <- "data/prcs_migration_analysis.json"
    data_path <- "data/prcs_migration_analysis.dat.gz"
    ddi_path <- "data/prcs_migration_analysis.xml"

    Finally, delete all text and code in the section "Delete this section before sharing"

As the instructions indicate, copy the three lines defining file paths
and paste them back into the R Markdown template file, overwriting the
existing code. This will hard code the paths to the .json, data, and DDI
codebook files so that we can delete the first section of the report,
where those paths were initially defined. Next, delete the section
labeled "Delete this section before sharing" from the R Markdown file --
everything from `## Delete this section before sharing` up to, but not
including, the `## Load Packages` section heading. This section is
designed to be deleted because it contains set up code and instructions
for you, the creator of the original analysis, which are not necessary
or relevant to your collaborators.

From here, we can fill out the remainder of the RRR with whatever
analysis we'd like, such as plotting migration rates over time:

![Screenshot of HTML report titled Reproducible Research, with tabs Load
Packages, Define File Paths, Load your IPUMS Data, and Analysis:
Migration in Puerto Rico 2015-2019. The last tab is active, and below
that is some summary text about migration in Puerto Rico, as well as
another tab set with tabs titled Overall, By education, By household
income, and By age. The By household income tab is active and shows a
series of line plots of the percentage of people who moved in the past
year by household income quintiles. People in the lowest household
income quintile were more likely to
move.](Images/rrr-migration-analysis-rendered.png)

In fact, this template can run out of the box **IPUMS USA** and your
**most recent extract**. Since this *just so happens* to be the extract
I'd like to work with, I can proceed without making **any edits**,
simply by clicking `Knit` or running `rmarkdown::render()`.

Instead, they use the `.json extract definition` to create and submit a
**new data extract** the first time the script is run. By sharing as few
as two files, you can allow a colleague or student to download the exact
same IPUMS data you used in order to replicate or further explore your
work. We hope this helps make research more accessible and replicable.
Read on to see the RRR in action, as we explore some data data from the
[Puerto Rican Community
Survey](https://usa.ipums.org/usa/sampdesc.shtml#us2019b), available
from IPUMS USA.

The basic assumptions of the template are that you:

1.  Have registered with IPUMS USA (or IPUMS CPS)
2.  Have generated an [IPUMS API
    key](https://account.ipums.org/api_keys)
3.  Have [added that key to your
    .Renviron](https://tech.popdata.org/ipumsr/reference/set_ipums_api_key.html)
4.  Have a specific dataset you want to download, analyze, and visualize
5.  Would like to let other (IPUMS users) replicate the work

For this example, we're using IPUMS USA data, specifically looking at
the Puerto Rican Community Survey from the years 2015-2019.

To get started in R, make sure to update `ipumsr`, then select our new
`RRR template`.

```{r,fig.show="hold", out.width="25%"}

knitr::include_graphics(file.path("Images","rmd_template_1.png"))

knitr::include_graphics(file.path("Images","rmd_template_2.png"))


```

Now here we are, looking at a (possibly) overwhelming amount of code.
But don't worry! We've tried to make this as painless as possible. In
fact, this template can run out of the box, defaulting to **IPUMS USA**
and your **most recent extract**. Since this *just so happens* to be the
extract I'd like to work with, I can proceed without making **any
edits**, simply by clicking `Knit` or running `rmarkdown::render()`.

```{r,fig.show="hold", out.width="25%"}
knitr::include_graphics(file.path("Images","rmd_initial_open.png"))
```

And awaaaaaaay it goes! All that's left to do is sit back, relax, and -

...wait, is that an error???

```{r,fig.show="hold", out.width="25%"}

knitr::include_graphics(file.path("Images","open2-not_an_error.png"))


```

![gimli from lord of the rings saying "it was
deliberate"](Images/gimli-the-lord-of-the-rings.gif)

No! See, as both the "error message" and our friend *Gimli* indicate -
that's not an error! The `RRR` is set up to be run/Knit a few times, at
your leisure. The reason for this is to ensure that the IPUMS servers
have time to process your data requests. But look, something **did**
happen - we've added a subfolder named `Data`. And within that are two
new files: **a .json extract definition** and a `chk_....csv` file. The
first file contains all the information needed to get your data (or to
share with friends/loved ones) and the second file, you don't need to
worry about!

```{r,fig.show="hold", out.width="33%"}

knitr::include_graphics(file.path("Images","open1a.png"))

knitr::include_graphics(file.path("Images","open2a.png"))
```

```{r,fig.show="hold", out.width="50%"}
knitr::include_graphics(file.path("Images","open2a.png"))


```

Now, you may have noticed that these files are both called "template,"
and you might be wondering why. This is one of the default parameters of
the `RRR`. Users will want to edit this, which can easily be done in the
first code-chunk of the `RRR`, depending on your window/font size, you
may need to scroll. Or you can use the **Table of Contents** to jump
down to **Setup-Project Parameters.** We'll set this to a more
descriptive names since our main focus will be migration rates in Puerto
Rico.

```{r,fig.show="hold", out.width="50%"}
knitr::include_graphics(file.path("Images","params.png"))


knitr::include_graphics(file.path("Images","params_filled.png"))

```

Since we've changed the `descriptive_name` parameter, it's helpful to
delete the `.json` and `chk_.csv` files with the old name, "template",
before proceeding (if you set a proper descriptive name in the first
place, you would not need to delete anything). With the name updated,
awaaaay we knit!

And with just 2 clicks, we've pulled our most recent IPUMS USA data
DIRECTLY into our Rproj! (I really can't overstate how cool this feature
is). You'll notice there's some basic descriptive information included
by default. Feel free to replace these as you develop your analyses.

```{r,fig.show="hold", out.width="50%"}


knitr::include_graphics(file.path("Images","open4a.png"))

knitr::include_graphics(file.path("Images","open4.png"))

```

From here, we can fill out the remainder of the `RRR` with whatever
analysis we'd like such as plotting migration rates over time. To check
out the full features of the `RRR`, be sure to check out
[github.com/ipums/simple-api-shiny-app](https://github.com/ipums/simple-api-shiny-app).
Clone the repo to try out the interactive tabset .HTML report for
yourself. Or check out the [pre-rendered
version](https://github.com/ipums/simple-api-shiny-app/blob/main/prcs_migration_ex.html),
though many features such as code-folding are not available in this
version.

```{r,fig.show="hold", out.width="50%"}

knitr::include_graphics(file.path("Images","mig1.png"))

```

```{r,fig.show="hold", out.width="50%"}

knitr::include_graphics(file.path("Images","mig2.png"))
knitr::include_graphics(file.path("Images","mig3.png"))
knitr::include_graphics(file.path("Images","mig4.png"))

```

This template is very much in beta, so be sure to **share your
feedback** by emailing us at
[ipums+cran\@umn.edu](mailto:ipums+cran@umn.edu){.email} or [creating an
issue on GitHub](https://github.com/ipums/ipumsr/issues). As an
even-more-beta-bonus, we've included a simple Shiny App: the [Variable
Variation Value Viewer
(VVVV)](https://github.com/ipums/simple-api-shiny-app/VVVV), which uses
these functions in a similar way to create a self-compiling web-app.

# Sharing an interactive Shiny app

Also included in this repo is the [Variable Variation Value Viewer
(VVVV)](https://github.com/ipums/simple-api-shiny-app/VVVV). This app
follows the same steps as the `RRR`, however it also makes use of the
`wait_for_extract()` function. Thats right, you can define, submit,
wait, and download your IPUMS data all automatically...though you may be
waiting a while for larger extracts. The etract used in this example is
intentionall small so that users do not need to wait long (avg \< 1 min)
for the app to load. As mentioned above, this is not meant to be a
robust, one-size-fits all app. But it does provide a petty neat way to
show users what you've done...

```{r,fig.show="hold", out.width="50%"}

knitr::include_graphics(file.path("Images","vvvv_1.png"))

```

And let them explore further trends...

```{r,fig.show="hold", out.width="33%"}

knitr::include_graphics(file.path("Images","vvvv_2.png"))
knitr::include_graphics(file.path("Images","vvvv_5.png"))

```

Complete with metadata

```{r,fig.show="hold", out.width="50%"}
knitr::include_graphics(file.path("Images","vvvv_3.png"))

```

We hope this inspires some cool new uses of IPUMS data. Happy coding and
remember,

Use it for Good.