-
Notifications
You must be signed in to change notification settings - Fork 1
/
README.Rmd
99 lines (81 loc) · 4.61 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# PPP
<!-- badges: start -->
<!-- badges: end -->
The goal of PPP is to download and assemble a unified, cleaned data set of
Paycheck Protection Program loans issued in 2020.
The data is too large to share on GitHub, but this package will allow you to
recreate the data locally.
The work for this package grew out of a project with [DataKind's DC chapter](https://github.com/DataKind-DC/CARES) and
the [National Press Foundation](https://nationalpress.org/).
## Installation
You can install the development version from [GitHub](https://github.com/) with:
``` r
# install.packages("devtools")
devtools::install_github("kbmorales/PPP")
```
## Assembling PPP data
``` r
# Note: this function will download over 600MB of data to your local machine,
# and read a large dataframe into memory!
ppp_data = ppp_assemble()
```
## Data Sources
- Small Business Administration PPP Loan Data:
- [Latest release](https://sba.app.box.com/s/ahn2exwfebgqruk714v3hnf75qdap3du)
- [Original release](https://sba.app.box.com/s/tvb0v5i57oa8gc6b5dcm9cyw7y2ms6pp)
- NAICS code dictionaries [US Census NAICS Files](https://www.census.gov/eos/www/naics/downloadables/downloadables.html)
## PPP Data Dictionary
Though there are two versions of the PPP data released, both will have the same
final structure as detailed below. The main structual difference between these
versions is the removal of a `JobsRetained` variable, and the addition of
`JobsReported`. Whether or not this is a simple semantic change is unknown
at this point.
The final PPP dataset output by `ppp_assemble()` will contain the following
columns:
| variable | original | notes |
| :------------------ | -----------:|----------------------------------------: |
| LoanRange | Yes | no values for loan amounts under 150K |
| BusinessName | Yes | no values for loan amounts under 150K |
| Address | Yes | no values for loan amounts under 150K |
| City | Yes | |
| State | Yes | |
| Zip | Yes | |
| NAICSCode | Yes | |
| BusinessType | Yes | |
| RaceEthnicity | Yes | 89.3% “Unanswered” |
| Gender | Yes | 77.7% “Unanswered” |
| Veteran | Yes | 84.7% “Unanswered” |
| NonProfit | Yes | |
| JobsRetained | Yes | only in 2020-07-06 release |
| JobsReported | Yes | only in 2020-08-08 release |
| DateApproved | Yes | |
| Lender | Yes | |
| CD | Yes | |
| LoanAmount | Yes | no values for loan amounts over 150K |
| source_file | No | file name for where data was pulled |
| version | No | release date of PPP data |
| LoanRange_Unified | No | Loan ranges regardless of amount |
| JobsRetained_Grouped| No | brackets for # of 'jobs retained' |
| JobsReported_Grouped| No | brackets for # of 'jobs reported' |
| LoanRangeMin | No | minimum possible loan value |
| LoanRangeMax | No | maximum possible loan value |
| LoanRangeMid | No | middle estimated loan value |
| naics_lvl_1 | No | Most general NAICS industry class |
| naics_lvl_2 | No | Second level NAICS industry class |
| naics_lvl_3 | No | Third level NAICS industry class |
| naics_lvl_4 | No | Forth level NAICS industry class |
| naics_lvl_5 | No | Most specific NAICS industry class |
| NAICS_version | No | version of NAICS where code was found |
| NAICS_valid | No | was NAICS code matched? |