docs: data field documentation, closes covidatlas#49 (covidatlas#505)

Co-authored-by: Larry Davis <[email protected]>
shaperilio · Apr 6, 2020 · 780ef27 · 780ef27
1 parent f909fcc
commit 780ef27
Show file tree

Hide file tree

Showing 2 changed files with 71 additions and 3 deletions.
diff --git a/README.md b/README.md
@@ -1,9 +1,9 @@
 # coronadatascraper
-> A scraper that pulls COVID-19 Coronavirus data scraped from government and curated data sources.
+> A crawler that scrapes COVID-19 Coronavirus data from government and curated data sources.
 
 This project exists to scrape, de-duplicate, and cross-check county-level data on the COVID-19 coronavirus pandemic.
 
-Every piece of data produced includes the URL where the data was sourced from as well as a rating of the source's technical quality (completeness, machine readability, best practices -- not accuracy).
+Every piece of data includes GeoJSON and population data, cites the source from which the data was obtained, and includes a rating of the source's technical quality (completeness, machine readability, best practices -- not accuracy).
 
 ## Where's the data?
 
@@ -13,7 +13,11 @@ https://coronadatascraper.com/
 
 We upload fresh data every day at around 9PM PST.
 
-## Getting started
+## How do I use this data?
+
+Read the [Data Fields](./docs/data_fields.md) documentation for details on exactly what each field in the dataset means.
+
+## How can I run the crawler locally?
 
 Check out our [Getting Started](./docs/getting_started.md) guide to help get our project running on your local machine.
 
@@ -45,3 +49,5 @@ This project uses data from [ISO-3166 Country and Dependent Territories Lists wi
 ## Attribution
 
 Please cite this project if you use it in your visualization or reporting.
+
+> Data obtained from Corona Data Scraper
diff --git a/docs/data_fields.md b/docs/data_fields.md
@@ -0,0 +1,62 @@
+# Data fields
+
+## All files
+
+The following fields are available in all files for each geographical entity:
+
+* `name` - the full name of the geographical entity being represented
+* `city` - the name of the city
+* `county` - county or parish or the appropriate name of the administrative subdivision below the level of state or equivalent
+* `state` - state, province or region depending on jurisdiction. In general, the first administrative subdivision below the level of country.
+* `country` - ISO 3166-1 alpha-3 (three letter) country code
+* `level` - one of `city`, `county`, `state`, `country`. Provided in order to facilitate filtering
+
+In addition, ISO IDs are provided for each location. See the [`country-levels`](https://github.com/hyperknot/country-levels) project for details.
+
+* `countryId` - [ISO 3166-1](https://en.wikipedia.org/wiki/ISO_3166-1) ID of the country country, i.e. `iso1:US` for US
+* `stateId` - [ISO 3166-2](https://en.wikipedia.org/wiki/ISO_3166-2) ID of the state/province, i.e. `iso2:US-NY` for New York, US
+* `countyId` - local ID of the county/region (i.e. `fips:36005` for Bronx County, New York, US)
+
+In general, whenever the record is about administrative subdivisions of a level, you'll find all larger levels are non-empty. However, there are exceptions. New York City will not have a `county` field because it is subdivided into five counties.
+
+The following fields are uniquely determined by the geographical entity and are provided as a convenience.
+
+* `population` - a recent estimate of the population in the geographical entity, determined from census data or official sources
+* `lat` - latitude of the geographical entity
+* `long` - longitude of the geographical entity
+* `tz` - an array of time zones for the geographical entity
+* `featureId` - the ID of the GeoJSON feature for this entity, corresponding to `properties.id` in the FeatureCollection provided by `feature.json`
+
+Additional attributes of a data point are:
+
+* `url` - the source for the data point
+* `aggregate` - the original level of aggregation of the source, e.g. country level data may have been obtained directly or by summation of state or county level data.
+* `rating` - the objective rating of the source. See [sources](https://coronadatascraper.com/#sources) for details.
+
+
+### `data.json`, `data.csv`, `timeseries.json`, and `timeseries-byLocation.json`
+
+The following fields define the epidemiological information for a data point:
+
+* `cases` - The cumulative number of confirmed or presumed confirmed cases
+* `deaths` - The cumulative number of deaths attributed to COVID-19
+* `recovered` - The cumulative number of recoveries
+* `tested` - The cumulative number of tests from which results have been obtained (does not include pending tests)
+* `hospitalized` - The cumulative number of patients hospitalized for COVID-19
+* `discharged` - The cumulative number of patients discharged after hospitalization for COVID-19
+
+The following fields detail the data's source:
+
+* `url` - The exact URL from which the data was obtained
+* `sources` - An array of sources that published the data
+* `curators` - An array of curators responsible for manually curating the data
+* `maintainers` - An array of maintainers responsible for writing the scraper code that obtains the data
+
+
+### `timeseries-tidy.csv`
+
+For each entry, the following data is provided:
+
+* `date` - the date the data point refers to
+* `type` - the type of data point: cases, tested, deaths, hospitalized, discharged, or recovered
+* `value` - the value of the data point (a cumulative count of events of a certain type)