Skip to content

Commit

Permalink
docs: data field documentation, closes covidatlas#49 (covidatlas#505)
Browse files Browse the repository at this point in the history
Co-authored-by: Larry Davis <[email protected]>
  • Loading branch information
piccolbo and lazd authored Apr 6, 2020
1 parent f909fcc commit 780ef27
Show file tree
Hide file tree
Showing 2 changed files with 71 additions and 3 deletions.
12 changes: 9 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
# coronadatascraper
> A scraper that pulls COVID-19 Coronavirus data scraped from government and curated data sources.
> A crawler that scrapes COVID-19 Coronavirus data from government and curated data sources.
This project exists to scrape, de-duplicate, and cross-check county-level data on the COVID-19 coronavirus pandemic.

Every piece of data produced includes the URL where the data was sourced from as well as a rating of the source's technical quality (completeness, machine readability, best practices -- not accuracy).
Every piece of data includes GeoJSON and population data, cites the source from which the data was obtained, and includes a rating of the source's technical quality (completeness, machine readability, best practices -- not accuracy).

## Where's the data?

Expand All @@ -13,7 +13,11 @@ https://coronadatascraper.com/

We upload fresh data every day at around 9PM PST.

## Getting started
## How do I use this data?

Read the [Data Fields](./docs/data_fields.md) documentation for details on exactly what each field in the dataset means.

## How can I run the crawler locally?

Check out our [Getting Started](./docs/getting_started.md) guide to help get our project running on your local machine.

Expand Down Expand Up @@ -45,3 +49,5 @@ This project uses data from [ISO-3166 Country and Dependent Territories Lists wi
## Attribution

Please cite this project if you use it in your visualization or reporting.

> Data obtained from Corona Data Scraper
62 changes: 62 additions & 0 deletions docs/data_fields.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# Data fields

## All files

The following fields are available in all files for each geographical entity:

* `name` - the full name of the geographical entity being represented
* `city` - the name of the city
* `county` - county or parish or the appropriate name of the administrative subdivision below the level of state or equivalent
* `state` - state, province or region depending on jurisdiction. In general, the first administrative subdivision below the level of country.
* `country` - ISO 3166-1 alpha-3 (three letter) country code
* `level` - one of `city`, `county`, `state`, `country`. Provided in order to facilitate filtering

In addition, ISO IDs are provided for each location. See the [`country-levels`](https://github.com/hyperknot/country-levels) project for details.

* `countryId` - [ISO 3166-1](https://en.wikipedia.org/wiki/ISO_3166-1) ID of the country country, i.e. `iso1:US` for US
* `stateId` - [ISO 3166-2](https://en.wikipedia.org/wiki/ISO_3166-2) ID of the state/province, i.e. `iso2:US-NY` for New York, US
* `countyId` - local ID of the county/region (i.e. `fips:36005` for Bronx County, New York, US)

In general, whenever the record is about administrative subdivisions of a level, you'll find all larger levels are non-empty. However, there are exceptions. New York City will not have a `county` field because it is subdivided into five counties.

The following fields are uniquely determined by the geographical entity and are provided as a convenience.

* `population` - a recent estimate of the population in the geographical entity, determined from census data or official sources
* `lat` - latitude of the geographical entity
* `long` - longitude of the geographical entity
* `tz` - an array of time zones for the geographical entity
* `featureId` - the ID of the GeoJSON feature for this entity, corresponding to `properties.id` in the FeatureCollection provided by `feature.json`

Additional attributes of a data point are:

* `url` - the source for the data point
* `aggregate` - the original level of aggregation of the source, e.g. country level data may have been obtained directly or by summation of state or county level data.
* `rating` - the objective rating of the source. See [sources](https://coronadatascraper.com/#sources) for details.


### `data.json`, `data.csv`, `timeseries.json`, and `timeseries-byLocation.json`

The following fields define the epidemiological information for a data point:

* `cases` - The cumulative number of confirmed or presumed confirmed cases
* `deaths` - The cumulative number of deaths attributed to COVID-19
* `recovered` - The cumulative number of recoveries
* `tested` - The cumulative number of tests from which results have been obtained (does not include pending tests)
* `hospitalized` - The cumulative number of patients hospitalized for COVID-19
* `discharged` - The cumulative number of patients discharged after hospitalization for COVID-19

The following fields detail the data's source:

* `url` - The exact URL from which the data was obtained
* `sources` - An array of sources that published the data
* `curators` - An array of curators responsible for manually curating the data
* `maintainers` - An array of maintainers responsible for writing the scraper code that obtains the data


### `timeseries-tidy.csv`

For each entry, the following data is provided:

* `date` - the date the data point refers to
* `type` - the type of data point: cases, tested, deaths, hospitalized, discharged, or recovered
* `value` - the value of the data point (a cumulative count of events of a certain type)

0 comments on commit 780ef27

Please sign in to comment.