diff --git a/README.md b/README.md index 9d255ecb5..129977539 100644 --- a/README.md +++ b/README.md @@ -1,9 +1,9 @@ # coronadatascraper -> A scraper that pulls COVID-19 Coronavirus data scraped from government and curated data sources. +> A crawler that scrapes COVID-19 Coronavirus data from government and curated data sources. This project exists to scrape, de-duplicate, and cross-check county-level data on the COVID-19 coronavirus pandemic. -Every piece of data produced includes the URL where the data was sourced from as well as a rating of the source's technical quality (completeness, machine readability, best practices -- not accuracy). +Every piece of data includes GeoJSON and population data, cites the source from which the data was obtained, and includes a rating of the source's technical quality (completeness, machine readability, best practices -- not accuracy). ## Where's the data? @@ -13,7 +13,11 @@ https://coronadatascraper.com/ We upload fresh data every day at around 9PM PST. -## Getting started +## How do I use this data? + +Read the [Data Fields](./docs/data_fields.md) documentation for details on exactly what each field in the dataset means. + +## How can I run the crawler locally? Check out our [Getting Started](./docs/getting_started.md) guide to help get our project running on your local machine. @@ -45,3 +49,5 @@ This project uses data from [ISO-3166 Country and Dependent Territories Lists wi ## Attribution Please cite this project if you use it in your visualization or reporting. + +> Data obtained from Corona Data Scraper diff --git a/docs/data_fields.md b/docs/data_fields.md new file mode 100644 index 000000000..ecd2085a7 --- /dev/null +++ b/docs/data_fields.md @@ -0,0 +1,62 @@ +# Data fields + +## All files + +The following fields are available in all files for each geographical entity: + +* `name` - the full name of the geographical entity being represented +* `city` - the name of the city +* `county` - county or parish or the appropriate name of the administrative subdivision below the level of state or equivalent +* `state` - state, province or region depending on jurisdiction. In general, the first administrative subdivision below the level of country. +* `country` - ISO 3166-1 alpha-3 (three letter) country code +* `level` - one of `city`, `county`, `state`, `country`. Provided in order to facilitate filtering + +In addition, ISO IDs are provided for each location. See the [`country-levels`](https://github.com/hyperknot/country-levels) project for details. + +* `countryId` - [ISO 3166-1](https://en.wikipedia.org/wiki/ISO_3166-1) ID of the country country, i.e. `iso1:US` for US +* `stateId` - [ISO 3166-2](https://en.wikipedia.org/wiki/ISO_3166-2) ID of the state/province, i.e. `iso2:US-NY` for New York, US +* `countyId` - local ID of the county/region (i.e. `fips:36005` for Bronx County, New York, US) + +In general, whenever the record is about administrative subdivisions of a level, you'll find all larger levels are non-empty. However, there are exceptions. New York City will not have a `county` field because it is subdivided into five counties. + +The following fields are uniquely determined by the geographical entity and are provided as a convenience. + +* `population` - a recent estimate of the population in the geographical entity, determined from census data or official sources +* `lat` - latitude of the geographical entity +* `long` - longitude of the geographical entity +* `tz` - an array of time zones for the geographical entity +* `featureId` - the ID of the GeoJSON feature for this entity, corresponding to `properties.id` in the FeatureCollection provided by `feature.json` + +Additional attributes of a data point are: + +* `url` - the source for the data point +* `aggregate` - the original level of aggregation of the source, e.g. country level data may have been obtained directly or by summation of state or county level data. +* `rating` - the objective rating of the source. See [sources](https://coronadatascraper.com/#sources) for details. + + +### `data.json`, `data.csv`, `timeseries.json`, and `timeseries-byLocation.json` + +The following fields define the epidemiological information for a data point: + +* `cases` - The cumulative number of confirmed or presumed confirmed cases +* `deaths` - The cumulative number of deaths attributed to COVID-19 +* `recovered` - The cumulative number of recoveries +* `tested` - The cumulative number of tests from which results have been obtained (does not include pending tests) +* `hospitalized` - The cumulative number of patients hospitalized for COVID-19 +* `discharged` - The cumulative number of patients discharged after hospitalization for COVID-19 + +The following fields detail the data's source: + +* `url` - The exact URL from which the data was obtained +* `sources` - An array of sources that published the data +* `curators` - An array of curators responsible for manually curating the data +* `maintainers` - An array of maintainers responsible for writing the scraper code that obtains the data + + +### `timeseries-tidy.csv` + +For each entry, the following data is provided: + +* `date` - the date the data point refers to +* `type` - the type of data point: cases, tested, deaths, hospitalized, discharged, or recovered +* `value` - the value of the data point (a cumulative count of events of a certain type)