Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDC500 Auto Refresh #1176

Open
wants to merge 15 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ arcgis2geojson
chardet
dataclasses==0.6
datacommons==1.4.3
db-dtypes
frozendict
func-timeout==4.3.5
geojson==2.5.0
Expand Down
1 change: 1 addition & 0 deletions requirements_all.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ chembl-webresource-client>=0.10.2
chardet
dataclasses==0.6
datacommons==1.4.3
db-dtypes
deepdiff==6.3.0
earthengine-api
flask_restful==0.3.9
Expand Down
66 changes: 54 additions & 12 deletions scripts/us_cdc/500_places/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,11 @@ Author: Padma Gundapaneni @padma-g
## About the Dataset

### Download URL
The datasets can be downloaded at the following links from [the CDC website](https://chronicdata.cdc.gov/browse?category=500+Cities+%26+Places&sortBy=newest&utf8).
- [PLACES: Local Data for Better Health, Census Tract Data](https://chronicdata.cdc.gov/500-Cities-Places/PLACES-Local-Data-for-Better-Health-Census-Tract-D/cwsq-ngmh)
- [PLACES: Local Data for Better Health, County/Country Data](https://chronicdata.cdc.gov/500-Cities-Places/PLACES-Local-Data-for-Better-Health-County-Data-20/swc5-untb)
- [PLACES: Local Data for Better Health, Place (City) Data](https://chronicdata.cdc.gov/500-Cities-Places/PLACES-Local-Data-for-Better-Health-Place-Data-202/eav7-hnsx)
- [PLACES: Local Data for Better Health, ZCTA (Zip Code) Data](https://chronicdata.cdc.gov/500-Cities-Places/PLACES-Local-Data-for-Better-Health-ZCTA-Data-2020/qnzd-25i4)
The datasets can be downloaded at the following links from [the CDC website](https://chronicdata.cdc.gov/browse?category=500+Cities+%26+Places&sortBy=newest&utf8). We need to manually search in the website for the latest release files for the below dataset and add the required configuration in json file present in the [GCP Bucket Location](gs://datcom-csv/cdc500_places/)
- PLACES: Local Data for Better Health, Census Tract Data
- PLACES: Local Data for Better Health, County/Country Data
- PLACES: Local Data for Better Health, Place (City) Data
- PLACES: Local Data for Better Health, ZCTA (Zip Code) Data

To download all datasets available, run the following command. The download will take 5-10 minutes total. Files will be downloaded and extracted to a `raw_data` folder.
```bash
Expand All @@ -34,7 +34,51 @@ The data imported in this effort is from the CDC's [500 Places project](https://

### Notes and Caveats

None.
For data refresh for CDC500 import we need to manually search in the website for the latest release files across all geo levels and add the required configuration in [Json file](gs://datcom-csv/cdc500_places/download_config.json) present in the GCP Bucket Location. The config file is present locally as well [download_config.json](https://github.com/datacommonsorg/data/blob/master/scripts/us_cdc/500_places/download_config.json) we can use this file as well to generate the output.

NOTE: If any changes made in local config update same changes in config file present in GCP as well vice versa. We should always keep both config file in sync.

Please fill the json file for the latest release data in below format:

```
{
"release_year": {ReleaseYear},
"parameter": [
{
"URL": "Download link of latet release",
"FILE_TYPE": "Geo Level of the data should be either [County, City, ZipCode, CensusTract]",
"FILE_NAME": "{GeoLevel}_raw_data_2022.csv"
}
]
}
```

Example:
{
"release_year": 2022,
"parameter": [
{
"URL": "https://data.cdc.gov/api/views/duw2-7jbt/rows.csv?accessType=DOWNLOAD",
"FILE_TYPE": "County",
"FILE_NAME": "county_raw_data_2022.csv"
},
{
"URL": "https://data.cdc.gov/api/views/epbn-9bv3/rows.csv?accessType=DOWNLOAD",
"FILE_TYPE": "City",
"FILE_NAME": "city_raw_data_2022.csv"
},
{
"URL": "https://data.cdc.gov/api/views/nw2y-v4gm/rows.csv?accessType=DOWNLOAD",
"FILE_TYPE": "CensusTract",
"FILE_NAME": "censustract_raw_data_2022.csv"
},
{
"URL": "https://data.cdc.gov/api/views/gd4x-jyhw/rows.csv?accessType=DOWNLOAD",
"FILE_TYPE": "ZipCode",
"FILE_NAME": "zipcode_raw_data_2022.csv"
}
]
}

### License
The data is made available for public-use by the [CDC](https://www.cdc.gov/nchs/data_access/ftp_data.htm). Users of CDC National Center for Health Statistics Data must comply with the CDC's [data use agreement](https://www.cdc.gov/nchs/data_access/restrictions.htm).
Expand All @@ -51,7 +95,7 @@ These data were collected and provided by the [CDC National Center for Chronic D

[`parse_cdc_places.py`](https://github.com/datacommonsorg/data/blob/master/scripts/us_cdc/500_places/parse_cdc_places.py)

[`clean_files.sh`](https://github.com/datacommonsorg/data/blob/master/scripts/us_cdc/500_places/clean_files.sh)
[`run.sh`](https://github.com/datacommonsorg/data/blob/master/scripts/us_cdc/500_places/run.sh)

#### Test Scripts
[`parse_cdc_places_test.py`](https://github.com/datacommonsorg/data/blob/master/scripts/us_cdc/500_places/parse_cdc_places_test.py)
Expand All @@ -65,7 +109,7 @@ These data were collected and provided by the [CDC National Center for Chronic D

##### Test Data Cleaning Script

To test the data cleaning script, run:
To test the config file is sync with each other and data cleaning script, run:

```bash
$ python3 parse_cdc_places_test.py
Expand All @@ -75,10 +119,8 @@ The expected output of this test can be found in the [`test_data`](https://githu

#### Data Download and Processing Steps

To download and clean all the data files at once, run `download_bulk.py` and then `clean_files.sh`:
To download and clean all the data files at once run `run.sh`:

```bash
$ python3 download_bulk.py

$ sh clean_files.sh
$ sh run.sh
```
2 changes: 2 additions & 0 deletions scripts/us_cdc/500_places/cdc_places.tmcf
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@ variableMeasured: C:CDC_Places->StatVar
observationPeriod: "P1Y"
measurementMethod: C:CDC_Places->DataValueTypeID
value: C:CDC_Places->Data_Value
unit: Percent
scalingFactor: 100

Node: E:CDC_Places->E1
observationAbout: C:CDC_Places->Location
Expand Down
7 changes: 0 additions & 7 deletions scripts/us_cdc/500_places/clean_files.sh

This file was deleted.

65 changes: 44 additions & 21 deletions scripts/us_cdc/500_places/download_bulk.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,45 +24,68 @@
"""

import os

import requests
import json
import sys

from absl import logging
from retry import retry
from absl import flags
from absl import app

_MODULE_DIR = os.path.dirname(os.path.abspath(__file__))
sys.path.append(os.path.join(_MODULE_DIR, '../../../util/'))
import file_util

_FLAGS = flags.FLAGS
flags.DEFINE_string(
'config_path', 'gs://unresolved_mcf/cdc/cdc500places/download_config.json',
'Path to config file')


def read_config_file_from_gcs(file_path):
with file_util.FileIO(file_path, 'r') as f:
CONFIG_FILE = json.load(f)
return CONFIG_FILE


DATA_URLS = {
"county_raw_data.csv":
"https://chronicdata.cdc.gov/api/views/swc5-untb/rows.csv?accessType=DOWNLOAD",
"city_raw_data.csv":
"https://chronicdata.cdc.gov/api/views/eav7-hnsx/rows.csv?accessType=DOWNLOAD",
"censustract_raw_data.csv":
"https://chronicdata.cdc.gov/api/views/cwsq-ngmh/rows.csv?accessType=DOWNLOAD",
"zipcode_raw_data.csv":
"https://chronicdata.cdc.gov/api/views/qnzd-25i4/rows.csv?accessType=DOWNLOAD"
}
@retry(tries=3, delay=5, backoff=5)
def retry_method(url):
return requests.get(url)


def download_file(url: str, save_path: str):
def download_file(release_year, url: str, save_path: str):
"""
Args:
url: url to the file to be downloaded
save_path: path for the downloaded file to be stored
Returns:
a downloaded csv file in the specified file path
"""
print(f'Downloading {url} to {save_path}')
request = requests.get(url, stream=True)
logging.info(
f'Downloading {url} for the year {release_year} to {save_path}')
response = retry_method(url)
if response.status_code != 200:
logging.fatal(
f'Failed to retrieve {url} for the year {release_year} to {save_path}'
)
saanikaaa marked this conversation as resolved.
Show resolved Hide resolved
with open(save_path, 'wb') as file:
file.write(request.content)
file.write(response.content)


def main():
def main(_):
"""Main function to download the files."""
data_dir = os.path.join(os.getcwd(), 'raw_data')
if not os.path.exists(data_dir):
os.makedirs(data_dir)
for dataset_name, url in DATA_URLS.items():
print(dataset_name)
save_path = os.path.join(data_dir, dataset_name)
download_file(url, save_path)
logging.set_verbosity(2)
_CONFIG_FILE = read_config_file_from_gcs(_FLAGS.config_path)
for item in _CONFIG_FILE:
release_year = item["release_year"]
for url_dict in item["parameter"]:
save_path = os.path.join(data_dir, url_dict['FILE_NAME'])
download_file(release_year, url_dict['URL'], save_path)
saanikaaa marked this conversation as resolved.
Show resolved Hide resolved


if __name__ == '__main__':
main()
app.run(main)
77 changes: 77 additions & 0 deletions scripts/us_cdc/500_places/download_config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
[
{
"release_year": 2022,
"parameter": [
{
"URL": "https://data.cdc.gov/api/views/duw2-7jbt/rows.csv?accessType=DOWNLOAD",
"FILE_TYPE": "County",
"FILE_NAME": "county_raw_data_2022.csv"
},
{
"URL": "https://data.cdc.gov/api/views/epbn-9bv3/rows.csv?accessType=DOWNLOAD",
"FILE_TYPE": "City",
"FILE_NAME": "city_raw_data_2022.csv"
},
{
"URL": "https://data.cdc.gov/api/views/nw2y-v4gm/rows.csv?accessType=DOWNLOAD",
"FILE_TYPE": "CensusTract",
"FILE_NAME": "censustract_raw_data_2022.csv"
},
{
"URL": "https://data.cdc.gov/api/views/gd4x-jyhw/rows.csv?accessType=DOWNLOAD",
"FILE_TYPE": "ZipCode",
"FILE_NAME": "zipcode_raw_data_2022.csv"
}
]
},
{
"release_year": 2023,
"parameter": [
{
"URL": "https://data.cdc.gov/api/views/h3ej-a9ec/rows.csv?accessType=DOWNLOAD",
"FILE_TYPE": "County",
"FILE_NAME": "county_raw_data_2023.csv"
},
{
"URL": "https://data.cdc.gov/api/views/krqc-563j/rows.csv?accessType=DOWNLOAD",
"FILE_TYPE": "City",
"FILE_NAME": "city_raw_data_2023.csv"
},
{
"URL": "https://data.cdc.gov/api/views/em5e-5hvn/rows.csv?accessType=DOWNLOAD",
"FILE_TYPE": "CensusTract",
"FILE_NAME": "censustract_raw_data_2023.csv"
},
{
"URL": "https://data.cdc.gov/api/views/9umn-c3jf/rows.csv?accessType=DOWNLOAD",
"FILE_TYPE": "ZipCode",
"FILE_NAME": "zipcode_raw_data_2023.csv"
}
]
},
{
"release_year": 2024,
"parameter": [
{
"URL": "https://data.cdc.gov/api/views/swc5-untb/rows.csv?accessType=DOWNLOAD",
"FILE_TYPE": "County",
"FILE_NAME": "county_raw_data_2024.csv"
},
{
"URL": "https://data.cdc.gov/api/views/eav7-hnsx/rows.csv?accessType=DOWNLOAD",
"FILE_TYPE": "City",
"FILE_NAME": "city_raw_data_2024.csv"
},
{
"URL": "https://data.cdc.gov/api/views/cwsq-ngmh/rows.csv?accessType=DOWNLOAD",
"FILE_TYPE": "CensusTract",
"FILE_NAME": "censustract_raw_data_2024.csv"
},
{
"URL": "https://data.cdc.gov/api/views/qnzd-25i4/rows.csv?accessType=DOWNLOAD",
"FILE_TYPE": "ZipCode",
"FILE_NAME": "zipcode_raw_data_2024.csv"
}
]
}
]
32 changes: 32 additions & 0 deletions scripts/us_cdc/500_places/manifest.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
{
"import_specifications": [
{
"import_name": "CDC500",
"curator_emails": [],
"provenance_url": "https://www.cdc.gov/places/index.html",
"provenance_description": "Variables related to health from the CDC",
"scripts": [
"download_bulk.py","parse_cdc_places.py"
],
"import_inputs": [
{
"template_mcf": "cdc_places.tmcf",
"cleaned_csv": "cleaned_csv/County.csv"
},
{
"template_mcf": "cdc_places.tmcf",
"cleaned_csv": "cleaned_csv/City.csv"
},
{
"template_mcf": "cdc_places.tmcf",
"cleaned_csv": "cleaned_csv/ZipCode.csv"
},
{
"template_mcf": "cdc_places.tmcf",
"cleaned_csv": "cleaned_csv/CensusTract.csv"
}
],
"cron_schedule": "0 4 * * 5"
}
]
}
Loading
Loading