Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDC500 Auto Refresh #1176

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 51 additions & 11 deletions scripts/us_cdc/500_places/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,11 @@ Author: Padma Gundapaneni @padma-g
## About the Dataset

### Download URL
The datasets can be downloaded at the following links from [the CDC website](https://chronicdata.cdc.gov/browse?category=500+Cities+%26+Places&sortBy=newest&utf8).
- [PLACES: Local Data for Better Health, Census Tract Data](https://chronicdata.cdc.gov/500-Cities-Places/PLACES-Local-Data-for-Better-Health-Census-Tract-D/cwsq-ngmh)
- [PLACES: Local Data for Better Health, County/Country Data](https://chronicdata.cdc.gov/500-Cities-Places/PLACES-Local-Data-for-Better-Health-County-Data-20/swc5-untb)
- [PLACES: Local Data for Better Health, Place (City) Data](https://chronicdata.cdc.gov/500-Cities-Places/PLACES-Local-Data-for-Better-Health-Place-Data-202/eav7-hnsx)
- [PLACES: Local Data for Better Health, ZCTA (Zip Code) Data](https://chronicdata.cdc.gov/500-Cities-Places/PLACES-Local-Data-for-Better-Health-ZCTA-Data-2020/qnzd-25i4)
The datasets can be downloaded at the following links from [the CDC website](https://chronicdata.cdc.gov/browse?category=500+Cities+%26+Places&sortBy=newest&utf8). We need to manually serach in the website for the latest release files for the below dataset and add the required configuration in json file present in the [GCP Bucket Location](gs://datcom-csv/cdc500_places/)
- PLACES: Local Data for Better Health, Census Tract Data
- PLACES: Local Data for Better Health, County/Country Data
- PLACES: Local Data for Better Health, Place (City) Data
- PLACES: Local Data for Better Health, ZCTA (Zip Code) Data

To download all datasets available, run the following command. The download will take 5-10 minutes total. Files will be downloaded and extracted to a `raw_data` folder.
```bash
Expand All @@ -34,7 +34,49 @@ The data imported in this effort is from the CDC's [500 Places project](https://

### Notes and Caveats

None.
For data refresh for CDC500 import we need to manually serach in the website for the latest release files across all geo levels and add the required configuration in [Json file](gs://datcom-csv/cdc500_places/download_config.json) present in the GCP Bucket Location.

Please fill the json file for the latest release data in below format:

```
{
"release_year": {ReleaseYear},
"parameter": [
{
"URL": "Download link of latet release",
"FILE_TYPE": "Geo Level of the data should be either [County, City, ZipCode, CensusTract]",
"FILE_NAME": "{GeoLevel}_raw_data_2022.csv"
}
]
}
```

Example:
{
"release_year": 2022,
"parameter": [
{
"URL": "https://data.cdc.gov/api/views/duw2-7jbt/rows.csv?accessType=DOWNLOAD",
"FILE_TYPE": "County",
"FILE_NAME": "county_raw_data_2022.csv"
},
{
"URL": "https://data.cdc.gov/api/views/epbn-9bv3/rows.csv?accessType=DOWNLOAD",
"FILE_TYPE": "City",
"FILE_NAME": "city_raw_data_2022.csv"
},
{
"URL": "https://data.cdc.gov/api/views/nw2y-v4gm/rows.csv?accessType=DOWNLOAD",
"FILE_TYPE": "CensusTract",
"FILE_NAME": "censustract_raw_data_2022.csv"
},
{
"URL": "https://data.cdc.gov/api/views/gd4x-jyhw/rows.csv?accessType=DOWNLOAD",
"FILE_TYPE": "ZipCode",
"FILE_NAME": "zipcode_raw_data_2022.csv"
}
]
}

### License
The data is made available for public-use by the [CDC](https://www.cdc.gov/nchs/data_access/ftp_data.htm). Users of CDC National Center for Health Statistics Data must comply with the CDC's [data use agreement](https://www.cdc.gov/nchs/data_access/restrictions.htm).
Expand All @@ -51,7 +93,7 @@ These data were collected and provided by the [CDC National Center for Chronic D

[`parse_cdc_places.py`](https://github.com/datacommonsorg/data/blob/master/scripts/us_cdc/500_places/parse_cdc_places.py)

[`clean_files.sh`](https://github.com/datacommonsorg/data/blob/master/scripts/us_cdc/500_places/clean_files.sh)
[`run.sh`](https://github.com/datacommonsorg/data/blob/master/scripts/us_cdc/500_places/run.sh)

#### Test Scripts
[`parse_cdc_places_test.py`](https://github.com/datacommonsorg/data/blob/master/scripts/us_cdc/500_places/parse_cdc_places_test.py)
Expand All @@ -75,10 +117,8 @@ The expected output of this test can be found in the [`test_data`](https://githu

#### Data Download and Processing Steps

To download and clean all the data files at once, run `download_bulk.py` and then `clean_files.sh`:
To download and clean all the data files at once run `clean_files.sh`:

```bash
$ python3 download_bulk.py

$ sh clean_files.sh
$ sh run.sh
```
2 changes: 2 additions & 0 deletions scripts/us_cdc/500_places/cdc_places.tmcf
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@ variableMeasured: C:CDC_Places->StatVar
observationPeriod: "P1Y"
measurementMethod: C:CDC_Places->DataValueTypeID
value: C:CDC_Places->Data_Value
unit: Percent
scalingFactor: 100

Node: E:CDC_Places->E1
observationAbout: C:CDC_Places->Location
Expand Down
7 changes: 0 additions & 7 deletions scripts/us_cdc/500_places/clean_files.sh

This file was deleted.

58 changes: 40 additions & 18 deletions scripts/us_cdc/500_places/download_bulk.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,31 +24,51 @@
"""

import os

import requests
import json

from absl import logging
from retry import retry
from google.cloud import storage

# Initialize GCP storage client
client = storage.Client()

# Define your GCP bucket and file name
bucket_name = 'datcom-csv' # Replace with your bucket name
file_name = 'cdc500_places/download_config.json' # Replace with your file name

# Download the file from GCP Storage
bucket = client.get_bucket(bucket_name)
blob = bucket.blob(file_name)

# Read the JSON content from the blob
json_data = blob.download_as_text()

# Load the JSON data
_CONFIG_FILE = json.loads(json_data)


DATA_URLS = {
"county_raw_data.csv":
"https://chronicdata.cdc.gov/api/views/swc5-untb/rows.csv?accessType=DOWNLOAD",
"city_raw_data.csv":
"https://chronicdata.cdc.gov/api/views/eav7-hnsx/rows.csv?accessType=DOWNLOAD",
"censustract_raw_data.csv":
"https://chronicdata.cdc.gov/api/views/cwsq-ngmh/rows.csv?accessType=DOWNLOAD",
"zipcode_raw_data.csv":
"https://chronicdata.cdc.gov/api/views/qnzd-25i4/rows.csv?accessType=DOWNLOAD"
}
@retry(tries=3, delay=5, backoff=5)
def retry_method(url):
return requests.get(url)


def download_file(url: str, save_path: str):
def download_file(release_year, url: str, save_path: str):
"""
Args:
url: url to the file to be downloaded
save_path: path for the downloaded file to be stored
Returns:
a downloaded csv file in the specified file path
"""
print(f'Downloading {url} to {save_path}')
request = requests.get(url, stream=True)
logging.info(
f'Downloading {url} for the year {release_year} to {save_path}')
request = retry_method(url)
if request.status_code != 200:
logging.fatal(
f'Failed to retrieve {url} for the year {release_year} to {save_path}'
)
with open(save_path, 'wb') as file:
file.write(request.content)

Expand All @@ -58,10 +78,12 @@ def main():
data_dir = os.path.join(os.getcwd(), 'raw_data')
if not os.path.exists(data_dir):
os.makedirs(data_dir)
for dataset_name, url in DATA_URLS.items():
print(dataset_name)
save_path = os.path.join(data_dir, dataset_name)
download_file(url, save_path)
logging.set_verbosity(2)
for item in _CONFIG_FILE:
release_year = item["release_year"]
for url_dict in item["parameter"]:
save_path = os.path.join(data_dir, url_dict['FILE_NAME'])
download_file(release_year, url_dict['URL'], save_path)
saanikaaa marked this conversation as resolved.
Show resolved Hide resolved


if __name__ == '__main__':
Expand Down
77 changes: 77 additions & 0 deletions scripts/us_cdc/500_places/download_config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
[
{
"release_year": 2022,
"parameter": [
{
"URL": "https://data.cdc.gov/api/views/duw2-7jbt/rows.csv?accessType=DOWNLOAD",
"FILE_TYPE": "County",
"FILE_NAME": "county_raw_data_2022.csv"
},
{
"URL": "https://data.cdc.gov/api/views/epbn-9bv3/rows.csv?accessType=DOWNLOAD",
"FILE_TYPE": "City",
"FILE_NAME": "city_raw_data_2022.csv"
},
{
"URL": "https://data.cdc.gov/api/views/nw2y-v4gm/rows.csv?accessType=DOWNLOAD",
"FILE_TYPE": "CensusTract",
"FILE_NAME": "censustract_raw_data_2022.csv"
},
{
"URL": "https://data.cdc.gov/api/views/gd4x-jyhw/rows.csv?accessType=DOWNLOAD",
"FILE_TYPE": "ZipCode",
"FILE_NAME": "zipcode_raw_data_2022.csv"
}
]
},
{
"release_year": 2023,
"parameter": [
{
"URL": "https://data.cdc.gov/api/views/h3ej-a9ec/rows.csv?accessType=DOWNLOAD",
"FILE_TYPE": "County",
"FILE_NAME": "county_raw_data_2023.csv"
},
{
"URL": "https://data.cdc.gov/api/views/krqc-563j/rows.csv?accessType=DOWNLOAD",
"FILE_TYPE": "City",
"FILE_NAME": "city_raw_data_2023.csv"
},
{
"URL": "https://data.cdc.gov/api/views/em5e-5hvn/rows.csv?accessType=DOWNLOAD",
"FILE_TYPE": "CensusTract",
"FILE_NAME": "censustract_raw_data_2023.csv"
},
{
"URL": "https://data.cdc.gov/api/views/9umn-c3jf/rows.csv?accessType=DOWNLOAD",
"FILE_TYPE": "ZipCode",
"FILE_NAME": "zipcode_raw_data_2023.csv"
}
]
},
{
"release_year": 2024,
"parameter": [
{
"URL": "https://data.cdc.gov/api/views/swc5-untb/rows.csv?accessType=DOWNLOAD",
"FILE_TYPE": "County",
"FILE_NAME": "county_raw_data_2024.csv"
},
{
"URL": "https://data.cdc.gov/api/views/eav7-hnsx/rows.csv?accessType=DOWNLOAD",
"FILE_TYPE": "City",
"FILE_NAME": "city_raw_data_2024.csv"
},
{
"URL": "https://data.cdc.gov/api/views/cwsq-ngmh/rows.csv?accessType=DOWNLOAD",
"FILE_TYPE": "CensusTract",
"FILE_NAME": "censustract_raw_data_2024.csv"
},
{
"URL": "https://data.cdc.gov/api/views/qnzd-25i4/rows.csv?accessType=DOWNLOAD",
"FILE_TYPE": "ZipCode",
"FILE_NAME": "zipcode_raw_data_2024.csv"
}
]
}
]
32 changes: 32 additions & 0 deletions scripts/us_cdc/500_places/manifest.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
{
"import_specifications": [
{
"import_name": "CDC500",
"curator_emails": [],
"provenance_url": "https://www.cdc.gov/places/index.html",
"provenance_description": "Variables related to health from the CDC",
"scripts": [
"download_bulk.py","parse_cdc_places.py"
],
"import_inputs": [
{
"template_mcf": "cdc_places.tmcf",
"cleaned_csv": "cleaned_csv/County.csv"
},
{
"template_mcf": "cdc_places.tmcf",
"cleaned_csv": "cleaned_csv/City.csv"
},
{
"template_mcf": "cdc_places.tmcf",
"cleaned_csv": "cleaned_csv/ZipCode.csv"
},
{
"template_mcf": "cdc_places.tmcf",
"cleaned_csv": "cleaned_csv/CensusTract.csv"
}
],
"cron_schedule": "0 11 * * 2"
}
]
}
Loading
Loading