datacommonsorg · saanikaaa · Jan 8, 2025 · Jan 9, 2025 · Jan 9, 2025 · Jan 13, 2025
diff --git a/requirements.txt b/requirements.txt
@@ -5,6 +5,7 @@ arcgis2geojson
 chardet
 dataclasses==0.6
 datacommons==1.4.3
+db-dtypes
 frozendict
 func-timeout==4.3.5
 geojson==2.5.0

diff --git a/requirements_all.txt b/requirements_all.txt
@@ -6,6 +6,7 @@ chembl-webresource-client>=0.10.2
 chardet
 dataclasses==0.6
 datacommons==1.4.3
+db-dtypes
 deepdiff==6.3.0
 earthengine-api
 flask_restful==0.3.9

diff --git a/scripts/us_cdc/500_places/README.md b/scripts/us_cdc/500_places/README.md
@@ -16,11 +16,11 @@ Author: Padma Gundapaneni @padma-g
 ## About the Dataset
 
 ### Download URL
-The datasets can be downloaded at the following links from [the CDC website](https://chronicdata.cdc.gov/browse?category=500+Cities+%26+Places&sortBy=newest&utf8).
-- [PLACES: Local Data for Better Health, Census Tract Data](https://chronicdata.cdc.gov/500-Cities-Places/PLACES-Local-Data-for-Better-Health-Census-Tract-D/cwsq-ngmh)
-- [PLACES: Local Data for Better Health, County/Country Data](https://chronicdata.cdc.gov/500-Cities-Places/PLACES-Local-Data-for-Better-Health-County-Data-20/swc5-untb)
-- [PLACES: Local Data for Better Health, Place (City) Data](https://chronicdata.cdc.gov/500-Cities-Places/PLACES-Local-Data-for-Better-Health-Place-Data-202/eav7-hnsx)
-- [PLACES: Local Data for Better Health, ZCTA (Zip Code) Data](https://chronicdata.cdc.gov/500-Cities-Places/PLACES-Local-Data-for-Better-Health-ZCTA-Data-2020/qnzd-25i4)
+The datasets can be downloaded at the following links from [the CDC website](https://chronicdata.cdc.gov/browse?category=500+Cities+%26+Places&sortBy=newest&utf8). We need to manually search in the website for the latest release files for the below dataset and add the required configuration in json file present in the [GCP Bucket Location](gs://datcom-csv/cdc500_places/)
+- PLACES: Local Data for Better Health, Census Tract Data
+- PLACES: Local Data for Better Health, County/Country Data
+- PLACES: Local Data for Better Health, Place (City) Data
+- PLACES: Local Data for Better Health, ZCTA (Zip Code) Data
 
 To download all datasets available, run the following command. The download will take 5-10 minutes total. Files will be downloaded and extracted to a `raw_data` folder.
 ```bash
@@ -34,7 +34,51 @@ The data imported in this effort is from the CDC's [500 Places project](https://
 
 ### Notes and Caveats
 
-None.
+For data refresh for CDC500 import we need to manually search in the website for the latest release files across all geo levels and add the required configuration in [Json file](gs://datcom-csv/cdc500_places/download_config.json) present in the GCP Bucket Location. The config file is present locally as well [download_config.json](https://github.com/datacommonsorg/data/blob/master/scripts/us_cdc/500_places/download_config.json) we can use this file as well to generate the output.
+
+NOTE: If any changes made in local config update same changes in config file present in GCP as well vice versa. We should always keep both config file in sync.
+
+Please fill the json file for the latest release data in below format:
+
+```
+{
+        "release_year": {ReleaseYear}, 
+        "parameter": [
+            {
+                "URL": "Download link of latet release",
+                "FILE_TYPE": "Geo Level of the data should be either [County, City, ZipCode, CensusTract]",
+                "FILE_NAME": "{GeoLevel}_raw_data_2022.csv"
+            }
+        ]
+    }
+```
+
+Example:
+{
+        "release_year": 2022,
+        "parameter": [
+            {
+                "URL": "https://data.cdc.gov/api/views/duw2-7jbt/rows.csv?accessType=DOWNLOAD",
+                "FILE_TYPE": "County",
+                "FILE_NAME": "county_raw_data_2022.csv"
+            },
+            {
+                "URL": "https://data.cdc.gov/api/views/epbn-9bv3/rows.csv?accessType=DOWNLOAD",
+                "FILE_TYPE": "City",
+                "FILE_NAME": "city_raw_data_2022.csv"
+            },
+            {
+                "URL": "https://data.cdc.gov/api/views/nw2y-v4gm/rows.csv?accessType=DOWNLOAD",
+                "FILE_TYPE": "CensusTract",
+                "FILE_NAME": "censustract_raw_data_2022.csv"
+            },
+            {
+                "URL": "https://data.cdc.gov/api/views/gd4x-jyhw/rows.csv?accessType=DOWNLOAD",
+                "FILE_TYPE": "ZipCode",
+                "FILE_NAME": "zipcode_raw_data_2022.csv"
+            }
+        ]
+    }
 
 ### License
 The data is made available for public-use by the [CDC](https://www.cdc.gov/nchs/data_access/ftp_data.htm). Users of CDC National Center for Health Statistics Data must comply with the CDC's [data use agreement](https://www.cdc.gov/nchs/data_access/restrictions.htm).
@@ -51,7 +95,7 @@ These data were collected and provided by the [CDC National Center for Chronic D
 
 [`parse_cdc_places.py`](https://github.com/datacommonsorg/data/blob/master/scripts/us_cdc/500_places/parse_cdc_places.py)
 
-[`clean_files.sh`](https://github.com/datacommonsorg/data/blob/master/scripts/us_cdc/500_places/clean_files.sh)
+[`run.sh`](https://github.com/datacommonsorg/data/blob/master/scripts/us_cdc/500_places/run.sh)
 
 #### Test Scripts
 [`parse_cdc_places_test.py`](https://github.com/datacommonsorg/data/blob/master/scripts/us_cdc/500_places/parse_cdc_places_test.py)
@@ -65,7 +109,7 @@ These data were collected and provided by the [CDC National Center for Chronic D
 
 ##### Test Data Cleaning Script
 
-To test the data cleaning script, run:
+To test the config file is sync with each other and data cleaning script, run:
 
 ```bash
 $ python3 parse_cdc_places_test.py
@@ -75,10 +119,8 @@ The expected output of this test can be found in the [`test_data`](https://githu
 
 #### Data Download and Processing Steps
 
-To download and clean all the data files at once, run `download_bulk.py` and then `clean_files.sh`:
+To download and clean all the data files at once run `run.sh`:
 
 ```bash
-$ python3 download_bulk.py
-
-$ sh clean_files.sh
+$ sh run.sh
 ```
diff --git a/scripts/us_cdc/500_places/cdc_places.tmcf b/scripts/us_cdc/500_places/cdc_places.tmcf
@@ -6,6 +6,8 @@ variableMeasured: C:CDC_Places->StatVar
 observationPeriod: "P1Y"
 measurementMethod: C:CDC_Places->DataValueTypeID
 value: C:CDC_Places->Data_Value
+unit: Percent
+scalingFactor: 100
 
 Node: E:CDC_Places->E1
 observationAbout: C:CDC_Places->Location

diff --git a/scripts/us_cdc/500_places/clean_files.sh b/scripts/us_cdc/500_places/clean_files.sh
diff --git a/scripts/us_cdc/500_places/download_bulk.py b/scripts/us_cdc/500_places/download_bulk.py
@@ -24,45 +24,68 @@
 """
 
 import os
-
 import requests
+import json
+import sys
+
+from absl import logging
+from retry import retry
+from absl import flags
+from absl import app
+
+_MODULE_DIR = os.path.dirname(os.path.abspath(__file__))
+sys.path.append(os.path.join(_MODULE_DIR, '../../../util/'))
+import file_util
+
+_FLAGS = flags.FLAGS
+flags.DEFINE_string(
+    'config_path', 'gs://unresolved_mcf/cdc/cdc500places/download_config.json',
+    'Path to config file')
+
+
+def read_config_file_from_gcs(file_path):
+    with file_util.FileIO(file_path, 'r') as f:
+        CONFIG_FILE = json.load(f)
+    return CONFIG_FILE
+
 
-DATA_URLS = {
-    "county_raw_data.csv":
-        "https://chronicdata.cdc.gov/api/views/swc5-untb/rows.csv?accessType=DOWNLOAD",
-    "city_raw_data.csv":
-        "https://chronicdata.cdc.gov/api/views/eav7-hnsx/rows.csv?accessType=DOWNLOAD",
-    "censustract_raw_data.csv":
-        "https://chronicdata.cdc.gov/api/views/cwsq-ngmh/rows.csv?accessType=DOWNLOAD",
-    "zipcode_raw_data.csv":
-        "https://chronicdata.cdc.gov/api/views/qnzd-25i4/rows.csv?accessType=DOWNLOAD"
-}
+@retry(tries=3, delay=5, backoff=5)
+def retry_method(url):
+    return requests.get(url)
 
 
-def download_file(url: str, save_path: str):
+def download_file(release_year, url: str, save_path: str):
     """
     Args:
         url: url to the file to be downloaded
         save_path: path for the downloaded file to be stored
     Returns:
         a downloaded csv file in the specified file path
     """
-    print(f'Downloading {url} to {save_path}')
-    request = requests.get(url, stream=True)
+    logging.info(
+        f'Downloading {url} for the year {release_year} to {save_path}')
+    response = retry_method(url)
+    if response.status_code != 200:
+        logging.fatal(
+            f'Failed to retrieve {url} for the year {release_year} to {save_path}'
+        )
     with open(save_path, 'wb') as file:
-        file.write(request.content)
+        file.write(response.content)
 
 
-def main():
+def main(_):
     """Main function to download the files."""
     data_dir = os.path.join(os.getcwd(), 'raw_data')
     if not os.path.exists(data_dir):
         os.makedirs(data_dir)
-    for dataset_name, url in DATA_URLS.items():
-        print(dataset_name)
-        save_path = os.path.join(data_dir, dataset_name)
-        download_file(url, save_path)
+    logging.set_verbosity(2)
+    _CONFIG_FILE = read_config_file_from_gcs(_FLAGS.config_path)
+    for item in _CONFIG_FILE:
+        release_year = item["release_year"]
+        for url_dict in item["parameter"]:
+            save_path = os.path.join(data_dir, url_dict['FILE_NAME'])
+            download_file(release_year, url_dict['URL'], save_path)
 
 
 if __name__ == '__main__':
-    main()
+    app.run(main)
diff --git a/scripts/us_cdc/500_places/download_config.json b/scripts/us_cdc/500_places/download_config.json
@@ -0,0 +1,77 @@
+[
+    {
+        "release_year": 2022,
+        "parameter": [
+            {
+                "URL": "https://data.cdc.gov/api/views/duw2-7jbt/rows.csv?accessType=DOWNLOAD",
+                "FILE_TYPE": "County",
+                "FILE_NAME": "county_raw_data_2022.csv"
+            },
+            {
+                "URL": "https://data.cdc.gov/api/views/epbn-9bv3/rows.csv?accessType=DOWNLOAD",
+                "FILE_TYPE": "City",
+                "FILE_NAME": "city_raw_data_2022.csv"
+            },
+            {
+                "URL": "https://data.cdc.gov/api/views/nw2y-v4gm/rows.csv?accessType=DOWNLOAD",
+                "FILE_TYPE": "CensusTract",
+                "FILE_NAME": "censustract_raw_data_2022.csv"
+            },
+            {
+                "URL": "https://data.cdc.gov/api/views/gd4x-jyhw/rows.csv?accessType=DOWNLOAD",
+                "FILE_TYPE": "ZipCode",
+                "FILE_NAME": "zipcode_raw_data_2022.csv"
+            }
+        ]
+    },
+    {
+        "release_year": 2023,
+        "parameter": [
+            {
+                "URL": "https://data.cdc.gov/api/views/h3ej-a9ec/rows.csv?accessType=DOWNLOAD",
+                "FILE_TYPE": "County",
+                "FILE_NAME": "county_raw_data_2023.csv"
+            },
+            {
+                "URL": "https://data.cdc.gov/api/views/krqc-563j/rows.csv?accessType=DOWNLOAD",
+                "FILE_TYPE": "City",
+                "FILE_NAME": "city_raw_data_2023.csv"
+            },
+            {
+                "URL": "https://data.cdc.gov/api/views/em5e-5hvn/rows.csv?accessType=DOWNLOAD",
+                "FILE_TYPE": "CensusTract",
+                "FILE_NAME": "censustract_raw_data_2023.csv"
+            },
+            {
+                "URL": "https://data.cdc.gov/api/views/9umn-c3jf/rows.csv?accessType=DOWNLOAD",
+                "FILE_TYPE": "ZipCode",
+                "FILE_NAME": "zipcode_raw_data_2023.csv"
+            }
+        ]
+    },
+    {
+        "release_year": 2024,
+        "parameter": [
+            {
+                "URL": "https://data.cdc.gov/api/views/swc5-untb/rows.csv?accessType=DOWNLOAD",
+                "FILE_TYPE": "County",
+                "FILE_NAME": "county_raw_data_2024.csv"
+            },
+            {
+                "URL": "https://data.cdc.gov/api/views/eav7-hnsx/rows.csv?accessType=DOWNLOAD",
+                "FILE_TYPE": "City",
+                "FILE_NAME": "city_raw_data_2024.csv"
+            },
+            {
+                "URL": "https://data.cdc.gov/api/views/cwsq-ngmh/rows.csv?accessType=DOWNLOAD",
+                "FILE_TYPE": "CensusTract",
+                "FILE_NAME": "censustract_raw_data_2024.csv"
+            },
+            {
+                "URL": "https://data.cdc.gov/api/views/qnzd-25i4/rows.csv?accessType=DOWNLOAD",
+                "FILE_TYPE": "ZipCode",
+                "FILE_NAME": "zipcode_raw_data_2024.csv"
+            }
+        ]
+    }
+]
diff --git a/scripts/us_cdc/500_places/manifest.json b/scripts/us_cdc/500_places/manifest.json
@@ -0,0 +1,32 @@
+{
+    "import_specifications": [
+        {
+            "import_name": "CDC500",
+            "curator_emails": [],
+            "provenance_url": "https://www.cdc.gov/places/index.html",
+            "provenance_description": "Variables related to health from the CDC",
+            "scripts": [
+                "download_bulk.py","parse_cdc_places.py"
+            ],
+            "import_inputs": [
+                {
+                    "template_mcf": "cdc_places.tmcf",
+                    "cleaned_csv": "cleaned_csv/County.csv"
+                },
+                {
+                    "template_mcf": "cdc_places.tmcf",
+                    "cleaned_csv": "cleaned_csv/City.csv"
+                },
+                {
+                    "template_mcf": "cdc_places.tmcf",
+                    "cleaned_csv": "cleaned_csv/ZipCode.csv"
+                },
+                {
+                    "template_mcf": "cdc_places.tmcf",
+                    "cleaned_csv": "cleaned_csv/CensusTract.csv"
+                }
+            ],
+            "cron_schedule": "0 4 * * 5"
+        }
+    ]
+}