datacommonsorg · saanikaaa · Jan 8, 2025 · Jan 9, 2025 · Jan 9, 2025 · Jan 13, 2025
diff --git a/scripts/us_cdc/500_places/README.md b/scripts/us_cdc/500_places/README.md
@@ -16,11 +16,11 @@ Author: Padma Gundapaneni @padma-g
 ## About the Dataset
 
 ### Download URL
-The datasets can be downloaded at the following links from [the CDC website](https://chronicdata.cdc.gov/browse?category=500+Cities+%26+Places&sortBy=newest&utf8).
-- [PLACES: Local Data for Better Health, Census Tract Data](https://chronicdata.cdc.gov/500-Cities-Places/PLACES-Local-Data-for-Better-Health-Census-Tract-D/cwsq-ngmh)
-- [PLACES: Local Data for Better Health, County/Country Data](https://chronicdata.cdc.gov/500-Cities-Places/PLACES-Local-Data-for-Better-Health-County-Data-20/swc5-untb)
-- [PLACES: Local Data for Better Health, Place (City) Data](https://chronicdata.cdc.gov/500-Cities-Places/PLACES-Local-Data-for-Better-Health-Place-Data-202/eav7-hnsx)
-- [PLACES: Local Data for Better Health, ZCTA (Zip Code) Data](https://chronicdata.cdc.gov/500-Cities-Places/PLACES-Local-Data-for-Better-Health-ZCTA-Data-2020/qnzd-25i4)
+The datasets can be downloaded at the following links from [the CDC website](https://chronicdata.cdc.gov/browse?category=500+Cities+%26+Places&sortBy=newest&utf8). We need to manually serach in the website for the latest release files for the below dataset and add the required configuration in json file present in the [GCP Bucket Location](gs://datcom-csv/cdc500_places/)
+- PLACES: Local Data for Better Health, Census Tract Data
+- PLACES: Local Data for Better Health, County/Country Data
+- PLACES: Local Data for Better Health, Place (City) Data
+- PLACES: Local Data for Better Health, ZCTA (Zip Code) Data
 
 To download all datasets available, run the following command. The download will take 5-10 minutes total. Files will be downloaded and extracted to a `raw_data` folder.
 ```bash
@@ -34,7 +34,49 @@ The data imported in this effort is from the CDC's [500 Places project](https://
 
 ### Notes and Caveats
 
-None.
+For data refresh for CDC500 import we need to manually serach in the website for the latest release files across all geo levels and add the required configuration in [Json file](gs://datcom-csv/cdc500_places/download_config.json) present in the GCP Bucket Location.
+
+Please fill the json file for the latest release data in below format:
+
+```
+{
+        "release_year": {ReleaseYear}, 
+        "parameter": [
+            {
+                "URL": "Download link of latet release",
+                "FILE_TYPE": "Geo Level of the data should be either [County, City, ZipCode, CensusTract]",
+                "FILE_NAME": "{GeoLevel}_raw_data_2022.csv"
+            }
+        ]
+    }
+```
+
+Example:
+{
+        "release_year": 2022,
+        "parameter": [
+            {
+                "URL": "https://data.cdc.gov/api/views/duw2-7jbt/rows.csv?accessType=DOWNLOAD",
+                "FILE_TYPE": "County",
+                "FILE_NAME": "county_raw_data_2022.csv"
+            },
+            {
+                "URL": "https://data.cdc.gov/api/views/epbn-9bv3/rows.csv?accessType=DOWNLOAD",
+                "FILE_TYPE": "City",
+                "FILE_NAME": "city_raw_data_2022.csv"
+            },
+            {
+                "URL": "https://data.cdc.gov/api/views/nw2y-v4gm/rows.csv?accessType=DOWNLOAD",
+                "FILE_TYPE": "CensusTract",
+                "FILE_NAME": "censustract_raw_data_2022.csv"
+            },
+            {
+                "URL": "https://data.cdc.gov/api/views/gd4x-jyhw/rows.csv?accessType=DOWNLOAD",
+                "FILE_TYPE": "ZipCode",
+                "FILE_NAME": "zipcode_raw_data_2022.csv"
+            }
+        ]
+    }
 
 ### License
 The data is made available for public-use by the [CDC](https://www.cdc.gov/nchs/data_access/ftp_data.htm). Users of CDC National Center for Health Statistics Data must comply with the CDC's [data use agreement](https://www.cdc.gov/nchs/data_access/restrictions.htm).
@@ -51,7 +93,7 @@ These data were collected and provided by the [CDC National Center for Chronic D
 
 [`parse_cdc_places.py`](https://github.com/datacommonsorg/data/blob/master/scripts/us_cdc/500_places/parse_cdc_places.py)
 
-[`clean_files.sh`](https://github.com/datacommonsorg/data/blob/master/scripts/us_cdc/500_places/clean_files.sh)
+[`run.sh`](https://github.com/datacommonsorg/data/blob/master/scripts/us_cdc/500_places/run.sh)
 
 #### Test Scripts
 [`parse_cdc_places_test.py`](https://github.com/datacommonsorg/data/blob/master/scripts/us_cdc/500_places/parse_cdc_places_test.py)
@@ -75,10 +117,8 @@ The expected output of this test can be found in the [`test_data`](https://githu
 
 #### Data Download and Processing Steps
 
-To download and clean all the data files at once, run `download_bulk.py` and then `clean_files.sh`:
+To download and clean all the data files at once run `clean_files.sh`:
 
 ```bash
-$ python3 download_bulk.py
-
-$ sh clean_files.sh
+$ sh run.sh
 ```
diff --git a/scripts/us_cdc/500_places/cdc_places.tmcf b/scripts/us_cdc/500_places/cdc_places.tmcf
@@ -6,6 +6,8 @@ variableMeasured: C:CDC_Places->StatVar
 observationPeriod: "P1Y"
 measurementMethod: C:CDC_Places->DataValueTypeID
 value: C:CDC_Places->Data_Value
+unit: Percent
+scalingFactor: 100
 
 Node: E:CDC_Places->E1
 observationAbout: C:CDC_Places->Location

diff --git a/scripts/us_cdc/500_places/clean_files.sh b/scripts/us_cdc/500_places/clean_files.sh
diff --git a/scripts/us_cdc/500_places/download_bulk.py b/scripts/us_cdc/500_places/download_bulk.py
@@ -24,31 +24,51 @@
 """
 
 import os
-
 import requests
+import json
+
+from absl import logging
+from retry import retry
+from google.cloud import storage
+
+# Initialize GCP storage client
+client = storage.Client()
+
+# Define your GCP bucket and file name
+bucket_name = 'datcom-csv'  # Replace with your bucket name
+file_name = 'cdc500_places/download_config.json'  # Replace with your file name
+
+# Download the file from GCP Storage
+bucket = client.get_bucket(bucket_name)
+blob = bucket.blob(file_name)
+
+# Read the JSON content from the blob
+json_data = blob.download_as_text()
+
+# Load the JSON data
+_CONFIG_FILE = json.loads(json_data)
+
 
-DATA_URLS = {
-    "county_raw_data.csv":
-        "https://chronicdata.cdc.gov/api/views/swc5-untb/rows.csv?accessType=DOWNLOAD",
-    "city_raw_data.csv":
-        "https://chronicdata.cdc.gov/api/views/eav7-hnsx/rows.csv?accessType=DOWNLOAD",
-    "censustract_raw_data.csv":
-        "https://chronicdata.cdc.gov/api/views/cwsq-ngmh/rows.csv?accessType=DOWNLOAD",
-    "zipcode_raw_data.csv":
-        "https://chronicdata.cdc.gov/api/views/qnzd-25i4/rows.csv?accessType=DOWNLOAD"
-}
+@retry(tries=3, delay=5, backoff=5)
+def retry_method(url):
+    return requests.get(url)
 
 
-def download_file(url: str, save_path: str):
+def download_file(release_year, url: str, save_path: str):
     """
     Args:
         url: url to the file to be downloaded
         save_path: path for the downloaded file to be stored
     Returns:
         a downloaded csv file in the specified file path
     """
-    print(f'Downloading {url} to {save_path}')
-    request = requests.get(url, stream=True)
+    logging.info(
+        f'Downloading {url} for the year {release_year} to {save_path}')
+    request = retry_method(url)
+    if request.status_code != 200:
+        logging.fatal(
+            f'Failed to retrieve {url} for the year {release_year} to {save_path}'
+        )
     with open(save_path, 'wb') as file:
         file.write(request.content)
 
@@ -58,10 +78,12 @@ def main():
     data_dir = os.path.join(os.getcwd(), 'raw_data')
     if not os.path.exists(data_dir):
         os.makedirs(data_dir)
-    for dataset_name, url in DATA_URLS.items():
-        print(dataset_name)
-        save_path = os.path.join(data_dir, dataset_name)
-        download_file(url, save_path)
+    logging.set_verbosity(2)
+    for item in _CONFIG_FILE:
+        release_year = item["release_year"]
+        for url_dict in item["parameter"]:
+            save_path = os.path.join(data_dir, url_dict['FILE_NAME'])
+            download_file(release_year, url_dict['URL'], save_path)
 
 
 if __name__ == '__main__':

diff --git a/scripts/us_cdc/500_places/download_config.json b/scripts/us_cdc/500_places/download_config.json
@@ -0,0 +1,77 @@
+[
+    {
+        "release_year": 2022,
+        "parameter": [
+            {
+                "URL": "https://data.cdc.gov/api/views/duw2-7jbt/rows.csv?accessType=DOWNLOAD",
+                "FILE_TYPE": "County",
+                "FILE_NAME": "county_raw_data_2022.csv"
+            },
+            {
+                "URL": "https://data.cdc.gov/api/views/epbn-9bv3/rows.csv?accessType=DOWNLOAD",
+                "FILE_TYPE": "City",
+                "FILE_NAME": "city_raw_data_2022.csv"
+            },
+            {
+                "URL": "https://data.cdc.gov/api/views/nw2y-v4gm/rows.csv?accessType=DOWNLOAD",
+                "FILE_TYPE": "CensusTract",
+                "FILE_NAME": "censustract_raw_data_2022.csv"
+            },
+            {
+                "URL": "https://data.cdc.gov/api/views/gd4x-jyhw/rows.csv?accessType=DOWNLOAD",
+                "FILE_TYPE": "ZipCode",
+                "FILE_NAME": "zipcode_raw_data_2022.csv"
+            }
+        ]
+    },
+    {
+        "release_year": 2023,
+        "parameter": [
+            {
+                "URL": "https://data.cdc.gov/api/views/h3ej-a9ec/rows.csv?accessType=DOWNLOAD",
+                "FILE_TYPE": "County",
+                "FILE_NAME": "county_raw_data_2023.csv"
+            },
+            {
+                "URL": "https://data.cdc.gov/api/views/krqc-563j/rows.csv?accessType=DOWNLOAD",
+                "FILE_TYPE": "City",
+                "FILE_NAME": "city_raw_data_2023.csv"
+            },
+            {
+                "URL": "https://data.cdc.gov/api/views/em5e-5hvn/rows.csv?accessType=DOWNLOAD",
+                "FILE_TYPE": "CensusTract",
+                "FILE_NAME": "censustract_raw_data_2023.csv"
+            },
+            {
+                "URL": "https://data.cdc.gov/api/views/9umn-c3jf/rows.csv?accessType=DOWNLOAD",
+                "FILE_TYPE": "ZipCode",
+                "FILE_NAME": "zipcode_raw_data_2023.csv"
+            }
+        ]
+    },
+    {
+        "release_year": 2024,
+        "parameter": [
+            {
+                "URL": "https://data.cdc.gov/api/views/swc5-untb/rows.csv?accessType=DOWNLOAD",
+                "FILE_TYPE": "County",
+                "FILE_NAME": "county_raw_data_2024.csv"
+            },
+            {
+                "URL": "https://data.cdc.gov/api/views/eav7-hnsx/rows.csv?accessType=DOWNLOAD",
+                "FILE_TYPE": "City",
+                "FILE_NAME": "city_raw_data_2024.csv"
+            },
+            {
+                "URL": "https://data.cdc.gov/api/views/cwsq-ngmh/rows.csv?accessType=DOWNLOAD",
+                "FILE_TYPE": "CensusTract",
+                "FILE_NAME": "censustract_raw_data_2024.csv"
+            },
+            {
+                "URL": "https://data.cdc.gov/api/views/qnzd-25i4/rows.csv?accessType=DOWNLOAD",
+                "FILE_TYPE": "ZipCode",
+                "FILE_NAME": "zipcode_raw_data_2024.csv"
+            }
+        ]
+    }
+]
diff --git a/scripts/us_cdc/500_places/manifest.json b/scripts/us_cdc/500_places/manifest.json
@@ -0,0 +1,32 @@
+{
+    "import_specifications": [
+        {
+            "import_name": "CDC500",
+            "curator_emails": [],
+            "provenance_url": "https://www.cdc.gov/places/index.html",
+            "provenance_description": "Variables related to health from the CDC",
+            "scripts": [
+                "download_bulk.py","parse_cdc_places.py"
+            ],
+            "import_inputs": [
+                {
+                    "template_mcf": "cdc_places.tmcf",
+                    "cleaned_csv": "cleaned_csv/County.csv"
+                },
+                {
+                    "template_mcf": "cdc_places.tmcf",
+                    "cleaned_csv": "cleaned_csv/City.csv"
+                },
+                {
+                    "template_mcf": "cdc_places.tmcf",
+                    "cleaned_csv": "cleaned_csv/ZipCode.csv"
+                },
+                {
+                    "template_mcf": "cdc_places.tmcf",
+                    "cleaned_csv": "cleaned_csv/CensusTract.csv"
+                }
+            ],
+            "cron_schedule": "0 11 * * 2"
+        }
+    ]
+}