Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code Review for Project #1

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
125 changes: 125 additions & 0 deletions python-code/code.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
import requests
import os
import time
import pandas as pd
import glob
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you use glob in this code.

import datetime as dt

def verify_dirs_exist():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming of this function is a bit misleading. It doesn't just verify that they exist, it actually creates them if they don't exist. I think it should be called ensure_dirs_exist.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also make this function take in the list of directories to create, so that it's re-usable.

#This notebook requires a few directories
dirs = ["download", "download\csv", "download\excel"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these directories intended to be literally named download\csv and download\excel? Or are csv and excel intended to be subdirectories under download? On non-Windows OSes, \ is not the directory separator character, so this does the former. You can use os.path.join('download', 'csv') to work cross-platform.

If the excel and csv directories are meant to be subdirectories of download, you don't need to make download separately. os.makedirs operates recursively.

for d in dirs:
curpath = os.path.abspath(os.curdir) # get current working directory
full_path = os.path.join(curpath, d) # join cwd with proposed d
create_dir_if_not_exists(full_path)

def create_dir_if_not_exists(full_path):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.makedirs already does the file existence check, essentially.

try:
    os.makedirs(full_path)
    print(f"Created directory {full_path}")
except FileExistsError:
    print(f"Directory {full_path} already existed")

The try/except version is slightly preferable because in the if version, there's a tiny chance that someone could create the directory between when you do the if and when you do the makedirs.

# expects a full path to the directory to test against or to create.
if not os.path.exists(full_path):
os.makedirs(full_path)
print("Created directory ", full_path)
else:
print("Directory ", full_path, " already existed")

def generate_file_name_from_url(url):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment with a sample URL and the file name you're generating from it would make this much easier to understand.

month_year = url.split("\\")[-1].split("_")[-1].split(".")[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does url.split("\\") do anything for these URLs? It looks like a sample url is https://www.huduser.gov/portal/datasets/usps/ZIP_COUNTY_032010.xlsx, which doesn't contain the \ character.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one of the (rare?) cases where a regular expression might make things clearer...

m = re.search('ZIP_COUNTY_(\d\d)(\d\d\d\d)\.xlsx', url) 
month, year = m.groups()

month = month_year[:2]
year = month_year[2:]
new_file_name = "ZIP-COUNTY-FIPS_"+year + "-" + month
return new_file_name

def get_file_path(url, csv_file=False):
"Takes in the full url and returns the full file path"
"File names are ZIP_COUNTY_032010.xlsx"
curpath = os.path.abspath(os.curdir) #get current working directory
full_path = ''
if csv_file:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the bulk of this function is in the if, would it make more sense for this to be two different functions? get_file_path_from_url and get_file_path_from_csv?

#If we are passing in a csv, change xlsx to .csv
csv_file_name = url.split("\\")[-1][:-5] + ".csv"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the goal of url.split("\\") to separate the filename from the directory name? You can use os.path.split(url) to do that safely on multiple platforms.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use replace to swap .xlsx for .csv instead of using [:-5] and +. It makes it clearer to someone reading the code what you intend to do. os.path.split(url)[-1].replace('.xlsx', '.csv')

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we don't want to use generate_file_name_from_url in this case? Is the "url" in the csv case not actually a "url", but a filesystem path?

full_path = os.path.join(curpath, "download\csv", csv_file_name)
else:
url_name = url.split('/')[-1] # gets file name
#switching file names to be YYYY-MM for better file management
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an ideal use of comments: document the "why?" and let the code speak to the "what?".

url_file_name = generate_file_name_from_url(url) + ".xlsx"
full_path = os.path.join(curpath, "download\excel", url_file_name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If excel is a subdirectory of download, this needs to be os.path.join(curpath, "download", "excel", url_file_name) on non-Windows platforms.

return full_path

def download_file(url):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This download_file function would be more generally useful if you passed in the file path as an argument.

#With a full url, downloads the full file in chunks.
#Able to handle large files.
full_file_path = get_file_path(url)
r = requests.get(url)
with open(full_file_path, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
return full_file_path

"""
Census State & County FIPS Data
HUD uses ZIP & FIPS data. We need to grab the FIPS to county
name data to be able to merge and create the cross lookup
"""

census_fips_url = "https://www2.census.gov/geo/docs/reference/codes/files/national_county.txt"
#the FIPS data does not come with column names
census_col_names = ["STATE","STATEFP","COUNTYFP","COUNTYNAME","CLASSFP"]

# Open url, read in the data with the column names, and convert specific columns to str.
# When Pandas reads these columns, it automatilcally intrepets them as INTS
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, good use of comments to explain the "why"

fips_df = pd.read_table(
census_fips_url,
sep=",",
names=census_col_names,
converters={'STATEFP': str,'COUNTYFP': str,'CLASSFP': str}
)

# Combine State & County FP to generate the full FIPS code for easier lookup
fips_df["STCOUNTYFP"] = fips_df["STATEFP"] + fips_df["COUNTYFP"]
#Dropping STATFP & COUNTYFP as we no longer need them
fips_df = fips_df[["STCOUNTYFP", "STATE" ,"COUNTYNAME", "CLASSFP"]]

# Get current year to handle future runs of this file
now = dt.datetime.now()
cur_year = now.year

def get_files_url(month, year):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be named more descriptively? "get_hud_zip_county_url"?

monthyear = month + str(year)
return "https://www.huduser.gov/portal/datasets/usps/ZIP_COUNTY_{}.xlsx".format(monthyear)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about formatting both of these in place?
return "https://www.huduser.gov/portal/datasets/usps/ZIP_COUNTY_{}{}.xlsx".format(month, year)

Or, more descriptively, but longer-winded:
return "https://www.huduser.gov/portal/datasets/usps/ZIP_COUNTY_{m}{y}.xlsx".format(m=month, y=year)

If you're using python 3.6, you can use the nicer f"" formatting...
return f"https://www.huduser.gov/portal/datasets/usps/ZIP_COUNTY_{month}{year}.xlsx"


"""
Main loop
"""
def main()
# from the beginning of hud year data to current year
for year in range(2010, cur_year+1):
#hud files are based on quarters
for month in ["03", "06", "09", "12"]:
#generate the HUDs url
url = get_files_url(month, year)
#download the file
full_file_path = download_file(url)
#open and get the excel dataframe
excel_df = process_excel_file(full_file_path)
#merge the excel file with the fips data
merged_df = fips_df.merge(excel_df)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better if this function did not refer to fits_df, which is defined outside of the function. That should be passed in as an argument or defined within the function.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same is true for cur_year

#reduce the dataframe down to specific columns
merged_df = merged_df[["ZIP", "COUNTYNAME", "STATE", "STCOUNTYFP", "CLASSFP"]]
#generate a csv file path
csv_path = get_file_path(full_file_path, True)
print(csv_path)
try:
merged_df.to_csv(csv_path, encoding='utf-8', index=False)
except:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This except is too broad. When I first tried to run this, the code did nothing because it was trying to write to a directory that didn't exist. It didn't show me the error because this clause swallowed it. If you're expecting a particular error, you should catch it as specifically as possible.

#once we get to a Q that hasn't happened yet, we'll get an XLDRerror
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a more straightforward way to detect this? Check to see if the input is empty? Check if year > now.year or year == now.year and month > now.month?

print("Operation has completed")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which operation has completed? This message could be more descriptive. "All files have been downloaded and processed"?

break

# prevent from overloading the HUD site and to be a nice visitor
time.sleep(1)
print("Completed ", csv_path)


if __name__ == '__main__':
verify_dirs_exist()
main()