-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Code Review for Project #1
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,125 @@ | ||
import requests | ||
import os | ||
import time | ||
import pandas as pd | ||
import glob | ||
import datetime as dt | ||
|
||
def verify_dirs_exist(): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The naming of this function is a bit misleading. It doesn't just verify that they exist, it actually creates them if they don't exist. I think it should be called There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would also make this function take in the list of directories to create, so that it's re-usable. |
||
#This notebook requires a few directories | ||
dirs = ["download", "download\csv", "download\excel"] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are these directories intended to be literally named If the |
||
for d in dirs: | ||
curpath = os.path.abspath(os.curdir) # get current working directory | ||
full_path = os.path.join(curpath, d) # join cwd with proposed d | ||
create_dir_if_not_exists(full_path) | ||
|
||
def create_dir_if_not_exists(full_path): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The try/except version is slightly preferable because in the if version, there's a tiny chance that someone could create the directory between when you do the if and when you do the makedirs. |
||
# expects a full path to the directory to test against or to create. | ||
if not os.path.exists(full_path): | ||
os.makedirs(full_path) | ||
print("Created directory ", full_path) | ||
else: | ||
print("Directory ", full_path, " already existed") | ||
|
||
def generate_file_name_from_url(url): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A comment with a sample URL and the file name you're generating from it would make this much easier to understand. |
||
month_year = url.split("\\")[-1].split("_")[-1].split(".")[0] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is one of the (rare?) cases where a regular expression might make things clearer...
|
||
month = month_year[:2] | ||
year = month_year[2:] | ||
new_file_name = "ZIP-COUNTY-FIPS_"+year + "-" + month | ||
return new_file_name | ||
|
||
def get_file_path(url, csv_file=False): | ||
"Takes in the full url and returns the full file path" | ||
"File names are ZIP_COUNTY_032010.xlsx" | ||
curpath = os.path.abspath(os.curdir) #get current working directory | ||
full_path = '' | ||
if csv_file: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Because the bulk of this function is in the if, would it make more sense for this to be two different functions? |
||
#If we are passing in a csv, change xlsx to .csv | ||
csv_file_name = url.split("\\")[-1][:-5] + ".csv" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is the goal of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a reason we don't want to use |
||
full_path = os.path.join(curpath, "download\csv", csv_file_name) | ||
else: | ||
url_name = url.split('/')[-1] # gets file name | ||
#switching file names to be YYYY-MM for better file management | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is an ideal use of comments: document the "why?" and let the code speak to the "what?". |
||
url_file_name = generate_file_name_from_url(url) + ".xlsx" | ||
full_path = os.path.join(curpath, "download\excel", url_file_name) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If |
||
return full_path | ||
|
||
def download_file(url): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This |
||
#With a full url, downloads the full file in chunks. | ||
#Able to handle large files. | ||
full_file_path = get_file_path(url) | ||
r = requests.get(url) | ||
with open(full_file_path, 'wb') as f: | ||
for chunk in r.iter_content(chunk_size=1024): | ||
if chunk: # filter out keep-alive new chunks | ||
f.write(chunk) | ||
return full_file_path | ||
|
||
""" | ||
Census State & County FIPS Data | ||
HUD uses ZIP & FIPS data. We need to grab the FIPS to county | ||
name data to be able to merge and create the cross lookup | ||
""" | ||
|
||
census_fips_url = "https://www2.census.gov/geo/docs/reference/codes/files/national_county.txt" | ||
#the FIPS data does not come with column names | ||
census_col_names = ["STATE","STATEFP","COUNTYFP","COUNTYNAME","CLASSFP"] | ||
|
||
# Open url, read in the data with the column names, and convert specific columns to str. | ||
# When Pandas reads these columns, it automatilcally intrepets them as INTS | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Again, good use of comments to explain the "why" |
||
fips_df = pd.read_table( | ||
census_fips_url, | ||
sep=",", | ||
names=census_col_names, | ||
converters={'STATEFP': str,'COUNTYFP': str,'CLASSFP': str} | ||
) | ||
|
||
# Combine State & County FP to generate the full FIPS code for easier lookup | ||
fips_df["STCOUNTYFP"] = fips_df["STATEFP"] + fips_df["COUNTYFP"] | ||
#Dropping STATFP & COUNTYFP as we no longer need them | ||
fips_df = fips_df[["STCOUNTYFP", "STATE" ,"COUNTYNAME", "CLASSFP"]] | ||
|
||
# Get current year to handle future runs of this file | ||
now = dt.datetime.now() | ||
cur_year = now.year | ||
|
||
def get_files_url(month, year): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could this be named more descriptively? "get_hud_zip_county_url"? |
||
monthyear = month + str(year) | ||
return "https://www.huduser.gov/portal/datasets/usps/ZIP_COUNTY_{}.xlsx".format(monthyear) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How about formatting both of these in place? Or, more descriptively, but longer-winded: If you're using python 3.6, you can use the nicer |
||
|
||
""" | ||
Main loop | ||
""" | ||
def main() | ||
# from the beginning of hud year data to current year | ||
for year in range(2010, cur_year+1): | ||
#hud files are based on quarters | ||
for month in ["03", "06", "09", "12"]: | ||
#generate the HUDs url | ||
url = get_files_url(month, year) | ||
#download the file | ||
full_file_path = download_file(url) | ||
#open and get the excel dataframe | ||
excel_df = process_excel_file(full_file_path) | ||
#merge the excel file with the fips data | ||
merged_df = fips_df.merge(excel_df) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would be better if this function did not refer to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The same is true for |
||
#reduce the dataframe down to specific columns | ||
merged_df = merged_df[["ZIP", "COUNTYNAME", "STATE", "STCOUNTYFP", "CLASSFP"]] | ||
#generate a csv file path | ||
csv_path = get_file_path(full_file_path, True) | ||
print(csv_path) | ||
try: | ||
merged_df.to_csv(csv_path, encoding='utf-8', index=False) | ||
except: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This |
||
#once we get to a Q that hasn't happened yet, we'll get an XLDRerror | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a more straightforward way to detect this? Check to see if the input is empty? Check if |
||
print("Operation has completed") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Which operation has completed? This message could be more descriptive. "All files have been downloaded and processed"? |
||
break | ||
|
||
# prevent from overloading the HUD site and to be a nice visitor | ||
time.sleep(1) | ||
print("Completed ", csv_path) | ||
|
||
|
||
if __name__ == '__main__': | ||
verify_dirs_exist() | ||
main() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you use
glob
in this code.