-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor doctor_visits: Load source file only once #1978
base: main
Are you sure you want to change the base?
Conversation
Optimize with dask
Descriptioncontinuing optimizing for doctors_visit Changelogrefactored update sensor and moved the csv processing into separate module (process_data) Notes
|
167d75c
to
026594b
Compare
026594b
to
bfa853a
Compare
4fa370a
to
1b5e6d3
Compare
1b5e6d3
to
9920821
Compare
output of one day of state and cprofile output with validation scripts |
Co-authored-by: george <[email protected]>
9b27248
to
7896042
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some style comments to get started. I'm making another pass for logic/flow.
val_isnull = df["val"].isnull() | ||
df_val_null = df[val_isnull] | ||
assert len(df_val_null) == 0, "sensor value is nan, check pipeline" | ||
df = df[~val_isnull] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit (optional): these filters (here and the two below) aren't actually changing the behavior since if there were any null or too high values, the assertion would cause the program to error out.
assert len(df_val_null) == 0, "sensor value is nan, check pipeline" | ||
df = df[~val_isnull] | ||
|
||
se_too_high = df["se"] >= 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: Please add brief rationale/source for this threshold of 5 in a comment. (The threshold of 90 below is self-evident and doesn't need a comment.)
If the threshold value came from the old code and was not explained, no need to do anything here.
df = df[~sensor_too_high] | ||
|
||
if se: | ||
valid_cond = (df["se"] > 0) & (df["val"] > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
discussion: consider consistency with past behavior of this indicator and with behavior of other indicators. I know some of our other indicators reported any odd (negative, etc) values that came in from the source. I'm somewhat wary of just dropping these points.
Second, if the df["val"] > 0
part of this filter is always desired behavior, it seems like it should also be done in the else
block below. Right now the if not se
case doesn't do any filtering by value.
Third, is val == 0
not valid?
valid_cond = (df["se"] > 0) & (df["val"] > 0) | ||
invalid_df = df[~valid_cond] | ||
if len(invalid_df) > 0: | ||
logger.info("p=0, std_err=0 invalid") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: Please expand/clarify this logging message. Our filter is doing more than removing p == 0
and se == 0
points. This could perhaps include the current geo_id
as in above assert
messages and/or a count of affected rows.
# out_name = format_outname(prefix, se, weekday) | ||
|
||
# write out results | ||
out_name = "smoothed_adj_cli" if weekday else "smoothed_cli" | ||
if se: | ||
assert prefix is not None, "template has no obfuscated prefix" | ||
out_name = prefix + "_" + out_name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo: I think we want
# out_name = format_outname(prefix, se, weekday) | |
# write out results | |
out_name = "smoothed_adj_cli" if weekday else "smoothed_cli" | |
if se: | |
assert prefix is not None, "template has no obfuscated prefix" | |
out_name = prefix + "_" + out_name | |
out_name = format_outname(prefix, se, weekday) |
|
||
out_n = 0 | ||
for d in set(output_df["date"]): | ||
filename = "%s/%s_%s_%s.csv" % (output_path, (d + Config.DAY_SHIFT).strftime("%Y%m%d"), geo_level, out_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit (optional): for readability, prefer f-strings over %
string formatting.
outfile.write("geo_id,val,se,direction,sample_size\n") | ||
|
||
for line in single_date_df.itertuples(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: The following chunk needs a rewrite to simplify.
Use the built-in pd.write_csv
. itertuples
is slow and unnecessary. The checks/conversions we're doing in the itertuples
loop either can be done in bulk (even before we split by date, actually) or have already been done (e.g. didn't we already multiply se
by 100? Also assertions on values).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see Dmitry already commented on this with some example code to use.
|
||
# aggregate age groups (so data is unique by service date and FIPS) | ||
df = df.groupby([Config.DATE_COL, Config.GEO_COL]).sum(numeric_only=True).reset_index() | ||
assert np.sum(df.duplicated()) == 0, "Duplicates after age group aggregation" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: recommend moving the duplicate check to before the groupby
. If we had
date, geo, age, value
1,us,18-,x
1,us,18+,x
1,us,18-,x
1,us,18+,x
...
The groupby
-sum
would cause us to double-count the two age groups, since we're only grouping over date and geo. So the duplicate check should come first and check date, geo, and age combos.
# aggregate age groups (so data is unique by service date and FIPS) | ||
df = df.groupby([Config.DATE_COL, Config.GEO_COL]).sum(numeric_only=True).reset_index() | ||
assert np.sum(df.duplicated()) == 0, "Duplicates after age group aggregation" | ||
assert (df[Config.COUNT_COLS] >= 0).all().all(), "Counts must be nonnegative" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: don't we run this check elsewhere?
Description
Currently, doctor_visits source files have multiple columns with slightly different names but refer to the same kind of data:
We deal with this by:
This PR remove the first round of loading source file and rename columns, replacing it with code that does the same thing but avoid loading/writing the source file twice.
Changelog
Itemize code/test/documentation changes and files added/removed.