Refactor doctor_visits: Load source file only once #1978

minhkhul · 2024-06-24T20:19:16Z

Description

Currently, doctor_visits source files have multiple columns with slightly different names but refer to the same kind of data:

"servicedate": "ServiceDate",
"patCountyFIPS": "PatCountyFIPS",
"patHRRname": "Pat HRR Name",
"patAgeGroup": "PatAgeGroup",
"patHRRid": "Pat HRR ID"

We deal with this by:

Unzip and load source file into dataframe
Rename columns to a standard format
Rewrote the whole source file into its original storage
Reopen the source file and load into another dataframe
Continue processing data.

This PR remove the first round of loading source file and rename columns, replacing it with code that does the same thing but avoid loading/writing the source file twice.

Changelog

Itemize code/test/documentation changes and files added/removed.

remove modify_claims_drops related files and tests
replace with code that does the same thing (rename columns and check for duplicated rows) in update_sensor.py
Adjust unit test files accordingly: adjust test_geo_map, change column names in csv.gz file to be deviant, so the unit test will change them to correct columns.

doctor_visits/delphi_doctor_visits/update_sensor.py

…arams

Optimize with dask

aysim319 · 2024-07-01T17:57:24Z

Description

continuing optimizing for doctors_visit
Main ran: 569 (doctor_visit_EDI_AGG_OUTPATIENT_26062024_1455CDT.log)
doctor_visit_refactor: 94 (doctor_visit_refactored_EDI_AGG_OUTPATIENT_26062024_1455CDT.log)

Changelog

refactored update sensor and moved the csv processing into separate module (process_data)
using datetime date params instead of string in process_date
using dask to read in csv
updated test_update_sensor
added test_process_data
updated config.py
updated run.py to use process_data

Notes

also looked into vaex; while it is more suitable as we run on single bare machines, it's less suitable for data wrangling and also was struggling too much to get it working with it (less mature on documentation and resources), open to trying it again if warranted
not sure what the memory for the machines as it currently attempts to read the entire in one go which may cause issues if the csv can't fit into memory. (doctor_visit_EDI_AGG_OUTPATIENT_26062024_1455CDT was about 4-5 GB and worked fine even on my local machine). Unsure on how to go about mitigation. Food for thought
thoughts on future optimization to improve write_to_cvs, but this seems like a good starting point for now

doctor_visits/delphi_doctor_visits/process_data.py

more lint

doctor_visits/delphi_doctor_visits/process_data.py

…t result

aysim319 · 2024-07-12T14:27:54Z

output of one day of state and cprofile output with validation scripts
profile_results.zip

doctor_visits/setup.py

Co-authored-by: george <[email protected]>

nmdefries

Some style comments to get started. I'm making another pass for logic/flow.

nmdefries · 2025-02-11T21:18:34Z

doctor_visits/delphi_doctor_visits/process_data.py

+    val_isnull = df["val"].isnull()
+    df_val_null = df[val_isnull]
+    assert len(df_val_null) == 0, "sensor value is nan, check pipeline"
+    df = df[~val_isnull]


nit (optional): these filters (here and the two below) aren't actually changing the behavior since if there were any null or too high values, the assertion would cause the program to error out.

nmdefries · 2025-02-11T21:23:02Z

doctor_visits/delphi_doctor_visits/process_data.py

+    assert len(df_val_null) == 0, "sensor value is nan, check pipeline"
+    df = df[~val_isnull]
+
+    se_too_high = df["se"] >= 5


todo: Please add brief rationale/source for this threshold of 5 in a comment. (The threshold of 90 below is self-evident and doesn't need a comment.)

If the threshold value came from the old code and was not explained, no need to do anything here.

nmdefries · 2025-02-11T21:31:15Z

doctor_visits/delphi_doctor_visits/process_data.py

+    df = df[~sensor_too_high]
+
+    if se:
+        valid_cond = (df["se"] > 0) & (df["val"] > 0)


discussion: consider consistency with past behavior of this indicator and with behavior of other indicators. I know some of our other indicators reported any odd (negative, etc) values that came in from the source. I'm somewhat wary of just dropping these points.

Second, if the df["val"] > 0 part of this filter is always desired behavior, it seems like it should also be done in the else block below. Right now the if not se case doesn't do any filtering by value.

Third, is val == 0 not valid?

nmdefries · 2025-02-11T21:33:37Z

doctor_visits/delphi_doctor_visits/process_data.py

+        valid_cond = (df["se"] > 0) & (df["val"] > 0)
+        invalid_df = df[~valid_cond]
+        if len(invalid_df) > 0:
+            logger.info("p=0, std_err=0 invalid")


todo: Please expand/clarify this logging message. Our filter is doing more than removing p == 0 and se == 0 points. This could perhaps include the current geo_id as in above assert messages and/or a count of affected rows.

nmdefries · 2025-02-12T16:26:50Z

doctor_visits/delphi_doctor_visits/process_data.py

+    # out_name = format_outname(prefix, se, weekday)
+
+    # write out results
+    out_name = "smoothed_adj_cli" if weekday else "smoothed_cli"
+    if se:
+        assert prefix is not None, "template has no obfuscated prefix"
+        out_name = prefix + "_" + out_name


todo: I think we want

Suggested change

# out_name = format_outname(prefix, se, weekday)

# write out results

out_name = "smoothed_adj_cli" if weekday else "smoothed_cli"

if se:

assert prefix is not None, "template has no obfuscated prefix"

out_name = prefix + "_" + out_name

out_name = format_outname(prefix, se, weekday)

nmdefries · 2025-02-12T16:31:12Z

doctor_visits/delphi_doctor_visits/process_data.py

+
+    out_n = 0
+    for d in set(output_df["date"]):
+        filename = "%s/%s_%s_%s.csv" % (output_path, (d + Config.DAY_SHIFT).strftime("%Y%m%d"), geo_level, out_name)


nit (optional): for readability, prefer f-strings over % string formatting.

nmdefries · 2025-02-12T16:35:59Z

doctor_visits/delphi_doctor_visits/process_data.py

+            outfile.write("geo_id,val,se,direction,sample_size\n")
+
+            for line in single_date_df.itertuples():


suggestion: The following chunk needs a rewrite to simplify.

Use the built-in pd.write_csv. itertuples is slow and unnecessary. The checks/conversions we're doing in the itertuples loop either can be done in bulk (even before we split by date, actually) or have already been done (e.g. didn't we already multiply se by 100? Also assertions on values).

I see Dmitry already commented on this with some example code to use.

nmdefries · 2025-02-12T16:55:08Z

doctor_visits/delphi_doctor_visits/process_data.py

+
+    # aggregate age groups (so data is unique by service date and FIPS)
+    df = df.groupby([Config.DATE_COL, Config.GEO_COL]).sum(numeric_only=True).reset_index()
+    assert np.sum(df.duplicated()) == 0, "Duplicates after age group aggregation"


suggestion: recommend moving the duplicate check to before the groupby. If we had

date, geo, age, value 1,us,18-,x 1,us,18+,x 1,us,18-,x 1,us,18+,x ...

The groupby-sum would cause us to double-count the two age groups, since we're only grouping over date and geo. So the duplicate check should come first and check date, geo, and age combos.

nmdefries · 2025-02-12T16:55:58Z

doctor_visits/delphi_doctor_visits/process_data.py

+    # aggregate age groups (so data is unique by service date and FIPS)
+    df = df.groupby([Config.DATE_COL, Config.GEO_COL]).sum(numeric_only=True).reset_index()
+    assert np.sum(df.duplicated()) == 0, "Duplicates after age group aggregation"
+    assert (df[Config.COUNT_COLS] >= 0).all().all(), "Counts must be nonnegative"


nit: don't we run this check elsewhere?

minhkhul added 7 commits June 24, 2024 16:01

replace modify_claims_drops with direct modification in update_sensor

c639049

cleanup Config

1dd80be

cleanup Config

749ed2d

change test

9d8d521

lint

ca38bb7

fix test geomap

17259d0

lint

6d841da

minhkhul changed the title ~~Refactor doctor_visits for faster runtime~~ Refactor doctor_visits: Load source file only once Jun 24, 2024

lint

4ec46df

minhkhul requested a review from melange396 June 24, 2024 21:16

adding logging for comparing processing time

9740899

melange396 reviewed Jun 29, 2024

View reviewed changes

doctor_visits/delphi_doctor_visits/update_sensor.py Outdated Show resolved Hide resolved

aysim319 and others added 7 commits July 1, 2024 09:41

using dask for read/write large files

aacc545

undo testing change and also using datetime instead of str for date p…

dbde5c7

…arams

refactored reading into seperate function

1394d3d

organizing code

dfc3be2

only procesing once and passing along the dataframe

e07c697

added/updated tests

d1ee4ce

Merge pull request #1981 from cmu-delphi/optimize_with_dask

fc2c58d

Optimize with dask

dshemetov reviewed Jul 1, 2024

View reviewed changes

doctor_visits/delphi_doctor_visits/process_data.py Outdated Show resolved Hide resolved

aysim319 and others added 3 commits July 1, 2024 17:23

in progress cleaning up writing csv

b52d80a

Merge branch 'main' into doctor_visits_refactor_for_speed

58b51a6

optimized write_csv

81381d6

aysim319 force-pushed the doctor_visits_refactor_for_speed branch 5 times, most recently from 167d75c to 026594b Compare July 8, 2024 22:04

lint

bfa853a

more lint

aysim319 force-pushed the doctor_visits_refactor_for_speed branch from 026594b to bfa853a Compare July 9, 2024 14:37

dshemetov requested changes Jul 9, 2024

View reviewed changes

doctor_visits/delphi_doctor_visits/process_data.py Outdated Show resolved Hide resolved

aysim319 added 4 commits July 9, 2024 15:00

reverting to assert

073651f

cleaning more stuff

dd06a91

version locking at 2024.6 due to pandas

4ddd5a0

aligned preprocessing to match current & rollback write for consisten…

593279b

…t result

aysim319 force-pushed the doctor_visits_refactor_for_speed branch from 4fa370a to 1b5e6d3 Compare July 11, 2024 15:19

pip versioning

9920821

aysim319 force-pushed the doctor_visits_refactor_for_speed branch from 1b5e6d3 to 9920821 Compare July 11, 2024 15:23

rewording variable and also ensure that column order is the same

cd83691

melange396 requested changes Jul 12, 2024

View reviewed changes

doctor_visits/setup.py Outdated Show resolved Hide resolved

aysim319 and others added 2 commits July 12, 2024 10:31

Update doctor_visits/setup.py

79c34d3

Co-authored-by: george <[email protected]>

latest version supported for 3.8 is 2023.5.*

7896042

aysim319 force-pushed the doctor_visits_refactor_for_speed branch from 9b27248 to 7896042 Compare July 12, 2024 16:26

aysim319 added 2 commits July 12, 2024 12:32

fix param

e2f7953

added notes for when we upgrade to 3.9+

a4f67c0

aysim319 linked an issue Jul 30, 2024 that may be closed by this pull request

Optimize Doctor Visit #2010

Open

aysim319 added 2 commits September 9, 2024 10:32

reverting unneeded change

9408c81

merge with main

b2f8b0e

nmdefries requested review from melange396 and dshemetov November 27, 2024 17:12

nmdefries reviewed Feb 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor doctor_visits: Load source file only once #1978

Refactor doctor_visits: Load source file only once #1978

minhkhul commented Jun 24, 2024 •

edited

Loading

aysim319 commented Jul 1, 2024 •

edited

Loading

aysim319 commented Jul 12, 2024

nmdefries left a comment

nmdefries Feb 11, 2025

nmdefries Feb 11, 2025

nmdefries Feb 11, 2025

nmdefries Feb 11, 2025

nmdefries Feb 12, 2025

nmdefries Feb 12, 2025

nmdefries Feb 12, 2025

nmdefries Feb 13, 2025

nmdefries Feb 12, 2025

nmdefries Feb 12, 2025

		outfile.write("geo_id,val,se,direction,sample_size\n")

		for line in single_date_df.itertuples():

Refactor doctor_visits: Load source file only once #1978

Are you sure you want to change the base?

Refactor doctor_visits: Load source file only once #1978

Conversation

minhkhul commented Jun 24, 2024 • edited Loading

Description

Changelog

aysim319 commented Jul 1, 2024 • edited Loading

Description

Changelog

Notes

aysim319 commented Jul 12, 2024

nmdefries left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

minhkhul commented Jun 24, 2024 •

edited

Loading

aysim319 commented Jul 1, 2024 •

edited

Loading