Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add inlining options and a new inline command #374

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mikix
Copy link
Contributor

@mikix mikix commented Jan 16, 2025

Inlining here means downloading attachment URLs and insert them as attachment data, allowing you to download the attachment just once, at the cost of extra storage space.

This is often useful in preparation for NLP tasks, where you want to often refer back to DocumentReference clinical notes, but do not want to constantly deal with the flakiness of an EHR server. (Or losing access to that server.)

New features:

  • There is a new inline command that can inline an existing folder of NDJSON.
  • The export and etl commands both now accept an opt-in flag to inline data after performing a bulk export.
  • If the server provides a Retry-After header that is a timestamp (rather than a number of seconds), we now parse that correctly.
  • All server requests are retried at least a little bit upon errors. (previously, only bulk export requests were retried - now every time we hit the server, so even for Medication downloads or DocRef attachment downloads during NLP tasks).

Behavior changes:

  • If the server gives us a 429 error, we now log it as an error message, and don't log a successful bulk export progress call.
  • If the server gives us a 429 error, we use the next exponential backoff delay instead of hard coding 60 seconds as the delay.
  • If the server gives us a Retry-After header on an error message, we no longer unconditionally accept and use it. Rather the requested delay is capped by our next exponential backoff delay. That is, the server's Retry-After time will be used if it's LESS than our next backoff, but if it's longer, we'll still use our own backoff. This is lightly hostile, but (a) it's only done on error cases, (b) our backoff times are generous, and counted in minutes not seconds, and (c) it lets us guarantee a max delay time for callers.
  • Instead of retrying on 429 and ALL 5xx errors, there's a specific list of error codes that we retry on. Currently it's 408, 429, 500, 502, 503, and 504.
  • Have the bulk exporter time out after 30 days, rather than one. We've seen Epic exports take a couple weeks.

Sample Output

Inlining… ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 9:32:47
                      Attachments  Resources
 Total examined       168,415      138,501
 Newly inlined        148,074      121,427
 Fatal errors          20,272       20,272
 Retried but gave up       69           69

Checklist

  • Consider if documentation (like in docs/) needs to be updated
  • Consider if tests should be added

Copy link

github-actions bot commented Jan 16, 2025

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines Covered Coverage Threshold Status
3877 3831 99% 98% 🟢

New Files

File Coverage Status
cumulus_etl/inliner/init.py 100% 🟢
cumulus_etl/inliner/cli.py 100% 🟢
cumulus_etl/inliner/inliner.py 100% 🟢
cumulus_etl/inliner/reader.py 100% 🟢
cumulus_etl/inliner/writer.py 100% 🟢
TOTAL 100% 🟢

Modified Files

File Coverage Status
cumulus_etl/init.py 100% 🟢
cumulus_etl/cli.py 100% 🟢
cumulus_etl/cli_utils.py 100% 🟢
cumulus_etl/common.py 100% 🟢
cumulus_etl/errors.py 100% 🟢
cumulus_etl/etl/cli.py 100% 🟢
cumulus_etl/etl/tasks/task_factory.py 100% 🟢
cumulus_etl/export/cli.py 100% 🟢
cumulus_etl/fhir/init.py 100% 🟢
cumulus_etl/fhir/fhir_client.py 100% 🟢
cumulus_etl/fhir/fhir_utils.py 100% 🟢
cumulus_etl/loaders/fhir/bulk_export.py 100% 🟢
cumulus_etl/loaders/fhir/ndjson_loader.py 100% 🟢
cumulus_etl/store.py 100% 🟢
TOTAL 100% 🟢

updated for commit: 7f2d0ab by action🐍

@mikix mikix force-pushed the mikix/inline branch 4 times, most recently from 2882da0 to cce70c5 Compare January 22, 2025 13:12
@@ -86,7 +90,15 @@ async def __aexit__(self, exc_type, exc_value, traceback):
await self._session.aclose()

async def request(
Copy link
Contributor Author

@mikix mikix Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me lightly explain some of the refactoring here.

Before, FhirClient.request was a single request. The retry logic was in the bulk export code, which retried on both error cases and the expected 202 status code for "still working on it".

Now, I've separated the bulk export retry logic into error-path and 202-path, and moved the error-path into this FhirClient.request method. So now any code making requests to the server has easy retries (enabled by default for any request).

@mikix mikix force-pushed the mikix/inline branch 3 times, most recently from 5abc924 to 01b515b Compare January 27, 2025 17:58
sed -i 's/"generated_on": "[^"]*", //g' $OUTDIR/*.ndjson
sed -i 's/"generated_on":"[^"]*",//g' $OUTDIR/*.ndjson
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusted to remove expected whitespace because this PR also switches to a more condensed NDJSON output mode for the ETL that removes whitespace (in an attempt to reduce bloat a little bit when inlining).

Comment on lines +136 to +146
# Look at twice the allowed connections - downloads will block, but that's OK.
# This will allow us to download some attachments while other workers are sleeping
# because they are waiting to retry due to an HTTP error.
peek_at=fhir.FhirClient.MAX_CONNECTIONS * 2,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is lightly hostile because if a server is actually overloaded (vs the normal random flakiness we tend to see), we might be hitting it a little more than we ought to... I'm testing right now on this a little to see if removing this bit changes anything on a sample pull against Cerner. But I'm inclined to try running with this for performance reasons (under the assumption that we rarely get an authentic overloaded error and just random flakiness is more likely).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just tested this - if we only peek at MAX_CONNECTIONS at a time, we're roughly 10% slower (took 11h instead of 10h for a big inline job).

@mikix mikix marked this pull request as ready for review January 27, 2025 18:09
Inlining here means downloading attachment URLs and insert them as
attachment data, allowing you to download the attachment just once,
at the cost of extra storage space.

This is often useful in preparation for NLP tasks, where you want to
often refer back to DocumentReference clinical notes, but do not want
to constantly deal with the flakiness of an EHR server. (Or losing
access to that server.)

New features:
- There is a new `inline` command that can inline an existing folder
  of NDJSON.
- The `export` and `etl` commands both now accept an opt-in flag to
  inline data after performing a bulk export.
- If the server provides a Retry-After header that is a timestamp
  (rather than a number of seconds), we now parse that correctly.
- All server requests are retried at least a little bit upon errors.
  (previously, only bulk export requests were retried - now every time
  we hit the server, so even for Medication downloads or DocRef
  attachment downloads during NLP tasks.

Behavior changes:
- If the server gives us a 429 error, we now log it as an error
  message, and don't log a successful progress call.
- If the server gives us a 429 error, we use the next exponential
  backoff delay instead of hard coding 60 seconds as the delay.
- If the server gives us a Retry-After header on an error message,
  we no longer unconditionally accept and use it. Rather the requested
  delay is capped by our next exponential backoff delay. That is, the
  server's Retry-After time will be used if it's LESS than our next
  backoff, but if it's longer, we'll still use our own backoff. This
  is lightly hostile, but (a) it's only done on error cases, (b) our
  backoff times are generous, and counted in minutes not seconds, and
  (c) it lets us guarantee a max delay time for callers.
- Instead of retrying on 429 and ALL 5xx errors, there's a specific
  list of error codes that we retry on. Currently it's 408, 429, 500,
  502, 503, and 504.
- Have the bulk exporter time out after 30 days, rather than one.
  We've seen Epic exports take a couple weeks.
"pyarrow < 19",
"pyarrow < 20",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated, I just stole this from a dependabot PR that was consistently failing CI for no good reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant