Fix finding document entries with a bad pacer_doc_id #4885

ttys0dev · 2025-01-03T14:48:57Z

I think we need a fallback here in case we have entries with bad pacer_doc_id's.

mlissner · 2025-01-03T16:36:25Z

Thanks! One day, we'll catch all these variations, and retire to a beach, I suppose. Until then, Alberto, can you please review? Seems like we'll want a test for this as well?

albertisfu

Thanks @ttys0dev I've added a couple of related tests. They're currently failing due to two issues that may require to be addressed:

When removing the invalid pacer_doc_id from the fallback lookup, the RECAPDocument is matched but retains its previous invalid pacer_doc_id. Should we update the incorrect pacer_doc_id with the new one? Currently, the pacer_doc_id is assigned here:

courtlistener/cl/recap/mergers.py

Line 965 in 3060ac0

rd.pacer_doc_id = rd.pacer_doc_id or docket_entry["pacer_doc_id"]

While prioritizing the current pacer_doc_id is generally expected behavior, in this case we'd need to update it from the docket_entry if it's not blank.
The fallback lookup without pacer_doc_id can raise a MultipleObjectsReturned error if there is more than one main PACER Document in the docket entry. Should we apply the same logic to select the latest RECAPDocument with PDF and remove duplicates?

courtlistener/cl/recap/mergers.py

Line 948 in 3060ac0

except RECAPDocument.MultipleObjectsReturned:

We could create a method for this logic and use it in both try/except blocks.

ttys0dev · 2025-01-04T18:13:08Z

Should we update the incorrect pacer_doc_id with the new one?

I think so, added handling for this.

We could create a method for this logic and use it in both try/except blocks.

Yeah, cleaned this up a bit by making a function for removing duplicates.

albertisfu

@ttys0dev thank you! Changes look good.

Just a couple of remaining comments, please.

cl/recap/mergers.py

…_doc_id

for more information, see https://pre-commit.ci

Co-authored-by: Alberto Islas <[email protected]>

for more information, see https://pre-commit.ci

albertisfu

Thanks @ttys0dev for the changes. I've made a small tweak to a docstring and moved the line rds_created.append(rd) as explained in a comment.

@mlissner this looks good to be merged.

albertisfu · 2025-01-06T20:53:11Z

cl/recap/mergers.py

+                        is_available=False,
+                        **params,
+                    )
+                    rds_created.append(rd)


I moved this line rds_created.append(rd) to apply only in cases where the RD is created. Previous approach was also applying it to existing RDs.

mlissner assigned albertisfu Jan 3, 2025

ttys0dev force-pushed the update-bad-pacer-doc-id branch from 60cfec2 to a335d6b Compare January 3, 2025 17:39

albertisfu self-requested a review January 4, 2025 01:02

albertisfu requested changes Jan 4, 2025

View reviewed changes

ttys0dev force-pushed the update-bad-pacer-doc-id branch from 17f5914 to 34bb0c6 Compare January 4, 2025 09:20

albertisfu self-requested a review January 6, 2025 17:24

albertisfu reviewed Jan 6, 2025

View reviewed changes

cl/recap/mergers.py Outdated Show resolved Hide resolved

cl/recap/mergers.py Outdated Show resolved Hide resolved

ttys0dev and others added 10 commits January 6, 2025 21:37

Fix finding document entries with a bad pacer_doc_id

839d39c

fix(recap): Added tests for matching a RECAPDocument with a bad pacer…

6332cdb

…_doc_id

Update pacer_doc_id from docket_entry

aee5912

Handle duplicate documents

48a4a0c

Refactor duplicate cleaning logic into function

0d93d2b

[pre-commit.ci] auto fixes from pre-commit.com hooks

a822888

for more information, see https://pre-commit.ci

Add types/docstring to clean_duplicate_documents

a8c3d9e

Co-authored-by: Alberto Islas <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

b430d23

for more information, see https://pre-commit.ci

Add keep_latest_rd_document method

eb9cf9c

[pre-commit.ci] auto fixes from pre-commit.com hooks

35c9536

for more information, see https://pre-commit.ci

ttys0dev force-pushed the update-bad-pacer-doc-id branch from 2755d0e to 35c9536 Compare January 6, 2025 19:37

fix(recap): Fix append rd created and tweak docstrings

3388e05

albertisfu self-requested a review January 6, 2025 20:51

albertisfu approved these changes Jan 6, 2025

View reviewed changes

mlissner merged commit 738f6f9 into freelawproject:main Jan 7, 2025
10 checks passed

ttys0dev deleted the update-bad-pacer-doc-id branch January 7, 2025 11:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix finding document entries with a bad pacer_doc_id #4885

Fix finding document entries with a bad pacer_doc_id #4885

ttys0dev commented Jan 3, 2025

mlissner commented Jan 3, 2025

albertisfu left a comment

ttys0dev commented Jan 4, 2025

albertisfu left a comment

albertisfu left a comment

albertisfu Jan 6, 2025

Fix finding document entries with a bad pacer_doc_id #4885

Fix finding document entries with a bad pacer_doc_id #4885

Conversation

ttys0dev commented Jan 3, 2025

mlissner commented Jan 3, 2025

albertisfu left a comment

Choose a reason for hiding this comment

ttys0dev commented Jan 4, 2025

albertisfu left a comment

Choose a reason for hiding this comment

albertisfu left a comment

Choose a reason for hiding this comment

albertisfu Jan 6, 2025

Choose a reason for hiding this comment