4826 Replicate RECAP PDF uploads to subdockets #4857

albertisfu · 2024-12-28T01:38:25Z

This PR adds support for replicating PDF uploads to sub-dockets, following a similar approach to attachment pages. A step was added in find_subdocket_pdf_rds before process_recap_pdf to look for subdockets where documents should be merged.

The process flow is:

The process aborts if the PDF upload belongs to an appellate court, since doppel cases cannot exist in appellate courts.
Similar to attachment pages, a RECAPDocument queryset finds unique RDs with the same pacer_doc_id in the same court. This query was moved to a helper method since it's common to both find_subdocket_pdf_rds and find_subdocket_att_page_rds.
Additional ProcessingQueue entries are created for each additional RECAPDocument where the PDF needs replication.
If the original PQ lacks a pacer_case_id (optional in PDF uploads), one is assigned during the first iteration so the PQ can succeed when processed by process_recap_pdf. Otherwise, the lookup will fail with RECAPDocument.MultipleObjectsReturned.
PQ creation is wrapped in transaction.atomic to roll back any objects if errors occur. This change was also applied to find_subdocket_att_page_rds.
Removed redundant code block in process_recap_attachment.

When working on this I noticed that within process_recap_pdf there is the fallback query:

courtlistener/cl/recap/tasks.py

Line 264 in cb012e8

rd = await RECAPDocument.objects.aget(pacer_doc_id=pq.pacer_doc_id)

Is it correct that this query only use pacer_doc_id ? Not sure if pacer_doc_ids are unique across all courts in PACER. If they're not, I think it would be safer to change it to:

rd = await RECAPDocument.objects.aget(pacer_doc_id=pq.pacer_doc_id, court_id=pq.court_id)?

Let me know what do you think.

Fixes: #4826

mlissner · 2024-12-30T18:21:54Z

Is it correct that this query only use pacer_doc_id ? Not sure if pacer_doc_ids are unique across all courts in PACER. If they're not, I think it would be safer to change it to:

Yeah, that's definitely a bug, and your fix of adding the court to it should help a lot, thank you.

johnhawkinson · 2024-12-30T18:27:11Z

Is it correct that this query only use pacer_doc_id ? Not sure if pacer_doc_ids are unique across all courts in PACER.

They are unique! The first 3 digits identify the court, see the doc1 URLs section of https://github.com/freelawproject/juriscraper/blob/main/juriscraper/pacer/notes.md

Yeah, that's definitely a bug, and your fix of adding the court to it should help a lot, thank you.

Why "definitely"?

mlissner · 2024-12-30T18:30:44Z

I forgot about that, you're right, John!

mlissner

LGTM thanks! Onward to a proper review.

ERosendo · 2025-01-06T15:58:32Z

cl/recap/tasks.py

+    appellate_court_ids = [
+        court_pk
+        async for court_pk in (
+            Court.federal_courts.appellate_pacer_courts().values_list(
+                "pk", flat=True
+            )
+        )
+    ]
+    if pq.court_id in appellate_court_ids:
+        # Abort the process for appellate documents. Subdockets cannot be found
+        # in appellate cases.
+        return pqs_to_process_pks


I believe we can avoid creating the full list of appellate courts. Consider the following approach:

Suggested change

appellate_court_ids = [

court_pk

async for court_pk in (

Court.federal_courts.appellate_pacer_courts().values_list(

"pk", flat=True

)

)

]

if pq.court_id in appellate_court_ids:

# Abort the process for appellate documents. Subdockets cannot be found

# in appellate cases.

return pqs_to_process_pks

appellate_court_ids = Court.federal_courts.appellate_pacer_courts()

if appellate_court_ids.filter(pk=pq.court_id).exists():

# Abort the process for appellate documents. Subdockets cannot be found

# in appellate cases.

return pqs_to_process_pks

Great! Yeah this approach is simpler. I've applied it.

ERosendo

LGTM. Let's merge after addressing my comment.

albertisfu · 2025-01-07T23:20:42Z

@ERosendo Thanks! I've applied your suggestion regarding the exists Court query for validating the court type. I've also replicated the same approach in other similar queries in cl.recap.api_serializers to optimize them.

Let me know what do you think.

ERosendo

Thanks @albertisfu. LGTM

sentry-io · 2025-01-08T02:13:09Z

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

‼️ WriteTimeout /api/rest/{version}/recap/ View Issue
‼️ ReadTimeout /api/rest/{version}/recap/ View Issue
‼️ Multiple RECAPDocuments returned when processing pdf upload /api/rest/{version}/recap/ View Issue

_{Did you find this useful? React with a 👍 or 👎}

albertisfu added 2 commits December 27, 2024 19:37

feat(recap): Replicate RECAP PDF uploads to subdockets

703f122

Fixes: #4826

fix(recap): Avoid PDF upload replication in appellate cases

65da6ed

albertisfu marked this pull request as ready for review December 30, 2024 18:00

albertisfu requested a review from mlissner December 30, 2024 18:00

albertisfu linked an issue Dec 30, 2024 that may be closed by this pull request

Replicate RECAP PDF uploads to subdockets #4826

Closed

mlissner approved these changes Dec 30, 2024

View reviewed changes

mlissner assigned ERosendo Dec 30, 2024

mlissner requested a review from ERosendo December 30, 2024 18:34

albertisfu mentioned this pull request Dec 30, 2024

Backward replication of RECAP PDF uploads to subdockets #4864

Open

Merge branch 'main' into 4826-replicate-pdf-uploads-to-subdockets

df91292

ERosendo reviewed Jan 6, 2025

View reviewed changes

ERosendo approved these changes Jan 6, 2025

View reviewed changes

ERosendo assigned albertisfu and unassigned ERosendo Jan 6, 2025

albertisfu added 3 commits January 7, 2025 16:01

Merge branch 'main' into 4826-replicate-pdf-uploads-to-subdockets

a925ccc

fix(recap): Fixed RecapUploadsTest merge conflicts

f68adfc

fix(recap): Simplified PACER court validation queries

dfee2d5

albertisfu assigned ERosendo and unassigned albertisfu Jan 7, 2025

ERosendo approved these changes Jan 8, 2025

View reviewed changes

mlissner merged commit 437c5d7 into main Jan 8, 2025
15 checks passed

mlissner deleted the 4826-replicate-pdf-uploads-to-subdockets branch January 8, 2025 00:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

4826 Replicate RECAP PDF uploads to subdockets #4857

4826 Replicate RECAP PDF uploads to subdockets #4857

albertisfu commented Dec 28, 2024 •

edited

Loading

mlissner commented Dec 30, 2024

johnhawkinson commented Dec 30, 2024

mlissner commented Dec 30, 2024

mlissner left a comment

ERosendo Jan 6, 2025

albertisfu Jan 7, 2025

ERosendo left a comment

albertisfu commented Jan 7, 2025

ERosendo left a comment

sentry-io bot commented Jan 8, 2025 •

edited

Loading

4826 Replicate RECAP PDF uploads to subdockets #4857

4826 Replicate RECAP PDF uploads to subdockets #4857

Conversation

albertisfu commented Dec 28, 2024 • edited Loading

mlissner commented Dec 30, 2024

johnhawkinson commented Dec 30, 2024

mlissner commented Dec 30, 2024

mlissner left a comment

Choose a reason for hiding this comment

ERosendo Jan 6, 2025

Choose a reason for hiding this comment

albertisfu Jan 7, 2025

Choose a reason for hiding this comment

ERosendo left a comment

Choose a reason for hiding this comment

albertisfu commented Jan 7, 2025

ERosendo left a comment

Choose a reason for hiding this comment

sentry-io bot commented Jan 8, 2025 • edited Loading

Suspect Issues

albertisfu commented Dec 28, 2024 •

edited

Loading

sentry-io bot commented Jan 8, 2025 •

edited

Loading