Add scripts to download and process CourtListener Opinion data #59

wildphoton · 2024-03-18T09:26:53Z

Download Opinion data from CourtListener bulk data list and process them into dolmo format.

blester125

This is really good, thanks for the hard work!

There are a few small things to fix (mostly stuff we standardized on after this PR lol) but I thinks its basically ready to go!

I have a question, how big are the generated dolma files? i.e., are they getting sharded or do we basically just end up with a single file for each csv? If the latter, should we re-write the python script to load all the csv files in the dir and then create a single dolma dataset that will actually get sharded (and have something in the metadata that tells which CSV the example came from)?

courtlistener/csv_to_dolma.py

blester125 · 2024-04-10T16:54:43Z

requirements.txt

@@ -7,3 +7,4 @@ smart_open
 markdown-it-py
 charset_normalizer
 logging_json
+pandas


Pandas doesn't seem to be used? Can we remove it for now?

courtlistener/get_data.sh

courtlistener/process_csv_file.sh

StellaAthena · 2024-04-15T14:55:20Z

I resolved the requirements.txt conflict and took an opportunity to alphabetize the requirements while I was at it.

blester125 · 2024-05-06T14:18:57Z

courtlistener/csv_to_dolma.py

+        ".csv", ".jsonl.gz"
+    )
+    to_dolma(example_generator, args.output_dir, output_file_base_name, args.shard_size)
+    logging.info(f"Saved {args.input_file} as dolma shared files at {args.output_dir}")


With the new logger you need to do this:

logger = logs.get_logger("court-listener-opinion") logger.info(...)

instead of using the root logger (logging.info(...))

Let's fix this and then I think it's good to merge!

Can I do the following?

logger = configure_logging("court-listener-opinion") ... logger.info(...)

Add scripts to download and process CourtListener Opinion data

281e2a0

blester125 requested changes Apr 10, 2024

View reviewed changes

Merge branch 'main' into legal/court_listener

413b328

Adress comments, add README

7116e80

wildphoton requested a review from blester125 May 6, 2024 08:40

blester125 reviewed May 6, 2024

View reviewed changes

update logger usage

714ddf1

wildphoton requested a review from blester125 May 8, 2024 05:38

blester125 approved these changes May 8, 2024

View reviewed changes

wildphoton merged commit 22b8e90 into main May 11, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add scripts to download and process CourtListener Opinion data #59

Add scripts to download and process CourtListener Opinion data #59

wildphoton commented Mar 18, 2024

blester125 left a comment

blester125 Apr 10, 2024

StellaAthena commented Apr 15, 2024

blester125 May 6, 2024 •

edited

Loading

wildphoton May 8, 2024

Add scripts to download and process CourtListener Opinion data #59

Add scripts to download and process CourtListener Opinion data #59

Conversation

wildphoton commented Mar 18, 2024

blester125 left a comment

Choose a reason for hiding this comment

blester125 Apr 10, 2024

Choose a reason for hiding this comment

StellaAthena commented Apr 15, 2024

blester125 May 6, 2024 • edited Loading

Choose a reason for hiding this comment

wildphoton May 8, 2024

Choose a reason for hiding this comment

blester125 May 6, 2024 •

edited

Loading