Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add scripts to download and process CourtListener Opinion data #59

Merged
merged 4 commits into from
May 11, 2024

Conversation

wildphoton
Copy link
Collaborator

  • Download Opinion data from CourtListener bulk data list and process them into dolmo format.

Copy link
Collaborator

@blester125 blester125 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really good, thanks for the hard work!

There are a few small things to fix (mostly stuff we standardized on after this PR lol) but I thinks its basically ready to go!

I have a question, how big are the generated dolma files? i.e., are they getting sharded or do we basically just end up with a single file for each csv? If the latter, should we re-write the python script to load all the csv files in the dir and then create a single dolma dataset that will actually get sharded (and have something in the metadata that tells which CSV the example came from)?

courtlistener/csv_to_dolma.py Outdated Show resolved Hide resolved
courtlistener/csv_to_dolma.py Show resolved Hide resolved
courtlistener/csv_to_dolma.py Outdated Show resolved Hide resolved
requirements.txt Outdated
@@ -7,3 +7,4 @@ smart_open
markdown-it-py
charset_normalizer
logging_json
pandas
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pandas doesn't seem to be used? Can we remove it for now?

courtlistener/get_data.sh Show resolved Hide resolved
courtlistener/get_data.sh Show resolved Hide resolved
courtlistener/process_csv_file.sh Show resolved Hide resolved
@StellaAthena
Copy link
Collaborator

I resolved the requirements.txt conflict and took an opportunity to alphabetize the requirements while I was at it.

@wildphoton wildphoton requested a review from blester125 May 6, 2024 08:40
".csv", ".jsonl.gz"
)
to_dolma(example_generator, args.output_dir, output_file_base_name, args.shard_size)
logging.info(f"Saved {args.input_file} as dolma shared files at {args.output_dir}")
Copy link
Collaborator

@blester125 blester125 May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the new logger you need to do this:

logger = logs.get_logger("court-listener-opinion")
logger.info(...)

instead of using the root logger (logging.info(...))

Let's fix this and then I think it's good to merge!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I do the following?

logger = configure_logging("court-listener-opinion")

...

logger.info(...)

@wildphoton wildphoton requested a review from blester125 May 8, 2024 05:38
@wildphoton wildphoton merged commit 22b8e90 into main May 11, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants