-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add scripts to download and process CourtListener Opinion data #59
Conversation
wildphoton
commented
Mar 18, 2024
- Download Opinion data from CourtListener bulk data list and process them into dolmo format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really good, thanks for the hard work!
There are a few small things to fix (mostly stuff we standardized on after this PR lol) but I thinks its basically ready to go!
I have a question, how big are the generated dolma files? i.e., are they getting sharded or do we basically just end up with a single file for each csv? If the latter, should we re-write the python script to load all the csv files in the dir and then create a single dolma dataset that will actually get sharded (and have something in the metadata that tells which CSV the example came from)?
requirements.txt
Outdated
@@ -7,3 +7,4 @@ smart_open | |||
markdown-it-py | |||
charset_normalizer | |||
logging_json | |||
pandas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pandas doesn't seem to be used? Can we remove it for now?
I resolved the |
courtlistener/csv_to_dolma.py
Outdated
".csv", ".jsonl.gz" | ||
) | ||
to_dolma(example_generator, args.output_dir, output_file_base_name, args.shard_size) | ||
logging.info(f"Saved {args.input_file} as dolma shared files at {args.output_dir}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the new logger you need to do this:
logger = logs.get_logger("court-listener-opinion")
logger.info(...)
instead of using the root logger (logging.info(...)
)
Let's fix this and then I think it's good to merge!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I do the following?
logger = configure_logging("court-listener-opinion")
...
logger.info(...)