-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CCCC - license filter #92
base: main
Are you sure you want to change the base?
Conversation
# keeps the same folder structure in the destination directory | ||
|
||
|
||
def setup_logging(log_level: str) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have a standardized way to configure logging at licensed_pile.logs
please use that.
return suburls | ||
|
||
|
||
def filter_condition(item: dict, URLS: set) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All uppercase should be used for global variables. I like that this is passed into the function, but it should be called urls
|
||
def filter_condition(item: dict, URLS: set) -> bool: | ||
"""Filters the data to only include the URLs in the urls.txt file""" | ||
return extract_suburls(item["metadata"]["warc_url"])[0] in URLS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're only ever going to be using the domain based on this code (suburls
is returned by extract_suburls
and that is always started with the domain).
We should convert this to explicitly use a domain extraction function.
return [line.strip() for line in file] | ||
|
||
|
||
def process_file( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These files seem like they should be processable with dolma (they seem to be just be gzipped jsonl files.
We should move this to a dolma processor instead of hand rolling this processing and the parallelism. You should be able to subclass the licensed_pile.write.ShardParallelProcessor
, override the process_example
method to call the filter_condition
function (return the example if it passes and None
if it fails).
This should simplify things a lot.
@@ -0,0 +1,538 @@ | |||
column_1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an error from this list being part of a csv at some point?
added a script to filter the cccc dataset using a txt file (attached) of the manually verified license URLs.