Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CCCC - license filter #92

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

baberabb
Copy link
Contributor

@baberabb baberabb commented Sep 20, 2024

added a script to filter the cccc dataset using a txt file (attached) of the manually verified license URLs.

@baberabb baberabb changed the title add cccc filter scipt CCCC - license filter Sep 20, 2024
# keeps the same folder structure in the destination directory


def setup_logging(log_level: str) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a standardized way to configure logging at licensed_pile.logs please use that.

return suburls


def filter_condition(item: dict, URLS: set) -> bool:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All uppercase should be used for global variables. I like that this is passed into the function, but it should be called urls


def filter_condition(item: dict, URLS: set) -> bool:
"""Filters the data to only include the URLs in the urls.txt file"""
return extract_suburls(item["metadata"]["warc_url"])[0] in URLS
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're only ever going to be using the domain based on this code (suburls is returned by extract_suburls and that is always started with the domain).

We should convert this to explicitly use a domain extraction function.

return [line.strip() for line in file]


def process_file(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These files seem like they should be processable with dolma (they seem to be just be gzipped jsonl files.

We should move this to a dolma processor instead of hand rolling this processing and the parallelism. You should be able to subclass the licensed_pile.write.ShardParallelProcessor, override the process_example method to call the filter_condition function (return the example if it passes and None if it fails).

This should simplify things a lot.

@@ -0,0 +1,538 @@
column_1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an error from this list being part of a csv at some point?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants