CCCC - license filter #92

baberabb · 2024-09-20T18:16:26Z

added a script to filter the cccc dataset using a txt file (attached) of the manually verified license URLs.

blester125 · 2024-10-14T21:40:37Z

cccc/scripts/filter_hf_ds.py

+# keeps the same folder structure in the destination directory
+
+
+def setup_logging(log_level: str) -> None:


We already have a standardized way to configure logging at licensed_pile.logs please use that.

blester125 · 2024-10-14T21:42:34Z

cccc/scripts/filter_hf_ds.py

+    return suburls
+
+
+def filter_condition(item: dict, URLS: set) -> bool:


All uppercase should be used for global variables. I like that this is passed into the function, but it should be called urls

blester125 · 2024-10-14T21:45:05Z

cccc/scripts/filter_hf_ds.py

+
+def filter_condition(item: dict, URLS: set) -> bool:
+    """Filters the data to only include the URLs in the urls.txt file"""
+    return extract_suburls(item["metadata"]["warc_url"])[0] in URLS


You're only ever going to be using the domain based on this code (suburls is returned by extract_suburls and that is always started with the domain).

We should convert this to explicitly use a domain extraction function.

blester125 · 2024-10-14T21:49:28Z

cccc/scripts/filter_hf_ds.py

+        return [line.strip() for line in file]
+
+
+def process_file(


These files seem like they should be processable with dolma (they seem to be just be gzipped jsonl files.

We should move this to a dolma processor instead of hand rolling this processing and the parallelism. You should be able to subclass the licensed_pile.write.ShardParallelProcessor, override the process_example method to call the filter_condition function (return the example if it passes and None if it fails).

This should simplify things a lot.

blester125 · 2024-10-14T21:50:41Z

cccc/scripts/urls_to_keep.txt

@@ -0,0 +1,538 @@
+column_1


Is this an error from this list being part of a csv at some point?

add script to filter cccc using provided list of urls

67d3dbd

baberabb changed the title ~~add cccc filter scipt~~ CCCC - license filter Sep 20, 2024

blester125 requested changes Oct 14, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CCCC - license filter #92

CCCC - license filter #92

baberabb commented Sep 20, 2024 •

edited

Loading

blester125 Oct 14, 2024

blester125 Oct 14, 2024

blester125 Oct 14, 2024

blester125 Oct 14, 2024

blester125 Oct 14, 2024

		# keeps the same folder structure in the destination directory


		def setup_logging(log_level: str) -> None:

		return suburls


		def filter_condition(item: dict, URLS: set) -> bool:

CCCC - license filter #92

Are you sure you want to change the base?

CCCC - license filter #92

Conversation

baberabb commented Sep 20, 2024 • edited Loading

blester125 Oct 14, 2024

Choose a reason for hiding this comment

blester125 Oct 14, 2024

Choose a reason for hiding this comment

blester125 Oct 14, 2024

Choose a reason for hiding this comment

blester125 Oct 14, 2024

Choose a reason for hiding this comment

blester125 Oct 14, 2024

Choose a reason for hiding this comment

baberabb commented Sep 20, 2024 •

edited

Loading