Add domain counting script #91

wildphoton · 2024-09-20T03:52:31Z

Script for extracting the sub-urls and counting their frequency and total-text-length in CCCC dataset.

blester125

There are linting error that need to be fixed too

blester125 · 2024-10-14T21:54:09Z

dolma-cccc/domain_counter.py

+    return suburls
+
+
+def merge_counters(counter_list):


This can be replaced with

import operator as op from functools import reduce def merge_counters(counter_list): return reduce(op.add, counter_list)

blester125 · 2024-10-14T21:56:11Z

dolma-cccc/domain_counter.py

+from datasets import Dataset, load_dataset
+
+
+def process_file(file_path):


I understand not using the dolma processors here as you are creating these counter objects, but can we add a comment about that being why we are using them?

blester125 · 2024-10-14T21:57:50Z

dolma-cccc/domain_counter.py

+                except json.JSONDecodeError:
+                    continue
+
+    gc.collect()  # Explicitly trigger garbage collection to free memory


If this had to get added, I feel like there is something going wrong in your parallelism implementation. Can you add some comments about why this was needed?

blester125 · 2024-10-14T22:02:02Z

dolma-cccc/domain_counter.py

+    suburls = [domain, ]
+    current_path = domain
+
+    for part in path_parts[:-1]:


Why are you indexing with [:-1] here? In the path_parts line you have .strip('/') so it seems like this function is ok with the final part of the path being a directory (and therefore should be part of this suburl list) but it gets filtered out?

blester125 · 2024-10-14T22:02:58Z

dolma-cccc/domain_counter.py

+    signal.signal(signal.SIGINT, signal_handler)
+
+    with tqdm(total=len(all_files), desc="Total files", position=0) as file_progress:
+        with ProcessPoolExecutor(max_workers=40) as executor:  # Adjust max_workers based on your system


This max worker should be a cli argument if you want people to adjust it themselves

blester125 · 2024-10-14T22:04:06Z

dolma-cccc/domain_counter.py

+    for suburl, count in sorted_suburls[:10]:
+        print(f"{suburl}: {count}")
+
+    sorted_text_lengths = sorted(final_text_length_counter.items(), key=lambda item: item[1], reverse=True)


Same as above

blester125 · 2024-10-14T22:05:26Z

dolma-cccc/domain_counter.py

+    for suburl, count in sorted_suburls[:10]:
+        print(f"{suburl}: {count}")
+
+    sorted_text_lengths = sorted(final_text_length_counter.items(), key=lambda item: item[1], reverse=True)


Something like

import operator as op ... = sorted(..., key=op.itemgetter(1), ...)

is generally preferred over a lambda to get a item like this

blester125 · 2024-10-14T22:07:05Z

dolma-cccc/domain_counter.py

+            parent_data = url_dict[parent_url]
+            if (data['total_sample_count'] == parent_data['total_sample_count'] and
+                    data['total_text_length'] == parent_data['total_text_length']):
+                # If counts match, mark the higher-level URL for removal


The term higher-level is confusing, it makes it seem like you would be removing the parent url.

blester125 · 2024-10-14T22:07:53Z

dolma-cccc/domain_counter.py

+# # Run the processing
+# root_folder = './data/dolma-cccc/data/CC-MAIN-2024-18'  # Update with your root folder path
+
+token = "YOUR_HF_TOKEN"  # Replace with you own hf token if you want to submit the dataset


This should be a cli argument.

blester125 · 2024-10-14T22:08:42Z

dolma-cccc/domain_counter.py

+    return filtered_data
+
+# download the dolma-ccc dataset for analysis
+snapshot_download(repo_id="allenai/dolma-cccc", local_dir='data/dolma-cccc', repo_type="dataset")


This should all be wrapped into a main function

Add domain counting script

81f463d

blester125 requested changes Oct 14, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add domain counting script #91

Add domain counting script #91

wildphoton commented Sep 20, 2024

blester125 left a comment

blester125 Oct 14, 2024

blester125 Oct 14, 2024

blester125 Oct 14, 2024

blester125 Oct 14, 2024

blester125 Oct 14, 2024

blester125 Oct 14, 2024

blester125 Oct 14, 2024

blester125 Oct 14, 2024

blester125 Oct 14, 2024

blester125 Oct 14, 2024

		from datasets import Dataset, load_dataset


		def process_file(file_path):

Add domain counting script #91

Are you sure you want to change the base?

Add domain counting script #91

Conversation

wildphoton commented Sep 20, 2024

blester125 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment