Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add domain counting script #91

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Add domain counting script #91

wants to merge 1 commit into from

Conversation

wildphoton
Copy link
Collaborator

Script for extracting the sub-urls and counting their frequency and total-text-length in CCCC dataset.

Copy link
Collaborator

@blester125 blester125 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are linting error that need to be fixed too

return suburls


def merge_counters(counter_list):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be replaced with

import operator as op
from functools import reduce

def merge_counters(counter_list):
  return reduce(op.add, counter_list)

from datasets import Dataset, load_dataset


def process_file(file_path):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand not using the dolma processors here as you are creating these counter objects, but can we add a comment about that being why we are using them?

except json.JSONDecodeError:
continue

gc.collect() # Explicitly trigger garbage collection to free memory
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this had to get added, I feel like there is something going wrong in your parallelism implementation. Can you add some comments about why this was needed?

suburls = [domain, ]
current_path = domain

for part in path_parts[:-1]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you indexing with [:-1] here? In the path_parts line you have .strip('/') so it seems like this function is ok with the final part of the path being a directory (and therefore should be part of this suburl list) but it gets filtered out?

signal.signal(signal.SIGINT, signal_handler)

with tqdm(total=len(all_files), desc="Total files", position=0) as file_progress:
with ProcessPoolExecutor(max_workers=40) as executor: # Adjust max_workers based on your system
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This max worker should be a cli argument if you want people to adjust it themselves

for suburl, count in sorted_suburls[:10]:
print(f"{suburl}: {count}")

sorted_text_lengths = sorted(final_text_length_counter.items(), key=lambda item: item[1], reverse=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

for suburl, count in sorted_suburls[:10]:
print(f"{suburl}: {count}")

sorted_text_lengths = sorted(final_text_length_counter.items(), key=lambda item: item[1], reverse=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like

import operator as op

... = sorted(..., key=op.itemgetter(1), ...)

is generally preferred over a lambda to get a item like this

parent_data = url_dict[parent_url]
if (data['total_sample_count'] == parent_data['total_sample_count'] and
data['total_text_length'] == parent_data['total_text_length']):
# If counts match, mark the higher-level URL for removal
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The term higher-level is confusing, it makes it seem like you would be removing the parent url.

# # Run the processing
# root_folder = './data/dolma-cccc/data/CC-MAIN-2024-18' # Update with your root folder path

token = "YOUR_HF_TOKEN" # Replace with you own hf token if you want to submit the dataset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a cli argument.

return filtered_data

# download the dolma-ccc dataset for analysis
snapshot_download(repo_id="allenai/dolma-cccc", local_dir='data/dolma-cccc', repo_type="dataset")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should all be wrapped into a main function

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants