Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add files to compute basic stats on pseudo crawl dataset #382

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

SaulLu
Copy link
Contributor

@SaulLu SaulLu commented Feb 1, 2022

As discussed with @thomasw21, this PR add basic slurm and python scripts to compute an intermiadiary metadata dataset and some statistics for the Pseudo Crawl dataset

Copy link
Member

@thomasw21 thomasw21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! A few comments but should be good.

parser.add_argument(
"--save-batch-size", type=int, required=True, help="Batch size when writing."
)
parser.add_argument("--use-datasets-caching", action="store_true")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We never us it. Let's remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, it seems to me that I used use_datasets_caching

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll removed save-batch-size

dashboard/python_scripts/compute_stats.py Outdated Show resolved Hide resolved
"--save-path-stats-json", type=str, help="Where to save the stats json."
)
parser.add_argument(
"--save-path-stats-full-json", type=str, help="Where to save the stats json."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"--save-path-stats-full-json", type=str, help="Where to save the stats json."
"--save-path-stats-full-json", type=str, required=True, help="Where to save the stats json."

Comment on lines +71 to +77
logger.info(f" --- Statistics not already computed for seed id {args.seed_id} ")
if not args.use_datasets_caching:
datasets.set_caching_enabled(False)
else:
logger.info(
f"the datasets results will be cached at {config.HF_DATASETS_CACHE}."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.info(f" --- Statistics not already computed for seed id {args.seed_id} ")
if not args.use_datasets_caching:
datasets.set_caching_enabled(False)
else:
logger.info(
f"the datasets results will be cached at {config.HF_DATASETS_CACHE}."
)
logger.info(f" --- Statistics not already computed for seed id {args.seed_id} ")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, it seems to me that I used use_datasets_caching

dashboard/python_scripts/compute_stats.py Outdated Show resolved Hide resolved
dashboard/python_scripts/compute_stats.py Outdated Show resolved Hide resolved
]
ds_html = ds_html.map(
get_length_text,
batched=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
batched=False,
batched=True,

You just have to code a batched version of get_length_text

with open(save_path_tmp, "w", encoding="utf-8") as f:
json.dump(data_stats, f, ensure_ascii=False, indent=4)
logger.info(f"Moving the saved dataset to {str(save_path.absolute())}")
subprocess.run(["mv", save_path_tmp, str(save_path.absolute())])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'd recomment os.rename. subprocess doesn't tell you if it fails.

Comment on lines +151 to +161
save_path = Path(args.save_path_stats_full_json)
tmp_file_name = f"tmp-{str(save_path.name)}"
save_path_tmp = os.path.join(save_path.parent, tmp_file_name)
logger.info(f"Saving the dataset at {save_path_tmp}")
ds_html.to_json(
save_path_tmp,
batch_size=args.save_batch_size,
num_proc=args.num_proc,
compression="gzip",
)
logger.info(f"Moving the saved dataset to {str(save_path.absolute())}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
save_path = Path(args.save_path_stats_full_json)
tmp_file_name = f"tmp-{str(save_path.name)}"
save_path_tmp = os.path.join(save_path.parent, tmp_file_name)
logger.info(f"Saving the dataset at {save_path_tmp}")
ds_html.to_json(
save_path_tmp,
batch_size=args.save_batch_size,
num_proc=args.num_proc,
compression="gzip",
)
logger.info(f"Moving the saved dataset to {str(save_path.absolute())}")
save_path_full = Path(args.save_path_stats_full_json)
save_path_full_tmp = save_path_full.rename(f"{save_path_full.name}.tmp")
logger.info(f"Saving the dataset at {save_path_full_tmp}")
ds_html.to_json(
save_path_full_tmp,
batch_size=args.save_batch_size,
num_proc=args.num_proc,
compression="gzip",
)
logger.info(f"Moving the saved dataset to {str(save_path_full.absolute())}")

two things:

  • move with os.rename, or shutil.move
  • You're not scared of overiting previous full_json?

#SBATCH --partition=cpu_p1
#SBATCH --time 10:00:00 # maximum execution time (HH:MM:SS)
#SBATCH --output=/gpfswork/rech/six/uty16tp/code/big_science/logs/compute_stats_v5/%x-%j.out # output file name
#SBATCH --array=1-604
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to add the new seeds. And I need to update the csv with the new seeds ....

@HugoLaurencon
Copy link
Collaborator

@SaulLu can you edit and merge? Thanks!

@SaulLu
Copy link
Contributor Author

SaulLu commented Feb 8, 2022

@HugoLaurencon, do you need it urgently ?

@HugoLaurencon
Copy link
Collaborator

@SaulLu No, no worries! It was just to tag you in case you didn't see the comments from Thomas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants