Adding Language specific validation sets to deepspeed #1

hadyelsahar · 2021-09-08T10:16:43Z

The idea of this issue to modify the megatron-deepspeed repository code that we use for training all models. In order to track the progress of validation loss on several validaiton sets separately. This would allow us to track the progress of training independtly on separate languages.

Currently, the validation loss is calculated on a single validation set that includes the same language combination as the training data. (see here 13B param model training on tensorboard)

Useful pointers

How datasets are loaded in model pre-training here
Dataset loader for GPT here
Validation step execution here

Progress

Forked deepspeed where all development happens (ask @hadyelsahar for invitation) here
Pull request: Adding language specific validation sets for Multilingual model training Megatron-DeepSpeed#97

sbmaruf · 2021-09-08T12:25:34Z

I can review/implement this part.

lintangsutawika · 2021-09-08T13:43:44Z

My current understanding is that in training.py , the train, validation, and test datasets are loaded from a function build_train_valid_test_data_iterators.

https://github.com/hadyelsahar/Megatron-DeepSpeed/blob/9e14c02a1dd22e4d36e2ee9a33e44d33774b8de7/megatron/training.py#L123-L136

Evaluation is then done here, both for valid_data_iterator and test_data_iterator.

https://github.com/hadyelsahar/Megatron-DeepSpeed/blob/9e14c02a1dd22e4d36e2ee9a33e44d33774b8de7/megatron/training.py#L152-L166

We could set

and call evaluate_and_print_results iteratively for each language.

for each_language_data_loader in valid_data_iterator:
    evaluate_and_print_results(
        prefix, forward_step_func, 
        each_language_data_loader, 
        model, 
        eval_metric
    )

Some modification to evaluate_and_print_results will be required so that we save each validation metric for each language.

hadyelsahar · 2021-09-13T14:18:10Z

Currently the code base yields 1 single validation / test sets. There’s no support of adding args for the specifications of the multiple validation datasets.

my adhoc solution is to add an extra argument:

  --extra-valid-data-path [EXTRA_VALID_DATA_PATH ...]
Path to extra validation dataset to be monitored during trainingAccepted format: 
1) a single data path, 
2) multiple datasets in the form:data1-weight data1-path data2-path data2-weight yielding single validation set 
3) allow multiple validation sets by multiple (2) separated by commas in the form: data1-weight data1-path data2-weight data2-path, data3-weight3 data3-path data3-weight data3-path ...

The idea here is to allow mixing different validation sets on the fly

python pretrain_gpt2.py. …. --extra-valid-data-path. 0.5 en_data, 0.5 fr_data, 0.33 rare1_data 0.33 rare2_data 0.33 rare3_data

any thoughts about a better design?

hadyelsahar · 2021-09-14T01:00:17Z

work in progress PR sent here: bigscience-workshop/Megatron-DeepSpeed#97

hadyelsahar changed the title ~~Adding Language specific validation set to deepspeed~~ Adding Language specific validation sets to deepspeed Sep 8, 2021

hadyelsahar self-assigned this Sep 8, 2021

hadyelsahar mentioned this issue Sep 14, 2021

Adding language specific validation sets for Multilingual model training bigscience-workshop/Megatron-DeepSpeed#97

Merged

haileyschoelkopf referenced this issue in haileyschoelkopf/multilingual-modeling May 9, 2022

add xlsum script (version #1)

3e8bd62

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Language specific validation sets to deepspeed #1

Adding Language specific validation sets to deepspeed #1

hadyelsahar commented Sep 8, 2021 •

edited

Loading

sbmaruf commented Sep 8, 2021

lintangsutawika commented Sep 8, 2021

hadyelsahar commented Sep 13, 2021

hadyelsahar commented Sep 14, 2021

Adding Language specific validation sets to deepspeed #1

Adding Language specific validation sets to deepspeed #1

Comments

hadyelsahar commented Sep 8, 2021 • edited Loading

Useful pointers

Progress

sbmaruf commented Sep 8, 2021

lintangsutawika commented Sep 8, 2021

hadyelsahar commented Sep 13, 2021

hadyelsahar commented Sep 14, 2021

hadyelsahar commented Sep 8, 2021 •

edited

Loading