Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Training Data Packing" got error - RuntimeError: Maximum number of iterations reached. #40

Open
jingkang99 opened this issue May 21, 2024 · 3 comments

Comments

@jingkang99
Copy link

follow the instructions on
https://github.com/HabanaAI/Model-References/tree/master/MLPERF3.1/Training/benchmarks

to execute comamnd:

python3 pack_pretraining_data_pytorch.py --input_dir=$PYTORCH_BERT_DATA/hdf5/training-4320/hdf5_4320_shards_uncompressed --output_dir=$PYTORCH_BERT_DATA/packed --max_predictions_per_seq=76
scipy 1.13.0


...
Dataset has 156725280 samples
Determining packing recipe
Begin packing pass
Unpacked mean sequence length: 254.43
Found 22102 unique packing strategies.

Iteration: 0: sequences still to pack: 156725280
Traceback (most recent call last):
File "/sox/habana-intel/Model-References/MLPERF3.1/Training/benchmarks/bert/implementations/PyTorch/pack_pretraining_data_pytorch.py", line 467, in
main()
File "/sox/habana-intel/Model-References/MLPERF3.1/Training/benchmarks/bert/implementations/PyTorch/pack_pretraining_data_pytorch.py", line 420, in main
strategy_set, mixture, padding, slicing = get_packing_recipe(args.output_dir, sequence_lengths, args.max_sequence_length, args.max_sequences_per_pack)
File "/sox/habana-intel/Model-References/MLPERF3.1/Training/benchmarks/bert/implementations/PyTorch/pack_pretraining_data_pytorch.py", line 111, in get_packing_recipe
partial_mixture, rnorm = optimize.nnls(np.expand_dims(w0, -1) * A, w0 * b)
File "/opt/python-llm/lib/python3.10/site-packages/scipy/optimize/_nnls.py", line 93, in nnls
raise RuntimeError("Maximum number of iterations reached.")
RuntimeError: Maximum number of iterations reached.


Training Data Packing
image

@jingkang99
Copy link
Author

Tried same steps with latest PyTorch Docker, same error occurs

image

any advice?

@Jing1Ling
Copy link

Hi @jingkang99, I update the version of scipy to 1.11.4. It works:).

@Alberto-Villarreal
Copy link

@jingkang99 We have updated to MLPERF4.0 : https://github.com/HabanaAI/Model-References/tree/master/MLPERF4.0.
Could you try if that works?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants