This directory provides instructions to reproduce Intel Gaudi's results for MLPerf Training v3.1 on 1 to 48 servers configurations with 8 Gaudi 2 cards each.
For more information on training deep learning models using Gaudi, refer to developer.habana.ai.
MLPerf™ is a trademark and service mark of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use is strictly prohibited.
- Running Intel Gaudi MLPerf™ Benchmarks
On each compute node, perform the following:
-
Follow the instructions provided in the Gaudi Installation Guide to set up the environment including the
$PYTHON
environment variable. The guide will walk you through the process of setting up your system to run the benchmarks on Gaudi. -
Create directories for scratch and dataset folders:
export MLPERF_ROOT=/path/to/mlperf/root export SCRATCH_DIR=$MLPERF_ROOT/scratch export DATASETS_DIR=$MLPERF_ROOT/datasets mkdir -p $SCRATCH_DIR mkdir -p $DATASETS_DIR
Note: If training is to be conducted on multiple nodes, it is essential to place the $DATASETS_DIR on a shared filesystem that is accessible by all the nodes. This allows for dataset preparation to be performed only once in the
Training Data for <configuration>
sections, enabling all nodes to access the prepared dataset during training. -
Clone Model-References repository and switch to the branch that matches your Intel Gaudi software version. You can run the
hl-smi
utility to determine the Intel Gaudi software version.cd $MLPERF_ROOT git clone -b [Intel Gaudi software version] https://github.com/HabanaAI/Model-References export MLPERF_DIR=$MLPERF_ROOT/Model-References/MLPERF3.1/Training
To build MLPerf training 3.1 container, perform the following:
-
Copy ssh keys to enable passwordless ssh to /root/.ssh/
-
Set the environment variables for the docker command.
- To find a docker image, go to gaudi-docker.
- Open gaudi-docker directory, and select the folder that matches the Intel Gaudi software version (determined by running
hl-smi
). - Navigate to subdirectories, choose system and framework version.
- Choose the docker build version. Most often 'latest' will be used.
- Navigate to "Docker Info" tab and note "Title" string.
- Set
DOCKER_IMAGE
to "Title" string withvault.habana.ai/gaudi-docker/
prefix. See the examples below.- Example on PyTorch Container:
# NOTE: The below is only an example value. Replace [Intel Gaudi software version] and [PT version] to match your setup and Supported Configuration. export DOCKER_IMAGE=vault.habana.ai/gaudi-docker/[Intel Gaudi software version]/ubuntu20.04/habanalabs/pytorch-installer-[PT Version]:latest export CONTAINER_NAME=mlperf3_1
- Example on PyTorch Container:
-
Create
mlperf3.1
container by running the following command.docker run --privileged --security-opt seccomp=unconfined \ --name $CONTAINER_NAME -td \ -v /dev:/dev \ --device=/dev:/dev \ -e LOG_LEVEL_ALL=6 \ -v /sys/kernel/debug:/sys/kernel/debug \ -v /tmp:/tmp \ -v $MLPERF_DIR:/root/MLPERF \ -v $DATASETS_DIR:/root/datasets \ -v $SCRATCH_DIR:/root/scratch \ --cap-add=sys_nice --cap-add=SYS_PTRACE \ --user root --workdir=/root --net=host \ --ulimit memlock=-1:-1 ${DOCKER_IMAGE}
-
Start the docker.
docker exec $CONTAINER_NAME bash -c "service ssh start" docker exec -it $CONTAINER_NAME bash
Note: The following two steps are only necessary for training on multiple nodes.
-
In the docker, create
/root/shared/hosts
file that contains a list of all host IPs in the cluster. Add one IP per line. Below is an example for 4 nodes (32 devices).mkdir /root/shared echo '10.10.100.101' > /root/shared/hosts echo '10.10.100.102' >> /root/shared/hosts echo '10.10.100.103' >> /root/shared/hosts echo '10.10.100.104' >> /root/shared/hosts
-
SSH is used to spawn local and remote processes. In order to allow communication between machines it is required to provide a passwordless ssh communication and set default port for connection. It has to be done on all of the machines:
mkdir .ssh printf 'Host *\n StrictHostKeyChecking no\nPort 3022' >> .ssh/config
It also may be necessary to setup SSH keys and add them to
~/.ssh/authorized_keys
.
Log into mlperf3.1 PyTorch container and run:
cd /root/MLPERF/benchmarks/bert/implementations/PyTorch
pip install -r requirements.txt
export PYTORCH_BERT_DATA=/root/datasets/pytorch_bert
bash input_preprocessing/prepare_data.sh -o $PYTORCH_BERT_DATA
At this stage, $PYTORCH_BERT_DATA/phase1
checkpoint and $PYTORCH_BERT_DATA/hdf5/eval_varlength
evaluation data are ready, while $PYTORCH_BERT_DATA/hdf5/training_4320/hdf5_4320_shards_uncompressed
training data requires packing as described in the following section.
Once the training data is ready, pack it using a similar code as described in GraphCore for v1.0 Submission.
mkdir $PYTORCH_BERT_DATA/packed
python3 pack_pretraining_data_pytorch.py \
--input_dir=$PYTORCH_BERT_DATA/hdf5/training-4320/hdf5_4320_shards_uncompressed \
--output_dir=$PYTORCH_BERT_DATA/packed \
--max_predictions_per_seq=76
For further details, refer to Packing: Towards 2x NLP BERT Acceleration.
-
Sign up with image-net.org and acquire the rights to download original images.
-
Follow the link to the 2012 ILSVRC and download ILSVRC2012_img_val.tar and ILSVRC2012_img_train.tar. Place the files in the folder that will be mapped in mlperf3.1 container (for example,
$DATASETS_DIR
). -
Run the script below in mlperf3.1 container to unpack the dataset:
bash /root/MLPERF/benchmarks/resnet/scripts/unpack_imagenet.sh \ --train-archive /path/to/ILSVRC2012_img_train.tar \ --validation-archive /path/to/ILSVRC2012_img_val.tar \ --output-path /root/datasets/imagenet \ --jobs-number 16
The script unpacks training and validation packages in parallel. In addition, when unpacking subarchives from ILSVRC2012_img_train.tar,
--jobs-number
defines number of parallel processes allocated for the task. Scripts runtime is dependent in large part on the data access speed of the storage where $DATASETS_DIR is located.
Dataset preparation should be done in the following docker:
docker run --ipc=host -it -v $DATASETS_DIR:/root/datasets -v $MLPERF_DIR:/root/MLPERF nvcr.io/nvidia/pytorch:22.11-py3 bash
MLPerf GPT3 is trained using C4/en/3.0.1 dataset. It can be downloaded from https://huggingface.co/datasets/allenai/c4. Instruction is clear on how to select precisely the files for downloading.
apt-get update
apt-get install git-lfs
mkdir -p /root/datasets/gpt3
cd /root/datasets/gpt3
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/datasets/allenai/c4
cd c4
git lfs pull --include "en/*"
Out of all the files, only 256 will be required for training, and 8 for validation. You can merge them into three .json.gz files using the following commands, which are taken from https://github.com/mlcommons/training/blob/master/large_language_model/megatron-lm/README.md.
# create softlinks to store each shard before merging
mkdir -p softlinks
for shard in {6..7}; do
start=$((shard * 128))
end=$((shard * 128 + 127))
mkdir -p softlinks/en_$shard
for ind in $(seq -f "%05g" $start $end); do
ln -s ../../en/c4-train.${ind}-of-01024.json.gz softlinks/en_${shard}/c4-train.${ind}-of-01024.json.gz
done
done
# merge
mkdir -p en_merge
for shard in {6..7}; do
cat softlinks/en_${shard}/*gz > en_merge/c4-train.en_${shard}.json.gz
done
cat en/c4-validation.0000* > en_merge/c4-validation.json.gz
To tokenize the prepared files, you need to download the tokenizer model, vocab_c4_en_301_5Mexp2_spm.model, and the vocabulary file, vocab_c4_en_301_5Mexp2_spm.vocab, from the following location: https://console.cloud.google.com/storage/browser/mlperf-llm-public2;tab=objects?prefix=&forceOnObjectsSortingFiltering=false. Please note that registration is required to access these files. Tokenization can be performed using the following commands. Please be aware that this conversion process may take several hours.
git clone https://github.com/NVIDIA/NeMo.git
cd NeMo && git checkout f3ad584b94170bc3ea197df29eb9ef9c96061730 && bash ./reinstall.sh && cd ..
mkdir -p preprocessed_c4_spm
for shard in {6..7}; do
python3 NeMo/scripts/nlp_language_modeling/preprocess_data_for_megatron.py \
--input en_merge/c4-train.en_${shard}.json.gz \
--tokenizer-library sentencepiece \
--tokenizer-model vocab_c4_en_301_5Mexp2_spm.model \
--output-prefix preprocessed_c4_spm/c4_en_${shard}_c4_spm \
--dataset-impl mmap \
--workers 128
done
python3 NeMo/scripts/nlp_language_modeling/preprocess_data_for_megatron.py \
--input en_merge/c4-validation.json.gz \
--tokenizer-library sentencepiece \
--tokenizer-model vocab_c4_en_301_5Mexp2_spm.model \
--output-prefix preprocessed_c4_spm/c4_en_validation_c4_spm \
--dataset-impl mmap \
--workers 128
The resulting files to be used during training are as follows:
preprocessed_c4_spm/c4_en_6_c4_spm_text_document.bin
preprocessed_c4_spm/c4_en_6_c4_spm_text_document.idx
preprocessed_c4_spm/c4_en_7_c4_spm_text_document.bin
preprocessed_c4_spm/c4_en_7_c4_spm_text_document.idx
preprocessed_c4_spm/c4_en_validation_c4_spm_text_document.bin
preprocessed_c4_spm/c4_en_validation_c4_spm_text_document.idx
In addition to the dataset, GPT3 implementation requires https://huggingface.co/gpt2/resolve/main/vocab.json and https://huggingface.co/gpt2/resolve/main/merges.txt files:
wget "https://huggingface.co/gpt2/resolve/main/vocab.json" -P preprocessed_c4_spm
wget "https://huggingface.co/gpt2/resolve/main/merges.txt" -P preprocessed_c4_spm
In order to exclude graph compilation time from Time To Train, you need to prepare a synthetic dataset for device warmup:
python3 /root/MLPERF/benchmarks/gpt3/tools/create_synthetic_dataset.py \
--valid_files_path preprocessed_c4_spm/c4_en_validation_c4_spm_text_document \
--output_path preprocessed_c4_spm/
The command line above will create synthetic files:
preprocessed_c4_spm/synthetic_text_document.bin
preprocessed_c4_spm/synthetic_text_document.idx
Log into mlperf3.1 PyTorch container. Install DeepSpeed and other requirements:
pip install git+https://github.com/HabanaAI/DeepSpeed.git
pip install -r /root/MLPERF/benchmarks/gpt3/requirements.txt
The checkpoint for MLPerf GPT3 in the paxml format can be downloaded from gs://mlperf-llm-public2/gpt3_spmd1x64x24_tpuv4-3072_v84_20221101/checkpoints/checkpoint_00004000. The common_bf16.json can be downloaded from: https://github.com/ShriyaPalsamudram/training/tree/LLM-NVIDIA-reference-draft/large_language_model/megatron-lm/scripts. At one stage, there will be a merged directory and a universal directory, each requiring 2 TB of disk space for 96L. Therefore, to complete all the steps, it is necessary to have over 4TB of free disk space. Additionally, the machine must have a minimum of 32 CPUs and 755GB of RAM to ensure proper functioning. Before the checkpoint can be used, it must be converted by following the steps below:
-
Convert the paxml checkpoint to Megatron distributed using /root/MLPERF/benchmarks/gpt3/tools/convert_checkpoint/convert_paxml_optimizer.py
python3 /root/MLPERF/benchmarks/gpt3/tools/convert_checkpoint/convert_paxml_optimizer.py \ --google_ckpts checkpoint_00004000/ \ --output_dir megatron_merged_ckpt \ --num_layers 96 \ --params_file common_bf16.json \ --pool 1
-
Convert Megatron merged checkpoint to DeepSpeed universal.
To generate the mp-rank-files required in megatron_optim_merged_to_ds_universal_convert.py, the user needs to run GPT-3, which will generate these files based on the configuration used in the run. This can be obtained by running a single step of GPT-3 and saving the checkpoint. Please note that only this particular step of checkpoint preparation must be done using 8 HLS2 machines. The remaining steps can be performed on a CPU-only machine. Please make sure /root/shared/hosts file contains a list of 8 IPs for HLS2 machines and SSH communication is properly configured. For further details, refer to points 5 and 6 here. Once the setup is ready, proceed to run the single step for GPT3 as follows:
mkdir checkpoint_with_mp_rank_files bash /root/MLPERF/benchmarks/gpt3/run_gpt.sh --hosts /root/shared/hosts --data-dir /root/datasets/ --output-dir /root/scratch --num-nodes 8 --data-parallel-size 1 --start-from-ckpt false --save-checkpoints-dir checkpoint_with_mp_rank_files --exit-interval 1 --global-batch-size 2048
Run megatron_optim_merged_to_ds_universal_convert.py to create the universal checkpoint:
mkdir -p /root/datasets/gpt3/universal-checkpoint python3 /root/MLPERF/benchmarks/gpt3/tools/convert_checkpoint/megatron_optim_merged_to_ds_universal_convert.py \ --o /root/datasets/gpt3/universal-checkpoint/ --ds-mp-rank-files-dir checkpoint_with_mp_rank_files --megatron-lm-merged-input-dir megatron_merged_ckpt \ --tp 8 --pp 8 --nl 96 --iteration 3000 --global-batch-size 2048 --seq_length 2048 --lr-decay-samples 166809600 --lr-warmup-samples 407040 \ --pool 64 --model-parallel-same-config False --update-only-mp-rank-files False
The instruction for preparing the dataset is based on original MLCommons instruction. Please follow instructions under the following link for more details: https://github.com/mlcommons/training/tree/master/stable_diffusion.
Log into mlperf3.1 PyTorch container
export DATASET_PATH=/root/datasets/stable_diffusion/datasets/laion-400m/webdataset-moments-filtered
bash /root/MLPERF/benchmarks/stable_diffusion/scripts/datasets/laion400m-filtered-download-moments.sh --output-dir $DATASET_PATH
Log into mlperf3.1 PyTorch container
export DATASET_DIR=/root/datasets/stable_diffusion/datasets/coco2014
bash /root/MLPERF/benchmarks/stable_diffusion/scripts/datasets/coco2014-validation-download-prompts.sh --output-dir $DATASET_DIR
export ANNOTATION_FILE=$DATASET_DIR/val2014_30k.tsv
bash /root/MLPERF/benchmarks/stable_diffusion/scripts/datasets/coco2014-validation-download-stats.sh --output-dir $DATASET_DIR
export FID_GT_PATH=$DATASET_DIR/val2014_30k_stats.npz
Reference: https://github.com/mlcommons/training/tree/master/stable_diffusion#downloading-the-checkpoints
Log into mlperf3.1 PyTorch container
export DATASET_DIR=/root/datasets/stable_diffusion/datasets/checkpoints/sd
bash /root/MLPERF/benchmarks/stable_diffusion/scripts/checkpoints/download_sd.sh --output-dir $DATASET_DIR
export BASE_CKPT=$DATASET_DIR/512-base-ema.ckpt
Uncompress any one data tar file from training data "example: $DATASET_PATH/00001.tar" and keep it in the input directory path. Set environment variables for input and output path and run the below script to generate the synthetic data at the output directory.
Log into mlperf3.1 PyTorch container:
cp /root/datasets/stable_diffusion/datasets/laion-400m/webdataset-moments-filtered/00001.tar /root/datasets/stable_diffusion/datasets/input_uncompressed_file/
cd /root/datasets/stable_diffusion/datasets/input_uncompressed_file/
tar -xvf 00001.tar; cd -;
export DATASET_PATH_UNCOMPRESSED=/root/datasets/stable_diffusion/datasets/input_uncompressed_file
export DATASET_PATH_OUTPUT=/root/datasets/stable_diffusion/datasets/
cd /root/MLPERF/benchmarks/stable_diffusion/scripts;
bash prepare_synthetic_data.sh; cd -;
After synthetic data preparation, copy generated SD_synthetic_data_10001.tar file to the path used via WARMUP_FILE in training:
export WARMUP_FILE=$DATASET_PATH_OUTPUT//SD_synthetic_data_10001.tar
-
Inside the mlperf3.1 PyTorch container, install BERT requirements.
export BERT_IMPLEMENTATIONS=/root/MLPERF/benchmarks/bert/implementations pip install -r $BERT_IMPLEMENTATIONS/PyTorch/requirements.txt
-
Run the training.
export PYTORCH_BERT_DATA=/root/datasets/pytorch_bert cd $BERT_IMPLEMENTATIONS/HLS-Gaudi2-PT ./launch_bert_pytorch.sh --data-dir $PYTORCH_BERT_DATA
Results can be found in following output files:
- /tmp/BERT_PRETRAINING/results/checkpoints/result_rank_0.txt
To get the TTT from the training script output, run following command:
grep 'run_start\|run_stop' /path/to/output/file | grep worker0 | awk '{print $5}' | tr -d ',' | paste -sd " " - | awk '{print ($2 - $1) / 1000 / 60}'
-
Inside the mlperf3.1 PyTorch container, install Resnet50 requirements.
export RESNET_IMPLEMENTATIONS=/root/MLPERF/benchmarks/resnet/implementations pip install -r $RESNET_IMPLEMENTATIONS/HLS-Gaudi2-PT/PyTorch/requirements.txt
-
Run the training.
cd $RESNET_IMPLEMENTATIONS/HLS-Gaudi2-PT ./launch_resnet.sh --config batch_256.cfg --data-dir /root/datasets/imagenet
To get the TTT from the training script output, run following command:
grep 'run_start\|run_stop' /tmp/resnet_log/result_rank_0.txt | grep worker0 | awk '{print $5}' | tr -d ',' | paste -sd " " - | awk '{print ($2 - $1) / 1000 / 60}'
All the training steps for GPT3-175B should be performed in mlperf3.1 PyTorch container.
The following requirements need to be installed on all machines participating in the training:
pip install git+https://github.com/HabanaAI/DeepSpeed.git
pip install -r /root/MLPERF/benchmarks/gpt3/requirements.txt
Intel Gaudi software supports 8-bit floating-point precision (FP8) training for GPT3 model and MLPerf3.1 submissions for GPT3 have been conducted using FP8 precision.
Running the GPT3 model requires multiple machines. For example, 32 HLS2 machines: HLS-Gaudi2-N32-PT system
or 48 HLS2 machines HLS-Gaudi2-N48-PT system
.
Please set the paths for the dataset and the universal checkpoint, which should be created during setup phase.
export DATASET_DIR=/root/datasets/gpt3/c4/preprocessed_c4_spm
export CHECKPOINT_DIR=/root/datasets/gpt3/universal-checkpoint
Please make sure /root/shared/hosts file contains a list of IPs for HLS2 machines, and that SSH communication is properly configured. For further details, refer to points 5 and 6 here.
bash /root/MLPERF/benchmarks/gpt3/run_gpt.sh --data-dir $DATASET_DIR/ --universal-ckpt-path $CHECKPOINT_DIR/ \
--hosts /root/shared/hosts --output-dir /root/scratch --num-nodes 32 --data-parallel-size 4 --save-checkpoints false --mllog-output-path /root/scratch/result.txt --train-samples 6782976 --use-fp8-transformer-engine --global-batch-size 2048 --micro-batch-size 2 --eval-interval 12 --device-warmup true --device-warmup-dataset-path $DATASET_DIR/synthetic_text_document
bash /root/MLPERF/benchmarks/gpt3/run_gpt.sh --data-dir $DATASET_DIR/ --universal-ckpt-path $CHECKPOINT_DIR/ \
--hosts /root/shared/hosts --output-dir /root/scratch --num-nodes 48 --data-parallel-size 8 --pipeline-model-parallel-size 6 --save-checkpoints false --mllog-output-path /root/scratch/result.txt --train-samples 6782976 --global-batch-size 2048 --micro-batch-size 2 --eval-interval 12 --device-warmup true --device-warmup-dataset-path $DATASET_DIR/synthetic_text_document --use-fp8-transformer-engine
Training results will be stored in /root/scratch
folder.
The --save-checkpoints
is set to false
as 96l checkpoints take a lot of disc space. In order to save the checkpoint after the run or save it with some frequency, please use --save-checkpoints true
and manipulate --save-interval
parameter.
The script will start from universal checkpoint and train up to 312 steps or the time, when validation log perplexity is below 2.69. According to the convergence point of GPT3 on HLS system, it should approximately run for 288 steps in order to reach 2.69 validation log perplexity. To reduce number of steps, you can use --exit-interval
parameter or reduce train samples by --train-samples
parameter.
To get the TTT from the training script output, run the following command:
grep 'run_start\|run_stop' /root/scratch/result.txt | awk '{print $5}' | tr -d ',' | paste -sd " " - | awk '{print ($2 - $1) / 1000 / 60}'
Following environment variables will be used to specify before training:
DATASET_PATH:= to the path where preprocessed data is located
ANNOTATION_FILE:= is the annotation file used for validation
FID_GT_PATH:= is the path for npz file used for inception
RESULTS_DIR:= to the path you want to save the results and checkpoint
POSTFIX_LOG_DIR:= postfix for logdir
WARMUP_FILE:= is the file used only in the warmup of the training
BASE_CKPT:= is the base checkpoint
For example:
export DATASET_PATH="/root/datasets/stable_diffusion/datasets/laion-400m/webdataset-moments-filtered/{00000..00831}.tar"
export ANNOTATION_FILE="/root/datasets/stable_diffusion/datasets/coco2014/val2014_30k.tsv"
export FID_GT_PATH="/root/datasets/stable_diffusion/datasets/coco2014/val2014_30k_stats.npz"
export RESULTS_DIR="/tmp/"
export POSTFIX_LOG_DIR="64x_run"
export WARMUP_FILE="/root/datasets/stable_diffusion/datasets/SD_synthetic_data_10001.tar"
export BASE_CKPT="/root/datasets/stable_diffusion/datasets/checkpoints/sd/512-base-ema.ckpt"
Log into mlperf3.1 PyTorch container and install the requirements
pip install -r /root/MLPERF/benchmarks/stable_diffusion/scripts/requirements.txt
bash /root/MLPERF/benchmarks/stable_diffusion/scripts/run_init.sh
Each worker will have training command:
For example:
MASTER_PORT=${MASTER_PORT} MASTER_ADDR=${MASTER_ADDR} NODE_RANK={NODE_RANK} python3 -u root/MLPERF/benchmarks/stable_diffusion/main.py \
lightning.trainer.num_nodes=8 data.params.train.params.urls=${DATASET_PATH} lightning.modelcheckpoint.params.every_n_train_steps=1000 \
data.params.validation.params.annotations_file=${ANNOTATION_FILE} \
lightning.trainer.max_steps=5000 lightning.trainer.val_check_interval=<Greater_than_max_steps_to_avoid_online_val> \
lightning.modelcheckpoint.params.save_last=False model.params.hpu_graph=True -m train --ckpt {BASE_CKPT} \
-b configs/train_08x08x08.yaml -l {RESULTS_DIR} --autocast --warmup {WARMUP_FILE} --async_checkpoint -n {POSTFIX_LOG_DIR}
For example:
MASTER_PORT=${MASTER_PORT} MASTER_ADDR=${MASTER_ADDR} NODE_RANK={NODE_RANK} python3 -u root/MLPERF/benchmarks/stable_diffusion/main.py \
lightning.trainer.num_nodes=8 data.params.validation.params.annotations_file=${ANNOTATION_FILE} \
model.params.validation_config.fid.gt_path=${FID_GT_PATH} model.params.load_unet=True -m validate \
--ckpt {RESULTS_DIR}/checkpoints/'epoch=000000-step=00000x000.ckpt' -b {BASE_CKPT} -b configs/train_08x08x08.yaml \
--current_validation_iter {Current_validation_iteration_number} --validation_iters {Total_validation_iteration_numbers}
Validated on | Intel Gaudi Software Version | Framework Version(s) | Mode |
---|---|---|---|
Gaudi 2 | 1.15.1 | PyTorch 2.2.0 | Training |
- Updated scripts to enable dynamic shapes support for topologies:
- PyTorch Bert
- Updated scripts to cover MLPerf 3.1 submission, including but not limited to:
- Optimized GPT3 code by:
- FP8 support;
- Sequence Parallelism support;
- Fused Scaled Dot Product Attention;
- Device warmup.
- Added new benchmark: Stable Diffusion;
- Enabled using HPU Graphs by default for PyTorch ResNet50;
- Removed UNet3D and Bert 64xcards from submission.
- Optimized GPT3 code by:
- Removed the setting of the PT_HPU_LAZY_MODE environment variable in the script for Bert and ResNet50.
- Removed unused PT_HPU_ENABLE_SYNC_OUTPUT_HOST environment variable.
- Updated scripts to cover MLPerf 3.0 submission.
- Switched UNet3D, Bert, ResNet50 from HMP to autocast.
- Added script for ImageNet unpacking.
- Reworked scripts and instruction for TensorFlow BERT data preprocessing.
- Add clearing deepspeed_config to force deepspeed to take config from args.deepspeed_configuration at initialize()
- Updated scripts to cover MLPerf 3.0 submission.
- Disabled auto dynamic shape support for Gaudi devices for PyTorch ResNet50.
- Prepared new scripts for PyTorch BERT data preprocessing.
- Moved data preprocessing instructions to docker environment.
- Updated scripts to cover MLPerf 2.1 submission.
- Removed obsolete files from TensorFlow/nlp/bert.
- Updated scripts to cover MLPerf 2.0 submission.
- Cleaned up ResNet requirements compared to the originally submitted ones.
- Removed run_bert_docker.sh and run_resnet50_docker.sh scripts.
- Switched from the deprecated TF_ENABLE_BF16_CONVERSION to TF_BF16_CONVERSION.
- Added TF_ENABLE_DYNAMIC_SHAPES to MLPerf launchers.
- Updated requirements.txt file for BERT and ResNet.