Create a working directory for the workflow and clone the repository into your working directory.
mkdir ~/workspace && cd ~/workspace
git clone https://github.com/intel/llm-on-ray.git
cd llm-on-ray
git checkout main
cd llm-on-ray/pretrain/docker
Build the habana docker image for Megatron-DeepSpeed.
./build-image.sh megatron-habana
Build the habana docker image for Huggingface trainer
./build-image.sh optimum-habana
Build the Nvidia docker image for both Megatron-DeepSpeed and Huggingface trainer
./build-image.sh nvidia
make the logs directory for saving the ray logs.
mkdir ~/workspace/logs
Gaudi2:
docker run -it --name megatron-habana --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none -v ~/workspace:/home/user/workspace -v ~/workspace/logs:/tmp --cap-add=sys_nice --net=host --ipc=host llm-ray:megatron-habana
docker run -it --name megatron-habana --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none -v ~/workspace:/home/user/workspace -v ~/workspace/logs:/tmp --cap-add=sys_nice --net=host --ipc=host llm-ray:optimum-habana
Nvidia GPU:
docker run --gpus all -it --ulimit memlock=-1 --ulimit stack=67108864 --network host --name megatron-nvidia --shm-size=64g -v ~/workspace/logs:/tmp -v ~/workspace:/home/user/workspace llm-ray:nvidia /bin/bash
if using docker container, run the following commands in docker containers.
RAY_SERVE_ENABLE_EXPERIMENTAL_STREAMING=1 ray start --head --node-ip-address 127.0.0.1 --ray-debugger-external --dashboard-host='0.0.0.0' --dashboard-port=8265
RAY_SERVE_ENABLE_EXPERIMENTAL_STREAMING=1 ray start --address='127.0.0.1:6379' --ray-debugger-external
If deploying a ray cluster on multiple nodes, please download the workflow repository on each node. More information about ray cluster, please refer to https://www.ray.io/
This workflow integrates two different pretrain solutions.
For GPU version, we use the Microsoft Megatron-DeepSpeed. For Gaudi2 version, we use the HabanaAI Megatron-DeepSpeed
It integrates the megatron dataloader for pretrain. For habana support, it uses the optimum-habana. It can use deepspeed ZeRO stage3 to train medium and large language models
Please refer to this tutorial. Copy the datasets bin and idx files into ~/workspace/data
If using the tokenizer files for Megatron_DeepSpeed pretrain, Download the GPT vocab file and merge table into ~/workspace/data.
cd ~/workspace/data/
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt
Modify the vocab_file and merge_file of megatron_config in config files
#llama_7b_megatron_deepspeed_zs0_8Gaudi_pretrain.conf
"megatron_config": {
"vocab_file": "megatron-data/gpt2-vocab.json",
"merge_file": "megatron-data/gpt2-merges.txt",
}
For Huggingface trainer, the Huggingface tokenizer is preferred. Modify the tokenizer_type and tokenizer_model of megatron_config for megatron dataset.
#llama_7b_8Guadi_pretrain.conf
"megatron_config": {
"tokenizer_type": "HFTokenizer",
"tokenizer_model": "huggyllama/llama-7b",
}
Modify the tokenizer parameters of trainer. The tokenizer of trainer and megatron dataset should be consistent
#llama_7b_8Guadi_pretrain.conf
"tokenizer": {
# The type of dataset, now only HuggingfaceTokenizer is supported.
"type": "HuggingFaceTokenizer",
# The name/path of tokenizer in huggingface.
"name": "huggyllama/llama-7b",
# Config of tokenizer, all items will be transfered to transformers.AutoTokenizer.from_pretrained().
"config": {}
}
Please ensure that you check and modify the configuration files located in ~/workspace/llm-on-ray/pretrain/config/ before proceeding.
After your environment configuration are properly set up, you can use the following instructions to pretrain the language model:
Set up megatron_deepspeed_path
in the configuration.
cd /home/user/workspace/llm-on-ray
# Bloom-7B
llm_on_ray-megatron_deepspeed_pretrain --config_file llm_on_ray/pretrain/config/bloom_7b_megatron_deepspeed_zs0_8Gaudi_pretrain.conf
# llama-7B
llm_on_ray-megatron_deepspeed_pretrain --config_file llm_on_ray/pretrain/config/llama_7b_megatron_deepspeed_zs0_8Gaudi_pretrain.conf
cd /home/user/workspace/llm-on-ray
# llama-7B
llm_on_ray-pretrain --config_file llm_on_ray/pretrain/config/llama_7b_8Guadi_pretrain.conf
cd /home/user/workspace/llm-on-ray
# llama2-7B
llm_on_ray-megatron_deepspeed_pretrain --config_file llm_on_ray/pretrain/config/llama2_3b_megatron_deepspeed_zs0_8gpus_pretrain.conf
cd /home/user/workspace/llm-on-ray
# llama-7B
llm_on_ray-pretrain --config_file llm_on_ray/pretrain/config/llama_7b_8gpu_pretrain.conf