Skip to content

Commit

Permalink
docs: adress the new PR review feedbacks
Browse files Browse the repository at this point in the history
  • Loading branch information
baptistecolle committed Jan 17, 2025
1 parent 43516b8 commit d77901a
Show file tree
Hide file tree
Showing 11 changed files with 111 additions and 138 deletions.
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Differences between Jetstream Pytorch and PyTorch XLA

This guide explains to optimum-tpu users the difference between Jetstream Pytorch and PyTorch XLA as those are two available backend in TGI.
This guide explains to optimum-tpu users the difference between Jetstream Pytorch and PyTorch XLA as those are two available backends in TGI.

JetStream PyTorch is a high-performance inference engine built on top of PyTorch XLA. It is optimized for throughput and memory efficiency when running Large Language Models (LLMs) on TPUs.

Expand Down
14 changes: 1 addition & 13 deletions docs/source/conceptual_guides/tpu_hardware_support.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -11,19 +11,7 @@ TPU version:
For example, a v5litepod-8 is a v5e TPU with 8 tpus.

## Memory on TPU
The HBM (High Bandwidth Memory) capacity per chip is 16GB for V5e, V5p and 32GB for V6e. So a v5e-8 (v5litepod-8), has 16GB*8=128GB of HBM memory

## Performance on TPU
There are several key metrics to consider when evaluating TPU performance:
- Peak compute per chip (bf16/int8): Measures the maximum theoretical computing power in floating point or integer operations per second. Higher values indicate faster processing capability for machine learning workloads.
HBM (High Bandwidth Memory) metrics:
- Capacity: Amount of available high-speed memory per chip.
- Bandwidth: Speed at which data can be read from or written to memory. These affect how much data can be processed and how quickly it can be accessed.
- Inter-chip interconnect (ICI) bandwidth: Determines how fast TPU chips can communicate with each other, which is crucial for distributed training across multiple chips.
Pod-level metrics:
- Peak compute per Pod: Total computing power when multiple chips work together. These indicate performance at scale for large training or serving jobs.

The actual performance you achieve will depend on your specific workload characteristics and how well it matches these hardware capabilities.
The HBM (High Bandwidth Memory) capacity per chip is 16GB for v5e, v5p and 32GB for v6e. So a v5e-8 (v5litepod-8), has 16GB*8=128GB of HBM memory

## Recommended Runtime for TPU

Expand Down
10 changes: 5 additions & 5 deletions docs/source/howto/advanced-tgi-serving.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -41,13 +41,13 @@ More information on tensor parralelsim can be found here https://huggingface.co/
Key parameters explained:

**Required parameters**
- `--shm-size 16GB`: Shared memory allocation
- `--privileged`: Required for TPU access
- `--net host`: Uses host network mode
Those are needed to run a TPU container so that the container can properly access the TPU hardware
- `--shm-size 16GB`: Increase default shared memory allocation.
- `--privileged`: Required for TPU access.
- `--net host`: Uses host network mode.
Those are needed to run a TPU container so that the container can properly access the TPU hardware.

**Optional parameters**
- `-v ~/hf_data:/data`: Volume mount for model storage, this allows you to not have to re-download the models weights on each startup. You can use any folder you would like as long as it maps back to /data
- `-v ~/hf_data:/data`: Volume mount for model storage, this allows you to not have to re-download the models weights on each startup. You can use any folder you would like as long as it maps back to /data.
- `-e SKIP_WARMUP=1`: Disables warmup for quick testing (not recommended for production).
Those are parameters used by TGI and optimum-TPU to configure the server behavior.

Expand Down
3 changes: 2 additions & 1 deletion docs/source/howto/deploy_instance_on_ie.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -79,4 +79,5 @@ You will need to replace {INSTANCE_ID} and {REGION} with the value from your own

- There are numerous ways to interact with your new inference endpoints. Review the inference endpoint documentation to explore different options:
https://huggingface.co/docs/inference-endpoints/index
- Consult our advanced parameter guide for TGI to learn about advanced TGI options you can use on inference endpoint (./howto/advanced-tgi-serving)
- Consult our advanced parameter guide for TGI to learn about advanced TGI options you can use on inference endpoint (./howto/advanced-tgi-serving)
- You can explore the full list of TPU-compatible models on the [Inference Endpoints TPU catalog page](https://endpoints.huggingface.co/catalog?accelerator=tpu)
11 changes: 4 additions & 7 deletions docs/source/howto/installation_inside_a_container.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,10 @@ First, set the environment variables for the image URL and version:

```bash
export TPUVM_IMAGE_URL=us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla
export TPUVM_IMAGE_VERSION=v2.5.1
export TPUVM_IMAGE_VERSION=r2.5.0_3.10_tpuvm

# Pull the image
docker pull ${TPUVM_IMAGE_URL}@sha256:${TPUVM_IMAGE_VERSION}
docker pull ${TPUVM_IMAGE_URL}:${TPUVM_IMAGE_VERSION}
```

### 2. Run the Container
Expand All @@ -30,16 +30,13 @@ Launch the container with the necessary flags for TPU access:
```bash
docker run -ti \
--rm \
--shm-size 16GB
--privileged \
--net=host \
${TPUVM_IMAGE_URL}@sha256:${TPUVM_IMAGE_VERSION} \
bash
```

Key flags explained:
- `--privileged`: Required for TPU access
- `--net=host`: Required for TPU access
- `--rm`: Automatically removes the container when it exits
`--shm-size 16GB --privileged --net=host` is required for docker to access the TPU

### 3. Install Optimum-TPU

Expand Down
1 change: 0 additions & 1 deletion docs/source/howto/serving.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,6 @@ docker run -p 8080:80 \
-e LOG_LEVEL=text_generation_router=debug \
-v ~/hf_data:/data \
-e HF_TOKEN=<your_hf_token_here> \
-e SKIP_WARMUP=1 \
ghcr.io/huggingface/optimum-tpu:v0.2.3-tgi \
--model-id google/gemma-2b-it \
--max-input-length 512 \
Expand Down
5 changes: 0 additions & 5 deletions docs/source/howto/training.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -16,18 +16,13 @@ Before starting the training process, ensure you have:
2. Optimum-TPU installed with PyTorch/XLA support:
```bash
pip install optimum-tpu -f https://storage.googleapis.com/libtpu-releases/index.html
export PJRT_DEVICE=TPU
```

## Example Training Scripts

You can now follow one of our several example scripts to get started:

1. Gemma Fine-tuning:
- See our [Gemma fine-tuning notebook](https://github.com/huggingface/optimum-tpu/blob/main/examples/language-modeling/gemma_tuning.ipynb) for a step-by-step guide

2. LLaMA Fine-tuning:
- Check our [LLaMA fine-tuning notebook](https://github.com/huggingface/optimum-tpu/blob/main/examples/language-modeling/llama_tuning.ipynb) for detailed instructions

## Tips
We recommend fine-tuning using bf16 on TPU, as those operations are often extremely fast while keeping the training stable and precise enough.
20 changes: 3 additions & 17 deletions docs/source/reference/fsdp_v2.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -88,20 +88,6 @@ trainer = Trainer( # or SFTTrainer
)
```

## Troubleshooting

Common issues and solutions:

1. Out of Memory (OOM):
- Enable gradient checkpointing
- Reduce batch size
- Use a smaller sequence length

2. Training Speed:
- Ensure proper batch size optimization
- Monitor TPU device utilization
- Check for communication bottlenecks

You can look our [example notebooks](../howto/more_examples) for best practice on training with optimum-tpu

For more details on PyTorch/XLA's FSDP implementation, refer to the [official documentation](https://pytorch.org/xla/master/#fully-sharded-data-parallel-via-spmd).
## Next steps
- You can look our [example notebooks](../howto/more_examples) for best practice on training with optimum-tpu
- For more details on PyTorch/XLA's FSDP implementation, refer to the [official documentation](https://pytorch.org/xla/master/#fully-sharded-data-parallel-via-spmd).
4 changes: 4 additions & 0 deletions docs/source/reference/tgi_advanced_options.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,10 @@ Those are parameters used by TGI and optimum-TPU to configure the server behavio
- `LOG_LEVEL`: Set logging verbosity (useful for debugging). It can be set to info, debug or a comma separated list of attribute such text_generation_launcher,text_generation_router=debug
- `SKIP_WARMUP`: Skip model warmup phase

**Note on warmup:**
- TGI performs warmup to compile TPU operations for optimal performance
- For production use, never use `SKIP_WARMUP=1`; you can however use the parameters for debugging purposes to speed up model loading at the cost of slow model inference

You can view more options in the [TGI documentation](https://huggingface.co/docs/text-generation-inference/reference/launcher). Not all parameters might be compatible with TPUs (for example, all the CUDA-specific parameters)

<Tip>
Expand Down
20 changes: 3 additions & 17 deletions docs/source/tutorials/inference_on_tpu.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,6 @@ docker run -p 8080:80 \
-e LOG_LEVEL=text_generation_router=debug \
-v ~/hf_data:/data \
-e HF_TOKEN=$(cat ~/.cache/huggingface/token) \
-e SKIP_WARMUP=1 \
ghcr.io/huggingface/optimum-tpu:v0.2.3-tgi \
--model-id google/gemma-2b-it \
--max-input-length 512 \
Expand All @@ -62,27 +61,13 @@ docker run -p 8080:80 \
### Understanding the Configuration

Key parameters explained:
- `--shm-size 16GB`: Shared memory allocation
- `--privileged`: Required for TPU access
- `--net host`: Uses host network mode
- `--shm-size 16GB --privileged --net=host`: Required for docker to access the TPU
- `-v ~/hf_data:/data`: Volume mount for model storage
- `-e SKIP_WARMUP=1`: Disables warmup for quick testing (not recommended for production)
- `--max-input-length`: Maximum input sequence length
- `--max-total-tokens`: Maximum combined input and output tokens
- `--max-batch-prefill-tokens`: Maximum tokens for batch processing
- `--max-batch-total-tokens`: Maximum total tokens in a batch

### Production Considerations

<Tip warning={true}>
For production, please remove `-e SKIP_WARMUP=1` as this will drastically decrease performance
</Tip>

Note on warmup:
- TGI performs warmup to compile TPU operations for optimal performance
- For this tutorial, we use `SKIP_WARMUP=1` to experiment quickly with TPU, but this means the first request will be slower as compilation happens on demand
- For production use, remove the `SKIP_WARMUP=1` flag to improve performance

## Step 3: Making Inference Requests

### Server Readiness
Expand All @@ -94,9 +79,10 @@ Wait for the "Connected" message in the logs:

Your TGI server is now ready to serve requests.

### Local Testing
### Testing from the TPU VM

Query the server from another terminal on the TPU instance:

```bash
curl 0.0.0.0:8080/generate \
-X POST \
Expand Down
159 changes: 88 additions & 71 deletions docs/source/tutorials/training_on_tpu.mdx
Original file line number Diff line number Diff line change
@@ -1,15 +1,6 @@
# First TPU Training on Google Cloud

This guide walks you through setting up and running model training on TPU using the `optimum-tpu` environment.

## Overview

The `huggingface-pytorch-training-tpu` Docker image provides a pre-configured environment for TPU training, featuring:
- Optimized HuggingFace libraries including optimum-tpu
- Pre-installed optimum-tpu package
- Jupyter notebook interface
- Performance-tuned configurations
- Common ML dependencies
This tutorial walks you through setting up and running model training on TPU using the `optimum-tpu` package.

## Prerequisites

Expand All @@ -19,82 +10,108 @@ Before starting, ensure you have:
- HuggingFace authentication token
- Basic familiarity with Jupyter notebooks

## 1. Start the Jupyter Container

Launch the container with the following command:
## Environment Setup
First, create and activate a virtual environment:

```bash
docker run --rm --shm-size 16GB --net host --privileged \
-v$(pwd)/artifacts:/tmp/output \
-e HF_TOKEN=<your_hf_token_here> \
us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-training-tpu.2.5.1.transformers.4.46.3.py310 \
jupyter notebook --allow-root --NotebookApp.token='' /notebooks
python -m venv .venv
source .venv/bin/activate
```

<Tip warning={true}>
You need to replace <your_hf_token_here> with a HuggingFace access token that you can get [here](https://huggingface.co/settings/tokens)
</Tip>

<Tip warning>
If you already logged in via `huggingface-cli login`, then you can set HF_TOKEN=$(cat ~/.cache/huggingface/token) for more convenience
</Tip>
```bash
# Install optimum-tpu with PyTorch/XLA support
pip install optimum-tpu -f https://storage.googleapis.com/libtpu-releases/index.html

### Understanding the Command Options:
**Required docker commands:**
- `--shm-size 16GB`: Increase default shared memory allocation
- `--net host`: Use host network mode for optimal performance
- `--privileged`: Required for TPU access
Those are needed to run a TPU container so that the container can properly access the TPU hardware
# Install additional training dependencies
pip install transformers datasets accelerate trl peft evaluate
```

**Optional arguments:**
- `--rm`: Automatically remove container when it exits
- `-v$(pwd)/artifacts:/tmp/output`: Mount local directory for saving outputs
- `-e HF_TOKEN=<your_hf_token_here>`: Pass HuggingFace token for model access
## Understanding FSDP for TPU Training
To speed up your training on TPU, you can rely on Optimum TPU's integration with FSDP (Fully Sharded Data Parallel). When training large models, FSDP automatically shards (splits) your model across all available TPU workers, providing several key benefits:
1. Memory efficiency: Each TPU worker only stores a portion of the model parameters, reducing per-device memory requirements
2. Automatic scaling: FSDP handles the complexity of distributing the model and aggregating gradients
3. Performance optimization: Optimum TPU's implementation is specifically tuned for TPU hardware

## 2. Connect to the Jupyter Notebook
This sharding happens automatically when you use the `fsdp_v2.get_fsdp_training_args(model)` configuration in your training setup, making it easy to train larger models that wouldn't fit on a single TPU device.

### Accessing the Interface
To connect from outside the TPU instance:
## How to Setup FSDP

![External IP TPU](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/optimum/tpu/gcp_ssh_tpu.png/get_external_ip_tpu.png)
The key modification to enable FSDP is just these few lines:

1. Locate your TPU's external IP in Google Cloud Console
2. Access the Jupyter interface at `http://[YOUR-IP]:8888`
- Example: `http://34.174.11.242:8888`
```diff
+from optimum.tpu import fsdp_v2
+fsdp_v2.use_fsdp_v2()
+fsdp_training_args = fsdp_v2.get_fsdp_training_args(model)
```

### Firewall Configuration (Optional)
Then include these arguments in your trainer configuration:

```diff
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
args=TrainingArguments(
...
+ dataloader_drop_last=True, # Required for FSDPv2
+ **fsdp_training_args,
),
...
)
```

To enable remote access, you may need to configure GCP firewall rules:
1. Create a new firewall rule:
```bash
gcloud compute firewall-rules create [RULE_NAME] \
--allow tcp:8888
```
2. Ensure port 8888 is accessible
3. Consider implementing these security practices:
- Use HTTPS when possible
- Limit access to specific IP ranges
- Enable Jupyter authentication
- Regular security audits
## Complete example

Here's a full working example that demonstrates TPU training with FSDP:

```python
import torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
from peft import LoraConfig
from trl import SFTTrainer
from optimum.tpu import fsdp_v2

# Enable FSDPv2 for TPU
fsdp_v2.use_fsdp_v2()

# Load model and dataset
model_id = "google/gemma-2b"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)
dataset = load_dataset("tatsu-lab/alpaca", split="train[:1000]")

# Get FSDP training arguments
fsdp_training_args = fsdp_v2.get_fsdp_training_args(model)

# Create trainer with minimal configuration
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
args=TrainingArguments(
output_dir="./output",
dataloader_drop_last=True, # Required for FSDPv2
**fsdp_training_args,
),
peft_config=LoraConfig(
r=8,
target_modules=["k_proj", "v_proj"],
task_type="CAUSAL_LM",
),
)

# Start training
trainer.train()
```

## 3. Start Training Your Model
Save this code as train.py and run it:

![Jypter Notebook interface](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/optimum/tpu/jupyter_notebook.png)
```
python train.py
```

You now have access to the Jupyter Notebook environment, which includes:
- Pre-configured TPU settings
- Common ML libraries
- Example notebooks
- Optimum-TPU utilities
You should now see the loss decrease during training. When the training is done, you will have a fine-tuned model. Congrats - you've just trained your first model on TPUs! 🙌

## Next Steps

Continue your TPU training journey with:
1. [Gemma Fine-tuning Guide](./howto/finetune-gemma)
- Detailed fine-tuning walkthrough (this is the notebook included in the container image)
- Performance optimization tips
2. [Manual Installation Guide](./howto/manual_installation_optimum-tpu)
- Learn how to set up optimum-tpu manually
- Customize your training environment
- Advanced configuration options
Continue your TPU training journey by exploring:
- More complex training scenarios in our [examples](./howto/more_examples)
- Different [model architectures supported by Optimum TPU](../supported-architectures.mdx)

0 comments on commit d77901a

Please sign in to comment.