docs: adress the new PR review feedbacks

huggingface · Jan 17, 2025 · d77901a · d77901a
1 parent 43516b8
commit d77901a
Show file tree

Hide file tree

Showing 11 changed files with 111 additions and 138 deletions.
diff --git a/docs/source/conceptual_guides/difference_between_jetstream_and_xla.mdx b/docs/source/conceptual_guides/difference_between_jetstream_and_xla.mdx
@@ -1,6 +1,6 @@
 # Differences between Jetstream Pytorch and PyTorch XLA
 
-This guide explains to optimum-tpu users the difference between Jetstream Pytorch and PyTorch XLA as those are two available backend in TGI.
+This guide explains to optimum-tpu users the difference between Jetstream Pytorch and PyTorch XLA as those are two available backends in TGI.
 
 JetStream PyTorch is a high-performance inference engine built on top of PyTorch XLA. It is optimized for throughput and memory efficiency when running Large Language Models (LLMs) on TPUs.
 

diff --git a/docs/source/conceptual_guides/tpu_hardware_support.mdx b/docs/source/conceptual_guides/tpu_hardware_support.mdx
@@ -11,19 +11,7 @@ TPU version:
 For example, a v5litepod-8 is a v5e TPU with 8 tpus.
 
 ## Memory on TPU
-The HBM (High Bandwidth Memory) capacity per chip is 16GB for V5e, V5p and 32GB for V6e. So a v5e-8 (v5litepod-8), has 16GB*8=128GB of HBM memory
-
-## Performance on TPU
-There are several key metrics to consider when evaluating TPU performance:
-- Peak compute per chip (bf16/int8): Measures the maximum theoretical computing power in floating point or integer operations per second. Higher values indicate faster processing capability for machine learning workloads.
-HBM (High Bandwidth Memory) metrics:
-- Capacity: Amount of available high-speed memory per chip.
-- Bandwidth: Speed at which data can be read from or written to memory. These affect how much data can be processed and how quickly it can be accessed.
-- Inter-chip interconnect (ICI) bandwidth: Determines how fast TPU chips can communicate with each other, which is crucial for distributed training across multiple chips.
-Pod-level metrics:
-- Peak compute per Pod: Total computing power when multiple chips work together. These indicate performance at scale for large training or serving jobs.
-
-The actual performance you achieve will depend on your specific workload characteristics and how well it matches these hardware capabilities.
+The HBM (High Bandwidth Memory) capacity per chip is 16GB for v5e, v5p and 32GB for v6e. So a v5e-8 (v5litepod-8), has 16GB*8=128GB of HBM memory
 
 ## Recommended Runtime for TPU
 

diff --git a/docs/source/howto/advanced-tgi-serving.mdx b/docs/source/howto/advanced-tgi-serving.mdx
@@ -41,13 +41,13 @@ More information on tensor parralelsim can be found here https://huggingface.co/
 Key parameters explained:
 
 **Required parameters**
-- `--shm-size 16GB`: Shared memory allocation
-- `--privileged`: Required for TPU access
-- `--net host`: Uses host network mode
-Those are needed to run a TPU container so that the container can properly access the TPU hardware
+- `--shm-size 16GB`: Increase default shared memory allocation.
+- `--privileged`: Required for TPU access.
+- `--net host`: Uses host network mode.
+Those are needed to run a TPU container so that the container can properly access the TPU hardware.
 
 **Optional parameters**
-- `-v ~/hf_data:/data`: Volume mount for model storage, this allows you to not have to re-download the models weights on each startup. You can use any folder you would like as long as it maps back to /data
+- `-v ~/hf_data:/data`: Volume mount for model storage, this allows you to not have to re-download the models weights on each startup. You can use any folder you would like as long as it maps back to /data.
 - `-e SKIP_WARMUP=1`: Disables warmup for quick testing (not recommended for production).
 Those are parameters used by TGI and optimum-TPU to configure the server behavior.
 

diff --git a/docs/source/howto/deploy_instance_on_ie.mdx b/docs/source/howto/deploy_instance_on_ie.mdx
@@ -79,4 +79,5 @@ You will need to replace {INSTANCE_ID} and {REGION} with the value from your own
 
 - There are numerous ways to interact with your new inference endpoints. Review the inference endpoint documentation to explore different options:
 https://huggingface.co/docs/inference-endpoints/index
-- Consult our advanced parameter guide for TGI to learn about advanced TGI options you can use on inference endpoint (./howto/advanced-tgi-serving)
+- Consult our advanced parameter guide for TGI to learn about advanced TGI options you can use on inference endpoint (./howto/advanced-tgi-serving)
+- You can explore the full list of TPU-compatible models on the [Inference Endpoints TPU catalog page](https://endpoints.huggingface.co/catalog?accelerator=tpu)
diff --git a/docs/source/howto/installation_inside_a_container.mdx b/docs/source/howto/installation_inside_a_container.mdx
@@ -17,10 +17,10 @@ First, set the environment variables for the image URL and version:
 
 ```bash
 export TPUVM_IMAGE_URL=us-central1-docker.pkg.dev/tpu-pytorch-releases/docker/xla
-export TPUVM_IMAGE_VERSION=v2.5.1
+export TPUVM_IMAGE_VERSION=r2.5.0_3.10_tpuvm
 
 # Pull the image
-docker pull ${TPUVM_IMAGE_URL}@sha256:${TPUVM_IMAGE_VERSION}
+docker pull ${TPUVM_IMAGE_URL}:${TPUVM_IMAGE_VERSION}
 ```
 
 ### 2. Run the Container
@@ -30,16 +30,13 @@ Launch the container with the necessary flags for TPU access:
 ```bash
 docker run -ti \
     --rm \
+    --shm-size 16GB
     --privileged \
     --net=host \
     ${TPUVM_IMAGE_URL}@sha256:${TPUVM_IMAGE_VERSION} \
     bash
 ```
-
-Key flags explained:
-- `--privileged`: Required for TPU access
-- `--net=host`: Required for TPU access
-- `--rm`: Automatically removes the container when it exits
+`--shm-size 16GB --privileged --net=host` is required for docker to access the TPU
 
 ### 3. Install Optimum-TPU
 

diff --git a/docs/source/howto/serving.mdx b/docs/source/howto/serving.mdx
@@ -26,7 +26,6 @@ docker run -p 8080:80 \
         -e LOG_LEVEL=text_generation_router=debug \
         -v ~/hf_data:/data \
         -e HF_TOKEN=<your_hf_token_here> \
-        -e SKIP_WARMUP=1 \
         ghcr.io/huggingface/optimum-tpu:v0.2.3-tgi \
         --model-id google/gemma-2b-it \
         --max-input-length 512 \

diff --git a/docs/source/howto/training.mdx b/docs/source/howto/training.mdx
@@ -16,18 +16,13 @@ Before starting the training process, ensure you have:
 2. Optimum-TPU installed with PyTorch/XLA support:
 ```bash
 pip install optimum-tpu -f https://storage.googleapis.com/libtpu-releases/index.html
-export PJRT_DEVICE=TPU
 ```
 
 ## Example Training Scripts
 
 You can now follow one of our several example scripts to get started:
-
 1. Gemma Fine-tuning:
    - See our [Gemma fine-tuning notebook](https://github.com/huggingface/optimum-tpu/blob/main/examples/language-modeling/gemma_tuning.ipynb) for a step-by-step guide
 
 2. LLaMA Fine-tuning:
    - Check our [LLaMA fine-tuning notebook](https://github.com/huggingface/optimum-tpu/blob/main/examples/language-modeling/llama_tuning.ipynb) for detailed instructions
-
-## Tips
-We recommend fine-tuning using bf16 on TPU, as those operations are often extremely fast while keeping the training stable and precise enough.
diff --git a/docs/source/reference/fsdp_v2.mdx b/docs/source/reference/fsdp_v2.mdx
@@ -88,20 +88,6 @@ trainer = Trainer(  # or SFTTrainer
 )
 ```
 
-## Troubleshooting
-
-Common issues and solutions:
-
-1. Out of Memory (OOM):
-   - Enable gradient checkpointing
-   - Reduce batch size
-   - Use a smaller sequence length
-
-2. Training Speed:
-   - Ensure proper batch size optimization
-   - Monitor TPU device utilization
-   - Check for communication bottlenecks
-
-You can look our [example notebooks](../howto/more_examples) for best practice on training with optimum-tpu
-
-For more details on PyTorch/XLA's FSDP implementation, refer to the [official documentation](https://pytorch.org/xla/master/#fully-sharded-data-parallel-via-spmd).
+## Next steps
+- You can look our [example notebooks](../howto/more_examples) for best practice on training with optimum-tpu
+- For more details on PyTorch/XLA's FSDP implementation, refer to the [official documentation](https://pytorch.org/xla/master/#fully-sharded-data-parallel-via-spmd).
diff --git a/docs/source/reference/tgi_advanced_options.mdx b/docs/source/reference/tgi_advanced_options.mdx
@@ -27,6 +27,10 @@ Those are parameters used by TGI and optimum-TPU to configure the server behavio
 - `LOG_LEVEL`: Set logging verbosity (useful for debugging). It can be set to info, debug or a comma separated list of attribute such text_generation_launcher,text_generation_router=debug
 - `SKIP_WARMUP`: Skip model warmup phase
 
+**Note on warmup:**
+- TGI performs warmup to compile TPU operations for optimal performance
+- For production use, never use `SKIP_WARMUP=1`; you can however use the parameters for debugging purposes to speed up model loading at the cost of slow model inference
+
 You can view more options in the [TGI documentation](https://huggingface.co/docs/text-generation-inference/reference/launcher). Not all parameters might be compatible with TPUs (for example, all the CUDA-specific parameters)
 
 <Tip>

diff --git a/docs/source/tutorials/inference_on_tpu.mdx b/docs/source/tutorials/inference_on_tpu.mdx
@@ -50,7 +50,6 @@ docker run -p 8080:80 \
         -e LOG_LEVEL=text_generation_router=debug \
         -v ~/hf_data:/data \
         -e HF_TOKEN=$(cat ~/.cache/huggingface/token) \
-        -e SKIP_WARMUP=1 \
         ghcr.io/huggingface/optimum-tpu:v0.2.3-tgi \
         --model-id google/gemma-2b-it \
         --max-input-length 512 \
@@ -62,27 +61,13 @@ docker run -p 8080:80 \
 ### Understanding the Configuration
 
 Key parameters explained:
-- `--shm-size 16GB`: Shared memory allocation
-- `--privileged`: Required for TPU access
-- `--net host`: Uses host network mode
+- `--shm-size 16GB --privileged --net=host`: Required for docker to access the TPU
 - `-v ~/hf_data:/data`: Volume mount for model storage
-- `-e SKIP_WARMUP=1`: Disables warmup for quick testing (not recommended for production)
 - `--max-input-length`: Maximum input sequence length
 - `--max-total-tokens`: Maximum combined input and output tokens
 - `--max-batch-prefill-tokens`: Maximum tokens for batch processing
 - `--max-batch-total-tokens`: Maximum total tokens in a batch
 
-### Production Considerations
-
-<Tip warning={true}>
-For production, please remove `-e SKIP_WARMUP=1` as this will drastically decrease performance
-</Tip>
-
-Note on warmup:
-- TGI performs warmup to compile TPU operations for optimal performance
-- For this tutorial, we use `SKIP_WARMUP=1` to experiment quickly with TPU, but this means the first request will be slower as compilation happens on demand
-- For production use, remove the `SKIP_WARMUP=1` flag to improve performance
-
 ## Step 3: Making Inference Requests
 
 ### Server Readiness
@@ -94,9 +79,10 @@ Wait for the "Connected" message in the logs:
 
 Your TGI server is now ready to serve requests.
 
-### Local Testing
+### Testing from the TPU VM
 
 Query the server from another terminal on the TPU instance:
+
 ```bash
 curl 0.0.0.0:8080/generate \
     -X POST \

diff --git a/docs/source/tutorials/training_on_tpu.mdx b/docs/source/tutorials/training_on_tpu.mdx
@@ -1,15 +1,6 @@
 # First TPU Training on Google Cloud
 
-This guide walks you through setting up and running model training on TPU using the `optimum-tpu` environment.
-
-## Overview
-
-The `huggingface-pytorch-training-tpu` Docker image provides a pre-configured environment for TPU training, featuring:
-- Optimized HuggingFace libraries including optimum-tpu
-- Pre-installed optimum-tpu package
-- Jupyter notebook interface
-- Performance-tuned configurations
-- Common ML dependencies
+This tutorial walks you through setting up and running model training on TPU using the `optimum-tpu` package.
 
 ## Prerequisites
 
@@ -19,82 +10,108 @@ Before starting, ensure you have:
 - HuggingFace authentication token
 - Basic familiarity with Jupyter notebooks
 
-## 1. Start the Jupyter Container
-
-Launch the container with the following command:
+## Environment Setup
+First, create and activate a virtual environment:
 
 ```bash
-docker run --rm --shm-size 16GB --net host --privileged \
-    -v$(pwd)/artifacts:/tmp/output \
-    -e HF_TOKEN=<your_hf_token_here> \
-    us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-training-tpu.2.5.1.transformers.4.46.3.py310 \
-    jupyter notebook --allow-root --NotebookApp.token='' /notebooks
+python -m venv .venv
+source .venv/bin/activate
 ```
 
-<Tip warning={true}>
-You need to replace <your_hf_token_here> with a HuggingFace access token that you can get [here](https://huggingface.co/settings/tokens)
-</Tip>
-
-<Tip warning>
-If you already logged in via `huggingface-cli login`, then you can set HF_TOKEN=$(cat ~/.cache/huggingface/token) for more convenience
-</Tip>
+```bash
+# Install optimum-tpu with PyTorch/XLA support
+pip install optimum-tpu -f https://storage.googleapis.com/libtpu-releases/index.html
 
-### Understanding the Command Options:
-**Required docker commands:**
-- `--shm-size 16GB`: Increase default shared memory allocation
-- `--net host`: Use host network mode for optimal performance
-- `--privileged`: Required for TPU access
-Those are needed to run a TPU container so that the container can properly access the TPU hardware
+# Install additional training dependencies
+pip install transformers datasets accelerate trl peft evaluate
+```
 
-**Optional arguments:**
-- `--rm`: Automatically remove container when it exits
-- `-v$(pwd)/artifacts:/tmp/output`: Mount local directory for saving outputs
-- `-e HF_TOKEN=<your_hf_token_here>`: Pass HuggingFace token for model access
+## Understanding FSDP for TPU Training
+To speed up your training on TPU, you can rely on Optimum TPU's integration with FSDP (Fully Sharded Data Parallel). When training large models, FSDP automatically shards (splits) your model across all available TPU workers, providing several key benefits:
+1. Memory efficiency: Each TPU worker only stores a portion of the model parameters, reducing per-device memory requirements
+2. Automatic scaling: FSDP handles the complexity of distributing the model and aggregating gradients
+3. Performance optimization: Optimum TPU's implementation is specifically tuned for TPU hardware
 
-## 2. Connect to the Jupyter Notebook
+This sharding happens automatically when you use the `fsdp_v2.get_fsdp_training_args(model)` configuration in your training setup, making it easy to train larger models that wouldn't fit on a single TPU device.
 
-### Accessing the Interface
-To connect from outside the TPU instance:
+## How to Setup FSDP
 
-![External IP TPU](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/optimum/tpu/gcp_ssh_tpu.png/get_external_ip_tpu.png)
+The key modification to enable FSDP is just these few lines:
 
-1. Locate your TPU's external IP in Google Cloud Console
-2. Access the Jupyter interface at `http://[YOUR-IP]:8888`
-   - Example: `http://34.174.11.242:8888`
+```diff
++from optimum.tpu import fsdp_v2
++fsdp_v2.use_fsdp_v2()
++fsdp_training_args = fsdp_v2.get_fsdp_training_args(model)
+```
 
-### Firewall Configuration (Optional)
+Then include these arguments in your trainer configuration:
+
+```diff 
+trainer = SFTTrainer(
+    model=model,
+    train_dataset=dataset,
+    args=TrainingArguments(
+        ...
++       dataloader_drop_last=True,  # Required for FSDPv2
++       **fsdp_training_args,
+    ),
+    ...
+)
+```
 
-To enable remote access, you may need to configure GCP firewall rules:
-1. Create a new firewall rule:
-   ```bash
-   gcloud compute firewall-rules create [RULE_NAME] \
-       --allow tcp:8888
-   ```
-2. Ensure port 8888 is accessible
-3. Consider implementing these security practices:
-   - Use HTTPS when possible
-   - Limit access to specific IP ranges
-   - Enable Jupyter authentication
-   - Regular security audits
+## Complete example
+
+Here's a full working example that demonstrates TPU training with FSDP:
+
+```python
+import torch
+from datasets import load_dataset
+from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments
+from peft import LoraConfig
+from trl import SFTTrainer
+from optimum.tpu import fsdp_v2
+
+# Enable FSDPv2 for TPU
+fsdp_v2.use_fsdp_v2()
+
+# Load model and dataset
+model_id = "google/gemma-2b"
+model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)
+dataset = load_dataset("tatsu-lab/alpaca", split="train[:1000]")
+
+# Get FSDP training arguments
+fsdp_training_args = fsdp_v2.get_fsdp_training_args(model)
+
+# Create trainer with minimal configuration
+trainer = SFTTrainer(
+    model=model,
+    train_dataset=dataset,
+    args=TrainingArguments(
+        output_dir="./output",
+        dataloader_drop_last=True,  # Required for FSDPv2
+        **fsdp_training_args,
+    ),
+    peft_config=LoraConfig(
+        r=8,
+        target_modules=["k_proj", "v_proj"],
+        task_type="CAUSAL_LM",
+    ),
+)
+
+# Start training
+trainer.train()
+```
 
-## 3. Start Training Your Model
+Save this code as train.py and run it:
 
-![Jypter Notebook interface](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/optimum/tpu/jupyter_notebook.png)
+```
+python train.py
+```
 
-You now have access to the Jupyter Notebook environment, which includes:
-- Pre-configured TPU settings
-- Common ML libraries
-- Example notebooks
-- Optimum-TPU utilities
+You should now see the loss decrease during training. When the training is done, you will have a fine-tuned model. Congrats - you've just trained your first model on TPUs! 🙌
 
 ## Next Steps
-
-Continue your TPU training journey with:
-1. [Gemma Fine-tuning Guide](./howto/finetune-gemma)
-   - Detailed fine-tuning walkthrough (this is the notebook included in the container image)
-   - Performance optimization tips
-2. [Manual Installation Guide](./howto/manual_installation_optimum-tpu)
-   - Learn how to set up optimum-tpu manually
-   - Customize your training environment
-   - Advanced configuration options
+Continue your TPU training journey by exploring:
+- More complex training scenarios in our [examples](./howto/more_examples)
+- Different [model architectures supported by Optimum TPU](../supported-architectures.mdx)