diff --git a/docs/index.html b/docs/index.html index 757584f..64ee72c 100644 --- a/docs/index.html +++ b/docs/index.html @@ -449,9 +449,10 @@
conf.yml
File
@@ -595,7 +596,13 @@ Use this Google Colab notebook for learning how to train your NMT model with RTG: https://colab.research.google.com/drive/198KbkUcCGXJXnWiM7IyEiO1Mq2hdVq8T?usp=sharing
+Add the root of this repo to PYTHONPATH
or install it via pip --editable
Refer to scripts/rtg-pipeline.sh
bash script and examples/transformer.base.yml
file for specific examples.
Let’s visualize the total memory required memory for training a model in the order of a 5D tensor: [Layers x ModelDim x Batch x SequenceLength x Vocabulary]
Let’s visualize the total required memory for training a model in the order of a 4D tensor: [ ModelDim x Batch x SequenceLength x Vocabulary]
Number of layers are often fixed. [There is something we can do (see Google’s Reformer) , but it is beyond our scope at the moment.]
-Model dim is often fixed. We dont do anything fancy here.
If you have GPUs with larger memory, use them. For example, V100 with 32GB is much better than 1080 Ti with 11GB.
If you have larger GPU, but you have many smaller GPUs, use many them by setting CUDA_VISIBLE_DEVICES
variable to comma separated list of GPU IDs.
+
If you dont have larger GPU, but you have many smaller GPUs, use many them by setting CUDA_VISIBLE_DEVICES
variable to comma separated list of GPU IDs.
The built in DataParallel
module divides batches into multiple GPUs ⇒ reduces total memory needed on each GPU.
Since beam decoder is used, let’s visualize [Batch x Beams x Vocabulary x SequenceLength]
Since beam decoder is used, let’s visualize memory as [Batch x Beams x Vocabulary x SequenceLength]