0.43.2: finetune Llama 405B on 4x GPUs with improved QLoRA+FSDP, CUDA 12.5 support
0.43.2
This release is quite significant as the QLoRA bug fix has big implications for higher seqlen
and batch sizes.
For each sequence (i.e. batch size increase of one) we expect memory savings of:
- 405B: 39GB for
seqlen=1024
, and 4888GB forseqlen=128,00
- 70B: 10.1GB for
seqlen=1024
and 1258GB forseqlen=128,00
This was due to activations being unnecessary for frozen parameters, yet the memory for them was still erroneously allocated due to the now fixed bug.
Improvements:
- docs: FSDP+QLoRA and CPU install guide (#1211 #1227, thanks @stevhliu)
- Add CUDA 12.5 and update 12.4 builds (#1284)
Bug Fixes
- 4bit getstate and 8bit deepcopy (#1230 #1231, thanks @BenjaminBossan)
- missing optimizers in
str2optimizer32bit
(#1222, thanks @EtienneDosSantos) - CUDA 12.5 build issue (#1273, thanks @HennerM)
- fix for min_8bit_size functionality in Optimizer base classes (#1286, thanks @Edenzzzz)
- QLoRA mem bug (#1270, thanks @Ther-nullptr)
- tests for cpu only platforms (#1259, thanks @galqiwi)
- restoration of quant_storage for CPU offloading (#1279)
- optim update error with non-contiguous grads/params (deepspeed) (#1187)