Release 0.43.2: finetune Llama 405B on 4x GPUs with improved QLoRA+FSDP, CUDA 12.5 support · bitsandbytes-foundation/bitsandbytes

0.43.2

This release is quite significant as the QLoRA bug fix has big implications for higher seqlen and batch sizes.

For each sequence (i.e. batch size increase of one) we expect memory savings of:

405B: 39GB for seqlen=1024, and 4888GB for seqlen=128,00
70B: 10.1GB for seqlen=1024 and 1258GB for seqlen=128,00

This was due to activations being unnecessary for frozen parameters, yet the memory for them was still erroneously allocated due to the now fixed bug.

Improvements:

docs: FSDP+QLoRA and CPU install guide (#1211 #1227, thanks @stevhliu)
Add CUDA 12.5 and update 12.4 builds (#1284)

Bug Fixes

4bit getstate and 8bit deepcopy (#1230 #1231, thanks @BenjaminBossan)
missing optimizers in str2optimizer32bit (#1222, thanks @EtienneDosSantos)
CUDA 12.5 build issue (#1273, thanks @HennerM)
fix for min_8bit_size functionality in Optimizer base classes (#1286, thanks @Edenzzzz)
QLoRA mem bug (#1270, thanks @Ther-nullptr)
tests for cpu only platforms (#1259, thanks @galqiwi)
restoration of quant_storage for CPU offloading (#1279)
optim update error with non-contiguous grads/params (deepspeed) (#1187)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.43.2: finetune Llama 405B on 4x GPUs with improved QLoRA+FSDP, CUDA 12.5 support

0.43.2

Improvements:

Bug Fixes

Contributors