QTIP: Quantization with Trellises and Incoherence Processing #6512

latheesan-k · 2024-11-03T19:00:55Z

Description

Add support for QTIP quantisation?

QTIP, a weight-only large language model (LLM) quantization method that achieves a state-of-the-art combination of quantization quality and speed. QTIP uses incoherence processing to make LLM weight matrices approximately i.i.d Gaussian, and then uses trellis coded quantization (TCQ) to quantize these weights with near-optimal distortion. QTIP solves naive TCQ's inherent slowness by introducing a series of novel compute-based codes for use with the "bitshift trellis."

Additional Context

Paper: https://arxiv.org/abs/2406.11235
Implementation: https://github.com/Cornell-RelaxML/qtip
Converted Models: https://huggingface.co/collections/relaxml/qtip-quantized-models-66fa253ad3186746f4b62803

latheesan-k added the enhancement New feature or request label Nov 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QTIP: Quantization with Trellises and Incoherence Processing #6512

QTIP: Quantization with Trellises and Incoherence Processing #6512

latheesan-k commented Nov 3, 2024

QTIP: Quantization with Trellises and Incoherence Processing #6512

QTIP: Quantization with Trellises and Incoherence Processing #6512

Comments

latheesan-k commented Nov 3, 2024