GPTFast llama loader released #5580

biship · 2024-02-26T15:14:33Z

This library does the following:
quantizes the model to int8
adds kv caching
adds speculative decoding
adds kv caching to the speculative decoding model
compiles the speculative model and main model with some extra options to squeeze out as much performance as possible
sends the models to CUDA if available

Does this library makes Oobabooga work easier not having to maintain the HF loader?
Would GPTFast be worth implementing as an additional loader (if it offers benefits).
Perhaps GPTFast makes too many assumptions or forces a method of processing that removes user configurability.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPTFast llama loader released #5580

{{title}}

Replies: 0 comments

Select a reply

GPTFast llama loader released #5580

biship Feb 26, 2024

Replies: 0 comments

biship
Feb 26, 2024