Is there a need or benefit to finetune a small model specifically for this purpose? #57

aliencaocao · 2024-01-13T15:22:41Z

aliencaocao
Jan 13, 2024

For example, using the uncompressed prompt, compressed prompt, and the final black-box model (GPT4, Claude etc) to form training data, and fine-tune the smaller models to learn to compress prompts more effectively to a certain model's liking.
Or has experiments proved that such is not needed/not worth the cost?

iofu728 · 2024-01-16T03:15:51Z

iofu728
Jan 16, 2024
Maintainer

Hi @aliencaocao,

Thank you for your interest in LLMLingua. That's an excellent question.
TL;DR: Fine-tuning is beneficial, but the improvement is not very significant.

Our current understanding is that any Language Model can be used to estimate the importance distribution of tokens. And we believe that the higher the compression rate of the LM itself (followed "LM is a compressor"), the more accurate the estimation will be. This is particularly true in terms of the model's exposure to more tokens during the pre-training process.

Therefore, we consider that any LM can potentially serve as a compressor for prompt compression, with different LMs sharing the same essential token distribution. In our previous experiments, we found that alignment might have some impact, but it is minimal – about 1-2 points. Perhaps a more refined alignment method could significantly enhance performance.

2 replies

aliencaocao Jan 16, 2024
Author

Thank you for the answer!
The main motivation is that if a much smaller model (3b, even 1b) is able to achieve the similar effect as a 7b in your paper in terms of prompt compression, as models of these sizes is actually usable on mobile platforms, which opens up many possibilities for mobile users, such as using the smaller prompt from mobile to reduce latency and decrease cost of massive user base.

iofu728 Jan 17, 2024
Maintainer

Yes, I think we have ample scope to utilize smaller models.

aliencaocao · 2024-01-18T12:43:40Z

aliencaocao
Jan 18, 2024
Author

On a similar note, is there a reason why the llama2-7b-chat was chosen instead of so many more finetuned and higher-performing models? Maybe not mistrial-7b since its rather new, but things like nous-hermes or orca.
Was it because you think that fine-tuned models will reduce the compressor ability since its not as generalized and unbiased from the massive pertaining tokens?

1 reply

iofu728 Jan 19, 2024
Maintainer

Hi @aliencaocao, our choice of using llama2-7b-chat is based on several considerations:

The goals of RLHF, including SFT, actually differ from those of LLMLingua.
- Their goal is to make the generation process stylistically and behaviorally more similar to a human or GPT-4.
- However, LLMLingua aims to have a more consistent distribution of important tokens at the prompt part.
Based on this understanding, we opt to use pretrained models as SLMs, including LLaMA 2, Mistral, and Phi-2.
These models show minimal changes in prompt-end PPL distribution after SFT. To achieve more consistent results, we have chosen to use llama2-7b.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a need or benefit to finetune a small model specifically for this purpose? #57

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Is there a need or benefit to finetune a small model specifically for this purpose? #57

aliencaocao Jan 13, 2024

Replies: 2 comments · 3 replies

iofu728 Jan 16, 2024 Maintainer

aliencaocao Jan 16, 2024 Author

iofu728 Jan 17, 2024 Maintainer

aliencaocao Jan 18, 2024 Author

iofu728 Jan 19, 2024 Maintainer

aliencaocao
Jan 13, 2024

Replies: 2 comments 3 replies

iofu728
Jan 16, 2024
Maintainer

aliencaocao Jan 16, 2024
Author

iofu728 Jan 17, 2024
Maintainer

aliencaocao
Jan 18, 2024
Author

iofu728 Jan 19, 2024
Maintainer