Is there a need or benefit to finetune a small model specifically for this purpose? #57
Replies: 2 comments 3 replies
-
Hi @aliencaocao, Thank you for your interest in LLMLingua. That's an excellent question. Our current understanding is that any Language Model can be used to estimate the importance distribution of tokens. And we believe that the higher the compression rate of the LM itself (followed "LM is a compressor"), the more accurate the estimation will be. This is particularly true in terms of the model's exposure to more tokens during the pre-training process. Therefore, we consider that any LM can potentially serve as a compressor for prompt compression, with different LMs sharing the same essential token distribution. In our previous experiments, we found that alignment might have some impact, but it is minimal – about 1-2 points. Perhaps a more refined alignment method could significantly enhance performance. |
Beta Was this translation helpful? Give feedback.
-
On a similar note, is there a reason why the llama2-7b-chat was chosen instead of so many more finetuned and higher-performing models? Maybe not mistrial-7b since its rather new, but things like nous-hermes or orca. |
Beta Was this translation helpful? Give feedback.
-
For example, using the uncompressed prompt, compressed prompt, and the final black-box model (GPT4, Claude etc) to form training data, and fine-tune the smaller models to learn to compress prompts more effectively to a certain model's liking.
Or has experiments proved that such is not needed/not worth the cost?
Beta Was this translation helpful? Give feedback.
All reactions