tidyd up finetuning page

Tanuki · Feb 14, 2024 · c8b86fa · c8b86fa
1 parent 9d3c395
commit c8b86fa
Showing 1 changed file with 11 additions and 6 deletions.
diff --git a/fern/docs/pages/finetuning.mdx b/fern/docs/pages/finetuning.mdx
@@ -3,16 +3,21 @@ title: Finetuning
 description: Here you'll find information to get started quickly using Tanuki.
 ---
 
-## Problem
 
-
-## Finetuning
 An advantage of using Tanuki in your workflow is the cost and latency benefits that will be provided as the number of datapoints increases. 
 
-Successful executions of your patched function suitable for finetuning will be persisted to a training dataset, which will be used to distil smaller models for each patched function. Model distillation and pseudo-labelling is a verified way how to cut down on model sizes and gain improvements in latency and memory footprints while incurring insignificant and minor cost to performance (https://arxiv.org/pdf/2305.02301.pdf, https://arxiv.org/pdf/2306.13649.pdf, https://arxiv.org/pdf/2311.00430.pdf, etc).
+Successful executions of your patched function suitable for finetuning will be persisted to a training dataset, which will be used to distil smaller models for each patched function. Model distillation and pseudo-labelling is a verified way how to cut down on model sizes and gain improvements in latency and memory footprints while incurring insignificant and minor cost to performance ([Distilling Step-by-Step!](https://arxiv.org/pdf/2305.02301.pdf), [ON-POLICY DISTILLATION OF LANGUAGE MODELS](https://arxiv.org/pdf/2306.13649.pdf), [DISTIL-WHISPER](https://arxiv.org/pdf/2311.00430.pdf) etc).
 
 Training smaller function-specific models and deploying them is handled by the Tanuki library, so the user will get the benefits without any additional MLOps or DataOps effort. Currently we support OpenAI GPT style models (GPT-3.5-turbo) and Anyscale models (Llama family and mistral 7B) as finetunable models. See [models](placeholder_url) for which student models are supported. 
 
-We ran Tanuki on some public datasets like Squad2, Spider and IMDB Movie Reviews. Using the default setting, our preliminary tests show that using less than 600 datapoints in the training data are enough to get gpt 3.5 turbo to perform essentialy equivalent (less than 1.5% of performance difference on held-out dev sets) to GPT-4 while achieving up to 12 times lower cost and over 6 times lower latency (cost and latency reduction are very dependent on task specific characteristics like input-output token sizes and align statement token sizes). These tests show the potential in model-distillation in this form for intelligently cutting costs and lowering latency without sacrificing performance.
+We ran Tanuki on some public datasets like [Squad 2.0](https://rajpurkar.github.io/SQuAD-explorer/), [Spider](https://yale-lily.github.io/spider) and [IMDB Movie Reviews](https://huggingface.co/datasets/imdb). Using the default setting of GPT-4 as a teacher and GPT 3.5 Turbo as the finetuning target, our preliminary tests show that using less than 1000 datapoints in the training data are enough to get gpt 3.5 turbo to perform essentialy equivalent (less than 1.5% of performance difference on held-out dev sets) to GPT-4 while achieving up to 12 times lower cost and over 6 times lower latency (cost and latency reduction are very dependent on task specific characteristics like input-output token sizes and align statement token sizes).
+These tests show the potential in model-distillation in this form for intelligently cutting costs and lowering latency without sacrificing performance. The results can be seen in the table below, where in the parenthesis we show the accuracy, cost and latency of the finetuned model compared to the teacher model respectively.
 
-![Example distillation results](https://github.com/monkeypatch/tanuki.py/assets/113173969/2ac4c2fd-7ba6-4598-891d-6aa2c85827c9)
+| Metric                                                   |  Squad 2.0   | Spider      | IMDB Movie Reviews|
+| ---------------------------------------------------------| ------------ |-------------|------------------ |
+|GPT-4 Accuracy                                            | 89% (100%)   | 74%  (100%) |97% (100%)         |
+|Finetuned GPT 3.5 Turbo Accuracy                          | 88% (99%)    | 72%  (97%)  |97% (100%)         |
+|GPT-4 Average cost ($ per request)                        | 0.07 (100%)  | 0.07 (100%) |0.04 (100%)        |
+|Finetuned GPT 3.5 Turbo Average cost ($ per request)      | 0.004 (6%)   | 0.02 (29%)  |0.005 (13%)        |
+|GPT-4 Average latency (sec per request)                   | 1.37  (100%) | 3.81 (100%) |1.06 (100%)        |
+|Finetuned GPT 3.5 Turbo Average latency (sec per request) | 0.81  (59%)  | 0.62 (16%)  |0.61 (58%)         |