2023-03-03

Experiment 7

Bit of a hiatus since the last entry because our main training machine was down for over a week.

Either way, feedback for v7 (experiment 6, for those who are understandably confused) was overwhelmingly consistent: the model degenerates to overly-brief responses way too easily. The ~gigabyte of data that the first v7 checkpoint saw was representative of the entire dataset, so I don't think training for around a month on the remaining data makes any sense given the problems we've already spotted.

That being the case, I have opted to abandon v7 for now and instead try again with a new version of the dataset.

What's new

New data and more strict filtering. Notably:

Short responses are aggressively trimmed out of the dataset.
- Any training examples where the median response length falls under a certain threshold are dropped completely, and as median response length increases, the chances of dropping the training example diminish. This will hopefully skew the model towards longer responses by default.
The above dropped the size of our dataset by a few gigabytes, so to compensate we:
- Added some more instruction following data.
- Incorporated all freshly submitted community contributed chat logs.

Training

Overview

Nothing new this time, kept everything the same as the previous experiment. So:

Setup:
- 4 x 48GB RTX A6000s
- DeepSpeed with ZeRO stage 1, bf16
Hyperparameters:
- Effective batch size of 256
- Learning rate of 0.98e-5
- Constant learning rate, with 24 warmup steps

Part 1/10

Training notes:

So slow! Training loop ran for 3d 8h 58m 42s. Definitely think this could be optimized. Either way, loss curves look good.

Testing notes:

Sampling settings:

Temperature: Usually 0.5 to 0.7, bumping up to 0.8-1.0 every now and again to see what happens.
Top P sampling: Either 0.9 or disabled (1.0).
Top K samlping: Either 0 or 40.
Repetition penalty: Between disabled (1.0) and 1.2.
- Sometimes keeping the repetition penalty disabled works great. Eventually the model will catch on to some mannerism and start repeating it though, so defaulting to something higher than 1.0 seems ideal.

Results:

Ultra-short responses like "...", "Yes.", "Hm?" seem a lot rarer now!
If using an assistant-like character and trying to use the model like it's instruction-tuned:
- Zero-shot doesn't work that well (needs quite a few regens for something good to come up)
- Few-shot works surprisingly well
- Anything that requires a long, complete response (e.g. cooking recipe, functional piece of code) is prone to hallucinations
- Would like to try the above with external knowledge grounding to see whether hallucinations become rarer
Didn't play with the model for very long. Will release on HF and see what the community has to say.

Part 2/10

Training notes:

Feedback from the community is that while short responses are not as common as in v7, they still happen. We already have enough checkpoints for people who need short responses (e.g. for Discord bots or something), so I decided to try being really aggressive about this, and just pruned off any responses with less than 60 words (over 50% of the dataset) for this part to try and balance out all the shorter data the model saw in part one.

Testing notes:

Same sampling settings as for part one. First impressions:

Longer responses seem a little more common. The small effect is understandable, given that the model has already seen many more short messages in the previous part than longer messages in this one.
A big problem with longer responses though is that there's more room in them for the characters to say something wrong/inconsistent/out of place/etc.
- Contrastive search noticeably reduces the chances of this happening, but it's not available everywhere.
I saw one report of messages getting cut off on v8 part one, and indeed I think I saw this happen here: every now and again there will be an unbalanced * or " pair in the generated response.
- Will write something to look for these in the training data and prune any incorrectly formatted training examples from the dataset, if they exist.
Will release on HF and wait for some community feedback while I adjust the training data for part three.

Part 3/10

Training notes:

Feedback for part 2/10 was generally positive, with the cons basically matching up with what I found while testing. I'm hoping that the rambling will be improved by the model seeing more long-form data, so I trimmed the training data for part 3 exactly the same as I did for part 2.

Testing notes:

I'm a little short on time, so I loaded up the model just to make sure nothing was horribly broken. Seems OK, pushing to HF to see what everyone thinks.

Part 4/10

Training notes:

Interestingly enough, feedback for part 3 was (mostly) negative despite no changes in hyperparameters or training data distribution. For now, I'm going to chalk this up to random chance - maybe the data in part 3 was somehow not representative of the full set, and leaned more towards one data source or something of the sorts.

Either way, from my very brief testing of part 3 it looked like responses were plenty long by default, so I've dialed the filtering back a little: instead of dropping responses that were under 60 words, I've only dropped the ones below 50 words for this part.

Testing notes:

Metrics indicate we've hit diminishing returns - evaluation loss barely moved over 209k optimization steps. Testing the model itself though, the responses are still decently sized and descriptive, and the tendency for it to ramble and bring up out of place stuff has lessened quite noticeably even without contrastive search.

As with the feedback for part 3, I'm unsure how much of this is down to dumb luck vs. the extra training data having an effect.

Uploaded to HF. If community feedback is negative/neutral for this part:

I'll assume that there won't be significant benefits to training v8 all the way to completion, and will shift priorities around.
I might keep training it, but only between other experiments (so the GPUs aren't sitting there just idling)
In the meantime, I'd like to experiment with:
- Drastically different ways of formatting our prompts during training time, taking inspiration from Chain of Hindsight fine-tuning (paper, code, summary in the form of a Twitter thread)
- Model/pipeline parallelism + PEFT + 8-bit training to see if we can scale past 6B.
  - Lots of new inference optimizations are showing up recently, so people are now able to run bigger models with the same hardware, and we'll likely be able to scale past 6B on Colab too.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2023-03-03.md

2023-03-03.md

2023-03-03

Experiment 7

What's new

Training

Overview

Part 1/10

Part 2/10

Part 3/10

Part 4/10

Files

2023-03-03.md

Latest commit

History

2023-03-03.md

File metadata and controls

2023-03-03

Experiment 7

What's new

Training

Overview

Part 1/10

Part 2/10

Part 3/10

Part 4/10