From aef6287c102bc6a6649217ce18e19b97d4b8242f Mon Sep 17 00:00:00 2001 From: Leon Ericsson Date: Thu, 7 Dec 2023 07:35:20 +0100 Subject: [PATCH] post --- posts/2023-12-06-gemini.md | 45 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 45 insertions(+) create mode 100644 posts/2023-12-06-gemini.md diff --git a/posts/2023-12-06-gemini.md b/posts/2023-12-06-gemini.md new file mode 100644 index 0000000..f4f1b48 --- /dev/null +++ b/posts/2023-12-06-gemini.md @@ -0,0 +1,45 @@ +--- +layout: post +title: "Gemini" +categories: [NLP] +year: 2023 +type: paper +author: Gemini Team, Google +exturl: https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf +--- +It's finally here. After months of rumors, question marks and speculation, it looks like Google actually pulled through. The sentiment +was pretty split on if they were going to succeed, especially considering the rumors just last week of the launch being pushed back +to next year. Question is though if it's still a bit too late, in the research community I feel like the general hype for LLM's has +reduced significantly, there's a looming sentiment that we've exhausted the scaling laws and are only looking at minimal returns +on future computational efforts. Anyway, Google is reporting that it's managed to beat GPT-4 across most benchmarks so let's focus +on the present and see what there is to gather from their technical report, even though I worry that it's going to be very little... + +## Recycling old architectures? +Gemini is, to no-one's suprise, a decoder-only Transformer model, enhanced with numerous optimizations to enable stable training +and optimized inference *on Google's Tensor Processing Unit*. The TPU sees very little love outside of Google's internal work, primarily +because Google only sells them to select Cloud partners. That mean's they can't utilize the work horse that is open-source communities +and instead have to rely on a lot of work internally. While this is costly, it comes with the advantage of having complete control of the +entire stack. Tangent aside, the Gemini family (Ultra, Pro and Nano) is trained to support 32k context length, with multi-query attention +for efficient inference. The most noteworthy architectural detail is probably that Gemini is trained, beginning to end, across multiple +modalities, accommodating audio, video and image input. Videos are encoded as a sequence of image tokens. Interestingly, they claim that +the Pro model was trained in a matter of **weeks**, levering a fraction of Ultra's resources. The Nano series (two models at 1.8B and 3.25B) +is distilled to produce best-in-class small language model performance, they are 4-bit (!!) quantized to run on-device. Unfortunately, +this is they tell us about the architecture. + +## Training +They go on discussing the immense complexities of training models of this scale. Ultra is trained on a large fleet of TPUv4 accelerators +across multiple SuperPods (4096 chips). We would expect the total number of chips to be somewhere in the 10s of thousands. The intra-cluster +and inter-cluster networks are fast enough to warrant commonly used synchronous training techniques such as model and data parallelism. They +implement everything using JAX and are able to improve overall goodput (time spent computing over the total elapsed training time) from 85% (PaLM) +to 97%. I've quite enjoyed using JAX lately, it's become a favorite in a couple of reinforcement learning projects thanks to the ability to +vectorize environments and run an entire GPU-only RL setup. + +Gemini's dataset is both multimodal and multilingual, as one would expect given the extent of Google's name. The recent rumor stating Gemini's +delay claimed that the model wasn't handling multilingual very well but it seems like this has been fixed then. The data used comes from web +documents, books and code, including images, audio and video data. I speculate they have been able to extract a lot of high quality textual +data from audio and video but we can't know for certain because they don't tell us the amount of tokens used. SentencePiece is used as the tokenizer, +however SP isn't a tokenizer in itself but rather a tokenization framework so it's unclear exactly what algorithm is used. Despite not telling +us how many tokens the models are trained on, they do state that they follow the approaches of Chinchilla and Llama, with smaller models training +for significantly more tokens to improve performance for a given inference budget. + +