post

LeonEricsson · Dec 7, 2023 · aef6287 · aef6287
1 parent f388249
commit aef6287
Showing 1 changed file with 45 additions and 0 deletions.
diff --git a/posts/2023-12-06-gemini.md b/posts/2023-12-06-gemini.md
@@ -0,0 +1,45 @@
+---
+layout: post
+title: "Gemini"
+categories: [NLP]
+year: 2023
+type: paper
+author: Gemini Team, Google
+exturl: https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf
+---
+It's finally here. After months of rumors, question marks and speculation, it looks like Google actually pulled through. The sentiment
+was pretty split on if they were going to succeed, especially considering the rumors just last week of the launch being pushed back
+to next year. Question is though if it's still a bit too late, in the research community I feel like the general hype for LLM's has
+reduced significantly, there's a looming sentiment that we've exhausted the scaling laws and are only looking at minimal returns
+on future computational efforts. Anyway, Google is reporting that it's managed to beat GPT-4 across most benchmarks so let's focus
+on the present and see what there is to gather from their technical report, even though I worry that it's going to be very little...
+
+## Recycling old architectures?
+Gemini is, to no-one's suprise, a decoder-only Transformer model, enhanced with numerous optimizations to enable stable training
+and optimized inference *on Google's Tensor Processing Unit*. The TPU sees very little love outside of Google's internal work, primarily
+because Google only sells them to select Cloud partners. That mean's they can't utilize the work horse that is open-source communities 
+and instead have to rely on a lot of work internally. While this is costly, it comes with the advantage of having complete control of the
+entire stack. Tangent aside, the Gemini family (Ultra, Pro and Nano) is trained to support 32k context length, with multi-query attention
+for efficient inference. The most noteworthy architectural detail is probably that Gemini is trained, beginning to end, across multiple
+modalities, accommodating audio, video and image input. Videos are encoded as a sequence of image tokens. Interestingly, they claim that
+the Pro model was trained in a matter of **weeks**, levering a fraction of Ultra's resources. The Nano series (two models at 1.8B and 3.25B)
+is distilled to produce best-in-class small language model performance, they are 4-bit (!!) quantized to run on-device. Unfortunately, 
+this is they tell us about the architecture.
+
+## Training
+They go on discussing the immense complexities of training models of this scale. Ultra is trained on a large fleet of TPUv4 accelerators 
+across multiple SuperPods (4096 chips). We would expect the total number of chips to be somewhere in the 10s of thousands. The intra-cluster
+and inter-cluster networks are fast enough to warrant commonly used synchronous training techniques such as model and data parallelism. They 
+implement everything using JAX and are able to improve overall goodput (time spent computing over the total elapsed training time) from 85% (PaLM) 
+to 97%. I've quite enjoyed using JAX lately, it's become a favorite in a couple of reinforcement learning projects thanks to the ability to
+vectorize environments and run an entire GPU-only RL setup.
+
+Gemini's dataset is both multimodal and multilingual, as one would expect given the extent of Google's name. The recent rumor stating Gemini's
+delay claimed that the model wasn't handling multilingual very well but it seems like this has been fixed then. The data used comes from web
+documents, books and code, including images, audio and video data. I speculate they have been able to extract a lot of high quality textual
+data from audio and video but we can't know for certain because they don't tell us the amount of tokens used. SentencePiece is used as the tokenizer,
+however SP isn't a tokenizer in itself but rather a tokenization framework so it's unclear exactly what algorithm is used. Despite not telling
+us how many tokens the models are trained on, they do state that they follow the approaches of Chinchilla and Llama, with smaller models training
+for significantly more tokens to improve performance for a given inference budget. 
+
+