Skip to content

Commit

Permalink
post
Browse files Browse the repository at this point in the history
  • Loading branch information
LeonEricsson committed Dec 7, 2023
1 parent f388249 commit aef6287
Showing 1 changed file with 45 additions and 0 deletions.
45 changes: 45 additions & 0 deletions posts/2023-12-06-gemini.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
---
layout: post
title: "Gemini"
categories: [NLP]
year: 2023
type: paper
author: Gemini Team, Google
exturl: https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf
---
It's finally here. After months of rumors, question marks and speculation, it looks like Google actually pulled through. The sentiment
was pretty split on if they were going to succeed, especially considering the rumors just last week of the launch being pushed back
to next year. Question is though if it's still a bit too late, in the research community I feel like the general hype for LLM's has
reduced significantly, there's a looming sentiment that we've exhausted the scaling laws and are only looking at minimal returns
on future computational efforts. Anyway, Google is reporting that it's managed to beat GPT-4 across most benchmarks so let's focus
on the present and see what there is to gather from their technical report, even though I worry that it's going to be very little...

## Recycling old architectures?
Gemini is, to no-one's suprise, a decoder-only Transformer model, enhanced with numerous optimizations to enable stable training
and optimized inference *on Google's Tensor Processing Unit*. The TPU sees very little love outside of Google's internal work, primarily
because Google only sells them to select Cloud partners. That mean's they can't utilize the work horse that is open-source communities
and instead have to rely on a lot of work internally. While this is costly, it comes with the advantage of having complete control of the
entire stack. Tangent aside, the Gemini family (Ultra, Pro and Nano) is trained to support 32k context length, with multi-query attention
for efficient inference. The most noteworthy architectural detail is probably that Gemini is trained, beginning to end, across multiple
modalities, accommodating audio, video and image input. Videos are encoded as a sequence of image tokens. Interestingly, they claim that
the Pro model was trained in a matter of **weeks**, levering a fraction of Ultra's resources. The Nano series (two models at 1.8B and 3.25B)
is distilled to produce best-in-class small language model performance, they are 4-bit (!!) quantized to run on-device. Unfortunately,
this is they tell us about the architecture.

## Training
They go on discussing the immense complexities of training models of this scale. Ultra is trained on a large fleet of TPUv4 accelerators
across multiple SuperPods (4096 chips). We would expect the total number of chips to be somewhere in the 10s of thousands. The intra-cluster
and inter-cluster networks are fast enough to warrant commonly used synchronous training techniques such as model and data parallelism. They
implement everything using JAX and are able to improve overall goodput (time spent computing over the total elapsed training time) from 85% (PaLM)
to 97%. I've quite enjoyed using JAX lately, it's become a favorite in a couple of reinforcement learning projects thanks to the ability to
vectorize environments and run an entire GPU-only RL setup.

Gemini's dataset is both multimodal and multilingual, as one would expect given the extent of Google's name. The recent rumor stating Gemini's
delay claimed that the model wasn't handling multilingual very well but it seems like this has been fixed then. The data used comes from web
documents, books and code, including images, audio and video data. I speculate they have been able to extract a lot of high quality textual
data from audio and video but we can't know for certain because they don't tell us the amount of tokens used. SentencePiece is used as the tokenizer,
however SP isn't a tokenizer in itself but rather a tokenization framework so it's unclear exactly what algorithm is used. Despite not telling
us how many tokens the models are trained on, they do state that they follow the approaches of Chinchilla and Llama, with smaller models training
for significantly more tokens to improve performance for a given inference budget.


0 comments on commit aef6287

Please sign in to comment.