generated from amitmerchant1990/reverie
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
f388249
commit aef6287
Showing
1 changed file
with
45 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
--- | ||
layout: post | ||
title: "Gemini" | ||
categories: [NLP] | ||
year: 2023 | ||
type: paper | ||
author: Gemini Team, Google | ||
exturl: https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf | ||
--- | ||
It's finally here. After months of rumors, question marks and speculation, it looks like Google actually pulled through. The sentiment | ||
was pretty split on if they were going to succeed, especially considering the rumors just last week of the launch being pushed back | ||
to next year. Question is though if it's still a bit too late, in the research community I feel like the general hype for LLM's has | ||
reduced significantly, there's a looming sentiment that we've exhausted the scaling laws and are only looking at minimal returns | ||
on future computational efforts. Anyway, Google is reporting that it's managed to beat GPT-4 across most benchmarks so let's focus | ||
on the present and see what there is to gather from their technical report, even though I worry that it's going to be very little... | ||
|
||
## Recycling old architectures? | ||
Gemini is, to no-one's suprise, a decoder-only Transformer model, enhanced with numerous optimizations to enable stable training | ||
and optimized inference *on Google's Tensor Processing Unit*. The TPU sees very little love outside of Google's internal work, primarily | ||
because Google only sells them to select Cloud partners. That mean's they can't utilize the work horse that is open-source communities | ||
and instead have to rely on a lot of work internally. While this is costly, it comes with the advantage of having complete control of the | ||
entire stack. Tangent aside, the Gemini family (Ultra, Pro and Nano) is trained to support 32k context length, with multi-query attention | ||
for efficient inference. The most noteworthy architectural detail is probably that Gemini is trained, beginning to end, across multiple | ||
modalities, accommodating audio, video and image input. Videos are encoded as a sequence of image tokens. Interestingly, they claim that | ||
the Pro model was trained in a matter of **weeks**, levering a fraction of Ultra's resources. The Nano series (two models at 1.8B and 3.25B) | ||
is distilled to produce best-in-class small language model performance, they are 4-bit (!!) quantized to run on-device. Unfortunately, | ||
this is they tell us about the architecture. | ||
|
||
## Training | ||
They go on discussing the immense complexities of training models of this scale. Ultra is trained on a large fleet of TPUv4 accelerators | ||
across multiple SuperPods (4096 chips). We would expect the total number of chips to be somewhere in the 10s of thousands. The intra-cluster | ||
and inter-cluster networks are fast enough to warrant commonly used synchronous training techniques such as model and data parallelism. They | ||
implement everything using JAX and are able to improve overall goodput (time spent computing over the total elapsed training time) from 85% (PaLM) | ||
to 97%. I've quite enjoyed using JAX lately, it's become a favorite in a couple of reinforcement learning projects thanks to the ability to | ||
vectorize environments and run an entire GPU-only RL setup. | ||
|
||
Gemini's dataset is both multimodal and multilingual, as one would expect given the extent of Google's name. The recent rumor stating Gemini's | ||
delay claimed that the model wasn't handling multilingual very well but it seems like this has been fixed then. The data used comes from web | ||
documents, books and code, including images, audio and video data. I speculate they have been able to extract a lot of high quality textual | ||
data from audio and video but we can't know for certain because they don't tell us the amount of tokens used. SentencePiece is used as the tokenizer, | ||
however SP isn't a tokenizer in itself but rather a tokenization framework so it's unclear exactly what algorithm is used. Despite not telling | ||
us how many tokens the models are trained on, they do state that they follow the approaches of Chinchilla and Llama, with smaller models training | ||
for significantly more tokens to improve performance for a given inference budget. | ||
|
||
|