0% found this document useful (0 votes)
47 views25 pages

Gemma3Report

The Gemma « model is a new multimodal addition to the Gemma family, featuring vision understanding, longer context support of up to 1.8K tokens, and improved multilingual capabilities. It employs a modified architecture to enhance performance while reducing memory usage during inference, achieving superior results compared to previous versions. The models are designed for standard consumer hardware and are released to the community for broader access.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views25 pages

Gemma3Report

The Gemma « model is a new multimodal addition to the Gemma family, featuring vision understanding, longer context support of up to 1.8K tokens, and improved multilingual capabilities. It employs a modified architecture to enhance performance while reducing memory usage during inference, achieving superior results compared to previous versions. The models are designed for standard consumer hardware and are released to the community for broader access.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

„0„5‚0«‚1„

Gemma « Technical Report


Gemma Team˛ Google $eepMindȷ

We introduce Gemma «˛ a multimodal addition to the Gemma family of lightweight open models˛ ranging
in scale from 1 to „7 billion parameters‹ This version introduces vision understanding abilities˛ a wider
coverage of languages and longer context – at least 1„8K tokens‹ We also change the architecture of
the model to reduce the KV‚cache memory that tends to explode with long context‹ This is achieved by
increasing the ratio of local to global attention layers˛ and keeping the span on local attention short‹
The Gemma « models are trained with distillation and achieve superior performance to Gemma „
for both pre‚trained and instruction finetuned versions‹ In particular˛ our novel post‚training recipe
significantly improves the math˛ chat˛ instruction‚following and multilingual abilities˛ making Gemma«‚
arXiv:2503.19786v1 [cs.CL] 25 Mar 2025

»B‚IT competitive with Gemma„‚„7B‚IT and Gemma«‚„7B‚IT comparable to Gemini‚1‹5‚Pro across


benchmarks‹ We release all our models to the community‹

1‹ Introduction layer˛ and assign a smaller span of only 10„»


tokens to the local layers‹ Therefore˛ only the
We present the newest version of Gemma open global layers attend to long context˛ and we have
language models ˘Gemma Team˛ „0„»a¯˛ co‚ 1 global for every 5 local layers‹
designed with the family of Gemini frontier mod‚
els ˘Gemini Team˛ „0„«¯‹ This new version The pre‚training optimization recipe is similar
comes in sizes comparable to Gemma „ ˘Gemma to Gemma „˛ with some modifications in the ar‚
Team˛ „0„»b¯˛ with the addition of a 1B model‹ chitecture design‹ We use the same tokenizer as
These models are designed to run on standard Gemini „‹0˛ and we also revisit our data mixture
consumer‚grade hardware such as phones˛ lap‚ to improve the multilingual capabilities of the
tops˛ and high‚end GPUs‹ This version comes models˛ while introducing image understanding‹
with several new abilities to the Gemma family; All Gemma « models are trained with knowledge
namely˛ multimodality˛ long context˛ and mul‚ distillation ˘Hinton et al‹˛ „015¯‹
tilinguality˛ while preserving or surpassing the In post‚training˛ we focus our efforts on im‚
performance of prior versions‹ proving mathematics˛ reasoning˛ and chat abili‚
In terms of multimodality˛ most Gemma « mod‚ ties˛ as well as integrating the new capabilities of
els are compatible with a tailored version of the Gemma «˛ long‚context˛ and image inputs‹ We
SigLIP vision encoder ˘Zhai et al‹˛ „0„«¯‹ The use a novel post‚training approach that brings
language models treat images as a sequence of gains across all capabilities˛ including math˛ cod‚
soft tokens encoded by SigLIP‹ We reduce the in‚ ing˛ chat˛ instruction following˛ and multilingual‹
ference cost of image processing by condensing The resulting Gemma « instruction‚tuned models
the vision embeddings into a fixed size of „5— are both powerful and versatile˛ outperforming
vectors‹ The encoder works at a fixed resolution their predecessors by a wide margin‹
and we take inspiration from LLaVA ˘Liu et al‹˛ In the following sections˛ we provide a brief
„0„»¯ to enable flexible resolutions with a Pan overview of our models˛ including the architec‚
and Scan ˘P˚S¯ method‹ ture and pre‚ and post‚training recipes‹ We also
The second main architectural improvement is provide detailed evaluations across a wide vari‚
an increase in context size to 1„8K tokens˛ with‚ ety of quantitative and qualitative benchmarks‹
out reducing performance‹ A challenge with long We discuss our approach to safe and responsible
context is the memory explosion of the KV cache deployment and outline the broader implications
during inference‹ To reduce this issue˛ we inter‚ of Gemma «˛ its limitations˛ and advantages‹
leave multiple local layers between each global

ȷ See
Contributions and Acknowledgments section for full author list‹ Please send correspondence to [email protected]
© „0„5 Google $eepMind‹ All rights reserved
Gemma « Technical Report

Vision Embedding Non‚embedding


Model
Encoder Parameters Parameters
1B 0 «0„M —98M
»B »17M —75M «˛„09M
1„B »17M 1˛01„M 10˛759M
„7B »17M 1˛»1—M „5˛—00M

Table 1 | Parameter counts for the Gemma « mod‚


els‹ Our vocabulary has „5—k entries‹

attention ˘Luong et al‹˛ „015¯˛ with a pattern of


5 local layers for every global layer˛ starting with
a local layer as the first layer of the model‹
Long context‹ Gemma « models support context
length of 1„8K tokens˛ with the exception of the
1B model that has «„K‹ We increase RoPE base
frequency from 10k to 1M on global self‚attention
layers˛ and keep the frequency of the local lay‚
ers at 10k‹ We follow a process similar to the
positional interpolation of Chen et al‹ ˘„0„«¯ to
extend the span of the global self‚attention layers‹

„‹1‹ Vision modality

Vision encoder‹ We use a »00M variant of the


SigLIP encoder ˘Zhai et al‹˛ „0„«¯˛ a Vision Trans‚
Figure 1 | Example of visual interaction with former ˘$osovitskiy˛ „0„0¯ trained with a varia‚
Gemma « „7B IT model‹ tion of the CLIP loss ˘Radford et al‹˛ „0„1¯‹ The
Gemma vision encoder takes as input square im‚
ages resized to 89— x 89—˛ and is finetuned on
„‹ Model Architecture data from visual assistant tasks‹ For simplicity˛ we
share the vision encoder across our »B˛ 1„B˛ and
Gemma « models follow the same general „7B models˛ keeping it frozen during training‹
decoder‚only transformer architecture as previ‚
Pan ˚ Scan ˘P˚S¯‹ The Gemma vision encoder
ous iterations ˘Vaswani et al‹˛ „017¯˛ with most
operates at a fixed resolution of 89— × 89—‹ This
architecture elements similar to the first two
results in artifacts when processing non‚square
Gemma versions‹ We use a Grouped‚Query Atten‚
aspect ratios and high‚resolution images˛ leading
tion ˘GQA¯ ˘Ainslie et al‹˛ „0„«¯ with post‚norm
to unreadable text˛ or small objects disappearing‹
and pre‚norm with RMSNorm ˘Zhang and Sen‚
We address this issue with an adaptive windowing
nrich˛ „019¯‹ Inspired by $ehghani et al‹ ˘„0„«¯˛
algorithm during inference‹ This algorithm seg‚
Wortsman et al‹ ˘„0„«¯ and Chameleon Team
ments images into non‚overlapping crops of equal
˘„0„»¯˛ we replace the soft‚capping of Gemma „
size˛ covering the whole image˛ and resize them
with QK‚norm‹ In this section˛ we focus on some
to 89—‽89— pixels to pass them to the encoder‹
key differences from previous versions below‹
This windowing is applied only when necessary˛
5ȷ1 interleaving of local›global layers‹ We and control for the maximum number of crops‹
alternate between a local sliding window self‚ It is an inference‚time only optimization and can
attention ˘Beltagy et al‹˛ „0„0¯ and global self‚ be disabled for faster inference‹


Gemma « Technical Report

Shards Raw ˘GB¯ Quantized ˘GB¯


Model Type ˜Chips $ata Seq‹ Replica Model bf1— Int» Int»blocks=«„ SFP8
1B TPUv5e 51„ 1— 1— „ 1B „‹0 0‹5 0‹7 1‹0
»B TPUv5e „0»8 1— 1— 8 ¸KV „‹9 1‹» 1‹— 1‹9
1„B TPUv» —1»» 1— 1— „» »B 8‹0 „‹— „‹9 »‹»
„7B TPUv5p —1»» „» 8 «„ ¸KV 1„‹7 7‹« 7‹— 9‹1
Table „ | Training infrastructure with sharding by 1„B „»‹0 —‹— 7‹1 1„‹»
data˛ sequence ˘Seq‹¯˛ and replica‹ ¸KV «8‹9 „1‹5 „„‹0 „7‹«
„7B 5»‹0 1»‹1 15‹« „7‹»
¸KV 7„‹7 «„‹8 «»‹0 »—‹1
„‹„‹ Pre‚training
Table « | Memory footprints ˘in GB¯ comparison
We follow a similar recipe as in Gemma „ for between raw ˘bfloat1—¯ and quantized check‚
pre‚training with knowledge distillation‹ points for weights and KV caching ˘¸KV¯ at
Training data‹ We pre‚train our models on a «„˛7—8 context size˛ quantized in 8 bits‹
slightly larger token budget than Gemma „˛ i‹e‹˛
we train on 1»T tokens for Gemma « „7B˛ 1„T „‹«‹ Quantization Aware Training
for the 1„B version˛ »T for the »B˛ and „T to‚
kens for the 1B‹ The increase in tokens accounts Along with the raw checkpoints˛ we also provide
for the mix of images and text used during pre‚ quantized versions of our models in different stan‚
training‹ We also increase the amount of multi‚ dard formats‹ These versions are obtained by fine‚
lingual data to improve language coverage‹ We tuning each model for a small number of steps˛
add both monolingual and parallel data˛ and we typically 5˛000˛ using Quantization Aware Train‚
handle the imbalance in language representation ing ˘QAT¯ ˘Jacob et al‹˛ „018¯‹ We use prob‚
using a strategy inspired by Chung et al‹ ˘„0„«¯‹ abilities from the non‚quantized checkpoint as
targets˛ and adapt the data to match the pre‚
Tokenizer‹ We use the same tokenizer as Gem‚
training and post‚training distributions‹ Based
ini „‹0ȷ a SentencePiece tokenizer with split dig‚
on the most popular open source quantization
its˛ preserved whitespace˛ and byte‚level encod‚
inference engines ˘e‹g‹ llama‹cpp¯˛ we focus on
ings ˘Kudo and Richardson˛ „018¯‹ The resulting
three weight representationsȷ per‚channel int»˛
vocabulary has „—„k entries‹ This tokenizer is
per‚block int»˛ and switched fp8‹ In Table «˛ we
more balanced for non‚English languages‹
report the memory filled by raw and quantized
Filtering‹ We use filtering techniques that reduce models for each weight representation with and
the risk of unwanted or unsafe utterances and without a KV‚cache for a sequence of «„k tokens‹
remove certain personal information and other
sensitive data‹ We decontaminate evaluation sets „‹»‹ Compute Infrastructure
from our pre‚training data mixture˛ and reduce
the risk of recitation by minimizing the prolifer‚ We train our models with TPUv»˛ TPUv5e˛ and
ation of sensitive outputs‹ We also apply a qual‚ TPUv5p as outlined in Table „‹ Each model con‚
ity reweighing step inspired by Sachdeva et al‹ figuration is optimized to minimize training step
˘„0„»¯ to reduce occurrences of low quality data‹ time‹ For the vision encoder˛ we pre‚compute
the embeddings for each image and directly train
$istillation‹ We sample „5— logits per token˛
with the embeddings˛ adding no cost to the train‚
weighted by teacher probabilities‹ The student
ing of the language models‹
learns the teacher’s distribution within these sam‚
ples via cross‚entropy loss‹ The teacher’s target The optimizer state is sharded using an im‚
distribution is set to zero probability for non‚ plementation of ZeRO‚« ˘Ren et al‹˛ „0„1¯‹ For
sampled logits˛ and renormalized‹ multi‚pod training˛ we perform a data replica re‚

«
Gemma « Technical Report

Context Formatting following˛ and multilingual abilities˛ while mini‚


User turn <start_of_turn>user mizing model harmfulness‹ This includes learn‚
ing from weight averaged reward models ˘Ramę
Model turn <start_of_turn>model
et al‹˛ „0„»b¯ trained with human feedback data˛
End of turn <end_of_turn> code execution feedback ˘Gehring et al‹˛ „0„»¯˛
Example of discussionȷ and ground‚truth rewards for solving math prob‚
Userȷ Who are you? lems ˘$eepSeek‚AI˛ „0„5; Lambert et al‹˛ „0„»¯‹
Modelȷ My name is Gemma!
Userȷ What is 2+2? $ata filtering‹ We carefully optimize the data
Modelȷ 2+2=4. used in post‚training to maximize model perfor‚
Model inputȷ mance‹ We filter examples that show certain per‚
sonal information˛ unsafe or toxic model outputs˛
[BOS]<start_of_turn>user
Who are you?<end_of_turn> mistaken self‚identification data˛ and duplicated
<start_of_turn>model examples‹ Including subsets of data that encour‚
My name is Gemma!<end_of_turn> age better in‚context attribution˛ hedging˛ and
<start_of_turn>user refusals to minimize hallucinations also improves
What is 2+2?<end_of_turn>
<start_of_turn>model performance on factuality metrics˛ without de‚
grading model performance on other metrics‹
Model outputȷ
2+2=4.<end_of_turn> [BOS] token‹ For both PT and IT models˛ text
starts with a [BOS] token˛ that needs to be added
Table » | Formatting for Gemma IT models‹ Explic‚ explicitly since the text “[BOS]” does not map to
itly add the [BOS] token after tokenization˛ or the [BOS] token‹ For instance˛ Flax has an option˛
use the add_bos=True option in the tokenizer‹ add_bos=True˛ to add this token automatically
$o not tokenize the text ˆ[BOS]ˆ‹ when tokenizing‹ An example of the formatting
for an IT model is shown in Table »˛
duction over the data center network˛ using the
Pathways approach of Barham et al‹ ˘„0„„¯‹ We PT versus IT Formatting‹ All models share the
use the ‘single controller’ programming paradigm same tokenizer˛ with some control tokens dedi‚
of Jax ˘Roberts et al‹˛ „0„«¯ and Pathways cated to IT formatting‹ A key difference is that PT
˘Barham et al‹˛ „0„„¯˛ along with the GSPM$ models output a <eos> token at the end of gener‚
partitioner ˘Xu et al‹˛ „0„1¯ and the MegaScale ation˛ while IT models output a <end_of_turn>
XLA compiler ˘XLA˛ „019¯‹ at the end of the generation˛ as shown for IT in
Table »‹ Fine‚tuning either model type thus also
requires adding their respective end tokens‹
«‹ Instruction‚Tuning
Pre‚trained models are turned into instruction‚ »‹ Evaluation of final models
tuned models with an improved post‚training ap‚
proach compared to our prior recipe ˘see Table —¯‹ In this section˛ we evaluate the IT models over
a series of automated benchmarks and human
Techniques‹ Our post‚training approach relies evaluations across a variety of domains˛ as well
on an improved version of knowledge distilla‚ as static benchmarks such as MMLU‹
tion ˘Agarwal et al‹˛ „0„»; Anil et al‹˛ „018; Hin‚
ton et al‹˛ „015¯ from a large IT teacher˛ along
with a RL finetuning phase based on improved ver‚ »‹1‹ LMSYS Chatbot Arena
sions of BON$ ˘Sessa et al‹˛ „0„»¯˛ WARM ˘Ramę
In this section˛ we report the performance of our
et al‹˛ „0„»b¯˛ and WARP ˘Ramę et al‹˛ „0„»a¯‹
IT „7B model on LMSys Chatbot Arena ˘Chiang
Reinforcement learning objectives‹ We use et al‹˛ „0„»¯ in blind side‚by‚side evaluations by
a variety of reward functions to improve help‚ human raters against other state‚of‚the‚art mod‚
fulness˛ math˛ coding˛ reasoning˛ instruction‚ els‹ We report Elo scores in Table 5‹ Gemma « „7B

»
Gemma « Technical Report

Rank Model Elo 95˝ CI Open Type ˜params›˜activated


1 Grok‚«‚Preview‚0„‚„» 1»1„ ¸8›‚10 ‚ ‚ ‚
1 GPT‚»‹5‚Preview 1»11 ¸11›‚11 ‚ ‚ ‚
« Gemini‚„‹0‚Flash‚Thinking‚Exp‚01‚„1 1«8» ¸—›‚5 ‚ ‚ ‚
« Gemini‚„‹0‚Pro‚Exp‚0„‚05 1«80 ¸5›‚— ‚ ‚ ‚
« ChatGPT‚»o‚latest ˘„0„5‚01‚„9¯ 1«77 ¸5›‚» ‚ ‚ ‚
— $eepSeek‚R1 1«—« ¸8›‚— yes MoE —71B›«7B
— Gemini‚„‹0‚Flash‚001 1«57 ¸—›‚5 ‚ ‚ ‚
8 o1‚„0„»‚1„‚17 1«5„ ¸»›‚— ‚ ‚ ‚
9 Gemma‚«‚„7B‚IT 1««8 ¸8›‚9 yes $ense „7B
9 Qwen„‹5‚Max 1««— ¸7›‚5 ‚ ‚ ‚
9 o1‚preview 1««5 ¸»›‚« ‚ ‚ ‚
9 o«‚mini‚high 1«„9 ¸8›‚— ‚ ‚ ‚
1« $eepSeek‚V« 1«18 ¸8›‚— yes MoE —71B›«7B
1» GLM‚»‚Plus‚0111 1«11 ¸8›‚8 ‚ ‚ ‚
1» Qwen‚Plus‚01„5 1«10 ¸7›‚5 ‚ ‚ ‚
1» Claude «‹7 Sonnet 1«09 ¸9›‚11 ‚ ‚ ‚
1» Gemini‚„‹0‚Flash‚Lite 1«08 ¸5›‚5 ‚ ‚ ‚
18 Step‚„‚1—K‚Exp 1«05 ¸7›‚— ‚ ‚ ‚
18 o«‚mini 1«0» ¸5›‚» ‚ ‚ ‚
18 o1‚mini 1«0» ¸»›‚« ‚ ‚ ‚
18 Gemini‚1‹5‚Pro‚00„ 1«0„ ¸«›‚« ‚ ‚ ‚
‹‹‹
„8 Meta‚Llama‚«‹1‚»05B‚Instruct‚bf1— 1„—9 ¸»›‚« yes $ense »05B
‹‹‹
«8 Llama‚«‹«‚70B‚Instruct 1„57 ¸5›‚« yes $ense 70B
‹‹‹
«9 Qwen„‹5‚7„B‚Instruct 1„57 ¸«›‚« yes $ense 7„B
‹‹‹
59 Gemma‚„‚„7B‚it 1„„0 ¸«›‚„ yes $ense „7B

Table 5 | Evaluation of Gemma « „7B IT model in the Chatbot Arena ˘Chiang et al‹˛ „0„»¯‹ All the
models are evaluated against each other through blind side‚by‚side evaluations by human raters‹ Each
model is attributed a score˛ based on the Elo rating system‹ Gemma‚«‚„7B‚IT numbers are preliminary
results received on March 8˛ „0„5‹

IT ˘1««8¯ is among the top 10 best models˛ with a follow third‚party static leaderboards for a fairer
score above other non‚thinking open models˛ such comparison across models‹ We include additional
as $eepSeek‚V« ˘1«18¯˛ LLaMA « »05B ˘1„57¯˛ evaluations of our models on other benchmarks
and Qwen„‹5‚70B ˘1„57¯˛ which are much larger in the appendix‹
models‹ Finally˛ the Elo of Gemma « is signifi‚
cantly higher than Gemma „˛ at 1„„0‹ Note that
Elo scores do not take into account visual abilities˛
5‹ Ablations
which none of the aforementioned models have‹ In this section˛ we focus on the impact of our
architecture changes˛ as well as some of the vision
»‹„‹ Standard benchmarks abilities new to this model‹

In Table —˛ we show the performance of our final 5‹1‹ Pre‚training ability probing
models across a variety of benchmarks compared
to our previous model iteration˛ and Gemini 1‹5‹ We use several standard benchmarks as probes
We do not compare directly with external models during pre‚training to ensure our models capture
that often report their own evaluation settings˛ general abilities˛ and in Figure „˛ we compare the
since running them in our setting does not guaran‚ quality of pre‚trained models from Gemma „ and
tee a fair comparison‹ We encourage the reader to « across these general abilities˛ namely˛ science˛

5
Gemma « Technical Report

Gemini 1‹5 Gemini „‹0 Gemma „ Gemma «


Flash Pro Flash Pro „B 9B „7B 1B »B 1„B „7B
MMLU‚Pro —7‹« 75‹8 77‹— 79‹1 15‹— »—‹8 5—‹9 1»‹7 »«‹— —0‹— —7‹5
LiveCodeBench «0‹7 «»‹„ «»‹5 «—‹0 1‹„ 10‹8 „0‹» 1‹9 1„‹— „»‹— „9‹7
Bird‚SQL ˘dev¯ »5‹— 5»‹» 58‹7 59‹« 1„‹„ ««‹8 »—‹7 —‹» «—‹« »7‹9 5»‹»
GPQA $iamond 51‹0 59‹1 —0‹1 —»‹7 „»‹7 „8‹8 «»‹« 19‹„ «0‹8 »0‹9 »„‹»
SimpleQA 8‹— „»‹9 „9‹9 »»‹« „‹8 5‹« 9‹„ „‹„ »‹0 —‹« 10‹0
FACTS Grounding 8„‹9 80‹0 8»‹— 8„‹8 »«‹8 —„‹0 —„‹» «—‹» 70‹1 75‹8 7»‹9
Global MMLU‚Lite 7«‹7 80‹8 8«‹» 8—‹5 »1‹9 —»‹8 —8‹— «»‹„ 5»‹5 —9‹5 75‹1
MATH 77‹9 8—‹5 90‹9 91‹8 „7‹„ »9‹» 55‹— »8‹0 75‹— 8«‹8 89‹0
HiddenMath »7‹„ 5„‹0 —«‹5 —5‹„ 1‹8 10‹» 1»‹8 15‹8 »«‹0 5»‹5 —0‹«
MMMU ˘val¯ —„‹« —5‹9 71‹7 7„‹7 ‚ ‚ ‚ ‚ »8‹8 59‹— —»‹9

Table — | Performance of instruction fine‚tuned ˘IT¯ models compared to Gemini 1‹5˛ Gemini „‹0˛ and
Gemma „ on zero‚shot benchmarks across different abilities‹

Figure „ | Summary of the performance of different pre‚trained models from Gemma „ and « across
general abilities‹ These plots are meant to give a simplified summary and details are in the appendix‹

code˛ factuality˛ multilinguality˛ reasoning˛ and


vision‹ The details of the performance across the
different public benchmarks used in these plots
are summarized in the appendix‹ Overall˛ we see
that the new versions improve in most categories˛
despite the addition of vision‹ We particularly
focus on multilinguality in this version˛ and this
directly impacts the quality of our models‹ How‚
ever˛ despite the use of decontamination tech‚ Figure « | Impact of LocalȷGlobal ratio on the
niques˛ there is always a risk of contamination perplexity on a validation set‹ The impact is mini‚
of these probes ˘Mirzadeh et al‹˛ „0„»¯˛ making mal˛ even with 7‚to‚1 local to global‹ This ablation
more definitive conclusions harder to assess‹ is run with text‚only models‹

5‹„‹ LocalȷGlobal attention layers ent ratios of local to global attention layers‹ 1ȷ1
is used in Gemma „ models˛ and 5ȷ1 is used in
We measure the impact of changes to local and Gemma «‹ We observe minimal impact on per‚
global self‚attention layers on performance and plexity when changing this ratio‹
memory consumption during inference‹
Sliding window size‹ In Fig‹ »˛ we compare
LocalȷGlobal ratio‹ In Fig‹ «˛ we compare differ‚ different sliding window sizes for the local at‚


Gemma « Technical Report

tention layers in different globalȷlocal ratio con‚


figurations‹ The sliding window can be reduced
significantly without impacting perplexity‹

Figure — | KV cache memory versus context


length‹ We show the memory usage of the KV
cache for our architecture ˘LȷG=5ȷ1˛ sw=10„»¯
and a transformer with global attention only – as
Figure » | Impact of Sliding Window size on per‚ used in LLaMa or Gemma 1‹
plexity measured on a validation set‹ We consider
„ „B models˛ with 1ȷ1 and 1ȷ« local to global layer
ratios‹ This ablation is run with text‚only models‹ quences and then scale the »B˛ 1„B˛ and „7B mod‚
els up to 1„8K tokens at the end of pre‚training
Impact on KV cache memory‹ In Fig‹ 5˛ we show while rescaling RoPE ˘Chen et al‹˛ „0„«¯‹ We
the balance between the memory used by the find a scaling factor of 8 to work well in practice‹
model and the KV cache during inference with a Note that compared to Gemma „˛ we have also
context of «„k tokens‹ The “global only” configu‚ increased the RoPE base frequency of global self‚
ration is the standard configuration used across attention layers from 10k to 1M˛ while keeping
most dense models‹ The “1ȷ1˛ sw=»09—” is used 10k for the local self‚attention layers‹ In Figure 7˛
in Gemma „‹ We observe that the “global only” we show the impact on perplexity for different
configuration results in a memory overhead of context lengths‹ Our models generalize to 1„8K˛
—0˝˛ while this is reduced to less than 15˝ with but rapidly degrade as we continue to scale‹
1ȷ« and sliding windows of 10„» ˘“sw=10„»”¯‹
In Fig‹ —˛ we compute the memory used by the
KV cache as a function of the context length with
either our „B architecture ˘LȷG=5ȷ1˛ sw=10„»¯
versus a “global only” „B model‹

Figure 5 | Model versus KV cache memory dur‚


ing inference with a pre‚fill KV cache of size «„k‹
We consider a „B model with different local to
global ratios and sliding window sizes ˘sw¯‹ We
compare to global only˛ which is the standard
Figure 7 | Long context performance of pre‚
used in Gemma 1 and Llama‹ This ablation is run
trained models before and after RoPE rescaling‹
with a text‚only model‹

5‹«‹ Enabling long context 5‹»‹ Small versus large teacher

Instead of training with 1„8K sequences from A common finding is that˛ to train a small model˛
scratch˛ we pre‚train our models with «„K se‚ it is preferable to distill from a smaller teacher‹

7
Gemma « Technical Report

resolution encoder has a »x» average pooling on


its output‹ As shown in Table 7˛ higher resolution
encoders perform better than smaller ones‹

$ocVQA InfoVQA TextVQA


»B 7„‹8 »»‹1 58‹9
»B w› P˚S 81‹0 57‹0 —0‹8
Δ ˘¸8‹„¯ ˘¸1„‹9¯ ˘¸1‹9¯

„7B 85‹— 59‹» —8‹—


„7B w› P˚S 90‹» 7—‹» 70‹„
Figure 8 | Small versus large teacher‹ Relative Δ ˘¸»‹8¯ ˘¸17‹0¯ ˘¸1‹—¯
difference of perplexity when using a small and
large teacher as a function of the token size of Table 8 | Impact of P˚S‹ »‚shot evaluation re‚
training‹ Smaller numbers means distilling from sults on the valid set˛ with and without P˚S on a
a larger teacher is better‹ pre‚trained checkpoint‹ Boosts are on tasks asso‚
ciated with images with varying aspect ratios˛ or
involving reading text on images‹
We suspect this is because these studies are often
performed in settings where the regularization ef‚ Pan ˚ Scan‹ P˚S enables capturing images at
fect of using a worse teacher surpasses the benefit close to their native aspect ratio and image reso‚
of using a better teacher‹ We train a student with lution‹ In Table 8˛ we compare our „7B IT model
„ teachers of different sizes˛ one large and one with and without P˚S‹ As expected˛ the ability
small˛ for different training horizons‹ In Fig‹ 8˛ to treat images with close to native resolution
we observe that for short training horizons˛ the greatly helps with tasks that require some form
smaller teacher is better˛ but the trend is reversed of reading text on images˛ which is particularly
for longer training‹ important for visual language models‹

5‹5‹ Vision encoder


—‹ Memorization and Privacy

Resolution $ocVQA InfoVQA TextVQA Large language models may produce near‚copies
of some text used in training ˘Biderman et al‹˛
„5— «1‹9 „«‹1 »»‹1
„0„«; Carlini et al‹˛ „0„1˛ „0„„; Ippolito et al‹˛
»»8 »5‹» «1‹— 5«‹5
„0„„; Nasr et al‹˛ „0„«¯‹ Several prior reports
89— 59‹8 ««‹7 58‹0
have released audits that quantify this risk by
measuring the memorization rate ˘Anil et al‹˛
Table 7 | Impact of image encoder input reso‚
„0„«; Chowdhery et al‹˛ „0„„; Gemini Team˛
lution‹ We measure performance using a short
„0„«˛ „0„»; Gemma Team˛ „0„»a˛b; LLaMa
schedule „B Gemma model on a few evaluation
Team˛ „0„»¯‹ This “memorization rate”1 is de‚
benchmarks to observe the effect of input image
fined as the ratio of generations from the model
resolution on vision encoder pre‚training‹
that match its training data compared to all model
Impact of image resolution‹ We use a vision generations using the following setup‹ We fol‚
encoder based on SigLIP ˘Zhai et al‹˛ „0„«¯‹ The low the methodology described in Gemma Team
vision encoder is frozen˛ and only the language 1 ˆWe do not state or imply [here] that a model ˆcontainsˆ

model is trained‹ Each image in this multimodal its training data in the sense that there is a copy of that data
data is represented by „5— image tokens from in the model‹ Rather˛ a model memorizes attributes of its
training data such that in certain cases it is statistically able
the respective vision encoder‹ The higher resolu‚ to generate such training data when following rules and
tion encoders thus use average pooling to reduce using information about features of its training data that it
their output to „5— tokens‹ For instance˛ the 89— does contain‹ˆ

8
Gemma « Technical Report

designed to have high recall and does not con‚


sider the context in which the information may
appear˛ which leads to many false positives‹ Thus˛
we are likely overestimating the true amount of
potentially personal information contained in the
outputs classified as memorized‹ S$P also pro‚
vides broad severity levelsȷ low˛ medium˛ and
high‹ We classify text as personal if S$P clas‚
sifies it as personal information at any severity
level‹ We observed no personal information in
the outputs characterized as memorization for all
Figure 9 | Total memorization rates for both ex‚ Gemma « models‹ This indicates a low rate of
act and approximate memorization‹ Gemma « personal data˛ below our detection thresholds˛ in
models memorize significantly less than all prior outputs classified as memorization‹
models‹ ˙No results for approximate memoriza‚
tion on these models‹
7‹ Responsibility˛ Safety˛ Security

˘„0„»b¯ to measure it‹ Specifically˛ we subsam‚ Responsibility˛ safety˛ and security are of utmost
ple a large portion of training data distributed importance in the development of Gemma mod‚
uniformly across different corpora and test for els‹ To reduce risks to Gemma « users˛ we have
discoverable extraction ˘Nasr et al‹˛ „0„«¯ of this continued to integrate enhanced internal safety
content using a prefix of length 50 and a suffix of processes that span the development workflow˛
length 50‹ We denote text as either “exactly mem‚ in line with recent Google AI models ˘Gemini
orized” if all tokens in the continuation match Team˛ „0„»¯‹ This focuses on safety mitigation at
the source suffix or “approximately memorized” training time˛ and robust and transparent model
if they match up to an edit distance of 10˝‹ evaluations for the new image‚to‚text capabilities
we have introduced‹
Figure 9 compares the memorization rates
across Gemma and Gemini models; these models
are ordered in reverse chronological order˛ with 7‹1‹ Governance ˚ Assessment
the newest Gemma « models on the left‹ We find
that Gemma « models memorize long‚form text Our approach to assessing the benefits and risks
at a much lower rate than prior models ˘note the of Gemma is reflective of that outlined for Gemma
log y‚axis¯‹ We observe only a marginal differ‚ 1 ˘Gemma Team˛ „0„»a¯˛ taking into account the
ence in the memorization rates between the »B˛ changes in supported modalities‹ We continue to
1„B˛ and „7B models˛ with 1B memorizing less believe that openness in AI can spread the bene‚
than these larger models‹ Further˛ we find that a fits of these technologies across society˛ but must
larger proportion of text is characterized as ap‚ be evaluated against the risk of malicious uses
proximately memorized˛ with a relative increase that can cause harm on both individual and in‚
in approximate memorization compared to exact stitutional levels ˘Weidinger et al‹˛ „0„1¯‹ Since
memorization of roughly „»x on average‹ the inaugural Gemma launch˛ we have seen these
models drive a number of socially beneficial ap‚
We also study the rate at which the generations plications˛ such as our own ShieldGemma „˛ a »B
may contain personal information‹ To identify po‚ image safety classifier built with Gemma «˛ which
tentially personal information˛ we use the Google provides a ready‚made solution for image safety˛
Cloud Sensitive $ata Protection ˘S$P¯ service‹„ outputting safety labels across dangerous content˛
S$P uses broad detection rules to identify text sexually explicit˛ and violence categories‹
that may contain personal information‹ S$P is
Releasing Gemma « models required specific
„ httpsȷ››cloud‹google‹com›sensitive‚data‚protection attention to changes in model capabilities and

9
Gemma « Technical Report

close monitoring of the evolving risks of existing rigorous risk assessment‹ Our internal safety pro‚
multimodal LLMs ˘Lin et al‹˛ „0„»¯˛ as well as an cesses are designed accordingly˛ and for previ‚
understanding of the ways in which models are ous Gemma models we have also undertaken
being used in the wild‹ Although we are yet to evaluations of capabilities relevant to extreme
receive any reports of malicious use for Gemma˛ risks ˘Phuong et al‹˛ „0„»; Shevlane et al‹˛ „0„«¯‹
we remain committed to investigating any such As we continue to develop and share open mod‚
reporting˛ and work with the academic and de‚ els˛ we will follow the heuristic that thoroughly
veloper communities˛ as well as conduct our own evaluating a more capable model often provides
monitoring˛ to flag such cases‹ sufficient assurance for less capable ones‹ As such˛
we prioritised a streamlined set of evaluations for
$espite advancements in capabilities˛ we be‚
Gemma «˛ reserving in‚depth dangerous capabil‚
lieve that˛ given the number of larger powerful
ity assessments for cases where a specific model
open models available˛ this release will have a
may present a potentially heightened risk ˘as de‚
negligible effect on the overall risk landscape‹
scribed below on CBRN evaluations¯‹ We balance
development speed with targeted safety testing˛
7‹„‹ Safety policies and train‚time mitigations ensuring our evaluations are well‚focused and
efficient˛ while upholding the commitments laid
A key pillar of Gemma’s approach to safety is to out in our Frontier Safety Framework‹
align fine‚tuned models with Google’s safety poli‚
cies˛ in line with Gemini models ˘Gemini Team˛
Baseline Evaluations
„0„«¯‹ They are designed to help prevent our
models from generating harmful content˛ i‹e‹˛ Baseline assurance captures the model violation
rate for safety policies˛ using a large number of
• Child sexual abuse and exploitation synthetic adversarial user queries˛ and human
• Revealing personally identifiable information raters to label the answers as policy violating or
that can lead to harm ˘e‹g‹˛ Social Security not‹ Overall˛ Gemma « violation rate is signifi‚
numbers¯ cantly low overall on these safety policies‹
• Hate speech and harassment
• $angerous or malicious content ˘including Chemical, Biological, Radiological and Nuclear
promoting self‚harm or instructing in harm‚ (CBRN) knowledge
ful activities¯
Owing to enhanced performance on STEM‚
• Sexually explicit content
related tasks˛ we evaluated knowledge relevant
• Medical advice that runs contrary to scientific
to biological˛ radiological˛ and nuclear risks using
or medical consensus
an internal dataset of closed‚ended˛ knowledge‚
based multiple choice questions‹ For evaluations
We undertook considerable safety filtering of our of chemical knowledge˛ we employed a closed‚
pre‚training data to reduce the likelihood of our ended knowledge‚based approach on chemical
pre‚trained and fine‚tuned checkpoints producing hazards developed by Macknight et al‹ Our eval‚
harmful content‹ For fine‚tuned models˛ we also uation suggests that the knowledge of Gemma «
use both SFT and RLHF to steer the model away models in these domains is low‹
from undesirable behavior‹

7‹»‹ Our approach to responsible open models


7‹«‹ Assurance Evaluations
$esigning safe˛ secure˛ and responsible applica‚
We also run our IT models through a set of base‚ tions requires a system‚level approach˛ working
line assurance evaluations to understand the po‚ to mitigate risks associated with each specific use
tential harms that our models can cause‹ As we case and environment‹ We will continue to adopt
champion open models˛ we also recognize that assessments and safety mitigations proportion‚
the irreversible nature of weight releases requires ate to the potential risks from our models˛ and

10
Gemma « Technical Report

will only share these with the community when A‹ Asai˛ J‹ Kasai˛ J‹ H‹ Clark˛ K‹ Lee˛ E‹ Choi˛
we are confident that the benefits significantly and H‹ Hajishirzi‹ Xor qaȷ Cross‚lingual open‚
outweigh the foreseeable risks‹ retrieval question answering‹ arXiv preprint
arXivȷ„010‹1185—˛ „0„0‹

8‹ $iscussion and Conclusion J‹ Austin˛ A‹ Odena˛ M‹ I‹ Nye˛ M‹ Bosma˛


H‹ Michalewski˛ $‹ $ohan˛ E‹ Jiang˛ C‹ J‹ Cai˛
In this work˛ we have presented Gemma «˛ the M‹ Terry˛ Q‹ V‹ Le˛ and C‹ Sutton‹ Program
latest addition to the Gemma family of open lan‚ synthesis with large language models‹ CoRR˛
guage models for text˛ image˛ and code‹ In this abs›„108‹077«„˛ „0„1‹
version˛ we focus on adding image understanding
and long context while improving multilinguality P‹ Barham˛ A‹ Chowdhery˛ J‹ $ean˛ S‹ Ghemawat˛
and STEM‚related abilities‹ Our model sizes and S‹ Hand˛ $‹ Hurt˛ M‹ Isard˛ H‹ Lim˛ R‹ Pang˛
architectures are designed to be compatible with S‹ Roy˛ B‹ Saeta˛ P‹ Schuh˛ R‹ Sepassi˛ L‹ E‹
standard hardware˛ and most of our architecture Shafey˛ C‹ A‹ Thekkath˛ and Y‹ Wu‹ Path‚
improvements are tailored to fit this hardware waysȷ Asynchronous distributed dataflow for
while maintaining performance‹ ml˛ „0„„‹

I‹ Beltagy˛ M‹ E‹ Peters˛ and A‹ Cohan‹ Long‚


References formerȷ The long‚document transformer‹ arXiv
preprint arXivȷ„00»‹05150˛ „0„0‹
Realworldqa‹ https://x.ai/news/grok-1.
5v‹ S‹ Biderman˛ U‹ Prashanth˛ L‹ Sutawika˛
H‹ Schoelkopf˛ Q‹ Anthony˛ S‹ Purohit˛ and
M‹ Acharya˛ K‹ Kafle˛ and C‹ Kanan‹ Tallyqaȷ An‚ E‹ Raff‹ Emergent and predictable memoriza‚
swering complex counting questions‹ In AAAI˛ tion in large language models‹ NeurIPS˛ «—ȷ
„018‹ „807„–„8090˛ „0„«‹

R‹ Agarwal˛ N‹ Vieillard˛ Y‹ Zhou˛ P‹ Stanczyk˛ S‹ R‹ Y‹ Bisk˛ R‹ Zellers˛ R‹ L‹ Bras˛ J‹ Gao˛ and Y‹ Choi‹
Garea˛ M‹ Geist˛ and O‹ Bachem‹ On‚policy PIQAȷ reasoning about physical commonsense
distillation of language modelsȷ Learning from in natural language‹ CoRR˛ abs›1911‹11—»1˛
self‚generated mistakes‹ In ICLR˛ „0„»‹ „019‹

J‹ Ainslie˛ J‹ Lee‚Thorp˛ M‹ de Jong˛ Y‹ Zemlyan‚ N‹ Carlini˛ F‹ Tramer˛ E‹ Wallace˛ M‹ Jagielski˛


skiy˛ F‹ Lebrřn˛ and S‹ Sanghai‹ Gqaȷ Training A‹ Herbert‚Voss˛ K‹ Lee˛ A‹ Roberts˛ T‹ Brown˛
generalized multi‚query transformer models $‹ Song˛ U‹ Erlingsson˛ et al‹ Extracting train‚
from multi‚head checkpoints‹ arXiv preprint ing data from large language models‹ In
arXivȷ„«05‹1«„»5˛ „0„«‹ USENIX˛ „0„1‹

R‹ Anil˛ G‹ Pereyra˛ A‹ Passos˛ R‹ Ormandi˛ G‹ E‹ N‹ Carlini˛ $‹ Ippolito˛ M‹ Jagielski˛ K‹ Lee˛


$ahl˛ and G‹ E‹ Hinton‹ Large scale distributed F‹ Tramer˛ and C‹ Zhang‹ Quantifying memo‚
neural network training through online distil‚ rization across neural language models‹ arXiv
lation‹ arXiv preprint arXivȷ180»‹0«„«5˛ „018‹ preprint arXivȷ„„0„‹07—»—˛ „0„„‹

R‹ Anil˛ A‹ M‹ $ai˛ O‹ Firat˛ M‹ Johnson˛ $‹ Lep‚ Chameleon Team‹ Chameleonȷ Mixed‚modal


ikhin˛ A‹ Passos˛ S‹ Shakeri˛ E‹ Taropa˛ P‹ Bailey˛ early‚fusion foundation models‹ arXiv preprint
Z‹ Chen˛ et al‹ Palm „ technical report‹ arXiv arXivȷ„»05‹09818˛ „0„»‹
preprint arXivȷ„«05‹10»0«˛ „0„«‹
M‹ Chen˛ J‹ Tworek˛ H‹ Jun˛ Q‹ Yuan˛ H‹ P‹
M‹ Artetxe˛ S‹ Ruder˛ and $‹ Yogatama‹ On the de Oliveira Pinto˛ J‹ Kaplan˛ H‹ Edwards˛
cross‚lingual transferability of monolingual rep‚ Y‹ Burda˛ N‹ Joseph˛ G‹ Brockman˛ A‹ Ray˛
resentations‹ In ACL˛ „0„0‹ R‹ Puri˛ G‹ Krueger˛ M‹ Petrov˛ H‹ Khlaaf˛

11
Gemma « Technical Report

G‹ Sastry˛ P‹ Mishkin˛ B‹ Chan˛ S‹ Gray˛ N‹ Ry‚ H‹ W‹ Chung˛ N‹ Constant˛ X‹ Garcia˛ A‹ Roberts˛


der˛ M‹ Pavlov˛ A‹ Power˛ L‹ Kaiser˛ M‹ Bavar‚ Y‹ Tay˛ S‹ Narang˛ and O‹ Firat‹ Unimaxȷ Fairer
ian˛ C‹ Winter˛ P‹ Tillet˛ F‹ P‹ Such˛ $‹ Cum‚ and more effective language sampling for large‚
mings˛ M‹ Plappert˛ F‹ Chantzis˛ E‹ Barnes˛ scale multilingual pretraining˛ „0„«‹
A‹ Herbert‚Voss˛ W‹ H‹ Guss˛ A‹ Nichol˛ A‹ Paino˛
N‹ Tezak˛ J‹ Tang˛ I‹ Babuschkin˛ S‹ Balaji˛ C‹ Clark˛ K‹ Lee˛ M‹ Chang˛ T‹ Kwiatkowski˛
S‹ Jain˛ W‹ Saunders˛ C‹ Hesse˛ A‹ N‹ Carr˛ M‹ Collins˛ and K‹ Toutanova‹ Boolqȷ Explor‚
J‹ Leike˛ J‹ Achiam˛ V‹ Misra˛ E‹ Morikawa˛ ing the surprising difficulty of natural yes›no
A‹ Radford˛ M‹ Knight˛ M‹ Brundage˛ M‹ Murati˛ questions‹ CoRR˛ abs›1905‹100»»˛ „019‹
K‹ Mayer˛ P‹ Welinder˛ B‹ McGrew˛ $‹ Amodei˛
K‹ Cobbe˛ V‹ Kosaraju˛ M‹ Bavarian˛ M‹ Chen˛
S‹ McCandlish˛ I‹ Sutskever˛ and W‹ Zaremba‹
H‹ Jun˛ L‹ Kaiser˛ M‹ Plappert˛ J‹ Tworek˛
Evaluating large language models trained on
J‹ Hilton˛ R‹ Nakano˛ C‹ Hesse˛ and J‹ Schul‚
code‹ CoRR˛ abs›„107‹0««7»˛ „0„1‹
man‹ Training verifiers to solve math word
S‹ Chen˛ S‹ Wong˛ L‹ Chen˛ and Y‹ Tian‹ Extend‚ problems‹ CoRR˛ abs›„110‹1»1—8˛ „0„1‹
ing context window of large language mod‚
$eepSeek‚AI‹ $eepseek‚r1ȷ Incentivizing reason‚
els via positional interpolation‹ arXiv preprint
ingt learning˛ „0„5‹
arXivȷ„«0—‹15595˛ „0„«‹
M‹ $ehghani˛ J‹ $jolonga˛ B‹ Mustafa˛
X‹ Chen˛ H‹ Fang˛ T‹‚Y‹ Lin˛ R‹ Vedantam˛ S‹ Gupta˛
P‹ Padlewski˛ J‹ Heek˛ J‹ Gilmer˛ A‹ P‹ Steiner˛
P‹ $ollđr˛ and C‹ L‹ Zitnick‹ Microsoft coco
M‹ Caron˛ R‹ Geirhos˛ I‹ Alabdulmohsin˛ et al‹
captionsȷ $ata collection and evaluation server‹
Scaling vision transformers to „„ billion
ArXiv˛ abs›150»‹00«„5˛ „015‹
parameters‹ In ICML˛ „0„«‹
W‹‚L‹ Chiang˛ L‹ Zheng˛ Y‹ Sheng˛ A‹ N‹ An‚
$‹ $eutsch˛ E‹ Briakou˛ I‹ Caswell˛ M‹ Finkelstein˛
gelopoulos˛ T‹ Li˛ $‹ Li˛ H‹ Zhang˛ B‹ Zhu˛
R‹ Galor˛ J‹ Juraska˛ G‹ Kovacs˛ A‹ Lui˛ R‹ Rei˛
M‹ Jordan˛ J‹ E‹ Gonzalez˛ and I‹ Stoica‹ Chat‚
J‹ Riesa˛ S‹ Rijhwani˛ P‹ Riley˛ E‹ Salesky˛ F‹ Tra‚
bot arenaȷ An open platform for evaluating
belsi˛ S‹ Winkler˛ B‹ Zhang˛ and M‹ Freitag‹
llms by human preference˛ „0„»‹
Wmt„»¸¸ȷ Expanding the language coverage
F‹ Chollet‹ On the measure of intelligence‹ arXiv of wmt„» to 55 languages ˚ dialects˛ „0„5‹
preprint arXivȷ1911‹015»7˛ „019‹
A‹ $osovitskiy‹ An image is worth 1—x1— wordsȷ
A‹ Chowdhery˛ S‹ Narang˛ J‹ $evlin˛ M‹ Bosma˛ Transformers for image recognition at scale‹
G‹ Mishra˛ A‹ Roberts˛ P‹ Barham˛ H‹ W‹ arXiv preprint arXivȷ„010‹119„9˛ „0„0‹
Chung˛ C‹ Sutton˛ S‹ Gehrmann˛ P‹ Schuh˛
$‹ $ua˛ Y‹ Wang˛ P‹ $asigi˛ G‹ Stanovsky˛ S‹ Singh˛
K‹ Shi˛ S‹ Tsvyashchenko˛ J‹ Maynez˛ A‹ Rao˛
and M‹ Gardner‹ $ROPȷ A reading comprehen‚
P‹ Barnes˛ Y‹ Tay˛ N‹ Shazeer˛ V‹ Prabhakaran˛
sion benchmark requiring discrete reasoning
E‹ Reif˛ N‹ $u˛ B‹ Hutchinson˛ R‹ Pope˛ J‹ Brad‚
over paragraphs‹ In ACL˛ „019‹
bury˛ J‹ Austin˛ M‹ Isard˛ G‹ Gur‚Ari˛ P‹ Yin˛
T‹ $uke˛ A‹ Levskaya˛ S‹ Ghemawat˛ S‹ $ev˛ B‹ Fatemi˛ M‹ Kazemi˛ A‹ Tsitsulin˛ K‹ Malkan˛
H‹ Michalewski˛ X‹ Garcia˛ V‹ Misra˛ K‹ Robin‚ J‹ Yim˛ J‹ Palowitch˛ S‹ Seo˛ J‹ Halcrow˛ and
son˛ L‹ Fedus˛ $‹ Zhou˛ $‹ Ippolito˛ $‹ Luan˛ B‹ Perozzi‹ Test of timeȷ A benchmark for
H‹ Lim˛ B‹ Zoph˛ A‹ Spiridonov˛ R‹ Sepassi˛ evaluating llms on temporal reasoning‹ arXiv
$‹ $ohan˛ S‹ Agrawal˛ M‹ Omernick˛ A‹ M‹ $ai˛ preprint arXivȷ„»0—‹09170˛ „0„»‹
T‹ S‹ Pillai˛ M‹ Pellat˛ A‹ Lewkowycz˛ E‹ Moreira˛
R‹ Child˛ O‹ Polozov˛ K‹ Lee˛ Z‹ Zhou˛ X‹ Wang˛ X‹ Fu˛ Y‹ Hu˛ B‹ Li˛ Y‹ Feng˛ H‹ Wang˛ X‹ Lin˛
B‹ Saeta˛ M‹ $iaz˛ O‹ Firat˛ M‹ Catasta˛ J‹ Wei˛ $‹ Roth˛ N‹ A‹ Smith˛ W‹‚C‹ Ma˛ and R‹ Krishna‹
K‹ Meier‚Hellstern˛ $‹ Eck˛ J‹ $ean˛ S‹ Petrov˛ Blinkȷ Multimodal large language models can
and N‹ Fiedel‹ Palmȷ Scaling language model‚ see but not perceive‹ ArXiv˛ abs›„»0»‹1„«90˛
ing with pathways˛ „0„„‹ „0„»‹

1„
Gemma « Technical Report

J‹ Gehring˛ K‹ Zheng˛ J‹ Copet˛ V‹ Mella˛ T‹ Cohen˛ G‹ Hinton˛ O‹ Vinyals˛ and J‹ $ean‹ $istilling the
and G‹ Synnaeve‹ Rlefȷ Grounding code llms in knowledge in a neural network‹ arXiv preprint
execution feedback with reinforcement learn‚ arXivȷ150«‹0„5«1˛ „015‹
ing‹ arXiv preprint arXivȷ„»10‹0„089˛ „0„»‹
C‹‚P‹ Hsieh˛ S‹ Sun˛ S‹ Kriman˛ S‹ Acharya˛
Gemini Team‹ Geminiȷ A family of highly capable $‹ Rekesh˛ F‹ Jia˛ Y‹ Zhang˛ and B‹ Ginsburg‹
multimodal models˛ „0„«‹ Rulerȷ What’s the real context size of your
long‚context language models? arXiv preprint
Gemini Team‹ Gemini 1‹5ȷ Unlocking multimodal arXivȷ„»0»‹0——5»˛ „0„»‹
understanding across millions of tokens of con‚
text˛ „0„»‹ $‹ Ippolito˛ F‹ Traměr˛ M‹ Nasr˛ C‹ Zhang˛
M‹ Jagielski˛ K‹ Lee˛ C‹ A‹ Choquette‚Choo˛ and
Gemma Team‹ Gemmaȷ Open models based on
N‹ Carlini‹ Preventing verbatim memorization
gemini research and technology˛ „0„»a‹
in language models gives a false sense of pri‚
Gemma Team‹ Gemma „ȷ Improving open lan‚ vacy‹ arXiv preprint arXivȷ„„10‹175»—˛ „0„„‹
guage models at a practical size‹ arXiv preprint
B‹ Jacob˛ S‹ Kligys˛ B‹ Chen˛ M‹ Zhu˛ M‹ Tang˛
arXivȷ„»08‹00118˛ „0„»b‹
A‹ Howard˛ H‹ Adam˛ and $‹ Kalenichenko‹
O‹ Goldman˛ U‹ Shaham˛ $‹ Malkin˛ S‹ Eiger˛ Quantization and training of neural networks
A‹ Hassidim˛ Y‹ Matias˛ J‹ Maynez˛ A‹ M‹ Gi‚ for efficient integer‚arithmetic‚only inference‹
lady˛ J‹ Riesa˛ S‹ Rijhwani˛ L‹ Rimell˛ I‹ Szpektor˛ In CVPR˛ „018‹
R‹ Tsarfaty˛ and M‹ Eyal‹ Eclekticȷ a novel chal‚
lenge set for evaluation of cross‚lingual knowl‚ M‹ Joshi˛ E‹ Choi˛ $‹ S‹ Weld˛ and L‹ Zettlemoyer‹
edge transfer˛ „0„5‹ Triviaqaȷ A large scale distantly supervised
challenge dataset for reading comprehension‹
N‹ Goyal˛ C‹ Gao˛ V‹ Chaudhary˛ P‹‚J‹ Chen˛ CoRR˛ abs›1705‹0«551˛ „017‹
G‹ Wenzek˛ $‹ Ju˛ S‹ Krishnan˛ M‹ Ranzato˛
F‹ Guzmđn˛ and A‹ Fan‹ The flores‚101 evalua‚ M‹ Kazemi˛ H‹ Alvari˛ A‹ Anand˛ J‹ Wu˛ X‹ Chen˛
tion benchmark for low‚resource and multilin‚ and R‹ Soricut‹ Geomverseȷ A systematic eval‚
gual machine translation‹ ACL˛ „0„„‹ uation of large models for geometric reasoning‹
arXiv preprint arXivȷ„«1„‹1„„»1˛ „0„«‹
Y‹ Goyal˛ T‹ Khot˛ $‹ Summers‚Stay˛ $‹ Batra˛ and
$‹ Parikh‹ Making the V in VQA matterȷ Elevat‚ M‹ Kazemi˛ N‹ $ikkala˛ A‹ Anand˛ P‹ $evi㞠I‹ $as‚
ing the role of image understanding in Visual gupta˛ F‹ Liu˛ B‹ Fatemi˛ P‹ Awasthi˛ $‹ Guo˛
Question Answering‹ In CVPR˛ „017‹ S‹ Gollapudi˛ and A‹ Qureshi‹ Remiȷ A dataset
for reasoning with multiple images‹ ArXiv˛
$‹ Hendrycks˛ C‹ Burns˛ S‹ Basart˛ A‹ Zou˛ abs›„»0—‹09175˛ „0„»a‹
M‹ Mazeika˛ $‹ Song˛ and J‹ Steinhardt‹ Mea‚
suring massive multitask language understand‚ M‹ Kazemi˛ Q‹ Yuan˛ $‹ Bhatia˛ N‹ Kim˛
ing‹ CoRR˛ abs›„009‹0««00˛ „0„0‹ X‹ Xu˛ V‹ Imbrasaite˛ and $‹ Ramachandran‹
Boardgameqaȷ A dataset for natural lan‚
$‹ Hendrycks˛ C‹ Burns˛ S‹ Kadavath˛ A‹ Arora˛ guage reasoning with contradictory informa‚
S‹ Basart˛ E‹ Tang˛ $‹ Song˛ and J‹ Steinhardt‹ tion‹ NeurIPS˛ «—˛ „0„»b‹
Measuring mathematical problem solving with
the math dataset‹ NeurIPS˛ „0„1‹ M‹ Kazemi˛ B‹ Fatemi˛ H‹ Bansal˛ J‹ Palowitch˛
C‹ Anastasiou˛ S‹ V‹ Mehta˛ L‹ K‹ Jain˛ V‹ Aglietti˛
J‹ Hessel˛ A‹ Marasovi㞠J‹ $‹ Hwang˛ L‹ Lee˛ J‹ $a˛ $‹ Jindal˛ P‹ Chen˛ et al‹ Big‚bench extra hard‹
R‹ Zellers˛ R‹ Mankoff˛ and Y‹ Choi‹ $o an‚ arXiv preprint arXivȷ„50„‹19187˛ „0„5‹
droids laugh at electric sheep? humorˆ under‚
standingˆ benchmarks from the new yorker cap‚ A‹ Kembhavi˛ M‹ Salvato˛ E‹ Kolve˛ M‹ Seo˛ H‹ Ha‚
tion contest‹ arXiv preprint arXivȷ„„09‹0—„9«˛ jishirzi˛ and A‹ Farhadi‹ A diagram is worth a
„0„„‹ dozen images‹ ArXiv˛ abs›1—0«‹07«9—˛ „01—‹


Gemma « Technical Report

E‹ Kıcıman˛ R‹ Ness˛ A‹ Sharma˛ and C‹ Tan‹ M‹ Mathew˛ V‹ Bagal˛ R‹ Tito˛ $‹ Karatzas˛ E‹ Val‚
Causal reasoning and large language modelsȷ veny˛ and C‹ Jawahar‹ Infographicvqa‹ In WACV˛
Opening a new frontier for causality‹ arXiv „0„„‹
preprint arXivȷ„«05‹00050˛ „0„«‹
I‹ Mirzadeh˛ K‹ Alizadeh˛ H‹ Shahrokhi˛ O‹ Tuzel˛
T‹ Kudo and J‹ Richardson‹ SentencePieceȷ A S‹ Bengio˛ and M‹ Farajtabar‹ Gsm‚symbolicȷ
simple and language independent subword to‚ Understanding the limitations of mathemati‚
kenizer and detokenizer for neural text pro‚ cal reasoning in large language models‹ arXiv
cessing‹ „018‹ preprint arXivȷ„»10‹05„„9˛ „0„»‹

T‹ Kwiatkowski˛ J‹ Palomaki˛ O‹ Redfield˛ M‹ Nasr˛ N‹ Carlini˛ J‹ Hayase˛ M‹ Jagielski˛ A‹ F‹


M‹ Collins˛ A‹ Parikh˛ C‹ Alberti˛ $‹ Epstein˛ Cooper˛ $‹ Ippolito˛ C‹ A‹ Choquette‚Choo˛
I‹ Polosukhin˛ J‹ $evlin˛ K‹ Lee˛ K‹ Toutanova˛ E‹ Wallace˛ F‹ Traměr˛ and K‹ Lee‹ Scal‚
L‹ Jones˛ M‹ Kelcey˛ M‹‚W‹ Chang˛ A‹ M‹ $ai˛ able extraction of training data from ˘pro‚
J‹ Uszkoreit˛ Q‹ Le˛ and S‹ Petrov‹ Natural ques‚ duction¯ language models‹ arXiv preprint
tionsȷ A benchmark for question answering re‚ arXivȷ„«11‹170«5˛ „0„«‹
search‹ ACL˛ „019‹
A‹ Nie˛ Y‹ Zhang˛ A‹ S‹ Amdekar˛ C‹ Piech˛ T‹ B‹
N‹ Lambert˛ J‹ Morrison˛ V‹ Pyatkin˛ S‹ Huang˛ Hashimoto˛ and T‹ Gerstenberg‹ Mocaȷ Mea‚
H‹ Ivison˛ F‹ Brahman˛ L‹ J‹ V‹ Miranda˛ A‹ Liu˛ suring human‚language model alignment on
N‹ $ziri˛ S‹ Lyu˛ et al‹ T\ˆ ulu «ȷ Pushing causal and moral judgment tasks‹ NeurIPS˛ «—˛
frontiers in open language model post‚training‹ „0„»‹
arXiv preprint arXivȷ„»11‹151„»˛ „0„»‹
R‹ Paiss˛ A‹ Ephrat˛ O‹ Tov˛ S‹ Zada˛ I‹ Mosseri˛
Z‹ Lin˛ J‹ Cui˛ X‹ Liao˛ and X‹ Wang‹ Mallaȷ $e‚ M‹ Irani˛ and T‹ $ekel‹ Teaching clip to count
mystifying real‚world large language model to ten‹ ICCV˛ „0„«‹
integrated malicious services˛ „0„»‹
M‹ Phuong˛ M‹ Aitchison˛ E‹ Catt˛ S‹ Co‚
H‹ Liu˛ C‹ Li˛ Q‹ Wu˛ and Y‹ J‹ Lee‹ Visual instruc‚ gan˛ A‹ Kaskasoli˛ V‹ Krakovna˛ $‹ Lindner˛
tion tuning‹ NeurIPS˛ «—˛ „0„»‹ M‹ Rahtz˛ Y‹ Assael˛ S‹ Hodkinson˛ H‹ Howard˛
T‹ Lieberum˛ R‹ Kumar˛ M‹ A‹ Raad˛ A‹ Webson˛
LLaMa Team‹ The llama « herd of models‹ arXiv L‹ Ho˛ S‹ Lin˛ S‹ Farquhar˛ M‹ Hutter˛ G‹ $ele‚
preprint arXivȷ„»07‹„178«˛ „0„»‹ tang˛ A‹ Ruoss˛ S‹ El‚Sayed˛ S‹ Brown˛ A‹ $ra‚
M‹ Luong˛ H‹ Pham˛ and C‹ $‹ Manning‹ Effective gan˛ R‹ Shah˛ A‹ $afoe˛ and T‹ Shevlane‹ Evalu‚
approaches to attention‚based neural machine ating frontier models for dangerous capabilities˛
translation‹ „015‹ „0„»‹

Macknight˛ Aung˛ and Gomes‹ Personal Commu‚ A‹ Radford˛ J‹ W‹ Kim˛ C‹ Hallacy˛ A‹ Ramesh˛
nication‹ G‹ Goh˛ S‹ Agarwal˛ G‹ Sastry˛ A‹ Askell˛
P‹ Mishkin˛ J‹ Clark˛ et al‹ Learning transferable
K‹ Marino˛ M‹ Rastegari˛ A‹ Farhadi˛ and R‹ Mot‚ visual models from natural language supervi‚
taghi‹ Ok‚vqaȷ A visual question answering sion‹ In ICML˛ pages 87»8–87—«‹ PMLR˛ „0„1‹
benchmark requiring external knowledge‹ In
CVPR˛ „019‹ A‹ Ramę˛ J‹ Ferret˛ N‹ Vieillard˛ R‹ $adashi˛
L‹ Hussenot˛ P‹‚L‹ Cedoz˛ P‹ G‹ Sessa˛ S‹ Girgin˛
A‹ Masry˛ X‹ L‹ $o˛ J‹ Q‹ Tan˛ S‹ Joty˛ and E‹ Hoque‹ A‹ $ouillard˛ and O‹ Bachem‹ WARPȷ On the
ChartQAȷ A benchmark for question answering benefits of weight averaged rewarded policies˛
about charts with visual and logical reasoning‹ „0„»a‹
ACL˛ „0„„‹
A‹ Ramę˛ N‹ Vieillard˛ L‹ Hussenot˛ R‹ $adashi˛
M‹ Mathew˛ $‹ Karatzas˛ R‹ Manmatha˛ and C‹ V‹ G‹ Cideron˛ O‹ Bachem˛ and J‹ Ferret‹ WARMȷ
Jawahar‹ $ocvqaȷ A dataset for vqa on docu‚ On the benefits of weight averaged reward mod‚
ment images‹ WACV˛ „0„0‹ els‹ In ICML˛ „0„»b‹


Gemma « Technical Report

$‹ Rein˛ B‹ L‹ Hou˛ A‹ C‹ Stickland˛ J‹ Petty˛ R‹ Y‹ V‹ Bolina˛ J‹ Clark˛ Y‹ Bengio˛ P‹ Christiano˛ and


Pang˛ J‹ $irani˛ J‹ Michael˛ and S‹ R‹ Bow‚ A‹ $afoe‹ Model evaluation for extreme risks˛
man‹ Gpqaȷ A graduate‚level google‚proof q˚a „0„«‹
benchmark‹ ArXiv˛ abs›„«11‹1„0„„˛ „0„«‹
F‹ Shi˛ M‹ Suzgun˛ M‹ Freitag˛ X‹ Wang˛ S‹ Sri‚
J‹ Ren˛ S‹ Rajbhandari˛ R‹ Y‹ Aminabadi˛ vats˛ S‹ Vosoughi˛ H‹ W‹ Chung˛ Y‹ Tay˛ S‹ Ruder˛
O‹ Ruwase˛ S‹ Yang˛ M‹ Zhang˛ $‹ Li˛ and $‹ Zhou˛ $‹ $as˛ and J‹ Wei‹ Language models
Y‹ He‹ Zero‚offloadȷ $emocratizing billion‚ are multilingual chain‚of‚thought reasoners‹ In
scale model training‹ In USENIX˛ „0„1‹ ICLR˛ „0„«‹
A‹ Roberts˛ H‹ W‹ Chung˛ G‹ Mishra˛ A‹ Levskaya˛ A‹ Singh˛ V‹ Natarjan˛ M‹ Shah˛ Y‹ Jiang˛ X‹ Chen˛
J‹ Bradbury˛ $‹ Andor˛ S‹ Narang˛ B‹ Lester˛ $‹ Parikh˛ and M‹ Rohrbach‹ Towards vqa mod‚
C‹ Gaffney˛ A‹ Mohiuddin˛ et al‹ Scaling up els that can read‹ In CVPR˛ „019‹
models and data with t5x and seqio‹ JMLR˛
„0„«‹ H‹ Singh˛ N‹ Gupta˛ S‹ Bharadwaj˛ $‹ Tewari˛
and P‹ Talukdar‹ Indicgenbenchȷ a multilin‚
N‹ Sachdeva˛ B‹ Coleman˛ W‹‚C‹ Kang˛ J‹ Ni˛ gual benchmark to evaluate generation capabil‚
L‹ Hong˛ E‹ H‹ Chi˛ J‹ Caverlee˛ J‹ McAuley˛ and ities of llms on indic languages‹ arXiv preprint
$‹ Z‹ Cheng‹ How to train data‚efficient llms‹ arXivȷ„»0»‹1—81—˛ „0„»a‹
arXiv preprint arXivȷ„»0„‹09——8˛ „0„»‹
S‹ Singh˛ A‹ Romanou˛ C‹ Fourrier˛ $‹ I‹ Adelani˛
K‹ Sakaguchi˛ R‹ L‹ Bras˛ C‹ Bhagavatula˛ and J‹ G‹ Ngui˛ $‹ Vila‚Suero˛ P‹ Limkonchotiwat˛
Y‹ Choi‹ WINOGRAN$Eȷ an adversarial K‹ Marchisio˛ W‹ Q‹ Leong˛ Y‹ Susanto˛ R‹ Ng˛
winograd schema challenge at scale‹ CoRR˛ S‹ Longpre˛ W‹‚Y‹ Ko˛ M‹ Smith˛ A‹ Bosselut˛
abs›1907‹10—»1˛ „019‹ A‹ Oh˛ A‹ F‹ T‹ Martins˛ L‹ Choshen˛ $‹ Ippolito˛
E‹ Sđnchez˛ B‹ Alastruey˛ C‹ Ropers˛ P‹ Stenetorp˛ E‹ Ferrante˛ M‹ Fadaee˛ B‹ Ermis˛ and S‹ Hooker‹
M‹ Artetxe˛ and M‹ R‹ Costa‚jussİ‹ Linguiniȷ Global mmluȷ Understanding and addressing
A benchmark for language‚agnostic linguistic cultural and linguistic biases in multilingual
reasoning‹ arXiv preprint arXivȷ„»09‹1„1„—˛ evaluation˛ „0„»b‹
„0„»‹ A‹ Steiner˛ A‹ S‹ Pinto˛ M‹ Tschannen˛ $‹ Key‚
M‹ Sap˛ H‹ Rashkin˛ $‹ Chen˛ R‹ L‹ Bras˛ sers˛ X‹ Wang˛ Y‹ Bitton˛ A‹ Gritsenko˛ M‹ Min‚
and Y‹ Choi‹ Socialiqaȷ Commonsense derer˛ A‹ Sherbondy˛ S‹ Long˛ S‹ Qin˛ R‹ In‚
reasoning about social interactions‹ CoRR˛ gle˛ E‹ Bugliarello˛ S‹ Kazemzadeh˛ T‹ Mes‚
abs›190»‹097„8˛ „019‹ nard˛ I‹ Alabdulmohsin˛ L‹ Beyer˛ and X‹ Zhai‹
PaliGemma „ȷ A Family of Versatile VLMs
P‹ G‹ Sessa˛ R‹ $adashi˛ L‹ Hussenot˛ J‹ Ferret˛ for Transfer‹ arXiv preprint arXivȷ„»1„‹0«555˛
N‹ Vieillard˛ A‹ Ramę˛ B‹ Shariari˛ S‹ Perrin˛ „0„»‹
A‹ Friesen˛ G‹ Cideron˛ S‹ Girgin˛ P‹ Stanczyk˛
A‹ Michi˛ $‹ Sinopalnikov˛ S‹ Ramos˛ A‹ Hęliou˛ M‹ Suzgun˛ N‹ Scales˛ N‹ Schąrli˛ S‹ Gehrmann˛
A‹ Severyn˛ M‹ Hoffman˛ N‹ Momchev˛ and Y‹ Tay˛ H‹ W‹ Chung˛ A‹ Chowdhery˛ Q‹ V‹ Le˛
O‹ Bachem‹ Bondȷ Aligning llms with best‚of‚n E‹ H‹ Chi˛ $‹ Zhou˛ and J‹ Wei‹ Challenging
distillation˛ „0„»‹ big‚bench tasks and whether chain‚of‚thought
can solve them˛ „0„„‹
K‹ Shah˛ N‹ $ikkala˛ X‹ Wang˛ and R‹ Panigrahy‹
Causal language modeling can elicit search and G‹ Tyen˛ H‹ Mansoor˛ P‹ Chen˛ T‹ Mak˛ and
reasoning capabilities on logic puzzles‹ arXiv V‹ Cărbune‹ Llms cannot find reasoning er‚
preprint arXivȷ„»09‹1050„˛ „0„»‹ rors˛ but can correct them´ arXiv preprint
arXivȷ„«11‹0851—˛ „0„«‹
T‹ Shevlane˛ S‹ Farquhar˛ B‹ Garfinkel˛ M‹ Phuong˛
J‹ Whittlestone˛ J‹ Leung˛ $‹ Kokotajlo˛ N‹ Mar‚ A‹ Vaswani˛ N‹ Shazeer˛ N‹ Parmar˛ J‹ Uszkoreit˛
chal˛ M‹ Anderljung˛ N‹ Kolt˛ L‹ Ho˛ $‹ Sid‚ L‹ Jones˛ A‹ N‹ Gomez˛ L‹ Kaiser˛ and I‹ Polo‚
darth˛ S‹ Avin˛ W‹ Hawkins˛ B‹ Kim˛ I‹ Gabriel˛ sukhin‹ Attention is all you need‹ „017‹

15
Gemma « Technical Report

K‹ Vodrahalli˛ S‹ Ontanon˛ N‹ Tripuraneni˛ K‹ Xu˛ benchmark for spatial relation recognition‹


S‹ Jain˛ R‹ Shivanna˛ J‹ Hui˛ N‹ $ikkala˛ ICCV˛ „019‹
M‹ Kazemi˛ B‹ Fatemi˛ et al‹ Michelangeloȷ
Long context evaluations beyond haystacks X‹ Yue˛ Y‹ Ni˛ K‹ Zhang˛ T‹ Zheng˛ R‹ Liu˛ G‹ Zhang˛
via latent structure queries‹ arXiv preprint S‹ Stevens˛ $‹ Jiang˛ W‹ Ren˛ Y‹ Sun˛ C‹ Wei˛
arXivȷ„»09‹1„—»0˛ „0„»‹ B‹ Yu˛ R‹ Yuan˛ R‹ Sun˛ M‹ Yin˛ B‹ Zheng˛
Z‹ Yang˛ Y‹ Liu˛ W‹ Huang˛ H‹ Sun˛ Y‹ Su˛
Y‹ Wang˛ X‹ Ma˛ G‹ Zhang˛ Y‹ Ni˛ A‹ Chandra˛ and W‹ Chen‹ Mmmuȷ A massive multi‚
S‹ Guo˛ W‹ Ren˛ A‹ Arulraj˛ X‹ He˛ Z‹ Jiang˛ discipline multimodal understanding and rea‚
et al‹ Mmlu‚proȷ A more robust and challenging soning benchmark for expert agi‹ CVPR˛ „0„«‹
multi‚task language understanding benchmark‹
In NeurIPS˛ „0„»‹ R‹ Zellers˛ A‹ Holtzman˛ Y‹ Bisk˛ A‹ Farhadi˛ and
Y‹ Choi‹ HellaSwagȷ Can a machine really finish
L‹ Weidinger˛ J‹ Mellor˛ M‹ Rauh˛ C‹ Griffin˛ your sentence? In ACL˛ „019‹
J‹ Uesato˛ P‹‚S‹ Huang˛ M‹ Cheng˛ M‹ Glaese˛
X‹ Zhai˛ B‹ Mustafa˛ A‹ Kolesnikov˛ and L‹ Beyer‹
B‹ Balle˛ A‹ Kasirzadeh˛ Z‹ Kenton˛ S‹ Brown˛
Sigmoid loss for language image pre‚training‹
W‹ Hawkins˛ T‹ Stepleton˛ C‹ Biles˛ A‹ Birhane˛
In CVPR˛ „0„«‹
J‹ Haas˛ L‹ Rimell˛ L‹ A‹ Hendricks˛ W‹ Isaac˛
S‹ Legassick˛ G‹ Irving˛ and I‹ Gabriel‹ Ethical B‹ Zhang and R‹ Sennrich‹ Root mean square
and social risks of harm from language models˛ layer normalization‹ „019‹
„0„1‹
J‹ Zhang˛ L‹ Jain˛ Y‹ Guo˛ J‹ Chen˛ K‹ L‹ Zhou˛
C‹ White˛ S‹ $ooley˛ M‹ Roberts˛ A‹ Pal˛ B‹ Feuer˛ S‹ Suresh˛ A‹ Wagenmaker˛ S‹ Sievert˛ T‹ Rogers˛
S‹ Jain˛ R‹ Shwartz‚Ziv˛ N‹ Jain˛ K‹ Saiful‚ K‹ Jamieson˛ et al‹ Humor in aiȷ Massive
lah˛ S‹ Naidu˛ et al‹ Livebenchȷ A challeng‚ scale crowd‚sourced preferences and bench‚
ing˛ contamination‚free llm benchmark‹ arXiv marks for cartoon captioning‹ arXiv preprint
preprint arXivȷ„»0—‹19«1»˛ „0„»‹ arXivȷ„»0—‹105„„˛ „0„»‹
M‹ Wortsman˛ P‹ J‹ Liu˛ L‹ Xiao˛ K‹ Everett˛ W‹ Zhong˛ R‹ Cui˛ Y‹ Guo˛ Y‹ Liang˛ S‹ Lu˛ Y‹ Wang˛
A‹ Alemi˛ B‹ Adlam˛ J‹ $‹ Co‚Reyes˛ I‹ Gur˛ A‹ Ku‚ A‹ Saied˛ W‹ Chen˛ and N‹ $uan‹ Agievalȷ A
mar˛ R‹ Novak˛ et al‹ Small‚scale proxies for human‚centric benchmark for evaluating foun‚
large‚scale transformer training instabilities‹ dation models˛ „0„«‹
arXiv preprint arXivȷ„«09‹1»«„„˛ „0„«‹

XLA‹ Xlaȷ Optimizing compiler for tensor‚


flow˛ „019‹ URL https://www.tensorflow.
org/xla‹
Y‹ Xu˛ H‹ Lee˛ $‹ Chen˛ B‹ A‹ Hechtman˛ Y‹ Huang˛
R‹ Joshi˛ M‹ Krikun˛ $‹ Lepikhin˛ A‹ Ly˛ M‹ Mag‚
gioni˛ R‹ Pang˛ N‹ Shazeer˛ S‹ Wang˛ T‹ Wang˛
Y‹ Wu˛ and Z‹ Chen‹ GSPM$ȷ general and scal‚
able parallelization for ML computation graphs‹
„0„1‹

Y‹ Yamada˛ Y‹ Bao˛ A‹ K‹ Lampinen˛ J‹ Kasai˛


and I‹ Yildirim‹ Evaluating spatial understand‚
ing of large language models‹ arXiv preprint
arXivȷ„«10‹1»5»0˛ „0„«‹

K‹ Yang˛ O‹ Russakovsky˛ and J‹ $eng‹ Spa‚


tialsenseȷ An adversarially crowdsourced

1—
Gemma « Technical Report

Core contributors Contributors ˘alphabetical order¯


Aishwarya Kamath∗ Abe Friesen
Johan Ferret∗ Abhanshu Sharma
Shreya Pathak∗ Abheesht Sharma
Nino Vieillard∗ Adi Mayrav Gilady
Ramona Merhej∗ Adrian Goedeckemeyer
Sarah Perrin∗ Alaa Saade
Tatiana Matejovicova∗ Alex Feng
Alexandre Ramę∗ Alexander Kolesnikov
Morgane Riviěre∗ Alexei Bendebury
Louis Rouillard∗ Alvin Abdagic
Thomas Mesnard∗ Amit Vadi
Geoffrey Cideron∗ Andrđs Gyşrgy
Jean‚bastien Grill∗ Andrę Susano Pinto
Sabela Ramos∗ Anil $as
Edouard Yvinec∗ Ankur Bapna
Michelle Casbon∗ Antoine Miech
Etienne Pot Antoine Yang
Ivo Penchev Antonia Paterson
Gaĺl Liu Ashish Shenoy
Francesco Visin Ayan Chakrabarti
Kathleen Kenealy Bilal Piot
Lucas Beyer Bo Wu
Xiaohai Zhai Bobak Shahriari
Anton Tsitsulin Bryce Petrini
Robert Busa‚Fekete Charlie Chen
Alex Feng Charline Le Lan
Noveen Sachdeva Christopher A‹ Choquette‚Choo
Benjamin Coleman CJ Carey
Yi Gao Cormac Brick
Basil Mustafa $aniel $eutsch
Iain Barr $anielle Eisenbud
Emilio Parisotto $ee Cattle
$avid Tian $erek Cheng
Matan Eyal $imitris Paparas
Colin Cherry $ivyashree Shivakumar Sreepathihalli
Jan‚Thorsten Peter $oug Reid
$anila Sinopalnikov $ustin Tran
Surya Bhupatiraju $ustin Zelle
Rishabh Agarwal Eric Noland
Mehran Kazemi Erwin Huizenga
$an Malkin Eugene Kharitonov
Ravin Kumar Frederick Liu
$avid Vilar Gagik Amirkhanyan
Idan Brusilovsky Glenn Cameron
Jiaming Luo Hadi Hashemi
Andreas Steiner Hanna Klimczak‚Plucińska
Harman Singh
Harsh Mehta
∗ co‚first authors‹

17
Gemma « Technical Report

Harshal Tushar Lehri Reza Rokni


Hussein Hazimeh Rob Willoughby
Ian Ballantyne Rohith Vallu
Idan Szpektor Ryan Mullins
Ivan Nardini Sammy Jerome
Jean Pouget‚Abadie Sara Smoot
Jetha Chan Sertan Girgin
Joe Stanton Shariq Iqbal
John Wieting Shashir Reddy
Jonathan Lai Shruti Sheth
Jordi Orbay Siim Pšder
Joseph Fernandez Sijal Bhatnagar
Josh Newlan Sindhu Raghuram Panyam
Ju‚yeong Ji Sivan Eiger
Jyotinder Singh Susan Zhang
Kat Black Tianqi Liu
Kathy Yu Trevor Yacovone
Kevin Hui Tyler Liechty
Kiran Vodrahalli Uday Kalra
Klaus Greff Utku Evci
Linhai Qiu Vedant Misra
Marcella Valentine Vincent Roseberry
Marina Coelho Vlad Feinberg
Marvin Ritter Vlad Kolesnikov
Matt Hoffman Woohyun Han
Matthew Watson Woosuk Kwon
Mayank Chaturvedi Xi Chen
Michael Moynihan Yinlam Chow
Min Ma Yuvein Zhu
Nabila Babar Zichuan Wei
Natasha Noy Zoltan Egyed
Nathan Byrd
Nick Roy
Support
Nikola Momchev
Victor Cotruta
Nilay Chauhan
Minh Giang
Noveen Sachdeva
Phoebe Kirk
Oskar Bunyan
Anand Rao
Pankil Botarda
Kat Black
Paul Caron
Nabila Babar
Paul Kishan Rubenstein
Jessica Lo
Phil Culliton
Erica Moreira
Philipp Schmid
Luiz Gustavo Martins
Pier Giuseppe Sessa
Omar Sanseviero
Pingmei Xu
Lucas Gonzalez
Piotr Stanczyk
Zach Gleicher
Pouya Tafti
Tris Warkentin
Rakesh Shivanna
Renjie Wu
Renke Pan Sponsors

18
Gemma « Technical Report

Vahab Mirrokni
Evan Senter
Eli Collins
Joelle Barral
Zoubin Ghahramani
Raia Hadsell
Yossi Matias
$‹ Sculley
Slav Petrov
Noah Fiedel
Noam Shazeer
Oriol Vinyals
Jeff $ean
$emis Hassabis
Koray Kavukcuoglu
Clement Farabet

Technical advisors
Elena Buchatskaya
Jean‚Baptiste Alayrac
Rohan Anil
$mitry ˘$ima¯ Lepikhin
Sebastian Borgeaud
Olivier Bachem

Lead
Armand Joulin

Technical leads
Alek Andreev
Cassidy Hardin
Robert $adashi
Lęonard Hussenot

19
Gemma « Technical Report

Appendix Gemma „ Gemma «

$etails of pre‚trained performances‹ „B 9B „7B »B 1„B „7B


MMLU 5„‹„ 71‹„ 75‹„ 59‹— 7»‹5 78‹—
MMLUpro „„‹„ »«‹7 »9‹» „9‹„ »5‹« 5„‹„
Gemma „ Gemma «
AGIE «1‹— 5«‹1 55‹1 »„‹1 57‹» ——‹„
„B 9B „7B 1B »B 1„B „7B MATH 1—‹» «—‹» »„‹1 „»‹„ »«‹« 50‹0
HellaS 7„‹9 81‹9 8—‹» —„‹« 77‹„ 8»‹„ 85‹— GSM8K „5‹0 70‹„ 7»‹— «8‹» 71‹0 8„‹—
BoolQ 75‹— 77‹5 7—‹„ —«‹„ 7„‹« 78‹8 8„‹» GPQA $iamond 1„‹5 „»‹8 „—‹« 15‹0 „5‹» „»‹«
MBPP «1‹0 51‹„ —0‹8 »—‹0 —0‹» —5‹—
PIQA 78‹1 81‹9 8«‹5 7«‹8 79‹— 81‹8 8«‹« HumanE 19‹5 »0‹„ 51‹„ «—‹0 »5‹7 »8‹8
SIQA 51‹8 5«‹« 5«‹8 »8‹9 51‹9 5«‹» 5»‹9
TQA —0‹„ 7—‹5 8«‹8 «9‹8 —5‹8 78‹„ 85‹5 Table 10 | STEM and code performance after pre‚
NQ 17‹„ „9‹„ «»‹7 9‹»8 „0‹0 «1‹» «—‹1 training phase‹
ARC‚C 55‹8 —9‹1 71‹» «8‹» 5—‹„ —8‹9 70‹—
ARC‚E 80‹— 88‹« 88‹— 7«‹0 8„‹» 88‹« 89‹0 pre‚trained models‹ On code˛ we see a similar
WinoG —5‹» 7«‹9 79‹» 58‹„ —»‹7 7»‹« 78‹8 improvement for the »B and 1„B models but not
BBH »„‹» —9‹» 7»‹8 „8‹» 50‹9 7„‹— 77‹7 on the „7B‹
$rop 5«‹„ 71‹5 75‹„ »„‹» —0‹1 7„‹„ 77‹„
»B 1„B „7B
Table 9 | Factuality˛ common‚sense performance
and reasoning after pre‚training phase‹ COCO caption 10„ 111 11—
$ocVQA 7„‹8 8„‹« 85‹—
InfoVQA »»‹1 5»‹8 59‹»
Factuality and common‚sense‹ In Table 9˛ we
MMMU «9‹„ 50‹« 5—‹1
report the performance of our new pre‚trained
TextVQA 58‹9 ——‹5 —8‹—
benchmarks compared to previous versions‹ We
RealWorldQA »5‹5 5„‹„ 5«‹9
consider several standard benchmarks˛ namely
ReMI „7‹« «8‹5 »»‹8
HellaSwag ˘Zellers et al‹˛ „019¯˛ BoolQ ˘Clark
AI„$ —«‹„ 75‹„ 79‹0
et al‹˛ „019¯˛ PIQA ˘Bisk et al‹˛ „019¯˛ SIQA ˘Sap
ChartQA —«‹— 7»‹7 7—‹«
et al‹˛ „019¯˛ TriviaQA ˘Joshi et al‹˛ „017¯˛ Natu‚
VQAv„ —«‹9 71‹„ 7„‹9
ral Questions ˘Kwiatkowski et al‹˛ „019¯˛ ARC‚C
BLINK «8‹0 «5‹9 «9‹—
and ARC‚E ˘Chollet˛ „019¯˛ WinoGrande ˘Sak‚
OK‚VQA 51‹0 58‹7 —0‹„
aguchi et al‹˛ „019¯˛ BBH ˘Suzgun et al‹˛ „0„„¯˛
TallyQA »„‹5 51‹8 5»‹«
$ROP ˘$ua et al‹˛ „019¯‹ Evaluation details are
SpatialSense VQA 50‹9 —0‹0 59‹»
described in Table 19‹ Overall˛ our models are in
CountBench VQA „—‹1 17‹8 —8‹0
the same ballpark as Gemma „˛ which is encour‚
aging since these abilities are not the focus of the Table 11 | Multimodal performance after pre‚
improvements brought in this version‹ training phase‹ The scores are on the val split
STEM and code‹ The details of our per‚ of each dataset without P˚S‹
formance on STEM and Code are in Ta‚
ble 10‹ We consider several standard bench‚ Image understanding‹ In Table 11˛ we re‚
marks˛ namely MMLU ˘Hendrycks et al‹˛ „0„0¯˛ port performance across a variety of visual
MMLU‚Pro ˘Wang et al‹˛ „0„»¯˛ AGIEval ˘Zhong question answer benchmarks for the different
et al‹˛ „0„«¯˛ MATH ˘Hendrycks et al‹˛ „0„1¯˛ models that were trained with a vision en‚
GSM8K ˘Cobbe et al‹˛ „0„1¯˛ GPQA ˘Rein coder˛ namely COCO Caption ˘Chen et al‹˛
et al‹˛ „0„«¯˛ MBPP ˘Austin et al‹˛ „0„1¯˛ Hu‚ „015¯˛ $ocVQA ˘Mathew et al‹˛ „0„0¯˛ Info‚
manEval ˘Chen et al‹˛ „0„1¯‹ Evaluation details graphicVQA ˘Mathew et al‹˛ „0„„¯˛ MMMU ˘Yue
are described in Table 19‹ Overall we see a consis‚ et al‹˛ „0„«¯˛ TextVQA ˘Singh et al‹˛ „019¯˛ Re‚
tent improvement over STEM abilities across our alWorldQA ˘Rea¯˛ ReMI ˘Kazemi et al‹˛ „0„»a¯˛

„0
Gemma « Technical Report

AI„$ ˘Kembhavi et al‹˛ „01—¯˛ ChartQA ˘Masry Gemma „ Gemma «


et al‹˛ „0„„¯˛ VQA v„ ˘Goyal et al‹˛ „017¯˛
„B 9B „7B 1B »B 1„B „7B
BLINK ˘Fu et al‹˛ „0„»¯˛ OK‚VQA ˘Marino et al‹˛
„019¯˛ TallyQA ˘Acharya et al‹˛ „018¯˛ Spa‚ MGSM 18‹7 57‹« —8‹0 „‹0» «»‹7 —»‹« 7»‹«
tialSense VQA ˘Yang et al‹˛ „019¯˛ CountBench GMMLU »«‹« —»‹0 —9‹» „»‹9 57‹0 —9‹» 75‹7
VQA ˘Paiss et al‹˛ „0„«¯‹ Evaluation details are WMT„»¸¸ «8‹8 50‹« 5«‹0 «—‹7 »8‹» 5«‹9 55‹7
described in Table „0‹ Flores «0‹„ »1‹« »»‹« „9‹5 «9‹„ »—‹0 »8‹8
XQuA$ 5«‹7 7„‹„ 7«‹9 »«‹9 —8‹0 7»‹5 7—‹8
PaliGemma „ Gemma « ECLeKTic 8‹„9 1»‹0 17‹1 »‹—9 11‹0 17‹„ „»‹»
„B 9B „7B »B 1„B „7B IndicGB »7‹» 59‹« —„‹1 »1‹» 57‹„ —1‹7 —«‹»
$ocVQA 81‹— 8—‹« 85‹1 8—‹1 89‹0 89‹5
InfoVQA »1‹» 5«‹1 50‹„ 55‹— —1‹— —»‹—
Table 1« | Multilingual performance after the pre‚
TextVQA 7—‹« 7—‹« 75‹1 79‹1 81‹— 8«‹„ training phase‹ IndicGenBench is an average over
ChartQA 70‹7 79‹1 71‹« 79‹8 8«‹5 8«‹» benchmarks reported in Table 1»‹
AI„$ 7—‹0 8»‹» 8»‹— 80‹9 85‹— 8—‹5
OKVQA —»‹1 —8‹— 70‹— —5‹„ —9‹« 71‹1 et al‹˛ „0„„¯˛ XQuA$ ˘Artetxe et al‹˛ „0„0¯˛
CountBenchQA 8„‹0 85‹« 87‹» 79‹» 8«‹5 87‹8
ECLeKTic ˘Goldman et al‹˛ „0„5¯˛ IndicGen‚
COCO caption 1»«‹ 1»5‹ 1»5‹ 1»«‹ 1»«‹ 1»»‹ Bench ˘Singh et al‹˛ „0„»a¯˛ XOR QA ˘Asai et al‹˛
VQAv„ 8»‹8 85‹8 85‹8 8»‹1 8»‹9 85‹1 „0„0¯‹ Evaluation details are described in Ta‚
Tally QA 80‹— 8„‹» 8„‹1 79‹0 81‹« 81‹7
ble 19‹
Table 1„ | Performance of pre‚trained checkpoints
Gemma „ Gemma «
after fine‚tuning on multi‚modal benchmarks
˘without P˚S¯‹ PaliGemma „ was transferred at „B 9B „7B 1B »B 1„B „7B
89—x89— resolution for the first four benchmarks˛ XQuA$ Indic 5»‹« 7«‹1 7»‹9 »«‹1 —8‹« 75‹„ 77‹8
and at »»8x»»8 resolution for the others‹ XORQA in‚en ——‹„ —9‹« 7„‹5 5—‹« —8‹« —9‹8 70‹»
XORQA in‚xx «1‹„ »0‹8 »»‹« „7‹1 «9‹8 »«‹8 »—‹0
Comparison to PaliGemma „‹ We fine‚tune mul‚ Flores Indic «8‹1 5»‹0 5—‹9 «9‹0 5„‹« 58‹0 59‹5
timodal Gemma « pre‚trained checkpoints fol‚
lowing the protocol from Steiner et al‹ ˘„0„»¯ – Table 1» | $etailed IndicGenBench performance
only learning rate is swept˛ otherwise the same after the pre‚training phase‹
transfer settings are used‹ The results in Table 1„
show that Gemma « excels at benchmarks in‚ Long context‹ In Table 15 we report the per‚
volving document understanding˛ even outper‚ formance of pre‚trained and fine‚tuned mod‚
forming the larger PaliGemma „ variant‹ Note els on long context benchmarks‹ We include
that due to average pooling in the vision en‚ RULER ˘Hsieh et al‹˛ „0„»¯ and MRCR ˘Vodra‚
coder the Gemma « »B and 1„B models are halli et al‹˛ „0„»¯ benchmarks evaluating at «„K
about 10x cheaper to transfer compared with the and 1„8K sequence lengths‹
PaliGemma „ 9B and „7B models at the same 89—
x 89— resolution‹ Gemma « also performs better
on AI„$ and OKVQA˛ but PaliGemma „ performs 8‹1‹ Performance of IT models
slightly better on VQAv„ and COCO caption‹
We report in Table 18˛ additional benchmarks
Multilinguality‹ In Table 1« we report the per‚ on our IT models‹ Note that N„C refers to
formance of the pre‚trained models on multilin‚ Natural„Code˛ the Gemini 1‹0 internal held‚out
gual tasks‹ We apply in‚context learning with dataset˛ which uses author‚generated sources in‚
multi‚shot prompting and present results on stead of web‚based information‹ BBEH refers to
the following benchmarksȷ MGSM ˘Shi et al‹˛ BIG‚Bench Extra Hard ˘Kazemi et al‹˛ „0„5¯˛ a
„0„«¯˛ Global‚MMLU‚Lite ˘Singh et al‹˛ „0„»b¯˛ challenging LLM reasoning benchmark that aggre‚
WMT„»¸¸ ˘$eutsch et al‹˛ „0„5¯˛ FLoRes ˘Goyal gates several reasoning tasks ˘Fatemi et al‹˛ „0„»;

„1
Gemma « Technical Report

Gemma « PT Gemma « IT
Context »B 1„B „7B »B 1„B „7B
RULER «„K —7‹1 90‹— 85‹9 —1‹» 80‹« 91‹1
RULER 1„8K 51‹7 80‹7 7„‹9 »—‹8 57‹1 ——‹0
MRCR «„K »»‹7 59‹8 —«‹„ »9‹8 5«‹7 —«‹„
MRCR 1„8K »0‹— 5—‹9 —0‹0 »»‹— »9‹8 59‹«

Table 15 | Performance of pre‚trained ˘PT¯ and


instruction fine‚tuned ˘IT¯ models on long context
benchmarks at different context lengths‹

»B 1„B „7B
MMMU ˘val¯ »8‹8 59‹— —»‹9
$ocVQA 75‹8 87‹1 8—‹—
InfoVQA 50‹0 —»‹9 70‹—
TextVQA 57‹8 —7‹7 —5‹1
AI„$ 7»‹8 8»‹„ 8»‹5
ChartQA —8‹8 75‹7 78‹0 »B 1„B „7B
VQAv„ ˘val¯ —„‹» 71‹— 71‹0 Perception Test MCVQA 50‹— 5»‹9 58‹1
MathVista ˘testmini¯ 50‹0 —„‹9 —7‹— ActivityNet‚QA »—‹« 50‹» 5„‹8

Table 1— | Performance of instruction fine‚tuned Table 17 | Performance of instruction fine‚tuned


˘IT¯ models on multimodal benchmarks‹ If not ˘IT¯ models on vision understanding benchmarks
mentioned˛ these results are on the final test set using 0 shot with 1— frames linspace‹ Per‚
of each dataset with P˚S applied‹ ception Test consists of real‚world videos de‚
signed to show perceptually interesting situa‚
tions and we report results on the multiple choice
Hessel et al‹˛ „0„„; Kazemi et al‹˛ „0„«˛ „0„»b;
video QA benchmark in terms of top‚1 accuracy‹
Kıcıman et al‹˛ „0„«; Nie et al‹˛ „0„»; Sđnchez
ActivityNet‚QA reports standard gpt‚evaluation‹
et al‹˛ „0„»; Shah et al‹˛ „0„»; Tyen et al‹˛ „0„«;
White et al‹˛ „0„»; Yamada et al‹˛ „0„«; Zhang
et al‹˛ „0„»¯‹ ECLeKTic refers to Goldman et al‹
˘„0„5¯‹ We report the micro average score‹ More
evaluation details are described in Table „1‹

8‹„‹ Performance of IT models on video under‚


standing

Additional multimodal evaluations‹ Gemma


« IT models were evaluated on common vision
benchmarks following the evaluation protocol of
Gemini 1‹5 ˘Gemini Team˛ „0„»¯‹ The results are
given in Table 1— when P˚S is activated‹

„„
Gemma « Technical Report

Gemma „ Gemma «
„B 9B „7B 1B »B 1„B „7B
MMLU 5—‹1 71‹« 7—‹„ «8‹8 58‹1 71‹9 7—‹9
MBPP «—‹— 59‹„ —7‹» «5‹„ —«‹„ 7«‹0 7»‹»
HumanEval „0‹1 »0‹„ 51‹8 »1‹5 71‹« 85‹» 87‹8
N„C »—‹8 —8‹« 77‹« 5—‹0 70‹« 80‹7 8»‹5
LiveCodeBench 7‹0 „0‹0 „9‹0 5‹0 „«‹0 «„‹0 «9‹0
GSM8K —„‹— 88‹1 91‹1 —„‹8 89‹„ 9»‹» 95‹9
MATH „7‹„ »9‹» 55‹— »8‹0 75‹— 8«‹8 89‹0
HiddenMath „‹0 8‹0 1„‹0 15‹0 »„‹0 51‹0 5—‹0
BBH »1‹» —9‹0 7»‹9 «9‹1 7„‹„ 85‹7 87‹—
BBEH 5‹9 9‹8 1»‹8 7‹„ 11‹0 1—‹« 19‹«
IFEval 80‹» 88‹» 91‹1 80‹„ 90‹„ 88‹9 90‹»
GMMLU‚Lite »1‹9 —»‹8 —8‹— «»‹„ 5»‹5 —9‹5 75‹1
ECLeKTic 5‹« 11‹8 17‹— 1‹» »‹— 10‹« 1—‹7
WMT„»¸¸ «7‹» »8‹7 51‹7 «5‹9 »—‹8 51‹— 5«‹»

Table 18 | Performance of instruction fine‚tuned ˘IT¯ models of different sizes on more internal and
external benchmarks‹

„«
Gemma « Technical Report

Evaluation Metric Type n‚shot COT Norm


MBPP pass␣1 sampling «‚shot
HumanEval pass␣1 sampling 0‚shot
HellaSwag Accuracy scoring 10‚shot Char‚Len
BoolQ Accuracy scoring 0‚shot Char‚Len
PIQA Accuracy scoring 0‚shot Char‚Len
SIQA Accuracy scoring 0‚shot Char‚Len
TriviaQA Accuracy sampling 5‚shot
Natural Questions Accuracy sampling 5‚shot
ARC‚C Accuracy scoring „5‚shot Char‚Len
ARC‚E Accuracy scoring 0‚shot Char‚Len
WinoGrande Accuracy scoring 5‚shot Char‚Len
BBH Accuracy sampling few‚shot Yes
$ROP Token F1 score sampling 1‚shot
AGIEval Accuracy sampling «‚5‚shot
MMLU Accuracy scoring 5‚shot Char‚Len
MATH Accuracy sampling »‚shot Yes
GSM8K Accuracy sampling 8‚shot Yes
GPQA $iamond Accuracy sampling 5‚shot Yes
MMLU‚Pro Accuracy sampling 5‚shot Yes
MGSM Accuracy sampling 8‚shot
FLoRes CHaRacter‚level F‚score sampling 1‚shot
Global‚MMLU‚Lite Accuracy scoring 5‚shot Char‚Len
XQuA$ CHaRacter‚level F‚score sampling 5‚shot
WMT„»¸¸ CHaRacter‚level F‚score sampling 5‚shot
ECLeKTic ECLeKTic score sampling „‚shot First‚line›strip
XQuA$ Indic CHaRacter‚level F‚score sampling 5‚shot
XOR QA IN‚EN CHaRacter‚level F‚score sampling 5‚shot
XOR QA IN‚XX CHaRacter‚level F‚score sampling 5‚shot
FLoRes Indic CHaRacter‚level F‚score sampling 5‚shot
RULER Accuracy sampling 0‚shot
MRCR MRCR score sampling few‚shot

Table 19 | $etails on text benchmarks‹ Char‚Len stands for Character Length Normalization and COT
stands for Chain‚Of‚Thought prompting‹

„»
Gemma « Technical Report

Evaluation Metric Type n‚shot


COCO Caption Cider score sampling »‚shot
$ocVQA ANLS score sampling »‚shot
InfographicVQA ANLS score sampling »‚shot
MMMU Accuracy sampling «‚shot text only
TextVQA Accuracy sampling »‚shot
RealWorldQA Accuracy sampling »‚shot text only
ReMI Accuracy sampling »‚shot
AI„$ Accuracy sampling »‚shot
ChartQA Accuracy sampling »‚shot
VQA v„ Accuracy sampling »‚shot
BLINK Accuracy sampling 0‚shot
OK‚VQA Accuracy sampling »‚shot
TallyQA Accuracy sampling »‚shot
SpatialSense VQA Accuracy sampling »‚shot
CountBench VQA Accuracy sampling 0‚shot

Table „0 | $etails on vision benchmarks‹ No Chain‚Of‚Thought prompting nor normalization‹

Evaluation Metric Type n‚shot COT


MMLU Accuracy sampling 0‚shot
MBPP pass␣1 sampling «‚shot
HumanEval pass␣1 sampling 0‚shot
N„C pass␣1 sampling 0‚shot
LiveCodeBench Average over 8 samples sampling 0‚shot Yes
GSM8K Accuracy sampling 0‚shot Yes
GPQA $iamond Accuracy sampling 0‚shot Yes
MATH Accuracy sampling 0‚shot
HiddenMath Accuracy sampling 0‚shot
BBH Accuracy sampling 0‚shot
BBEH Accuracy sampling 0‚shot
IFEval Accuracy sampling 0‚shot
Global‚MMLU‚lite Accuracy sampling 0‚shot Yes
ECLeKTic ECLeKTic score sampling 0‚shot
WMT„»¸¸ CHaRacter‚level F‚score sampling 0‚shot

Table „1 | $etails on instruction fine‚tuned ˘IT¯ benchmarks‹ No normalization‹

„5

You might also like