端侧多模态|谷歌开源Gemma 3n 赋能移动终端全模态智能

最新推荐文章于 2025-07-14 17:42:10 发布

JasonLiu1919

最新推荐文章于 2025-07-14 17:42:10 发布

阅读量1.1k

点赞数 19

CC 4.0 BY-SA版权

文章标签：人工智能多模态大模型移动终端

本文链接：https://blog.csdn.net/ljp1919/article/details/149001430

0.引言

Gemma 3n模型在Google I/O大会期间作为预览版首次亮相，引起了设备端社区的极大兴趣，因为它是一款从头开始设计，旨在端侧硬件上运行的模型。更令人瞩目的是，它原生支持多模态，能够处理图像、文本、音频和视频输入。近期，Gemma 3n已全面进入开源生态系统，正式在各大主流开源库中发布，这标志着其可用性和开发者社区的广泛应用迈出了重要一步。

1.简介

Gemma 3n现已在最常用的开源库中全面可用，包括transformers & timm、MLX、llama.cpp（仅限文本输入）、transformers.js、ollama以及Google AI Edge等。此次发布包含了两种模型尺寸，每种尺寸都有基础版（base）和指令跟随版（instruct）两种变体。

这些模型的命名遵循非标准约定，分别为gemma-3n-E2B和gemma-3n-E4B，其中的“E”代表“Effective”（有效）。其真实参数量分别为5B和8B，但由于内存效率的提升，它们在VRAM（GPU内存）中仅需2B和4B。这意味着这些模型在硬件支持方面表现得像2B和4B模型，但在质量上却超越了2B/4B模型的表现。具体来说，E2B模型仅需2GB的GPU RAM即可运行，而E4B模型仅需3GB的GPU RAM。

模型官方下载地址： https://huggingface.co/collections/google/gemma-3n-685065323f5984ef315c93f4

除了语言解码器，Gemma 3n还集成了音频编码器和视觉编码器，增强了其多模态能力。

其中，视觉编码器采用了MobileNet-v5-300这一新版本，其拥有3亿参数，支持256x256、512x512和768x768的分辨率。在Google Pixel设备上，它能实现60 FPS的帧率，并且在参数量仅为ViT Giant三分之一的情况下，性能超越了后者。

音频编码器则基于通用语音模型（USM），能够处理160ms的音频块，支持语音转文本和翻译功能（例如英语到西班牙语/法语）。

Gemma 3n的架构本身已被整合到目前(6月27日)新版transformers中，并与timm库协同工作，实现图像编码。其架构亮点包括：

MatFormer架构：一种嵌套的Transformer设计，类似于Matryoshka嵌入(俄罗斯套娃嵌入)，允许提取不同层的子集，如同独立的模型。E2B和E4B是共同训练的，其中E2B被配置为E4B的一个子模型，用户可以根据硬件特性和内存预算"混合搭配"不同层。
逐层嵌入（Per-Layer Embeddings, PLE）：通过将嵌入层的数据卸载到CPU来减少加速器内存使用，这是E2B模型拥有5B实际参数但占用与2B参数模型相同GPU内存的关键原因。
KV缓存共享（KV Cache Sharing）：加速音频和视频的长上下文处理，相比Gemma 3 4B模型，预填充prefill速度快了2倍。

更多AI相关欢迎关注微信公众号"小窗幽记机器学习"

2.结果表现

在性能与基准测试方面，Gemma 3n表现出色：

LMArena分数：E4B是首个达到1300+分数的10B以下模型。
MMLU分数：Gemma 3n在不同尺寸（E4B、E2B以及多种混合搭配配置）下均展现出有竞争力的性能。
多语言支持：它支持140种语言的文本处理和35种语言的多模态交互。

3.实测

为了方便用户使用，Hugging Face提供了一个专门的Demo Space供用户体验模型。具体Demo地址：https://huggingface.co/spaces/huggingface-projects/gemma-3n-E4B-it

4.模型使用

此外，Gemma 3n还提供了便捷的推理方式，可以通过transformers库的pipeline抽象进行简单调用，也可以通过AutoProcessor和AutoModelForImageTextToText进行更详细的推理。模型支持多种输入模态，包括纯文本、音频与文本交错（用于语音转文本等）、以及图像/视频与文本交错（视频被处理为图像帧的集合）。

示例1：基于pipeline

import torchfrom transformers import pipelinepipe = pipeline(   "image-text-to-text",   model="google/gemma-3n-E4B-it", # "google/gemma-3n-E4B-it"   device="cuda",   torch_dtype=torch.bfloat16)messages = [   {       "role": "user",       "content": [           {"type": "image", "url": "https://huggingface.co/datasets/ariG23498/demo-data/resolve/main/airplane.jpg"},           {"type": "text", "text": "Describe this image"}       ]   }]output = pipe(text=messages, max_new_tokens=32)print(output[0]["generated_text"][-1]["content"])

示例2: 基于AutoModelForImageTextToText

from transformers import AutoProcessor, AutoModelForImageTextToTextimport torchmodel_id = "google/gemma-3n-e4b-it"# google/gemma-3n-e2b-itprocessor = AutoProcessor.from_pretrained(model_id)model = AutoModelForImageTextToText.from_pretrained(model_id).to(device)def model_generation(model, messages):    inputs = processor.apply_chat_template(        messages,        add_generation_prompt=True,        tokenize=True,        return_dict=True,        return_tensors="pt",    )    input_len = inputs["input_ids"].shape[-1]    inputs = inputs.to(model.device, dtype=model.dtype)    with torch.inference_mode():        generation = model.generate(**inputs, max_new_tokens=32, disable_compile=False)        generation = generation[:, input_len:]    decoded = processor.batch_decode(generation, skip_special_tokens=True)    print(decoded[0])

如果仅仅是文本的话：

# Text Onlymessages = [    {        "role": "user",        "content": [            {"type": "text", "text": "What is the capital of France?"}        ]    }]model_generation(model, messages)

如果涉及到音频：

# Interleaved with Audiomessages = [    {        "role": "user",        "content": [            {"type": "text", "text": "Transcribe the following speech segment in English:"},            {"type": "audio", "audio": "https://huggingface.co/datasets/ariG23498/demo-data/resolve/main/speech.wav"},        ]    }]model_generation(model, messages)

如果涉及到图片和视频：

# Interleaved with Imagemessages = [    {        "role": "user",        "content": [            {"type": "image", "image": "https://huggingface.co/datasets/ariG23498/demo-data/resolve/main/airplane.jpg"},            {"type": "text", "text": "Describe this image."}        ]    }]model_generation(model, messages)

除了transformers，Gemma 3n还支持MLX（实现所有三种模态的支持）和llama.cpp（仅限文本输入）进行推理。同时，也发布了gemma-3n-E2B-it模型变体的ONNX权重，方便在不同运行时和平台上部署，并且已集成到Transformers.js 3.6.0版本中。考虑到模型的大小，微调Gemma 3n以适应特定下游任务是相当便利的，并且官方提供了免费的Google Colab笔记本来简化微调过程，还提供了专门用于音频任务微调的笔记。

具体地址如下：

https://colab.research.google.com/github/huggingface/huggingface-gemma-recipes/blob/main/notebooks/fine_tune_gemma3n_on_t4.ipynb
https://github.com/huggingface/huggingface-gemma-recipes/blob/main/notebooks/fine_tune_gemma3n_on_audio.ipynb

Hugging Face还推出了Gemma Recipes代码库，其中包含了运行和微调模型的笔记本和脚本，鼓励社区用户贡献更多配方。具体地址如下：https://github.com/huggingface/huggingface-gemma-recipes

总而言之，Gemma 3n以其多模态、小尺寸和高能力的特点，助力端侧多模态的发展。

更多AI相关欢迎关注微信公众号"小窗幽记机器学习"。