ValueError: Number of images does not match number of special image tokens in the input text. Got 256 image tokens in the text but 256 tokens from image embeddings.

#91
by zml31415 - opened

Hello everyone
I get the error in the title when using inputs_embeds together with pixel_values. Small code example to reproduce:

import requests
from PIL import Image
from io import BytesIO
from transformers.models.gemma3 import modeling_gemma3
from transformers import AutoProcessor
import torch


model_name = "google/gemma-3-12b-it"
model = modeling_gemma3.Gemma3ForConditionalGeneration.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)
device = model.device

processor = AutoProcessor.from_pretrained(model_name, use_fast=True)

img = Image.open(BytesIO(requests.get("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg").content)).convert("RGB")
prompt = ["Analyse and explain the image: <start_of_image>\n"]

inputs = processor(text=prompt, images=img, return_tensors="pt").to(device)

pixel_values = inputs.pixel_values  
input_ids = inputs.input_ids

inputs_embeds = model.get_input_embeddings()(input_ids)
outputs = model(
    inputs_embeds=inputs_embeds,
    #input_ids=input_ids,
    pixel_values=pixel_values,
    use_cache=False
)

The issue is that in the modeling_gemma3.py in line 898 and 899 there is this creation of a mask that identifies all the placeholders for the outputs of the vision tower, respectively multi_modal_projector:

if input_ids is None:
                special_image_mask = inputs_embeds == self.get_input_embeddings()(
                    torch.tensor(self.config.image_token_id, dtype=torch.long, device=inputs_embeds.device)

which seems to mess something up with the special_image_mask since the alternative path (lines 901-903) for the input_ids works perfectly fine:

else:
                special_image_mask = (input_ids == self.config.image_token_id).unsqueeze(-1)
                special_image_mask = special_image_mask.expand_as(inputs_embeds).to(inputs_embeds.device)

because of the if inputs_ids is None: case the later check (line 905) that throws the error

if not is_torchdynamo_compiling() and inputs_embeds[special_image_mask].numel() != image_features.numel():

has a wired thing with the numbers. In my case inputs_embeds[special_image_mask].numel() is 983041 while image_features.numel() is 983040. Exactly 1 off. This seems wired to me since image_features.shape is torch.Size([1, 256, 3840]), therefore 256*3840 is the expected 983040. And inputs_embeds.shape is torch.Size([1, 269, 3840]), applying a correct mask and considering the 13 text tokens, the relevant tensor should also be torch.Size([1, 256, 3840]). Therefore again 983040 for the .numel(), but it is 1 more. i don't get why this is. My current workaround is to feed inputs_embeds as well inputs_ids, disable the check (line 867 and 868) if one feeds both effectively disabling the strange if input_ids is None mask calculation (lines 897 to 899). Then everything works fine. But i can not work with custom modifications in the modeling_gemma3.py code for ever :)
Is there something wrong that i do or is there something strange in the modeling_gemma3.py code?

Google org

Hi @zml31415 ,

Thanks for reaching out to us, the google/gemma-3-27b-it or google/gemma-3-12b-it are instruction tuned (IT) models they follows a specific kind of prompt and chat template to process your query/prompt, which means any IT Gemma model follows a role based instructions to process your request. Please find the following sample prompt message for your reference:

messages = [
{
"role": "system",
"content": [{"type": "text", "text": "You are a helpful assistant."}]
},
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
{"type": "text", "text": "What animal is on the candy?"}
]
}
]

inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt"
).to(model.device, dtype=torch.bfloat16)

Please adjust your prompt based on the above given sample prompt message. Thanks for your interest in Gemma models.

Thanks.

edit: due to code mistake while posting, changed the line inputs_embeds = model.get_input_embeddings()(generated_ids) to inputs_embeds = model.get_input_embeddings()(input_ids)

Thank you for your response. But i feel like your answer is not exactly tackling my question.I try to clarify it: If i understand the models code correctly, one could provide the inputs to the model two ways:
First via the input_ids that are coming from the processor and
second via inputs_embeds, that i can get from the models function model.get_input_embeddings()(input_ids).
I already tried the apply_chat_template, and it works because internally it is using the input_ids route which is at the deciding if-statement (line 897-900, modeling_gemma3.py). It calculates the special_image_mask in a way that works, even with the code and prompt that i provided. One just needs to disable the assert in line 867, 868. But providing the inputs_embeds -- taking the else path in line 901, it doesn't work. I wonder if this function got tested or if there is a bug or if i overlooked something.
Interestingly, internally the inputs_embeds are also calculated if not provided (line 884, 885 modeling_gemma3.py) Resulting in an almost identical (whith the exception of line 897-900) path of calculation.
So my point is, that using any method (apply_chat_template, input_ids=... , **inputs) that provides the input_ids to the model, works fine, but providing the inputs_embeds does not - even though the inputs_embedsare still calculated internally (line 884, 885 modeling_gemma3.py). The deciding if statement is in line 897-900, because the calculation of the token_mask is doing something strange when not providing input_ids but inputs_embeds. This line is the only difference i could track down between using the input_ids and the inputs_embeds as input for the model together with the pixel_values.
If i understood your answer wrong and its a simple, "you are not supposed to use our model this way", fine, but why is the "inputs_embeds" function implemented anyways then?
The funny thing is, that the input_ids in the modeling_gemma3.py are only properly used in two lines:
In 902 where the working implementation of the special_image_maskis done,
and the funny part in 885 where exactly the same is done as in my code: inputs_embeds = model.get_input_embeddings()(input_ids). And all further processing is done on those inputs_embeds. But directly providing them leads to the error in the thread title.
I hope you look into it, since using the chat_template isn't really applicable for my application.

BTW, i also tried it with the PT model where in the documentation the prompt that i used is given as an example, still the same error. Only the inputs_embeds part is different in this documentation. The input_ids are provided to the model: edit: for clarification, this is the documentation code:

# pip install accelerate

from transformers import AutoProcessor, Gemma3ForConditionalGeneration
from PIL import Image
import requests
import torch

model_id = "google/gemma-3-12b-pt"

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"
image = Image.open(requests.get(url, stream=True).raw)

model = Gemma3ForConditionalGeneration.from_pretrained(model_id).eval()
processor = AutoProcessor.from_pretrained(model_id)

prompt = "<start_of_image> in this image, there is"
model_inputs = processor(text=prompt, images=image, return_tensors="pt")

input_len = model_inputs["input_ids"].shape[-1]

with torch.inference_mode():
    generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False)
    generation = generation[0][input_len:]

decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)
This comment has been hidden (marked as Spam)

Sign up or log in to comment