Model usable on intel NPU? Or how to convert

#2
by spotbch - opened

Hi there,

Just wondering if there is a step-by-step process anywhere that details how I can enable this model to be used by an intel second gen npu (275HX). From what I have tested, it is a dynamic input issue and I have been unable to successfully reshape the model to change its inputs to static. Any tips on what I can do to get this to work would be very appreciated!

Thanks!

OpenVINO Toolkit org

@spotbch So the docs dont mention VLM support for NPU device.

https://docs.openvino.ai/2025/openvino-workflow-generative/inference-with-genai/inference-with-genai-on-npu.html

However, I did implement NPU for LLM and Whisper for OpenArc, you should check it out. VLM for CPU/GPU also is fully implemented and this model would work there... however there may be better options, and full model in fp16 wastes memory. instead adapt example like

from optimum.intel import OVModelForVisualCausalLM
from optimum.intel import OVPipelineQuantizationConfig, OVQuantizationConfig, OVWeightQuantizationConfig

model_id = "nanonets/Nanonets-OCR2-3B"
model = OVModelForVisualCausalLM.from_pretrained(
    model_id,
    export=True,
    trust_remote_code=True,
    quantization_config=OVPipelineQuantizationConfig(
        quantization_configs={
            "lm_model": OVQuantizationConfig(bits=8),
            "text_embeddings_model": OVWeightQuantizationConfig(bits=4),
        },
        dataset="contextual",
        trust_remote_code=True,
    )
)
model.save_pretrained("Nanonets-OCR2-3B-LM-INT4_ASYM-VE-FP16-ov")

which preserves vision part in full precision, catering to representational properties of vision tokens. IRL Gemma3 or Qwen2.5-VL may outperform this model, though I haven't tried.

Explore the APIs in this example; they may help you tinker to change static shapes... however this path may invite suffering lol. There are a few issues in OpenArc where a gentlemen benched his NPU- performance suggests you may be better off using iGPU for VLM, even though VLM was not evaluated. Anyway, good job hacking on this :)

Wow thank you so much for the information and for such a quick response. I have got some LLMs and whisper to function on the NPU but appreciate the link to OpenArc as I haven't seen that yet. I was impressed with Whisper on the NPU but just getting into this space so I can't say I have benchmarked a lot of performance. To that extent, the performance on NPU is not mission critical for a VLM I just wanted to get the proof of concept to see if it is possible to run one on an NPU while other tasks are completed on CPU/GPU.

Thanks again!

Sign up or log in to comment