Qwen3-VL

This version of Qwen3-VL-2B-Instruct and Qwen3-VL-4B-Instruct have been converted to run on the Axera NPU using w8a16 quantization.

Compatible with Pulsar2 version: 5.0

Convert tools links:

For those who are interested in model conversion, you can try to export axmodel through the original repo :

Pulsar2 Link, How to Convert LLM from Huggingface to axmodel

AXera NPU HOST LLM Runtime

Support Platform

AX650
- AX650N DEMO Board
- M4N-Dock(爱芯派Pro)
- M.2 Accelerator card

Image Process

Chips	input size	image num	image encoder	ttft(168 tokens)	w8a16	CMM	Flash
AX650	384*384	1	236 ms	907 ms	4.3 tokens/sec	7.3GiB	7.9GiB

Video Process

Chips	input size	image num	image encoder	ttft(600 tokens)	w8a16	CMM	Flash
AX650	384*384	8	778 ms	2442 ms	4.3 tokens/sec	7.3GiB	7.9GiB

The DDR capacity refers to the CMM memory that needs to be consumed. Ensure that the CMM memory allocation on the development board is greater than this value.

How to use

Download all files from this repository to the device

If you using AX650 Board

Prepare tokenizer server

Install transformer

pip install -r requirements.txt

Demo Run

Image understand demo

start tokenizer server for image understand demo

python3 tokenizer_images.py --port 8080

run image understand demo

input text

描述这张图片

input image

root@ax650 ~/Qwen3-VL-4B-Instruct # bash run_image_ax650.sh 
[I][                            Init][ 156]: LLM init start
[I][                            Init][ 158]: Total CMM:7884 MB
[I][                            Init][  34]: connect http://127.0.0.1:8080 ok
bos_id: -1, eos_id: 151645
img_start_token: 151652
img_context_token: 151655
  2% | █                                 |   1 /  39 [0.01s<0.58s, 66.67 count/s] tokenizer init ok[I][                            Init][  26]: LLaMaEmbedSelector use mmap
  5% | ██                                |   2 /  39 [0.02s<0.43s, 90.91 count/s] embed_selector init ok[I][                            Init][ 201]: attr.axmodel_num:36
102% | █████████████████████████████████ |  40 /  39 [75.14s<73.27s, 0.53 count/s] init vpm axmodel ok,remain_cmm(369 MB)[I][                            Init][ 266]: IMAGE_CONTEXT_TOKEN: 151655, IMAGE_START_TOKEN: 151652
[I][                            Init][ 309]: image encoder output float32

[I][                            Init][ 339]: max_token_len : 2047
[I][                            Init][ 344]: kv_cache_size : 1024, kv_cache_num: 2047
[I][                            Init][ 352]: prefill_token_num : 128
[I][                            Init][ 356]: grp: 1, prefill_max_token_num : 1
[I][                            Init][ 356]: grp: 2, prefill_max_token_num : 128
[I][                            Init][ 356]: grp: 3, prefill_max_token_num : 256
[I][                            Init][ 356]: grp: 4, prefill_max_token_num : 384
[I][                            Init][ 356]: grp: 5, prefill_max_token_num : 512
[I][                            Init][ 356]: grp: 6, prefill_max_token_num : 640
[I][                            Init][ 356]: grp: 7, prefill_max_token_num : 768
[I][                            Init][ 356]: grp: 8, prefill_max_token_num : 896
[I][                            Init][ 356]: grp: 9, prefill_max_token_num : 1024
[I][                            Init][ 356]: grp: 10, prefill_max_token_num : 1152
[I][                            Init][ 360]: prefill_max_token_num : 1152
[I][                            Init][ 372]: LLM init ok
[I][                            Init][ 374]: Left CMM:369 MB
Type "q" to exit, Ctrl+c to stop current running
prompt >> 描述这张图片
image >> images/recoAll_attractions_1.jpg
[I][                     EncodeImage][ 440]: pixel_values size 1
[I][                     EncodeImage][ 441]: grid_h 24 grid_w 24
[I][                     EncodeImage][ 489]: image encode time : 236.550995 ms, size : 1
[I][                          Encode][ 532]: input_ids size:168
[I][                          Encode][ 540]: offset 15
[I][                          Encode][ 569]: img_embed.size:1, 368640
[I][                          Encode][ 583]: out_embed size:430080
[I][                          Encode][ 584]: input_ids size 168
[I][                          Encode][ 586]: position_ids size:168
[I][                             Run][ 607]: input token num : 168, prefill_split_num : 2
[I][                             Run][ 641]: input_num_token:128
[I][                             Run][ 641]: input_num_token:40
[I][                             Run][ 865]: ttft: 907.21 ms
这张图片展示了埃及吉萨金字塔群的壮丽景象，背景是清澈的蓝天，前景是广袤的沙漠。

画面中，最引人注目的是三座宏伟的金字塔，它们是古埃及文明的象征。其中，位于中央的是一座巨大的金字塔，其石块结构清晰可见，显示出古代工匠的精湛技艺。在它的左侧，是一座较小的金字塔，可能是为法老或贵族建造的。在右侧，还有一座金字塔，虽然部分被遮挡，但依然能感受到其雄伟的气势。

金字塔的周围是平坦的沙地，阳光照射下，金字塔的轮廓在蓝天的映衬下显得格外清晰。整个场景充满了历史的厚重感和神秘的氛围，让人不禁感叹古埃及文明的辉煌成就。

这张图片不仅展现了金字塔的建筑之美，也体现了古埃及人对宇宙和永恒的追求。它是一幅令人震撼的自然与人文景观的完美结合。

[N][                             Run][ 992]: hit eos,avg 4.29 token/s

Video understand demo

start tokenizer server for image understand demo

python tokenizer_video.py --port 8080

run video understand demo

input text

描述这个视频

input video

./video

root@ax650 ~/Qwen3-VL-4B-Instruct # bash run_video_ax650.sh 
[I][                            Init][ 156]: LLM init start
[I][                            Init][ 158]: Total CMM:7884 MB
[I][                            Init][  34]: connect http://127.0.0.1:8080 ok
bos_id: -1, eos_id: 151645
img_start_token: 151652
img_context_token: 151656
  2% | █                                 |   1 /  39 [0.01s<0.43s, 90.91 count/s] tokenizer init ok[I][                            Init][  26]: LLaMaEmbedSelector use mmap
  5% | ██                                |   2 /  39 [0.01s<0.29s, 133.33 count/s] embed_selector init ok[I][                            Init][ 201]: attr.axmodel_num:36
102% | █████████████████████████████████ |  40 /  39 [73.00s<71.17s, 0.55 count/s] init vpm axmodel ok,remain_cmm(369 MB)[I][                            Init][ 266]: IMAGE_CONTEXT_TOKEN: 151656, IMAGE_START_TOKEN: 151652
[I][                            Init][ 309]: image encoder output float32

[I][                            Init][ 339]: max_token_len : 2047
[I][                            Init][ 344]: kv_cache_size : 1024, kv_cache_num: 2047
[I][                            Init][ 352]: prefill_token_num : 128
[I][                            Init][ 356]: grp: 1, prefill_max_token_num : 1
[I][                            Init][ 356]: grp: 2, prefill_max_token_num : 128
[I][                            Init][ 356]: grp: 3, prefill_max_token_num : 256
[I][                            Init][ 356]: grp: 4, prefill_max_token_num : 384
[I][                            Init][ 356]: grp: 5, prefill_max_token_num : 512
[I][                            Init][ 356]: grp: 6, prefill_max_token_num : 640
[I][                            Init][ 356]: grp: 7, prefill_max_token_num : 768
[I][                            Init][ 356]: grp: 8, prefill_max_token_num : 896
[I][                            Init][ 356]: grp: 9, prefill_max_token_num : 1024
[I][                            Init][ 356]: grp: 10, prefill_max_token_num : 1152
[I][                            Init][ 360]: prefill_max_token_num : 1152
[I][                            Init][ 372]: LLM init ok
[I][                            Init][ 374]: Left CMM:369 MB
Type "q" to exit, Ctrl+c to stop current running
prompt >> 描述这个视频
video >> video
video/frame_0000.jpg
video/frame_0008.jpg
video/frame_0016.jpg
video/frame_0024.jpg
video/frame_0032.jpg
video/frame_0040.jpg
video/frame_0048.jpg
video/frame_0056.jpg
[I][                     EncodeImage][ 440]: pixel_values size 4
[I][                     EncodeImage][ 441]: grid_h 24 grid_w 24
[I][                     EncodeImage][ 489]: image encode time : 778.210022 ms, size : 4
[I][                          Encode][ 532]: input_ids size:600
[I][                          Encode][ 540]: offset 15
[I][                          Encode][ 569]: img_embed.size:4, 368640
[I][                          Encode][ 574]: offset:159
[I][                          Encode][ 574]: offset:303
[I][                          Encode][ 574]: offset:447
[I][                          Encode][ 583]: out_embed size:1536000
[I][                          Encode][ 584]: input_ids size 600
[I][                          Encode][ 586]: position_ids size:600
[I][                             Run][ 607]: input token num : 600, prefill_split_num : 5
[I][                             Run][ 641]: input_num_token:128
[I][                             Run][ 641]: input_num_token:128
[I][                             Run][ 641]: input_num_token:128
[I][                             Run][ 641]: input_num_token:128
[I][                             Run][ 641]: input_num_token:88
[I][                             Run][ 865]: ttft: 2441.51 ms
这个视频展示了一群**土拨鼠**（或称旱獭）在山地环境中嬉戏打闹的生动场景。

**画面内容：**

- **主体动物**：画面中有多只土拨鼠，它们毛色以灰、棕、白相间，体型圆润，四肢短小，尾巴蓬松。它们正互相追逐、扑打、推搡，动作非常活跃，看起来像是在玩耍或争斗。
- **动作细节**：土拨鼠们用前爪互相拍打、推搡，有的甚至用后腿蹬地，姿态充满动感。其中一只土拨鼠的前爪高高举起，似乎在“击打”另一只，画面充满动感和趣味。
- **背景环境**：背景是连绵起伏的山峦，山坡上覆盖着绿色植被，远处可见裸露的岩石和一条蜿蜒的山路。天空湛蓝，阳光明媚，整个场景充满自然野趣。
- **构图与视觉效果**：画面采用近景特写，聚焦于土拨鼠的互动，背景则略显模糊，突出了主体。画面中还出现了轻微的“多重曝光”或“动态模糊”效果，增强了动作的动感和趣味性。

**整体氛围：**

视频充满活力和趣味，展现了野生动物在自然环境中的自然行为，尤其是它们之间充满“斗殴”趣味的互动，让人忍俊不禁。这种“打斗”在动物界中常是社交、领地争夺或玩耍行为，但在这里被拍摄得极具戏剧性和趣味性。

**总结：**

这是一段充满动感和趣味的野生动物视频，展现了土拨鼠在山地环境中活泼好动、互相嬉戏的可爱瞬间，背景壮丽，画面生动，令人印象深刻。

[N][                             Run][ 992]: hit eos,avg 4.30 token/s

Downloads last month: 16

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AXERA-TECH/Qwen3-VL-4B-Instruct

Base model

Qwen/Qwen3-VL-2B-Instruct

Finetuned

(16)

this model