Qwen3-VL

This version of Qwen3-VL-2B-Instruct and Qwen3-VL-4B-Instruct have been converted to run on the Axera NPU using w8a16 quantization.

Compatible with Pulsar2 version: 5.0

Convert tools links:

For those who are interested in model conversion, you can try to export axmodel through the original repo :

Pulsar2 Link, How to Convert LLM from Huggingface to axmodel

AXera NPU HOST LLM Runtime

Support Platform

AX650
- AX650N DEMO Board
- M4N-Dock(爱芯派Pro)
- M.2 Accelerator card

Image Process

Chips	input size	image num	image encoder	ttft(168 tokens)	w8a16	CMM	Flash
AX650	384*384	1	238 ms	392 ms	9.5 tokens/sec	4.1GiB	4.2GiB

Video Process

Chips	input size	image num	image encoder	ttft(600 tokens)	w8a16	CMM	Flash
AX650	384*384	8	751 ms	1045 ms	9.5 tokens/sec	4.1GiB	4.2GiB

The DDR capacity refers to the CMM memory that needs to be consumed. Ensure that the CMM memory allocation on the development board is greater than this value.

How to use

Download all files from this repository to the device

If you using AX650 Board

Prepare tokenizer server

Install transformer

pip install -r requirements.txt

Demo Run

Image understand demo

start tokenizer server for image understand demo

python3 tokenizer_images.py --port 8080

run image understand demo

input text

描述这张图片

input image

root@ax650 ~/Qwen3-VL-2B-Instruct # bash run_image_ax650.sh 
[I][                            Init][ 156]: LLM init start
[I][                            Init][ 158]: Total CMM:7884 MB
[I][                            Init][  34]: connect http://127.0.0.1:8080 ok
bos_id: -1, eos_id: 151645
img_start_token: 151652
img_context_token: 151655
  3% | ██                                |   1 /  31 [0.01s<0.31s, 100.00 count/s] tokenizer init ok[I][                            Init][  26]: LLaMaEmbedSelector use mmap
  6% | ███                               |   2 /  31 [0.01s<0.20s, 153.85 count/s] embed_selector init ok[I][                            Init][ 201]: attr.axmodel_num:28
103% | ██████████████████████████████████ |  32 /  31 [13.72s<13.29s, 2.33 count/s] init vpm axmodel ok,remain_cmm(3678 MB)[I][                            Init][ 266]: IMAGE_CONTEXT_TOKEN: 151655, IMAGE_START_TOKEN: 151652
[I][                            Init][ 309]: image encoder output float32

[I][                            Init][ 339]: max_token_len : 2047
[I][                            Init][ 344]: kv_cache_size : 1024, kv_cache_num: 2047
[I][                            Init][ 352]: prefill_token_num : 128
[I][                            Init][ 356]: grp: 1, prefill_max_token_num : 1
[I][                            Init][ 356]: grp: 2, prefill_max_token_num : 128
[I][                            Init][ 356]: grp: 3, prefill_max_token_num : 256
[I][                            Init][ 356]: grp: 4, prefill_max_token_num : 384
[I][                            Init][ 356]: grp: 5, prefill_max_token_num : 512
[I][                            Init][ 356]: grp: 6, prefill_max_token_num : 640
[I][                            Init][ 356]: grp: 7, prefill_max_token_num : 768
[I][                            Init][ 356]: grp: 8, prefill_max_token_num : 896
[I][                            Init][ 356]: grp: 9, prefill_max_token_num : 1024
[I][                            Init][ 356]: grp: 10, prefill_max_token_num : 1152
[I][                            Init][ 360]: prefill_max_token_num : 1152
[I][                            Init][ 372]: LLM init ok
[I][                            Init][ 374]: Left CMM:3678 MB
Type "q" to exit, Ctrl+c to stop current running
prompt >> 描述这张图片
image >> images/recoAll_attractions_1.jpg
[I][                     EncodeImage][ 440]: pixel_values size 1
[I][                     EncodeImage][ 441]: grid_h 24 grid_w 24
[I][                     EncodeImage][ 489]: image encode time : 230.444000 ms, size : 1
[I][                          Encode][ 532]: input_ids size:168
[I][                          Encode][ 540]: offset 15
[I][                          Encode][ 569]: img_embed.size:1, 294912
[I][                          Encode][ 583]: out_embed size:344064
[I][                          Encode][ 584]: input_ids size 168
[I][                          Encode][ 586]: position_ids size:168
[I][                             Run][ 607]: input token num : 168, prefill_split_num : 2
[I][                             Run][ 641]: input_num_token:128
[I][                             Run][ 641]: input_num_token:40
[I][                             Run][ 865]: ttft: 392.89 ms
好的，这是一张关于埃及吉萨金字塔的图片。

这张图片展示了埃及吉萨金字塔群的壮丽景象。在广阔的沙漠中，几座巨大的金字塔巍然耸立，它们由巨大的石块堆砌而成，呈现出经典的阶梯状结构。这些金字塔是古埃及文明的杰作，是世界著名的文化遗产。

在画面的前景，可以看到一些游客或探险者，他们与金字塔相比显得微不足道，这更突显了金字塔的宏伟与古老。天空晴朗，阳光明媚，为整个场景增添了明亮的色彩。整个画面充满了历史的厚重感和自然的壮美。

[N][                             Run][ 992]: hit eos,avg 9.39 token/s

Video understand demo

start tokenizer server for image understand demo

python tokenizer_video.py --port 8080

run video understand demo

input text

描述这个视频

input video

./video

root@ax650 ~/Qwen3-VL # bash run_qwen3_vl_2b_video.sh 
[I][                            Init][ 156]: LLM init start
[I][                            Init][  34]: connect http://127.0.0.1:8080 ok
bos_id: -1, eos_id: 151645
img_start_token: 151652
img_context_token: 151656
  3% | ██                                |   1 /  31 [0.01s<0.31s, 100.00 count/s] tokenizer init ok[I][                            Init][  26]: LLaMaEmbedSelector use mmap
  6% | ███                               |   2 /  31 [0.01s<0.20s, 153.85 count/s] embed_selector init ok[I][                            Init][ 198]: attr.axmodel_num:28
103% | ██████████████████████████████████ |  32 /  31 [30.34s<29.39s, 1.05 count/s] init vpm axmodel ok,remain_cmm(3678 MB)[I][                            Init][ 263]: IMAGE_CONTEXT_TOKEN: 151656, IMAGE_START_TOKEN: 151652
[I][                            Init][ 306]: image encoder output float32

[I][                            Init][ 336]: max_token_len : 2047
[I][                            Init][ 341]: kv_cache_size : 1024, kv_cache_num: 2047
[I][                            Init][ 349]: prefill_token_num : 128
[I][                            Init][ 353]: grp: 1, prefill_max_token_num : 1
[I][                            Init][ 353]: grp: 2, prefill_max_token_num : 128
[I][                            Init][ 353]: grp: 3, prefill_max_token_num : 256
[I][                            Init][ 353]: grp: 4, prefill_max_token_num : 384
[I][                            Init][ 353]: grp: 5, prefill_max_token_num : 512
[I][                            Init][ 353]: grp: 6, prefill_max_token_num : 640
[I][                            Init][ 353]: grp: 7, prefill_max_token_num : 768
[I][                            Init][ 353]: grp: 8, prefill_max_token_num : 896
[I][                            Init][ 353]: grp: 9, prefill_max_token_num : 1024
[I][                            Init][ 353]: grp: 10, prefill_max_token_num : 1152
[I][                            Init][ 357]: prefill_max_token_num : 1152
[I][                            Init][ 366]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
prompt >> 描述这个视频
image >> video
video/frame_0000.jpg
video/frame_0008.jpg
video/frame_0016.jpg
video/frame_0024.jpg
video/frame_0032.jpg
video/frame_0040.jpg
video/frame_0048.jpg
video/frame_0056.jpg
[I][                          Encode][ 490]: image encode time : 751.804993 ms, size : 4
[I][                          Encode][ 533]: input_ids size:600
[I][                          Encode][ 541]: offset 15
[I][                          Encode][ 557]: img_embed.size:4, 294912
[I][                          Encode][ 562]: offset:159
[I][                          Encode][ 562]: offset:303
[I][                          Encode][ 562]: offset:447
[I][                          Encode][ 571]: out_embed size:1228800
[I][                          Encode][ 573]: position_ids size:600
[I][                             Run][ 591]: input token num : 600, prefill_split_num : 5
[I][                             Run][ 625]: input_num_token:128
[I][                             Run][ 625]: input_num_token:128
[I][                             Run][ 625]: input_num_token:128
[I][                             Run][ 625]: input_num_token:128
[I][                             Run][ 625]: input_num_token:88
[I][                             Run][ 786]: ttft: 1040.91 ms
根据您提供的图片，这是一段关于两只土拨鼠在山地环境中互动的视频片段。

- **主体**：画面中有两只土拨鼠（也称“山地土拨鼠”或“黑背土拨鼠”），它们正站在一块布满碎石的草地上。它们的毛色为灰褐色与黑色相间，面部有明显的黑色条纹，这是土拨鼠的典型特征。

- **行为**：这两只土拨鼠正进行着一种看似玩耍或社交的互动。它们用前爪互相拍打，身体前倾，姿态充满活力。这种行为在土拨鼠中通常表示友好、玩耍或建立社交联系。

- **环境**：背景是连绵起伏的山脉，山坡上覆盖着绿色的植被，天空晴朗，阳光明媚。整个场景给人一种自然、宁静又充满生机的感觉。

- **视频风格**：从画面的清晰度和动态感来看，这可能是一段慢动作或高清晰度的视频片段，捕捉了土拨鼠活泼、生动的瞬间。

综上所述，这段视频生动地记录了两只土拨鼠在自然山地环境中友好互动的场景，展现了它们活泼、充满活力的天性。

[N][                             Run][ 913]: hit eos,avg 9.44 token/s

prompt >>

Downloads last month: 29

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AXERA-TECH/Qwen3-VL-2B-Instruct

Base model

Qwen/Qwen3-VL-2B-Instruct

Finetuned

(16)

this model