NVIDIA-Nemotron-Nano-VL-12B-V2-FP4-QAD

Model Overview

Description

NVIDIA-Nemotron-Nano-VL-12B-V2-FP4-QAD is the quantized version of the NVIDIA Llama Nemotron Nano VL V2 model, which is an auto-regressive vision language model that uses an optimized transformer architecture. For more information, please check here. The NVIDIA Llama Nemotron Nano VL FP4 QAD model is quantized with TensorRT Model Optimizer.

This model was trained on commercial images using Quantization-aware Distillation (QAD).

This model was trained on commercial images for all three stages of training and supports single image inference.

License/Terms of Use

Governing Terms:

Your use of the model is governed by the NVIDIA Open License Agreement.

Additional Information:

Backbone LLM: NVIDIA-Nemotron-Nano-12B-v2.

Deployment Geography:

Global

Use Case:

Customers: AI foundry enterprise customers

Use Cases: Image summarization. Text-image analysis, Optical Character Recognition, Interactive Q&A on images, Text Chain-of-Thought reasoning

Release Date:

  • Hugging Face [October 28, 2025]

Model Architecture:

Network Type: Transformer

Network Architecture:

Vision Encoder: C-RADIOv2-H

Language Encoder: NVIDIA-Nemotron-Nano-12B-v2

Input

Input Type(s): Image, Text

  • Input Images
  • Language Supported: German, Spanish, French, Italian, Korean, Portuguese, Russian, Japanese, Chinese, English

Input Format(s): Image (Red, Green, Blue (RGB)), and Text (String)

Input Parameters: Image (2D), Text (1D)

Other Properties Related to Input:

  • Context length up to 128K
  • Maximum Resolution: Determined by a 12-tile layout constraint, with each tile being 512 × 512 pixels. This supports aspect ratios such as:
    • 4 × 3 layout: up to 2048 × 1536 pixels
    • 3 × 4 layout: up to 1536 × 2048 pixels
    • 2 × 6 layout: up to 1024 × 3072 pixels
    • 6 × 2 layout: up to 3072 × 1024 pixels
    • Other configurations allowed, provided total tiles ≤ 12
  • Channel Count: 3 channels (RGB)
  • Alpha Channel: Not supported (no transparency)

Output

Output Type(s): Text

Output Formats: String

Output Parameters: One-Dimensional (1D): Sequences up to 128K

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration

Runtime Engine(s): vLLM
Supported Hardware Microarchitecture Compatibility: B100 SXM
Supported Operating System(s): Linux

Model Versions:

Nemotron-Nano-VL-12B-V2-FP4-QAD

Quick Start

Install Dependencies

pip install causal_conv1d "transformers>4.53,<4.54" torch timm "mamba-ssm==2.2.5" accelerate open_clip_torch numpy pillow

Usage

To serve this checkpoint with vLLM, you can start the docker vllm/vllm-openai:nightly and run the sample command below:

python3 -m vllm.entrypoints.openai.api_server --model nvidia/Nemotron-Nano-VL-12B-V2-FP4-QAD --trust-remote-code --quantization modelopt_fp4

Training, Testing, and Evaluation Datasets:

Training Datasets:

Data Modalities
** Total Size: 39'486'703 samples
** Total Number of Datasets: 270

** Text-only datasets: 33
** Text-and-image datasets: 176
** Video-and-text datasets: 61
** Total size: 27.7 TB

** Data modalities: Text, Image, Video
** Data Collection Method by dataset: Hybrid: Automated, Human, Synthetic
** Labeling Method by dataset: Hybrid: Automated, Human, Synthetic

** Dataset partition: Training [100%], Testing [0%], Validation [0%]
** Time period for training data collection: 2023-2025
** Time period for testing data collection: N/A
** Time period for validation data collection: N/A

The post-training datasets consist of a mix of internal and public datasets designed for training vision language models across various tasks. It includes:

  • Public datasets sourced from publicly available images and annotations, supporting tasks like classification, captioning, visual question answering, conversation modeling, document analysis and text/image reasoning.
  • Internal text and image datasets built with public commercial images and internal labels, adapted for the same tasks as listed above.
  • Synthetic image datasets generated programmatically for specific tasks like tabular data understanding and optical character recognition (OCR), for English, Chinese as well as other languages.
  • Video datasets supporting video question answering and reasoning tasks from publicly available video sources, with either publicly available or internally generated annotations.
  • Specialized datasets for safety alignment, function calling, and domain-specific tasks (e.g., science diagrams, financial question answering).
  • NVIDIA-Sourced Synthetic Datasets for text reasoning.
  • Private datasets for safety alignment or VQA on invoices.
  • Crawled or scraped captioning, VQA, and video datasets.
  • Some datasets were improved with Qwen2.5-72B-Instruct annotations

For around ~30% of our total training corpus and several of the domains listed above, we used commercially permissive models to perform:

  • Language translation
  • Re-labeling of annotations for text, image and video datasets
  • Synthetic data generation
  • Generating chain-of-thought (CoT) traces

Additional processing for several datasets included rule-based QA generation (e.g., with templates), expanding short answers into longer responses, as well as proper reformatting. More details can be found here.

** Image based datasets were all scanned against known CSAM to make sure no such content was included in training.

Public Datasets

Dataset Name Type Modalities Number of Samples Size
Captioning on Open Images (subset, relabeled) VQA image, text 1'278'221 378.34 GB
Localized Narratives (subset, relabeled) VQA image, text 503'275 147.67 GB
TextCaps (subset) Image Captioning image, text 21'953 5.76 GB
TextCaps (subset) Image Captioning image, text 109'765 28.81 GB
TextVQA (subset) Image Captioning image, text 34'602 9.08 GB
RefCoco Referring Expression Grounding image, text 14'694 2.39 GB
VQAv2 VQA image, text 28'555 4.41 GB
AOKVQA VQA image, text 20'832 3.39 GB
GQA VQA image, text 21'433 2.94 GB
AOKVQA VQA image, text 16'131 2.62 GB
synthdog-en OCR image, text 29'672 2.31 GB
WIT Image Captioning image, text 538'916 745.24 GB
CLEVR Image Reasoning image, text 70'000 12.57 GB
CLEVR-Math Image Reasoning image, text 70'000 12.47 GB
OpenAssistant (oasst1, oasst2) Text Instruction Tuning text 47'118 0.09 GB
VATEX Video Captioning video, text 2'880 5.50 GB
YouCook2 Video Captioning video, text 36 0.17 GB
VCG+ 112K VideoQA video, text 164 2.82 GB
Video Localized Narratives Video Captioning video, text 373 0.64 GB
CLEVRER VQA video, text 40'000 46.05 GB
NExT-QA VideoQA video, text 10'368 57.06 GB
CLEVRER Video Reasoning video, text 42'620 49.10 GB
ScreenQA VQA image, text 302'004 30.52 GB
WikiSQL Image Reasoning image, text N/A N/A
WikiTableQuestions TextQA text N/A N/A
RenderedText OCR image, text N/A N/A
FinQA Text Reasoning text N/A N/A
TAT-QA Text Reasoning text N/A N/A
Databricks Dolly 15K Text Instruction Tuning text N/A N/A
WebSight Image Classification image, text N/A N/A
RAVEN Image Reasoning image, text N/A N/A
VizWiz VQA image, text N/A N/A
Inter-GPS Image Reasoning image, text N/A N/A
OCR dataset from arXiv data OCR image, text 120'000 49.99 GB
OCR dataset from arXiv data OCR image, text 599'927 249.93 GB
OCR dataset from arXiv data OCR image, text 1'565'011 1637.79 GB
OCR dataset from arXiv data OCR image, text 418'059 422.04 GB
OCR dataset from arXiv data OCR image, text 200'001 200.89 GB
OCR dataset from arXiv data OCR image, text 200'000 198.94 GB
OCR dataset from arXiv data OCR image, text 200'001 196.08 GB
OCR dataset from arXiv data OCR image, text 400'000 382.95 GB
OCR dataset from arXiv data OCR image, text 400'000 388.16 GB
OCR dataset from arXiv data OCR image, text 18'280 20.98 GB
DocLayNet (curated) OCR image, text 48'369 18.59 GB
DocLayNet (curated & augmented) OCR image, text 48'249 9.12 GB
DocLayNet (curated & augmented) OCR image, text 48'267 9.09 GB
SynthTabNet OCR image, text 200'000 9.70 GB
OCR dataset based on pdfs from CommonCrawl OCR image, text 14'309 17.00 GB
OCR dataset based on pdfs from CommonCrawl OCR image, text 8'461 7.77 GB
OCR dataset based on pdfs from CommonCrawl OCR image, text 8'462 7.99 GB
OCR dataset based on pdfs from CommonCrawl OCR image, text 14'236 5.84 GB
OCR dataset based on pdfs from CommonCrawl OCR image, text 14'232 5.92 GB
SynthTables OCR image, text 4'887 0.38 GB
TabRecSet OCR image, text 25'281 2.46 GB
TabRecSet OCR image, text 25'281 1.61 GB
FinTabNet OCR image, text 57'137 9.22 GB
FinTabNet OCR image, text 57'131 21.76 GB
FinTabNet OCR image, text 57'129 21.68 GB
PubTables-1M OCR image, text 224'170 29.55 GB
PubTables-1M OCR image, text 224'169 36.32 GB
PubTables-1M OCR image, text 225'108 36.45 GB
OCR dataset based on Wikimedia OCR image, text 200'000 37.13 GB
OCR dataset based on Wikimedia OCR image, text 200'000 33.38 GB
OCR dataset based on Wikimedia OCR image, text 200'000 32.85 GB
OCR dataset based on Wikimedia OCR image, text 200'000 31.15 GB
OCR dataset based on Wikimedia OCR image, text 200'000 30.30 GB
OCR dataset based on Wikimedia OCR image, text 200'000 38.40 GB
OCR dataset based on Wikimedia OCR image, text 200'000 27.09 GB
OCR dataset based on Wikimedia OCR image, text 200'000 29.52 GB
OCR dataset based on Wikimedia OCR image, text 200'000 30.49 GB
OCR dataset based on Wikimedia OCR image, text 200'000 30.14 GB
OCR dataset based on Wikimedia OCR image, text 200'000 100.14 GB
OCR dataset based on Wikimedia OCR image, text 200'000 93.82 GB
OCR dataset based on Wikimedia OCR image, text 200'000 93.96 GB
OCR dataset based on Wikimedia OCR image, text 200'000 90.61 GB
OCR dataset based on Wikimedia OCR image, text 200'000 89.89 GB
OCR dataset based on Wikimedia OCR image, text 200'000 95.75 GB
OCR dataset based on Wikimedia OCR image, text 200'000 85.65 GB
OCR dataset based on Wikimedia OCR image, text 200'000 91.01 GB
OCR dataset based on Wikimedia OCR image, text 200'000 90.29 GB
OCR dataset based on Wikimedia OCR image, text 200'000 84.66 GB
TextOCR OCR image, text 21'727 5.83 GB
TextOCR OCR image, text 21'138 2.83 GB
Table OCR on pdfs from CommonCrawl OCR image, text 19'359 12.92 GB
Table OCR on pdfs from CommonCrawl OCR image, text 19'351 14.57 GB
Table OCR on pdfs from CommonCrawl OCR image, text 19'350 14.44 GB
HierText OCR image, text 8'278 2.60 GB
FUNSD OCR image, text 149 0.01 GB
Gretel Synthetic Safety Alignment Safety Text 19'779 0.03 GB
Internal safety alignment multimodal dataset Safety image, text 22'559 8.27 GB
ALFRED Action Safety video, text 6'524 5.92 GB
ALFRED Goal Safety video, text 6'464 5.86 GB
VQA-RAD Safety image, text 1'793 0.09 GB
SLAKE Safety image, text 9'835 0.85 GB
STEM MMLU-aux (subset) Safety text 37'444 0.49 GB
Glaive & Xlam Function call text 8'000 0.02 GB
Textbooks VQA VQA image, text 46'745 10.85 GB
ai2d VQA image, text 12'413 2.23 GB
ScienceQA VQA image, text 12'716 0.39 GB
ScienceQA from LlaVA-OneVision VQA image, text 19'196 0.65 GB
ChartQA VQA image, text 15'121 0.68 GB
ChartQA (augmented) VQA image, text 15'050 0.65 GB
ChartQA (CoT) VQA image, text 23'571 1.04 GB
ChartQA VQA image, text 60'438 2.69 GB
Geo170K VQA image, text 13'263 0.07 GB
InfographicVQA VQA image, text 23'946 8.21 GB
DocVQA VQA image, text 39'463 26.29 GB
DocVQA (CoT) Image Reasoning image, text 16'881 10.65 GB
ALLaVA-4V (subset) Visual Instruction Tuning image, text 524'892 96.99 GB
ALLaVA-4V (subset) Visual Instruction Tuning image, text 227'776 42.52 GB
TabMWP Image Reasoning image, text 23'058 0.30 GB
PMC-VQA VQA image, text 2'266 0.04 GB
OCR-VQA from The Cauldron VQA image, text 165'746 5.79 GB
ST-VQA from The Cauldron VQA image, text 17'232 0.68 GB
WebSight from The Cauldron OCR image, text 9'809 1.84 GB
EST-VQA VQA image, text 17'043 4.25 GB
TAL Handwritten English OCR OCR image, text 9'998 0.22 GB
TAL Handwritten Math writing OCR image, text 22'244 0.33 GB
SlideVQA VQA image, text 5'773 0.42 GB
pixmo-docs VQA image, text 251'165 34.88 GB
pixmo-cap Image Captioning image, text 706'897 261.63 GB
pixmo-cap-qa VQA image, text 214'978 56.72 GB
pixmo-ask-model-anything Visual Instruction Tuning image, text 153'592 20.50 GB
TallyQA VQA image, text 68'775 10.64 GB
Bounding box to text annotations on a subset of Open Images VQA image, text 1'664'533 490.37 GB
Bounding box to text annotations on a subset of Open Images VQA image, text 1'664'533 488.17 GB
Bounding box to text annotations on a subset of Open Images VQA image, text 1'128'326 324.46 GB
TabMWP (CoT) Image Reasoning image, text 20'305 0.28 GB
VisualWebInstruct Visual Instruction Tuning image, text 260'419 7.41 GB
Internal collection of public text SFT datasets Text Instruction Tuning text 197'938 1.04 GB
ReCTS from ICDAR2019 OCR image, text 20'000 1.77 GB
RCTW from ICDAR2017 OCR image, text 8'034 7.85 GB
OCR equation heavy dataset from arXiv data OCR image, text 2'000 0.03 GB
Mulberry-SFT (CoT) Image Reasoning image, text 191'332 30.80 GB
LLaVA-CoT-100k (CoT) Image Reasoning image, text 63'013 8.18 GB
GeomVerse (CoT) Image Reasoning image, text 9'298 0.90 GB
MapQA (CoT) Image Reasoning image, text 16'832 1.77 GB
MetaMathQA (CoT) Text Reasoning text 225'408 4.55 GB
MetaMathQA (CoT) Image Reasoning image, text 220'544 4.48 GB
PlotQA (CoT) Image Reasoning image, text 16'256 0.76 GB
Visual7W Telling (CoT) Image Reasoning image, text 62'592 3.21 GB
Visual7W Pointing VQA image, text 25'733 0.93 GB
VisText Image Captioning image, text 9'969 0.52 GB
ScreenQA VQA image, text 32'724 3.51 GB
wave-ui-25k OCR image, text 24'978 11.44 GB
Charts2500 VQA image, text 2'486 0.09 GB
Cyrillic OCR image, text 72'284 1.49 GB
CMM-Math Image Reasoning image, text 13'148 0.05 GB
SimChart9K Image Reasoning image, text 9'536 0.69 GB
UniChart Image Reasoning image, text 504'885 17.04 GB
CASIA-HWDB2-line OCR image, text 2'193 0.09 GB
MMTab VQA image, text 232'746 59.23 GB
ArxivQA VQA image, text 99'995 17.32 GB
docmatix-single VQA image, text 19'992 3.94 GB
DocReason525K Image Reasoning image, text 25'863 33.80 GB
FigureQA VQA image, text 100'000 2.37 GB
LRV-Instruction Visual Instruction Tuning image, text 7'198 0.37 GB
VisualWebInstruct (CoT) Image Reasoning image, text 48'929 4.37 GB
DocMatix (multi-page) Image Reasoning image, text 19'969 8.66 GB
spot-the-diff Image Reasoning image, text 8'007 1.45 GB
DocVQA (CoT) Image Reasoning image, text 36'333 24.32 GB
DocVQA (CoT) Image Reasoning image, text 45'710 2.10 GB
DocVQA (CoT) Image Reasoning image, text 19'548 6.70 GB
Mulberry-SFT (subset, CoT) Image Reasoning image, text 103'763 18.45 GB
UniGeo (CoT) Image Reasoning image, text 9'728 0.05 GB
NIGHTS Image Reasoning image, text 12'906 37.01 GB
Mantis-Instruct (CoT) Image Reasoning image, text 67'723 13.86 GB
OCR dataset based on pdfs from CommonCrawl Image Reasoning image, text 2'858 1.23 GB
OCR dataset based on pdfs from CommonCrawl Image Reasoning image, text 586 0.46 GB
FinTabNet (relabeled) Image Reasoning image, text 8'356 3.17 GB
Table OCR on pdfs from CommonCrawl Image Reasoning image, text 4'846 3.65 GB
HierText (relabeled for QA) Image Reasoning image, text 514 0.07 GB
ECD-10k-Images Image Reasoning image, text 132'613 15.38 GB
ActivityNet (open-ended QA) VideoQA video, text 6'490 162.22 GB
NExT-QA (multi-choice QA) VideoQA video, text 5'496 11.07 GB
NExT-QA (open-ended QA) VideoQA video, text 5'492 10.99 GB
NExT-QA (multi-choice QA) VideoQA video, text 52 0.74 GB
NExT-QA (open-ended QA) VideoQA video, text 61 0.85 GB
NExT-QA (open-ended QA) VideoQA video, text 6'843 27.83 GB
NExT-QA (multi-choice QA) VideoQA video, text 6'843 27.85 GB
ActivityNet (open-ended QA) VideoQA video, text 7'420 102.81 GB
ActivityNet (open-ended QA) VideoQA video, text 3'840 25.84 GB
NExT-QA (multi-choice QA) VideoQA video, text 4'633 35.38 GB
NExT-QA (open-ended QA) VideoQA video, text 4'694 35.84 GB
ActivityNet (open-ended QA) VideoQA video, text 2'580 7.46 GB
Perception Test (multi-choice QA) VideoQA video, text 1'785 18.67 GB
Perception Test (multi-choice QA) VideoQA video, text 618 11.52 GB
NExT-QA VideoQA video, text 34'132 150.86 GB
CLEVRER VideoQA video, text 40'000 46.03 GB
Video dataset based on Kinetics VideoQA video, text 39'452 26.15 GB
EGO4D VideoQA video, text 7'797 3.38 GB
TVQA VideoQA video, text 34'868 100.05 GB
EgoExoLearn VideoQA video, text 36'373 8558.27 GB
Video dataset based on Kinetics VideoQA video, text 647'883 890.56 GB
Mementos VideoQA video, text 4'060 14.07 GB
Perception Test VideoQA video, text 7'392 94.95 GB
ActivityNet VideoQA video, text 10'021 191.49 GB
EGO4D VideoQA video, text 1'506 137.00 GB
FineAction VideoQA video, text 7'504 169.76 GB
HACS VideoQA video, text 31'223 829.25 GB
HiREST VideoQA video, text 822 42.50 GB
Perception Test VideoQA video, text 2'135 25.98 GB
ActivityNet VideoQA video, text 9'064 181.24 GB
HiREST VideoQA video, text 525 27.54 GB
YouCook2 VideoQA video, text 1'180 77.65 GB
DiDeMo VideoQA video, text 7'452 33.90 GB
EGO4D VideoQA video, text 2'665 194.01 GB
MedVidQA VideoQA video, text 933 40.35 GB
QuerYD VideoQA video, text 1'562 50.69 GB
YouCook2 VideoQA video, text 2'270 158.77 GB
EgoExoLearn (open-ended QA) VideoQA video, text 9'998 1751.69 GB
Breakfast Actions VideoQA video, text 1'204 3.45 GB
EgoExoLearn (multi-choice QA) VideoQA video, text 6'832 1196.41 GB
CrossTask (multi-choice QA) VideoQA video, text 75'686 417.50 GB
CrossTask (open-ended QA) VideoQA video, text 20'399 112.02 GB
EgoProceL (multi-choice QA) VideoQA video, text 4'789 42.74 GB
EgoProceL (open-ended QA) VideoQA video, text 5'667 50.58 GB
HC-STVG (multi-choice QA) VideoQA video, text 147'799 796.18 GB
HC-STVG (open-ended QA) VideoQA video, text 41'050 221.82 GB
TAPOS (multi-choice QA) VideoQA video, text 33'941 218.50 GB
TAPOS (open-ended QA) VideoQA video, text 13'991 88.00 GB
Multi-page OCR based on CommonCrawl pdf data VQA image, text 7'262 48.19 GB
Multi-page QA based on CommonCrawl pdf data VQA image, text 455 31.88 GB
Table OCR dataset based on CommonCrawl pdf data OCR image, text 4'281 0.68 GB
Table OCR dataset based on CommonCrawl pdf data OCR image, text 4'285 0.67 GB
Table OCR dataset based on CommonCrawl pdf data OCR image, text 4'282 0.67 GB
Selection of public datasets (relabeled) Image Reasoning image, text 13'843 4.18 GB
Selection of public datasets (relabeled) Image Reasoning image, text 18'442 3.89 GB
Perception Test VideoQA video, text 7'392 94.95 GB
Perception Test (CoT) VideoQA video, text 4'977 64.55 GB

Private Datasets

Dataset Name Type Modalities Number of Samples Size
Internal safety alignment text dataset Safety Text N/A N/A
Internal safety alignment text dataset Safety Text N/A N/A
Synthetic dataset with HLE data with DeepSeek-R1-0528 Text Reasoning text 445'958 9.01 GB
Internal QA dataset on invoices Image Reasoning image, text 6'471 5.22 GB
Internal QA dataset on invoices Image Reasoning image, text 11'258 10.19 GB

Data Crawling and Scraping

Dataset Name Type Modalities Number of Samples Size
Internal video dataset VideoQA video, text 274'472 348.84 GB
Internal video dataset VideoQA video, text 14'256 44.46 GB
Internal VQA and captioning dataset Image Captioning image, text 14'872 3.27 GB
Internal VQA dataset VQA image, text 20'250 1.87 GB
Internal VQA dataset VQA image, text 20'098 2.07 GB
Internal Captioning dataset Image Captioning image, text 24'998 6.97 GB

User-Sourced Data (Collected by Provider including Prompts)


Self-Sourced Synthetic Data

Dataset Name Type Modalities Number of Samples Size
Random ASCII characters for OCR OCR image, text 14'533 5.76 GB
Random ASCII characters for OCR OCR image, text 14'533 9.26 GB
Random Chinese characters for OCR OCR image, text 29'108 15.00 GB
Random Chinese characters for OCR OCR image, text 29'108 24.11 GB
Random English characters for OCR OCR image, text 14'525 5.65 GB
Random English characters for OCR OCR image, text 14'525 9.39 GB
Synthetic sparse table dataset OCR image, text 100'000 14.36 GB
Synthetic dataset with OpenCodeReasoning 2.0 from DeepSeek-R1-0528 Text Reasoning text 1'165'591 54.15 GB
Synthetic dataset with OpenCodeReasoning 2.0 from DeepSeek-R1-0528 Text Reasoning text 175'000 0.95 GB
Synthetic dataset with OpenSTEM from DeepSeek-R1-0528 Text Reasoning text 1'922'012 28.00 GB
Synthetic dataset with OpenSTEM from DeepSeek-R1-0528 Text Reasoning text 288'000 0.57 GB
Synthetic dataset with HLE data with DeepSeek-R1-0528 Text Reasoning text 67'000 0.22 GB
Synthetic tool-calling data with seed tools from ToolBench, Glaive, xLAM and responses from Qwen3-235B-A22B with reasoning Text Reasoning text 403'619 6.55 GB
Synthetic safety data with responses from DeepSeek-R1-0528 Text Reasoning text 30'710 0.12 GB
Dummy conversation dataset Text Reasoning text 2'262 0.00 GB
Chat data with HelpSteer2 HelpSteer3 as seed user prompts and responses from Qwen3-235B-A22B with reasoning Text Reasoning text 32'752 0.26 GB
Chat data with HelpSteer2 HelpSteer3 as seed user prompts and responses from Qwen3-235B-A22B without reasoning Text Reasoning text 3'636 0.01 GB
Synthetic chat dataset with responses from DeepSeek-R1 Text Reasoning text 389'350 3.30 GB
Chat dataset with LMSYS-1M as seed user prompts and responses from Qwen3-235B-A22B with reasoning Text Reasoning text 353'526 2.61 GB
Chat dataset with LMSYS-1M as seed user prompts and responses from Qwen3-235B-A22B without reasoning Text Reasoning text 361'733 1.12 GB
Synthetic multilingual STEM from DeepSeek-R1-0528, Qwen2.5-32B-Instruct-AWQ, Qwen2.5-14B-Instruct Text Reasoning text 4'999'794 86.68 GB
Chat dataset with WildChat-1M as seed user prompts and responses from Qwen3-235B-A22B with reasoning Text Reasoning text 545'844 5.25 GB
Chat dataset with WildChat-1M as seed user prompts and responses from Qwen3-235B-A22B without reasoning Text Reasoning text 81'876 0.43 GB
Synthetic Math with OpenMathReasoning from DeepSeek-R1-0528 Text Reasoning text 1'591'641 58.63 GB
Synthetic Math with OpenMathReasoning from DeepSeek-R1-0528 Text Reasoning text 239'467 0.52 GB
Synthetic dataset with OpenCodeReasoning 2.0 from DeepSeek-R1-0528 Code text 1'165'591 54.15 GB
Synthetic tool calling dataset from DeepSeek-R1-0528 Text Reasoning text 74'044 46.43 GB

Properties

  • Additionally, the dataset collection (for training and evaluation) consists of a mix of internal and public datasets designed for training and evaluation across various tasks. It includes:
    • Internal datasets built with public commercial images and internal labels, supporting tasks like conversation modeling and document analysis.
    • Public datasets sourced from publicly available images and annotations, adapted for tasks such as image captioning and visual question answering.
    • Synthetic datasets generated programmatically for specific tasks like tabular data understanding.
    • Specialized datasets for safety alignment, function calling, and domain-specific tasks (e.g., science diagrams, financial question answering).

Evaluation Datasets:

The following external benchmarks are used for evaluating the model:

Data Collection Method by dataset:

  • Hybrid: Human, Automated

Labeling Method by dataset:

  • Hybrid: Human, Automated

Properties (Quantity, Dataset Descriptions, Sensor(s)): N/A

Dataset License(s): N/A

Evaluation Benchmarks:

Benchmark Score (FP4) Score (BF16)
AI2D 87.1% 87.1%
OCRBenchV2 61.9% 62.0%
OCRBench 85.1% 85.6%
ChartQA 90.0% 89.7%
DocVQA val 94.0% 94.4%

Inference:

Engine: vLLM
Test Hardware:

  • 1x NVIDIA B100 SXM

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards. Please report security vulnerabilities or NVIDIA AI Concerns here.

Users are responsible for model inputs and outputs. Users are responsible for ensuring safe integration of this model, including implementing guardrails as well as other safety mechanisms, prior to deployment.

Outputs generated by these models may contain political content or other potentially misleading information, issues with content security and safety, or unwanted bias that is independent of our oversight.

Downloads last month
28
Safetensors
Model size
8B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-NVFP4-QAD