| Intended Task/Domain: |
Visual Question Answering |
| Model Type: |
Transformer |
| Intended Users: |
Individuals and businesses that need to process documents such as invoices, receipts, and manuals. Also, users who are building multi-modal agents and RAG systems. |
| Output: |
Text |
| Tools used to evaluate datasets to identify synthetic data and ensure data authenticity. |
We used a Gemma-3 4B-based filtering model fine-tuned on Nemotron Content Safety Dataset v2 to ensure the quality of synthetic data. |
| Describe how the model works: |
Vision Encoder and a Nemotron 5.5H -12B Language Encoder. It processes multiple input modalities, including text, multiple images, and video. It fuses these inputs and uses its large language model backbone with a 128K context length to perform visual Q&A, summarization, and data extraction. |
| Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: |
Not Applicable |
| Technical Limitations & Mitigation: |
The model has a limited maximum resolution determined by a 12-tile layout constraint, where each tile is 512x512 pixels. It also supports a limited number of input images (up to 4) and has a maximum context length of 128K tokens for combined input and output. |
| Verified to have met prescribed NVIDIA quality standards: |
Yes |
| Performance Metrics: |
Accuracy (Visual Question Answering), Latency, Throughput |
| Potential Known Risks: |
The Model may produce output that is biased, toxic, or incorrect responses. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The Model may also generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text, producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.While we have taken safety and security into account and are continuously improving, outputs may still contain political content, misleading information, or unwanted bias beyond our control. |
| Licensing: |
Governing Terms: Use of this model is governed by the NVIDIA Open Model License Agreement |