zhiyucheng's picture
Upload 4 files
0f5d4d4 verified
Field Response
Intended Task/Domain: Visual Question Answering
Model Type: Transformer
Intended Users: Individuals and businesses that need to process documents such as invoices, receipts, and manuals. Also, users who are building multi-modal agents and RAG systems.
Output: Text
Tools used to evaluate datasets to identify synthetic data and ensure data authenticity. We used a Gemma-3 4B-based filtering model fine-tuned on Nemotron Content Safety Dataset v2 to ensure the quality of synthetic data.
Describe how the model works: Vision Encoder and a Nemotron 5.5H -12B Language Encoder. It processes multiple input modalities, including text, multiple images, and video. It fuses these inputs and uses its large language model backbone with a 128K context length to perform visual Q&A, summarization, and data extraction.
Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: Not Applicable
Technical Limitations & Mitigation: The model has a limited maximum resolution determined by a 12-tile layout constraint, where each tile is 512x512 pixels. It also supports a limited number of input images (up to 4) and has a maximum context length of 128K tokens for combined input and output.
Verified to have met prescribed NVIDIA quality standards: Yes
Performance Metrics: Accuracy (Visual Question Answering), Latency, Throughput
Potential Known Risks: The Model may produce output that is biased, toxic, or incorrect responses. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The Model may also generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text, producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.While we have taken safety and security into account and are continuously improving, outputs may still contain political content, misleading information, or unwanted bias beyond our control.
Licensing: Governing Terms: Use of this model is governed by the NVIDIA Open Model License Agreement