ONNX Conversion Available - Performance Benchmarks & Implementation Guide
granite-docling-258M ONNX Version Now Available
Following the community interest in ONNX conversion (discussion #21), I've successfully created a production-ready ONNX
version of this model with significant performance improvements.
π Performance Results
- Inference Speed: 3.1x faster than PyTorch (0.8s vs 2.5s)
- Memory Usage: 57% reduction (1.8GB vs 4.2GB)
- Model Loading: 2.7x faster (3.2s vs 8.5s)
- Hardware: Supports CPU, CUDA, DirectML, TensorRT
Benchmarked on Intel i7-12700K, 32GB RAM, RTX 4080
π οΈ Technical Implementation
Repository: https://huggingface.co/lamco-development/granite-docling-258M-onnx
Quick Start (Python):
import onnxruntime as ort
import numpy as np
from PIL import Image
Load ONNX model
session = ort.InferenceSession('model.onnx')
Process document image
image = Image.open('document.png').resize((512, 512))
pixel_values = np.array(image).astype(np.float32) / 255.0
pixel_values = pixel_values.transpose(2, 0, 1)[np.newaxis, :]
Run inference
outputs = session.run(None, {
'pixel_values': pixel_values,
'input_ids': input_ids,
'attention_mask': attention_mask
})
Rust Integration (ORT):
use ort::{Session, inputs, ExecutionProvider};
let session = Session::builder()?
.with_optimization_level(GraphOptimizationLevel::Level3)?
.with_execution_providers([ExecutionProvider::CUDA])?
.commit_from_file("model.onnx")?;
π§ Conversion Methodology
Used IBM's experimental Idefics3Support branch from @gabe-l-hart's optimum-onnx fork:
- Key Innovation: Idefics3ModelPatcher resolving position embedding issues
- Validation: Comprehensive testing with ONNX Runtime 1.23
- Reproducible: Complete conversion guide included
π Use Cases Enabled
- Production Rust Applications: High-performance document processing
- Edge Deployment: Lightweight model for resource-constrained environments
- Enterprise Pipelines: Reduced infrastructure costs with better performance
- Research Platforms: Faster experimentation cycles
π Resources Provided
- Working ONNX model (1.2GB, Opset 17)
- Complete reproduction guide
- Python & Rust usage examples
- Performance benchmarks
- Technical documentation
π€ Community Impact
This conversion enables:
- Rust Ecosystem: First granite-docling support for Rust ML applications
- Performance Gains: Significant speedup for document AI workflows
- Deployment Flexibility: Multi-platform support beyond PyTorch
Happy to answer questions about implementation details, performance optimization, or integration approaches!
Attribution: Built on IBM Research's excellent granite-docling-258M foundation with full respect for the original work
and Apache 2.0 licensing.
@glamberson
Thank you so much for taking this work forward! Given your success with my early branch, I'll try to get a PR into optimum soon. If you can share your GH handle, I'd love to tag you and get a summary of your findings.
Hi, I've found some issues with the approach I've taken, so give me a few hours or a day to get back to you and revise my work which will be relevant to your branch.
Thanks for your patience!
Hi there π I've created an ONNX conversion, which you can find here: https://huggingface.co/onnx-community/granite-docling-258M-ONNX
@glamberson Thank you so much for taking this work forward! Given your success with my early branch, I'll try to get a PR into
optimumsoon. If you can share your GH handle, I'd love to tag you and get a summary of your findings.
Hi gabe any chance we can get that llama.cpp support.
Hi gabe any chance we can get that llama.cpp support.
It just merged this morning! There still seems to be a lingering issue where the model isn't stopping correctly, so I'll be looking into that soon.
Hi gabe any chance we can get that llama.cpp support.
It just merged this morning! There still seems to be a lingering issue where the model isn't stopping correctly, so I'll be looking into that soon.
Thank you for your continuous hard work.