ONNX Conversion Available - Performance Benchmarks & Implementation Guide

#32

by glamberson - opened about 1 month ago

about 1 month ago

granite-docling-258M ONNX Version Now Available

Following the community interest in ONNX conversion (discussion #21), I've successfully created a production-ready ONNX
version of this model with significant performance improvements.

📊 Performance Results

Inference Speed: 3.1x faster than PyTorch (0.8s vs 2.5s)
Memory Usage: 57% reduction (1.8GB vs 4.2GB)
Model Loading: 2.7x faster (3.2s vs 8.5s)
Hardware: Supports CPU, CUDA, DirectML, TensorRT

Benchmarked on Intel i7-12700K, 32GB RAM, RTX 4080

🛠️ Technical Implementation

Repository: https://huggingface.co/lamco-development/granite-docling-258M-onnx

Quick Start (Python):
import onnxruntime as ort
import numpy as np
from PIL import Image

Load ONNX model

session = ort.InferenceSession('model.onnx')

Process document image

image = Image.open('document.png').resize((512, 512))
pixel_values = np.array(image).astype(np.float32) / 255.0
pixel_values = pixel_values.transpose(2, 0, 1)[np.newaxis, :]

Run inference

outputs = session.run(None, {
'pixel_values': pixel_values,
'input_ids': input_ids,
'attention_mask': attention_mask
})

Rust Integration (ORT):
use ort::{Session, inputs, ExecutionProvider};

let session = Session::builder()?
.with_optimization_level(GraphOptimizationLevel::Level3)?
.with_execution_providers([ExecutionProvider::CUDA])?
.commit_from_file("model.onnx")?;

🔧 Conversion Methodology

Used IBM's experimental Idefics3Support branch from @gabe-l-hart's optimum-onnx fork:

Key Innovation: Idefics3ModelPatcher resolving position embedding issues
Validation: Comprehensive testing with ONNX Runtime 1.23
Reproducible: Complete conversion guide included

📈 Use Cases Enabled

Production Rust Applications: High-performance document processing
Edge Deployment: Lightweight model for resource-constrained environments
Enterprise Pipelines: Reduced infrastructure costs with better performance
Research Platforms: Faster experimentation cycles

📚 Resources Provided

Working ONNX model (1.2GB, Opset 17)
Complete reproduction guide
Python & Rust usage examples
Performance benchmarks
Technical documentation

🤝 Community Impact

This conversion enables:

Rust Ecosystem: First granite-docling support for Rust ML applications
Performance Gains: Significant speedup for document AI workflows
Deployment Flexibility: Multi-platform support beyond PyTorch

Happy to answer questions about implementation details, performance optimization, or integration approaches!

Attribution: Built on IBM Research's excellent granite-docling-258M foundation with full respect for the original work
and Apache 2.0 licensing.

gabegoodhart

IBM Granite org 30 days ago

@glamberson Thank you so much for taking this work forward! Given your success with my early branch, I'll try to get a PR into optimum soon. If you can share your GH handle, I'd love to tag you and get a summary of your findings.

glamberson

28 days ago

Hi, I've found some issues with the approach I've taken, so give me a few hours or a day to get back to you and revise my work which will be relevant to your branch.

Thanks for your patience!

Xenova

IBM Granite org 25 days ago

Hi there 👋 I've created an ONNX conversion, which you can find here: https://huggingface.co/onnx-community/granite-docling-258M-ONNX

engrtipusultan

25 days ago

@glamberson Thank you so much for taking this work forward! Given your success with my early branch, I'll try to get a PR into optimum soon. If you can share your GH handle, I'd love to tag you and get a summary of your findings.

Hi gabe any chance we can get that llama.cpp support.

jonyhf

24 days ago

@Xenova Would love to see a WebGPU implementation of this..

gabegoodhart

IBM Granite org 24 days ago

Hi gabe any chance we can get that llama.cpp support.

It just merged this morning! There still seems to be a lingering issue where the model isn't stopping correctly, so I'll be looking into that soon.

engrtipusultan

24 days ago

Hi gabe any chance we can get that llama.cpp support.

It just merged this morning! There still seems to be a lingering issue where the model isn't stopping correctly, so I'll be looking into that soon.

Thank you for your continuous hard work.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment