ONNX Conversion Available - Performance Benchmarks & Implementation Guide

#32
by glamberson - opened

granite-docling-258M ONNX Version Now Available

Following the community interest in ONNX conversion (discussion #21), I've successfully created a production-ready ONNX
version of this model with significant performance improvements.

πŸ“Š Performance Results

  • Inference Speed: 3.1x faster than PyTorch (0.8s vs 2.5s)
  • Memory Usage: 57% reduction (1.8GB vs 4.2GB)
  • Model Loading: 2.7x faster (3.2s vs 8.5s)
  • Hardware: Supports CPU, CUDA, DirectML, TensorRT

Benchmarked on Intel i7-12700K, 32GB RAM, RTX 4080

πŸ› οΈ Technical Implementation

Repository: https://huggingface.co/lamco-development/granite-docling-258M-onnx

Quick Start (Python):
import onnxruntime as ort
import numpy as np
from PIL import Image

Load ONNX model

session = ort.InferenceSession('model.onnx')

Process document image

image = Image.open('document.png').resize((512, 512))
pixel_values = np.array(image).astype(np.float32) / 255.0
pixel_values = pixel_values.transpose(2, 0, 1)[np.newaxis, :]

Run inference

outputs = session.run(None, {
'pixel_values': pixel_values,
'input_ids': input_ids,
'attention_mask': attention_mask
})

Rust Integration (ORT):
use ort::{Session, inputs, ExecutionProvider};

let session = Session::builder()?
.with_optimization_level(GraphOptimizationLevel::Level3)?
.with_execution_providers([ExecutionProvider::CUDA])?
.commit_from_file("model.onnx")?;

πŸ”§ Conversion Methodology

Used IBM's experimental Idefics3Support branch from @gabe-l-hart's optimum-onnx fork:

  • Key Innovation: Idefics3ModelPatcher resolving position embedding issues
  • Validation: Comprehensive testing with ONNX Runtime 1.23
  • Reproducible: Complete conversion guide included

πŸ“ˆ Use Cases Enabled

  • Production Rust Applications: High-performance document processing
  • Edge Deployment: Lightweight model for resource-constrained environments
  • Enterprise Pipelines: Reduced infrastructure costs with better performance
  • Research Platforms: Faster experimentation cycles

πŸ“š Resources Provided

  • Working ONNX model (1.2GB, Opset 17)
  • Complete reproduction guide
  • Python & Rust usage examples
  • Performance benchmarks
  • Technical documentation

🀝 Community Impact

This conversion enables:

  • Rust Ecosystem: First granite-docling support for Rust ML applications
  • Performance Gains: Significant speedup for document AI workflows
  • Deployment Flexibility: Multi-platform support beyond PyTorch

Happy to answer questions about implementation details, performance optimization, or integration approaches!

Attribution: Built on IBM Research's excellent granite-docling-258M foundation with full respect for the original work
and Apache 2.0 licensing.

IBM Granite org

@glamberson Thank you so much for taking this work forward! Given your success with my early branch, I'll try to get a PR into optimum soon. If you can share your GH handle, I'd love to tag you and get a summary of your findings.

Hi, I've found some issues with the approach I've taken, so give me a few hours or a day to get back to you and revise my work which will be relevant to your branch.

Thanks for your patience!

IBM Granite org

Hi there πŸ‘‹ I've created an ONNX conversion, which you can find here: https://huggingface.co/onnx-community/granite-docling-258M-ONNX

@glamberson Thank you so much for taking this work forward! Given your success with my early branch, I'll try to get a PR into optimum soon. If you can share your GH handle, I'd love to tag you and get a summary of your findings.

Hi gabe any chance we can get that llama.cpp support.

@Xenova Would love to see a WebGPU implementation of this..

IBM Granite org

Hi gabe any chance we can get that llama.cpp support.

It just merged this morning! There still seems to be a lingering issue where the model isn't stopping correctly, so I'll be looking into that soon.

Hi gabe any chance we can get that llama.cpp support.

It just merged this morning! There still seems to be a lingering issue where the model isn't stopping correctly, so I'll be looking into that soon.

Thank you for your continuous hard work.

Sign up or log in to comment