granite-docling-258M-onnx / CONVERSION_GUIDE.md
glamberson's picture
Add technical conversion reproduction guide
7ee6acf verified
# granite-docling ONNX Conversion Guide
## Technical Reproduction Instructions
This document provides complete instructions for reproducing the granite-docling ONNX conversion.
### Prerequisites
- Python 3.10+
- ~4GB available RAM
- ~2GB disk space for conversion environment
### Step 1: Environment Setup
```bash
# Create isolated environment
python3 -m venv onnx_converter
source onnx_converter/bin/activate # Linux/Mac
# or onnx_converter\Scripts\activate # Windows
# Install dependencies
pip install torch torchvision transformers optimum[onnxruntime] safetensors
```
### Step 2: Download Original Model
```bash
# Download granite-docling SafeTensors model
mkdir granite-docling-258m
cd granite-docling-258m
curl -L "https://huggingface.co/ibm-granite/granite-docling-258M/resolve/main/model.safetensors" -o model.safetensors
curl -L "https://huggingface.co/ibm-granite/granite-docling-258M/resolve/main/config.json" -o config.json
curl -L "https://huggingface.co/ibm-granite/granite-docling-258M/resolve/main/tokenizer.json" -o tokenizer.json
curl -L "https://huggingface.co/ibm-granite/granite-docling-258M/resolve/main/preprocessor_config.json" -o preprocessor_config.json
```
### Step 3: Install IBM Experimental Fork
```bash
# Clone IBM experimental optimum-onnx fork
git clone https://github.com/gabe-l-hart/optimum-onnx.git
cd optimum-onnx
git checkout Idefics3Support
# Install experimental fork
pip install -e . --force-reinstall
```
### Step 4: Convert to ONNX
```python
import os
import torch
os.environ['CUDA_VISIBLE_DEVICES'] = '' # Force CPU
from pathlib import Path
from transformers import Idefics3ForConditionalGeneration
from optimum.exporters.onnx import export
from optimum.exporters.onnx.model_configs import Idefics3OnnxConfig
# Load model
model = Idefics3ForConditionalGeneration.from_pretrained(
'./granite-docling-258m',
trust_remote_code=True,
torch_dtype=torch.float32
).to('cpu')
# Create ONNX config
onnx_config = Idefics3OnnxConfig(model.config, task='image-to-text')
# Export to ONNX
output_path = Path('./granite_docling.onnx')
export(model, onnx_config, output_path, 17)
print(f"ONNX conversion complete: {output_path}")
```
### Expected Output
```
Initializing Idefics3ModelPatcher
Entering Idefics3ModelPatcher context
Patching Idefics3 model
Using patched position embedding forward
Exiting Idefics3ModelPatcher context
ONNX conversion complete: granite_docling.onnx (1.2GB)
```
### Validation
```python
import onnxruntime as ort
# Test ONNX model loading
session = ort.InferenceSession('granite_docling.onnx')
print("✅ ONNX model loads successfully")
# Check input/output specifications
for inp in session.get_inputs():
print(f"Input: {inp.name} - {inp.shape}")
for out in session.get_outputs():
print(f"Output: {out.name} - {out.shape}")
```
## Troubleshooting
### Common Issues
1. **"Custom architecture" error**: Ensure using IBM experimental fork
2. **Memory errors**: Use CPU-only conversion (`CUDA_VISIBLE_DEVICES=''`)
3. **Import errors**: Verify experimental fork installed with `-e .`
### Technical Notes
- **Conversion time**: 5-10 minutes on typical CPU
- **Memory usage**: ~4GB RAM during conversion
- **Warnings**: TracerWarnings are expected for complex VLM
- **File size**: ONNX (~1.2GB) vs SafeTensors (~492MB) due to graph inclusion
## Attribution
Original model: IBM Research granite-docling-258M
Conversion method: IBM experimental Idefics3Support optimum-onnx fork
Documentation: lamco-development