lamco-development
/

granite-docling-258M-onnx

+# granite-docling ONNX Conversion Guide
+## Technical Reproduction Instructions
+This document provides complete instructions for reproducing the granite-docling ONNX conversion.
+### Prerequisites
+- Python 3.10+
+- ~4GB available RAM
+- ~2GB disk space for conversion environment
+### Step 1: Environment Setup
+```bash
+# Create isolated environment
+python3 -m venv onnx_converter
+source onnx_converter/bin/activate  # Linux/Mac
+# or onnx_converter\Scripts\activate  # Windows
+# Install dependencies
+pip install torch torchvision transformers optimum[onnxruntime] safetensors
+```
+### Step 2: Download Original Model
+```bash
+# Download granite-docling SafeTensors model
+mkdir granite-docling-258m
+cd granite-docling-258m
+curl -L "https://huggingface.co/ibm-granite/granite-docling-258M/resolve/main/model.safetensors" -o model.safetensors
+curl -L "https://huggingface.co/ibm-granite/granite-docling-258M/resolve/main/config.json" -o config.json
+curl -L "https://huggingface.co/ibm-granite/granite-docling-258M/resolve/main/tokenizer.json" -o tokenizer.json
+curl -L "https://huggingface.co/ibm-granite/granite-docling-258M/resolve/main/preprocessor_config.json" -o preprocessor_config.json
+```
+### Step 3: Install IBM Experimental Fork
+```bash
+# Clone IBM experimental optimum-onnx fork
+git clone https://github.com/gabe-l-hart/optimum-onnx.git
+cd optimum-onnx
+git checkout Idefics3Support
+# Install experimental fork
+pip install -e . --force-reinstall
+```
+### Step 4: Convert to ONNX
+```python
+import os
+import torch
+os.environ['CUDA_VISIBLE_DEVICES'] = ''  # Force CPU
+from pathlib import Path
+from transformers import Idefics3ForConditionalGeneration
+from optimum.exporters.onnx import export
+from optimum.exporters.onnx.model_configs import Idefics3OnnxConfig
+# Load model
+model = Idefics3ForConditionalGeneration.from_pretrained(
+    './granite-docling-258m',
+    trust_remote_code=True,
+    torch_dtype=torch.float32
+).to('cpu')
+# Create ONNX config
+onnx_config = Idefics3OnnxConfig(model.config, task='image-to-text')
+# Export to ONNX
+output_path = Path('./granite_docling.onnx')
+export(model, onnx_config, output_path, 17)
+print(f"ONNX conversion complete: {output_path}")
+```
+### Expected Output
+```
+Initializing Idefics3ModelPatcher
+Entering Idefics3ModelPatcher context
+Patching Idefics3 model
+Using patched position embedding forward
+Exiting Idefics3ModelPatcher context
+ONNX conversion complete: granite_docling.onnx (1.2GB)
+```
+### Validation
+```python
+import onnxruntime as ort
+# Test ONNX model loading
+session = ort.InferenceSession('granite_docling.onnx')
+print("✅ ONNX model loads successfully")
+# Check input/output specifications
+for inp in session.get_inputs():
+    print(f"Input: {inp.name} - {inp.shape}")
+for out in session.get_outputs():
+    print(f"Output: {out.name} - {out.shape}")
+```
+## Troubleshooting
+### Common Issues
+1. **"Custom architecture" error**: Ensure using IBM experimental fork
+2. **Memory errors**: Use CPU-only conversion (`CUDA_VISIBLE_DEVICES=''`)
+3. **Import errors**: Verify experimental fork installed with `-e .`
+### Technical Notes
+- **Conversion time**: 5-10 minutes on typical CPU
+- **Memory usage**: ~4GB RAM during conversion
+- **Warnings**: TracerWarnings are expected for complex VLM
+- **File size**: ONNX (~1.2GB) vs SafeTensors (~492MB) due to graph inclusion
+## Attribution
+Original model: IBM Research granite-docling-258M
+Conversion method: IBM experimental Idefics3Support optimum-onnx fork
+Documentation: lamco-development