DIMI Arabic OCR

Accurate Arabic OCR model for extracting printed Arabic text from images

🧠 Overview

DIMI-Arabic-OCR is a fine-tuned vision-language model (VLM) specialized for Arabic Optical Character Recognition (OCR).
It extracts printed Arabic text from images with high accuracy — including diacritics (tashkeel) and punctuation.

🔤 Language: Arabic
🧩 Base Model: Qwen2.5-VL-7B (via Unsloth 4-bit)
⚙️ Task: Image-to-Text / OCR
🪶 Quantization: 4-bit LoRA for efficient inference
👨‍💻 Author: Ahmed Zaky

🚀 Quick Start

# IMPORTANT: Import unsloth first!
import unsloth
from unsloth import FastVisionModel
from PIL import Image
import torch

# Load the model
model, tokenizer = FastVisionModel.from_pretrained(
    "AhmedZaky1/DIMI-Arabic-OCR",
    load_in_4bit=True,
    use_gradient_checkpointing="unsloth",
)

FastVisionModel.for_inference(model)

# Prepare your image
image = Image.open("/content/2.jpg")

# Arabic instruction
instruction = "استخرج النص العربي والأرقام الموجودة في هذه الصورة بدقة عالية جدًا، مع الحفاظ الكامل على الترتيب الأصلي والتنسيق."

# Prepare messages
messages = [
    {"role": "user", "content": [
        {"type": "image", "image": image},  # Include image here
        {"type": "text", "text": instruction}
    ]}
]

# Apply chat template
input_text = tokenizer.apply_chat_template(
    messages, 
    add_generation_prompt=True,
)

# Tokenize with proper parameters to avoid truncation
inputs = tokenizer(
    text=input_text,
    images=image,  
    return_tensors="pt",
    padding=True,
    truncation=False, 
    max_length=None,   
).to("cuda")

# Generate
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=2048,
        do_sample=False,
        temperature=None,
        top_p=None,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

# Decode the prediction
generated_ids = outputs[0][inputs['input_ids'].shape[1]:]
prediction = tokenizer.decode(generated_ids, skip_special_tokens=True).strip()

print("Extracted Arabic Text:")
print(prediction)

🧩 Model Architecture

Base: Qwen2.5-VL-7B-Instruct
Fine-tuning: LoRA (rank 16)
Quantization: 4-bit (bnb)
Framework: Unsloth for efficient training/inference

📊 Evaluation

Metric	Description	Score (↓ better)
CER	Character Error Rate	`0.22`
WER	Word Error Rate	`0.40`

Evaluation performed on a 2.6K image test set from combined Arabic OCR datasets (news + diacritics).

🧾 Training Data

Fine-tuned on 26,000 Arabic text images combining:

The dataset covers modern standard Arabic with and without diacritics.

📚 Citation

If you use this model, please cite:

@misc{dimi-arabic-ocr-2025,
  author = {Ahmed Zaky},
  title = {DIMI-Arabic-OCR: Fine-tuned Qwen2.5-VL for Arabic Text Recognition},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/AhmedZaky1/DIMI-Arabic-OCR}}
}