DeepSeek AI

DeepSeek-OCR-bnb-4bit-NF4


🌟 Github | 📥 Model Download | 📄 Paper Link | 📄 Arxiv Paper Link |

DeepSeek-OCR: Contexts Optical Compression

How to Use This 4-bit Quantized Model

This is a 4-bit NF4 quantized version of deepseek-ai/DeepSeek-OCR, created using bitsandbytes. It offers significantly reduced VRAM (up to 8 Gb!) usage while maintaining high accuracy, making it ideal for consumer GPUs.

Environment Setup

For optimal compatibility, we strongly recommend creating a virtual environment using uv with Python 3.12.9. This matches the test environment of the original deepseek-ai/DeepSeek-OCR model.

Prerequisites: You must have the NVIDIA CUDA Toolkit (e.g., CUDA 11.8, which matches the PyTorch build) installed on your system to compile flash-attn.

Below are the recommended library versions, based on the original model's requirements, plus the libraries needed for 4-bit loading (bitsandbytes, accelerate) and PyTorch compatibility (torchvision).

# 1. Create and activate the environment (Python 3.12.9 recommended)
uv venv --python 3.12.9 .venv

# 2. Install PyTorch 
uv pip install torch==2.6.0 torchvision 

# 3. Install Transformers and dependencies
uv pip install transformers==4.46.3 tokenizers==0.20.3 einops addict easydict

# 4. Install 4-bit (bitsandbytes) and 'device_map' (accelerate) support
uv pip install bitsandbytes accelerate

# 5. Install flash-attn (compiles from source, requires CUDA Toolkit)
uv pip install flash-attn==2.7.3 --no-build-isolation

Usage

Usage (Inference Code) Once your environment is set up, you can use one of the two code blocks provided below.

The main difference is the _attn_implementation parameter, which depends on your GPU architecture:

  1. _attn_implementation='flash_attention_2': Recommended for NVIDIA Ampere (RTX 30xx, A100) or newer GPUs.

  2. _attn_implementation='eager': Required for NVIDIA Turing (RTX 20xx) GPUs and older, or for general compatibility if Flash Attention 2 fails.

This 4-bit model was successfully tested on an RTX 2080 Ti (Turing architecture) with CUDA 13.0 drivers, which requires the eager implementation. Both code blocks are provided for your convenience.

Load 4-bit Quantized Model (Flash Attention 2)

from transformers import AutoModel, AutoTokenizer
import torch
import os

os.environ["CUDA_VISIBLE_DEVICES"] = '0'
model_id = 'Jalea96/DeepSeek-OCR-bnb-4bit-NF4'

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_id, 
    _attn_implementation='flash_attention_2',
    trust_remote_code=True, 
    use_safetensors=True,
    device_map="auto",
    torch_dtype=torch.bfloat16
)
model = model.eval()

# --- 1. Set Image and Task Prompt ---
prompt = "<image>\n<|grounding|>Convert the document to markdown. "
image_file = 'your_image.jpg'
output_path = 'your/output/dir'

if not os.path.exists(output_path):
    os.makedirs(output_path)

# --- 2. Set Resolution ---
# (Gundam is recommended for most documents)
# Tiny:  base_size = 512,  image_size = 512, crop_mode = False
# Small: base_size = 640,  image_size = 640, crop_mode = False
# Base:  base_size = 1024, image_size = 1024, crop_mode = False
# Large: base_size = 1280, image_size = 1280, crop_mode = False
# Gundam:base_size = 1024, image_size = 640, crop_mode = True
base_size, image_size, crop_mode = 1024, 640, True

# --- 3. Run Inference ---
res = model.infer(
    tokenizer, 
    prompt=prompt, 
    image_file=image_file, 
    output_path=output_path, 
    base_size=base_size, 
    image_size=image_size, 
    crop_mode=crop_mode, 
    save_results=True, # Set to True to save visualization (image_vis.jpg)
    test_compress=True
)

Load 4-bit Quantized Model (Eager Attention)

from transformers import AutoModel, AutoTokenizer
import torch
import os

os.environ["CUDA_VISIBLE_DEVICES"] = '0'
model_id = 'Jalea96/DeepSeek-OCR-bnb-4bit-NF4'

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModel.from_pretrained(
    model_id, 
    _attn_implementation='eager',
    trust_remote_code=True, 
    use_safetensors=True,
    device_map="auto",
    torch_dtype=torch.bfloat16
)
model = model.eval()

# --- 1. Set Image and Task Prompt ---
prompt = "<image>\n<|grounding|>Convert the document to markdown. "
image_file = 'your_image.jpg'
output_path = 'your/output/dir'

if not os.path.exists(output_path):
    os.makedirs(output_path)

# --- 2. Set Resolution ---
# (Gundam is recommended for most documents)
# Tiny:  base_size = 512,  image_size = 512, crop_mode = False
# Small: base_size = 640,  image_size = 640, crop_mode = False
# Base:  base_size = 1024, image_size = 1024, crop_mode = False
# Large: base_size = 1280, image_size = 1280, crop_mode = False
# Gundam:base_size = 1024, image_size = 640, crop_mode = True
base_size, image_size, crop_mode = 1024, 640, True

# --- 3. Run Inference ---
res = model.infer(
    tokenizer, 
    prompt=prompt, 
    image_file=image_file, 
    output_path=output_path, 
    base_size=base_size, 
    image_size=image_size, 
    crop_mode=crop_mode, 
    save_results=True, # Set to True to save visualization (image_vis.jpg)
    test_compress=True
)

Acknowledgement

I would like to thank DeepSeek and their entire team for making this incredible range of models available. I also want to thank the entire Hugging Face community for their amazing platform and everyone who contributes to making this such an incredible community.

Work in progress...

benchmarks Fox, OminiDocBench.

Citation

@article{wei2025deepseek,
  title={DeepSeek-OCR: Contexts Optical Compression},
  author={Wei, Haoran and Sun, Yaofeng and Li, Yukun},
  journal={arXiv preprint arXiv:2510.18234},
  year={2025}
}
Downloads last month
584
Safetensors
Model size
3B params
Tensor type
I64
·
F32
·
F16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Model tree for Jalea96/DeepSeek-OCR-bnb-4bit-NF4

Quantized
(2)
this model