🧠 Why does DeepSeek-OCR not use Multi-Head Latent Attention (MLA)?

#53

by ZoneTwelve - opened 2 days ago

2 days ago

Hi DeepSeek team 👋,

First of all, thank you for releasing DeepSeek-OCR — it’s an impressive and elegant vision-to-text model.

While exploring the model architecture and configuration files (config.json), I noticed that Multi-Head Latent Attention (MLA) is default enabled in this OCR model.

Questions

Could you please share some insights into why MLA was not used in DeepSeek-OCR?

Was it due to compatibility issues between MLA and the vision encoder–decoder pipeline?
Or did MLA not provide practical benefits in the OCR setting (e.g., shorter sequence lengths or the main bottleneck lying elsewhere)?
Is there any plan to integrate MLA into future versions of DeepSeek-OCR to improve inference efficiency?

I’m asking because MLA has demonstrated significant efficiency gains in your other models (e.g., DeepSeek-V2/V3), and I’m curious about the reasoning behind excluding it here.

Thanks again for your excellent work and for open-sourcing this project! 🙏

ZoneTwelve changed discussion title from Why does DeepSeek-OCR not use Multi-Head Latent Attention (MLA)? to 🧠 Why does DeepSeek-OCR not use Multi-Head Latent Attention (MLA)? 2 days ago

HaoranWei

DeepSeek org 2 days ago

Hello,
We actually have an internal MLA-enabled version of DeepSeek-OCR.
The only reason it hasn’t been open-sourced yet is simply that I haven’t had the bandwidth to implement the code needed to convert the internal weights into the Hugging Face format.
Best regards

Excel001

1 day ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment