🧠 Why does DeepSeek-OCR not use Multi-Head Latent Attention (MLA)?

#53
by ZoneTwelve - opened

Hi DeepSeek team šŸ‘‹,

First of all, thank you for releasing DeepSeek-OCR — it’s an impressive and elegant vision-to-text model.

While exploring the model architecture and configuration files (config.json), I noticed that Multi-Head Latent Attention (MLA) is default enabled in this OCR model.

Questions

Could you please share some insights into why MLA was not used in DeepSeek-OCR?

  • Was it due to compatibility issues between MLA and the vision encoder–decoder pipeline?
  • Or did MLA not provide practical benefits in the OCR setting (e.g., shorter sequence lengths or the main bottleneck lying elsewhere)?
  • Is there any plan to integrate MLA into future versions of DeepSeek-OCR to improve inference efficiency?

I’m asking because MLA has demonstrated significant efficiency gains in your other models (e.g., DeepSeek-V2/V3), and I’m curious about the reasoning behind excluding it here.

Thanks again for your excellent work and for open-sourcing this project! šŸ™

ZoneTwelve changed discussion title from Why does DeepSeek-OCR not use Multi-Head Latent Attention (MLA)? to 🧠 Why does DeepSeek-OCR not use Multi-Head Latent Attention (MLA)?
DeepSeek org

Hello,
We actually have an internal MLA-enabled version of DeepSeek-OCR.
The only reason it hasn’t been open-sourced yet is simply that I haven’t had the bandwidth to implement the code needed to convert the internal weights into the Hugging Face format.
Best regards

Sign up or log in to comment