š§ Why does DeepSeek-OCR not use Multi-Head Latent Attention (MLA)?
Hi DeepSeek team š,
First of all, thank you for releasing DeepSeek-OCR ā itās an impressive and elegant vision-to-text model.
While exploring the model architecture and configuration files (config.json), I noticed that Multi-Head Latent Attention (MLA) is default enabled in this OCR model.
Questions
Could you please share some insights into why MLA was not used in DeepSeek-OCR?
- Was it due to compatibility issues between MLA and the vision encoderādecoder pipeline?
- Or did MLA not provide practical benefits in the OCR setting (e.g., shorter sequence lengths or the main bottleneck lying elsewhere)?
- Is there any plan to integrate MLA into future versions of DeepSeek-OCR to improve inference efficiency?
Iām asking because MLA has demonstrated significant efficiency gains in your other models (e.g., DeepSeek-V2/V3), and Iām curious about the reasoning behind excluding it here.
Thanks again for your excellent work and for open-sourcing this project! š
Hello,
We actually have an internal MLA-enabled version of DeepSeek-OCR.
The only reason it hasnāt been open-sourced yet is simply that I havenāt had the bandwidth to implement the code needed to convert the internal weights into the Hugging Face format.
Best regards
Hi