Improve model card with paper link and pipeline tag
Browse filesThis PR adds a link to the paper associated with this model and sets the `pipeline_tag` to `text-generation` to improve discoverability on the Hugging Face Hub.
README.md
CHANGED
|
@@ -1,7 +1,9 @@
|
|
| 1 |
---
|
| 2 |
-
license: mit
|
| 3 |
library_name: transformers
|
|
|
|
|
|
|
| 4 |
---
|
|
|
|
| 5 |
# DeepSeek-V3-0324
|
| 6 |
<!-- markdownlint-disable first-line-h1 -->
|
| 7 |
<!-- markdownlint-disable html -->
|
|
@@ -197,5 +199,15 @@ This repository and the model weights are licensed under the [MIT License](LICEN
|
|
| 197 |
}
|
| 198 |
```
|
| 199 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 200 |
## Contact
|
| 201 |
-
If you have any questions, please raise an issue or contact us at [service@deepseek.com](service@deepseek.com).
|
|
|
|
| 1 |
---
|
|
|
|
| 2 |
library_name: transformers
|
| 3 |
+
license: mit
|
| 4 |
+
pipeline_tag: text-generation
|
| 5 |
---
|
| 6 |
+
|
| 7 |
# DeepSeek-V3-0324
|
| 8 |
<!-- markdownlint-disable first-line-h1 -->
|
| 9 |
<!-- markdownlint-disable html -->
|
|
|
|
| 199 |
}
|
| 200 |
```
|
| 201 |
|
| 202 |
+
## Paper title and link
|
| 203 |
+
|
| 204 |
+
The model was presented in the paper [Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures](https://huggingface.co/papers/2505.09343).
|
| 205 |
+
|
| 206 |
+
## Paper abstract
|
| 207 |
+
|
| 208 |
+
The abstract of the paper is the following:
|
| 209 |
+
|
| 210 |
+
The rapid scaling of large language models (LLMs) has unveiled critical limitations in current hardware architectures, including constraints in memory capacity, computational efficiency, and interconnection bandwidth. DeepSeek-V3, trained on 2,048 NVIDIA H800 GPUs, demonstrates how hardware-aware model co-design can effectively address these challenges, enabling cost-efficient training and inference at scale. This paper presents an in-depth analysis of the DeepSeek-V3/R1 model architecture and its AI infrastructure, highlighting key innovations such as Multi-head Latent Attention (MLA) for enhanced memory efficiency, Mixture of Experts (MoE) architectures for optimized computation-communication trade-offs, FP8 mixed-precision training to unlock the full potential of hardware capabilities, and a Multi-Plane Network Topology to minimize cluster-level network overhead. Building on the hardware bottlenecks encountered during DeepSeek-V3's development, we engage in a broader discussion with academic and industry peers on potential future hardware directions, including precise low-precision computation units, scale-up and scale-out convergence, and innovations in low-latency communication fabrics. These insights underscore the critical role of hardware and model co-design in meeting the escalating demands of AI workloads, offering a practical blueprint for innovation in next-generation AI systems.
|
| 211 |
+
|
| 212 |
## Contact
|
| 213 |
+
If you have any questions, please raise an issue or contact us at [service@deepseek.com](service@deepseek.com).
|