Upload README_CN.md with huggingface_hub
Browse files- README_CN.md +748 -0
    	
        README_CN.md
    ADDED
    
    | @@ -0,0 +1,748 @@ | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | 
|  | |
| 1 | 
            +
            <p align="left">
         | 
| 2 | 
            +
               <a href="README.md">English</a>  | 中文</a> 
         | 
| 3 | 
            +
            </p>
         | 
| 4 | 
            +
            <br><br>
         | 
| 5 | 
            +
             | 
| 6 | 
            +
            <p align="center">
         | 
| 7 | 
            +
             <img src="https://dscache.tencent-cloud.cn/upload/uploader/hunyuan-64b418fd052c033b228e04bc77bbc4b54fd7f5bc.png" width="400"/> <br>
         | 
| 8 | 
            +
            </p><p></p>
         | 
| 9 | 
            +
             | 
| 10 | 
            +
             | 
| 11 | 
            +
            <p align="center">
         | 
| 12 | 
            +
                🤗 <a href="https://huggingface.co/tencent/"><b>Hugging Face</b></a>  |  
         | 
| 13 | 
            +
                <img src="https://avatars.githubusercontent.com/u/109945100?s=200&v=4" width="16"/> <a href="https://modelscope.cn/models/Tencent-Hunyuan/"><b>ModelScope</b></a>  |  
         | 
| 14 | 
            +
            <img src="https://cdn-avatars.huggingface.co/v1/production/uploads/6594d0c6c5f1cd69a48b261d/04ZNQlAfs08Bfg4B1o3XO.png" width="14"/> <a href="https://github.com/Tencent/AngelSlim/tree/main"><b>AngelSlim</b></a>
         | 
| 15 | 
            +
            </p>
         | 
| 16 | 
            +
             | 
| 17 | 
            +
            <p align="center">
         | 
| 18 | 
            +
                🖥️ <a href="https://hunyuan.tencent.com" style="color: red;"><b>Official Website</b></a>  |  
         | 
| 19 | 
            +
                🕖 <a href="https://cloud.tencent.com/product/hunyuan"><b>HunyuanAPI</b></a>  |  
         | 
| 20 | 
            +
                🕹️ <a href="https://hunyuan.tencent.com/"><b>Demo</b></a>    
         | 
| 21 | 
            +
            </p>
         | 
| 22 | 
            +
             | 
| 23 | 
            +
            <p align="center">
         | 
| 24 | 
            +
                <a href="https://github.com/Tencent-Hunyuan/Hunyuan-7B"><b>GITHUB</b></a> | 
         | 
| 25 | 
            +
                <a href="https://cnb.cool/tencent/hunyuan/Hunyuan-7B"><b>cnb.cool</b></a> | 
         | 
| 26 | 
            +
                <a href="https://github.com/Tencent-Hunyuan/Hunyuan-7B/blob/main/LICENSE"><b>LICENSE</b></a>
         | 
| 27 | 
            +
            </p>
         | 
| 28 | 
            +
             | 
| 29 | 
            +
             | 
| 30 | 
            +
             | 
| 31 | 
            +
             | 
| 32 | 
            +
            ## 模型介绍
         | 
| 33 | 
            +
             | 
| 34 | 
            +
            混元是腾讯开源的高效大语言模型系列,专为多样化计算环境中的灵活部署而设计。从边缘设备到高并发生产系统,这些模型凭借先进的量化支持和超长上下文能力,在各种场景下都能提供最优性能。
         | 
| 35 | 
            +
             | 
| 36 | 
            +
            我们发布了一系列混元稠密模型,包括预训练和指令微调两种变体,参数规模涵盖0.5B、1.8B、4B和7B。这些模型采用了与混元-A13B相似的训练策略,因此继承了其强大的性能特征。这个全面的模型家族支持灵活的部署优化 - 从使用小尺寸的模型适配资源受限边缘计算场景,到使用较大尺寸的高性能模型支持高并发低延迟的复杂推理生产环境,在各种场景下都能保持强大的能力。
         | 
| 37 | 
            +
             | 
| 38 | 
            +
             | 
| 39 | 
            +
            ### 核心特性与优势
         | 
| 40 | 
            +
            - **混合推理支持**:同时支持快思考和慢思考两种模式,支持用户灵活选择 
         | 
| 41 | 
            +
            - **超长上下文理解**:原生支持256K上下文窗口,在长文本任务中保持稳定性能
         | 
| 42 | 
            +
            - **增强Agent能力**:优化Agent能力,在BFCL-v3、τ-Bench、C3-Bench等智能体基准测试中领先
         | 
| 43 | 
            +
            - **高效推理**:采用分组查询注意力(GQA)策略,支持多量化格式,实现高效推理
         | 
| 44 | 
            +
             | 
| 45 | 
            +
            ## 新闻
         | 
| 46 | 
            +
            <br>
         | 
| 47 | 
            +
             | 
| 48 | 
            +
            * 2025.7.30 我们在Hugging Face开源了 **Hunyuan-0.5B-Pretrain** , **Hunyuan-1.8B-Pretrain** , **Hunyuan-4B-Pretrain** , **Hunyuan-7B-Pretrain** , **Hunyuan-0.5B-Instruct** , **Hunyuan-1.8B-Instruct** , **Hunyuan-4B-Instruct** , **Hunyuan-7B-Instruct**。
         | 
| 49 | 
            +
             | 
| 50 | 
            +
            ## Benchmark评估榜单
         | 
| 51 | 
            +
            | Model            | Hunyuan-0.5B-Pretrain | Hunyuan-1.8B-Pretrain | Hunyuan-4B-Pretrain | Hunyuan-7B-Pretrain|
         | 
| 52 | 
            +
            |:------------------:|:---------------:|:--------------:|:-------------:|:---------------:|
         | 
| 53 | 
            +
            | MMLU             | 54.02          | 64.62         | 74.01        | 79.82         |
         | 
| 54 | 
            +
            | MMLU-Redux              |  54.72         | 64.42        | 73.53       | 79         |
         | 
| 55 | 
            +
            | MMLU-Pro        | 31.15             | 38.65            | 51.91        | 57.79          |
         | 
| 56 | 
            +
            | SuperGPQA    |  17.23         | 24.98          | 27.28           | 30.47          |
         | 
| 57 | 
            +
            | BBH       | 45.92          | 74.32         | 75.17        | 82.95          |
         | 
| 58 | 
            +
            | GPQA             | 27.76             | 35.81            | 43.52        | 44.07          |
         | 
| 59 | 
            +
            | GSM8K | 55.64             | 77.26            | 87.49       | 88.25         |
         | 
| 60 | 
            +
            | MATH             | 42.95          | 62.85          | 72.25        | 74.85          |
         | 
| 61 | 
            +
            | EvalPlus             | 39.71          | 60.67          | 67.76        | 66.96          |
         | 
| 62 | 
            +
            | MultiPL-E            | 21.83          | 45.92         | 59.87        | 60.41          |
         | 
| 63 | 
            +
            | MBPP            | 43.38          | 66.14         | 76.46        | 76.19          |
         | 
| 64 | 
            +
            | CRUX-O         | 30.75             | 36.88           | 56.5        | 60.75          |
         | 
| 65 | 
            +
            | Chinese SimpleQA            | 12.51             | 22.31            | 30.53        | 38.86          |
         | 
| 66 | 
            +
            | simpleQA (5shot)            | 2.38             | 3.61            | 4.21        | 5.69          |
         | 
| 67 | 
            +
             | 
| 68 | 
            +
             | 
| 69 | 
            +
            | Topic               |                        Bench                         | Hunyuan-0.5B-Instruct | Hunyuan-1.8B-Instruct | Hunyuan-4B-Instruct | Hunyuan-7B-Instruct|
         | 
| 70 | 
            +
            |:-------------------:|:----------------------------------------------------:|:-------------:|:------------:|:-----------:|:---------------------:|
         | 
| 71 | 
            +
            | **Mathematics**     |            AIME 2024<br>AIME 2025<br>MATH            | 17.2<br>20<br>48.5 | 56.7<br>53.9<br>86 | 78.3<br>66.5<br>92.6 | 81.1<br>75.3<br>93.7 |
         | 
| 72 | 
            +
            | **Science**         |            GPQA-Diamond<br>OlympiadBench             | 23.3<br>29.6 | 47.2<br>63.4 | 61.1<br>73.1 | 60.1<br>76.5 |
         | 
| 73 | 
            +
            | **Coding**          |           Livecodebench<br>Fullstackbench            | 11.1<br>20.9 | 31.5<br>42   | 49.4<br>54.6 | 57<br>56.3 |
         | 
| 74 | 
            +
            | **Reasoning**       |              BBH<br>DROP<br>ZebraLogic               | 40.3<br>52.8<br>34.5 | 64.6<br>76.7<br>74.6 | 83<br>78.2<br>83.5 | 87.8<br>85.9<br>85.1 |
         | 
| 75 | 
            +
            | **Instruction<br>Following** |        IF-Eval<br>SysBench                  | 49.7<br>28.1 | 67.6<br>55.5 | 76.6<br>68 | 79.3<br>72.7 |
         | 
| 76 | 
            +
            | **Agent**           | BFCL v3<br> τ-Bench<br>ComplexFuncBench<br> C3-Bench | 49.8<br>14.4<br>13.9<br>45.3 | 58.3<br>18.2<br>22.3<br>54.6 | 67.9<br>30.1<br>26.3<br>64.3 | 70.8<br>35.3<br>29.2<br>68.5 |
         | 
| 77 | 
            +
            | **Long<br>Context** | PenguinScrolls<br>longbench-v2<br>FRAMES          | 53.9<br>34.7<br>41.9 | 73.1<br>33.2<br>55.6 | 83.1<br>44.1<br>79.2 | 82<br>43<br>78.6 |
         | 
| 78 | 
            +
             | 
| 79 | 
            +
             
         | 
| 80 | 
            +
             | 
| 81 | 
            +
            ## 使用 transformers 推理
         | 
| 82 | 
            +
             | 
| 83 | 
            +
            我们的模型默认使用慢思考进行推理,有两种方法可以禁用 CoT 推理。
         | 
| 84 | 
            +
             | 
| 85 | 
            +
            1. 调用 apply_chat_template 时传递 **enable_thinking=False**。
         | 
| 86 | 
            +
            2. 在 prompt 前添加 **/no_think** 将会强制模型不使用 CoT 推理。同理,在 prompt 前添加 **/think** 将会强制模型执行 CoT 推理。
         | 
| 87 | 
            +
             | 
| 88 | 
            +
            以下代码片段展示了如何使用 transformers 库加载和使用模型。它还演示了如何禁用推理模式,以及如何解析出“推理过程”和“最终输出”。
         | 
| 89 | 
            +
             | 
| 90 | 
            +
            ```python
         | 
| 91 | 
            +
            from transformers import AutoModelForCausalLM, AutoTokenizer
         | 
| 92 | 
            +
            import os
         | 
| 93 | 
            +
            import re
         | 
| 94 | 
            +
             | 
| 95 | 
            +
            model_name_or_path = os.environ['MODEL_PATH']
         | 
| 96 | 
            +
            # model_name_or_path = "tencent/Hunyuan-7B-Instruct"
         | 
| 97 | 
            +
             | 
| 98 | 
            +
            tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
         | 
| 99 | 
            +
            model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="auto",trust_remote_code=True)  # You may want to use bfloat16 and/or move to GPU here
         | 
| 100 | 
            +
            messages = [
         | 
| 101 | 
            +
                {"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
         | 
| 102 | 
            +
            ]
         | 
| 103 | 
            +
            tokenized_chat = tokenizer.apply_chat_template(
         | 
| 104 | 
            +
                messages, 
         | 
| 105 | 
            +
                tokenize=False
         | 
| 106 | 
            +
                add_generation_prompt=True,
         | 
| 107 | 
            +
                enable_thinking=True
         | 
| 108 | 
            +
            )
         | 
| 109 | 
            +
                                                            
         | 
| 110 | 
            +
            model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
         | 
| 111 | 
            +
            model_inputs.pop("token_type_ids", None)
         | 
| 112 | 
            +
            outputs = model.generate(**model_inputs, max_new_tokens=4096)
         | 
| 113 | 
            +
            output_text = tokenizer.decode(outputs[0])
         | 
| 114 | 
            +
             | 
| 115 | 
            +
            think_pattern = r'<think>(.*?)</think>'
         | 
| 116 | 
            +
            think_matches = re.findall(think_pattern, output_text, re.DOTALL)
         | 
| 117 | 
            +
             | 
| 118 | 
            +
            answer_pattern = r'<answer>(.*?)</answer>'
         | 
| 119 | 
            +
            answer_matches = re.findall(answer_pattern, output_text, re.DOTALL)
         | 
| 120 | 
            +
             | 
| 121 | 
            +
            think_content = [match.strip() for match in think_matches][0]
         | 
| 122 | 
            +
            answer_content = [match.strip() for match in answer_matches][0]
         | 
| 123 | 
            +
            print(f"thinking_content:{think_content}\n\n")
         | 
| 124 | 
            +
            print(f"answer_content:{answer_content}\n\n")
         | 
| 125 | 
            +
            ```
         | 
| 126 | 
            +
             | 
| 127 | 
            +
             | 
| 128 | 
            +
            我们推荐使用下面这组参数进行推理。注意,我们的模型没有默认 system_prompt。
         | 
| 129 | 
            +
             | 
| 130 | 
            +
            ```json
         | 
| 131 | 
            +
             | 
| 132 | 
            +
            {
         | 
| 133 | 
            +
              "do_sample": true,
         | 
| 134 | 
            +
              "top_k": 20,
         | 
| 135 | 
            +
              "top_p": 0.8,
         | 
| 136 | 
            +
              "repetition_penalty": 1.05,
         | 
| 137 | 
            +
              "temperature": 0.7
         | 
| 138 | 
            +
            }
         | 
| 139 | 
            +
            ```
         | 
| 140 | 
            +
             | 
| 141 | 
            +
             
         | 
| 142 | 
            +
             | 
| 143 | 
            +
            ## 训练数据格式处理 
         | 
| 144 | 
            +
             | 
| 145 | 
            +
            如果需要微调我们的 Instruct 模型,建议将数据处理成以下格式,分别对应慢思考和快思考的场景。
         | 
| 146 | 
            +
             | 
| 147 | 
            +
            ```python
         | 
| 148 | 
            +
            # think_pattern
         | 
| 149 | 
            +
            think = ""
         | 
| 150 | 
            +
            answer = ""
         | 
| 151 | 
            +
            think_pattern = f"<think>\n{think}\n</think>\n<answer>\n{answer}\n</answer>"
         | 
| 152 | 
            +
             | 
| 153 | 
            +
            # fast think pattern
         | 
| 154 | 
            +
            messages = [
         | 
| 155 | 
            +
                {"role": "system", "content": "You are a helpful assistant."},
         | 
| 156 | 
            +
                {"role": "user", "content": "/no_think海水为什么是咸的" },
         | 
| 157 | 
            +
                {"role": "assistant", "content": "<think>\n\n</think>\n<answer>\n海水是咸的主要是因为其中含有许多溶解在水中的盐类和矿物质。这些盐类和矿物质来自于地球表面的岩石和土壤中的化学物质,随着时间的推移,它们被带到了海洋中。当海水蒸发时,水分蒸发掉了,但盐类和矿物质仍然留在水中,导致海水变得更加咸味。因此,海水的咸度是由其中的盐类和矿物质的含量决定的。\n</answer>"}
         | 
| 158 | 
            +
            ]
         | 
| 159 | 
            +
             | 
| 160 | 
            +
            # slow think pattern
         | 
| 161 | 
            +
            messages = [
         | 
| 162 | 
            +
                {"role": "system", "content": "You are a helpful assistant."},
         | 
| 163 | 
            +
                {"role": "user", "content": "1+1=" },
         | 
| 164 | 
            +
                {"role": "assistant", "content": "<think>\n嗯,用户问的是1加1等于多少。首先,我需要确认这是一个基本的算术问题。1加1在十进制的数学体系中,通常的结果是2。不过,可能需要考虑是否有其他情况,比如二进制或者其他数制,但用户没有特别说明,所以默认应该是十进制。另外,有时候可能会有脑筋急转弯的情况,比如在某些语境下1+1可能等于1(比如1滴水加1滴水还是1滴水),但通常数学问题中都是2。所以最准确的回答应该是2。</think>\n<answer>\n在十进制的基本算术运算中,1加1的结果是2。这是数学中最基础的加法运算之一,遵循自然数的加法规则。因此,1 + 1 = 2。\n</answer>"}
         | 
| 165 | 
            +
            ]
         | 
| 166 | 
            +
             | 
| 167 | 
            +
            from transformers import AutoTokenizer
         | 
| 168 | 
            +
            tokenizer = AutoTokenizer.from_pretrained("your_tokenizer_path", trust_remote_code=True)
         | 
| 169 | 
            +
            train_ids = tokenizer.apply_chat_template(messages)
         | 
| 170 | 
            +
            ```
         | 
| 171 | 
            +
             | 
| 172 | 
            +
             
         | 
| 173 | 
            +
             | 
| 174 | 
            +
            ## 使用 LLaMA-Factory 训练
         | 
| 175 | 
            +
             | 
| 176 | 
            +
            我们将介绍如何使用`LLaMA-Factory`来进行微调混元模型。
         | 
| 177 | 
            +
             | 
| 178 | 
            +
            ### 安装环境
         | 
| 179 | 
            +
             | 
| 180 | 
            +
            开始之前,确保你已经安装了以下代码库:
         | 
| 181 | 
            +
            1. 使用[LLaMA-Factory](https://github.com/hiyouga/LLaMA-Factory)官方指导进行安装。
         | 
| 182 | 
            +
            2. 使用[DeepSpeed](https://github.com/deepspeedai/DeepSpeed#installation)官方指导进行安装(可选)。
         | 
| 183 | 
            +
            3. 安装配套的transformer库。当前混元提交的transformer代码正在评审中,需要获取配套的分支。
         | 
| 184 | 
            +
            ```
         | 
| 185 | 
            +
            pip install git+https://github.com/huggingface/transformers@4970b23cedaf745f963779b4eae68da281e8c6ca
         | 
| 186 | 
            +
            ```
         | 
| 187 | 
            +
             | 
| 188 | 
            +
            ### 准备数据
         | 
| 189 | 
            +
             | 
| 190 | 
            +
            我们需要准备自定义的数据集:
         | 
| 191 | 
            +
             | 
| 192 | 
            +
            1. 请将您的数据以`json`格式进行组织,并将数据放入`LLaMA-Factory`的`data`目录中。当前使用的是`sharegpt`格式的数据集,需要遵循以下格式:
         | 
| 193 | 
            +
            ```
         | 
| 194 | 
            +
            [
         | 
| 195 | 
            +
              {
         | 
| 196 | 
            +
                "messages": [
         | 
| 197 | 
            +
                  {
         | 
| 198 | 
            +
                    "role": "system",
         | 
| 199 | 
            +
                    "content": "系统提示词(选填)"
         | 
| 200 | 
            +
                  },
         | 
| 201 | 
            +
                  {
         | 
| 202 | 
            +
                    "role": "user",
         | 
| 203 | 
            +
                    "content": "人类指令"
         | 
| 204 | 
            +
                  },
         | 
| 205 | 
            +
                  {
         | 
| 206 | 
            +
                    "role": "assistant",
         | 
| 207 | 
            +
                    "content": "模型回答"
         | 
| 208 | 
            +
                  }
         | 
| 209 | 
            +
                ]
         | 
| 210 | 
            +
              }
         | 
| 211 | 
            +
            ]
         | 
| 212 | 
            +
            ```
         | 
| 213 | 
            +
            可以参考前面章节中对[数据格式](#训练数据格式处理)的说明。
         | 
| 214 | 
            +
             | 
| 215 | 
            +
            2. 在`data/dataset_info.json`文件中提供您的数据集定义,并采用以下格式:
         | 
| 216 | 
            +
            ```
         | 
| 217 | 
            +
            "数据集名称": {
         | 
| 218 | 
            +
              "file_name": "data.json",
         | 
| 219 | 
            +
              "formatting": "sharegpt",
         | 
| 220 | 
            +
              "columns": {
         | 
| 221 | 
            +
                "messages": "messages"
         | 
| 222 | 
            +
              },
         | 
| 223 | 
            +
              "tags": {
         | 
| 224 | 
            +
                "role_tag": "role",
         | 
| 225 | 
            +
                "content_tag": "content",
         | 
| 226 | 
            +
                "user_tag": "user",
         | 
| 227 | 
            +
                "assistant_tag": "assistant",
         | 
| 228 | 
            +
                "system_tag": "system"
         | 
| 229 | 
            +
              }
         | 
| 230 | 
            +
            }
         | 
| 231 | 
            +
            ```
         | 
| 232 | 
            +
             | 
| 233 | 
            +
            ### 训练
         | 
| 234 | 
            +
             | 
| 235 | 
            +
            1. 将`train/llama_factory_support/example_configs`目录下的文件都拷贝到`LLaMA-Factory`的`example/hunyuan`目录下。
         | 
| 236 | 
            +
            2. 修改配置文件`hunyuan_full.yaml`中的模型路径和数据集名称,其他的配置请根据需要进行修改。
         | 
| 237 | 
            +
              ```
         | 
| 238 | 
            +
              ### model
         | 
| 239 | 
            +
              model_name_or_path: [!!!add the model path here!!!]
         | 
| 240 | 
            +
             | 
| 241 | 
            +
              ### dataset
         | 
| 242 | 
            +
              dataset: [!!!add the data set name here!!!]
         | 
| 243 | 
            +
              ```
         | 
| 244 | 
            +
            3. 执行训练命令
         | 
| 245 | 
            +
                * 运行单机训练
         | 
| 246 | 
            +
                请注意这里需要设置`DISABLE_VERSION_CHECK`环境变量,避免版本冲突。
         | 
| 247 | 
            +
                ```
         | 
| 248 | 
            +
                export DISABLE_VERSION_CHECK=1
         | 
| 249 | 
            +
                llamafactory-cli train examples/hunyuan/hunyuan_full.yaml
         | 
| 250 | 
            +
                ```
         | 
| 251 | 
            +
                * 运行多机训练
         | 
| 252 | 
            +
                在每个节点上执行以下命令。请注意将`torchrun`需要的`NNODES`、`NODE_RANK`、`MASTER_ADDR`和`MASTER_PORT`按照您运行的环境进行配置。
         | 
| 253 | 
            +
                ```
         | 
| 254 | 
            +
                export DISABLE_VERSION_CHECK=1
         | 
| 255 | 
            +
                FORCE_TORCHRUN=1 NNODES=${NNODES} NODE_RANK=${NODE_RANK} MASTER_ADDR=${MASTER_ADDR} MASTER_PORT=${MASTER_PORT} \
         | 
| 256 | 
            +
                llamafactory-cli train examples/hunyuan_full.yaml
         | 
| 257 | 
            +
                ```
         | 
| 258 | 
            +
             | 
| 259 | 
            +
             
         | 
| 260 | 
            +
             | 
| 261 | 
            +
            ## 量化压缩
         | 
| 262 | 
            +
             | 
| 263 | 
            +
            我们使用了 [AngleSlim](https://github.com/tencent/AngelSlim) 压缩工具来生成 FP8 和 INT4 量化模型。`AngleSlim` 是一款专门致力于打造更易用、更全面且更高效的模型压缩解决方案的工具。
         | 
| 264 | 
            +
             | 
| 265 | 
            +
            ### FP8 量化
         | 
| 266 | 
            +
            我们采用FP8-static量化,FP8量化采用8位浮点格式,通过少量校准数据(无需训练)预先确定量化scale,将模型权重与激活值转换为FP8格式,提升推理效率并降低部署门槛。 我们您可以使用AngleSlim量化,你也可以直接下载我们量化完成的开源模型使用[LINK](https://huggingface.co/).
         | 
| 267 | 
            +
             | 
| 268 | 
            +
            ### Int4 Quantization
         | 
| 269 | 
            +
            Int4量化我们采用GPTQ和AWQ算法实现W4A16量化。
         | 
| 270 | 
            +
             | 
| 271 | 
            +
            GPTQ算法采用逐层处理模型权重,利用少量校准数据最小化量化后的权重重构误差,通过近似Hessian逆矩阵的优化过程逐层调整权重。流程无需重新训练模型,仅需少量校准数据即可量化权重,提升推理效率并降低部署门槛。
         | 
| 272 | 
            +
            AWQ使用少量校准数据(无需进行训练)来计算激活值的幅度,从而进行统计计算。对于每个权重通道,都会计算一个缩放系数s,以扩大重要权重的数值表达范围,从而在量化过程中能够保留更多的信息。
         | 
| 273 | 
            +
             | 
| 274 | 
            +
            您可以使用 [AngleSlim](https://github.com/tencent/AngelSlim) 量化,也可以直接下载我们量化完成的开源模型使用 [LINK](https://huggingface.co/) 。
         | 
| 275 | 
            +
             | 
| 276 | 
            +
             | 
| 277 | 
            +
            #### 量化 Benchmark
         | 
| 278 | 
            +
            本小节介绍了混元量化模型的基准指标。
         | 
| 279 | 
            +
             | 
| 280 | 
            +
            |     Bench     |           Quantization            |    Hunyuan-0.5B-Instruct     |     Hunyuan-1.8B-Instruct      |     Hunyuan-4B-Instruct      |     Hunyuan-7B-Instruct      |
         | 
| 281 | 
            +
            |:-------------:|:---------------------------------:|:----------------------------:|:------------------------------:|:----------------------------:|:----------------------------:|
         | 
| 282 | 
            +
            |     DROP      | B16<br>FP8<br>Int4GPTQ<br>Int4AWQ | 52.8<br>51.6<br>50.9<br>48.9 |  76.7<br>75.1<br>73.0<br>71.7  | 78.2<br>78.3<br>78.1<br>78.2 | 85.9<br>86.0<br>85.7<br>85.9 |
         | 
| 283 | 
            +
            | GPQA-Diamond  | B16<br>FP8<br>Int4GPTQ<br>Int4AWQ | 23.3<br>22.5<br>23.3<br>23.3 | 47.2<br>47.7<br>44.43<br>43.62 |  61.1<br>60.2<br>58.1<br>-   | 60.1<br>60.1<br>60.0<br>60.1 |
         | 
| 284 | 
            +
            | OlympiadBench | B16<br>FP8<br>Int4GPTQ<br>Int4AWQ | 29.6<br>29.6<br>26.8<br>26.3 |  63.4<br>62.5<br>60.9<br>61.7  | 73.1<br>73.1<br>72.9<br>72.8 | 76.5<br>76.6<br>76.2<br>76.4 |
         | 
| 285 | 
            +
            |   AIME 2024   | B16<br>FP8<br>Int4GPTQ<br>Int4AWQ |    17.2<br>17.2<br>-<br>-    |    56.7<br>55.17<br>-<br>-     |    78.3<br>76.6<br>-<br>-    | 81.1<br>80.9<br>81.0<br>80.9 |
         | 
| 286 | 
            +
             | 
| 287 | 
            +
             | 
| 288 | 
            +
             | 
| 289 | 
            +
             
         | 
| 290 | 
            +
             | 
| 291 | 
            +
            ## 推理和部署 
         | 
| 292 | 
            +
             | 
| 293 | 
            +
            HunyuanLLM可以采用TensorRT-LLM, vLLM或sglang部署。为了简化部署过程HunyuanLLM提供了预构建docker镜像,详见一下章节。
         | 
| 294 | 
            +
             | 
| 295 | 
            +
            镜像:https://hub.docker.com/r/hunyuaninfer/hunyuan-a13b/tags
         | 
| 296 | 
            +
             | 
| 297 | 
            +
            ## 使用TensorRT-LLM推理
         | 
| 298 | 
            +
            ### Docker:
         | 
| 299 | 
            +
             | 
| 300 | 
            +
            为了简化部署过程,HunyuanLLM提供了预构建docker镜像 (注意: 该镜像要求Host的Cuda版本为12.8以上):
         | 
| 301 | 
            +
             | 
| 302 | 
            +
            [hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-trtllm](https://hub.docker.com/r/hunyuaninfer/hunyuan-a13b/tags) 。您只需要下载模型文件并用下面代码启动docker即可开始推理模型。
         | 
| 303 | 
            +
            ```shell
         | 
| 304 | 
            +
            # 拉取
         | 
| 305 | 
            +
            国内:
         | 
| 306 | 
            +
            docker pull docker.cnb.cool/tencent/hunyuan/hunyuan-a13b:hunyuan-moe-A13B-trtllm
         | 
| 307 | 
            +
            国外:
         | 
| 308 | 
            +
            docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-trtllm
         | 
| 309 | 
            +
             | 
| 310 | 
            +
            # 启动
         | 
| 311 | 
            +
            docker run --privileged --user root --name hunyuanLLM_infer --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-trtllm     
         | 
| 312 | 
            +
            ```
         | 
| 313 | 
            +
             | 
| 314 | 
            +
            注: Docker容器权限管理。以上代码采用特权模式(--privileged)启动Docker容器会赋予容器较高的权限,增加数据泄露和集群安全风险。建议在非必要情况下避免使用特权模式,以降低安全威胁。对于必须使用特权模式的场景,应进行严格的安全评估,并实施相应的安全监控、加固措施。
         | 
| 315 | 
            +
             | 
| 316 | 
            +
            ### BF16部署
         | 
| 317 | 
            +
             | 
| 318 | 
            +
            #### Step1:执行推理
         | 
| 319 | 
            +
             | 
| 320 | 
            +
            #### 方式1:命令行推理
         | 
| 321 | 
            +
             | 
| 322 | 
            +
            下面我们展示一个代码片段,采用`TensorRT-LLM`快速请求chat model:
         | 
| 323 | 
            +
            修改 examples/pytorch/quickstart_advanced.py 中如下代码:
         | 
| 324 | 
            +
             | 
| 325 | 
            +
             | 
| 326 | 
            +
            ```python
         | 
| 327 | 
            +
            def setup_llm(args):
         | 
| 328 | 
            +
                kv_cache_config = KvCacheConfig(
         | 
| 329 | 
            +
                    enable_block_reuse=not args.disable_kv_cache_reuse,
         | 
| 330 | 
            +
                    free_gpu_memory_fraction=args.kv_cache_fraction,
         | 
| 331 | 
            +
                )
         | 
| 332 | 
            +
                spec_config = None
         | 
| 333 | 
            +
                
         | 
| 334 | 
            +
                hf_ckpt_path="$your_hunyuan_model_path"
         | 
| 335 | 
            +
                tokenizer = AutoTokenizer.from_pretrained(hf_ckpt_path, trust_remote_code=True)
         | 
| 336 | 
            +
                llm = LLM(
         | 
| 337 | 
            +
                    tokenizer=tokenizer,
         | 
| 338 | 
            +
                    model=args.model_dir,
         | 
| 339 | 
            +
                    backend='pytorch',
         | 
| 340 | 
            +
                    disable_overlap_scheduler=args.disable_overlap_scheduler,
         | 
| 341 | 
            +
                    kv_cache_dtype=args.kv_cache_dtype,
         | 
| 342 | 
            +
                    kv_cache_config=kv_cache_config,
         | 
| 343 | 
            +
                    attn_backend=args.attention_backend,
         | 
| 344 | 
            +
                    use_cuda_graph=args.use_cuda_graph,
         | 
| 345 | 
            +
                    cuda_graph_padding_enabled=args.cuda_graph_padding_enabled,
         | 
| 346 | 
            +
                    cuda_graph_batch_sizes=args.cuda_graph_batch_sizes,
         | 
| 347 | 
            +
                    load_format=args.load_format,
         | 
| 348 | 
            +
                    print_iter_log=args.print_iter_log,
         | 
| 349 | 
            +
                    enable_iter_perf_stats=args.print_iter_log,
         | 
| 350 | 
            +
                    torch_compile_config=TorchCompileConfig(
         | 
| 351 | 
            +
                        enable_fullgraph=args.use_torch_compile,
         | 
| 352 | 
            +
                        enable_inductor=args.use_torch_compile,
         | 
| 353 | 
            +
                        enable_piecewise_cuda_graph= \
         | 
| 354 | 
            +
                            args.use_piecewise_cuda_graph)
         | 
| 355 | 
            +
                    if args.use_torch_compile else None,
         | 
| 356 | 
            +
                    moe_backend=args.moe_backend,
         | 
| 357 | 
            +
                    enable_trtllm_sampler=args.enable_trtllm_sampler,
         | 
| 358 | 
            +
                    max_seq_len=args.max_seq_len,
         | 
| 359 | 
            +
                    max_batch_size=args.max_batch_size,
         | 
| 360 | 
            +
                    max_num_tokens=args.max_num_tokens,
         | 
| 361 | 
            +
                    enable_attention_dp=args.enable_attention_dp,
         | 
| 362 | 
            +
                    tensor_parallel_size=args.tp_size,
         | 
| 363 | 
            +
                    pipeline_parallel_size=args.pp_size,
         | 
| 364 | 
            +
                    moe_expert_parallel_size=args.moe_ep_size,
         | 
| 365 | 
            +
                    moe_tensor_parallel_size=args.moe_tp_size,
         | 
| 366 | 
            +
                    moe_cluster_parallel_size=args.moe_cluster_size,
         | 
| 367 | 
            +
                    enable_chunked_prefill=args.enable_chunked_prefill,
         | 
| 368 | 
            +
                    speculative_config=spec_config,
         | 
| 369 | 
            +
                    trust_remote_code=args.trust_remote_code,
         | 
| 370 | 
            +
                    gather_generation_logits=args.return_generation_logits)
         | 
| 371 | 
            +
             | 
| 372 | 
            +
                sampling_params = SamplingParams(
         | 
| 373 | 
            +
                    end_id=127960,
         | 
| 374 | 
            +
                    max_tokens=args.max_tokens,
         | 
| 375 | 
            +
                    temperature=args.temperature,
         | 
| 376 | 
            +
                    top_k=args.top_k,
         | 
| 377 | 
            +
                    top_p=args.top_p,
         | 
| 378 | 
            +
                    return_context_logits=args.return_context_logits,
         | 
| 379 | 
            +
                    return_generation_logits=args.return_generation_logits,
         | 
| 380 | 
            +
                    logprobs=args.logprobs)
         | 
| 381 | 
            +
                return llm, sampling_params
         | 
| 382 | 
            +
             | 
| 383 | 
            +
             | 
| 384 | 
            +
            def main():
         | 
| 385 | 
            +
                args = parse_arguments()
         | 
| 386 | 
            +
                prompts = args.prompt if args.prompt else example_prompts
         | 
| 387 | 
            +
             | 
| 388 | 
            +
                llm, sampling_params = setup_llm(args)
         | 
| 389 | 
            +
                new_prompts = []
         | 
| 390 | 
            +
                for prompt in prompts:
         | 
| 391 | 
            +
                    messages = [{"role": "user", "content": f"{prompt}"}]
         | 
| 392 | 
            +
                    new_prompts.append(
         | 
| 393 | 
            +
                        llm.tokenizer.apply_chat_template(messages,
         | 
| 394 | 
            +
                                                            tokenize=False,
         | 
| 395 | 
            +
                                                            add_generation_prompt=True))
         | 
| 396 | 
            +
                prompts = new_prompts
         | 
| 397 | 
            +
                outputs = llm.generate(prompts, sampling_params)
         | 
| 398 | 
            +
             | 
| 399 | 
            +
                for i, output in enumerate(outputs):
         | 
| 400 | 
            +
                    prompt = output.prompt
         | 
| 401 | 
            +
                    generated_text = output.outputs[0].text
         | 
| 402 | 
            +
                    print(f"[{i}] Prompt: {prompt!r}, Generated text: {generated_text!r}")
         | 
| 403 | 
            +
            ```
         | 
| 404 | 
            +
             | 
| 405 | 
            +
            运行方式:
         | 
| 406 | 
            +
             | 
| 407 | 
            +
            ```shell
         | 
| 408 | 
            +
            python3 quickstart_advanced.py --model_dir "HunyuanLLM模型路径" --tp_size 4
         | 
| 409 | 
            +
            ```
         | 
| 410 | 
            +
             | 
| 411 | 
            +
            #### 方式2:服务化推理
         | 
| 412 | 
            +
             | 
| 413 | 
            +
            下面我们展示使用`TensorRT-LLM`服务化的方式部署模型和请求。
         | 
| 414 | 
            +
             | 
| 415 | 
            +
            准备配置文件:
         | 
| 416 | 
            +
             | 
| 417 | 
            +
            ```
         | 
| 418 | 
            +
            cat >/path/to/extra-llm-api-config.yml <<EOF
         | 
| 419 | 
            +
            use_cuda_graph: true
         | 
| 420 | 
            +
            cuda_graph_padding_enabled: true
         | 
| 421 | 
            +
            cuda_graph_batch_sizes:
         | 
| 422 | 
            +
            - 1
         | 
| 423 | 
            +
            - 2
         | 
| 424 | 
            +
            - 4
         | 
| 425 | 
            +
            - 8
         | 
| 426 | 
            +
            - 16
         | 
| 427 | 
            +
            - 32
         | 
| 428 | 
            +
            print_iter_log: true
         | 
| 429 | 
            +
            EOF
         | 
| 430 | 
            +
            ```
         | 
| 431 | 
            +
             | 
| 432 | 
            +
            启动服务:
         | 
| 433 | 
            +
             | 
| 434 | 
            +
            ```shell
         | 
| 435 | 
            +
            trtllm-serve \
         | 
| 436 | 
            +
              /path/to/HunYuan-moe-A13B \
         | 
| 437 | 
            +
              --host localhost \
         | 
| 438 | 
            +
              --port 8000 \
         | 
| 439 | 
            +
              --backend pytorch \
         | 
| 440 | 
            +
              --max_batch_size 32 \
         | 
| 441 | 
            +
              --max_num_tokens 16384 \
         | 
| 442 | 
            +
              --tp_size 2 \
         | 
| 443 | 
            +
              --kv_cache_free_gpu_memory_fraction 0.6 \
         | 
| 444 | 
            +
              --trust_remote_code \
         | 
| 445 | 
            +
              --extra_llm_api_options /path/to/extra-llm-api-config.yml
         | 
| 446 | 
            +
            ```
         | 
| 447 | 
            +
             | 
| 448 | 
            +
            服务启动成功后, 使用 OpenAI API 进行模型推理调用:
         | 
| 449 | 
            +
            ```
         | 
| 450 | 
            +
            curl -X POST "http://localhost:8000/v1/chat/completions" \
         | 
| 451 | 
            +
              -H "Content-Type: application/json" \
         | 
| 452 | 
            +
              --data '{
         | 
| 453 | 
            +
                "model": "HunYuan/HunYuan-80B-A13B",
         | 
| 454 | 
            +
                "messages": [
         | 
| 455 | 
            +
                  {
         | 
| 456 | 
            +
                    "role": "user",
         | 
| 457 | 
            +
                    "content": "Write a short summary of the benefits of regular exercise"
         | 
| 458 | 
            +
                  }
         | 
| 459 | 
            +
                ]
         | 
| 460 | 
            +
              }'
         | 
| 461 | 
            +
            ```
         | 
| 462 | 
            +
             | 
| 463 | 
            +
            #### FP8/Int4量化模型部署:
         | 
| 464 | 
            +
            目前 TensorRT-LLM 的 fp8 和 int4 量化模型正在支持中,敬请期待。
         | 
| 465 | 
            +
             | 
| 466 | 
            +
             | 
| 467 | 
            +
            ## 使用vLLM推理
         | 
| 468 | 
            +
            ### Docker:
         | 
| 469 | 
            +
             | 
| 470 | 
            +
            为了简化部署过程,HunyuanLLM提供了预构建docker镜像 (注意: 该镜像要求Host的Cuda版本为12.8以上):
         | 
| 471 | 
            +
             | 
| 472 | 
            +
            [hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-vllm](https://hub.docker.com/r/hunyuaninfer/hunyuan-a13b/tags) 。您只需要下载模型文件并用下面代码启动docker即可开始推理模型。
         | 
| 473 | 
            +
            ```shell
         | 
| 474 | 
            +
            # 下载模型:
         | 
| 475 | 
            +
            # ModelScope: 
         | 
| 476 | 
            +
            modelscope download --model Tencent-Hunyuan/Hunyuan-A13B-Instruct
         | 
| 477 | 
            +
            # Huggingface: vllm 会自动下载
         | 
| 478 | 
            +
             | 
| 479 | 
            +
            # 拉取
         | 
| 480 | 
            +
            国内:
         | 
| 481 | 
            +
            docker pull docker.cnb.cool/tencent/hunyuan/hunyuan-a13b:hunyuan-moe-A13B-vllm 
         | 
| 482 | 
            +
            国外:
         | 
| 483 | 
            +
            docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-vllm
         | 
| 484 | 
            +
             | 
| 485 | 
            +
            # 使用 huggingface 起服务
         | 
| 486 | 
            +
            docker run  --privileged --user root  --net=host --ipc=host \
         | 
| 487 | 
            +
                    -v ~/.cache:/root/.cache/ \
         | 
| 488 | 
            +
                    --gpus=all -it --entrypoint python docker.cnb.cool/tencent/hunyuan/hunyuan-a13b:hunyuan-moe-A13B-vllm \
         | 
| 489 | 
            +
                     -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 \
         | 
| 490 | 
            +
                     --tensor-parallel-size 4 --model tencent/Hunyuan-A13B-Instruct --trust-remote-code 
         | 
| 491 | 
            +
             | 
| 492 | 
            +
            # 使用modelscope下载的模型起服务
         | 
| 493 | 
            +
            docker run  --privileged --user root  --net=host --ipc=host \
         | 
| 494 | 
            +
                    -v ~/.cache/modelscope:/root/.cache/modelscope \
         | 
| 495 | 
            +
                    --gpus=all -it --entrypoint python   docker.cnb.cool/tencent/hunyuan/hunyuan-a13b:hunyuan-moe-A13B-vllm \
         | 
| 496 | 
            +
                     -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --tensor-parallel-size 4 \
         | 
| 497 | 
            +
                     --port 8000 --model /root/.cache/modelscope/hub/models/Tencent-Hunyuan/Hunyuan-A13B-Instruct/ --trust_remote_code           
         | 
| 498 | 
            +
            ```
         | 
| 499 | 
            +
             | 
| 500 | 
            +
            注: Docker容器权限管理。以上代码采用特权模式(--privileged)启动Docker容器会赋予容器较高的权限,增加数据泄露和集群安全风险。建议在非必要情况下避免使用特权模式,以降低安全威胁。对于必须使用特权模式的场景,应进行严格的安全评估,并实施相应的安全监控、加固措施。
         | 
| 501 | 
            +
             | 
| 502 | 
            +
             | 
| 503 | 
            +
            ### BF16部署
         | 
| 504 | 
            +
             | 
| 505 | 
            +
            BF16可以在2张显存超过80G的GPU卡上部署,如果长文推荐TP4。按如下步骤执行:
         | 
| 506 | 
            +
             | 
| 507 | 
            +
            运行命令前请先设置如下环境变量:
         | 
| 508 | 
            +
             | 
| 509 | 
            +
            ```shell
         | 
| 510 | 
            +
            export MODEL_PATH=PATH_TO_MODEL
         | 
| 511 | 
            +
            ```
         | 
| 512 | 
            +
             | 
| 513 | 
            +
            #### Step1:执行推理
         | 
| 514 | 
            +
             | 
| 515 | 
            +
            #### 方式1:命令行推理
         | 
| 516 | 
            +
             | 
| 517 | 
            +
            下面我们展示一个代码片段,采用`vLLM`快速请求chat model:
         | 
| 518 | 
            +
             | 
| 519 | 
            +
            注: vLLM组件远程代码执行防护。下列代码中vLLM组件的trust-remote-code配置项若被启用,将允许加载并执行来自远程模型仓库的代码,这可能导致恶意代码的执行。除非业务需求明确要求,否则建议该配置项处于禁用状态,以降低潜在的安全威胁。
         | 
| 520 | 
            +
             | 
| 521 | 
            +
             | 
| 522 | 
            +
            ```python
         | 
| 523 | 
            +
            import os
         | 
| 524 | 
            +
            from typing import List, Optional
         | 
| 525 | 
            +
            from vllm import LLM, SamplingParams
         | 
| 526 | 
            +
            from vllm.inputs import PromptType
         | 
| 527 | 
            +
            from transformers import AutoTokenizer
         | 
| 528 | 
            +
             | 
| 529 | 
            +
            model_path=os.environ.get('MODEL_PATH')
         | 
| 530 | 
            +
            tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
         | 
| 531 | 
            +
             | 
| 532 | 
            +
            llm = LLM(model=model_path,
         | 
| 533 | 
            +
                    tokenizer=model_path,
         | 
| 534 | 
            +
                    trust_remote_code=True,
         | 
| 535 | 
            +
                    dtype='bfloat16',
         | 
| 536 | 
            +
                    tensor_parallel_size=4,
         | 
| 537 | 
            +
                    gpu_memory_utilization=0.9)
         | 
| 538 | 
            +
             | 
| 539 | 
            +
            sampling_params = SamplingParams(
         | 
| 540 | 
            +
                temperature=0.7, top_p=0.8, max_tokens=4096, top_k=20, repetition_penalty=1.05)
         | 
| 541 | 
            +
             | 
| 542 | 
            +
            messages = [
         | 
| 543 | 
            +
                {
         | 
| 544 | 
            +
                    "role": "system",
         | 
| 545 | 
            +
                    "content": "You are a helpful assistant.",
         | 
| 546 | 
            +
                },
         | 
| 547 | 
            +
                {"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
         | 
| 548 | 
            +
            ]
         | 
| 549 | 
            +
             | 
| 550 | 
            +
            tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
         | 
| 551 | 
            +
             | 
| 552 | 
            +
            dummy_inputs: List[PromptType] = [{
         | 
| 553 | 
            +
                "prompt_token_ids": batch
         | 
| 554 | 
            +
            } for batch in tokenized_chat.numpy().tolist()]
         | 
| 555 | 
            +
             | 
| 556 | 
            +
            outputs = llm.generate(dummy_inputs, sampling_params)
         | 
| 557 | 
            +
             | 
| 558 | 
            +
            # Print the outputs.
         | 
| 559 | 
            +
            for output in outputs:
         | 
| 560 | 
            +
                prompt = output.prompt
         | 
| 561 | 
            +
                generated_text = output.outputs[0].text
         | 
| 562 | 
            +
                print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
         | 
| 563 | 
            +
            ```
         | 
| 564 | 
            +
             | 
| 565 | 
            +
            #### 方式2:服务化推理
         | 
| 566 | 
            +
             | 
| 567 | 
            +
            下面我们展示使用`vLLM`服务化的方式部署模型并请求
         | 
| 568 | 
            +
             | 
| 569 | 
            +
            在主节点上运行:
         | 
| 570 | 
            +
             | 
| 571 | 
            +
            ```shell
         | 
| 572 | 
            +
            export VLLM_HOST_IP=${LOCAL_IP}
         | 
| 573 | 
            +
            ```
         | 
| 574 | 
            +
            接着我们启动服务,运行 :
         | 
| 575 | 
            +
            ```shell
         | 
| 576 | 
            +
            cd inference
         | 
| 577 | 
            +
            sh run_server.sh
         | 
| 578 | 
            +
            ```
         | 
| 579 | 
            +
             | 
| 580 | 
            +
            运行`run_server.sh`成功后, 运行请求脚本:
         | 
| 581 | 
            +
            ```shell
         | 
| 582 | 
            +
            sh openapi.sh
         | 
| 583 | 
            +
            ```
         | 
| 584 | 
            +
             | 
| 585 | 
            +
            注意修改`openapi.sh`中的`${LOCAL_IP}`和`${MODEL_PATH}`为服务对应值。
         | 
| 586 | 
            +
             | 
| 587 | 
            +
             | 
| 588 | 
            +
            ### 量化模型部署:
         | 
| 589 | 
            +
             | 
| 590 | 
            +
            本部分介绍采用vLLM部署量化后模型的流程。
         | 
| 591 | 
            +
             | 
| 592 | 
            +
            镜像:部署镜像同BF16。
         | 
| 593 | 
            +
             | 
| 594 | 
            +
             | 
| 595 | 
            +
            #### Int8量化模型部署:
         | 
| 596 | 
            +
            部署Int8-weight-only版本HunYuan-A13B模型只需设置`run_server_int8.sh`中的环境变量:
         | 
| 597 | 
            +
            ```SHELL
         | 
| 598 | 
            +
            export MODEL_PATH=PATH_TO_BF16_MODEL
         | 
| 599 | 
            +
            ```
         | 
| 600 | 
            +
             | 
| 601 | 
            +
            接着我们启动Int8服务。运行:
         | 
| 602 | 
            +
            ```shell
         | 
| 603 | 
            +
            sh run_server_int8.sh
         | 
| 604 | 
            +
            ```
         | 
| 605 | 
            +
             | 
| 606 | 
            +
            运行`run_server_int8.sh`成功后, 运行请求脚本:
         | 
| 607 | 
            +
            ```shell
         | 
| 608 | 
            +
            sh openapi.sh
         | 
| 609 | 
            +
            ```
         | 
| 610 | 
            +
             | 
| 611 | 
            +
            #### Int4量化模型部署:
         | 
| 612 | 
            +
            部署Int4-weight-only版本HunYuan-A13B模型只需设置`run_server_int4.sh`中的环境变量,采用GPTQ方式:
         | 
| 613 | 
            +
            ```SHELL
         | 
| 614 | 
            +
            export MODEL_PATH=PATH_TO_INT4_MODEL
         | 
| 615 | 
            +
            ```
         | 
| 616 | 
            +
             | 
| 617 | 
            +
            接着我们启动Int4服务。运行:
         | 
| 618 | 
            +
            ```shell
         | 
| 619 | 
            +
            sh run_server_int4.sh
         | 
| 620 | 
            +
            ```
         | 
| 621 | 
            +
             | 
| 622 | 
            +
            运行`run_server_int4.sh`成功后, 运行请求脚本:
         | 
| 623 | 
            +
            ```shell
         | 
| 624 | 
            +
            sh openapi.sh
         | 
| 625 | 
            +
            ```
         | 
| 626 | 
            +
             | 
| 627 | 
            +
            #### FP8量化模型部署:
         | 
| 628 | 
            +
            部署W8A8C8版本HunYuan-A13B模型只需设置`run_server_int8.sh`中的环境变量:
         | 
| 629 | 
            +
            ```shell
         | 
| 630 | 
            +
            export MODEL_PATH=PATH_TO_FP8_MODEL
         | 
| 631 | 
            +
            ```
         | 
| 632 | 
            +
             | 
| 633 | 
            +
            接着我们启动FP8服务。运行:
         | 
| 634 | 
            +
            ```shell
         | 
| 635 | 
            +
            sh run_server_fp8.sh
         | 
| 636 | 
            +
            ```
         | 
| 637 | 
            +
             | 
| 638 | 
            +
            运行`run_server_fp8.sh`成功后, 运行请求脚本:
         | 
| 639 | 
            +
            ```shell
         | 
| 640 | 
            +
            sh openapi.sh
         | 
| 641 | 
            +
            ```
         | 
| 642 | 
            +
            ## 使用sglang推理
         | 
| 643 | 
            +
             | 
| 644 | 
            +
            ### BF16部署
         | 
| 645 | 
            +
             | 
| 646 | 
            +
            #### Step1: 拉取镜像
         | 
| 647 | 
            +
             | 
| 648 | 
            +
             | 
| 649 | 
            +
            ```
         | 
| 650 | 
            +
            docker pull docker.cnb.cool/tencent/hunyuan/hunyuan-a13b:hunyuan-moe-A13B-sglang
         | 
| 651 | 
            +
            或
         | 
| 652 | 
            +
            docker pull hunyuaninfer/hunyuan-a13b:hunyuan-moe-A13B-sglang
         | 
| 653 | 
            +
            ```
         | 
| 654 | 
            +
             | 
| 655 | 
            +
            - 启动 API server:
         | 
| 656 | 
            +
             | 
| 657 | 
            +
            ```
         | 
| 658 | 
            +
            docker run --gpus all \
         | 
| 659 | 
            +
                --shm-size 32g \
         | 
| 660 | 
            +
                -p 30000:30000 \
         | 
| 661 | 
            +
                --ipc=host \
         | 
| 662 | 
            +
                docker.cnb.cool/tencent/hunyuan/hunyuan-a13b:hunyuan-moe-A13B-sglang \
         | 
| 663 | 
            +
                -m sglang.launch_server --model-path hunyuan/huanyuan_A13B --tp 4 --trust-remote-code --host 0.0.0.0 --port 30000
         | 
| 664 | 
            +
            ```
         | 
| 665 | 
            +
             | 
| 666 | 
            +
            #### Step2:执行推理
         | 
| 667 | 
            +
             | 
| 668 | 
            +
            #### 方式1:命令行推理
         | 
| 669 | 
            +
             | 
| 670 | 
            +
            下面我们展示一个代码片段,采用`sglang`快速请求chat model:
         | 
| 671 | 
            +
             | 
| 672 | 
            +
             | 
| 673 | 
            +
            ```python
         | 
| 674 | 
            +
            import sglang as sgl
         | 
| 675 | 
            +
            from transformers import AutoTokenizer
         | 
| 676 | 
            +
             | 
| 677 | 
            +
            model_path=os.environ.get('MODEL_PATH')
         | 
| 678 | 
            +
             | 
| 679 | 
            +
             | 
| 680 | 
            +
            tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
         | 
| 681 | 
            +
             | 
| 682 | 
            +
            messages = [
         | 
| 683 | 
            +
                {
         | 
| 684 | 
            +
                    "role": "system",
         | 
| 685 | 
            +
                    "content": "You are a helpful assistant.",
         | 
| 686 | 
            +
                },
         | 
| 687 | 
            +
                {"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
         | 
| 688 | 
            +
            ]
         | 
| 689 | 
            +
            prompts = []
         | 
| 690 | 
            +
            prompts.append(tokenizer.apply_chat_template(
         | 
| 691 | 
            +
                messages,
         | 
| 692 | 
            +
                tokenize=False,
         | 
| 693 | 
            +
                add_generation_prompt=True
         | 
| 694 | 
            +
            ))
         | 
| 695 | 
            +
            print(prompts)
         | 
| 696 | 
            +
             | 
| 697 | 
            +
            llm = sgl.Engine(
         | 
| 698 | 
            +
                model_path=model_path,
         | 
| 699 | 
            +
                tp_size=4,
         | 
| 700 | 
            +
                trust_remote_code=True,
         | 
| 701 | 
            +
                mem_fraction_static=0.7,
         | 
| 702 | 
            +
            )
         | 
| 703 | 
            +
             | 
| 704 | 
            +
            sampling_params = {"temperature": 0.7, "top_p": 0.8, "top_k": 20, "max_new_tokens": 4096}
         | 
| 705 | 
            +
            outputs = llm.generate(prompts, sampling_params)
         | 
| 706 | 
            +
            for prompt, output in zip(prompts, outputs):
         | 
| 707 | 
            +
                print(f"Prompt: {prompt}\nGenerated text: {output['text']}")
         | 
| 708 | 
            +
            ```
         | 
| 709 | 
            +
             | 
| 710 | 
            +
            #### 方式2:服务化推理
         | 
| 711 | 
            +
             | 
| 712 | 
            +
            下面我们展示使用`sglang`服务化的方式部署模型和请求。
         | 
| 713 | 
            +
             | 
| 714 | 
            +
            ```shell
         | 
| 715 | 
            +
            model_path="HunyuanLLM模型路径"
         | 
| 716 | 
            +
            python3 -u -m sglang.launch_server \
         | 
| 717 | 
            +
                --model-path $model_path \
         | 
| 718 | 
            +
                --tp 4 \
         | 
| 719 | 
            +
                --trust-remote-code
         | 
| 720 | 
            +
            ```
         | 
| 721 | 
            +
             | 
| 722 | 
            +
            服务启动成功后, 运行请求脚本:
         | 
| 723 | 
            +
            ```python
         | 
| 724 | 
            +
            import openai
         | 
| 725 | 
            +
            client = openai.Client(
         | 
| 726 | 
            +
                base_url="http://localhost:30000/v1", api_key="EMPTY")
         | 
| 727 | 
            +
             | 
| 728 | 
            +
            response = client.chat.completions.create(
         | 
| 729 | 
            +
                model="default",
         | 
| 730 | 
            +
                messages= [
         | 
| 731 | 
            +
                    {"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
         | 
| 732 | 
            +
                ],
         | 
| 733 | 
            +
                temperature=0.7,
         | 
| 734 | 
            +
                max_tokens=4096,
         | 
| 735 | 
            +
                extra_body={"top_p": 0.8, "top_k": 20}
         | 
| 736 | 
            +
            )
         | 
| 737 | 
            +
            print(response)
         | 
| 738 | 
            +
            ```
         | 
| 739 | 
            +
             | 
| 740 | 
            +
            #### FP8/Int4量化模型部署:
         | 
| 741 | 
            +
            目前 sglang 的 fp8 和 int4 量化模型正在支持中,敬请期待。
         | 
| 742 | 
            +
             | 
| 743 | 
            +
            ## 交互式Demo Web 
         | 
| 744 | 
            +
            hunyuan-A13B 现已开放网页demo。访问 https://hunyuan.tencent.com/?model=hunyuan-a13b 即可简单体验我们的模型。
         | 
| 745 | 
            +
             | 
| 746 | 
            +
             | 
| 747 | 
            +
            ## 联系我们
         | 
| 748 | 
            +
            如果你想给我们的研发和产品团队留言,欢迎联系我们腾讯混元LLM团队。你可以通过邮件(hunyuan_opensource@tencent.com)联系我们。
         | 

