Model Details

This model is a mixed int4 model with group_size 128 and symmetric quantization of deepseek-ai/DeepSeek-V3.1-Terminus generated by intel/auto-round via RTN(no algorithm tuning). Non expert layers are fallback to 8 bits. Please refer to Section Generate the model for more details. Please follow the license of the original model.

The e_score_correction_bias is stored in BF16 because, when loaded in Transformers, its dtype is automatically converted to BF16. As a result, it is difficult for us to preserve it in FP32 within our tools. Please use it with causion

How To Use

INT4 Inference

from transformers import AutoModelForCausalLM, AutoTokenizer
import transformers
import torch
quantized_model_dir = "Intel/DeepSeek-V3.1-Terminus-int4-mixed-AutoRound"

model = AutoModelForCausalLM.from_pretrained(
        quantized_model_dir,
        torch_dtype=torch.bfloat16,
        device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, trust_remote_code=True)
prompts = [
        "9.11和9.8哪个数字大",
        "strawberry中有几个r?",
        "There is a girl who likes adventure,",
        "Please give a brief introduction of DeepSeek company.",
        ]

texts=[]
for prompt in prompts:
    messages = [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
            )
    texts.append(text)
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)

outputs = model.generate(
        input_ids=inputs["input_ids"].to(model.device),
        attention_mask=inputs["attention_mask"].to(model.device),
        max_length=200, ##change this to align with the official usage
        num_return_sequences=1,
        do_sample=False  ##change this to align with the official usage
        )
generated_ids = [
        output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs["input_ids"], outputs)
        ]
decoded_outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

for i, prompt in enumerate(prompts):
    input_id = inputs
    print(f"Prompt: {prompt}")
    print(f"Generated: {decoded_outputs[i]}")
    print("-"*50)


"""
Prompt: 9.11和9.8哪个数字大
Generated: 9.11 比 9.8 大。

比较两个小数时,先比较整数部分(都是 9),然后比较小数部分:
- 9.11 的小数部分是 0.11
- 9.8 的小数部分是 0.8
由于 0.11 小于 0.8,但这里需要对齐小数位比较:
9.11 = 9.11
9.8 = 9.80
比较 0.11 和 0.80,0.11 < 0.80,所以 9.11 < 9.8?
**不对,我纠正一下**:
实际上 9.11 的十分位是 1,而 9.8 的十分位是 8,因为 1 < 8,所以 9.
--------------------------------------------------
Prompt: strawberry中有几个r?
Generated: 我们来数一下单词 **strawberry** 中的字母 **r** 的数量。

单词:s t r a w b e r r y

逐个字母看:
- 第 3 个字母:r
- 第 8 个字母:r
- 第 9 个字母:r

一共有 **3** 个字母 **r**。

**答案:3**
--------------------------------------------------
Prompt: There is a girl who likes adventure,
Generated: That's a wonderful start to a story. A girl who likes adventure is a character full of potential.

What would you like to do with this idea?

*   **Create a character profile?** We could give her a name, a backstory, and define what *kind* of adventure she seeks.
    *   **Name:** Elara, Maya, Kaelen, Juniper?
    *   **Type of Adventure:** Is she an explorer of ancient ruins, a solver of mysteries in her town, a traveler to fantastical worlds, or a protector of nature?

*   **Start a story?** We can begin a narrative. Where is she, and what is the call to adventure?
    *   *Example:* "Elara traced the faded lines on the old map she'd found tucked inside a library book. It led to a part of the forest everyone
--------------------------------------------------
Prompt: Please give a brief introduction of DeepSeek company.
Generated: Of course! Here is a brief introduction to DeepSeek.

**DeepSeek** is a leading Chinese artificial intelligence research company, widely recognized for developing advanced large language models (LLMs).

Here are the key points about the company:

*   **Core Focus:** Their primary mission is to achieve Artificial General Intelligence (AGI). They are best known for their series of "DeepSeek" models, which are among the most powerful and capable open-source LLMs in the world, competing with models from major global AI labs.

*   **Key Products & Models:**
    *   **DeepSeek-V2:** A state-of-the-art mixture-of-experts (MoE) model that delivers high performance at a significantly lower cost for inference compared to similar-sized models.
    *   **DeepSeek Coder:** A family of models specifically designed for code generation and
--------------------------------------------------

"""

Generate the model

v0.7.1 is required

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_round import AutoRound
from auto_round.utils import llm_load_model

model_name = "deepseek-ai/DeepSeek-V3.1-Terminus"

model, tokenizer,_=llm_load_model(model_name,trust_remote_code=False,device="cpu")
layer_config = {}
for n, m in model.named_modules():
    if isinstance(m, torch.nn.Linear):
        if "expert" in n and "shared_experts" not in n:
            layer_config[n] = {"bits": 4}
            print(n, 4)
        elif n != "lm_head":
            layer_config[n] = {"bits": 8}
            print(n, 8)

ar = AutoRound(model, tokenizer=tokenizer, iters=0, layer_config=layer_config)
ar.quantize_and_save(format="auto_round", output_dir="tmp_autoround")

Ethical Considerations and Limitations

The model can produce factually incorrect output, and should not be relied on to produce factually accurate information. Because of the limitations of the pretrained model and the finetuning datasets, it is possible that this model could generate lewd, biased or otherwise offensive outputs.

Therefore, before deploying any applications of the model, developers should perform safety testing.

Caveats and Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

Here are a couple of useful links to learn more about Intel's AI software:

  • Intel Neural Compressor link

Disclaimer

The license on this model does not constitute legal advice. We are not responsible for the actions of third parties who use this model. Please consult an attorney before using this model for commercial purposes.

Cite

@article{cheng2023optimize, title={Optimize weight rounding via signed gradient descent for the quantization of llms}, author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi}, journal={arXiv preprint arXiv:2309.05516}, year={2023} }

arxiv github

Downloads last month
219
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Intel/DeepSeek-V3.1-Terminus-int4-mixed-AutoRound

Quantized
(18)
this model