Use unsloth BF16 GGUF to quantize IQ1_M.

Added IQ1_S_M Quantization

IQ1_S_M Quantization Details

--output-tensor-type Q6_K

--token-embedding-type Q6_K

--tensor-type ffn_down_exps=Q2_K
--tensor-type blk.[4-48].ffn_down_exps=IQ1_S
--tensor-type blk.[2|3].ffn_down_exps=MXFP4
--tensor-type blk.89.ffn_down_exps=Q8_0
--tensor-type blk.90.ffn_down_exps=BF16
--tensor-type blk.[91|92].ffn_down_exps=Q5_K
--tensor-type blk.93.ffn_down_exps=Q6_K
--tensor-type blk.[80-88].ffn_down_exps=IQ3_XXS
--tensor-type blk.[49-74].ffn_down_exps=IQ1_M
--tensor-type blk.[0|1|25|29|45|59|79].ffn_down_exps=Q2_K

--tensor-type ffn_gate_exps=iq1_s
--tensor-type blk.3.ffn_gate_exps=mxfp4
--tensor-type blk.[76|77].ffn_gate_exps=iq2_xs
--tensor-type blk.[85-88].ffn_gate_exps=iq3_xxs
--tensor-type blk.[89-93].ffn_gate_exps=iq4_xs

--tensor-type ffn_up_exps=iq1_s
--tensor-type blk.3.ffn_up_exps=mxfp4
--tensor-type blk.[76|77].ffn_up_exps=iq2_xs
--tensor-type blk.[85-88].ffn_up_exps=iq3_xxs
--tensor-type blk.[89-93].ffn_up_exps=iq4_xs

--tensor-type attn=iq4_xs

llama-cli test (reasoning) with 780M integrated GPU (ROCm 6.4.2)

M:\llama_latest\build\bin>.\llama-cli.exe -m M:\IQ1_S_M\Qwen3-235B-A22B-Thinking-2507-IQ1_S_M-00001-of-00010.gguf -c 8192

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon 780M Graphics, gfx1103 (0x1103), VMM: no, Wave Size: 32
build: 6367 (2c8dac72) with clang version 20.0.0git (git@github.com:Compute-Mirrors/llvm-project 33ab2c2f7838239f1e2e5c06432bbb8d887e8cb2) for x86_64-pc-windows-msvc
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon 780M Graphics) - 59175 MiB free
llama_model_loader: additional 9 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 48 key-value pairs and 1131 tensors from M:\IQ1_S_M\Qwen3-235B-A22B-Thinking-2507-IQ1_S_M-00001-of-00010.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen3moe
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen3-235B-A22B-Thinking-2507
llama_model_loader: - kv 3: general.version str = 2507
llama_model_loader: - kv 4: general.finetune str = Thinking
llama_model_loader: - kv 5: general.basename str = Qwen3-235B-A22B-Thinking-2507
llama_model_loader: - kv 6: general.quantized_by str = Unsloth
llama_model_loader: - kv 7: general.size_label str = 235B-A22B
llama_model_loader: - kv 8: general.license str = apache-2.0
llama_model_loader: - kv 9: general.license.link str = https://huggingface.co/Qwen/Qwen3-235...
llama_model_loader: - kv 10: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 11: general.base_model.count u32 = 1
llama_model_loader: - kv 12: general.base_model.0.name str = Qwen3 235B A22B Thinking 2507
llama_model_loader: - kv 13: general.base_model.0.version str = 2507
llama_model_loader: - kv 14: general.base_model.0.organization str = Qwen
llama_model_loader: - kv 15: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen3-235...
llama_model_loader: - kv 16: general.tags arr[str,2] = ["unsloth", "text-generation"]
llama_model_loader: - kv 17: qwen3moe.block_count u32 = 94
llama_model_loader: - kv 18: qwen3moe.context_length u32 = 262144
llama_model_loader: - kv 19: qwen3moe.embedding_length u32 = 4096
llama_model_loader: - kv 20: qwen3moe.feed_forward_length u32 = 12288
llama_model_loader: - kv 21: qwen3moe.attention.head_count u32 = 64
llama_model_loader: - kv 22: qwen3moe.attention.head_count_kv u32 = 4
llama_model_loader: - kv 23: qwen3moe.rope.freq_base f32 = 5000000.000000
llama_model_loader: - kv 24: qwen3moe.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 25: qwen3moe.expert_used_count u32 = 8
llama_model_loader: - kv 26: qwen3moe.attention.key_length u32 = 128
llama_model_loader: - kv 27: qwen3moe.attention.value_length u32 = 128
llama_model_loader: - kv 28: qwen3moe.expert_count u32 = 128
llama_model_loader: - kv 29: qwen3moe.expert_feed_forward_length u32 = 1536
llama_model_loader: - kv 30: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 31: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 32: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 33: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 34: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 35: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 36: tokenizer.ggml.padding_token_id u32 = 151654
llama_model_loader: - kv 37: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 38: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 39: general.quantization_version u32 = 2
llama_model_loader: - kv 40: general.file_type u32 = 24
llama_model_loader: - kv 41: quantize.imatrix.file str = ..\imatrix_unsloth.gguf
llama_model_loader: - kv 42: quantize.imatrix.dataset str = unsloth_calibration_Qwen3-235B-A22B-T...
llama_model_loader: - kv 43: quantize.imatrix.entries_count u32 = 744
llama_model_loader: - kv 44: quantize.imatrix.chunks_count u32 = 684
llama_model_loader: - kv 45: split.no u16 = 0
llama_model_loader: - kv 46: split.count u16 = 10
llama_model_loader: - kv 47: split.tensors.count i32 = 1131
llama_model_loader: - type f32: 471 tensors
llama_model_loader: - type q8_0: 1 tensors
llama_model_loader: - type q2_K: 11 tensors
llama_model_loader: - type q5_K: 2 tensors
llama_model_loader: - type q6_K: 3 tensors
llama_model_loader: - type iq2_xs: 4 tensors
llama_model_loader: - type iq3_xxs: 17 tensors
llama_model_loader: - type iq1_s: 206 tensors
llama_model_loader: - type iq4_xs: 386 tensors
llama_model_loader: - type iq1_m: 25 tensors
llama_model_loader: - type bf16: 1 tensors
llama_model_loader: - type mxfp4: 4 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = IQ1_S - 1.5625 bpw
print_info: file size = 56.71 GiB (2.07 BPW)
load: printing all EOG tokens:
load: - 151643 ('<|endoftext|>')
load: - 151645 ('<|im_end|>')
load: - 151662 ('<|fim_pad|>')
load: - 151663 ('<|repo_name|>')
load: - 151664 ('<|file_sep|>')
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch = qwen3moe
print_info: vocab_only = 0
print_info: n_ctx_train = 262144
print_info: n_embd = 4096
print_info: n_layer = 94
print_info: n_head = 64
print_info: n_head_kv = 4
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 16
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 12288
print_info: n_expert = 128
print_info: n_expert_used = 8
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 5000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 262144
print_info: rope_finetuned = unknown
print_info: model type = 235B.A22B
print_info: model params = 235.09 B
print_info: general.name = Qwen3-235B-A22B-Thinking-2507
print_info: n_ff_exp = 1536
print_info: vocab type = BPE
print_info: n_vocab = 151936
print_info: n_merges = 151387
print_info: BOS token = 11 ','
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151654 '<|vision_pad|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 94 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 95/95 layers to GPU
load_tensors: ROCm0 model buffer size = 57581.65 MiB
load_tensors: CPU_Mapped model buffer size = 486.86 MiB
.................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 8192
llama_context: n_ctx_per_seq = 8192
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = auto
llama_context: kv_unified = false
llama_context: freq_base = 5000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (8192) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context: ROCm_Host output buffer size = 0.58 MiB
llama_kv_cache: ROCm0 KV buffer size = 1504.00 MiB
llama_kv_cache: size = 1504.00 MiB ( 8192 cells, 94 layers, 1/1 seqs), K (f16): 752.00 MiB, V (f16): 752.00 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context: ROCm0 compute buffer size = 304.75 MiB
llama_context: ROCm_Host compute buffer size = 24.01 MiB
llama_context: graph nodes = 5835
llama_context: graph splits = 2
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: added <|repo_name|> logit bias = -inf
common_init_from_params: added <|file_sep|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 8192
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 8
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
main: chat template example:
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant

main: interactive mode on. sampler seed: 2824890048 sampler params: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 8192 top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist generate: n_ctx = 8192, n_batch = 2048, n_predict = -1, n_keep = 0

== Running in interactive mode. ==

Press Ctrl+C to interject at any time.
Press Return to return control to the AI.
To return control without starting a new line, end your input with '/'.
If you want to submit another line, end your input with ''.
Not using system message. To change it, set a different value via -sys PROMPT

我有一个vehicle mass 质量为M kg，轮胎的半径为R，单位为米。axle之前有一个ratio为n的transmission。 n定义为input speed 除以output speed。求换算到transmission input端的vehicle inertia。嗯，用户的问题是关于将车辆惯性换算到传动系统输入端的。让我仔细看看。

用户有一个质量为M kg的车辆，轮胎半径是R米，还有一个传动比n的变速箱，n定义为输入转速除以输出转速。需要求换算到变速箱输入端的车辆惯性。首先得理解这个问题。

首先，惯性应该是指转动惯量（moment of inertia）吧？因为车辆整体是平动质量，但问题提到轮胎和传动系统，可能涉及旋转惯性。用户说的"vehicle inertia"可能是指等效到旋转轴上的转动惯量。

关键点：传动比n = 输入转速 / 输出转速。通常传动比定义为输入转速比输出转速，所以如果是减速箱，n>1的话，输出转速比输入慢。比如差速器或者变速箱。

现在，车辆的平动质量M，轮胎半径R，当车辆移动时，轮胎转动，带动传动系统。需要把车辆的平动惯性等效到传动输入轴上。

回想一下等效转动惯量的计算方法。当有平动和转动转换时，惯性需要根据传动比进行缩放。

首先，轮胎的转动。轮胎的线速度v = ω * R，其中ω是轮胎的角速度（rad/s）。这里轮胎的角速度就是传动系统的输出转速，因为轮胎直接装在轴上。

而传动比n = 输入转速 / 输出转速，所以输入转速 = n * 输出转速。也就是说，输出转速 = 输入转速 / n。

假设车辆速度v = ω_output * R，其中ω_output是输出端的角速度（轮胎的角速度）。

而输入端的角速度ω_input = n * ω_output，因为n = ω_input / ω_output。

现在，车辆的平动动能是(1/2)M v²。而旋转动能在输入端会是(1/2) I_input ω_input²。

等效的意思是，当车辆运动时，其平动动能应该等于等效到输入端的旋转动能。或者说，惯性需要满足动能等效。

所以，(1/2) M v² = (1/2) I_input ω_input²

代入v = ω_output * R，而ω_output = ω_input / n

所以v = (ω_input / n) * R

代入动能等式：

(1/2) M ( (ω_input R)/n )² = (1/2) I_input ω_input²

两边都除以(1/2) ω_input²，得到：

M (R² / n²) = I_input

所以I_input = M R² / n²

但这里可能有问题，因为通常转动惯量的等效要考虑传动比的平方。

比如，一个质量平动等效到旋转轴，惯性是M * (R)^2，但如果有传动比，就需要乘以(1/n)^2，因为速度被n衰减了，所以惯性要除以n²。

例如，假设输出端（轮胎）的等效转动惯量是MR²（这其实是平动质量等效为转动惯量，公式是J = MR²，但严格来说，对于纯滚动，轮胎的转动惯量可能还包括轮胎自身的惯性，但这里用户可能指的是整车的平动质量等效到转动惯量）。

通常，车辆的平动质量M，在旋转系统中等效的转动惯量为J = M * R²，其中R是滚动半径。比如，当车辆加速时，相当于在旋转系统中有一个等效惯量J = M*R²。

现在，这个J是在输出端（轮胎轴）的。现在要换算到输入端，即变速箱的输入轴。

由于传动比n = 输入转速 / 输出转速，所以输出转速 = 输入转速 / n。

因此，输出端的角速度是输入端的1/n倍。

转动惯量的等效需要考虑转速比的平方。因为动能守恒：J_output * ω_output² = J_input * ω_input²

但实际等效惯量转换公式是：J_input = J_output * (ω_output / ω_input)²

因为ω_output = ω_input / n，所以比例是1/n。

所以J_input = J_output * (1/n)²

而J_output是M*R²（车辆平动等效的转动惯量）

所以J_input = (M R²) * (1/n²) = M R² / n²

对吗？比如，假设n=2，输入转速是输出的两倍，那么输入端的等效惯量会比输出端小，因为转速高，同样的动能，惯量小。比如，输出端有较大的惯量，但输入端转速高，所以等效惯量小。

例如，假设输出端惯量J_out，那么输入端的等效惯量J_in = J_out * (N_out / N_in)^2，这里传动比n = N_in / N_out，所以N_out / N_in = 1/n，因此J_in = J_out * (1/n)^2

是的，这个公式是对的。

所以，车辆的等效转动惯量在输出端是MR²（假设轮胎没有其他惯性，仅考虑平动质量等效），那么换算到输入端就是MR² / n²

所以答案应该是 I = (M * R²) / n²

但需要确认用户的问题是否如此。

用户的问题中说"vehicle inertia"，可能指的是整车的平动惯性等效到旋转系统。通常，车辆的平动质量在传动系统中等效为转动惯量，公式是J = M*R²，其中R是轮胎滚动半径。

所以，当这个等效转动惯量位于输出端（传动系统的输出侧，即轮胎侧），要换算到输入端，需要除以n²，因为传动比n是输入转速比输出转速，所以速度比是n:1，惯量与速度平方成反比。

例如，若传动比n=3，输入转速是输出的3倍，那么输入端的等效惯量是输出端的1/9。

因此，最终结果应该是 M*R² / n²

现在检查单位是否正确：M是kg，R是米，所以M*R²是kg·m²，n是无量纲，所以结果单位是kg·m²，正确。

可能用户提到的"axle之前有一个ratio为n的transmission"，axle应该是指驱动轴，即传动系统的输出端连接到车轮轴。所以传动系统输入端到输出端的传动比n=输入转速/输出转速。

所以，结论正确。

要将车辆的平动惯性等效换算到传动系统输入端，需考虑质量、轮胎半径及传动比的影响。以下是详细推导和结果：

关键概念

车辆平动惯性：车辆质量 $ M $（kg）的平动运动等效为转动惯量时，其等效转动惯量在轮胎轴（输出端）为 $ J_{\text{output}} = M \cdot R^2 $（单位：kg·m²），其中 $ R $ 为轮胎半径（m）。
传动比定义：$ n = \frac{\text{输入转速}}{\text{输出转速}} $，即输入端转速是输出端的 $ n $ 倍。
惯量换算规则：等效转动惯量与转速比的平方成反比。若输出端惯量为 $ J_{\text{output}} $，则输入端等效惯量为： $$ J_{\text{input}} = J_{\text{output}} \cdot \left( \frac{\text{输出转速}}{\text{输入转速}} \right)^2 $$

推导过程

输出端等效惯量 车辆平动质量 $ M $ 等效到轮胎轴（输出端）的转动惯量为： $$ J_{\text{output}} = M \cdot R^2 $$
转速关系 由传动比定义 $ n = \frac{\omega_{\text{input}}}{\omega_{\text{output}}} $，得： $$ \omega_{\text{output}} = \frac{\omega_{\text{input}}}{n} $$
输入端等效惯量 根据惯量换算规则： $$ J_{\text{input}} = J_{\text{output}} \cdot \left( \frac{\omega_{\text{output}}}{\omega_{\text{input}}} \right)^2 = (M \cdot R^2) \cdot \left( \frac{1}{n} \right)^2 $$ 化简得： $$ J_{\text{input}} = \frac{M \cdot R^2}{n^2} $$

最终结果

换算到传动系统输入端的车辆等效转动惯量为： $\boxed{J_{\text{input}} = \dfrac{M \cdot R^2}{n^2}}$

单位验证

$ M $: kg
$ R $: m
$ n $: 无量纲
结果 $ J_{\text{input}} $: kg·m²（符合转动惯量单位）

物理意义

传动比 $ n $ 的影响：当 $ n > 1 $（减速传动），输入端等效惯量显著减小，因为输入端转速更高，相同动能下惯量更小。
典型场景：例如电动汽车中，若 $ n = 10 $，输入端等效惯量仅为输出端的 $ 1/100 $，大幅降低电机负载。

此结果适用于将车辆平动惯性等效至传动系统输入轴，适用于动力学分析或控制设计。

llama_perf_sampler_print: sampling time = 124.87 ms / 2306 runs ( 0.05 ms per token, 18467.06 tokens per second)
llama_perf_context_print: load time = 123893.22 ms
llama_perf_context_print: prompt eval time = 5562.49 ms / 67 tokens ( 83.02 ms per token, 12.04 tokens per second)
llama_perf_context_print: eval time = 418866.68 ms / 2238 runs ( 187.16 ms per token, 5.34 tokens per second)
llama_perf_context_print: total time = 584478.10 ms / 2305 tokens
llama_perf_context_print: graphs reused = 2228
Interrupted by user

Downloads last month: 6

GGUF

Model size

235B params

Architecture

qwen3moe

Hardware compatibility

1-bit

2-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lovedheart/Qwen3-235B-A22B-Thinking-2507-GGUF-IQ1-M

Base model

Qwen/Qwen3-235B-A22B-Thinking-2507

Quantized

(42)

this model