Qwen3-Embedding-0.6B (GGUF) Models

This directory contains GGUF builds of the Qwen3 0.6B embedding model, produced from the upstream base repository Qwen/Qwen3-0.6B-Base (original Hugging Face layout in ../Qwen3-Embedding-0.6B/).

Contents

File Purpose
qwen3-embedding-0.6b.Q4_K_M.gguf Quantized (Q4_K_M) GGUF for efficient inference.
qwen3-embedding-0.6b-fix.gguf Same model with explicit sep_token / EOS metadata fix applied.

Special Token Configuration

Extracted from tokenizer_config.json:

"sep_token": "<|endoftext|>",
"sep_token_id": 151643

The model uses <|endoftext|> as both the padding (pad_token) and separator (sep_token). For embedding generation each input text MUST terminate with the separator token (or the converter must auto-append it) to avoid a runtime warning:

[WARNING] At least one last token in strings embedded is not SEP. 'tokenizer.ggml.add_eos_token' should be set to 'true' in the GGUF header

Why the Warning Appears

If the GGUF metadata key tokenizer.ggml.add_eos_token is absent or false, llama.cpp will not auto-append the final SEP/EOS token for embedding inputs. Any input string that does not already end with <|endoftext|> triggers the warning and may yield sub‑optimal embeddings (slightly different token boundary semantics).

Fix Implemented

The file qwen3-embedding-0.6b-fix.gguf was regenerated ensuring:

  • tokenizer.ggml.add_eos_token = true
  • sep_token (<|endoftext|>) retained with id 151643

This makes llama.cpp automatically append the SEP/EOS token when missing, silencing the warning and standardizing embeddings.

Rebuilding From Upstream (Recommended Process)

  1. Obtain upstream model:
    • Clone or download Qwen/Qwen3-0.6B-Base (embedding variant directory).
  2. Convert to GGUF using the current llama.cpp conversion script:
    • Use the repo's convert_hf_to_gguf.py (it already sets EOS for Qwen tokenizers). Example:
python3 llama.cpp/convert_hf_to_gguf.py \
  --model Qwen3-Embedding-0.6B \
  --outfile qwen3-embedding-0.6b-fix.gguf \
  --ftype q4_k_m

If you previously produced a GGUF that shows the warning, just re-run conversion with an up-to-date llama.cpp checkout. The script internally writes tokenizer.ggml.add_eos_token = true for this tokenizer family.

Post-Conversion Validation

Run a quick embedding call and confirm no warning appears:

./llama.cpp/build/bin/embedding \
  -m models/qwen3-embedding-0.6b-fix.gguf \
  -p "Hello world"

If you still see the warning:

  • Confirm the binary was rebuilt after updating sources (make or cmake --build).
  • Inspect metadata using a small Python snippet:
from gguf import GGUFReader
r = GGUFReader("models/qwen3-embedding-0.6b-fix.gguf")
for f in r.fields:
    if f.name == "tokenizer.ggml.add_eos_token":
        print("ADD_EOS_TOKEN=", f.parts[-1])

Expected output: ADD_EOS_TOKEN= True

Manual Patch (Fallback Method)

If re-conversion is inconvenient, you can clone metadata and force the flag:

from gguf import GGUFReader, GGUFWriter, constants as C
src = GGUFReader("qwen3-embedding-0.6b.Q4_K_M.gguf")
dst = GGUFWriter("qwen3-embedding-0.6b-fix.gguf", src.architecture)

# Copy all existing fields except override ADD_EOS
for field in src.fields:
    if field.name == C.Keys.Tokenizer.ADD_EOS:
        continue
    dst.add_field(field.name, field.field_type, field.parts)

dst.add_add_eos_token(True)  # set flag

# Copy tensors
for tensor in src.tensors:
    data = tensor.data()
    dst.add_tensor(tensor.name, data, tensor.shape, tensor.tensor_type)

dst.write_header_to_file()
dst.write_kv_data_to_file()
dst.write_tensors_to_file()
dst.close()

After patching, re-run the validation step.

Usage Notes for Embeddings

  • Always feed raw text; no special wrapping needed. Auto-SEP happens with the fixed file.
  • For batch embeddings, ensure each string ends cleanly (avoid trailing spaces if you rely on identical hashes downstream).
  • The dimensionality matches upstream Qwen3-Embedding-0.6B (refer to upstream docs for exact embedding size).

License & Attribution

The original model weights and tokenizer come from the Qwen project (Qwen/Qwen3-0.6B-Base). Review their license and usage terms before redistribution. This README documents conversion adjustments only (metadata EOS flag addition).

Changelog

  • Initial addition: added fixed GGUF with tokenizer.ggml.add_eos_token = true to suppress SEP warning.

For further improvements (FP16 build, alternative quantization tiers, or batching examples), open an issue or PR in this repo.

Downloads last month
52
GGUF
Model size
0.6B params
Architecture
qwen3
Hardware compatibility
Log In to view the estimation

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for WilliamSong/qwen3-embedding-0.6b

Quantized
(54)
this model