Zephyr-7B-Beta — GGUF (Q5_K_M & Q8_0)
Two ready-to-run GGUF builds of the Zephyr-7B-Beta chat model for local CPU inference via the llama.cpp ecosystem.
These are inference-only quantized weights.
Files
zephyr-q5_k_m.gguf— balanced quality vs size (≈ 4.8 GB). Good default for 16 GB RAM laptops.zephyr-q8_0.gguf— higher fidelity (≈ 7.2 GB). Requires more RAM.
GGUF embeds tokenizer/vocab, so separate tokenizer files are not required for inference.
Prompt Format (Zephyr chat)
Use Zephyr chat tags for best results:
<|user|> YOUR_PROMPT_HERE <|assistant|>
Example
<|user|> List three ways Retrieval-Augmented Generation improves factuality. <|assistant|>
How to Run (llama.cpp)
CLI (CPU)
Q5_K_M (fits most 16 GB RAM systems)
./llama-cli -m zephyr-q5_k_m.gguf \
-p "<|user|>\nExplain RAG in 3 bullets.\n\n<|assistant|>\n" \
-n 256 -c 2048 -ngl 0 -t $(nproc)
Q8_0 (higher quality; more RAM)
./llama-cli -m zephyr-q8_0.gguf \
-p "<|user|>\nGive 5 note-taking tips.\n\n<|assistant|>\n" \
-n 256 -c 2048 -ngl 0 -t $(nproc)
Flags
-n 256→ max new tokens-c 2048→ context window-ngl 0→ CPU-only (set>0to offload to GPU if supported)-t $(nproc)→ threads
Some builds use
./maininstead of./llama-cli. Replace the binary name if needed.
Popular UIs
Import the .gguf directly in:
- LM Studio
- KoboldCpp
- Text Generation WebUI (llama.cpp backend)
- Ollama (custom import)
Hardware Notes
Approximate RAM use at 2k context (CPU-only):
- Q5_K_M (~4.8 GB file) → ~8–10 GB RAM
- Q8_0 (~7.2 GB file) → ~12–14 GB RAM
Actual usage varies with context length, batch size, and compile options.
Checksums (optional)
Verify downloads:
sha256sum zephyr-q5_k_m.gguf
sha256sum zephyr-q8_0.gguf
(Add the resulting hashes here if you want to publish them.)
Intended Use & Limitations
- Intended for local assistant/chat and general text generation.
- Not suitable for high-stakes or safety-critical use without human review.
- Outputs may contain mistakes or biases; verify important information.
What’s Included
Quantized GGUF weights:
zephyr-q5_k_m.ggufzephyr-q8_0.gguf
No training code or LoRA adapters are included here.
Acknowledgments
- Base: Zephyr-7B-Beta (converted to GGUF and quantized for CPU inference).
- Inference runtime:
llama.cppand compatible UIs.
Changelog
- v1.0 — Initial release of
Q5_K_MandQ8_0GGUF builds.
- Downloads last month
- 19
5-bit
8-bit
Model tree for arunvpp05/zephyr-gguf
Base model
mistralai/Mistral-7B-v0.1