๐ป KAT-Dev 72B - GGUF
Enterprise-Grade 72B Coding Model, Optimized for Local Inference
Original Model | Ollama Registry | llama.cpp
๐ What is This?
This is KAT-Dev 72B, a powerful coding model with 72 billion parameters, quantized to GGUF format for efficient local inference. Perfect for developers who want enterprise-grade code assistance running entirely on their own hardware with Ollama or llama.cpp!
โจ Why You'll Love It
- ๐ป Coding-Focused - Optimized specifically for programming tasks
- ๐ง 72B Parameters - Large enough for complex reasoning and refactoring
- โก Local Inference - Run entirely on your machine, no API calls
- ๐ Privacy First - Your code never leaves your computer
- ๐ฏ Multiple Quantizations - Choose your speed/quality trade-off
- ๐ Ollama Ready - One command to start coding
- ๐ง llama.cpp Compatible - Works with your favorite tools
๐ฏ Quick Start
Option 1: Ollama (Easiest!)
Pull and run directly from the Ollama registry:
# Recommended: IQ3_M (best balance)
ollama run richardyoung/kat-dev-72b:iq3_m
# Other variants
ollama run richardyoung/kat-dev-72b:iq4_xs # Better quality
ollama run richardyoung/kat-dev-72b:iq2_m # Faster, smaller
ollama run richardyoung/kat-dev-72b:iq2_xxs # Most compact
That's it! Start asking coding questions! ๐
Option 2: Build from Modelfile
Download this repo and build locally:
# Clone or download the modelfiles
ollama create kat-dev-72b-iq3_m -f modelfiles/kat-dev-72b--iq3_m.Modelfile
ollama run kat-dev-72b-iq3_m
Option 3: llama.cpp
Use with llama.cpp directly:
# Download the GGUF file (replace variant as needed)
huggingface-cli download richardyoung/kat-dev-72b kat-dev-72b-iq3_m.gguf --local-dir ./
# Run with llama.cpp
./llama-cli -m kat-dev-72b-iq3_m.gguf -p "Write a Python function to"
๐ป System Requirements
| Component | Minimum | Recommended |
|---|---|---|
| RAM | 32 GB | 64 GB+ |
| Storage | 40 GB free | 50+ GB free |
| CPU | Modern 8-core | 16+ cores |
| GPU | Optional (CPU-only works!) | Metal/CUDA for acceleration |
| OS | macOS, Linux, Windows | Latest versions |
๐ก Tip: Larger quantizations (IQ4_XS) need more RAM but produce better code. Smaller ones (IQ2_XXS) are faster but less precise.
๐จ Available Quantizations
Choose the right balance for your needs:
| Quantization | Size | Quality | Speed | RAM Usage | Best For |
|---|---|---|---|---|---|
| IQ4_XS | 37 GB | โญโญโญโญโญ | โญโญโญ | ~50 GB | Production code, complex refactoring |
| IQ3_M (recommended) | 33 GB | โญโญโญโญ | โญโญโญโญ | ~40 GB | Daily development, best balance |
| IQ2_M | 27 GB | โญโญโญ | โญโญโญโญโญ | ~35 GB | Quick prototyping, fast iteration |
| IQ2_XXS | 24 GB | โญโญ | โญโญโญโญโญ | ~30 GB | Testing, very constrained systems |
Variant Details
| Variant | Size | Blob SHA256 |
|---|---|---|
iq4_xs |
36.98 GB | c4cb9c6e... |
iq3_m |
33.07 GB | 14d07184... |
iq2_m |
27.32 GB | cbe26a3c... |
iq2_xxs |
23.74 GB | a49c7526... |
๐ Usage Examples
Code Generation
ollama run richardyoung/kat-dev-72b:iq3_m "Write a Python function to validate email addresses with regex"
Code Explanation
ollama run richardyoung/kat-dev-72b:iq3_m "Explain this code: def fib(n): return n if n < 2 else fib(n-1) + fib(n-2)"
Debugging Help
ollama run richardyoung/kat-dev-72b:iq3_m "Why does this Python code raise a KeyError?"
Refactoring
ollama run richardyoung/kat-dev-72b:iq3_m "Refactor this JavaScript function to use async/await instead of callbacks"
Multi-turn Conversation
ollama run richardyoung/kat-dev-72b:iq3_m
>>> I need to build a REST API in Python
>>> Show me a FastAPI example with authentication
>>> How do I add rate limiting?
๐๏ธ Model Details
Click to expand technical details
Architecture
- Base Model: KAT-Dev 72B Exp by Kwaipilot
- Parameters: ~72 Billion
- Quantization: GGUF format (IQ2_XXS to IQ4_XS)
- Context Length: Standard (check base model for specifics)
- Optimization: Code generation and understanding
- Training: Specialized for programming tasks
Supported Languages
The model excels at:
- Python
- JavaScript/TypeScript
- Java
- C/C++
- Go
- Rust
- And many more!
โก Performance Tips
Getting the best results
- Choose the right quantization - IQ3_M is recommended for daily use
- Use specific prompts - "Write a Python function to X" works better than "code for X"
- Provide context - Share error messages, file structures, or requirements
- Iterate - Ask follow-up questions to refine the code
- GPU acceleration - Use Metal (Mac) or CUDA (NVIDIA) for faster inference
- Temperature settings - Lower (0.1-0.3) for precise code, higher (0.7-0.9) for creative solutions
Example Ollama Configuration
# Create with custom parameters
ollama create my-kat-dev -f modelfiles/kat-dev-72b--iq3_m.Modelfile
# Edit the Modelfile to add:
PARAMETER temperature 0.2
PARAMETER top_p 0.9
PARAMETER repeat_penalty 1.1
๐ง Building Custom Variants
You can modify the included Modelfiles to customize behavior:
FROM ./kat-dev-72b-iq3_m.gguf
# System prompt
SYSTEM You are an expert programmer specializing in Python and web development.
# Parameters
PARAMETER temperature 0.2
PARAMETER num_ctx 8192
PARAMETER stop "<|endoftext|>"
Then build:
ollama create my-custom-kat -f custom.Modelfile
โ ๏ธ Known Limitations
- ๐พ Large Size - Even the smallest variant needs 24+ GB of storage
- ๐ RAM Intensive - Requires significant system memory
- โฑ๏ธ Inference Speed - Slower than smaller models (trade-off for quality)
- ๐ English-Focused - Best performance with English prompts
- ๐ Code-Specialized - Not optimized for general conversation
๐ License
Apache 2.0 - Same as the original model. Free for commercial use!
๐ Acknowledgments
- Original Model: Kwaipilot for creating KAT-Dev 72B
- GGUF Format: Georgi Gerganov for llama.cpp
- Ollama: Ollama team for the amazing runtime
- Community: All the developers testing and providing feedback
๐ Useful Links
- ๐ฆ Original Model: Kwaipilot/KAT-Dev-72B-Exp
- ๐ Ollama Registry: richardyoung/kat-dev-72b
- ๐ ๏ธ llama.cpp: GitHub
- ๐ Ollama Docs: Documentation
- ๐ฌ Discussions: Ask questions here!
๐ฎ Pro Tips
Advanced usage patterns
1. Integration with VS Code
Use with Continue.dev or other coding assistants:
{
"models": [
{
"title": "KAT-Dev 72B",
"provider": "ollama",
"model": "richardyoung/kat-dev-72b:iq3_m"
}
]
}
2. API Server Mode
Run as an OpenAI-compatible API:
ollama serve
# Then use the API at http://localhost:11434
3. Batch Processing
Process multiple files:
for file in *.py; do
ollama run richardyoung/kat-dev-72b:iq3_m "Review this code: $(cat $file)" > "${file}.review"
done
Quantized with โค๏ธ by richardyoung
If you find this useful, please โญ star the repo and share with other developers!
Format: GGUF | Runtime: Ollama / llama.cpp | Created: October 2025
- Downloads last month
- 1,326
2-bit
3-bit
4-bit
Model tree for richardyoung/kat-dev-72b
Base model
Kwaipilot/KAT-Dev-72B-Exp