💻 KAT-Dev 72B - GGUF

Enterprise-Grade 72B Coding Model, Optimized for Local Inference

Original Model | Ollama Registry | llama.cpp

📖 What is This?

This is KAT-Dev 72B, a powerful coding model with 72 billion parameters, quantized to GGUF format for efficient local inference. Perfect for developers who want enterprise-grade code assistance running entirely on their own hardware with Ollama or llama.cpp!

✨ Why You'll Love It

💻 Coding-Focused - Optimized specifically for programming tasks
🧠 72B Parameters - Large enough for complex reasoning and refactoring
⚡ Local Inference - Run entirely on your machine, no API calls
🔒 Privacy First - Your code never leaves your computer
🎯 Multiple Quantizations - Choose your speed/quality trade-off
🚀 Ollama Ready - One command to start coding
🔧 llama.cpp Compatible - Works with your favorite tools

🎯 Quick Start

Option 1: Ollama (Easiest!)

Pull and run directly from the Ollama registry:

# Recommended: IQ3_M (best balance)
ollama run richardyoung/kat-dev-72b:iq3_m

# Other variants
ollama run richardyoung/kat-dev-72b:iq4_xs  # Better quality
ollama run richardyoung/kat-dev-72b:iq2_m   # Faster, smaller
ollama run richardyoung/kat-dev-72b:iq2_xxs # Most compact

That's it! Start asking coding questions! 🎉

Option 2: Build from Modelfile

Download this repo and build locally:

# Clone or download the modelfiles
ollama create kat-dev-72b-iq3_m -f modelfiles/kat-dev-72b--iq3_m.Modelfile
ollama run kat-dev-72b-iq3_m

Option 3: llama.cpp

Use with llama.cpp directly:

# Download the GGUF file (replace variant as needed)
huggingface-cli download richardyoung/kat-dev-72b kat-dev-72b-iq3_m.gguf --local-dir ./

# Run with llama.cpp
./llama-cli -m kat-dev-72b-iq3_m.gguf -p "Write a Python function to"

💻 System Requirements

Component	Minimum	Recommended
RAM	32 GB	64 GB+
Storage	40 GB free	50+ GB free
CPU	Modern 8-core	16+ cores
GPU	Optional (CPU-only works!)	Metal/CUDA for acceleration
OS	macOS, Linux, Windows	Latest versions

💡 Tip: Larger quantizations (IQ4_XS) need more RAM but produce better code. Smaller ones (IQ2_XXS) are faster but less precise.

🎨 Available Quantizations

Choose the right balance for your needs:

Quantization	Size	Quality	Speed	RAM Usage	Best For
IQ4_XS	37 GB	⭐⭐⭐⭐⭐	⭐⭐⭐	~50 GB	Production code, complex refactoring
IQ3_M (recommended)	33 GB	⭐⭐⭐⭐	⭐⭐⭐⭐	~40 GB	Daily development, best balance
IQ2_M	27 GB	⭐⭐⭐	⭐⭐⭐⭐⭐	~35 GB	Quick prototyping, fast iteration
IQ2_XXS	24 GB	⭐⭐	⭐⭐⭐⭐⭐	~30 GB	Testing, very constrained systems

Variant Details

Variant	Size	Blob SHA256
`iq4_xs`	36.98 GB	`c4cb9c6e...`
`iq3_m`	33.07 GB	`14d07184...`
`iq2_m`	27.32 GB	`cbe26a3c...`
`iq2_xxs`	23.74 GB	`a49c7526...`

📚 Usage Examples

Code Generation

ollama run richardyoung/kat-dev-72b:iq3_m "Write a Python function to validate email addresses with regex"

Code Explanation

ollama run richardyoung/kat-dev-72b:iq3_m "Explain this code: def fib(n): return n if n < 2 else fib(n-1) + fib(n-2)"

Debugging Help

ollama run richardyoung/kat-dev-72b:iq3_m "Why does this Python code raise a KeyError?"

Refactoring

ollama run richardyoung/kat-dev-72b:iq3_m "Refactor this JavaScript function to use async/await instead of callbacks"

Multi-turn Conversation

ollama run richardyoung/kat-dev-72b:iq3_m
>>> I need to build a REST API in Python
>>> Show me a FastAPI example with authentication
>>> How do I add rate limiting?

🏗️ Model Details

Click to expand technical details

Architecture

Base Model: KAT-Dev 72B Exp by Kwaipilot
Parameters: ~72 Billion
Quantization: GGUF format (IQ2_XXS to IQ4_XS)
Context Length: Standard (check base model for specifics)
Optimization: Code generation and understanding
Training: Specialized for programming tasks

Supported Languages

The model excels at:

Python
JavaScript/TypeScript
Java
C/C++
Go
Rust
And many more!

⚡ Performance Tips

Getting the best results

Choose the right quantization - IQ3_M is recommended for daily use
Use specific prompts - "Write a Python function to X" works better than "code for X"
Provide context - Share error messages, file structures, or requirements
Iterate - Ask follow-up questions to refine the code
GPU acceleration - Use Metal (Mac) or CUDA (NVIDIA) for faster inference
Temperature settings - Lower (0.1-0.3) for precise code, higher (0.7-0.9) for creative solutions

Example Ollama Configuration

# Create with custom parameters
ollama create my-kat-dev -f modelfiles/kat-dev-72b--iq3_m.Modelfile

# Edit the Modelfile to add:
PARAMETER temperature 0.2
PARAMETER top_p 0.9
PARAMETER repeat_penalty 1.1

🔧 Building Custom Variants

You can modify the included Modelfiles to customize behavior:

FROM ./kat-dev-72b-iq3_m.gguf

# System prompt
SYSTEM You are an expert programmer specializing in Python and web development.

# Parameters
PARAMETER temperature 0.2
PARAMETER num_ctx 8192
PARAMETER stop "<|endoftext|>"

Then build:

ollama create my-custom-kat -f custom.Modelfile

⚠️ Known Limitations

💾 Large Size - Even the smallest variant needs 24+ GB of storage
🐏 RAM Intensive - Requires significant system memory
⏱️ Inference Speed - Slower than smaller models (trade-off for quality)
🌐 English-Focused - Best performance with English prompts
📝 Code-Specialized - Not optimized for general conversation

📄 License

Apache 2.0 - Same as the original model. Free for commercial use!

🙏 Acknowledgments

Original Model: Kwaipilot for creating KAT-Dev 72B
GGUF Format: Georgi Gerganov for llama.cpp
Ollama: Ollama team for the amazing runtime
Community: All the developers testing and providing feedback

🔗 Useful Links

📦 Original Model: Kwaipilot/KAT-Dev-72B-Exp
🚀 Ollama Registry: richardyoung/kat-dev-72b
🛠️ llama.cpp: GitHub
📖 Ollama Docs: Documentation
💬 Discussions: Ask questions here!

🎮 Pro Tips

Advanced usage patterns

1. Integration with VS Code

Use with Continue.dev or other coding assistants:

{
  "models": [
    {
      "title": "KAT-Dev 72B",
      "provider": "ollama",
      "model": "richardyoung/kat-dev-72b:iq3_m"
    }
  ]
}

2. API Server Mode

Run as an OpenAI-compatible API:

ollama serve
# Then use the API at http://localhost:11434

3. Batch Processing

Process multiple files:

for file in *.py; do
  ollama run richardyoung/kat-dev-72b:iq3_m "Review this code: $(cat $file)" > "${file}.review"
done

Quantized with ❤️ by richardyoung

If you find this useful, please ⭐ star the repo and share with other developers!

Format: GGUF | Runtime: Ollama / llama.cpp | Created: October 2025

Downloads last month: 1,326

GGUF

Model size

73B params

Architecture

qwen2

Hardware compatibility

2-bit

3-bit

4-bit

Model tree for richardyoung/kat-dev-72b

Base model

Kwaipilot/KAT-Dev-72B-Exp

Quantized

(17)

this model