๐Ÿ’ป KAT-Dev 72B - GGUF

Enterprise-Grade 72B Coding Model, Optimized for Local Inference

GGUF Size Ollama License

Original Model | Ollama Registry | llama.cpp


๐Ÿ“– What is This?

This is KAT-Dev 72B, a powerful coding model with 72 billion parameters, quantized to GGUF format for efficient local inference. Perfect for developers who want enterprise-grade code assistance running entirely on their own hardware with Ollama or llama.cpp!

โœจ Why You'll Love It

  • ๐Ÿ’ป Coding-Focused - Optimized specifically for programming tasks
  • ๐Ÿง  72B Parameters - Large enough for complex reasoning and refactoring
  • โšก Local Inference - Run entirely on your machine, no API calls
  • ๐Ÿ”’ Privacy First - Your code never leaves your computer
  • ๐ŸŽฏ Multiple Quantizations - Choose your speed/quality trade-off
  • ๐Ÿš€ Ollama Ready - One command to start coding
  • ๐Ÿ”ง llama.cpp Compatible - Works with your favorite tools

๐ŸŽฏ Quick Start

Option 1: Ollama (Easiest!)

Pull and run directly from the Ollama registry:

# Recommended: IQ3_M (best balance)
ollama run richardyoung/kat-dev-72b:iq3_m

# Other variants
ollama run richardyoung/kat-dev-72b:iq4_xs  # Better quality
ollama run richardyoung/kat-dev-72b:iq2_m   # Faster, smaller
ollama run richardyoung/kat-dev-72b:iq2_xxs # Most compact

That's it! Start asking coding questions! ๐ŸŽ‰

Option 2: Build from Modelfile

Download this repo and build locally:

# Clone or download the modelfiles
ollama create kat-dev-72b-iq3_m -f modelfiles/kat-dev-72b--iq3_m.Modelfile
ollama run kat-dev-72b-iq3_m

Option 3: llama.cpp

Use with llama.cpp directly:

# Download the GGUF file (replace variant as needed)
huggingface-cli download richardyoung/kat-dev-72b kat-dev-72b-iq3_m.gguf --local-dir ./

# Run with llama.cpp
./llama-cli -m kat-dev-72b-iq3_m.gguf -p "Write a Python function to"

๐Ÿ’ป System Requirements

Component Minimum Recommended
RAM 32 GB 64 GB+
Storage 40 GB free 50+ GB free
CPU Modern 8-core 16+ cores
GPU Optional (CPU-only works!) Metal/CUDA for acceleration
OS macOS, Linux, Windows Latest versions

๐Ÿ’ก Tip: Larger quantizations (IQ4_XS) need more RAM but produce better code. Smaller ones (IQ2_XXS) are faster but less precise.

๐ŸŽจ Available Quantizations

Choose the right balance for your needs:

Quantization Size Quality Speed RAM Usage Best For
IQ4_XS 37 GB โญโญโญโญโญ โญโญโญ ~50 GB Production code, complex refactoring
IQ3_M (recommended) 33 GB โญโญโญโญ โญโญโญโญ ~40 GB Daily development, best balance
IQ2_M 27 GB โญโญโญ โญโญโญโญโญ ~35 GB Quick prototyping, fast iteration
IQ2_XXS 24 GB โญโญ โญโญโญโญโญ ~30 GB Testing, very constrained systems

Variant Details

Variant Size Blob SHA256
iq4_xs 36.98 GB c4cb9c6e...
iq3_m 33.07 GB 14d07184...
iq2_m 27.32 GB cbe26a3c...
iq2_xxs 23.74 GB a49c7526...

๐Ÿ“š Usage Examples

Code Generation

ollama run richardyoung/kat-dev-72b:iq3_m "Write a Python function to validate email addresses with regex"

Code Explanation

ollama run richardyoung/kat-dev-72b:iq3_m "Explain this code: def fib(n): return n if n < 2 else fib(n-1) + fib(n-2)"

Debugging Help

ollama run richardyoung/kat-dev-72b:iq3_m "Why does this Python code raise a KeyError?"

Refactoring

ollama run richardyoung/kat-dev-72b:iq3_m "Refactor this JavaScript function to use async/await instead of callbacks"

Multi-turn Conversation

ollama run richardyoung/kat-dev-72b:iq3_m
>>> I need to build a REST API in Python
>>> Show me a FastAPI example with authentication
>>> How do I add rate limiting?

๐Ÿ—๏ธ Model Details

Click to expand technical details

Architecture

  • Base Model: KAT-Dev 72B Exp by Kwaipilot
  • Parameters: ~72 Billion
  • Quantization: GGUF format (IQ2_XXS to IQ4_XS)
  • Context Length: Standard (check base model for specifics)
  • Optimization: Code generation and understanding
  • Training: Specialized for programming tasks

Supported Languages

The model excels at:

  • Python
  • JavaScript/TypeScript
  • Java
  • C/C++
  • Go
  • Rust
  • And many more!

โšก Performance Tips

Getting the best results
  1. Choose the right quantization - IQ3_M is recommended for daily use
  2. Use specific prompts - "Write a Python function to X" works better than "code for X"
  3. Provide context - Share error messages, file structures, or requirements
  4. Iterate - Ask follow-up questions to refine the code
  5. GPU acceleration - Use Metal (Mac) or CUDA (NVIDIA) for faster inference
  6. Temperature settings - Lower (0.1-0.3) for precise code, higher (0.7-0.9) for creative solutions

Example Ollama Configuration

# Create with custom parameters
ollama create my-kat-dev -f modelfiles/kat-dev-72b--iq3_m.Modelfile

# Edit the Modelfile to add:
PARAMETER temperature 0.2
PARAMETER top_p 0.9
PARAMETER repeat_penalty 1.1

๐Ÿ”ง Building Custom Variants

You can modify the included Modelfiles to customize behavior:

FROM ./kat-dev-72b-iq3_m.gguf

# System prompt
SYSTEM You are an expert programmer specializing in Python and web development.

# Parameters
PARAMETER temperature 0.2
PARAMETER num_ctx 8192
PARAMETER stop "<|endoftext|>"

Then build:

ollama create my-custom-kat -f custom.Modelfile

โš ๏ธ Known Limitations

  • ๐Ÿ’พ Large Size - Even the smallest variant needs 24+ GB of storage
  • ๐Ÿ RAM Intensive - Requires significant system memory
  • โฑ๏ธ Inference Speed - Slower than smaller models (trade-off for quality)
  • ๐ŸŒ English-Focused - Best performance with English prompts
  • ๐Ÿ“ Code-Specialized - Not optimized for general conversation

๐Ÿ“„ License

Apache 2.0 - Same as the original model. Free for commercial use!

๐Ÿ™ Acknowledgments

  • Original Model: Kwaipilot for creating KAT-Dev 72B
  • GGUF Format: Georgi Gerganov for llama.cpp
  • Ollama: Ollama team for the amazing runtime
  • Community: All the developers testing and providing feedback

๐Ÿ”— Useful Links

๐ŸŽฎ Pro Tips

Advanced usage patterns

1. Integration with VS Code

Use with Continue.dev or other coding assistants:

{
  "models": [
    {
      "title": "KAT-Dev 72B",
      "provider": "ollama",
      "model": "richardyoung/kat-dev-72b:iq3_m"
    }
  ]
}

2. API Server Mode

Run as an OpenAI-compatible API:

ollama serve
# Then use the API at http://localhost:11434

3. Batch Processing

Process multiple files:

for file in *.py; do
  ollama run richardyoung/kat-dev-72b:iq3_m "Review this code: $(cat $file)" > "${file}.review"
done

Quantized with โค๏ธ by richardyoung

If you find this useful, please โญ star the repo and share with other developers!

Format: GGUF | Runtime: Ollama / llama.cpp | Created: October 2025

Downloads last month
1,326
GGUF
Model size
73B params
Architecture
qwen2
Hardware compatibility
Log In to view the estimation

2-bit

3-bit

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for richardyoung/kat-dev-72b

Quantized
(17)
this model