llama-cpp-python Pre-built Windows Wheels

Stop fighting with Visual Studio and CUDA Toolkit. Just download and run.

Pre-compiled llama-cpp-python wheels for Windows across CUDA versions and GPU architectures.

Quick Start

Find your GPU in the compatibility list below
Download the wheel for your GPU from GitHub Releases or find your card on the README table
Install: pip install <downloaded-wheel-file>.whl
Run your GGUF models immediately

Platform Support:
✅ Windows 10/11 64-bit (available now, biggest pain point)
🔜 Linux support coming soon

Supported GPUs

RTX 50 Series (Blackwell - sm_100)

RTX 5090, 5080, 5070 Ti, 5070, 5060 Ti, 5060, RTX PRO 6000 Blackwell, B100, B200, GB200

RTX 40 Series (Ada Lovelace - sm_89)

RTX 4090, 4080, 4070 Ti, 4070, 4060 Ti, 4060, RTX 6000 Ada, RTX 5000 Ada, L40, L40S

RTX 30 Series (Ampere - sm_86)

RTX 3090, 3090 Ti, 3080 Ti, 3080, 3070 Ti, 3070, 3060 Ti, 3060, RTX A6000, A5000, A4000

RTX 20 Series & GTX 16 Series (Turing - sm_75)

RTX 2080 Ti, 2080 Super, 2070 Super, 2060, GTX 1660 Ti, 1660 Super, 1650, Quadro RTX 8000, Tesla T4

View full compatibility table →

Usage Example

from llama_cpp import Llama

# Load your GGUF model with GPU acceleration
llm = Llama(
    model_path="./models/llama-3-8b.Q4_K_M.gguf",
    n_gpu_layers=-1,  # Offload all layers to GPU
    n_ctx=2048        # Context window
)

# Generate text
response = llm(
    "Write a haiku about artificial intelligence:",
    max_tokens=50,
    temperature=0.7
)

print(response['choices'][0]['text'])

Download Wheels

➡️ Download from GitHub Releases

Available Configurations:

CUDA Versions: 11.8, 12.1, 13.0
Python Versions: 3.10, 3.11, 3.12, 3.13
Architectures: sm_75 (Turing), sm_86 (Ampere), sm_89 (Ada), sm_100 (Blackwell)

What This Solves

❌ No Visual Studio required
❌ No CUDA Toolkit installation needed
❌ No compilation errors
❌ No "No CUDA toolset found" issues
✅ Works immediately with GGUF models
✅ Full GPU acceleration out of the box

Installation

Download the wheel matching your configuration and install:

# Example for RTX 4090 with Python 3.12 and CUDA 13.0
pip install llama_cpp_python-0.3.16+cuda13.0.sm89.ada-cp312-cp312-win_amd64.whl

Build Details

All wheels are built with:

Visual Studio 2019/2022 Build Tools
Official NVIDIA CUDA Toolkits (11.8, 12.1, 13.0)
Optimized CMAKE_CUDA_ARCHITECTURES for each GPU generation
Built from official llama-cpp-python source

Contributing

Need a different configuration?

Open an issue on GitHub with:

OS (Windows/Linux/macOS)
Python version
CUDA version
GPU model

Resources

License

MIT License - Free to use for any purpose

Wheels are built from llama-cpp-python (MIT License)

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support