4.5 non air locally exposed to Claude Code

#24
by giulianocarioca - opened

Hello, can anybody please confirm ifnit is possible fit the the entire 4.5 model (non air) into a single Blackwell RTX PRO 6000 and if so, what quantization level is necessary?

Assuming it can be done, what inference service should I use tonload the model and expose? Ollama, vllm, studio, or even a python fastapi/flask endpoint loading the model manually into the GPU?

Thank you!

Sign up or log in to comment