4.5 non air locally exposed to Claude Code

#24

by giulianocarioca - opened about 1 month ago

about 1 month ago

Hello, can anybody please confirm ifnit is possible fit the the entire 4.5 model (non air) into a single Blackwell RTX PRO 6000 and if so, what quantization level is necessary?

Assuming it can be done, what inference service should I use tonload the model and expose? Ollama, vllm, studio, or even a python fastapi/flask endpoint loading the model manually into the GPU?

Thank you!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment