Inference with llama.cpp + Open WebUI gives repeating `?`
#1
by
whoisjeremylam
- opened
Is there a specific build of llama.cpp that should be used to support AutoRound?
This is the command
CUDA_VISIBLE_DEVICES=1 \
~/llama.cpp/build/bin/llama-server \
-t 23 \
-m /home/ai/models/Intel/Ling-flash-2.0-gguf-q2ks-mixed-AutoRound/Ling-flash-Q2_K_S.gguf \
--alias Ling-flash \
--no-mmap \
--host 0.0.0.0 \
--port 5000 \
-c 13056 \
-ngl 999 \
-ub 4096 -b 4096
llama.cpp build from main:
$ git rev-parse --short HEAD
6de8ed751
same here.
latest llama.cpp (github master) freshly built on ubuntu+cuda, and using llama.cpp built-in UI.
returns repeating '?' no matter what is the prompt.
otherwise, works fine with other models.
CPU works fine, but CUDA has issues, we’re investigating the root cause.
corfirmed :(, with -ot exps=CPU works as expected
