我用8卡A6000报错

by AlphaLQ - opened 3 days ago

3 days ago

我在A6000*8上用vllm运行，直接报错，代码是Model card的示例代码
错误如下：
ValueError: The output_size of gate's and up's weight = 96 is not divisible by weight quantization block_n = 128.
除了这个报错外，还有一些WARNING
WARNING 10-31 07:52:25 [multiproc_executor.py:720] Reducing Torch parallelism from 255 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
WARNING 10-31 07:52:32 [symm_mem.py:58] SymmMemCommunicator: Device capability 8.6 not supported, communicator is not available.
WARNING 10-31 07:52:32 [custom_all_reduce.py:144] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 10-31 07:52:33 [topk_topp_sampler.py:66] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(Worker_TP0 pid=2060306) WARNING 10-31 07:52:38 [fp8.py:457] Failed to import DeepGemm kernels.
(Worker_TP0 pid=2060306) WARNING 10-31 07:52:38 [fp8.py:480] CutlassBlockScaledGroupedGemm not supported on the current platform.
有大神指导一下吗？

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment