我用8卡A6000报错
我在A6000*8上用vllm运行,直接报错,代码是Model card的示例代码
错误如下:
ValueError: The output_size of gate's and up's weight = 96 is not divisible by weight quantization block_n = 128.
除了这个报错外,还有一些WARNING
WARNING 10-31 07:52:25 [multiproc_executor.py:720] Reducing Torch parallelism from 255 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
WARNING 10-31 07:52:32 [symm_mem.py:58] SymmMemCommunicator: Device capability 8.6 not supported, communicator is not available.
WARNING 10-31 07:52:32 [custom_all_reduce.py:144] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 10-31 07:52:33 [topk_topp_sampler.py:66] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(Worker_TP0 pid=2060306) WARNING 10-31 07:52:38 [fp8.py:457] Failed to import DeepGemm kernels.
(Worker_TP0 pid=2060306) WARNING 10-31 07:52:38 [fp8.py:480] CutlassBlockScaledGroupedGemm not supported on the current platform.
有大神指导一下吗?