inclusionAI
/

Ring-1T-FP8

Text Generation

compressed-tensors

Model card Files Files and versions

zhanghanxiao commited on 9 days ago

Commit

8fbf888

·

verified ·

1 Parent(s): 67c3090

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -231,7 +231,7 @@ Here is the example to deploy the model with multiple GPU nodes, where the maste
 # step 1. start ray on all nodes
 # step 2. start vllm server only on node 0:
-vllm serve $MODEL_PATH --port $PORT --served-model-name my_model --trust-remote-code --tensor-parallel-size 8 --pipeline-parallel-size 4 --gpu-memory-utilization 0.85
 # This is only an example, please adjust arguments according to your actual environment.

 # step 1. start ray on all nodes
 # step 2. start vllm server only on node 0:
+vllm serve $MODEL_PATH --port $PORT --served-model-name my_model --trust-remote-code --tensor-parallel-size 32 --gpu-memory-utilization 0.85
 # This is only an example, please adjust arguments according to your actual environment.