Update README.md

Browse files

Files changed (1) hide show

README.md +0 -38

README.md CHANGED Viewed

@@ -294,7 +294,6 @@ Our INT4 model is only optimized for batch size 1, so expect some slowdown with
 |----------------------------------|----------------|----------------------------|
 |                                  | Phi-4 mini-Ins | phi4-mini-INT4             |
 | latency (batch_size=1)           | 2.46s          | 2.2s (1.12x speedup)       |
-| serving (num_prompts=1)          | 0.87 req/s     | 1.05 req/s (1.20x speedup) |
 ## Results (H100 machine)
 | Benchmark (Latency)              |                |                            |
@@ -333,43 +332,6 @@ python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model
 VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model pytorch/Phi-4-mini-instruct-INT4 --batch-size 1
 ```
-## benchmark_serving
-We benchmarked the throughput in a serving environment.
-Download sharegpt dataset:
-```Shell
-wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
-```
-Other datasets can be found in: https://github.com/vllm-project/vllm/tree/main/benchmarks
-Note: you can change the number of prompts to be benchmarked with `--num-prompts` argument for `benchmark_serving` script.
-### baseline
-Server:
-```Shell
-vllm serve microsoft/Phi-4-mini-instruct --tokenizer microsoft/Phi-4-mini-instruct -O3
-```
-Client:
-```Shell
-python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model microsoft/Phi-4-mini-instruct --num-prompts 1
-```
-### INT4
-Server:
-```Shell
-VLLM_DISABLE_COMPILE_CACHE=1 vllm serve pytorch/Phi-4-mini-instruct-INT4 --tokenizer microsoft/Phi-4-mini-instruct -O3 --pt-load-map-location cuda:0
-```
-Client:
-```Shell
-python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-INT4 --num-prompts 1
-```
 </details>
 # Paper: TorchAO: PyTorch-Native Training-to-Serving Model Optimization

 |----------------------------------|----------------|----------------------------|
 |                                  | Phi-4 mini-Ins | phi4-mini-INT4             |
 | latency (batch_size=1)           | 2.46s          | 2.2s (1.12x speedup)       |
 ## Results (H100 machine)
 | Benchmark (Latency)              |                |                            |
 VLLM_DISABLE_COMPILE_CACHE=1 python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model pytorch/Phi-4-mini-instruct-INT4 --batch-size 1
 ```
 </details>
 # Paper: TorchAO: PyTorch-Native Training-to-Serving Model Optimization