vLLM V1 Disaggregated Serving with Mooncake Store and LMCache
vLLM V1 Disaggregated Serving with Mooncake Store and LMCache
Overview
The vLLM v1 version has been released with support for PD disaggregation. The detailed design document can be found here. LMCache immediately implemented the corresponding connector to support storage, transmission, and loading of KVCache, enabling collaborative operation with PD nodes. Mooncake, as LMCache’s backend storage engine, has undergone extensive optimizations in usability, performance, and stability. This document explains how to deploy a PD disaggregated serving demo using LMCache + Mooncake.
Deployment
- First, you need to prepare two GPU-equipped machines, which we will refer to as Machine A and Machine B. Install vLLM, Mooncake and LMCache on both Machine A and Machine B. For specific installation instructions, please refer to the official documentation of each repository.
- Start the Mooncake Master node on Machine A:
mooncake_master -port 50052 -max_threads 64 -metrics_port 9004 \ --enable_http_metadata_server=true \ --http_metadata_server_host=0.0.0.0 \ --http_metadata_server_port=8080
- Launch the Decoder instance on machine A
- Modify the vllm/examples/others/lmcache/disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh file.
diff --git a/examples/lmcache/disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh b/examples/lmcache/disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh
index 831ef0bb5..a2ff0744c 100644
--- a/examples/lmcache/disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh
+++ b/examples/lmcache/disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh
elif [[ $1 == "decoder" ]]; then
# Decoder listens on port 8200
- decode_config_file=$SCRIPT_DIR/configs/lmcache-decoder-config.yaml
+ decode_config_file=$SCRIPT_DIR/configs/mooncake-decoder-config.yaml
UCX_TLS=cuda_ipc,cuda_copy,tcp \
LMCACHE_CONFIG_FILE=$decode_config_file \
LMCACHE_USE_EXPERIMENTAL=True \
VLLM_ENABLE_V1_MULTIPROCESSING=1 \
VLLM_WORKER_MULTIPROC_METHOD=spawn \
CUDA_VISIBLE_DEVICES=1 \
- Add the
mooncake-decoder-config.yamlfile
chunk_size: 256
remote_url: "mooncakestore://{IP of Machine A}:50052/"
remote_serde: "naive"
local_cpu: False
max_local_cpu_size: 100
extra_config:
local_hostname: "{IP of Machine A}"
metadata_server: "http://{IP of Machine A}:8080/metadata"
protocol: "rdma"
device_name: "mlx5_0" # Multiple RDMA devices can be specified as comma-separated list
master_server_address: "{IP of Machine A}:50052"
global_segment_size: 32212254720 # 30GB
local_buffer_size: 1073741824 # 1GB
transfer_timeout: 1
save_chunk_meta: False
- Launch the Decoder instance using command
bash disagg_vllm_launcher.sh decoder Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4
- Launch the Prefiller instance on machine B
- Modify the vllm/examples/others/lmcache/disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh file.
diff --git a/examples/lmcache/disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh b/examples/lmcache/disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh
index 831ef0bb5..9e5a3f044 100644
--- a/examples/lmcache/disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh
+++ b/examples/lmcache/disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh
@@ -18,12 +18,14 @@ fi
if [[ $1 == "prefiller" ]]; then
# Prefiller listens on port 8100
- prefill_config_file=$SCRIPT_DIR/configs/lmcache-prefiller-config.yaml
+ prefill_config_file=$SCRIPT_DIR/configs/mooncake-prefiller-config.yaml
- Add the
mooncake-prefiller-config.yamlfile
chunk_size: 256
remote_url: "mooncakestore://{IP of Machine A}:50052/"
remote_serde: "naive"
local_cpu: False
max_local_cpu_size: 100
extra_config:
local_hostname: "{IP of Machine B}"
metadata_server: "http://{IP of Machine A}:8080/metadata"
protocol: "rdma"
device_name: "mlx5_0" # Multiple RDMA devices can be specified as comma-separated list
master_server_address: "{IP of Machine A}:50052"
global_segment_size: 32212254720 # 30GB
local_buffer_size: 1073741824 # 1GB
transfer_timeout: 1
save_chunk_meta: False
- Launch the Prefiller instance using command
bash disagg_vllm_launcher.sh prefiller Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4
- Prepare the router
disagg_proxy_server
We use the disagg_proxy_server provided by LMCache. According to LMCache/LMCache#1342, when using Mooncake Store as the backend, you need to comment out wait_decode_kv_ready(req_id) in the proxy code.
- Launch the
disagg_proxy_serverusing command
python3 disagg_proxy_server.py --host localhost --port 9000 --prefiller-host IP_of_Machine_B --prefiller-port 8100 --decoder-host IP_of_Machine_A --decoder-port 8200
- Now we can send the requests to the
disagg_proxy_serverto test PD disaggregated serving.
Additional Resources
- Mooncake x LMCache: Unite to Pioneer KVCache-Centric LLM Serving System
- Using Mooncake in LMCache
- Using LMCache in vLLM
We evaluated the current implementation on two A10 servers. By comparing the performance of a 1P1D configuration with that of two regular (non-disaggregated) instances, we observed that P/D disaggregation achieves approximately 30% lower ITL while maintaining comparable total throughput. This aligns with findings from the Mooncake paper, which highlighted that P/D disaggregation is effective in reducing TBT/ITL under similar throughput conditions—or conversely, in enabling higher throughput under stricter ITL/TBT SLOs. Moreover, we anticipate even greater benefits in larger-scale clusters where both the number of prefill and decode nodes (x and y in xPyD configurations) increase, offering enhanced scheduling flexibility and resource efficiency.
Traffic Request Rate: 1.0
- model: Qwen2.5-7B-Instruct-GPTQ-Int4
- TP: 4
- random_input_len=8192, random_output_len=512
- num prompt=50
| Configuration | Output Token Throughput (tok/s) | Mean E2E Latency (ms) | Total Token Throughput (tok/s) | Mean TTFT (ms) | P99 TTFT (ms) | Mean ITL (ms) | P99 ITL (ms) |
|---|---|---|---|---|---|---|---|
| 1P1D | 407.59 | 3413.86 | 7084.46 | 732.54 | 2952.57 | 7.23 | 10.76 |
| 2 Regular | 427.65 | 4586.54 | 7433.27 | 767.18 | 1264.88 | 10.30 | 12.73 |
Traffic Request Rate: 4.0
- model: Qwen2.5-7B-Instruct-GPTQ-Int4
- TP: 2
- random_input_len=2048, random_output_len=512
- num prompt=200
| Configuration | Output Token Throughput (tok/s) | Mean E2E Latency (ms) | Total Token Throughput (tok/s) | Mean TTFT (ms) | P99 TTFT (ms) | Mean ITL (ms) | P99 ITL (ms) |
|---|---|---|---|---|---|---|---|
| 1P1D | 1215.17 | 11519.24 | 6161.43 | 1111.94 | 2725.89 | 17.06 | 19.72 |
| 2 Regular | 1223.03 | 11683.15 | 6201.29 | 310.01 | 720.91 | 25.74 | 294.89 |