vLLM V1 Disaggregated Serving with Mooncake Store and LMCache

2026-05-122026-05-12 TE 0 Comments

https://kvcache-ai.github.io/Mooncake/getting_started/examples/vllm-integration/vllmv1-lmcache-integration.html

Overview

The vLLM v1 version has been released with support for PD disaggregation. The detailed design document can be found here. LMCache immediately implemented the corresponding connector to support storage, transmission, and loading of KVCache, enabling collaborative operation with PD nodes. Mooncake, as LMCache’s backend storage engine, has undergone extensive optimizations in usability, performance, and stability. This document explains how to deploy a PD disaggregated serving demo using LMCache + Mooncake.

Deployment

First, you need to prepare two GPU-equipped machines, which we will refer to as Machine A and Machine B. Install vLLM, Mooncake and LMCache on both Machine A and Machine B. For specific installation instructions, please refer to the official documentation of each repository.
Start the Mooncake Master node on Machine A:

mooncake_master -port 50052 -max_threads 64 -metrics_port 9004 \
  --enable_http_metadata_server=true \
  --http_metadata_server_host=0.0.0.0 \
  --http_metadata_server_port=8080

Launch the Decoder instance on machine A

Modify the vllm/examples/others/lmcache/disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh file.

diff --git a/examples/lmcache/disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh b/examples/lmcache/disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh
index 831ef0bb5..a2ff0744c 100644
--- a/examples/lmcache/disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh
+++ b/examples/lmcache/disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh
 elif [[ $1 == "decoder" ]]; then
     # Decoder listens on port 8200
-    decode_config_file=$SCRIPT_DIR/configs/lmcache-decoder-config.yaml
+    decode_config_file=$SCRIPT_DIR/configs/mooncake-decoder-config.yaml
 
     UCX_TLS=cuda_ipc,cuda_copy,tcp \
         LMCACHE_CONFIG_FILE=$decode_config_file \
         LMCACHE_USE_EXPERIMENTAL=True \
         VLLM_ENABLE_V1_MULTIPROCESSING=1 \
         VLLM_WORKER_MULTIPROC_METHOD=spawn \
         CUDA_VISIBLE_DEVICES=1 \

Add the mooncake-decoder-config.yaml file

chunk_size: 256
remote_url: "mooncakestore://{IP of Machine A}:50052/"
remote_serde: "naive"
local_cpu: False
max_local_cpu_size: 100

extra_config:
  local_hostname: "{IP of Machine A}"
  metadata_server: "http://{IP of Machine A}:8080/metadata"
  protocol: "rdma"
  device_name: "mlx5_0" # Multiple RDMA devices can be specified as comma-separated list
  master_server_address: "{IP of Machine A}:50052"
  global_segment_size: 32212254720 # 30GB
  local_buffer_size: 1073741824 # 1GB
  transfer_timeout: 1
  save_chunk_meta: False

Launch the Decoder instance using command

bash disagg_vllm_launcher.sh decoder Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4

Launch the Prefiller instance on machine B

Modify the vllm/examples/others/lmcache/disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh file.

diff --git a/examples/lmcache/disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh b/examples/lmcache/disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh
index 831ef0bb5..9e5a3f044 100644
--- a/examples/lmcache/disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh
+++ b/examples/lmcache/disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh
@@ -18,12 +18,14 @@ fi
 
 if [[ $1 == "prefiller" ]]; then
     # Prefiller listens on port 8100
-    prefill_config_file=$SCRIPT_DIR/configs/lmcache-prefiller-config.yaml
+    prefill_config_file=$SCRIPT_DIR/configs/mooncake-prefiller-config.yaml

Add the mooncake-prefiller-config.yaml file

chunk_size: 256
remote_url: "mooncakestore://{IP of Machine A}:50052/"
remote_serde: "naive"
local_cpu: False
max_local_cpu_size: 100

extra_config:
  local_hostname: "{IP of Machine B}"
  metadata_server: "http://{IP of Machine A}:8080/metadata"
  protocol: "rdma"
  device_name: "mlx5_0" # Multiple RDMA devices can be specified as comma-separated list
  master_server_address: "{IP of Machine A}:50052"
  global_segment_size: 32212254720 # 30GB
  local_buffer_size: 1073741824 # 1GB
  transfer_timeout: 1
  save_chunk_meta: False

Launch the Prefiller instance using command

bash disagg_vllm_launcher.sh prefiller Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4

Prepare the router disagg_proxy_server

We use the disagg_proxy_server provided by LMCache. According to LMCache/LMCache#1342, when using Mooncake Store as the backend, you need to comment out wait_decode_kv_ready(req_id) in the proxy code.

Launch the disagg_proxy_server using command

python3 disagg_proxy_server.py --host localhost --port 9000 --prefiller-host IP_of_Machine_B --prefiller-port 8100 --decoder-host IP_of_Machine_A --decoder-port 8200

Now we can send the requests to the disagg_proxy_server to test PD disaggregated serving.

Additional Resources

We evaluated the current implementation on two A10 servers. By comparing the performance of a 1P1D configuration with that of two regular (non-disaggregated) instances, we observed that P/D disaggregation achieves approximately 30% lower ITL while maintaining comparable total throughput. This aligns with findings from the Mooncake paper, which highlighted that P/D disaggregation is effective in reducing TBT/ITL under similar throughput conditions—or conversely, in enabling higher throughput under stricter ITL/TBT SLOs. Moreover, we anticipate even greater benefits in larger-scale clusters where both the number of prefill and decode nodes (x and y in xPyD configurations) increase, offering enhanced scheduling flexibility and resource efficiency.

Traffic Request Rate: 1.0

model: Qwen2.5-7B-Instruct-GPTQ-Int4
TP: 4
random_input_len=8192, random_output_len=512
num prompt=50

Configuration	Output Token Throughput (tok/s)	Mean E2E Latency (ms)	Total Token Throughput (tok/s)	Mean TTFT (ms)	P99 TTFT (ms)	Mean ITL (ms)	P99 ITL (ms)
1P1D	407.59	3413.86	7084.46	732.54	2952.57	7.23	10.76
2 Regular	427.65	4586.54	7433.27	767.18	1264.88	10.30	12.73

Traffic Request Rate: 4.0

model: Qwen2.5-7B-Instruct-GPTQ-Int4
TP: 2
random_input_len=2048, random_output_len=512
num prompt=200

Configuration	Output Token Throughput (tok/s)	Mean E2E Latency (ms)	Total Token Throughput (tok/s)	Mean TTFT (ms)	P99 TTFT (ms)	Mean ITL (ms)	P99 ITL (ms)
1P1D	1215.17	11519.24	6161.43	1111.94	2725.89	17.06	19.72
2 Regular	1223.03	11683.15	6201.29	310.01	720.91	25.74	294.89

Overview

Deployment

Additional Resources

Traffic Request Rate: 1.0

Traffic Request Rate: 4.0

You May Also Like

《新产业标准化领航工程实施方案（2023—2035年）》解读

书籍

高考数学？

发表回复 取消回复

发表回复取消回复