vLLM V1 Disaggregated Serving with Mooncake Store and LMCache

vLLM V1 Disaggregated Serving with Mooncake Store and LMCache

https://kvcache-ai.github.io/Mooncake/getting_started/examples/vllm-integration/vllmv1-lmcache-integration.html

Overview

The vLLM v1 version has been released with support for PD disaggregation. The detailed design document can be found here. LMCache immediately implemented the corresponding connector to support storage, transmission, and loading of KVCache, enabling collaborative operation with PD nodes. Mooncake, as LMCache’s backend storage engine, has undergone extensive optimizations in usability, performance, and stability. This document explains how to deploy a PD disaggregated serving demo using LMCache + Mooncake.

Deployment

  1. First, you need to prepare two GPU-equipped machines, which we will refer to as Machine A and Machine B. Install vLLMMooncake and LMCache on both Machine A and Machine B. For specific installation instructions, please refer to the official documentation of each repository.
  2. Start the Mooncake Master node on Machine A:
mooncake_master -port 50052 -max_threads 64 -metrics_port 9004 \
  --enable_http_metadata_server=true \
  --http_metadata_server_host=0.0.0.0 \
  --http_metadata_server_port=8080
  1. Launch the Decoder instance on machine A
  • Modify the vllm/examples/others/lmcache/disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh file.
diff --git a/examples/lmcache/disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh b/examples/lmcache/disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh
index 831ef0bb5..a2ff0744c 100644
--- a/examples/lmcache/disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh
+++ b/examples/lmcache/disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh
 elif [[ $1 == "decoder" ]]; then
     # Decoder listens on port 8200
-    decode_config_file=$SCRIPT_DIR/configs/lmcache-decoder-config.yaml
+    decode_config_file=$SCRIPT_DIR/configs/mooncake-decoder-config.yaml
 
     UCX_TLS=cuda_ipc,cuda_copy,tcp \
         LMCACHE_CONFIG_FILE=$decode_config_file \
         LMCACHE_USE_EXPERIMENTAL=True \
         VLLM_ENABLE_V1_MULTIPROCESSING=1 \
         VLLM_WORKER_MULTIPROC_METHOD=spawn \
         CUDA_VISIBLE_DEVICES=1 \
  • Add the mooncake-decoder-config.yaml file
chunk_size: 256
remote_url: "mooncakestore://{IP of Machine A}:50052/"
remote_serde: "naive"
local_cpu: False
max_local_cpu_size: 100

extra_config:
  local_hostname: "{IP of Machine A}"
  metadata_server: "http://{IP of Machine A}:8080/metadata"
  protocol: "rdma"
  device_name: "mlx5_0" # Multiple RDMA devices can be specified as comma-separated list
  master_server_address: "{IP of Machine A}:50052"
  global_segment_size: 32212254720 # 30GB
  local_buffer_size: 1073741824 # 1GB
  transfer_timeout: 1
  save_chunk_meta: False
  • Launch the Decoder instance using command
bash disagg_vllm_launcher.sh decoder Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4
  1. Launch the Prefiller instance on machine B
  • Modify the vllm/examples/others/lmcache/disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh file.
diff --git a/examples/lmcache/disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh b/examples/lmcache/disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh
index 831ef0bb5..9e5a3f044 100644
--- a/examples/lmcache/disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh
+++ b/examples/lmcache/disagg_prefill_lmcache_v1/disagg_vllm_launcher.sh
@@ -18,12 +18,14 @@ fi
 
 if [[ $1 == "prefiller" ]]; then
     # Prefiller listens on port 8100
-    prefill_config_file=$SCRIPT_DIR/configs/lmcache-prefiller-config.yaml
+    prefill_config_file=$SCRIPT_DIR/configs/mooncake-prefiller-config.yaml
  • Add the mooncake-prefiller-config.yaml file
chunk_size: 256
remote_url: "mooncakestore://{IP of Machine A}:50052/"
remote_serde: "naive"
local_cpu: False
max_local_cpu_size: 100

extra_config:
  local_hostname: "{IP of Machine B}"
  metadata_server: "http://{IP of Machine A}:8080/metadata"
  protocol: "rdma"
  device_name: "mlx5_0" # Multiple RDMA devices can be specified as comma-separated list
  master_server_address: "{IP of Machine A}:50052"
  global_segment_size: 32212254720 # 30GB
  local_buffer_size: 1073741824 # 1GB
  transfer_timeout: 1
  save_chunk_meta: False
  • Launch the Prefiller instance using command
bash disagg_vllm_launcher.sh prefiller Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4
  1. Prepare the router disagg_proxy_server

We use the disagg_proxy_server provided by LMCache. According to LMCache/LMCache#1342, when using Mooncake Store as the backend, you need to comment out wait_decode_kv_ready(req_id) in the proxy code.

  1. Launch the disagg_proxy_server using command
python3 disagg_proxy_server.py --host localhost --port 9000 --prefiller-host IP_of_Machine_B --prefiller-port 8100 --decoder-host IP_of_Machine_A --decoder-port 8200 
  1. Now we can send the requests to the disagg_proxy_server to test PD disaggregated serving.

Additional Resources

We evaluated the current implementation on two A10 servers. By comparing the performance of a 1P1D configuration with that of two regular (non-disaggregated) instances, we observed that P/D disaggregation achieves approximately 30% lower ITL while maintaining comparable total throughput. This aligns with findings from the Mooncake paper, which highlighted that P/D disaggregation is effective in reducing TBT/ITL under similar throughput conditions—or conversely, in enabling higher throughput under stricter ITL/TBT SLOs. Moreover, we anticipate even greater benefits in larger-scale clusters where both the number of prefill and decode nodes (x and y in xPyD configurations) increase, offering enhanced scheduling flexibility and resource efficiency.

Traffic Request Rate: 1.0

  • model: Qwen2.5-7B-Instruct-GPTQ-Int4
  • TP: 4
  • random_input_len=8192, random_output_len=512
  • num prompt=50
ConfigurationOutput Token Throughput (tok/s)Mean E2E Latency (ms)Total Token Throughput (tok/s)Mean TTFT (ms)P99 TTFT (ms)Mean ITL (ms)P99 ITL (ms)
1P1D407.593413.867084.46732.542952.577.2310.76
2 Regular427.654586.547433.27767.181264.8810.3012.73

Traffic Request Rate: 4.0

  • model: Qwen2.5-7B-Instruct-GPTQ-Int4
  • TP: 2
  • random_input_len=2048, random_output_len=512
  • num prompt=200
ConfigurationOutput Token Throughput (tok/s)Mean E2E Latency (ms)Total Token Throughput (tok/s)Mean TTFT (ms)P99 TTFT (ms)Mean ITL (ms)P99 ITL (ms)
1P1D1215.1711519.246161.431111.942725.8917.0619.72
2 Regular1223.0311683.156201.29310.01720.9125.74294.89

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注