Install with NVIDIA Dynamo
This guide provides step-by-step instructions for integrating vLLM Semantic Router with NVIDIA Dynamo.
About NVIDIA Dynamo
NVIDIA Dynamo is a high-performance distributed inference platform designed for large language model serving. Dynamo provides advanced features for optimizing GPU utilization and reducing inference latency through intelligent routing and caching mechanisms.
Key Features
- Disaggregated Serving: Separate Prefill and Decode workers for optimal GPU utilization
- KV-Aware Routing: Routes requests to workers with relevant KV cache for prefix cache optimization
- Dynamic Scaling: Planner component handles auto-scaling based on workload
- Multi-Tier KV Cache: GPU HBM → System Memory → NVMe for efficient cache management
- Worker Coordination: etcd and NATS for distributed worker registration and message queuing
- Backend Agnostic: Supports vLLM, SGLang, and TensorRT-LLM backends
Integration Benefits
Integrating vLLM Semantic Router with NVIDIA Dynamo provides several advantages:
-
Dual-Layer Intelligence: Semantic Router provides request-level intelligence (model selection, classification) while Dynamo optimizes infrastructure-level efficiency (worker selection, KV cache reuse)
-
Intelligent Model Selection: Semantic Router analyzes incoming requests and routes them to the most appropriate model based on content understanding, while Dynamo's KV-aware router efficiently selects optimal workers
-
Dual-Layer Caching: Semantic cache (request-level, Milvus-backed) combined with KV cache (token-level, Dynamo-managed) for maximum latency reduction
-
Enhanced Security: PII detection and jailbreak prevention filter requests before reaching inference workers
-
Disaggregated Architecture: Separate prefill and decode workers with KV-aware routing for reduced latency and better throughput
Architecture
This deployment uses the Disaggregated Router Deployment pattern with KV cache enabled, featuring separate prefill and decode workers for optimal GPU utilization.
┌─────────────────────────────────────────────────────────────────┐
│ CLIENT │
│ curl -X POST http://localhost:8080/v1/chat/completions │
│ -d '{"model": "MoM", "messages": [...]}' │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ ENVOY GATEWAY │
│ • Routes traffic, applies ExtProc filter │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ SEMANTIC ROUTER (ExtProc Filter) │
│ • Classifies query → selects category (e.g., "math") │
│ • Selects model → rewrites request │
│ • Injects domain-specific system prompt │
│ • PII/Jailbreak detection │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ DYNAMO FRONTEND (KV-Aware Routing) │
│ • Receives enriched request with selected model │
│ • Routes to optimal worker based on KV cache state │
│ • Coordinates workers via etcd/NATS │
└─────────────────────────────────────────────────────────────────┘
│ │
▼ ▼
┌───────────────────────────┐ ┌───────────────────────────┐
│ PREFILL WORKER (GPU 1) │ │ DECODE WORKER (GPU 2) │
│ prefillworker0 │──▶ decodeworker1 │
│ --worker-type prefill │ │ --worker-type decode │
└───────── ──────────────────┘ └───────────────────────────┘
Deployment Modes
This guide deploys the Disaggregated Router Deployment pattern with KV cache enabled (frontend.routerMode=kv). This is the recommended configuration for optimal performance, as it enables KV-aware routing to reuse computed attention tensors across requests. Separate prefill and decode workers maximize GPU utilization.
Based on NVIDIA Dynamo deployment patterns, the Helm chart supports two deployment modes:
Aggregated Mode (Default)
Workers handle both prefill and decode phases. Simpler setup, fewer GPUs required.
# No workerType specified = defaults to "both"
helm install dynamo-vllm ./deploy/kubernetes/dynamo/helm-chart \
--namespace dynamo-system \
--set workers[0].model.path=Qwen/Qwen2-0.5B-Instruct \
--set workers[1].model.path=Qwen/Qwen2-0.5B-Instruct
- Workers register as
backendcomponent in ETCD - No
--is-prefill-workerflag - Each worker can handle complete inference requests
Disaggregated Mode (High Performance)
Separate prefill and decode workers for optimal GPU utilization.
# Explicit workerType = disaggregated mode
helm install dynamo-vllm ./deploy/kubernetes/dynamo/helm-chart \
--namespace dynamo-system \
--set workers[0].model.path=Qwen/Qwen2-0.5B-Instruct \
--set workers[0].workerType=prefill \
--set workers[1].model.path=Qwen/Qwen2-0.5B-Instruct \
--set workers[1].workerType=decode
| Worker | Flag | ETCD Component | Role |
|---|---|---|---|
| Prefill | --is-prefill-worker | prefill | Processes input tokens, generates KV cache |
| Decode | (no special flag) | backend | Generates output tokens, receives decode requests only |
In disaggregated mode, only prefill workers use the --is-prefill-worker flag. Decode workers use the default vLLM behavior (no special flag). The KV-aware frontend routes prefill requests to prefill workers and decode requests to backend workers.
Prerequisites
GPU Requirements
This deployment requires a machine with at least 3 GPUs:
| Component | GPU | Description |
|---|---|---|
| Frontend | GPU 0 | Dynamo Frontend with KV-aware routing (--router-mode kv) |
| Prefill Worker | GPU 1 | Handles prefill phase of inference (--worker-type prefill) |
| Decode Worker | GPU 2 | Handles decode phase of inference (--worker-type decode) |
Required Tools
Before starting, ensure you have the following tools installed:
NVIDIA Runtime Configuration (One-Time Setup)
Configure Docker to use the NVIDIA runtime as the default:
# Configure NVIDIA runtime as default
sudo nvidia-ctk runtime configure --runtime=docker --set-as-default
# Restart Docker
sudo systemctl restart docker
# Verify configuration
docker info | grep -i "default runtime"
# Expected output: Default Runtime: nvidia
Step 1: Create Kind Cluster with GPU Support
Create a local Kubernetes cluster with GPU support. Choose one of the following options:
Option 1: Quick Setup (External Documentation)
For a quick setup, follow the official Kind GPU documentation:
kind create cluster --name semantic-router-dynamo
# Verify cluster is ready
kubectl wait --for=condition=Ready nodes --all --timeout=300s
For GPU support, see the Kind GPU documentation for details on configuring extra mounts and deploying the NVIDIA device plugin.
Option 2: Full GPU Setup (E2E Procedure)
This is the procedure used in our E2E tests. It includes all the steps needed to set up GPU support in Kind.
2.1 Create Kind Cluster with GPU Configuration
Create a Kind config file with GPU mount support:
# Create Kind config for GPU support
cat > kind-gpu-config.yaml << 'EOF'
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
name: semantic-router-dynamo
nodes:
- role: control-plane
extraMounts:
- hostPath: /mnt
containerPath: /mnt
- role: worker
extraMounts:
- hostPath: /mnt
containerPath: /mnt
- hostPath: /dev/null
containerPath: /var/run/nvidia-container-devices/all
EOF
# Create cluster with GPU config
kind create cluster --name semantic-router-dynamo --config kind-gpu-config.yaml --wait 5m
# Verify cluster is ready
kubectl wait --for=condition=Ready nodes --all --timeout=300s
2.2 Set Up NVIDIA Libraries in Kind Worker
Copy NVIDIA libraries from the host to the Kind worker node:
# Set worker name
WORKER_NAME="semantic-router-dynamo-worker"
# Detect NVIDIA driver version
DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader | head -1)
echo "Detected NVIDIA driver version: $DRIVER_VERSION"
# Verify GPU devices exist in the Kind worker
docker exec $WORKER_NAME ls /dev/nvidia0
echo "✅ GPU devices found in Kind worker"
# Create directory for NVIDIA libraries
docker exec $WORKER_NAME mkdir -p /nvidia-driver-libs
# Copy nvidia-smi binary
tar -cf - -C /usr/bin nvidia-smi | docker exec -i $WORKER_NAME tar -xf - -C /nvidia-driver-libs/
# Copy NVIDIA libraries from host
tar -cf - -C /usr/lib64 libnvidia-ml.so.$DRIVER_VERSION libcuda.so.$DRIVER_VERSION | \
docker exec -i $WORKER_NAME tar -xf - -C /nvidia-driver-libs/
# Create symlinks
docker exec $WORKER_NAME bash -c "cd /nvidia-driver-libs && \
ln -sf libnvidia-ml.so.$DRIVER_VERSION libnvidia-ml.so.1 && \
ln -sf libcuda.so.$DRIVER_VERSION libcuda.so.1 && \
chmod +x nvidia-smi"
# Verify nvidia-smi works inside the Kind worker
docker exec $WORKER_NAME bash -c "LD_LIBRARY_PATH=/nvidia-driver-libs /nvidia-driver-libs/nvidia-smi"
echo "✅ nvidia-smi verified in Kind worker"
2.3 Deploy NVIDIA Device Plugin
Deploy the NVIDIA device plugin to make GPUs allocatable in Kubernetes:
# Create device plugin manifest
cat > nvidia-device-plugin.yaml << 'EOF'
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
template:
metadata:
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.14.1
name: nvidia-device-plugin-ctr
env:
- name: LD_LIBRARY_PATH
value: "/nvidia-driver-libs"
securityContext:
privileged: true
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
- name: dev
mountPath: /dev
- name: nvidia-driver-libs
mountPath: /nvidia-driver-libs
readOnly: true
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
- name: dev
hostPath:
path: /dev
- name: nvidia-driver-libs
hostPath:
path: /nvidia-driver-libs
EOF
# Apply device plugin
kubectl apply -f nvidia-device-plugin.yaml
# Wait for device plugin to be ready
sleep 20
# Verify GPUs are allocatable
kubectl get nodes -o custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\\.com/gpu
echo "✅ GPU setup complete"
The Semantic Router project includes automated E2E tests that handle all of this GPU setup automatically. You can run:
make e2e-test E2E_PROFILE=dynamo E2E_VERBOSE=true
This will create a Kind cluster with GPU support, deploy all components, and run the test suite.
Step 2: Install Dynamo Platform
Deploy the Dynamo platform components (etcd, NATS, Dynamo Operator):
# Add the Dynamo Helm repository
helm repo add dynamo https://nvidia.github.io/dynamo
helm repo update
# Install Dynamo CRDs
helm install dynamo-crds dynamo/dynamo-crds \
--namespace dynamo-system \
--create-namespace
# Install Dynamo Platform (etcd, NATS, Operator)
helm install dynamo-platform dynamo/dynamo-platform \
--namespace dynamo-system \
--wait
# Wait for platform components to be ready
kubectl wait --for=condition=Available deployment -l app.kubernetes.io/instance=dynamo-platform -n dynamo-system --timeout=300s
Step 3: Install Envoy Gateway
Deploy Envoy Gateway with ExtensionAPIs enabled for Semantic Router integration:
# Install Envoy Gateway with custom values
helm install envoy-gateway oci://docker.io/envoyproxy/gateway-helm \
--version v1.3.0 \
--namespace envoy-gateway-system \
--create-namespace \
-f https://raw.githubusercontent.com/vllm-project/semantic-router/refs/heads/main/deploy/kubernetes/dynamo/dynamo-resources/envoy-gateway-values.yaml
# Wait for Envoy Gateway to be ready
kubectl wait --for=condition=Available deployment/envoy-gateway -n envoy-gateway-system --timeout=300s
Important: The values file enables extensionApis.enableEnvoyPatchPolicy: true, which is required for the Semantic Router ExtProc integration.
Step 4: Deploy vLLM Semantic Router
Deploy the Semantic Router with Dynamo-specific configuration:
# Install Semantic Router from GHCR OCI registry
helm install semantic-router oci://ghcr.io/vllm-project/charts/semantic-router \
--version v0.0.0-latest \
--namespace vllm-semantic-router-system \
--create-namespace \
-f https://raw.githubusercontent.com/vllm-project/semantic-router/refs/heads/main/deploy/kubernetes/dynamo/semantic-router-values/values.yaml
# Wait for deployment to be ready
kubectl wait --for=condition=Available deployment/semantic-router -n vllm-semantic-router-system --timeout=600s
# Verify deployment status
kubectl get pods -n vllm-semantic-router-system
Note: The values file configures Semantic Router to route to the TinyLlama model served by Dynamo workers.
Step 5: Deploy RBAC Resources
Apply RBAC permissions for Semantic Router to access Dynamo CRDs:
kubectl apply -f https://raw.githubusercontent.com/vllm-project/semantic-router/refs/heads/main/deploy/kubernetes/dynamo/dynamo-resources/rbac.yaml
Step 6: Deploy Dynamo vLLM Workers
Deploy the Dynamo workers using the Helm chart. This provides flexible CLI-based configuration without editing YAML files.
Option A: Using Helm Chart (Recommended)
# Clone the repository (if not already cloned)
git clone https://github.com/vllm-project/semantic-router.git
cd semantic-router
# Basic installation with default TinyLlama model
helm install dynamo-vllm ./deploy/kubernetes/dynamo/helm-chart \
--namespace dynamo-system
# Wait for workers to be ready
kubectl wait --for=condition=Available deployment -l app.kubernetes.io/instance=dynamo-vllm -n dynamo-system --timeout=600s
Option B: Custom Model via CLI
Deploy with a custom model without editing any files:
helm install dynamo-vllm ./deploy/kubernetes/dynamo/helm-chart \
--namespace dynamo-system \
--set workers[0].model.path=Qwen/Qwen2-0.5B-Instruct \
--set workers[1].model.path=Qwen/Qwen2-0.5B-Instruct
Option C: Explicit Prefill/Decode Configuration
helm install dynamo-vllm ./deploy/kubernetes/dynamo/helm-chart \
--namespace dynamo-system \
--set workers[0].model.path=Qwen/Qwen2-0.5B-Instruct \
--set workers[0].workerType=prefill \
--set workers[1].model.path=Qwen/Qwen2-0.5B-Instruct \
--set workers[1].workerType=decode
Option D: Gated Models (Llama, Mistral)
For models requiring HuggingFace authentication:
# Create secret with HuggingFace token
kubectl create secret generic hf-secret \
--from-literal=HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxx \
-n dynamo-system
# Install with secret reference
helm install dynamo-vllm ./deploy/kubernetes/dynamo/helm-chart \
--namespace dynamo-system \
--set huggingface.existingSecret=hf-secret \
--set workers[0].model.path=meta-llama/Llama-2-7b-chat-hf \
--set workers[1].model.path=meta-llama/Llama-2-7b-chat-hf
Option E: Custom GPU Device Assignment
Specify which GPU each worker should use:
helm install dynamo-vllm ./deploy/kubernetes/dynamo/helm-chart \
--namespace dynamo-system \
--set frontend.gpuDevice=0 \
--set workers[0].gpuDevice=1 \
--set workers[0].workerType=prefill \
--set workers[1].gpuDevice=2 \
--set workers[1].workerType=decode
If you don't specify gpuDevice, the Helm chart uses smart defaults:
- Frontend: GPU 0
- Worker 0: GPU 1 (index + 1)
- Worker 1: GPU 2 (index + 1)
- Worker N: GPU N+1
This ensures GPU 0 is reserved for the frontend, and workers are automatically assigned to subsequent GPUs. You only need to override these if you have a specific GPU layout requirement.