Multi-tenant LLM inference-as-a-service on NVIDIA DGX Spark. All tenant management, authentication, rate limiting, and inference routing is implemented via Kubernetes CRDs — zero custom application code.
TokenLabs composes four open-source projects, each handling a distinct concern:
Client (Authorization: Bearer <api-key>)
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ Envoy AI Gateway (EAG v0.5.0) + Envoy Gateway (EG v1.5.0) │
│ ├─ Kuadrant AuthPolicy → Authorino ① API key auth │
│ ├─ Kuadrant RateLimitPolicy → Limitador ② Rate limits │
│ │ │
│ ├─ AIGatewayRoute: /v1/chat/completions ③ Model routing │
│ │ EAG reads {"model":...} body → x-ai-eg-model header │
│ │ ├─ model=Nemotron-Llama-8B → llm-d EPP → vLLM (spark-01) │
│ │ └─ model=Nemotron-VL-12B → llm-d EPP → vLLM (spark-02) │
│ │ │
│ ├─ HTTPRoute: /v1/audio/speech ──► Magpie TTS ④ Text-to-speech│
│ └─ HTTPRoute: /v1/audio/transcriptions ──► Riva STT (NVIDIA NIM)│
├──────────────────────────────────────────────────────────────────┤
│ vLLM / TTS Workers │
│ └─ Response with usage.total_tokens (LLMs) │
├──────────────────────────────────────────────────────────────────┤
│ Kuadrant TokenRateLimitPolicy → Limitador ⑤ Token quota │
└──────────────────────────────────────────────────────────────────┘
│
▼
Client receives response
Envoy AI Gateway (EAG v0.5.0) — AI-aware proxy layer that extends Envoy Gateway. EAG adds the AIGatewayRoute CRD which natively parses the "model" field from JSON request bodies, sets the x-ai-eg-model header, and routes to the matching InferencePool. It also introduces BackendSecurityPolicy for upstream credential injection (used by Riva STT to swap the client's token-labs key for the NVIDIA API key). EAG replaces the old Body Based Router (BBR) ext_proc pattern — no ConfigMaps, no extra sidecar.
Envoy Gateway (EG v1.5.0) — Kubernetes-native L7 proxy that implements the Gateway API. EG is the data plane; EAG extends it via an xDS extension manager hook. EG provisions Envoy proxy pods, handles TLS termination, and hosts ext_proc filters for llm-d's EPP. Chosen over Istio because the Gateway API Inference Extension explicitly supports it and it's lighter weight than a full service mesh.
Kuadrant — CNCF policy layer that deploys two backing services:
- Authorino — external authorization service. When the
AuthPolicyCRD is applied, Authorino intercepts every request and validates the API key (stored as a Kubernetes Secret). It extracts tenant metadata (tier, user-id) from the Secret's annotations and enriches the request context so downstream policies can use it. - Limitador — rate limiting service. Enforces request-count limits (via
RateLimitPolicy) and, critically, token-based quotas (viaTokenRateLimitPolicy). The token policy automatically parsesusage.total_tokensfrom OpenAI-compatible JSON responses and counts it against the tenant's quota — no custom middleware required. This is what makes per-tenant billing feasible without writing a proxy.
llm-d (v0.5.0) — inference-aware request scheduler. Its Endpoint Picker (EPP) runs as an Envoy ext_proc server and scores every vLLM pod on three signals before routing the request:
- KV-cache usage — avoids pods whose GPU memory is nearly full
- Prefix-cache locality — routes similar prompts to the same pod to reuse cached KV entries
- Queue depth — prefers pods with fewer in-flight requests
This produces better tail latency and higher throughput than round-robin or least-connections load balancing.
vLLM — high-performance LLM inference engine running on DGX Spark GB10 GPUs. Served via the ghcr.io/llm-d/llm-d-cuda:v0.5.0 container image. Exposes an OpenAI-compatible API (/v1/chat/completions, /v1/completions, /v1/models). Currently serves two models:
- Nemotron-Llama 8B (
nvidia/Llama-3.1-Nemotron-Nano-8B-v1, spark-01) — general-purpose chat model, BF16, 80% GPU utilization - Nemotron VL 12B FP8 (
nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-FP8, spark-02) — NVIDIA vision-language model with FP8 quantization, supports image+text inputs
Magpie TTS — NVIDIA's multilingual text-to-speech model (357M parameters). Runs on spark-01 in CPU mode (the GB10 GPU is fully allocated to the Nemotron-Llama vLLM pod). Served via a custom FastAPI wrapper that exposes an OpenAI-compatible /v1/audio/speech endpoint. Supports 5 voices and 7 languages (en, es, de, fr, vi, it, zh). Built on the NeMo framework.
Riva STT (NVIDIA NIM proxy) — speech-to-text via NVIDIA NIM at integrate.api.nvidia.com. TokenLabs proxies /v1/audio/transcriptions to the NIM endpoint using an Envoy Gateway Backend + BackendTLSPolicy (TLS toward NVIDIA) + EAG BackendSecurityPolicy (swaps the client's token-labs key for the NVIDIA API key). The client never sees the NVIDIA key.
┌──────────────────────────────────────────────────────────────────────┐
│ MicroK8s Cluster │
│ │
│ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │
│ │ controller │ │ spark-01 │ │ spark-02 │ │
│ │ (CPU, ARM64) │ │ (GB10 GPU) │ │ (GB10 GPU) │ │
│ │ │ │ │ │ │ │
│ │ Envoy AI GW │ │ vLLM: │ │ vLLM: │ │
│ │ Envoy GW │ │ Nemotron- │ │ Nemotron VL │ │
│ │ Kuadrant │ │ Llama 8B │ │ 12B FP8 │ │
│ │ llm-d EPPs │ │ Magpie TTS │ │ │ │
│ │ │ │ (CPU mode) │ │ │ │
│ └────────────────┘ └────────────────┘ └────────────────┘ │
└──────────────────────────────────────────────────────────────────────┘
The cluster has three nodes. The CPU controller runs control-plane components (Envoy AI Gateway, Envoy Gateway proxy, Kuadrant operators, llm-d EPPs). spark-01 serves nvidia/Llama-3.1-Nemotron-Nano-8B-v1 (80% GPU utilization, BF16) and Magpie TTS (CPU mode — the GPU is fully allocated to vLLM). spark-02 serves nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-FP8 (FP8 quantized vision-language model, 90% GPU utilization). Both use tensor_parallelism=1.
There is no external identity provider (no Keycloak, no Auth0). Tenants are Kubernetes Secrets — Authorino validates API keys by looking up Secrets directly. No database, no restarts, no config reloads. The moment you kubectl apply a tenant Secret, the API key is live.
- Client sends a request with
Authorization: Bearer <api-key> - Authorino searches for a Secret in
kuadrant-systemlabeledauthorino.kuadrant.io/managed-by: authorino - Compares the
api_keyfield in each Secret against the bearer token - On match, extracts the tenant's tier (
kuadrant.io/groups) and ID (secret.kuadrant.io/user-id) from annotations - Passes this metadata downstream — RateLimitPolicy and TokenRateLimitPolicy use it to enforce per-tenant quotas
apiVersion: v1
kind: Secret
metadata:
name: tenant-acme
namespace: kuadrant-system
labels:
authorino.kuadrant.io/managed-by: authorino # Authorino discovers this Secret
app: token-labs
annotations:
kuadrant.io/groups: "pro" # tier: free | pro | enterprise
secret.kuadrant.io/user-id: "acme" # unique tenant ID (rate limit counter key)
stringData:
api_key: "tlabs_sk_acme_..." # API key value# 1. Generate a secure API key
API_KEY="tlabs_sk_$(openssl rand -hex 24)"
# 2. Create the tenant Secret (choose tier: free, pro, or enterprise)
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Secret
metadata:
name: tenant-acme
namespace: kuadrant-system
labels:
authorino.kuadrant.io/managed-by: authorino
app: token-labs
annotations:
kuadrant.io/groups: "pro"
secret.kuadrant.io/user-id: "acme"
stringData:
api_key: "$API_KEY"
EOF
# 3. Share the API key with the client (securely, out-of-band)
echo "API Key: $API_KEY"The client can use the key immediately — no waiting, no restart:
curl https://inference.token-labs.local/v1/chat/completions \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{"model": "nvidia/Llama-3.1-Nemotron-Nano-8B-v1", "messages": [{"role": "user", "content": "Hello"}]}'| Action | Command |
|---|---|
| List all tenants | kubectl get secrets -n kuadrant-system -l app=token-labs |
| Change tier | kubectl annotate secret tenant-acme -n kuadrant-system kuadrant.io/groups=enterprise --overwrite |
| Rotate API key | kubectl create secret generic tenant-acme -n kuadrant-system --from-literal=api_key="$(openssl rand -hex 24)" --dry-run=client -o yaml | kubectl apply -f - |
| Revoke access | kubectl delete secret tenant-acme -n kuadrant-system |
| Tier | Requests/day | Requests/min | Tokens/day | Tokens/min |
|---|---|---|---|---|
| Free | 100 | 10 | 50,000 | 5,000 |
| Pro | 5,000 | 100 | 500,000 | 50,000 |
| Enterprise | 50,000 | 1,000 | 5,000,000 | 500,000 |
See docs/ARCHITECTURE.md for the full CRD inventory, request flow details, and design decisions.
Quick reference: If you've done this before and just need the commands, see
deploy/README.md.
- MicroK8s cluster with GPU addon enabled on worker nodes
kubectlv1.28+ configured for the clusterhelmv3.12+helmfilev1.1+- HuggingFace token with access to
meta-llama/Llama-3.1-8B-Instructandnvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-FP8
All scripts and commands in this guide use standard kubectl and helm. On a MicroK8s cluster, create aliases so they resolve to the MicroK8s-bundled binaries:
# Permanent system-wide aliases (recommended)
sudo snap alias microk8s.kubectl kubectl
sudo snap alias microk8s.helm helm
# Or add to ~/.bashrc / ~/.zshrc
echo 'alias kubectl="microk8s kubectl"' >> ~/.zshrc
echo 'alias helm="microk8s helm"' >> ~/.zshrc
source ~/.zshrcVerify:
kubectl version --client
helm versionThis installs the Gateway API base CRDs (Gateway, HTTPRoute, GatewayClass), the Gateway API Inference Extension CRDs (InferencePool), and the Envoy AI Gateway CRDs (AIGatewayRoute). These are the Kubernetes resource definitions that all projects build upon.
./deploy/scripts/01-install-crds.shWhat it does:
- Applies Gateway API v1.4.1 standard CRDs
- Applies Inference Extension v1.3.0 CRDs (graduated InferencePool at
inference.networking.k8s.io/v1) - Installs Envoy AI Gateway v0.5.0 CRDs (AIGatewayRoute)
Verify:
kubectl get crd gateways.gateway.networking.k8s.io
kubectl get crd inferencepools.inference.networking.k8s.io
kubectl get crd aigatewayroutes.aigateway.envoyproxy.ioEnvoy AI Gateway (EAG) extends Envoy Gateway with AI-specific routing. Install order matters: EAG CRDs first, then the EAG controller (which creates a TLS cert Secret), then EG configured to connect to EAG via its extension manager. Redis is required for Kuadrant's distributed rate limiting (Limitador stores counters in Redis).
./deploy/scripts/02-install-envoy-ai-gateway.shWhat it does:
- Installs EAG CRDs (
ai-gateway-crds-helmv0.5.0) — registersAIGatewayRoute,AIServiceBackend,BackendSecurityPolicy - Installs the EAG controller (
ai-gateway-helmv0.5.0) intoenvoy-ai-gateway-systemand waits for it to be ready (it creates the TLS cert Secret needed by EG) - Installs Envoy Gateway (
gateway-helmv1.5.0) intoenvoy-gateway-systemwith EAG extension manager config andInferencePoolas a valid backendRef type - Deploys a standalone Redis instance into
redis-systemfor rate-limit counters
Verify:
kubectl get pods -n envoy-ai-gateway-system # ai-gateway-controller running
kubectl get pods -n envoy-gateway-system # envoy-gateway controller running
kubectl get pods -n redis-system # redis pod running
kubectl get gatewayclass # "eg" class listedKuadrant is the policy layer. Installing the operator deploys the controller that watches for AuthPolicy, RateLimitPolicy, and TokenRateLimitPolicy CRDs. Creating the Kuadrant CR bootstraps the backing services (Authorino for auth, Limitador for rate limiting).
./deploy/scripts/03-install-kuadrant.shWhat it does:
- Adds the Kuadrant Helm repo and installs
kuadrant-operatorintokuadrant-system - Creates a
KuadrantCR that triggers deployment of Authorino and Limitador
Verify:
kubectl get pods -n kuadrant-system # operator, authorino, limitador all running
kubectl get kuadrant -n kuadrant-system # status should show Readyllm-d is the inference-aware scheduling layer. It uses a 5-release Helmfile pattern:
| Chart | Release | What it deploys |
|---|---|---|
llm-d-infra v1.3.6 |
llm-d-infra |
CRDs and shared infrastructure. Gateway creation is disabled (gateway.create: false) since we manage the Gateway resource separately via Envoy Gateway. |
inferencepool v1.3.0 |
llm-d-inferencepool |
EPP for Llama 3.1 8B — the ext_proc server that performs inference-aware routing with kvCacheAware and queueDepthAware scoring. |
inferencepool v1.3.0 |
llm-d-inferencepool-nemotron-vl |
EPP for Nemotron VL 12B — separate EPP instance for the vision-language model pool. |
llm-d-modelservice v0.4.5 |
llm-d-modelservice |
vLLM worker for Llama 3.1 8B Instruct. 1 decode replica on spark-01. |
llm-d-modelservice v0.4.5 |
llm-d-modelservice-nemotron-vl |
vLLM worker for Nemotron VL 12B FP8. 1 decode replica on spark-02 with --trust-remote-code --quantization=modelopt. |
# Set your HuggingFace token first
kubectl create namespace token-labs
kubectl create secret generic hf-token \
--from-literal="HF_TOKEN=${HF_TOKEN}" \
-n token-labs
./deploy/scripts/04-deploy-llm-d.shWhat it does:
- Creates the
token-labsnamespace - Runs
helmfile applywhich installs all 5 releases with values fromdeploy/llm-d/values/ - Waits for vLLM workers to download model weights and become ready (can take several minutes on first run)
Verify:
kubectl get pods -n token-labs # 2 vLLM pods + 2 EPP pods running
kubectl get inferencepool -n token-labs # both pools should show ReadyThis step creates the actual networking and policy resources that wire everything together:
# Gateway + AIGatewayRoute
kubectl apply -f deploy/gateway/namespace.yaml
kubectl apply -f deploy/gateway/gatewayclass.yaml
kubectl apply -f deploy/gateway/gateway.yaml
kubectl apply -f deploy/gateway/aigwroute.yaml
# Kuadrant policies
kubectl apply -f deploy/policies/Gateway resources (deploy/gateway/):
namespace.yaml— creates thetoken-labsnamespace (idempotent)gateway.yaml— creates aGatewayresource withgatewayClassName: eg, listening on HTTP port 80 with hostnameinference.token-labs.local. Envoy Gateway sees this and provisions an Envoy proxy pod to handle traffic.aigwroute.yaml— creates anAIGatewayRoutefor/v1/chat/completions. EAG's AI filter reads the"model"field from the JSON request body and sets thex-ai-eg-modelheader. Each rule matches on this header and routes to the correctInferencePoolbackend. TheInferencePoolis the bridge to llm-d's EPP — Envoy invokes the EPP via ext_proc to pick the optimal vLLM pod for each request.
Kuadrant policies (deploy/policies/):
kuadrant.yaml— theKuadrantCR (idempotent, already created in step 3)auth-policy.yaml—AuthPolicytargeting the Gateway. Configures API key authentication: Authorino validates theAuthorization: Bearer <key>header by looking up Secrets labeledapp: token-labs. On match, it extractskuadrant.io/groups(tier) andsecret.kuadrant.io/user-id(tenant ID) from annotations and passes them in the request context. An OPA policy validates the tier is one offree,pro, orenterprise.rate-limit-policy.yaml—RateLimitPolicytargeting the Gateway. Defines per-tier request count limits (e.g., free = 10/min and 100/day). Useswhenpredicates with CEL expressions to matchauth.identity.groupsandcounterskeyed byauth.identity.useridfor tenant isolation.token-rate-limit-policy.yaml—TokenRateLimitPolicytargeting theAIGatewayRoutefor/v1/chat/completions. After vLLM returns a response, Kuadrant's wasm-shim parsesusage.total_tokensfrom the JSON body and sends it to Limitador ashits_addend. Each tenant's cumulative token usage is tracked per time window.
Verify:
kubectl get gateway -n token-labs # Programmed: True
kubectl get aigatewayroute -n token-labs # llm-inference listed
kubectl get authpolicy -n token-labs # Accepted: True
kubectl get ratelimitpolicy -n token-labs # Accepted: True
kubectl get tokenratelimitpolicy -n token-labs # Accepted: TrueBefore applying the tenant manifests, set a real company name and generate a unique API key for each tenant. The files in deploy/tenants/ use placeholder values that must be replaced.
Using the template for a new tenant:
COMPANY="acme"
TIER="pro" # free | pro | enterprise
API_KEY="tlabs_sk_$(openssl rand -hex 24)"
sed \
-e "s/COMPANY-NAME/${COMPANY}/g" \
-e "s/\"pro\"/\"${TIER}\"/g" \
-e "s/tlabs_CHANGEME/${API_KEY}/g" \
deploy/tenants/tenant-template.yaml \
| kubectl apply -f -
echo "Tenant: ${COMPANY} Key: ${API_KEY}"Editing the demo tenant files before applying:
Open deploy/tenants/tenant-free-demo.yaml and tenant-pro-demo.yaml and update:
metadata.name— e.g.tenant-acme-freesecret.kuadrant.io/user-id— unique identifier used as the rate-limit counter keyapi_key— replace the placeholder with a generated key:
# Generate keys for each tenant
echo "Free tier key: tlabs_sk_$(openssl rand -hex 24)"
echo "Pro tier key: tlabs_sk_$(openssl rand -hex 24)"Then apply:
kubectl apply -f deploy/tenants/Verify (keys are live immediately, no restart needed):
kubectl get secrets -n kuadrant-system -l app=token-labs
# Quick smoke test
GATEWAY_IP=$(kubectl get gateway token-labs-gateway -n token-labs \
-o jsonpath='{.status.addresses[0].value}')
curl -s -o /dev/null -w "%{http_code}" \
http://${GATEWAY_IP}/v1/models \
-H "Host: inference.token-labs.local" \
-H "Authorization: Bearer <your-key>"
# Expect: 200Magpie TTS runs on spark-01 in CPU mode (the GB10 GPU is fully allocated to the Nemotron-Llama vLLM pod). It exposes an OpenAI-compatible /v1/audio/speech endpoint backed by nvidia/magpie_tts_multilingual_357m.
Build the image (must be built natively on spark-01 — see build notes in deploy/README.md):
# On spark-01
docker build -t ghcr.io/elizabetht/token-labs/magpie-tts:latest services/magpie-tts
docker push ghcr.io/elizabetht/token-labs/magpie-tts:latestIf the GHCR package is private, create a pull secret first:
kubectl create secret docker-registry ghcr-pull-secret \
--docker-server=ghcr.io \
--docker-username=elizabetht \
--docker-password=<GITHUB_PAT_read:packages> \
-n token-labsDeploy:
kubectl apply -f deploy/magpie-tts/The model (nvidia/magpie_tts_multilingual_357m) downloads from NGC on first pod startup — allow ~1–2 min.
Verify:
kubectl get pods -n token-labs -l app=magpie-tts # 1 pod running
kubectl logs -n token-labs -l app=magpie-tts # look for "model loaded successfully"Riva STT proxies /v1/audio/transcriptions to integrate.api.nvidia.com. Requires an NVIDIA API key.
kubectl create secret generic nvidia-nim-api-key \
--from-literal=apiKey=nvapi-CHANGEME \
-n token-labs
./deploy/scripts/05-deploy-services.shVerify:
kubectl get httproute -n token-labs # riva-stt listed
kubectl get backendsecuritypolicy -n token-labs # nvidia-nim-api-key listedGATEWAY_IP=$(kubectl get gateway token-labs-gateway -n token-labs \
-o jsonpath='{.status.addresses[0].value}')
# Chat completion — Nemotron-Llama 8B (spark-01)
curl -s http://${GATEWAY_IP}/v1/chat/completions \
-H "Host: inference.token-labs.local" \
-H "Authorization: Bearer <your-api-key>" \
-H "Content-Type: application/json" \
-d '{
"model": "nvidia/Llama-3.1-Nemotron-Nano-8B-v1",
"messages": [{"role": "user", "content": "What is Kubernetes?"}],
"max_tokens": 200
}' | jq
# Vision-language — Nemotron VL 12B FP8 (spark-02)
curl -s http://${GATEWAY_IP}/v1/chat/completions \
-H "Host: inference.token-labs.local" \
-H "Authorization: Bearer <your-api-key>" \
-H "Content-Type: application/json" \
-d '{
"model": "nvidia/NVIDIA-Nemotron-Nano-12B-v2-VL-FP8",
"messages": [{"role": "user", "content": "Describe this image."}],
"max_tokens": 200
}' | jq
# Verify rate limiting (free tier — should get 429 after 10 requests/min)
for i in $(seq 1 15); do
echo -n "Request $i: "
curl -s -o /dev/null -w "%{http_code}" \
http://${GATEWAY_IP}/v1/chat/completions \
-H "Host: inference.token-labs.local" \
-H "Authorization: Bearer <your-free-tier-key>" \
-H "Content-Type: application/json" \
-d '{"model":"nvidia/Llama-3.1-Nemotron-Nano-8B-v1","messages":[{"role":"user","content":"Hi"}],"max_tokens":5}'
echo
doneOptional — test audio services (only after deploying Magpie TTS / Riva STT):
# Text-to-speech — Magpie TTS (spark-01, CPU mode)
curl -s http://${GATEWAY_IP}/v1/audio/speech \
-H "Host: inference.token-labs.local" \
-H "Authorization: Bearer <your-api-key>" \
-H "Content-Type: application/json" \
-d '{"input": "Welcome to Token Labs.", "voice": "aria"}' \
--output speech.wav
# Speech-to-text — Riva STT via NVIDIA NIM
curl -s http://${GATEWAY_IP}/v1/audio/transcriptions \
-H "Host: inference.token-labs.local" \
-H "Authorization: Bearer <your-api-key>" \
-F "file=@audio.wav" \
-F "model=nvidia/parakeet-ctc-1.1b"The stack exposes metrics from all layers via Prometheus ServiceMonitors:
# Optional: deploy ServiceMonitors
kubectl apply -f deploy/monitoring/service-monitors.yaml| Source | Key Metrics |
|---|---|
| Limitador | limitador_counter_hits_total — per-tenant request/token counts |
| Authorino | auth_server_response_status — auth allow/deny rates |
| vLLM | vllm:kv_cache_usage_perc, vllm:request_latency_seconds |
| EPP | Routing decisions, prefix-cache hit rates |
llm-d also provides ready-made Grafana dashboards — see docs/ARCHITECTURE.md for details.
| Metric | Prefill (Input) | Decode (Output) |
|---|---|---|
| Throughput | 3,203 tok/s | 520 tok/s |
| Cost/1M tokens | $0.006 | $0.037 |
Uses lighteval with the IFEval benchmark to verify model quality across quantizations. Models are compared against the meta-llama/Llama-3.1-8B-Instruct baseline using a ±5% threshold. See baselines/README.md for details.
├── deploy/
│ ├── scripts/ # Installation scripts (run in order)
│ │ ├── 01-install-crds.sh # Gateway API + Inference Extension CRDs
│ │ ├── 02-install-envoy-ai-gateway.sh # EAG v0.5.0 + EG v1.5.0 + Redis
│ │ ├── 03-install-kuadrant.sh # Kuadrant (Authorino + Limitador)
│ │ ├── 04-deploy-llm-d.sh # llm-d helmfile (5 releases)
│ │ └── 05-deploy-services.sh # Magpie TTS + Riva STT
│ ├── gateway/ # Gateway + AIGatewayRoute + Envoy Gateway values
│ │ ├── namespace.yaml
│ │ ├── gateway.yaml # Gateway (gatewayClassName: eg)
│ │ ├── aigwroute.yaml # AIGatewayRoute: model → InferencePool
│ │ └── envoy-gateway-values.yaml # EG helm values: EAG extension manager
│ ├── llm-d/ # Helmfile + values for llm-d 5-release deploy
│ │ ├── helmfile.yaml.gotmpl
│ │ └── values/
│ ├── magpie-tts/ # Magpie TTS deployment + HTTPRoute
│ ├── riva-stt/ # Riva STT → NVIDIA NIM proxy
│ │ ├── backend.yaml # EG Backend + BackendTLSPolicy + BackendSecurityPolicy
│ │ ├── httproute.yaml # HTTPRoute: /v1/audio/transcriptions → NVIDIA NIM
│ │ └── secret-template.yaml # NVIDIA API key Secret template
│ ├── policies/ # Kuadrant AuthPolicy, RateLimitPolicy, TokenRateLimitPolicy
│ ├── tenants/ # Tenant API key Secrets (template + demos)
│ └── monitoring/ # Prometheus ServiceMonitors
├── services/
│ └── magpie-tts/ # FastAPI TTS wrapper (server.py, Dockerfile)
├── docs/
│ ├── ARCHITECTURE.md # Full architecture deep-dive
│ ├── index.html # Live demo page
│ └── benchmark-results.* # Benchmark data
├── baselines/ # Accuracy baseline values
├── scripts/ # Benchmark and analysis scripts
├── Dockerfile # vLLM build for ARM64
└── .github/workflows/ # CI/CD pipelines
MIT