Skip to content
Back to Tutorials

How to Stand Up an AI Gateway on Kubernetes for Inference Workloads

Intermediate · 1 hour · 21 min read · Byte Smith ·

Before you begin

  • A Kubernetes cluster with a Gateway API-capable controller installed
  • At least one working inference service or model endpoint already deployed
  • Cluster access with permission to create Namespaces, Gateway API resources, and NetworkPolicies
  • A basic observability stack such as Prometheus, OpenTelemetry, or controller-native metrics
  • A staging environment or isolated namespace where you can test burst and failure scenarios

What you'll learn

  • Identify which AI traffic patterns need different gateway behavior
  • Design a minimal v1 AI gateway without overengineering it
  • Deploy a basic Gateway API entry layer for inference services
  • Add model-version routing, tenant boundaries, and rollout controls
  • Apply timeouts, backpressure thinking, and observability for inference traffic
  • Secure the gateway with namespace, network, and audit boundaries
  • Test the design with burst traffic, mixed requests, and controlled failures
1
2
3
4
5
6
7
8
9
10
On this page

AI traffic is not just “normal API traffic with bigger JSON.” Inference requests are often more expensive, more latency-sensitive, longer-lived, and more variable than standard web traffic. Streaming responses hold connections open. Tool-calling agents fan out into bursts of smaller requests. Batch inference can look like background throughput rather than interactive latency. And in many environments, the gateway also becomes the place where model access, cost control, and tenant boundaries start to matter.

That is why Kubernetes platform teams should care now. The Kubernetes AI Gateway Working Group was launched specifically to define standards and best practices for networking infrastructure that supports AI workloads, and the official Gateway API Inference Extension project is already working on model-aware routing, endpoint picking, and other inference-specific patterns on top of Gateway API. In other words, this is no longer just a vendor-specific product category. It is becoming a Kubernetes platform concern.

This tutorial shows how to build a practical v1 AI gateway on Kubernetes without pretending the standards are more mature than they are. You will stand up a basic Gateway API entry layer, route requests to inference services, add model-version and tenant-aware patterns, tune reliability controls, add observability, secure the design, and then test it under realistic load. The goal is not to build the most advanced inference gateway possible on day one. The goal is to build a clean platform baseline that you can evolve safely. For broader context on how Kubernetes networking is adapting for AI workloads, see Kubernetes AI Workloads and Networking.

Step 1: Identify your AI traffic types

Before you write any routing rules, decide what kinds of inference traffic you actually have. Kubernetes now has an official AI Gateway effort because AI workloads introduce traffic patterns that need different networking behavior, including token-aware rate limiting, payload inspection, AI-specific routing, and AI-specific protocols. The inference extension project goes even further and introduces concepts like model-aware routing, serving priorities, endpoint picking, and latency-aware scheduling based on model-server metrics.

Classify interactive inference

Interactive inference is what most teams mean first: chat completions, summarization, embeddings on demand, assistant UX, and similar user-facing calls. These requests are usually sensitive to:

  • end-to-end latency
  • time to first token
  • streaming continuity
  • model selection
  • per-tenant quotas

If you have interactive traffic, your gateway should be optimized for predictable response behavior and controlled rollout rather than maximum throughput alone.

Classify batch inference

Batch inference usually looks different:

  • larger bursts
  • less sensitivity to first-token latency
  • more tolerance for queueing
  • more emphasis on throughput and fairness
  • scheduled or asynchronous execution patterns

You do not want to tune your entire gateway around interactive chat if half your load is offline embeddings or nightly enrichment jobs.

Classify tool-calling agents

Tool-calling agents often produce traffic that is easy to underestimate:

  • many small calls
  • bursts after planning phases
  • retries from the caller if a tool step fails
  • mixed workloads across multiple models or endpoints

This matters because the gateway may need tighter quotas, idempotency awareness, or stronger observability per route and per tenant.

Classify streaming responses

Streaming inference is where teams get into trouble fastest. Gateway API now has standard support for route timeouts, but retry support is still under active evolution and is especially nuanced around streaming or bidirectional patterns. Treat long-lived streams as a distinct class, not just “the same route with a different body.”

Create a simple traffic classification document before moving on.

File: traffic-profile.yaml

interactive_inference:
  examples:
    - chat-completions
    - real-time-summarization
  latency_sensitive: true
  streaming: true
  retry_friendly: false

batch_inference:
  examples:
    - offline-embeddings
    - nightly-document-labeling
  latency_sensitive: false
  streaming: false
  retry_friendly: true

tool_calling_agents:
  examples:
    - planner-executor-agents
    - retrieval-agents
  latency_sensitive: mixed
  streaming: mixed
  retry_friendly: mixed

streaming_responses:
  examples:
    - server-sent-events
    - token-streaming-chat
  latency_sensitive: true
  streaming: true
  retry_friendly: false
Tip

Keep your first version of the gateway opinionated about traffic classes. A route that serves low-latency chat should not inherit the same timeout and backpressure assumptions as a background embedding job.

You should now have a traffic profile that tells you which paths need low latency, which can tolerate queueing, and which should avoid retries entirely.

Step 2: Decide what the gateway should do

A gateway becomes messy when it takes on responsibilities by accident. The Kubernetes AI Gateway effort explicitly frames AI gateways as infrastructure that can enforce policy, access control, payload inspection, token-aware limiting, and AI-specific routing. That does not mean your first version should do all of those things at once. It means you should decide the scope up front.

Start with a small responsibility set

For a first release, your AI gateway should usually do these things:

  • expose one or more stable entry points
  • route requests to the correct inference service
  • isolate tenants or environments
  • enforce basic auth and admission boundaries
  • emit reliable logs and metrics
  • support safe model rollouts

What it usually should not do in v1:

  • perform full semantic prompt inspection inline
  • become your only quota system
  • implement every provider fallback pattern
  • hide all model-server behavior from the application team

Define the routing dimensions

Most teams need one or more of these:

  • route by model family such as chat versus embeddings
  • route by model version such as v1 versus v2
  • route by tenant using hostnames, paths, or headers
  • route by traffic class such as interactive versus batch

Write the policy down. Do not leave it inside Helm values or controller defaults.

File: ai-gateway-responsibilities.yaml

entrypoints:
  - tenant-hostnames
  - shared-api-domain

routing_dimensions:
  - route_by_path
  - route_by_model_version
  - route_by_tenant

required_controls:
  - authn
  - authz
  - request_logging
  - per_route_timeouts
  - rollout_support

deferred_for_v2:
  - semantic_payload_inspection
  - provider_failover
  - token_based_rate_limiting
  - body_based_model_routing

Decide whether the gateway or a router service owns model selection

This is one of the most important design decisions.

If your clients call distinct endpoints like /v1/chat/completions and /v1/embeddings, standard HTTPRoute is enough for a good v1 design. If your clients send an OpenAI-compatible request where the model name is only in the request body, standard Gateway API alone is not enough for model-aware routing. That is precisely why the inference extension project introduces body-based routing and endpoint-picking patterns.

For v1, choose one of these:

  • Simple path or hostname routing if your API surface already separates workloads cleanly
  • A thin internal router service if the model name only exists in the body and you want to keep the external API stable
  • An inference-aware extension later, if body-aware routing becomes worth the complexity
Note

If your API is OpenAI-compatible and the model name lives in the JSON body, do not pretend plain path-based Gateway rules can solve model-aware routing by themselves.

You should now know what your gateway owns, what it does not own, and whether model choice happens in standard Gateway rules, in a router service, or in a more advanced inference-aware layer later.

Step 3: Deploy a basic gateway layer

The cleanest place to start is a shared Gateway in a platform namespace, model services in a model-serving namespace, and tenant routes in tenant namespaces. Gateway API is designed for this kind of separation, and cross-namespace routing is a first-class pattern. When a route points to a backend in another namespace, the target namespace must explicitly allow it with a ReferenceGrant.

Create namespaces and labels

File: namespaces.yaml

apiVersion: v1
kind: Namespace
metadata:
  name: ai-gateway
---
apiVersion: v1
kind: Namespace
metadata:
  name: ai-models
---
apiVersion: v1
kind: Namespace
metadata:
  name: tenant-a
  labels:
    ai-routes: "enabled"

Apply them:

kubectl apply -f namespaces.yaml

Create the shared Gateway

This example assumes your controller has already installed a GatewayClass. Replace your-gateway-class with the real class name from your environment.

File: gateway.yaml

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: ai-shared-gateway
  namespace: ai-gateway
spec:
  gatewayClassName: your-gateway-class
  listeners:
    - name: https
      protocol: HTTPS
      port: 443
      hostname: "*.ai.example.com"
      tls:
        mode: Terminate
        certificateRefs:
          - name: ai-example-com-tls
      allowedRoutes:
        namespaces:
          from: Selector
          selector:
            matchLabels:
              ai-routes: "enabled"

Apply it:

kubectl apply -f gateway.yaml
kubectl get gateway -n ai-gateway

Expose your inference services

This tutorial assumes you already have model-serving Deployments running. Create stable Services for them.

File: model-services.yaml

apiVersion: v1
kind: Service
metadata:
  name: chat-llama3-8b-v1
  namespace: ai-models
spec:
  selector:
    app: chat-llama3-8b-v1
  ports:
    - name: http
      port: 8000
      targetPort: 8000
---
apiVersion: v1
kind: Service
metadata:
  name: embed-bge-small-v1
  namespace: ai-models
spec:
  selector:
    app: embed-bge-small-v1
  ports:
    - name: http
      port: 8000
      targetPort: 8000

Apply it:

kubectl apply -f model-services.yaml

Allow cross-namespace backend references

Because the HTTPRoute will live in tenant-a and the backends live in ai-models, you need a ReferenceGrant in the backend namespace.

File: referencegrant.yaml

apiVersion: gateway.networking.k8s.io/v1
kind: ReferenceGrant
metadata:
  name: allow-tenant-a-routes
  namespace: ai-models
spec:
  from:
    - group: gateway.networking.k8s.io
      kind: HTTPRoute
      namespace: tenant-a
  to:
    - group: ""
      kind: Service

Apply it:

kubectl apply -f referencegrant.yaml

Create the initial HTTPRoute

File: tenant-a-routes.yaml

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: tenant-a-inference
  namespace: tenant-a
spec:
  parentRefs:
    - name: ai-shared-gateway
      namespace: ai-gateway
      sectionName: https
  hostnames:
    - "tenant-a.ai.example.com"
  rules:
    - matches:
        - path:
            type: PathPrefix
            value: /v1/chat/completions
      backendRefs:
        - name: chat-llama3-8b-v1
          namespace: ai-models
          port: 8000
    - matches:
        - path:
            type: PathPrefix
            value: /v1/embeddings
      backendRefs:
        - name: embed-bge-small-v1
          namespace: ai-models
          port: 8000

Apply it and verify:

kubectl apply -f tenant-a-routes.yaml
kubectl get httproute -n tenant-a
kubectl describe httproute tenant-a-inference -n tenant-a
Info

Use HTTPRoute for HTTP or OpenAI-compatible APIs, and GRPCRoute when your serving path is truly gRPC-native. Both are GA, but keep gRPC and regular HTTP on separate hostnames when possible for cleaner operations.

You should now have a basic shared AI gateway, two model-serving Services, and a tenant route that forwards chat and embeddings traffic through Gateway API.

Step 4: Add AI-aware routing

Now that the baseline works, add the kinds of routing that matter specifically for inference. The official inference extension project focuses on model-aware routing, serving priority, rollouts, and endpoint selection based on model-server metrics such as cache state and queue depth. That is the long-term direction. For v1, keep it simpler: use Gateway API for model-version rollout and tenant boundaries first, and only introduce body-based routing or endpoint-picking when you have a clear need.

Add model-version canary routing

Gateway API supports weighted traffic splitting through backendRefs, which makes it a good fit for model rollouts.

File: tenant-a-chat-canary.yaml

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: tenant-a-chat-canary
  namespace: tenant-a
spec:
  parentRefs:
    - name: ai-shared-gateway
      namespace: ai-gateway
      sectionName: https
  hostnames:
    - "tenant-a.ai.example.com"
  rules:
    - matches:
        - path:
            type: PathPrefix
            value: /v1/chat/completions
      backendRefs:
        - name: chat-llama3-8b-v1
          namespace: ai-models
          port: 8000
          weight: 90
        - name: chat-llama3-8b-v2
          namespace: ai-models
          port: 8000
          weight: 10

This is the cleanest way to do model rollouts when the external API stays the same and only the serving implementation changes.

Add tenant isolation by hostname or route ownership

Kubernetes multi-tenancy guidance recommends namespace boundaries, RBAC, quotas, and network policies as foundational controls for shared clusters. In practice, the easiest AI gateway isolation model is:

  • one shared Gateway
  • one route-owning namespace per tenant or app team
  • one model-serving namespace or per-tenant model namespaces where needed
  • explicit ReferenceGrant where cross-namespace access is allowed

That keeps tenant routing ownership separate from platform-owned entry points.

Keep fallback simple in v1

True inference fallback is more subtle than normal HTTP failover. You may want to fail from one model version to another, from one pool to another, or from self-hosted inference to an external provider. The AI Gateway Working Group’s active egress proposals are explicitly aimed at external AI services, failover, compliance routing, and secure token injection, which tells you this is still an evolving standard area.

For v1, use one of these:

  • a dedicated router service behind your Gateway for complex fallback logic
  • a model-specific route with a controlled canary split
  • an implementation-specific inference extension if you already committed to that stack

A clean pattern is to keep the public route stable and send complex fallback logic to an internal router service:

File: tenant-a-chat-router.yaml

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: tenant-a-chat-router
  namespace: tenant-a
spec:
  parentRefs:
    - name: ai-shared-gateway
      namespace: ai-gateway
      sectionName: https
  hostnames:
    - "tenant-a.ai.example.com"
  rules:
    - matches:
        - path:
            type: PathPrefix
            value: /v1/chat/completions
      backendRefs:
        - name: llm-router
          namespace: tenant-a
          port: 8080

That router can make policy-aware fallback decisions without forcing every inference concern into Gateway YAML.

Tip

Gateway API is excellent at stable entry points, rollout traffic splits, and ownership boundaries. Use it for those first. Add body-aware routing or endpoint-picking only when your traffic actually needs it.

You should now have a way to do safe model-version rollout, a clear tenant isolation pattern, and a realistic v1 stance on fallback logic.

Step 5: Add reliability controls

Inference traffic fails differently from a normal CRUD API. Some requests are expensive and long-running. Some streams should never be retried automatically. Some model servers queue internally before responding. That means reliability is a mix of gateway settings and model-server behavior.

Set explicit route timeouts

Gateway API now provides standard route timeout fields on HTTPRoute rules. Use them. Do not leave critical inference paths entirely at controller defaults.

File: tenant-a-route-timeouts.yaml

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: tenant-a-inference-with-timeouts
  namespace: tenant-a
spec:
  parentRefs:
    - name: ai-shared-gateway
      namespace: ai-gateway
      sectionName: https
  hostnames:
    - "tenant-a.ai.example.com"
  rules:
    - matches:
        - path:
            type: PathPrefix
            value: /v1/chat/completions
      timeouts:
        request: 120s
        backendRequest: 115s
      backendRefs:
        - name: chat-llama3-8b-v1
          namespace: ai-models
          port: 8000
    - matches:
        - path:
            type: PathPrefix
            value: /v1/embeddings
      timeouts:
        request: 30s
        backendRequest: 25s
      backendRefs:
        - name: embed-bge-small-v1
          namespace: ai-models
          port: 8000

These numbers are examples. The right values depend on your models and user expectations. The important point is that chat and embeddings should not inherit the same timeout policy blindly.

Be conservative with retries

Gateway retry work exists, but it is still an area where semantics vary and streaming is especially tricky. For streaming chat, default to no automatic gateway retries unless you have explicitly tested the behavior end to end. For non-streaming idempotent requests such as some embedding workloads, limited retries may be acceptable if your implementation supports them and your backend semantics are safe.

Add backend reliability basics

Do not focus only on the gateway. Your model-serving workloads also need stable disruption behavior.

File: chat-pdb.yaml

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: chat-llama3-8b-v1
  namespace: ai-models
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: chat-llama3-8b-v1

This will not solve overload, but it protects you from avoidable voluntary disruption during maintenance.

Decide where queueing and backpressure live

For v1, the safest rule is:

  • Gateway: enforce connection-level and route-level boundaries
  • Model server or router: own queueing and request-shedding decisions
  • Autoscaling: react to actual demand signals where possible

If your stack supports inference-aware queue visibility or endpoint picking later, you can evolve toward it. The inference extension project is specifically built around endpoint selection and model-server metrics like queue length and cache state, which is a strong sign that generic L7 load balancing is not enough for mature inference workloads.

Warning

Do not enable retries on streaming routes just because your controller supports retries somewhere. Streaming failures often need caller-aware recovery, not blind replay.

You should now have explicit timeouts, a conservative retry stance, and a clear separation between gateway reliability controls and model-server backpressure behavior.

Step 6: Add observability

An AI gateway without observability becomes a cost amplifier and a debugging bottleneck. At minimum, you want to understand:

  • request volume
  • request latency
  • failure rate
  • tenant distribution
  • model distribution
  • streaming versus non-streaming
  • token and cost signals when available

The official inference extension work also highlights model-server metrics, time-to-first-token style objectives, and model-aware scheduling decisions, which are a good signal for what mature observability will need to include.

Standardize a gateway log contract

Even if your exact metrics pipeline changes later, define a structured log contract now.

File: gateway-log-example.json

{
  "timestamp": "2026-03-10T15:22:11.418Z",
  "request_id": "req_01JPN8XQZ0Y1M0AX3R0D4M9J7E",
  "tenant": "tenant-a",
  "hostname": "tenant-a.ai.example.com",
  "route": "/v1/chat/completions",
  "model_route": "chat-llama3-8b-v1",
  "stream": true,
  "http_status": 200,
  "latency_ms": 1840,
  "ttft_ms": 420,
  "prompt_tokens": 612,
  "completion_tokens": 238,
  "estimated_cost_usd": 0.0194,
  "gateway_backend": "chat-llama3-8b-v1.ai-models.svc.cluster.local:8000"
}

If you cannot get every field on day one, that is fine. But tenant, route, backend, status, latency, and stream/non-stream should be present from the start.

Create recording rules for owned metrics

If you operate a custom router or adapter layer, export your own metrics with stable names and labels.

File: prometheus-rules.yaml

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: ai-gateway-recording-rules
  namespace: ai-gateway
spec:
  groups:
    - name: ai-gateway.rules
      rules:
        - record: ai_gateway:requests_per_second
          expr: sum(rate(ai_gateway_requests_total[5m])) by (tenant, route, model_route)
        - record: ai_gateway:error_rate
          expr: sum(rate(ai_gateway_requests_total{status=~"5.."}[5m])) by (tenant, route, model_route)
        - record: ai_gateway:p95_latency_ms
          expr: histogram_quantile(0.95, sum(rate(ai_gateway_request_duration_ms_bucket[5m])) by (le, tenant, route, model_route))

These are example metric names for your adapter or gateway telemetry. The exact source may be your controller, your router service, or both.

Track gateway and backend metrics separately

Keep these categories distinct:

  • gateway request latency
  • backend model latency
  • queue depth
  • saturation or shedding
  • token throughput
  • cost or estimated cost
  • per-model success rate

If you merge everything into a single “AI latency” number, you will not know whether the problem is routing, the controller, the model server, or upstream GPU pressure.

Info

Inference observability is not just HTTP observability. You need enough context to answer: which tenant, which route, which model, what latency, how many tokens, and what it cost.

You should now have a structured observability plan that covers gateway behavior, model behavior, and tenant or model-level cost visibility.

Step 7: Secure it

Your AI gateway is both a network boundary and a policy boundary. That makes it a natural place to enforce access, tenant separation, and audit visibility. Kubernetes’ own multi-tenancy guidance emphasizes namespaces, RBAC, quotas, and network policies as core building blocks for isolating tenants and avoiding noisy-neighbor or over-permission problems in shared clusters.

Isolate model services from the rest of the cluster

Do not let every workload call your model-serving namespace directly. Use a NetworkPolicy so only the gateway namespace can reach model-serving pods.

File: model-networkpolicy.yaml

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: only-gateway-to-models
  namespace: ai-models
spec:
  podSelector: {}
  policyTypes:
    - Ingress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: ai-gateway
      ports:
        - protocol: TCP
          port: 8000

Apply it:

kubectl apply -f model-networkpolicy.yaml

Put tenant budgets somewhere explicit

Request or token quotas are often controller-specific or app-specific, but you should still isolate tenant resource consumption at the namespace level.

File: tenant-a-quota.yaml

apiVersion: v1
kind: ResourceQuota
metadata:
  name: tenant-a-ai-quota
  namespace: tenant-a
spec:
  hard:
    requests.cpu: "20"
    requests.memory: 64Gi
    limits.cpu: "40"
    limits.memory: 128Gi
    pods: "50"

This is not a request-per-minute quota. It is a cluster fairness boundary. Keep traffic-level quotas in your gateway or router layer, but do not forget cluster-level fairness.

Make auth and audit explicit

Authentication and authorization at the gateway will depend on your implementation. Some stacks will use JWT or mTLS policy attachment. Others will use external auth services. Your minimum secure baseline should still be the same:

  • authenticate callers before inference access
  • authorize by tenant and route
  • log who called what
  • avoid passing raw sensitive prompt content into broad logs
  • record enough audit metadata to reconstruct access later

A simple audit contract is enough to start.

File: audit-log-example.json

{
  "timestamp": "2026-03-10T15:40:02.112Z",
  "tenant": "tenant-a",
  "subject": "svc:agent-platform",
  "route": "/v1/chat/completions",
  "model_route": "chat-llama3-8b-v1",
  "decision": "allow",
  "request_id": "req_01JPN92C7S49YJ1E0QY6T5G6P3",
  "sensitive_payload_logged": false
}

Protect sensitive prompt and response data

For v1, keep this simple:

  • do not log full prompts by default
  • do not log full streamed responses by default
  • redact secrets or obvious identifiers in middleware where possible
  • reserve payload inspection for explicit policy use cases

The AI Gateway Working Group is actively working on payload processing and AI traffic guardrail proposals, which is a strong indicator that inline prompt/response inspection is real platform territory, but it is still an emerging standard area. Do not overbuild it on day one.

Warning

Your gateway should be able to answer “who called which model route and what happened” without storing entire prompts or responses in general-purpose logs.

You should now have network isolation around model services, namespace-level tenant boundaries, and a clear starting point for auth and audit controls.

Step 8: Test with a realistic workload

Now validate the design under conditions that look more like production. AI traffic is rarely uniform. You want burst traffic, mixed routes, streaming, and at least one backend failure.

Send burst traffic to a non-streaming route

Create an embeddings payload.

File: embeddings.json

{
  "input": [
    "hello world",
    "design a safe ai gateway on kubernetes",
    "measure latency, error rate, and cost"
  ],
  "model": "bge-small"
}

Run a burst test:

export GATEWAY_ADDR=http://YOUR_GATEWAY_ADDRESS
hey -n 200 -c 20 \
  -m POST \
  -T application/json \
  -H "Host: tenant-a.ai.example.com" \
  -D embeddings.json \
  ${GATEWAY_ADDR}/v1/embeddings

This validates burst handling on a route that is generally safer to retry and easier to compare across runs.

Test streaming behavior explicitly

Create a streaming request.

File: chat-stream.json

{
  "model": "llama3-8b",
  "stream": true,
  "messages": [
    { "role": "user", "content": "Explain why AI gateways need different timeout and routing rules." }
  ]
}

Run it:

curl -N \
  -H "Host: tenant-a.ai.example.com" \
  -H "Content-Type: application/json" \
  -X POST \
  --data @chat-stream.json \
  ${GATEWAY_ADDR}/v1/chat/completions

Watch for:

  • time to first token
  • steady stream continuity
  • premature disconnects
  • proxy buffering surprises
  • incorrect timeout behavior

Mix routes and traffic types

A simple shell loop is enough for first validation.

for i in $(seq 1 20); do
  curl -s \
    -H "Host: tenant-a.ai.example.com" \
    -H "Content-Type: application/json" \
    -X POST \
    --data @embeddings.json \
    ${GATEWAY_ADDR}/v1/embeddings > /dev/null &

  curl -s \
    -H "Host: tenant-a.ai.example.com" \
    -H "Content-Type: application/json" \
    -X POST \
    --data @chat-stream.json \
    ${GATEWAY_ADDR}/v1/chat/completions > /dev/null &
done

wait

This is not a perfect benchmark. It is enough to shake out route mismatches, saturation, and poor default behavior.

Inject a backend failure

Now prove that your timeouts, rollout split, and observability behave the way you think they do.

kubectl -n ai-models scale deployment chat-llama3-8b-v2 --replicas=0
kubectl -n ai-models get pods -w

Then send traffic again and observe:

  • did the canary route fail only for the canary share
  • did the gateway return the right errors
  • did logs identify the backend cleanly
  • did streaming routes hang longer than expected
  • did your metrics separate gateway failure from backend failure
Tip

Do not call a gateway production-ready until you have watched it handle a burst, a stream, and a failing backend in the same environment.

You should now have evidence that the gateway works under burst load, mixed route use, and controlled failure, instead of only under happy-path curl tests.

Common Setup Problems

Treating the AI gateway like ordinary ingress

Symptoms:

  • one generic timeout for everything
  • no distinction between chat and embeddings
  • streaming routes drop unexpectedly
  • rollout behavior is too coarse

Root cause: the gateway was designed like a normal web ingress instead of an inference entry layer.

Fix: classify traffic types first, separate interactive and batch routes, and treat streaming as its own class.

Trying to route by model name without a model-aware layer

Symptoms:

  • clients send model in the request body
  • Gateway rules cannot select backends correctly
  • teams start hardcoding awkward path variants

Root cause: standard Gateway rules do not parse JSON bodies for routing by themselves.

Fix: keep v1 path or hostname based, add a thin router service, or adopt an inference-aware extension deliberately later.

Missing ReferenceGrant for cross-namespace backends

Symptoms:

  • HTTPRoute exists but cannot resolve backend refs
  • traffic never reaches model services

Root cause: route namespace can see the shared Gateway, but the backend namespace never granted cross-namespace reference access.

Fix: add a ReferenceGrant in the backend namespace for the specific source namespace and kind.

Retries enabled on streaming routes

Symptoms:

  • duplicate partial outputs
  • strange client behavior during failures
  • misleading success rates

Root cause: retry logic was applied as if streaming traffic behaved like short idempotent requests.

Fix: disable or tightly constrain retries on streaming routes, and let the caller handle replay where needed.

No tenant or model labels in logs and metrics

Symptoms:

  • latency looks bad, but nobody knows for whom
  • costs are rising, but there is no route-level visibility
  • rollout issues are hard to localize

Root cause: observability was added as generic HTTP telemetry only.

Fix: log and label by tenant, route, backend, model route, and stream flag from day one.

Wrap-Up

A good v1 AI gateway on Kubernetes is not a giant custom platform. It is a clean Gateway API entry layer with explicit ownership, stable routes, safe rollout controls, solid isolation, and enough observability to understand latency, failures, and cost. Keep the first version simple: separate traffic classes, route clearly, isolate tenants, set explicit timeouts, and verify behavior with real traffic.

Move to a more advanced architecture when you actually need it: body-based model routing, endpoint picking based on model-server metrics, external provider failover, token-aware limiting, or inline payload processing. The Kubernetes AI Gateway Working Group and the Gateway API Inference Extension project make it clear that those patterns are becoming first-class platform concerns, but they are still evolving. That is exactly why a simple, well-structured v1 is the right place to start.