Contents

Ollama on Kubernetes: Deploy Your Local LLM on a Real Cluster

Before we start

In the previous post we built a ChatGPT-like app running 100% on your machine with Ollama, FastAPI and real-time streaming. At the end we left an open question: what if we put this in a Kubernetes cluster?

Today we’ll answer that question. We’re going to deploy Ollama as a service inside Kubernetes, connect our chat app, configure resources, autoscaling and talk about something fundamental: the GPU.

And at the end, we’ll be honest about what’s missing to make this “real production” — because deploying an LLM is only half the journey.


Architecture

This is what we’re going to build:

  • Ingress exposes the app to the outside world.
  • Chat App is our FastAPI from the previous post, with HPA to scale based on demand.
  • Ollama runs the LLM model, ideally on a GPU node.
  • PVC persists downloaded models so they don’t re-download on every restart.

Prerequisites

ToolVersionWhat for
OrbStack1.9+Local Kubernetes on macOS
kubectl1.28+Cluster management
Helm3.xChart installation (optional)
Ollama local0.18+For testing before deploying
Why OrbStack?
We use OrbStack because it provides an integrated Kubernetes cluster on macOS with native Linux container support, low resource usage and near-instant startup. It’s perfect for local development. If you prefer another option, alternatives are explained below.

Step 1: Set up the local cluster with OrbStack

OrbStack includes a Kubernetes cluster you can enable from the app or via terminal:

# Verify Kubernetes is active in OrbStack
kubectl cluster-info

You should see something like:

Kubernetes control plane is running at https://127.0.0.1:26443

If you use another tool, here are the equivalents:

ToolCommand to create clusterPlatform
OrbStackEnable in Settings → KubernetesmacOS
Rancher DesktopEnable in Preferences → KubernetesmacOS, Windows, Linux
minikubeminikube start --memory=8g --cpus=4macOS, Windows, Linux
kindkind create clustermacOS, Windows, Linux
Docker DesktopEnable in Settings → KubernetesmacOS, Windows
Memory
To run an LLM like Llama 3.1 (8B) you need at least 8 GB of RAM available for the cluster. Larger models like 70B require much more. Make sure to allocate enough memory to your local Kubernetes tool.

Step 2: Create the namespace and PVC

First, a dedicated namespace and a persistent volume for the models:

# ollama-namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: llm
  labels:
    app: ollama
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-models
  namespace: llm
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi
kubectl apply -f ollama-namespace.yaml

Why a PVC? Because Ollama models weigh several GBs. Without persistence, every time the pod restarts it would have to download the model again. With the PVC, the model downloads once and stays.


Step 3: Deploy Ollama

Now the Ollama deployment:

# ollama-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  namespace: llm
  labels:
    app: ollama
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
        - name: ollama
          image: ollama/ollama:latest
          ports:
            - containerPort: 11434
          resources:
            requests:
              cpu: "2"
              memory: 4Gi
            limits:
              cpu: "4"
              memory: 8Gi
          volumeMounts:
            - name: models
              mountPath: /root/.ollama
          readinessProbe:
            httpGet:
              path: /
              port: 11434
            initialDelaySeconds: 10
            periodSeconds: 5
          livenessProbe:
            httpGet:
              path: /
              port: 11434
            initialDelaySeconds: 30
            periodSeconds: 10
      volumes:
        - name: models
          persistentVolumeClaim:
            claimName: ollama-models
---
apiVersion: v1
kind: Service
metadata:
  name: ollama
  namespace: llm
spec:
  selector:
    app: ollama
  ports:
    - port: 11434
      targetPort: 11434
  type: ClusterIP
kubectl apply -f ollama-deployment.yaml

About resources: why these values?

Let’s stop here and talk about the numbers:

ResourceRequestLimitWhy
CPU2 cores4 coresLLM inference is CPU-intensive. 2 cores is the minimum to avoid crawling; 4 gives room for spikes
Memory4 Gi8 GiLlama 3.1 (8B) needs ~4.7 GB just to load the model into memory. The request ensures the scheduler places the pod on a node with enough RAM
Requests vs Limits
  • Request = the minimum the pod needs to function. The scheduler uses this to decide which node to place the pod on.
  • Limit = the maximum it can consume. If exceeded, Kubernetes kills it (OOMKilled).

For Ollama, the memory request is critical: if it’s too low, the scheduler might place the pod on a node without enough RAM and the model won’t load. If the limit is too low, the pod dies as soon as it starts processing a long prompt.

Reference table by model

ModelParametersMinimum RAMMinimum CPURecommended GPU
Llama 3.18B4.7 GB2 cores6 GB VRAM
Llama 3.170B40 GB8 cores48 GB VRAM
Mistral7B4.1 GB2 cores6 GB VRAM
Phi-414B8.5 GB2 cores10 GB VRAM
Gemma 312B7.5 GB2 cores10 GB VRAM

Step 4: Download the model

Once Ollama is running, download the model:

# Run inside the Ollama pod
kubectl exec -it -n llm deploy/ollama -- ollama pull llama3.1

This downloads the model to the PVC. Next time the pod restarts, the model will already be there.

Verify it works:

kubectl exec -it -n llm deploy/ollama -- ollama run llama3.1 "Hello, are you running on Kubernetes?"

Step 5: Deploy the Chat App

Now we deploy the chat app from the previous post:

# chat-app-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: chat-app
  namespace: llm
  labels:
    app: chat-app
spec:
  replicas: 2
  selector:
    matchLabels:
      app: chat-app
  template:
    metadata:
      labels:
        app: chat-app
    spec:
      containers:
        - name: chat-app
          image: ghcr.io/your-org/ollama-chat-python:latest
          ports:
            - containerPort: 8000
          env:
            - name: OLLAMA_HOST
              value: "http://ollama.llm.svc.cluster.local:11434"
          resources:
            requests:
              cpu: 200m
              memory: 256Mi
            limits:
              cpu: 500m
              memory: 512Mi
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 5
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 10
---
apiVersion: v1
kind: Service
metadata:
  name: chat-app
  namespace: llm
spec:
  selector:
    app: chat-app
  ports:
    - port: 80
      targetPort: 8000
  type: ClusterIP

Notice the OLLAMA_HOST environment variable: it points to the Ollama Service using Kubernetes internal DNS (ollama.llm.svc.cluster.local). The app doesn’t need to know where Ollama physically runs.

kubectl apply -f chat-app-deployment.yaml

Step 6: Configure the HPA

The chat app can scale horizontally. Ollama can’t — a model loaded in memory can’t be easily “split” between pods. But the app can:

# chat-app-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: chat-app-hpa
  namespace: llm
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: chat-app
  minReplicas: 2
  maxReplicas: 5
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Pods
          value: 1
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 120
      policies:
        - type: Pods
          value: 1
          periodSeconds: 120
kubectl apply -f chat-app-hpa.yaml

Why DON’T we scale Ollama with HPA?

Good question. There are several reasons:

  1. Model loading: Each new Ollama replica needs to load the model into memory (~5 GB for Llama 3.1 8B). That takes time and resources.
  2. State: Ollama keeps the model in memory. It’s not stateless like a typical REST API.
  3. GPU: If using GPU, each replica needs its own GPU. You can’t efficiently share one GPU between two Ollama pods.
  4. Cost: More replicas = more GPUs = more money. A lot more money.

The correct strategy to scale Ollama is vertical scaling (more CPU/RAM/GPU) or using a solution like vLLM or TGI (Text Generation Inference) that are designed to serve models efficiently with request batching.


The GPU: the elephant in the room

Let’s talk about the most important thing for production: the GPU.

Without GPU (CPU-only)

This is what we’re doing in our OrbStack example. It works, but it’s slow. A simple prompt can take 10-30 seconds to generate the full response on CPU. Fine for development and testing, but unacceptable for production with real users.

With GPU

With a GPU, the same prompt takes 1-3 seconds. The difference is massive.

To use GPU in Kubernetes you need:

  1. Nodes with GPU (NVIDIA is the standard).
  2. NVIDIA Device Plugin installed in the cluster.
  3. NVIDIA Drivers on the nodes.
# Add to the Ollama container spec:
resources:
  requests:
    cpu: "1"
    memory: 4Gi
    nvidia.com/gpu: "1"    # Request 1 GPU
  limits:
    cpu: "2"
    memory: 8Gi
    nvidia.com/gpu: "1"    # Limit to 1 GPU

Quick GPU reference

GPUVRAMSupported modelsCloud price (approx/hour)
NVIDIA T416 GBUp to 13B parameters~$0.50
NVIDIA A10G24 GBUp to 30B parameters~$1.00
NVIDIA A100 (40GB)40 GBUp to 70B parameters~$3.00
NVIDIA A100 (80GB)80 GB70B+ parameters~$5.00
NVIDIA H10080 GB70B+ (faster)~$8.00
OrbStack and GPU
OrbStack currently doesn’t support GPU passthrough to the Kubernetes cluster. For local development with GPU, you can run Ollama directly on your Mac (leveraging Metal/Apple Silicon) and connect your K8s app to the host via host.docker.internal. For production with real GPU, you need a cloud cluster (GKE, EKS, AKS) with GPU nodes.

Test the complete setup

Expose the app locally and test it:

# Port-forward to test
kubectl port-forward -n llm svc/chat-app 8000:80

Open http://localhost:8000 in your browser and you’re set — your AI chat running on Kubernetes.

To check the status of everything:

# View all resources
kubectl get all -n llm

# View Ollama logs
kubectl logs -n llm deploy/ollama -f

# Watch the HPA in action
kubectl get hpa -n llm -w

What’s missing: the road to production

Ok, we have Ollama running on Kubernetes. Does that mean we’re ready for production? Not even close. Deploying an LLM is just the first step. To make it “real production” you need to cover several areas that are often overlooked.

1. Prompt observability (Langfuse)

What prompts are your users sending? How long does each response take? Which model generates better answers? Without observability, you’re flying blind.

Langfuse is an open source LLM observability platform that lets you:

  • See every prompt and response in a dashboard.
  • Measure latency, tokens consumed and costs.
  • Evaluate response quality.
  • Trace complex chains (RAG, agents, etc.).

2. Sensitive data protection (Presidio)

This is one of the biggest and least discussed risks. What happens if a user sends their credit card number in a prompt? Or personal data like their SSN, address, medical history?

Microsoft Presidio is an open source PII detection and anonymization framework. You put it as middleware between your app and Ollama to:

  • Detect PII before it reaches the model (names, emails, cards, IDs).
  • Anonymize or redact sensitive information.
  • Log what kind of data attempted to pass through.

Example flow:

3. Metrics and monitoring

You need to know:

  • Latency per request: how long does inference take? (p50, p95, p99)
  • Tokens per second: how fast is text generation?
  • GPU/CPU memory: are you close to the limit?
  • Queue depth: how many requests are waiting?
  • Error rate: how many requests fail?

This is solved with Prometheus + Grafana (if you’ve read the Prometheus post, you already know how).

4. Content guardrails

What happens if someone asks your LLM to generate offensive content, dangerous instructions or misinformation? You need guardrails that filter both input and output.

Tools like NVIDIA NeMo Guardrails or Guardrails AI let you define rules that:

  • Block prompts attempting to jailbreak the model.
  • Filter responses containing inappropriate content.
  • Keep the conversation within the expected domain.

5. Costs and optimization

Running an LLM on Kubernetes isn’t cheap. An A100 GPU can cost $3/hour. Running 24/7, that’s **$2,160 per month**. To optimize:

  • Quantization: Quantized models (Q4, Q5) use less memory and are faster, with minimal quality loss.
  • Turn off when not in use: If your LLM only receives traffic during business hours, shut it down at night.
  • Batching: Tools like vLLM group multiple requests for parallel processing, maximizing GPU utilization.
  • Smaller models: Do you really need a 70B model? For many use cases, Phi-3 (3.8B) or Llama 3.1 (8B) are more than enough.

Summary

TopicStatusTool
Ollama deployment on K8s✅ Covered todayKubernetes + OrbStack
Resources and HPA✅ Covered todayRequests/Limits + HPA
GPU✅ ExplainedNVIDIA Device Plugin
Prompt observability🔜 Next postLangfuse
Data protection🔜 Next postPresidio
LLM metrics🔜 Next postPrometheus + Grafana
Guardrails🔜 Next postNeMo Guardrails
Optimization🔜 Next postvLLM, quantization

This post is the first in a series about LLMs on Kubernetes for production. In the upcoming posts we’ll cover each missing piece, starting with Langfuse for observability and Presidio for data protection.

Because deploying an LLM is easy. Doing it right is the hard part.


Resources