Ollama on Kubernetes: Deploy Your Local LLM on a Real Cluster

Before we start
In the previous post we built a ChatGPT-like app running 100% on your machine with Ollama, FastAPI and real-time streaming. At the end we left an open question: what if we put this in a Kubernetes cluster?
Today we’ll answer that question. We’re going to deploy Ollama as a service inside Kubernetes, connect our chat app, configure resources, autoscaling and talk about something fundamental: the GPU.
And at the end, we’ll be honest about what’s missing to make this “real production” — because deploying an LLM is only half the journey.
Architecture
This is what we’re going to build:
- Ingress exposes the app to the outside world.
- Chat App is our FastAPI from the previous post, with HPA to scale based on demand.
- Ollama runs the LLM model, ideally on a GPU node.
- PVC persists downloaded models so they don’t re-download on every restart.
Prerequisites
| Tool | Version | What for |
|---|---|---|
| OrbStack | 1.9+ | Local Kubernetes on macOS |
| kubectl | 1.28+ | Cluster management |
| Helm | 3.x | Chart installation (optional) |
| Ollama local | 0.18+ | For testing before deploying |
Step 1: Set up the local cluster with OrbStack
OrbStack includes a Kubernetes cluster you can enable from the app or via terminal:
# Verify Kubernetes is active in OrbStack
kubectl cluster-infoYou should see something like:
Kubernetes control plane is running at https://127.0.0.1:26443If you use another tool, here are the equivalents:
| Tool | Command to create cluster | Platform |
|---|---|---|
| OrbStack | Enable in Settings → Kubernetes | macOS |
| Rancher Desktop | Enable in Preferences → Kubernetes | macOS, Windows, Linux |
| minikube | minikube start --memory=8g --cpus=4 | macOS, Windows, Linux |
| kind | kind create cluster | macOS, Windows, Linux |
| Docker Desktop | Enable in Settings → Kubernetes | macOS, Windows |
Step 2: Create the namespace and PVC
First, a dedicated namespace and a persistent volume for the models:
# ollama-namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: llm
labels:
app: ollama
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ollama-models
namespace: llm
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gikubectl apply -f ollama-namespace.yamlWhy a PVC? Because Ollama models weigh several GBs. Without persistence, every time the pod restarts it would have to download the model again. With the PVC, the model downloads once and stays.
Step 3: Deploy Ollama
Now the Ollama deployment:
# ollama-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
namespace: llm
labels:
app: ollama
spec:
replicas: 1
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
resources:
requests:
cpu: "2"
memory: 4Gi
limits:
cpu: "4"
memory: 8Gi
volumeMounts:
- name: models
mountPath: /root/.ollama
readinessProbe:
httpGet:
path: /
port: 11434
initialDelaySeconds: 10
periodSeconds: 5
livenessProbe:
httpGet:
path: /
port: 11434
initialDelaySeconds: 30
periodSeconds: 10
volumes:
- name: models
persistentVolumeClaim:
claimName: ollama-models
---
apiVersion: v1
kind: Service
metadata:
name: ollama
namespace: llm
spec:
selector:
app: ollama
ports:
- port: 11434
targetPort: 11434
type: ClusterIPkubectl apply -f ollama-deployment.yamlAbout resources: why these values?
Let’s stop here and talk about the numbers:
| Resource | Request | Limit | Why |
|---|---|---|---|
| CPU | 2 cores | 4 cores | LLM inference is CPU-intensive. 2 cores is the minimum to avoid crawling; 4 gives room for spikes |
| Memory | 4 Gi | 8 Gi | Llama 3.1 (8B) needs ~4.7 GB just to load the model into memory. The request ensures the scheduler places the pod on a node with enough RAM |
- Request = the minimum the pod needs to function. The scheduler uses this to decide which node to place the pod on.
- Limit = the maximum it can consume. If exceeded, Kubernetes kills it (OOMKilled).
For Ollama, the memory request is critical: if it’s too low, the scheduler might place the pod on a node without enough RAM and the model won’t load. If the limit is too low, the pod dies as soon as it starts processing a long prompt.
Reference table by model
| Model | Parameters | Minimum RAM | Minimum CPU | Recommended GPU |
|---|---|---|---|---|
| Llama 3.1 | 8B | 4.7 GB | 2 cores | 6 GB VRAM |
| Llama 3.1 | 70B | 40 GB | 8 cores | 48 GB VRAM |
| Mistral | 7B | 4.1 GB | 2 cores | 6 GB VRAM |
| Phi-4 | 14B | 8.5 GB | 2 cores | 10 GB VRAM |
| Gemma 3 | 12B | 7.5 GB | 2 cores | 10 GB VRAM |
Step 4: Download the model
Once Ollama is running, download the model:
# Run inside the Ollama pod
kubectl exec -it -n llm deploy/ollama -- ollama pull llama3.1This downloads the model to the PVC. Next time the pod restarts, the model will already be there.
Verify it works:
kubectl exec -it -n llm deploy/ollama -- ollama run llama3.1 "Hello, are you running on Kubernetes?"Step 5: Deploy the Chat App
Now we deploy the chat app from the previous post:
# chat-app-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: chat-app
namespace: llm
labels:
app: chat-app
spec:
replicas: 2
selector:
matchLabels:
app: chat-app
template:
metadata:
labels:
app: chat-app
spec:
containers:
- name: chat-app
image: ghcr.io/your-org/ollama-chat-python:latest
ports:
- containerPort: 8000
env:
- name: OLLAMA_HOST
value: "http://ollama.llm.svc.cluster.local:11434"
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 5
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 10
---
apiVersion: v1
kind: Service
metadata:
name: chat-app
namespace: llm
spec:
selector:
app: chat-app
ports:
- port: 80
targetPort: 8000
type: ClusterIPNotice the OLLAMA_HOST environment variable: it points to the Ollama Service using Kubernetes internal DNS (ollama.llm.svc.cluster.local). The app doesn’t need to know where Ollama physically runs.
kubectl apply -f chat-app-deployment.yamlStep 6: Configure the HPA
The chat app can scale horizontally. Ollama can’t — a model loaded in memory can’t be easily “split” between pods. But the app can:
# chat-app-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: chat-app-hpa
namespace: llm
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: chat-app
minReplicas: 2
maxReplicas: 5
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Pods
value: 1
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 120
policies:
- type: Pods
value: 1
periodSeconds: 120kubectl apply -f chat-app-hpa.yamlWhy DON’T we scale Ollama with HPA?
Good question. There are several reasons:
- Model loading: Each new Ollama replica needs to load the model into memory (~5 GB for Llama 3.1 8B). That takes time and resources.
- State: Ollama keeps the model in memory. It’s not stateless like a typical REST API.
- GPU: If using GPU, each replica needs its own GPU. You can’t efficiently share one GPU between two Ollama pods.
- Cost: More replicas = more GPUs = more money. A lot more money.
The correct strategy to scale Ollama is vertical scaling (more CPU/RAM/GPU) or using a solution like vLLM or TGI (Text Generation Inference) that are designed to serve models efficiently with request batching.
The GPU: the elephant in the room
Let’s talk about the most important thing for production: the GPU.
Without GPU (CPU-only)
This is what we’re doing in our OrbStack example. It works, but it’s slow. A simple prompt can take 10-30 seconds to generate the full response on CPU. Fine for development and testing, but unacceptable for production with real users.
With GPU
With a GPU, the same prompt takes 1-3 seconds. The difference is massive.
To use GPU in Kubernetes you need:
- Nodes with GPU (NVIDIA is the standard).
- NVIDIA Device Plugin installed in the cluster.
- NVIDIA Drivers on the nodes.
# Add to the Ollama container spec:
resources:
requests:
cpu: "1"
memory: 4Gi
nvidia.com/gpu: "1" # Request 1 GPU
limits:
cpu: "2"
memory: 8Gi
nvidia.com/gpu: "1" # Limit to 1 GPUQuick GPU reference
| GPU | VRAM | Supported models | Cloud price (approx/hour) |
|---|---|---|---|
| NVIDIA T4 | 16 GB | Up to 13B parameters | ~$0.50 |
| NVIDIA A10G | 24 GB | Up to 30B parameters | ~$1.00 |
| NVIDIA A100 (40GB) | 40 GB | Up to 70B parameters | ~$3.00 |
| NVIDIA A100 (80GB) | 80 GB | 70B+ parameters | ~$5.00 |
| NVIDIA H100 | 80 GB | 70B+ (faster) | ~$8.00 |
host.docker.internal. For production with real GPU, you need a cloud cluster (GKE, EKS, AKS) with GPU nodes.Test the complete setup
Expose the app locally and test it:
# Port-forward to test
kubectl port-forward -n llm svc/chat-app 8000:80Open http://localhost:8000 in your browser and you’re set — your AI chat running on Kubernetes.
To check the status of everything:
# View all resources
kubectl get all -n llm
# View Ollama logs
kubectl logs -n llm deploy/ollama -f
# Watch the HPA in action
kubectl get hpa -n llm -wWhat’s missing: the road to production
Ok, we have Ollama running on Kubernetes. Does that mean we’re ready for production? Not even close. Deploying an LLM is just the first step. To make it “real production” you need to cover several areas that are often overlooked.
1. Prompt observability (Langfuse)
What prompts are your users sending? How long does each response take? Which model generates better answers? Without observability, you’re flying blind.
Langfuse is an open source LLM observability platform that lets you:
- See every prompt and response in a dashboard.
- Measure latency, tokens consumed and costs.
- Evaluate response quality.
- Trace complex chains (RAG, agents, etc.).
2. Sensitive data protection (Presidio)
This is one of the biggest and least discussed risks. What happens if a user sends their credit card number in a prompt? Or personal data like their SSN, address, medical history?
Microsoft Presidio is an open source PII detection and anonymization framework. You put it as middleware between your app and Ollama to:
- Detect PII before it reaches the model (names, emails, cards, IDs).
- Anonymize or redact sensitive information.
- Log what kind of data attempted to pass through.
Example flow:
3. Metrics and monitoring
You need to know:
- Latency per request: how long does inference take? (p50, p95, p99)
- Tokens per second: how fast is text generation?
- GPU/CPU memory: are you close to the limit?
- Queue depth: how many requests are waiting?
- Error rate: how many requests fail?
This is solved with Prometheus + Grafana (if you’ve read the Prometheus post, you already know how).
4. Content guardrails
What happens if someone asks your LLM to generate offensive content, dangerous instructions or misinformation? You need guardrails that filter both input and output.
Tools like NVIDIA NeMo Guardrails or Guardrails AI let you define rules that:
- Block prompts attempting to jailbreak the model.
- Filter responses containing inappropriate content.
- Keep the conversation within the expected domain.
5. Costs and optimization
Running an LLM on Kubernetes isn’t cheap. An A100 GPU can cost $3/hour. Running 24/7, that’s **$2,160 per month**. To optimize:
- Quantization: Quantized models (Q4, Q5) use less memory and are faster, with minimal quality loss.
- Turn off when not in use: If your LLM only receives traffic during business hours, shut it down at night.
- Batching: Tools like vLLM group multiple requests for parallel processing, maximizing GPU utilization.
- Smaller models: Do you really need a 70B model? For many use cases, Phi-3 (3.8B) or Llama 3.1 (8B) are more than enough.
Summary
| Topic | Status | Tool |
|---|---|---|
| Ollama deployment on K8s | ✅ Covered today | Kubernetes + OrbStack |
| Resources and HPA | ✅ Covered today | Requests/Limits + HPA |
| GPU | ✅ Explained | NVIDIA Device Plugin |
| Prompt observability | 🔜 Next post | Langfuse |
| Data protection | 🔜 Next post | Presidio |
| LLM metrics | 🔜 Next post | Prometheus + Grafana |
| Guardrails | 🔜 Next post | NeMo Guardrails |
| Optimization | 🔜 Next post | vLLM, quantization |
This post is the first in a series about LLMs on Kubernetes for production. In the upcoming posts we’ll cover each missing piece, starting with Langfuse for observability and Presidio for data protection.
Because deploying an LLM is easy. Doing it right is the hard part.