Kubernetes: Autoscaling with HPA and an Intro to VPA

Updated March 2026: This article uses
autoscaling/v2(the current stable version) and includes recent features like configurable tolerance (v1.33) and In-Place Pod Resize (GA in v1.35).
Prerequisites
This post is the direct continuation of the Kubernetes series. You need:
- Everything from the first chapter: Docker (or OrbStack), kubectl, and Kind with an active cluster.
- Everything from the second chapter: knowing how to create Deployments, Services, and YAML manifests.
- A Deployment running with
resources.requestsdefined (we did this in the previous post).
If you don’t have this ready, check the previous posts first. Here we jump straight into autoscaling.
Introduction
In the previous chapter we learned how to scale manually with kubectl scale. That works, but it has an obvious problem: you need a human watching. What happens on a Friday at 11 PM when your app goes viral on social media and traffic multiplies by 10? You don’t want to be there scaling by hand.
That’s what the Horizontal Pod Autoscaler (HPA) is for: you define rules and Kubernetes scales your Pods automatically based on real metrics. It’s like putting your infrastructure on autopilot.

The HPA monitors metrics and automatically adjusts the number of replicas in your Deployment
How does the HPA work under the hood?
The HPA isn’t magic — it’s a control loop that runs every 15 seconds (by default) and does the following:
- Reads metrics from the Metrics Server (CPU, memory) or from custom metrics adapters.
- Calculates how many replicas it needs using this formula:
desired replicas = ⌈ current replicas × (current metric / target metric) ⌉- Scales the Deployment if the difference exceeds the tolerance (10% by default).
- Repeats the cycle.
For example: you have 2 replicas, the current CPU usage is 80%, and your target is 50%. The calculation would be:
replicas = ⌈ 2 × (80 / 50) ⌉ = ⌈ 3.2 ⌉ = 4 replicasThe HPA would scale from 2 to 4 Pods.

The HPA control loop: read metrics → calculate → scale → repeat
Installing Metrics Server on Kind
The HPA needs a Metrics Server to read CPU and memory metrics. On managed clusters (EKS, GKE, AKS) it usually comes preinstalled, but on Kind you need to install it manually.
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yamlOn Kind you need a patch because the certificates aren’t valid. Edit the Metrics Server Deployment:
kubectl patch deployment metrics-server -n kube-system \
--type='json' \
-p='[{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--kubelet-insecure-tls"}]'Wait for it to be ready:
kubectl wait --namespace kube-system \
--for=condition=ready pod \
--selector=k8s-app=metrics-server \
--timeout=90sVerify it works:
kubectl top nodesNAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
mi-cluster-control-plane 250m 12% 512Mi 26%
mi-cluster-worker 120m 6% 256Mi 13%kubectl top podsIf you see metrics, you’re ready for the HPA.
HPA by CPU: practical example
Let’s go with the most common case. First, make sure you have a Deployment with resources.requests defined. If you followed the previous post, you already have it. If not, create deployment.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: mi-app
labels:
app: mi-app
spec:
replicas: 2
selector:
matchLabels:
app: mi-app
template:
metadata:
labels:
app: mi-app
spec:
containers:
- name: mi-app
image: nginx:latest
ports:
- containerPort: 80
resources:
requests:
memory: "64Mi"
cpu: "50m"
limits:
memory: "128Mi"
cpu: "100m"kubectl apply -f deployment.yamlresources.requests, the HPA can’t calculate percentages and won’t work. Always define requests in your containers.Creating the HPA with a command
The quick way:
kubectl autoscale deployment mi-app --min=2 --max=10 --cpu-percent=50Verify:
kubectl get hpaNAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
mi-app Deployment/mi-app 10%/50% 2 10 2 30sCreating the HPA with a YAML manifest
For production, always use YAML. Create hpa-cpu.yaml:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: mi-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: mi-app
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50kubectl apply -f hpa-cpu.yamlLet’s break it down:
| Field | What it does |
|---|---|
scaleTargetRef | Which Deployment to target |
minReplicas: 2 | Never go below 2 replicas (high availability) |
maxReplicas: 10 | Never go above 10 (cost control) |
averageUtilization: 50 | Scale when the average CPU exceeds 50% of the requests |
autoscaling/v2 — the stable and current version. The v2beta1 and v2beta2 versions have been removed from Kubernetes. If you see tutorials using those versions, update them.HPA by memory
CPU-based scaling is the most common, but sometimes your app is memory-intensive (caches, data processing, etc.). You can scale by memory the same way.
Create hpa-memory.yaml:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: mi-app-hpa-memory
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: mi-app
minReplicas: 2
maxReplicas: 8
metrics:
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 70Combining metrics: CPU + memory
Why pick just one metric when you can use both? When you define multiple metrics, the HPA calculates the replicas needed for each one and picks the highest number.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: mi-app-hpa-multi
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: mi-app
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 70If CPU says you need 4 replicas and memory says you need 6, the HPA will set 6 replicas. The highest number always wins to ensure both metrics stay under control.
Custom metrics with Prometheus
CPU and memory metrics are a good starting point, but in production you want to scale based on business metrics: requests per second, messages in queue, latency, etc. For this you need a custom metrics adapter.
The most popular combination is Prometheus + Prometheus Adapter.
Architecture
Pods → export metrics → Prometheus (scraping) → Prometheus Adapter → HPA
Custom metrics flow: App → Prometheus → Adapter → HPA
Example: scaling by requests per second
Let’s say your app exposes an http_requests_total metric and you want to scale when it exceeds 100 requests/s per Pod:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: mi-app-hpa-custom
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: mi-app
minReplicas: 2
maxReplicas: 20
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "100"Behavior: controlling how the HPA scales
By default, the HPA scales up fast and scales down slow (5-minute stabilization window). But you can customize this behavior with the behavior field.
Why does it matter?
- Too aggressive scale-up: you create 50 Pods in a 10-second spike and then they sit idle.
- Too aggressive scale-down: you drop replicas and another spike hits, causing latency while the new Pods start up.
Example with custom behavior
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: mi-app-hpa-behavior
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: mi-app
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Percent
value: 50
periodSeconds: 60
- type: Pods
value: 4
periodSeconds: 60
selectPolicy: Max
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 25
periodSeconds: 120
selectPolicy: MaxLet’s break it down:
Scale-up:
| Field | What it does |
|---|---|
stabilizationWindowSeconds: 30 | Wait 30s before scaling (avoids reacting to instant spikes) |
Percent: 50 | Can increase up to 50% of current replicas per minute |
Pods: 4 | Or add up to 4 Pods per minute (whichever is greater) |
selectPolicy: Max | Pick the most aggressive policy (the one that adds more Pods) |
Scale-down:
| Field | What it does |
|---|---|
stabilizationWindowSeconds: 300 | Wait 5 minutes before reducing (avoids the “yo-yo effect”) |
Percent: 25 | Reduce a maximum of 25% every 2 minutes |
selectPolicy: Max | Pick the policy that reduces the most |
Configurable tolerance (K8s v1.33+)
Starting with Kubernetes v1.33, you can configure the HPA’s tolerance per resource. By default, the HPA ignores differences smaller than 10% to avoid unnecessary scaling. Now you can customize it:
spec:
behavior:
scaleUp:
tolerance: 0.0 # Scale immediately on any change
scaleDown:
tolerance: 0.05 # Tolerate 5% before reducingThis is useful when you need ultra-fast reactions for scale-up (zero-tolerance) but want to be conservative on scale-down.
HPAConfigurableTolerance.Testing autoscaling
How do you verify your HPA works? By generating artificial load.
Generating load with a temporary Pod
kubectl run load-generator --image=busybox --rm -it -- /bin/sh -c \
"while true; do wget -q -O- http://mi-app-svc; done"In another terminal, watch the HPA in action:
kubectl get hpa mi-app-hpa --watchNAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
mi-app-hpa Deployment/mi-app 10%/50% 2 10 2 5m
mi-app-hpa Deployment/mi-app 68%/50% 2 10 2 5m30s
mi-app-hpa Deployment/mi-app 68%/50% 2 10 3 6m
mi-app-hpa Deployment/mi-app 52%/50% 2 10 4 6m30s
mi-app-hpa Deployment/mi-app 35%/50% 2 10 4 7mYou’ll see replicas go up as load increases. When you stop the generator (Ctrl+C), after the stabilization window the replicas will come back down.
Useful commands for monitoring
# See detailed HPA status
kubectl describe hpa mi-app-hpa
# See HPA events
kubectl get events --field-selector involvedObject.name=mi-app-hpa
# See Pod metrics in real time
kubectl top pods -l app=mi-app --containers
The HPA detects high CPU load and automatically creates new replicas
VPA: Vertical Pod Autoscaler (introduction)
So far we’ve talked about scaling horizontally (more Pods). But there’s another dimension: scaling vertically — giving more CPU and memory to each individual Pod.

HPA scales horizontally (more Pods), VPA scales vertically (more resources per Pod)
What is the VPA?
The Vertical Pod Autoscaler (VPA) analyzes the actual resource usage of your Pods and automatically adjusts the CPU/memory requests and limits. It’s like having an assistant that says: “hey, this Pod is requesting 500Mi of memory but only using 120Mi — let’s adjust it.”
When to use VPA instead of HPA?
| Scenario | Use |
|---|---|
| Stateless app with variable traffic | HPA |
| App with stable load but poorly sized resources | VPA |
| Databases, caches (hard to scale horizontally) | VPA |
| Apps with unpredictable traffic spikes | HPA (or both) |
VPA modes
The VPA isn’t part of the Kubernetes core — you need to install it separately from the official repository.
| Mode | Behavior |
|---|---|
| Off | Only recommends, doesn’t apply changes (ideal for getting started) |
| Initial | Applies resources only when the Pod is created |
| Recreate | Deletes and recreates the Pod with the new resources |
| InPlaceOrRecreate | Tries to adjust without restarting (K8s v1.35+), if it can’t, recreates the Pod |
Basic VPA example
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: mi-app-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: mi-app
updatePolicy:
updateMode: "Off" # Recommendations only, doesn't apply changes
resourcePolicy:
containerPolicies:
- containerName: mi-app
minAllowed:
cpu: "25m"
memory: "32Mi"
maxAllowed:
cpu: "500m"
memory: "512Mi"With Off mode, you can check the recommendations:
kubectl describe vpa mi-app-vpaRecommendation:
Container Recommendations:
Container Name: mi-app
Lower Bound:
Cpu: 25m
Memory: 50Mi
Target:
Cpu: 50m
Memory: 80Mi
Upper Bound:
Cpu: 200m
Memory: 256MiThis tells you: “your Pods should have 50m CPU and 80Mi of memory as a target.” Great for fine-tuning your resources.requests without guessing.
InPlaceOrRecreate mode) to adjust CPU and memory of your Pods without restarting them. A game-changer for apps that can’t afford downtime.HPA + VPA together?
Yes you can, but be careful:
- Don’t use both to scale by the same metric (e.g., both by CPU). They’ll fight each other.
- It does work well: HPA by CPU/custom metrics + VPA by memory (in
InitialorOffmode). - The safest recommendation: use VPA in
Offmode to get recommendations and adjust your requests manually. Leave the active scaling to the HPA.
Best practices for autoscaling in production
After battling with HPAs in production, these are the lessons that stick:
1. Always define resources.requests
Without requests, the HPA can’t calculate percentages. Define realistic requests based on your app’s actual usage (VPA in Off mode helps with this).
2. Don’t set minReplicas: 1
If your Pod restarts, you have zero replicas for a few seconds. Minimum 2 for high availability.
3. Configure the behavior
The defaults are reasonable, but every app is different. An API that needs immediate response needs aggressive scale-up. A background worker can scale more slowly.
4. Use Pod Disruption Budgets (PDB)
When the HPA reduces replicas, make sure it doesn’t drop below a safe minimum:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: mi-app-pdb
spec:
minAvailable: 1
selector:
matchLabels:
app: mi-app5. Monitor HPA events
kubectl describe hpa mi-app-hpa | grep -A 5 "Events"If you see messages like failed to get cpu utilization, check that Metrics Server is running and that your Pods have resources.requests.
6. Watch out for the “yo-yo effect”
If your scale-down is too aggressive, the HPA drops replicas → the remaining Pods get overloaded → the HPA adds replicas → they stabilize → the HPA drops replicas… and on and on forever. Solution: a generous stabilization window on scale-down (300-600 seconds).
Official references
- Horizontal Pod Autoscaling — Complete HPA guide
- HPA Walkthrough — Step-by-step tutorial
- autoscaling/v2 API — API reference
- Vertical Pod Autoscaler — Official VPA repository
- Metrics Server — Installation and configuration
- Prometheus Adapter — Custom metrics for HPA
- In-Place Pod Resource Resize — Resize Pods without restarting them (GA in v1.35)
Summary
Today we mastered autoscaling in Kubernetes:
- The HPA scales Pods horizontally based on metrics (CPU, memory, custom).
- Use
autoscaling/v2— the beta versions no longer exist. - The
behaviorfield gives you granular control over how scaling goes up and down. - Custom metrics with Prometheus let you scale based on real business metrics.
- The VPA complements the HPA by adjusting resources per Pod (vertical scaling).
- In-Place Pod Resize (GA in K8s v1.35) lets the VPA adjust resources without restarting Pods.
- Always define
resources.requests, configure behavior, and monitor HPA events.
With HPA + VPA + best practices, your infrastructure adapts to traffic on its own. Less manual ops, more coffee in peace.
In the next chapter we’ll set up Prometheus and Grafana in our cluster for full observability: metrics, dashboards, and everything you need to see what’s happening inside your cluster.
Did you enjoy this article? Share it with someone who’s getting started with Kubernetes. And if you have questions, drop me a comment!