Kubernetes: Autoscaling with HPA and an Intro to VPA

2025-12-27 2415 words 12 minutes ... views

/images/kubernetes-hpa/kubernetes-hpa-header.png

Contents

Updated March 2026: This article uses autoscaling/v2 (the current stable version) and includes recent features like configurable tolerance (v1.33) and In-Place Pod Resize (GA in v1.35).

Prerequisites

This post is the direct continuation of the Kubernetes series. You need:

Everything from the first chapter: Docker (or OrbStack), kubectl, and Kind with an active cluster.
Everything from the second chapter: knowing how to create Deployments, Services, and YAML manifests.
A Deployment running with resources.requests defined (we did this in the previous post).

If you don’t have this ready, check the previous posts first. Here we jump straight into autoscaling.

Introduction

In the previous chapter we learned how to scale manually with kubectl scale. That works, but it has an obvious problem: you need a human watching. What happens on a Friday at 11 PM when your app goes viral on social media and traffic multiplies by 10? You don’t want to be there scaling by hand.

That’s what the Horizontal Pod Autoscaler (HPA) is for: you define rules and Kubernetes scales your Pods automatically based on real metrics. It’s like putting your infrastructure on autopilot.

The HPA monitors metrics and automatically adjusts the number of replicas in your Deployment

How does the HPA work under the hood?

The HPA isn’t magic — it’s a control loop that runs every 15 seconds (by default) and does the following:

Reads metrics from the Metrics Server (CPU, memory) or from custom metrics adapters.
Calculates how many replicas it needs using this formula:

desired replicas = ⌈ current replicas × (current metric / target metric) ⌉

Scales the Deployment if the difference exceeds the tolerance (10% by default).
Repeats the cycle.

For example: you have 2 replicas, the current CPU usage is 80%, and your target is 50%. The calculation would be:

replicas = ⌈ 2 × (80 / 50) ⌉ = ⌈ 3.2 ⌉ = 4 replicas

The HPA would scale from 2 to 4 Pods.

The HPA control loop: read metrics → calculate → scale → repeat

Installing Metrics Server on Kind

The HPA needs a Metrics Server to read CPU and memory metrics. On managed clusters (EKS, GKE, AKS) it usually comes preinstalled, but on Kind you need to install it manually.

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

On Kind you need a patch because the certificates aren’t valid. Edit the Metrics Server Deployment:

kubectl patch deployment metrics-server -n kube-system \
  --type='json' \
  -p='[{"op": "add", "path": "/spec/template/spec/containers/0/args/-", "value": "--kubelet-insecure-tls"}]'

Wait for it to be ready:

kubectl wait --namespace kube-system \
  --for=condition=ready pod \
  --selector=k8s-app=metrics-server \
  --timeout=90s

Verify it works:

kubectl top nodes

NAME                       CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
mi-cluster-control-plane   250m         12%    512Mi           26%
mi-cluster-worker          120m         6%     256Mi           13%

kubectl top pods

If you see metrics, you’re ready for the HPA.

HPA by CPU: practical example

Let’s go with the most common case. First, make sure you have a Deployment with resources.requests defined. If you followed the previous post, you already have it. If not, create deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: mi-app
  labels:
    app: mi-app
spec:
  replicas: 2
  selector:
    matchLabels:
      app: mi-app
  template:
    metadata:
      labels:
        app: mi-app
    spec:
      containers:
        - name: mi-app
          image: nginx:latest
          ports:
            - containerPort: 80
          resources:
            requests:
              memory: "64Mi"
              cpu: "50m"
            limits:
              memory: "128Mi"
              cpu: "100m"

kubectl apply -f deployment.yaml

Info

Without resources.requests, the HPA can’t calculate percentages and won’t work. Always define requests in your containers.

Creating the HPA with a command

The quick way:

kubectl autoscale deployment mi-app --min=2 --max=10 --cpu-percent=50

Verify:

kubectl get hpa

NAME     REFERENCE           TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
mi-app   Deployment/mi-app   10%/50%   2         10        2          30s

Creating the HPA with a YAML manifest

For production, always use YAML. Create hpa-cpu.yaml:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: mi-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: mi-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 50

kubectl apply -f hpa-cpu.yaml

Let’s break it down:

Field	What it does
`scaleTargetRef`	Which Deployment to target
`minReplicas: 2`	Never go below 2 replicas (high availability)
`maxReplicas: 10`	Never go above 10 (cost control)
`averageUtilization: 50`	Scale when the average CPU exceeds 50% of the requests

Important

We’re using autoscaling/v2 — the stable and current version. The v2beta1 and v2beta2 versions have been removed from Kubernetes. If you see tutorials using those versions, update them.

HPA by memory

CPU-based scaling is the most common, but sometimes your app is memory-intensive (caches, data processing, etc.). You can scale by memory the same way.

Create hpa-memory.yaml:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: mi-app-hpa-memory
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: mi-app
  minReplicas: 2
  maxReplicas: 8
  metrics:
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 70

Warning

Unlike CPU, memory doesn’t always drop when traffic drops (many apps don’t release memory easily). This can cause the HPA to scale up but not scale back down. Keep this in mind when choosing your thresholds.

Combining metrics: CPU + memory

Why pick just one metric when you can use both? When you define multiple metrics, the HPA calculates the replicas needed for each one and picks the highest number.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: mi-app-hpa-multi
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: mi-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 50
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 70

If CPU says you need 4 replicas and memory says you need 6, the HPA will set 6 replicas. The highest number always wins to ensure both metrics stay under control.

Custom metrics with Prometheus

CPU and memory metrics are a good starting point, but in production you want to scale based on business metrics: requests per second, messages in queue, latency, etc. For this you need a custom metrics adapter.

The most popular combination is Prometheus + Prometheus Adapter.

Architecture

Pods → export metrics → Prometheus (scraping) → Prometheus Adapter → HPA

Custom metrics flow: App → Prometheus → Adapter → HPA

Example: scaling by requests per second

Let’s say your app exposes an http_requests_total metric and you want to scale when it exceeds 100 requests/s per Pod:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: mi-app-hpa-custom
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: mi-app
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "100"

Note

Setting up Prometheus + Adapter from scratch is a big topic. In the next post I’ll walk you through how to set up Prometheus and Grafana in your cluster step by step. The key takeaway here is that the HPA can scale by any metric, not just CPU and memory.

Behavior: controlling how the HPA scales

By default, the HPA scales up fast and scales down slow (5-minute stabilization window). But you can customize this behavior with the behavior field.

Why does it matter?

Too aggressive scale-up: you create 50 Pods in a 10-second spike and then they sit idle.
Too aggressive scale-down: you drop replicas and another spike hits, causing latency while the new Pods start up.

Example with custom behavior

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: mi-app-hpa-behavior
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: mi-app
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 50
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Percent
          value: 50
          periodSeconds: 60
        - type: Pods
          value: 4
          periodSeconds: 60
      selectPolicy: Max
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 25
          periodSeconds: 120
      selectPolicy: Max

Let’s break it down:

Scale-up:

Field	What it does
`stabilizationWindowSeconds: 30`	Wait 30s before scaling (avoids reacting to instant spikes)
`Percent: 50`	Can increase up to 50% of current replicas per minute
`Pods: 4`	Or add up to 4 Pods per minute (whichever is greater)
`selectPolicy: Max`	Pick the most aggressive policy (the one that adds more Pods)

Scale-down:

Field	What it does
`stabilizationWindowSeconds: 300`	Wait 5 minutes before reducing (avoids the “yo-yo effect”)
`Percent: 25`	Reduce a maximum of 25% every 2 minutes
`selectPolicy: Max`	Pick the policy that reduces the most

Configurable tolerance (K8s v1.33+)

Starting with Kubernetes v1.33, you can configure the HPA’s tolerance per resource. By default, the HPA ignores differences smaller than 10% to avoid unnecessary scaling. Now you can customize it:

spec:
  behavior:
    scaleUp:
      tolerance: 0.0    # Scale immediately on any change
    scaleDown:
      tolerance: 0.05   # Tolerate 5% before reducing

This is useful when you need ultra-fast reactions for scale-up (zero-tolerance) but want to be conservative on scale-down.

Note

This feature was introduced as alpha in v1.33 and moved to beta in v1.35 (enabled by default). In v1.33/v1.34 it requires manually enabling the feature gate HPAConfigurableTolerance.

Testing autoscaling

How do you verify your HPA works? By generating artificial load.

Generating load with a temporary Pod

kubectl run load-generator --image=busybox --rm -it -- /bin/sh -c \
  "while true; do wget -q -O- http://mi-app-svc; done"

In another terminal, watch the HPA in action:

kubectl get hpa mi-app-hpa --watch

NAME         REFERENCE           TARGETS    MINPODS   MAXPODS   REPLICAS   AGE
mi-app-hpa   Deployment/mi-app   10%/50%    2         10        2          5m
mi-app-hpa   Deployment/mi-app   68%/50%    2         10        2          5m30s
mi-app-hpa   Deployment/mi-app   68%/50%    2         10        3          6m
mi-app-hpa   Deployment/mi-app   52%/50%    2         10        4          6m30s
mi-app-hpa   Deployment/mi-app   35%/50%    2         10        4          7m

You’ll see replicas go up as load increases. When you stop the generator (Ctrl+C), after the stabilization window the replicas will come back down.

Useful commands for monitoring

# See detailed HPA status
kubectl describe hpa mi-app-hpa

# See HPA events
kubectl get events --field-selector involvedObject.name=mi-app-hpa

# See Pod metrics in real time
kubectl top pods -l app=mi-app --containers

The HPA detects high CPU load and automatically creates new replicas

VPA: Vertical Pod Autoscaler (introduction)

So far we’ve talked about scaling horizontally (more Pods). But there’s another dimension: scaling vertically — giving more CPU and memory to each individual Pod.

HPA scales horizontally (more Pods), VPA scales vertically (more resources per Pod)

What is the VPA?

The Vertical Pod Autoscaler (VPA) analyzes the actual resource usage of your Pods and automatically adjusts the CPU/memory requests and limits. It’s like having an assistant that says: “hey, this Pod is requesting 500Mi of memory but only using 120Mi — let’s adjust it.”

When to use VPA instead of HPA?

Scenario	Use
Stateless app with variable traffic	HPA
App with stable load but poorly sized resources	VPA
Databases, caches (hard to scale horizontally)	VPA
Apps with unpredictable traffic spikes	HPA (or both)

VPA modes

The VPA isn’t part of the Kubernetes core — you need to install it separately from the official repository.

Mode	Behavior
Off	Only recommends, doesn’t apply changes (ideal for getting started)
Initial	Applies resources only when the Pod is created
Recreate	Deletes and recreates the Pod with the new resources
InPlaceOrRecreate	Tries to adjust without restarting (K8s v1.35+), if it can’t, recreates the Pod

Basic VPA example

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: mi-app-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: mi-app
  updatePolicy:
    updateMode: "Off"  # Recommendations only, doesn't apply changes
  resourcePolicy:
    containerPolicies:
      - containerName: mi-app
        minAllowed:
          cpu: "25m"
          memory: "32Mi"
        maxAllowed:
          cpu: "500m"
          memory: "512Mi"

With Off mode, you can check the recommendations:

kubectl describe vpa mi-app-vpa

Recommendation:
  Container Recommendations:
    Container Name: mi-app
    Lower Bound:
      Cpu:     25m
      Memory:  50Mi
    Target:
      Cpu:     50m
      Memory:  80Mi
    Upper Bound:
      Cpu:     200m
      Memory:  256Mi

This tells you: “your Pods should have 50m CPU and 80Mi of memory as a target.” Great for fine-tuning your resources.requests without guessing.

Success

The In-Place Pod Resource Resize feature is now GA (stable). This allows the VPA (in InPlaceOrRecreate mode) to adjust CPU and memory of your Pods without restarting them. A game-changer for apps that can’t afford downtime.

HPA + VPA together?

Yes you can, but be careful:

Don’t use both to scale by the same metric (e.g., both by CPU). They’ll fight each other.
It does work well: HPA by CPU/custom metrics + VPA by memory (in Initial or Off mode).
The safest recommendation: use VPA in Off mode to get recommendations and adjust your requests manually. Leave the active scaling to the HPA.

Best practices for autoscaling in production

After battling with HPAs in production, these are the lessons that stick:

1. Always define `resources.requests`

Without requests, the HPA can’t calculate percentages. Define realistic requests based on your app’s actual usage (VPA in Off mode helps with this).

2. Don’t set `minReplicas: 1`

If your Pod restarts, you have zero replicas for a few seconds. Minimum 2 for high availability.

3. Configure the behavior

The defaults are reasonable, but every app is different. An API that needs immediate response needs aggressive scale-up. A background worker can scale more slowly.

4. Use Pod Disruption Budgets (PDB)

When the HPA reduces replicas, make sure it doesn’t drop below a safe minimum:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: mi-app-pdb
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: mi-app

5. Monitor HPA events

kubectl describe hpa mi-app-hpa | grep -A 5 "Events"

If you see messages like failed to get cpu utilization, check that Metrics Server is running and that your Pods have resources.requests.

6. Watch out for the “yo-yo effect”

If your scale-down is too aggressive, the HPA drops replicas → the remaining Pods get overloaded → the HPA adds replicas → they stabilize → the HPA drops replicas… and on and on forever. Solution: a generous stabilization window on scale-down (300-600 seconds).

Official references

Horizontal Pod Autoscaling — Complete HPA guide
HPA Walkthrough — Step-by-step tutorial
autoscaling/v2 API — API reference
Vertical Pod Autoscaler — Official VPA repository
Metrics Server — Installation and configuration
Prometheus Adapter — Custom metrics for HPA
In-Place Pod Resource Resize — Resize Pods without restarting them (GA in v1.35)

Summary

Today we mastered autoscaling in Kubernetes:

The HPA scales Pods horizontally based on metrics (CPU, memory, custom).
Use autoscaling/v2 — the beta versions no longer exist.
The behavior field gives you granular control over how scaling goes up and down.
Custom metrics with Prometheus let you scale based on real business metrics.
The VPA complements the HPA by adjusting resources per Pod (vertical scaling).
In-Place Pod Resize (GA in K8s v1.35) lets the VPA adjust resources without restarting Pods.
Always define resources.requests, configure behavior, and monitor HPA events.

With HPA + VPA + best practices, your infrastructure adapts to traffic on its own. Less manual ops, more coffee in peace.

In the next chapter we’ll set up Prometheus and Grafana in our cluster for full observability: metrics, dashboards, and everything you need to see what’s happening inside your cluster.

Did you enjoy this article? Share it with someone who’s getting started with Kubernetes. And if you have questions, drop me a comment!

Contents

Kubernetes: Autoscaling with HPA and an Intro to VPA

Prerequisites

Introduction

How does the HPA work under the hood?

Installing Metrics Server on Kind

HPA by CPU: practical example

Creating the HPA with a command

Creating the HPA with a YAML manifest

HPA by memory

Combining metrics: CPU + memory

Custom metrics with Prometheus

Architecture

Example: scaling by requests per second

Behavior: controlling how the HPA scales

Why does it matter?

Example with custom behavior

Configurable tolerance (K8s v1.33+)

Testing autoscaling

Generating load with a temporary Pod

Useful commands for monitoring

VPA: Vertical Pod Autoscaler (introduction)

What is the VPA?

When to use VPA instead of HPA?

VPA modes

Basic VPA example

HPA + VPA together?

Best practices for autoscaling in production

1. Always define resources.requests

2. Don’t set minReplicas: 1

3. Configure the behavior

4. Use Pod Disruption Budgets (PDB)

5. Monitor HPA events

6. Watch out for the “yo-yo effect”

Official references

Summary

1. Always define `resources.requests`

2. Don’t set `minReplicas: 1`