Understanding Kubernetes ReplicaSets: Ensuring High Availability for Your Applications

Understanding Kubernetes ReplicaSets: Ensuring High Availability for Your Applications

Introduction

In the previous article, we finished with an honest admission about pods. On their own, they’re fragile. If a node fails, the pods on it go with it and nothing brings them back. If you want more than one copy running, you create each pod yourself. If one crashes, it stays crashed until you intervene. Pods give you a unit of scheduling, but they don’t give you reliability.

The answer isn’t to make pods cleverer. It’s to wrap them in something that watches over them. Something that lets you say “I want three copies of this pod running” and takes responsibility for making that true, even as pods fail, nodes die, and traffic changes. That something is a ReplicaSet.

This article is about what a ReplicaSet is and how it solves the fragility problem we named in the last article. We’ll look at the reconciliation mechanism it uses, which is the same pattern you already met in Part 1. Then we’ll write one in YAML, apply it to a cluster, and watch it recover from a deleted pod. By the end, you’ll understand why you almost never create pods directly, and why ReplicaSet is the first Kubernetes resource that actually gives you production-grade reliability.

Let’s start with what a ReplicaSet does.

What a ReplicaSet Does

A ReplicaSet ensures that a specified number of identical pods are running at any given time. You tell it how many pods you want, and it makes sure that many exist. If there are too few, it creates more. If there are too many, it removes some. If one disappears, it replaces it.

The way a ReplicaSet does this is the reconciliation mechanism we met in Part 1. It continuously compares what you asked for against what’s actually running, and takes whatever action to close the gap. For a ReplicaSet, what you asked for is a number: “three pods matching these labels running”. If three pods match, nothing happens. If two match, the ReplicaSet creates one more. If four match because something unexpected happened, the ReplicaSet picks one and deletes it.

That’s the whole idea. The interesting part is what this looks like in practice, because the same behaviour handles situations that might feel like different problems.

A pod crashes. The container stops, the process hits an unhandled exception, something inside the pod goes wrong. The ReplicaSet now has fewer pods than it should. It creates a new pod from its template, and the scheduler places the new pod on a node. A few seconds later, your application is back to full capacity.

You change the count. You want five copies of your application instead of three, maybe because traffic is up, maybe because you want more headroom. You update the ReplicaSet’s count from 3 to 5. It’s now short by two, so it creates two more pods. Scaling down works the same way. Lower the count to 2, and the ReplicaSet removes one pod to match.

A node fails. A worker node becomes unreachable. Every pod on it is effectively gone. If the node was running two of your ReplicaSet’s pods, the ReplicaSet is suddenly short by two and creates two replacements. The scheduler doesn’t try to put them back on the failed node; it picks healthy nodes with capacity. Your application keeps running, just on different machines.

Three situations, one behaviour. The ReplicaSet isn’t doing three different things for three different problems. It’s doing the same thing in all three cases: noticing the gap, creating or removing pods to close it. That’s what makes it reliable. Whatever goes wrong, the answer is always the same.

Which raises a practical question: how does a ReplicaSet know which pods are its pods? That’s where labels come in, and it’s the first thing we’ll see when we write the YAML.

A ReplicaSet in YAML

You know what a ReplicaSet does. Three situations, one behaviour. Notices the gap, creates or removes pods to close it. Now you need to tell Kubernetes to actually run one, and that means writing it down.

Here’s the whole thing for a ReplicaSet that maintains three copies of an nginx pod.

apiVersion: apps/v1
kind: ReplicaSet
metadata:
  name: frontend
  labels:
    app: frontend
spec:
  replicas: 3
  selector:
    matchLabels:
      app: frontend
  template:
    metadata:
      labels:
        app: frontend
    spec:
      containers:
        - name: nginx
          image: nginx:1.27
          ports:
            - containerPort: 80

Read it from the top, and you can see the whole story we just told.

The top four lines are the same shape as a pod from Part 2. apiVersion, kind, metadata. The only real change is the apiVersion, which is apps/v1 now instead of v1. ReplicaSets and pods live in different corners of the Kubernetes API, and the version tells Kubernetes which corner to look in.

Then there’s the spec, and this is where it gets interesting.

replicas: 3 is the thing you actually asked for. Remember everything from the last section about noticing gaps? This is the number the ReplicaSet compares against. If fewer than three pods exist, it’s short. If more than three exist, it has too many. Everything the ReplicaSet does, it does in service of this number.

The selector is how it finds its pods. You can’t maintain a count of three if you can’t tell which pods you’re counting. A cluster might have fifty pods running; only some of them belong to this ReplicaSet. The selector is the filter: “I care about pods labelled app: frontend, and nothing else.” Every time the ReplicaSet wakes up to check how it’s doing, it counts pods matching this label.

The template is the recipe. When the ReplicaSet finds itself short a pod, it can’t create just anything; it needs to know what kind of pod to create. That’s what the template is for. And if you look inside it, past the template: line, you’ll see something familiar. Metadata. Spec. Containers. Image. Ports. It’s a pod definition, exactly the one you’d write if you were creating a pod directly. The ReplicaSet wraps it, but it doesn’t change it.

One detail is worth noticing before you apply this. The label app: frontend appears three times. Once on the ReplicaSet itself. Once in the selector. Once on the pods the template creates. The first one is optional, just a tag you can use to find this ReplicaSet later. But the selector and the template labels have to match. The selector says “I’m looking for pods labelled X.” The template says “Every pod I create will be labelled Y.” If X and Y are different, the ReplicaSet creates pods and then can’t see them. It thinks it still needs more, creates more, can’t see those either, and keeps going forever.

That’s the whole spec. Three lines that tell the ReplicaSet what to do, wrapped around a pod definition that tells it what to create. Let’s apply it and watch it work.

Watching It Work

Save the YAML to a file called frontend-rs.yaml and hand it to Kubernetes.

kubectl apply -f frontend-rs.yaml

What happens next is the same path we traced in Part 1. The API server receives the request, validates it, and stores the ReplicaSet in etcd. The ReplicaSet controller inside the controller manager starts watching and immediately notices something it doesn’t like: you asked for three pods, and zero exist. It creates three pods from the template. The scheduler places each one on a node, the kubelets on those nodes start the containers, and a few seconds later everything is running.

Check the ReplicaSet.

kubectl get replicaset

NAME       DESIRED   CURRENT   READY   AGE
frontend   3         3         3       20s

Three desired, three current, three ready. The gap is closed. Check the pods too:

kubectl get pods

NAME             READY   STATUS    RESTARTS   AGE
frontend-x4k2p   1/1     Running   0          20s
frontend-m8kl9   1/1     Running   0          20s
frontend-p9j7r   1/1     Running   0          20s

Three pods, each with a name that starts with the ReplicaSet’s name and ends in a random string. The suffix is Kubernetes giving each pod a unique identity.

Here’s what you just created.

A ReplicaSet managing three pods spread across three worker nodes

The ReplicaSet sits at the top, with the three pods it created underneath. Each pod ended up on a different worker node, which is the scheduler’s default behaviour: spread pods out so a single node failure can’t take all your copies down at once. The dashed lines from the ReplicaSet to each pod represent the logical relationship. These pods belong to this ReplicaSet because their labels match its selector.

That’s the shape of what you’ve built. Now for the interesting part. We talked about what happens when a pod disappears. Let’s make one disappear.

kubectl delete pod frontend-x4k2p

Check the pods again.

kubectl get pods

NAME             READY   STATUS    RESTARTS   AGE
frontend-m8kl9   1/1     Running   0          1m
frontend-p9j7r   1/1     Running   0          1m
frontend-k2f9x   1/1     Running   0          4s

The one you deleted is gone. A new one has taken its place, four seconds old, already running. You didn’t ask for it, didn’t do anything, didn’t even refresh a dashboard. The ReplicaSet saw a gap and closed it.

That’s the whole idea from section 2 playing out in front of you. Three situations, one behaviour. You just watched the first one.

Scaling works the same way. If you want five pods instead of three, you change the number.

kubectl scale replicaset frontend --replicas=5

Check again.

kubectl get pods

NAME             READY   STATUS    RESTARTS   AGE
frontend-m8kl9   1/1     Running   0          2m
frontend-p9j7r   1/1     Running   0          2m
frontend-k2f9x   1/1     Running   0          1m
frontend-j3h8r   1/1     Running   0          3s
frontend-n5k2m   1/1     Running   0          3s

Two new pods, both running. Same mechanism, different trigger. You moved the number up, the ReplicaSet noticed it was short, and created what was needed.

Scale back down and you’ll see the opposite.

kubectl scale replicaset frontend --replicas=2

Three of the pods will disappear. The ReplicaSet looked at the count (two), looked at what was running (five), and removed enough to match. The three situations from section 2 were a pod crash, a count change, and a node failure. You’ve just watched the first two live.

You could trigger the third one too, by draining a node or simulating a failure, but by now the pattern is clear. The ReplicaSet is doing exactly one thing, over and over, for every situation it meets. The YAML you wrote is just how you asked it to start.

Where This Leads

You’ve built a ReplicaSet. It maintains three pods for you, replaces them when they fail, and scales them up or down when you change the count. Your application is now reliable in a way a single pod on its own never was.

But there’s a problem you might not have noticed yet. Look at the pod names in the output you’ve been watching.

frontend-x4k2p
frontend-m8kl9
frontend-p9j7r

Each one has a random suffix. When you deleted one, it came back with a different suffix. When you scaled up, new ones appeared with their own suffixes. Every pod gets a new IP address when it’s created, and that IP is released when the pod is removed. The ReplicaSet is keeping three pods alive, but which three pods, and with which IPs, keeps changing.

This is fine for the ReplicaSet itself. It doesn’t care about specific pods, only the count. But it’s a problem for anything that wants to actually use those pods. If another part of your application needs to send a request to the frontend, which pod does it send it to? What IP does it connect to? The IPs aren’t stable, so hardcoding one doesn’t work. Even if you listed all three, the moment a pod gets replaced, one of those IPs stops working.

You need something that sits in front of your pods and gives them a stable address. Something that knows which pods are currently healthy and forwards traffic to them, whatever their IPs happen to be right now. That something is called a Service, and it’s the next article.