Kubernetes StatefulSet: A Deep Dive

Kubernetes StatefulSet: A Deep Dive

Still find that taking notes by hand helps things stick better.

When I first started working with Kubernetes, I thought everything could be a Deployment. Then I tried to run a database cluster. That's when I learned about StatefulSets the hard way.

What makes StatefulSet different?

In Kubernetes, most applications can run as Deployments. You scale them up and down, pods get random names, and if one dies, another takes its place. But what about databases? Message queues? Distributed systems that need stable network identities?

That's where StatefulSet comes in.

The core differences

FeatureDeploymentStatefulSet
Pod NamesRandom (web-7d8f9c-xk2m)Ordered (web-0, web-1, web-2)
Creation OrderParallelSequential (0 then 1 then 2)
Deletion OrderRandomReverse (2 then 1 then 0)
StorageShared (optional)Dedicated per pod
Network IdentityUnstableStable DNS names
ScalingFast, parallelSlow, sequential

How StatefulSet works

Pod naming and ordering

StatefulSet creates pods with predictable names following the pattern: <statefulset-name>-<ordinal-index>

cassandra-0  ← Created first
cassandra-1  ← Created after cassandra-0 is Ready
cassandra-2  ← Created after cassandra-1 is Ready

This ordering matters. Databases need the leader pod ready before followers join. Distributed systems depend on predictable hostnames for node discovery. Message queues require ordered startup to form a cluster.

Sequential creation process

When you create a StatefulSet with 3 replicas:

Time 0s:  cassandra-0 → Pending
Time 5s:  cassandra-0 → Running (waiting for Ready)
Time 30s: cassandra-0 → Ready ✓
Time 31s: cassandra-1 → Pending
Time 36s: cassandra-1 → Running (waiting for Ready)
Time 50s: cassandra-1 → Ready ✓
Time 51s: cassandra-2 → Pending
...

Why sequential? Because many stateful applications need the leader or master node up first. Configuration gets copied from previous replicas. Cluster membership has to be registered before the next node joins. You can't just start everything at once and hope it sorts itself out.

Stable network identity

Each StatefulSet pod gets a stable DNS name that survives restarts:

<pod-name>.<service-name>.<namespace>.svc.cluster.local

Example for a Cassandra cluster:

cassandra-0.cassandra.default.svc.cluster.local
cassandra-1.cassandra.default.svc.cluster.local
cassandra-2.cassandra.default.svc.cluster.local

Even if cassandra-1 crashes and restarts, it keeps the same DNS name. Your application code can hardcode these addresses without worry.

Persistent storage with StatefulSet

VolumeClaimTemplates

The real point of StatefulSet is dedicated persistent storage per pod:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
spec:
  serviceName: postgres
  replicas: 3
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
      - name: postgres
        image: postgres:14
        ports:
        - containerPort: 5432
        volumeMounts:
        - name: postgres-data
          mountPath: /var/lib/postgresql/data
  volumeClaimTemplates:
  - metadata:
      name: postgres-data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 10Gi

This creates:

postgres-0 → PVC: postgres-data-postgres-0 (10Gi)
postgres-1 → PVC: postgres-data-postgres-1 (10Gi)
postgres-2 → PVC: postgres-data-postgres-2 (10Gi)

What happens during scaling?

Scaling up (3 to 4 replicas):

1. Create new PVC: postgres-data-postgres-3
2. Wait for PVC to bind to a PersistentVolume
3. Create pod: postgres-3
4. Mount postgres-data-postgres-3 to postgres-3

Scaling down (4 to 3 replicas):

1. Delete pod: postgres-3
2. PVC postgres-data-postgres-3 remains (not deleted!)
3. Data preserved for future scale-up

This is an important detail: StatefulSet never deletes PVCs automatically. This prevents accidental data loss. It also means you'll have orphaned PVCs hanging around if you're not careful.

Real-world example: Cassandra cluster

Here's a real Cassandra StatefulSet I deployed:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: cassandra
  namespace: databases
spec:
  serviceName: cassandra
  replicas: 3
  selector:
    matchLabels:
      app: cassandra
  template:
    metadata:
      labels:
        app: cassandra
    spec:
      containers:
      - name: cassandra
        image: cassandra:4.0
        ports:
        - containerPort: 9042
          name: cql
        - containerPort: 7000
          name: intra-node
        env:
        - name: CASSANDRA_SEEDS
          value: "cassandra-0.cassandra.databases.svc.cluster.local"
        - name: CASSANDRA_CLUSTER_NAME
          value: "ProductionCluster"
        - name: POD_IP
          valueFrom:
            fieldRef:
              fieldPath: status.podIP
        volumeMounts:
        - name: cassandra-data
          mountPath: /var/lib/cassandra
        resources:
          requests:
            memory: 4Gi
            cpu: 2
          limits:
            memory: 8Gi
            cpu: 4
      terminationGracePeriodSeconds: 300
  volumeClaimTemplates:
  - metadata:
      name: cassandra-data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: fast-ssd
      resources:
        requests:
          storage: 100Gi

Why this configuration?

CASSANDRA_SEEDS points to cassandra-0 as the seed node. Since it's created first, other nodes can discover the cluster through it.

terminationGracePeriodSeconds is set to 300 (5 minutes). Cassandra needs time to flush data to disk and tell the cluster it's leaving. Without this, you risk data loss on shutdown.

Resources: Cassandra is memory-hungry. 4Gi request with 8Gi limit gives it room to handle spikes without starving other pods.

storageClassName set to fast-ssd because databases on slow storage are painful. SSDs make a huge difference for query performance.

Headless service for StatefulSet

StatefulSets need a "headless service" (ClusterIP: None) for pod DNS resolution:

apiVersion: v1
kind: Service
metadata:
  name: cassandra
  namespace: databases
spec:
  clusterIP: None  # Headless service
  selector:
    app: cassandra
  ports:
  - port: 9042
    name: cql
  - port: 7000
    name: intra-node

Why headless? Normal services distribute traffic across all pods (load balancing). Headless services create DNS records for each pod individually, so pods can talk directly to each other: cassandra-0.cassandra.databases.svc.cluster.local. That's what distributed databases need.

Update strategies

StatefulSet supports two update strategies:

RollingUpdate (default)

spec:
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      partition: 0

Updates pods in reverse order (2 then 1 then 0):

1. Delete cassandra-2
2. Wait for new cassandra-2 to be Ready
3. Delete cassandra-1
4. Wait for new cassandra-1 to be Ready
5. Delete cassandra-0
6. Wait for new cassandra-0 to be Ready

The partition parameter is useful. If you set partition: 2, only pods with ordinal 2 or higher will update. This lets you do canary deployments - update one pod first, test it, then roll forward.

OnDelete

spec:
  updateStrategy:
    type: OnDelete

Pods only update when you manually delete them. Gives you full control over rollout timing. I use this for databases where I want to verify each node before moving to the next.

Common pitfalls

Forgetting PVC cleanup

After deleting a StatefulSet, PVCs remain. You have to manually delete them:

# Delete StatefulSet
kubectl delete statefulset postgres

# PVCs still exist!
kubectl get pvc
# postgres-data-postgres-0   Bound    ...
# postgres-data-postgres-1   Bound    ...
# postgres-data-postgres-2   Bound    ...

# Manually clean up
kubectl delete pvc postgres-data-postgres-{0..2}

Kubernetes does this on purpose to protect your data. But it means you'll accumulate orphaned PVCs if you're not paying attention.

Slow scaling

Scaling StatefulSets is slow because it's sequential:

# This might take 5-10 minutes!
kubectl scale statefulset cassandra --replicas=10

Each pod has to get a PVC created and bound, get scheduled with its image pulled, start up and pass the readiness probe. Only then does the next pod start.

If you know you'll need to scale up, pre-create PVCs or increase replicas during off-peak hours.

Readiness probe configuration

StatefulSet relies heavily on readiness probes. If they're misconfigured, your StatefulSet gets stuck.

readinessProbe:
  exec:
    command:
    - /bin/sh
    - -c
    - cqlsh -e "describe cluster"
  initialDelaySeconds: 60
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

Too aggressive (short delays, low failure threshold) means pods never become Ready, so the next pod never starts. Too lenient (long delays, high threshold) means slow scaling and sick pods staying in rotation.

I spent way too long debugging a stuck StatefulSet once because my initialDelaySeconds was 15 and Cassandra needed a full minute to start. Felt obvious in hindsight.

When to use StatefulSet

Good use cases:

  • Databases (PostgreSQL, MySQL, MongoDB) - need stable hostnames for replication and persistent storage per instance
  • Distributed coordination (Zookeeper, etcd, Consul) - leader election requires stable identities and ordered startup
  • Message queues (RabbitMQ, Kafka) - persistent message storage per broker with cluster discovery via stable DNS
  • Caching (Redis Cluster) - consistent hashing needs stable node IDs

Bad use cases:

  • Stateless web apps - no need for stable identities, Deployment is simpler
  • Workers processing queue jobs - no persistent state, parallel scaling with Deployment is better
  • API gateways - don't need ordered startup, use Deployment with HPA

Debugging StatefulSet issues

These commands are where I usually start:

# Check pod status
kubectl get pods -l app=cassandra -o wide

# Check PVC binding
kubectl get pvc
kubectl describe pvc cassandra-data-cassandra-0

# Check events
kubectl describe statefulset cassandra
kubectl get events --sort-by='.lastTimestamp'

# Check logs
kubectl logs cassandra-0
kubectl logs cassandra-0 --previous  # If pod crashed

# Force recreation (StatefulSet will automatically recreate it)
kubectl delete pod cassandra-1

Performance considerations

Storage class selection

volumeClaimTemplates:
- metadata:
    name: data
  spec:
    storageClassName: fast-ssd  # Critical!
    accessModes: [ "ReadWriteOnce" ]
    resources:
      requests:
        storage: 100Gi

For databases, SSD can be 10-100x faster than HDD. Worth the cost.

Local SSDs have the lowest latency but no replication. Network storage (EBS, Persistent Disk) has built-in redundancy but higher latency. Pick based on your tolerance for data loss vs performance needs.

Resource limits

resources:
  requests:
    memory: 4Gi
    cpu: 2
  limits:
    memory: 8Gi
    cpu: 4

Don't set limits too tight. Databases have bursty workloads. Leave headroom. Use Prometheus to track real resource consumption and adjust based on actual data rather than guessing.

Staged rollouts with partition

You can use the partition parameter for blue-green style deployments:

# Update only cassandra-2
kubectl patch statefulset cassandra -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":2}}}}'
kubectl set image statefulset cassandra cassandra=cassandra:4.1

# Test cassandra-2 in production
# If good, update cassandra-1
kubectl patch statefulset cassandra -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":1}}}}'

# Finally update cassandra-0
kubectl patch statefulset cassandra -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":0}}}}'

To sum up

StatefulSet is how Kubernetes handles stateful applications. The key ideas:

  1. Sequential ordering means safe cluster formation
  2. Stable network identities let pods find each other
  3. Each pod gets its own persistent storage
  4. PVCs are never auto-deleted (clean them up yourself)
  5. Scaling is slow - that's the trade-off for safety

After running StatefulSets in production for databases and message queues, I've found they're slower and more complex than Deployments. But for stateful applications, there's no real alternative.

Use StatefulSet when your application needs stable network identifiers, ordered deployment, or per-pod persistent storage. Otherwise, stick with Deployments. They're simpler, faster, and good enough for most workloads.


This article is based on hands-on experience running StatefulSets in production Kubernetes clusters, including Cassandra, PostgreSQL, and Redis deployments.