Kubernetes StatefulSet: A Deep Dive
Kubernetes StatefulSet: A Deep Dive
Still find that taking notes by hand helps things stick better.
When I first started working with Kubernetes, I thought everything could be a Deployment. Then I tried to run a database cluster. That's when I learned about StatefulSets the hard way.
What makes StatefulSet different?
In Kubernetes, most applications can run as Deployments. You scale them up and down, pods get random names, and if one dies, another takes its place. But what about databases? Message queues? Distributed systems that need stable network identities?
That's where StatefulSet comes in.
The core differences
| Feature | Deployment | StatefulSet |
|---|---|---|
| Pod Names | Random (web-7d8f9c-xk2m) | Ordered (web-0, web-1, web-2) |
| Creation Order | Parallel | Sequential (0 then 1 then 2) |
| Deletion Order | Random | Reverse (2 then 1 then 0) |
| Storage | Shared (optional) | Dedicated per pod |
| Network Identity | Unstable | Stable DNS names |
| Scaling | Fast, parallel | Slow, sequential |
How StatefulSet works
Pod naming and ordering
StatefulSet creates pods with predictable names following the pattern: <statefulset-name>-<ordinal-index>
cassandra-0 ← Created first
cassandra-1 ← Created after cassandra-0 is Ready
cassandra-2 ← Created after cassandra-1 is Ready
This ordering matters. Databases need the leader pod ready before followers join. Distributed systems depend on predictable hostnames for node discovery. Message queues require ordered startup to form a cluster.
Sequential creation process
When you create a StatefulSet with 3 replicas:
Time 0s: cassandra-0 → Pending
Time 5s: cassandra-0 → Running (waiting for Ready)
Time 30s: cassandra-0 → Ready ✓
Time 31s: cassandra-1 → Pending
Time 36s: cassandra-1 → Running (waiting for Ready)
Time 50s: cassandra-1 → Ready ✓
Time 51s: cassandra-2 → Pending
...
Why sequential? Because many stateful applications need the leader or master node up first. Configuration gets copied from previous replicas. Cluster membership has to be registered before the next node joins. You can't just start everything at once and hope it sorts itself out.
Stable network identity
Each StatefulSet pod gets a stable DNS name that survives restarts:
<pod-name>.<service-name>.<namespace>.svc.cluster.local
Example for a Cassandra cluster:
cassandra-0.cassandra.default.svc.cluster.local
cassandra-1.cassandra.default.svc.cluster.local
cassandra-2.cassandra.default.svc.cluster.local
Even if cassandra-1 crashes and restarts, it keeps the same DNS name. Your application code can hardcode these addresses without worry.
Persistent storage with StatefulSet
VolumeClaimTemplates
The real point of StatefulSet is dedicated persistent storage per pod:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
spec:
serviceName: postgres
replicas: 3
selector:
matchLabels:
app: postgres
template:
metadata:
labels:
app: postgres
spec:
containers:
- name: postgres
image: postgres:14
ports:
- containerPort: 5432
volumeMounts:
- name: postgres-data
mountPath: /var/lib/postgresql/data
volumeClaimTemplates:
- metadata:
name: postgres-data
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 10Gi
This creates:
postgres-0 → PVC: postgres-data-postgres-0 (10Gi)
postgres-1 → PVC: postgres-data-postgres-1 (10Gi)
postgres-2 → PVC: postgres-data-postgres-2 (10Gi)
What happens during scaling?
Scaling up (3 to 4 replicas):
1. Create new PVC: postgres-data-postgres-3
2. Wait for PVC to bind to a PersistentVolume
3. Create pod: postgres-3
4. Mount postgres-data-postgres-3 to postgres-3
Scaling down (4 to 3 replicas):
1. Delete pod: postgres-3
2. PVC postgres-data-postgres-3 remains (not deleted!)
3. Data preserved for future scale-up
This is an important detail: StatefulSet never deletes PVCs automatically. This prevents accidental data loss. It also means you'll have orphaned PVCs hanging around if you're not careful.
Real-world example: Cassandra cluster
Here's a real Cassandra StatefulSet I deployed:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: cassandra
namespace: databases
spec:
serviceName: cassandra
replicas: 3
selector:
matchLabels:
app: cassandra
template:
metadata:
labels:
app: cassandra
spec:
containers:
- name: cassandra
image: cassandra:4.0
ports:
- containerPort: 9042
name: cql
- containerPort: 7000
name: intra-node
env:
- name: CASSANDRA_SEEDS
value: "cassandra-0.cassandra.databases.svc.cluster.local"
- name: CASSANDRA_CLUSTER_NAME
value: "ProductionCluster"
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
volumeMounts:
- name: cassandra-data
mountPath: /var/lib/cassandra
resources:
requests:
memory: 4Gi
cpu: 2
limits:
memory: 8Gi
cpu: 4
terminationGracePeriodSeconds: 300
volumeClaimTemplates:
- metadata:
name: cassandra-data
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: fast-ssd
resources:
requests:
storage: 100Gi
Why this configuration?
CASSANDRA_SEEDS points to cassandra-0 as the seed node. Since it's created first, other nodes can discover the cluster through it.
terminationGracePeriodSeconds is set to 300 (5 minutes). Cassandra needs time to flush data to disk and tell the cluster it's leaving. Without this, you risk data loss on shutdown.
Resources: Cassandra is memory-hungry. 4Gi request with 8Gi limit gives it room to handle spikes without starving other pods.
storageClassName set to fast-ssd because databases on slow storage are painful. SSDs make a huge difference for query performance.
Headless service for StatefulSet
StatefulSets need a "headless service" (ClusterIP: None) for pod DNS resolution:
apiVersion: v1
kind: Service
metadata:
name: cassandra
namespace: databases
spec:
clusterIP: None # Headless service
selector:
app: cassandra
ports:
- port: 9042
name: cql
- port: 7000
name: intra-node
Why headless? Normal services distribute traffic across all pods (load balancing). Headless services create DNS records for each pod individually, so pods can talk directly to each other: cassandra-0.cassandra.databases.svc.cluster.local. That's what distributed databases need.
Update strategies
StatefulSet supports two update strategies:
RollingUpdate (default)
spec:
updateStrategy:
type: RollingUpdate
rollingUpdate:
partition: 0
Updates pods in reverse order (2 then 1 then 0):
1. Delete cassandra-2
2. Wait for new cassandra-2 to be Ready
3. Delete cassandra-1
4. Wait for new cassandra-1 to be Ready
5. Delete cassandra-0
6. Wait for new cassandra-0 to be Ready
The partition parameter is useful. If you set partition: 2, only pods with ordinal 2 or higher will update. This lets you do canary deployments - update one pod first, test it, then roll forward.
OnDelete
spec:
updateStrategy:
type: OnDelete
Pods only update when you manually delete them. Gives you full control over rollout timing. I use this for databases where I want to verify each node before moving to the next.
Common pitfalls
Forgetting PVC cleanup
After deleting a StatefulSet, PVCs remain. You have to manually delete them:
# Delete StatefulSet
kubectl delete statefulset postgres
# PVCs still exist!
kubectl get pvc
# postgres-data-postgres-0 Bound ...
# postgres-data-postgres-1 Bound ...
# postgres-data-postgres-2 Bound ...
# Manually clean up
kubectl delete pvc postgres-data-postgres-{0..2}
Kubernetes does this on purpose to protect your data. But it means you'll accumulate orphaned PVCs if you're not paying attention.
Slow scaling
Scaling StatefulSets is slow because it's sequential:
# This might take 5-10 minutes!
kubectl scale statefulset cassandra --replicas=10
Each pod has to get a PVC created and bound, get scheduled with its image pulled, start up and pass the readiness probe. Only then does the next pod start.
If you know you'll need to scale up, pre-create PVCs or increase replicas during off-peak hours.
Readiness probe configuration
StatefulSet relies heavily on readiness probes. If they're misconfigured, your StatefulSet gets stuck.
readinessProbe:
exec:
command:
- /bin/sh
- -c
- cqlsh -e "describe cluster"
initialDelaySeconds: 60
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
Too aggressive (short delays, low failure threshold) means pods never become Ready, so the next pod never starts. Too lenient (long delays, high threshold) means slow scaling and sick pods staying in rotation.
I spent way too long debugging a stuck StatefulSet once because my initialDelaySeconds was 15 and Cassandra needed a full minute to start. Felt obvious in hindsight.
When to use StatefulSet
Good use cases:
- Databases (PostgreSQL, MySQL, MongoDB) - need stable hostnames for replication and persistent storage per instance
- Distributed coordination (Zookeeper, etcd, Consul) - leader election requires stable identities and ordered startup
- Message queues (RabbitMQ, Kafka) - persistent message storage per broker with cluster discovery via stable DNS
- Caching (Redis Cluster) - consistent hashing needs stable node IDs
Bad use cases:
- Stateless web apps - no need for stable identities, Deployment is simpler
- Workers processing queue jobs - no persistent state, parallel scaling with Deployment is better
- API gateways - don't need ordered startup, use Deployment with HPA
Debugging StatefulSet issues
These commands are where I usually start:
# Check pod status
kubectl get pods -l app=cassandra -o wide
# Check PVC binding
kubectl get pvc
kubectl describe pvc cassandra-data-cassandra-0
# Check events
kubectl describe statefulset cassandra
kubectl get events --sort-by='.lastTimestamp'
# Check logs
kubectl logs cassandra-0
kubectl logs cassandra-0 --previous # If pod crashed
# Force recreation (StatefulSet will automatically recreate it)
kubectl delete pod cassandra-1
Performance considerations
Storage class selection
volumeClaimTemplates:
- metadata:
name: data
spec:
storageClassName: fast-ssd # Critical!
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 100Gi
For databases, SSD can be 10-100x faster than HDD. Worth the cost.
Local SSDs have the lowest latency but no replication. Network storage (EBS, Persistent Disk) has built-in redundancy but higher latency. Pick based on your tolerance for data loss vs performance needs.
Resource limits
resources:
requests:
memory: 4Gi
cpu: 2
limits:
memory: 8Gi
cpu: 4
Don't set limits too tight. Databases have bursty workloads. Leave headroom. Use Prometheus to track real resource consumption and adjust based on actual data rather than guessing.
Staged rollouts with partition
You can use the partition parameter for blue-green style deployments:
# Update only cassandra-2
kubectl patch statefulset cassandra -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":2}}}}'
kubectl set image statefulset cassandra cassandra=cassandra:4.1
# Test cassandra-2 in production
# If good, update cassandra-1
kubectl patch statefulset cassandra -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":1}}}}'
# Finally update cassandra-0
kubectl patch statefulset cassandra -p '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":0}}}}'
To sum up
StatefulSet is how Kubernetes handles stateful applications. The key ideas:
- Sequential ordering means safe cluster formation
- Stable network identities let pods find each other
- Each pod gets its own persistent storage
- PVCs are never auto-deleted (clean them up yourself)
- Scaling is slow - that's the trade-off for safety
After running StatefulSets in production for databases and message queues, I've found they're slower and more complex than Deployments. But for stateful applications, there's no real alternative.
Use StatefulSet when your application needs stable network identifiers, ordered deployment, or per-pod persistent storage. Otherwise, stick with Deployments. They're simpler, faster, and good enough for most workloads.
This article is based on hands-on experience running StatefulSets in production Kubernetes clusters, including Cassandra, PostgreSQL, and Redis deployments.