Availability#
This chapter is about strategies for improving controller availability and tail latencies.
Motivation#
Despite the common goals often set forth for application deployments, most kube
controllers:
- can run in a single replica (default recommendation)
- can handle being killed, and be shifted to another node
- can handle minor downtime
This is due to a couple of properties:
- Controllers are queue consumers that do not require 100% uptime to meet a 100% SLO
- Rust images are often very small and will reschedule quickly
- watch streams re-initialise quickly with the current state on boot
- reconciler#idempotency means multiple repeat reconciliations are not problematic
- parallel execution mode of reconciliations makes restarts fast
These properties combined creates a low-overhead system that is normally quick to catch-up after being rescheduled, and offers a traditional Kubernetes eventual consistency guarantee.
That said, this setup can struggle under strong consistency requirements. Ask yourself:
- How quickly do you expect your reconciler to respond to changes on average?
- Is a
30s
P95 downtime from reschedules acceptable?
Responsiveness#
If you want to improve average responsiveness, then traditional scaling and optimization strategies can help:
- Configure controller concurrency to avoid waiting for a reconciler slot
- Optimize the reconciler, avoid duplicated work
- Satisfy CPU requirements to avoid cgroup throttling
- Ensure your relations are setup right to avoid waiting for the next requeue
You can plot heatmaps of reconciliation times in grafana using standard observability#What Metrics.
High Availability#
Scaling a controller beyond one replica for HA is different than for a regular load-balanced traffic receiving application.
A controller is effectively a consumer of Kubernetes watch events, and these are themselves unsynchronised event streams whose watchers are unaware of each other. Adding another pod - without some form of external locking - will result in duplicated work.
To avoid this, most controllers lean into the eventual consistency model and run with a single replica, accepting higher tail latencies due to reschedules. However, if the performance demands are strong enough, these pod reschedules will dominate the tail of your latency metrics, and this can make a stronger case for HA.
Scaling Replicas
It not recommended to set replicas: 2
for an application running a normal Controller
without leaders/shards, as this will cause both controller pods to reconcile the same objects, creating duplicate work and potential race conditions.
To safely operate with more than one pod, you must have leadership of your domain and wait for such leadership to be acquired before commencing. This is the concept of leader election.
Leader Election#
Leader election allows having control over resources managed in Kubernetes using Leases as distributed locking mechanisms.
The common solution to downtime based-problems is to use the leader-with-lease
pattern, by having another controller replica in "standby mode", ready to takeover immediately without stepping on the toes of the other controller pod. We can do this by creating a Lease
, and gating on the validity of the lease before doing the real work in the reconciler.
Unsynchronised Rollout Surges
A 1 replica controller deployment without leader election might create short periods of duplicate work and racey writes during rollouts because of how rolling updates surge by default.
The natural expiration of leases
means that you are required to periodically update them while your main pod (the leader) is active. When your pod is about be replaced, you can initiate a step down (and expire the lease), ideally after receiving a SIGTERM
after draining your active work queue. If your pod crashes, then a replacement pod must wait for the scheduled lease expiry.
Third Party Crates#
At the moment, leader election support is not supported by kube
itself, and requires 3rd party crates (see kube#485). A brief list of popular crates:
kube-leader-election
via hendrikmaus (examples / docs / disclaimer)kube-coordinate
via thedodd (docs)kubert
->kubert::lease
via olix0r (example / linkerd use)
Elected Shards#
Leader election can in-theory be used on top of explicit scaling#sharding to ensure you have at most one replica managing one shard by using one lease per shard. This could reduce the number of excess replicas standing-by in a sharded scenario.