High Availability & Fault Tolerance
In high-scale systems, failure is a certainty. Designing for High Availability (HA) means ensuring that the system remains functional even when components fail.
Availability is often misunderstood as "it's up." In high-scale systems, availability is a mathematical guarantee:
- 99.9% (Three Nines): 8.77 hours of downtime per year.
- 99.99% (Four Nines): 52.6 minutes of downtime per year.
- 99.999% (Five Nines): 5.26 minutes of downtime per year.
At five nines, you don't have time for a human to log in and fix something. Your system must be autonomously self-healing.
1. Cellular Architecture
Standard microservices are often "monolithic" at the infrastructure layer. If a bad config change hits the "Order Service," it hits all instances globally.
The Blueprint: Divide your system into "Cells" (or Shards).
- A cell is a complete, independent instance of your entire stack (Web, API, DB).
- Users are mapped to specific cells.
- A failure in Cell A cannot affect Cell B.
- This limits the Blast Radius to a small % of users.
2. Grey Failure Detection
Most monitoring detects "Dead" or "Alive." Grey failure is when a node is alive (pingable) but degraded (high latency, partial errors, or corrupted data).
The Blueprint:
- Active Probing: Don't just check health endpoints. Execute synthetic transactions that mimic user behavior.
- Outlier Detection: Use circuit breakers (like Resilience4j) to automatically eject nodes that perform significantly worse than their peers, even if they aren't "down."
3. The "Static Stability" Principle
Static stability means the system continues to work in a failure state without needing to make control-plane changes (like scaling up or rebalancing).
The Blueprint:
- Over-provisioning: Keep enough capacity to handle a full AZ (Availability Zone) failure without waiting for autoscaling (which often fails during outages).
- Hard Dependencies: If Service A needs Service B to start, and B is down, A will crash-loop. Design systems to start with cached data or in a degraded "read-only" mode.
4. Control Plane vs. Data Plane
The Data Plane handles the actual user requests. The Control Plane manages configuration and orchestration.
The Blueprint:
- Ensure the Data Plane can continue running even if the Control Plane is completely down.
- Example: A load balancer should keep routing to known healthy nodes even if the service discovery registry disappears.
Summary Checklist for Five Nines
- Regional Isolation (Active-Active)
- Cellular blast radius control
- Automated rollback on metric breach
- Grey failure ejection
- No hard dependencies in the critical path