Failure Is the Norm
In a cluster of 1,000 nodes, each with 99.9% annual uptime:
P(all nodes up) = 0.999^1000 ≈ 0.37
Expected: at least one node is down 63% of the time.
Types of failures:
- Crash-stop: node halts and stays halted
- Crash-recovery: node halts then restarts with durable state
- Byzantine: node behaves arbitrarily (lies, corrupts messages)
- Network partition: nodes are reachable but cannot talk to each other