availability
what the nines actually cost, and why availability math gets uncomfortable fast.
what the nines actually cost, and why availability math gets uncomfortable fast.
availability is the fraction of time a system is operational and accepting requests. it is usually expressed as a percentage, 99.9%, 99.99%, and the gap between those numbers sounds small until you convert it to minutes of downtime per year.
| availability | downtime per year | downtime per month |
|---|---|---|
| 99% | 3.65 days | 7.3 hours |
| 99.9% | 8.76 hours | 43.8 minutes |
| 99.99% | 52.6 minutes | 4.4 minutes |
| 99.999% | 5.26 minutes | 26 seconds |
99% sounds high. 3.65 days of downtime per year is not. the jump from 99.9% to 99.99% is the difference between "we had a bad incident" and "we barely exceeded our error budget."
each additional nine gets geometrically more expensive to achieve. going from 99% to 99.9% might mean adding a replica. going from 99.99% to 99.999% might mean rearchitecting your entire deployment pipeline, eliminating all single points of failure, and investing heavily in automated failover. the cost curve is steep.
these three terms are related but different:
sli (service level indicator) is the measurement. p99 latency. error rate. request success rate. the raw number.
slo (service level objective) is the target. "99.9% of requests succeed." "p99 latency stays below 200ms." this is the internal standard the team commits to maintaining.
sla (service level agreement) is the contract. what happens if you miss the slo. usually involves credits or penalties. slas are externally visible; slos are the internal targets that protect them.
the practical implication: define slis first (what are you actually measuring?), then set slos (what is acceptable?), then let slas follow from the slos with some margin for safety. do not work backwards from sla targets without understanding what measurement produces those numbers.
components in series (all must work) multiply their availabilities. components in parallel (any one can work) are harder to calculate but generally improve availability.
for series dependencies: if your service requires a database and a cache and an auth service, and each is 99.9% available, your composite is roughly 99.9%³ ≈ 99.7%. every added dependency is a subtraction from your availability budget.
for parallel components: if you have two replicas and the service is down only when both are down, and each fails independently with probability 0.001, the probability both fail simultaneously is 0.001 × 0.001 = 0.000001. that is 99.9999% available, but only if the failures are truly independent.
failures are rarely independent. a bad deployment hits all replicas. a network partition affects all nodes in the same zone. a shared dependency fails and takes everything with it. common-mode failures are why "we have two copies" often provides less availability improvement than the math suggests.
availability calculations usually include all downtime, including planned maintenance windows. this matters because deployments, migrations, and schema changes all require some care to execute without downtime.
zero-downtime deployments (rolling updates, blue-green deployments, canary releases) exist specifically to move planned downtime toward zero. if you need to deploy frequently and availability matters, the deployment mechanism is part of the availability story.
availability says nothing about correctness. a system that returns 200 OK for every request regardless of whether it is doing the right thing is 100% available and completely unreliable.
availability also says nothing about latency unless you define it into the sli. a service that responds in 30 seconds is technically available but may be functionally unusable. "available" in practice means "available and responsive", which means latency thresholds need to be part of the sli definition.
this is where availability and reliability connect. you want a system that is reachable (available) and doing the right thing (reliable) at acceptable speed. optimizing for one without the other produces a system that is technically impressive and practically frustrating.