scalability is not "add more machines and watch it go faster." it is the question of whether the work can be divided up at all, and whether your current design makes that possible.
stateless computation scales easily. take a request, do work, return a response, nothing persisted between calls. run ten copies behind a load balancer. done.
state does not scale easily. if your application needs to know what happened before, shopping cart contents, session tokens, which records a user has already seen, you now have a coordination problem. either every replica has access to the same state store, or you have to route requests to the right replica, or you accept that replicas can disagree.
this is why "just add more servers" breaks down. the bottleneck is usually not computation. it is shared state.
vertical scaling means adding resources to one machine: more cpu cores, more ram, faster storage. it is simple, requires no code changes, and works until it does not. hardware has limits. a single machine can only grow so large before cost becomes prohibitive or the ceiling is hit entirely. more importantly, one machine is one failure domain.
horizontal scaling means adding more machines to the pool. it is cheaper at scale, eliminates the single-failure-domain problem, and lets you scale specific components independently. the cost is coordination: requests need routing, shared state needs a shared store, and any statefulness in the application becomes a design problem rather than a deployment footnote.
most production systems end up with both: vertically scale individual nodes to a sensible size, then scale horizontally across multiple nodes.
the scale cube from the art of scalability is useful here. you can scale a system along three axes:
- x-axis: clone the whole application horizontally. works for stateless services. the first move for most traffic growth.
- y-axis: split by function (service decomposition). different services handle different responsibilities. adds operational complexity but lets you scale bottlenecks independently.
- z-axis: partition the data. route traffic based on a key, user id, region, tenant, so each instance owns a subset of the data. solves the shared-state problem but introduces consistency and routing complexity.
most scaling problems are solved by a combination of these. the question is which combination your current architecture supports.
if any part of your system cannot be parallelized, that part limits your maximum speedup regardless of how many machines you add. if 5% of a workload is sequential, you top out at 20x improvement no matter how many cores you throw at it.
practical implication: find the bottleneck before scaling. throwing capacity at the wrong layer is expensive and achieves nothing. a single slow database query will cap your throughput long before you run out of application servers.
most of what makes scaling genuinely hard is stateful services: databases, caches, queues, anything that holds data. techniques here include:
- read replicas: route read traffic to replicas, writes to a primary. helps if reads dominate and you can tolerate slightly stale data.
- caching: serve reads from memory when possible. works until cache invalidation becomes the actual problem.
- sharding: partition data across multiple nodes. each node owns a key range. scales write throughput but introduces cross-shard operations, rebalancing complexity, and makes transactions harder.
- connection pooling: databases handle a limited number of concurrent connections. pooling lets you run many application instances against a reasonably-sized database.
the point is not to memorize these techniques. it is to notice that every one of them involves a tradeoff. read replicas trade consistency. caching trades freshness. sharding trades simplicity. scaling is the act of choosing which tradeoff you can live with.