congestion control

flow control protects the receiver. congestion control protects the network.

the sender has no direct visibility into the network's capacity. it cannot ask routers how full their queues are. it can only send data and observe what happens. did the acks come back on time, or did packets vanish?

every congestion control algorithm is a feedback loop: probe for more bandwidth, detect congestion, back off, repeat. the differences are in how they probe and what signals they react to.

the congestion window

the congestion window (cwnd) limits how many bytes the sender can have in flight. the actual send rate is bounded by min(cwnd, rwnd).

cwnd lives in the sender's kernel, not on the wire. the receiver never sees it.

ss --info '( dport = :443 )' | grep cwnd
# cwnd:10 ssthresh:20

cwnd is measured in segments (typically mss-sized, 1460 bytes on ethernet). cwnd:10 means roughly 14600 bytes in flight.

slow start

a new connection has no idea how much bandwidth is available. slow start probes: for every ack received, cwnd increases by one segment. cwnd doubles every rtt. exponential growth.

RTT 1: cwnd=1   → send 1 segment  → 1 ack  → cwnd=2
RTT 2: cwnd=2   → send 2 segments → 2 acks → cwnd=4
RTT 3: cwnd=4   → send 4 segments → 4 acks → cwnd=8
RTT 4: cwnd=8   → send 8 segments → 8 acks → cwnd=16

"slow start" is a misnomer. the name refers to starting from 1 rather than blasting at full speed, but the growth is exponential.

slow start continues until:

packet loss hits and cwnd gets cut
cwnd reaches ssthresh and switches to linear growth (congestion avoidance)
rwnd is reached and the receiver is the bottleneck

linux sets initcwnd to 10 segments by default (ip route show to check). google's research showed bumping initcwnd from 3 to 10 significantly reduced page load times because most web responses fit in 10 segments.

congestion avoidance

once cwnd reaches ssthresh, growth goes linear: cwnd increases by roughly one segment per rtt. the sender carefully pushes upward, testing whether the network can handle more. when it pushes too far, packets drop and the algorithm reacts.

detecting congestion

two signals:

packet loss. either a retransmission timeout fires or three duplicate acks trigger fast retransmit. loss-based algorithms treat this as the primary signal.

ecn (explicit congestion notification). routers mark packets when their queues fill up instead of dropping them. the receiver echoes the mark back in acks. the sender reacts as if a loss occurred but avoids the actual retransmission. enable with net.ipv4.tcp_ecn=1 but every router on the path needs support.

loss-based: reno and cubic

reno is the textbook algorithm. on loss: cut cwnd in half, set ssthresh to the new cwnd, enter congestion avoidance. simple, conservative, and terrible on high-bandwidth high-latency links because it takes forever to recover.

cubic (RFC 9438) is linux's default since 2.6.19. instead of linear growth in congestion avoidance, cubic uses a cubic function that grows slowly near the last known good cwnd (where loss happened) and more aggressively when far from it. much faster at utilizing fat pipes.

sysctl net.ipv4.tcp_congestion_control
# net.ipv4.tcp_congestion_control = cubic

sysctl net.ipv4.tcp_available_congestion_control

cubic's weakness: it interprets any packet loss as congestion, even random loss on a wireless link. 1% loss on wifi and cubic backs off constantly even if the link has plenty of bandwidth.

delay-based: vegas and bbr

loss-based algorithms wait until the network is already congested (queues full, packets dropped) before reacting. delay-based algorithms detect congestion earlier by watching rtt. a rising rtt means queues are filling.

vegas monitors rtt and reduces cwnd when it increases beyond a threshold. avoids filling queues entirely, which is great for latency but means it loses bandwidth fights with cubic. cubic fills queues, vegas backs off, cubic takes the capacity.

bbr (bottleneck bandwidth and round-trip propagation time) is google's answer. it models the connection's bottleneck bandwidth and minimum rtt, then paces packets to match. does not react to loss directly. does not need to fill queues to find capacity.

# enable bbr (kernel 4.9+)
sysctl -w net.ipv4.tcp_congestion_control=bbr

# bbr needs the fq packet scheduler
tc qdisc replace dev eth0 root fq

bbr excels on long-fat networks and networks with random non-congestion loss. keeps queues shallow, which is great for latency. bbrv1 can be unfair to cubic flows; bbrv2 (newer kernels) addresses this.

bbr is not magic. on a shared link with mixed cubic and bbr flows, the interactions are complex and not always fair. test before deploying.

what happens under loss

when cubic detects a loss:

ssthresh is set to cwnd * 0.7 (reno uses 0.5)
cwnd drops to ssthresh
congestion avoidance resumes

classic sawtooth: cwnd ramps up linearly, loss drops it, ramp again. average throughput is a function of loss frequency and recovery speed.

cwnd over time

congestion window (cubic)

connection starts with initcwnd=10. slow start begins.

on a short request (typical web page), the connection might never leave slow start. on a large download, you spend most of the time in congestion avoidance watching the sawtooth.

pacing

raw congestion control says "you can have cwnd bytes in flight." without pacing, the sender dumps cwnd bytes into the network as fast as the nic can push them. creates bursts that fill router queues and cause loss even when average throughput is fine.

pacing spreads packets evenly across the rtt. bbr requires pacing (via the fq qdisc). cubic benefits from it.

tc qdisc show dev eth0

observing

ss --info '( dport = :443 )'
# rtt:25.3/1.2 cwnd:42 ssthresh:38 bytes_acked:1284032 retrans:0/2

cwnd:42 ssthresh:38 means congestion avoidance (cwnd > ssthresh). retrans:0/2 means 0 currently unacked retransmits, 2 total since the connection opened.

cwnd stuck at a low value with low ssthresh means recent loss and slow recovery. cwnd equals initcwnd on a fresh connection means slow start has not had time to ramp.

nstat -az TcpRetransSegs

system-wide retransmit count. if this is climbing, something on your network is hurting.

what this means for your app

short connections suffer most. a connection transferring 50 KB might finish entirely during slow start with initcwnd=10 (14 KB on the wire in the first rtt). the congestion algorithm never gets a chance to ramp. http/2 multiplexing and connection reuse help because one warmed-up connection with a large cwnd serves many requests. quic with 0-rtt goes further: it can reuse the previous connection's congestion state.

buffer sizes limit throughput. if SO_SNDBUF is smaller than the bandwidth-delay product, the sender cannot fill the pipe. if SO_RCVBUF is smaller, the receive window clamps throughput before cwnd gets a chance.

measure before tuning. switching to bbr on a server handling only lan traffic will not help. bumping initcwnd to 40 when your clients are on the same rack is pointless overhead. know the rtt and bandwidth of your typical path, calculate the bdp, and tune accordingly.

flow control protects the receiver. congestion control protects the network. udp is what happens when you opt out of all of this.