tcp deep dive

chapter 1 covered the handshake as a black box. three packets and you are in. here we look at what those packets actually contain.

tcp (RFC 9293) provides a reliable, ordered byte stream over an unreliable network. every guarantee it makes costs something.

segment anatomy

every tcp segment carries a fixed 20-byte header (plus options):

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|          source port          |       destination port        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                        sequence number                        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                    acknowledgment number                      |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|  off  | res |N|C|E|U|A|P|R|S|F|          window size          |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|           checksum            |         urgent pointer        |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                    options (if any)                            |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

source and destination ports (16 bits each) identify the endpoints. combined with the ip addresses from below, they form the four-tuple that uniquely identifies a connection.

sequence number (32 bits) counts bytes, not segments. send 1000 bytes starting at sequence 5000 and the next segment starts at 6000. the isn (initial sequence number) is random to prevent old segments from a previous connection sneaking in.

acknowledgment number (32 bits) says "I have received everything up to this byte." cumulative. acking byte 7000 means all bytes before 7000 arrived.

flags drive the state machine: SYN opens, FIN closes, RST aborts, ACK acknowledges, PSH tells the receiver to deliver data immediately.

window size (16 bits, scaled by window scale option) advertises how much buffer space the receiver has. flow control covers this.

handshake revisited, with numbers

the three-way handshake is really an isn exchange:

client                              server
  |--- SYN  seq=1000 ------------>|
  |<-- SYN+ACK seq=5000 ack=1001--|
  |--- ACK  seq=1001 ack=5001 --->|

the client picks isn 1000. the server acks 1001 (your isn + 1) and offers its own isn 5000. the client acks 5001. both sides now know each other's starting sequence number and can track every byte that follows.

tcpdump shows relative sequence numbers by default (starting at 0). pass -S for absolute numbers when you need to verify isn handling.

data transfer

after the handshake, every segment carries a sequence number and data. the receiver responds with acks:

client                                  server
  |--- seq=1001 len=500 ------------>|
  |--- seq=1501 len=500 ------------>|
  |<-- ack=2001 ----------------------|  "got everything up to 2001"
  |--- seq=2001 len=1000 ----------->|
  |<-- ack=3001 ----------------------|

the sender does not wait for an ack before sending the next segment. it sends as fast as the congestion window and receive window allow. acks are cumulative: if ack 2001 arrives, everything before 2001 is confirmed regardless of how many segments it took.

delayed acks

most tcp stacks do not ack every segment. linux waits up to 40ms or until 2 segments arrive before sending an ack. reduces ack traffic but can hurt latency for request-response protocols. TCP_QUICKACK disables it per-socket.

nagle's algorithm

nagle buffers small writes until an ack arrives for previous data or enough bytes accumulate to fill a segment. prevents the silly window syndrome of sending one byte at a time.

for latency-sensitive protocols (ssh, gaming, interactive apis), kill it:

int flag = 1;
setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, &flag, sizeof(flag));

nagle interacts badly with delayed acks. the sender waits for an ack, the receiver waits to batch the ack, and you get 40ms of pointless delay. most http servers set TCP_NODELAY by default for exactly this reason.

retransmission

when a segment is lost, tcp has two paths back.

timeout-based retransmission

the sender starts a retransmission timeout (rto) timer for each segment. no ack before the timer fires means retransmit. rto is calculated from smoothed rtt and rtt variance:

rto = srtt + max(G, 4 * rttvar)

G is the clock granularity (typically 1ms on linux). minimum rto is 200ms (net.ipv4.tcp_rto_min), maximum is 120 seconds.

waiting hundreds of milliseconds to detect a lost packet is painful. that is the slow path.

fast retransmit

when the receiver gets an out-of-order segment, it immediately sends a duplicate ack for the last in-order byte. three duplicate acks (four total for the same sequence number) and the sender retransmits without waiting for the timeout:

client                                   server
  |--- seq=1001 len=500 ------------>|
  |--- seq=1501 len=500 ---X         |   (lost)
  |--- seq=2001 len=500 ------------>|
  |<-- ack=1501 (dup ack #1) --------|   "still waiting for 1501"
  |--- seq=2501 len=500 ------------>|
  |<-- ack=1501 (dup ack #2) --------|
  |--- seq=3001 len=500 ------------>|
  |<-- ack=1501 (dup ack #3) --------|
  |--- seq=1501 len=500 ------------>|   fast retransmit
  |<-- ack=3501 ----------------------|   cumulative ack catches up

fires in about one rtt. much faster than waiting for rto.

selective acknowledgment (sack)

cumulative acks have a problem: after a loss, the sender knows bytes are missing but not which ones arrived after the gap. sack (RFC 2018) lets the receiver report which ranges it has:

ack=1501 SACK=2001-3501

now the sender knows it only needs to retransmit 1501-2000 instead of re-sending everything. sack is negotiated during the handshake (the sackOK option in tcpdump output) and virtually always enabled.

connection teardown

either side initiates a close with FIN:

client                              server
  |--- FIN seq=8000 ------------->|
  |<-- ACK ack=8001 --------------|
  |<-- FIN seq=6000 --------------|
  |--- ACK ack=6001 ------------->|

that is the graceful four-way close. in practice most connections combine FIN+ACK in the second step.

TIME_WAIT

after sending the final ack, the initiator enters TIME_WAIT for 2 * MSL (maximum segment lifetime, usually 60 seconds on linux). two reasons:

if the final ack was lost, the other side retransmits its FIN and the TIME_WAIT socket can re-ack it
prevents old segments from a previous connection on the same four-tuple from being accepted by a new one

thousands of TIME_WAIT sockets after a load test are normal. they consume about 150 bytes each on linux. do not blindly set net.ipv4.tcp_tw_reuse=1 unless you understand the implications for nat and timestamp ordering.

RST

RST terminates a connection immediately. no handshake. causes: connecting to a closed port, sending data after the remote has closed, application calling close() with data in the receive buffer, or a firewall injecting resets.

if you see RSTs in tcpdump, figure out whether they come from the peer or from something in the path. firewalls love injecting RSTs for connections they consider idle.

socket states

ss -tan shows every tcp socket and its state. the ones worth knowing:

ESTABLISHED is the boring happy path
SYN-SENT means waiting for syn+ack. stuck here means the remote is unreachable or something is dropping silently
SYN-RECEIVED is the server waiting for the final ack. a flood here is a syn attack
CLOSE-WAIT means the remote sent FIN but your application has not called close(). accumulating CLOSE-WAIT sockets means your app is leaking connections
TIME-WAIT is cleaning up. normal. stop deleting them

ss -tan state close-wait | wc -l

if that number keeps climbing, something is not closing sockets.

tuning

most defaults are fine. when they are not:

# current buffer sizes
sysctl net.ipv4.tcp_rmem
sysctl net.ipv4.tcp_wmem

# format: min default max (bytes)
# net.ipv4.tcp_rmem = 4096 131072 6291456

per-socket via setsockopt:

SO_RCVBUF / SO_SNDBUF for buffer sizes (rarely needed, auto-tuning handles it)
TCP_NODELAY to kill nagle
TCP_KEEPALIVE to detect dead peers on idle connections. default interval is 7200 seconds; most apps set something shorter
SO_LINGER to control what happens on close() with data pending

# everything about a specific connection
ss --info '( dport = :443 )'

ss --info exposes tcp_info: rtt, cwnd, retransmits, sack usage, congestion algorithm. same data available via getsockopt(TCP_INFO) in application code.

the segment format decides what is possible. the state machine decides what happens. flow control and congestion control decide how fast it all goes.