status
this chapter is in active development
expect live edits and rapid iteration (except for when i am really busy with other stuff) while this material is written.
status
this chapter is in active development
expect live edits and rapid iteration (except for when i am really busy with other stuff) while this material is written.
chapter 1 covered the handshake as a black box. three packets and you are in. here we look at what those packets actually contain.
tcp (RFC 9293) provides a reliable, ordered byte stream over an unreliable network. every guarantee it makes costs something.
every tcp segment carries a fixed 20-byte header (plus options):
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| source port | destination port |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| sequence number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| acknowledgment number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| off | res |N|C|E|U|A|P|R|S|F| window size |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| checksum | urgent pointer |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| options (if any) |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
source and destination ports (16 bits each) identify the endpoints. combined with the ip addresses from below, they form the four-tuple that uniquely identifies a connection.
sequence number (32 bits) counts bytes, not segments. send 1000 bytes starting at sequence 5000 and the next segment starts at 6000. the isn (initial sequence number) is random to prevent old segments from a previous connection sneaking in.
acknowledgment number (32 bits) says "I have received everything up to this byte." cumulative. acking byte 7000 means all bytes before 7000 arrived.
flags drive the state machine: SYN opens, FIN closes, RST aborts, ACK acknowledges, PSH tells the receiver to deliver data immediately.
window size (16 bits, scaled by window scale option) advertises how much buffer space the receiver has. flow control covers this.
the three-way handshake is really an isn exchange:
client server
|--- SYN seq=1000 ------------>|
|<-- SYN+ACK seq=5000 ack=1001--|
|--- ACK seq=1001 ack=5001 --->|
the client picks isn 1000. the server acks 1001 (your isn + 1) and offers its own isn 5000. the client acks 5001. both sides now know each other's starting sequence number and can track every byte that follows.
tcpdump shows relative sequence numbers by default (starting at 0). pass -S for absolute numbers when you need to verify isn handling.
after the handshake, every segment carries a sequence number and data. the receiver responds with acks:
client server
|--- seq=1001 len=500 ------------>|
|--- seq=1501 len=500 ------------>|
|<-- ack=2001 ----------------------| "got everything up to 2001"
|--- seq=2001 len=1000 ----------->|
|<-- ack=3001 ----------------------|
the sender does not wait for an ack before sending the next segment. it sends as fast as the congestion window and receive window allow. acks are cumulative: if ack 2001 arrives, everything before 2001 is confirmed regardless of how many segments it took.
most tcp stacks do not ack every segment. linux waits up to 40ms or until 2 segments arrive before sending an ack. reduces ack traffic but can hurt latency for request-response protocols. TCP_QUICKACK disables it per-socket.
nagle buffers small writes until an ack arrives for previous data or enough bytes accumulate to fill a segment. prevents the silly window syndrome of sending one byte at a time.
for latency-sensitive protocols (ssh, gaming, interactive apis), kill it:
int flag = 1;
setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, &flag, sizeof(flag));
nagle interacts badly with delayed acks. the sender waits for an ack, the receiver waits to batch the ack, and you get 40ms of pointless delay. most http servers set TCP_NODELAY by default for exactly this reason.
when a segment is lost, tcp has two paths back.
the sender starts a retransmission timeout (rto) timer for each segment. no ack before the timer fires means retransmit. rto is calculated from smoothed rtt and rtt variance:
rto = srtt + max(G, 4 * rttvar)
G is the clock granularity (typically 1ms on linux). minimum rto is 200ms (net.ipv4.tcp_rto_min), maximum is 120 seconds.
waiting hundreds of milliseconds to detect a lost packet is painful. that is the slow path.
when the receiver gets an out-of-order segment, it immediately sends a duplicate ack for the last in-order byte. three duplicate acks (four total for the same sequence number) and the sender retransmits without waiting for the timeout:
client server
|--- seq=1001 len=500 ------------>|
|--- seq=1501 len=500 ---X | (lost)
|--- seq=2001 len=500 ------------>|
|<-- ack=1501 (dup ack #1) --------| "still waiting for 1501"
|--- seq=2501 len=500 ------------>|
|<-- ack=1501 (dup ack #2) --------|
|--- seq=3001 len=500 ------------>|
|<-- ack=1501 (dup ack #3) --------|
|--- seq=1501 len=500 ------------>| fast retransmit
|<-- ack=3501 ----------------------| cumulative ack catches up
fires in about one rtt. much faster than waiting for rto.
cumulative acks have a problem: after a loss, the sender knows bytes are missing but not which ones arrived after the gap. sack (RFC 2018) lets the receiver report which ranges it has:
ack=1501 SACK=2001-3501
now the sender knows it only needs to retransmit 1501-2000 instead of re-sending everything. sack is negotiated during the handshake (the sackOK option in tcpdump output) and virtually always enabled.
either side initiates a close with FIN:
client server
|--- FIN seq=8000 ------------->|
|<-- ACK ack=8001 --------------|
|<-- FIN seq=6000 --------------|
|--- ACK ack=6001 ------------->|
that is the graceful four-way close. in practice most connections combine FIN+ACK in the second step.
after sending the final ack, the initiator enters TIME_WAIT for 2 * MSL (maximum segment lifetime, usually 60 seconds on linux). two reasons:
thousands of TIME_WAIT sockets after a load test are normal. they consume about 150 bytes each on linux. do not blindly set net.ipv4.tcp_tw_reuse=1 unless you understand the implications for nat and timestamp ordering.
RST terminates a connection immediately. no handshake. causes: connecting to a closed port, sending data after the remote has closed, application calling close() with data in the receive buffer, or a firewall injecting resets.
if you see RSTs in tcpdump, figure out whether they come from the peer or from something in the path. firewalls love injecting RSTs for connections they consider idle.
ss -tan shows every tcp socket and its state. the ones worth knowing:
close(). accumulating CLOSE-WAIT sockets means your app is leaking connectionsss -tan state close-wait | wc -l
if that number keeps climbing, something is not closing sockets.
most defaults are fine. when they are not:
# current buffer sizes
sysctl net.ipv4.tcp_rmem
sysctl net.ipv4.tcp_wmem
# format: min default max (bytes)
# net.ipv4.tcp_rmem = 4096 131072 6291456
per-socket via setsockopt:
SO_RCVBUF / SO_SNDBUF for buffer sizes (rarely needed, auto-tuning handles it)TCP_NODELAY to kill nagleTCP_KEEPALIVE to detect dead peers on idle connections. default interval is 7200 seconds; most apps set something shorterSO_LINGER to control what happens on close() with data pending# everything about a specific connection
ss --info '( dport = :443 )'
ss --info exposes tcp_info: rtt, cwnd, retransmits, sack usage, congestion algorithm. same data available via getsockopt(TCP_INFO) in application code.
the segment format decides what is possible. the state machine decides what happens. flow control and congestion control decide how fast it all goes.