๐TCP/IP Stack Internals
Probes deep TCP/IP engineering judgment: state-machine edges, loss recovery, offloads, zero-copy, latency knobs, and packet-level debugging.
Step the three-way handshake and the TCP state machine on each side.
Why does TCP have TIME_WAIT, which side usually enters it, and what are the real production risks of trying to eliminate it?senior
TIME_WAIT protects correctness after the active close. It keeps the 4-tuple reserved long enough for delayed duplicate segments to age out (so they cannot be accepted into a new incarnation of the same connection) and so the final ACK of the close handshake can be retransmitted if the peer retransmits its FIN. The duration is 2*MSL. In the normal close path, the endpoint that sends the first FIN and completes the active close is the one that enters TIME_WAIT.
The production temptation is to remove TIME_WAIT because busy clients or proxies run out of ephemeral ports or memory. The risk is accepting an old duplicate segment into a new connection on the same 4-tuple, corrupting an application protocol, or breaking close reliability when the last ACK is lost and the peer's retransmitted FIN hits a closed socket (which then gets an RST).
Senior handling is not 'disable TIME_WAIT'. It is usually:
- Widen the ephemeral port range and reuse connections at the application layer (pooling, HTTP keep-alive, protocol multiplexing).
- Enable
net.ipv4.tcp_tw_reusefor outbound connections (which safely reuses a TIME_WAIT socket for a new outbound connection using timestamps), while understandingtcp_tw_recyclewas removed for breaking NAT. - Put the active close on the side with more port headroom when protocol semantics allow it.
- Account for NAT/load-balancer state timeouts.
- Measure TIME_WAIT bucket counts, SYN retries, reset rates, and 4-tuple reuse pressure.
The invariant remains: TIME_WAIT is a correctness mechanism, not a resource leak to tune away.
- How does NAT change TIME_WAIT pressure?
- What is simultaneous close?
- What would you look for in a packet capture?
Explain simultaneous open in TCP and why it matters even if most application code never intentionally uses it.senior
In simultaneous open, both endpoints actively open and send SYNs before receiving the other side's SYN. Per RFC 9293 the state machine handles it: each side in SYN-SENT receives a SYN without an ACK, replies with SYN+ACK, transitions through SYN-RECEIVED, and reaches ESTABLISHED once the final ACKs cross. There is no separate server role; both sides perform what looks like a half-handshake simultaneously.
It matters because TCP is specified as a symmetric peer protocol, not strictly client/server. State-machine code, firewalls, NATs, SYN-cookie paths, and connection tracking can mishandle crossed SYNs even if applications rarely depend on them, because many middleboxes assume exactly one side is passive. It is also a good interview probe because it reveals whether someone memorized the three-way handshake or actually understands the FSM transitions.
In production debugging, crossed SYNs can surface as duplicate connection attempts, unexpected SYN-RECV states, or policy drops in middleboxes that assume one side always sends the first SYN. Peer-to-peer NAT-traversal designs sometimes rely on this behavior deliberately.
- How does simultaneous open differ from simultaneous close?
- What would conntrack need to store?
- Would SYN cookies complicate this?
Compare Reno, CUBIC, and BBR at a level useful for debugging throughput collapse or latency under load. Be specific about how BBRv2/v3 changed the picture.staff
Reno is loss-based AIMD: grow cwnd roughly linearly, halve it on loss, fast-retransmit/recover on duplicate ACKs. Simple and fair under classic assumptions, but fills high bandwidth-delay-product paths slowly.
CUBIC is also loss-based but grows cwnd as a cubic function of time since the last loss, growing aggressively far from the prior loss point and cautiously near it. It is the Linux default because it scales on high-BDP paths while staying deployable. It still treats loss as the congestion signal, which misfires on lossy fabrics or shallow buffers.
BBR (v1) models the path: it estimates bottleneck bandwidth (BtlBw) and minimum round-trip propagation time (RTprop), paces to the delivery-rate estimate, and bounds in-flight near the BDP rather than reacting primarily to loss. It can win throughput and latency, but v1 could be unfair to loss-based flows and could build queues when its RTprop estimate was stale. BBRv2/v3 added loss and ECN as explicit signals with inflight_hi/inflight_lo bounds: ProbeBW raises in-flight (the Up phase pushes toward ~1.25x BDP) and backs off when loss or ECN-mark rate crosses a threshold, while ProbeRTT periodically drops in-flight to about half the BDP to re-measure RTprop. The net effect is better coexistence with CUBIC and less queue buildup.
For debugging I ask:
- Is the sender app-limited or cwnd-limited?
- Is pacing active, and what qdisc (
fq) is in use? BBR needs pacing. - Are losses congestion, corruption, policer drops, or incast buffer drops?
- Are RTT samples inflated by queueing?
- Do retransmits line up with RTO, fast recovery, or RACK/TLP?
The right choice depends on fabric, fairness goals, buffer depth, pacing support, and whether the objective is latency or bulk throughput.
- Why can BBR build queues in some cases?
- How do you tell app-limited from cwnd-limited?
- What metrics would `ss -ti` expose?
Distinguish TCP flow control from congestion control, and give a failure mode where confusing them leads to the wrong fix.senior
Flow control protects the receiver. The receiver advertises a window (rwnd) based on free buffer space so the sender cannot overrun it; with window scaling the advertised value is shifted to allow windows past 64 KB. Congestion control protects the network path: the sender limits in-flight data via cwnd, loss recovery, pacing, and RTT signals based on inferred path capacity.
The effective send window is roughly min(rwnd, cwnd) minus bytes already in flight. If throughput collapses because the receive window is tiny, switching congestion control from CUBIC to BBR will not help. If throughput collapses because cwnd is repeatedly cut by fabric drops, growing socket receive buffers will not help.
A concrete failure mode: a latency-sensitive receiver thread is descheduled and stops draining the socket. The receive buffer fills, the advertised window shrinks toward zero, and the sender stalls; someone blames congestion control. The correct fix is CPU isolation, faster receive-side processing, larger receive buffers as a burst absorber, or application backpressure, not a different congestion algorithm. ss -ti showing a small/zero peer receive window or tcpdump showing window-size-zero advertisements and zero-window probes points straight at flow control, not congestion.
- What does a zero-window probe do?
- How would packet capture show receiver limitation?
- How do autotuned socket buffers affect this?
Describe the Nagle and delayed-ACK interaction, why it hurts request/response latency, and how `TCP_NODELAY` and `TCP_QUICKACK` differ.senior
Nagle reduces tinygrams by holding small writes while there is unacknowledged data in flight, coalescing them into fewer, fuller segments. Delayed ACK lets the receiver wait briefly (up to ~40 ms on Linux) before ACKing, hoping to piggyback the ACK on a response or to acknowledge two segments at once.
Together they create a latency trap: a Nagle-enabled sender with one small write outstanding waits for an ACK before sending the next write, while the receiver delays that ACK waiting for more data or an application response. Neither side is broken, but the request/response stalls for the delayed-ACK timer.
TCP_NODELAY disables Nagle for that socket, so small sends go immediately; it is sender-side. TCP_QUICKACK is receiver-side and asks Linux to ACK promptly, but it is a one-shot/transient hint, not a sticky mode: the stack drifts back to delayed-ACK behavior, so you may have to set it repeatedly.
The gotcha is that disabling Nagle raises packet rate and per-packet overhead. The better fix is often application-level write coalescing: build the full message and send it once with writev(), or use TCP_CORK to batch a bulk response and uncork at the end. For latency-sensitive control messages TCP_NODELAY is usually right; for streaming bulk data it can be counterproductive.
- How would `writev()` change the decision?
- When is `TCP_CORK` preferable?
- What would the packet trace look like?
How do SACK and RACK-TLP change TCP loss recovery compared with classic duplicate-ACK counting?staff
Classic fast retransmit uses a dupack threshold (three duplicate ACKs) as the loss signal. That works for long flights with plenty of packets after the loss, but it struggles with tail losses, small application-limited flights, lost retransmissions, and packet reordering (where dupacks falsely trigger retransmission).
SACK lets the receiver report non-contiguous blocks it has received, so the sender knows exactly which sequence ranges are missing and retransmits only the holes instead of everything past the cumulative ACK.
RACK-TLP (RFC 8985) replaces dupack counting with time-based inference. RACK uses per-segment transmit timestamps plus SACK feedback: if a later-sent segment is delivered, any earlier-sent segment still unacked beyond a reordering window is deemed lost and retransmitted, without waiting for three dupacks. This handles reordering and lost retransmissions gracefully. TLP (Tail Loss Probe) addresses the case where the last segment or its ACK is lost and there is no subsequent data to trigger RACK: after a Probe Timeout, default 2*SRTT (the next integral multiple of an SRTT, and 1 second when no SRTT exists), it sends one probe to elicit an ACK/SACK and trigger fast recovery instead of waiting for a full RTO.
The senior debugging point: recovery is no longer 'three dupacks or RTO'. Reordering window, timestamp quality, SACK availability, pacing, and application-limited sends all affect when Linux retransmits. Captures need sequence numbers, SACK blocks, timestamps, and timing, not just retransmission counts.
- Why do tail losses hurt RPC latency?
- How can reordering cause spurious retransmits?
- What happens if SACK is disabled?
What are TSO, GSO, GRO, and checksum offload actually promising, and what bugs do they create in drivers or packet captures?senior
TSO (TCP Segmentation Offload) lets the stack hand the NIC a large TCP payload (up to ~64 KB), and the NIC segments it into MTU-sized packets with correct, incrementing TCP/IP headers. GSO is the software analogue: the stack carries a large skb and segments late (just before the driver, or in the driver) when hardware cannot. GRO coalesces received packets into a large skb before upper-layer processing to amortize per-packet cost (the receive-side complement). TX checksum offload means the stack marks the skb as needing checksum completion and stores checksum-start/offset metadata; the NIC fills in the checksum after DMA.
The promise is performance and amortization, not a different wire protocol. The wire still carries MTU-sized frames unless jumbo frames are configured.
Driver and debugging bugs:
- Capturing on the TX host with
tcpdumpshows the pre-segmentation 64 KB skb and the pre-completion (often wrong) checksum, fooling you into thinking a giant frame or a bad checksum went on the wire; the actual wire is fine. This is whytcpdumpreports 'checksum incorrect' on TX so often. - Incorrect
gso_size/gso_type, header offsets, MSS, or checksum-start/offset metadata corrupts real wire packets. - Encapsulation offloads where inner and outer headers need separate checksum/segmentation handling are a rich bug source.
- Assuming
ethtool -Kdisabling offloads is harmless: it changes timing enough to hide races and can overload the CPU at line rate.
A driver engineer must treat skb offload metadata as a contract with hardware, not just bytes in a buffer.
- Why does tcpdump show bad checksums on TX?
- How do tunnel offloads complicate this?
- When would you disable GRO for latency?
Explain Linux TCP zero-copy send paths: `sendfile()` and `MSG_ZEROCOPY`. What are the lifecycle and correctness traps?senior
sendfile() copies data between file descriptors inside the kernel, commonly from a page-cache-backed file to a TCP socket, avoiding a user read buffer and the extra user/kernel copies. It is efficient for static file serving, especially with sensible corking or batching.
MSG_ZEROCOPY asks the kernel to avoid copying user pages on socket send for TCP and supported transports. The application sets SO_ZEROCOPY on the socket and passes MSG_ZEROCOPY on each send(). Completion is asynchronous via the socket error queue, because the send syscall returning does not mean the NIC is done with those user pages.
The traps are buffer lifetime and accounting:
- Reusing or modifying a buffer before zero-copy completion corrupts the data on the wire.
- Small writes can be slower: page pinning, the per-send notification, and bookkeeping cost more than a memcpy of a few hundred bytes.
- Error-queue handling becomes part of the send path. You read completions with
recvmsg(fd, &msg, MSG_ERRQUEUE); ignoring them leaks accounting state and destroys observability. - The kernel can sometimes fall back to a copy (for example, a packet looped to a local socket or
tcpdump), and it signals this by settingSO_EE_CODE_ZEROCOPY_COPIEDinee_code, telling you to stop bothering withMSG_ZEROCOPYon that socket. - Pinned pages interact with memory pressure and long-lived DMA; TLS, corking, and retransmission can change actual behavior.
Zero-copy is a throughput and memory-bandwidth tool first. For tiny latency-sensitive messages, copying from hot cache is usually faster and simpler.
- How do you read zerocopy completions?
- Why can zero-copy hurt small messages?
- How does retransmission affect pinned pages?
Show how you read MSG_ZEROCOPY completions from the error queue and interpret the notification. What exactly does a completion mean?staff
Each send(..., MSG_ZEROCOPY) is tagged with a monotonically increasing 32-bit counter. Completions arrive on the socket's error queue as SO_EE_ORIGIN_ZEROCOPY errors, and crucially each notification carries a range [ee_info, ee_data] (inclusive) of completed send IDs, so one notification can retire many sends. You must read and coalesce these, not assume one-completion-per-send.
struct msghdr msg = {0};
char control[128];
msg.msg_control = control;
msg.msg_controllen = sizeof(control);
if (recvmsg(fd, &msg, MSG_ERRQUEUE) < 0) return; /* always non-blocking */
for (struct cmsghdr *cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) {
struct sock_extended_err *ee =
(struct sock_extended_err *)CMSG_DATA(cm);
if (ee->ee_origin != SO_EE_ORIGIN_ZEROCOPY)
continue;
uint32_t lo = ee->ee_info, hi = ee->ee_data; /* inclusive ID range */
for (uint32_t id = lo; id <= hi; id++)
release_buffer(id); /* now safe to reuse */
if (ee->ee_code & SO_EE_CODE_ZEROCOPY_COPIED)
/* kernel fell back to a copy; consider dropping MSG_ZEROCOPY */;
}
Semantics that trip people up: ee_errno is zero (so this never makes normal reads/writes return errors), the error queue read is always non-blocking, and you typically detect readiness with poll()/epoll on POLLERR (or EPOLLERR, which fires even though you did not request it). A completion means the kernel and NIC are finished with those pages and they may be reused; it is decoupled from, and arrives after, the send() that returned the byte count. The range encoding exists precisely so the kernel can batch notifications under load, and the SO_EE_CODE_ZEROCOPY_COPIED bit tells you the 'zero-copy' actually copied.
- Why does the kernel report a range instead of one ID?
- What does `SO_EE_CODE_ZEROCOPY_COPIED` tell you to do?
- How would you bound the number of in-flight zero-copy buffers?
Why do TCP timestamps and window scaling exist together, and what is PAWS protecting against on a fast link?senior
Both come from RFC 7323 (TCP Extensions for High Performance) and both are needed once paths have a large bandwidth-delay product. Window scaling lets the advertised receive window exceed the 16-bit field's 64 KB limit by applying a negotiated left-shift, so a fast, high-latency path can keep enough bytes in flight to fill the pipe (BDP can be megabytes). Without it, throughput is capped at 64 KB / RTT regardless of link speed.
The TCP timestamp option adds TSval (sender's clock) and TSecr (echoed back) to every segment, used for two things: round-trip time measurement (a clean RTT sample per ACK, including for retransmitted data via Eifel-style detection) and PAWS.
PAWS (Protection Against Wrapped Sequences) is the safety mechanism that window scaling makes necessary. The 32-bit sequence space is only ~4 GB; on a fast enough link it wraps quickly (at 10-100 Gbps, in well under the 2*MSL a stale duplicate could survive). Once sequence numbers can wrap within a connection's lifetime, an old delayed duplicate could carry a sequence number that looks valid in the current window and be wrongly accepted. PAWS treats the monotonically non-decreasing timestamp as a logical high-order extension of the sequence number: a segment whose TSval is older than what was recently seen is discarded as an old duplicate. This is why disabling timestamps on a high-speed network is not free; you lose both good RTT sampling and wrap protection.
- At what link speed does sequence wrap become a real concern?
- What is the throughput cap without window scaling?
- Why is the timestamp described as a sequence-number extension?
Explain TCP incast: why a synchronized many-to-one pattern collapses throughput, and what actually fixes it.staff
Incast is a catastrophic goodput collapse in datacenter fabrics when many senders transmit to one receiver simultaneously, classically a striped read where a client asks N storage nodes for pieces of a block and waits for all of them. The synchronized responses converge on the receiver's switch port and overflow the switch's shallow egress buffer; multiple senders lose packets at once.
The collapse comes from loss recovery, not bandwidth. When a sender loses its whole window (common when N is large and buffers are tiny), there are no dupacks to trigger fast retransmit, so it waits for a full RTO. The default tcp_rto_min is about 200 ms, which is enormous next to a sub-millisecond datacenter RTT. During that timeout the link sits idle, the barrier-synchronized request cannot complete until the slowest sender recovers, and effective goodput can fall to 1-10% of link capacity with per-request latency above 200 ms. Adding more senders makes it worse because more flows lose simultaneously and all stall on RTOs.
What actually helps:
- Reduce the RTO floor toward the fabric RTT (microsecond-granularity RTO / lower
tcp_rto_min) so a timeout costs microseconds, not 200 ms. - Use ECN-based congestion control (DCTCP) so switches mark instead of drop and senders back off before the buffer overflows.
- Increase switch buffering (deeper-buffer switches) to absorb the synchronized burst.
- Add application-level jitter/staggering or limit fan-out / in-flight requests so responses do not all arrive at once.
- Smaller per-flow windows and pacing to reduce the synchronized burst size.
The staff insight is that incast is a control-loop pathology: a barrier-synchronized workload plus shallow buffers plus a coarse RTO, and the most leveraged fixes attack the RTO granularity and the drop-versus-mark behavior, not raw bandwidth.
- Why doesn't fast retransmit save you here?
- How does DCTCP change switch behavior?
- What host counters would show incast loss?
Walk through the SYN-RECEIVED state and SYN cookies. What problem do cookies solve, and what do you give up?senior
When a passive socket receives a SYN, it normally allocates a request socket, records the connection parameters (ISN, MSS, window scale, SACK-permitted, timestamps), replies SYN+ACK, and sits in SYN-RECEIVED waiting for the final ACK. That half-open state lives in a bounded SYN backlog queue. A SYN flood (many spoofed SYNs that never complete) fills the backlog so legitimate SYNs are dropped, a denial-of-service.
SYN cookies remove the need to store state for half-open connections. Instead of allocating a request socket on the SYN, the server encodes the essential connection state into the initial sequence number of the SYN+ACK: a hash of the 4-tuple and a slowly changing secret/time counter, plus a small encoding of the MSS. The server then forgets the connection entirely. When the client's final ACK arrives, its acknowledgment number is cookie+1; the server recomputes and validates the cookie, and if it checks out, reconstructs the connection and goes to ESTABLISHED without ever having stored backlog state. Linux enables this automatically under backlog pressure via net.ipv4.tcp_syncookies.
What you give up: the SYN-ACK ISN has limited bits, so only a small set of MSS values can be encoded, and historically TCP options that must be remembered from the SYN (window scaling, SACK-permitted, timestamps) could be lost when cookies engaged, degrading the connection. Linux mitigates this by smuggling some option state into the timestamp, but the principle stands: cookies trade a little fidelity and CPU (hashing per ACK) for immunity to backlog exhaustion. Because they are activated only under flood, normal connections are unaffected.
- How is the connection state recovered from the final ACK?
- Why can window scaling be lost under cookies?
- When does Linux actually switch cookies on?
Give a short C example that sets low-latency TCP options correctly, and explain what it does not solve.senior
A minimal setup disables Nagle and requests quick-ACK behavior:
#include <netinet/tcp.h>
#include <sys/socket.h>
int tune_tcp(int fd) {
int one = 1;
if (setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, &one, sizeof(one)) < 0)
return -1;
/* QUICKACK is a transient hint, not a sticky mode; may need re-setting */
if (setsockopt(fd, IPPROTO_TCP, TCP_QUICKACK, &one, sizeof(one)) < 0)
return -1;
return 0;
}
TCP_NODELAY is sticky and sender-side: it lets small writes go immediately. TCP_QUICKACK is receiver-side and transient: Linux treats it as a one-shot hint and drifts back to delayed ACKs, so for a sustained effect you re-apply it (often after each recv).
This does not solve application write fragmentation (use writev() or build the whole message first), scheduler latency and core isolation, NIC interrupt moderation, cwnd limits, rwnd/receiver limits, packet loss and RTO, TLS record sizing (a single record split across segments adds a round trip), or middlebox buffering. A senior engineer still reads packet traces and ss -ti (RTT, cwnd, rwnd, retransmits, pacing rate) before declaring a socket tuned. Two socket options are a starting point, not a latency strategy.
- Where would you use `writev()`?
- How would TLS change the picture?
- What if `TCP_QUICKACK` is not available?
For low-latency market data, why use UDP multicast instead of TCP, and what engineering problems does that push to the application?staff
UDP multicast lets one publisher deliver a single packet stream to many receivers without per-client TCP state, per-client retransmission queues, or head-of-line blocking caused by one slow receiver. It fits market data, where the newest state often matters more than reliable delivery of every stale update, and where the publisher must do bounded, predictable work regardless of subscriber count.
The application inherits the hard parts:
- Loss detection via sequence numbers and gap tracking.
- Recovery via a snapshot/refresh channel, a replay/retransmission service, or redundant A/B feeds (publish two independent multicast streams and arbitrate, taking whichever packet arrives first).
- Reordering and duplicate handling, especially with A/B arbitration.
- IGMP/PIM and switch configuration, plus multicast group scaling and IGMP snooping correctness.
- Kernel socket buffer (
SO_RCVBUF), NIC RX queue, and filter sizing to absorb microbursts; multicast often lands on one RSS queue unless steered. - Backpressure policy, because UDP will never slow the publisher for a struggling receiver; the receiver must keep up or drop.
- Hardware timestamping and clock sync (PTP) if you measure latency.
The senior tradeoff is explicit: TCP gives ordered, reliable bytes but head-of-line blocks and couples sender behavior to each receiver and to network conditions. UDP multicast gives fan-out and constant publisher work, but correctness and recovery move entirely into protocol design and operations, which is exactly the model exchanges use (and where A/B feeds plus a recovery channel are standard).
- How would you design gap recovery?
- What counters show multicast loss in the host?
- How does GRO or interrupt moderation affect feed latency?