flowchart TD
A[IP datagrams: unreliable, unordered, no feedback]:::constraint --> B[Coordination: distributed endpoints, no central authority]:::constraint
B --> C[State: must infer capacity from local signals]:::constraint
C --> D[Time: must estimate RTT from ACK stream]:::constraint
E[1986: Congestion collapse]:::failure --> F[Jacobson 1988: cwnd + AIMD + SRTT]:::fix
F --> G[Loss = congestion signal assumption]:::constraint
G --> H[High-BDP: linear increase too slow]:::failure
H --> I[CUBIC 2008: cubic growth, RTT-independent]:::fix
G --> J[Bufferbloat: loss delayed by deep buffers]:::failure
J --> K[BBR 2017: model-based, BtlBw + RTprop]:::fix
G --> L[Datacenter: incast, shallow buffers, low RTT]:::failure
L --> M[DCTCP 2010: proportional ECN marking]:::fix
N[TCP ossification: middleboxes block extensions]:::failure --> O[QUIC 2017: transport over UDP, encrypted]:::fix
classDef constraint fill:#4a90d9,stroke:#2c5f8a,color:white
classDef failure fill:#d94a4a,stroke:#8a2c2c,color:white
classDef fix fill:#4ad94a,stroke:#2c8a2c,color:white
9 Transport and Congestion Control
9.2 Act 1: “It’s 1974. TCP Is a Reliability Protocol.”
Cerf and Kahn designed TCP as a protocol for packet network intercommunication — a system to deliver bytes correctly across heterogeneous networks connected by stateless gateways (Cerf and Kahn 1974). The protocol unified host-to-host communication, process-to-process multiplexing, and reliability into a single monolithic design. By 1981, RFC 793 formalized the connection state machine: three-way handshake, sequence numbers, cumulative acknowledgments, and flow control via a receiver-advertised window (Postel 1981). The receiver window told the sender how much buffer space the receiver had available — a capacity limit at the far end. This was the only rate-control mechanism: if the receiver could accept 32 KB, the sender could send 32 KB. The network between them was assumed to have sufficient capacity.
What the pioneers saw: A small ARPANET with fewer than 200 hosts, administered cooperatively by researchers who knew each other. Links operated at 50 kbps. Traffic consisted of terminal sessions and file transfers. The engineering challenge was reliability — ensuring bytes arrived correctly despite unreliable packet delivery.
What remained invisible from the pioneers’ vantage point: The network would grow from 200 hosts to millions. Traffic would increase by orders of magnitude. Multiple independent administrative domains would replace cooperative research governance. The implicit assumption — that the network has enough capacity for all senders — would fail catastrophically within a decade.
9.2.1 Invariant Analysis: TCP (1974–1981)
| Invariant | TCP Answer (1974–1981) | Gap? |
|---|---|---|
| State | Sequence numbers, connection FSM, receiver window | Path-blind: zero belief about network capacity or delay |
| Time | Fixed retransmission timers (“adaptive” but vague) | RTT estimation absent; timers fire too early or too late |
| Coordination | Distributed endpoints; receiver flow control only | Congestion coordination absent; each sender independent |
| Interface | Byte stream above, IP datagrams below | Clean — and permanent. This becomes the binding constraint |
The State gap is the critical one: TCP tracks connection state (what bytes have been sent, acknowledged, received) but maintains zero path state. The sender is blind to network load, bottleneck capacity, and competing traffic. The congestion window is absent from the design. The sender transmits as fast as the receiver window permits — which is often far faster than the network absorbs.
9.2.2 Environment, Measurement, Belief
| Layer | What TCP Has | What’s Missing |
|---|---|---|
| Environment | True queue occupancy, competing traffic, link rates | Everything — fully opaque to the sender |
| Measurement | ACK arrivals (binary: arrived or not) | Timing and congestion signals both absent |
| Belief | “Receiver can accept this much data” (receiver window) | Network capacity belief entirely absent |
The E-M gap is total. The measurement channel (ACK arrivals) carries information about the receiver but nothing about the path. The sender’s belief covers only “the receiver has buffer space” — a statement about the far end, silent on the middle. The gap between Environment and Measurement is structural absence, beyond noise or estimation error. The sensor for network capacity simply does not exist. This is the most extreme form of the E-M gap.
9.2.3 “The Gaps Didn’t Matter… Yet”
On the 1974 ARPANET, they were invisible. With fewer than 200 hosts and 50 kbps links, aggregate demand rarely exceeded capacity. Congestion occurred locally and resolved quickly because traffic was light. The receiver window was the practical bottleneck, not the network.
But by 1986, the Internet had grown to over 5,000 hosts. Links had scaled to 56 kbps and 1.5 Mbps, but aggregate demand scaled faster. Multiple independent organizations connected their networks. The assumption “enough capacity” broke.
9.3 Act 2: “It’s 1986. The Network Is Collapsing.”
9.3.1 Which Invariant Broke?
In October 1986, throughput on the NSFnet backbone between LBL and UC Berkeley — a path that should have sustained 32 kbps — dropped to 40 bits per second. A factor of 800 reduction. The network was physically intact; every link was operational. Every router was forwarding. The collapse was behavioral: senders were transmitting faster than the network could deliver, packets were being dropped, senders retransmitted (because their timers fired), creating more traffic, causing more drops. The retransmissions themselves became the dominant source of traffic — a positive feedback loop where the system’s response to failure amplified the failure.
This was a stable, self-reinforcing failure. It was a stable equilibrium: the network settled into a state where nearly all traffic consisted of retransmissions, and useful throughput approached zero. The system had converged — to the wrong operating point.
“The network is now subject to a form of congestion collapse… If the timers are set too short, more retransmissions than necessary will be performed, and the network will be further congested.” — Nagle, 1984 (Nagle 1984)
Nagle diagnosed the problem in RFC 896 two years before the collapse became acute: senders lacked any mechanism to sense or respond to congestion. His fix operated at the Interface level — coalesce small packets to reduce packet count — but it treated a symptom, leaving the cause intact. Senders still lacked a feedback loop.
Three researchers, working independently over three years, diagnosed three distinct invariant failures:
| Year | Pioneer | Invariant Diagnosed | Gap Identified |
|---|---|---|---|
| 1984 | Nagle | State | Sender lacks network load model |
| 1986 | Zhang | Time | RTT estimate is too crude; timers fire incorrectly |
| 1987 | Karn & Partridge | State | Ambiguous RTT samples from retransmissions pollute estimator |
Zhang showed that TCP’s retransmission timers systematically failed because the RTT estimator was too primitive (Zhang 1986). RFC 793 tracked only a smoothed mean RTT with a fixed multiplier — variance tracking absent. When network conditions changed, the timer was either too aggressive (causing spurious retransmissions that worsened congestion) or too conservative (waiting too long to detect loss).
Karn and Partridge identified a subtler problem: when a packet is retransmitted, the sender is unable to determine whether the returning ACK acknowledges the original or the retransmission (Karn and Partridge 1987). Using this ambiguous sample corrupts the RTT estimate. Their fix: discard RTT samples from retransmitted segments and back off the timer exponentially on each retransmission.
In parallel, Jain and Chiu were developing the theoretical foundation: how should distributed agents adjust rates to converge to fairness? Their work would prove that additive increase / multiplicative decrease (AIMD) is the unique linear control policy that converges to both efficiency and fairness — a result Jacobson would implement independently of the formal proof.
Three researchers. Three years. No coordination. But they converged on the same decomposition — each diagnosing a different invariant’s failure. This convergence is evidence that the four-invariant structure is the structure the pioneers discovered through debugging, confirmed rather than imposed retrospectively through debugging.
9.3.2 Jacobson’s Synthesis: Four Fixes in One System (1988)
Jacobson synthesized all four invariant fixes into a single coherent system — the most consequential transport paper ever written (Jacobson 1988).
“In October ‘86, the Internet had its first ’congestion collapse’… the throughput from LBL to UC Berkeley… dropped from 32 Kbps to 40 bps.” — Jacobson, 1988 (Jacobson 1988)
Jacobson framed TCP congestion control as a feedback control problem, drawing explicitly on control theory. His core principle: conservation of packets — a new packet should enter the network only when an old one leaves. Three violations of this principle cause collapse: (1) the connection starts too fast, (2) bad timers inject packets prematurely, (3) resource limits change mid-connection.
Applied closed-loop reasoning by creating TCP’s first real feedback loop:
- Slow start: cwnd (congestion window) begins at 1 MSS (Maximum Segment Size)1 and doubles each RTT (exponential growth to equilibrium). The sender sources data at most twice the network’s current delivery rate.
- Congestion avoidance (AIMD): After reaching ssthresh, increase cwnd by 1 MSS per RTT. On loss, halve cwnd. Additive increase probes; multiplicative decrease retreats aggressively.
- RTT estimator: SRTT and RTTVAR with exponentially weighted moving averages:
RTTVAR = (1 - beta) * RTTVAR + beta * |SRTT - RTT_sample|
SRTT = (1 - alpha) * SRTT + alpha * RTT_sample
RTO = SRTT + 4 * RTTVAR
RFC 6298 (Paxson et al. 2011) uses alpha = 1/8 and beta = 1/4.2 The smoothed RTT weights history 7/8 and the new sample 1/8 — gradual adaptation without oscillation. The multiplier 4 on RTTVAR covers the 95th percentile of RTT variation, chosen specifically to prevent spurious timeouts during slow start when RTT can double in one round.
- Fast retransmit: Three duplicate ACKs trigger retransmission without waiting for timeout — a faster path to loss recovery. The sender infers loss (rather than delay) because the receiver has received three subsequent packets and is signaling a gap.
Concrete example: Starting cwnd = 1 MSS (1,500 bytes), RTT = 100 ms. Slow start: cwnd = 1, 2, 4, 8, 16, 32 (5 RTTs = 500 ms). If loss occurs at cwnd = 32, ssthresh = 16, cwnd resets to 1. Congestion avoidance from cwnd = 16 to 32 takes 16 RTTs = 1.6 seconds. Total recovery: 2.1 seconds. This was fast enough for 56 kbps links where BDP (Bandwidth-Delay Product) = 700 bytes. It would prove inadequate for 10 Gbps links where BDP = 125 MB.3
Applied disaggregation by separating per-flow state (each sender tracks its own cwnd, SRTT, RTTVAR) from network state (routers remain stateless).
Decision placement remained fully distributed: every sender makes rate decisions independently, using only local ACK signals. No central authority allocates bandwidth. Convergence emerges from the interaction of independent AIMD loops.
Chiu and Jain proved mathematically that AIMD is the only linear control law that converges to fairness from any starting point (Chiu and Jain 1989). The proof is elegant: additive increase moves all flows toward the efficiency line (total rate = capacity); multiplicative decrease moves all flows toward the fairness line (equal rates). The combination oscillates around the intersection — the fair, efficient operating point.
The beauty of Jacobson’s synthesis: four independent diagnoses (Nagle on State, Zhang on Time, Karn on measurement integrity, Jain on Coordination) were unified into a single deployable system. The interface remained unchanged — no router modifications, no new header fields, no coordination between senders. This is why RFC 1122 (Braden 1989) could mandate the algorithms for all Internet hosts within one year: the fix operated entirely within the existing protocol boundary.
9.3.3 Invariant Analysis: TCP + Jacobson (1988)
| Invariant | TCP Answer After Jacobson (1988) | Gap? |
|---|---|---|
| State | cwnd (capacity belief) + SRTT/RTTVAR (delay belief) | Loss is binary and delayed; no congestion gradient |
| Time | EWMA estimator: RTO = SRTT + 4*RTTVAR | Adapts slowly to sudden RTT changes (8 samples to converge) |
| Coordination | AIMD: emergent fairness without communication | RTT bias: low-RTT flows get more bandwidth |
| Interface | Unchanged — deployable via host update only | No router participation; routers remain passive |
The State gap narrowed dramatically while remaining open. Loss is now the congestion signal, but loss is binary (happened or didn’t) and delayed (detected after one RTT at best, after a timeout at worst). The sender is blind to congestion severity — a single dropped packet and a completely saturated link produce the same signal.
9.3.4 Environment, Measurement, Belief After the Fix
| Layer | What TCP Has | What’s Missing |
|---|---|---|
| Environment | True bottleneck rate, queue depth, competing flows | Still opaque to sender |
| Measurement | ACK timing (RTT samples) + loss events (timeout or 3 dup-ACKs) | Direct capacity signal absent; loss is noisy |
| Belief | cwnd (capacity proxy), SRTT (delay estimate) | cwnd is a proxy, an estimate rather than a measurement of true capacity |
The E-M gap is accidentally noisy: the ACK signal is honest but imperfect. Jacobson’s estimator extracts what information exists, but the measurement channel is fundamentally limited to binary loss events and timing. Better estimators can improve accuracy; they are bounded by the information the signal carries.
9.3.5 “The Gaps Didn’t Matter… Yet”
Jacobson’s algorithms were mandated for all Internet hosts within one year (RFC 1122 (Braden 1989), 1989). They worked extraordinarily well for the 1988 Internet: moderate link speeds, small buffers, homogeneous RTTs. The loss signal was tightly coupled to congestion because buffers were small — when congestion occurred, packets were dropped quickly, and the sender learned within a few RTTs.
Jacobson knew the endpoint-only solution was incomplete:
“There are two network problems, congestion avoidance and congestion recovery… This paper describes the approach to the first problem… The second problem requires router changes.” — Jacobson, 1988 (Jacobson 1988)
The “gateway fix” was RED (Floyd and Jacobson, 1993 (Floyd and Jacobson 1993)) — five years later. Jacobson shipped what was deployable first. This is the meta-constraint of incremental deployability: host changes can be mandated via RFC; router changes require voluntary upgrades across administratively independent networks.
9.3.6 The AIMD Closed Loop in Steady State
The sawtooth pattern that emerges from AIMD is the system reaching equilibrium. The sender increases cwnd linearly (1 MSS per RTT) until loss occurs, then halves cwnd instantly. The cycle repeats.
The loop delay is one RTT. This is consequential: a flow with 100 ms RTT takes at least 100 ms to react to congestion. During those 100 ms, if the sender was transmitting near bottleneck rate, congestion has deepened. By contrast, a datacenter flow with 1 ms RTT reacts 100x faster. This RTT coupling is ineliminable without changing the time model.
The loop is stabilizing. After loss, cwnd halves, reducing sending rate by 50%. The sender then probes slowly upward. Over many cycles, the system converges to a rate where loss probability is low but not zero. The equilibrium loss probability p satisfies a balance: the rate of additive increase (1 MSS/RTT) equals the rate of multiplicative decrease (cwnd/2 per loss event). For N flows sharing capacity C, the equilibrium cwnd per flow satisfies N * cwnd = C, and loss occurs when any flow overshoots.
But the loop has failure modes visible at scale:
- RTT-biased fairness: A 1 ms flow increases cwnd 100x per second; a 100 ms flow increases once per second. At equilibrium, low-RTT flows consume disproportionate bandwidth. This follows directly from the mathematics of AIMD — it is structural.
- TCP incast: When many senders share a bottleneck and all transmit simultaneously (partition-aggregate workloads), all flows experience synchronized loss. All halve cwnd. When the buffer drains, all increase in lockstep. Utilization collapses to 10%.
- Slow convergence: On a link with equilibrium cwnd = 83,000 MSS (10 Gbps, 100 ms RTT), recovering from a halving takes 41,500 RTTs = 69 minutes at 1 MSS per RTT.
9.4 Act 3: “It’s 2008. The Pipe Got Wider.”
9.4.1 Which Invariant Broke?
Link speeds scaled from 10 Mbps to 10 Gbps. Propagation delays remained governed by the speed of light. The bandwidth-delay product (BDP) — the amount of data that fits “in the pipe” — grew by three orders of magnitude.
| Scenario | Bandwidth | RTT | BDP | Time to Fill (AIMD) |
|---|---|---|---|---|
| 1988 ARPANET | 56 kbps | 100 ms | 700 bytes | < 1 RTT |
| 2000 WAN | 100 Mbps | 50 ms | 625 KB | ~400 RTTs (20 s) |
| 2008 fiber | 10 Gbps | 100 ms | 125 MB | ~83,000 RTTs (~2.3 hours) |
| Datacenter | 100 Gbps | 0.5 ms | 6.25 MB | ~4,200 RTTs (2.1 s) |
Reno’s linear increase (1 MSS per RTT) was too slow for high-BDP pipes. After a loss event on a 10 Gbps WAN path, cwnd drops to half. Recovering to full utilization at 1 MSS per RTT takes 83,000 RTTs — over two hours. The network is 50% utilized during recovery.
9.4.2 Ha, Rhee, and Xu’s Redesign: CUBIC (2008)
CUBIC replaced Reno’s linear congestion avoidance with a cubic function of elapsed time since the last loss event (Ha et al. 2008):
cwnd(t) = C * (t - K)^3 + W_max
where t is wall-clock time since last loss, W_max is the pre-loss window, K is the time to reach W_max under cubic growth, and C is a scaling constant.4
Applied closed-loop reasoning with a structurally different growth curve:
- Near W_max: Growth is concave (slowing down) — cautious near the previous congestion point.
- Beyond W_max: Growth is convex (speeding up) — aggressive probing for new capacity.
- Decrease: 0.7x (less aggressive than Reno’s 0.5x), preserving more throughput after loss.
The critical innovation: growth rate is a function of wall-clock time, not RTT. A 10 ms flow and a 100 ms flow increase cwnd at the same rate in real time. This partially fixes the RTT-bias that plagued Reno: under AIMD, a 1 ms flow increases cwnd 100x per second while a 100 ms flow increases once per second. CUBIC decouples growth from ACK frequency.
Concrete example: W_max = 83,000 MSS (10 Gbps link, 100 ms RTT, 1,500-byte packets). After loss, cwnd drops to 58,100 (0.7x). Under Reno, recovering 24,900 MSS at 1 per RTT takes 24,900 RTTs = 41.5 minutes. Under CUBIC, the cubic function is tuned so recovery takes ~400 RTTs = 40 seconds — a 60x improvement.
CUBIC also introduced a TCP-friendly mode: when the cubic function would grow slower than standard AIMD, CUBIC switches to AIMD-compatible growth. This ensures backward compatibility — CUBIC always matches or exceeds Reno’s bandwidth in the same conditions.
9.4.3 Before/After: Reno to CUBIC
| What Changed | Reno (1990) | CUBIC (2008) |
|---|---|---|
| Growth curve | Linear: +1 MSS/RTT | Cubic: C*(t-K)^3 + W_max |
| Growth rate depends on | RTT (ACK frequency) | Wall-clock time (RTT-independent) |
| Decrease factor | 0.5x | 0.7x |
| Recovery time (10 Gbps, 100 ms) | ~83,000 RTTs | ~400 RTTs |
| RTT fairness | Severe bias: 100x RTT difference = 100x bandwidth difference | Partially fixed: growth rate independent of RTT |
| Congestion signal | Loss | Loss (unchanged) |
The progression from Reno to CUBIC illustrates a recurring pattern: the controller (how the sender responds) improved dramatically, but the sensor (what the sender observes) remained the same. CUBIC is a better feedback controller operating on the same noisy, binary measurement signal. When the signal itself fails — as it does with bufferbloat — the controller’s sophistication is irrelevant.
9.4.4 Environment, Measurement, Belief: CUBIC
| Layer | What CUBIC Has | What’s Missing |
|---|---|---|
| Environment | True bottleneck rate, competing traffic, queue depth | Still opaque to sender |
| Measurement | Loss events + ACK timing | Direct capacity measurement absent; loss is binary |
| Belief | cwnd via cubic function; W_max as memory of previous equilibrium | Still interprets loss as sole congestion signal |
The E-M gap quality is unchanged from Jacobson: accidentally noisy. CUBIC improves the controller (cubic growth curve) but uses the same sensor (loss events). When the sensor fails — as it does with deep buffers — the controller’s sophistication is irrelevant.
9.4.5 “The Gaps Didn’t Matter… Yet”
CUBIC became the Linux default in 2006 (kernel 2.6.19) and remains the dominant WAN congestion control algorithm (Ha et al. 2008). It solved the high-BDP scaling problem elegantly. But CUBIC retained the fundamental assumption inherited from Jacobson: loss is the signal for congestion.
This assumption held as long as buffers were small enough that loss and congestion were tightly coupled. When deep buffers appeared — in home routers, cable modems, and carrier-grade equipment — the loss signal decoupled from congestion. Buffers absorbed thousands of packets without dropping. The sender saw ACKs arriving on time, believed the network was healthy, and kept increasing cwnd. RTT inflated from 5 ms to 500 ms before any packet was lost. This is bufferbloat: the belief (cwnd) diverged from reality (the queue was full) because the measurement signal (loss) was delayed by the buffer.
The bufferbloat problem revealed a deeper insight about the E-M-B decomposition: CUBIC improved the controller (how the sender responds to loss) but used the same sensor (binary loss events). When the sensor fails — when the signal decouples from the underlying reality — no amount of controller sophistication can compensate. The fix required either a better sensor (DCTCP’s ECN (Explicit Congestion Notification) marking) or a different measurement strategy entirely (BBR’s model-based probing).
9.5 Act 4: “It’s 2017. Loss Is the Wrong Signal.”
9.5.1 Which Invariant Broke?
The State invariant’s measurement channel failed. Loss-based CC assumes loss indicates congestion. In networks with deep buffers, loss occurs only after the buffer is completely full — potentially hiding congestion for thousands of RTTs. The sender’s belief (cwnd = safe capacity) diverges from reality (the path is saturated, latency has exploded).
A concrete example: a home broadband link with 100 Mbps downstream and a 10 MB buffer at the cable modem. A saturated flow hides congestion for 10 MB / (100 Mbps / 8) = 0.8 seconds. During those 800 ms, the sender increases cwnd, the buffer fills, and latency climbs from 5 ms to 800 ms. A video call that needs < 100 ms RTT becomes unusable. A web page that would load in 200 ms now takes 2 seconds. Every interactive application on the same access link suffers, because the buffer is shared.
The diagnosis through the E-M-B framework is precise: the Environment changed (deep buffers deployed for throughput), the Measurement signal (loss) decoupled from congestion (buffers absorb packets instead of dropping), and the Belief (cwnd = safe capacity) diverged from reality (the path is saturated). The sender’s estimator is functioning correctly — it just has the wrong sensor.
9.5.2 Cardwell’s Redesign: BBR (2017)
BBR (Bottleneck Bandwidth and Round-trip propagation time) replaced the cwnd belief system entirely (Cardwell et al. 2017). Instead of a proxy for capacity (cwnd), BBR maintains an explicit network model:
- BtlBw: Bottleneck bandwidth — the maximum delivery rate observed over the last 10 RTTs.
- RTprop: Round-trip propagation delay — the minimum RTT observed over the last 10 seconds.
- Target: Send at rate = BtlBw, keeping in-flight bytes at BtlBw * RTprop (the BDP).
Applied closed-loop reasoning with a fundamentally different measurement interpretation: loss serves as probing information, freed from its role as a penalty signal. BBR increases rate periodically, observes the result, and updates its model.
BBR operates in four phases:
- STARTUP: Exponential increase until BtlBw estimate stops growing for 3 RTTs. Goal: find bottleneck bandwidth quickly.
- DRAIN: Reduce in-flight packets to the BDP. Goal: drain queues built during startup.
- PROBE_BW: Every 10 RTTs, increase rate by 25% to probe for more bandwidth. If delivery rate drops, reduce back.
- PROBE_RTT: Every 10 seconds, reduce in-flight to 4 MSS for 200 ms. Goal: refresh RTprop estimate with minimal queueing.
Concrete comparison: On a 10 Gbps link with 100 ms RTT and a 10 MB buffer:
- CUBIC: Fills the buffer before detecting loss. Queue grows to 10 MB, adding 8 ms of queuing delay. RTT inflates from 100 ms to 108 ms. After loss, cwnd drops to 0.7x and slowly recovers. Average queue depth: 5 MB. Average queuing delay: 4 ms.
- BBR: Measures BtlBw = 10 Gbps and RTprop = 100 ms. Sends at BDP = 125 MB in flight. Queue target: zero. Average queuing delay: < 1 ms.
The difference is structural: CUBIC fills the pipe and the buffer before backing off. BBR fills the pipe and drains the buffer by design. For latency-sensitive applications (video conferencing, interactive web), this difference is transformative.
9.5.3 Invariant Analysis: BBR (2017)
| Invariant | BBR Answer (2017) | Gap? |
|---|---|---|
| State | (BtlBw, RTprop) — explicit model, not proxy | Model assumes constant bottleneck; wrong in multi-tenant networks |
| Time | Phase-based probing on scheduled intervals | RTprop stale for up to 10 seconds after routing changes |
| Coordination | Model-based; probes past loss | Coexistence failure: BBR starves CUBIC flows (takes 95% capacity) |
| Interface | Unchanged (ACK-based) | Interprets ACKs differently but uses same signal |
The Coordination gap is BBR’s most consequential failure. When BBR and CUBIC share a bottleneck, they interpret loss differently. BBR probes past loss — CUBIC interprets that loss and halves cwnd. BBR fills the capacity that CUBIC vacates. Over time, BBR can consume 95% of bottleneck bandwidth while CUBIC gets 5%. BBRv2 and BBRv3 add loss-aware fairness constraints, but the fundamental tension persists: a protocol that probes past loss will dominate a protocol that fears loss.
The model assumptions create additional gaps. BBR assumes bottleneck bandwidth is constant — wrong in multi-tenant environments where neighbors’ traffic changes. RTprop estimates are systematically inflated when every packet traverses a path with persistent queueing. A datacenter with persistent background traffic produces RTprop samples that systematically overestimate true propagation delay.
9.5.4 Environment, Measurement, Belief
| Layer | What BBR Has | What’s Missing |
|---|---|---|
| Environment | True bottleneck rate, queue depth | Still opaque; measured indirectly |
| Measurement | Delivery rate (from ACK spacing) + min RTT | Delivery rate noisy; RTprop estimate goes stale between probes |
| Belief | (BtlBw, RTprop) = explicit model | BtlBw assumes stationarity; multi-tenant traffic violates this |
The E-M gap quality changed. Jacobson’s gap was accidentally noisy (honest but imperfect ACK signal). BBR’s gap is structurally filtered: the delivery rate measurement reflects not just the bottleneck but also competing traffic, receiver-side buffering, and link-layer retransmissions. BBR smooths this noise with windowed max/min filters, but some information is permanently lost.
9.5.5 “The Gaps Didn’t Matter… Yet”
Despite the fairness concerns, BBR achieved lower latency and higher throughput than CUBIC on Google’s global network. For a single operator controlling both endpoints (Google servers to user browsers via QUIC), the coexistence problem is muted — Google’s traffic competes primarily with itself. BBR is deployed at scale in Google infrastructure: YouTube, Cloud, and all QUIC connections.
The gap that matters is deployment scope. BBR works well when one operator controls the sender (and can choose the CC algorithm). It works poorly on shared infrastructure where heterogeneous CC algorithms compete. As BBR adoption grows beyond Google, the fairness problem becomes increasingly visible. BBRv3 (2023) addresses this with explicit loss-rate monitoring and proportional backoff — but the tension between model-based probing and loss-based conservation remains fundamental.
9.6 Act 5: “It’s 2010. The Datacenter Changes Everything.”
9.6.1 Which Invariant Broke?
Datacenters introduced a qualitatively different environment: sub-millisecond RTTs, single administrative authority, commodity switches with shallow buffers, and workloads mixing latency-sensitive “mice” flows (short queries) with throughput-sensitive “elephant” flows (bulk transfers).
The critical failure was incast collapse: in partition-aggregate workloads (MapReduce, search), dozens or hundreds of workers respond simultaneously to a single aggregator. Their synchronized traffic overwhelms the switch buffer. All flows experience loss simultaneously. All halve cwnd. When the buffer drains, all increase together — synchronized oscillation with utilization collapsing to 10%.
Standard TCP (Reno or CUBIC) in the datacenter produces two failures: (1) elephant flows fill buffers, inflating latency for mice flows, and (2) incast causes synchronized loss and throughput collapse. Both failures trace to the same State gap: loss is too coarse a signal for sub-millisecond environments.
Concrete incast example: A MapReduce job with 100 workers, each holding 1 MB of results. The aggregator sends a request; all 100 workers respond simultaneously over 10 Gbps links to a top-of-rack switch with a 1 MB buffer. Total arriving data: 100 MB. Buffer capacity: 1 MB. Within microseconds, 99 MB of data is dropped. All 100 flows detect loss and halve cwnd. When retransmissions begin, they are again synchronized. Throughput oscillates between near-zero and momentary bursts. Effective utilization: 10-20% of link capacity.
9.6.2 Alizadeh’s Redesign: DCTCP (2010)
DCTCP (Data Center TCP) extracted multi-bit feedback from the single-bit ECN field (Alizadeh et al. 2010; Ramakrishnan et al. 2001):
- Switches mark the Congestion Experienced (CE) bit when queue depth exceeds threshold K.
- The sender tracks alpha = EWMA of the fraction of marked packets in each window.
- Decrease: cwnd = cwnd * (1 - alpha/2).
- If 10% of packets are marked: 5% decrease. If 90% are marked: 45% decrease.
Applied closed-loop reasoning with a richer measurement signal: instead of binary loss (all-or-nothing), DCTCP receives a continuous congestion gradient. The marking rate is proportional to queue depth, so the sender learns not just “congested or not” but “how congested.”
Applied disaggregation by separating the measurement signal (ECN from switches) from the control response (proportional decrease at endpoints). Switches provide signals; endpoints interpret them. This preserves endpoint autonomy while adding router participation — the “gateway fix” that Jacobson called for in 1988.
Decision placement shifted subtly: switches now actively participate in congestion signaling (marking packets), but endpoints retain all rate-control authority. This is the first change to the measurement interface since Jacobson 1988.
9.6.3 Invariant Analysis: DCTCP (2010)
| Invariant | DCTCP Answer (2010) | Gap? |
|---|---|---|
| State | cwnd + alpha (fraction marked). Continuous signal | Requires ECN-capable switches; single-admin only |
| Time | Same RTT estimator. ECN marks arrive within 1 RTT | Marking threshold K must be tuned per deployment |
| Coordination | Proportional decrease (gentler than AIMD halving) | Cannot coexist with standard TCP (too gentle, gets starved) |
| Interface | ECN from switches — new measurement signal | Requires coordinated deployment across all switches |
The deployment constraint is DCTCP’s binding limitation: it requires every switch on the path to support ECN marking, and sharing links with standard TCP leads to starvation. This restricts DCTCP to single-admin environments — datacenters where one operator controls all switches and can mandate the protocol.
Concrete example: A datacenter switch with 10 Gbps links and 1,500-byte packets. Marking threshold K = 20 packets. Queue sojourn time at K: 20 * 1,500 * 8 / 10 Gbps = 2.4 microseconds. In steady state, queue depth oscillates around K. Average queuing delay: ~2 microseconds. Compare with CUBIC on the same switch: the queue grows until packets are dropped, potentially filling a 10 MB buffer, causing 8 ms of queuing delay — 3,000x worse.
9.6.4 Environment, Measurement, Belief: DCTCP
| Layer | What DCTCP Has | What’s Missing |
|---|---|---|
| Environment | True queue depth at each switch | Sender sees only aggregate marking rate |
| Measurement | Fraction of ECN-marked packets per window | Continuous signal, but marking threshold K is static |
| Belief | alpha = EWMA of marked fraction | Proportional to congestion severity; much richer than binary loss |
The E-M gap quality fundamentally changed. DCTCP’s measurement signal is physically richer than loss: ECN marks arrive before packets are dropped, and their frequency encodes congestion severity. This is not a better estimator for the same signal (as CUBIC was for Reno) — it is a different sensor entirely. The diagnostic: DCTCP moved from accidentally noisy measurement (loss) to designed measurement (ECN marking). The cost: the new sensor requires coordinated deployment.
9.6.5 Before/After: Standard TCP to DCTCP
| What Changed | Standard TCP (CUBIC) | DCTCP (2010) |
|---|---|---|
| Congestion signal | Loss (binary) | ECN marks (proportional) |
| Decrease rule | Halve cwnd (0.5x or 0.7x) | cwnd * (1 - alpha/2), proportional |
| Queue target | Buffer fills before reaction | Queue stays at threshold K |
| Incast handling | Synchronized collapse | Proportional marking prevents overflow |
| Deployment scope | Global Internet | Single-admin datacenter only |
9.6.6 “The Gaps Didn’t Matter… Yet”
DCTCP’s single-admin constraint was tolerable because the target environment — hyperscale datacenters — is precisely where single-admin deployment is feasible. Google, Microsoft, and Amazon all deployed DCTCP or DCTCP-inspired protocols within their datacenters. The constraint excluded the wide-area Internet, but that was acceptable because the datacenter workloads (partition-aggregate, storage replication) that benefit most from DCTCP exist only within single-admin domains.
The gap that would matter next was different: even inside the datacenter, DCTCP’s measurement signal (ECN) provides only queue-depth information. It is silent on link utilization, the number of competing flows, and the arrival rate of new traffic. Later datacenter protocols — HPCC (Li et al. 2019) and Swift (Kumar et al. 2020) — added richer in-network telemetry: INT (In-band Network Telemetry) headers carrying precise queue depth, link utilization, and timestamp information per hop. This is the E-M-B progression: from binary loss (TCP) to proportional marking (DCTCP) to per-hop telemetry (HPCC/Swift).
9.7 Act 6: “It’s 2017. TCP Cannot Evolve.”
9.7.1 Which Invariant Broke?
The Interface invariant — TCP’s permanent binding constraint — became a liability. TCP is implemented in operating system kernels: deploying a new congestion control algorithm requires an OS update, which takes months to years. Worse, middleboxes (NATs, firewalls, intrusion detection systems) inspect TCP headers and block packets with unfamiliar options. Honda et al. measured that 35% of Internet paths drop packets carrying unknown TCP options (Honda et al. 2011) — meaning TCP extensions are effectively undeployable across much of the Internet.
This is protocol ossification: the loss of protocol flexibility caused by deployed infrastructure that enforces expectations about protocol behavior. TCP’s header format, option space, and behavior became frozen by the accumulated assumptions of millions of middleboxes.
The consequences were concrete. TCP Fast Open (RFC 7413 (Cheng et al. 2014), 2014) enabled 0-RTT connection setup by caching a cryptographic cookie — a simple, backward-compatible extension. Deployment stalled because middleboxes dropped SYN packets containing unfamiliar options. Multipath TCP (MPTCP, RFC 6824 (Ford et al. 2013), 2013) added subflow management to existing connections. Again, middleboxes interfered. The pattern: any TCP extension that modified the header format or option space was blocked by infrastructure that the protocol designers lacked control over.
9.7.2 Langley’s Redesign: QUIC (2017)
QUIC escaped TCP’s ossification by building a new transport protocol over UDP (Langley et al. 2017):
- UDP as substrate: Middleboxes already pass UDP traffic. By wrapping transport logic in UDP, QUIC avoids TCP’s frozen header format.
- User-space implementation: QUIC runs in application libraries, not OS kernels. Updates deploy with application releases, not OS patches.
- Encrypted headers: QUIC encrypts all transport metadata (connection state, stream IDs, sequence numbers). Middleboxes see only opaque bytes, preventing future ossification.
- Stream multiplexing: Multiple independent streams within one connection. Loss on one stream leaves others unblocked — fixing TCP’s head-of-line blocking.
- 0-RTT connection setup: Combined cryptographic and transport handshake. Repeat connections can send data immediately, eliminating the latency cost of TCP’s three-way handshake plus TLS setup.
- Connection migration: Connections identified by Connection IDs, not 4-tuples. A mobile device switching from WiFi to cellular keeps its QUIC connection alive.
Applied disaggregation by separating transport logic from the OS kernel and separating per-stream reliability from per-connection ordering. TCP bundles reliability, ordering, and congestion control into one byte stream; QUIC disaggregates them. This disaggregation enables independent loss recovery per stream: if stream A loses a packet, stream B continues delivering data. In TCP with HTTP/2, a single lost packet blocks all multiplexed streams — even those with no data in the lost packet.
Decision placement for congestion control remained endpoint-local: QUIC runs the same CC algorithms (BBR, CUBIC) internally. The innovation lies in Interface, not CC — how transport is packaged and deployed. This is a critical distinction: QUIC changes where the code runs (user-space vs kernel) and what it encrypts (everything), but it preserves the fundamental CC architecture (endpoint-driven, ACK-based, distributed).
Applied closed-loop reasoning for connection setup: QUIC’s 0-RTT handshake combines cryptographic key exchange and transport parameter negotiation into a single round trip. For repeat connections, the client caches session parameters and can send data immediately — zero round trips of setup latency. This reduces page-load time by 1–2 RTTs compared to TCP + TLS 1.3, which is significant on high-RTT paths (50–100 ms WAN) and transformative on satellite links (300+ ms).
9.7.3 Invariant Analysis: QUIC (2017)
| Invariant | QUIC Answer (2017) | Gap? |
|---|---|---|
| State | Per-stream state; Connection ID enables migration | CC algorithms inherited from TCP (same State gaps) |
| Time | 0-RTT/1-RTT handshake; same CC timing internally | Same RTT estimation challenges |
| Coordination | Same distributed CC as TCP | Same fairness problems (BBR vs CUBIC) |
| Interface | UDP wrapper; encrypted headers; stream multiplexing | Middleboxes that block UDP also block QUIC |
QUIC’s contribution is primarily to the Interface invariant. It demonstrates that when the binding constraint itself ossifies, the solution is to change the deployment layer — move the interface to a substrate that middleboxes cannot inspect.
9.7.4 Environment, Measurement, Belief: QUIC
| Layer | What QUIC Has | What’s Missing |
|---|---|---|
| Environment | Same network as TCP | Same invisibility |
| Measurement | Same ACK-based signals (per-stream) | Same noise, but per-stream loss isolation |
| Belief | Same CC models (BBR/CUBIC internally) | Same CC gaps; innovation is in deployment, not CC |
The E-M gap is identical to TCP’s for congestion control. QUIC’s innovation is orthogonal: it changes how transport is packaged and deployed, while congestion sensing and control remain unchanged. The lesson: when the Interface invariant ossifies, restructuring the other invariants (State, Time, Coordination) is insufficient — the interface itself must change.
Concrete deployment impact: By 2017, Google reported that 35% of its egress traffic used QUIC. HTTP/3 (RFC 9114 (Bishop 2022)) standardized on QUIC as its transport. IETF QUIC (RFC 9000 (Iyengar and Thomson 2021), 2021) diverged from Google’s original design but preserved the core architectural choices: UDP substrate, mandatory encryption, stream multiplexing, connection migration.
9.8 The Grand Arc: From Cerf/Kahn to QUIC
9.8.1 The Evolving Anchor
The binding constraint — the IP datagram interface — remained constant across all six acts. Every innovation operated within the same Interface and Coordination constraints. What changed was the State invariant (what the sender believes about the network) and the Time invariant (how timing decisions are made), restructured generation after generation as the environment evolved.
This constancy distinguishes transport from routing (where the binding constraint shifts across eras). TCP’s meta constraint — IP datagrams exist, and no entity has central authority — never changed. The dependency chain is stable: Interface (locked) forces Coordination (distributed), which forces State (inferred from local signals), which forces Time (estimated from ACK stream). Every TCP variant navigates within this chain. The variation is in the quality of the estimator and the richness of the signal.
| Era | Environment | State (Belief) | Time | What Broke |
|---|---|---|---|---|
| 1974 | Small ARPANET, < 200 hosts | Connection state only | Fixed timers | “Enough capacity” assumption |
| 1988 | 5,000+ hosts, congestion collapse | cwnd + SRTT/RTTVAR | EWMA estimator | Loss signal delayed by deep buffers |
| 2008 | 10 Gbps fiber, high-BDP | Cubic function of wall-clock time | RTT-independent growth | Loss still sole signal; bufferbloat |
| 2017 (BBR) | Deep buffers, bufferbloat | (BtlBw, RTprop) explicit model | Phase-based probing | Fairness with loss-based flows |
| 2010 (DCTCP) | Datacenter: 1 ms RTT, ECN | alpha (ECN fraction) | ECN faster than loss | Single-admin only; cannot coexist |
| 2017 (QUIC) | Ossified TCP; middlebox interference | Per-stream state; Connection ID | 0-RTT handshake | UDP blocking in some networks |
9.8.2 Three Design Principles Applied Across the Arc
Disaggregation was applied repeatedly, each time separating a new concern:
- Cerf/Kahn applied disaggregation by separating endpoint state (TCP) from network forwarding (IP). Routers are stateless; endpoints maintain all connection semantics.
- Jacobson applied disaggregation by separating per-flow state (cwnd, SRTT) from network state. Each sender tracks its own capacity belief independently.
- QUIC applied disaggregation by separating per-stream reliability from per-connection ordering, and separating transport logic from the OS kernel.
Closed-loop reasoning was the dominant design strategy, refined across every generation:
- Jacobson applied closed-loop reasoning with the AIMD feedback loop: measure loss, adjust cwnd, send, repeat. Loop period = RTT.
- BBR applied closed-loop reasoning with a model-based loop: measure delivery rate and min RTT, update (BtlBw, RTprop) model, set rate to match model.
- DCTCP applied closed-loop reasoning with a richer measurement signal: ECN marks provide a continuous congestion gradient instead of binary loss.
Decision placement remained consistently distributed across all generations. Every TCP variant keeps rate-control authority at endpoints. DCTCP added switch-side measurement (ECN marking) but preserved endpoint-side decision-making. This consistency follows from the binding constraint: no entity has global authority over Internet traffic, so rate decisions must be local.
The one context where decision placement shifted was the datacenter. Under single administrative authority, DCTCP coordinates switches (marking) and endpoints (proportional response). HPCC goes further: switches provide per-hop telemetry, and endpoints compute rate adjustments from the full path signal. This is possible only because the datacenter operator controls both switches and hosts — the Coordination invariant’s answer changes when the meta-constraint changes from “no central authority” to “single administrator.”
The transport lineage demonstrates a general pattern: when the binding constraint is constant (IP interface, distributed authority), innovation concentrates in the State and Time invariants. Designers restructure what the system believes and how it measures, because the interface and coordination model are frozen. The two exceptions — DCTCP (new measurement interface via ECN) and QUIC (new transport interface via UDP) — both required environments where the deployment barrier was lower: single-admin datacenters and user-space libraries, respectively.
9.8.3 The Dependency Chain
9.8.4 Pioneer Diagnosis Table
| Year | Pioneer | Invariant | Diagnosis | Contribution |
|---|---|---|---|---|
| 1974 | Cerf & Kahn | Interface | Heterogeneous networks need common protocol | TCP/IP: byte stream over datagrams |
| 1984 | Nagle | State | Senders flood network without congestion awareness | RFC 896: named congestion collapse |
| 1986 | Zhang | Time | RTT estimation too crude for reliable timers | Identified timer failure modes |
| 1987 | Karn & Partridge | State | Retransmit ambiguity corrupts RTT samples | Karn’s algorithm: discard ambiguous samples |
| 1988 | Jacobson | State + Time | Conservation of packets violated three ways | Slow start, AIMD, SRTT/RTTVAR, fast retransmit |
| 1989 | Chiu & Jain | Coordination | Distributed rate control needs convergence proof | AIMD convergence to fairness |
| 2008 | Ha, Rhee, Xu | State | Linear increase too slow for high-BDP | CUBIC: cubic growth, RTT-independent |
| 2010 | Alizadeh et al. | State | Loss too coarse for datacenter | DCTCP: proportional ECN decrease |
| 2017 | Cardwell et al. | State | Loss is the wrong signal (bufferbloat) | BBR: model-based CC (BtlBw, RTprop) |
| 2017 | Langley et al. | Interface | TCP ossified by middleboxes and kernels | QUIC: transport over UDP, encrypted |
9.8.5 The General Principle
The tighter the coupling between measurement and environment, the better the closed-loop performance. ECN is tighter than loss-based signaling. Loss is tighter than timeout. (BtlBw, RTprop) is tighter than cwnd. Per-hop INT telemetry is tighter than endpoint-only ACK signals. The cost of tighter coupling is always higher per-packet overhead and stricter deployment requirements. The design trade-off is eternal: tighter coupling enables higher utilization and lower latency, but at the cost of complexity and deployment friction.
9.8.6 Innovation Timeline
flowchart TD
subgraph sg1["Origins"]
A1["1974 — Cerf/Kahn: TCP/IP"]
A2["1981 — RFC 793"]
A1 --> A2
end
subgraph sg2["Diagnosis"]
B1["1984 — Nagle: RFC 896"]
B2["1986 — Zhang: timer critique"]
B3["1987 — Karn/Partridge: fix"]
B1 --> B2 --> B3
end
subgraph sg3["The Jacobson Era"]
C1["1988 — Jacobson: slow start + AIMD"]
C2["1989 — RFC 1122 mandate"]
C3["1990 — Reno fast recovery"]
C1 --> C2 --> C3
end
subgraph sg4["Refinement"]
D1["1992 — RFC 1323: window scaling"]
D2["1993 — Floyd: RED"]
D3["1996 — SACK (RFC 2018)"]
D1 --> D2 --> D3
end
subgraph sg5["Scaling"]
E1["2008 — Ha/Rhee/Xu: CUBIC"]
E2["2010 — Alizadeh: DCTCP"]
E1 --> E2
end
subgraph sg6["Model-Based"]
F1["2017 — Cardwell: BBR"]
F2["2017 — Langley: QUIC"]
F3["2021 — RFC 9000: IETF QUIC"]
F1 --> F2 --> F3
end
sg1 --> sg2 --> sg3 --> sg4 --> sg5 --> sg6
9.9 Generative Exercises
Mars rovers communicate with Earth at 20-minute one-way latency. RTT = 2,400 seconds.
Predict the closed-loop failure: Slow start from cwnd = 1 takes log_2(100) * 2,400 = ~16,800 seconds (~4.7 hours) to reach cwnd = 100. AIMD adds 1 MSS per 2,400-second RTT. Loss recovery takes one RTO (~3,600 seconds = 1 hour). What happens to throughput for a file transfer?
Diagnose through invariants: Which invariant is under greatest pressure? (Time: the feedback loop is 2,400 seconds. The sender is blind for 40 minutes between sending and learning the result.) Can you keep the IP interface but change the Time model? What if you replace inferred timing with prescribed timing based on orbital mechanics?
Design a fix: Propose a preloading mechanism — before losing connectivity, precompute a cwnd schedule from ground-truth Earth-Mars link models. Upload the schedule to the rover. The rover increases cwnd on wall-clock time, not RTT. What invariant answers change? What new failure modes appear if the precomputed model is wrong? What if a solar event disrupts the link mid-schedule?
Alternative architecture: Propose a delay-tolerant transport protocol that does not use ACKs for reliability. Instead, use forward error correction (FEC) to encode redundancy at the sender. How does removing the ACK-based feedback loop change all four invariant answers? What is the cost in bandwidth efficiency?
A campus network carries traffic from three populations: students using CUBIC (browser default), a streaming service using BBR, and a research cluster using DCTCP with ECN. All share a single 10 Gbps backbone link with a 10 MB buffer.
Predict the interaction: BBR probes past loss. CUBIC halves cwnd on loss. DCTCP responds to ECN marks. Who dominates? Who starves?
Diagnose through the Coordination invariant: AIMD converges to fairness only when all flows use the same algorithm. When flows use different CC algorithms, what happens to the convergence proof? Can you define “fairness” when flows have different loss responses?
Design a solution: Propose a mechanism that achieves some notion of fairness across heterogeneous CC algorithms. Where would you place this mechanism — at endpoints, at switches, or at both? What are the deployment constraints?
L4S prediction: L4S (Low Latency, Low Loss, Scalable Throughput, RFC 9332 (De Schepper and Briscoe 2023)) proposes separating traffic into two queues: a “classic” queue for legacy TCP and an “L4S” queue for scalable CC algorithms (like DCTCP-style). Analyze this through the Coordination invariant: does separating queues solve the fairness problem, or does it create a new one? What happens when 90% of traffic is L4S and 10% is classic?
LEO satellite constellations (Starlink, Kuiper) introduce an environment with properties of both WAN and datacenter: variable RTTs, single operator, frequent path changes.
A LEO satellite constellation (Starlink-like) has variable RTTs: 20 ms when the satellite is overhead, 80 ms at low elevation angles, with handoffs every 15 seconds as satellites move. The path changes completely every 90 seconds.
Predict the measurement failure: SRTT with alpha = 1/8 tracks RTT changes slowly. How many samples to converge after a 20 ms to 80 ms jump? What happens to RTO during the transition?
BBR vs CUBIC: BBR’s RTprop estimate uses minimum RTT over 10 seconds. What happens when the path’s true propagation delay increases by 4x every 15 seconds? How does CUBIC’s loss-based approach handle the same scenario?
Design the right CC: This environment has properties of both WAN (variable RTT, multi-hop) and datacenter (single operator controls all satellites). Which existing CC algorithm is closest to correct? What modification would you make, and which invariant does it restructure?
Connection migration: A user drives along a highway while on a video call. The ground terminal hands off between satellite beams every 15 seconds. TCP connections using 4-tuple identification break on every handoff. QUIC’s Connection ID survives handoffs. What other transport-layer changes are needed for seamless satellite handoff? Consider: does the path’s BDP change during handoff? Should the CC state (cwnd or BtlBw estimate) be preserved across handoffs or reset?
This chapter is part of “A First-Principles Approach to Networked Systems” by Arpit Gupta, UC Santa Barbara, licensed under CC BY-NC-SA 4.0.
Maximum Segment Size (MSS) is the largest payload TCP sends in a single segment, typically 1460 bytes (1500-byte Ethernet MTU minus 20-byte IP header minus 20-byte TCP header).↩︎
The SRTT and RTTVAR formulas are Exponentially Weighted Moving Averages (EWMAs). An EWMA gives recent samples more weight than older ones, with the weight decaying exponentially. Alpha = 1/8 means each new RTT sample contributes 12.5% to the smoothed estimate — the estimator adapts to changes within roughly 8 samples. Beta = 1/4 weights the deviation estimator more aggressively so RTTVAR tracks variance changes faster than the mean. These values were chosen empirically by Jacobson (Jacobson 1988) and codified in RFC 6298 as powers of two for efficient bit-shift implementation.↩︎
Bandwidth-Delay Product (BDP) = link bandwidth × round-trip time. For a 1 Gbps link with 100 ms RTT, BDP = 10^9 × 0.1 = 10^8 bits = 12.5 MB. This is the amount of data “in flight” needed to fully utilize the link. TCP’s congestion window must reach BDP for full utilization.↩︎
CUBIC’s window growth is W(t) = C(t - K)^3 + W_max, where C is a scaling constant, K = (W_max × β/C)^{1/3} is the time to reach the previous maximum, and β = 0.7 is the multiplicative decrease factor. The cubic function grows slowly near W_max (probing carefully) and aggressively far from it.↩︎