flowchart TD
A["Finite buffer\nat every hop"]:::constraint --> B["FIFO tail-drop:\nno measurement,\nno fairness"]:::failure
B --> C["Congestion collapse:\nthroughput → zero\n(Nagle 1984)"]:::failure
C --> D["Fair queuing:\nper-flow isolation\n(DKS 1989)"]:::fix
C --> E["RED: early detection\nvia queue length EWMA\n(Floyd/Jacobson 1993)"]:::fix
D --> F["Per-flow state\ntoo expensive\nat scale"]:::failure
E --> G["Queue length\nis wrong signal:\nconflates burst\nwith overload"]:::failure
G --> H["RED rarely deployed\ndespite 20 years\nof research"]:::failure
H --> I["Cheap DRAM →\ndeep buffers →\nbufferbloat crisis\n(Gettys 2011)"]:::failure
I --> J["CoDel: sojourn time\nreplaces queue length\n(Nichols/Jacobson 2012)"]:::fix
J --> K["Aggregate CoDel:\nno per-flow fairness"]:::failure
D --> L["DRR: O(1)\nfair queuing\n(Shreedhar 1996)"]:::fix
K --> M["FQ-CoDel:\nDRR + CoDel per flow\n(RFC 8290, 2018)"]:::fix
L --> M
classDef constraint fill:#4A90D9,stroke:#2C5F8A,color:#fff
classDef failure fill:#D94A4A,stroke:#8A2C2C,color:#fff
classDef fix fill:#4AD94A,stroke:#2C8A2C,color:#fff
8 Queue Management
8.1 The Anchor: Finite Buffer at Every Forwarding Hop
Every router, switch, and access point faces the same physical constraint: link capacity is finite, but packet arrivals are bursty and often exceed capacity instantaneously. A buffer absorbs the excess, converting would-be loss into delay. Without buffers, every burst causes packet loss. With them, bursts cause queuing delay. The entire design space of queue management lives in this tradeoff — accepting some delay to minimize loss, or accepting some loss to minimize delay, or attempting to do both.
This constraint is architectural, not temporary. Routers have finite memory. Links have finite bandwidth. Traffic is stochastic. The buffer is where delay and loss are born, and the queue is where the network’s congestion signal is generated. Queue management is the network-interior counterpart to transport — a peer system that creates the congestion environment transport operates in, and in turn receives signals from transport that reshape its own design.
The binding constraint is finite buffer at every forwarding hop. Packets arrive faster than they depart. The queue manager must continuously answer four decision problems:
- When to drop — before the buffer fills (active queue management) or after (tail-drop)?
- Which packet to serve next — first-in-first-out, or per-flow scheduling?
- How to signal congestion to transport — through loss, through delay, or through explicit marking?
- What to measure — queue length, sojourn time1, or arrival rate?
John Nagle, observing the first congestion collapse in 1984, framed the problem that would drive three decades of queue management research:
“If the retransmit interval is not set correctly, the host can rapidly degrade the performance of the net by retransmitting traffic the net is trying to deliver.” — Nagle, 1984 (Nagle 1984)
Looking back nearly three decades later, Kathleen Nichols and Van Jacobson identified the specific failure in the measurement signal that had plagued every previous approach:
“The key insight is that the quantity to control is not queue length but the time packets spend in the queue — the sojourn time.” — Nichols and Jacobson, 2012 (Nichols and Jacobson 2012)
Nichols and Jacobson’s framing was retrospective — they identified what Floyd and Jacobson’s RED (1993) had gotten wrong, and what two decades of deployment failure had demonstrated. The pioneers who built FIFO queues, invented RED, and deployed oversized buffers had no such framework. They were solving an engineering problem one measurement signal at a time.
8.2 Act 1: “It’s 1984. The Internet Suffers Congestion Collapse.”
The ARPANET has grown from four nodes to dozens. The transition from NCP to TCP/IP is complete. Gateways (the predecessors of modern routers) connect heterogeneous networks with different capacities. John Nagle, working at Ford Aerospace, observes a catastrophic failure mode: when traffic exceeds a gateway’s capacity, throughput collapses to near zero while the network remains fully loaded.
“Congestion control is a recognized problem in networks of store-and-forward packet switches.” — Nagle, 1984 (Nagle 1984)
What the pioneers saw: Gateways with small buffers. TCP implementations that retransmitted aggressively on timeout. Interactive applications (Telnet) sending single-byte packets with 40 bytes of header. A cooperative academic network where a few dozen hosts generated modest traffic.
What remained invisible from the pioneers’ vantage point: Buffers would grow from dozens of packets to thousands as memory became cheap. TCP would evolve loss-based congestion control that depended on drops as signals. The Internet would scale from hundreds to billions of devices, making cooperative behavior impossible to assume.
8.2.1 The Failure: FIFO Tail-Drop Under Load
Every gateway uses the same queue discipline: First-In-First-Out with tail-drop. Packets arrive, enter a single buffer in arrival order, and depart when the link is free. When the buffer fills, arriving packets are silently discarded. This is the simplest possible queue management — no per-flow state, no measurement, no intelligence.
Nagle documents two failure modes. First, the “small packet problem” — interactive applications generate 41-byte packets (40 bytes header, 1 byte payload), creating massive overhead. Second, retransmission flooding — when gateways drop packets due to full buffers, senders retransmit, generating duplicate traffic that further overloads the already-congested gateways. This creates a positive feedback loop: congestion causes drops, drops cause retransmissions, retransmissions cause more congestion.
FIFO tail-drop provides no mechanism to break this loop. The gateway lacks intelligence — it accepts packets until the buffer fills, then drops everything that arrives. It offers zero early warning, zero proportional signaling, zero fairness between flows. A single aggressive sender can fill the entire buffer, starving all other flows. The queue is a passive container, blind to its own congestion.
8.2.2 Invariant Analysis: FIFO Tail-Drop (1984)
| Invariant | FIFO Tail-Drop Answer (1984) | Gap? |
|---|---|---|
| State | Buffer contents only — no measurement of congestion severity | No awareness of persistent vs. transient overload |
| Time | Packet arrival order — no timing intelligence | No relationship between queue delay and signaling |
| Coordination | None — “order is destiny” | No fairness; aggressive sender monopolizes buffer |
| Interface | Accept or drop (binary, at buffer boundary) | Signal arrives too late — only after buffer is full |
The State gap is the most consequential: the gateway maintains no measurement of congestion at all. It tracks only buffer occupancy — blind to whether the queue is growing, shrinking, or stable, and blind to whether current traffic is a transient burst or persistent overload. The only “signal” is binary: buffer full (drop) or buffer not full (accept). This signal arrives too late — by the time packets drop, the queue has been full for hundreds of milliseconds, and TCP senders have already injected thousands of excess packets.
8.2.3 Environment → Measurement → Belief
| Layer | What FIFO Tail-Drop Has | What’s Missing |
|---|---|---|
| Environment | True queue occupancy, arrival rates, departure rates | — |
| Measurement | Nothing — gateway observes no congestion signal | No queue length tracking, no delay measurement |
| Belief | Assume capacity is available until buffer overflows | Belief is always optimistic; collapses suddenly |
The E→M gap is structurally absent — the gateway has no measurement mechanism at all. There is no sensor, no estimator, no control loop. The gateway is an open-loop system: it accepts packets without limit until a hard physical boundary (buffer full) forces action. This is the root cause of congestion collapse — the queue management system generates no feedback signal until the catastrophe has already occurred.
8.2.4 “The Gaps Didn’t Matter… Yet.”
In 1984, the Internet had hundreds of hosts. Traffic was bursty but sparse. Gateways had small buffers (tens of packets), so even when congestion occurred, the delay was bounded. Nagle’s congestion collapse was a warning, not a crisis — it appeared under specific pathological conditions. His immediate fix, the Nagle algorithm, addressed the small-packet problem at the transport layer. But his deeper observation — that gateways needed per-flow intelligence and congestion signaling — planted the seed for two divergent research threads that would take decades to converge.
Nagle proposed that gateways should maintain separate queues per flow and serve them round-robin, creating a “firewall” between competing flows. This idea anticipated fair queuing by five years. But in 1984, per-flow state at a gateway was expensive, and the Internet was small enough that FIFO worked acceptably for most traffic.
The question Nagle left open: what measurement signal should the queue use to detect congestion and signal transport? He identified the problem. The solution required two separate innovations — one for fairness, one for measurement.
8.3 Act 2: “It’s 1989. Fair Queuing Arrives.”
8.3.1 Which Invariant Broke?
| What Changed | Before | After |
|---|---|---|
| Traffic | Cooperative academic users, modest load | Heterogeneous applications, aggressive senders |
| Problem | Congestion collapse from retransmission flooding | Unfairness — one aggressive flow monopolizes the buffer |
| Invariant | Coordination: no mechanism for fair sharing | Coordination redesigned: per-flow isolation |
The Internet is growing. TCP congestion control (Jacobson 1988) has solved the retransmission flooding problem — senders now back off on loss. But a new failure appears: under FIFO, a single bulk transfer can monopolize the queue, starving latency-sensitive flows (Telnet, DNS) that share the same link. FIFO treats all packets identically, which is fair in a strict sense (all wait equally) but catastrophically unfair in an economic sense — a user sending one packet suffers the same delay as a user sending a thousand.
Alan Demers, Srinivasan Keshav, and Scott Shenker at Xerox PARC frame the problem by analogy to operating systems:
“CPU scheduling of user processes controls the use of CPU resources… and insulates well-behaved users from ill-behaved users. We ask: can a similar approach work for network scheduling?” — Demers, Keshav, and Shenker, 1989 (Demers et al. 1989)
What the pioneers saw: The FIFO queue as a shared resource with no access control — the network equivalent of a time-sharing system with no scheduler. They saw that FIFO “inextricably intertwines” three independent allocation quantities: bandwidth, promptness (delay), and buffer space. A single ill-behaved source can “capture an arbitrarily high fraction of the bandwidth.”
What remained invisible from the pioneers’ vantage point: Fair queuing solves the coordination problem but not the measurement problem. Per-flow scheduling ensures each flow gets its fair share of bandwidth, but leaves unaddressed what happens inside each flow’s queue. A flow’s packets can still accumulate indefinitely, and the queue provides no signal to the sender about the delay those packets are experiencing.
8.3.2 Demers, Keshav, and Shenker’s Redesign: Weighted Fair Queuing
The solution applied disaggregation by separating per-flow scheduling from aggregate buffering. Instead of a single FIFO queue, the router maintains one queue per flow. A scheduler serves the queues in proportion to their assigned weights, emulating a bit-by-bit round-robin (Generalized Processor Sharing) (Parekh and Gallager 1993) using virtual time stamps.
Each packet receives a virtual finish time: the time at which it would complete service in the idealized GPS (Generalized Processor Sharing) system. The scheduler always serves the packet with the earliest virtual finish time. This ensures that over any interval, each flow’s transmitted bytes are proportional to its weight — max-min fairness.
The coordination answer transforms from “order is destiny” (FIFO) to “each flow gets a fair share proportional to its weight.” No flow can increase its allocation without decreasing another’s — the allocation is Pareto optimal.
The cost of this disaggregation is per-flow state. Each flow requires a queue, a virtual finish time, and a weight. At millions of flows per router, this becomes expensive. WFQ (Weighted Fair Queuing) requires O(log F) per-packet work to maintain sorted virtual times, where F is the number of active flows (Parekh and Gallager 1993). This motivated simpler approximations: Deficit Round-Robin (DRR) (Shreedhar and Varghese 1996) achieves similar fairness with O(1) per-packet cost by replacing virtual time with a deficit counter.
8.3.3 Invariant Analysis: Fair Queuing (1989)
| Invariant | Fair Queuing Answer (1989) | Gap? |
|---|---|---|
| State | Per-flow: queue, virtual finish time, deficit counter | Expensive at scale (millions of flows) |
| Time | Virtual time emulating bit-by-bit service | Complex computation per packet |
| Coordination | Per-flow fair share; max-min fairness | Solved — each flow isolated |
| Interface | Guaranteed minimum bandwidth per flow | No delay guarantee; queue can grow within each flow |
Fair queuing closes the Coordination gap that FIFO left open. An aggressive flow can only fill its own queue — other flows are insulated. But a new gap emerges: fair queuing provides no mechanism to manage delay within each flow’s queue. A flow’s packets can accumulate indefinitely, growing the queue without triggering any congestion signal. The fairness problem is solved; the measurement problem remains.
8.3.4 Environment → Measurement → Belief
| Layer | What Fair Queuing Has | What’s Missing |
|---|---|---|
| Environment | Per-flow arrival rates, service rates | True congestion severity per flow |
| Measurement | Queue occupancy per flow | No delay measurement; no persistence detection |
| Belief | Fair share is maintained | Queue grows without bound within a flow |
The E→M gap is accidentally overlooked — the system could measure per-flow delay but omits it. Fair queuing was designed to solve unfairness, leaving queue depth management to a parallel thread. The measurement question was left to a parallel research thread.
8.3.5 “The Gaps Didn’t Matter… Yet.”
In 1989, routers had small buffers and relatively few concurrent flows. Per-flow queuing kept individual queues short by limiting each flow’s buffer share. The Internet backbone carried primarily academic traffic — cooperative, bursty, and modest in volume. The delay-management gap was invisible because buffers were small enough that per-flow queues rarely grew large.
But fair queuing’s per-flow state scaled poorly to the growing Internet core, where routers handled millions of flows per second. Operators needed a different approach — one that managed congestion at aggregate granularity.
8.4 Act 3: “It’s 1993. Floyd and Jacobson Invent RED.”
8.4.1 Which Invariant Broke?
| What Changed | Before | After |
|---|---|---|
| Scale | Hundreds of flows per router | Thousands of flows, rising fast |
| Problem | Per-flow state too expensive for core routers | Need aggregate AQM (Active Queue Management) without per-flow overhead |
| Invariant | State: no measurement of congestion in FIFO queues | State redesigned: EWMA of queue length |
Sally Floyd and Van Jacobson observe that FIFO tail-drop creates two specific pathologies at scale. First, global synchronization — when a tail-drop queue overflows, many TCP connections lose packets simultaneously, all reduce their windows at the same moment, then all ramp up together, creating oscillating waves of under- and over-utilization. Second, lock-out — a few aggressive flows monopolize the buffer, preventing other flows from ever getting packets into the queue.
“The most effective detection of congestion can occur in the gateway itself. The gateway can reliably distinguish between propagation delay and persistent queueing delay.” — Floyd and Jacobson, 1993 (Floyd and Jacobson 1993)
What the pioneers saw: A network where TCP senders cooperatively respond to packet loss as a congestion signal. Gateways that wait until the buffer fills before signaling — too late. The insight: if the gateway drops packets before the buffer fills, TCP senders receive the signal earlier and reduce their rates before congestion becomes catastrophic.
What remained invisible from the pioneers’ vantage point: Queue length is the wrong measurement signal. A long queue can mean two different things: (1) a transient burst that will clear on its own (healthy — the queue should absorb it), or (2) persistent overload requiring rate reduction (unhealthy — the queue should signal). RED conflates both. This ambiguity would plague RED for two decades.
8.4.2 Floyd and Jacobson’s Solution: Random Early Detection
RED computes an exponentially weighted moving average (EWMA) of the queue length. When the average exceeds a minimum threshold, RED begins dropping packets probabilistically — the higher the average queue length, the higher the drop probability. When the average exceeds a maximum threshold, RED drops every arriving packet. The randomization prevents global synchronization: different connections receive the drop signal at different times, preventing the waves that tail-drop creates.
Floyd and Jacobson applied closed-loop reasoning — the gateway now runs a feedback loop between its measurement (queue length) and its action (probabilistic dropping). The loop’s purpose: keep the average queue length between the two thresholds, signaling congestion early enough that TCP senders can reduce their rates before the buffer overflows.
But the loop’s measurement signal is degraded. Consider two scenarios on a 10 Mbps link. In scenario one, a video application sends a burst of 100 packets that fill 20 queue slots and drain in 12 milliseconds. In scenario two, two TCP flows each sending at 5 Mbps create a persistent queue of 40 packets. RED’s EWMA reaches 30 packets in both cases. RED makes the same drop decision for both, but they require opposite responses: the burst needs no drops (the queue will drain on its own), while the persistent overload needs drops (to reduce the sending rate).
8.4.3 Invariant Analysis: RED (1993)
| Invariant | RED Answer (1993) | Gap? |
|---|---|---|
| State | EWMA of aggregate queue length | Wrong signal: conflates burst with overload |
| Time | EWMA filtering targets congestion persisting several RTTs | Filter time constant hard to tune |
| Coordination | Gateway drops unilaterally, proportional to queue depth | No per-flow fairness |
| Interface | Probabilistic drop/ECN mark | Signal proportional to wrong metric |
The fundamental problem is the State answer: queue length contains information about buffer occupancy but not about congestion persistence. A queue of 50 packets caused by a brief burst is healthy — the queue will drain. A queue of 50 packets caused by sustained overload is unhealthy — it will grow without intervention. RED’s EWMA conflates these two cases.
Additionally, RED has four tuning parameters: the EWMA weight (w_q), minimum threshold (min_th), maximum threshold (max_th), and maximum drop probability (max_p). Optimal values depend on link capacity, RTT, and traffic mix. In heterogeneous networks with dynamic link rates, no single tuning works everywhere. Floyd and Jacobson themselves acknowledged that optimal threshold values were “left as a question for future research.”
8.4.4 Environment → Measurement → Belief
| Layer | What RED Has | What’s Missing |
|---|---|---|
| Environment | True queue occupancy, arrival/departure dynamics | Congestion persistence (burst vs. overload) |
| Measurement | EWMA of queue length | Signal conflates transient and persistent congestion |
| Belief | “High average queue = persistent congestion” | Belief is often wrong; acts on bursts, misses slow-onset overload |
The E→M gap is accidentally noisy — the EWMA signal is an honest but imperfect proxy for congestion. Better estimation (different filter parameters) helps in some scenarios but fails in others. The fundamental problem is that queue length is the wrong quantity to measure for detecting persistent congestion.
8.4.5 “The Gaps Didn’t Matter… Yet.”
RED received an IETF recommendation (Braden et al. 1998) that “RED or some equivalent should be used by default in routers.” Research papers refined RED’s parameters for over a decade. Yet RED was rarely deployed in production. Network operators found it “too difficult to configure, especially in an environment with dynamic link rates” (Baker and Fairhurst 2015). The IETF eventually withdrew the specific recommendation for RED, acknowledging that its tuning burden prevented deployment.
RED’s failure was not conceptual — early detection is the right idea. The failure was in the measurement signal. Queue length is a spatial proxy for a temporal phenomenon (congestion persistence). The right signal would need to measure time directly.
Meanwhile, a different problem was growing. Router manufacturers, chasing throughput benchmarks, installed larger and larger buffers.
8.5 Act 4: “It’s 2011. Gettys Rediscovers Bufferbloat.”
8.5.1 Which Invariant Broke?
| What Changed | Before | After |
|---|---|---|
| Hardware | Router buffers sized at 1x BDP (tens of packets) | Cheap DRAM: buffers sized at 10–50x BDP (thousands of packets) |
| Problem | Occasional congestion at modest buffer depths | Persistent multi-second latency invisible to TCP |
| Invariant | State: TCP’s loss signal degraded by deep buffers | Coupling failure between queue and transport measurement |
Jim Gettys (Gettys and Nichols 2012), a veteran systems engineer at Bell Labs and Alcatel-Lucent, notices that his home Internet connection has become unusable for interactive applications — SSH sessions lag, video calls stutter, web pages load slowly — despite a nominally fast broadband link. Measurements reveal the cause: 3 or more seconds of queuing delay in his home router’s buffer. He names the phenomenon bufferbloat.
What the pioneers saw (retrospectively): Memory became cheap. Router manufacturers sized buffers to maximize throughput benchmarks — link utilization is easy to measure and market. A 100 Mbps link with a 1,000-packet buffer (1.5 MB) can absorb any TCP burst without loss, keeping the link fully utilized. But that buffer holds 120 milliseconds of packets. TCP, using loss as its congestion signal, interprets the absence of drops as “no congestion — send faster.” The congestion window grows unchecked. The buffer fills completely before a single drop occurs. By then, every packet in the buffer experiences 120+ milliseconds of queuing delay.
What remained invisible until Gettys measured it: The coupling failure between queue management and transport. TCP’s feedback loop depends on loss to signal congestion. The queue’s feedback (implicitly, through tail-drop) depends on buffer overflow. When the buffer is 50 times the bandwidth-delay product, these two feedback mechanisms decouple completely. TCP receives no loss signal for seconds. The sender’s belief (“I can send faster”) diverges from the environment’s reality (“the queue is full, latency is catastrophic”) by orders of magnitude.
8.5.2 The Concrete Numbers
A home router bottleneck at 10 Mbps with a 100-packet buffer (150 KB):
- Queue delay at full buffer: (100 x 1,500 bytes) / 10 Mbps = 120 milliseconds
- With a 50 ms propagation RTT, total RTT becomes 170 ms
- An interactive application (voice frame every 20 ms) experiences 6x its frame period in queue delay alone
Scale the buffer to 1,000 packets (1.5 MB), common in commodity routers:
- Queue delay: 1.2 seconds
- TCP sees no loss during this entire interval — cwnd grows toward infinity
- Interactive applications become completely unusable
8.5.3 The Closed-Loop Failure
TCP’s congestion control is a negative feedback loop: send faster → observe loss → back off → send slower → queue drains → send faster. This loop is stable when the feedback signal (loss) arrives within a few RTTs. Bufferbloat breaks the loop by inflating the time constant. If the buffer adds 1.2 seconds of delay, the feedback loop’s period stretches from milliseconds to seconds. TCP’s AIMD sawtooth becomes a slow oscillation between extremes: the sender ramps for seconds (while the buffer fills silently), receives a burst of drops, halves its window, the buffer drains, and the cycle repeats. Latency oscillates between near-zero and 1+ seconds, never stabilizing.
8.5.4 Four-Invariant Diagnosis
Bufferbloat is a coupling failure across two systems — queue management and transport — diagnosed through the State invariant:
| Layer | Queue System Sees | Transport System Sees |
|---|---|---|
| Environment | Buffer full, 1.2s of queued packets | Packets being delivered (slowly) |
| Measurement | Queue depth (but takes no action until overflow) | Loss (but none arrives because buffer absorbs everything) |
| Belief | “Operating normally — buffer not full” | “No congestion — increase sending rate” |
Both systems have optimistic beliefs because both measurement signals are degraded. The queue management system has no AQM — it uses FIFO tail-drop, generating no signal until the buffer overflows. The transport system relies on loss, which arrives only after the buffer fills completely. The gap between environment (catastrophic latency) and belief (send faster) is where bufferbloat lives.
8.5.5 Why Bufferbloat Happened
Bufferbloat was not a new technical problem — Nagle identified the underlying dynamic (congestion collapse from mismatched feedback) in 1984. What changed was economics. DRAM prices dropped by 1,000x between 1990 and 2010. Router manufacturers, competing on throughput benchmarks, installed the cheapest way to improve utilization: more buffer. Throughput is easy to measure and market; latency is harder to measure and less directly tied to user satisfaction scores. The incentive was misaligned: the metric that drove purchasing decisions (throughput) was orthogonal to the metric that drove user experience (latency).
Gettys’ contribution was not discovering a new algorithm — it was measuring the real-world impact of the coupling failure and catalyzing the community to fix it. His blog posts, IETF engagement, and measurement campaigns created the urgency that led to CoDel.
8.6 Act 5: “It’s 2012. Nichols and Jacobson Fix the Measurement.”
8.6.1 Which Invariant Broke?
| What Changed | Before | After |
|---|---|---|
| Diagnosis | RED’s queue-length signal is wrong | Need a signal that distinguishes burst from overload |
| Problem | Bufferbloat persists because RED’s measurement fails | Sojourn time replaces queue length |
| Invariant | State: measurement signal redesigned | From spatial proxy to temporal truth |
Kathleen Nichols and Van Jacobson — the same Jacobson who co-invented RED in 1993 — identify the fundamental flaw in every previous AQM approach. Queue length, whether raw or EWMA-filtered, “contains no information at all about packet demand or network load.” A queue grows identically from a harmless burst that drains in milliseconds and from persistent overload that drains only when senders back off. Queue length conflates the two.
“CoDel is parameterless. One of the weaknesses in the RED algorithm is that it is too difficult to configure, especially in an environment with dynamic link rates.” — Nichols and Jacobson, 2012 (Nichols and Jacobson 2012)
What the pioneers saw: Nineteen years of RED deployment failure. The tuning burden that prevented operators from enabling RED. The bufferbloat crisis that demonstrated RED had not solved the underlying problem. And a simple observation: the quantity that matters for interactive applications is not how many packets are in the queue, but how long each packet waits.
What remained invisible from the pioneers’ vantage point: CoDel solves the measurement problem for a single aggregate queue. An aggressive flow can still dominate the queue, causing CoDel to drop packets from innocent flows that happen to share the same buffer. The fairness problem — solved by Demers, Keshav, and Shenker in 1989 — remains unsolved in the AQM context. The two solutions would need to be combined.
8.6.2 Nichols and Jacobson’s Redesign: CoDel
CoDel (Controlled Delay) measures sojourn time — the time each packet spends in the queue, computed as the difference between dequeue time and enqueue time. At each packet departure, CoDel records the sojourn time and tracks the minimum sojourn time over a sliding 100-millisecond window. The minimum is the key: it reveals whether the queue drained at any point during the interval.
- Transient burst: Packets arrive in a burst, queue depth spikes, but packets dequeue quickly. The minimum sojourn time stays below the 5 ms target. CoDel takes no action — the queue is “good.”
- Persistent overload: Packets accumulate continuously. The minimum sojourn time exceeds the 5 ms target for the entire 100 ms interval. CoDel enters the dropping state — the queue is “bad.”
The dropping rate follows a control law: the interval between successive drops shrinks as interval/sqrt(count), where count is the number of drops since entering the dropping state. This generates a linearly increasing drop rate that aligns with TCP’s inverse-square-root relationship between loss rate and throughput, ensuring the feedback signal converges rapidly.
Nichols and Jacobson applied closed-loop reasoning with a fundamentally different measurement signal. RED’s loop measured space (queue length) and acted on a spatial threshold. CoDel’s loop measures time (sojourn time) and acts on a temporal threshold. The temporal signal directly captures what matters for interactive applications: how long packets wait. It is independent of link rates, buffer sizes, and traffic patterns — the same 5 ms target works from kilobits to gigabits per second.
8.6.3 Invariant Analysis: CoDel (2012)
| Invariant | CoDel Answer (2012) | Gap? |
|---|---|---|
| State | Aggregate + per-packet timestamps; min sojourn over 100 ms | Correct signal for persistence detection |
| Time | Real-time sojourn (5 ms target / 100 ms window) | Self-tuning; no parameters to configure |
| Coordination | Gateway drops/marks when persistent delay detected | No per-flow fairness — aggregate queue |
| Interface | Drop or ECN mark; signal proportional to delay persistence | Clear, timely signal to transport |
CoDel closes the State gap that RED left open: sojourn time directly measures the phenomenon (queuing delay) that needs controlling. The measurement signal is faithful — it reports exactly what interactive applications care about. CoDel is self-tuning: the 5 ms target and 100 ms interval are the only parameters, and they work across heterogeneous networks without operator tuning. This eliminated the deployment barrier that killed RED.
8.6.4 Environment → Measurement → Belief
| Layer | What CoDel Has | What’s Missing |
|---|---|---|
| Environment | Per-packet queuing delay | Per-flow delay (aggregate only) |
| Measurement | Min sojourn over 100 ms window | Correct; separates burst from overload |
| Belief | “If min sojourn > 5 ms for full interval, overload is persistent” | Faithful belief; acts only on persistent congestion |
The E→M gap is effectively closed for the aggregate: CoDel’s measurement (min sojourn) faithfully tracks the environment (queuing delay persistence). But the measurement is aggregate — it reflects the combined behavior of all flows sharing the queue. If one aggressive flow creates persistent delay, CoDel drops packets from the aggregate, which hits innocent packets from well-behaved flows. The Coordination gap remains.
8.6.5 “The Gaps Didn’t Matter… Yet.”
CoDel is dramatically effective at managing aggregate delay. On links with homogeneous traffic (similar flows sharing a single bottleneck), CoDel keeps latency near the 5 ms target regardless of buffer size. But on links with heterogeneous traffic — a bulk download sharing a queue with a VoIP call — CoDel exposes the VoIP call to the download’s queue pressure. Both flows share the same CoDel instance, and CoDel’s drops land randomly across flows.
The solution requires combining CoDel’s measurement innovation with fair queuing’s coordination innovation. The two threads — one tracing back to 1989 (fairness), the other to 2012 (measurement) — need to converge.
8.7 Act 6: “It’s 2018. FQ-CoDel Synthesizes Fairness and Measurement.”
8.7.1 Which Invariant Broke?
| What Changed | Before | After |
|---|---|---|
| Traffic mix | Homogeneous flows sharing aggregate queue | Heterogeneous: bulk downloads alongside interactive traffic |
| Problem | Aggregate CoDel drops from innocent flows | Per-flow isolation + per-flow AQM needed |
| Invariant | Coordination: no per-flow fairness in CoDel | Both Coordination and State redesigned |
Toke Hoeiland-Joergensen, Paul McKenney, Dave Taht, Jim Gettys, and Eric Dumazet (Hoeiland-Joergensen et al. 2018) combine the two innovations that the queue management field had developed independently over three decades. Fair queuing (1989) solves the Coordination problem: per-flow isolation ensures each flow gets its fair share. CoDel (2012) solves the State problem: sojourn-time measurement detects persistent delay and generates timely signals. Neither alone is sufficient. FQ-CoDel combines both.
8.7.2 The Synthesis: FQ-CoDel
FQ-CoDel hashes each packet’s 5-tuple (source IP, destination IP, protocol, source port, destination port) into one of 1,024 flow buckets. Each bucket has its own queue. A Deficit Round-Robin (DRR) scheduler serves the queues in turn, ensuring each flow gets fair access to the link with O(1) per-packet cost. Within each queue, a CoDel instance independently tracks minimum sojourn time and drops packets when persistent delay is detected.
The combination achieves what neither component alone can:
- Fair queuing without CoDel: Each flow gets its fair share, but queues within each flow can grow without bound. No delay signal to transport.
- CoDel without fair queuing: Delay is controlled for the aggregate, but an aggressive flow’s packets trigger drops from innocent flows sharing the same queue.
- FQ-CoDel: Each flow is isolated. An aggressive flow builds only its own queue. Its CoDel instance detects persistent delay and drops only its packets. Other flows’ queues remain short and responsive. The bulk download experiences drops proportional to its own congestion; the VoIP call experiences near-zero queuing delay.
FQ-CoDel includes a critical optimization for sparse flows — flows that transmit occasionally (DNS queries, SSH keystrokes, VoIP frames). When a packet arrives at an empty queue, FQ-CoDel gives it priority, scheduling it ahead of packets from flows with accumulated backlogs. This ensures that interactive traffic achieves near-zero queuing delay even on a heavily loaded link.
FQ-CoDel applied disaggregation by separating per-flow scheduling (DRR) from per-flow delay management (CoDel). It applied closed-loop reasoning with sojourn time as the measurement signal, independently per flow. It applied decision placement at the gateway — each router enforces fairness and delay control unilaterally, without negotiation with senders.
8.7.3 Invariant Analysis: FQ-CoDel (2018)
| Invariant | FQ-CoDel Answer (2018) | Gap? |
|---|---|---|
| State | Per-bucket (1,024): DRR deficit + CoDel state + timestamps | Scales to thousands of flows; hash collisions rare |
| Time | Per-flow sojourn (5 ms target / 100 ms window) | Self-tuning across link speeds |
| Coordination | DRR for per-flow fairness + CoDel for per-flow delay | Both fairness and delay controlled |
| Interface | Fair-share bandwidth + controlled delay bound; ECN when possible | Clear, per-flow signal to transport |
FQ-CoDel closes both the Coordination gap (opened in Act 1, addressed in Act 2, lost in Act 3) and the State gap (opened in Act 1, addressed with wrong signal in Act 3, corrected in Act 5). The synthesis required both innovations.
8.7.4 Environment → Measurement → Belief
| Layer | What FQ-CoDel Has | What’s Missing |
|---|---|---|
| Environment | Per-flow queuing delay and arrival rates | True application requirements (latency targets) |
| Measurement | Per-flow min sojourn over 100 ms window | Correct for persistence detection, per flow |
| Belief | “Each flow’s delay is independently controlled” | Faithful belief; acts only on each flow’s own congestion |
The E→M gap is closed per flow. Each CoDel instance measures only its own flow’s sojourn time, uncontaminated by other flows’ behavior. The measurement is faithful, the feedback loop is tight, and the system is self-tuning. FQ-CoDel became the default queuing discipline (qdisc) in Linux, shipping in OpenWrt, RHEL, and major distributions since 2014. Residential users deploying FQ-CoDel report latency dropping from 100+ milliseconds to single-digit milliseconds under load.
8.8 The Grand Arc: From Congestion Collapse to FQ-CoDel
8.8.1 The Evolving Anchor
| Era | Binding Constraint | State Answer | Coordination Answer | Measurement Signal |
|---|---|---|---|---|
| 1984 | Finite buffer, FIFO only | No measurement | No fairness | None (binary: full/not full) |
| 1989 | Per-flow state expensive | Per-flow queues + virtual time | Max-min fairness via WFQ | Queue occupancy per flow |
| 1993 | Scale demands aggregate AQM | EWMA of aggregate queue length | Gateway drops proportionally | Queue length (wrong signal) |
| 2011 | Cheap memory → deep buffers | TCP’s loss signal degraded | Queue-transport coupling failure | Loss arrives too late |
| 2012 | Persistence detection needed | Min sojourn time over 100 ms | Gateway drops on persistent delay | Sojourn time (right signal) |
| 2018 | Need fairness + measurement | Per-flow sojourn tracking | DRR + per-flow CoDel | Per-flow sojourn time |
The measurement signal is the thread that connects the entire arc. Each generation’s failure is a measurement failure: FIFO has no signal, RED has the wrong signal, deep buffers degrade the loss signal, CoDel finds the right signal, and FQ-CoDel applies it per flow.
8.8.2 Three Design Principles Applied Across the Arc
Disaggregation appears three times. Demers, Keshav, and Shenker (1989) applied disaggregation by separating per-flow scheduling from aggregate buffering — creating the per-flow queue abstraction. Floyd and Jacobson (1993) deliberately rejected disaggregation, keeping RED aggregate to avoid per-flow state overhead. FQ-CoDel (2018) applied disaggregation twice: separating flow scheduling (DRR) from delay management (CoDel), and separating each flow’s congestion state from the aggregate.
Closed-loop reasoning is the engine of AQM evolution. FIFO tail-drop is open-loop — no measurement, no feedback, no control. RED introduced the first closed loop (EWMA → probabilistic drops), but the measurement signal was degraded. CoDel applied closed-loop reasoning with a faithful signal (sojourn time), creating a tight loop that converges reliably. The progression from open-loop to degraded-loop to faithful-loop mirrors the E→M→B decomposition: as measurement quality improves, belief tracks environment, and the control loop stabilizes.
Decision placement remains consistently at the gateway throughout the arc. Every queue management system places the drop/schedule decision at the router, not at the sender. This is forced by the constraint: only the gateway has a unified view of the queue. Senders see end-to-end signals (loss, RTT); the gateway sees the queue directly. Nagle (1984) identified this: “decisions about the duration and magnitude of transient congestion to be allowed at the gateway are best made by the gateway itself.”
8.8.3 The Dependency Chain
8.8.4 Pioneer Diagnosis Table
| Year | Pioneer | Invariant | Diagnosis | Contribution |
|---|---|---|---|---|
| 1984 | Nagle | State + Coordination | No measurement, no fairness → congestion collapse | Identified congestion collapse; proposed round-robin per-flow |
| 1989 | Demers, Keshav, Shenker | Coordination | FIFO allows monopolization | Fair queuing: per-flow isolation via virtual time |
| 1993 | Floyd, Jacobson | State | Tail-drop signals too late | RED: proactive dropping via EWMA queue length |
| 1996 | Shreedhar, Varghese | Time | WFQ’s O(log N) too expensive | DRR: O(1) fair queuing via deficit counters |
| 2011 | Gettys | State (coupling) | Deep buffers degrade TCP’s loss signal | Measured bufferbloat; catalyzed community response |
| 2012 | Nichols, Jacobson | State | Queue length is wrong signal; sojourn time is right | CoDel: self-tuning AQM via min sojourn time |
| 2018 | Hoeiland-Joergensen et al. | Coordination + State | Neither FQ nor CoDel alone sufficient | FQ-CoDel: per-flow DRR + per-flow CoDel |
8.8.5 Innovation Timeline
flowchart TD
subgraph sg1["Problem Genesis"]
A1["1984 — Nagle: congestion collapse identified (RFC 896)"]
end
subgraph sg2["Fairness Thread"]
B1["1989 — Demers/Keshav/Shenker: Fair Queuing (WFQ)"]
B2["1996 — Shreedhar/Varghese: Deficit Round-Robin (O(1))"]
B1 --> B2
end
subgraph sg3["AQM Thread"]
C1["1993 — Floyd/Jacobson: RED (queue length signal)"]
C2["1998 — RFC 2309 recommends AQM deployment"]
C3["2001 — ECN standardized (RFC 3168)"]
C1 --> C2 --> C3
end
subgraph sg4["Crisis and Resolution"]
D1["2011 — Gettys: bufferbloat identified"]
D2["2012 — Nichols/Jacobson: CoDel (sojourn time signal)"]
D3["2013 — Pan et al.: PIE (estimated delay)"]
D1 --> D2 --> D3
end
subgraph sg5["Synthesis"]
E1["2014 — FQ-CoDel becomes Linux default qdisc"]
E2["2018 — RFC 8290 standardizes FQ-CoDel"]
E3["2023 — L4S architecture (RFC 9332)"]
E1 --> E2 --> E3
end
sg1 --> sg2 --> sg3 --> sg4 --> sg5
8.9 Generative Exercises
A datacenter switch connects servers over 100 Gbps links with 5 microsecond propagation RTTs. The buffer holds 100 KB (approximately 67 packets at 1,500 bytes). Applications include distributed key-value stores requiring sub-100-microsecond tail latency and bulk MapReduce shuffles generating sustained traffic.
CoDel’s default 5 ms target is 1,000 times larger than the propagation RTT. Its 100 ms observation window spans 20,000 RTTs.
Redesign the AQM parameters for this environment. What target delay and observation window would you choose? How does the relationship between buffer depth (67 packets), BDP (approximately 1 packet at 100 Gbps x 5 microseconds), and the number of concurrent flows (potentially thousands during an incast event) change the measurement signal? Would ECN marking (Ramakrishnan et al. 2001) (which avoids the throughput cost of drops) be more appropriate than dropping in this environment? How does DCTCP’s (Alizadeh et al. 2010) use of ECN marking rate — rather than binary loss — change the information content of the congestion signal?
A home router runs FQ-CoDel on its WAN uplink. User A browses the web (5 HTTP connections). User B runs a WireGuard VPN tunnel carrying 20 internal applications — video streaming, file sync, SSH, and email. All of User B’s traffic has the same outer 5-tuple (source IP, destination IP, protocol=UDP, source port, destination port=51820).
FQ-CoDel hashes packets by 5-tuple. All 20 of User B’s applications map to a single flow bucket. User A’s 5 connections map to 5 buckets.
Analyze the fairness consequences: what fraction of link capacity does each user get under DRR? How does CoDel interact with the aggregated tunnel — does the sojourn time reflect the tunnel’s internal application mix, or just the aggregate? Propose a modification that restores per-application fairness without breaking the VPN’s encryption. Consider Cake’s per-host fairness approach: would grouping by source IP (User A vs. User B) solve the problem? What tradeoff does this create between per-flow and per-host fairness?
L4S (Low Latency, Low Loss, Scalable Throughput) (De Schepper and Briscoe 2023) proposes a dual-queue architecture: a “classic” queue (managed by CoDel for legacy TCP) and an “L4S” queue (for scalable congestion controls like Prague/DCTCP (Alizadeh et al. 2010) that respond to ECN marks proportionally rather than halving their window). The L4S queue targets sub-millisecond delay by marking packets at very low queue depths.
Analyze this design through the four invariants. What State does the dual-queue system maintain? How does the Coordination between the two queues work — who decides which queue a packet enters? What happens at the Interface — can a non-L4S flow game the system by marking itself as L4S to get lower latency? What Time dynamics emerge when L4S flows (microsecond-scale response) share a link with classic TCP flows (RTT-scale response)?
Predict: if L4S succeeds, what becomes the next measurement signal problem? Consider that L4S assumes the sender’s congestion control is honest about its response to ECN marks. What happens when it is not?
Sojourn time is the time a packet spends in the queue, from arrival to departure. CoDel measures sojourn time rather than queue length because sojourn time directly reflects the delay packets experience, while queue length depends on link speed (a 100-packet queue at 10 Gbps causes far less delay than the same queue at 100 Mbps).↩︎