8  Queue Management


8.1 The Anchor: Finite Buffer at Every Forwarding Hop

Every router, switch, and access point faces the same physical constraint: link capacity is finite, but packet arrivals are bursty and often exceed capacity instantaneously. A buffer absorbs the excess, converting would-be loss into delay. Without buffers, every burst causes packet loss. With them, bursts cause queuing delay. The entire design space of queue management lives in this tradeoff — accepting some delay to minimize loss, or accepting some loss to minimize delay, or attempting to do both.

This constraint is architectural, not temporary. Routers have finite memory. Links have finite bandwidth. Traffic is stochastic. The buffer is where delay and loss are born, and the queue is where the network’s congestion signal is generated. Queue management is the network-interior counterpart to transport — a peer system that creates the congestion environment transport operates in, and in turn receives signals from transport that reshape its own design.

The binding constraint is finite buffer at every forwarding hop. Packets arrive faster than they depart. The queue manager must continuously answer four decision problems:

  1. When to drop — before the buffer fills (active queue management) or after (tail-drop)?
  2. Which packet to serve next — first-in-first-out, or per-flow scheduling?
  3. How to signal congestion to transport — through loss, through delay, or through explicit marking?
  4. What to measure — queue length, sojourn time1, or arrival rate?

John Nagle, observing the first congestion collapse in 1984, framed the problem that would drive three decades of queue management research:

“If the retransmit interval is not set correctly, the host can rapidly degrade the performance of the net by retransmitting traffic the net is trying to deliver.” — Nagle, 1984 (Nagle 1984)

Looking back nearly three decades later, Kathleen Nichols and Van Jacobson identified the specific failure in the measurement signal that had plagued every previous approach:

“The key insight is that the quantity to control is not queue length but the time packets spend in the queue — the sojourn time.” — Nichols and Jacobson, 2012 (Nichols and Jacobson 2012)

Nichols and Jacobson’s framing was retrospective — they identified what Floyd and Jacobson’s RED (1993) had gotten wrong, and what two decades of deployment failure had demonstrated. The pioneers who built FIFO queues, invented RED, and deployed oversized buffers had no such framework. They were solving an engineering problem one measurement signal at a time.


8.2 Act 1: “It’s 1984. The Internet Suffers Congestion Collapse.”

The ARPANET has grown from four nodes to dozens. The transition from NCP to TCP/IP is complete. Gateways (the predecessors of modern routers) connect heterogeneous networks with different capacities. John Nagle, working at Ford Aerospace, observes a catastrophic failure mode: when traffic exceeds a gateway’s capacity, throughput collapses to near zero while the network remains fully loaded.

“Congestion control is a recognized problem in networks of store-and-forward packet switches.” — Nagle, 1984 (Nagle 1984)

What the pioneers saw: Gateways with small buffers. TCP implementations that retransmitted aggressively on timeout. Interactive applications (Telnet) sending single-byte packets with 40 bytes of header. A cooperative academic network where a few dozen hosts generated modest traffic.

What remained invisible from the pioneers’ vantage point: Buffers would grow from dozens of packets to thousands as memory became cheap. TCP would evolve loss-based congestion control that depended on drops as signals. The Internet would scale from hundreds to billions of devices, making cooperative behavior impossible to assume.

8.2.1 The Failure: FIFO Tail-Drop Under Load

Every gateway uses the same queue discipline: First-In-First-Out with tail-drop. Packets arrive, enter a single buffer in arrival order, and depart when the link is free. When the buffer fills, arriving packets are silently discarded. This is the simplest possible queue management — no per-flow state, no measurement, no intelligence.

Nagle documents two failure modes. First, the “small packet problem” — interactive applications generate 41-byte packets (40 bytes header, 1 byte payload), creating massive overhead. Second, retransmission flooding — when gateways drop packets due to full buffers, senders retransmit, generating duplicate traffic that further overloads the already-congested gateways. This creates a positive feedback loop: congestion causes drops, drops cause retransmissions, retransmissions cause more congestion.

FIFO tail-drop provides no mechanism to break this loop. The gateway lacks intelligence — it accepts packets until the buffer fills, then drops everything that arrives. It offers zero early warning, zero proportional signaling, zero fairness between flows. A single aggressive sender can fill the entire buffer, starving all other flows. The queue is a passive container, blind to its own congestion.

8.2.2 Invariant Analysis: FIFO Tail-Drop (1984)

Invariant FIFO Tail-Drop Answer (1984) Gap?
State Buffer contents only — no measurement of congestion severity No awareness of persistent vs. transient overload
Time Packet arrival order — no timing intelligence No relationship between queue delay and signaling
Coordination None — “order is destiny” No fairness; aggressive sender monopolizes buffer
Interface Accept or drop (binary, at buffer boundary) Signal arrives too late — only after buffer is full

The State gap is the most consequential: the gateway maintains no measurement of congestion at all. It tracks only buffer occupancy — blind to whether the queue is growing, shrinking, or stable, and blind to whether current traffic is a transient burst or persistent overload. The only “signal” is binary: buffer full (drop) or buffer not full (accept). This signal arrives too late — by the time packets drop, the queue has been full for hundreds of milliseconds, and TCP senders have already injected thousands of excess packets.

8.2.3 Environment → Measurement → Belief

Layer What FIFO Tail-Drop Has What’s Missing
Environment True queue occupancy, arrival rates, departure rates
Measurement Nothing — gateway observes no congestion signal No queue length tracking, no delay measurement
Belief Assume capacity is available until buffer overflows Belief is always optimistic; collapses suddenly

The E→M gap is structurally absent — the gateway has no measurement mechanism at all. There is no sensor, no estimator, no control loop. The gateway is an open-loop system: it accepts packets without limit until a hard physical boundary (buffer full) forces action. This is the root cause of congestion collapse — the queue management system generates no feedback signal until the catastrophe has already occurred.

8.2.4 “The Gaps Didn’t Matter… Yet.”

In 1984, the Internet had hundreds of hosts. Traffic was bursty but sparse. Gateways had small buffers (tens of packets), so even when congestion occurred, the delay was bounded. Nagle’s congestion collapse was a warning, not a crisis — it appeared under specific pathological conditions. His immediate fix, the Nagle algorithm, addressed the small-packet problem at the transport layer. But his deeper observation — that gateways needed per-flow intelligence and congestion signaling — planted the seed for two divergent research threads that would take decades to converge.

Nagle proposed that gateways should maintain separate queues per flow and serve them round-robin, creating a “firewall” between competing flows. This idea anticipated fair queuing by five years. But in 1984, per-flow state at a gateway was expensive, and the Internet was small enough that FIFO worked acceptably for most traffic.

The question Nagle left open: what measurement signal should the queue use to detect congestion and signal transport? He identified the problem. The solution required two separate innovations — one for fairness, one for measurement.


8.3 Act 2: “It’s 1989. Fair Queuing Arrives.”

8.3.1 Which Invariant Broke?

What Changed Before After
Traffic Cooperative academic users, modest load Heterogeneous applications, aggressive senders
Problem Congestion collapse from retransmission flooding Unfairness — one aggressive flow monopolizes the buffer
Invariant Coordination: no mechanism for fair sharing Coordination redesigned: per-flow isolation

The Internet is growing. TCP congestion control (Jacobson 1988) has solved the retransmission flooding problem — senders now back off on loss. But a new failure appears: under FIFO, a single bulk transfer can monopolize the queue, starving latency-sensitive flows (Telnet, DNS) that share the same link. FIFO treats all packets identically, which is fair in a strict sense (all wait equally) but catastrophically unfair in an economic sense — a user sending one packet suffers the same delay as a user sending a thousand.

Alan Demers, Srinivasan Keshav, and Scott Shenker at Xerox PARC frame the problem by analogy to operating systems:

“CPU scheduling of user processes controls the use of CPU resources… and insulates well-behaved users from ill-behaved users. We ask: can a similar approach work for network scheduling?” — Demers, Keshav, and Shenker, 1989 (Demers et al. 1989)

What the pioneers saw: The FIFO queue as a shared resource with no access control — the network equivalent of a time-sharing system with no scheduler. They saw that FIFO “inextricably intertwines” three independent allocation quantities: bandwidth, promptness (delay), and buffer space. A single ill-behaved source can “capture an arbitrarily high fraction of the bandwidth.”

What remained invisible from the pioneers’ vantage point: Fair queuing solves the coordination problem but not the measurement problem. Per-flow scheduling ensures each flow gets its fair share of bandwidth, but leaves unaddressed what happens inside each flow’s queue. A flow’s packets can still accumulate indefinitely, and the queue provides no signal to the sender about the delay those packets are experiencing.

8.3.2 Demers, Keshav, and Shenker’s Redesign: Weighted Fair Queuing

The solution applied disaggregation by separating per-flow scheduling from aggregate buffering. Instead of a single FIFO queue, the router maintains one queue per flow. A scheduler serves the queues in proportion to their assigned weights, emulating a bit-by-bit round-robin (Generalized Processor Sharing) (Parekh and Gallager 1993) using virtual time stamps.

Each packet receives a virtual finish time: the time at which it would complete service in the idealized GPS (Generalized Processor Sharing) system. The scheduler always serves the packet with the earliest virtual finish time. This ensures that over any interval, each flow’s transmitted bytes are proportional to its weight — max-min fairness.

The coordination answer transforms from “order is destiny” (FIFO) to “each flow gets a fair share proportional to its weight.” No flow can increase its allocation without decreasing another’s — the allocation is Pareto optimal.

Figure 8.1: This figure traces the evolution of queue management through four scheduling disciplines, each representing a different answer to the Coordination invariant (how to allocate the bottleneck’s capacity). Panel 1 (FIFO) demonstrates the baseline: first-in, first-out processing with no fairness logic, where an aggressive flow (A) sends 7 Mbps while smaller flows (B, C) receive 1 Mbps and 2 Mbps respectively. Panel 2 (Priority Queuing) introduces explicit coordination: high-priority traffic (B) receives 8 Mbps while lower priorities (A, C) starve. Panel 3 (Round-Robin) attempts fairness at the packet level, giving each flow equal scheduling opportunity, but because flow A sends larger packets (“jumbo packets”), A captures 4.2 Mbps while B receives 2.1 Mbps—fairness in packets does not translate to fairness in bytes. Panel 4 (WFQ / Deficit Round-Robin) achieves max-min fairness: each flow receives exactly 3.3 Mbps (one-third of capacity), regardless of packet size or sending pattern. Observe the progression: FIFO requires no per-flow state but produces unfairness; priority queuing adds state but sacrifices fairness for explicit priorities; round-robin adds state but fairness remains vulnerable to packet size; WFQ adds per-flow deficit counters and finally achieves proportional allocation where no reallocation can benefit one flow without harming another (Pareto optimality). The scalability tradeoff is ineliminable: FIFO is O(1) per packet (no state), while WFQ is O(log F) per packet (where F is the number of flows), explaining why high-speed Internet routers use FIFO in the core and access networks use WFQ when per-flow state is feasible.

The cost of this disaggregation is per-flow state. Each flow requires a queue, a virtual finish time, and a weight. At millions of flows per router, this becomes expensive. WFQ (Weighted Fair Queuing) requires O(log F) per-packet work to maintain sorted virtual times, where F is the number of active flows (Parekh and Gallager 1993). This motivated simpler approximations: Deficit Round-Robin (DRR) (Shreedhar and Varghese 1996) achieves similar fairness with O(1) per-packet cost by replacing virtual time with a deficit counter.

8.3.3 Invariant Analysis: Fair Queuing (1989)

Invariant Fair Queuing Answer (1989) Gap?
State Per-flow: queue, virtual finish time, deficit counter Expensive at scale (millions of flows)
Time Virtual time emulating bit-by-bit service Complex computation per packet
Coordination Per-flow fair share; max-min fairness Solved — each flow isolated
Interface Guaranteed minimum bandwidth per flow No delay guarantee; queue can grow within each flow

Fair queuing closes the Coordination gap that FIFO left open. An aggressive flow can only fill its own queue — other flows are insulated. But a new gap emerges: fair queuing provides no mechanism to manage delay within each flow’s queue. A flow’s packets can accumulate indefinitely, growing the queue without triggering any congestion signal. The fairness problem is solved; the measurement problem remains.

8.3.4 Environment → Measurement → Belief

Layer What Fair Queuing Has What’s Missing
Environment Per-flow arrival rates, service rates True congestion severity per flow
Measurement Queue occupancy per flow No delay measurement; no persistence detection
Belief Fair share is maintained Queue grows without bound within a flow

The E→M gap is accidentally overlooked — the system could measure per-flow delay but omits it. Fair queuing was designed to solve unfairness, leaving queue depth management to a parallel thread. The measurement question was left to a parallel research thread.

8.3.5 “The Gaps Didn’t Matter… Yet.”

In 1989, routers had small buffers and relatively few concurrent flows. Per-flow queuing kept individual queues short by limiting each flow’s buffer share. The Internet backbone carried primarily academic traffic — cooperative, bursty, and modest in volume. The delay-management gap was invisible because buffers were small enough that per-flow queues rarely grew large.

But fair queuing’s per-flow state scaled poorly to the growing Internet core, where routers handled millions of flows per second. Operators needed a different approach — one that managed congestion at aggregate granularity.


8.4 Act 3: “It’s 1993. Floyd and Jacobson Invent RED.”

8.4.1 Which Invariant Broke?

What Changed Before After
Scale Hundreds of flows per router Thousands of flows, rising fast
Problem Per-flow state too expensive for core routers Need aggregate AQM (Active Queue Management) without per-flow overhead
Invariant State: no measurement of congestion in FIFO queues State redesigned: EWMA of queue length

Sally Floyd and Van Jacobson observe that FIFO tail-drop creates two specific pathologies at scale. First, global synchronization — when a tail-drop queue overflows, many TCP connections lose packets simultaneously, all reduce their windows at the same moment, then all ramp up together, creating oscillating waves of under- and over-utilization. Second, lock-out — a few aggressive flows monopolize the buffer, preventing other flows from ever getting packets into the queue.

“The most effective detection of congestion can occur in the gateway itself. The gateway can reliably distinguish between propagation delay and persistent queueing delay.” — Floyd and Jacobson, 1993 (Floyd and Jacobson 1993)

What the pioneers saw: A network where TCP senders cooperatively respond to packet loss as a congestion signal. Gateways that wait until the buffer fills before signaling — too late. The insight: if the gateway drops packets before the buffer fills, TCP senders receive the signal earlier and reduce their rates before congestion becomes catastrophic.

What remained invisible from the pioneers’ vantage point: Queue length is the wrong measurement signal. A long queue can mean two different things: (1) a transient burst that will clear on its own (healthy — the queue should absorb it), or (2) persistent overload requiring rate reduction (unhealthy — the queue should signal). RED conflates both. This ambiguity would plague RED for two decades.

8.4.2 Floyd and Jacobson’s Solution: Random Early Detection

RED computes an exponentially weighted moving average (EWMA) of the queue length. When the average exceeds a minimum threshold, RED begins dropping packets probabilistically — the higher the average queue length, the higher the drop probability. When the average exceeds a maximum threshold, RED drops every arriving packet. The randomization prevents global synchronization: different connections receive the drop signal at different times, preventing the waves that tail-drop creates.

Floyd and Jacobson applied closed-loop reasoning — the gateway now runs a feedback loop between its measurement (queue length) and its action (probabilistic dropping). The loop’s purpose: keep the average queue length between the two thresholds, signaling congestion early enough that TCP senders can reduce their rates before the buffer overflows.

But the loop’s measurement signal is degraded. Consider two scenarios on a 10 Mbps link. In scenario one, a video application sends a burst of 100 packets that fill 20 queue slots and drain in 12 milliseconds. In scenario two, two TCP flows each sending at 5 Mbps create a persistent queue of 40 packets. RED’s EWMA reaches 30 packets in both cases. RED makes the same drop decision for both, but they require opposite responses: the burst needs no drops (the queue will drain on its own), while the persistent overload needs drops (to reduce the sending rate).

8.4.3 Invariant Analysis: RED (1993)

Invariant RED Answer (1993) Gap?
State EWMA of aggregate queue length Wrong signal: conflates burst with overload
Time EWMA filtering targets congestion persisting several RTTs Filter time constant hard to tune
Coordination Gateway drops unilaterally, proportional to queue depth No per-flow fairness
Interface Probabilistic drop/ECN mark Signal proportional to wrong metric

The fundamental problem is the State answer: queue length contains information about buffer occupancy but not about congestion persistence. A queue of 50 packets caused by a brief burst is healthy — the queue will drain. A queue of 50 packets caused by sustained overload is unhealthy — it will grow without intervention. RED’s EWMA conflates these two cases.

Additionally, RED has four tuning parameters: the EWMA weight (w_q), minimum threshold (min_th), maximum threshold (max_th), and maximum drop probability (max_p). Optimal values depend on link capacity, RTT, and traffic mix. In heterogeneous networks with dynamic link rates, no single tuning works everywhere. Floyd and Jacobson themselves acknowledged that optimal threshold values were “left as a question for future research.”

8.4.4 Environment → Measurement → Belief

Layer What RED Has What’s Missing
Environment True queue occupancy, arrival/departure dynamics Congestion persistence (burst vs. overload)
Measurement EWMA of queue length Signal conflates transient and persistent congestion
Belief “High average queue = persistent congestion” Belief is often wrong; acts on bursts, misses slow-onset overload

The E→M gap is accidentally noisy — the EWMA signal is an honest but imperfect proxy for congestion. Better estimation (different filter parameters) helps in some scenarios but fails in others. The fundamental problem is that queue length is the wrong quantity to measure for detecting persistent congestion.

8.4.5 “The Gaps Didn’t Matter… Yet.”

RED received an IETF recommendation (Braden et al. 1998) that “RED or some equivalent should be used by default in routers.” Research papers refined RED’s parameters for over a decade. Yet RED was rarely deployed in production. Network operators found it “too difficult to configure, especially in an environment with dynamic link rates” (Baker and Fairhurst 2015). The IETF eventually withdrew the specific recommendation for RED, acknowledging that its tuning burden prevented deployment.

RED’s failure was not conceptual — early detection is the right idea. The failure was in the measurement signal. Queue length is a spatial proxy for a temporal phenomenon (congestion persistence). The right signal would need to measure time directly.

Meanwhile, a different problem was growing. Router manufacturers, chasing throughput benchmarks, installed larger and larger buffers.


8.5 Act 4: “It’s 2011. Gettys Rediscovers Bufferbloat.”

8.5.1 Which Invariant Broke?

What Changed Before After
Hardware Router buffers sized at 1x BDP (tens of packets) Cheap DRAM: buffers sized at 10–50x BDP (thousands of packets)
Problem Occasional congestion at modest buffer depths Persistent multi-second latency invisible to TCP
Invariant State: TCP’s loss signal degraded by deep buffers Coupling failure between queue and transport measurement

Jim Gettys (Gettys and Nichols 2012), a veteran systems engineer at Bell Labs and Alcatel-Lucent, notices that his home Internet connection has become unusable for interactive applications — SSH sessions lag, video calls stutter, web pages load slowly — despite a nominally fast broadband link. Measurements reveal the cause: 3 or more seconds of queuing delay in his home router’s buffer. He names the phenomenon bufferbloat.

Figure 8.2: TCP’s congestion window (blue) represents the sender’s belief about available capacity; the queue occupancy (red) represents reality at the bottleneck. Panel A shows a well-sized buffer (1× BDP): frequent loss keeps belief tightly coupled to reality, peak RTT stays near 20 ms. Panel B shows a bloated buffer (50× BDP): loss arrives too late, belief diverges for seconds, and RTT inflates above 500 ms. Adjust buffer size and toggle CoDel AQM to observe the coupling dynamics.

What the pioneers saw (retrospectively): Memory became cheap. Router manufacturers sized buffers to maximize throughput benchmarks — link utilization is easy to measure and market. A 100 Mbps link with a 1,000-packet buffer (1.5 MB) can absorb any TCP burst without loss, keeping the link fully utilized. But that buffer holds 120 milliseconds of packets. TCP, using loss as its congestion signal, interprets the absence of drops as “no congestion — send faster.” The congestion window grows unchecked. The buffer fills completely before a single drop occurs. By then, every packet in the buffer experiences 120+ milliseconds of queuing delay.

What remained invisible until Gettys measured it: The coupling failure between queue management and transport. TCP’s feedback loop depends on loss to signal congestion. The queue’s feedback (implicitly, through tail-drop) depends on buffer overflow. When the buffer is 50 times the bandwidth-delay product, these two feedback mechanisms decouple completely. TCP receives no loss signal for seconds. The sender’s belief (“I can send faster”) diverges from the environment’s reality (“the queue is full, latency is catastrophic”) by orders of magnitude.

8.5.2 The Concrete Numbers

A home router bottleneck at 10 Mbps with a 100-packet buffer (150 KB):

  • Queue delay at full buffer: (100 x 1,500 bytes) / 10 Mbps = 120 milliseconds
  • With a 50 ms propagation RTT, total RTT becomes 170 ms
  • An interactive application (voice frame every 20 ms) experiences 6x its frame period in queue delay alone

Scale the buffer to 1,000 packets (1.5 MB), common in commodity routers:

  • Queue delay: 1.2 seconds
  • TCP sees no loss during this entire interval — cwnd grows toward infinity
  • Interactive applications become completely unusable

8.5.3 The Closed-Loop Failure

TCP’s congestion control is a negative feedback loop: send faster → observe loss → back off → send slower → queue drains → send faster. This loop is stable when the feedback signal (loss) arrives within a few RTTs. Bufferbloat breaks the loop by inflating the time constant. If the buffer adds 1.2 seconds of delay, the feedback loop’s period stretches from milliseconds to seconds. TCP’s AIMD sawtooth becomes a slow oscillation between extremes: the sender ramps for seconds (while the buffer fills silently), receives a burst of drops, halves its window, the buffer drains, and the cycle repeats. Latency oscillates between near-zero and 1+ seconds, never stabilizing.

Figure 8.3: Two independently designed systems meet at a link and couple bidirectionally: transport sends packets at a rate determined by its congestion window (a belief about available capacity), while queue management absorbs packets into a buffer, decides which packets get serviced, and provides signals back to transport via loss or delay. The feedback loop operates as follows: when the sender’s rate R exceeds link capacity C, the queue fills. When the queue is full, packets drop. Transport observes loss via missing ACKs, infers congestion, and reduces its sending rate via AIMD (additive-increase, multiplicative-decrease). The loop closes: higher loss → lower R → queue drains → loss decreases. This creates a control loop that should stabilize the system at an equilibrium sending rate. However, this feedback loop has a critical temporal assumption: the loss signal must reach transport quickly enough for control to remain stable. At high speeds (100 Gbps), one dropped packet per RTT represents a minuscule loss rate (0.01% or less). TCP’s congestion window is typically many thousands of packets, so dropping one packet causes only a 0.01% rate reduction—insufficient feedback. The queue fills faster than transport can reduce the sending rate, creating a pathological condition called bufferbloat: the measurement signal (loss) is so sparse that the queue grows to enormous size before transport reacts. This causes buffering to increase, latency to spike dramatically, and the feedback mechanism designed to stabilize the loop to destabilize it instead. The resolution requires redesigning one or both sides of the interface: either over-provisioning to ensure R < C always, using active queue management to keep queues small, or replacing loss-based feedback with denser signals like ECN marks.

8.5.4 Four-Invariant Diagnosis

Bufferbloat is a coupling failure across two systems — queue management and transport — diagnosed through the State invariant:

Layer Queue System Sees Transport System Sees
Environment Buffer full, 1.2s of queued packets Packets being delivered (slowly)
Measurement Queue depth (but takes no action until overflow) Loss (but none arrives because buffer absorbs everything)
Belief “Operating normally — buffer not full” “No congestion — increase sending rate”

Both systems have optimistic beliefs because both measurement signals are degraded. The queue management system has no AQM — it uses FIFO tail-drop, generating no signal until the buffer overflows. The transport system relies on loss, which arrives only after the buffer fills completely. The gap between environment (catastrophic latency) and belief (send faster) is where bufferbloat lives.

8.5.5 Why Bufferbloat Happened

Bufferbloat was not a new technical problem — Nagle identified the underlying dynamic (congestion collapse from mismatched feedback) in 1984. What changed was economics. DRAM prices dropped by 1,000x between 1990 and 2010. Router manufacturers, competing on throughput benchmarks, installed the cheapest way to improve utilization: more buffer. Throughput is easy to measure and market; latency is harder to measure and less directly tied to user satisfaction scores. The incentive was misaligned: the metric that drove purchasing decisions (throughput) was orthogonal to the metric that drove user experience (latency).

Gettys’ contribution was not discovering a new algorithm — it was measuring the real-world impact of the coupling failure and catalyzing the community to fix it. His blog posts, IETF engagement, and measurement campaigns created the urgency that led to CoDel.


8.6 Act 5: “It’s 2012. Nichols and Jacobson Fix the Measurement.”

8.6.1 Which Invariant Broke?

What Changed Before After
Diagnosis RED’s queue-length signal is wrong Need a signal that distinguishes burst from overload
Problem Bufferbloat persists because RED’s measurement fails Sojourn time replaces queue length
Invariant State: measurement signal redesigned From spatial proxy to temporal truth

Kathleen Nichols and Van Jacobson — the same Jacobson who co-invented RED in 1993 — identify the fundamental flaw in every previous AQM approach. Queue length, whether raw or EWMA-filtered, “contains no information at all about packet demand or network load.” A queue grows identically from a harmless burst that drains in milliseconds and from persistent overload that drains only when senders back off. Queue length conflates the two.

“CoDel is parameterless. One of the weaknesses in the RED algorithm is that it is too difficult to configure, especially in an environment with dynamic link rates.” — Nichols and Jacobson, 2012 (Nichols and Jacobson 2012)

What the pioneers saw: Nineteen years of RED deployment failure. The tuning burden that prevented operators from enabling RED. The bufferbloat crisis that demonstrated RED had not solved the underlying problem. And a simple observation: the quantity that matters for interactive applications is not how many packets are in the queue, but how long each packet waits.

What remained invisible from the pioneers’ vantage point: CoDel solves the measurement problem for a single aggregate queue. An aggressive flow can still dominate the queue, causing CoDel to drop packets from innocent flows that happen to share the same buffer. The fairness problem — solved by Demers, Keshav, and Shenker in 1989 — remains unsolved in the AQM context. The two solutions would need to be combined.

8.6.2 Nichols and Jacobson’s Redesign: CoDel

CoDel (Controlled Delay) measures sojourn time — the time each packet spends in the queue, computed as the difference between dequeue time and enqueue time. At each packet departure, CoDel records the sojourn time and tracks the minimum sojourn time over a sliding 100-millisecond window. The minimum is the key: it reveals whether the queue drained at any point during the interval.

  • Transient burst: Packets arrive in a burst, queue depth spikes, but packets dequeue quickly. The minimum sojourn time stays below the 5 ms target. CoDel takes no action — the queue is “good.”
  • Persistent overload: Packets accumulate continuously. The minimum sojourn time exceeds the 5 ms target for the entire 100 ms interval. CoDel enters the dropping state — the queue is “bad.”

The dropping rate follows a control law: the interval between successive drops shrinks as interval/sqrt(count), where count is the number of drops since entering the dropping state. This generates a linearly increasing drop rate that aligns with TCP’s inverse-square-root relationship between loss rate and throughput, ensuring the feedback signal converges rapidly.

Nichols and Jacobson applied closed-loop reasoning with a fundamentally different measurement signal. RED’s loop measured space (queue length) and acted on a spatial threshold. CoDel’s loop measures time (sojourn time) and acts on a temporal threshold. The temporal signal directly captures what matters for interactive applications: how long packets wait. It is independent of link rates, buffer sizes, and traffic patterns — the same 5 ms target works from kilobits to gigabits per second.

Figure 8.4: CoDel (Controlled Delay) represents a fundamental shift in Active Queue Management (AQM) measurement signals. Rather than monitoring queue length—which conflates transient bursts with persistent overload—CoDel measures sojourn time: the time a packet spends from enqueue to dequeue. This captures latency directly, the actual symptom of congestion that applications experience. The key insight is that during a brief traffic burst, minimum sojourn time remains low because packets dequeue quickly; during persistent overload, minimum sojourn time rises because the queue is always busy. By tracking the minimum sojourn time over a 100-millisecond observation window, CoDel distinguishes the two phenomena that previous algorithms conflated. The three panels show the evolution of CoDel’s control loop as traffic intensity increases. Panel A depicts light load with random arrivals: sojourn times are brief (< 5 ms target), and CoDel makes no drops because the minimum sojourn is low—the queue is clearing. Panel B shows moderate load where sojourn time approaches the target: CoDel begins dropping packets at a low rate, signaling congestion to TCP. Panel C shows heavy persistent overload: minimum sojourn exceeds the target by significant margins, CoDel increases its drop rate following the formula drop_interval = interval / √(drop_count), creating an exponentially increasing signal that backpressures senders without overwhelming the network with drops. The signal is gentle and self-regulating: if drops work and congestion clears, minimum sojourn drops below target and CoDel stops dropping immediately.

8.6.3 Invariant Analysis: CoDel (2012)

Invariant CoDel Answer (2012) Gap?
State Aggregate + per-packet timestamps; min sojourn over 100 ms Correct signal for persistence detection
Time Real-time sojourn (5 ms target / 100 ms window) Self-tuning; no parameters to configure
Coordination Gateway drops/marks when persistent delay detected No per-flow fairness — aggregate queue
Interface Drop or ECN mark; signal proportional to delay persistence Clear, timely signal to transport

CoDel closes the State gap that RED left open: sojourn time directly measures the phenomenon (queuing delay) that needs controlling. The measurement signal is faithful — it reports exactly what interactive applications care about. CoDel is self-tuning: the 5 ms target and 100 ms interval are the only parameters, and they work across heterogeneous networks without operator tuning. This eliminated the deployment barrier that killed RED.

8.6.4 Environment → Measurement → Belief

Layer What CoDel Has What’s Missing
Environment Per-packet queuing delay Per-flow delay (aggregate only)
Measurement Min sojourn over 100 ms window Correct; separates burst from overload
Belief “If min sojourn > 5 ms for full interval, overload is persistent” Faithful belief; acts only on persistent congestion

The E→M gap is effectively closed for the aggregate: CoDel’s measurement (min sojourn) faithfully tracks the environment (queuing delay persistence). But the measurement is aggregate — it reflects the combined behavior of all flows sharing the queue. If one aggressive flow creates persistent delay, CoDel drops packets from the aggregate, which hits innocent packets from well-behaved flows. The Coordination gap remains.

8.6.5 “The Gaps Didn’t Matter… Yet.”

CoDel is dramatically effective at managing aggregate delay. On links with homogeneous traffic (similar flows sharing a single bottleneck), CoDel keeps latency near the 5 ms target regardless of buffer size. But on links with heterogeneous traffic — a bulk download sharing a queue with a VoIP call — CoDel exposes the VoIP call to the download’s queue pressure. Both flows share the same CoDel instance, and CoDel’s drops land randomly across flows.

The solution requires combining CoDel’s measurement innovation with fair queuing’s coordination innovation. The two threads — one tracing back to 1989 (fairness), the other to 2012 (measurement) — need to converge.


8.7 Act 6: “It’s 2018. FQ-CoDel Synthesizes Fairness and Measurement.”

8.7.1 Which Invariant Broke?

What Changed Before After
Traffic mix Homogeneous flows sharing aggregate queue Heterogeneous: bulk downloads alongside interactive traffic
Problem Aggregate CoDel drops from innocent flows Per-flow isolation + per-flow AQM needed
Invariant Coordination: no per-flow fairness in CoDel Both Coordination and State redesigned

Toke Hoeiland-Joergensen, Paul McKenney, Dave Taht, Jim Gettys, and Eric Dumazet (Hoeiland-Joergensen et al. 2018) combine the two innovations that the queue management field had developed independently over three decades. Fair queuing (1989) solves the Coordination problem: per-flow isolation ensures each flow gets its fair share. CoDel (2012) solves the State problem: sojourn-time measurement detects persistent delay and generates timely signals. Neither alone is sufficient. FQ-CoDel combines both.

8.7.2 The Synthesis: FQ-CoDel

FQ-CoDel hashes each packet’s 5-tuple (source IP, destination IP, protocol, source port, destination port) into one of 1,024 flow buckets. Each bucket has its own queue. A Deficit Round-Robin (DRR) scheduler serves the queues in turn, ensuring each flow gets fair access to the link with O(1) per-packet cost. Within each queue, a CoDel instance independently tracks minimum sojourn time and drops packets when persistent delay is detected.

The combination achieves what neither component alone can:

  • Fair queuing without CoDel: Each flow gets its fair share, but queues within each flow can grow without bound. No delay signal to transport.
  • CoDel without fair queuing: Delay is controlled for the aggregate, but an aggressive flow’s packets trigger drops from innocent flows sharing the same queue.
  • FQ-CoDel: Each flow is isolated. An aggressive flow builds only its own queue. Its CoDel instance detects persistent delay and drops only its packets. Other flows’ queues remain short and responsive. The bulk download experiences drops proportional to its own congestion; the VoIP call experiences near-zero queuing delay.

FQ-CoDel includes a critical optimization for sparse flows — flows that transmit occasionally (DNS queries, SSH keystrokes, VoIP frames). When a packet arrives at an empty queue, FQ-CoDel gives it priority, scheduling it ahead of packets from flows with accumulated backlogs. This ensures that interactive traffic achieves near-zero queuing delay even on a heavily loaded link.

FQ-CoDel applied disaggregation by separating per-flow scheduling (DRR) from per-flow delay management (CoDel). It applied closed-loop reasoning with sojourn time as the measurement signal, independently per flow. It applied decision placement at the gateway — each router enforces fairness and delay control unilaterally, without negotiation with senders.

8.7.3 Invariant Analysis: FQ-CoDel (2018)

Invariant FQ-CoDel Answer (2018) Gap?
State Per-bucket (1,024): DRR deficit + CoDel state + timestamps Scales to thousands of flows; hash collisions rare
Time Per-flow sojourn (5 ms target / 100 ms window) Self-tuning across link speeds
Coordination DRR for per-flow fairness + CoDel for per-flow delay Both fairness and delay controlled
Interface Fair-share bandwidth + controlled delay bound; ECN when possible Clear, per-flow signal to transport

FQ-CoDel closes both the Coordination gap (opened in Act 1, addressed in Act 2, lost in Act 3) and the State gap (opened in Act 1, addressed with wrong signal in Act 3, corrected in Act 5). The synthesis required both innovations.

8.7.4 Environment → Measurement → Belief

Layer What FQ-CoDel Has What’s Missing
Environment Per-flow queuing delay and arrival rates True application requirements (latency targets)
Measurement Per-flow min sojourn over 100 ms window Correct for persistence detection, per flow
Belief “Each flow’s delay is independently controlled” Faithful belief; acts only on each flow’s own congestion

The E→M gap is closed per flow. Each CoDel instance measures only its own flow’s sojourn time, uncontaminated by other flows’ behavior. The measurement is faithful, the feedback loop is tight, and the system is self-tuning. FQ-CoDel became the default queuing discipline (qdisc) in Linux, shipping in OpenWrt, RHEL, and major distributions since 2014. Residential users deploying FQ-CoDel report latency dropping from 100+ milliseconds to single-digit milliseconds under load.


8.8 The Grand Arc: From Congestion Collapse to FQ-CoDel

8.8.1 The Evolving Anchor

Era Binding Constraint State Answer Coordination Answer Measurement Signal
1984 Finite buffer, FIFO only No measurement No fairness None (binary: full/not full)
1989 Per-flow state expensive Per-flow queues + virtual time Max-min fairness via WFQ Queue occupancy per flow
1993 Scale demands aggregate AQM EWMA of aggregate queue length Gateway drops proportionally Queue length (wrong signal)
2011 Cheap memory → deep buffers TCP’s loss signal degraded Queue-transport coupling failure Loss arrives too late
2012 Persistence detection needed Min sojourn time over 100 ms Gateway drops on persistent delay Sojourn time (right signal)
2018 Need fairness + measurement Per-flow sojourn tracking DRR + per-flow CoDel Per-flow sojourn time

The measurement signal is the thread that connects the entire arc. Each generation’s failure is a measurement failure: FIFO has no signal, RED has the wrong signal, deep buffers degrade the loss signal, CoDel finds the right signal, and FQ-CoDel applies it per flow.

8.8.2 Three Design Principles Applied Across the Arc

Disaggregation appears three times. Demers, Keshav, and Shenker (1989) applied disaggregation by separating per-flow scheduling from aggregate buffering — creating the per-flow queue abstraction. Floyd and Jacobson (1993) deliberately rejected disaggregation, keeping RED aggregate to avoid per-flow state overhead. FQ-CoDel (2018) applied disaggregation twice: separating flow scheduling (DRR) from delay management (CoDel), and separating each flow’s congestion state from the aggregate.

Closed-loop reasoning is the engine of AQM evolution. FIFO tail-drop is open-loop — no measurement, no feedback, no control. RED introduced the first closed loop (EWMA → probabilistic drops), but the measurement signal was degraded. CoDel applied closed-loop reasoning with a faithful signal (sojourn time), creating a tight loop that converges reliably. The progression from open-loop to degraded-loop to faithful-loop mirrors the E→M→B decomposition: as measurement quality improves, belief tracks environment, and the control loop stabilizes.

Decision placement remains consistently at the gateway throughout the arc. Every queue management system places the drop/schedule decision at the router, not at the sender. This is forced by the constraint: only the gateway has a unified view of the queue. Senders see end-to-end signals (loss, RTT); the gateway sees the queue directly. Nagle (1984) identified this: “decisions about the duration and magnitude of transient congestion to be allowed at the gateway are best made by the gateway itself.”

8.8.3 The Dependency Chain

flowchart TD
    A["Finite buffer\nat every hop"]:::constraint --> B["FIFO tail-drop:\nno measurement,\nno fairness"]:::failure
    B --> C["Congestion collapse:\nthroughput → zero\n(Nagle 1984)"]:::failure
    C --> D["Fair queuing:\nper-flow isolation\n(DKS 1989)"]:::fix
    C --> E["RED: early detection\nvia queue length EWMA\n(Floyd/Jacobson 1993)"]:::fix
    D --> F["Per-flow state\ntoo expensive\nat scale"]:::failure
    E --> G["Queue length\nis wrong signal:\nconflates burst\nwith overload"]:::failure
    G --> H["RED rarely deployed\ndespite 20 years\nof research"]:::failure
    H --> I["Cheap DRAM →\ndeep buffers →\nbufferbloat crisis\n(Gettys 2011)"]:::failure
    I --> J["CoDel: sojourn time\nreplaces queue length\n(Nichols/Jacobson 2012)"]:::fix
    J --> K["Aggregate CoDel:\nno per-flow fairness"]:::failure
    D --> L["DRR: O(1)\nfair queuing\n(Shreedhar 1996)"]:::fix
    K --> M["FQ-CoDel:\nDRR + CoDel per flow\n(RFC 8290, 2018)"]:::fix
    L --> M

    classDef constraint fill:#4A90D9,stroke:#2C5F8A,color:#fff
    classDef failure fill:#D94A4A,stroke:#8A2C2C,color:#fff
    classDef fix fill:#4AD94A,stroke:#2C8A2C,color:#fff

8.8.4 Pioneer Diagnosis Table

Year Pioneer Invariant Diagnosis Contribution
1984 Nagle State + Coordination No measurement, no fairness → congestion collapse Identified congestion collapse; proposed round-robin per-flow
1989 Demers, Keshav, Shenker Coordination FIFO allows monopolization Fair queuing: per-flow isolation via virtual time
1993 Floyd, Jacobson State Tail-drop signals too late RED: proactive dropping via EWMA queue length
1996 Shreedhar, Varghese Time WFQ’s O(log N) too expensive DRR: O(1) fair queuing via deficit counters
2011 Gettys State (coupling) Deep buffers degrade TCP’s loss signal Measured bufferbloat; catalyzed community response
2012 Nichols, Jacobson State Queue length is wrong signal; sojourn time is right CoDel: self-tuning AQM via min sojourn time
2018 Hoeiland-Joergensen et al. Coordination + State Neither FQ nor CoDel alone sufficient FQ-CoDel: per-flow DRR + per-flow CoDel

8.8.5 Innovation Timeline

flowchart TD
    subgraph sg1["Problem Genesis"]
        A1["1984 — Nagle: congestion collapse identified (RFC 896)"]
    end
    subgraph sg2["Fairness Thread"]
        B1["1989 — Demers/Keshav/Shenker: Fair Queuing (WFQ)"]
        B2["1996 — Shreedhar/Varghese: Deficit Round-Robin (O(1))"]
        B1 --> B2
    end
    subgraph sg3["AQM Thread"]
        C1["1993 — Floyd/Jacobson: RED (queue length signal)"]
        C2["1998 — RFC 2309 recommends AQM deployment"]
        C3["2001 — ECN standardized (RFC 3168)"]
        C1 --> C2 --> C3
    end
    subgraph sg4["Crisis and Resolution"]
        D1["2011 — Gettys: bufferbloat identified"]
        D2["2012 — Nichols/Jacobson: CoDel (sojourn time signal)"]
        D3["2013 — Pan et al.: PIE (estimated delay)"]
        D1 --> D2 --> D3
    end
    subgraph sg5["Synthesis"]
        E1["2014 — FQ-CoDel becomes Linux default qdisc"]
        E2["2018 — RFC 8290 standardizes FQ-CoDel"]
        E3["2023 — L4S architecture (RFC 9332)"]
        E1 --> E2 --> E3
    end
    sg1 --> sg2 --> sg3 --> sg4 --> sg5

Queue Management Evolution


8.9 Generative Exercises

TipExercise 1: Datacenter AQM Under Microsecond Constraints

A datacenter switch connects servers over 100 Gbps links with 5 microsecond propagation RTTs. The buffer holds 100 KB (approximately 67 packets at 1,500 bytes). Applications include distributed key-value stores requiring sub-100-microsecond tail latency and bulk MapReduce shuffles generating sustained traffic.

CoDel’s default 5 ms target is 1,000 times larger than the propagation RTT. Its 100 ms observation window spans 20,000 RTTs.

Redesign the AQM parameters for this environment. What target delay and observation window would you choose? How does the relationship between buffer depth (67 packets), BDP (approximately 1 packet at 100 Gbps x 5 microseconds), and the number of concurrent flows (potentially thousands during an incast event) change the measurement signal? Would ECN marking (Ramakrishnan et al. 2001) (which avoids the throughput cost of drops) be more appropriate than dropping in this environment? How does DCTCP’s (Alizadeh et al. 2010) use of ECN marking rate — rather than binary loss — change the information content of the congestion signal?

TipExercise 2: Encrypted Tunnel Fairness

A home router runs FQ-CoDel on its WAN uplink. User A browses the web (5 HTTP connections). User B runs a WireGuard VPN tunnel carrying 20 internal applications — video streaming, file sync, SSH, and email. All of User B’s traffic has the same outer 5-tuple (source IP, destination IP, protocol=UDP, source port, destination port=51820).

FQ-CoDel hashes packets by 5-tuple. All 20 of User B’s applications map to a single flow bucket. User A’s 5 connections map to 5 buckets.

Analyze the fairness consequences: what fraction of link capacity does each user get under DRR? How does CoDel interact with the aggregated tunnel — does the sojourn time reflect the tunnel’s internal application mix, or just the aggregate? Propose a modification that restores per-application fairness without breaking the VPN’s encryption. Consider Cake’s per-host fairness approach: would grouping by source IP (User A vs. User B) solve the problem? What tradeoff does this create between per-flow and per-host fairness?

TipExercise 3: Predicting the Next Measurement Signal

L4S (Low Latency, Low Loss, Scalable Throughput) (De Schepper and Briscoe 2023) proposes a dual-queue architecture: a “classic” queue (managed by CoDel for legacy TCP) and an “L4S” queue (for scalable congestion controls like Prague/DCTCP (Alizadeh et al. 2010) that respond to ECN marks proportionally rather than halving their window). The L4S queue targets sub-millisecond delay by marking packets at very low queue depths.

Analyze this design through the four invariants. What State does the dual-queue system maintain? How does the Coordination between the two queues work — who decides which queue a packet enters? What happens at the Interface — can a non-L4S flow game the system by marking itself as L4S to get lower latency? What Time dynamics emerge when L4S flows (microsecond-scale response) share a link with classic TCP flows (RTT-scale response)?

Predict: if L4S succeeds, what becomes the next measurement signal problem? Consider that L4S assumes the sender’s congestion control is honest about its response to ECN marks. What happens when it is not?


  1. Sojourn time is the time a packet spends in the queue, from arrival to departure. CoDel measures sojourn time rather than queue length because sojourn time directly reflects the delay packets experience, while queue length depends on link speed (a 100-packet queue at 10 Gbps causes far less delay than the same queue at 100 Mbps).↩︎