6  Queue Management


6.1 Queue Management as a System

Queue management is not a subprocess of transport. It is a peer system—one that creates the environment transport operates in, and in turn receives signals from transport that reshape its own design. The two systems co-evolve. Transport (Chapter 4) controls the sending rate based on observed congestion. Queues (Chapter 5) store arriving packets that exceed the instantaneous link capacity and signal congestion back to transport. Neither can exist optimally without the other.

Every router, access point, and network interface faces a constraint: link capacity is finite, but packet arrivals are bursty and often exceed capacity instantaneously. A queue buffers excess packets, enabling the link to utilize all available capacity. Without queues, bursts would cause packet loss. With them, bursts cause delay. The design space of queue management is entirely about managing this tradeoff—accepting some delay to minimize loss, or accepting some loss to minimize delay, or attempting to do both.

Queue management is the network-interior counterpart to transport. Its engineering question: what do I do when packets arrive faster than I can forward them? Transport and queue management are two sides of the same optimization decomposition — the source subproblem (rate control) and the link subproblem (buffer management). Their coupling is the subject of Chapter 6.

Queue management answers the four invariants distinctly from transport.

State: The router maintains a queue buffer—an array of packets waiting transmission. The environment state is the true link occupancy and upcoming arrival pattern. The measurement signal is queue length (number of packets), sojourn time (delay per packet), or estimated delay (a computed estimate). The internal belief is a decision rule: “when should I drop this packet, and when should I transmit it?” This belief must be simple enough to execute at line rate on thousands of flows. Critically, the choice of measurement signal determines whether the queue management system will succeed or fail.

Time: Queue decisions happen at packet arrival (for drop/accept) and packet departure (for scheduling). Both are microsecond-scale, much tighter than transport’s RTT-scale (milliseconds). The queue has no shared clock with senders—it infers demand from the arrival stream alone.

Coordination: A centralized scheduler at each interface decides which queued packet transmits next. In FIFO queues, this is trivial (arrival order). In fair queuing systems, the scheduler maintains per-flow state and uses proportional allocation. The key invariant: the scheduler does not negotiate with senders. It observes, decides, and enforces unilaterally.

Interface: Packets arrive; the queue either accepts or drops. For transmission, the scheduler selects a packet from the queue. The interface between queue and transport is implicit: loss and delay are the only signals. Explicitly, the queue interface is “accept this packet” or “drop this packet,” with no negotiation.

The anchor for queue management is a finite buffer at a shared link. This constraint is architectural (routers have fixed memory), not temporary. It forces every queue designer to answer: how should we fill and drain this buffer, and what signals should we send to upstream senders? Different answers lead to radically different systems.


6.2 Scheduling Disciplines: From FIFO to Weighted Fair Queuing

Who decides which packet goes next? This is the Coordination invariant applied to the queue. The evolution from FIFO to Weighted Fair Queuing represents an increasing sophistication in how fairly capacity is divided.

6.2.1 FIFO: Simplicity Without Fairness

FIFO (First-In-First-Out) is the baseline. Packets arrive in a single queue. The scheduler transmits them in arrival order. No state is needed beyond a pointer to the head and tail of the queue. No per-flow intelligence is required.

FIFO is work-conserving: the link never sits idle if packets are available. It provides no differentiation—a flow with thousands of packets in the queue monopolizes the link, while a small latency-sensitive flow waits behind them. FIFO treats all flows identically, which is fair in a strict sense (all wait equally), but unfair in an economic sense (a user sending one packet suffers the same delay as a user sending a thousand).

FIFO’s coordination answer is: “order is destiny.” No one decides anything about fairness. The result is deterministic but indifferent to application requirements.

6.2.2 Priority Queuing: Static Differentiation

Priority queuing maintains multiple queues at different priority levels. The scheduler always serves the highest-priority queue first (FIFO within each priority). High-priority traffic (VoIP, emergency services) gets low latency; low-priority traffic (bulk file transfer) waits.

The Coordination answer is now explicit: “high-priority traffic gets priority.” This answers the question “who decides?” with “the network operator decides (at configuration time), based on application type.”

The tradeoff is unilateral: without capacity guarantees per priority class, high-priority flows can starve low-priority ones. A sustained high-priority burst can cause low-priority packets to wait indefinitely. This is acceptable for traffic emergency-first; it is unacceptable for fair resource sharing.

6.2.3 Round-Robin and Weighted Fair Queuing: Fairness

Round-Robin (RR) serves one packet from each queue in turn. Every flow gets its turn regardless of demand. This guarantees progress: no flow waits indefinitely.

The Coordination answer is: “take turns.” The result is fairness in terms of packet count—each flow gets equal opportunity to send. But if packets vary in size, large packets use more bandwidth than their share. A flow sending jumbo packets gets 2x bandwidth while other flows send normal packets.

Weighted Fair Queuing (WFQ) generalizes RR: each flow gets a weight (reflecting its fair share of capacity). The scheduler maintains a deficit counter per flow—a running “credit” based on the flow’s weight. A flow’s credit increases proportionally to its weight each time the scheduler passes. Packets from the flow with the highest credit are transmitted first. This ensures that over a long interval, each flow’s transmitted bits equal its weight-based share. WFQ implements max-min fairness: maximize the minimum allocation to any flow, then recursively maximize the next-smallest. No flow can increase its allocation without decreasing another’s.

The Coordination answer is: “each flow gets a fair share proportional to its weight.” This is max-min fairness, which has strong game-theoretic properties: every flow receives at least its fair share, and no reallocation can benefit one flow without harming another (Pareto optimality).

The tradeoff is complexity. WFQ requires per-flow state (deficit counter, virtual time). At high line rates with thousands of flows, this becomes expensive. A simpler alternative, Deficit Round-Robin (DRR), achieves similar fairness with O(1) per-packet cost by using a fixed time step instead of virtual time. Another variant, Stochastic Fairness Queuing (SFQ), uses hashing to reduce per-flow state from millions to a manageable few thousand buckets, accepting occasional collisions for scalability.

6.2.4 Key Insight: Coordination Answers Constrain State

Each coordination choice requires different state. FIFO requires no per-flow state. Priority queuing requires per-priority state (typically just a queue head/tail pointer per priority). WFQ requires per-flow state (deficit counters, virtual time). This creates a scalability hierarchy: simpler coordination is cheaper per-packet, but less fair. The choice of scheduling discipline is often anchored by deployment constraints. Cellular networks can afford per-flow state at the base station (they control all scheduling); the Internet’s distributed architecture cannot (millions of flows per router). This explains why the Internet uses FIFO in the core, and WFQ primarily at access networks or in specialized deployments.

Fair Queuing Progression: The evolution from FIFO through priority queuing to weighted fair queuing reveals how fairness emerges as coordination becomes more sophisticated. FIFO’s simplicity (no per-flow state) comes at the cost of extreme unfairness (one aggressive flow monopolizes the queue). Priority queuing introduces explicit Coordination logic (“high-priority traffic gets priority”) but without capacity guarantees per class. Weighted Fair Queuing finally achieves proportional allocation by maintaining per-flow deficit counters and dynamically weighting scheduling decisions. The progression is illustrated in Figure 6.1, which shows how each discipline partitions a shared queue across competing flows.

Scheduling Disciplines: From FIFO to Fair Queuing
Figure 6.1: This figure traces the evolution of queue management through four scheduling disciplines, each representing a different answer to the Coordination invariant (how to allocate the bottleneck’s capacity). Panel 1 (FIFO) demonstrates the baseline: first-in, first-out processing with no fairness logic, where an aggressive flow (A) sends 7 Mbps while smaller flows (B, C) receive 1 Mbps and 2 Mbps respectively. Panel 2 (Priority Queuing) introduces explicit coordination: high-priority traffic (B) receives 8 Mbps while lower priorities (A, C) starve. Panel 3 (Round-Robin) attempts fairness at the packet level, giving each flow equal scheduling opportunity, but because flow A sends larger packets (“jumbo packets”), A captures 4.2 Mbps while B receives 2.1 Mbps—fairness in packets does not translate to fairness in bytes. Panel 4 (WFQ / Deficit Round-Robin) achieves max-min fairness: each flow receives exactly 3.3 Mbps (one-third of capacity), regardless of packet size or sending pattern. Observe the progression: FIFO requires no per-flow state but produces unfairness; priority queuing adds state but sacrifices fairness for explicit priorities; round-robin adds state but fairness remains vulnerable to packet size; WFQ adds per-flow deficit counters and finally achieves proportional allocation where no reallocation can benefit one flow without harming another (Pareto optimality). The scalability tradeoff is ineliminable: FIFO is O(1) per packet (no state), while WFQ is O(log F) per packet (where F is the number of flows), explaining why high-speed Internet routers use FIFO in the core and access networks use WFQ when per-flow state is feasible.

WFQ achieves max-min fairness: it maximizes the minimum allocation to any flow, then recursively maximizes the next-smallest. However, the tradeoff is inherent—scalability vs. fairness. FIFO requires O(1) per-packet time (no state), while WFQ requires O(log F) per-packet time for per-flow updates (where F is the number of flows). This explains why the Internet core uses FIFO at high speeds (routers cannot afford per-flow state for millions of flows), while access networks and specialized deployments (where traffic is less diverse or state is feasible) use WFQ or approximations (Deficit Round-Robin, Stochastic Fairness Queuing) to provide fairness guarantees.


6.3 Bufferbloat: The Canonical Coupling Failure

Bufferbloat is the canonical failure mode when queue management and transport control fail to coordinate their measurement signals. It reveals how two nominally independent systems create a coupled closed loop that oscillates destructively. Understanding bufferbloat requires decomposing the State invariant at three layers: environment, measurement signal, and internal belief.

6.3.1 The Mechanism with Concrete Numbers

A bottleneck router (e.g., a home WiFi access point) has a large buffer—100+ packets. Consider a concrete example:

  • Link capacity: 10 Mbps
  • Packet size: 1,500 bytes (typical MTU)
  • Buffer depth: 100 packets
  • Delay introduced: (100 packets × 1,500 bytes) / (10 Mbps) = 1,200,000 bits / (10,000,000 bits/second) = 0.12 seconds = 120 milliseconds

TCP senders use loss as their congestion signal. As senders transmit, packets accumulate in the queue. TCP interprets the absence of loss as “no congestion—send faster.” TCP’s sending window grows unchecked. The queue fills, but because packets are not dropping, TCP never receives a loss signal. Packets accumulate indefinitely. Latency explodes. An interactive application (video call, online game) expects response within 100 milliseconds from the user’s perspective. Bufferbloat adds 120+ milliseconds of queue delay alone, making the application unusable. Yet the link is not fully utilized—the buffer is full, but more capacity exists downstream.

6.3.2 State Decomposition: Environment, Measurement, Belief

The three-layer State decomposition from Chapter 1 perfectly diagnoses bufferbloat.

Environment state is the true queue occupancy: 100 packets in the buffer, 120 milliseconds of delay at the bottleneck.

Measurement signal is loss. TCP observes packet loss by timing out waiting for ACKs or receiving duplicate ACKs from the receiver. But large buffers degrade this signal catastrophically. Packets queue for 120 milliseconds, but only drop after the buffer overflows, which happens when the buffer is truly saturated. By then, the delay has already grown unacceptable.

Internal belief is TCP’s congestion window (cwnd), representing the sender’s estimate of safe sending rate. With a large buffer absorbing packets without loss, TCP’s cwnd grows toward infinity. The belief diverges from reality.

The gap between environment (queue full) and belief (can send faster) is where bufferbloat lives. The measurement signal (loss) is degraded: it arrives late and suddenly, leaving TCP oscillating between extremes.

Figure Figure 6.2 traces this divergence concretely. A simplified TCP AIMD sender drives packets into a FIFO queue under two buffer regimes—one sized at the bandwidth-delay product (1× BDP = 100 packets), the other fifty times larger. Watch what happens to the sender’s belief (cwnd, blue) versus the queue’s reality (occupancy, red) as the buffer grows.

Figure 6.2: TCP’s congestion window (blue) represents the sender’s belief about available capacity; the queue occupancy (red) represents reality at the bottleneck. Panel A shows a well-sized buffer (1× BDP): frequent loss keeps belief tightly coupled to reality, peak RTT stays near 20 ms. Panel B shows a bloated buffer (50× BDP): loss arrives too late, belief diverges for seconds, and RTT inflates above 500 ms. Interactive version: adjust buffer size and toggle CoDel AQM to observe the coupling dynamics.

The contrast is stark. In Panel A the small buffer forces frequent loss: cwnd oscillates tightly around the BDP, the queue drains on every halving, and peak RTT stays near 20 ms—the feedback loop is fast enough to track reality. In Panel B the oversized buffer absorbs thousands of packets before any loss signal fires. TCP’s cwnd climbs past 8,000 packets, the queue fills to capacity, and RTT inflates above 500 ms—a half-second of pure queueing delay invisible to the sender. The belief–reality gap persists for seconds, not milliseconds. This is the signature of bufferbloat: measurement signal degradation proportional to buffer depth.

In the interactive widget, drag the buffer-size slider from 100 to 20,000 packets and observe how peak RTT scales with buffer depth. Then toggle CoDel AQM on: the sojourn-time signal collapses the gap even with a 20,000-packet buffer, keeping RTT near the 5 ms target. This previews the AQM redesign discussed in §5.4.

6.3.3 The Closed-Loop Failure Mode

TCP’s congestion control is a feedback loop: send faster → observe loss → back off. The loop is stable in principle. In practice, the loop’s time constant is the RTT plus the time for the loss signal to propagate back. If the buffer adds 120 milliseconds of delay, and the RTT is 50 milliseconds, the feedback loop’s time constant becomes 170+ milliseconds. For human-scale interactions (mouse clicks, voice frames arriving every 20 milliseconds), this loop is too slow to maintain stability or responsiveness.

Additionally, the loop oscillates: TCP increases cwnd for many RTTs (while the buffer fills silently), then receives a loss burst and backs off aggressively. The buffer drains quickly, and TCP ramps back up. This sawtooth behavior creates jitter—latency is either high (buffer filling) or low (buffer draining), never stable.

6.3.4 Why Large Buffers Were Installed

Buffers were oversized to solve a different problem: burst absorption. When TCP opens a new connection, it starts with an initial congestion window (IW=10 in modern TCP). This is a burst of 10 packets sent immediately. If the router has no buffer, the burst causes loss, and the connection backs off. Large buffers absorb these bursts, allowing TCP to reach higher throughput quickly. Additionally, links often have mismatched capacities (1 Gbps ingress, 100 Mbps egress). Data arriving on the fast link queues at the slow link. Without a buffer, packets drop immediately. With a buffer, packets queue and transmit at the slow rate.

Economically, memory became cheap. Router manufacturers sized buffers to maximize throughput metrics (link utilization), even if it meant inflating latency. The incentive was misaligned: throughput is easy to measure and market; latency is harder to measure and less directly tied to user satisfaction (at the time).

6.3.5 The Core Problem: Measurement Signal Failure

The root cause is not TCP or the queue independently. It is the coupling failure: TCP and the queue are tightly coupled at the data path, but their control loops are independent. TCP’s feedback is loss-based; the queue’s feedback (implicitly, through packet acceptance/drop) is buffer-based. These two feedbacks conflict when the buffer is large. This reveals the deepest principle of the failure: the measurement signal (loss) is fundamentally wrong for the problem at hand. Loss is a proxy for congestion, but it is a degraded proxy. Delay is the direct signal of bufferbloat, but TCP cannot observe delay; it only observes the end-to-end RTT, which conflates propagation delay with queueing delay.


6.4 AQM Evolution: Redesigning the Measurement Signal

Active Queue Management is a redesign of the State invariant. Instead of simply storing packets and dropping when full (tail drop), the queue maintains an internal measurement of congestion and proactively signals transport before the buffer overflows. The evolution of AQM is a series of refinements to what measurement signal the queue should observe. Each algorithm answers the same question differently: “How should the queue measure congestion?”

6.4.1 RED: Queue Length as the Signal

Random Early Detection (RED) was the first AQM, proposed by Floyd and Jacobson (1993). The idea: instead of tail-drop (drop when buffer is full), drop packets earlier, when the average queue length exceeds a threshold. Early loss signals TCP to back off before the buffer fills.

RED’s algorithm: compute an exponentially weighted moving average (EWMA) of the queue length. If EWMA < min_threshold, accept packets. If EWMA > max_threshold, drop all packets. If min_threshold ≤ EWMA ≤ max_threshold, drop packets probabilistically (drop probability increases linearly with EWMA).

The formula for drop probability is:

if EWMA <= min_threshold:
    drop_prob = 0
else if EWMA >= max_threshold:
    drop_prob = 1
else:
    drop_prob = max_p × (EWMA - min_threshold) / (max_threshold - min_threshold)

The Coordination answer is implicit: “the queue decides unilaterally when to drop, without asking TCP’s permission.” The measurement signal is queue length (averaged over a window). The internal belief is: “if average queue length is high, congestion must be persistent; drop probabilistically to signal TCP.”

RED sounds reasonable: monitor the queue, drop early, TCP responds to loss earlier, problem solved. Decades of research followed. Yet RED was rarely deployed in practice. Why?

The fundamental problem: queue length is the wrong measurement signal for distinguishing bufferbloat. A long queue can mean two different things: (1) transient burst traffic that will clear on its own (healthy—don’t signal congestion), or (2) persistent overload requiring rate reduction (unhealthy—do signal congestion). RED cannot distinguish them.

Consider two scenarios on a 10 Mbps link: - Scenario 1: A video application suddenly sends 100 packets (a burst). These packets fill 20 packet slots in the buffer, then dequeue over the next 12 milliseconds. - Scenario 2: Two TCP flows are each sending at 5 Mbps, creating a persistent queue of 40 packets.

RED’s EWMA window (typically 100+ milliseconds) smooths out short-term fluctuations. During Scenario 1, the EWMA might reach 30 packets. During Scenario 2, the EWMA also reaches 30+ packets. RED makes the same drop decision for both, but they require opposite responses: Scenario 1 needs no drops (the queue will drain on its own), while Scenario 2 needs drops (to reduce the sending rate).

During slow-onset persistent overload, the EWMA rises gradually, and RED drops probabilistically based on the average. This is weak signal: by the time RED starts dropping, the queue has already grown significantly. During transient bursts, RED may drop unnecessarily, starving the link.

Additionally, RED has many tuning parameters: min_threshold, max_threshold, max_drop_probability, and the EWMA weight. Optimal values depend on link capacity, RTT, and traffic pattern. In heterogeneous networks, no single tuning works everywhere. Network operators found RED difficult to configure and unreliable in practice. Despite decades of research, RED never achieved significant deployment.

6.4.2 CoDel: Sojourn Time as the Signal

Kathleen Nichols and Van Jacobson’s CoDel (2012) represents a breakthrough. The key insight: measure sojourn time (how long a packet spends in the queue), not queue length.

Sojourn time naturally separates the two phenomena RED conflates. During a transient burst, packets spend brief time in the queue—they dequeue quickly. During persistent overload, packets accumulate and spend long time in the queue. Minimum sojourn time within an observation window (100 milliseconds) is the signature of persistence: short minimum sojourn means the queue is clearing, even if peak queue length is high.

CoDel’s algorithm: at each packet dequeue, record the dequeue time and the enqueue time (recorded when the packet arrived). Compute the sojourn time (dequeue_time - enqueue_time). Maintain the minimum sojourn time over the last 100 milliseconds. If min_sojourn > target (5 milliseconds, typical), the queue is persistently congested. Start dropping packets—initially one per interval, exponentially increasing if congestion persists. If min_sojourn < target, congestion has cleared; stop dropping.

More precisely, CoDel drops packets at a rate determined by the formula:

if min_sojourn > target:
    time_to_drop = now + interval / sqrt(drop_count)
    drop packets at this rate
else:
    drop_count = 0  # reset
    don't drop

This generates a drop rate that increases as sqrt(drop_count), creating a gentle, increasing signal to senders without overwhelming the network with drops.

The Coordination answer: “detect persistent overload via sojourn time, then signal TCP with drops or ECN marks.” The measurement signal is sojourn time, which directly measures latency. The internal belief is: “if minimum sojourn time is high, packets are spending long time in my queue; this indicates persistent overload, not just a burst.”

CoDel is self-tuning. The target delay (5 ms) is a control parameter, but different networks can use different targets, and CoDel adapts to varying capacity. The algorithm does not require manual tuning of threshold pairs or EWMA weights. The target delay represents the acceptable latency for interactive applications—5 milliseconds is a conservative, achievable target for most user-facing applications.

Crucially, CoDel works because it measures the right quantity. Sojourn time directly captures the phenomenon (bufferbloat) that needs controlling. By distinguishing persistent from transient queues, CoDel drops signals to senders only when congestion is real and persistent, not during harmless bursts.

To understand why 5 milliseconds is the right target: for a video call with 20-millisecond frames, a 5-millisecond target means each packet spends less than 25% of a frame period in the queue. This maintains responsiveness for human interaction.

The cost: per-packet timestamping. At high line rates (10+ Gbps), timestamping every packet at enqueue and dequeue is expensive—it requires hardware support and adds latency. This motivated simpler alternatives. Figure 6.3 shows how CoDel’s control loop evolves across three traffic regimes — light load where no drops are needed, moderate load where sparse drops signal congestion, and heavy overload where the dropping rate accelerates to restore equilibrium.

CoDel (Controlled Delay) represents a fundamental shift in Active Queue Management (AQM) measurement signals. Rather than monitoring queue length—which conflates transient bursts with persistent overload—CoDel measures sojourn time: the time a packet spends from enqueue to dequeue. This captures latency directly, the actual symptom of congestion that applications experience. The key insight is that during a brief traffic burst, minimum sojourn time remains low because packets dequeue quickly; during persistent overload, minimum sojourn time rises because the queue is always busy. By tracking the minimum sojourn time over a 100-millisecond observation window, CoDel distinguishes the two phenomena that previous algorithms conflated.

Figure 6.3: The three panels show the evolution of CoDel’s control loop as traffic intensity increases. Panel A depicts light load with random arrivals: sojourn times are brief (< 5 ms target), and CoDel makes no drops because the minimum sojourn is low—the queue is clearing. Panel B shows moderate load where sojourn time approaches the target: CoDel begins dropping packets at a low rate, signaling congestion to TCP. Panel C shows heavy persistent overload: minimum sojourn exceeds the target by significant margins, CoDel increases its drop rate following the formula drop_interval = interval / √(drop_count), creating an exponentially increasing signal that backpressures senders without overwhelming the network with drops. The signal is gentle and self-regulating: if drops work and congestion clears, minimum sojourn drops below target and CoDel stops dropping immediately.

6.4.3 PIE: Proportional Integral Control Without Timestamps

Proportional Integral Enhanced (PIE) solves CoDel’s per-packet overhead by estimating queue delay without explicit timestamps. At regular intervals (10-100 milliseconds), PIE computes:

delay_estimate = queue_length / departure_rate

This is ingenious: if a queue holds Q packets and packets are departing at rate R (bytes per second), then the time required to drain the queue is approximately Q / R seconds. This estimates the sojourn time without per-packet timestamps.

More specifically:

On each periodic update interval (e.g., 10ms):
    departure_rate_avg = bytes_departed_in_interval / interval_duration
    queue_delay_estimate = queue_length_bytes / departure_rate_avg

    # PI controller update
    error = queue_delay_estimate - target_delay
    drop_prob_new = drop_prob_old + alpha * error + beta * error_derivative
    drop_prob = clip(drop_prob_new, [0, 1])

On each packet arrival:
    if random() < drop_prob:
        DROP packet
    else:
        ENQUEUE packet

PIE then adjusts a drop probability using a proportional-integral (PI) controller. The drop probability increases if estimated delay > target, and decreases if estimated delay < target. At each packet arrival, drop with probability P. The PI controller approach is inspired by control theory: the proportional term responds to current error, while the integral term accounts for sustained error (overshoot/undershoot).

The Coordination answer: same as CoDel—“detect congestion and drop probabilistically.” The measurement signal is estimated delay (computed once per interval, not per-packet). The internal belief is a PI controller: maintain drop probability such that estimated delay stays near target.

PIE is cheaper to implement: no per-packet timestamps, only periodic rate measurement (which existing routers already track for billing). DOCSIS (cable modem standard) adopted PIE over CoDel for this reason. Vendors could implement PIE in firmware without silicon respin.

The tradeoff: estimated delay is less accurate than measured sojourn time. If the departure rate is zero (no packets sent during an interval), division by zero is undefined. During bursty traffic with silent periods, PIE’s estimate can be noisy. But in practice, PIE works nearly as well as CoDel for most traffic patterns. The measurement is “good enough,” and simplicity enables deployment.

Deployment evidence is striking: PIE became the standard AQM in DOCSIS 3.1 cable modems, shipping in millions of devices worldwide. CoDel’s theoretical superiority could not overcome PIE’s practical deployment advantages.

6.4.4 Design Pattern: Measurement Signal Evolution

The arc from RED to CoDel to PIE reveals a design pattern: as the measurement signal improves (from queue length to sojourn time to estimated delay), the control loop becomes tighter and more reliable, reducing the need for manual tuning.

RED measures queue occupancy (space), which is indirect and does not distinguish transient from persistent. CoDel measures latency (time), which is direct and reveals persistence. PIE estimates latency without per-packet overhead.

Each step is a redesign of the State invariant: what information should the queue maintain internally to make good control decisions? RED’s answer: an EWMA of queue length. CoDel’s answer: the minimum sojourn time. PIE’s answer: an estimated queue delay computed from rate measurements and periodic updates.

The progression also illustrates the closed-loop reasoning principle: as the feedback signal improves, the loop becomes more stable and responsive. RED’s loop oscillates because the signal is degraded. CoDel’s loop stabilizes because sojourn time is a faithful signal.


6.5 Fair Queuing + AQM: FQ_CoDel and Cake

A single CoDel or PIE queue protects aggregate delay but does not guarantee per-flow fairness. A single greedy flow (bulk download) can monopolize the buffer, starving latency-sensitive traffic (VoIP call). All flows share the same queue and the same AQM controller, so the controller reacts to aggregate congestion, not flow-specific congestion. This reveals a critical limitation: AQM solves the measurement signal problem, but it does not address the Coordination problem of fairness.

6.5.1 FQ_CoDel: Per-Flow Queuing

FQ_CoDel disaggregates further: maintain one queue per flow (or per 5-tuple: source IP, destination IP, protocol, source port, destination port), and apply CoDel to each queue independently.

The algorithm: hash the packet’s 5-tuple to a queue index (typically 1024 queues by default). If the queue does not exist, create it. Enqueue the packet in that queue. For transmission, use deficit round-robin (DRR) to serve queues in turn, ensuring each flow gets fair opportunity. Within each queue, CoDel monitors sojourn time and drops packets if latency persists.

A scheduler (e.g., deficit round-robin) serves queues in turn, ensuring each flow gets fair opportunity to transmit. Within each queue, CoDel monitors sojourn time and drops packets if latency persists.

The Coordination answer is now two-level: (1) at the macro level, scheduler ensures fair turn-taking among flows; (2) at the micro level, each flow’s CoDel instance reacts to its own queue delay. One aggressive flow’s congestion does not degrade other flows’ latency.

The State invariant increases in complexity: the router now maintains per-flow state (queue, sojourn time estimate, drop probability). At millions of flows per second, this is expensive in memory and CPU.

However, FQ_CoDel’s fairness is powerful. In interactive scenarios with mixed traffic (VoIP + bulk download), VoIP gets consistent low latency even if bulk download saturates the link. The bulk download’s queue grows, but its sojourn time becomes high, and CoDel drops its packets. The VoIP flow’s queue stays small and clear, because VoIP sends sparingly (only a few frames per second). The two flows remain isolated by the per-flow queue structure.

In practice, 1024 queues create occasional hash collisions. With many flows, different 5-tuples may hash to the same queue, forcing them to share CoDel’s control. This degrades fairness slightly but is acceptable because collisions are rare and the throughput impact is minimal.

6.5.2 Cake: Associative Hashing and Host-Level Fairness

Cake (Common Applications Kept Enhanced) extends FQ_CoDel by addressing practical deployment failures.

Hash-based flow classification (5-tuple lookup) is fast but prone to collisions. With millions of flows, hash collisions force unrelated flows to share the same queue, degrading fairness. Cake uses 8-way associative hashing: store 8 flow entries per bucket. On collision, linearly search the 8 entries to find a matching flow. This reduces collision probability from 1/N (with linear hashing) to approximately (1/N)^8 (with 8-way associative hashing), dramatically improving fairness.

Additionally, Cake adds host-level fairness: group flows by source IP (assuming one source IP = one user on a home network). This prevents a single user with 1000 parallel connections from monopolizing the link. For example, a video downloader opening 10 parallel TCP connections shares capacity fairly with a user opening just one connection. Host-level fairness is the right granularity for residential networks where per-IP = per-user.

Encrypted traffic (VPN, TLS, QUIC) hashes to a single queue because the 5-tuple of the outer headers is identical for all flows inside the tunnel. Cake cannot see inside the encryption, so it cannot disaggregate. This is a fundamental limit of unencrypted classification. Within a VPN tunnel carrying 100 applications, all traffic appears as one flow to Cake, and fairness is lost inside the tunnel.

The Coordination answer: “per-flow fairness, plus per-host fairness, plus probabilistic AQM per queue.” The State complexity increases: per-flow plus per-host grouping, plus flow table with associative hashing.

Cake is deployed in residential broadband routers and has become the default AQM in OpenWrt and Linux. Its effectiveness is well-documented: it reliably achieves sub-millisecond latency even with saturated links, as long as fairness granularity (flow or user) matches the traffic pattern. In operational networks, Cake has transformed user experience under congestion. Residential users who deploy Cake report dramatic improvements: latency drops from 100+ milliseconds to single-digit milliseconds under load.


6.6 Max-Min Fairness and the Water-Filling Algorithm

Why does fairness matter? The answer is rooted in economics and game theory. Max-min fairness is a specific answer to the Coordination invariant: how should a shared resource be divided among competing flows?

Max-min fairness states: maximize the minimum allocation to any flow, then recursively maximize the next-smallest, and so on. Operationally, the algorithm is “water-filling”: imagine flows as pitchers of different heights (their demands). Pour water (capacity) into all pitchers until the shortest is full, then redistribute surplus to the next-tallest, continuing until all capacity is allocated or all demands are satisfied.

Concretely, suppose we have 10 Mbps capacity and four flows requesting 8, 3, 5, and 2 Mbps respectively. Max-min fairness allocates as follows: 1. Flow 4 (demand 2 Mbps) gets its request. Remaining: 8 Mbps. Remaining flows: 3 at [8, 3, 5]. 2. Flow 2 (demand 3 Mbps) gets its request. Remaining: 5 Mbps. Remaining flows: 2 at [8, 5]. 3. Split remaining 5 Mbps equally between flows 1 and 3: each gets 2.5 Mbps.

Final allocation: [2.5, 3, 2.5, 2] Mbps. Critically, no flow’s allocation can increase without decreasing another’s—the allocation is Pareto optimal.

Max-min fairness has deep roots in cooperative game theory: it emerges as the solution to the Nash bargaining problem. It also has strong practical properties: it prevents starvation (every flow gets something), and it maximizes the bottleneck rate for all flows. However, it is only one possible Coordination answer. Proportional fairness (each flow gets allocation ∝ 1/congestion_price) maximizes log-sum of throughputs and is game-theoretically stable under selfish behavior. Utilitarian fairness simply maximizes total throughput without concern for distribution. These are different Coordination answers to the same question.

Why does max-min fairness matter for queue design? Because it provides a principle for disaggregation. WFQ implements max-min fairness through weights. Per-flow queuing + AQM approximates max-min fairness: flows with higher demand consume more buffer, and per-flow CoDel ensures each gets early congestion signals. The water-filling principle also explains why per-host fairness (as in Cake) is important: a single aggressive user should not consume more than a fair share of the network, even if they open many parallel connections.


6.7 Traffic Policing and Shaping: Enforcing Rate Contracts

Traffic policing and shaping are mechanisms for enforcing rate contracts at network boundaries. A policer observes arriving traffic and drops or marks packets that violate a rate limit. A shaper delays packets to enforce the limit, ensuring non-conforming traffic becomes conforming without loss.

Both use the same underlying state model: a token bucket, which maintains an internal belief about how much traffic is allowable. Tokens accumulate at a sustained rate (the contract rate, e.g., 10 Mbps). Each arriving packet consumes tokens equal to its size. If tokens are available, the packet conforms; if not, it violates the contract. A policer drops or marks the violating packet; a shaper queues it, waiting for tokens to accumulate.

Concrete example: a token bucket with rate R = 10 Mbps and depth B = 1.2 MB (10 seconds worth at 10 Mbps). - At time t=0, bucket is full with 1.2 MB of tokens. - Packet arrives (1,500 bytes): consumes 1,500 tokens, bucket = 1.2 MB - 1,500 bytes. - At time t=0.1 seconds: 10 Mbps × 0.1 s = 1 Mb = 125 KB of tokens accumulate. - At time t=1 second: bucket returns to capacity (if tokens can overflow) or stays at 1.2 MB (if capped).

The burst allowance (bucket depth B) determines how much traffic can exceed the sustained rate without violation. For B = 1.2 MB and R = 10 Mbps, a burst of 20 Mbps can sustain for: B / (20 - 10) Mbps = 1.2 MB / 10 Mbps = 1.2 seconds.

Traffic policing is enforcement: the policer drops or marks packets unilaterally. Traffic shaping is deferral: the shaper queues packets, delaying them until tokens accumulate. Both enforce a rate contract, but with different tradeoffs: policing incurs loss (wasteful), shaping incurs delay (buffering).

The key insight is that traffic policing sits at the interface between the user’s domain and the network’s domain. It enforces the service contract: the user promises to send no more than X packets/bytes per second (with burst tolerance B), and the network promises certain treatment (queue priority, low loss) conditional on contract compliance.


6.8 DiffServ: Disaggregated Queue Management at Scale

DiffServ (Differentiated Services) solves the scalability crisis of per-flow queue management by disaggregating the network into edge and core domains. Edge routers perform complex, stateful operations (per-flow classification, policing, marking). Core routers perform simple, stateless operations (per-class scheduling based on packet marks).

The key mechanism is the DSCP field (DiffServ Code Point)—6 bits in the IP header that encode the service class. An edge router classifies incoming packets (examining 5-tuple, protocol, etc.), marks them with a DSCP value, and applies policing if needed. A core router simply reads the DSCP field and schedules packets according to per-class queues.

This is radical disaggregation: the Coordination answer (“who decides how capacity is divided?”) becomes “the edge decides via marking; the core executes via class-based scheduling.” The State answer is “edges have per-flow state, cores have only per-class state.” The Interface is the DSCP field in every packet.

Example: an ISP configures three classes: - Gold (DSCP=46): VoIP, video conferencing, gaming. Target latency: <10ms. - Silver (DSCP=34): Web browsing, email. Target latency: <100ms. - Best-Effort (DSCP=0): File transfer, torrents. No latency guarantee.

Edge routers classify inbound traffic: VoIP packets marked as Gold, web traffic marked as Silver, bulk transfers marked as Best-Effort. Core routers maintain three queues (one per class) and schedule them with strict priority: Gold always serves first, then Silver, then Best-Effort. This ensures VoIP gets low latency even when bulk transfers saturate the core.

The tradeoff: DiffServ guarantees per-class latency (Gold class gets < 10ms, Silver class gets < 100ms), not per-flow fairness. If two premium customers are in the same Gold class, they share the class’s capacity; no per-customer guarantee exists. Additionally, DiffServ requires upstream enforcement (re-marking at the edge) for honesty; customers cannot self-mark as Gold to get priority.

Modern deployments use DiffServ plus per-class AQM: each DSCP class has its own CoDel or PIE instance, providing both differentiation (per-class priority) and fairness (per-flow within class).


6.9 Generative Exercises

6.9.1 Exercise 1: WiFi Access Point with Heterogeneous PHY Rates

A WiFi access point serves both 802.11ax (WiFi 6, up to 1 Gbps per spatial stream) and 802.11ac (WiFi 5, up to 100 Mbps) clients. The bottleneck is the air interface, which is time-shared among clients.

Current design: FIFO queue at the access point. All clients’ packets queue in a single FIFO.

Problem: an 802.11ac client with a sustained download (100 Mbps demand) causes the queue to grow, inflating latency for 802.11ax clients (which have lower sustained demand but are latency-sensitive).

Redesign: propose a scheduling mechanism that ensures fair airtime (not bandwidth) allocation. Consider that airtime consumption per packet is inversely proportional to the PHY rate: a 1500-byte packet takes 120 μs at 100 Mbps, but 12 μs at 1 Gbps. The key insight is that fair bandwidth sharing requires unequal packet service opportunity when PHY rates differ.

How would you weight fair queuing to equalize per-client latency despite the 10x bandwidth difference? One approach: set WFQ weights inversely proportional to PHY rates. If the 802.11ac client’s PHY rate is 100 Mbps and the 802.11ax client’s is 1 Gbps, set weights as 1 Gbps : 100 Mbps = 10:1. This ensures the slower client gets 10x service opportunity, offsetting its lower PHY rate, so both achieve equal latency.

Trace how this changes the State invariant: per-client PHY rate tracking becomes necessary (environment state), and the scheduler must maintain per-client deficit counters and virtual time (internal belief). How does this affect Coordination: weights become the explicit fairness policy, decided by the AP operator at configuration time, enforced at each packet departure decision.

What challenges arise with encrypted traffic (e.g., VPN tunnels from the 802.11ac client)? How would you identify PHY rate if the AP has no direct knowledge of which device is which?

6.9.2 Exercise 2: Buffering Tradeoff in Access Networks

Design a buffer for a residential broadband router serving 10 downstream users (home WiFi).

Constraints: (1) Link capacity is 100 Mbps down, 10 Mbps up. (2) One user saturates the downstream (sustained 100 Mbps download). (3) Concurrent applications: one VoIP call (requires < 100 ms latency), one video stream (adaptive bitrate, benefits from buffering), one bulk download (benefits from large buffer).

Scenario A: Small buffer (10 milliseconds) - Packet capacity: 100 Mbps × 10 ms = 1.25 Mb = 156 KB - Number of packets: 156 KB / 1,500 bytes per packet ≈ 104 packets - Sufficient for transient bursts (TCP initial window = 10 packets), but not multiple concurrent connections - When TCP slow-start begins (connection opens with IW=10), packets are transmitted immediately - For the bulk download, each TCP connection starts slow-start → queue fills with 10 packets - For the VoIP call sending 2 packets every 20 ms, the 10 ms buffer may cause loss under concurrent load - Video ABR ramps bitrate gradually, but concurrent bulk download already fills the buffer

Analysis: 10 ms buffer is too small. It prevents bufferbloat but at the cost of packet loss, which degrades TCP throughput and video quality.

Scenario B: Large buffer (100 milliseconds) - Packet capacity: 100 Mbps × 100 ms = 10 Mb = 1.25 MB - Number of packets: 1.25 MB / 1,500 bytes ≈ 833 packets - Can absorb large bursts and multiple TCP slow-starts without loss - VoIP call experiences 100 ms of queueing delay (unacceptable for interactive use) - Video ABR can request higher bitrates and sustain them via buffering - Bulk download achieves high throughput but at cost of latency

Analysis: 100 ms buffer avoids loss but creates bufferbloat for latency-sensitive applications.

Tradeoff Decomposition: This is the fundamental tension between loss and delay, between throughput and latency. Neither extreme works: small buffers waste throughput; large buffers waste interactivity.

Proposal: Adaptive AQM Buffer Use CoDel or Cake with target delay = 5 ms and maximum buffer = 100 ms. The AQM grows the buffer only when needed: - At low load (no congestion): buffer shrinks to minimum - During bursts (transient queue): buffer absorbs packets without dropping - During persistent overload: AQM drops packets early, preventing buffer from filling

For target delay tuning: - VoIP call frame period = 20 ms; target delay should be < 5 ms to ensure latency < 25 ms (keeping voice responsive) - Video ABR responds to bandwidth estimates updated every 2-4 seconds; it tolerates 20-100 ms buffer - Bulk download has no latency requirement; it only needs throughput

The measurement that should inform this decision is the sojourn time per application. CoDel maintains minimum sojourn time; if min_sojourn > target, congestion is persistent. An adaptive controller could adjust target_delay per class: low target for VoIP (5 ms), higher target for video (20 ms), no constraint for bulk (100 ms). This requires DiffServ marking, but combined with per-class CoDel, it achieves optimal tradeoff.

What implementation challenges arise? (Hint: marking traffic at the edge requires classification, which adds CPU overhead. Dynamic target adjustment requires estimating application type from packet patterns.)

6.9.3 Exercise 3: Comparing AQM Algorithms

Given a network with: - Link capacity: 100 Mbps - RTT: 50 ms - BDP: 625 KB (100 Mbps × 50 ms / 8) ≈ 416 packets (1,500 bytes each) - No AQM initially; FIFO tail-drop with 1,000 packets (1.5 MB)

Three users simultaneously download files: - User A: Single HTTP connection, 10 MB file - User B: 10 parallel HTTP connections, 100 MB file - User C: Video stream, 5 Mbps ABR

Baseline (FIFO tail-drop): - Buffer fills immediately (1,000 packets) - Queue latency: 1,500 packets × 1,500 bytes / 100 Mbps ≈ 180 ms - User A: 10 MB / 100 Mbps = 0.8 seconds - User B: 100 MB / 100 Mbps = 8 seconds - User C: Experiences 180 ms latency, video pauses and rebuffers

Compare three AQM designs:

  1. RED: min_threshold = 100 packets, max_threshold = 300 packets, EWMA_weight = 0.002, max_drop_prob = 0.1
    • Initially, queue is small (< 100 packets), RED accepts all
    • When load increases, EWMA rises slowly (smoothing time constant ≈ 1 second)
    • By the time RED reaches max_threshold, queue has grown to ~300 packets
    • Latency during ramp: (100-300 packets) × 12 ms = 120-360 ms
    • User B’s 10 parallel connections hash to potentially 10 different flows; RED is flow-agnostic, treats all equally
    • Problem: EWMA lag means RED detects congestion too late
    • Fairness: RED probabilistically drops any flow; no per-flow guarantee
  2. CoDel: target_delay = 5 ms, interval = 100 ms
    • Measures sojourn time per packet
    • Minimum sojourn over 100 ms window reflects persistent queue
    • If min_sojourn > 5 ms, congestion is persistent; start dropping
    • Max queue: 5 ms × 100 Mbps / 8 = 62.5 KB ≈ 42 packets
    • Queue stabilizes at ~5 ms latency
    • User A: Slightly increased latency (5-10 ms) but throughput unaffected
    • User B: 10 flows share capacity fairly (each gets ~10 Mbps); total 100 Mbps
    • User C: 5 Mbps = half of aggregate 100 Mbps; experiences 5 ms jitter but stable video
    • Fairness: Per-queue CoDel reacts to each flow independently; aggregate remains stable
  3. Cake: CoDel + 8-way associative hashing + host-level fairness
    • Hash User B’s 10 flows: If they collide to fewer buckets, they share one CoDel
    • 8-way hashing reduces collision probability
    • Host-level fairness: User B’s flows grouped by source IP get 1/3 of capacity (User A gets 1/3, User C gets 1/3)
    • User A: 33 Mbps, transfer time = 10 MB / 33 Mbps ≈ 2.4 seconds
    • User B: 33 Mbps total (3.3 Mbps per connection), transfer time = 100 MB / 33 Mbps ≈ 24 seconds
    • User C: 34 Mbps available; exceeds demand, so achieves 5 Mbps with low latency
    • Fairness: Per-host enforcement; single aggressive user (B) cannot monopolize link

Analysis: - RED: Fails due to slow detection. Bufferbloat persists; latency remains high during ramp-up. - CoDel: Succeeds at stabilizing latency (5 ms target), but doesn’t enforce per-host fairness. User B’s 10 flows each get 10 Mbps, monopolizing the link relative to the single VoIP user. - Cake: Best overall. Per-host fairness ensures all users receive equal opportunity. CoDel per-queue ensures low latency. 8-way hashing reduces collision misclassification.

Verification question: If User B opens 20 parallel connections instead of 10, how does Cake adapt? (Cake’s per-host fairness groups all 20 flows by source IP, allocating 1/3 of capacity to User B as a host, regardless of how many flows they open.)

Why Cake wins: It combines three improvements: (1) sojourn-time measurement (CoDel), (2) per-flow isolation (hash queuing), (3) per-host fairness (grouping), and (4) associative hashing (collision reduction). It answers all three invariants well: State (per-host + per-flow queues), Coordination (per-host and per-flow fairness), Time (5 ms target via sojourn time).

6.9.4 Exercise 4: Token Bucket Rate Limiting

A residential ISP offers three service tiers:

  • Basic: 10 Mbps sustained, 20 Mbps burst (max 10 seconds)
  • Premium: 50 Mbps sustained, 100 Mbps burst (max 10 seconds)
  • Ultimate: 100 Mbps sustained, 200 Mbps burst (max 5 seconds)

Design a token bucket policer for the Premium tier:

Parameters: - R (token rate) = 50 Mbps (sustained rate) - Burst rate = 100 Mbps (peak rate) - Burst duration = 10 seconds (max) - Burst increment = 100 - 50 = 50 Mbps

To maintain burst for exactly 10 seconds: - Tokens depleted per second during burst: 50 Mbps = 50 Mb/s - Tokens needed for 10 seconds: 50 Mb/s × 10 s = 500 Mb = 62.5 MB - B (bucket size) = 62.5 MB

Verification: If bucket starts full with 62.5 MB and tokens drain at 50 Mbps during a 100 Mbps burst: - Time to empty: 62.5 MB / 50 Mbps = 10 seconds ✓

Scenario: Premium customer uploads 100 MB file

At sustained rate (50 Mbps): - Time = 100 MB / 50 Mbps = 16 seconds

At burst rate (100 Mbps), assuming bucket is full: - Time at 100 Mbps: 62.5 MB / 100 Mbps = 6.25 seconds - After burst, remaining 37.5 MB transfers at sustained rate: 37.5 MB / 50 Mbps = 7.5 seconds - Total time = 6.25 + 7.5 = 13.75 seconds

Token refill after burst: - After upload, bucket is empty - Time to refill from empty to full: 62.5 MB / 50 Mbps = 12.5 seconds - So another burst can happen after 12.5 seconds of idle or light traffic

Analysis of tradeoff: - Bucket size B = 62.5 MB represents ~5 RTTs worth of bandwidth on a 50 Mbps link - This allows reasonable TCP slow-start (typically 10 packets = 15 KB, easily absorbed) - Burst duration matches human timescale: 10 seconds is long enough for initial burst but short enough to prevent abuse - The policer implements fair queuing at the service-level boundary: each customer gets their contracted rate, enforcement is automatic

Extension question: What if the ISP wants to allow unlimited burst (B → ∞)? What market consideration prevents this? (Answer: It eliminates the distinction between Premium (50 Mbps) and Ultimate (100 Mbps) tiers; all customers could upgrade to Premium and burst indefinitely. The finite bucket size enforces the service tier differentiation at the policer boundary.)


6.10 References

  • Floyd, S. and Jacobson, V. (1993). “Random Early Detection Gateways for Congestion Avoidance.” IEEE/ACM Transactions on Networking, 1(4), 397–413.
  • Gettys, J. and Nichols, K. (2012). “Bufferbloat: Dark Buffers in the Internet.” ACM Queue, 9(11).
  • Hoeiland-Joergensen, T., McKenney, P., Taht, D., Gallo, J., and Täht, D. (2018). “Cake: A Common Applications Kept Enhanced.” IETF Proceedings.
  • Jacobson, V., Nichols, K., Poduri, K., and Jackson, J. (1999). “An Expedited Forwarding PHB.” RFC 2598.
  • Nichols, K. and Jacobson, V. (2012). “Controlling Queue Delay.” ACM Queue, 10(5).
  • Pan, R., Natarajan, P., Piglione, C., Prabhu, M.S., Cybenko, V., Baker, F., and Bump, B. (2013). “PIE: A Lightweight and Effective Buffer Management Scheme for Packet-Switched Networks.” ACM SIGCOMM Workshop.
  • Ramakrishnan, K., Floyd, S., and Black, D. (2001). “The Addition of Explicit Congestion Notification (ECN) to IP.” RFC 3168.


This chapter is part of “A First-Principles Approach to Networked Systems” by Arpit Gupta, UC Santa Barbara, licensed under CC BY-NC-SA 4.0.