9 Measurement, Management, and the Research Frontier

9.1 Measurement as a System—The Framework Applied to Observation

The first seven chapters examined operational systems: medium access, transport, queue management, multimedia applications. These systems move data, adapt to congestion, or present data to users. But operational systems are opaque without visibility. Measurement systems observe operational systems, and management systems act on those observations. Measurement and management are consuming systems—they do not carry user data; they observe and control systems that do.

The framework applies to measurement systems themselves. Measurement has invariant questions just as sharp as transport or queue management. What state do we observe? When do observations happen? Who coordinates measurement? How do components communicate? The answers define measurement architectures and constrain what’s observable.

Measurement is the component that observes all others. Its engineering question: how do I observe what’s actually happening in a system I can’t fully instrument? Measurement cuts across every decomposition axis — it must observe traffic (flow-level), protocol behavior (per-component), network state (per-device), and user experience (per-application).

This chapter treats measurement as the sixth system, management as the seventh. Both must answer the four invariants. Both face the anchoring constraint: visibility into opaque systems is expensive. Measurement systems trade off flexibility (answer arbitrary questions), scalability (process high-volume traffic), and accuracy (get right answers). No system achieves all three. The strategic design principle here is disaggregation: split measurement work between switches (simple, fast, scalable) and servers (complex, slower, flexible). Management systems then act on reduced, aggregated data to adapt operational systems.

The chapter builds toward Prediction 3. When administrative boundaries collapse—when a single entity owns network, endpoints, and management logic—systems can exchange richer signals and operate at tighter margins. Datacenter congestion control exploits this. SmartNIC offload exploits this. Layer 3 boundary blurring (when middle-mile and last-mile control unify) enables new architectures. The research frontier is recognizing where measurement, management, and operational control can merge.

9.2 Active vs. Passive Measurement—A State Invariant Design Choice

How do you characterize network behavior without interfering with it? Two opposite corners:

Active measurement: Inject probes, observe responses. Send sustained UDP traffic at increasing rates; infer link capacity from loss curve. Send ping packets; measure round-trip time. This trades interference for control. You ask precise questions (“What is this link’s capacity?”) and get precise answers. Cost: probes consume bandwidth and may receive preferential treatment. Benefit: reproducible, deterministic. Tools include ping (ICMP round-trip timing), traceroute (hop-by-hop latency), iperf (sustained throughput under TCP/UDP), and netperf (transport performance). Active tools are mature and widely available, making deployment relatively straightforward in controlled environments—the primary advantage for lab research.

Passive measurement: Observe existing traffic. Monitor TCP flows, count bytes over time intervals, compute aggregate throughput. Zero measurement overhead—probes don’t consume bandwidth. But you see only what happens naturally. If no one downloads large files during your observation window, you won’t see peak throughput. Results are noisy, confounded by application behavior. Tools like tcpdump (packet capture), NetFlow, and sFlow enable operational networks to monitor production traffic without injection overhead. Passive tools require deep network access and operator expertise but are essential for carrier-grade monitoring.

The State invariant choice here is fundamental: what visibility do we have? Active creates synthetic state (probes); passive observes real state. Both provide valid answers to different questions. A single user downloading a file over one TCP connection (passive observation) reports lower throughput than multiple parallel downloads (implicit active test by load generation). Neither is “wrong”—they measure different scenarios.

Active measurement faces regulatory and technical barriers. Aggressive probing on the public internet can be classified as denial-of-service. ISPs may rate-limit or block ICMP (ping). Passive measurement faces privacy concerns: observing user behavior (which domains they visit, which endpoints they contact) raises data collection questions. But passive avoids these issues—you don’t learn what users do, only what they achieve.

Teaching principle: Measurement bias is fundamental. Active probes may be treated differently from real traffic (lower-priority, special queue, faster route). Passive observation is biased by application patterns (idle periods, bursty transfers). Neither is “ground truth.” The choice depends on what you’re measuring, where, and what costs you’ll accept.

9.3 Throughput Measurement—Same Link, Different Answers

Throughput is not a single quantity. It depends on measurement technique, each exposing different aspects of link behavior.

Single-threaded TCP: One connection fills one congestion window. Rate = window_size / RTT. For a 64 KB window and 100 ms RTT, throughput ≈ 5 Mbps, even on a 1 Gbps link. This measures realistic user experience (most web browsers use few parallel connections). Bias: biased low, conservative, window-limited.

Multi-threaded TCP: Many parallel connections, aggregate window larger, likely to saturate buffers and trigger loss. Reflects theoretical link capacity. Bias: biased high, aggressive, reflects raw capacity not user experience. Commercial speedtest tools (Ookla) use this approach.

Passive observation: Real users, real applications, real patterns. Bursts, idle periods, competing applications. Bias: biased low because real traffic includes idle periods; confounded by what applications do.

UDP probes: Raw packet rate at which loss rises. No congestion control, so doesn’t reflect realistic TCP behavior. Bias: biased toward packet-handling limit, not representative transport.

Lesson: Throughput is question-dependent. “How fast can a user download?” (single-thread). “What’s the link’s capacity?” (multi-thread). “What do all users see?” (passive). Each technique answers a different question. Recognize the bias. Choose technique whose bias matches your objective.

Measurement challenges cascade: throughput is sensitive to loss (1% loss reduces TCP throughput by ~10%, compounding), time of day (peak hours show congestion, off-peak shows utilization), and hardware (modem buffers, WiFi interference). Variation within a single service plan can reach 2x (customers paying for “100 Mbps” seeing 50–130 Mbps). PowerBoost—temporary rate acceleration in cable modems—adds complexity: measured throughput jumps during initial boost period, then settles to sustained rate. Operators must understand the measurement technique to interpret results fairly.

9.4 Latency Measurement—Revealing Hidden Queue State

Queue depth is invisible—there is no API to query “what’s my queue depth?” Latency is the measurement signal that reveals queue state indirectly.

Under idle conditions (minimal load), latency = propagation + transmission (fixed, predictable). Under load (congestion), latency increases due to queuing. The gap—working latency minus idle latency—is proportional to queue depth:

\[\text{queue depth} \approx (\text{latency}_{\text{working}} - \text{latency}_{\text{idle}}) \times \text{capacity}\]

Example: idle latency 10 ms, working latency 150 ms, capacity 100 Mbps. - Queue delay = 140 ms - Queue depth ≈ 140 ms × 100 Mbps = 1.75 MB ≈ 11,600 packets - Massive queue → bufferbloat → AQM is broken

This mechanism connects measurement to queue management. Latency increase reveals whether AQM (CoDel, PIE) is working. Tight loops (CoDel, PIE) detect queue buildup early and drop packets, keeping queue small and latency low. Loose loops (RED, tail-drop) allow queues to grow large before signaling, producing bufferbloat. No loop (FIFO) produces unbounded latency.

Measurement challenges: ICMP rate-limiting (some routers limit ping responses). UDP probes can be blocked. Measurement itself consumes bandwidth. Last-mile latency (40–80% of end-to-end) is harder to diagnose than core latency—you need measurement points in the access network, not distant servers.

Application sensitivity varies: VoIP tolerates 100 ms RTT; gaming needs <50 ms; web browsing tolerates 200+ ms. A single “latency” metric is inadequate. Operators must measure under load (representative of real congestion, not off-peak) to diagnose bufferbloat accurately. The State invariant choice: do we measure idle or working latency? FCC metrics currently favor idle latency (makes ISPs look good); real bufferbloat is only visible under load.

9.5 From SNMP to Modern Telemetry—The Evolution of Visibility

Traditional network management uses SNMP (Simple Network Management Protocol): operators poll each device at fixed intervals, retrieve counters (bytes transmitted, packet loss, queue depth), and assemble a view of network state. This works for slow, low-frequency monitoring but breaks under two pressures: (1) measurement scale (hundreds of devices, thousands of flows) and (2) timeliness (polling every 5 minutes misses events lasting 100 ms).

SNMP defines an Interface between management systems and managed devices: human-readable MIBs (Management Information Bases) abstract away device details. This is powerful—operators need not understand each device’s internals. But it is limited: SNMP exposes what the device’s designer decided to expose. Dynamic new queries (e.g., “which /24 subnets are sending traffic to this AS?”) require new code deployment, firmware updates, or become impossible. SNMP’s pull model (management station sends request, device responds) creates latency: a management station learning about a problem must wait for the next polling interval. At 5-minute intervals, a 100-millisecond anomaly is invisible. If that anomaly is a DDoS attack, the delay is unacceptable.

NETCONF/YANG introduced declarative configuration models—operators push desired state, not isolated configuration commands. But this still relies on reactive polling: ask the device, wait for response, hope the device has been logging state accurately. The fundamental Coordination problem remains: devices are passive responders to queries, not active observers.

Modern telemetry inverts this: devices proactively stream high-frequency measurements (e.g., 100 measurements/sec, not 1 measurement/5 minutes) to collectors. This requires a different state model. Devices can no longer store all observations—bandwidth and storage are too expensive. Instead, they must summarize: emit only significant changes (threshold crossing), aggregates (counters), or sampled subsets. The measurement signal becomes quantized, and the management system must infer network state from incomplete, batched information.

The Interface changes: SNMP/NETCONF define request-response interaction (pull model). Telemetry uses publish-subscribe (push model). This enables reactive management: observe spike in latency → immediately investigate, not after next polling interval. Devices become active, making the Coordination invariant more distributed: devices decide what to report based on local state, not waiting for external queries.

The State changes: SNMP stores queried variables on devices. Telemetry stores streams on collectors. The control-plane model inverts: instead of devices being sources of truth, collectors become sources of truth for operational state. This places higher demands on collector architecture—must handle buffering, deduplication, and out-of-order delivery.

The Time changes: SNMP uses wallclock time (polls every 5 minutes). Telemetry uses sequence numbers (message 1, 2, 3) to detect gaps. This enables lossless operation even if collector is temporarily unavailable—devices can buffer unsent messages. Time semantics become more sophisticated: devices timestamp at millisecond precision; collectors correlate timestamps from multiple devices to detect causality.

The shift from SNMP’s request-response interaction to telemetry’s proactive streaming model inverts how networks observe themselves. Traditional management required management stations to ask devices specific questions; modern systems require devices to decide what information is worth reporting. This distributed decision-making enables reactive management: the moment a queue depth threshold is crossed, the signal reaches collectors. No waiting for next polling interval. The price is complexity: operators must understand what their devices are emitting, tuning thresholds for significance and managing on-device summarization. Collectors must handle out-of-order delivery and deduplication that polling avoided. The measurement framework trades the simplicity of pull-based polling for the reactivity of push-based streaming. Figure 9.1 depicts how devices push telemetry streams to collectors instead of passively responding to management queries.

The figure contrasts three telemetry architectural paradigms spanning a 250-year range in measurement latency. SNMP polling (leftmost panel) exemplifies the request-response model: management stations periodically query devices (typically every 5 minutes) for counters like packet counts. The 5-minute polling window means operators observe network state with a delay of up to 5 minutes—an anomaly (packet loss spike, queue buildup) is invisible until the next polling cycle. But the architecture is simple: devices are stateless (only store counters), collectors are simple (send and receive), and operators control exactly what gets reported. Streaming telemetry (middle panel) inverts this: devices proactively push measurements to collectors at high frequency (every 1–5 seconds), eliminating polling delays. The moment a queue depth threshold is exceeded, collectors are notified. This enables reactive management: detect anomalies on the seconds timescale rather than waiting for the next 5-minute polling cycle. The cost is complexity: devices must make decisions about what to report, collectors must handle out-of-order delivery and deduplication, and per-packet bandwidth increases if not carefully filtered.

Figure 9.1: In-band telemetry (rightmost panel) represents the frontier: switches embed measurement metadata directly into packet headers as they forward traffic. This achieves per-packet granularity at line rate, with no separate telemetry channel—a radical simplification for high-speed networks. INT (In-Band Network Telemetry) headers carry queue depth, switch utilization, and latency measurements from each hop. The switch can add ~50–100 bytes per packet without reducing forwarding throughput. But hardware support is required—not all switches have telemetry capabilities. This architectural progression reveals a fundamental tradeoff: latency-of-visibility (how fast operators see what is happening) versus simplicity and observability constraints (SNMP is simple but slow, streaming is fast but complex, in-band is fastest but hardware-dependent).

9.6 Network Telemetry—Flexibility, Scalability, Accuracy Triangle

Telemetry is automated collection of flow-level metrics and performance data. Unlike SNMP polling, telemetry brings packet-level visibility. But visibility at 100 Gbps is expensive: 200 million packets/second, ~100 GB/s data generation rate. No storage system sustains this; no CPU core processes this. Telemetry systems must choose what to keep and what to discard.

No system achieves flexibility, scalability, and accuracy simultaneously. This is not an engineering limitation—it is a fundamental impossibility. At 100 Gbps, data generation is ~200 million packets/sec. Storing all packets requires ~100 GB/sec I/O. Processing all packets into exact aggregates requires ~100 billion operations/sec. Neither is feasible. You must give up one: reduce accuracy (sample), reduce flexibility (predefine queries), or reduce scalability (external storage).

Six representative architectures occupy different positions in the flexibility–scalability–accuracy triangle:

Full packet capture: Store all headers and payloads. Maximum flexibility (answer any future query), maximum accuracy (nothing lost). Zero scalability: at 100 Gbps, buffer fills in seconds. Cost: external packet brokers, long-term storage. Use case: forensic analysis, incident replay. Tools: Gigascope [SIGMOD ’03], NetQRE [SIGCOMM ’17].

Query execution at switches: Programmed match-action pipeline executes queries in the data-plane. Only results (aggregates, counts) egress to servers. High scalability (processes at line rate), accuracy for executed queries, but flexibility is constrained: only queries expressible in match-action pipeline execute. New queries require switch reconfiguration. Use case: standard operational queries (byte counters, flow counts, traffic engineering). Tools: OpenSketch, UnivMon, OmniMon.

Sampling (NetFlow, sFlow): Sample 1 packet per K. Reconstruct from samples. High scalability (sample rate determines data volume), moderate flexibility (analyze samples post-hoc), poor accuracy (rare events invisible, confidence intervals widen with sampling). Example: sample 1/1000 means you miss attacks sending 100 packets/sec (expect <1 sample per 10 seconds). Use case: trend analysis, broad anomaly detection.

Header-only capture: Full headers, no payloads. Moderate scalability (~10 TB per hour at 100 Gbps, versus 45 TB for full packet), moderate flexibility (flow analysis possible, DPI impossible), accuracy for header-visible patterns. Use case: protocol-level debugging. Tools: EverFlow [SIGCOMM ’15], dShark [NSDI ’19].

Split execution (Sonata): Distribute query work between switches (simple, fast) and servers (complex, flexible). Switch executes filters and early aggregation; server completes analysis. High scalability (switch pre-aggregates), high flexibility (declarative query language), accuracy maintained. Cost: query planning complexity (compiler, ILP solver). Use case: production networks needing both standard and research queries.

Compression: Compress data at switch before sending. Reduces bandwidth, but information loss limits future queries. Once compressed, certain analyses become impossible.

The triangle is an impossibility: pick two of three. Every practical system abandons one. Sonata demonstrates disaggregation stretches the frontier—better flexibility + scalability than either corner alone.

9.7 Sonata and Programmable Switches—Disaggregation of the Measurement Pipeline

Sonata accepts high-level declarative queries (filter, map, reduce, distinct) and automatically partitions execution between switch (PISA data-plane) and server (control-plane). The insight: not all operations need switch execution. Switch excels at line-rate aggregation; server excels at complex analysis. Optimal partitioning minimizes tuples flowing from switch to server.

PISA programming model: Programmable parser extracts fields into a Packet Header Vector (PHV). Match-action stages perform line-rate matching and state updates. Stateful memory (counters, hash tables) stores aggregates. Deparser reconstructs modified packet. The PISA model forces deterministic packet processing: every packet takes a fixed pipeline depth (e.g., 32 stages on a high-end switch), ensuring line-rate throughput. This is fundamentally different from CPU processing, where variable-latency operations (hash table lookups, memory allocation) would cause congestion or dropped packets.

Resource constraints: PHV width (how many fields simultaneously, typically 400-800 bits), number of actions (hundreds to thousands per stage), pipeline stages (24-32 typical), stateful memory (megabits to tens of megabits). These constraints define feasible queries. A query needing to track per-flow state (thousands of flows × hundreds of bytes each = megabytes) exceeds typical switch state. Query needing more state than available cannot fit on switch—must disaggregate. The ILP solver explicitly models these constraints: “Field X needs 16 bits of PHV space; we have 768 bits total and already allocated 600 bits to other operations; can we fit this query?” If no, the solver must partition—move some filtering to the server.

Example query: Detect DNS amplification attack victims. Filter UDP packets where sport=53 (DNS). Extract (dstIP, srcIP). Count distinct srcIPs per dstIP. Report dstIPs with > threshold responses. Naive approach: send all port-53 packets to server (~millions/sec). Sonata approach: execute filter + count on switch, report only top-10 dstIPs every 100 ms (~10 tuples/sec). 4–5 orders of magnitude reduction.

The architecture that achieves this reduction is explicit disaggregation: the switch becomes an active participant in measurement, not a passive observer. Simple filtering and aggregation execute at line rate on the switch, with complex analysis deferred to servers that have more flexible computation. The switch sees all 200 million packets per second but emits only thousands of summary tuples. Servers never see the raw traffic; they receive pre-filtered, pre-aggregated results. This inverts the traditional measurement model where monitoring data flows up the stack unchanged, losing information at each layer due to storage constraints. The query partitioning between switch and server is illustrated in Figure 9.2, showing which operations execute at line rate and which move to the server.

Sonata’s core innovation is automatic disaggregation of measurement queries between hardware (switches) and software (servers). A declarative query specifies what to measure—for example, “detect all source IPs sending traffic exceeding 1% of total volume.” The query compiler analyzes the query and switch capabilities (how many match-action stages are available? how much state memory?), then automatically partitions the computation: the switch executes simple operations (filtering, hashing, counting) at line rate using PISA (Protocol Independent Switch Architecture) pipelines, while servers execute complex operations (exact deduplication, threshold detection, anomaly analysis). The data reduction is dramatic: instead of forwarding 1 million packets per second (1 million candidate flows) from switch to server, the switch pre-aggregates to 1,000 candidate flows (1,000× reduction). The server then completes analysis on this pre-filtered, pre-aggregated dataset, yielding ~10 heavy-hitter flows—a cumulative 10,000× reduction from raw traffic to results.

Figure 9.2: This disaggregation solves the flexibility–scalability–accuracy triangle, an impossibility result for centralized systems: you cannot simultaneously achieve all three. Full packet capture achieves flexibility (answer any future query) and accuracy (nothing discarded) but zero scalability (buffer fills in seconds at 100 Gbps). Sampling achieves scalability (sample 1 per 1,000 packets) and flexibility (analyze samples post-hoc) but poor accuracy (rare events invisible). Sonata achieves both high flexibility and high scalability by accepting that the decision of what to compute happens at query planning time (via the ILP solver), not at runtime. Once the ILP solver determines optimal partitioning—which predicates execute on the switch versus server—the partition is fixed for that query’s lifetime (possibly seconds to hours). This trades flexibility-at-runtime for flexibility-at-planning-time, enabling both scalability (switch pre-aggregates, reducing server load) and accuracy (no sampling; exact counts maintained).

Query partitioning via ILP: Formulate as integer linear program: minimize (tuples sent per second) subject to (switch constraints). Solver determines: which predicates execute on switch? Which aggregations execute where? The optimization is non-trivial because moving an operation from switch to server saves switch resources but increases server load. The ILP balances these tradeoffs.

Performance impact: Without partitioning, 8 simultaneous monitoring tasks on 100 Gbps traffic send ~1 billion tuples/sec to server. Basic filtering reduces to ~100 million tuples/sec (10x). Optimal partitioning reduces to ~100,000 tuples/sec (10,000x reduction, transforming infeasible to manageable).

Limitations: ILP solving is NP-hard; solving time is seconds to minutes (acceptable for long-lived queries, not ad hoc analysis). Query language intentionally restricts expressibility (no loops, limited state). Complex queries (entropy-based DDoS detection, time-series anomaly detection) may not partition well. Operator must simplify or move entirely to server. Hardware heterogeneity: ILP solver needs exact per-switch capabilities; deploying across mixed vendors is complex. Operational expertise required: understanding PISA semantics, ILP constraints, partitioning consequences.

Sonata demonstrates a core insight: disaggregation solves resource constraints. When a resource (switch state) is limited, push part of work to less-constrained place (server). Coordination shifts from runtime (every packet) to plan-time (query partitioning once). This is why disaggregation appears in transport (separating congestion control from reliable delivery), in queueing (separating detection from reaction), and in telemetry (separating filtering from analysis).

9.8 Boundary Blurring—Datacenter CC, SmartNIC Offload, and Prediction 3

Prediction 3 states: Relaxing interface or coordination constraints enables tighter belief-environment coupling, enabling operation at tighter margins.

Three examples demonstrate this across different layers:

9.8.1 Datacenter Congestion Control (DCTCP → DCQCN → HPCC → Swift)

Wide-area TCP operates in an administratively decentralized environment. No single entity sees all flows. TCP must infer congestion from loss signals (Jacobson’s algorithm), and loss is delayed and noisy. TCP’s belief about available capacity lags behind reality—packets are in flight before loss is detected. RTT scales from tens of milliseconds (datacenter) to hundreds (wide-area). Convergence speed scales with RTT. Buffering must be large to avoid loss (buffer is insurance against uncertainty).

Datacenter networks operate under single administrative control (one company, one data center). This relaxes the coordination constraint. Switches can mark packets with Explicit Congestion Notification (ECN) instead of dropping. Endpoints see congestion immediately (within one RTT), not after loss detection. Because RTT is small (microseconds to milliseconds), feedback is tight and fast. DCTCP operates at 95% utilization with sub-millisecond latencies. Wide-area TCP cannot achieve this—the administrative boundary prevents richer signals.

DCQCN (Data Center QCN, Quantized Congestion Notification) improves DCTCP by adding multi-rate support and explicit rate notification. Senders can operate at multiple rates based on ECN feedback intensity, enabling finer-grained congestion control. HPCC (High Precision Congestion Control) pushes further: each switch measures queue occupancy, packet loss, and link utilization in real-time. This information is propagated in-band (inside packet headers) to senders via an HPCC header added to data packets. Senders read the header and compute precise rate allocations: Rate = (Bottleneck capacity × Packet timestamp) / (Packet number). HPCC achieves near-zero queuing (target: <1 microsecond average) while maintaining near-optimal throughput (>95% link utilization)—a regime impossible in wide-area TCP where RTT measurement uncertainty forces conservative buffering.

Swift improves the coordination further by enabling switches to compute rate allocations directly and signal them to senders via explicit rate-request backpressure (RRB). The switch becomes the decision authority: it measures global congestion and prescribes the rate each flow should transmit. Senders become reactive: receive rate prescription, execute immediately. This inverts the Coordination invariant from distributed (each sender decides independently) to centralized (switch decides for all). The State invariant inverts too: senders maintain minimal state (current rate, window); switches maintain complete state (per-flow rate allocation, queue occupancy, link utilization).

Interface relaxation: IP’s best-effort datagram interface is unchanged. But ECN (added to IP, widely deployed in datacenters) provides a richer measurement signal. This changes the State invariant: internal belief (congestion window) updates from loss (delayed, noisy) to ECN marks (immediate, precise).

Coordination relaxation: Datacenter traffic engineering (Hedera, CONGA) can move flows between paths. Wide-area routing is decentralized. DCTCP senders can trust switch ECN marks because they operate in the same administrative domain (no Byzantine switches).

Coupling tightness: Wide-area TCP belief ≈ capacity - queue. Datacenter DCTCP belief ≈ capacity - 0 (nearly perfect, because congestion signaling is immediate and reliable). HPCC belief = actual capacity (explicit, in-band signal). Swift belief = capacity + optimal rate allocation (switch computes, sender executes).

9.8.2 SmartNIC Offload

Transport layer (TCP, QUIC) traditionally runs on CPU. Network interface card (NIC) has minimal processing—receive packets into ring buffer, transmit packets from ring buffer, update DMA pointers. This separation (CPU does logic, NIC does I/O) is clean but costly: TCP per-packet processing consumes CPU cycles (retransmission timers, ACK processing, cwnd updates); interrupt latency adds microseconds; context switches between kernel and application degrade cache locality. At 100 Gbps, per-packet interrupt overhead becomes unsustainable.

SmartNICs (programmable NICs with embedded processors) blur this boundary. Segments of transport logic move to NIC: connection state machine (SYN handling, state transitions), congestion window tracking (CWND updates, ACK processing), retransmission logic (timeout detection, packet resend), flow lookup (which connection does this ACK belong to?). This achieves two goals: (1) reducing CPU load (less context switching, fewer interrupts, CPUs freed for application), and (2) reducing latency (packet processing stays near hardware at nanosecond timescales, avoids kernel/user space boundary crossing at microsecond timescales).

Interface relaxation: Socket API unchanged. Application sees the same reliable byte stream (send() / recv() behave identically). But internally, some state (sequence numbers, window size, retransmission queue) migrates to NIC DRAM. CPU no longer owns entire connection state—it shares state with NIC through shared memory or explicit synchronization.

Coordination relaxation: When TCP runs on CPU, it coordinates with OS kernel (scheduler decides when to run), page faults (memory access delays), memory allocation (malloc/free overhead). SmartNIC TCP runs independently with dedicated resources (private processor, dedicated DRAM, no kernel interference).

Coupling tightness: CPU-based TCP experiences latency from scheduler (10+ microseconds between interrupt and handling; context switches add 1-5 microseconds each). SmartNIC TCP sees packets immediately in NIC DRAM, processes them in dedicated pipelines with no context switching, achieving sub-microsecond per-packet latency. The latency gap between CPU-based and SmartNIC-based TCP can reach 100x for high-frequency operations.

9.8.3 Layer 3 Boundary Blurring—Middle-Mile and Last-Mile Unified Control

Historically, network control has sharp boundaries: access networks (ISP) and core networks (different ISP) operate independently. They exchange traffic via open peering points; no single entity controls both. Access networks use different technologies (cable modems, DSL, fiber), operate at different scales, face different constraints.

New architectures relax this boundary. A vertically integrated ISP can implement unified congestion control across access and core: when core becomes congested, proactively slow down access link uploading (by signaling modem). When access link is congested, reroute core traffic away. This requires shared state (queue occupancy observations across both domains) and unified decision logic.

Interface relaxation: Traditionally, modem and router are closed, proprietary devices. ISPs cannot observe internal state. New modems expose telemetry (queue depth, latency, loss) via standard mechanisms (SNMP, JSON APIs). Routers signal back (via DOCSIS Message Header, or in-band signals) to modems.

Coordination relaxation: Without unified control, access and core compete independently. AIMD at both layers causes oscillation. With unified control, a single rate decision applies across both—coherent adaptation.

Coupling tightness: Separate layers → belief-environment gap is large (takes seconds to propagate signals across ISP). Unified layer → belief-environment gap is small (sub-RTT feedback).

Research frontier: When administrative boundaries collapse (one company owns entire path), tighter coupling enables new algorithms. HPCC exploits this to achieve near-optimal throughput with near-zero queuing. L4S (Low Latency, Low Loss, Scalable Throughput) uses ECN-based signaling to couple transport and AQM tightly. These algorithms fail in multi-administrative environments because they require richer signals (ECN vs loss) and tighter coordination (sub-RTT feedback).

9.9 Measurement and Management Tools—The Research Program

The instructor’s research program (TurboTest, BQT+, NetReplica, NetForge, NetGent) connects the six systems across measurement, management, and control. These tools instantiate the framework in systems you’ll use.

9.9.1 Measurement Tools: TurboTest and BQT+

TurboTest (NSDI ’26): Accelerates broadband speed tests using ML-based early termination. Insight: a speedtest can be interrupted early once confidence in final estimate is high. Training ML model on historical data (10 Gbps, 100 Gbps lines), the model learns when to stop collecting data. Benefit: 92% reduction in test data (fewer packets, faster tests). Traditional speedtest methodologies run for 30-60 seconds to achieve statistical confidence; TurboTest learns from patterns in historical data that the same confidence can be achieved in 3-6 seconds. This applies Prediction 2: when measurement cost becomes pressure (battery drain, data quota), the State invariant (how much data to collect) restructures. Early termination trades measurement certainty for speed, which is optimal when test overhead is high. The measurement signal becomes sparser but still predictive of final throughput.

BQT+ (SIGCOMM ’26): Extends broadband measurement to affordability. Traditional speedtest measures throughput and latency. BQT+ measures affordability: given actual user plan (cost per GB, overage charges), what is the effective throughput considering pricing? A user with a 100 Mbps connection but a 200 GB/month cap experiences an effective throughput of 50 Mbps (if sustained for 13 hours, they hit the cap and face overage charges or throttling). This challenges the Interface invariant: speedtest results (Mbps) do not reflect user experience if pricing creates barriers. BQT+ redefines measurement to capture administrative constraints (ISP pricing) as part of network state. It demonstrates how measurement systems must account for institutional boundaries and economic incentives, not just physical network properties.

9.9.2 Simulation and Replay: NetReplica

NetReplica: A network simulation platform that replays real traffic workloads under controlled network conditions. Measurement challenge: field measurements show user behavior (what applications do, what latencies they experience) but confound network conditions with user behavior. Simulation challenge: synthetic workloads (iperf, netperf) don’t reflect real application patterns (video buffering, TCP slow-start, idle periods).

NetReplica solves this by (1) capturing real traffic at scale (packet traces or flow summaries), (2) replaying traffic in emulated network with shaped conditions (100 Mbps link, 50 ms latency, 1% loss), and (3) measuring application outcomes (page load time, video stalls, call quality). This enables controlled experiments: “how does page load time change when latency increases from 10 ms to 100 ms?” without affecting real users. It demonstrates Closed-loop reasoning: observing real user behavior, simulating intervention, measuring outcome.

9.9.3 Code Generation and Experimentation: NetForge and NetGent

NetForge: A data-driven code generation system for network functions. Input: captured traffic trace or behavioral specification. Output: synthesized P4 dataplane code (switch processing logic). Insight: telemetry and traffic engineering functions often have repetitive structure (count packets matching pattern, aggregate by key, report top-k). Instead of hand-coding P4, specify the query declaratively; code generation produces efficient PISA code. This partially automates Sonata’s manual partitioning step.

NetGent: A browser automation system that executes application workflows (YouTube playback, Zoom calls, Netflix streaming) in a controlled environment. Measurement challenge: field speedtest measures throughput/latency of network, but user experience depends on application behavior. NetGent bridges this by: - Accepting application specification (workflow, duration, user actions) - Executing workflow in browser while network is shaped (via CTP shaping service) - Measuring application-level metrics (stalls, resolution, frames per second)

Example: “simulate a 60-second YouTube session with 20 Mbps capacity and 100 ms latency; measure stalls and average bitrate.” This replaces passive user measurement (which confounds everything) with controlled active measurement of real applications under known network conditions.

These tools instantiate the framework: measurement as a system (answer State, Time, Coordination, Interface questions), management as action (observe system state, apply closed-loop control), and the research frontier as boundary-blurring (where tight coupling enables new algorithms).

9.10 The FCC Speed Test Paradox—Generative Exercise 1

The FCC mandates broadband providers measure and report speed test results to ensure “adequate” service. But the measurement itself reveals the framework’s power to hide problems.

The scenario: Cable modem connected to ISP, running FCC-approved speedtest. Test measures downstream throughput (how fast can you download?), upstream throughput, latency (idle), and loss.

Measurement setup: Speedtest downloads a file from a nearby server (to maximize throughput), measures bytes transferred per second. Idle latency: send 5 ICMP ping packets with minimal traffic.

Framework analysis—State invariant: - Environment: Actual modem buffer depth (100 ms of data). Actual congestion on ISP core (none during test, because test is during off-peak). - Measurement signal: Downloaded bytes per second. Idle latency (ping with no competing traffic). - Internal belief: FCC report says “user can achieve 100 Mbps downstream, 10 ms latency.”

The problem: This belief matches environment during test (off-peak, no congestion), but diverges dramatically during peak hours. Peak hours: modem buffer fills with competing user traffic; latency jumps to 150 ms; throughput drops to 30 Mbps. Why? Because the buffer is oversized (by 100x) and AQM is broken.

Prediction 2 applies: When measurement cost becomes pressure (users want tests that don’t run for hours), the State invariant (what to measure) restructures. FCC metrics measure ideal conditions (idle latency, peak rate), not realistic conditions (load latency, achievable throughput during congestion). The measurement framework chose to measure idle state, which is easier and makes ISPs look good, but hides bufferbloat.

What the framework predicts: Fixing this measurement requires measuring State under different conditions. Idle latency (what FCC currently mandates) is insufficient. Must measure working latency (latency under competing background traffic). Must measure throughput under loss (not just loss-free conditions). This restructures the Interface invariant: FCC speedtest must report both idle and working metrics, or it masks the real problem.

The measurement methodology reveals a deeper truth: measurement design is not neutral. By choosing to measure idle conditions, the FCC enabled bufferbloat to persist unseen. By ignoring PowerBoost effects (burst rate that transitions to sustained rate), the methodology allowed ISPs to oversell burst capacity. Different methodologies would reveal different truths. Point-in-time measurement misses time-of-day variation and seasonal patterns. Longitudinal measurement (tracking speed over days or weeks) would expose these patterns and change how operators and ISPs set expectations. The design of the measurement framework is the design of policy: what the framework measures is what gets reported, and what gets reported shapes what ISPs optimize for. Figure 9.3 illustrates how different measurement approaches (idle, load, sustained) expose different aspects of network behavior and can lead to opposite conclusions about service quality.

The top panel reveals the PowerBoost measurement paradox: a standard FCC speed test (Ookla/Speedtest) operates over a 30-second window. PowerBoost’s burst phase lasts approximately 10 seconds at 20 Mbps, then transitions to sustained rate (6 Mbps) for the remaining 20 seconds. The speed test measures average throughput over the entire 30-second window: (10s × 20 Mbps + 20s × 6 Mbps) / 30s = 320 Mbps-seconds / 30s = 10.7 Mbps reported. But users see the initial 10-second boost and perceive 20 Mbps as their “real” speed. The test is decoupled from actual user behavior: if a user uploads a 100 MB file (requiring 14 seconds at 6 Mbps), they experience the sustained rate for nearly the entire duration. The speed test window is too short to reveal sustained capacity constraints; it artificially privileges burst performance.

Figure 9.3: The lower panels contrast three crucial measurement design choices: idle latency (ping on empty network) versus working latency (latency while data is being transferred), point-in-time tests (30-second burst window) versus longitudinal tests (5-minute sustained measurement), and how each reveals or hides bufferbloat (excess buffering that inflates latency under load). Idle latency typically measures 10–20 ms—excellent performance. But bufferbloat hides here: an ISP with 1 GB of buffer can show excellent idle latency while saturating the buffer during transfers, inflating latency to 200+ ms. Working latency under load reveals bufferbloat but is harder to interpret: is high latency due to propagation delay, congestion, or buffering? Longitudinal tests (5-minute uploads) reveal sustained capacity—6 Mbps in this example—but are operationally impractical for consumer-facing tests (users won’t wait 5 minutes). Point-in-time tests (30 seconds) report burst capacity, which misrepresents sustained ability. Measurement methodology is not neutral: it determines what problems are visible to operators and regulators. Short burst tests systematically hide capacity degradation and bufferbloat, becoming a de facto policy instrument that shifts accountability away from ISPs.

Exercises: 1. Design a measurement methodology that reveals bufferbloat without requiring end users to run complex diagnostics. What State should you observe? How would Closed-loop reasoning help (observing before bufferbloat is fixed, then after)? Consider: could you measure the latency gap (idle vs. working) automatically by having the modem run lightweight probes in the background, correlating with real traffic? What would be the measurement overhead? 2. Current FCC speed test is a point-in-time measurement. How would you design a longitudinal measurement (over days/weeks) to capture time-of-day variation and seasonal patterns? What Time semantics would enable this? Should you measure at fixed times (e.g., 8am, 5pm daily) or random times? Why does the choice matter for diagnosing bufferbloat patterns? 3. PowerBoost (burst rate for first 32 seconds, then sustained rate) dramatically changes speedtest results. A modem with PowerBoost will report 150 Mbps (burst rate), but real downloads average 100 Mbps (sustained). How does the Interface (what results to report) need to change to reflect this? Should FCC require reporting both peak and sustained rates? Should the test be designed to measure sustained rate (run longer than 32 seconds) or report both? What is the fairness implication of each choice?

9.11 L4S and Administrative Boundaries—Generative Exercise 2

L4S (Low Latency, Low Loss, Scalable Throughput) is a new queueing discipline that splits traffic into two queues: one for ECN-capable senders (who respond quickly to ECN marks) and one for classic TCP (who respond to loss). The dual-queue design maintains backward compatibility while enabling ultra-low latency for ECN-capable flows.

The scenario: An ISP deploys L4S on its access link. Cable modems detect ECN-capable packets and queue them separately. Result: ECN flows see <1 ms latency; classic TCP flows see 10 ms latency. Users with ECN-capable applications (video conferencing, online gaming) get better experience; legacy applications get baseline experience.

Framework analysis—Coordination invariant: - Wide-area Internet: Administrative decentralization. Each AS operates independently. ECN-capable sender cannot trust that path will mark (some middleboxes drop ECN). Must fall back to loss-based control. Result: hybrid behavior, unpredictable. L4S cannot work end-to-end. - ISP access network: Administrative unification. Single ISP owns modem, router, edge switch. Sender (in user home) and AQM (in ISP equipment) are under same administrative control. Sender can trust ECN marks. L4S works within single ISP.

The problem: L4S requires Coordination relaxation (trusted signaling) and Interface relaxation (ECN field must be preserved). But ECN adoption is low (many middleboxes still remove ECN), and dual-queue implementation is complex. Deployment fragmented: one ISP deploys L4S, another doesn’t. What happens when traffic crosses an administrative boundary (leaves ISP with L4S, enters ISP without L4S)?

Prediction 3 applies—tighter margins: Within the ISP, L4S operates at 95% utilization with sub-millisecond latencies (like datacenter DCTCP). At the boundary, traffic hits classic queue management, loses ECN marks, falls back to loss-based feedback. Belief-environment coupling degrades. Throughput drops. Latency increases.

What the framework predicts: L4S is fragile under administrative boundary relaxation. It works beautifully when tight coupling is possible (single ISP), but fails when boundaries blur (traffic leaves ISP). This is why L4S will likely remain a datacenter or enterprise network solution, not a wide-area Internet protocol.

Exercises: 1. Design a measurement to detect when L4S boundaries are crossed. What State would you observe? (Hint: L4S flows experience low latency; non-L4S flows experience high latency. Can you infer administrative boundary from latency divergence?) Consider: if you send test packets with and without the ECN flag, will they follow the same path? Will they experience the same queue? If not, this reveals the boundary. 2. L4S requires ECN marking and ECN-aware congestion control. Currently, ECN adoption is ~50% (some paths support it, some don’t). Design a fallback strategy: if ECN is not available, what should L4S do? How does this change the Interface invariant? Should L4S senders gracefully degrade to loss-based feedback when ECN marking is not detected? Should they detect this automatically or require explicit configuration? 3. PowerBoost (from Exercise 1) interacts poorly with L4S. During PowerBoost, modem sends at 150 Mbps for 32 seconds, then drops to 100 Mbps. L4S AQM is tuned for the sustained rate (100 Mbps). What happens during PowerBoost? (Hint: queue drains faster than expected, latency drops paradoxically below target, then spikes when PowerBoost ends because the sudden rate drop causes queue buildup.) How would you redesign L4S to handle rate changes? Could you measure current egress rate and adapt target queue delay dynamically?

9.12 References

Alizadeh, M., Greenberg, A., Maltz, D.A., Padhye, J., Patel, P., Prabhakar, B., Sengupta, S., and Sridharan, M. (2010). “Data Center TCP (DCTCP).” Proc. ACM SIGCOMM.
Bosshart, P., Gibb, G., Kim, H.S., Varghese, G., McKeown, N., Agewalaa, M., Krishnamurthy, A., Langley, A., et al. (2013). “Forwarding Metamorphosis: Fast Programmable Match-Action Tables in Reconfigurable Hardware.” Proc. ACM SIGCOMM.
Cardwell, N., Cheng, Y., Gunn, C.S., Yeganeh, S.H., and Jacobson, V. (2017). “BBR: Congestion-Based Congestion Control.” ACM Queue, 14(5).
De Coninck, Q., and Bonaventure, O. (2019). “Leveraging the OpenFlow Protocol for Congestion Avoidance.” Proc. ACM CoNEXT.
Dukkipati, N., Refice, T., Cheng, Y., Chu, J., Herbert, T., Agarwal, A., Jain, A., and Lam, V.Y. (2010). “An Argument for Increasing TCP’s Initial Congestion Window.” ACM SIGCOMM Computer Communication Review, 40(3).
Gupta, A., Costa, B., Prabhakar, B., Katz, R., and Zhang, M. (2018). “Sonata: Query-Driven Network Telemetry.” Proc. ACM SIGCOMM.
Hao, W., Salvatore, S., Jayakumar, R., and Rexford, J. (2016). “Measuring Per-application Network Performance in Networks with in-Network Computing.” Proc. USENIX NSDI.
Jain, A., Bian, Y., Lin, G., Prabhakar, B., and Zhang, M. (2019). “HPCC: High Precision Congestion Control.” Proc. ACM SIGCOMM.
McKeown, N., Anderson, T., Balakrishnan, H., Parulkar, G., Peterson, L., Rexford, J., Shenker, S., and Stoica, I. (2008). “OpenFlow: Enabling Innovation in Campus Networks.” ACM SIGCOMM Computer Communication Review, 38(2).
Nichols, K., and Jacobson, V. (2012). “Controlling Queue Delay.” ACM Queue, 10(5).
Ramakrishnan, K., Floyd, S., and Black, D. (2001). “The Addition of Explicit Congestion Notification (ECN) to IP.” RFC 3168.
Sarrar, N., Uhlig, S., Feldmann, A., Sherwood, R., and Yalagandula, P. (2012). “Leveraging Zipf’s Law for Traffic Offloading.” ACM SIGCOMM Computer Communication Review, 42(5).
Sundaresan, S., de Donato, W., Feamster, N., Expósito, R., Kreibich, C., and Weaver, N. (2011). “Broadband Internet Performance: A View from the Gateway.” Proc. ACM SIGCOMM.
White, S., Volk, D., Chen, L., and Sundaresan, S. (2012). “Broadband Performance: Trends in User Speeds.” FCC Measuring Broadband America.

9.13 Summary

Measurement and management consume operational systems—they observe to enable control. Both must answer the four invariants, trading off flexibility, scalability, and accuracy. Active measurement trades interference for control; passive measurement trades control for non-interference. Throughput depends on technique; latency reveals queue state.

From SNMP’s slow polling to modern telemetry’s high-frequency streaming, measurement systems have evolved to handle faster networks and more granular visibility. Yet the fundamental tradeoff remains: flexibility, scalability, and accuracy form an impossibility triangle. No system achieves all three simultaneously.

Telemetry systems occupy distinct positions on this triangle. Full packet capture maximizes flexibility and accuracy but fails at scale. Switch-only execution maximizes scalability and accuracy but limits flexibility. Sampling maximizes scalability and flexibility but sacrifices accuracy. Sonata demonstrates how disaggregation stretches the frontier by splitting query execution between switch and server, achieving 10,000x data reduction (from 1 billion tuples/sec to 100,000 tuples/sec) while maintaining accuracy and flexibility. PISA programmable switches enable in-network computation, but resource constraints (PHV width, pipeline depth, stateful memory) force critical decisions about what compute to push to switches and what to keep on servers.

Prediction 3 manifests across layers: datacenter DCTCP operates at 95% utilization with sub-millisecond latencies because administrative unification enables richer signals (ECN marks instead of loss); HPCC and Swift achieve near-zero queuing with near-optimal throughput through even tighter coordination; SmartNIC offload achieves sub-microsecond per-packet latency through boundary blurring between CPU and NIC; L4S requires administrative trust and ECN preservation to function. The research frontier recognizes that relaxing administrative boundaries and interface constraints enables tighter coupling and operation at margins previously infeasible.

The instructor’s research program (TurboTest, BQT+, NetReplica, NetForge, NetGent) instantiates these principles in tools for measurement, simulation, and application-aware experimentation. These systems bridge measurement and management, connecting the framework from theory to practice.

The framework is now complete: four invariants answer structural questions; three principles guide solution strategies; anchored dependency graphs trace constraints; six systems span the design space; and three predictions make testable claims about where systems are fragile. Use this framework to evaluate any networked system: identify the anchor, answer the invariants, trace the dependency graph, evaluate closed-loop dynamics, and check meta-constraints. The framework will show you where the system fails and what changes are feasible.

This chapter is part of “A First-Principles Approach to Networked Systems” by Arpit Gupta, UC Santa Barbara, licensed under CC BY-NC-SA 4.0.