From Measurement to Trustworthy Control

CS176C — Advanced Topics in Internet Computing

Arpit Gupta

2026-05-26

You Are On Call

It is 8 pm. A student messages the campus help desk:

“YouTube is broken in my dorm room.”

What do you do?

  1. Look at your monitoring dashboard. (2) Form a guess. (3) Change something. (4) Watch the dashboard again.

Welcome to network operations.

Why Do We Even Need an Operator?

If TCP picks its own congestion window, BGP picks its own best path, AQM picks its own packets to drop…

…what’s left for a human to do?

Protocols decide within their scope. Humans decide about the scope.

What flows are allowed. Which queue policy the router runs. When to add a new uplink. What to do when half the campus cannot reach YouTube. Whether the new firewall rule is legal. Who to peer with.

A network operator is the person (or team) responsible for keeping a network running and meeting its service obligations.

Today’s Argument

Operations is a loop: sense → believe → decide → act.

For 40 years that loop has been built around what a single human can absorb and act on.

Today’s claim: most of the design decisions in network operations — how telemetry is sampled, how dashboards aggregate, how alerts get thresholded, how often we poll — only make sense once you treat the human operator as the binding constraint at the end of the loop.

L15 (today): the loop, the four invariants that constrain it, how operators measured and controlled networks 1988 → 2016.

L16 (Thursday): the data-plane revolution and what comes next.

The Operator’s Loop

  1. Sense — read telemetry.
  2. Form a belief — what is happening?
  3. Decide what to do.
  4. Act — change something.
  5. Sense again. Repeat.

Everything else — counters, dashboards, SNMP, SDN, modern automation systems — exists to make some part of this loop faster, cheaper, more accurate, or more accountable.

. . .

The four invariants from Ch 1 (State, Time, Coordination, Interface) are constraints on this one loop.

. . .

The rest of this lecture: walk each constraint, then trace how it has changed across four decades.

A Running Example

8 pm Tuesday. A student streams an HD lecture recording.

The playback freezes for three seconds, then resumes at lower quality.

“YouTube is broken.”

We will answer what just happened? from five different perspectives. Hold them all in your head.

The Internet Is a Network of Networks

~75,000 autonomous systems, each independently operated.

The campus IT department’s authority ends at the boundary where it hands packets to the eyeball ISP.

The eyeball ISP’s authority ends at the boundary where it hands to a transit provider or to the CDN.

Each boundary is where one operator’s view ends and another’s begins. Multi-stakeholder ops is not a side effect of regulation. It is the architecture.

So whose view is true when YouTube freezes?

Five Different People, Five Different Beliefs

The 8 pm rebuffer happens. What does each party see?

Who What they see
The student ?
The content provider (YouTube, CDN team) ?
The campus NOC ?
The eyeball ISP ?
A third-party probe (if deployed) ?

Five Different People, Five Different Beliefs

Who What they see
The student Spinning wheel
The content provider One rebuffer; ABR drops 1080p → 480p; no surge across other users
The campus NOC Green dashboards. No alert.
The eyeball ISP Nothing unusual at the border router
A third-party probe Brief latency spike at 8:00:23 pm

Five truthful beliefs. No two of them describe the same picture.

Attribution = composing these partial views. No single party can do it alone.

What Is the Loop For? Performant + Secure

Performance is layered by consumer. Different consumers care about different things.

Consumer What kind of performance? What does that look like?
Campus NOC Capacity ?
Application owner Quality of Service (QoS) ?
End user Quality of Experience (QoE) ?

What Is the Loop For? Performant + Secure

Consumer What kind of performance? What does that look like?
Campus NOC Capacity Link bytes-per-second
Application owner QoS Flow Completion Time, SLO
End user QoE Video plays smoothly?

Always relative to a workload. Same network, different traffic mix → different verdict.

Security is a closed loop too: trustworthy substrate + policy verification + anomaly detection + bounded-time mitigation. (Not the cryptographic kind.)

What Constrains the Loop?

Four invariants. Four constraints. Same loop.

STATE — Three Layers

  1. Environment — what’s actually happening on the network. Unobservable in full.
  2. Measurement — what telemetry actually captures. Lossy projection.
  3. Belief — the operator’s model of what’s happening.

Run the rebuffer through these layers. The campus NOC’s measurement was the SNMP counter, averaged over 5 minutes. Which layer lost the burst?

Why State Is Hostile to Naive Measurement

A 200 ms burst saturated the residential VLAN. The 5-min SNMP average shows 60% utilization. Why does the average hide the burst?

In 1994, Leland-Taqqu-Willinger-Wilson at Bellcore traced a year of real Ethernet traffic and found:

Traffic is bursty at every timescale they could observe — milliseconds, seconds, minutes, hours.

Zoom in. Zoom out. Same shape. They named it self-similarity.

Aggregating does not smooth. The Poisson model — basis of decades of queueing theory inherited from telephony — was empirically wrong.

Consequences for measurement:

  • Averaging at any timescale hides bursts at every smaller timescale.
  • Uniform sampling under-represents heavy tails (a few flows carry most bytes).
  • Periodic polling aliases with periodic phenomena.

TIME — Below Seconds, the Human Cannot Be in the Loop

Network operations span timescales. For each one below, what kind of decision lives there? Try to fill in before you click.

Timescale What kind of decision? Concrete example
Microseconds ? ?
Milliseconds ? ?
Seconds ? ?
Minutes to hours ? ?
Days+ ? ?

Take a minute. Where in the table can the human still be the decider?

Timescale What kind of decision Concrete example
Microseconds Per-packet AQM (L14 — FQ-CoDel, CAKE)
Milliseconds Per-RTT TCP cwnd
Seconds Per-flow Bitrate adaptation in the video player
Minutes to hours Per-incident Outage triage
Days+ Policy / planning Topology change, vendor renewal

Human reaction is at the bottom of the slow end. Not because of laziness. Because of physics.

If the human is not in the loop, who is? That’s the next invariant.

COORDINATION — Who Can Decide, and With What Scope?

Observation tells you what’s happening. Authority lets you act. They are not the same — the content provider could observe its own users’ rebuffer but cannot reach into the campus network to fix it.

There are three steps of authority over the loop. For each one, think: who has the password? what’s the scope of their decisions?

Step Who has authority? What’s the scope?
1. One router at a time ? ?
2. One controller for one operator ? ?
3. Across multiple operators ? ?

Stop. Predict each one before clicking.

Step Who has authority? What’s the scope?
1. One router at a time Whoever has the device password One router; one CLI session
2. One controller for one operator Whoever wrote the controller’s policy Every router in the operator’s domain (SDN — §Act 7)
3. Across multiple operators No single authority Failures cross AS boundaries; still mostly a research problem

So who’s at the receiving end of all this authority? What does the operator actually see?

INTERFACE — The Consumer Contract

Interface = the contract between the network and whoever consumes belief and action.

For 40 years, that consumer has been a single human operator. What does that look like concretely? Predict before you click.

When the operator receives belief, it looks like… When the operator takes action, it looks like…
? ?
? ?
? ?

Three examples each. Walk through your day on an on-call shift.

Belief reaches the operator as… The operator’s action looks like…
A chart on a dashboard A button click
A colored alert A CLI line
A paragraph in a runbook A Slack message

Most of what operations tooling is, is a workaround for what a single human can absorb and act on.

The Forty-Year Constraint

The networks could always emit more.

The human could never absorb more.

The tooling lived in the gap.

That gap shows up everywhere: aggregated counters, thresholded alerts, 5-minute polling cadences, dashboards as the universal interface.

If the consumer contract has shaped four decades of tools, what was the first version?

The Chronology

What was the first consumer contract?

1988–1995: SNMP Counters

1988: SNMP standardized. Per-link device counters. Polled every 5 minutes.

Why 5 minutes? Not a network constraint — a consumer constraint. Faster polls would have produced more data than the human in the loop could absorb.

So operators ran on 5-min averages. They suspected they were missing things, but they had no way to prove it.

By 1994 they had proof: Leland-Taqqu-Willinger-Wilson’s self-similarity result (we saw this on slide 13) showed traffic was bursty at every timescale — meaning the 5-min averages had been systematically lying about what the network was doing.

Knowing the era’s tooling was lying didn’t give operators anything better to deploy. What came next?

1995–2003: The Flow Becomes the Unit

Cisco NetFlow v5 (1996): the flow as the operational unit. One record per 5-tuple, aggregated over a timeout window. Not one record per packet.

sFlow (sampling). IPFIX (standardization). Paxson NPD (1997) — active probing methodology, with careful attention to sampling bias.

Better measurement → better diagnostics. Operators could see what was wrong in finer detail.

But control was still distributed by design. Every router ran its own BGP, OSPF, scheduling decisions independently. Even with config-management automation, operators were tweaking inputs to distributed algorithms — they could not centrally express what the network as a whole should do.

The next gap wasn’t seeing. It was being able to express network-wide intent.

2003: The Knowledge Plane Names the Missing Layer

The 1990s gave operators sharper symptoms. They still could not centrally say what the network was supposed to do.

In 2003, Clark, Partridge, Ramming, and Wroclawski named the missing layer.

A new plane — alongside the data plane (forwards packets) and the control plane (computes routes) — whose job is to:

  1. Maintain an explicit belief about what the network is supposed to do.
  2. Distinguish raw observations (telemetry, counters) from interpreted beliefs (what those observations mean).
  3. Reason about uncertainty — missing data, conflicting evidence, hidden state.
  4. Provide services and advice to the data + control planes that consume the belief.

The Knowledge Plane was not an architecture or a controller — it was the abstraction that named what operations needed but did not yet have.

Operations is what you do when sensing has produced a belief and a decision must follow. The Knowledge Plane named the belief.

The vision named the missing layer. Who built it?

Before SDN: Distributed Control

Distributed control = each device runs its own algorithm using local information only. Network behavior emerges from independent local decisions.

You already know three examples of this from earlier in the course. Name them before clicking.

  • BGP — every router computes its own best path from neighbor announcements.
  • OSPF — every router floods link-state, runs Dijkstra independently.
  • TCP congestion control — every endpoint, its own algorithm.

What’s hard about this regime?

Hard to express network-wide intent. Hard to query global state. No router holds the whole picture.

SDN’s Key Abstraction: Centralized Control

Treat the entire network as one programmable system, with a logically-centralized view of state.

  1. Match-action tables in every switch(match this header pattern → take this action).
  2. OpenFlow protocol — controller installs rules into switches and queries them.
  3. Centralized controller — runs one program over a global picture.

Per-rule counters → centralized observability. Per-rule install → centralized control.

Same abstraction. Both observation and action.

What Does SDN Actually Let Operators Do?

The abstraction is interesting. The control actions are what make it useful.

Before clicking: what kinds of new control actions would centralization enable that distributed protocols could not? (Hint: think about what BGP cannot express, or what TE looks like when one program sees every link in your WAN.)

Use case What it enables System (era)
Application-specific peering at IXPs Things BGP cannot express (e.g., send video to AS X, DNS to AS Y, DDoS to a scrubber) SDX (Gupta et al. SIGCOMM 2014)
Global WAN traffic engineering Centralized scheduler runs links above 90% utilization with bounded loss B4 (Jain et al. SIGCOMM 2013)
Programmable peering edge Egress route selection on per-application latency, not just BGP best-path Espresso (Yap et al. SIGCOMM 2017)
Per-flow authorization at the enterprise Every flow must be approved by a central controller before it traverses the network Ethane (Casado et al. SIGCOMM 2007)

Pattern: the centralization paid off because it let operators express things distributed protocols structurally could not.

But — Where Is SDN Actually Deployed?

Heavily deployed:

  • Hyperscaler datacenters — Google, Meta, Microsoft, Amazon
  • Hyperscaler WANs — B4, SWAN, Espresso
  • Cloudflare — eBPF/XDP as core data plane
  • SD-WAN for enterprise branches

Limited or absent:

  • Tier-1 ISP backbones — partial; specific TE only
  • Public Internet — almost none. BGP still per-AS, hop-by-hop
  • Enterprise core — CLI + Ansible still dominate; SNMP remains favorite monitor in 2026
  • P4 / Tofino — confined to AI/HPC fabrics

Huston (APNIC 2025): ISP-scale SDN has “essential operational fragility” — controller trusts local state stays constant; often it doesn’t.

Hyperscalers adopt first. The rest of the Internet trails by years to decades.

What L15 Has Traced

By 2016, the operator’s loop ran on:

Observability (counters → flow records → probes) Belief (Knowledge Plane named it; SDN built the prerequisite) Programmable authority (SDN controllers with concrete TE / peering / load-balancing actions)

But:

  • The consumer was still a human.
  • The data plane was still mostly fixed-function.
  • The Interface contract had not yet shifted.

Thursday: the data-plane revolution. Pre-read: Clark Knowledge Plane 2003.