CS176C — Advanced Topics in Internet Computing
2026-05-26
It is 8 pm. A student messages the campus help desk:
“YouTube is broken in my dorm room.”
What do you do?
Welcome to network operations.
If TCP picks its own congestion window, BGP picks its own best path, AQM picks its own packets to drop…
…what’s left for a human to do?
Protocols decide within their scope. Humans decide about the scope.
What flows are allowed. Which queue policy the router runs. When to add a new uplink. What to do when half the campus cannot reach YouTube. Whether the new firewall rule is legal. Who to peer with.
A network operator is the person (or team) responsible for keeping a network running and meeting its service obligations.
Operations is a loop: sense → believe → decide → act.
For 40 years that loop has been built around what a single human can absorb and act on.
Today’s claim: most of the design decisions in network operations — how telemetry is sampled, how dashboards aggregate, how alerts get thresholded, how often we poll — only make sense once you treat the human operator as the binding constraint at the end of the loop.
L15 (today): the loop, the four invariants that constrain it, how operators measured and controlled networks 1988 → 2016.
L16 (Thursday): the data-plane revolution and what comes next.
Everything else — counters, dashboards, SNMP, SDN, modern automation systems — exists to make some part of this loop faster, cheaper, more accurate, or more accountable.
. . .
The four invariants from Ch 1 (State, Time, Coordination, Interface) are constraints on this one loop.
. . .
The rest of this lecture: walk each constraint, then trace how it has changed across four decades.
8 pm Tuesday. A student streams an HD lecture recording.
The playback freezes for three seconds, then resumes at lower quality.
“YouTube is broken.”
We will answer what just happened? from five different perspectives. Hold them all in your head.
~75,000 autonomous systems, each independently operated.
The campus IT department’s authority ends at the boundary where it hands packets to the eyeball ISP.
The eyeball ISP’s authority ends at the boundary where it hands to a transit provider or to the CDN.
Each boundary is where one operator’s view ends and another’s begins. Multi-stakeholder ops is not a side effect of regulation. It is the architecture.
So whose view is true when YouTube freezes?
The 8 pm rebuffer happens. What does each party see?
| Who | What they see |
|---|---|
| The student | ? |
| The content provider (YouTube, CDN team) | ? |
| The campus NOC | ? |
| The eyeball ISP | ? |
| A third-party probe (if deployed) | ? |
| Who | What they see |
|---|---|
| The student | Spinning wheel |
| The content provider | One rebuffer; ABR drops 1080p → 480p; no surge across other users |
| The campus NOC | Green dashboards. No alert. |
| The eyeball ISP | Nothing unusual at the border router |
| A third-party probe | Brief latency spike at 8:00:23 pm |
Five truthful beliefs. No two of them describe the same picture.
Attribution = composing these partial views. No single party can do it alone.
Performance is layered by consumer. Different consumers care about different things.
| Consumer | What kind of performance? | What does that look like? |
|---|---|---|
| Campus NOC | Capacity | ? |
| Application owner | Quality of Service (QoS) | ? |
| End user | Quality of Experience (QoE) | ? |
| Consumer | What kind of performance? | What does that look like? |
|---|---|---|
| Campus NOC | Capacity | Link bytes-per-second |
| Application owner | QoS | Flow Completion Time, SLO |
| End user | QoE | Video plays smoothly? |
Always relative to a workload. Same network, different traffic mix → different verdict.
Security is a closed loop too: trustworthy substrate + policy verification + anomaly detection + bounded-time mitigation. (Not the cryptographic kind.)
Four invariants. Four constraints. Same loop.
Run the rebuffer through these layers. The campus NOC’s measurement was the SNMP counter, averaged over 5 minutes. Which layer lost the burst?
A 200 ms burst saturated the residential VLAN. The 5-min SNMP average shows 60% utilization. Why does the average hide the burst?
In 1994, Leland-Taqqu-Willinger-Wilson at Bellcore traced a year of real Ethernet traffic and found:
Traffic is bursty at every timescale they could observe — milliseconds, seconds, minutes, hours.
Zoom in. Zoom out. Same shape. They named it self-similarity.
Aggregating does not smooth. The Poisson model — basis of decades of queueing theory inherited from telephony — was empirically wrong.
Consequences for measurement:
Network operations span timescales. For each one below, what kind of decision lives there? Try to fill in before you click.
| Timescale | What kind of decision? | Concrete example |
|---|---|---|
| Microseconds | ? | ? |
| Milliseconds | ? | ? |
| Seconds | ? | ? |
| Minutes to hours | ? | ? |
| Days+ | ? | ? |
Take a minute. Where in the table can the human still be the decider?
| Timescale | What kind of decision | Concrete example |
|---|---|---|
| Microseconds | Per-packet | AQM (L14 — FQ-CoDel, CAKE) |
| Milliseconds | Per-RTT | TCP cwnd |
| Seconds | Per-flow | Bitrate adaptation in the video player |
| Minutes to hours | Per-incident | Outage triage |
| Days+ | Policy / planning | Topology change, vendor renewal |
Human reaction is at the bottom of the slow end. Not because of laziness. Because of physics.
If the human is not in the loop, who is? That’s the next invariant.
Observation tells you what’s happening. Authority lets you act. They are not the same — the content provider could observe its own users’ rebuffer but cannot reach into the campus network to fix it.
There are three steps of authority over the loop. For each one, think: who has the password? what’s the scope of their decisions?
| Step | Who has authority? | What’s the scope? |
|---|---|---|
| 1. One router at a time | ? | ? |
| 2. One controller for one operator | ? | ? |
| 3. Across multiple operators | ? | ? |
Stop. Predict each one before clicking.
| Step | Who has authority? | What’s the scope? |
|---|---|---|
| 1. One router at a time | Whoever has the device password | One router; one CLI session |
| 2. One controller for one operator | Whoever wrote the controller’s policy | Every router in the operator’s domain (SDN — §Act 7) |
| 3. Across multiple operators | No single authority | Failures cross AS boundaries; still mostly a research problem |
So who’s at the receiving end of all this authority? What does the operator actually see?
Interface = the contract between the network and whoever consumes belief and action.
For 40 years, that consumer has been a single human operator. What does that look like concretely? Predict before you click.
| When the operator receives belief, it looks like… | When the operator takes action, it looks like… |
|---|---|
| ? | ? |
| ? | ? |
| ? | ? |
Three examples each. Walk through your day on an on-call shift.
| Belief reaches the operator as… | The operator’s action looks like… |
|---|---|
| A chart on a dashboard | A button click |
| A colored alert | A CLI line |
| A paragraph in a runbook | A Slack message |
Most of what operations tooling is, is a workaround for what a single human can absorb and act on.
The networks could always emit more.
The human could never absorb more.
The tooling lived in the gap.
That gap shows up everywhere: aggregated counters, thresholded alerts, 5-minute polling cadences, dashboards as the universal interface.
If the consumer contract has shaped four decades of tools, what was the first version?
What was the first consumer contract?
1988: SNMP standardized. Per-link device counters. Polled every 5 minutes.
Why 5 minutes? Not a network constraint — a consumer constraint. Faster polls would have produced more data than the human in the loop could absorb.
So operators ran on 5-min averages. They suspected they were missing things, but they had no way to prove it.
By 1994 they had proof: Leland-Taqqu-Willinger-Wilson’s self-similarity result (we saw this on slide 13) showed traffic was bursty at every timescale — meaning the 5-min averages had been systematically lying about what the network was doing.
Knowing the era’s tooling was lying didn’t give operators anything better to deploy. What came next?
Cisco NetFlow v5 (1996): the flow as the operational unit. One record per 5-tuple, aggregated over a timeout window. Not one record per packet.
sFlow (sampling). IPFIX (standardization). Paxson NPD (1997) — active probing methodology, with careful attention to sampling bias.
Better measurement → better diagnostics. Operators could see what was wrong in finer detail.
But control was still distributed by design. Every router ran its own BGP, OSPF, scheduling decisions independently. Even with config-management automation, operators were tweaking inputs to distributed algorithms — they could not centrally express what the network as a whole should do.
The next gap wasn’t seeing. It was being able to express network-wide intent.
The 1990s gave operators sharper symptoms. They still could not centrally say what the network was supposed to do.
In 2003, Clark, Partridge, Ramming, and Wroclawski named the missing layer.
A new plane — alongside the data plane (forwards packets) and the control plane (computes routes) — whose job is to:
The Knowledge Plane was not an architecture or a controller — it was the abstraction that named what operations needed but did not yet have.
Operations is what you do when sensing has produced a belief and a decision must follow. The Knowledge Plane named the belief.
The vision named the missing layer. Who built it?
Distributed control = each device runs its own algorithm using local information only. Network behavior emerges from independent local decisions.
You already know three examples of this from earlier in the course. Name them before clicking.
What’s hard about this regime?
Hard to express network-wide intent. Hard to query global state. No router holds the whole picture.
Treat the entire network as one programmable system, with a logically-centralized view of state.
(match this header pattern → take this action).Per-rule counters → centralized observability. Per-rule install → centralized control.
Same abstraction. Both observation and action.
The abstraction is interesting. The control actions are what make it useful.
Before clicking: what kinds of new control actions would centralization enable that distributed protocols could not? (Hint: think about what BGP cannot express, or what TE looks like when one program sees every link in your WAN.)
| Use case | What it enables | System (era) |
|---|---|---|
| Application-specific peering at IXPs | Things BGP cannot express (e.g., send video to AS X, DNS to AS Y, DDoS to a scrubber) | SDX (Gupta et al. SIGCOMM 2014) |
| Global WAN traffic engineering | Centralized scheduler runs links above 90% utilization with bounded loss | B4 (Jain et al. SIGCOMM 2013) |
| Programmable peering edge | Egress route selection on per-application latency, not just BGP best-path | Espresso (Yap et al. SIGCOMM 2017) |
| Per-flow authorization at the enterprise | Every flow must be approved by a central controller before it traverses the network | Ethane (Casado et al. SIGCOMM 2007) |
Pattern: the centralization paid off because it let operators express things distributed protocols structurally could not.
Heavily deployed:
Limited or absent:
Huston (APNIC 2025): ISP-scale SDN has “essential operational fragility” — controller trusts local state stays constant; often it doesn’t.
Hyperscalers adopt first. The rest of the Internet trails by years to decades.
By 2016, the operator’s loop ran on:
Observability (counters → flow records → probes) Belief (Knowledge Plane named it; SDN built the prerequisite) Programmable authority (SDN controllers with concrete TE / peering / load-balancing actions)
But:
Thursday: the data-plane revolution. Pre-read: Clark Knowledge Plane 2003.
© 2026 Arpit Gupta, UC Santa Barbara. All rights reserved.