From Programmable Observation to Agentic Operations

Last lecture ended on a question. I had walked you through the operator’s loop, the four invariants, the network-of-networks problem with attribution, and the chronology from SNMP counters through the Knowledge Plane vision through SDN’s birth. By 2016, SDN had given operators centralized authority and continuously-verifiable policies. But three things had not changed: the consumer at the end of the loop was still a human, the data plane was still mostly fixed-function, and that combination meant operators were still squeezing observations through human-scale interfaces — five-minute averages, dashboards, thresholded alerts.

And the question I left you with was: if the human is not the consumer of all this very detailed fine-grained information, how do you synthesize state and decide what to do?

That’s where we pick up today. Tuesday already covered the why and whether — agentic networking, good idea or bad. Today we trace the how: how the engineering got to a place where the question is even askable. The data plane became programmable. Observation moved into it. Machine-learning decision-makers entered specific loops. A credibility crisis surfaced a gap the research frontier is still trying to close.

Act 1: The four-decade tug-of-war — and what the linear narrative gets wrong

Before I walk through the engineering substrate, I want to push back on the way the story usually gets told. The common framing in talks and survey papers goes something like: “for decades we had SNMP and NetFlow, then in the 2010s switches became programmable, and then we built per-hop in-band telemetry, declarative measurement systems, and sketches on switches.” That framing paints a clean linear evolution with the programmable-data-plane shift as the prime mover. The actual story is messier: parallel tracks running for decades, each chasing its own question, eventually fusing on a new substrate.

What was happening from 1988 onward was a tug-of-war between three pressures: how much richness operators wanted from telemetry, how much cost (memory, CPU, bandwidth, storage) the production-transport-storage chain could afford, and how much the human consumer at the end of the loop could absorb. At any given moment, several research and engineering communities were working on different points in that trade — in parallel. The 2013 programmable-data-plane shift was an enabler that let some of those parallel tracks fuse. The work that came after had its own causes — older ones.

And I should flag one more thing. L15 leaned hard on the human cognitive bandwidth as the binding constraint that shaped four decades of observability. That framing is correct but partial. Cost is a co-equal constraint. When operators settled for five-minute SNMP averages, two pressures pushed them there: humans could absorb only so much, and devices could afford only so much memory, the management network only so much bandwidth, the collector only so much storage to ship richer summaries. Both pressures — consumer-side and producer-side — have constrained the design space. The relaxation arc this chapter traces is the relaxation of both.

Let me trace the actual chronology, including the questions designers were asking, because the questions changed across communities and across decades.

The parallel tracks, 1988 to ~2010

Track 1 — Counter telemetry (SNMP, 1988). Per-link device counters polled at minutes-to-hours cadence. The question SNMP’s authors were asking: what is the cheapest, lowest-common-denominator interface that every device vendor will agree to support? The answer was an implicit aggregation — the device picks the summary function, the operator picks only the polling interval. This track gets its own substrate upgrade in Act 2 below; the implicit-aggregation problem persists all the way through.

Track 2 — Flow-level summarization (NetFlow v5, 1996; sFlow, 2001; IPFIX standardization). NetFlow had its own motivation, fifteen years before programmable switches existed. NetFlow’s designers in 1996 were asking: how do we summarize traffic for billing and capacity planning when full packet capture is too expensive and SNMP counters are too coarse? Their answer was the flow record — connection-level summaries (5-tuple, byte count, packet count, duration) emitted per completed connection. sFlow (InMon, 2001) asked the high-speed version of the same question as 10 Gbps links came online: how do we monitor at line rate by sampling when per-packet processing is out of reach? Both are first-class telemetry. Both predate the programmable-data-plane work by more than a decade. Both were driven by the same operational constraint: what can we do given that the switch is fixed-function? That was the working question for fifteen years.

Track 3 — Streaming algorithms, theory and networking deployment (1970–2002). This is one track with two faces. The theoretical CS face: Bloom filters (1970) for set membership, Alon-Matias-Szegedy (1996) for frequency moments — a streaming-CS family of statistics that includes how many distinct keys you have seen, how skewed the key distribution is, and how often the heaviest key appears, all computed in bounded memory. Charikar-Chen-Farach-Colton (2002) for the Count Sketch — roughly, how often have I seen this specific key. Cormode-Muthukrishnan (2005) for Count-Min. The networking-deployment face: Estan and Varghese’s Sample-and-Hold (SIGCOMM 2002) is the same family — a bounded-memory heavy-hitter detector — but tuned for the operational constraints networking actually has: line-rate insertion, per-flow memory budgets in switch SRAM, adversarial traffic patterns. Heavy-hitter detection is one of the canonical use-cases for streaming sketches; Estan-Varghese is the networking version of the streaming-CS heavy-hitter question. The unifying question across both faces: what queries can be answered with bounded memory in one pass over a data stream, and how does the answer change when the stream is line-rate packets?

So before 2013, the field had three parallel research tracks responding to different binding constraints. None of them needed programmable silicon. Each was answering its own question: cheapest common interface vendors will agree to (SNMP), summarize for billing under cost pressure / sample at high speed (NetFlow, sFlow), bounded-memory queries over a stream, in theory and at line-rate (streaming sketches and their networking instantiation).

What 2013 actually changed

In 2013, Pat Bosshart, Glen Gibb, Hun-Seok Kim, George Varghese, and Nick McKeown published the Reconfigurable Match Tables (RMT) architecture. P4 followed in 2014; Barefoot’s Tofino shipped in 2016. The shift: switch ASICs organized as a programmable parser, a sequence of match-action stages, and a programmable deparser. Behavior in software, silicon as a general substrate.

RMT’s designers were asking a structural question. Sketches-in-switches and per-hop measurement came as applications later. The structural question: if we expose the pipeline architecture as a target for software, what new design space opens up? The architectural shift was an enabling move. The work that built on top of it was driven by the questions the existing measurement communities were already asking.

One honesty check before we go further. Programmable data planes are a hyperscaler reality, not an Internet-wide one. Tofino runs in production at Google, Meta, and Microsoft datacenters. Enterprise switches, ISP edge routers, and consumer hardware mostly run fixed-function silicon. The work that follows matters intellectually; its operational footprint sits in a handful of organizations.

A short industrial history, because it matters. The Stanford team that designed RMT also designed the hardware to realize it, and spun out as Barefoot Networks. Barefoot did not fabricate silicon themselves — they shipped the source code and let others fabricate. The first Tofino chips made it possible for any organization to deploy. Intel acquired Barefoot and started shipping Tofino at scale. Intel struggled to keep up with hyperscaler-scale demand. Google rolled their own fabrication of a Tofino-architecturally-equivalent ASIC. Meta and Microsoft followed similar paths. So the architecture won decisively — but the commercial trajectory of the silicon itself is messier than the academic story suggests. (UCSB has a Tofino in its telemetry infrastructure, which is part of why this chapter takes the substrate seriously.)

2013-2018: pre-existing questions get programmable answers

Now the parallel tracks start to fuse. Several lines of work, each addressing a different question, become feasible on the new substrate.

Track 3 fusion — sketches on switches. OpenSketch (Yu, Jose, Miao; NSDI 2013) asked: can we take the streaming-algorithm theory the CS community had been refining for forty years and make it programmable on switches? The platform composed sketch primitives. UnivMon (Liu, Manousis, Vorsanger, Sekar, Braverman; SIGCOMM 2016) extended this with a universal sketch that approximates a wide class of frequency statistics. This work is the operational deployment of theory that long predated programmable hardware.

A new question — per-hop, per-packet visibility. Changhoon Kim and colleagues at Barefoot proposed INT (In-Band Network Telemetry) in 2015. INT starts from a different design question than NetFlow and sFlow. Where they asked how do we summarize given fixed hardware?, INT asked: now that switches are programmable, what could measurement look like if every packet carried its own measurement payload? INT embeds per-hop metadata — ingress timestamp, egress timestamp, queue depth, hop ID — into packets as they traverse the network. The sink at the destination reconstructs the per-hop trajectory at microsecond resolution. The spec slogan captures the inversion: “INT collects and reports network state, by the data plane, without requiring intervention or work by the control plane.”

Query-driven systems — Marple and Sonata. Marple (Narayana et al., SIGCOMM 2017) and Sonata (Gupta et al., SIGCOMM 2018; full disclosure, my own paper) asked yet another question: what abstraction should operators use to write measurement queries, given that the programmable substrate has resource constraints? Marple proposed a key-value abstraction over switch state. Sonata proposed a declarative dataflow query language with a compiler that partitions queries between switch and server.

The Sonata example everyone uses — detect DNS amplification victims:

victims = (packets
    .filter(lambda p: p.udp_sport == 53)
    .map(lambda p: (p.dst_ip, p.src_ip))
    .distinct()
    .reduce(keys=["dst_ip"], op=sum)
    .filter(lambda k, v: v > threshold))

Naive execution: ship every port-53 packet to the server. Millions of tuples per second per link, operationally impossible. Sonata’s compiler: execute filter, distinct, and most of the count on the switch — using sketch primitives from Track 3 — and ship only the heavy hitters to the server. Tens of tuples per second. Four orders of magnitude reduction in tuples-to-server. The Sonata optimization objective explicitly encodes both binding constraints — the switch’s resource budget and the operator’s cognitive budget — in the same ILP.

There were also siblings worth knowing: Everflow (Zhu et al., SIGCOMM 2015) proposed match-and-mirror at line rate for distributed packet capture. dShark (Yu et al., NSDI 2019) addressed how to analyze those captures across many vantage points. Different questions, same enabling substrate.

Where does the burstiness even come from?

Before we say sketches are the right tool, ask the prior question: why is the data heavy-tailed in the first place? Three reinforcing sources, each of which we have already seen earlier in the course.

Medium access reinforces burstiness. You randomly request access to a shared medium; when you get it, you send everything you have. The protocol concentrates sends in time. (L5–L6.)
Applications are event-driven, not steady. Humans are not continuously generating traffic; we type, we click, the application reacts, the application sends. Long idle, short spike, long idle.
The ABR cycle, as canonical example. A streaming player requests a chunk, gets ten seconds of buffer, and goes quiet for ten seconds while it plays from the buffer. Then another chunk. Spike, silence, spike.

The same burstiness shows up at every aggregation level — host, subnet, AS. Walter Willinger’s 1994 result names the structural property: self-similar at every timescale. That is why sketches fit, not generic information-theoretic compactness.

Why sketches win operationally — the 30-year debt the field finally paid

Sketches matter operationally because the data they summarize is heavy-tailed in the sense above. A few keys carry most of the mass; most keys carry almost none. Sketches are exactly the right data structure for that distribution. A small fixed-size table gives bounded error on heavy-hitter queries precisely because heavy tails concentrate the mass in a few places.

The intellectual lag was nearly twenty years. Self-similarity was discovered in 1994. Count-Min was published in 2005. Networking deployed sketches operationally between 2013 and 2018 once a programmable substrate existed to host them. The field knew the data’s structure; the field had the theoretical tool; the operational deployment had to wait for the substrate. The sketches existed first; the substrate let them be deployed.

UnivMon is the proto-foundation-model of telemetry. One shared data structure on the switch, many downstream queries on the collector. The abstraction that Frontier A’s network foundation models will re-instantiate two generations later — one representation, many downstream uses — is already present here, in 2016, in a sketch.

Apply to the rebuffer

In 2018, with INT plus Sonata plus sketches deployed on the campus residential VLAN, the operator can write a declarative query — “for each (source, destination) pair, count packets in 100 ms windows; alert if any pair exceeds ten times its baseline rate.” The switch runs the count via a sketch, the server receives only the heavy hitters, the operator gets a tractable alert. The 8 pm rebuffer surfaces as a specific traffic pattern.

But notice what has not changed: the operator still has to read the alert and decide what to do. We have made the substrate vastly richer, the cost of asking specific questions much lower, the resource-vs-richness trade more elastic. The consumer at the end of the loop, and the dollar cost of running the whole apparatus, are still binding.

Act 2: The counter telemetry world catches up — gNMI and eBPF

Track 1 from Act 1 — counter telemetry, SNMP as the canonical instance — also got its own substrate upgrade in the same window: gNMI in 2016, eBPF and XDP maturing around 2018. Both inherited an old problem the counter world has lived with for thirty years and that the programmable-data-plane work in Act 1 partly sidestepped. I want to spell it out because it matters for the rest of the lecture.

The implicit aggregation problem

When a device exposes a counter — bytes-in, packets-dropped, queue-depth — the counter is already an aggregation. It is a sum, or a max, or an instantaneous value. The aggregation function is fixed at the device. The operator picks the polling interval; the device has already picked the function.

Why does that matter? Because, recall Willinger’s 1994 result, the underlying traffic is self-similar and heavy-tailed. A simple sum over a five-minute window is cheap to compute and cheap to store, but it throws away exactly the distributional structure bursts and heavy-hitters live in. A 60% link-utilization average for the past five minutes is consistent with steady moderate load and with a 100 ms burst that fully saturated the link and dropped packets. The two cases look the same in the counter. They look completely different to a video player.

A more honest summary — quantile sketches, distribution snapshots, top-k tracking, the same primitives the Act 1 data-plane work makes deployable — preserves more of the structure. It also costs more CPU at the device, more memory in the device’s table, more bandwidth on the wire, more storage at the collector. Every richness gain has a cost. The aggregation question — what function the device computes, at what cost — is the hidden parameter the counter world has been answering “simple sum” to for thirty years.

gNMI and OpenConfig (2016): the Time relaxation, not the aggregation relaxation

For thirty years, SNMP was a pull protocol — a management station polled devices on some cadence. Polling cadence was bounded by management-network bandwidth and the collector’s storage budget. gNMI inverted the architecture: devices stream telemetry on their preferred cadence; collectors handle buffering, out-of-order delivery, and schema drift.

What did this actually relax? The Time invariant. Observations arrive at sub-second cadences instead of five-minute cadences. What stayed unchanged? The aggregation question. The counter is still a sum, still a max, still an instantaneous value. gNMI lets you stream that same aggregation a thousand times more often. To stream a quantile distribution instead, you need either richer counters at the device or you need to push raw events to the collector and aggregate there. Both cost more. gNMI made the cheap version faster. The rich version stayed expensive.

By the late 2010s, no operator was reading streaming telemetry with their eyes. The gNMI consumer was a Prometheus instance, a Kafka topic, a time-series database, an alerting rule engine. Geoff Huston at APNIC called this the vanishing network. That framing is real but partial: the vanishing-of-the-human story is one half of what was happening. The cost-of-richer-aggregation story is the other half. Two reasons the aggregations stayed simple: the humans were vanishing, and anything richer cost more than the management infrastructure was willing to pay.

eBPF and XDP (~2018): the same trade, on the host

Steve McCanne and Van Jacobson’s 1993 BPF — the kernel packet filter we mentioned on Tuesday — got extended into a full kernel programming environment with persistent state, helper functions, and a verifier. XDP runs those programs at the driver level, before the kernel allocates any data structures for a packet. Host-side programmable telemetry now matches in-network programmable telemetry. A measurement system that wants to combine switch-side and host-side data can use the same query abstractions on both.

The cost dimension shows up here too. eBPF maps give you persistent state across packets, but the maps live in scarce kernel memory. XDP runs at the driver, but it costs CPU per packet. You can write rich queries; you pay for them. The architectural lesson is the same as for sketches and Sonata: programmable substrates expose the cost-vs-richness trade as a tunable parameter. The underlying trade stays.

Production deployments: what the substrate enabled at operator scale

The substrate work landed in production. By the late 2010s and early 2020s, large operators had built complete push-based telemetry pipelines on top of these primitives, and the published systems give us concrete cost-vs-richness numbers from production.

Pingmesh (Chuanxiong Guo and colleagues, Microsoft; SIGCOMM 2015) deployed an always-on host-driven active probing system across Microsoft’s datacenters. Every server probes a set of peers; results stream to a central analyzer; the analyzer maintains a continuously-updated belief about pairwise reachability and latency. Push-based, host-driven, complete pipeline. The reason it works is that the operator gets a globally consistent latency map at a cost the datacenter is already paying for ECN and ACK traffic — a richer belief at a known incremental cost.
OpTel (Miao, Chen, and colleagues at Tencent with our group at UCSB; NSDI 2022) replaced SNMP polling of Tencent’s optical-backbone devices with a push-based pipeline that streams device state to cloud-hosted controllers. The motivation was specifically Chinese. China’s optical-backbone was growing rapidly and accidental fiber cuts from concurrent construction were frequent enough that minute-granularity SNMP polling was structurally insufficient — failures degraded over seconds, and you needed second-granularity signals to predict link failures before they hit. OpTel pushed one-second telemetry off the device into elastic cloud compute. In six months of production, it detected roughly 2× more optical events than the prior system. Push moves the aggregation cost off the device onto the operator’s cloud budget — a different cost-allocation, not a free lunch.

Both systems are existence proofs of what the Act 1 / Act 2 substrate work made deployable at scale. They also make the cost dimension explicit in the published numbers, in a way the protocol-level descriptions of gNMI and eBPF do not.

The combined substrate work — Act 1’s programmable switches, sketches-on-switches, query-driven telemetry, plus Act 2’s gNMI, eBPF, and the production systems built on top — collectively reduced the cost-per-bit-of-belief by orders of magnitude across both the in-network and host-side telemetry worlds. The human-attention budget and the dollar budget both got more elastic. But neither went to zero, and they remain the two structural constraints the rest of the lecture is trying to relax further.

Act 3: Machine learning enters specific network-operations loops

A parallel line of work, starting in the mid-2000s and accelerating after 2013, began replacing humans inside individual network-operations loops. The pattern was consistent: take a closed-loop decision an operator had been making either explicitly or implicitly through hand-tuned heuristics, train a model offline, deploy it, measure that it beats the heuristic on the workloads that mattered.

We focus on the decisions an operator makes on the network’s behalf — what is this traffic, is something abnormal, where should flows go, when did the failure start, what caused it. Application-side learned decisions like bitrate selection (Pensieve, CS2P from L11) and host-side decisions like congestion control (Remy, Winstein–Balakrishnan 2013) belong to the same intellectual moment, but they live outside network operations — the bitrate is the player’s call, the host stack is the OS’s call.

The network-operations learning loops fall into three families.

Classification and detection — what is this and is it abnormal?

Traffic classification. Moore and Zuev’s Bayesian classifier (SIGMETRICS 2005) established the supervised-learning baseline: hand-crafted flow features, off-the-shelf classifier, application-label output. The 2017–2023 deep-learning wave — DeepPacket (Lotfollahi et al., 2017), nPrint (Holland et al., IMC 2021), ET-BERT (Lin et al., WWW 2022), netFound (Beltiukov et al., 2023, our group) — replaced the hand-crafted features with learned representations of packets and flows. The operator-facing decision is the same one Moore was already automating in 2005: what application generated this flow, what device is it from, is the label trustworthy enough to act on for billing, QoS, or access control.
Anomaly and intrusion detection. Kitsune (Mirsky et al., NDSS 2018) and Whisper (Fu et al., CCS 2021) are the canonical examples — autoencoder and frequency-domain models that flag traffic the operator should look at. The decision automated here is should this be on the human’s queue at all.

Control and planning — where should the network move bits, and how should we provision?

AuTO (Chen et al., SIGCOMM 2018) — hierarchical RL for datacenter flow scheduling and traffic optimization.
NeuroPlan (Zhu et al., SIGCOMM 2021) — RL for long-horizon network planning: capacity provisioning over multi-year time horizons. Planning had been the most human-decision-dominated activity in the operator’s calendar.
Decima (Mao et al., SIGCOMM 2019) — RL for cluster job scheduling at the datacenter compute layer; included here because at hyperscale-operator scale the line between compute scheduling and network scheduling has thinned.

Root cause analysis and fault localization — what actually caused the failure?

This line of work is the most directly relevant to the operator’s lived experience and is often forgotten when “ML for networking” gets summarized as classification.

Sherlock (Bahl et al., SIGCOMM 2007) — inference graphs over enterprise application dependencies; an early statistical RCA system that localized application failures to network components.
NetPoirot (Arzani et al., SIGCOMM 2016) — supervised classifier that decides, when a cloud application complains, whether the fault is in the network or in the application stack. Pure binary decision over a noisy multi-layer system.
LossRadar (Li et al., CoNEXT 2016) — sketch-based per-flow loss localization inside the data center.
007 (Arzani et al., NSDI 2018) — host-coordinated voting to localize packet-drop causes across a cloud network, without requiring switch-side changes. Origin story: the paper emerged from a Microsoft organizational conflict between WAN and datacenter networking teams over attribution of customer-reported failures. Each team’s metrics were affected by tickets the other team’s failures generated. The system was a peer-reviewed attribution mechanism — internal conflict resolution as research output.
Industrial follow-ons — DeepCoffea, causal-inference RCA, Microsoft’s Project Flash — extended the idea to wider fault models.

Each of these papers automates a specific RCA sub-problem under a defined fault model: “is the failure in the network or the app stack?”, “which link dropped these packets?”, “which device caused this app to slow down?”.

What this arc actually delivered

The arc is consistent: narrow, well-defined decision loop; sufficient training data; ML beats heuristic on the benchmarks; deploy; repeat. By 2021, every operator-facing decision loop in the list above had at least one credible learned alternative.

But notice where the arc stops. When the rebuffer hits at 8 pm, four signals tell four fragments of the story. The access switch sees the queue drop. The route monitor sees the BGP withdrawal upstream. The server’s eBPF probe sees the congestion-window collapse. The APM tool sees the application retry storm. A human still composes the four fragments into a single causal chain. Each learned component was trained on one fragment, not on the composition. The classifiers know what the traffic is. The anomaly detectors know that something is off. The RCA tools know which layer the fault sits in, under their fault model. No single learned system assembles the cross-layer causal chain the way an experienced operator does. That gap is what the next act turns into a structural question.

Act 4: The credibility crisis — and the four open questions it surfaces

In 2022, Arthur Jacobs, Roman Beltiukov, and colleagues in our group at UCSB published Trustee at CCS. The title of the paper is the diagnosis: “AI/ML for Network Security: The Emperor Has No Clothes.”

Here is what they found. ML-for-networking systems were achieving very high benchmark accuracies on classification tasks — intrusion detection, malware classification, traffic identification. But the systems were often making decisions for completely wrong reasons. They were exploiting shortcuts in the training data — artifacts of how the dataset was collected, distributional quirks that would not generalize to deployment. The models worked on the test set and would not work in the field, and nobody had a principled way to tell the difference from outside.

Trustee proposed trust reports as a canonical interface: rather than asking operators to trust opaque ML decisions, the system would translate a black-box model into a succinct white-box decision tree — the highest-fidelity succinct surrogate it can find — that the operator can read. Three failure modes consistently surfaced: learned shortcuts (the model exploits training-set artifacts), failure to generalize out-of-distribution, and overfitting. Reading the trust report made each failure mode visible from the outside for the first time.

The Beltiukov line — credibility, data, foundation models. The same researcher who is the second author on Trustee, Roman Beltiukov (who also publishes as Sylee Beltiukov), then drove a four-paper line that operationalized the response to the credibility crisis.

netUnicorn (Beltiukov, Guo, Gupta, Willinger; CCS 2023) addressed the data-collection root cause. If models exploit shortcuts in training data, build a platform that simplifies collecting multi-environment data so the shortcuts have nowhere to hide. netUnicorn is the data-side answer to the Trustee diagnosis.
netFound (Beltiukov, Guthula, Manda, Daneshamooz, Guo, Willinger, Gupta, Monga; 2023) is the representation-side answer — covered in Frontier A below.
Demystifying Network Foundation Models (Beltiukov, Guthula, Guo, Willinger, Gupta; NeurIPS Datasets & Benchmarks 2025) is the evaluation-side answer — covered in Frontier A.
IEF — the Intrinsic Evaluation Framework (2025) is the metric-side answer. Instead of checking only that the model gets test cases right, IEF checks the internal representation on three axes: latent-space geometry (representations should not collapse into a few skewed clusters), alignment with hand-engineered statistical features that domain experts have developed over thirty years, and context discrimination (a representation of a TCP-Cubic flow and a TCP-BBR flow over the same congested link should look different in the latent space). IEF is honest about open-endedness: in NLP, “context” is a black-box concept, but in networking we can be specific about what contexts a representation should be able to discriminate — congestion control variant, queue discipline, vantage point, and so on. The framework is still under active development.

I want to be precise about what the credibility crisis actually is, because the framing matters. The diagnosis is structural, not anti-ML. Traffic classifiers, anomaly detectors, NeuroPlan, AuTO, 007 — all of these worked on their benchmark workloads. The crisis is that once you compress the loop period past the human’s reaction threshold, you have to trust the machine, and the trust mechanisms operators had for closed-loop systems — rate-limit it, page someone, roll it back — transfer poorly to learned components. The trust gap is the cost of the speed.

But the crisis is doing more work than just diagnosing one problem. It is forcing the field to confront four first-principles questions that the existing engineering stack — programmable data planes, sketches, streaming telemetry, eBPF, ML in specific loops — leaves unanswered:

What is the right belief representation when the consumer is no longer constrained by human cognition?
What is the right decision-maker when no single program can be the controller for the whole loop?
What is the right verification regime when the decision function is a lossy learned approximation?
What is the right unit of attribution in a network-of-networks where no party has full telemetry?

Each of these questions has a research front trying to answer it. Each front has produced specific papers that answer specific sub-questions. None of the fronts is settled, and the field has not yet figured out how the answers compose. That is the agenda we will walk through next.

Act 5: The four open questions, the papers that take a swing at each, and what remains

Each frontier gets the same treatment: name the underlying first-principles question, walk through what specific papers answer about it, mark what they leave open. The point is to leave you with a clear map of which sub-problems are closed and which are still open — because if you go on to do research in this area, the open ones are where your contribution can land.

Frontier A: What is the right belief representation?

The first-principles question. L15 walked you through the chronology of belief representations — counters, flow records, sketches, sampled packet captures. Each generation compressed the data for a human consumer. Once the consumer is a machine, the constraint changes. Representations can be machine-scale: high-dimensional latent vectors. Raw vectors aren’t useful on their own — they have to encode operationally meaningful structure, and they have to be reusable across tasks so you can fine-tune instead of retraining from scratch for every new question. What does the right representation look like?

This is also where we have a thirty-year-old debt to pay. Walter Willinger’s 1994 self-similarity result told us the data is heavy-tailed and bursty at every timescale. Sketches partially exploited that (Act 2). But our representations — the actual data structures we feed into ML systems — mostly ignored it. We tokenized packet bytes as if they were generic text. We aggregated bursts into samples and lost the burst structure. We trained on iid assumptions the data has consistently violated since 1994.

What the papers answer.

nPrint (Holland et al., 2021) answers: what should the input tensor look like? It standardizes a bit-level encoding of packet headers so every ML-for-networking pipeline can stop inventing its own. Closes the “what is the substrate” question.
ET-BERT (Lin et al., 2022) answers: does transformer pre-training transfer to network traffic the way it does to text? Yes, with caveats. Demonstrates that the BERT recipe works on encrypted traffic for classification tasks.
netFound (Beltiukov, Guthula, and colleagues in our group, 2023) answers: does respecting protocol structure improve representation quality? The idea is that protocols give you a natural vocabulary; tokenize on that vocabulary, build a structural foundation model, and downstream tasks need less data to fine-tune.
NetBurst (Guthula, Daneshamooz, Fleming, Kundu, Willinger, Gupta; 2025) answers: can we build a foundation model that actually respects self-similarity? The architecture is event-centric, not sample-centric; intermittent, not continuous. Walter Willinger is a co-author — the same person who discovered the data’s structure in 1994 is on the paper that finally builds a model around it. Thirty-one years.

What remains open. Cross-operator generalization is unsettled — a model trained on one operator’s telemetry transfers in some settings, breaks in others. Cross-task generalization is partial — the foundation models beat per-task training on benchmarks, but it remains unclear which downstream tasks benefit most. And the deepest open question: what evaluation metric tells you a learned belief representation is good enough to act on? Accuracy on a labeled benchmark misses it — that was the Trustee finding. The IEF work in our group is one attempt; the field is still converging.

Connection to the holistic theme. This frontier is what makes machine consumers possible. Without a good belief representation, you cannot have agents that reason about network state at machine speed. Every other frontier (B, C, D) depends on something like this working.

Frontier B: What is the right decision-maker?

The first-principles question. Tuesday’s lecture established that humans cannot be the consumer of microsecond-scale belief models. So what replaces them? The lazy answer is “an LLM” or “an agent.” The harder question is: what is the right architectural unit for the decision-maker? A single monolithic agent? A population of specialized agents? Agents with what scope, what authority, what coordination protocol? The Coordination invariant from L15 returns here with teeth.

What the papers answer.

NetConfEval (Wang et al., 2024) answers a diagnostic question: what can off-the-shelf LLMs already do on network-configuration tasks? Not enough. The contribution is the benchmark that quantifies the gap — without it, the field could not measure progress.
AIOpsLab (Chen et al., 2025) answers an evaluation question: what are the right operational metrics for an agentic incident-management system? It standardizes the use of TTD (time to detect) and TTM (time to mitigate) — operator-side metrics SRE teams have used for years — as the comparison axis for agent vs. human. The contribution is the benchmark and the metric protocol, not the invention of TTD/TTM.
NetGent (Daneshamooz and colleagues in our group, 2025) answers an architectural sub-question: can agents automate stateful workflows that previously required a human to drive a browser session or shell session? Yes, for bounded application workflows.
Industry instantiations split into two layers. Operator-facing agentic AIOps platforms — Nokia Sense/Think/Act, Cisco agentic AI, TM Forum’s A2A inter-agent protocol — answer a protocol question: what is the wire-level interface between agents that need to coordinate across operator domains? Developer-facing agent-runtime platforms — Cloudflare Project Think (Sunil Pai, April 2026) — answer a substrate question: what primitives does an agent need to run reliably across millions of long-lived sessions? Project Think’s answer is durable execution with crash recovery, isolated sub-agents with their own state, sandboxed code execution, and persistent forkable sessions. This is the substrate other operators’ AIOps agents will be built on, not Cloudflare operating its own network with agents.

What remains open. Three big things. Composition — multi-agent systems work in narrow benchmarks; how they behave at scale across operator domains is unknown. Accountability — Arpit raised this in Tuesday’s lecture: when something breaks in a thousand-agent network, which agent is responsible? No good answer yet. The right unit — nobody knows whether the future is one agent per flow, one agent per service, one agent per AS, or some hierarchical mix. The architectural question is wide open.

Connection to the holistic theme. This is the most direct successor to the L15 question about Coordination. The decision-maker shifted from per-device CLI to SDN controller. It is shifting again, and the destination is still being designed.

Frontier C: What is the right verification regime over lossy learned components?

The first-principles question. L15 walked you through the verification arc — HSA, VeriFlow, Batfish, NetKAT, Minesweeper. These were sound tools: if HSA said no loops exist, no loops existed. They worked because the systems they verified had explicit rule sets. When the rule set is a learned function — a neural network, an LLM — the verification primitives have to change. You cannot enumerate the input space. You cannot reason about every possible action symbolically. So how do you preserve the trust regime SDN inherited from formal methods, once the decision function is probabilistic?

What the papers answer.

LeJIT (Hè and Apostolaki, 2025) answers: can you put a symbolic safety check in the loop with an LLM, at inference time? Yes. The architectural move: let the LLM propose actions; let an SMT solver veto any action that violates a declared safety policy. The LLM stays creative; the solver stays principled. Closes the “can we even compose these” question.
Capisce (Campbell, Hojjat, Foster, 2024) answers: what formal language do you write the safety specification in? It proposes a precise control-interface specification language — a contract for what a controller is allowed to do.
VeriX (Wu et al., 2023) answers: can you formally verify properties of a neural network’s outputs? Yes, for narrow properties on small networks. Demonstrates the technique exists.

What remains open. Soundness-completeness at scale — the L15 symbolic verification tools were sound but increasingly incomplete (some properties were unreachable). The over-learned-reps versions inherit that tradeoff, plus the system being verified is now probabilistic. Specification quality — Capisce gives you the language; the harder question is who writes the spec, and how do you know the spec itself is correct. Coverage — VeriX proves narrow properties; production systems need coverage of the full action space.

Connection to the holistic theme. Frontier C is the cure that the credibility crisis demanded. Frontier B (agentic ops) cannot be deployed at scale without Frontier C (safety gates). The two are coupled: every advance in agent capability raises the bar on the verification machinery.

Frontier D: What is the right unit of attribution under partial observability?

The first-principles question. L15 established that the Internet is a network of networks — independent operators, partial telemetry, no global view. We saw this play out in the running rebuffer example: five stakeholders, five truthful beliefs, none of them the same picture. When something breaks at Internet scale, attribution requires composing partial views across administrative boundaries. What is the right architectural unit for that composition?

For most of the chronology, the implicit assumption was a single-domain operator. The verification regime (Frontier C) presupposes a single authority over the configuration. The representation regime (Frontier A) presupposes a single training set. The agent regime (Frontier B) presupposes a single decision-maker. None of those assumptions hold across operators.

What the papers answer.

MoCE — Mixture-of-Context-aware Experts (Vipul Harsh, Sayan Sinha, Henry Milner, B. Aditya Prakash, Vyas Sekar, Hui Zhang; NSDI 2026) answers: is composition the right architectural unit for cross-domain attribution? The contribution is the framing — instead of trying to train one big model that sees everything (which is impossible given the data-sharing constraints), build a composition of experts, each specialized for a particular (stakeholder, vantage, telemetry-source) regime, and route the query to the right expert. The architectural unit is composed expertise, not unified state.

Important scope caveat about MoCE. The framework presupposes hyperscaler-grade visibility — a Cloudflare, a Conviva, a Google can see deeply into both their own network and the client device, and the only unknown patch is the middle. For an operator who is not Google, who has no privileged view of the endpoint, the cross-stakeholder partial-observability problem is genuinely harder than MoCE addresses. MoCE is the first answer for the operators who already see almost everything; it is not yet an answer for the operators who see only their slice.

What remains open. Almost everything else. Federated training — how do you train the experts when no party shares raw telemetry? Cross-stakeholder verification — Frontier C’s safety gates assume a single specification authority; what does verification look like when multiple parties have to agree? Coalition formation — which experts trust which other experts, and how does that trust get established and revoked? Surfacing attribution to humans — once the experts compose an attribution, who is actually accountable for acting on it? And the operator-without-hyperscaler-visibility case — the structural one. No paper has yet given a satisfying answer for that operator.

Connection to the holistic theme. Frontier D is the front most directly downstream of the network-of-networks reality from L15. The first three frontiers can pretend the world has a single operator. Frontier D cannot — and as a result, it is the front where the gap between “research paper exists” and “production deployment works” is widest.

Where the frontiers leave us

Step back. Five years ago, the field had:

Programmable data planes that could observe at line rate (Acts 1–3).
ML decision-makers that beat heuristics in narrow loops (Act 3).
A diagnosis that those ML decision-makers could not be trusted the way old systems could (Act 4).

The four research fronts (Act 5) each take a swing at one sub-question: what to represent, who decides, how to verify, how to compose across stakeholders. Each front has produced specific contributions that close specific sub-questions. All four are still in motion. And — this is the part that usually gets buried in conference papers — the four answers don’t yet compose with each other.

Stacking them naively breaks. The intended pipeline: Frontier A’s foundation model gives you a representation; Frontier B’s agent consumes the representation; Frontier C’s safety gate vetoes the agent’s action; Frontier D’s expert composition handles the cross-stakeholder case. In theory. In practice, the layers lack interfaces. A safety gate (Frontier C) lacks a way to read a foundation-model embedding (Frontier A). An agent population (Frontier B) lacks a protocol for trusting an expert from a different stakeholder (Frontier D). The composition problem is the real open frontier — downstream of all four.

If you want to do research in this area, the open questions live at the interfaces — how the four answers fit together into a coherent stack an operator could deploy. The field has already swung many times at “another foundation model” and “another agent benchmark”; the gap is at the joins.

Act 6: What stays human

One last point — and I think it gets buried.

Throughout this lecture I have described what happens when machine consumers replace human consumers in inner loops, then in outer loops, then perhaps eventually in the loops that today require a NOC engineer (network operations center engineer) to interpret. The arc is real. But the arc has a stopping point — and the stop is not technical.

The outermost loop — where intent is set, where ethics is negotiated, where accountability is assigned, where regulatory constraints are imposed — stays human. The stop comes from accountability, not capability. What makes those decisions decisions is that they are answerable to people. A campus IT department does not get to decide that a particular community should be deprioritized on the residential VLAN. That decision belongs to the people the community comprises, to the operators who answer to them, to the regulators who answer to all of them.

No agent population, however capable, gets to make that decision.

The chapter’s view is that the operator’s role retreats outward, not away. The operator stops being the per-incident decider and becomes the intent specifier and the accountability gatekeeper. The agents handle the inner loops. The humans handle the loop that asks what the inner loops should optimize for, who they should serve, and who answers for them when they fail.

The thesis, said plainly: automation, not autonomy. If AI is the final decision-maker, your network is up for toast. The plausible outcome is AI-and-human teaming. Humans need to understand the fundamentals deeply enough to know what to delegate, to be a good discriminator of the quality of an agent’s output, and to make the critical calls. You are still the bottleneck, but you don’t have to take every decision; you have to take the critical ones, and you have to design the agent so it gives you the right data to take them with.

On what this means for who runs networks. Google today runs its network with an army of engineers. The future is not an army. The future is four or five really good network engineers per network, who understand the fundamentals, who know how to operate with AI — and who might be running five other networks at the same time. As AI matures, the per-network cognitive load drops, but the human stays in charge. The question is how many humans, not whether.

That is the honest state of network operations in 2026. The components of the relaxation are visible. The composition is unfinished. The human contract is genuinely undecided. You are graduating into a field that is in motion.

Generative questions for the rest of CS176C

A few questions to hold open as you finish the quarter:

What does it mean to be “an operator” when agents make most of the inner-loop decisions? The professional identity that “network operator” has held for four decades is being reshaped, not eliminated.
How do you architect cross-stakeholder verification when no single party trusts the others with raw telemetry? MoCE is one approach. Federated learning, blockchain-anchored attribution, formal coalition-verification — all active.
What does “performance” mean for a network whose user is increasingly another agent, not a human? The L15 definition was layered by consumer; agentic services may need new layers we cannot specify yet.
What does “secure” mean when the attacker is also agentic? The L15 secure definition assumed human-instigated attacks and human-supervised defenses. Both assumptions are weakening.
How does the planning/operations boundary hold up when planning itself is algorithmic (NeuroPlan and successors)? The boundary is doing pedagogical work today; it may need refinement tomorrow.

That is where I’ll leave you. Week 10 is project work; no lectures. Office hours are extended. Bring your project questions, your stuck moments, your half-formed ideas about which of these frontiers you would want to push on next.

Good luck with your projects, and thank you for the quarter.

Optional pre-read

Jacobs et al., “AI/ML for Network Security: The Emperor has no Clothes,” CCS 2022 (the Trustee paper).
Beltiukov et al., “Demystifying Network Foundation Models,” NeurIPS Datasets & Benchmarks 2025.
Pai, “Project Think: building the next generation of AI agents on Cloudflare,” Cloudflare blog, April 15, 2026.

References

[1] Bosshart et al. Forwarding Metamorphosis: Fast Programmable Match-Action Processing in Hardware for SDN. SIGCOMM 2013. [2] Bosshart et al. P4: Programming Protocol-Independent Packet Processors. SIGCOMM CCR 2014. [3] Kim et al. In-band Network Telemetry (INT). 2015. [4] Gupta et al. Sonata: Query-Driven Streaming Network Telemetry. SIGCOMM 2018. [5] Narayana et al. Marple: Language-Directed Hardware Design for Network Performance Monitoring. SIGCOMM 2017. [6] Zhu et al. Everflow: Packet-Level Telemetry in Large Datacenter Networks. SIGCOMM 2015. [7] Yu et al. dShark. NSDI 2019. [8] Bloom. Space/Time Trade-offs in Hash Coding with Allowable Errors. CACM 1970. [9] Cormode and Muthukrishnan. Count-Min Sketch. J. Algorithms 2005. [10] Yu, Jose, Miao. OpenSketch. NSDI 2013. [11] Liu, Manousis, Vorsanger, Sekar, Braverman. UnivMon: One Sketch to Rule Them All. SIGCOMM 2016. [12] OpenConfig Working Group. gNMI specification. 2016+. [13] Linux kernel. eBPF / XDP documentation. 2018+. [14] Huston. The Vanishing Network. APNIC blog 2024. [14a] Guo et al. Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis. SIGCOMM 2015. [14b] Miao, Chen, Gupta, Meng, et al. Detecting Ephemeral Optical Events with OpTel. NSDI 2022. [15] Winstein and Balakrishnan. Remy: TCP ex Machina. SIGCOMM 2013. [15a] Moore and Zuev. Internet Traffic Classification Using Bayesian Analysis Techniques. SIGMETRICS 2005. [15b] Lotfollahi et al. DeepPacket. 2017. [15c] Mirsky et al. Kitsune: An Ensemble of Autoencoders for Online Network Intrusion Detection. NDSS 2018. [15d] Fu et al. Realtime Robust Malicious Traffic Detection via Frequency Domain Analysis (Whisper). CCS 2021. [15e] Bahl et al. Sherlock: Towards Highly Available Enterprise Network Services. SIGCOMM 2007. [15f] Arzani et al. Taking the Blame Game out of Data Centers Operations with NetPoirot. SIGCOMM 2016. [15g] Li et al. LossRadar: Fast Detection of Lost Packets in Data Center Networks. CoNEXT 2016. [15h] Arzani et al. 007: Democratically Finding the Cause of Packet Drops. NSDI 2018. [18] Mao et al. Decima. SIGCOMM 2019. [19] Chen et al. AuTO. SIGCOMM 2018. [20] Zhu et al. NeuroPlan. SIGCOMM 2021. [21] Jacobs, Beltiukov, Willinger, Ferreira, Gupta, Granville. AI/ML for Network Security: The Emperor has no Clothes (Trustee). CCS 2022. [21a] Beltiukov, Guo, Gupta, Willinger. In Search of netUnicorn: A Data-Collection Platform to Develop Generalizable ML Models for Network Security Problems. CCS 2023. [22] Beltiukov et al. IEF — Intrinsic Evaluation Framework. 2025. [23] Holland et al. nPrint. IMC 2021. [24] Lin et al. ET-BERT. WWW 2022. [25] Beltiukov, Guthula, et al. netFound. arXiv 2023. [26] Guthula, Daneshamooz, Fleming, Kundu, Willinger, Gupta. NetBurst. arXiv 2025. [27] Wang et al. NetConfEval. CoNEXT 2024. [28] Chen et al. AIOpsLab. MLSys 2025. [29] Daneshamooz et al. NetGent. arXiv 2025. [30] Hè and Apostolaki. LeJIT. HotNets 2025. [31] Campbell, Hojjat, Foster. Capisce. OOPSLA 2024. [32] Wu et al. VeriX. 2023. [33] Harsh, Sinha, Milner, Prakash, Sekar, Zhang. MoCE. NSDI 2026.