From Programmable Observation to Agentic Operations

CS176C — Advanced Topics in Internet Computing

Arpit Gupta

2026-05-28

Where We Left Off — and Today’s Question

L15 (Tuesday): the operator’s loop, the four invariants, the network-of-networks attribution problem; chronology from SNMP through the Knowledge Plane through SDN. By 2016, SDN gave centralized authority and continuously-verifiable policies. Three things stayed the same: human consumer, mostly fixed-function data plane, five-minute-averages-and-dashboards interfaces.

Today’s question: if the human is not the consumer of all this very detailed fine-grained information, how do you synthesize state and decide what to do?

Today’s arc: the data plane became programmable; observation moved into it; ML decision-makers entered specific loops; a credibility crisis surfaced a gap the research frontier is still trying to close.

Act 1: The Story Gets Told Wrong

The common framing: “SNMP and NetFlow for decades → switches got programmable → INT, Sonata, sketches.”

That paints a clean linear evolution with the programmable-data-plane shift as the prime mover. The actual story: parallel tracks running for decades, each chasing its own question, eventually fusing on a new substrate.

Three pressures shaped the trade for 30+ years:

Pressure What it means
Richness What detail can the telemetry capture?
Cost Memory, CPU, bandwidth, storage along the producer-transport-storage chain
Human consumer What can a human absorb at the end of the loop?

L15 leaned on the human side. Cost is co-equal. Both pressures get relaxed across the rest of the lecture.

Three Parallel Tracks Before 2013

Track Anchors Question being asked
1. Counter telemetry SNMP (1988) Cheapest common interface vendors will agree to?
2. Flow summarization NetFlow (1996), sFlow (2001), IPFIX Summarize for billing / sample at line rate, given fixed-function silicon?
3. Streaming sketches — theory + networking deployment Bloom (1970), AMS (1996), Count Sketch (2002), Count-Min (2005); Estan-Varghese Sample-and-Hold (SIGCOMM 2002) Bounded-memory queries over a stream, in theory and at line-rate? Heavy-hitter detection is one of the canonical use-cases.

None of these needed programmable silicon. They predate it by decades. Each was answering a different question.

What 2013 Enabled — and the Deployment Caveat

RMT (Bosshart, Gibb, Kim, Varghese, McKeown; 2013) → P4 (2014) → Tofino (Barefoot, 2016).

Switch ASIC: programmable parser + match-action stages + programmable deparser. Behavior in software, silicon as a general substrate.

The question RMT’s designers were asking was structural, not measurement-specific:

If we expose the pipeline architecture as a target for software, what new design space opens up?

Sketches-in-switches and per-hop measurement came as applications later. An enabling shift, not the prime mover.

Deployment honesty: Tofino runs in production at Google, Meta, Microsoft datacenters. Enterprise switches, ISP edge routers, and consumer hardware mostly run fixed-function silicon. The work that follows matters intellectually; its operational footprint sits in a handful of organizations.

2013–2018: Pre-Existing Questions Get Programmable Answers

System Pre-existing question What 2013+ enabled
OpenSketch (2013), UnivMon (2016) Can streaming algorithms run on switches? Track 3 deploys on programmable silicon
INT (Kim, Barefoot, 2015) Per-hop, per-packet visibility? A new question that needed the new substrate
Marple (2017), Sonata (2018) What abstraction should operators use for resource-constrained queries? All three tracks fuse here
Everflow (2015), dShark (2019) Match-and-mirror at line rate; distributed packet analysis Same substrate, different question
# Sonata: detect DNS amplification victims
victims = (packets
    .filter(lambda p: p.udp_sport == 53)
    .map(lambda p: (p.dst_ip, p.src_ip))
    .distinct()
    .reduce(keys=["dst_ip"], op=sum)
    .filter(lambda k, v: v > threshold))

Naive: millions of tuples/sec to server. Sonata’s compiler runs filter + distinct + most of the count on the switch — using Track 3 sketches — and ships only heavy hitters. ~10⁴× reduction in tuples-to-server. Switch resource budget and operator cognitive budget in the same ILP.

Sketches Win Because Willinger Was Right (1994)

Recall L15: Walter Willinger’s 1994 self-similarity result. Network traffic is heavy-tailed at every observable timescale. A few keys carry most of the mass; most keys carry almost none.

Sketches are exactly the right data structure for that distribution. A small fixed-size table gives bounded error on heavy-hitter queries precisely because heavy tails concentrate the mass in a few places.

Year Event
1994 Self-similarity discovered (Leland-Taqqu-Willinger-Wilson, ToN)
2005 Count-Min Sketch published (Cormode-Muthukrishnan)
2013–2018 Sketches deployed operationally on programmable switches

The intellectual lag was ~20 years. The sketches existed first; the substrate let them be deployed.

What the Substrate Work Did and Did Not Do

Did: the substrate work — programmable switches, sketches-on-switches, query-driven telemetry — reduced cost-per-bit-of-belief by orders of magnitude for the operators who run it.

Did not: eliminate either binding constraint.

Constraint Where it stood in 2010 Where it stands now
Human-attention budget Five-minute dashboards Streaming pipelines feed alerting engines, not human eyeballs
Dollar budget SNMP at five-minute polling was what the budget allowed Sub-second telemetry feasible at hyperscaler budgets only

Both remain the structural constraints the rest of the lecture is trying to relax further.

Geoff Huston (APNIC, 2024) called this the vanishing network. Now: who consumes the richer data?

Act 3: ML in Network-Operations Loops — Three Families

Scope: decisions an operator makes on the network’s behalf. Out: bitrate selection (Pensieve/CS2P, L11), congestion control (Remy 2013) — same intellectual moment, but the operator does not pick those.

Family Question being learned Anchor papers
Classification & detection What is this traffic? Is something abnormal? Moore-Zuev 2005, DeepPacket 2017, nPrint 2021, ET-BERT 2022, netFound 2023 (our group); Kitsune NDSS 2018, Whisper CCS 2021
Control & planning Where should flows go? How much to provision? AuTO SIGCOMM 2018, NeuroPlan SIGCOMM 2021, Decima SIGCOMM 2019
Root cause analysis & localization What actually caused the failure? Sherlock SIGCOMM 2007, NetPoirot SIGCOMM 2016, LossRadar CoNEXT 2016, 007 NSDI 2018

Pattern across all three: narrow scope, defined fault model / decision space, ML beats hand-tuned heuristic on benchmarks. By 2021, every operator-facing loop above had a credible learned alternative.

What the Arc Did Not Cross

The 8 pm rebuffer: four signals tell four fragments of the story.

  • The access switch sees the queue drop.
  • The route monitor sees the BGP withdrawal upstream.
  • The server’s eBPF probe sees the congestion-window collapse.
  • The APM tool sees the application retry storm.

A human still composes the four fragments into a single causal chain. Each learned component was trained on one fragment, not on the composition.

No single learned system assembles the cross-layer causal chain the way an experienced operator does.

That gap is what the next act turns into a structural question.

The Credibility Crisis → Four First-Principles Questions

Trustee (Jacobs et al., our group, CCS 2022): “AI/ML for Network Security: The Emperor Has No Clothes.” ML-for-networking systems were achieving high benchmark accuracy for the wrong reasons — shortcuts in training data that wouldn’t survive deployment. IEF (Beltiukov et al., 2025) extends: check the model’s internal representation, not just its test-set accuracy.

The crisis is structural, not anti-ML. Classifiers, anomaly detectors, NeuroPlan, AuTO, 007 — all worked on their benchmarks. The crisis: once the loop compresses past human reaction, you have to trust the machine; the old trust mechanisms (rate-limit, page someone, roll back) transfer poorly to learned components.

It forces four first-principles questions the engineering stack leaves open:

  1. What is the right belief representation when the consumer is machine, not human?
  2. What is the right decision-maker when no single program controls the whole loop?
  3. What is the right verification regime over lossy learned components?
  4. What is the right unit of attribution in a network of networks?

Frontiers A + B — Representation and Decision-Maker

Frontier A — Representation

L15 chronology: counters → flow records → sketches → packet captures. Each compressed for a human. Now: machine-scale, reusable across tasks, respects self-similarity.

Paper Answers
nPrint (2021) Bit-level packet-header tensor
ET-BERT (2022) BERT recipe works on encrypted traffic
netFound (2023, our group) Protocol-structure-aware foundation model
NetBurst (2025, our group, Willinger co-author) Event-centric, respects self-similarity. 31 years after 1994.

Open: cross-operator transfer, cross-task transfer, what metric tells you a representation is good enough to act on?

Frontier B — Decision-Maker

Paper Answers
NetConfEval (2024) Benchmark: what can off-the-shelf LLMs do on net-config? Not enough yet — but now measurable.
AIOpsLab (2025) Standardizes TTD/TTM as agent-vs-human axis
NetGent (2025, our group) Agents automate stateful workflows that needed a human-driven browser/shell
Operator platforms: Nokia Sense/Think/Act, Cisco agentic AI, TM Forum A2A Wire-level interface between coordinating agents
Developer platform: Cloudflare Project Think (Pai, April 2026) Durable execution, sub-agents, sandboxed code — the substrate AIOps agents run on

Open: composition at scale, accountability when many agents act, the right architectural unit.

Frontiers C + D — Verification and Multi-Stakeholder Attribution

Frontier C — Safety Gates Over Learned Reps

L15 verification was sound (HSA, VeriFlow, Batfish, NetKAT). Once the decision function is a learned approximation, the primitives have to change.

Paper Answers
LeJIT (Hè, Apostolaki; HotNets 2025) Symbolic safety check in the loop with an LLM — solver vetoes unsafe actions
Capisce (Campbell, Hojjat, Foster; OOPSLA 2024) Specification language for controller contracts
VeriX (Wu, Wu, Barrett; NeurIPS 2023) Formal verification of NN outputs for narrow properties

Open: soundness-completeness at scale; spec correctness; coverage of the full action space.

Frontier D — Multi-Stakeholder Attribution

Frontiers A, B, C all presupposed a single operator. None of those assumptions hold across operators.

Paper Answers
MoCE — Mixture-of-Context-aware Experts (Vipul Harsh, Sayan Sinha, Henry Milner, B. Aditya Prakash, Vyas Sekar, Hui Zhang; NSDI 2026) Composition, not unification: each expert covers a (stakeholder, vantage, telemetry-source) regime; route the query to the right expert

Open: federated training without raw-telemetry sharing; cross-stakeholder verification; coalition formation; surfacing attribution to humans.

The Four Answers Don’t Compose Yet

Five years ago, the field had: programmable observation (Acts 1–2), ML in narrow loops (Act 3), a credibility diagnosis (Act 4).

The four research fronts (Act 5) each take a swing at one sub-question. All four are in motion. The four answers don’t compose with each other yet.

Intended pipeline: A’s foundation model → B’s agent → C’s safety gate → D’s expert composition.

In theory. In practice, the layers lack interfaces:

  • A safety gate (C) lacks a way to read a foundation-model embedding (A).
  • An agent population (B) lacks a protocol for trusting an expert from a different stakeholder (D).

The composition problem is the real open frontier — downstream of all four. If you want to do research here, the open questions live at the joins, not at “another foundation model” or “another agent benchmark.”

What Stays Human

One last point — and I think it gets buried.

The arc of machine consumers replacing humans in inner, then outer, then NOC-level loops is real. But the arc has a stopping point — and it’s not technical.

The outermost loop — intent, ethics, accountability, regulatory constraints — stays human. The stop comes from accountability, not capability. A campus IT department does not get to decide that a community should be deprioritized on the residential VLAN. That decision belongs to the people the community comprises, to the operators who answer to them, to the regulators who answer to all of them.

The operator’s role retreats outward, not away. The operator becomes the intent specifier and accountability gatekeeper. Agents handle the inner loops. Humans handle the loop that asks what the inner loops should optimize for, who they should serve, and who answers for them when they fail.

Week 10 is project work — extended office hours, bring your stuck moments. Thank you for the quarter. Good luck with your projects.