Systems for Agents, Agents for Systems

Published:

At CAIDA’s AIMS-19 workshop in February, I moderated a breakout session on agentic AI in Internet measurement. The room included researchers, operators, and infrastructure engineers — people who build and maintain the systems that keep the Internet observable. The conversation surfaced a pattern I had been noticing in scattered form across workshops, faculty discussions, and the kind of informal conversations with colleagues where the real thinking happens.

Everyone was using AI. LLMs for coding, data processing, operational triage, literature synthesis. But the hard questions people kept raising were systems questions — orchestration, reliability, sandboxing, state management, cost control. One national research network operator described building an LLM-powered Looking Glass that aggregates data across member networks for triage — and every challenge they reported was a distributed systems integration problem. A Japanese network operator reported similar interest, driven by reduced staffing, and was working on standardization for agent-to-agent communication protocols. The measurement community was discovering, in real time, that the problems worth solving were systems problems.

Then an NSF program manager posed a question that reframed everything: How should Internet measurement infrastructure and data change if the primary consumer is AI rather than humans? The question bypassed whether AI is useful and went straight to what infrastructure must exist for AI to consume and act on data reliably. A fundamentally different design question — and a systems question.

This keeps surfacing. The question that matters is “what does it actually take to make these systems work, and who is equipped to build that?”

In a previous post, I argued that computing is a generative discipline — it produces the abstractions from which entirely new problem spaces emerge. This post picks up where that one left off. The generative pattern has moved again. And it has landed on systems.

The Generative Pattern Has Moved Again

The claim that computing is generative is empirical and historical. Packet switching created networking as a new intellectual domain — with its own theory, design principles, and research community. MapReduce reconstituted how a generation thought about scalable computation. Each new domain required a different combination of computing’s foundational abstractions. A computing abstraction opens a new problem space that did not exist before. That is the generative pattern.

Two years ago, the frontier was large language models. LLMs draw primarily on statistical learning, optimization, and parallel computation. The dominant narrative held that they would subsume everything — coding, writing, reasoning, entire professions. Today, the frontier has already moved. The current wave centers on agentic systems: architectures where AI models are embedded within larger computational frameworks that plan, use tools, maintain state, and interact with external environments.

At a high level, LLMs and agentic systems appear similar — both involve AI performing complex tasks. Yet they are fundamentally different in the computing abstractions they require. Agentic systems draw on a different and broader cross-section of computing: distributed systems, planning and control, program synthesis, security and sandboxing, formal verification of tool use, and the design of reliable multi-component software architectures.1 The shift happened in under two years, and it demanded entirely different foundational knowledge.

The generative pattern produced something more specific than another AI variant. It produced a domain whose foundations are systems foundations. The center of gravity has shifted from “how do we build better models” to “how do we build reliable systems around models.” That is computing generating a new problem space that happens to need the people who build distributed systems, operating systems, networked systems, and secure systems.

Two Distinct Questions

If agentic systems are a systems problem — generated by computing, requiring systems abstractions — then two questions arise. They are related but fundamentally different, and the distinction matters.

Question A: Systems for Agents

What must systems researchers build to make agentic systems reliable, scalable, and secure?

This is the infrastructure question. What does the new domain need from the discipline that generated it?

The AIMS breakout surfaced five sub-problems, each recognizable to anyone who has built distributed systems:

  1. Orchestration and composition — agents chain tools, APIs, and data sources into multi-step workflows. Internet2’s Looking Glass is fundamentally a distributed systems integration problem.
  2. Reliability under uncertainty — LLM backends are probabilistic, providers update models without notice, and the current ecosystem fragments across agent frameworks, model versions, and MCP servers. Version pinning, sandboxed testing, explicit scope constraints — classical fault tolerance, newly urgent.
  3. Security and containment — agents that touch production infrastructure inherit real-world consequences. Participants drew parallels to the early 2000s, when rapid web expansion led to widespread exploitation before guardrails existed.
  4. State management and observability — minor agent configuration changes cascade unpredictably, requiring the same distributed tracing infrastructure that systems researchers have been building for microservices.
  5. API design for non-human consumers — existing APIs were designed assuming human common sense. The NSF program manager’s question lives here: data infrastructure designed for AI consumers is a fundamentally different design problem.

These are familiar problems — newly urgent because the systems that need them are proliferating faster than the engineering to support them.

Question B: Agents for Systems

What can agentic systems do for systems research itself?

This is the leverage question. How does the new domain enable the discipline that generated it? Agents could multiply scarce research capacity, transform teaching infrastructure, and lower the threshold between public investment and public service. The evidence for each comes later — they are the heart of this essay. For now, the claim is structural: the same infrastructure that answers Question A is what makes Question B possible.

One clarification on scope. I will lean heavily on examples from Internet measurement and networking — that is where I work, and the measurement community is a natural first mover because it has the data, the infrastructure, and the operational experience. But the two-question structure generalizes beyond networking. Every domain that agentic systems touch will face the same pair of questions. These arguments build on and extend earlier collaborative work on what it means for networking to remain a science in the age of AI — particularly the case that understanding, not just performance, must remain the goal.17


The resource constraints that make investment costly are what make the tools necessary.


The Squeeze

Here is the tension. The pace at which agentic systems are creating new research problems is accelerating — AI’s share of computer science publications nearly doubled between 2013 and 2023, and the total volume of AI publications nearly tripled over that span.2 Corporate AI investment exceeded $250 billion in 2024.3 The shift from LLMs to agentic systems happened in under twenty-four months. The problems are multiplying. The resources to pursue them are stagnant or shrinking.

A PI’s capacity to pursue research ideas is, at root, a function of how many people they can fund. That number has been compressing. Under the post-2022 UC-UAW agreement, the minimum GSR salary rose to $34,564 by October 2024, with additional campus- and step-level variation above that.4 Students should be paid fairly — the unionization corrected a long-standing inequity. But the structural economics are real: with tuition, fees, and benefits, the fully loaded cost of supporting a single PhD student now significantly exceeds what most individual grants can sustain for more than one or two students.

CISE funding rates sit around 22%.5 The FY2026 presidential budget request proposed a 65% reduction to NSF’s CISE directorate — from roughly $989 million to $346 million. Congress did not go along: the enacted appropriation holds NSF at roughly $8.75 billion, with no R&RA directorate cut by more than 5% from FY2024 levels.6 The threat did not materialize — but the fact that it was credible enough to appear in a presidential request illustrates a funding volatility that makes the squeeze structural, not cyclical. Meanwhile, the state picture compounds the pressure: the 2025–26 California budget proposed over $270 million in ongoing General Fund reductions for the UC system, and while the Legislature moderated some of those cuts, the budgetary pressure on graduate student support and TA allocations persists.7 At UCLA, the mathematics department has already cut TA appointments from 50% to 25% and eliminated paid graders entirely.8

The squeeze pressures both questions simultaneously. It makes building the infrastructure for agents (Question A) harder, because the engineering labor is expensive and grants are smaller. And it makes leveraging agents for research (Question B) more necessary, because the backlog of feasible but unfunded ideas keeps expanding while the pace of innovation demands faster execution. Every PI has a notebook of well-specified experiments, understood methods, and buildable tools that sit unexecuted — the hands and the funding simply fall short.

The arithmetic converts “agents could help” into “agents must help, because the structural economics leave no other path to executing the ideas that already exist at the rate the field now demands.”

What Becomes Possible

This is the center of gravity of the essay. Everything prior established why agentic systems are a systems problem and why the problem is urgent. What follows is what the urgency enables — the expansion of Question B.

The Research Multiplier

Here is the ambitious version of the claim: with the same number of students and the same grant budget, can a research group explore an order of magnitude more hypotheses? The intellectual work stays human — hypothesis formation, interpretation, critical evaluation. What compresses is the operational middle of the research pipeline: experiment setup, data collection, preprocessing, initial analysis. The labor-intensive work that currently consumes weeks of graduate student time per experiment.

The AIMS breakout reached a consensus that maps directly onto this: tasks currently requiring weeks of graduate student effort could compress to days, but only for well-understood, verifiable tasks. Novel debugging, hypothesis formation, and interpretation remain human. The compression is real, and so are the boundaries.

A recent MIT preprint offers a concrete reference point. Glia is a multi-agent framework for automated systems design — specialized agents for reasoning, experimentation, and analysis, collaborating through an evaluation framework that grounds abstract reasoning in empirical feedback.9 Applied to a distributed GPU cluster for LLM inference, Glia reports performance comparable to expert-designed algorithms for request routing, scheduling, and autoscaling, in significantly less time. The architectural insight matters more than the specific result: Glia operationalizes the research loop itself — hypothesize, experiment, analyze, refine — and compresses the time between having an idea and having the data to evaluate it.

Networking and Internet measurement research needs the same kind of execution substrate — a sandbox where a researcher specifies an intent (“Analyze the performance of my custom video conferencing application under challenging network conditions (i.e., slow paths with dynamic congestion pressure from the competing cross traffic at the bottleneck link)”), and the system translates that into an experiment specification, configures the infrastructure, collects the data, and delivers multi-representation results. The IP hourglass provides the right architectural analogy: diverse research intents above, diverse network conditions below, a unified execution layer in the middle. Several efforts in this direction are underway, including work in my own group that unifies network replication, traffic generation, and agent-based workflow automation into a single data-generation substrate.10 The motivation mirrors Glia’s: what the systems community needs is better infrastructure for the research loop itself.

The broader AI-for-science movement validates the scale of this opportunity. The Department of Energy’s Genesis Mission directed $320 million toward AI infrastructure for scientific research, including shared compute, domain-specific AI models, and autonomous laboratory projects.11 NSF’s AI portfolio now exceeds $700 million annually across 25 National AI Research Institutes.12 The bet across federal science agencies is the same: AI as a research multiplier, not a research replacement. The question is whether the systems research community builds the infrastructure to capture that multiplier — or cedes it to communities less equipped to handle the systems problems.

That backlog — the one every PI carries — is what force multiplication makes tractable. The people who do the thinking stay. The time and cost between having an idea and having the data to evaluate it compresses.

Key claim: With the same students and the same grant budget, agentic infrastructure could enable an order of magnitude more hypotheses explored — by compressing the operational middle of the research pipeline, while keeping the intellectual work human.

Super Teaching

The same resource squeeze that constrains research is compressing teaching infrastructure. And the same agentic tools that multiply research capacity can multiply teaching capacity — but only if we are ambitious about what that means.

Here is the framing: can we generate more meaningful empirical content for students — better lab environments, richer problem sets, more responsive feedback — without proportional TA support, and without compromising the quality of delivery?

The empirical context is stark. UC TA allocations are being cut as a direct consequence of budget pressures. At UCLA, mathematics TA appointments were halved and paid graders eliminated entirely — the pedagogical need stayed the same, but the funding shrank.8 The pattern extends across the UC system, where proposed budget reductions ripple through course offerings and academic support services.7 The trajectory is clear: the number of students is growing, the resources per student are shrinking, and the pace at which curriculum must evolve — especially in computing, where the LLM-to-agentic shift happened in under two years — is accelerating.

Agents can handle preparation labor: lab environments, problem sets, worked examples, grading scaffolds, automated feedback on student code. The infrastructure cost of principled curricular change drops. Junior faculty can take risks they currently cannot afford — designing new courses around emerging topics without requiring TAs who are already experts. Senior faculty can iterate on courses without diverting entire research groups.

I know the cost of getting this wrong from personal experience. Early in my career, I tried to build a completely new course on programmable networks. The ambition was right; the infrastructure was missing. The result was my worst teaching ratings.13 Ambition without infrastructure produces burnout and bad evaluations. Agents are a complement to pedagogical judgment — they provide the infrastructure that makes pedagogical ambition survivable.

And once agents compress the operational costs of both research and teaching, the educational question becomes harder and deeper: what should we train students to be exceptional at? That question connects directly back to the generative-discipline argument. Teach discrimination, not production. Teach the abstractions that carry across paradigm shifts, not the tools that will be obsolete in two years. If agents handle execution, the human premium is on judgment — and computing education should optimize for exactly that.

From Public Investment to Public Service

Public investment in science should produce artifacts that serve the public. That is the social contract underlying research funding. But the gap between a published paper and a public-facing artifact is fundamentally a labor gap.

A paper requires insight and limited engineering. A public artifact — a tool that policymakers, advocacy organizations, or other researchers can actually use — requires insight plus substantial engineering: packaging, documentation, testing, deployment, maintenance. That engineering gap keeps publicly funded research from producing publicly usable tools. Most papers die as papers.

Agents compress the engineering. If the artifact threshold drops, the social contract becomes more fulfillable.

Consider BQT+, a broadband-plan querying tool developed in my group. It started as NSF-funded measurement research.14 Today it is used by the California Public Utilities Commission, by Pew and Benton, by state broadband offices, and by digital equity organizations across the country. That translation from research to policy impact required enormous engineering effort — years of packaging, documentation, deployment, and maintenance that no grant budgeted for. BQT+ is the exception that proves the rule: the intellectual contribution was one paper, but the public service required years of engineering that the funding model does not support.

If the artifact threshold drops, what opens goes beyond individual productivity. It is a structural change in what academia can deliver to the public. The same research that produces a paper could also produce a deployable tool, a public dataset, a policy-grade measurement platform. This is the question academia has always been challenged on: where is the public return on research investment? Lowering the artifact threshold is a concrete, structural answer — and it could reshape how public investment in research translates into artifacts of public value.

Agentic systems may lower the threshold for turning publicly funded research into tools and services that others can actually use — shifting academia from a model centered on papers alone toward one in which deployable artifacts become a more routine part of the output.


Honest Reckoning

The preceding sections argued that agentic systems could multiply research capacity, transform teaching, and lower the artifact threshold. All of that depends on confronting three uncomfortable truths — and each requires both technical mitigations and community-level norms. Engineering alone is insufficient for institutional problems.

Quality at Scale

The most immediate community concern is a flood of AI-scaled papers that recycle existing methodologies without generating new insight. Even if such papers are detectable by reviewers, the reviewing burden alone is a significant cost. The AIMS breakout was explicit: agents are not yet capable of inventing novel measurement tools or techniques, and over-reliance could entrench existing methods at the expense of methodological innovation.

If running an experiment becomes cheap, the bar for what constitutes a contribution must rise. The community needs to articulate that bar explicitly — not discover it through an avalanche of mediocre submissions. This requires engineering solutions (better benchmarks, automated screening, reviewer assistance) and community norms around what counts as a contribution when execution costs approach zero.

Training Displacement

The deepest institutional concern. If agents handle preprocessing, data collection, and boilerplate analysis, where do junior researchers build the tacit knowledge that comes from wrestling with data? The AIMS discussion surfaced an uncomfortable observation: in some cases, the time spent validating AI outputs already exceeds the time saved. But that validation itself may be valuable training.

Which tasks must remain human for pedagogical reasons? “All of them” preserves busywork in the name of education. “None of them” produces researchers who can direct agents but lack the ability to evaluate what the agents produce. The boundary must be drawn deliberately, task by task. This connects to the generative-discipline argument: teach discrimination, not production. Teach students to evaluate, critique, and judge — because those are the competencies that carry across paradigm shifts. The cognitive engineering literature has been wrestling with exactly this tension for decades — the gap between what autonomous systems demonstrate under idealized conditions and how they perform in the ambiguity of real-world deployment, and what that gap means for the humans who must work alongside them.18

Cost and the Dependency Trap

The most operational concern. LLM inference costs have dropped substantially, but current pricing is VC-subsidized — and the $252 billion in corporate AI investment will eventually demand returns.3 The AIMS breakout offered a concrete cautionary case: one participant reported spending over $10,000 in a single month from a poorly designed automated script that scaled mistakes faster than it scaled insight.

Historical precedent is instructive. Google’s transition from unlimited to paid Drive storage left scientific communities with enormous migration costs. Cloud computing’s academic adoption followed a similar arc — free tiers disappeared, costs became structural, dependencies could not unwind. Shared infrastructure — NRP, CBORG, CloudBank 2.0, TritonGPT, Internet2 NET+ — helps redistribute the risk, though the risk persists.15 The community needs cost discipline, vendor diversification, and infrastructure that preserves portability — built from the start, before the dependency hardens.

Who Builds This

Everything in this essay reduces to a practical question: who actually builds the execution substrates, the sandboxes, the orchestration layers, the observability infrastructure that agentic systems require?

Model researchers build models. The infrastructure that makes those models reliable, composable, secure, and observable in real-world settings — that is systems work. It has always been systems work. The AIMS breakout made this vivid: every hard problem that surfaced was a problem that distributed systems researchers, network architects, and security engineers already knew how to think about. The problem class was familiar. The urgency was new.

And the same infrastructure that makes agents reliable for one domain makes them reliable for others. The Department of Energy is spending $320 million on AI infrastructure for scientific research.11 AlphaFold predicted the structures of over 200 million proteins, but its deployment depended on compute infrastructure, data pipelines, and validation frameworks.16 MIT’s Glia targets systems design, but its architectural pattern — specialized agents collaborating through empirical evaluation — maps onto any domain where the research loop can be instrumented. These are all instances of the same pair of questions this essay has been asking. They are all systems problems.

The squeeze makes the stakes concrete. Stagnant funding, rising costs, accelerating demands — the arithmetic only worsens with waiting. If the systems research community builds the infrastructure now, it captures the multiplier: more hypotheses per grant dollar, more public artifacts per paper, more pedagogical ambition per course. If it does not, the infrastructure will be built anyway — by communities less equipped to handle the reliability, security, and observability problems that determine whether these systems actually work.

That is already happening. The tools are being built. The question is whether the people who understand systems — really understand them, at the level of distributed state and fault tolerance and adversarial containment — are the ones doing the building.


TLDR

Computing generated AI. AI generated agentic systems. Agentic systems create two questions: What systems infrastructure must we build to make them reliable? And what can those systems do for research itself? The squeeze in academic research makes both questions urgent. If we build the infrastructure, agentic systems could multiply research capacity, transform teaching, and lower the barrier between public funding and public-facing artifacts. The discipline that keeps generating the next thing is the right one to build the tools for this thing.


Arpit Gupta is an Associate Professor of Computer Science at the University of California, Santa Barbara, and a Faculty Scientist at Berkeley Lab.

Notes

  1. This cross-section is elaborated in my earlier post, “Computing Is a Generative Discipline”.
  2. Stanford HAI AI Index Report, 2025. AI’s share of CS publications rose from 21.6% to 41.8% between 2013 and 2023; total AI publications nearly tripled over the same span (from ~102,000 to over 242,000).
  3. Stanford HAI AI Index Report, 2025. Corporate AI investment reached an estimated $252 billion globally in 2024.
  4. UC-UAW Postdoctoral and Academic Researcher contracts, 2022–2025. The minimum GSR salary reached $34,564.50 by October 1, 2024, per the ratified agreement. Campus- and step-level variation adds above that minimum.
  5. NSF CISE directorate funding data. The 22% funding rate is from FY2024 data published by the NSF Budget Division.
  6. The FY2026 presidential budget request proposed reducing CISE from approximately $989 million to $346 million — a 65% cut. Congress rejected the proposal: the enacted appropriation (signed January 23, 2026) provides NSF roughly $8.75 billion, a 3.4% reduction from FY2024, and directs that no R&RA directorate receive more than a 5% cut from FY2024 levels. See AAS summary; AIP FY2026 NSF tracker.
  7. California Legislative Analyst’s Office, “The 2025-26 Budget: University of California,” 2025. The Governor’s 2025–26 budget proposed a $272 million ongoing General Fund reduction for UC. The Legislature subsequently moderated some cuts while deferring a $129.7 million payment and delaying a $240.8 million general funding increase. Budgetary pressure on graduate support and TA allocations persists.
  8. “UCLA math department TA, grader cuts spark concern over student learning, support,” Daily Bruin, October 2025. TA appointments reduced from 50% to 25%; paid graders eliminated.
  9. P. Hamadanian et al., “Glia: A Human-Inspired AI for Automated Systems Design and Optimization,” arXiv:2510.27176, 2025. Project website: glia.mit.edu.
  10. This work — an agentic execution substrate for network data generation — is part of ongoing thesis research in my group. It draws on the IP hourglass architectural analogy: diverse research intents above, diverse network conditions below, a unified execution layer in the middle. Details forthcoming.
  11. DOE Genesis Mission, announced November 2025: $320 million in initial awards across four workstreams including shared AI compute infrastructure, a Transformational AI Models Consortium, foundational AI awards for domain-specific models, and autonomous laboratory projects.
  12. NSF AI portfolio exceeds $700 million annually, including 25 National AI Research Institutes with a $100 million expansion in July 2025.
  13. Rate My Professor listing: ratemyprofessors.com/professor/2615402. The full story of that course redesign is in “Computing Is a Generative Discipline.”
  14. BQT+ was initially supported by NSF Award #2220417 (Internet Measurement Research: Developing Querying Tools for Broadband Access Network Data). Its adoption by policy organizations is documented in H. Manda et al., “The Efficacy of the Connect America Fund in Addressing US Internet Access Inequities,” ACM SIGCOMM 2024.
  15. Shared infrastructure examples: National Research Platform (NRP) — distributed research cyberinfrastructure. CBORG — Lawrence Berkeley National Lab AI Portal. CloudBank 2.0 — NSF-funded cloud credit and commercial cloud access program, expanded with a $20 million NSF award (Award #2505560). (NSF announcement). TritonGPT — UC San Diego on-premises LLM service (tritongpt.ucsd.edu). Internet2 NET+ — collective purchasing program for 500+ member research institutions.
  16. AlphaFold Protein Structure Database, containing over 200 million predicted protein structures. See also A. Varadi et al., “AlphaFold Protein Structure Database in 2024,” Nucleic Acids Research, 2024.
  17. See, e.g., the NetAI Manifesto — collaborative work with Walter Willinger and others arguing that networking in the age of AI must remain a science grounded in understanding, not just optimization.
  18. Two foundational references: J. M. Bradshaw et al., “Seven Deadly Myths of ‘Autonomous Systems,’” IEEE Intelligent Systems, 28(3), 2013 — argues that the goal should be effective human-machine interdependence, not autonomy per se; and D. D. Woods, “The Risks of Autonomy: Doyle’s Catch,” Journal of Cognitive Engineering and Decision Making, 10(4), 2016 — on the gap between what autonomous systems demonstrate under controlled conditions and how they perform in real-world deployment. Earlier collaborative work with Walter Willinger connected these ideas explicitly to networking and agentic AI.