14 Agentic Systems — Applying the Framework to Build the Framework’s Builders

14.1 The Anchor: Finite Context, Token Budgets, and Hallucination Risk

An agentic system is a networked system. A large language model sits at the center; around it, a web of tools, memories, other agents, and eventually a human operator exchange messages. Every message costs tokens. Every tool call costs latency. Every memory entry costs a slot in a bounded context window — and every entry, once written, risks reinforcing a mistake. The agent runs under finite cost and finite time. It runs under the same four questions that every system in this book has had to answer: what to track, when to act, who decides, what crosses the interface.

The binding constraint is the conjunction of three realities that the agent inherits from the model it is built on: a finite context window, a finite token budget, and a non-zero hallucination rate. The window is bounded because attention scales quadratically¹. The budget is bounded because inference costs money and seconds. The hallucination rate is bounded above zero because the model is statistical. These three bounds are inherited from the lower layer (the model) in exactly the way TCP inherits IP’s unreliable datagrams and WiFi inherits RF physics. The agent inherits these bounds. Its only freedom is to design around them.

From that binding constraint, four decision problems fall out — the agentic analogues of TCP’s “when to send, what to send, how much to send”:

What to add to context? (Every token occupies a slot.)
When to call a tool? (Every call trades latency for information.)
What to persist? (Every entry persists imperfectly and risks misleading the next run.)
When to decompose into subagents? (A single context overflows beyond a threshold — but coordination introduces new gaps.)

These four decisions are answered, differently, by every design from prompt completion (2020) to persistent autonomous researchers (2026+). Looking back across the six years, the agentic community reinvented the four invariants — State, Time, Coordination, Interface — not from pedagogical choice, but from debugging failures. What follows is that journey, Act by Act.

“An autonomous agent is a system that perceives its environment, reasons about it, and acts to achieve goals.” — retrospective, from the common framing that crystallized around 2023.

The pioneers who prompted GPT-3 in 2020 had no such framing. They were running a language model. The framing emerged from failure, the same way McQuillan’s tripartite routing procedure emerged from ARPANET instability in 1980.

14.2 Act 1: “It’s 2020. GPT-3 Completes Prompts.”

In June 2020, OpenAI released GPT-3 (Brown et al. 2020). At 175 billion parameters (Brown et al. 2020), it could continue a text prompt with startling fluency. The interface was simple: submit a prompt, receive a completion. Temperature², top-p, and stop sequences were the only control knobs. The whole interaction fit in a single HTTP request.

“Humans do not require large supervised datasets to learn most language tasks… scaling up language models greatly improves task- agnostic, few-shot performance.” — Brown et al., 2020 (Brown et al. 2020)

What the pioneers saw: a pure function. Prompt in, completion out. The cleverness of the interaction lived entirely in the prompt. The community discovered prompt engineering — phrasing, few-shot exemplars, role instructions — as the craft of extracting behavior from a frozen model.

What remained invisible from the pioneers’ vantage point: that a single forward pass is a terrible substrate for anything requiring derivation, planning, or action on the world. Arithmetic was brittle. Multi-hop questions flopped. The model had the knowledge but no mechanism to externalize a derivation.

14.2.1 The Designer’s Decisions (2020)

At this layer, the designer is the person writing the prompt. Each request she constructs forces her to answer, implicitly, the four decision problems:

What to add to context? The entire prompt. There is no other State.
When to call a tool? Tools are absent; the question is moot.
What to persist? Zero state carries over. Each call is independent.
When to decompose? Decomposition is absent. There is one model, one call.

14.2.2 Invariant Analysis: Prompt Completion (2020)

Invariant	Prompt Completion Answer (2020)	Gap?
State	Prompt text only; no scratchpad	Reasoning vanishes inside one forward pass
Time	Single forward pass	No loop, no retry, no observation
Coordination	None — one model, one caller	Delegation and critique are absent
Interface	Text in / text out	World-action is absent

Two invariants have essentially degenerate answers: Time is a point, not an interval, and Coordination is absent. State is the prompt, which makes inspection trivial but externalization impossible — the model thinks inside weights opaque to the designer. Interface is pure text. The gaps are enormous: the model lacks the ability to show its work, retry, call external services, or remember the last conversation. For simple continuations these gaps are invisible; for anything multi-step they are catastrophic.

14.2.3 Environment → Measurement → Belief

Layer	What Prompt Completion Has	What’s Missing
Environment	The true task context; the world beyond the prompt	—
Measurement	Tokens in the prompt window	No tool outputs, no observations, no feedback
Belief	Model weights + prompt context	No externalized derivation; errors invisible

This is a physically limited measurement regime: the only signal the model receives is the prompt. The loop is open — there is no closed feedback because there is no next step. Narrowing the E-M-B gap requires a loop beyond the single forward pass; it must be widened by designing in a loop, which is exactly what comes next.

14.2.4 “The Gaps Didn’t Matter… Yet.”

For translation, summarization, and short-form generation, one forward pass is enough. Few-shot prompting already outperformed fine-tuned baselines on many benchmarks (Brown et al. 2020), which is why the community accepted it as the native interface for a frozen model. The gaps mattered only when the tasks got longer than a single reasoning step. By 2021, arithmetic benchmarks started showing plateaus — the model knew the pieces but could not assemble them.

The environment changed when the community stopped asking “can the model generate plausible text?” and started asking “can the model solve a multi-step problem?” That shift broke the single-pass answer to the State invariant.

14.3 Act 2: “It’s 2022. Let’s Think Step by Step.”

In January 2022, Jason Wei and colleagues at Google showed that prompting a model with a few examples of step-by-step reasoning transformed its multi-step accuracy (Wei et al. 2022). On GSM8K (grade-school math), PaLM’s accuracy jumped from 18% to 57% simply by including chains of reasoning in the prompt (Wei et al. 2022). Kojima et al. showed the even more surprising result that the phrase “Let’s think step by step” alone – zero-shot CoT – worked almost as well on some tasks (Kojima et al. 2022).

“We show that generating a chain of thought — a series of intermediate reasoning steps — significantly improves the ability of large language models to perform complex reasoning.” — Wei et al., 2022 (Wei et al. 2022)

What the pioneers saw: that reasoning was latent in the model and needed only a scratchpad to surface. The prompt could carry its own derivation. By laying out intermediate steps as tokens, the model attended to them on the way to the final answer.

What remained invisible: that the scratchpad still lives inside a single forward pass. The model wrote its derivation, but it could not act on the world between steps, could not call a calculator to verify arithmetic, had zero access to facts beyond its training data. The scratchpad was a richer State — but still entirely internal. Tree of Thoughts (Yao, Yu, et al. 2023) later showed that the scratchpad could branch (Wei et al. 2022), but the world beyond the prompt remained out of reach.

14.3.1 The Designer’s Decisions (2022)

What to add to context? The prompt + the model’s own intermediate reasoning. The scratchpad becomes part of State.
When to call a tool? Tools remain absent.
What to persist? Zero state carries over.
When to decompose? Decomposition remains absent — but branching within a single call (ToT) is the first hint that decomposition matters.

Applied decision placement: all decisions still live in one model, one call. The scratchpad externalizes reasoning within the call, but control authority is monolithic.

14.3.2 Invariant Analysis: Chain of Thought (2022)

Invariant	CoT Answer (2022)	Gap?
State	Scratchpad: reasoning materialized as tokens	Scratchpad is ephemeral; vanishes at turn end
Time	One longer forward pass	No observe-act loop yet
Coordination	None — one model	Delegation and critique absent
Interface	Text + reasoning format	World-action absent

CoT answers State substantively for the first time: the model’s intermediate belief is inspectable. But Time remains a point — the scratchpad is written in one pass and discarded. Coordination and Interface are unchanged from Act 1. The consequence: CoT improves reasoning about a closed problem but lacks the ability to debug a failing program, update a stale fact, or check a calculation against a calculator.

14.3.3 Environment → Measurement → Belief

Layer	What CoT Has	What’s Missing
Environment	The task the prompt describes	—
Measurement	Prompt tokens + scratchpad tokens	No external observations
Belief	Chain of reasoning steps in context	No verification, no grounding

Still a physically limited measurement regime: the only signal is text. But the scratchpad is a first attempt to narrow the E–M–B gap through self-measurement — the model inspects its own intermediate steps. Self-consistency (Wang et al. 2023) samples many chains and votes (Wei et al. 2022), effectively running several open-loop measurements in parallel. The fundamental limit remains: no external signal enters the loop.

14.3.4 “The Gaps Didn’t Matter… Yet.”

For math word problems and logic puzzles, the scratchpad was enough. The gap between belief and environment stayed small because the task was fully described in the prompt. Later reasoning models — OpenAI’s o1 (OpenAI 2024), DeepSeek-R1 (DeepSeek AI 2025), and their successors — would push the scratchpad idea to its limit by training the model to produce long internal reasoning traces before emitting a visible answer. This introduced a new paradigm: inference-time scaling (OpenAI 2024), where performance improves not from larger models but from more compute spent at test time. The cost is concrete: reasoning models are 10 to 74 times more expensive per token than standard models (OpenAI 2024; DeepSeek AI 2025). And a subtle failure mode emerges — overthinking: on low-complexity tasks, reasoning models underperform standard models, spending tokens on derivation chains that add confusion rather than clarity. Beyond certain complexity thresholds, accuracy collapses entirely despite longer traces. The Time invariant is the diagnostic: the agent is spending too much time (tokens) on reasoning relative to the task’s demands.

But as soon as tasks required information outside the prompt — current weather, a database record, the contents of a file — no amount of internal reasoning could reach across the boundary. The environment broke the Interface invariant’s degenerate answer.

14.4 Act 3: “It’s 2022–2023. Reason, Then Act.”

In October 2022, Shunyu Yao and colleagues at Princeton and Google released ReAct (Yao, Zhao, et al. 2023). ReAct interleaved three primitives: Thought, Action, Observation. The model emitted a thought, then an action (e.g., search[Colorado orogeny]), then read an observation back from an external tool. The loop continued until the model emitted a final answer.

“ReAct prompts LLMs to generate both reasoning traces and task- specific actions in an interleaved manner, allowing for greater synergy between the two.” — Yao et al., 2022 (Yao, Zhao, et al. 2023)

Schick et al.’s Toolformer (Schick et al. 2023) showed the complementary idea at the pre-training level: a model could learn when to call which API and insert tool calls into its own generation. By mid- 2023, function calling was a first-class product feature in commercial APIs (Schick et al. 2023). The scaffolding became standard.

What the pioneers saw: that reasoning and action had to interleave. A thought without observation is speculation; an action without thought is a reflex. The minimal unit of agency was the Thought→Action→Observation tuple.

What remained invisible: that the loop still lived inside a single task episode. When the task ended, the trace vanished. The next episode started from zero — every lesson learned in one run was unavailable to the next.

Applied disaggregation by separating the reasoning engine (the LLM) from the action substrate (the tools). This is the first serious disaggregation in the agentic stack, and it creates an interface — the tool-call format — that will later ossify and then crystallize into MCP.

14.4.1 The Designer’s Decisions (2022–2023)

What to add to context? Thoughts + tool results, appended as the loop runs.
When to call a tool? Whenever the model emits an action token.
What to persist? Within the episode, everything. Across episodes, zero state.
When to decompose? Still a single agent.

14.4.2 Invariant Analysis: ReAct + Tools (2022–2023)

Invariant	ReAct Answer (2022–2023)	Gap?
State	Thoughts + actions + observations, appended	Grows unboundedly; no compression
Time	Observe→act loop, per-step	No checkpoint, no long-horizon planning
Coordination	Single agent, sequential tools	Parallel specialists absent
Interface	Tool-call syntax (per-framework)	Ad-hoc — no standard

ReAct answers Time for the first time: there is now a loop, with a measurable period (one tool round-trip). It answers Interface partially: tools exist. But the State grows linearly with trace length, so long tasks overflow the context window. And Coordination remains single-agent — when the trace gets long, delegation remains unavailable.

14.4.3 Environment → Measurement → Belief

Layer	What ReAct Has	What’s Missing
Environment	External world accessed through tools	—
Measurement	Tool outputs appended as Observations	Tool outputs are narrow; untrusted
Belief	Running trace of thoughts + observations	Trace grows without compression

This is accidentally noisy measurement — tool outputs are honest but imperfect (stale search results, partial API responses). Better estimators can narrow the gap (better retrieval, grounding citations). The closed loop is finally alive: measure (observe) → estimate (think) → control (act). But the loop has no memory beyond the episode.

14.4.4 “The Gaps Didn’t Matter… Yet.”

For HotPotQA-style multi-hop question answering (Yao, Zhao, et al. 2023), 3–5 ReAct steps sufficed. The trace fit in context. The task ended before the agent got tired. The gap that would matter next — cross-episode learning — appeared as soon as people tried to run agents on software engineering tasks, where the same bug could recur across runs and the same lesson had to be re-learned every time. That broke the no-persistence answer to State.

14.5 Act 4: “It’s 2023. Memory and Self-Reflection.”

In March 2023, Noah Shinn and colleagues introduced Reflexion (Shinn et al. 2023). After each attempt at a task, the agent wrote a verbal self-reflection — a natural-language critique of what went wrong — and stored it. On the next attempt, the reflection entered the prompt. On HumanEval, Reflexion lifted pass@1³ from 80% to 91% for GPT-4 (Shinn et al. 2023).

“Reflexion agents verbally reflect on task feedback signals, then maintain their own reflective text in an episodic memory buffer to induce better decision-making in subsequent trials.” — Shinn et al., 2023 (Shinn et al. 2023)

In October 2023, Charles Packer and colleagues at Berkeley proposed MemGPT (Packer et al. 2023) (later productized as Letta), borrowing virtual-memory paging from operating systems. MemGPT defines three memory tiers with distinct access patterns (Packer et al. 2023): working memory (always visible in the context window — system prompt, recent messages), archival memory (semantically searchable via embedding⁴ retrieval — long-term knowledge), and recall memory (chronologically indexed — conversation history). The model issues tool calls to page data between tiers. When working memory approaches 70% of the context window, a memory pressure alert fires and the agent must evict (Packer et al. 2023) — compressing older entries via recursive summarization before they overflow. Paginated retrieval prevents context overflow when bringing archival data back.

“We propose MemGPT (MemoryGPT), a system that intelligently manages different memory tiers in order to effectively provide extended context within the LLM’s limited context window.” — Packer et al., 2023 (Packer et al. 2023)

What the pioneers saw: that the agent was running out of memory. A long task, a long dialog, or a long-lived assistant needed a store that outlasted one context window. Reflexion added episodic memory (what went wrong). MemGPT added hierarchical memory (working, archival, recall — three tiers with distinct access semantics, mirroring the L1/L2/L3 cache hierarchy in hardware).

What remained invisible: that these were two ad-hoc answers to the same question. The field had no shared memory architecture. Every framework invented its own eviction policy, its own reflection format, its own vector store. Memory became a configuration problem, not an architecture.

Applied closed-loop reasoning at a new timescale: multi-run. The Reflexion loop closes over trials, not steps. The MemGPT loop closes over paging decisions. Both extend the agent’s effective context beyond the window.

14.5.1 The Designer’s Decisions (2023)

What to add to context? Active conversation + recalled memories + past reflections.
When to call a tool? Now also: when to page, when to reflect.
What to persist? Summaries, reflections, vectors — policy is ad-hoc.
When to decompose? Still a single agent, but the memory tiers prefigure decomposition.

14.5.2 Invariant Analysis: Reflexion + MemGPT (2023)

Invariant	Memory-Augmented Answer (2023)	Gap?
State	Main context + archival memory + reflections	Eviction + retrieval policy ad-hoc
Time	Per-step + per-trial + per-page loops	No checkpoint/resume across agents
Coordination	Single agent with memory services	Reasoning delegation absent
Interface	Memory read/write APIs (per-framework)	No standard memory protocol

State is now genuinely multi-tier: working, archival, reflective. Time gains an outer loop. But Coordination and Interface remain single-agent and per-framework. The next failure is inevitable: when the task demands specialized perspectives — a planner, a coder, a critic — a single agent lacks the capacity to hold all roles simultaneously.

14.5.3 Environment → Measurement → Belief

Layer	What Memory-Augmented Agents Have	What’s Missing
Environment	Current task + history of past attempts	—
Measurement	Tool outputs + retrieval hits + past reflections	Retrieval is lossy; reflection risks misleading
Belief	Layered: working, archival, reflective	Policy for when to trust each layer is implicit

This is structurally filtered measurement — the memory store edits the signal the agent receives. A bad reflection poisons future runs. The loop can widen the E–M–B gap if the filter is wrong, the same way BGP route flap damping suppressed legitimate updates. Memory hygiene becomes a first-order concern.

14.5.4 “The Gaps Didn’t Matter… Yet.”

For single-agent assistants with modest tasks, Reflexion + MemGPT worked. Agents ran for hours instead of seconds. But when the community tried to build software-engineering agents, a single perspective proved too narrow. The coder missed what the critic would have caught; the planner missed what the debugger would have caught. The no-Coordination answer broke.

14.6 Act 5: “It’s 2023–2024. Orchestrating Teams of Specialists.”

In August 2023, Microsoft’s AutoGen (Wu et al. 2023) introduced conversable agents — specialist agents that exchanged messages through a shared protocol. A Planner would emit tasks; an Executor would run code; a Critic would review. MetaGPT (Hong et al. 2024) encoded an entire software-engineering pipeline as a team of roles. CrewAI and LangGraph followed (Wu et al. 2023; Hong et al. 2024), each with a different take on how specialists should coordinate.

“AutoGen enables the development of LLM applications using multiple agents that can converse with each other to solve tasks.” — Wu et al., 2023 (Wu et al. 2023)

What the pioneers saw: that a single agent’s context was a bottleneck. Splitting a task across specialists distributed the context load and introduced critique as a first-class primitive. Decomposition traded memory for coordination.

What remained invisible: that each framework was building its own ad-hoc agent-to-agent protocol. AutoGen’s messages were AutoGen- shaped; MetaGPT’s were CrewAI-incompatible. Integrations were N×M. Tool definitions were framework-locked. The Interface invariant ballooned into a protocol-soup.

Applied decision placement by distributing reasoning authority across specialists. This is the agentic analogue of ARPANET’s distributed routing decision: each agent decides locally, and a lightweight orchestrator sequences conversations. Applied disaggregation again, this time at the agent level: planner, executor, critic, researcher become separately evolvable components.

14.6.1 The Designer’s Decisions (2023–2024)

What to add to context? Per-agent context, per-role slice.
When to call a tool? Per-agent policy.
What to persist? Per-agent memory + a shared scratchpad.
When to decompose? The designer decides at team-design time.

14.6.2 Invariant Analysis: Multi-Agent Orchestration (2023–2024)

Invariant	Multi-Agent Answer (2023–2024)	Gap?
State	Per-role contexts + shared scratchpad	Coherence across agents is fragile
Time	Turn-based conversation	Deadlocks, loops, runaway chats
Coordination	Distributed specialists + orchestrator	Ad-hoc per framework
Interface	Per-framework message formats	No shared wire protocol

Coordination is answered substantively for the first time. But the Interface gap has grown — every framework invents its own primitives, which means a tool written for LangChain is inaccessible from CrewAI. The field begins to need a thin waist.

14.6.3 Environment → Measurement → Belief

Layer	What Multi-Agent Has	What’s Missing
Environment	Task + other agents’ outputs	—
Measurement	Inter-agent messages + tool outputs	Inter-agent messages are noisy reflections
Belief	Distributed across agents; unstable	No consistency guarantee

This is a new failure mode: accidentally noisy measurement plus structurally filtered communication, because each specialist edits what it passes to the next. Chat loops and coordination failures are the agentic analogue of routing loops in distance-vector routing — and they have the same cause: inconsistent belief across distributed actors without a global consistency mechanism.

14.6.4 “The Gaps Didn’t Matter… Yet.”

For bounded pipelines — generate spec, generate code, generate tests — the ad-hoc protocols worked. Coding harnesses like SWE-agent (Yang et al. 2024; Jimenez et al. 2024), Aider, Cursor, and Claude Code encoded this pattern at the IDE/terminal edge, pairing a model with a narrow set of file-edit and shell tools to drive real software-engineering workflows. The ad-hoc protocols broke as soon as teams wanted to compose tools across frameworks, or call out to external services from multiple agent systems. That broke the per-framework Interface answer.

14.7 Act 6: “It’s 2024. The Thin Waist Emerges: MCP.”

In November 2024, Anthropic published the Model Context Protocol (Anthropic 2024). MCP defined four primitives and a single wire format (JSON-RPC (JavaScript Object Notation – Remote Procedure Call) over stdio or HTTP) for any agent to talk to any tool server (Anthropic 2024):

Resources — readable, structured data (files, database rows, API responses) the server exposes for the model to inspect
Tools — executable functions the model can invoke with parameters and receive structured results
Prompts — reusable templates the server provides for common interaction patterns
Sampling — bidirectional: the server can request the client to perform an LLM completion, enabling server-initiated reasoning

This is the architectural distinction from earlier function calling: function calling is provider-coupled (OpenAI’s format differs from Anthropic’s) and application-developer-handled. MCP is model-agnostic with server-handled implementation — any MCP client talks to any MCP server, the way any IP host talks to any IP host. The N×M integration problem (M applications × N tools = M×N custom integrations) collapses to M+N.

Within six months, the major agent frameworks had MCP clients. Tool vendors started shipping MCP servers.

“MCP is an open protocol that standardizes how applications provide context to LLMs.” — Anthropic, 2024 (Anthropic 2024)

This is the agentic thin waist. It sits between the model (which can vary) and the substrate (which can vary), exposing a narrow, reusable spanning layer. It is to agents what IP is to networks and POSIX is to operating systems: deliberately weak, deliberately general, deliberately stable.

What the pioneers saw: that the N×M integration problem was killing the ecosystem. A standard wire format was worth more than any one framework’s cleverness. Anthropic made MCP open, and the community adopted it.

What remained invisible: that MCP solved the agent↔︎tool interface but not agent↔︎memory, agent↔︎agent, or agent↔︎experiment. The waist is still being assembled. Memory architectures remain fragmented. Provenance (who did what, with which model version, on which data) has no standard. Benchmarks overfit within months — SWE-bench (Jimenez et al. 2024) for software engineering, WebArena (Zhou et al. 2024) for browser tasks, OSWorld (Xie et al. 2024) for desktop environments, and τ-bench (Yao et al. 2024) for tool-agent-user interaction each captured one slice of “real” agent performance and each saturated within a year of release. The contamination problem was quantified precisely: models achieving 70%+ on standard SWE-bench (Jimenez et al. 2024) dropped to ~23% on SWE-bench Pro — an uncontaminated, industrial-grade variant — revealing that headline scores measured memorization, not capability. This is a measurement-quality problem: the benchmark’s E→M gap was structurally filtered (test cases leaked into training data), producing a Belief (“the agent can fix software bugs”) that diverged sharply from the Environment (the agent fails on unfamiliar code).

Applied interface design at the level of a spanning layer — the same move Beck described in the hourglass model, and the same move the IETF made with IP. A narrow waist ossifies slowly and everything else can evolve above and below.

14.7.1 The Designer’s Decisions (2024)

What to add to context? MCP resources + tool schemas + prompts.
When to call a tool? Model-native, MCP-standardized.
What to persist? Still framework-specific — MCP doesn’t specify.
When to decompose? Per designer; but MCP servers make composition cheap.

14.7.2 Invariant Analysis: MCP (2024)

Invariant	MCP Answer (2024)	Gap?
State	Unchanged per agent; MCP doesn’t standardize memory	Memory architecture still fragmented
Time	Per-call latency + budget	No standard checkpoint/resume
Coordination	Agent + tool servers (and nascent A2A)	Agent-agent still ad-hoc
Interface	MCP thin waist — four primitives, one wire format	Only the agent↔︎tool edge

MCP answers Interface with a thin waist. Memory and inter-agent protocols remain outside its scope. The community is extending it (MCP-over-SSE (Server-Sent Events), MCP elicitations, A2A) — the waist is still narrowing.

14.7.3 Environment → Measurement → Belief After the Fix

Layer	What an MCP Agent Has	What’s Missing
Environment	Every tool exposed through MCP	—
Measurement	Standardized tool outputs	Output fidelity still per-server
Belief	Composable context from resources + tools	Cross-run memory still fragmented

The E–M–B loop now has a standard measurement interface. This is exactly the move from proprietary ARPANET measurement procedures (Act 1) to SNMP/IPFIX (Chapter 12): standardizing the sensor layer before standardizing the estimator.

MCP also introduced four security invariants that any deployment must satisfy: sandbox isolation (tools execute in constrained environments), contextual authorization (the user approves tool invocations before execution), exfiltration detection (monitoring for data leaving the trust boundary), and auditable logging (every tool call and result is recorded). These are load-bearing for the safety discussion in Act 7: without them, the thin waist becomes a broad attack surface. The security model is the Interface invariant’s enforcement layer — the same role that BGP’s RPKI (Chapter 6) plays for route advertisements.

14.8 Act 7: “It’s 2026+. Persistent Memory and Autonomous Research.”

By 2026, three trajectories were converging. First, persistent memory as an architecture: Engram-style Research Digest + Archive, Letta’s hosted memory, MCP memory extensions. Second, autonomous research agents: AI Scientist (Lu et al. 2024) running end-to-end ML research; Agent Laboratory (Schmidgall et al. 2025) pairing humans with agent pipelines; Glia (Hamadanian et al. 2025) optimizing GPU schedulers with a Researcher+Supervisor loop. Third, agents in production networking: Confucius at Meta managing hyperscale networks via DSL-mediated tools; NetConfEval benchmarking LLM configuration synthesis (Wang et al. 2024).

“We introduce The AI Scientist, the first comprehensive framework for fully automatic scientific discovery, enabling frontier large language models to perform research independently.” — Lu et al., 2024 (Lu et al. 2024)

What the pioneers are seeing: that the four invariants are answered together, not in isolation. A persistent-memory agent with MCP tools, a planner-executor-critic triad, and a budget manager is a complete agentic system — all four decisions answered, all four invariants closed over each other.

What remains invisible: the verification wall. Autonomous research works cheaply only when the evaluation substrate is cheap. ML has Python + GPU; systems research has no equivalent. Glia (Hamadanian et al. 2025) builds on Vidur; Confucius leans on Meta’s tools. The composable empirical backend — the netUnicorn-style thin waist for experiments — remains unbuilt in most domains.

The safety gap is deeper than sandboxing. Constraint loss via context compression is the most dangerous failure mode for deployed agents: when a long context is lossy-compressed to fit a window, critical critical constraints are silently dropped. A network management agent instructed to “suggest configuration changes but do not execute them” loses the “do not execute” qualifier during summarization — and proceed to push configuration changes to production, causing mass service disruption. This is a State invariant failure with catastrophic consequences: the agent’s Belief (I should execute) diverges from the Environment (the operator said suggest-only) because the Measurement (compressed context) was structurally filtered.

Behavioral verification is equally young: Constitutional AI (Bai et al. 2022) replaces human preference labels with a model-written critique loop, and downstream sandboxing and pre-action authorization layers (enforced through MCP’s security invariants (Anthropic 2024) from Act 6) bound what an agent can do before it acts. These are first moves, far from the verification equivalent of IP checksums or TCP sequence numbers — they are first moves at a layer that is still being designed.

Applied closed-loop reasoning at the longest timescale yet: across projects, not just runs. The Research Digest outlives any individual agent’s context window. This is the agentic analogue of long-term network telemetry: the measurements that outlast the measuring device.

14.8.1 Invariant Analysis: Persistent Autonomous Agents (2026+)

Invariant	Autonomous Agent Answer (2026+)	Gap?
State	Working context + archival memory + research digest	Eviction + provenance policies still open
Time	Per-step + per-trial + per-project loops	Multi-day runs hit budget/latency walls
Coordination	Planner + Experimenter + Verifier + Knowledge-base	Verifier is the bottleneck outside ML
Interface	MCP + memory extensions + experiment specs	Experiment-spec waist still standardizing

14.8.2 Environment → Measurement → Belief

Layer	What Autonomous Agents Have	What’s Missing
Environment	Real systems, real datasets, real users	—
Measurement	MCP tools + persistent memory + experiment oracles	Experiment substrates vary wildly by domain
Belief	Digest + Archive + working context	Provenance of beliefs not standardized

14.9 The Grand Arc: From Prompt to Persistent Agent

14.9.1 The Evolving Anchor

Era	Binding Constraint	State	Time	Coordination	Interface
2020 Prompt	Context window only	Prompt text	Single pass	None	Text in/out
2022 CoT	+ need to externalize reasoning	+ scratchpad	Longer pass	None	+ reasoning format
2022–23 ReAct	+ need to act on the world	+ trace	Observe-act loop	None	+ tool calls
2023 Reflexion/MemGPT	+ need cross-episode memory	+ archival + reflections	+ multi-trial loop	None	+ memory API
2023–24 Multi-agent	+ need specialist perspectives	Per-role contexts	Turn-based	Distributed specialists	Per-framework protocols
2024 MCP	+ need composable tool ecosystem	Unchanged	+ per-call budget	+ tool servers	MCP thin waist
2026+ Persistent	+ need cross-run intelligence	+ digest + archive	Multi-day runs	Planner/Exec/Verifier/KB	MCP + memory + experiment specs

The binding constraint accumulates. Every era inherits its predecessors’ limits and adds one more. 2026+ agents must still fit each call into a finite context, still pay for each tool call, still tolerate hallucination — but now they also must manage memory, coordinate specialists, and validate against experiments.

14.9.2 Three Design Principles Applied Across the Arc

Disaggregation was applied first at ReAct (Yao, Zhao, et al. 2023) (reasoning / action), again at Reflexion/MemGPT (Shinn et al. 2023; Packer et al. 2023) (working context / archival memory), again at multi-agent (Wu et al. 2023; Hong et al. 2024) (planner / executor / critic), and finally at MCP (Anthropic 2024) (agent / tool). Each disaggregation created an interface; each interface is a candidate for ossification. The most successful — MCP — ossified deliberately, as a thin waist.

Closed-loop reasoning was applied at progressively longer timescales. CoT (Wei et al. 2022) closed a loop inside a single forward pass (self-attention). ReAct (Yao, Zhao, et al. 2023) closed a loop per step. Reflexion (Shinn et al. 2023) closed a loop per trial. AI Scientist (Lu et al. 2024) closes a loop per project. Each loop is a feedback mechanism narrowing the E–M–B gap at a specific timescale; each loop period is a design parameter.

Decision placement migrated from centralized (one model, one call) to distributed (multi-agent) to hybrid (planner-centralized orchestrator over distributed specialists). The migration is the same one routing took from ARPANET’s distributed IMPs to SDN’s centralized controller and back to segment-routing’s hybrid. The underlying question — who decides? — is identical.

Two cross-chapter analogies sharpen the pattern. First, token budgets are the closest thing agents have to congestion control. An agent consuming tokens is a flow consuming bandwidth — and the current literature lacks a treatment of agent token consumption as a flow subject to fair-queueing, rate-limiting, or admission control (the tools from Chapter 7). Second, misaligned tool selection at scale is the agentic analogue of BGP route leaks (Chapter 6): a syntactically valid action (the tool call is well-formed) with catastrophic semantic consequences (the tool executes an action the agent’s intent excluded). Both failures are structurally filtered — the measurement signal (tool call succeeded) obscures the environment (the action was harmful).

14.9.3 The Dependency Chain

flowchart TD
    A["<b>2020 Prompt</b><br/>State=prompt<br/>Time=point<br/>Coord=none<br/>Interface=text"]
    B["<b>2022 CoT</b><br/>+ scratchpad<br/>(State externalized)"]
    C["<b>2022–23 ReAct</b><br/>+ observe-act loop<br/>(Time closes)"]
    D["<b>2023 Reflexion/MemGPT</b><br/>+ cross-episode memory<br/>(State persists)"]
    E["<b>2023–24 Multi-Agent</b><br/>+ specialists<br/>(Coord distributes)"]
    F["<b>2024 MCP</b><br/>+ thin waist<br/>(Interface ossifies)"]
    G["<b>2026+ Persistent</b><br/>+ digest + archive<br/>(all four coupled)"]

    A -->|"gap: reasoning invisible"| B
    B -->|"gap: cannot act"| C
    C -->|"gap: no persistence"| D
    D -->|"gap: one perspective"| E
    E -->|"gap: ad-hoc protocols"| F
    F -->|"gap: memory+exper still open"| G

    style A fill:#4477AA,color:#fff,stroke:#4477AA
    style B fill:#66CCEE,color:#000,stroke:#66CCEE
    style C fill:#228833,color:#fff,stroke:#228833
    style D fill:#CCBB44,color:#000,stroke:#CCBB44
    style E fill:#EE6677,color:#fff,stroke:#EE6677
    style F fill:#AA3377,color:#fff,stroke:#AA3377
    style G fill:#BBBBBB,color:#000,stroke:#BBBBBB

Figure 14.1: The dependency chain of agentic systems. Each stage answers one invariant more fully; each answer creates the gap that motivates the next stage.

14.9.4 Pioneer Diagnosis Table

Year	Pioneer	Invariant	Diagnosis	Contribution
2020	Brown et al.	Interface	Language tasks need no supervised data	GPT-3 few-shot prompting (Brown et al. 2020)
2022	Wei et al.	State	Reasoning is latent; needs a scratchpad	Chain of Thought (Wei et al. 2022)
2022	Yao et al.	Time	Reasoning must interleave with action	ReAct (Yao, Zhao, et al. 2023)
2023	Schick et al.	Interface	Models can learn when to call tools	Toolformer (Schick et al. 2023)
2023	Shinn et al.	State	Agents repeat mistakes without reflection	Reflexion (Shinn et al. 2023)
2023	Packer et al.	State	One context window overflows on long tasks	MemGPT / Letta (Packer et al. 2023)
2023	Wu et al.	Coordination	Single agent is brittle	AutoGen (Wu et al. 2023)
2024	Jimenez et al.	(Evaluation)	Agents need real-world oracles	SWE-bench (Jimenez et al. 2024)
2024	Zhou et al.	(Evaluation)	Browsers need live environment oracles	WebArena (Zhou et al. 2024)
2024	Yang et al.	Interface	Coding agents need agent-computer interfaces	SWE-agent (Yang et al. 2024)
2024	OpenAI	State	Reasoning can be learned, not only prompted	o1 (OpenAI 2024)
2024	Anthropic	Interface	Ecosystem needs a thin waist	Model Context Protocol (Anthropic 2024)
2024	Lu et al.	(All four)	End-to-end research can autonomize	AI Scientist (Lu et al. 2024)
2025	Schmidgall et al.	Coordination	Research needs specialist teams	Agent Laboratory (Schmidgall et al. 2025)
2025	Hamadanian et al.	(Substrate)	Systems design needs an Evaluation Playground	Glia (Hamadanian et al. 2025)

14.9.5 Innovation Timeline

flowchart TD
    subgraph sg1["Prompting Era"]
        A1["2020 — Brown: GPT-3"]
        A2["2021 — Few-shot prompting craft"]
        A1 --> A2
    end
    subgraph sg2["Reasoning Era"]
        B1["2022 — Wei: Chain of Thought"]
        B2["2022 — Kojima: Zero-shot CoT"]
        B3["2022 — Wang: Self-consistency"]
        B4["2023 — Yao: Tree of Thoughts"]
        B1 --> B2 --> B3 --> B4
    end
    subgraph sg3["Tool-Use Era"]
        C1["2022 — Yao: ReAct"]
        C2["2023 — Schick: Toolformer"]
        C3["2023 — OpenAI: Function calling"]
        C4["2023 — Patil: Gorilla"]
        C1 --> C2 --> C3 --> C4
    end
    subgraph sg4["Memory Era"]
        D1["2023 — Shinn: Reflexion"]
        D2["2023 — Packer: MemGPT/Letta"]
        D3["2023 — Park: Generative Agents"]
        D1 --> D2 --> D3
    end
    subgraph sg5["Multi-Agent Era"]
        E1["2023 — Wu: AutoGen"]
        E2["2023 — Hong: MetaGPT"]
        E3["2023 — Li: CAMEL"]
        E4["2024 — LangGraph"]
        E5["2024 — CrewAI"]
        E1 --> E2 --> E3 --> E4 --> E5
    end
    subgraph sg6["Thin-Waist Era"]
        F1["2024 — Anthropic: MCP"]
        F2["2024 — Jimenez: SWE-bench"]
        F3["2024 — Zhou: WebArena"]
        F4["2024 — Yang: SWE-agent"]
        F5["2024 — OpenAI: o1"]
        F6["2025 — A2A protocol"]
        F7["2025 — MCP memory extensions"]
        F1 --> F2 --> F3 --> F4 --> F5 --> F6 --> F7
    end
    subgraph sg7["Autonomous Era"]
        G1["2024 — Lu: AI Scientist"]
        G2["2025 — Schmidgall: Agent Laboratory"]
        G3["2025 — Hamadanian: Glia"]
        G4["2025 — Meta: Confucius"]
        G5["2025 — Popper"]
        G6["2026 — Persistent memory architectures"]
        G7["2026 — Composable empirical backends"]
        G1 --> G2 --> G3 --> G4 --> G5 --> G6 --> G7
    end
    sg1 --> sg2 --> sg3 --> sg4 --> sg5 --> sg6 --> sg7

Figure 14.2: Innovation timeline for agentic systems, 2020–2026+.

14.10 Why This Matters

The four invariants are substrate-independent. They are the questions any system with State, Time, Coordination, and Interface decisions must answer — and that set is closed under composition. 1970 ALOHA answered them at the level of frame transmissions. 2012 CoDel answered them at the level of packet queues. 2026 agentic systems answer them at the level of tokens, tool calls, and specialist teams. The substrate changed; the questions did not.

This is the framework’s reach. A student who internalizes the four invariants can reason about an operating-system scheduler, a distributed database, a congestion-control loop, a memory hierarchy, a multi-agent planner — or whatever system class emerges next — without learning a new vocabulary. Each substrate has its own binding constraint (RF physics, disk latency, context window, hallucination rate); each imposes its own version of the E-M-B loop; each forces the same four placement decisions. The discipline transfers.

The six systems in this book are finished designs with decades of debugging behind them. The agentic layer above them remains under construction; the thin waist is still assembling; the memory architecture is still settling; the verification wall for systems research stands unbroken. A student who can diagnose which invariant is the binding constraint for a new system, and at what timescale its E-M-B loop closes, is equipped to contribute to any of those open problems. That is what the framework buys.

14.11 Generative Exercises

Exercise 1: A 500-Paper Survey Agent

Design a research agent that ingests a 500-paper corpus in a domain you care about and produces the four survey artifacts (intent, triage, deepened notes, synthesis) used in this book’s literature-survey pipeline. Using only the four invariants:

What belongs in the context window for the triage step vs. the deepen step vs. the synthesize step? (State)
At what timescale does the agent check its own progress — per paper, per thread, per synthesis pass? (Time)
Which roles must be specialized vs. shared? (Coordination)
What is the thin waist between the survey agent and the corpus? (Interface — hint: MCP resources exposing the PDF store)

Predict which invariant will be the binding constraint if the corpus grows to 5,000 papers. Explain your answer using the E-M-B decomposition.

Exercise 2: A Debugging Agent That Coordinates with a Human

A software engineer is debugging a distributed-systems race condition and wants an agent to help. The agent can read logs, run tests, and propose hypotheses; the human has ground truth about the system’s intended behavior. Design the human-in-the-loop coordination using the four invariants:

What belongs in the shared context between human and agent vs. the agent-private context? (State)
When must the agent stop and ask a question vs. proceed on its own? (Time — framed as a budget for autonomous action)
Who decides when to rollback a hypothesis? (Coordination — which decisions stay with the human?)
What is the interface — chat, an IDE extension, an MCP server exposing logs? (Interface)

Identify which of the five tensions from this chapter’s dependency graph this design will hit.

Exercise 3: An MCP Server for a Measurement Campaign

Chapter 12 described network-measurement systems. Suppose you want an agent to drive a Sonata/netUnicorn-style measurement campaign: propose experiments, launch them on a testbed, interpret telemetry, iterate. Design the MCP server that exposes the measurement substrate to the agent:

Which resources does the server expose (testbed topology, historical runs, available probes)?
Which tools does the server offer (launch experiment, fetch telemetry, cancel, checkpoint)?
How does the server expose the agent’s budget (cost, latency, testbed fair-share) — is it a resource, a tool response, or a separate primitive?
What prevents the agent from damaging the testbed (validation layer, dry-run, human approval for destructive actions — Confucius’s pattern)?

Sketch the thin waist: what crosses between the measurement agent and any testbed, independent of which testbed? That thin waist is the agentic-research analogue of IP for networks.

14.12 What Comes After This Book

The six systems in this book are finished designs — each with decades of debugging behind them. The agentic layer above them remains under construction. The thin waist is still assembling. The memory architecture is still settling. The verification wall for systems research stands unbroken.

The student who understands the four invariants can build what comes next. The framework is applied, not updated; the problems are new, the tools are the same. Go build.

Anthropic. 2024. Introducing the Model Context Protocol. Https://www.anthropic.com/news/model-context-protocol.

Bai, Yuntao, Saurav Kadavath, Sandipan Kundu, et al. 2022. “Constitutional AI: Harmlessness from AI Feedback.” arXiv Preprint arXiv:2212.08073.

Brown, Tom B., Benjamin Mann, Nick Ryder, et al. 2020. “Language Models Are Few-Shot Learners.” Advances in Neural Information Processing Systems (NeurIPS).

DeepSeek AI. 2025. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. Technical report.

Hamadanian, Pouya, Mahdi Karimi, Mohammad Alizadeh, Hari Balakrishnan, et al. 2025. “Glia: Multi-Agent LLM for Systems Design Optimization.” arXiv Preprint arXiv:2510.27176.

Hong, Sirui, Mingchen Zhuge, Jonathan Chen, et al. 2024. “MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework.” International Conference on Learning Representations (ICLR).

Jimenez, Carlos E., John Yang, Alexander Wettig, et al. 2024. “SWE-bench: Can Language Models Resolve Real-World GitHub Issues?” International Conference on Learning Representations (ICLR).

Kojima, Takeshi, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. “Large Language Models Are Zero-Shot Reasoners.” Advances in Neural Information Processing Systems (NeurIPS).

Lu, Chris, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. 2024. “The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery.” arXiv Preprint arXiv:2408.06292.

OpenAI. 2024. Learning to Reason with LLMs. Https://openai.com/index/learning-to-reason-with-llms/.

Packer, Charles, Sarah Wooders, Kevin Lin, et al. 2023. “MemGPT: Towards LLMs as Operating Systems.” arXiv Preprint arXiv:2310.08560.

Schick, Timo, Jane Dwivedi-Yu, Roberto Dessı̀, et al. 2023. “Toolformer: Language Models Can Teach Themselves to Use Tools.” Advances in Neural Information Processing Systems (NeurIPS).

Schmidgall, Samuel, Yusheng Su, Ze Wang, et al. 2025. “Agent Laboratory: Using LLM Agents as Research Assistants.” arXiv Preprint arXiv:2501.04227.

Shinn, Noah, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. “Reflexion: Language Agents with Verbal Reinforcement Learning.” Advances in Neural Information Processing Systems (NeurIPS).

Wang, Changjie, Mariano Scazzariello, Alireza Farshin, Simone Ferlin, Dejan Kostić, and Marco Chiesa. 2024. “NetConfEval: Can LLMs Facilitate Network Configuration?” Proceedings of ACM CoNEXT.

Wang, Xuezhi, Jason Wei, Dale Schuurmans, et al. 2023. “Self-Consistency Improves Chain of Thought Reasoning in Language Models.” International Conference on Learning Representations (ICLR).

Wei, Jason, Xuezhi Wang, Dale Schuurmans, et al. 2022. “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” Advances in Neural Information Processing Systems (NeurIPS).

Wu, Qingyun, Gagan Bansal, Jieyu Zhang, et al. 2023. “AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation.” arXiv Preprint arXiv:2308.08155.

Xie, Tianbao, Danyang Zhang, Jixuan Chen, et al. 2024. “OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments.” Advances in Neural Information Processing Systems (NeurIPS).

Yang, John, Carlos E. Jimenez, Alexander Wettig, et al. 2024. “SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering.” Advances in Neural Information Processing Systems (NeurIPS).

Yao, Shunyu, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. 2024. “\(\tau\)-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains.” arXiv Preprint arXiv:2406.12045.

Yao, Shunyu, Dian Yu, Jeffrey Zhao, et al. 2023. “Tree of Thoughts: Deliberate Problem Solving with Large Language Models.” Advances in Neural Information Processing Systems (NeurIPS).

Yao, Shunyu, Jeffrey Zhao, Dian Yu, et al. 2023. “ReAct: Synergizing Reasoning and Acting in Language Models.” International Conference on Learning Representations (ICLR).

Zhou, Shuyan, Frank F. Xu, Hao Zhu, et al. 2024. “WebArena: A Realistic Web Environment for Building Autonomous Agents.” International Conference on Learning Representations (ICLR).

Self-attention computes pairwise interactions between all tokens in the context window. For a window of N tokens, attention requires O(N²) computation and memory. Doubling the context window quadruples the cost — this quadratic scaling is the physics-level constraint that bounds context size.↩︎
Temperature (T) controls the randomness of the model’s token sampling. At T=0, the model always picks the highest-probability token (deterministic). At T=1, sampling follows the model’s learned distribution. Higher T increases diversity but also hallucination risk.↩︎
pass@k measures the probability that at least one of k generated solutions passes all test cases. pass@1 is the strictest metric (single attempt); pass@100 measures whether the model can produce a correct solution given many tries. The gap between pass@1 and pass@100 quantifies the benefit of search/retry.↩︎
An embedding is a dense vector representation of a text chunk in a high-dimensional space (typically 768-3072 dimensions). Similar texts have nearby embeddings, enabling semantic search by computing cosine similarity between the query embedding and stored document embeddings.↩︎