flowchart TD
A["<b>2020 Prompt</b><br/>State=prompt<br/>Time=point<br/>Coord=none<br/>Interface=text"]
B["<b>2022 CoT</b><br/>+ scratchpad<br/>(State externalized)"]
C["<b>2022–23 ReAct</b><br/>+ observe-act loop<br/>(Time closes)"]
D["<b>2023 Reflexion/MemGPT</b><br/>+ cross-episode memory<br/>(State persists)"]
E["<b>2023–24 Multi-Agent</b><br/>+ specialists<br/>(Coord distributes)"]
F["<b>2024 MCP</b><br/>+ thin waist<br/>(Interface ossifies)"]
G["<b>2026+ Persistent</b><br/>+ digest + archive<br/>(all four coupled)"]
A -->|"gap: reasoning invisible"| B
B -->|"gap: cannot act"| C
C -->|"gap: no persistence"| D
D -->|"gap: one perspective"| E
E -->|"gap: ad-hoc protocols"| F
F -->|"gap: memory+exper still open"| G
style A fill:#4477AA,color:#fff,stroke:#4477AA
style B fill:#66CCEE,color:#000,stroke:#66CCEE
style C fill:#228833,color:#fff,stroke:#228833
style D fill:#CCBB44,color:#000,stroke:#CCBB44
style E fill:#EE6677,color:#fff,stroke:#EE6677
style F fill:#AA3377,color:#fff,stroke:#AA3377
style G fill:#BBBBBB,color:#000,stroke:#BBBBBB
14 Agentic Systems — Applying the Framework to Build the Framework’s Builders
14.1 The Anchor: Finite Context, Token Budgets, and Hallucination Risk
An agentic system is a networked system. A large language model sits at the center; around it, a web of tools, memories, other agents, and eventually a human operator exchange messages. Every message costs tokens. Every tool call costs latency. Every memory entry costs a slot in a bounded context window — and every entry, once written, risks reinforcing a mistake. The agent runs under finite cost and finite time. It runs under the same four questions that every system in this book has had to answer: what to track, when to act, who decides, what crosses the interface.
The binding constraint is the conjunction of three realities that the agent inherits from the model it is built on: a finite context window, a finite token budget, and a non-zero hallucination rate. The window is bounded because attention scales quadratically1. The budget is bounded because inference costs money and seconds. The hallucination rate is bounded above zero because the model is statistical. These three bounds are inherited from the lower layer (the model) in exactly the way TCP inherits IP’s unreliable datagrams and WiFi inherits RF physics. The agent inherits these bounds. Its only freedom is to design around them.
From that binding constraint, four decision problems fall out — the agentic analogues of TCP’s “when to send, what to send, how much to send”:
- What to add to context? (Every token occupies a slot.)
- When to call a tool? (Every call trades latency for information.)
- What to persist? (Every entry persists imperfectly and risks misleading the next run.)
- When to decompose into subagents? (A single context overflows beyond a threshold — but coordination introduces new gaps.)
These four decisions are answered, differently, by every design from prompt completion (2020) to persistent autonomous researchers (2026+). Looking back across the six years, the agentic community reinvented the four invariants — State, Time, Coordination, Interface — not from pedagogical choice, but from debugging failures. What follows is that journey, Act by Act.
“An autonomous agent is a system that perceives its environment, reasons about it, and acts to achieve goals.” — retrospective, from the common framing that crystallized around 2023.
The pioneers who prompted GPT-3 in 2020 had no such framing. They were running a language model. The framing emerged from failure, the same way McQuillan’s tripartite routing procedure emerged from ARPANET instability in 1980.
14.2 Act 1: “It’s 2020. GPT-3 Completes Prompts.”
In June 2020, OpenAI released GPT-3 (Brown et al. 2020). At 175 billion parameters (Brown et al. 2020), it could continue a text prompt with startling fluency. The interface was simple: submit a prompt, receive a completion. Temperature2, top-p, and stop sequences were the only control knobs. The whole interaction fit in a single HTTP request.
“Humans do not require large supervised datasets to learn most language tasks… scaling up language models greatly improves task- agnostic, few-shot performance.” — Brown et al., 2020 (Brown et al. 2020)
What the pioneers saw: a pure function. Prompt in, completion out. The cleverness of the interaction lived entirely in the prompt. The community discovered prompt engineering — phrasing, few-shot exemplars, role instructions — as the craft of extracting behavior from a frozen model.
What remained invisible from the pioneers’ vantage point: that a single forward pass is a terrible substrate for anything requiring derivation, planning, or action on the world. Arithmetic was brittle. Multi-hop questions flopped. The model had the knowledge but no mechanism to externalize a derivation.
14.2.1 The Designer’s Decisions (2020)
At this layer, the designer is the person writing the prompt. Each request she constructs forces her to answer, implicitly, the four decision problems:
- What to add to context? The entire prompt. There is no other State.
- When to call a tool? Tools are absent; the question is moot.
- What to persist? Zero state carries over. Each call is independent.
- When to decompose? Decomposition is absent. There is one model, one call.
14.2.2 Invariant Analysis: Prompt Completion (2020)
| Invariant | Prompt Completion Answer (2020) | Gap? |
|---|---|---|
| State | Prompt text only; no scratchpad | Reasoning vanishes inside one forward pass |
| Time | Single forward pass | No loop, no retry, no observation |
| Coordination | None — one model, one caller | Delegation and critique are absent |
| Interface | Text in / text out | World-action is absent |
Two invariants have essentially degenerate answers: Time is a point, not an interval, and Coordination is absent. State is the prompt, which makes inspection trivial but externalization impossible — the model thinks inside weights opaque to the designer. Interface is pure text. The gaps are enormous: the model lacks the ability to show its work, retry, call external services, or remember the last conversation. For simple continuations these gaps are invisible; for anything multi-step they are catastrophic.
14.2.3 Environment → Measurement → Belief
| Layer | What Prompt Completion Has | What’s Missing |
|---|---|---|
| Environment | The true task context; the world beyond the prompt | — |
| Measurement | Tokens in the prompt window | No tool outputs, no observations, no feedback |
| Belief | Model weights + prompt context | No externalized derivation; errors invisible |
This is a physically limited measurement regime: the only signal the model receives is the prompt. The loop is open — there is no closed feedback because there is no next step. Narrowing the E-M-B gap requires a loop beyond the single forward pass; it must be widened by designing in a loop, which is exactly what comes next.
14.2.4 “The Gaps Didn’t Matter… Yet.”
For translation, summarization, and short-form generation, one forward pass is enough. Few-shot prompting already outperformed fine-tuned baselines on many benchmarks (Brown et al. 2020), which is why the community accepted it as the native interface for a frozen model. The gaps mattered only when the tasks got longer than a single reasoning step. By 2021, arithmetic benchmarks started showing plateaus — the model knew the pieces but could not assemble them.
The environment changed when the community stopped asking “can the model generate plausible text?” and started asking “can the model solve a multi-step problem?” That shift broke the single-pass answer to the State invariant.
14.3 Act 2: “It’s 2022. Let’s Think Step by Step.”
In January 2022, Jason Wei and colleagues at Google showed that prompting a model with a few examples of step-by-step reasoning transformed its multi-step accuracy (Wei et al. 2022). On GSM8K (grade-school math), PaLM’s accuracy jumped from 18% to 57% simply by including chains of reasoning in the prompt (Wei et al. 2022). Kojima et al. showed the even more surprising result that the phrase “Let’s think step by step” alone – zero-shot CoT – worked almost as well on some tasks (Kojima et al. 2022).
“We show that generating a chain of thought — a series of intermediate reasoning steps — significantly improves the ability of large language models to perform complex reasoning.” — Wei et al., 2022 (Wei et al. 2022)
What the pioneers saw: that reasoning was latent in the model and needed only a scratchpad to surface. The prompt could carry its own derivation. By laying out intermediate steps as tokens, the model attended to them on the way to the final answer.
What remained invisible: that the scratchpad still lives inside a single forward pass. The model wrote its derivation, but it could not act on the world between steps, could not call a calculator to verify arithmetic, had zero access to facts beyond its training data. The scratchpad was a richer State — but still entirely internal. Tree of Thoughts (Yao, Yu, et al. 2023) later showed that the scratchpad could branch (Wei et al. 2022), but the world beyond the prompt remained out of reach.
14.3.1 The Designer’s Decisions (2022)
- What to add to context? The prompt + the model’s own intermediate reasoning. The scratchpad becomes part of State.
- When to call a tool? Tools remain absent.
- What to persist? Zero state carries over.
- When to decompose? Decomposition remains absent — but branching within a single call (ToT) is the first hint that decomposition matters.
Applied decision placement: all decisions still live in one model, one call. The scratchpad externalizes reasoning within the call, but control authority is monolithic.
14.3.2 Invariant Analysis: Chain of Thought (2022)
| Invariant | CoT Answer (2022) | Gap? |
|---|---|---|
| State | Scratchpad: reasoning materialized as tokens | Scratchpad is ephemeral; vanishes at turn end |
| Time | One longer forward pass | No observe-act loop yet |
| Coordination | None — one model | Delegation and critique absent |
| Interface | Text + reasoning format | World-action absent |
CoT answers State substantively for the first time: the model’s intermediate belief is inspectable. But Time remains a point — the scratchpad is written in one pass and discarded. Coordination and Interface are unchanged from Act 1. The consequence: CoT improves reasoning about a closed problem but lacks the ability to debug a failing program, update a stale fact, or check a calculation against a calculator.
14.3.3 Environment → Measurement → Belief
| Layer | What CoT Has | What’s Missing |
|---|---|---|
| Environment | The task the prompt describes | — |
| Measurement | Prompt tokens + scratchpad tokens | No external observations |
| Belief | Chain of reasoning steps in context | No verification, no grounding |
Still a physically limited measurement regime: the only signal is text. But the scratchpad is a first attempt to narrow the E–M–B gap through self-measurement — the model inspects its own intermediate steps. Self-consistency (Wang et al. 2023) samples many chains and votes (Wei et al. 2022), effectively running several open-loop measurements in parallel. The fundamental limit remains: no external signal enters the loop.
14.3.4 “The Gaps Didn’t Matter… Yet.”
For math word problems and logic puzzles, the scratchpad was enough. The gap between belief and environment stayed small because the task was fully described in the prompt. Later reasoning models — OpenAI’s o1 (OpenAI 2024), DeepSeek-R1 (DeepSeek AI 2025), and their successors — would push the scratchpad idea to its limit by training the model to produce long internal reasoning traces before emitting a visible answer. This introduced a new paradigm: inference-time scaling (OpenAI 2024), where performance improves not from larger models but from more compute spent at test time. The cost is concrete: reasoning models are 10 to 74 times more expensive per token than standard models (OpenAI 2024; DeepSeek AI 2025). And a subtle failure mode emerges — overthinking: on low-complexity tasks, reasoning models underperform standard models, spending tokens on derivation chains that add confusion rather than clarity. Beyond certain complexity thresholds, accuracy collapses entirely despite longer traces. The Time invariant is the diagnostic: the agent is spending too much time (tokens) on reasoning relative to the task’s demands.
But as soon as tasks required information outside the prompt — current weather, a database record, the contents of a file — no amount of internal reasoning could reach across the boundary. The environment broke the Interface invariant’s degenerate answer.
14.4 Act 3: “It’s 2022–2023. Reason, Then Act.”
In October 2022, Shunyu Yao and colleagues at Princeton and Google released ReAct (Yao, Zhao, et al. 2023). ReAct interleaved three primitives: Thought, Action, Observation. The model emitted a thought, then an action (e.g., search[Colorado orogeny]), then read an observation back from an external tool. The loop continued until the model emitted a final answer.
“ReAct prompts LLMs to generate both reasoning traces and task- specific actions in an interleaved manner, allowing for greater synergy between the two.” — Yao et al., 2022 (Yao, Zhao, et al. 2023)
Schick et al.’s Toolformer (Schick et al. 2023) showed the complementary idea at the pre-training level: a model could learn when to call which API and insert tool calls into its own generation. By mid- 2023, function calling was a first-class product feature in commercial APIs (Schick et al. 2023). The scaffolding became standard.
What the pioneers saw: that reasoning and action had to interleave. A thought without observation is speculation; an action without thought is a reflex. The minimal unit of agency was the Thought→Action→Observation tuple.
What remained invisible: that the loop still lived inside a single task episode. When the task ended, the trace vanished. The next episode started from zero — every lesson learned in one run was unavailable to the next.
Applied disaggregation by separating the reasoning engine (the LLM) from the action substrate (the tools). This is the first serious disaggregation in the agentic stack, and it creates an interface — the tool-call format — that will later ossify and then crystallize into MCP.
14.4.1 The Designer’s Decisions (2022–2023)
- What to add to context? Thoughts + tool results, appended as the loop runs.
- When to call a tool? Whenever the model emits an action token.
- What to persist? Within the episode, everything. Across episodes, zero state.
- When to decompose? Still a single agent.
14.4.2 Invariant Analysis: ReAct + Tools (2022–2023)
| Invariant | ReAct Answer (2022–2023) | Gap? |
|---|---|---|
| State | Thoughts + actions + observations, appended | Grows unboundedly; no compression |
| Time | Observe→act loop, per-step | No checkpoint, no long-horizon planning |
| Coordination | Single agent, sequential tools | Parallel specialists absent |
| Interface | Tool-call syntax (per-framework) | Ad-hoc — no standard |
ReAct answers Time for the first time: there is now a loop, with a measurable period (one tool round-trip). It answers Interface partially: tools exist. But the State grows linearly with trace length, so long tasks overflow the context window. And Coordination remains single-agent — when the trace gets long, delegation remains unavailable.
14.4.3 Environment → Measurement → Belief
| Layer | What ReAct Has | What’s Missing |
|---|---|---|
| Environment | External world accessed through tools | — |
| Measurement | Tool outputs appended as Observations | Tool outputs are narrow; untrusted |
| Belief | Running trace of thoughts + observations | Trace grows without compression |
This is accidentally noisy measurement — tool outputs are honest but imperfect (stale search results, partial API responses). Better estimators can narrow the gap (better retrieval, grounding citations). The closed loop is finally alive: measure (observe) → estimate (think) → control (act). But the loop has no memory beyond the episode.
14.4.4 “The Gaps Didn’t Matter… Yet.”
For HotPotQA-style multi-hop question answering (Yao, Zhao, et al. 2023), 3–5 ReAct steps sufficed. The trace fit in context. The task ended before the agent got tired. The gap that would matter next — cross-episode learning — appeared as soon as people tried to run agents on software engineering tasks, where the same bug could recur across runs and the same lesson had to be re-learned every time. That broke the no-persistence answer to State.
14.5 Act 4: “It’s 2023. Memory and Self-Reflection.”
In March 2023, Noah Shinn and colleagues introduced Reflexion (Shinn et al. 2023). After each attempt at a task, the agent wrote a verbal self-reflection — a natural-language critique of what went wrong — and stored it. On the next attempt, the reflection entered the prompt. On HumanEval, Reflexion lifted pass@13 from 80% to 91% for GPT-4 (Shinn et al. 2023).
“Reflexion agents verbally reflect on task feedback signals, then maintain their own reflective text in an episodic memory buffer to induce better decision-making in subsequent trials.” — Shinn et al., 2023 (Shinn et al. 2023)
In October 2023, Charles Packer and colleagues at Berkeley proposed MemGPT (Packer et al. 2023) (later productized as Letta), borrowing virtual-memory paging from operating systems. MemGPT defines three memory tiers with distinct access patterns (Packer et al. 2023): working memory (always visible in the context window — system prompt, recent messages), archival memory (semantically searchable via embedding4 retrieval — long-term knowledge), and recall memory (chronologically indexed — conversation history). The model issues tool calls to page data between tiers. When working memory approaches 70% of the context window, a memory pressure alert fires and the agent must evict (Packer et al. 2023) — compressing older entries via recursive summarization before they overflow. Paginated retrieval prevents context overflow when bringing archival data back.
“We propose MemGPT (MemoryGPT), a system that intelligently manages different memory tiers in order to effectively provide extended context within the LLM’s limited context window.” — Packer et al., 2023 (Packer et al. 2023)
What the pioneers saw: that the agent was running out of memory. A long task, a long dialog, or a long-lived assistant needed a store that outlasted one context window. Reflexion added episodic memory (what went wrong). MemGPT added hierarchical memory (working, archival, recall — three tiers with distinct access semantics, mirroring the L1/L2/L3 cache hierarchy in hardware).
What remained invisible: that these were two ad-hoc answers to the same question. The field had no shared memory architecture. Every framework invented its own eviction policy, its own reflection format, its own vector store. Memory became a configuration problem, not an architecture.
Applied closed-loop reasoning at a new timescale: multi-run. The Reflexion loop closes over trials, not steps. The MemGPT loop closes over paging decisions. Both extend the agent’s effective context beyond the window.
14.5.1 The Designer’s Decisions (2023)
- What to add to context? Active conversation + recalled memories + past reflections.
- When to call a tool? Now also: when to page, when to reflect.
- What to persist? Summaries, reflections, vectors — policy is ad-hoc.
- When to decompose? Still a single agent, but the memory tiers prefigure decomposition.
14.5.2 Invariant Analysis: Reflexion + MemGPT (2023)
| Invariant | Memory-Augmented Answer (2023) | Gap? |
|---|---|---|
| State | Main context + archival memory + reflections | Eviction + retrieval policy ad-hoc |
| Time | Per-step + per-trial + per-page loops | No checkpoint/resume across agents |
| Coordination | Single agent with memory services | Reasoning delegation absent |
| Interface | Memory read/write APIs (per-framework) | No standard memory protocol |
State is now genuinely multi-tier: working, archival, reflective. Time gains an outer loop. But Coordination and Interface remain single-agent and per-framework. The next failure is inevitable: when the task demands specialized perspectives — a planner, a coder, a critic — a single agent lacks the capacity to hold all roles simultaneously.
14.5.3 Environment → Measurement → Belief
| Layer | What Memory-Augmented Agents Have | What’s Missing |
|---|---|---|
| Environment | Current task + history of past attempts | — |
| Measurement | Tool outputs + retrieval hits + past reflections | Retrieval is lossy; reflection risks misleading |
| Belief | Layered: working, archival, reflective | Policy for when to trust each layer is implicit |
This is structurally filtered measurement — the memory store edits the signal the agent receives. A bad reflection poisons future runs. The loop can widen the E–M–B gap if the filter is wrong, the same way BGP route flap damping suppressed legitimate updates. Memory hygiene becomes a first-order concern.
14.5.4 “The Gaps Didn’t Matter… Yet.”
For single-agent assistants with modest tasks, Reflexion + MemGPT worked. Agents ran for hours instead of seconds. But when the community tried to build software-engineering agents, a single perspective proved too narrow. The coder missed what the critic would have caught; the planner missed what the debugger would have caught. The no-Coordination answer broke.
14.6 Act 5: “It’s 2023–2024. Orchestrating Teams of Specialists.”
In August 2023, Microsoft’s AutoGen (Wu et al. 2023) introduced conversable agents — specialist agents that exchanged messages through a shared protocol. A Planner would emit tasks; an Executor would run code; a Critic would review. MetaGPT (Hong et al. 2024) encoded an entire software-engineering pipeline as a team of roles. CrewAI and LangGraph followed (Wu et al. 2023; Hong et al. 2024), each with a different take on how specialists should coordinate.
“AutoGen enables the development of LLM applications using multiple agents that can converse with each other to solve tasks.” — Wu et al., 2023 (Wu et al. 2023)
What the pioneers saw: that a single agent’s context was a bottleneck. Splitting a task across specialists distributed the context load and introduced critique as a first-class primitive. Decomposition traded memory for coordination.
What remained invisible: that each framework was building its own ad-hoc agent-to-agent protocol. AutoGen’s messages were AutoGen- shaped; MetaGPT’s were CrewAI-incompatible. Integrations were N×M. Tool definitions were framework-locked. The Interface invariant ballooned into a protocol-soup.
Applied decision placement by distributing reasoning authority across specialists. This is the agentic analogue of ARPANET’s distributed routing decision: each agent decides locally, and a lightweight orchestrator sequences conversations. Applied disaggregation again, this time at the agent level: planner, executor, critic, researcher become separately evolvable components.
14.6.1 The Designer’s Decisions (2023–2024)
- What to add to context? Per-agent context, per-role slice.
- When to call a tool? Per-agent policy.
- What to persist? Per-agent memory + a shared scratchpad.
- When to decompose? The designer decides at team-design time.
14.6.2 Invariant Analysis: Multi-Agent Orchestration (2023–2024)
| Invariant | Multi-Agent Answer (2023–2024) | Gap? |
|---|---|---|
| State | Per-role contexts + shared scratchpad | Coherence across agents is fragile |
| Time | Turn-based conversation | Deadlocks, loops, runaway chats |
| Coordination | Distributed specialists + orchestrator | Ad-hoc per framework |
| Interface | Per-framework message formats | No shared wire protocol |
Coordination is answered substantively for the first time. But the Interface gap has grown — every framework invents its own primitives, which means a tool written for LangChain is inaccessible from CrewAI. The field begins to need a thin waist.
14.6.3 Environment → Measurement → Belief
| Layer | What Multi-Agent Has | What’s Missing |
|---|---|---|
| Environment | Task + other agents’ outputs | — |
| Measurement | Inter-agent messages + tool outputs | Inter-agent messages are noisy reflections |
| Belief | Distributed across agents; unstable | No consistency guarantee |
This is a new failure mode: accidentally noisy measurement plus structurally filtered communication, because each specialist edits what it passes to the next. Chat loops and coordination failures are the agentic analogue of routing loops in distance-vector routing — and they have the same cause: inconsistent belief across distributed actors without a global consistency mechanism.
14.6.4 “The Gaps Didn’t Matter… Yet.”
For bounded pipelines — generate spec, generate code, generate tests — the ad-hoc protocols worked. Coding harnesses like SWE-agent (Yang et al. 2024; Jimenez et al. 2024), Aider, Cursor, and Claude Code encoded this pattern at the IDE/terminal edge, pairing a model with a narrow set of file-edit and shell tools to drive real software-engineering workflows. The ad-hoc protocols broke as soon as teams wanted to compose tools across frameworks, or call out to external services from multiple agent systems. That broke the per-framework Interface answer.
14.7 Act 6: “It’s 2024. The Thin Waist Emerges: MCP.”
In November 2024, Anthropic published the Model Context Protocol (Anthropic 2024). MCP defined four primitives and a single wire format (JSON-RPC (JavaScript Object Notation – Remote Procedure Call) over stdio or HTTP) for any agent to talk to any tool server (Anthropic 2024):
- Resources — readable, structured data (files, database rows, API responses) the server exposes for the model to inspect
- Tools — executable functions the model can invoke with parameters and receive structured results
- Prompts — reusable templates the server provides for common interaction patterns
- Sampling — bidirectional: the server can request the client to perform an LLM completion, enabling server-initiated reasoning
This is the architectural distinction from earlier function calling: function calling is provider-coupled (OpenAI’s format differs from Anthropic’s) and application-developer-handled. MCP is model-agnostic with server-handled implementation — any MCP client talks to any MCP server, the way any IP host talks to any IP host. The N×M integration problem (M applications × N tools = M×N custom integrations) collapses to M+N.
Within six months, the major agent frameworks had MCP clients. Tool vendors started shipping MCP servers.
“MCP is an open protocol that standardizes how applications provide context to LLMs.” — Anthropic, 2024 (Anthropic 2024)
This is the agentic thin waist. It sits between the model (which can vary) and the substrate (which can vary), exposing a narrow, reusable spanning layer. It is to agents what IP is to networks and POSIX is to operating systems: deliberately weak, deliberately general, deliberately stable.
What the pioneers saw: that the N×M integration problem was killing the ecosystem. A standard wire format was worth more than any one framework’s cleverness. Anthropic made MCP open, and the community adopted it.
What remained invisible: that MCP solved the agent↔︎tool interface but not agent↔︎memory, agent↔︎agent, or agent↔︎experiment. The waist is still being assembled. Memory architectures remain fragmented. Provenance (who did what, with which model version, on which data) has no standard. Benchmarks overfit within months — SWE-bench (Jimenez et al. 2024) for software engineering, WebArena (Zhou et al. 2024) for browser tasks, OSWorld (Xie et al. 2024) for desktop environments, and τ-bench (Yao et al. 2024) for tool-agent-user interaction each captured one slice of “real” agent performance and each saturated within a year of release. The contamination problem was quantified precisely: models achieving 70%+ on standard SWE-bench (Jimenez et al. 2024) dropped to ~23% on SWE-bench Pro — an uncontaminated, industrial-grade variant — revealing that headline scores measured memorization, not capability. This is a measurement-quality problem: the benchmark’s E→M gap was structurally filtered (test cases leaked into training data), producing a Belief (“the agent can fix software bugs”) that diverged sharply from the Environment (the agent fails on unfamiliar code).
Applied interface design at the level of a spanning layer — the same move Beck described in the hourglass model, and the same move the IETF made with IP. A narrow waist ossifies slowly and everything else can evolve above and below.
14.7.1 The Designer’s Decisions (2024)
- What to add to context? MCP resources + tool schemas + prompts.
- When to call a tool? Model-native, MCP-standardized.
- What to persist? Still framework-specific — MCP doesn’t specify.
- When to decompose? Per designer; but MCP servers make composition cheap.
14.7.2 Invariant Analysis: MCP (2024)
| Invariant | MCP Answer (2024) | Gap? |
|---|---|---|
| State | Unchanged per agent; MCP doesn’t standardize memory | Memory architecture still fragmented |
| Time | Per-call latency + budget | No standard checkpoint/resume |
| Coordination | Agent + tool servers (and nascent A2A) | Agent-agent still ad-hoc |
| Interface | MCP thin waist — four primitives, one wire format | Only the agent↔︎tool edge |
MCP answers Interface with a thin waist. Memory and inter-agent protocols remain outside its scope. The community is extending it (MCP-over-SSE (Server-Sent Events), MCP elicitations, A2A) — the waist is still narrowing.
14.7.3 Environment → Measurement → Belief After the Fix
| Layer | What an MCP Agent Has | What’s Missing |
|---|---|---|
| Environment | Every tool exposed through MCP | — |
| Measurement | Standardized tool outputs | Output fidelity still per-server |
| Belief | Composable context from resources + tools | Cross-run memory still fragmented |
The E–M–B loop now has a standard measurement interface. This is exactly the move from proprietary ARPANET measurement procedures (Act 1) to SNMP/IPFIX (Chapter 12): standardizing the sensor layer before standardizing the estimator.
MCP also introduced four security invariants that any deployment must satisfy: sandbox isolation (tools execute in constrained environments), contextual authorization (the user approves tool invocations before execution), exfiltration detection (monitoring for data leaving the trust boundary), and auditable logging (every tool call and result is recorded). These are load-bearing for the safety discussion in Act 7: without them, the thin waist becomes a broad attack surface. The security model is the Interface invariant’s enforcement layer — the same role that BGP’s RPKI (Chapter 6) plays for route advertisements.
14.8 Act 7: “It’s 2026+. Persistent Memory and Autonomous Research.”
By 2026, three trajectories were converging. First, persistent memory as an architecture: Engram-style Research Digest + Archive, Letta’s hosted memory, MCP memory extensions. Second, autonomous research agents: AI Scientist (Lu et al. 2024) running end-to-end ML research; Agent Laboratory (Schmidgall et al. 2025) pairing humans with agent pipelines; Glia (Hamadanian et al. 2025) optimizing GPU schedulers with a Researcher+Supervisor loop. Third, agents in production networking: Confucius at Meta managing hyperscale networks via DSL-mediated tools; NetConfEval benchmarking LLM configuration synthesis (Wang et al. 2024).
“We introduce The AI Scientist, the first comprehensive framework for fully automatic scientific discovery, enabling frontier large language models to perform research independently.” — Lu et al., 2024 (Lu et al. 2024)
What the pioneers are seeing: that the four invariants are answered together, not in isolation. A persistent-memory agent with MCP tools, a planner-executor-critic triad, and a budget manager is a complete agentic system — all four decisions answered, all four invariants closed over each other.
What remains invisible: the verification wall. Autonomous research works cheaply only when the evaluation substrate is cheap. ML has Python + GPU; systems research has no equivalent. Glia (Hamadanian et al. 2025) builds on Vidur; Confucius leans on Meta’s tools. The composable empirical backend — the netUnicorn-style thin waist for experiments — remains unbuilt in most domains.
The safety gap is deeper than sandboxing. Constraint loss via context compression is the most dangerous failure mode for deployed agents: when a long context is lossy-compressed to fit a window, critical critical constraints are silently dropped. A network management agent instructed to “suggest configuration changes but do not execute them” loses the “do not execute” qualifier during summarization — and proceed to push configuration changes to production, causing mass service disruption. This is a State invariant failure with catastrophic consequences: the agent’s Belief (I should execute) diverges from the Environment (the operator said suggest-only) because the Measurement (compressed context) was structurally filtered.
Behavioral verification is equally young: Constitutional AI (Bai et al. 2022) replaces human preference labels with a model-written critique loop, and downstream sandboxing and pre-action authorization layers (enforced through MCP’s security invariants (Anthropic 2024) from Act 6) bound what an agent can do before it acts. These are first moves, far from the verification equivalent of IP checksums or TCP sequence numbers — they are first moves at a layer that is still being designed.
Applied closed-loop reasoning at the longest timescale yet: across projects, not just runs. The Research Digest outlives any individual agent’s context window. This is the agentic analogue of long-term network telemetry: the measurements that outlast the measuring device.
14.8.1 Invariant Analysis: Persistent Autonomous Agents (2026+)
| Invariant | Autonomous Agent Answer (2026+) | Gap? |
|---|---|---|
| State | Working context + archival memory + research digest | Eviction + provenance policies still open |
| Time | Per-step + per-trial + per-project loops | Multi-day runs hit budget/latency walls |
| Coordination | Planner + Experimenter + Verifier + Knowledge-base | Verifier is the bottleneck outside ML |
| Interface | MCP + memory extensions + experiment specs | Experiment-spec waist still standardizing |
14.8.2 Environment → Measurement → Belief
| Layer | What Autonomous Agents Have | What’s Missing |
|---|---|---|
| Environment | Real systems, real datasets, real users | — |
| Measurement | MCP tools + persistent memory + experiment oracles | Experiment substrates vary wildly by domain |
| Belief | Digest + Archive + working context | Provenance of beliefs not standardized |
14.9 The Grand Arc: From Prompt to Persistent Agent
14.9.1 The Evolving Anchor
| Era | Binding Constraint | State | Time | Coordination | Interface |
|---|---|---|---|---|---|
| 2020 Prompt | Context window only | Prompt text | Single pass | None | Text in/out |
| 2022 CoT | + need to externalize reasoning | + scratchpad | Longer pass | None | + reasoning format |
| 2022–23 ReAct | + need to act on the world | + trace | Observe-act loop | None | + tool calls |
| 2023 Reflexion/MemGPT | + need cross-episode memory | + archival + reflections | + multi-trial loop | None | + memory API |
| 2023–24 Multi-agent | + need specialist perspectives | Per-role contexts | Turn-based | Distributed specialists | Per-framework protocols |
| 2024 MCP | + need composable tool ecosystem | Unchanged | + per-call budget | + tool servers | MCP thin waist |
| 2026+ Persistent | + need cross-run intelligence | + digest + archive | Multi-day runs | Planner/Exec/Verifier/KB | MCP + memory + experiment specs |
The binding constraint accumulates. Every era inherits its predecessors’ limits and adds one more. 2026+ agents must still fit each call into a finite context, still pay for each tool call, still tolerate hallucination — but now they also must manage memory, coordinate specialists, and validate against experiments.
14.9.2 Three Design Principles Applied Across the Arc
Disaggregation was applied first at ReAct (Yao, Zhao, et al. 2023) (reasoning / action), again at Reflexion/MemGPT (Shinn et al. 2023; Packer et al. 2023) (working context / archival memory), again at multi-agent (Wu et al. 2023; Hong et al. 2024) (planner / executor / critic), and finally at MCP (Anthropic 2024) (agent / tool). Each disaggregation created an interface; each interface is a candidate for ossification. The most successful — MCP — ossified deliberately, as a thin waist.
Closed-loop reasoning was applied at progressively longer timescales. CoT (Wei et al. 2022) closed a loop inside a single forward pass (self-attention). ReAct (Yao, Zhao, et al. 2023) closed a loop per step. Reflexion (Shinn et al. 2023) closed a loop per trial. AI Scientist (Lu et al. 2024) closes a loop per project. Each loop is a feedback mechanism narrowing the E–M–B gap at a specific timescale; each loop period is a design parameter.
Decision placement migrated from centralized (one model, one call) to distributed (multi-agent) to hybrid (planner-centralized orchestrator over distributed specialists). The migration is the same one routing took from ARPANET’s distributed IMPs to SDN’s centralized controller and back to segment-routing’s hybrid. The underlying question — who decides? — is identical.
Two cross-chapter analogies sharpen the pattern. First, token budgets are the closest thing agents have to congestion control. An agent consuming tokens is a flow consuming bandwidth — and the current literature lacks a treatment of agent token consumption as a flow subject to fair-queueing, rate-limiting, or admission control (the tools from Chapter 7). Second, misaligned tool selection at scale is the agentic analogue of BGP route leaks (Chapter 6): a syntactically valid action (the tool call is well-formed) with catastrophic semantic consequences (the tool executes an action the agent’s intent excluded). Both failures are structurally filtered — the measurement signal (tool call succeeded) obscures the environment (the action was harmful).
14.9.3 The Dependency Chain
14.9.4 Pioneer Diagnosis Table
| Year | Pioneer | Invariant | Diagnosis | Contribution |
|---|---|---|---|---|
| 2020 | Brown et al. | Interface | Language tasks need no supervised data | GPT-3 few-shot prompting (Brown et al. 2020) |
| 2022 | Wei et al. | State | Reasoning is latent; needs a scratchpad | Chain of Thought (Wei et al. 2022) |
| 2022 | Yao et al. | Time | Reasoning must interleave with action | ReAct (Yao, Zhao, et al. 2023) |
| 2023 | Schick et al. | Interface | Models can learn when to call tools | Toolformer (Schick et al. 2023) |
| 2023 | Shinn et al. | State | Agents repeat mistakes without reflection | Reflexion (Shinn et al. 2023) |
| 2023 | Packer et al. | State | One context window overflows on long tasks | MemGPT / Letta (Packer et al. 2023) |
| 2023 | Wu et al. | Coordination | Single agent is brittle | AutoGen (Wu et al. 2023) |
| 2024 | Jimenez et al. | (Evaluation) | Agents need real-world oracles | SWE-bench (Jimenez et al. 2024) |
| 2024 | Zhou et al. | (Evaluation) | Browsers need live environment oracles | WebArena (Zhou et al. 2024) |
| 2024 | Yang et al. | Interface | Coding agents need agent-computer interfaces | SWE-agent (Yang et al. 2024) |
| 2024 | OpenAI | State | Reasoning can be learned, not only prompted | o1 (OpenAI 2024) |
| 2024 | Anthropic | Interface | Ecosystem needs a thin waist | Model Context Protocol (Anthropic 2024) |
| 2024 | Lu et al. | (All four) | End-to-end research can autonomize | AI Scientist (Lu et al. 2024) |
| 2025 | Schmidgall et al. | Coordination | Research needs specialist teams | Agent Laboratory (Schmidgall et al. 2025) |
| 2025 | Hamadanian et al. | (Substrate) | Systems design needs an Evaluation Playground | Glia (Hamadanian et al. 2025) |
14.9.5 Innovation Timeline
flowchart TD
subgraph sg1["Prompting Era"]
A1["2020 — Brown: GPT-3"]
A2["2021 — Few-shot prompting craft"]
A1 --> A2
end
subgraph sg2["Reasoning Era"]
B1["2022 — Wei: Chain of Thought"]
B2["2022 — Kojima: Zero-shot CoT"]
B3["2022 — Wang: Self-consistency"]
B4["2023 — Yao: Tree of Thoughts"]
B1 --> B2 --> B3 --> B4
end
subgraph sg3["Tool-Use Era"]
C1["2022 — Yao: ReAct"]
C2["2023 — Schick: Toolformer"]
C3["2023 — OpenAI: Function calling"]
C4["2023 — Patil: Gorilla"]
C1 --> C2 --> C3 --> C4
end
subgraph sg4["Memory Era"]
D1["2023 — Shinn: Reflexion"]
D2["2023 — Packer: MemGPT/Letta"]
D3["2023 — Park: Generative Agents"]
D1 --> D2 --> D3
end
subgraph sg5["Multi-Agent Era"]
E1["2023 — Wu: AutoGen"]
E2["2023 — Hong: MetaGPT"]
E3["2023 — Li: CAMEL"]
E4["2024 — LangGraph"]
E5["2024 — CrewAI"]
E1 --> E2 --> E3 --> E4 --> E5
end
subgraph sg6["Thin-Waist Era"]
F1["2024 — Anthropic: MCP"]
F2["2024 — Jimenez: SWE-bench"]
F3["2024 — Zhou: WebArena"]
F4["2024 — Yang: SWE-agent"]
F5["2024 — OpenAI: o1"]
F6["2025 — A2A protocol"]
F7["2025 — MCP memory extensions"]
F1 --> F2 --> F3 --> F4 --> F5 --> F6 --> F7
end
subgraph sg7["Autonomous Era"]
G1["2024 — Lu: AI Scientist"]
G2["2025 — Schmidgall: Agent Laboratory"]
G3["2025 — Hamadanian: Glia"]
G4["2025 — Meta: Confucius"]
G5["2025 — Popper"]
G6["2026 — Persistent memory architectures"]
G7["2026 — Composable empirical backends"]
G1 --> G2 --> G3 --> G4 --> G5 --> G6 --> G7
end
sg1 --> sg2 --> sg3 --> sg4 --> sg5 --> sg6 --> sg7
14.10 Why This Matters
The four invariants are substrate-independent. They are the questions any system with State, Time, Coordination, and Interface decisions must answer — and that set is closed under composition. 1970 ALOHA answered them at the level of frame transmissions. 2012 CoDel answered them at the level of packet queues. 2026 agentic systems answer them at the level of tokens, tool calls, and specialist teams. The substrate changed; the questions did not.
This is the framework’s reach. A student who internalizes the four invariants can reason about an operating-system scheduler, a distributed database, a congestion-control loop, a memory hierarchy, a multi-agent planner — or whatever system class emerges next — without learning a new vocabulary. Each substrate has its own binding constraint (RF physics, disk latency, context window, hallucination rate); each imposes its own version of the E-M-B loop; each forces the same four placement decisions. The discipline transfers.
The six systems in this book are finished designs with decades of debugging behind them. The agentic layer above them remains under construction; the thin waist is still assembling; the memory architecture is still settling; the verification wall for systems research stands unbroken. A student who can diagnose which invariant is the binding constraint for a new system, and at what timescale its E-M-B loop closes, is equipped to contribute to any of those open problems. That is what the framework buys.
14.11 Generative Exercises
Design a research agent that ingests a 500-paper corpus in a domain you care about and produces the four survey artifacts (intent, triage, deepened notes, synthesis) used in this book’s literature-survey pipeline. Using only the four invariants:
- What belongs in the context window for the triage step vs. the deepen step vs. the synthesize step? (State)
- At what timescale does the agent check its own progress — per paper, per thread, per synthesis pass? (Time)
- Which roles must be specialized vs. shared? (Coordination)
- What is the thin waist between the survey agent and the corpus? (Interface — hint: MCP resources exposing the PDF store)
Predict which invariant will be the binding constraint if the corpus grows to 5,000 papers. Explain your answer using the E-M-B decomposition.
A software engineer is debugging a distributed-systems race condition and wants an agent to help. The agent can read logs, run tests, and propose hypotheses; the human has ground truth about the system’s intended behavior. Design the human-in-the-loop coordination using the four invariants:
- What belongs in the shared context between human and agent vs. the agent-private context? (State)
- When must the agent stop and ask a question vs. proceed on its own? (Time — framed as a budget for autonomous action)
- Who decides when to rollback a hypothesis? (Coordination — which decisions stay with the human?)
- What is the interface — chat, an IDE extension, an MCP server exposing logs? (Interface)
Identify which of the five tensions from this chapter’s dependency graph this design will hit.
Chapter 12 described network-measurement systems. Suppose you want an agent to drive a Sonata/netUnicorn-style measurement campaign: propose experiments, launch them on a testbed, interpret telemetry, iterate. Design the MCP server that exposes the measurement substrate to the agent:
- Which resources does the server expose (testbed topology, historical runs, available probes)?
- Which tools does the server offer (launch experiment, fetch telemetry, cancel, checkpoint)?
- How does the server expose the agent’s budget (cost, latency, testbed fair-share) — is it a resource, a tool response, or a separate primitive?
- What prevents the agent from damaging the testbed (validation layer, dry-run, human approval for destructive actions — Confucius’s pattern)?
Sketch the thin waist: what crosses between the measurement agent and any testbed, independent of which testbed? That thin waist is the agentic-research analogue of IP for networks.
14.12 What Comes After This Book
The six systems in this book are finished designs — each with decades of debugging behind them. The agentic layer above them remains under construction. The thin waist is still assembling. The memory architecture is still settling. The verification wall for systems research stands unbroken.
The student who understands the four invariants can build what comes next. The framework is applied, not updated; the problems are new, the tools are the same. Go build.
Self-attention computes pairwise interactions between all tokens in the context window. For a window of N tokens, attention requires O(N2) computation and memory. Doubling the context window quadruples the cost — this quadratic scaling is the physics-level constraint that bounds context size.↩︎
Temperature (T) controls the randomness of the model’s token sampling. At T=0, the model always picks the highest-probability token (deterministic). At T=1, sampling follows the model’s learned distribution. Higher T increases diversity but also hallucination risk.↩︎
pass@k measures the probability that at least one of k generated solutions passes all test cases. pass@1 is the strictest metric (single attempt); pass@100 measures whether the model can produce a correct solution given many tries. The gap between pass@1 and pass@100 quantifies the benefit of search/retry.↩︎
An embedding is a dense vector representation of a text chunk in a high-dimensional space (typically 768-3072 dimensions). Similar texts have nearby embeddings, enabling semantic search by computing cosine similarity between the query embedding and stored document embeddings.↩︎