flowchart TD
A["<b>1969: Survivability</b><br/>Locks Coordination = distributed"] --> B["State: distance vectors<br/>Time: 128ms noisy updates"]
B --> C["<b>FAILURE: Oscillations at scale</b><br/>Measurement proxy wrong + no consistency"]
C --> D["<b>1980: McQuillan redesign</b><br/>State → full topology<br/>Time → 10s direct measurement"]
D --> E["<b>1989: Internet commercializes</b><br/>Relocks Coordination = autonomous"]
E --> F["State: policy-encoded RIBs<br/>Time: deliberately slow (MRAI 30s)"]
F --> G["<b>1997-2003: BGP crises</b><br/>Convergence 3-15 min, NP-hard,<br/>RFD paradox, prefix hijacking"]
G --> H["<b>2004: Comprehensibility crisis</b><br/>Relocks Coordination = centralized"]
H --> I["State: global NIB<br/>Interface: match-action flow tables"]
I --> J["<b>FAILURE: Interface ossifies</b><br/>OpenFlow 12→41 fields, still not enough"]
J --> K["<b>2014: P4</b><br/>Interface → programmable pipeline"]
style C fill:#EE6677,color:#fff
style G fill:#EE6677,color:#fff
style J fill:#EE6677,color:#fff
style A fill:#4477AA,color:#fff
style D fill:#228833,color:#fff
style E fill:#4477AA,color:#fff
style H fill:#4477AA,color:#fff
style K fill:#228833,color:#fff
7 Routing, Switching, and Programmable Networks
7.1 The Anchor: Forwarding Under Incomplete Information
Every routing system solves the same problem: a packet arrives with a destination address, and the node must choose an output port. The forwarding decision is trivial — a table lookup. Building and maintaining that table is the hard part. The network graph changes — links fail, nodes join, traffic shifts — but forwarding must continue. If two nodes disagree about the path to the same destination, packets loop or vanish.
The binding constraint is incomplete information about a graph you don’t fully control. Every node sees only its local neighborhood. Every routing design is a strategy for making forwarding decisions with incomplete, delayed, and potentially inconsistent information about a changing topology. This constraint is constant from 1964 to the present. What evolves is why the information is incomplete, what information matters, and who lacks control.
The intellectual origins of this constraint trace to 1964, when Paul Baran at RAND addressed a survivability problem:
“Let us consider the synthesis of a communication network which will allow several hundred major communications stations to talk with one another after an enemy attack.” — Baran, 1964 (Baran 1964)
Baran proved that a distributed mesh with redundancy levels of only three or four links per node survives heavy attack. His contribution was not a routing algorithm — it was a proof that the network must function without any central point of control. This survival requirement became the meta constraint that shaped the first generation of routing: Coordination must be distributed.
Routing must continuously answer four decision problems — the routing analogues of TCP’s “when to send, what to send, how much to send”:
- What to measure about the network? (What signals can I observe about the topology?)
- What to tell my neighbors? (What information do I share, and in what form?)
- How to compute the forwarding table? (What algorithm turns measurements into forwarding decisions?)
- When to update? (How fast do I react when conditions change?)
Looking back, John McQuillan — who redesigned ARPANET routing in 1980 — formalized these as a tripartite “routing procedure” (McQuillan et al. 1980):
“(1) a measurement process for determining pertinent network characteristics, (2) a protocol for disseminating information about these characteristics, and (3) a calculation to determine how traffic should be routed.”
This framing maps directly to the four invariants — measurement is State, dissemination is Interface, calculation requires Coordination, and the entire procedure operates on a Time axis. McQuillan articulated this structure in retrospect, after debugging the failures of the first routing system. The pioneers who built that first system in 1969 had no such framework — they were solving an engineering problem from scratch.
7.2 Act 1: “It’s 1969. The First Four Nodes Come Online.”
In September 1969, the first Interface Message Processor (IMP) was installed at UCLA. By December, four nodes were connected — UCLA, Stanford Research Institute, UC Santa Barbara, and the University of Utah. The first host-to-host message was sent on October 29, 1969: the intended word was “login,” but the system crashed after transmitting “lo.” The ARPANET was born.
The IMPs were built by Bolt, Beranek and Newman (BBN) — dedicated minicomputers (Honeywell DDP-516, 12 KB of memory) that served as the network’s packet switches. An IMP was a dedicated minicomputer (a Honeywell DDP-516 with 12 KB of memory) that sat between a host computer and the network. In today’s terms, the IMP combined what a modern network separates into a router (Layer 3 forwarding), a modem (Layer 1/2 link management), and a host interface — all in one device. The concept of protocol layers had yet to emerge; Zimmermann’s OSI model would arrive in 1980. The IMP was simply “the machine that moves packets.”
Frank Heart’s team framed their task as subnet engineering:
“The essential task is to transfer bits reliably from a source location to a specified destination.” — Heart et al., 1970 (Heart et al. 1970)
Their overriding design concern was autonomy:
“Any dependency between one IMP and another would merely broaden the area jeopardized by one IMP’s failure.” — Heart et al., 1970 (Heart et al. 1970)
What the pioneers saw: A small network (~10 nodes) of minicomputers connected by 50 Kbps leased telephone lines. Each IMP had very limited memory. The environment was benign — a cooperative research community, no competing traffic, no commercial pressure. The problem, as they understood it, was reliability: keep packets flowing even when individual machines or lines fail.
What remained invisible from the pioneers’ vantage point: The Internet would grow to millions of nodes. Commercial interests would fragment administrative control. Traffic would explode beyond any single link’s capacity. The routing algorithm that works for 10 cooperative nodes would catastrophically fail at 60. These forces were decades away — and the design reflected the environment the pioneers could see.
They chose the simplest distributed algorithm: distance-vector routing. Each IMP tells every neighbor, “I can reach destination X at cost N.” The neighbor adds its own link cost and records the result. After enough exchanges, every IMP converges to the shortest path. This is the Bellman-Ford algorithm (Bellman 1958) applied to a live network.
7.2.1 The Designer’s Decisions
Concretely, each IMP faces three decisions on every packet and every update:
- Packet arrives for destination X: Look up X in the routing table → forward to the neighbor with the lowest recorded cost. A “fast and simple table lookup procedure.”
- Neighbor J sends a distance update: “Compute delay via J to all other nodes by adding to each entry in J’s table its own delay to J.” Compare against current table. Keep the minimum.
- Link fails: Reroute packets immediately. Set cost to infinity. Tell neighbors.
7.2.2 Invariant Analysis: Distance-Vector Routing (1969)
| Invariant | ARPANET DV Answer (1969) | Gap? |
|---|---|---|
| State | Delay estimate per destination, built from neighbor ads | Proxy measurement: queue length ≠ delay |
| Time | Updates every 128 ms | Too fast and noisy — chases transient congestion |
| Coordination | Fully distributed, “routing by rumor” | No consistency guarantee — loops possible |
| Interface | Full tables exchanged every 128 ms | Bandwidth cost scales with N |
Two invariants answered solidly: distributed Coordination satisfies survivability, and the simple Interface is deployable. Two have critical gaps. The State gap is the most consequential: each IMP estimates link delay by counting packets in its output queue — but queue length ignores line speed, propagation delay, and resource contention. Routes oscillate as the algorithm chases transient queue fluctuations, shifting traffic unpredictably. The Coordination gap compounds this: because each node computes from neighbors’ advertisements without checking whether the advertised path loops back through itself, two nodes can simultaneously believe they should route through each other. Packets bounce between them until discarded — the count-to-infinity problem.
7.2.3 Environment → Measurement → Belief
| Layer | What ARPANET DV has | What’s missing |
|---|---|---|
| Environment | True topology + link delays + traffic load | — |
| Measurement | Queue length sampled every 128 ms as proxy for delay | Queue length ignores line speed, propagation delay, and resource contention. Systematically wrong under both light and heavy load. |
| Belief | Distance vector: best cost to each destination via each neighbor | No topology model — only distances. Cannot detect self-loops. |
The Belief gap is the root cause of count-to-infinity: a node receiving a distance advertisement has no way to check whether the advertised path passes through itself. It accepts a route that loops back through its own position — and advertises that looping route to its neighbors, who accept it in turn.
The IMP’s design reflects all three design principles in their earliest form. The IMP is a monolithic device — routing, forwarding, and host interface entangled in one machine, with no disaggregation of concerns. Survivability forces decision placement to be fully distributed — each IMP decides independently. And the routing exchange is a closed-loop feedback system (measure → compute → advertise → neighbors recompute → they advertise back) with a 128 ms loop period — but one that is potentially divergent, widening the E-M-B gap instead of narrowing it under certain topologies.
7.2.4 “The gaps didn’t matter… yet.”
The ARPANET had ~50 nodes and moderate traffic. Distance-vector worked for a decade. The proxy measurement was inaccurate but tolerable. Routing loops occurred but resolved quickly at small scale. The system was, in Baran’s framing, “highly inefficient but extremely stable” — acceptable when survivability was the binding constraint.
7.2.5 Disaggregation in Action: The IMP Splits into TCP and IP (1974)
Notice that Heart’s IMP provided both routing and reliable delivery. The IMP guaranteed hop-by-hop reliability: “Once an IMP has accepted a packet and returned a positive acknowledgment, it holds onto that packet tenaciously until it in turn receives an acknowledgment from the succeeding IMP.” The IMP even reassembled messages and delivered them in order to the host. In modern terms, the IMP conflated the transport layer (reliable delivery) with the network layer (routing and forwarding) inside a single device.
In 1974, Cerf and Kahn proposed a radical disaggregation (Cerf and Kahn 1974). To interconnect dissimilar networks (ARPANET, packet radio, satellite), every intermediate gateway would need to guarantee reliability — an impossible coordination requirement across heterogeneous systems. Their solution: make the network layer intentionally unreliable. This proposal took nearly a decade to deploy: TCP/IP replaced the ARPANET’s original Network Control Program on January 1, 1983 — a coordinated cutover known as “Flag Day.” Gateways (routers) would only forward packets — “the responsibility for properly routing data” — while a new endpoint protocol, TCP, would handle reliability end-to-end: “End-to-end restoration procedures are desirable to allow complete recovery.”
This is the disaggregation principle in its most consequential form. Separating transport from routing created two independent evolutionary paths:
- Routing could evolve from distance-vector to link-state to BGP to SDN — each change affecting only how packets are forwarded, without breaking reliable delivery.
- Transport could evolve from Tahoe to Reno to CUBIC to BBR (Chapter 4) — each change affecting only endpoint behavior, without modifying routers.
The cost of disaggregation: TCP must infer network congestion from endpoint observations alone (ACK timing, loss events), because IP’s interface deliberately hides router state. This gap — the same Environment-Measurement-Belief gap from Chapter 4 — is the price of the separation. Saltzer, Reed, and Clark formalized this tradeoff a decade later (Saltzer et al. 1984): functions requiring application knowledge belong at endpoints. Internal reliability measures are “performance enhancements, never requirements for correctness.”
The rest of this chapter traces the routing side of this disaggregation. Transport evolved independently in Chapter 4.
Then the network grew.
7.3 Act 2: “It’s the Late 1970s. The ARPANET Is Oscillating.”
By the late 1970s, the ARPANET had grown to ~60 nodes. Distance-vector routing, which had worked reliably for a decade, became unstable. Links oscillated between congested and idle as the routing algorithm chased transient queue-length fluctuations. Count-to-infinity loops persisted for minutes. The gaps that were invisible at the original 4 nodes, and tolerable at 50, became catastrophic at 60. In May 1979, the ARPANET switched to a new routing algorithm — the old distance-vector program was removed entirely.
McQuillan diagnosed three “serious defects,” each mapping to a different invariant:
7.3.1 Which Invariant Broke?
| Invariant | What broke | McQuillan’s diagnosis |
|---|---|---|
| State (Measurement) | Queue-length proxy was systematically wrong | “The length of an output queue is only one of many factors that affect a packet’s delay. A measurement procedure that takes into account only one such factor cannot give accurate results.” |
| Coordination (Consistency) | Distributed iteration couldn’t guarantee consistent routes | “With such a scheme, it is difficult to ensure that routes used by different nodes are consistent” — leading to “wild oscillations” and “long-term loops.” |
| State (Scalability) | Routing tables grew with network size | “Routing packets would become correspondingly larger (or more frequent) as the ARPANET grows to 100 or more nodes.” |
7.3.2 Why Count-to-Infinity Is a Divergent Feedback Loop
Count-to-infinity reveals a fundamental closed-loop failure. The routing loop is supposed to narrow the E-M-B gap (converge Belief toward Environment). But when a destination becomes unreachable:
- Node A believes it can reach X via B at cost 2
- B’s link to X fails. B sets cost to infinity.
- Before B’s update reaches A, A tells B: “I can reach X at cost 2.”
- B has no way to detect that A’s path goes through B. B records: “Reach X via A at cost 3.”
- A hears B’s new cost and updates: “Reach X via B at cost 4.”
- The two nodes bounce the route, incrementing cost by 1 each round, until both reach infinity.1
The measurement signal (neighbor’s distance) is contaminated by the loop’s own output. A’s belief feeds into B’s measurement, which feeds back into A’s belief. This is positive feedback — the loop amplifies the E-M-B gap instead of reducing it. At 30-second update intervals (RIP (Hedrick 1988)), a 15-hop network takes up to 8 minutes to converge. During those 8 minutes, packets destined for X bounce between A and B until their TTL expires.
7.3.3 McQuillan’s Redesign: Link-State Routing
McQuillan’s insight: if every node had the same complete topology database, they would compute the same shortest paths. Replace rumors with shared truth. His redesign applied all three design principles:
He applied disaggregation by separating the topology database from the forwarding table — two distinct data structures, connected by SPF (Shortest Path First) computation (Dijkstra 1959). This is the first separation of concerns within routing. He applied closed-loop reasoning with a critical fix: replacing the noisy queue-length proxy with direct delay measurement averaged over 10 seconds, making the measurement signal independent of the computation. The loop structure remains the same (measure → disseminate → compute → repeat), but now it converges because the measurement (LSA, Link-State Advertisement) carries direct observations — breaking the positive feedback that plagued DV. And decision placement remains distributed — survivability still demands it — but nodes now compute on “identical databases,” agreeing without negotiation.
| What changed | Distance-Vector (before) | Link-State (after) |
|---|---|---|
| State (what to believe) | Neighbor distances — O(N) partial state, built from rumors | Full topology database — O(N²) complete state, “identical databases maintained at all nodes” |
| Measurement (what to observe) | Queue length proxy, every 128 ms | Direct delay measurement, averaged over “10 seconds” — “a significant departure” |
| Coordination (how to agree) | Iterative relaxation via neighbor exchange (“routing by rumor”) | Reliable flooding: each node sends only its own link costs; all nodes run “the exact same algorithm on an identical data base” |
| Time (how fast) | 128 ms updates, unbounded convergence | Updates propagate within “100 msec”; convergence bounded by flooding time + SPF computation |
Why does link-state converge while distance-vector can diverge? The critical structural difference: in link-state, the measurement signal (LSA) is independent of the computation. Each node reports only its own directly-measured link costs. These measurements are decoupled from any other node’s belief. The circular dependency is eliminated — the positive feedback loop is broken.
7.3.4 Environment → Measurement → Belief After the Fix
| Layer | DV (before) | LS (after) | Gap closed? |
|---|---|---|---|
| Environment | True topology + delays | Same | — |
| Measurement | Queue-length proxy (noisy, indirect) | Direct delay measurement (accurate, averaged over 10s) | Yes — measurement tracks environment accurately |
| Belief | Distance vector (partial, inconsistent across nodes) | SPF shortest-path tree (complete, identical across nodes) | Yes — belief derived from shared, accurate measurement |
7.3.5 OSPF: Production Codification (1989–1998)
OSPF (RFC 2328) codified link-state for production IP networks (Moy 1998). OSPF applied disaggregation further through areas — partitioning the topology database into zones. Within an area, every router holds complete state (every link, every cost). Across area boundaries, Area Border Routers summarize: they advertise only “destination X is reachable at cost Y” without revealing the internal topology that produces that cost. This manages a fundamental scaling problem: in a flat link-state network with N routers, every router stores information about every link — and the number of links grows as N², consuming memory and flooding bandwidth. Areas keep the per-router state manageable by limiting full topology knowledge to local scope.
OSPF also refined decision placement through the Designated Router (DR) on multi-access segments (like an Ethernet LAN with many routers). Without a DR, every pair of routers on the segment forms a separate relationship and exchanges topology updates — N routers produce N×(N-1)/2 pairs, each flooding independently. The DR consolidates: one elected router collects updates from all others and redistributes a single copy, reducing the flooding traffic from quadratic to linear.
7.4 Act 3: “It’s 1985. Ethernet LANs Keep Crashing from Loops.”
Acts 1 and 2 traced routing at what we now call Layer 3 — the network layer, where IP addresses identify destinations and routers forward packets hop by hop across wide-area networks. But the forwarding problem also exists at Layer 2 — the data link layer, where Ethernet MAC addresses identify devices on a local network and bridges forward frames between LAN segments. The loop-free forwarding problem is structurally identical: a packet must reach its destination without cycling forever. But the constraints are different, and the different constraints produce a fundamentally different design.
Why detour to Layer 2? Because Perlman’s spanning tree protocol is the inverse of McQuillan’s link-state solution. Both solve loop-free forwarding. McQuillan’s approach: give every node full topology knowledge, then compute loop-free paths. Perlman’s approach: give nodes almost no topology knowledge, and instead constrain the topology itself to be loop-free. The contrast reveals how the binding constraint — not the problem — determines the design.
Radia Perlman faced a different environment from the ARPANET engineers. Her problem was connecting multiple Ethernet LAN segments with bridges — simple devices that forward Ethernet frames between segments based on MAC addresses. Bridges were cheap, required no configuration, and were invisible to end stations. But redundant links between bridges created forwarding loops:
“Switching loops resulting in broadcast radiations and MAC table instability.”
Perlman’s binding constraint was Interface transparency: bridges must allow stations to participate in the network without any modification or awareness. In her own words:
“Transparent to stations, and thus allows a station to participate in an extended LAN with no modification.” — Perlman, 1985 (Perlman 1985)
What Perlman saw: Campus networks with a few dozen bridges connecting Ethernet segments. Bridges were “dumb” devices — far simpler than routers (which understood IP addresses and ran routing protocols). The designers of bridges deliberately avoided complexity: no configuration, no IP addresses, no routing protocols. Just plug in a cable and it works.
What the transparency constraint forced: Link-state requires every node to maintain a full topology database — expensive in memory and computation. Bridges in 1985 were simple forwarding devices with far fewer resources than routers. The transparency constraint further demanded that the solution work without any change to end-station behavior — stations must remain entirely unaware of bridges.
7.4.1 Perlman’s Inverse Insight
Instead of distributing full topology knowledge (McQuillan’s approach), constrain the active topology to a loop-free tree. A distributed root-bridge election prunes the mesh into a spanning tree. Every bridge agrees on a single root (the bridge with the lowest ID), computes its shortest path to the root, and disables all ports outside that shortest path. The result: a tree topology where exactly one path exists between any two points. Loops are impossible — the topology itself is pruned to prevent them, without requiring any node to know the full graph.
7.4.2 Invariant Analysis: Spanning Tree Protocol (1985)
| Invariant | STP Answer (1985) | Gap? |
|---|---|---|
| State | Minimal: root ID, cost to root, designated port | Solid — constant per bridge regardless of network size |
| Time | 30–50 second convergence | Critical: all forwarding halts during reconvergence |
| Coordination | Distributed root election via BPDUs | Solid for LANs |
| Interface | Transparent — invisible to end stations | Blocks all redundant links, wasting bandwidth |
The Time gap is the most operationally painful: when a link fails or a new bridge joins, ALL forwarding through affected ports halts for 30–50 seconds while bridges re-elect a root and recompute the tree. During this window, packets are silently dropped. The Interface trade-off is equally significant: STP disables every redundant link, keeping only one active path between any two points. If a network has 10 paths between buildings A and B, STP blocks 9 of them — traffic flows on a single path regardless of load, while the blocked links sit idle.
Perlman’s design applied decision placement as distributed — the transparency constraint rules out centralized control, so bridges elect a root via BPDU exchange. She applied closed-loop reasoning through BPDU exchange → root election → port blocking, though the loop converges slowly (passive timers, no active feedback — RSTP later applied active “proposal/agreement” feedback to achieve sub-second convergence). And the design reflects a specific disaggregation choice: STP operates at L2, completely separate from L3 routing, enabling each layer to evolve independently — but also preventing L2 and L3 from coordinating their topologies.
7.4.3 “The gaps didn’t matter… yet.”
Campus LANs were small — a few dozen bridges, a handful of redundant links. Losing a link for 30 seconds while STP reconverged was annoying but survivable, because alternative paths were rare and traffic loads were modest.
Then data centers arrived. Modern data centers use fat-tree (Clos) topologies — massive multi-stage switch fabrics with hundreds of equal-cost paths between any two servers. The whole point of a fat-tree is that every path carries traffic simultaneously, providing aggregate bandwidth that grows with the number of switches. STP blocks every redundant path and keeps only one — eliminating exactly the bandwidth that fat-tree topologies exist to provide. A fat-tree network running STP operates at a fraction of its capacity, with most of its links idle.
This mismatch drove multiple attempts to replace STP in data centers. TRILL (Perlman et al. 2011) and Shortest Path Bridging (SPB) brought link-state routing to Layer 2, enabling multipath forwarding while preserving transparent bridging semantics. But the solution that ultimately prevailed was simpler: VXLAN (Mahalingam et al. 2014) encapsulates Layer 2 Ethernet frames inside Layer 3 UDP/IP packets, tunneling L2 traffic over an L3-routed core. The core network runs standard IP routing (OSPF/BGP) with all paths active. The L2 illusion — flat addressing, VM mobility — is maintained only at the edge, where VXLAN tunnel endpoints translate between L2 frames and L3 packets. The core remains stateless and fully utilized.
7.5 Act 4: “It’s the Early 1990s. The Internet Commercializes. Shortest-Path Breaks.”
Acts 1–3 shared a common assumption: all nodes in the network have the same goal — find the shortest path (or in STP’s case, the loop-free path). OSPF routers all want to minimize delay. STP bridges all want a single tree. The cooperative ARPANET research community had no reason for nodes to disagree about what “best” means.
The ARPANET was formally decommissioned in 1990. The NSFNET, which had served as the research backbone since 1986, lifted its ban on commercial traffic in 1991 and was itself decommissioned in 1995. The Internet transformed from a cooperative research community into a federation of Autonomous Systems (ASes) — independently operated networks (ISPs, universities, corporations) connected by commercial contracts. Each AS had its own business interests: a transit provider wants to carry paying customers’ traffic but not freeloaders’. A content provider wants to peer directly with eyeball networks to avoid transit fees. A government network wants to avoid routing through certain countries.
The meta constraint shifted. The binding constraint moved beyond consistency (McQuillan’s problem) and transparency (Perlman’s problem) to commercial sovereignty — each AS must independently define its routing policy without revealing its internal topology or surrendering control over path selection.
What the BGP pioneers saw: A network that had grown from a single backbone to a mesh of competing providers. The previous interdomain protocol (EGP) assumed a tree-structured Internet with a single backbone. That assumption was now false:
EGP “could not handle the transition from a tree-structured to a mesh-structured Internet.”
What commercial sovereignty demanded: OSPF-style link-state routing is structurally incompatible with autonomous systems. First, flooding the full topology of every AS to every other AS is prohibitively expensive — “flooding overhead makes them generally inappropriate for inter-AS use.” Second, and more fundamentally, ASes are “unwilling to reveal their local policies to others” — the interface must hide internal state. Link-state’s core assumption — every node sees the same complete topology — directly contradicts commercial sovereignty.
7.5.1 Invariant Analysis: BGP (Rekhter et al. 2006) (1993–Present)
| Invariant | BGP Answer | Gap? |
|---|---|---|
| State | Per-prefix, per-peer RIB with policy attributes (AS_PATH, LOCAL_PREF, MED, Communities) | ~1M prefixes globally; incompressible because each AS applies different preferences |
| Time | MRAI (Minimum Route Advertisement Interval) = 30s, hold timer = 90s, convergence = 3–15 min | Deliberately slow: stability over speed |
| Coordination | Autonomous — each AS decides independently | NP-complete in general (Griffin); stable under commercial hierarchy (Gao-Rexford) |
| Interface | UPDATE message encodes business logic, hides internal topology | No origin validation — any AS can announce any prefix |
What the gaps mean concretely: BGP’s State is a per-prefix, per-peer Routing Information Base (RIB): for every destination prefix in the Internet (~1 million today), each router stores a separate entry from each neighbor, annotated with policy attributes. The AS_PATH lists every network the route has traversed, enabling loop detection. This state is “incompressible” — each AS applies its own preferences, so routes resist aggregation.
BGP’s Time is deliberately slow. After a route withdrawal, routers explore every alternative path before giving up. Each alternative takes one MRAI interval (30 seconds) to try. With 5 alternatives, convergence takes 2.5 minutes minimum. This is intentional: rapid oscillation across the global Internet would be worse than slow convergence.
BGP’s Coordination is theoretically intractable: Griffin et al. (Griffin et al. 2002) proved that determining whether a BGP configuration has any stable solution is NP-complete — there exist policy combinations where routers oscillate forever. In practice, the commercial hierarchy (customer-provider-peer) prevents these pathological cases (Gao and Rexford 2001).
BGP’s Interface lacks origin validation: any AS can announce any prefix, including prefixes it has no right to. This enabled the Pakistan/YouTube hijacking in 2008. RPKI (Lepinski and Kent 2012) adds cryptographic origin checks, but deployment remains incomplete due to collective action problems (Goldberg 2014).
7.5.2 Environment → Measurement → Belief: A Qualitative Shift
BGP’s E-M-B gap is fundamentally different from ARPANET’s. In the ARPANET, measurement was accidentally noisy (queue length was a bad proxy). In BGP, measurement is structurally filtered — neighbors deliberately hide their internal topology and show only policy-shaped route advertisements.
| Layer | ARPANET DV | BGP |
|---|---|---|
| Environment | True topology + link delays | True topology + business relationships + traffic economics |
| Measurement | Queue-length proxy (accidentally noisy) | Route advertisements (structurally filtered — neighbor shows only what they want you to see) |
| Belief | Distance vector (partial, inconsistent) | RIB: “best path according to MY policy” (per-prefix, per-peer, policy-weighted) |
| Nature of E→M gap | Engineering problem — build a better sensor | Institutional reality — incompleteness is designed in. Better sensors are irrelevant. |
A model is only as good as its input signals. When the input signals are strategically shaped rather than merely noisy, the path to stability lies in structural constraints that limit the solution space — better estimation alone is insufficient.
7.5.3 The 1993–2008 Arc: 15 Years of Cascading Crises
BGP’s story extends far beyond its creation. Five crises over 15 years revealed gaps that the original design left exposed. Each crisis is a closed-loop failure: the loop that should keep Belief (routing tables) aligned with Environment (true network state) breaks because the measurement signal is insufficient, the loop gain is wrong, or structural constraints are violated.
- 1993 — Address exhaustion (State/Interface, Fuller et al.)
- The class-based address system (Class A/B/C) was exhausting the IPv4 space and inflating routing tables. CIDR (Fuller et al. 1993) replaced rigid class boundaries with variable-length prefixes, enabling hierarchical aggregation that reduced routing table size.
- 1997–2001 — Convergence crisis (Time, Labovitz et al. (Labovitz et al. 2000))
- BGP converges in 3–15 minutes after failures. “Packet loss grows by a factor of 30.” The mechanism: after a route withdrawal, routers explore every alternative path before giving up — the same structural problem as count-to-infinity, but at interdomain scale, amplified by the MRAI timer (30 seconds per alternative).
- 2001 — Stability theory (Coordination, Gao & Rexford (Gao and Rexford 2001))
- The commercial hierarchy (customer > peer > provider) guarantees BGP convergence — stability emerges from market structure, not protocol mechanism. If every AS prefers customer routes over peer/provider routes, the dispute-wheel pathology is structurally prevented.
- 2002 — NP-hardness (Coordination, Griffin et al. (Griffin et al. 2002))
- Determining whether a BGP configuration has any stable solution is NP-complete. There exist policy combinations — called “dispute wheels” — where routers form circular dependencies and oscillate forever with no equilibrium. The Gao-Rexford conditions prevent these cases in practice, but the theoretical worst case is intractable.
- 2003 — Route flap damping paradox (Time → Coordination, Mao et al. (Mao et al. 2002))
- Route flap damping (RFD), designed to suppress unstable routes, worsened convergence. RFD treated the temporary route fluctuations that naturally occur during BGP path exploration as evidence of instability, and suppressed them — for up to an hour. A Time fix misidentified normal Coordination behavior as pathological.
- ~2008 — Prefix hijacking (Interface, Pakistan/YouTube incident)
- Any AS can announce any prefix, including prefixes it has no right to. In 2008, Pakistan Telecom accidentally announced YouTube’s prefix, diverting global YouTube traffic through Pakistan for hours. RPKI (Lepinski and Kent 2012) adds cryptographic origin checks, but deployment remains incomplete — early adopters gain little benefit without widespread participation (Goldberg 2014).
BGP operates exactly as designed under constraints that every interdomain protocol must face. Any interdomain protocol serving autonomous, commercially competing networks faces the same NP-hardness (Griffin). BGP works in practice because the commercial hierarchy provides structural damping (Gao-Rexford). Replacing BGP means either abandoning commercial autonomy or solving NP-hard problems in real time. Neither is feasible.
BGP’s design reflects all three principles. The community applied disaggregation by separating intradomain routing (OSPF) from interdomain routing (BGP) — each evolves independently, though hot-potato routing couples them at the boundary, sometimes catastrophically. Commercial sovereignty forces decision placement to be fully autonomous: each AS decides independently, because sovereignty forbids the consistency constraints that would require centralization. And closed-loop reasoning reveals the deepest insight: BGP’s feedback loop (advertise → select → advertise) is convergent under Gao-Rexford conditions but NP-hard in general — the stabilizing mechanism is institutional (market hierarchy), not algorithmic.
7.6 Act 5: “It’s 1996. Operators Need Traffic Engineering.”
Acts 1–4 focused on the control plane — how nodes learn paths and agree on forwarding decisions. But as the Internet grew to carry commercial traffic in the mid-1990s, a new problem emerged that was neither routing nor switching: traffic engineering. Operators discovered that shortest-path routing, however correct, left some links congested while others sat idle. The network was loop-free and convergent — McQuillan and Moy had solved that — but it was inefficient. Operators needed to steer traffic along non-shortest paths to balance load, avoid congestion, and meet Service Level Agreements.
This is the traffic engineering problem — and its conflation with routing produced a decade of operational pain.
MPLS (Multiprotocol Label Switching) (Rosen et al. 2001) promised to decouple routing from forwarding. The idea: use IP routing protocols (OSPF, BGP) to compute paths, but forward packets using short fixed-length labels instead of IP address lookups. A 32-bit label is inserted between the Ethernet header and the IP header — a “shim” that tells each router which pre-computed path the packet belongs to.2 Routers swap labels at each hop (fast table lookup, no longest-prefix match needed), steering traffic along explicit Label Switched Paths (LSPs) that can follow non-shortest routes. This gave operators the traffic engineering control that shortest-path routing lacked.
But RSVP-TE signaling “maintains signaling states on ALL devices along the path.” Every core router stores state for every tunnel traversing it. At scale, this is “operationally very expensive” with a “notorious number of MPLS RSVP-TE tunnels.” The State invariant broke again — the same failure mode as DV’s routing-table explosion, but at a different layer.
Meanwhile, Fortz & Thorup (2000) (Fortz and Thorup 2000) showed that optimizing OSPF link weights achieves “within 1–3% of optimal multicommodity flow” — challenging whether MPLS-TE was even necessary.
And routing-TE conflation created subtle failures: “OSPF distance changes DO affect BGP routes, sometimes with updates lagging by as much as a minute.” Local traffic-engineering decisions broke global reachability. This is a disaggregation failure — routing and TE were supposed to be separated by MPLS but remained coupled through IGP (Interior Gateway Protocol)-BGP interaction.
Segment Routing (Filsfils et al. 2015) resolved this by moving path state from the core to the packet header. The ingress encodes the path as an ordered list of “segments.” Core routers follow the list — no per-flow state, no LDP, no RSVP-TE. “Core routers don’t need to remember the entire network topology” — they follow the “travel checklist” in the header. Sub-50ms convergence via TI-LFA (Topology-Independent Loop-Free Alternate). The State invariant moves to its most extreme position: stateless core.
7.7 Act 6: “It’s 2004. Networks Are Unmanageable.”
By the early 2000s, every problem from Acts 1–5 had been solved — at least in principle. OSPF provided convergent intradomain routing. BGP provided policy-compliant interdomain routing. MPLS provided traffic engineering. STP provided loop-free L2 forwarding. The protocols worked. But operating them together was a nightmare.
The meta constraint shifted once more. The problem had moved beyond routing correctness, policy autonomy, and traffic engineering. The network’s policy had become incomprehensible to any single human. Enterprise networks had thousands of devices, each configured independently with ACLs, VLANs, and firewall rules. Policy was scattered across boxes. Debugging was forensic — tracing a dropped packet required logging into dozens of devices and correlating configuration fragments. Misconfigurations caused outages that were difficult to diagnose and expensive to fix.
The pioneers who launched SDN framed this as a management and consistency problem — distinct from routing:
| Pioneer | Year | Their question | Their framing |
|---|---|---|---|
| Feamster et al. (Feamster et al. 2004) | 2004 | “Extracting routing control into a logically centralized Routing Control Platform” | Consistency crisis — distributed BGP configs are fragile and undebuggable |
| Casado et al. (Casado et al. 2007) | 2007 | “How could we change the enterprise network architecture to make it more manageable?” | Management crisis — VLANs and ACLs “hide the complexity, not reduce it” |
| McKeown et al. (McKeown et al. 2008) | 2008 | “How can we run experiments in our campus networks?” | Innovation crisis — network “ossification” blocks research |
Three independent diagnoses. Three different framings. But all converged on the same architectural answer: separate the control plane from the data plane, and centralize control.
7.7.1 Decision Placement Inverts
This is the most dramatic shift in the chapter. Every previous era locked Coordination as distributed — survivability (Baran), correctness (McQuillan), and commercial sovereignty (BGP) all demanded it. Now comprehensibility demands the opposite.
Casado articulated why centralization was necessary — and how it responds to a fundamentally different binding constraint than Saltzer’s end-to-end argument:
“IP’s best-effort service model is both simple and unchanging, well-suited for distributed algorithms. Network management is quite the opposite; its requirements are complex and require strong consistency, making it quite hard to compute in a distributed manner.” — Casado et al., 2007 (Casado et al. 2007)
TCP’s constraint (no central authority over flows) forced distributed placement. SDN’s constraint (policy must be globally consistent and comprehensible) forced centralized placement. Both are correct for their binding constraint.
7.7.2 Invariant Analysis: SDN / OpenFlow (2008)
| Invariant | SDN Answer (2008) | Gap? |
|---|---|---|
| State | Global NIB in controller; cached flow entries in switch TCAMs (8K–16K entries) | TCAM capacity limits fine-grained policies |
| Time | Reactive: 0.4–12 ms first-packet delay; proactive: zero | Controller becomes bottleneck under high flow churn |
| Coordination | Centralized controller; switches execute | Single point of failure (addressed by Onix, ONOS) |
| Interface | Match-action flow tables on fixed header fields3 | Fixed field set ossifies: 12 → 41 fields, still insufficient |
What these gaps mean concretely: The State gap centers on TCAM (Ternary Content-Addressable Memory) — the specialized hardware that stores flow entries in a switch. TCAMs are fast (single-cycle lookup) but small and expensive: a typical switch holds 8,000–16,000 entries. A fine-grained policy (one rule per user per application) can require hundreds of thousands of entries, far exceeding TCAM capacity. The Time gap is the fundamental cost of centralization: in reactive mode, the first packet of every new flow is forwarded to the controller, which computes a rule and installs it — adding 0.4–12 ms of latency. Under high churn (thousands of new flows per second), the controller becomes the bottleneck. The Coordination gap — a single controller is a single point of failure — is addressed by distributed controller architectures (Onix (Koponen et al. 2010), ONOS) that replicate state across machines.
The SDN pioneers applied disaggregation at a new boundary — separating the control plane from the data plane across devices, the most consequential new interface since Cerf/Kahn separated transport from routing. They applied decision placement as centralized, because comprehensibility demands a global view. And the closed-loop they built (observe topology → compute rules → install in switches → observe traffic → recompute) operates with near-complete measurement — the controller sees the global topology. B4 (Jain et al. 2013) achieves a loop period of 0.3–0.8s for TE computation and 2–3x better link utilization than distributed routing. The fail-open design ensures switches maintain last known state if the controller fails — a concession to the production tension where SDN must coexist with “existing distributed routing protocols” for interoperability, resulting in a hybrid placement.
7.7.3 The BGP-SDN Bridge
SDX (Gupta et al. 2014) applied SDN to Internet Exchange Points (IXPs) — the physical locations where BGP peering happens. SDX allowed IXP participants to express fine-grained, application-specific peering policies (e.g., “peer with AS X only for video traffic”) that BGP’s attribute system cannot express. The intellectual contribution was demonstrating that SDN’s match-action abstraction could operate at the interdomain boundary, not just within a single network.
SDX became the intellectual foundation for two production systems at the world’s largest content providers. Google’s Espresso (Yap et al. 2017) applied SDN to Google’s peering edge, using programmable switches and host-based routing to steer egress traffic based on real-time performance measurements. Facebook’s Edge Fabric (Schlinker et al. 2017) built a similar system, using a centralized controller to override BGP’s performance-oblivious route selection at Facebook’s PoPs, steering traffic away from congested peering links in near real-time.
Both systems worked — within their operators’ networks. But a surprising finding emerged. Arnold, Calder, Cunha, Gupta, and Katz-Bassett (Arnold et al. 2019) examined three large-scale measurement studies across Facebook, Microsoft, and Google and found that “in all cases, performance-aware routing provided little benefit over BGP in terms of latency.” For the vast majority of traffic, BGP’s performance-oblivious route selection performed as well as or better than the sophisticated alternatives. The paper identified several reasons: most alternate paths degrade together during congestion (so there is no better alternative to switch to), direct peering means BGP’s preferred routes are already short, and providers’ massive infrastructure investments align BGP’s choices with good performance.
This result reframes the SDN-at-the-edge story. SDX, Espresso, and Edge Fabric are architecturally sound applications of the disaggregation and decision placement principles — they separate peering control from BGP’s distributed path selection and centralize it for performance optimization. But the measurement evidence suggests that the environment BGP operates in — dense peering, short paths, infrastructure investment — already produces near-optimal outcomes. The gap between BGP’s belief and the environment’s true state is smaller than the SDN pioneers assumed. Better measurement and centralized control help only at the margins.
7.8 Act 7: “It’s 2013. OpenFlow’s Fields Are Ossifying.”
OpenFlow grew from 12 to 41 match fields in a few years — “increasing the complexity of the specification while still not providing the flexibility to add new headers.” Every new encapsulation format (VXLAN for data center overlays, NVGRE for Microsoft’s virtualization, INT for in-band telemetry) required a new OpenFlow specification revision — a process that took years of standardization before vendors implemented it in hardware. The Interface invariant broke: the same ossification pattern that afflicted IP (middleboxes hardcoded assumptions about headers), MPLS (signaling protocols proliferated), and the telephone network (circuit switching prevented packet innovation).
Bosshart et al. (2014) asked: “How should OpenFlow evolve?” Their answer: make the pipeline itself programmable — define behavior in software, compile it to hardware (Bosshart et al. 2014).
7.8.1 Invariant Analysis: P4 (2014)
| Invariant | P4 Answer (2014) | Gap? |
|---|---|---|
| State | Programmer-defined: “typed match+action tables” + stateful registers. Memory layout not vendor-fixed. | Current frontier — expressiveness limited by hardware stages |
| Time | Line-rate deterministic: “fixed clock cycles per packet” via PISA (Protocol-Independent Switch Architecture) pipeline. No loops or recursion — bounded at compile time. | Solid — line-rate processing guaranteed |
| Coordination | Controller + programmable pipeline. P4Runtime defines the coordination interface between controller and switch. | SDN model preserved, but what the controller can express is now unbounded |
| Interface | Three-stage programmable: Parser (extract headers) → Match-Action (process) → Deparser (reconstruct). “Defines the behavior” of the hardware. | The escape: protocol-independent, reconfigurable, target-independent |
P4 unlocks capabilities beyond OpenFlow’s reach. Because the programmer defines what the switch does with each packet, the switch can perform application-level computation at line rate — not just forward packets. Examples: NetCache (Jin et al. 2017) stores frequently-accessed database records directly in the switch’s memory, serving cache hits in microseconds without reaching the server. In-band Network Telemetry (INT) (Kim et al. 2015) inserts per-hop measurement data (queue depth, processing delay) into packet headers as they traverse the network, giving operators hop-by-hop visibility. SwitchML (Sapio et al. 2021) performs machine learning gradient aggregation — summing model updates from hundreds of servers — inside the switch fabric at terabit speeds, bypassing the CPU bottleneck. The data plane has become a programmable computation platform — far beyond its origins as a simple forwarding engine.
eBPF/XDP extends this to the kernel: programmable packet processing at the earliest point in the Linux network stack. SmartNICs (Firestone et al. 2018) push it to every server’s NIC. The escape from interface ossification is total: every layer of the stack is now programmable.
7.9 The Grand Arc: From Baran to P4
7.9.1 The Evolving Anchor
| Era (Year) | Meta Constraint → Coordination Lock | State and Measurement | Time and Interface |
|---|---|---|---|
| ARPANET DV (1969) | Survivability → distributed (rumor) | Distance vectors, O(N), queue-length proxy | 128 ms updates; full table exchange |
| ARPANET LS (1980) | Consistency → distributed (shared truth) | Full topology, O(N²), direct delay measurement | 10s measurement, 100ms propagation; LSA flooding |
| STP (1985) | L2 transparency → distributed election | Minimal: root/cost/port | 30–50s convergence; transparent to hosts |
| BGP (1993) | Commercial sovereignty → autonomous | Policy-encoded RIBs, filtered measurement | MRAI 30s, 3–15 min convergence; hides topology |
| MPLS/SR (1996–2015) | Traffic engineering → distributed + explicit paths | Per-flow labels → stateless core | RSVP 30s refresh → TI-LFA sub-50ms |
| SDN (2004–2013) | Policy comprehensibility → centralized | Global NIB in controller | 0.4–12ms reactive; fixed match-action fields |
| P4 (2014) | Interface flexibility → centralized + programmable | Programmer-defined tables | Line-rate deterministic; programmable pipeline |
7.9.2 Three Design Principles Applied Across the Arc
The three design principles are fixed analytical tools. They do not change. What changes is the problem — the nature of the questions being asked, driven by evolving constraints. Designers apply the same principles to each new problem. The solutions differ because the problems differ.
Disaggregation is applied whenever a monolithic system’s concerns become too entangled to manage. Each era’s designers identified a different pair of concerns to separate:
- 1974: Cerf/Kahn separated transport (reliable delivery) from routing (forwarding) — because interconnecting heterogeneous networks required an unreliable network layer.
- 1989: The community separated intradomain routing (OSPF) from interdomain routing (BGP) — because commercial sovereignty required hiding internal topology.
- 2008: McKeown separated the control plane from the data plane across devices — because policy comprehensibility required a global view unavailable to distributed routers.
- 2014: Bosshart separated pipeline definition from hardware implementation — because protocol evolution outpaced the fixed OpenFlow specification.
Each application of disaggregation creates a new interface between the separated concerns. Each interface eventually ossifies — and the next generation applies disaggregation again at a different boundary.
Closed-loop reasoning is applied in every era to bridge the Environment-Measurement-Belief gap. The principle is the same — build a feedback loop that keeps Belief aligned with Environment through Measurement. What differs is the measurement quality available to each generation’s designers:
- 1969: DV designers applied closed-loop reasoning with a noisy proxy measurement (queue length). The loop diverged (count-to-infinity) because the measurement was contaminated by the loop’s own output.
- 1980: McQuillan applied the same principle with direct delay measurement. The loop converged because the measurement signal became independent of the computation.
- 1993: BGP designers applied closed-loop reasoning with structurally filtered measurement (policy-shaped advertisements). The loop converges under commercial hierarchy constraints (Gao-Rexford) — structural damping from the market, applied externally to the protocol.
- 2008: SDN designers applied closed-loop reasoning with near-complete measurement (controller’s global view). The loop period dropped to controller latency (0.3–0.8s for B4’s TE computation).
Decision placement is applied based on each era’s binding constraint. The principle — “the tighter the constraint on consistency, the more centralized” — is constant. What differs is the constraint:
- Survivability (1969) demands distributed placement — centralized control is a single point of failure.
- Correctness (1980) demands distributed placement with shared state — every node runs the same computation on the same data.
- Commercial sovereignty (1989) demands autonomous placement — each AS decides independently.
- Policy comprehensibility (2004) demands centralized placement — a global view is essential for consistent policy enforcement.
- Operational reality (2015) demands hybrid placement — centralized TE with distributed baseline forwarding (segment routing + SDN controller).
7.9.3 The Dependency Chain
7.9.4 Pioneer Diagnosis Table
| Year | Pioneer | Invariant | Diagnosis | Contribution |
|---|---|---|---|---|
| 1964 | Baran | (Meta constraint) | Centralized networks are fatally vulnerable | Proof that distributed mesh with 3–4 links/node survives → locks Coordination as distributed |
| 1970 | Heart/BBN | — | (Built first implementation) | IMP with distance-vector routing — the simplest distributed answer |
| 1980 | McQuillan | State + Time | Queue-length proxy wrong; routing-by-rumor inconsistent | Link-state: direct measurement + full topology + SPF |
| 1985 | Perlman | Interface | L2 transparency forbids full topology distribution | Spanning tree: constrain topology instead |
| 1993 | Fuller/Rekhter | State/Interface | Class-based addressing exhausted | CIDR: variable-length prefixes + BGP-4 |
| 1997–2001 | Labovitz | Time | BGP converges in 3–15 minutes; path exploration | MRAI timer analysis; convergence measurement |
| 2001 | Gao & Rexford | Coordination | BGP stability from market hierarchy, not protocol | Structural damping conditions: customer > peer > provider |
| 2002 | Griffin | Coordination | BGP stability is NP-complete; dispute wheels | Theoretical hardness proof |
| 2003 | Mao | Time → Coordination | Route flap damping worsens convergence | RFD paradox: Time fix misidentified normal Coordination behavior |
| 2004 | Feamster | Coordination | Distributed BGP configs are fragile | RCP: centralize routing control |
| 2007 | Casado | Coordination | Enterprise policy is unmanageable when distributed | Ethane: centralized policy over flow-based forwarding |
| 2008 | McKeown | Interface | No way to experiment on production networks | OpenFlow: standardized switch interface |
| 2014 | Bosshart | Interface | OpenFlow’s fixed fields ossified (12→41) | P4: programmable parse-match-deparse pipeline |
| 2015 | Filsfils | State | RSVP-TE per-flow state explodes at scale | Segment Routing: path state in packet header, stateless core |
7.9.5 Innovation Timeline
flowchart TD
subgraph sg1["Foundations (1958–1980)"]
A1["1958 — Bellman: dynamic programming for shortest paths"]
A2["1959 — Dijkstra: SPF algorithm"]
A3["1964 — Baran: distributed packet switching for survivability"]
A4["1969 — ARPANET IMPs with distance-vector routing"]
A5["1974 — Cerf & Kahn: TCP/IP and the gateway concept"]
A6["1978 — ARPANET routing oscillates — count-to-infinity crisis"]
A7["1980 — McQuillan: link-state replaces distance-vector"]
A1 --> A2 --> A3 --> A4 --> A5 --> A6 --> A7
end
subgraph sg2["Scaling and Policy (1985–2003)"]
B1["1985 — Perlman: Spanning Tree Protocol for loop-free L2"]
B2["1988 — RIP (RFC 1058): distance-vector for IP"]
B3["1989 — OSPF (RFC 1131) + BGP-1 (RFC 1105)"]
B4["1993 — CIDR (RFC 1519) + BGP-4 (RFC 1771)"]
B5["1997 — Labovitz: BGP convergence takes 3–15 min"]
B6["2001 — Gao-Rexford stability proof; MPLS (RFC 3031)"]
B7["2002 — Griffin: BGP stability is NP-complete"]
B8["2003 — Mao: route flap damping paradox"]
B1 --> B2 --> B3 --> B4 --> B5 --> B6 --> B7 --> B8
end
subgraph sg3["SDN Revolution (2004–2015)"]
C1["2004 — Feamster: Separating Routing from Routers"]
C2["2007 — Casado: Ethane (centralized enterprise control)"]
C3["2008 — McKeown: OpenFlow; Pakistan/YouTube hijack"]
C4["2010 — Onix: distributed SDN controller"]
C5["2012 — RPKI (RFC 6480): incremental origin validation"]
C6["2013 — Google B4: production SDN WAN"]
C7["2014 — P4: programmable data planes; SDX at IXPs"]
C8["2015 — Segment Routing; Jupiter Rising (Google DC)"]
C1 --> C2 --> C3 --> C4 --> C5 --> C6 --> C7 --> C8
end
subgraph sg4["Current Frontier (2017+)"]
D1["2017 — Google Espresso: SDN at peering edge"]
D2["2018 — eBPF/XDP: kernel-level programmable packets"]
D3["2018 — Azure SmartNICs: FPGA offload at hyperscale"]
D1 --> D2 --> D3
end
sg1 --> sg2 --> sg3 --> sg4
7.10 Generative Exercises
The Mars-Earth link has a one-way propagation delay of 3–22 minutes depending on orbital position. Design a routing system for a Martian colony network that must interoperate with Earth’s Internet.
Which invariants change? What does “convergence” mean when the feedback loop’s period is 44 minutes? Can you use link-state routing between Mars and Earth, or does the timescale force something else? What decision placement makes sense — centralized on Mars, distributed between planets?
BGP assumes honest path advertisements. Consider a network where some ASes deliberately inject false routes (prefix hijacking, path forgery). Which invariant’s answer breaks first? RPKI validates origin but not path — what would full path validation require? Apply the E-M-B framework: if Measurement is adversarially corrupted, what kind of Belief model can still be useful?
A fleet of delivery drones forms an ad-hoc mesh network. Topology changes every few seconds as drones move. Link-state flooding would be constant. Distance-vector would never converge. What routing design is needed? Which invariant is the binding constraint? How does the loop bandwidth (how fast routing can track changes) compare to the environment’s rate of change?
This chapter is part of “A First-Principles Approach to Networked Systems” by Arpit Gupta, UC Santa Barbara, licensed under CC BY-NC-SA 4.0.
In distance-vector routing, if node A routes to destination D through node B, and node B’s path to D fails, B may accept A’s advertisement of a path to D — creating a loop. Both nodes increment their distance to D through each other, counting toward infinity. The loop persists until the distance exceeds the protocol’s infinity threshold (typically 16 in RIP).↩︎
An MPLS label is a 20-bit integer inserted between the L2 and L3 headers (the “shim” header). Each router swaps the label based on a label forwarding table, enabling path control without examining the IP header. The label stack allows nested tunnels.↩︎
OpenFlow defines a match-action abstraction: each flow entry matches on packet header fields (source/destination IP, port, VLAN, etc.) and specifies an action (forward to port, drop, modify header, send to controller). The original OpenFlow 1.0 supported 12 match fields; OpenFlow 1.3 expanded to 41.↩︎