7 Routing, Switching, and Programmable Networks

7.1 The Anchor: Forwarding Under Incomplete Information

Every routing system solves the same problem: a packet arrives with a destination address, and the node must choose an output port. The forwarding decision is trivial — a table lookup. Building and maintaining that table is the hard part. The network graph changes — links fail, nodes join, traffic shifts — but forwarding must continue. If two nodes disagree about the path to the same destination, packets loop or vanish.

The binding constraint is incomplete information about a graph you don’t fully control. Every node sees only its local neighborhood. Every routing design is a strategy for making forwarding decisions with incomplete, delayed, and potentially inconsistent information about a changing topology. This constraint is constant from 1964 to the present. What evolves is why the information is incomplete, what information matters, and who lacks control.

The intellectual origins of this constraint trace to 1964, when Paul Baran at RAND addressed a survivability problem:

“Let us consider the synthesis of a communication network which will allow several hundred major communications stations to talk with one another after an enemy attack.” — Baran, 1964 (Baran 1964)

Baran proved that a distributed mesh with redundancy levels of only three or four links per node survives heavy attack. His contribution was not a routing algorithm — it was a proof that the network must function without any central point of control. This survival requirement became the meta constraint that shaped the first generation of routing: Coordination must be distributed.

Routing must continuously answer four decision problems — the routing analogues of TCP’s “when to send, what to send, how much to send”:

What to measure about the network? (What signals can I observe about the topology?)
What to tell my neighbors? (What information do I share, and in what form?)
How to compute the forwarding table? (What algorithm turns measurements into forwarding decisions?)
When to update? (How fast do I react when conditions change?)

Looking back, John McQuillan — who redesigned ARPANET routing in 1980 — formalized these as a tripartite “routing procedure” (McQuillan et al. 1980):

“(1) a measurement process for determining pertinent network characteristics, (2) a protocol for disseminating information about these characteristics, and (3) a calculation to determine how traffic should be routed.”

This framing maps directly to the four invariants — measurement is State, dissemination is Interface, calculation requires Coordination, and the entire procedure operates on a Time axis. McQuillan articulated this structure in retrospect, after debugging the failures of the first routing system. The pioneers who built that first system in 1969 had no such framework — they were solving an engineering problem from scratch.

7.2 Act 1: “It’s 1969. The First Four Nodes Come Online.”

In September 1969, the first Interface Message Processor (IMP) was installed at UCLA. By December, four nodes were connected — UCLA, Stanford Research Institute, UC Santa Barbara, and the University of Utah. The first host-to-host message was sent on October 29, 1969: the intended word was “login,” but the system crashed after transmitting “lo.” The ARPANET was born.

The IMPs were built by Bolt, Beranek and Newman (BBN) — dedicated minicomputers (Honeywell DDP-516, 12 KB of memory) that served as the network’s packet switches. An IMP was a dedicated minicomputer (a Honeywell DDP-516 with 12 KB of memory) that sat between a host computer and the network. In today’s terms, the IMP combined what a modern network separates into a router (Layer 3 forwarding), a modem (Layer 1/2 link management), and a host interface — all in one device. The concept of protocol layers had yet to emerge; Zimmermann’s OSI model would arrive in 1980. The IMP was simply “the machine that moves packets.”

Frank Heart’s team framed their task as subnet engineering:

“The essential task is to transfer bits reliably from a source location to a specified destination.” — Heart et al., 1970 (Heart et al. 1970)

Their overriding design concern was autonomy:

“Any dependency between one IMP and another would merely broaden the area jeopardized by one IMP’s failure.” — Heart et al., 1970 (Heart et al. 1970)

What the pioneers saw: A small network (~10 nodes) of minicomputers connected by 50 Kbps leased telephone lines. Each IMP had very limited memory. The environment was benign — a cooperative research community, no competing traffic, no commercial pressure. The problem, as they understood it, was reliability: keep packets flowing even when individual machines or lines fail.

What remained invisible from the pioneers’ vantage point: The Internet would grow to millions of nodes. Commercial interests would fragment administrative control. Traffic would explode beyond any single link’s capacity. The routing algorithm that works for 10 cooperative nodes would catastrophically fail at 60. These forces were decades away — and the design reflected the environment the pioneers could see.

They chose the simplest distributed algorithm: distance-vector routing. Each IMP tells every neighbor, “I can reach destination X at cost N.” The neighbor adds its own link cost and records the result. After enough exchanges, every IMP converges to the shortest path. This is the Bellman-Ford algorithm (Bellman 1958) applied to a live network.

7.2.1 The Designer’s Decisions

Concretely, each IMP faces three decisions on every packet and every update:

Packet arrives for destination X: Look up X in the routing table → forward to the neighbor with the lowest recorded cost. A “fast and simple table lookup procedure.”
Neighbor J sends a distance update: “Compute delay via J to all other nodes by adding to each entry in J’s table its own delay to J.” Compare against current table. Keep the minimum.
Link fails: Reroute packets immediately. Set cost to infinity. Tell neighbors.

7.2.2 Invariant Analysis: Distance-Vector Routing (1969)

Invariant	ARPANET DV Answer (1969)	Gap?
State	Delay estimate per destination, built from neighbor ads	Proxy measurement: queue length ≠ delay
Time	Updates every 128 ms	Too fast and noisy — chases transient congestion
Coordination	Fully distributed, “routing by rumor”	No consistency guarantee — loops possible
Interface	Full tables exchanged every 128 ms	Bandwidth cost scales with N

Two invariants answered solidly: distributed Coordination satisfies survivability, and the simple Interface is deployable. Two have critical gaps. The State gap is the most consequential: each IMP estimates link delay by counting packets in its output queue — but queue length ignores line speed, propagation delay, and resource contention. Routes oscillate as the algorithm chases transient queue fluctuations, shifting traffic unpredictably. The Coordination gap compounds this: because each node computes from neighbors’ advertisements without checking whether the advertised path loops back through itself, two nodes can simultaneously believe they should route through each other. Packets bounce between them until discarded — the count-to-infinity problem.

7.2.3 Environment → Measurement → Belief

Layer	What ARPANET DV has	What’s missing
Environment	True topology + link delays + traffic load	—
Measurement	Queue length sampled every 128 ms as proxy for delay	Queue length ignores line speed, propagation delay, and resource contention. Systematically wrong under both light and heavy load.
Belief	Distance vector: best cost to each destination via each neighbor	No topology model — only distances. Cannot detect self-loops.

The Belief gap is the root cause of count-to-infinity: a node receiving a distance advertisement has no way to check whether the advertised path passes through itself. It accepts a route that loops back through its own position — and advertises that looping route to its neighbors, who accept it in turn.

The IMP’s design reflects all three design principles in their earliest form. The IMP is a monolithic device — routing, forwarding, and host interface entangled in one machine, with no disaggregation of concerns. Survivability forces decision placement to be fully distributed — each IMP decides independently. And the routing exchange is a closed-loop feedback system (measure → compute → advertise → neighbors recompute → they advertise back) with a 128 ms loop period — but one that is potentially divergent, widening the E-M-B gap instead of narrowing it under certain topologies.

7.2.4 “The gaps didn’t matter… yet.”

The ARPANET had ~50 nodes and moderate traffic. Distance-vector worked for a decade. The proxy measurement was inaccurate but tolerable. Routing loops occurred but resolved quickly at small scale. The system was, in Baran’s framing, “highly inefficient but extremely stable” — acceptable when survivability was the binding constraint.

7.2.5 Disaggregation in Action: The IMP Splits into TCP and IP (1974)

Notice that Heart’s IMP provided both routing and reliable delivery. The IMP guaranteed hop-by-hop reliability: “Once an IMP has accepted a packet and returned a positive acknowledgment, it holds onto that packet tenaciously until it in turn receives an acknowledgment from the succeeding IMP.” The IMP even reassembled messages and delivered them in order to the host. In modern terms, the IMP conflated the transport layer (reliable delivery) with the network layer (routing and forwarding) inside a single device.

In 1974, Cerf and Kahn proposed a radical disaggregation (Cerf and Kahn 1974). To interconnect dissimilar networks (ARPANET, packet radio, satellite), every intermediate gateway would need to guarantee reliability — an impossible coordination requirement across heterogeneous systems. Their solution: make the network layer intentionally unreliable. This proposal took nearly a decade to deploy: TCP/IP replaced the ARPANET’s original Network Control Program on January 1, 1983 — a coordinated cutover known as “Flag Day.” Gateways (routers) would only forward packets — “the responsibility for properly routing data” — while a new endpoint protocol, TCP, would handle reliability end-to-end: “End-to-end restoration procedures are desirable to allow complete recovery.”

This is the disaggregation principle in its most consequential form. Separating transport from routing created two independent evolutionary paths:

Routing could evolve from distance-vector to link-state to BGP to SDN — each change affecting only how packets are forwarded, without breaking reliable delivery.
Transport could evolve from Tahoe to Reno to CUBIC to BBR (Chapter 4) — each change affecting only endpoint behavior, without modifying routers.

The cost of disaggregation: TCP must infer network congestion from endpoint observations alone (ACK timing, loss events), because IP’s interface deliberately hides router state. This gap — the same Environment-Measurement-Belief gap from Chapter 4 — is the price of the separation. Saltzer, Reed, and Clark formalized this tradeoff a decade later (Saltzer et al. 1984): functions requiring application knowledge belong at endpoints. Internal reliability measures are “performance enhancements, never requirements for correctness.”

The rest of this chapter traces the routing side of this disaggregation. Transport evolved independently in Chapter 4.

Then the network grew.

7.3 Act 2: “It’s the Late 1970s. The ARPANET Is Oscillating.”

By the late 1970s, the ARPANET had grown to ~60 nodes. Distance-vector routing, which had worked reliably for a decade, became unstable. Links oscillated between congested and idle as the routing algorithm chased transient queue-length fluctuations. Count-to-infinity loops persisted for minutes. The gaps that were invisible at the original 4 nodes, and tolerable at 50, became catastrophic at 60. In May 1979, the ARPANET switched to a new routing algorithm — the old distance-vector program was removed entirely.

McQuillan diagnosed three “serious defects,” each mapping to a different invariant:

7.3.1 Which Invariant Broke?

Invariant	What broke	McQuillan’s diagnosis
State (Measurement)	Queue-length proxy was systematically wrong	“The length of an output queue is only one of many factors that affect a packet’s delay. A measurement procedure that takes into account only one such factor cannot give accurate results.”
Coordination (Consistency)	Distributed iteration couldn’t guarantee consistent routes	“With such a scheme, it is difficult to ensure that routes used by different nodes are consistent” — leading to “wild oscillations” and “long-term loops.”
State (Scalability)	Routing tables grew with network size	“Routing packets would become correspondingly larger (or more frequent) as the ARPANET grows to 100 or more nodes.”

7.3.2 Why Count-to-Infinity Is a Divergent Feedback Loop

Count-to-infinity reveals a fundamental closed-loop failure. The routing loop is supposed to narrow the E-M-B gap (converge Belief toward Environment). But when a destination becomes unreachable:

Node A believes it can reach X via B at cost 2
B’s link to X fails. B sets cost to infinity.
Before B’s update reaches A, A tells B: “I can reach X at cost 2.”
B has no way to detect that A’s path goes through B. B records: “Reach X via A at cost 3.”
A hears B’s new cost and updates: “Reach X via B at cost 4.”
The two nodes bounce the route, incrementing cost by 1 each round, until both reach infinity.¹

The measurement signal (neighbor’s distance) is contaminated by the loop’s own output. A’s belief feeds into B’s measurement, which feeds back into A’s belief. This is positive feedback — the loop amplifies the E-M-B gap instead of reducing it. At 30-second update intervals (RIP (Hedrick 1988)), a 15-hop network takes up to 8 minutes to converge. During those 8 minutes, packets destined for X bounce between A and B until their TTL expires.

7.3.3 McQuillan’s Redesign: Link-State Routing

McQuillan’s insight: if every node had the same complete topology database, they would compute the same shortest paths. Replace rumors with shared truth. His redesign applied all three design principles:

He applied disaggregation by separating the topology database from the forwarding table — two distinct data structures, connected by SPF (Shortest Path First) computation (Dijkstra 1959). This is the first separation of concerns within routing. He applied closed-loop reasoning with a critical fix: replacing the noisy queue-length proxy with direct delay measurement averaged over 10 seconds, making the measurement signal independent of the computation. The loop structure remains the same (measure → disseminate → compute → repeat), but now it converges because the measurement (LSA, Link-State Advertisement) carries direct observations — breaking the positive feedback that plagued DV. And decision placement remains distributed — survivability still demands it — but nodes now compute on “identical databases,” agreeing without negotiation.

What changed	Distance-Vector (before)	Link-State (after)
State (what to believe)	Neighbor distances — O(N) partial state, built from rumors	Full topology database — O(N²) complete state, “identical databases maintained at all nodes”
Measurement (what to observe)	Queue length proxy, every 128 ms	Direct delay measurement, averaged over “10 seconds” — “a significant departure”
Coordination (how to agree)	Iterative relaxation via neighbor exchange (“routing by rumor”)	Reliable flooding: each node sends only its own link costs; all nodes run “the exact same algorithm on an identical data base”
Time (how fast)	128 ms updates, unbounded convergence	Updates propagate within “100 msec”; convergence bounded by flooding time + SPF computation

Why does link-state converge while distance-vector can diverge? The critical structural difference: in link-state, the measurement signal (LSA) is independent of the computation. Each node reports only its own directly-measured link costs. These measurements are decoupled from any other node’s belief. The circular dependency is eliminated — the positive feedback loop is broken.

7.3.4 Environment → Measurement → Belief After the Fix

Layer	DV (before)	LS (after)	Gap closed?
Environment	True topology + delays	Same	—
Measurement	Queue-length proxy (noisy, indirect)	Direct delay measurement (accurate, averaged over 10s)	Yes — measurement tracks environment accurately
Belief	Distance vector (partial, inconsistent across nodes)	SPF shortest-path tree (complete, identical across nodes)	Yes — belief derived from shared, accurate measurement

7.3.5 OSPF: Production Codification (1989–1998)

OSPF (RFC 2328) codified link-state for production IP networks (Moy 1998). OSPF applied disaggregation further through areas — partitioning the topology database into zones. Within an area, every router holds complete state (every link, every cost). Across area boundaries, Area Border Routers summarize: they advertise only “destination X is reachable at cost Y” without revealing the internal topology that produces that cost. This manages a fundamental scaling problem: in a flat link-state network with N routers, every router stores information about every link — and the number of links grows as N², consuming memory and flooding bandwidth. Areas keep the per-router state manageable by limiting full topology knowledge to local scope.

OSPF also refined decision placement through the Designated Router (DR) on multi-access segments (like an Ethernet LAN with many routers). Without a DR, every pair of routers on the segment forms a separate relationship and exchanges topology updates — N routers produce N×(N-1)/2 pairs, each flooding independently. The DR consolidates: one elected router collects updates from all others and redistributes a single copy, reducing the flooding traffic from quadratic to linear.

7.4 Act 3: “It’s 1985. Ethernet LANs Keep Crashing from Loops.”

Acts 1 and 2 traced routing at what we now call Layer 3 — the network layer, where IP addresses identify destinations and routers forward packets hop by hop across wide-area networks. But the forwarding problem also exists at Layer 2 — the data link layer, where Ethernet MAC addresses identify devices on a local network and bridges forward frames between LAN segments. The loop-free forwarding problem is structurally identical: a packet must reach its destination without cycling forever. But the constraints are different, and the different constraints produce a fundamentally different design.

Why detour to Layer 2? Because Perlman’s spanning tree protocol is the inverse of McQuillan’s link-state solution. Both solve loop-free forwarding. McQuillan’s approach: give every node full topology knowledge, then compute loop-free paths. Perlman’s approach: give nodes almost no topology knowledge, and instead constrain the topology itself to be loop-free. The contrast reveals how the binding constraint — not the problem — determines the design.

Radia Perlman faced a different environment from the ARPANET engineers. Her problem was connecting multiple Ethernet LAN segments with bridges — simple devices that forward Ethernet frames between segments based on MAC addresses. Bridges were cheap, required no configuration, and were invisible to end stations. But redundant links between bridges created forwarding loops:

“Switching loops resulting in broadcast radiations and MAC table instability.”

Perlman’s binding constraint was Interface transparency: bridges must allow stations to participate in the network without any modification or awareness. In her own words:

“Transparent to stations, and thus allows a station to participate in an extended LAN with no modification.” — Perlman, 1985 (Perlman 1985)

What Perlman saw: Campus networks with a few dozen bridges connecting Ethernet segments. Bridges were “dumb” devices — far simpler than routers (which understood IP addresses and ran routing protocols). The designers of bridges deliberately avoided complexity: no configuration, no IP addresses, no routing protocols. Just plug in a cable and it works.

What the transparency constraint forced: Link-state requires every node to maintain a full topology database — expensive in memory and computation. Bridges in 1985 were simple forwarding devices with far fewer resources than routers. The transparency constraint further demanded that the solution work without any change to end-station behavior — stations must remain entirely unaware of bridges.

7.4.1 Perlman’s Inverse Insight

Instead of distributing full topology knowledge (McQuillan’s approach), constrain the active topology to a loop-free tree. A distributed root-bridge election prunes the mesh into a spanning tree. Every bridge agrees on a single root (the bridge with the lowest ID), computes its shortest path to the root, and disables all ports outside that shortest path. The result: a tree topology where exactly one path exists between any two points. Loops are impossible — the topology itself is pruned to prevent them, without requiring any node to know the full graph.

7.4.2 Invariant Analysis: Spanning Tree Protocol (1985)

Invariant	STP Answer (1985)	Gap?
State	Minimal: root ID, cost to root, designated port	Solid — constant per bridge regardless of network size
Time	30–50 second convergence	Critical: all forwarding halts during reconvergence
Coordination	Distributed root election via BPDUs	Solid for LANs
Interface	Transparent — invisible to end stations	Blocks all redundant links, wasting bandwidth

The Time gap is the most operationally painful: when a link fails or a new bridge joins, ALL forwarding through affected ports halts for 30–50 seconds while bridges re-elect a root and recompute the tree. During this window, packets are silently dropped. The Interface trade-off is equally significant: STP disables every redundant link, keeping only one active path between any two points. If a network has 10 paths between buildings A and B, STP blocks 9 of them — traffic flows on a single path regardless of load, while the blocked links sit idle.

Perlman’s design applied decision placement as distributed — the transparency constraint rules out centralized control, so bridges elect a root via BPDU exchange. She applied closed-loop reasoning through BPDU exchange → root election → port blocking, though the loop converges slowly (passive timers, no active feedback — RSTP later applied active “proposal/agreement” feedback to achieve sub-second convergence). And the design reflects a specific disaggregation choice: STP operates at L2, completely separate from L3 routing, enabling each layer to evolve independently — but also preventing L2 and L3 from coordinating their topologies.

7.4.3 “The gaps didn’t matter… yet.”

Campus LANs were small — a few dozen bridges, a handful of redundant links. Losing a link for 30 seconds while STP reconverged was annoying but survivable, because alternative paths were rare and traffic loads were modest.

Then data centers arrived. Modern data centers use fat-tree (Clos) topologies — massive multi-stage switch fabrics with hundreds of equal-cost paths between any two servers. The whole point of a fat-tree is that every path carries traffic simultaneously, providing aggregate bandwidth that grows with the number of switches. STP blocks every redundant path and keeps only one — eliminating exactly the bandwidth that fat-tree topologies exist to provide. A fat-tree network running STP operates at a fraction of its capacity, with most of its links idle.

This mismatch drove multiple attempts to replace STP in data centers. TRILL (Perlman et al. 2011) and Shortest Path Bridging (SPB) brought link-state routing to Layer 2, enabling multipath forwarding while preserving transparent bridging semantics. But the solution that ultimately prevailed was simpler: VXLAN (Mahalingam et al. 2014) encapsulates Layer 2 Ethernet frames inside Layer 3 UDP/IP packets, tunneling L2 traffic over an L3-routed core. The core network runs standard IP routing (OSPF/BGP) with all paths active. The L2 illusion — flat addressing, VM mobility — is maintained only at the edge, where VXLAN tunnel endpoints translate between L2 frames and L3 packets. The core remains stateless and fully utilized.

7.5 Act 4: “It’s the Early 1990s. The Internet Commercializes. Shortest-Path Breaks.”

Acts 1–3 shared a common assumption: all nodes in the network have the same goal — find the shortest path (or in STP’s case, the loop-free path). OSPF routers all want to minimize delay. STP bridges all want a single tree. The cooperative ARPANET research community had no reason for nodes to disagree about what “best” means.

The ARPANET was formally decommissioned in 1990. The NSFNET, which had served as the research backbone since 1986, lifted its ban on commercial traffic in 1991 and was itself decommissioned in 1995. The Internet transformed from a cooperative research community into a federation of Autonomous Systems (ASes) — independently operated networks (ISPs, universities, corporations) connected by commercial contracts. Each AS had its own business interests: a transit provider wants to carry paying customers’ traffic but not freeloaders’. A content provider wants to peer directly with eyeball networks to avoid transit fees. A government network wants to avoid routing through certain countries.

The meta constraint shifted. The binding constraint moved beyond consistency (McQuillan’s problem) and transparency (Perlman’s problem) to commercial sovereignty — each AS must independently define its routing policy without revealing its internal topology or surrendering control over path selection.

What the BGP pioneers saw: A network that had grown from a single backbone to a mesh of competing providers. The previous interdomain protocol (EGP) assumed a tree-structured Internet with a single backbone. That assumption was now false:

EGP “could not handle the transition from a tree-structured to a mesh-structured Internet.”

What commercial sovereignty demanded: OSPF-style link-state routing is structurally incompatible with autonomous systems. First, flooding the full topology of every AS to every other AS is prohibitively expensive — “flooding overhead makes them generally inappropriate for inter-AS use.” Second, and more fundamentally, ASes are “unwilling to reveal their local policies to others” — the interface must hide internal state. Link-state’s core assumption — every node sees the same complete topology — directly contradicts commercial sovereignty.

7.5.1 Invariant Analysis: BGP (Rekhter et al. 2006) (1993–Present)

Invariant	BGP Answer	Gap?
State	Per-prefix, per-peer RIB with policy attributes (AS_PATH, LOCAL_PREF, MED, Communities)	~1M prefixes globally; incompressible because each AS applies different preferences
Time	MRAI (Minimum Route Advertisement Interval) = 30s, hold timer = 90s, convergence = 3–15 min	Deliberately slow: stability over speed
Coordination	Autonomous — each AS decides independently	NP-complete in general (Griffin); stable under commercial hierarchy (Gao-Rexford)
Interface	UPDATE message encodes business logic, hides internal topology	No origin validation — any AS can announce any prefix

What the gaps mean concretely: BGP’s State is a per-prefix, per-peer Routing Information Base (RIB): for every destination prefix in the Internet (~1 million today), each router stores a separate entry from each neighbor, annotated with policy attributes. The AS_PATH lists every network the route has traversed, enabling loop detection. This state is “incompressible” — each AS applies its own preferences, so routes resist aggregation.

BGP’s Time is deliberately slow. After a route withdrawal, routers explore every alternative path before giving up. Each alternative takes one MRAI interval (30 seconds) to try. With 5 alternatives, convergence takes 2.5 minutes minimum. This is intentional: rapid oscillation across the global Internet would be worse than slow convergence.

BGP’s Coordination is theoretically intractable: Griffin et al. (Griffin et al. 2002) proved that determining whether a BGP configuration has any stable solution is NP-complete — there exist policy combinations where routers oscillate forever. In practice, the commercial hierarchy (customer-provider-peer) prevents these pathological cases (Gao and Rexford 2001).

BGP’s Interface lacks origin validation: any AS can announce any prefix, including prefixes it has no right to. This enabled the Pakistan/YouTube hijacking in 2008. RPKI (Lepinski and Kent 2012) adds cryptographic origin checks, but deployment remains incomplete due to collective action problems (Goldberg 2014).

7.5.2 Environment → Measurement → Belief: A Qualitative Shift

BGP’s E-M-B gap is fundamentally different from ARPANET’s. In the ARPANET, measurement was accidentally noisy (queue length was a bad proxy). In BGP, measurement is structurally filtered — neighbors deliberately hide their internal topology and show only policy-shaped route advertisements.

Layer	ARPANET DV	BGP
Environment	True topology + link delays	True topology + business relationships + traffic economics
Measurement	Queue-length proxy (accidentally noisy)	Route advertisements (structurally filtered — neighbor shows only what they want you to see)
Belief	Distance vector (partial, inconsistent)	RIB: “best path according to MY policy” (per-prefix, per-peer, policy-weighted)
Nature of E→M gap	Engineering problem — build a better sensor	Institutional reality — incompleteness is designed in. Better sensors are irrelevant.

A model is only as good as its input signals. When the input signals are strategically shaped rather than merely noisy, the path to stability lies in structural constraints that limit the solution space — better estimation alone is insufficient.

7.5.3 The 1993–2008 Arc: 15 Years of Cascading Crises

BGP’s story extends far beyond its creation. Five crises over 15 years revealed gaps that the original design left exposed. Each crisis is a closed-loop failure: the loop that should keep Belief (routing tables) aligned with Environment (true network state) breaks because the measurement signal is insufficient, the loop gain is wrong, or structural constraints are violated.

1993 — Address exhaustion (State/Interface, Fuller et al.): The class-based address system (Class A/B/C) was exhausting the IPv4 space and inflating routing tables. CIDR (Fuller et al. 1993) replaced rigid class boundaries with variable-length prefixes, enabling hierarchical aggregation that reduced routing table size.
1997–2001 — Convergence crisis (Time, Labovitz et al. (Labovitz et al. 2000)): BGP converges in 3–15 minutes after failures. “Packet loss grows by a factor of 30.” The mechanism: after a route withdrawal, routers explore every alternative path before giving up — the same structural problem as count-to-infinity, but at interdomain scale, amplified by the MRAI timer (30 seconds per alternative).
2001 — Stability theory (Coordination, Gao & Rexford (Gao and Rexford 2001)): The commercial hierarchy (customer > peer > provider) guarantees BGP convergence — stability emerges from market structure, not protocol mechanism. If every AS prefers customer routes over peer/provider routes, the dispute-wheel pathology is structurally prevented.
2002 — NP-hardness (Coordination, Griffin et al. (Griffin et al. 2002)): Determining whether a BGP configuration has any stable solution is NP-complete. There exist policy combinations — called “dispute wheels” — where routers form circular dependencies and oscillate forever with no equilibrium. The Gao-Rexford conditions prevent these cases in practice, but the theoretical worst case is intractable.
2003 — Route flap damping paradox (Time → Coordination, Mao et al. (Mao et al. 2002)): Route flap damping (RFD), designed to suppress unstable routes, worsened convergence. RFD treated the temporary route fluctuations that naturally occur during BGP path exploration as evidence of instability, and suppressed them — for up to an hour. A Time fix misidentified normal Coordination behavior as pathological.
~2008 — Prefix hijacking (Interface, Pakistan/YouTube incident): Any AS can announce any prefix, including prefixes it has no right to. In 2008, Pakistan Telecom accidentally announced YouTube’s prefix, diverting global YouTube traffic through Pakistan for hours. RPKI (Lepinski and Kent 2012) adds cryptographic origin checks, but deployment remains incomplete — early adopters gain little benefit without widespread participation (Goldberg 2014).

Misconception: “BGP Is Broken and Should Be Replaced”

BGP operates exactly as designed under constraints that every interdomain protocol must face. Any interdomain protocol serving autonomous, commercially competing networks faces the same NP-hardness (Griffin). BGP works in practice because the commercial hierarchy provides structural damping (Gao-Rexford). Replacing BGP means either abandoning commercial autonomy or solving NP-hard problems in real time. Neither is feasible.

BGP’s design reflects all three principles. The community applied disaggregation by separating intradomain routing (OSPF) from interdomain routing (BGP) — each evolves independently, though hot-potato routing couples them at the boundary, sometimes catastrophically. Commercial sovereignty forces decision placement to be fully autonomous: each AS decides independently, because sovereignty forbids the consistency constraints that would require centralization. And closed-loop reasoning reveals the deepest insight: BGP’s feedback loop (advertise → select → advertise) is convergent under Gao-Rexford conditions but NP-hard in general — the stabilizing mechanism is institutional (market hierarchy), not algorithmic.

7.6 Act 5: “It’s 1996. Operators Need Traffic Engineering.”

Acts 1–4 focused on the control plane — how nodes learn paths and agree on forwarding decisions. But as the Internet grew to carry commercial traffic in the mid-1990s, a new problem emerged that was neither routing nor switching: traffic engineering. Operators discovered that shortest-path routing, however correct, left some links congested while others sat idle. The network was loop-free and convergent — McQuillan and Moy had solved that — but it was inefficient. Operators needed to steer traffic along non-shortest paths to balance load, avoid congestion, and meet Service Level Agreements.

This is the traffic engineering problem — and its conflation with routing produced a decade of operational pain.

MPLS (Multiprotocol Label Switching) (Rosen et al. 2001) promised to decouple routing from forwarding. The idea: use IP routing protocols (OSPF, BGP) to compute paths, but forward packets using short fixed-length labels instead of IP address lookups. A 32-bit label is inserted between the Ethernet header and the IP header — a “shim” that tells each router which pre-computed path the packet belongs to.² Routers swap labels at each hop (fast table lookup, no longest-prefix match needed), steering traffic along explicit Label Switched Paths (LSPs) that can follow non-shortest routes. This gave operators the traffic engineering control that shortest-path routing lacked.

But RSVP-TE signaling “maintains signaling states on ALL devices along the path.” Every core router stores state for every tunnel traversing it. At scale, this is “operationally very expensive” with a “notorious number of MPLS RSVP-TE tunnels.” The State invariant broke again — the same failure mode as DV’s routing-table explosion, but at a different layer.

Meanwhile, Fortz & Thorup (2000) (Fortz and Thorup 2000) showed that optimizing OSPF link weights achieves “within 1–3% of optimal multicommodity flow” — challenging whether MPLS-TE was even necessary.

And routing-TE conflation created subtle failures: “OSPF distance changes DO affect BGP routes, sometimes with updates lagging by as much as a minute.” Local traffic-engineering decisions broke global reachability. This is a disaggregation failure — routing and TE were supposed to be separated by MPLS but remained coupled through IGP (Interior Gateway Protocol)-BGP interaction.

Segment Routing (Filsfils et al. 2015) resolved this by moving path state from the core to the packet header. The ingress encodes the path as an ordered list of “segments.” Core routers follow the list — no per-flow state, no LDP, no RSVP-TE. “Core routers don’t need to remember the entire network topology” — they follow the “travel checklist” in the header. Sub-50ms convergence via TI-LFA (Topology-Independent Loop-Free Alternate). The State invariant moves to its most extreme position: stateless core.

7.7 Act 6: “It’s 2004. Networks Are Unmanageable.”

By the early 2000s, every problem from Acts 1–5 had been solved — at least in principle. OSPF provided convergent intradomain routing. BGP provided policy-compliant interdomain routing. MPLS provided traffic engineering. STP provided loop-free L2 forwarding. The protocols worked. But operating them together was a nightmare.

The meta constraint shifted once more. The problem had moved beyond routing correctness, policy autonomy, and traffic engineering. The network’s policy had become incomprehensible to any single human. Enterprise networks had thousands of devices, each configured independently with ACLs, VLANs, and firewall rules. Policy was scattered across boxes. Debugging was forensic — tracing a dropped packet required logging into dozens of devices and correlating configuration fragments. Misconfigurations caused outages that were difficult to diagnose and expensive to fix.

The pioneers who launched SDN framed this as a management and consistency problem — distinct from routing:

Pioneer	Year	Their question	Their framing
Feamster et al. (Feamster et al. 2004)	2004	“Extracting routing control into a logically centralized Routing Control Platform”	Consistency crisis — distributed BGP configs are fragile and undebuggable
Casado et al. (Casado et al. 2007)	2007	“How could we change the enterprise network architecture to make it more manageable?”	Management crisis — VLANs and ACLs “hide the complexity, not reduce it”
McKeown et al. (McKeown et al. 2008)	2008	“How can we run experiments in our campus networks?”	Innovation crisis — network “ossification” blocks research

Three independent diagnoses. Three different framings. But all converged on the same architectural answer: separate the control plane from the data plane, and centralize control.

7.7.1 Decision Placement Inverts

This is the most dramatic shift in the chapter. Every previous era locked Coordination as distributed — survivability (Baran), correctness (McQuillan), and commercial sovereignty (BGP) all demanded it. Now comprehensibility demands the opposite.

Casado articulated why centralization was necessary — and how it responds to a fundamentally different binding constraint than Saltzer’s end-to-end argument:

“IP’s best-effort service model is both simple and unchanging, well-suited for distributed algorithms. Network management is quite the opposite; its requirements are complex and require strong consistency, making it quite hard to compute in a distributed manner.” — Casado et al., 2007 (Casado et al. 2007)

TCP’s constraint (no central authority over flows) forced distributed placement. SDN’s constraint (policy must be globally consistent and comprehensible) forced centralized placement. Both are correct for their binding constraint.

7.7.2 Invariant Analysis: SDN / OpenFlow (2008)

Invariant	SDN Answer (2008)	Gap?
State	Global NIB in controller; cached flow entries in switch TCAMs (8K–16K entries)	TCAM capacity limits fine-grained policies
Time	Reactive: 0.4–12 ms first-packet delay; proactive: zero	Controller becomes bottleneck under high flow churn
Coordination	Centralized controller; switches execute	Single point of failure (addressed by Onix, ONOS)
Interface	Match-action flow tables on fixed header fields³	Fixed field set ossifies: 12 → 41 fields, still insufficient

What these gaps mean concretely: The State gap centers on TCAM (Ternary Content-Addressable Memory) — the specialized hardware that stores flow entries in a switch. TCAMs are fast (single-cycle lookup) but small and expensive: a typical switch holds 8,000–16,000 entries. A fine-grained policy (one rule per user per application) can require hundreds of thousands of entries, far exceeding TCAM capacity. The Time gap is the fundamental cost of centralization: in reactive mode, the first packet of every new flow is forwarded to the controller, which computes a rule and installs it — adding 0.4–12 ms of latency. Under high churn (thousands of new flows per second), the controller becomes the bottleneck. The Coordination gap — a single controller is a single point of failure — is addressed by distributed controller architectures (Onix (Koponen et al. 2010), ONOS) that replicate state across machines.

The SDN pioneers applied disaggregation at a new boundary — separating the control plane from the data plane across devices, the most consequential new interface since Cerf/Kahn separated transport from routing. They applied decision placement as centralized, because comprehensibility demands a global view. And the closed-loop they built (observe topology → compute rules → install in switches → observe traffic → recompute) operates with near-complete measurement — the controller sees the global topology. B4 (Jain et al. 2013) achieves a loop period of 0.3–0.8s for TE computation and 2–3x better link utilization than distributed routing. The fail-open design ensures switches maintain last known state if the controller fails — a concession to the production tension where SDN must coexist with “existing distributed routing protocols” for interoperability, resulting in a hybrid placement.

7.7.3 The BGP-SDN Bridge

SDX (Gupta et al. 2014) applied SDN to Internet Exchange Points (IXPs) — the physical locations where BGP peering happens. SDX allowed IXP participants to express fine-grained, application-specific peering policies (e.g., “peer with AS X only for video traffic”) that BGP’s attribute system cannot express. The intellectual contribution was demonstrating that SDN’s match-action abstraction could operate at the interdomain boundary, not just within a single network.

SDX became the intellectual foundation for two production systems at the world’s largest content providers. Google’s Espresso (Yap et al. 2017) applied SDN to Google’s peering edge, using programmable switches and host-based routing to steer egress traffic based on real-time performance measurements. Facebook’s Edge Fabric (Schlinker et al. 2017) built a similar system, using a centralized controller to override BGP’s performance-oblivious route selection at Facebook’s PoPs, steering traffic away from congested peering links in near real-time.

Both systems worked — within their operators’ networks. But a surprising finding emerged. Arnold, Calder, Cunha, Gupta, and Katz-Bassett (Arnold et al. 2019) examined three large-scale measurement studies across Facebook, Microsoft, and Google and found that “in all cases, performance-aware routing provided little benefit over BGP in terms of latency.” For the vast majority of traffic, BGP’s performance-oblivious route selection performed as well as or better than the sophisticated alternatives. The paper identified several reasons: most alternate paths degrade together during congestion (so there is no better alternative to switch to), direct peering means BGP’s preferred routes are already short, and providers’ massive infrastructure investments align BGP’s choices with good performance.

This result reframes the SDN-at-the-edge story. SDX, Espresso, and Edge Fabric are architecturally sound applications of the disaggregation and decision placement principles — they separate peering control from BGP’s distributed path selection and centralize it for performance optimization. But the measurement evidence suggests that the environment BGP operates in — dense peering, short paths, infrastructure investment — already produces near-optimal outcomes. The gap between BGP’s belief and the environment’s true state is smaller than the SDN pioneers assumed. Better measurement and centralized control help only at the margins.

7.8 Act 7: “It’s 2013. OpenFlow’s Fields Are Ossifying.”

OpenFlow grew from 12 to 41 match fields in a few years — “increasing the complexity of the specification while still not providing the flexibility to add new headers.” Every new encapsulation format (VXLAN for data center overlays, NVGRE for Microsoft’s virtualization, INT for in-band telemetry) required a new OpenFlow specification revision — a process that took years of standardization before vendors implemented it in hardware. The Interface invariant broke: the same ossification pattern that afflicted IP (middleboxes hardcoded assumptions about headers), MPLS (signaling protocols proliferated), and the telephone network (circuit switching prevented packet innovation).

Bosshart et al. (2014) asked: “How should OpenFlow evolve?” Their answer: make the pipeline itself programmable — define behavior in software, compile it to hardware (Bosshart et al. 2014).

7.8.1 Invariant Analysis: P4 (2014)

Invariant	P4 Answer (2014)	Gap?
State	Programmer-defined: “typed match+action tables” + stateful registers. Memory layout not vendor-fixed.	Current frontier — expressiveness limited by hardware stages
Time	Line-rate deterministic: “fixed clock cycles per packet” via PISA (Protocol-Independent Switch Architecture) pipeline. No loops or recursion — bounded at compile time.	Solid — line-rate processing guaranteed
Coordination	Controller + programmable pipeline. P4Runtime defines the coordination interface between controller and switch.	SDN model preserved, but what the controller can express is now unbounded
Interface	Three-stage programmable: Parser (extract headers) → Match-Action (process) → Deparser (reconstruct). “Defines the behavior” of the hardware.	The escape: protocol-independent, reconfigurable, target-independent

P4 unlocks capabilities beyond OpenFlow’s reach. Because the programmer defines what the switch does with each packet, the switch can perform application-level computation at line rate — not just forward packets. Examples: NetCache (Jin et al. 2017) stores frequently-accessed database records directly in the switch’s memory, serving cache hits in microseconds without reaching the server. In-band Network Telemetry (INT) (Kim et al. 2015) inserts per-hop measurement data (queue depth, processing delay) into packet headers as they traverse the network, giving operators hop-by-hop visibility. SwitchML (Sapio et al. 2021) performs machine learning gradient aggregation — summing model updates from hundreds of servers — inside the switch fabric at terabit speeds, bypassing the CPU bottleneck. The data plane has become a programmable computation platform — far beyond its origins as a simple forwarding engine.

eBPF/XDP extends this to the kernel: programmable packet processing at the earliest point in the Linux network stack. SmartNICs (Firestone et al. 2018) push it to every server’s NIC. The escape from interface ossification is total: every layer of the stack is now programmable.

7.9 The Grand Arc: From Baran to P4

7.9.1 The Evolving Anchor

Era (Year)	Meta Constraint → Coordination Lock	State and Measurement	Time and Interface
ARPANET DV (1969)	Survivability → distributed (rumor)	Distance vectors, O(N), queue-length proxy	128 ms updates; full table exchange
ARPANET LS (1980)	Consistency → distributed (shared truth)	Full topology, O(N²), direct delay measurement	10s measurement, 100ms propagation; LSA flooding
STP (1985)	L2 transparency → distributed election	Minimal: root/cost/port	30–50s convergence; transparent to hosts
BGP (1993)	Commercial sovereignty → autonomous	Policy-encoded RIBs, filtered measurement	MRAI 30s, 3–15 min convergence; hides topology
MPLS/SR (1996–2015)	Traffic engineering → distributed + explicit paths	Per-flow labels → stateless core	RSVP 30s refresh → TI-LFA sub-50ms
SDN (2004–2013)	Policy comprehensibility → centralized	Global NIB in controller	0.4–12ms reactive; fixed match-action fields
P4 (2014)	Interface flexibility → centralized + programmable	Programmer-defined tables	Line-rate deterministic; programmable pipeline

7.9.2 Three Design Principles Applied Across the Arc

The three design principles are fixed analytical tools. They do not change. What changes is the problem — the nature of the questions being asked, driven by evolving constraints. Designers apply the same principles to each new problem. The solutions differ because the problems differ.

Disaggregation is applied whenever a monolithic system’s concerns become too entangled to manage. Each era’s designers identified a different pair of concerns to separate:

1974: Cerf/Kahn separated transport (reliable delivery) from routing (forwarding) — because interconnecting heterogeneous networks required an unreliable network layer.
1989: The community separated intradomain routing (OSPF) from interdomain routing (BGP) — because commercial sovereignty required hiding internal topology.
2008: McKeown separated the control plane from the data plane across devices — because policy comprehensibility required a global view unavailable to distributed routers.
2014: Bosshart separated pipeline definition from hardware implementation — because protocol evolution outpaced the fixed OpenFlow specification.

Each application of disaggregation creates a new interface between the separated concerns. Each interface eventually ossifies — and the next generation applies disaggregation again at a different boundary.

Closed-loop reasoning is applied in every era to bridge the Environment-Measurement-Belief gap. The principle is the same — build a feedback loop that keeps Belief aligned with Environment through Measurement. What differs is the measurement quality available to each generation’s designers:

1969: DV designers applied closed-loop reasoning with a noisy proxy measurement (queue length). The loop diverged (count-to-infinity) because the measurement was contaminated by the loop’s own output.
1980: McQuillan applied the same principle with direct delay measurement. The loop converged because the measurement signal became independent of the computation.
1993: BGP designers applied closed-loop reasoning with structurally filtered measurement (policy-shaped advertisements). The loop converges under commercial hierarchy constraints (Gao-Rexford) — structural damping from the market, applied externally to the protocol.
2008: SDN designers applied closed-loop reasoning with near-complete measurement (controller’s global view). The loop period dropped to controller latency (0.3–0.8s for B4’s TE computation).

Decision placement is applied based on each era’s binding constraint. The principle — “the tighter the constraint on consistency, the more centralized” — is constant. What differs is the constraint:

Survivability (1969) demands distributed placement — centralized control is a single point of failure.
Correctness (1980) demands distributed placement with shared state — every node runs the same computation on the same data.
Commercial sovereignty (1989) demands autonomous placement — each AS decides independently.
Policy comprehensibility (2004) demands centralized placement — a global view is essential for consistent policy enforcement.
Operational reality (2015) demands hybrid placement — centralized TE with distributed baseline forwarding (segment routing + SDN controller).

7.9.3 The Dependency Chain

flowchart TD
    A["<b>1969: Survivability</b><br/>Locks Coordination = distributed"] --> B["State: distance vectors<br/>Time: 128ms noisy updates"]
    B --> C["<b>FAILURE: Oscillations at scale</b><br/>Measurement proxy wrong + no consistency"]
    C --> D["<b>1980: McQuillan redesign</b><br/>State → full topology<br/>Time → 10s direct measurement"]
    D --> E["<b>1989: Internet commercializes</b><br/>Relocks Coordination = autonomous"]
    E --> F["State: policy-encoded RIBs<br/>Time: deliberately slow (MRAI 30s)"]
    F --> G["<b>1997-2003: BGP crises</b><br/>Convergence 3-15 min, NP-hard,<br/>RFD paradox, prefix hijacking"]
    G --> H["<b>2004: Comprehensibility crisis</b><br/>Relocks Coordination = centralized"]
    H --> I["State: global NIB<br/>Interface: match-action flow tables"]
    I --> J["<b>FAILURE: Interface ossifies</b><br/>OpenFlow 12→41 fields, still not enough"]
    J --> K["<b>2014: P4</b><br/>Interface → programmable pipeline"]

    style C fill:#EE6677,color:#fff
    style G fill:#EE6677,color:#fff
    style J fill:#EE6677,color:#fff
    style A fill:#4477AA,color:#fff
    style D fill:#228833,color:#fff
    style E fill:#4477AA,color:#fff
    style H fill:#4477AA,color:#fff
    style K fill:#228833,color:#fff

Figure 7.1: The routing dependency chain across eras. Each failure triggers a redesign that shifts invariant answers, which cascades to create the next generation’s vulnerability.

7.9.4 Pioneer Diagnosis Table

Year	Pioneer	Invariant	Diagnosis	Contribution
1964	Baran	(Meta constraint)	Centralized networks are fatally vulnerable	Proof that distributed mesh with 3–4 links/node survives → locks Coordination as distributed
1970	Heart/BBN	—	(Built first implementation)	IMP with distance-vector routing — the simplest distributed answer
1980	McQuillan	State + Time	Queue-length proxy wrong; routing-by-rumor inconsistent	Link-state: direct measurement + full topology + SPF
1985	Perlman	Interface	L2 transparency forbids full topology distribution	Spanning tree: constrain topology instead
1993	Fuller/Rekhter	State/Interface	Class-based addressing exhausted	CIDR: variable-length prefixes + BGP-4
1997–2001	Labovitz	Time	BGP converges in 3–15 minutes; path exploration	MRAI timer analysis; convergence measurement
2001	Gao & Rexford	Coordination	BGP stability from market hierarchy, not protocol	Structural damping conditions: customer > peer > provider
2002	Griffin	Coordination	BGP stability is NP-complete; dispute wheels	Theoretical hardness proof
2003	Mao	Time → Coordination	Route flap damping worsens convergence	RFD paradox: Time fix misidentified normal Coordination behavior
2004	Feamster	Coordination	Distributed BGP configs are fragile	RCP: centralize routing control
2007	Casado	Coordination	Enterprise policy is unmanageable when distributed	Ethane: centralized policy over flow-based forwarding
2008	McKeown	Interface	No way to experiment on production networks	OpenFlow: standardized switch interface
2014	Bosshart	Interface	OpenFlow’s fixed fields ossified (12→41)	P4: programmable parse-match-deparse pipeline
2015	Filsfils	State	RSVP-TE per-flow state explodes at scale	Segment Routing: path state in packet header, stateless core

7.9.5 Innovation Timeline

flowchart TD
    subgraph sg1["Foundations (1958–1980)"]
        A1["1958 — Bellman: dynamic programming for shortest paths"]
        A2["1959 — Dijkstra: SPF algorithm"]
        A3["1964 — Baran: distributed packet switching for survivability"]
        A4["1969 — ARPANET IMPs with distance-vector routing"]
        A5["1974 — Cerf & Kahn: TCP/IP and the gateway concept"]
        A6["1978 — ARPANET routing oscillates — count-to-infinity crisis"]
        A7["1980 — McQuillan: link-state replaces distance-vector"]
        A1 --> A2 --> A3 --> A4 --> A5 --> A6 --> A7
    end
    subgraph sg2["Scaling and Policy (1985–2003)"]
        B1["1985 — Perlman: Spanning Tree Protocol for loop-free L2"]
        B2["1988 — RIP (RFC 1058): distance-vector for IP"]
        B3["1989 — OSPF (RFC 1131) + BGP-1 (RFC 1105)"]
        B4["1993 — CIDR (RFC 1519) + BGP-4 (RFC 1771)"]
        B5["1997 — Labovitz: BGP convergence takes 3–15 min"]
        B6["2001 — Gao-Rexford stability proof; MPLS (RFC 3031)"]
        B7["2002 — Griffin: BGP stability is NP-complete"]
        B8["2003 — Mao: route flap damping paradox"]
        B1 --> B2 --> B3 --> B4 --> B5 --> B6 --> B7 --> B8
    end
    subgraph sg3["SDN Revolution (2004–2015)"]
        C1["2004 — Feamster: Separating Routing from Routers"]
        C2["2007 — Casado: Ethane (centralized enterprise control)"]
        C3["2008 — McKeown: OpenFlow; Pakistan/YouTube hijack"]
        C4["2010 — Onix: distributed SDN controller"]
        C5["2012 — RPKI (RFC 6480): incremental origin validation"]
        C6["2013 — Google B4: production SDN WAN"]
        C7["2014 — P4: programmable data planes; SDX at IXPs"]
        C8["2015 — Segment Routing; Jupiter Rising (Google DC)"]
        C1 --> C2 --> C3 --> C4 --> C5 --> C6 --> C7 --> C8
    end
    subgraph sg4["Current Frontier (2017+)"]
        D1["2017 — Google Espresso: SDN at peering edge"]
        D2["2018 — eBPF/XDP: kernel-level programmable packets"]
        D3["2018 — Azure SmartNICs: FPGA offload at hyperscale"]
        D1 --> D2 --> D3
    end
    sg1 --> sg2 --> sg3 --> sg4

Figure 7.2: Innovation timeline: key ideas in routing, switching, and programmable networks. Red nodes mark failures that triggered the next generation.

7.10 Generative Exercises

Exercise 1: Routing on Mars

The Mars-Earth link has a one-way propagation delay of 3–22 minutes depending on orbital position. Design a routing system for a Martian colony network that must interoperate with Earth’s Internet.

Which invariants change? What does “convergence” mean when the feedback loop’s period is 44 minutes? Can you use link-state routing between Mars and Earth, or does the timescale force something else? What decision placement makes sense — centralized on Mars, distributed between planets?

Exercise 2: Routing with Adversarial Participants

BGP assumes honest path advertisements. Consider a network where some ASes deliberately inject false routes (prefix hijacking, path forgery). Which invariant’s answer breaks first? RPKI validates origin but not path — what would full path validation require? Apply the E-M-B framework: if Measurement is adversarially corrupted, what kind of Belief model can still be useful?

Exercise 3: Routing in a Mobile Mesh

A fleet of delivery drones forms an ad-hoc mesh network. Topology changes every few seconds as drones move. Link-state flooding would be constant. Distance-vector would never converge. What routing design is needed? Which invariant is the binding constraint? How does the loop bandwidth (how fast routing can track changes) compare to the environment’s rate of change?

This chapter is part of “A First-Principles Approach to Networked Systems” by Arpit Gupta, UC Santa Barbara, licensed under CC BY-NC-SA 4.0.

Arnold, Todd, Matt Calder, Italo Cunha, et al. 2019. “Beating BGP Is Harder Than We Thought.” Proc. ACM HotNets, 1–8.

Baran, Paul. 1964. “On Distributed Communications Networks.” IEEE Transactions on Communications Systems 12 (1): 1–9.

Bellman, Richard. 1958. “On a Routing Problem.” Quarterly of Applied Mathematics 16 (1): 87–90.

Bosshart, Pat, Dan Daly, Glen Gibb, et al. 2014. “P4: Programming Protocol-Independent Packet Processors.” ACM SIGCOMM Computer Communication Review 44: 87–95.

Casado, Martin, Michael J. Freedman, Justin Pettit, Jianying Luo, Nick McKeown, and Scott Shenker. 2007. “Ethane: Taking Control of the Enterprise.” Proceedings of the ACM SIGCOMM Conference, 1–12.

Cerf, Vinton G., and Robert E. Kahn. 1974. “A Protocol for Packet Network Intercommunication.” IEEE Transactions on Communications COM-22 (5): 637–48.

Dijkstra, Edsger W. 1959. “A Note on Two Problems in Connexion with Graphs.” Numerische Mathematik 1: 269–71.

Feamster, Nick, Hari Balakrishnan, Jennifer Rexford, Aman Shaikh, and Kobus van der Merwe. 2004. “The Case for Separating Routing from Routers.” Proceedings of the ACM SIGCOMM Workshop on Future Directions in Network Architecture (FDNA), 5–12.

Filsfils, Clarence, Nagendra Kumar Nainar, Carlos Pignataro, Juan Camilo Cardona, and Pierre Francois. 2015. “The Segment Routing Architecture.” Proc. IEEE GLOBECOM, 1–6.

Firestone, Daniel, Andrew Putnam, Sambhrama Mundkur, et al. 2018. “Azure Accelerated Networking: SmartNICs in the Public Cloud.” Proceedings of USENIX NSDI, 51–66.

Fortz, Bernard, and Mikkel Thorup. 2000. “Internet Traffic Engineering by Optimizing OSPF Weights.” Proceedings of IEEE INFOCOM, 519–28.

Fuller, Vince, Tony Li, Jessica Yu, and Kannan Varadhan. 1993. Classless Inter-Domain Routing (CIDR): An Address Assignment and Aggregation Strategy. RFC No. 1519. IETF.

Gao, Lixin, and Jennifer Rexford. 2001. “Stable Internet Routing Without Global Coordination.” IEEE/ACM Transactions on Networking 9: 681–92.

Griffin, Timothy G., F. Bruce Shepherd, and Gordon Wilfong. 2002. “The Stable Paths Problem and Interdomain Routing.” IEEE/ACM Transactions on Networking 10: 232–43.

Gupta, Arpit, Laurent Vanbever, Muhammad Shahbaz, et al. 2014. “SDX: A Software Defined Internet Exchange.” Proc. ACM SIGCOMM, 551–62.

Heart, Frank E., Robert E. Kahn, Severo M. Ornstein, William R. Crowther, and David C. Walden. 1970. The Interface Message Processor for the ARPA Computer Network. BBN Report No. 1822. Bolt Beranek; Newman.

Hedrick, Charles L. 1988. Routing Information Protocol. RFC No. 1058. IETF.

Jain, Sushant, Alok Kumar, Subhasree Mandal, et al. 2013. “B4: Experience with a Globally-Deployed Software Defined WAN.” Proceedings of the ACM SIGCOMM Conference, 3–14.

Jin, Xin, Xiaozhou Li, Haoyu Zhang, et al. 2017. “NetCache: Balancing Key-Value Stores with Fast in-Network Caching.” Proceedings of ACM SOSP, 121–36.

Kim, Changhoon, Anirudh Sivaraman, Naga Katta, Antonin Bas, Advait Dixit, and Lawrence J. Wobker. 2015. In-Band Network Telemetry via Programmable Dataplanes. SIGCOMM Industrial Demo; P4 Consortium Specification.

Koponen, Teemu, Martin Casado, Natasha Gude, et al. 2010. “Onix: A Distributed Control Platform for Large-Scale Production Networks.” Proceedings of USENIX OSDI, 351–64.

Labovitz, Craig, Abha Ahuja, Abhijit Bose, and Farnam Jahanian. 2000. “Delayed Internet Routing Convergence.” Proceedings of the ACM SIGCOMM Conference, 175–87.

Lepinski, Matt, and Stephen Kent. 2012. An Infrastructure to Support Secure Internet Routing. RFC No. 6480. IETF.

Mahalingam, M., D. Dutt, K. Duda, et al. 2014. Virtual eXtensible Local Area Network (VXLAN): A Framework for Overlaying Virtualized Layer 2 Networks over Layer 3 Networks. RFC No. 7348. IETF.

Mao, Zhuoqing Morley, Ramesh Govindan, George Varghese, and Randy H. Katz. 2002. “Route Flap Damping Exacerbates Internet Routing Convergence.” Proceedings of the ACM SIGCOMM Conference, 221–33.

McKeown, Nick, Tom Anderson, Hari Balakrishnan, et al. 2008. “OpenFlow: Enabling Innovation in Campus Networks.” ACM SIGCOMM Computer Communication Review 38 (2): 69–74.

McQuillan, John M., Ira Richer, and Eric C. Rosen. 1980. “The New Routing Algorithm for the ARPANET.” IEEE Transactions on Communications COM-28 (5): 711–19.

Moy, John. 1998. OSPF Version 2. RFC No. 2328. IETF.

Perlman, Radia. 1985. “An Algorithm for Distributed Computation of a Spanning Tree in an Extended LAN.” Proceedings of the ACM SIGCOMM Conference, 44–53.

Perlman, R., D. Eastlake, D. Dutt, S. Gai, and A. Ghanwani. 2011. Routing Bridges (RBridges): Base Protocol Specification (TRILL). RFC No. 6325. IETF.

Rekhter, Yakov, Tony Li, and Susan Hares. 2006. A Border Gateway Protocol 4 (BGP-4). RFC No. 4271. IETF.

Rosen, Eric, Arun Viswanathan, and Ross Callon. 2001. Multiprotocol Label Switching Architecture. RFC No. 3031. IETF.

Saltzer, Jerome H., David P. Reed, and David D. Clark. 1984. “End-to-End Arguments in System Design.” ACM Transactions on Computer Systems 2 (4): 277–88.

Sapio, Amedeo, Marco Canini, Chen-Yu Ho, et al. 2021. “Scaling Distributed Machine Learning with in-Network Aggregation.” Proceedings of USENIX NSDI, 785–808.

Schlinker, Brandon, Hyojeong Kim, Timothy Cui, et al. 2017. “Engineering Egress with Edge Fabric: Steering Oceans of Content to the World.” Proc. ACM SIGCOMM, 418–31.

Yap, Kok-Kiong, Murtaza Motiwala, Jeremy Rahe, et al. 2017. “Taking the Edge Off with Espresso: Scale, Reliability and Programmability for Global Internet Peering.” Proc. ACM SIGCOMM, 432–45.

In distance-vector routing, if node A routes to destination D through node B, and node B’s path to D fails, B may accept A’s advertisement of a path to D — creating a loop. Both nodes increment their distance to D through each other, counting toward infinity. The loop persists until the distance exceeds the protocol’s infinity threshold (typically 16 in RIP).↩︎
An MPLS label is a 20-bit integer inserted between the L2 and L3 headers (the “shim” header). Each router swaps the label based on a label forwarding table, enabling path control without examining the IP header. The label stack allows nested tunnels.↩︎
OpenFlow defines a match-action abstraction: each flow entry matches on packet header fields (source/destination IP, port, VLAN, etc.) and specifies an action (forward to port, drop, modify header, send to controller). The original OpenFlow 1.0 supported 12 match fields; OpenFlow 1.3 expanded to 41.↩︎