The Framework as a Design Tool
2026-04-09
You have the complete framework: four invariants, three principles, E-M-B decomposition.
Today you use it to design two real systems from scratch.
Exercise 1: OSPF works inside one organization. The Internet commercializes. OSPF breaks across organizations. Design what replaces it. → BGP
Exercise 2: OSPF works but doesn’t scale economically. Networks grow massive. Per-device cost explodes. Design what replaces the architecture. → SDN/OpenFlow
Two evolutions from the same baseline. Same framework generates both.
Let’s visualize OSPF’s dependency graph. Help me fill this in — recall from Tuesday.
Binding constraint: survivability — the network must work despite failures (Baran, 1964)
| Invariant | OSPF’s answer | Forced by | Design principle |
|---|---|---|---|
| Coordination | Distributed — each router computes independently | Survivability → no central point of failure | Decision placement |
| State | Full topology — every router knows every link and cost | Distributed → need shared truth to avoid loops | Disaggregation: measurement from belief |
| Time | Event-driven flooding, sub-second convergence | Full topology → changes trigger immediate reflooding | Closed-loop: fast, honest feedback |
| Interface | LSAs — the format for exchanging raw link measurements | Cooperative trust → share everything, hide nothing | Honest measurement signal |
What assumption holds this together?
1989: ARPANET decommissioned. The Internet becomes an interconnection of independent networks.
This is fundamentally different from ARPANET:
| ARPANET | Commercial Internet | |
|---|---|---|
| Who operates it | One cooperative research community | Thousands of organizations — AT&T, Sprint, universities, corporations |
| Relationship | Collaborative — shared goals | Commercial — competing business interests |
| The problem | Route within one backbone | Organizations with disparate interests must figure out how to exchange traffic |
The tool at hand is OSPF. Is it the right tool?
Look at OSPF’s four invariant answers. Which one becomes impossible when separate commercial organizations must route between each other?
Recall OSPF’s State answer: full topology shared — every router sees every link via LSAs.
This means every router floods: "Link A-B, cost 10" — raw topology, nothing hidden.
Would AT&T flood its internal topology to Sprint? Its 500 routers, link capacities, traffic engineering policies?
No. Commercial competitors treat topology as a trade secret. OSPF’s State answer — share everything honestly — requires trust that no longer exists.
OSPF can still work inside each organization — single admin, full trust. But it fails across organizations. This means routing has to split: one system inside, a different system between.
What does this separation create? Who handles the boundary?
The State failure forces a split: intra-domain (OSPF, full trust) vs inter-domain (new protocol, filtered trust).
This creates gateway routers — routers at the AS boundary that speak both protocols. Inside: OSPF. Outside: the new inter-domain protocol.
Now: what does the gateway router advertise to other ASes?
OSPF advertises link states: "Link A-B, cost 10" — router-level. But we’re hiding internal routers. The outside world doesn’t know AT&T’s router names. What’s the right unit?
→ Prefixes — groups of IP addresses the AS is responsible for: 208.65.152.0/22
Plus the AS-level path for loop detection: [AS7018, AS3356, AS15169]
This is a path vector — more than a distance (DV), less than a topology (OSPF). The maximum competitors will disclose.
| OSPF (cooperative) | Inter-domain (path vector) | |
|---|---|---|
| What’s shared | Every link, every cost | Prefixes + AS-level path |
| What’s hidden | Nothing | Internal topology, capacity, congestion, cost |
| E-M-B gap | None — measurement = environment | Permanent — by design |
Is this the same type of gap as bufferbloat (accidentally noisy)? As count-to-infinity (circular belief)? Or something fundamentally different?
Time ← the loop runs across thousands of ASes exchanging prefix+path updates. How fast? DV ran at 128ms and oscillated. OSPF floods in seconds inside one admin.
→ Slow — 30-second minimum between updates for the same prefix. Stability over speed. Cost: 3-15 min convergence after failures. (Closed-loop: learned from DV’s mistake)
Selection ← OSPF picks shortest path. But commercial ASes have business preferences — a paying customer’s longer path beats a competitor’s shorter path.
→ Business preference (LOCAL_PREF) overrides shortest path. The protocol enforces business relationships, not optimal routing. (Decision placement: each AS applies local policy)
| Invariant | Your design | BGP (RFC 1105, 1989) | Design principle |
|---|---|---|---|
| Coordination | Each AS decides by local business policy | LOCAL_PREF > AS_PATH length > tie-breakers | Decision placement: maximally distributed |
| State | Path vector — prefix + AS path (structurally filtered) | AS_PATH + policy attributes; permanent E-M-B gap | Closed-loop: design measurement signal given privacy |
| Time | Slow updates for stability | 30-second minimum between updates; 3-15 min convergence | Closed-loop: stability over speed (DV’s lesson) |
| Interface | Disaggregated from internal routing | BGP ↔︎ OSPF/IS-IS boundary | Disaggregation: inter-domain from intra-domain |
OSPF → BGP: State broke because trust changed. Every other answer adapted.
Griffin (2002): BGP stability with arbitrary policies is NP-complete.
Gao-Rexford (2001): BGP converges if policies follow the customer-provider hierarchy.
Stability comes from economic structure, not protocol design. The E-M-B gap is permanent. The system works because institutional constraints — the market hierarchy — keep policies aligned.
Consequence of no verification: Pakistan Telecom announces YouTube’s prefix (2008). BGP accepts it — no origin authentication. YouTube goes dark globally for 2 hours. RPKI (2012) partially fixes origin validation — same deployability meta-constraint.
A tension is forming. BGP selects one best path per prefix — stability demands simplicity. Yet operators want finer control: route video one way, bulk transfers another. Every additional policy rule (communities, route maps, prefix-list filters) adds complexity per device — and complexity threatens the stability Gao-Rexford guarantees. Precision competes with stability. Control competes with cost. This same tension will reappear inside organizations — and it will break the architecture.
:::
Go back to OSPF — inside a single organization (datacenter, campus, WAN). Same dependency graph as slide 4. But a different problem emerged.
Three forces converged:
1. Scale exploded. Cloud providers built datacenters with tens of thousands of switches. Each device running link-state + Dijkstra + line-rate forwarding = compute and memory bloat. Routers cost $500K+. Cisco/Juniper monopoly pricing — organizations trapped.
2. Precision demanded bloat. OSPF routes by shortest path to IP prefix. Organizations wanted: video on path A, conferencing on path B, bulk transfers on path C. Per-application rules require more FIB/TCAM entries per device → more memory → higher cost. Every additional rule inflates every device.
3. Policy required touching every device. Each router owns its own control plane. Changing routing policy = reconfigure every router individually. Casado (2007): human error accounted for 62% of network downtime — because network-wide policy was expressed through thousands of lines of local configuration on individual devices.
Three forces: scale bloats per-device cost. Precision demands more rules per device. Policy requires touching every device individually.
All three trace back to one architectural choice: every router computes its own control plane. Route computation, policy expression, traffic engineering — distributed across every device.
Which invariant is that? And if it’s the problem — what’s the alternative?
But wait — in Lecture 3, Baran proved distributed coordination was essential for survivability. Doesn’t centralizing routing create a single point of failure?
“Network management is complex and requires strong consistency, making it quite hard to compute in a distributed manner.” — Casado et al., 2007
Baran’s context (1964): a national network across hostile territory. Survive nuclear attack. Centralization = one bomb destroys routing.
SDN’s context (2004): a datacenter or campus you fully control. You own every device, every link, every power supply.
Inside your own domain: you can replicate controllers, add fast failover, monitor health. Logically centralized ≠ physically centralized.
And the data showed: distributed configuration caused more downtime (62% from human error) than centralization risked. The real threat was no longer physical destruction — it was operational complexity.
“Why should network-wide routing decisions be implemented through thousands of lines of local configuration on individual, distributed devices?” — Feamster et al., 2004
The control plane and data plane are coupled in every device. Route computation, policy, traffic engineering — all crammed into the same $500K box. Same monolithic-IMP pattern from Lecture 3.
Heart separated forwarding from routing in 1969 — same box, different processes. What’s the equivalent separation here?
Binding constraint: per-device cost + operational complexity unsustainable. Manageability, not survivability.
Coordination ← single admin, full authority → centralized controller. The survivability tradeoff is acceptable because you control the domain and can replicate for redundancy.
State ← controller sees full topology (no secrets — you own everything). Switches hold only forwarding rules pushed to them. No Dijkstra, no link-state database on the switch.
Time ← controller pushes rules directly → sub-second. No distributed convergence. No path exploration.
Interface ← match-action rules: match on any header field (not just IP prefix). Video traffic? Match on port 443 + specific server IPs. This is the flexibility OSPF lacked — without inflating every device.
“OpenFlow provides an open protocol to program the forwarding table in different switches.” — McKeown et al., 2008
| Invariant | OSPF (coupled) | Your design | SDN/OpenFlow (2008) |
|---|---|---|---|
| Coordination | Distributed — each router decides | Centralized controller | NOX, ONOS, ODL |
| State | Full topology in every router | Global in controller; switches hold only rules | Network Information Base |
| Time | Distributed convergence (seconds) | Controller pushes rules (ms) | Flow setup: milliseconds |
| Interface | Per-prefix forwarding (limited) | Match on any header field (flexible) | OpenFlow match-action tables |
Heart (1969): separated forwarding from routing — same box, different processes.
SDN (2008): separated control plane from data plane — different devices entirely.
Same disaggregation principle. Applied more aggressively because the cost constraint demanded it.
| OSPF (baseline) | BGP | SDN | |
|---|---|---|---|
| Context | Intra-domain, cooperative | Inter-domain, commercial | Intra-domain, scale + flexibility |
| What broke | (baseline) | State — topology becomes a trade secret | Control/data coupling → cost bloat + inflexibility |
| Binding constraint | Survivability | Commercial sovereignty | Per-device cost + policy precision |
| Key principle | Closed-loop (LS flooding) | Closed-loop (slow updates, measurement under privacy) | Disaggregation (control from data plane) |
| Coordination | Distributed | Distributed (sovereign) | Centralized |
| State | Full topology | Filtered paths (permanent gap) | Global (controller) |
BGP: trust changes → State answer changes, coordination stays distributed.
SDN: cost + flexibility changes → deeper disaggregation, coordination centralizes.
Next week: wireless medium access — binding constraint is physics (shared spectrum). Same framework, new substrate. Read Ch 3.