Tapestry Behavioral Specification
Ben Y. Zhao
Version 1.0
August 28, 2002

Outline

  1. Introduction
    1. Purpose of this document
    2. Tapestry summary
  2. Naming
    1. Node Guid Assignment
    2. Object Guid Assignment
  3. Application Level API: Core Routing
    1. Route to defined ID: TapestryRouteMsg
    2. Proximity route to ID: TapestryPrefixRouteMsg
  4. Application Level API: Object Location
    1. Publish / advertising objects: TapestryPublishMsg
    2. Locate / route to objects: TapestryLocateMsg
    3. Object multicast behavior
  5. Internal Message Handling
    1. Object Location: Publish / Unpublish
    2. Routing: RouteMsg
  6. Dynamic Node Membership
    1. Node insertion
    2. Directed multicast
    3. Nearest neighbor
    4. Voluntary node deletion
    5. Involuntary node deletion
  7. Upcall Module: Multicast - Bayeux
    1. Multicast Session Announcement
    2. Node join protocol
    3. Node leave protocol
    4. Session Teardown Announcement
  8. Upcall Module: Anycast
    1. Anycast Group Announcement
    2. Anycast Join
    3. Anycast Leave
    4. Anycast Routing
  9. Upcall Module: Mobile Tapestry

1. Introduction

A. Purpose of this document
This document is written in order to clearly define the expected behavior of in implementation of the Tapestry overlay network.  We try to clearly outline the expected behavior from the perspective of the application interface, such that a given application written for one given Tapestry implementation should require minimal modifications in order to work with a separate implementation.

B. Tapestry summary
Tapestry is a structured Peer-to-Peer overlay network infrastructure that offers location independent routing and object location with deterministic performance in an efficient manner.

As a network infrastructure, applications can be written on top of Tapestry, and interact with it through a well-defined interface (most of which is defined in this document).

In a Tapestry overlay network, participating nodes are assigned a pseudo-random node identifier. These are Globally Unique Identifiers, or GUIDs for short, and are typically randomized bit strings in a linear namespace from 0 to 2^160-1.  These nodes then self-organize into a Tapestry mesh, defined by distributed routing tables located at each node that route
to a logarithmic number of other nodes in the mesh. Given these routing tables, a Tapestry message can be routed from any node to and destination node given the GUID at the destination node.

In the context of object location, objects can be any piece of data or metadata that has some notion of location inside the network.  Each object also has a Globally Unique IDentifier in the same linear namespace as the node GUIDs.  For our purposes, object location is the process in which a node that has access to some data or resource advertises its availability through a publication process, and clients searching for that resource or data routes a message through the Tapestry network layer to the location of the data or resource.

2. Naming

A. Node GUID assignment
There are a number of ways to generate the desired pseudo-random GUID for each node (machine) participating in the Tapestry network.  Any secure one-way hash algorithm (such as SHA-1, MD5) will provide the pseudo-randomness necessary to avoid collisions.  The data to be hashed should provide for some basic deterrence against Sybil attacks, where malicious nodes can assume arbitrarily large number of identities by continuously generating new GUIDs. An intuitive choice is the IP address of the machine, which guarantees uniqueness across machines in the network.  While IP address spoofing is not trivial, it can be done with some effort.  

A more secure solution is to require a public/private key pair, where the GUID is generated from a hash of the public key.  This allows anyone to verify the identity of the node in a cryptographically secure verification process.  Generating public/private key pairs is not difficult, however. A more secure scheme would involve the use of one or more centralized certificate authorities which would grant naming certificates once it has verified the requestor.  Verification might require some non-trivial computation task to deter continuous requests for new GUIDs.  Furthermore, the resulting certificate might map the GUID to a public key and IP address combination, both of which can be verified on demand.

B. Object GUID assignment
There is also several ways of choosing the GUID of a given piece of data or resource.  Again, the way of assigning a GUID has implications on the authentication of data or resources.  One approach is to apply the one way hash function to the entire contents of a piece of data to get the GUID.  While this allows for relatively simple verification, it restricts the ability of objects to change their contents over time. While acceptable for static resources or immutable data, this naming scheme may be too limiting for applications dealing with more dynamic data, such as writable file systems.  An alternate scheme is to assign a static and unique descriptive metadata field to the data, and assign the GUID as a hash of the metadata.

3. Core Routing

We begin the discussion of Tapestry functionality by looking at node-to-node routing. This functionality depends solely on the distributed routing tables available on each Tapestry node to forward messages from any node to a destination node with the desired GUID. The routing algorithm uses a modified form of prefix routing, and does not involve object location lookups anywhere along the route.

A. Route to defined ID: TapestryRouteMsg
This is the message used for communication between two nodes that know each other by their GUID names. The assumption is that the sender of the message knows the existence of the destination node.
Key fields:
When a node's Router receives a TapestryRouteMsg, it serializes the headers and concatenates the result to the application payload to generate a single byte array.  It then inserts the resulting full byte array as the payload of an internal RouteMsg.  If necessary, Router should fragment the full byte array into multiple message fragments, each smaller than or equal to the set MTU size for the outgoing link.  For fragmented messages, each resulting fragment becomes the payload of a separate RouteMsg and gets the appropriate RouteMsg headers attached, including fragmentation and reassembly information.  If in the course of routing, this message finds that no node GUID is an exact match of the destination GUID, the message is dropped, silently.

From the application's perspective, the message routing is successful if the full reassembled TapestryRouteMsg is received by a node with the GUID that exactly matches the Peer field (if it is found within TTL hops, otherwise delivered to some node that is TTL-hops away from the source en route to the destination).  The number of hops taken (as long as it is <= TTL) is irrelevant, and should be set in TapestryRouteMsg.Hops upon delivery.  Internally, Tapestry does not need to use a preset route to deliver the message, and can route around node/link failures where appropriate.

B. Proximity route to ID: TapestryPrefixRouteMsg
The namespace used for nodes is necessarily large in order to accomodate potentially very large networks while avoiding name collisions.  For reasonably-sized or smaller networks, however, this results in a very sparse namespace of nodes.  Therefore, simply routing a message to a random GUID would likely not reach a physical node with an exact match in name.  This message routes messages to some existing node in the network as a replacement or surrogate of the GUID we'd like to match.
When a node's Tapestry Router receives a TapestryPrefixMsg, it serializes the headers and concatenates the result to the application payload to generate a single byte array.  It then inserts the resulting full byte array as the payload of an internal RouteMsg.   A flag in the RouteMsg is turned on to turn on "fuzzy routing."  Long messages are fragmented in the same fashion as TapestryRouteMsgs.  As this message is routed between nodes, if an empty routing entry is encountered, meaning no exact match exists for the destination GUID, the message uses surrogate routing to map the destination GUID to some existing node in the network.  Barring unrecoverable failures or corruption, this message should always be delivered to some node in the network.

4. Object Location

Object location and routing is an integral part of the Tapestry functionality set.   The basic routines allow a node to advertise or make known its access to a local resource, and clients to route messages to that node given the unique identifier (GUID) of the resource or data.  This object location mechanism is a general layer of indirection which can be leveraged for a variety of applications, the most general of which is routing to higher level names.  These names are higher level in the sense that they are not tied down to a particular network node.  Any particular node in the network can advertise an object name, and receive messages on its behalf.  Furthermore, there can be multiple instances of a single name, providing for a flexible way to do receiver controlled multicast.

A. Publish / advertising objects: TapestryPublishMsg
Before any routing to object or name can occur, the nodes responsible for the name or object must first make its availability known.  This is done in Tapestry by sending a TapestryPublishMsg into the Tapestry mesh.  This message routes to a deterministically chosen physical node in the network, using surrogate routing where necessary. The invariant should be that, no matter where in the network TapestryPublishMsgs originate from, they all reach the same final "root" node if they are advertising the same object GUID.
The handling of a TapestryPublishMsg is relatively forward.  Since there is no payload other than the fields listed above, the Router translates the fields into those of a ostore.tapestry.impl.PublishMsg and forwards it on towards the root of the object GUID.  At each hop, the local router reads the fields of PublishMsg and stores a mapping between the object GUID and the node GUID of the original source of the TapestryPublishMsg.  The TapestryTag is also kept alongside the node GUID.  The message terminates when it has reached a root node.  

B.  Locate / route to objects: TapestryLocateMsg
For a client to locate an object, it must send a TapestryLocateMsg to the GUID of the desired object.  Through Tapestry object location, the infrastructure will embed this message in an internal message which is then forwarded towards the object GUID's root node.  Along the way, it will query each router for the desired object / query, and find a location mapping showing the node GUID of the object's location.  The message is then redirected to that node via Tapestry routing.
When a TapestryLocateMsgs is received from the application layer, the Router routes towards the root node of the object Guid, while at each hop searching for a locally cached location pointer which matches the desired object Guid.  If the Query field is non-zero, replicas that match the object Guid must also match the Query tag in order to satisfy the query.  LocateMsgs contain a QueryState field which is accessible to upcall functions.

C. Object multicast behavior
In the basic object location case, a TapestryLocateMsg is forwarded to the first object replica which matches the desired object Guid, and also matches its Query.  In a core "upcall module" provided by Tapestry, LocateMsgs invoke their upcalls when they find any objects with matching Guids or terminate at the root of the object.

The default behavior is for QueryState to encode the number of unique replicas (N) to route this message to.  At each hop where a matching object Guid is found, an upcall is triggered, which then checks in the list of matching object replicas to see if the required number of replicas has been found.  If so, the message is replicated and routed to the N closest object replicas.

5. Internal Message Handling

The argument types specified in Section 4 are those used by applications to interface with Tapestry, either through a function call or through a message passing scheme.  In this Section, we describe the message types and formats used for Tapestry internal messages.  In many cases, these messages encapsulate and generalize the specific functionality sets specified by the API messages.

There are two types of internal messages: Publish/Unpublish Messages, and Route to Node/Object Messages

A. Routing: RouteMsg


B. Object Location: Publish / Unpublish