Position Paper for Sigops Workshop on
        Fault Tolerance Support in Distributed Systems, 1990

                        Andrew Birrell
                        
                    Systems Research Center,
                 Digital Equipment Corporation,
                        130 Lytton Avenue,
                       Palo Alto, CA 94022,
                             U.S.A.

At SRC we have been exploring the provision and use of fault tolerance 
in the basic facilities of a distributed system - the physical 
communications, the name service and the file service.  We now have 
research prototypes of each of these, and we are starting to gain 
experience in how tolerant the really are.

Our LAN, called "Autonet" provides a mesh-connected network with link 
speeds of 100 Mbits/sec connecting into full cross-bar switches.  
Each host is connected to two switches, with automatic fail-over.  
The network itself is self-configuring.  The switches will dynamically 
re-arrange their routing tables in response to failures of links or 
switches.

Our name service is fully and seamlessly integrated with the file 
service (which is called "Echo").  Both services provide the same 
semantics; they differ only in their response to failures.  From 
the point of view of the normal user, the choice of placing a directory 
in the name service or in the file service is made by considering 
the desired fault tolerance characteristics.

A global path name is rooted in the name service. Name resolution 
starts in the name service, proceeding until it encounters a 
"junction", describing a file service volume.  The remainder of the 
path name is then presented to the file service.  The name service 
and file service are both equally accessible through a single 
interface.  So, for example, the Unix "ls" and "find" commands work 
just as well within the name service as within the file service.

The name service is implemented with the familiar lazy update 
replication scheme, where updates are committed at the initial replica, 
and asynchronously propagated to the other replicas after returning 
"success" to the client.  A name service volume is available to a 
client provided the client can contact at least one replica of 
the volume. This provides very high availability but at the cost 
of weaker consistency guarantees.

The file service is implemented with a replication scheme providing 
tight consistency - an update does not return to the client until 
at least a majority of replicas has committed the update.  A file 
service volume is available to a client provided the client can contact 
a majority of the replicas of that volume.  This provides lower 
availability, but with the benefit of simpler and more powerful 
semantics - as seen by the client, all replicas of file volumes always 
contain the same data.  (In addition, we maintain tight consistency 
for cached file system data by using a token-based cache consistency 
algorithm on client machines.).

In general, we can glue these two components together in arbitrary 
ways.  Way can have a file volume as a child of a name service 
directory, but equally we could have a name service volume as a child 
of a file service directory.  The administrator can choose how to 
organize his name space, providing the appropriate trade-offs between 
availability and consistency for each object.

In practice, we anticipate that the system will be configured to use 
the name service for the higher level parts of the name space, and 
the file service for the lower level parts.  This corresponds to the 
observation that the higher level parts are slowly changing - the 
weak update semantics are unlikely to disturb users - and so we can 
benefit from the very high availability of the name service.  But 
the lower level parts change frequently, so lazy update propagation 
would be difficult to live with.  Further, the higher level parts 
of the name space tend to be widely shared, whereas the lower level 
parts have high locality of access.  This corresponds well to the 
two replication schemes - the lazy update propagation scheme works 
well over poorly connected wide area networks, but all existing 
tight-consistency schemes face substantial performance penalties if 
you disperse the replicas across such networks.

We believe that the combination of these facilities will provide 
us with the basis for a distributed system that is flexible, scalable, 
and capable of tolerating many failure modes while retaining high 
availability.  Right now (April) the system is just about to enter 
service in SRC; by September we should have some real experience 
with its successes and failures.