Donut implements two levels of hash table interfaces: one between the end-user and the system, and another between the request servers and donut nodes. This separates the implementation details to be abstracted from the user while maintaining flexibility to optimize the interface for system-internal operations.
The external interface, exposed to the end-user is a traditional get/put/remove hash interface. get takes a string (the key), and returns a stream of binary data (the value for that key). put takes a string (the key) and a stream of binary data (the value to be inserted for that key). remove takes a string (the key). The end-user is not exposed to details such as how the key is hashed, replication, or the separation between finding the responsible node:DonutNode and data transfer.
- A call findSuccessor is exposed to the request servers.
- The key used in the get, put, and remove routines have two fields: a string and a 64-bit id (in practice the id is encapsulated in such that the key-space could be arbitrary extended). Keeping a copy of the original key (the string) allows the system to deal with collisions.
Internally, finding the node responsible for a key and propagating the data are separate tasks. On receiving a request from the end user, a request server hashes the key into a 64-bit number. It then calls findSuccessor to find the node responsible for the key and invokes the applicable hash table procedure on that node.
Replication is done on two levels. Internally, nodes in the ring replicate data to their successor to guarantee availability of data when nodes leave the ring. A second level or replication is implemented by the client. On this level data is replicated to different keys across the ring. This level of replication can guarantee that stale reads can be detected.
A donut node replicates every write request (both puts and removes) to the next R successor nodes. This guarantees that when a node leaves the ring, it’s successor – which becomes responsible for it’s data – maintains a consistent view of that section of the data.
Replication is implemented in a way that guarantees that all successors have the data change before a write is considered finished. Specifically, a replicatePut or replicateRemove is sent recursively to the R successor nodes. The actual commit (update or removal of data) happens on the way back up the call stack. The originating node will not receive a response unless all nodes replicated successfully. The goal is not to guarantee that stale reads will not occur, or that the client can in all cases ascertain whether the write was successful. Rather, the goal is to guarantee that if a particular write was not successful the client will know. This means there are scenarios in which all nodes replicating a key have the updated data, but the client would be notified that the write was not successful.
There is another level of replication that is done on a higher level, by the request servers. The method used is based on a design in Dearle, Kirby, Norcross. There is a global constant R which specifies the total number of replicas (including the master).
For a given key k, R – 1 additional keys are computed to replicate the data on such that k(i) = k0 + i * K / R, where i is the ith replica. Given a reasonably distributed set of nodes, the successors for these keys will be a set of R unique nodes. If the chosen R is such that the total number of replicas is odd, this method can be extended to ensure that stale reads are identified.
Each write is done to all R keys. All subsequent reads of that key are also done on all R keys. If there is a consensus (a majority of the replicas hold the same value) than the read is correct and up to date. Otherwise the read is stale. Several scenarios can produce the stale read state. For a discussion of some of these scenarios, as well as some suggestions for possible ways to recover from stale reads, see Further Work.