Bolour Computing logo

September 11, 2009

Update Succession in Replicated Key-Value Stores

Filed under: database — Azad Bolour @ 7:16 am

In an earlier blog we looked at the use of vector clocks for keeping track of temporal relations between events in an asynchronous event system. To recap:

In an asynchronous event system each event is marked by a node-clock pair; intra-node temporal relations between events are based on clock values at a given node; and inter-node temporal relations between events rely on message transmittals being earlier than corresponding message receipts.

And we saw earlier that in such a system, an event E2 is a temporal successor of another event E1 if and only if the vector clock of E2 dominates the vector clock of E1. [The vector clock of an event is a map from the nodes of a system to the latest clock values of those nodes known to have occurred before (or at the same time as) the event.]

In this blog we’ll see how to extend the use of vector clocks to keep track of update succession in update-anywhere replicated key-value stores. This type of store is exemplified by Amazon’s Dynamo, and by the open source system Voldemort. In the literature I have seen so far on these systems, it is assumed without elaboration that vector clocks can be used to represent update succession. But as we’ll see shortly, this assumption is not immediately evident. And to prove it requires some constraints on asynchronous updates.

My aim in this blog is to outline the difference between temporal succession and update succession, and to show what this difference means for the use of vector clocks in update-anywhere data stores.

General Asynchronous Events

In order to demonstrate the use of vector clocks for update succession, I need to digress first to generalize the model of message passing between asynchronous events. The first generalization is to allow events to be both transmitters and receivers of multiple messages. The second generalization is to allow loopback messages from a node to itself.

Figure 1 depicts this more general model.

vector-clock-general

Figure 1. General Asynchronous Event System

[In Figure 1, the notation E(node, clock) designates an event that occurred at the given node and the given clock value at that node.]

It is easy to extend earlier arguments about the equivalence of temporal succession and vector clock dominance to this more general model. The main difference between the two models is that in propagating vector clocks we may now have to include the vector clocks from multiple message sources in our maximal merge computation.

Asynchronous Updates and Distributed Versions of Data Items

A replicated data store includes a set of key-value pairs, called data items, each of which is replicated to a number of nodes. For high write availability, an update to the value of a data item is allowed to be written at any available node.

Then independent and asynchronous updates of the value associated with a given key may have to be written to different nodes, so that multiple versions of a data item may coexist in the store as a whole. Each such differing version of a data item may, in its own right, carry useful information. Therefore, in general, these versions are not allowed to blindly overwrite each other. For maximum flexibility, coexisting versions of a data item are resolved (merged) by application code specific to the use of each instance of a data store.

This scenario leads to a model of the evolution of a data item in which:

  • A read by the application may cause a number of different versions of the data item for a given key to be read from the data store.
  • The application creates a single updated version of the data item based on all the versions read.
  • The new version is then written and it obsoletes and replaces all the versions read in this particular update operation.

The updated version is then an update successor of each version read. And the versions read for the update are update precursors of the updated version. Of course, the update successor and the update precursor relations are transitive. And we define them to be reflexive as well.

We know that an updated version of a data item should obsolete and replace all of its [proper] precursors. But when the new version of the data item is first written, some of its precursors may not be present at the node of this initial write. And even for those precursors that are present at this initial write node, there are replicate copies at other nodes that also need to be purged. While this new version will be replicated to all replicate nodes, replications may have to take place asynchronously to the original write of this new version. Therefore, the new version of the data item needs to carry with it information about its precursors, so that they can be purged once its replicate copies reach other nodes.

Writes of Data Items as Asynchronous Events

Conceptually, we may consider an entire update operation – including all its reads, their resolution, and the subsequent write of a new version – as an event in a general asynchronous event system. And for the purpose of tracking update succession, we may consider this event as occurring at the node in which the new update version is first written and at the clock value of the write at that node. Looked at in this manner, the corresponding reads can be thought of as messages sent from earlier write events (earlier versions of the data item) to the new update event (the new version of the data item).

The upshot is that if we identify versions of a data item with update (or initial write) events, we have here a system of events that is similar to our earlier general asynchronous event system.

Figure 2 depicts the update succession of versions in this scenario.

vector-clock-update-succession

Figure 2. Update Succession with Asynchronous Versions

In Figure 2, a version of a data item initially written to node n at clock value c of that node is depicted as V(n, c). Note in particular, in Figure 2, that a version may be the immediate update precursor to two asynchronous versions – as in V(1, 1) being asynchronously updated to V(2, 3) and to V(3, 2) – and that a version may be the immediate update successor of two asynchronous versions – as in V(1, 7) succeeding both V(2, 3) and V(3, 4).

Update Succession versus Temporal Succession

The similarity of our update/version event system and our earlier asynchronous event system depicted in Figure 1, leads us to associate vector clocks with write events (and corresponding versions of a data item) and to try to use them to determine the update succession of versions, and thereby to cause the obsolescence and purge of updated versions.

But before we can make the leap between the two event systems, there is another crucial property of asynchronous event systems that we have yet to establish for write events in a replicated data store: the linear temporal succession of events within each node according to their clock values.

Is there, in fact, a linear order of update succession for a data item within each node according to clock value in an update-anywhere replicated data store? Well, not by default. Following is a trivial counter-example.

Consider two different clients reading the same version of a data item, and proceeding to update it independently at the same node. If the system blindly writes both update versions to the data store, then one can occur at a clock time later than the other. But the second update version is not an update successor of the first: it was not created by reading the first and updating it, and it does not obsolete the first. This is a crucial difference between the event system of update-anywhere replicated data stores, and the general asynchronous event system we saw earlier.

The Case for Read Validation

The only way I know to remove this difference is to assume that in an update, reads are validated within the update transaction at its primary write node for those versions of the data item that were created at that node. If further versions of the data item – later versions than those that were read by the update operation – were created at the node where the update is first written, the update would be rejected and possibly retried.

Read validation specialized to the the primary node of an update in this manner would imply that the versions of a data item created at a given node are totally ordered in time via the local clock value at the node, and that this total ordering entails update succession: each version of a data item created at a node is an update successor of the version immediately before it. I’ll call this condition totally ordered local update succession.

Totally ordered local update succession: Within the sequence of versions of a data item created at a given node, ordered by their clock values at that node, each version is the result of an update whose read set included the immediately preceding version.

At this point, we have the sought-after similarity in the structure of the predecessor relation and its relation to clocks for general asynchronous events, and the structure of the precursor relation and its relation to clocks for write events of a given data item (and for corresponding versions) in an update-anywhere replicated data store. But as have seen, temporal succession defined through the predecessor relation for asynchronous events is equivalent to vector clock dominance. Therefore, update succession defined through the precursor relation for write events of a given data item (and for corresponding versions) must be equivalent to vector clock dominance as well.

Propagating Vector Clocks to New Versions

To maintain vector clocks for versions of data items, we need to perform the maximal merge of the immediate precursors of a version plus the node-clock of the version itself. The precursors are the versions read by the update operation. So reads need to piggy-back vector clocks with each version of a data item read. And, of course, writes need to store the new (maximally merged) vector clock with each update version of a data item. All versions of the same data item residing at the node of the write and dominated by the new vector clock then become obsolete and may be purged.

But What about Replicate Writes?

Replicate writes were excluded from our event system of writes/versions because replicate writes do not in fact create new versions of data items, nor new vector clocks. A replicate write simply copies a version and its vector clock intact from one node to another. Whether an update operation reads a version of a data item from its initial write node or from a replicate node is immaterial to the relationship between that version and the update version.

Of course, upon reaching a replicate node, a replica’s vector clock obsoletes any versions of the data item whose vector clocks it dominates, and allows them to be purged from that replicate node.

Acknowledgments

Thanks to the members of the Silicon Valley Patterns Group and in particular to Wayne Vucenic and Chris Tucker for useful discussions on distributed key-value stores. A special thanks to Jay Kreps, the creator of Voldemort, for participating in our group discussions.

September 10, 2009

Vector Clocks for Representing Temporal Relations between Distributed Events

Filed under: distributed systems — Azad Bolour @ 10:23 pm

In this blog I’ll review the use of vector clocks for comparing the times of occurrence of dispersed events in a distributed system. Vector clocks are well-known in the literature, and there are plenty of resources on the net about them. See, for example, the Wikipedia entry about them for original references and credits.

Vector clocks are useful in a scenario where we are not able or willing to make assumptions about clock drift rates at independent nodes of a distributed system, or about message transmission rates between nodes. In such cases, our primary source of information about temporal relations between events across nodes is the fact that for each message sent from node i to node j, the transmittal of the message from node i happens before the receipt of the message at node j. For example, suppose that a message is transmitted from node 1 when node 1’s clock value is 1, and is received at node 2 when node 2’s clock value is 3. If the clock value and the node of origin are piggy-backed on the message, then node 2 will know that its clock value of 3 is after node 1’s clock value of 1.

Based solely on the fact that message traversal entails some (unknown) delay, and that clock values at each node are increasing over time, we would like to be able to determine, for any two events in the system, whether one occurred before the other, or whether nothing can be asserted about their temporal relation.

My primary motivation for going over vector clocks here is to set the stage for my next blog about the use of vector clocks in update-anywhere replicated data stores.

But before we get there, here is a concrete example of the direct application of vector clocks. Consider a battlefield situation where a number of independent moving agents are observing a number of moving objects and communicating the positions of these objects to each other. The observers may get in and out of communication range of each other, and their communications equipment may get jammed or damaged and later be unjammed and repaired. Vector clocks make it possible to determine when an observation in such a scenario is obsoleted by another, so that decisions may be made based on the most recent observations for each object of interest.

Asynchronous Event System

The kind of event system described above is known as an asynchronous event system.

Asynchronous event system. A system of events in which each event is marked by a node+clock pair, where intra-node temporal relations between events are based on clock values at a given node, and where inter-node temporal relations between events rely on message transmittals being earlier than corresponding message receipts.

Figure 1 depicts the elements of an asynchronous event system.

vector-clock-basic-event-system

Figure 1. Nodes, Clocks, and Messages in an Asynchronous Event System

The notation E(node, clock) is used to represent an event that occurred at a given node and a given clock value at that node. We’ll call the pairing of a node and a clock, a node-clock (e.g., (3, 6) – the clock value of 6 at node 3). Nodes and clock values are represented as non-negative integers. We’ll typedef these integers to Node, and Clock, respectively, and represent their pairing as a class named NodeClock.

The Predecessor Relation between Events

In this scenario, the events can be thought of as forming the vertices of a graph with two types of edges: a last edge from an event to the last event that immediately preceded it at the same node; and a source edge from a message receipt event to the corresponding message transmittal event. Let’s call this graph the predecessor graph of the set of events, and let’s call the direct relation represented by this graph, that is, the union of the last and the source relations, the immediate-predecessor relation.

Figure 2 depicts the predecessor graph of the distributed events in Figure 1.

vector-clock-predecessor-graph

Figure 2. Predecessors of Events

Suppose we have two references E1 and E2 to events in such a system. Then E2 succeeds E1 in time, as far as we know, if and only if there is a path from E2 to E1 in the predecessor graph of events. Of course, if E1 and E2 refer to the same event, then their times are also comparable. (For simplicity, I am making the (inessential) assumption that at each node, different events occur at different clock values.) Thus, comparability in time leads directly to the reflexive-transitive closure of the immediate-predecessor relation, which I will call simply the predecessor relation.

predecessor: reflexive-transitive closure of (source + last)

Representing the Predecessor Relation

Vector clocks are a natural data structure motivated by the need for an efficient representation of the predecessor relation.

The direct (and inefficient) representation of the predecessor relation would include, for each event, a list of all node-clocks of events that precede it (directly or transitively) in the predecessor graph.

	class Event {
		Node origin;
		Set<NodeClock> predecessors;
	}

In this representation, the predecessor set for event E(3, 5) of Figure 2 is:

(3, 5),
(3, 4),
(2, 4),
(2, 3),
(1, 1)

But we can keep the size of this representation bounded by removing, for each predecessor node, all but the latest node-clock for that node. In our example, the predecessors of event E(3, 5) at node 2 include both E(2, 4), and E(2, 3). But we would know by comparing clock values at node 2, that event E(2, 3) is a predecessor of event E(2, 4). So there is no need keep (2, 3) explicitly in the predecessor set. Since E(2, 4) is the latest predecessor of E(3, 5) at node 2, we know that any other event originating from node 2 whose clock value is less than 4 is also a predecessor of E(3, 5).

So we end up with a representation in which the set of predecessors can be replaced by a map of latest predecessors:

	class Event {
		Node origin;
		Map<Node, Clock> latestPredecessors;
	}

The map of latest predecessors is called a vector clock.

In what follows, I’ll use the terms vector clock and latest predecessors interchangeably. The former is standard terminology. The latter is more intention-revealing in this writeup.

vector clock of event: map of latest predecessors of an event for each node

The vector clock of event E(3, 5) in our running example is then:

(1, 1),
(2, 4),
(3, 5)

Clock values for particular nodes in a vector clock may be represented by array reference notation. For example, E.vectorClock[1] (or E.latestPredecessors[1]) refers to the clock value of event E’s latest predecessor originating at node 1.

Direct Test of Temporal Succession

To summarize, including vector clocks in the representation of events provides for an easy test of temporal succession:

For any two events E1(n1, c1) and E(n2, c2):

E1 is a predecessor of E2 (E2 succeeds E1), if and only if,

c1 <= E2.latestPredecessors[n1] (c1 <= E2.vectorClock[n1]) … (1)

Of course, for the latter comparison to be true, E2 must have a latest predecessor for node 1. To finesse this case, we may use a clock value of -1 for each node that has no predecessor in a vector clock.

Vector Clock Dominance Test of Temporal Succession

For our direct test (1) to be useful, we need to keep track of both the vector clocks of events, and the nodes of origination of events.

But what if we don’t keep track of the nodes of origination?

In that case, we can use an equivalent test based solely on vector clocks: the vector clock dominance test. A vector clock vc1 is dominated by another, vc2, if for every node, the clock value of vc1 is no greater than the corresponding clock value of vc2.

It is then easy to demonstrate that:

vector clock dominance is equivalent to temporal succession

Or, in symbols:

E1(n1, c1) <= E2(n2, c2), if and only if,

E1.latestPredecessors <= E2.latestPredecessors

where the relational symbol <= is overloaded to depict both the is-dominated-by relation between vector clocks, and the temporal succession relation between events.

The equivalence of vector clock dominance and temporal succession is a fairly simple consequence of the direct test for temporal succession established above. Here are the details for completeness.

Suppose E1(n1, c1) <= E2(n2, c2). Clearly, since E2 succeeds E1, E2’s predecessors must include all of E1’s predecessors, and so E2’s latest predecessor at each node can’t have occurred any earlier than E1’s latest predecessor. In other words, temporal succession implies vector clock dominance.

Suppose, on the other hand, that E2’s vector clock dominates E1’s: E1(c1, n1).latestPredecessors <= E2(n2, c2).latestPredecessors. Then by the definition of vector clock dominance:

E1(n1, c1).latestPredecessor[n1] <= E2(n2, c2).latestPredecessors[n1] … (2)

(this relation holds for every node, and in particular for node n1). But since the predecessor relation was defined to be reflexive, the latest predecessor of E1(n1, c1) at node n1 (its node of origination), is E1(n1, c1) itself. So (2) simplifies to:

c1 <= E2.latestPredecessors[n1]

which by (1) above implies E1 <= E2.

Propagating Vector Clocks

It is easy to see that when a new event occurs, its vector clock has to become the maximal merge of the vector-clocks of its immediate predecessors plus its own node-clock. The maximal merge of a set S of vector clocks is a vector clock in which the clock value for each node, n, is the maximum clock value for n within the vector clocks in S. This type of computation can easily be performed if we piggy-back the vector clock of a message transmittal event onto the message.

This completes our overview of the use of vector clocks in asynchronous event systems.

What’s Next

As mentioned earlier, this blog forms the backdrop for my next blog on the use of vector clocks in update-anywhere replicated data stores. We’ll see there that just as in our battlefield example, vector clocks help remove obsoleted observations from consideration for each agent, in a replicated data store, vector clocks help remove obsoleted versions of a data item from the data store for each replica of the store.

Powered by WordPress