<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Programming Practices &#187; database</title>
	<atom:link href="http://bolour.com/blog/index.php/category/database/feed/" rel="self" type="application/rss+xml" />
	<link>http://bolour.com/blog</link>
	<description></description>
	<lastBuildDate>Mon, 14 Sep 2009 18:17:57 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Update Succession in Replicated Key-Value Stores</title>
		<link>http://bolour.com/blog/2009/09/update-succession-in-replicated-key-value-stores/</link>
		<comments>http://bolour.com/blog/2009/09/update-succession-in-replicated-key-value-stores/#comments</comments>
		<pubDate>Fri, 11 Sep 2009 07:16:19 +0000</pubDate>
		<dc:creator>Azad Bolour</dc:creator>
				<category><![CDATA[database]]></category>

		<guid isPermaLink="false">http://bolour.com/blog/?p=114</guid>
		<description><![CDATA[In an earlier blog we looked at the use of vector clocks for keeping track of temporal relations between events in an asynchronous event system. To recap:
In an asynchronous event system each event is marked by a node-clock pair; intra-node temporal relations between events are based on clock values at a given node; and inter-node [...]]]></description>
			<content:encoded><![CDATA[<p>In an <a href="http://bolour.com/blog/2009/09/vector-clocks-for-representing-temporal-relations-between-distributed-events/">earlier blog</a> we looked at the use of vector clocks for keeping track of temporal relations between events in an <a href="http://bolour.com/blog/2009/09/vector-clocks-for-representing-temporal-relations-between-distributed-events/#asynchronous-distributed-event-system">asynchronous event system</a>. To recap:</p>
<blockquote><p>In an <strong><em>asynchronous event system</em></strong> each event is marked by a node-clock pair; intra-node temporal relations between events are based on clock values at a given node; and inter-node temporal relations between events rely on message transmittals being earlier than corresponding message receipts.</p></blockquote>
<p>And we saw <a href="http://bolour.com/blog/2009/09/vector-clocks-for-representing-temporal-relations-between-distributed-events/#vector-clock-dominance">earlier</a> that in such a system, an event E2 is a <em>temporal successor</em> of another event E1 if and only if the vector clock of E2 dominates the vector clock of E1. [The vector clock of an event is a map from the nodes of a system to the latest clock values of those nodes <strong><em>known to have occurred</em></strong> before (or at the same time as) the event.]</p>
<p>In this blog we&#8217;ll see how to extend the use of vector clocks to keep track of update succession in update-anywhere replicated key-value stores. This type of store is exemplified by Amazon&#8217;s <a href="http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf">Dynamo</a>, and by the open source system <a href="http://project-voldemort.com/design.php">Voldemort</a>.  In the literature I have seen so far on these systems, it is assumed without elaboration that vector clocks can be used to represent update succession. But as we&#8217;ll see shortly, this assumption is not immediately evident. And to prove it requires some constraints on asynchronous updates.</p>
<p>My aim in this blog is to outline the difference between <em>temporal succession</em> and <em>update succession</em>, and to show what this difference means for the use of vector clocks in update-anywhere data stores.</p>
<h3>General Asynchronous Events</h3>
<p>In order to demonstrate the use of vector clocks for update succession, I need to digress first to generalize the model of message passing between asynchronous events. The first generalization is to allow events to be both transmitters and receivers of multiple messages. The second generalization is to allow loopback messages from a node to itself.</p>
<p>Figure 1 depicts this more general model.</p>
<blockquote><p><img class="aligncenter size-full wp-image-119" title="vector-clock-general" src="http://bolour.com/blog/wp-content/uploads/2009/09/vector-clock-general.gif" alt="vector-clock-general" width="521" height="232" /></p>
<p>Figure 1. General Asynchronous Event System</p></blockquote>
<p>[In Figure 1, the notation <em>E(node, clock)</em> designates an event that occurred at the given node and the given clock value at that node.]</p>
<p>It is easy to extend <a href="http://bolour.com/blog/2009/09/vector-clocks-for-representing-temporal-relations-between-distributed-events/#vector-clock-dominance">earlier arguments</a> about the equivalence of temporal succession and vector clock dominance to this more general model. The main difference between the two models is that in <a href="http://bolour.com/blog/2009/09/vector-clocks-for-representing-temporal-relations-between-distributed-events/#vector-clock-propagation">propagating vector clocks</a> we may now have to include the vector clocks from multiple message sources in our <em>maximal merge</em> computation.</p>
<h3>Asynchronous Updates and Distributed Versions of Data Items</h3>
<p>A replicated data store includes a set of key-value pairs, called <em>data items</em>, each of which is replicated to a number of nodes. For high write availability, an update to the value of a data item is allowed to be written at any available node.</p>
<p>Then independent and asynchronous updates of the value associated with a given key may have to be written to different nodes, so that multiple versions of a data item may coexist in the store as a whole. Each such differing version of a data item may, in its own right, carry useful information. Therefore, in general, these versions are not allowed to blindly overwrite each other. For maximum flexibility, coexisting versions of a data item are resolved (merged) by application code specific to the use of each instance of a data store.</p>
<p>This scenario leads to a model of the evolution of a data item in which:</p>
<ul>
<li>A read by the application may cause a number of different versions of the data item for a given key to be read from the data store.</li>
<li>The application creates a single updated version of the data item based on all the versions read.</li>
<li>The new version is then written and it obsoletes and replaces all the versions read in this particular update operation.</li>
</ul>
<p>The updated version is then an <em>update successor</em> of each version read. And the versions read for the update are <em>update precursors</em> of the updated version. Of course, the update successor and the update precursor relations are transitive. And we define them to be reflexive as well.</p>
<p>We know that an updated version of a data item should obsolete and replace all of its [proper] precursors. But when the new version of the data item is first written, some of its precursors may not be present at the node of this initial write. And even for those precursors that are present at this initial write node, there are replicate copies at other nodes that also need to be purged. While this new version will be replicated to all replicate nodes, replications may have to take place asynchronously to the original write of this new version. Therefore, the new version of the data item needs to carry with it information about its precursors, so that they can be purged once its replicate copies reach other nodes.</p>
<h3>Writes of Data Items as Asynchronous Events</h3>
<p>Conceptually, we may consider an entire update operation &#8211; including all its reads, their resolution, and the subsequent write of a new version &#8211; as an event in a general asynchronous event system. And for the purpose of tracking update succession, we may consider this event as occurring at the node in which the new update version is first written and at the clock value of the write at that node. Looked at in this manner, the corresponding reads can be thought of as messages sent from earlier write events (earlier versions of the data item)  to the new update event (the new version of the data item).</p>
<p>The upshot is that if we identify versions of a data item with update (or initial write) events, we have here a system of events that is similar to our earlier general asynchronous event system.</p>
<p>Figure 2 depicts the update succession of versions in this scenario.</p>
<blockquote><p><img class="aligncenter size-full wp-image-122" title="vector-clock-update-succession" src="http://bolour.com/blog/wp-content/uploads/2009/09/vector-clock-update-succession.gif" alt="vector-clock-update-succession" width="521" height="232" /></p>
<p>Figure 2. Update Succession with Asynchronous Versions</p></blockquote>
<p>In Figure 2, a version of a data item initially written to node <em>n </em>at clock value <em>c </em>of that node is depicted as <em>V(n, c)</em>. Note in particular, in Figure 2, that a version may be the immediate update precursor to two asynchronous versions &#8211; as in V(1, 1) being asynchronously updated to V(2, 3) and to V(3, 2) &#8211; and that a version may be the immediate update successor of two asynchronous versions &#8211; as in V(1, 7) succeeding both V(2, 3) and V(3, 4).</p>
<h3>Update Succession versus Temporal Succession</h3>
<p>The similarity of our update/version event system and our earlier asynchronous event system depicted in Figure 1, leads us to associate vector clocks with write events (and corresponding versions of a data item) and to try to use them to determine the <em>update succession</em> of versions, and thereby to cause the obsolescence and purge of updated versions.</p>
<p>But before we can make the leap between the two event systems, there is another crucial property of asynchronous event systems that we have yet to establish for write events in a replicated data store: the linear temporal succession of events within each node according to their clock values.</p>
<p>Is there, in fact, a linear order of <em><strong>update succession</strong></em> for a data item within each node according to clock value in an update-anywhere replicated data store?  Well, not by default. Following is a trivial counter-example.</p>
<p>Consider two different clients reading the same version of a data item, and proceeding to update it independently at the same node. If the system blindly writes both update versions to the data store, then one can occur at a clock time later than the other. But the second update version is not an <em><strong>update</strong></em> successor of the first: <em><strong>it was not created by reading the first and updating it, and it does not obsolete the first</strong></em>. This is a crucial difference between the event system of update-anywhere replicated data stores, and the general asynchronous event system we saw earlier.</p>
<h3>The Case for Read Validation</h3>
<p>The only way I know to remove this difference is to assume that in an update, reads are validated within the update transaction at its primary write node for those versions of the data item that were <em>created</em> at that node. If further versions of the data item &#8211; later versions than those that were read by the update operation &#8211; were created at the node where the update is first written, the update would be rejected and possibly retried.</p>
<p>Read validation specialized to the the primary node of an update in this manner would imply that the versions of a data item created at a given node are totally ordered in time via the local clock value at the node, and that this total ordering entails update succession: each version of a data item created at a node is an update successor of the version immediately before it. I&#8217;ll call this condition <strong><em>totally ordered local update succession</em></strong>.</p>
<blockquote><p><em><strong>Totally ordered local update succession</strong></em>: Within the sequence of versions of a data item created at a given node, ordered by their clock values at that node, each version is the result of an update whose read set included the immediately preceding version.</p></blockquote>
<p>At this point, we have the sought-after similarity in the structure of the <em>predecessor </em>relation and its relation to clocks for general asynchronous events, and the structure of the <em>precursor </em>relation and its relation to clocks for write events of a given data item (and for corresponding versions) in an update-anywhere replicated data store. But as have seen, <em>temporal succession</em> defined through the predecessor relation for asynchronous events is equivalent to vector clock dominance. Therefore, <em>update succession</em> defined through the precursor relation for write events of a given data item (and for corresponding versions) must be equivalent to vector clock dominance as well.</p>
<h3>Propagating Vector Clocks to New Versions</h3>
<p>To maintain vector clocks for versions of data items, we need to perform the maximal merge of the immediate precursors of a version plus the node-clock of the version itself. The precursors are the versions read by the update operation. So reads need to piggy-back vector clocks with each version of a data item read. And, of course, writes need to store the new (maximally merged) vector clock with each update version of a data item. All versions of the same data item residing at the node of the write and dominated by the new vector clock then become obsolete and may be purged.</p>
<h3>But What about Replicate Writes?</h3>
<p>Replicate writes were excluded from our event system of writes/versions  because replicate writes do not in fact create new versions of data items, nor new vector clocks. A replicate write simply copies a version and its vector clock intact from one node to another. Whether an update operation reads a version of a data item from its initial write node or from a replicate node is immaterial to the relationship between that version and the update version.</p>
<p>Of course, upon reaching a replicate node, a replica&#8217;s vector clock obsoletes any versions of the data item whose vector clocks it dominates, and allows them to be purged from that replicate node.</p>
<h3>Acknowledgments</h3>
<p>Thanks to the members of the Silicon Valley Patterns Group and in particular to Wayne Vucenic and Chris Tucker for useful discussions on distributed key-value stores. A special thanks to Jay Kreps, the creator of Voldemort, for participating in our group discussions.</p>
]]></content:encoded>
			<wfw:commentRss>http://bolour.com/blog/2009/09/update-succession-in-replicated-key-value-stores/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Google App Engine Data Model</title>
		<link>http://bolour.com/blog/2009/06/the-google-app-engine-data-model/</link>
		<comments>http://bolour.com/blog/2009/06/the-google-app-engine-data-model/#comments</comments>
		<pubDate>Mon, 01 Jun 2009 22:40:53 +0000</pubDate>
		<dc:creator>Azad Bolour</dc:creator>
				<category><![CDATA[database]]></category>

		<guid isPermaLink="false">http://bolour.com/blog/?p=29</guid>
		<description><![CDATA[Decades ago in my college database class we learned about relational databases, network databases, and hierarchical databases. Back then, relational was cool. And hierarchical was definitely passé. Today, hierarchical is making a comeback with the Google App Engine (GAE). Here is a brief overview.
In GAE, persistent data for a given application consists of a set [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: left;">Decades ago in my college database class we learned about relational databases, network databases, and hierarchical databases. Back then, relational was cool. And hierarchical was definitely passé. Today, hierarchical is making a comeback with the Google App Engine (GAE). Here is a brief overview.</p>
<p>In GAE, persistent data for a given application consists of a set of <strong><em>entities</em></strong>. An entity has a <strong><em>kind</em></strong>: a string that designates a set of similar entities. However, entities of a given kind need not be homogeneous.</p>
<p>The properties of an entity are represented by a name-value map. The names are strings. The values are basic types including dates and blobs. Properties may also be multi-valued. The fact that homogeneity within kinds is not a requirement means that different entities of a given kind may have different sets of property names, and that they may have differently-typed properties for identically named properties.</p>
<p>Entities form distinct strict hierarchies within a data store. An entity either has a unique parent, or no parent (a root entity). A root entity and all of its descendants form a cluster of entities known as an <em><strong>entity group</strong></em>. The parent relationship between entities and the corresponding entity groups arising from this relationship play a pivotal role in the GAE data store.</p>
<p>One way in which the hierarchical nature of the model manifests itself is in the construction of entity keys. Within a given kind, a root entity is identified either by a unique name, a string, or by a unique ID, a long integer. So a root entity is globally uniquely identified by the combination of kind and name/ID. Let&#8217;s call this combination of kind and name/ID a <em>simple key</em> [my terminology]. Non-root entities are identified within their parents by unique simple keys. So in general, a key in the data model can be thought of as a path composed of simple keys. A key of length 1 uniquely identifies a root entity. A key of length 2 uniquely identifies an entity at level 2 of the hierarchy. And so on.</p>
<p>Another way in which the hierarchical nature of the model manifests itself is in the togetherness semantics of entity groups, which, as you will recall, consist of a root entity and all of its descendants. The underlying storage structure used to store entities is Google&#8217;s BigTable. BigTables are stored in a distributed fashion by assigning ranges of records (based on key) to different BigTable servers. The ranges are called <strong><em>tablets</em></strong>. What&#8217;s special about entity groups with respect to this type of sharding is that members of an entity group are not divided to different servers: they stay close together at all times, and are managed by a single server at any given time. A transaction involving members of the same entity group can therefore be managed by a single server and implemented simply and efficiently.</p>
<p>Currently GAE transactions are limited to single entity groups. GAE could, but currently does not, support transparent transaction management across multiple entity groups, by implementing distributed transactions under the covers.</p>
<p>Why can&#8217;t we have a dummy root entity and have the entire data store hang off of that in a single entity group? Because GAE&#8217;s design limits throughput by entity group. Specifically:</p>
<ul>
<li>GAE uses optimistic concurrency control for transactions, and its concurrency control algorithm operates at the root level of an entity group. The optimistic concurrency control timestamp of the root reflects the last update time of any entity in the group. So large entity groups increase the likelihood of timestamp conflicts and resulting rollbacks.</li>
<li>Currently the data store cannot support more than about 10 writes per entity group per second all told. So large entity groups reduce parallelism between transactions since a transaction affecting any entity in an entity group requires a write to the group&#8217;s root entity.</li>
</ul>
<p>One other major restriction on transactions is that queries are not supported within transactions. You can get an entity by its key within a transaction. But you cannot run a search based on property values within a transaction.</p>
<p>Relationships other than the special parent relationships may be represented explicitly in an entity as a property whose value is the key of the related entity. [Parent relationships, of course, are managed implicitly by GAE.] But GAE does not provide special support for relationships other than the parent relationship. No special support for traversal or joins, for example.</p>
<p>By default all indexable properties are automatically indexed (text and blob values are not indexed). It is also possible to explicitly create composite indexes on more than one property. Surprisingly for a search engine company, full text indexing of text fields is not supported at this time.</p>
<p>This completes my brief overview of the data model.</p>
<p>Clearly, there are a number of choices in the basic design of the GAE data model that are different from those of relational databases. Some of these, like native support for hierarchies and multi-valued properties can help you model your application more easily. Clearly too, other choices, such as the transaction restrictions, and the non-existence of native support for joins, can make your applications more complex, or less able to satisfy the real requirements of your users.</p>
<p>Now let&#8217;s shift attention to GAE&#8217;s high level APIs. GAE provides JDO and JPA interfaces to its persistent data. The main abstraction embodied by these high-level interfaces is the homogeneity of kinds. Since in JDO and JPA entities are persistent versions of Java objects, an entity kind in these interfaces represents a Java class that requires persistence. All entities of a given kind then necessarily have the same set of properties and corresponding property types.</p>
<p>Unfortunately, at the moment, the high-level interfaces do not hide the fact that transactions are limited to individual entity groups. A transaction that spans a second entity group triggers an exception, independently of which interface is used.</p>
<p>Nor do the high-level interfaces transparently provide joins or certain other familiar SQL functions at this time. In other words, the JDO and JPA query language variants supported by GAE are somewhat limited. A join query appearing in a JDO query, for example, will trigger an exception.</p>
<p>The designers of the data store API have so far avoided exposing functionality whose performance may be iffy, or which may complicate the management of the data store. So for now, we have to roll our own frameworks, or look to third party offerings to fill in the gaps: gaps between our expectations of functionality from databases, conditioned as they are by relational databases, and what the GAE can reasonably deliver.</p>
<p>As an example, the impression one gets is that the App Engine folks are not at this time eager to embrace transparent distributed transactions across entity groups in the base GAE product. Clearly life can get more complicated with say two-phase commit and the possibility of a distributed transaction coordinator crashing after the prepare phase of a transaction. Other developers, however, are working on distributed transactions (see, for example, <a title="this talk" href="http://code.google.com/events/io/sessions/DesignDistributedTransactionLayerAppEngine.html" target="_self">this talk</a> at Google IO 2009).</p>
<p>The extent to which such third party additions find success, and are worked into the standard GAE development ecosystem is something to keep an eye on over next months and years.</p>
<p><strong>References</strong></p>
<p><a title="http://labs.google.com/papers/bigtable-osdi06.pdf" href="http://labs.google.com/papers/bigtable-osdi06.pdf" target="_self">http://labs.google.com/papers/bigtable-osdi06.pdf</a><br />
<a title="http://sites.google.com/site/io/building-scalable-web-applications-with-google-app-engine" href="http://sites.google.com/site/io/building-scalable-web-applications-with-google-app-engine" target="_self">http://sites.google.com/site/io/building-scalable-web-applications-with-google-app-engine</a><br />
<a title="http://sites.google.com/site/io/under-the-covers-of-the-google-app-engine-datastore" href="http://sites.google.com/site/io/under-the-covers-of-the-google-app-engine-datastore" target="_self">http://sites.google.com/site/io/under-the-covers-of-the-google-app-engine-datastore</a><br />
<a title="http://www.stanford.edu/class/ee380/Abstracts/081105-slides.pdf" href="http://www.stanford.edu/class/ee380/Abstracts/081105-slides.pdf" target="_self">http://www.stanford.edu/class/ee380/Abstracts/081105-slides.pdf</a><br />
<a title="http://www-users.itlabs.umn.edu/classes/Fall-2008/csci8101/bigtable.pdf" href="http://www-users.itlabs.umn.edu/classes/Fall-2008/csci8101/bigtable.pdf" target="_self">http://www-users.itlabs.umn.edu/classes/Fall-2008/csci8101/bigtable.pdf</a></p>
]]></content:encoded>
			<wfw:commentRss>http://bolour.com/blog/2009/06/the-google-app-engine-data-model/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
	</channel>
</rss>

