July 1, 2010

Versioning in Zope and Plone revisited - Part I

On a new versioning approach for Zope-based applications

CMFEditions - the current versioning system used for CMF and Plone

To make it short: CMFEditions stinks.

Why?

  • very monolithic
  • too tight integration with CMF
  • fragile implementation
  • doing "too much"
  • doing "too much" in a very intransparent way
  • no backend serialization format other than Python pickles
  • only ZODB-based backend
  • backend not pluggable

Design goals for a new versioning implementation for Zope-based applications

  • golden rule #1: keep it simple, keep it small
  • pluggable storage API (storing the versioned data)
  • using JSON as data exchange format between objects to be versioned  and versioner and between versioner and backend storage (the storage may use a different serialization format (e.g. 'pickle' for a ZODB  based backend or 'json' for a typical noSQL backend like MongoDB)
  • making use of the Zope Component Architecture for adopting arbitrary content objects to the storage API
  • the solution does not claim to store and restore the complete state of an content object. Instead we focus on dealing with the metadata and the content itself. If an object uses a complex internal data model then it is in responsible to serialize and deserialize the data to JSON.
  • leave out complex functionality (likely handling of references, object relations etc.) out of the core versioning engine - this might be handled through adapters

Storage layer

CMFEditions uses the ZODB for storing the state of an object as a Python pickle. The new versioning systems supports a pluggable storage system where a version storage provides its functionality through a well-defined API (as defined through IVersionSupport). Three different storages come to my mind: (RDBMS, object databases like the ZODB and document-oriented databases like MongoDB). A version storage API may look like this:

class IVersionStorage(Interface):

 # methods used for IVersionSupport
 def store(id, version_data, revision_metadata):
 """ Store 'version_data' for a given 'id'.  'version_data' holds the
 data to be versioned (JSON format).  'revision_metadata' holds  
 application-specific metadata for the particular version (e.g.  
 revision date, creator uid, "revision is a major/minor          
 revision) (JSON format).                                        

 Returns revision number.
 """                         

 def retrieve(id, revision):
 """ Return 'version_data' for a given 'id' and 'revision' """

 def remove(id):
 """ Remove all revisions for a given object 'id' """

 def has_revision(id, revision):
 """ Check if there is a revison 'revision' for a given object 'id' """

 def remove_revision(id, revision):
 """ Remove a particular 'revision' for a given object 'id' """

 def remove(id):
 """ Remove all revisions for a given object 'id' """

 def list_revisions(id):
 """ Return all revisions (and their stored revison_metadata) stored for
 a particular content piece by its 'id'.
 """

Versioning layer

Versioning is an application-level functionality. An application should have full control over the things to be versioned (recall that CMFEdition always persists the full object state). In order to make a particular object versionable we need a simple interface (either to be implemented directly by the object or through an adapter):

class IVersionSupport(Interface):
 """ API for retrieving data to be versioned from an object
 and restoring a previous state of an object.          
 The data format is JSON.                              

 Objects must provide their unique ID through the 'id' field.

 This API applies to single objects only 
 (no support for object collections).    
 """                                         

 def getVersionableData():
 """ Return versionable data (in JSON format) """

 def restoreFromVersion(version_data):
 """ Restore object based on 'version_data' (JSON format) """

Open issues

  • dealing with large data (images, files). A storage backend like MongoDB has a limit of 4MB for embedded documents (we have to use GridFS for larger pieces of data)
  • all versionable objects must provide a unique ID (``UID`` for Archetypes-backed content). How about Dexterity? How about ZTK/zope.schema-based content?
  • should de-duplication be handled on the storage layer or the versioning layer (I assume on the storage layer as an optional feature in order to keep the overall complexity low)
  • how to deal with object collections (folders, hierarchies...having some ideas but they need some more brain-storming...likely to be approached with part II of the blog entry some time soon)

Prototype implementation

I created a rapid protoype using MongoDB as storage backend for the versioned data. A very basic implementation for Archetypes-based content has been implemented using adapter:

  1. adapters for adopting Plone (4) content to IVersionSupport
  2. adapters for adopting Archetypes fields to a simple API with get(), set() methods for retrieving/storing values for a particular field from Plone content

The overall implementation is actually very small and extensible. In fact it is possible to version almost all core Plone content-types (except object collections (like folders)) into MongoDB.

Why MongoDB?

We are using MongoDB in an ongoing project and my experiences are very positive. MongoDB is very easy so install and run (much, much easier than a ZODB/ZEO server) and its document-oriented storage approach fits perfectly with the JSON data model. In addition: MongoDB is blazing fast (up 50.000 inserts per second measured), has a rich query API and provides a bunch of replication options.

Keep in mind...

  • storages will be pluggable...so you may replace a MongoDB storage with a ZODB-backed implementation
  • versioning all and everything is not our primary goal - we want to version the interesing information