« Gapminder fuses data and design with a vengeance | Main | 30 minutes to develop a crappy todo-list? dude! »

Buckybase, a document database with bidirectional hyperlinks and reverse chronological access

I'm with Kragen Sitaker, when he writes:

Now I am feeling the need for a personal semistructured store more than
ever.  This is a searchable database that supports "data first,
structure later" <http://www.betaversion.org/~stefano/linotype/news/93/>
(i.e. you can add a schema incrementally to existing data entered
without one, or with a very primitive one), but supports enough
structure to render web pages and things like that.

Here's my design for a document database called Buckybase that fits that description. 

Like its namesake, the Buckyball molecule, I think that Buckybase "exhibit[s] a number of notable characteristics that make the possibilities of [its] use seem almost limitless".

Technologies Used by Buckybase

Sample Document and Core Data Model

A sample document that represents a bug report:

system:id
bug-1
title
Software doesn't exist yet.
notes
  1. Would be nice to have it by the end of 2006.
  2. 2007 ain't bad either.
  3. Anyhow, it's really needed fast.
related
  1. bug-2
  2. bug-3

system:id is a special field that each document has.

All other fields are optional.

This bug report uses the fields title, notes, and related. We use the syntax document.field to talk about fields, e.g. bug-1.title.

bug-1.title is single-valued, whereas bug-1.notes and bug-1.related are multi-valued with three and two values, respectively. Basically, all fields are multi-valued, and a single-valued field is simply a shorthand for a multi-valued field with only one value.

Fields are ordered, indicated by the <ol> tag's numbering (1., 2., 3., ...).

As for field values, they can be either strings (e.g. "Would be nice to have it by the end of 2006.") or links to other documents (e.g. bug-2). Nothing prevents you from mixing strings and links in the values of a field.

Inverse Fields

Fields are bidirectional: If you would GET bug-2, the document would look this:

system:id
bug-2
title
OutOfCoffeeException
notes
It's really hard to get good coffee in Shanghai.
inverse:related
bug-1

Note the inverse:related field. Because bug-1.related contains bug-2, bug-2 automatically has an inverse field called inverse:related, that contains bug-1.

Bidirectional fields make hopping around in the data pool extremely easy and convenient. Note that inverse fields are not truly first class fields: their order cannot be changed manually and is instead determined by the system. (The reason is that inverse fields will not be stored directly in the document in many implementations, but rather delivered by a lookup in an inverse index like Lucene's).

Data Model Summary

  • Every document has a system:id field with a unique ID.
  • Other fields are optional and can be added at will.
  • All fields are multi-valued and ordered (single-value fields are available as a shorthand, but are just multi-valued fields with one value).
  • Field values can be strings or links to other documents. (A field may contain a mixture of strings and links as values.)
  • Inverse fields are automatically made available and contain backlinks. However, the order of inverse fields cannot be set manually.

Multiple Users Per System

A system should support multiple users. My account may be e.g. /manuel.

Multiple Feeds Per User

Each user should be able to maintain any number of feeds that contain documents.

I could create feeds like /manuel/default and /manuel/wiki.

GETting a feed returns the most recently changed documents, and maybe uses Atom's facilities for paging of older documents.

A feed is also a practical unit for security. It would be nice to have public, shared, and private feeds.

The Atom Publishing Protocol also has support for media objects (BLOBs) which need to be incorporated into this design.

Note that I plan to make a feed writable to its owning user only, preventing the problems of shared state. Shared state will be simulated by special assessement modules (see Rohit Khare's ARRESTED architectural style), that aggregate the data from many users to present a unified view, like del.icio.us does.

Next Steps

I plan to implement this system using SISC Scheme on top of the Java Virtual Machine and release it as free software (GPL is most likely).

In the first version I'll just write the Atom feeds as files to disk, and read 'em every time a request is made (and maybe add some caching). In the long run however, the system should grow to a fault-tolerant, scalable, clustered architecture, likely using P2P technology. But this is future music right now.

On top of this database, I'd like to layer a system that lets users design the layouts of documents in the browser, and create mashups from their and other users' feeds.

Please tell me what you think about the general architecture described here.

Acknowledgements, Further Links

Many thanks go to Zini for numerous discussions and prototyping sessions over the years.

Current systems that sport a similar design are Google Base, CouchDb, and many others listed in Kragen's post semistructured data: summary of six years of wishes.

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d8341cb1b553ef00d834bb28e453ef

Listed below are links to weblogs that reference Buckybase, a document database with bidirectional hyperlinks and reverse chronological access:

Comments

I've created a Google Project for Buckybase for tracking design issues.

Darius asked "one thing that wasn't clear was if strings could contain embedded links, or you just had simple strings and simple links at the top level?"

Only simple strings and simple links. Links inside a string are not meaningful to the system.

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been posted. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment