I'm with Kragen Sitaker, when he writes:
Now I am feeling the need for a personal semistructured store more than
ever. This is a searchable database that supports "data first,
structure later" <http://www.betaversion.org/~stefano/linotype/news/93/>
(i.e. you can add a schema incrementally to existing data entered
without one, or with a very primitive one), but supports enough
structure to render web pages and things like that.
Here's my design for a document database called Buckybase that fits that description.
Like its namesake, the Buckyball molecule, I think that Buckybase "exhibit[s] a number of notable characteristics that make the possibilities of [its] use seem almost limitless".
Technologies Used by Buckybase
- Atom Syndication Format (the IETF standard for feeds and entries)
- Atom Publishing Protocol (still in draft, so I'm not committing to it yet)
- XOXO microformat for outlines (basically, using HTML's <dl> and <ol> tags to represent documents)
Sample Document and Core Data Model
A sample document that represents a bug report:
- system:id
- bug-1
- title
- Software doesn't exist yet.
- notes
-
- Would be nice to have it by the end of 2006.
- 2007 ain't bad either.
- Anyhow, it's really needed fast.
- related
system:id is a special field that each document has.
All other fields are optional.
This bug report uses the fields title, notes, and related. We use the syntax document.field to talk about fields, e.g. bug-1.title.
bug-1.title is single-valued, whereas bug-1.notes and bug-1.related are multi-valued with three and two values, respectively. Basically, all fields are multi-valued, and a single-valued field is simply a shorthand for a multi-valued field with only one value.
Fields are ordered, indicated by the <ol> tag's numbering (1., 2., 3., ...).
As for field values, they can be either strings (e.g. "Would be nice to have it by the end of 2006.") or links to other documents (e.g. bug-2). Nothing prevents you from mixing strings and links in the values of a field.
Inverse Fields
Fields are bidirectional: If you would GET bug-2, the document would look this:
- system:id
- bug-2
- title
- OutOfCoffeeException
- notes
- It's really hard to get good coffee in Shanghai.
- inverse:related
- bug-1
Note the inverse:related field. Because bug-1.related contains bug-2, bug-2 automatically has an inverse field called inverse:related, that contains bug-1.
Bidirectional fields make hopping around in the data pool extremely easy and convenient. Note that inverse fields are not truly first class fields: their order cannot be changed manually and is instead determined by the system. (The reason is that inverse fields will not be stored directly in the document in many implementations, but rather delivered by a lookup in an inverse index like Lucene's).
Data Model Summary
- Every document has a system:id field with a unique ID.
- Other fields are optional and can be added at will.
- All fields are multi-valued and ordered (single-value fields are available as a shorthand, but are just multi-valued fields with one value).
- Field values can be strings or links to other documents. (A field may contain a mixture of strings and links as values.)
- Inverse fields are automatically made available and contain backlinks. However, the order of inverse fields cannot be set manually.
Multiple Users Per System
A system should support multiple users. My account may be e.g. /manuel.
Multiple Feeds Per User
Each user should be able to maintain any number of feeds that contain documents.
I could create feeds like /manuel/default and /manuel/wiki.
GETting a feed returns the most recently changed documents, and maybe uses Atom's facilities for paging of older documents.
A feed is also a practical unit for security. It would be nice to have public, shared, and private feeds.
The Atom Publishing Protocol also has support for media objects (BLOBs) which need to be incorporated into this design.
Note that I plan to make a feed writable to its owning user only, preventing the problems of shared state. Shared state will be simulated by special assessement modules (see Rohit Khare's ARRESTED architectural style), that aggregate the data from many users to present a unified view, like del.icio.us does.
Next Steps
I plan to implement this system using SISC Scheme on top of the Java Virtual Machine and release it as free software (GPL is most likely).
In the first version I'll just write the Atom feeds as files to disk, and read 'em every time a request is made (and maybe add some caching). In the long run however, the system should grow to a fault-tolerant, scalable, clustered architecture, likely using P2P technology. But this is future music right now.
On top of this database, I'd like to layer a system that lets users design the layouts of documents in the browser, and create mashups from their and other users' feeds.
Please tell me what you think about the general architecture described here.
Acknowledgements, Further Links
Many thanks go to Zini for numerous discussions and prototyping sessions over the years.
Current systems that sport a similar design are Google Base, CouchDb, and many others listed in Kragen's post semistructured data: summary of six years of wishes.
I've created a Google Project for Buckybase for tracking design issues.
Posted by: Manuel | October 11, 2006 at 19:01
Darius asked "one thing that wasn't clear was if strings could contain embedded links, or you just had simple strings and simple links at the top level?"
Only simple strings and simple links. Links inside a string are not meaningful to the system.
Posted by: Manuel | October 15, 2006 at 15:17
with those filthy spammers putting links on my blogs. It’s just not being lazy moderating but I just don’t have time moderating.
Posted by: louboutin heels | May 30, 2011 at 15:56
Only simple strings and simple links. Links inside a string are not meaningful to the system.
Posted by: puma sneakers | August 26, 2011 at 17:17