I'm with Kragen Sitaker, when he writes:
Now I am feeling the need for a personal semistructured store more than
ever. This is a searchable database that supports "data first,
structure later" <https://www.betaversion.org/~stefano/linotype/news/93/>
(i.e. you can add a schema incrementally to existing data entered
without one, or with a very primitive one), but supports enough
structure to render web pages and things like that.
Here's my design for a document database called Buckybase that fits that description.
Technologies Used by Buckybase
- Atom Syndication Format (the IETF standard for feeds and entries)
- Atom Publishing Protocol (still in draft, so I'm not committing to it yet)
- XOXO microformat for outlines (basically, using HTML's <dl> and <ol> tags to represent documents)
Sample Document and Core Data Model
A sample document that represents a bug report:
- Software doesn't exist yet.
- Would be nice to have it by the end of 2006.
- 2007 ain't bad either.
- Anyhow, it's really needed fast.
system:id is a special field that each document has.
All other fields are optional.
This bug report uses the fields title, notes, and related. We use the syntax document.field to talk about fields, e.g. bug-1.title.
bug-1.title is single-valued, whereas bug-1.notes and bug-1.related are multi-valued with three and two values, respectively. Basically, all fields are multi-valued, and a single-valued field is simply a shorthand for a multi-valued field with only one value.
Fields are ordered, indicated by the <ol> tag's numbering (1., 2., 3., ...).
As for field values, they can be either strings (e.g. "Would be nice to have it by the end of 2006.") or links to other documents (e.g. bug-2). Nothing prevents you from mixing strings and links in the values of a field.
Fields are bidirectional: If you would GET bug-2, the document would look this:
- It's really hard to get good coffee in Shanghai.
Note the inverse:related field. Because bug-1.related contains bug-2, bug-2 automatically has an inverse field called inverse:related, that contains bug-1.
Bidirectional fields make hopping around in the data pool extremely easy and convenient. Note that inverse fields are not truly first class fields: their order cannot be changed manually and is instead determined by the system. (The reason is that inverse fields will not be stored directly in the document in many implementations, but rather delivered by a lookup in an inverse index like Lucene's).
Data Model Summary
- Every document has a system:id field with a unique ID.
- Other fields are optional and can be added at will.
- All fields are multi-valued and ordered (single-value fields are available as a shorthand, but are just multi-valued fields with one value).
- Field values can be strings or links to other documents. (A field may contain a mixture of strings and links as values.)
- Inverse fields are automatically made available and contain backlinks. However, the order of inverse fields cannot be set manually.
Multiple Users Per System
A system should support multiple users. My account may be e.g. /manuel.
Multiple Feeds Per User
Each user should be able to maintain any number of feeds that contain documents.
I could create feeds like /manuel/default and /manuel/wiki.
GETting a feed returns the most recently changed documents, and maybe uses Atom's facilities for paging of older documents.
A feed is also a practical unit for security. It would be nice to have public, shared, and private feeds.
The Atom Publishing Protocol also has support for media objects (BLOBs) which need to be incorporated into this design.
Note that I plan to make a feed writable to its owning user only, preventing the problems of shared state. Shared state will be simulated by special assessement modules (see Rohit Khare's ARRESTED architectural style), that aggregate the data from many users to present a unified view, like del.icio.us does.
In the first version I'll just write the Atom feeds as files to disk, and read 'em every time a request is made (and maybe add some caching). In the long run however, the system should grow to a fault-tolerant, scalable, clustered architecture, likely using P2P technology. But this is future music right now.
On top of this database, I'd like to layer a system that lets users design the layouts of documents in the browser, and create mashups from their and other users' feeds.
Please tell me what you think about the general architecture described here.
Acknowledgements, Further Links
Many thanks go to Zini for numerous discussions and prototyping sessions over the years.
Current systems that sport a similar design are Google Base, CouchDb, and many others listed in Kragen's post semistructured data: summary of six years of wishes.