Here's my most recent thinking on semistructured databases...
The system should store notes, that usually consist of a title, some structured information, and a longer text, like so:
Deli NYC
tel: 6326 2835
location: Shanghai
special: Super Burrito (Sat and Sun only)
tags: shanghai, food, sandwich, excellent
rating: *****
url: http://www.delinyc.com/nyc.htm (seems to be down, CNY related)Deli NYC plain rocks, especially the Tuna Melt, Pastrami and Mozzarella, Ham and Egg Salad, and of course the California Style Super Burrito.
Delivery usually takes 30 minutes.
The extracted data should look somewhat like this:
title: Deli NYC
tel: 6326 2835
location: Shanghai
special: Super Burrito (Sat and Sun only)
tags: shanghai
tags: food
tags: sandwich
tags: excellent
rating: *****
url: http://www.delinyc.com/nyc.htm (seems to be down, CNY related)
body: Deli NYC plain rocks, especially the Tuna Melt, Pastrami and Mozzarella, Ham and Egg Salad, and of course the California Style Super Burrito.
body: Delivery usually takes 30 minutes.
Note that the system uses some heuristics to find out that tags is actually a list of items, and not one long item. (The heuristic could be that each item is short, has little punctuation, and the list does not end with a dot.)
There are some issues with the use of heuristics, and ideally there would be ways to disable them when there's a problem.
Sloppy linking and backlinks: Links to other items (e.g. Shanghai, the rating *****) should be very sloppy (i.e. case-insensitive, also insensitive to punctuation), and also lead to discoverable backlinks from those items back to Deli NYC, as in Buckybase.
Pretty URLs: The URL for the Deli NYC item could be /manuel/Deli+NYC, but it should be possible to specify a different, easier-to-type URL for an item, especially if it has a long title. This could be done with a key-value pair:
slug: delinyc
Then /manuel/delinyc would also be an URL for the item.
User interface: Like Google, the system should have a search bar at the top, with two buttons, "search" and "go" (I'm feeling lucky). Search gives you a list of matching items, while go jumps to an item that has the string you entered as title (and can also be used to create new items: going to an item that doesn't exist brings up the edit mode.)
Versioning: Ideally, the system would provide versioning for items, but without some sort of stable identifier for each item, this could be tricky. The system could use a hidden identifier in the edit form, and update the existing item, though.
Basically, the system should be as sloppy and useful as a pile of paper, but bring some of the benefits of computers (searching, backlinks, versioning, and further down the road advanced slice-and-dice-ability and presentation tools.)
Check out SBook:
http://www.simson.net/ref/sbook5/
It's been around for a while doing much of what you want, and Simson Garfinkel recently stopped selling it and released the source code.
Posted by: Vladimir Sedach | February 20, 2007 at 05:05
Thanks for the link, I already saw it in your del.icio.us.
However, for me it's "I must create a system or be enslav'd by another man's" :) (-- William Blake)
What I'm trying to do at this layer is slightly different from SBook: I'm not using "AI" techniques to recognize things, but rather try to define a simple-minded, easy-to-explain text syntax for entering arbitrary metadata documents.
An AI layer for recognizing things like document types, dates, people, places, etc would be layered on top of that basic formatting layer, and for that I'll have a look at SBook. (The General Architecture for Text Engineering is also quite good at this.)
Posted by: Manuel | February 20, 2007 at 16:00