Motivation
These days there are many "schema-less document stores", which allow the storage, indexing, and retrivial of data.The canonical examples are:
These are new, these are fast, these are shiny.
But they're too modern for me to trust or risk in production, despite the idea of firing random data at a storage system, and later retrieving it by ID being so attractive.
If we pretend that a schema-less document is merely a bundle of keys and values, and that keys are strings then we can flatten them in a Redis store easily.
Illustration
Consider a website such as Debian Administration.If we ignore the blogs, polls, and similar things, the core of the site is the indexing and display of articles.
An article is a collection of data, a document if you will, comprising of:
- Author
- Body
- Publication Date
- Publication time
- Tags
- Title
- ...
article = Hash.new()
article[title] => "Article title .."
article[date] => "10/03/1976"
article[Tags] => "foo, bar, baz"
..
Isn't that just like a schema-free database entry? Perhaps it doesn't look like
one at first-glance, but we can add/remove the fields on a per-document basis and
we're not tied to a rigid definition, so actually it is.In brief: We can store ANY data posted to the storage, and later retrieve it in JSON-form.
Implementation
The code implements a simplification of the idea of a flexible document storage system:- Any data which is HTTP-posted to the
/$db/create
handler will be broken down into "key = value" - Every key will be stored for later use, against the database "$db".
- Every value will be stored to the newly constructed, flat document
- Find the next document ID.
- Let us pretend it was 6.
- Store the data:
-
$db:6:title
will have the value of the "title" field, submitted to us. -
$db:6:body
will have the value of the "body" field, submitted to us.
-
- Add the members "title" and "body" to the set:
-
$db:6:members
. This will let us know, in the future, which fields the document has.
-
Extras
It has been shown many times that tags are useful, so for that reason there is special-handling for any key called "tags". We assume that the value of this key is a comma-seperated list of tags.Given the HTTP-post like this:
title => This is my title,
body => This is my arbitrarily long document ...
tags => steve, test, data
The tags "steve", "test", and "data" will be created, and we can fetch any
matching document by tag easily.Examples
Here is an example of a client adding a new document:~$ curl -d "title=This is my title&body=This is my body&date=Today.." \
http://127.0.0.1:9999/articles/create/
Created ID 4
Here we see four things:- We've used
curl
to post from the command line. - Our "document" consists of "title" + "body" + "date".
- We've used the database "articles".
- The HTTP-post resulted in an ID being returned.
~$ curl -v http://127.0.0.1:9999/articles/get/4
{"body":"This is my body","title":"This is my title", "date":"Today.."}
Recap: We're flexible, because we don't mandate fields to be added. ANY client can be posting ANY data, and it will all be squirrelled away and available for later retrieval.
Tag Searching
There are two end-points which are useful for tags:-
/tags
will show the known-tags, and their use-counts. -
/tag/foo
will return a JSON array of document-identifiers with the given tag.
Searching
The data which is inserted may be searched upon any field, if the tag-queries are insufficient:~# curl -d "title=This is title xxxx&body=This is a body" http://localhost:9999/articles/create/
~# curl http://localhost:9999/articles/search/title/xxx
-> 6
Here we've searched against the "title" field in the database "articles", any field which has been inserted is a valid target.NOTE We do not assume that each document has the same number of fields with the same titles, but that is not a requirement.On my desktop I can search 20,000 documents (consisting of "title=NN" and "body=XX") against the title-field in just under 3 seconds. The search time is largely limited by the time required to fetch the field of every non-matching records, in order.
NOTE: The first result is returned. If you wish to find subsequent matches add/N
, to start from record N.
Speeding this up would be a challenge, which implies that the storage model is simple but not suited to large data collections.
API
There are the following handlers installed:-
/$db/create
- POST to this to submit a new document.
-
/$db/get
- GET an existing document.
-
/$db/recent
- Return the most recent documents.
-
/$db/replace
- POST to this to update an existing document.
-
/$db/search
- GET to this to search for matches based on a single field.
-
/$db/tags
- Return a JSON-hash of known tags, and their use-counts.
-
/$db/tag
- Retrieve a JSON-array of IDs which match the given tag.
$db
" is the equivilent of the database name. You can use
any string that is valid here.The server will happily allow you to post, fetch, etc, against any number of databases - internally the database is merely a prefix used for the Redis key names.
Dependencies
The current server is sinatra-based, and relies upon two gems:- redis
- sinatra
~# apt-get install rubygemes
~# gem install redis
~# gem install sinatra
To improve performance you could use the thin
server: ~# gem install thin
(On my desktop I can retrieve ~40 documents a second with no effort.)