Total Pageviews

Wednesday 16 April 2014

redis-document-store

Motivation

These days there are many "schema-less document stores", which allow the storage, indexing, and retrivial of data.
The canonical examples are:
These are new, these are fast, these are shiny.
But they're too modern for me to trust or risk in production, despite the idea of firing random data at a storage system, and later retrieving it by ID being so attractive.
If we pretend that a schema-less document is merely a bundle of keys and values, and that keys are strings then we can flatten them in a Redis store easily.

Illustration

Consider a website such as Debian Administration.
If we ignore the blogs, polls, and similar things, the core of the site is the indexing and display of articles.
An article is a collection of data, a document if you will, comprising of:
  • Author
  • Body
  • Publication Date
  • Publication time
  • Tags
  • Title
  • ...
We could represent this easily as a hash:
article = Hash.new()
article[title] => "Article title .."
article[date] => "10/03/1976"
article[Tags] => "foo, bar, baz"
..
Isn't that just like a schema-free database entry? Perhaps it doesn't look like one at first-glance, but we can add/remove the fields on a per-document basis and we're not tied to a rigid definition, so actually it is.
In brief: We can store ANY data posted to the storage, and later retrieve it in JSON-form.

Implementation

The code implements a simplification of the idea of a flexible document storage system:
  • Any data which is HTTP-posted to the /$db/create handler will be broken down into "key = value"
  • Every key will be stored for later use, against the database "$db".
  • Every value will be stored to the newly constructed, flat document
Given a POST with two named keys ("title", and "body") we'll do this:
  • Find the next document ID.
    • Let us pretend it was 6.
  • Store the data:
    • $db:6:title will have the value of the "title" field, submitted to us.
    • $db:6:body will have the value of the "body" field, submitted to us.
  • Add the members "title" and "body" to the set:
    • $db:6:members. This will let us know, in the future, which fields the document has.
This is sufficient to store an arbitrary number of keys, associated values, and later retrieve them.

Extras

It has been shown many times that tags are useful, so for that reason there is special-handling for any key called "tags". We assume that the value of this key is a comma-seperated list of tags.
Given the HTTP-post like this:
title => This is my title,
body  => This is my arbitrarily long document ...
tags  => steve, test, data
The tags "steve", "test", and "data" will be created, and we can fetch any matching document by tag easily.

Examples

Here is an example of a client adding a new document:
~$ curl -d "title=This is my title&body=This is my body&date=Today.." \
            http://127.0.0.1:9999/articles/create/
 Created ID 4
Here we see four things:
  • We've used curl to post from the command line.
  • Our "document" consists of "title" + "body" + "date".
  • We've used the database "articles".
  • The HTTP-post resulted in an ID being returned.
Now here is the retrival, using that ID:
~$ curl -v http://127.0.0.1:9999/articles/get/4
{"body":"This is my body","title":"This is my title", "date":"Today.."}
Recap: We're flexible, because we don't mandate fields to be added. ANY client can be posting ANY data, and it will all be squirrelled away and available for later retrieval.

Tag Searching

There are two end-points which are useful for tags:
  • /tags will show the known-tags, and their use-counts.
  • /tag/foo will return a JSON array of document-identifiers with the given tag.

Searching

The data which is inserted may be searched upon any field, if the tag-queries are insufficient:
~# curl -d "title=This is title xxxx&body=This is a body" http://localhost:9999/articles/create/

~# curl http://localhost:9999/articles/search/title/xxx
-> 6
Here we've searched against the "title" field in the database "articles", any field which has been inserted is a valid target.
NOTE We do not assume that each document has the same number of fields with the same titles, but that is not a requirement.
NOTE: The first result is returned. If you wish to find subsequent matches add /N, to start from record N.
On my desktop I can search 20,000 documents (consisting of "title=NN" and "body=XX") against the title-field in just under 3 seconds. The search time is largely limited by the time required to fetch the field of every non-matching records, in order.
Speeding this up would be a challenge, which implies that the storage model is simple but not suited to large data collections.

API

There are the following handlers installed:
  • /$db/create
    • POST to this to submit a new document.
  • /$db/get
    • GET an existing document.
  • /$db/recent
    • Return the most recent documents.
  • /$db/replace
    • POST to this to update an existing document.
  • /$db/search
    • GET to this to search for matches based on a single field.
  • /$db/tags
    • Return a JSON-hash of known tags, and their use-counts.
  • /$db/tag
    • Retrieve a JSON-array of IDs which match the given tag.
Here "$db" is the equivilent of the database name. You can use any string that is valid here.
The server will happily allow you to post, fetch, etc, against any number of databases - internally the database is merely a prefix used for the Redis key names.

Dependencies

The current server is sinatra-based, and relies upon two gems:
  • redis
  • sinatra
Assuming you're on a Debian GNU/Linux system you could install these via:
 ~# apt-get install rubygemes
 ~# gem install redis
 ~# gem install sinatra
To improve performance you could use the thin server:
 ~# gem install thin
(On my desktop I can retrieve ~40 documents a second with no effort.)