A couple months ago, I wrote an article titled Operation CouchDB. I noticed that a lot of people still visit my blog for this particular post, so this is an update to the situation.

And no, you may not copy and paste this or any other of my blog posts unless you ask me. ;-)

Caching revisited

A while back I wrote about how caching is trivial with CouchDB — well sort of.

CouchDB and its ecosystem like to emphasize on how the power of HTTP allows you to leverage countless of known and proven solutions. For caching, this goes only so far.

If you remember the ETag the reality is that while it’s easy to grasp what it does, a lot of reverse proxies and caches don’t implement this at all, or very well. Take my favorite — Varnish. There is currently very little support for the ETag.

And once you go down this road, it gets messy:

  • need a process to listen on a filtered _changes.
  • the process must be able to PURGE Varnish

… suddenly it’s homegrown and not transparent at all.

However the biggest problem I thought about the other day is that if you don’t happen to be one of a chosen few to run a CouchApp in production your CouchDB client is not a regular browser. Which means that your CouchDB client library doesn’t know about the ETag either. Bummer.

What about those lightweight HEAD ?

HEAD requests to validate (or invalidate) the cache have the following issues:

  • An obvious I/O penalty because CouchDB will consult the BTree each time you HEAD request to check for the ETag.

  • The HEAD request in CouchDB is currently implemented like a GET, but they throw away the body on respone.

While work is being done in COUCHDB-941 to fix these two issues in summary — and this is not just true for CouchDB — the awesomeness of the ETag comes primarily from faking performance (aka fast) by saving bandwidth to transfer data.

Solution?

Ahhhhh — sorry, that might have been a lot of bitching! :-) Let me get to the point!

IMHO, if my cache has to HEAD-request against CouchDB, each time it is used to make sure the data is not stale, it’s becomes pretty pointless to use a cache anyway. Point taken, HEAD-requests will become cheaper, but they have to happen anyway.

Thus only part of the work is actually offloaded from the database server while I’d say the usual approach in caching is to not hit the database (server) at all.

The (current) bottom line is: no silver bullet available.

You have to invalidate the cache from your application or e.g. by using a tool like thinner to follow _changes and purge accordingly. Or, what we’re doing: make the application cache-aware and PURGE directly from it.

include_docs

?include_docs=true works well, up until you a sh!tload of traffic — and then it’s a hefty i/o penalty each time your view is read. Of course emitting the entire doc means that you’ll need more disk space, but the performance improvement by sacrificing disk space is between 10 and 100x. (Courtesy of the nice guys at cloudant!)

In a cloud computing context like AWS this is indeed a cost factor: think instance size for general i/o performance, the actual storage (in GB) and the throughput which Amazon charges for. In real a datacenter SAS disks are also more expensive than those crappy 7500 rpm SATA(-II) disks.

Partitioning

When I say partitioning, I mean spreading out the various document types by database. So even though the great advantage of a document oriented database is to be able to have a lot of different data right next to each other, you should avoid that.

Because when your data grows, it’s pretty darn useful to be able to push a certain document type on another server in order to scale out. I know it’s not that hard to do this when they are all in one database, but separating them when everything is on fire is just a whole lot more work, then replicating the database to a dedicated instance and changing the endpoint in your application.

Keeping an eye on things

I did not mention this earlier because it’s a no-brainer — sort of. It seems though, that it has to be said.

Capacity planning

… and projections are (still) pretty important with CouchDB (or NoSQL in general). If you don’t know by how much or if your data will grow, the least you should do is put a monitor on your CouchDB server’s _stats to keep an eye on how things develop.

Monitoring

If you happen to run munin, I suggest you take a look at Gordon Stratton plugins. They are very useful.

My second suggestion for monitoring would be to not just put a sensor on http://example.org:5984/ because with CouchDB, that page almost always works:

Of course this page is a great indicator for whether your CouchDB server is available at all, but it is not a great indicator on the server’s performance, e.g. number of writes, write speed, view building, view reads and so on.

Instead my suggestion is to build a slightly more sophisticated setup e.g. using a tool like tsung which actually makes requests, inserts data and does a whole more. It would let you aggregate the data and which allows you to see your general throughput and comes in useful with a post-mortem.

If you struggle with tsung, check out my tsung chef cookbook.

Application specific

There really is never enough monitoring. Aside from monitoring general server vitals, I highly recommend keeping a sensor things that are specific to your application. E.g. the number of documents of a certain type, or in a database, etc.. All these little things help getting the big picture.

BigCouch

BigCouch is probably the biggest news since I last blogged about operation CouchDB.

BigCouch is Cloudants CouchDB sharding framework. It basically lets you spread out a single database across multiple nodes. I haven’t had the pleasure of running BigCouch myself yet, but we’ve been a customer of Cloudant for a while now.

Expect more on this topic soon, as I plan to get my hands dirty.

Fin

I hope I didn’t forget anything. If you have feedback or stories to share, please comment. :-)

Since I just re-read my original blog post, I think the rest of it stands.