Skip to content

CouchDB: checkpointing on view building

I'm posting about this tidbit because Google seemed to know nothing about it.

Anyway, during the view building process, we may see the following in the couchdb.log (level = info, at least, in local.ini):

[...] [info] [...] checkpointing view update at seq 78163851 for citations _design/erlang
[...] [debug] [...] New task status for citations _design/erlang: Processed 17844590 of 107444308 changes (16%)
[...] [debug] [...] New task status for citations _design/erlang: Processed 17848060 of 107444308 changes (16%)
[...] [debug] [...] New task status for citations _design/erlang: Processed 17850878 of 107444308 changes (16%)
[...] [info] [...] checkpointing view update at seq 78170348 for citations _design/erlang
[...] [debug] [...] New task status for citations _design/erlang: Processed 17851087 of 107444308 changes (16%)

The above tells us, that CouchDB saved the current process during indexing and allows us to resume in case we decide to restart the CouchDB and interrupt the indexing process. I've tried it myself a couple times with CouchDB 0.10.0 — I also had not noticed this feature prior to it.

And why is this useful in particular? The biggest use for this is upgrading computing power (e.g. on AWS EC2) when we realize we need MOAR and then we are still able to resume when we boot into more resources.

Sidenote: Checkpointing will not help if indexing is stopped and the view is adjusted/changed. Or when the indexing stopped due to an error, such as a crash.

That's all, kids.

Introducing TillStore

Update, 2009-10-24: Fixed a bug, and committed a couple other improvements — TillStore 0.2.0 is released!

I went to nosqlberlin last week and I got inspired. I listened to a lot of interesting talks — most notably about Redis, CouchDB, Riak and MongoDB (I'm omitting two talks, which of course were not less awesome than the rest!) Due to an unfortunate circumstance I had six hours to hack on stuff from Thursday to Friday.

And here it is — I'm proud to present my very own approach to key-value stores: TillStore!

What is a key-value store?

In a nutshell, a key-value store is a database for values (Doh!). There are no duplicates, it's always a key mapped to a value. No bells and whistles — most just want to be fast. Eventually consistent is another attribute which most of them like to claim for themselves. The value term (in key-value store) is very flexible. Some key-value stores only support certain types, TillStore supports them all through JSON. ;-)

Other examples for key-value stores include Riak, Redis, Cassandra and Tokyo Cabinet.


So vain, right? Well, in the beginning TillStore was an inside joke with a colleague. And to be honest, TillStore was nothing more but the following:

$tillStore = array();

However, when I had time to hack away on Thursday night, I took it a slightly higher level. ;-)

Small notes on CouchDB's views

I've been wrestling with a couple views in CouchDB currently. This blog post serves as mental note to myself, and hopefully to others. As I write this, i'm using 0.9.1 and 0.10.0 in a production setup.

Here's the environment:

  • Amazon AWS L Instance (ami-eef61587)
  • Ubuntu 9.04 (Jaunty)
  • CouchDB 0.9.1 and 0.10.0
  • database size: 199.8 GB
  • documents: 157408793

On to the tips

These are some small pointers which I gathered by reading different sources (wiki, mailing list, IRC, blog posts, Jan ...). All those revolve around views and what not with a relatively large data set.

Do you want the speed?

Building a view on a database of this magnitude will take a while.

In the beginning I estimated about week and a half. And it really took that long.

Things to consider, always:

  • upgrade to trunk ;-) (or e.g. to 0.10.x)
  • view building is CPU-bound which leads us to MOAR hardware — a larger instance

The bottom line is, "Patience (really) is a virtue!". :-)

A side-note on upgrading: Double-check that upgrading doesn't require you to rebuild the views. That is, unless you got time.

View basics

When we initially tested if CouchDB was made for us we started off with a bunch off emit(, doc)-like map functions in (sometimes) temporary views. On the production data, there are a few gotcha's.

First off — the obvious: temporary views are slow.

Back to JavaScript

Emitting the complete document will force CouchDB to duplicate data in the index which in return needs more space and also makes view building a little slower. Instead it's suggested to always emit(, null) and then POST with multiple keys in the body to retrieve the documents.

Reads are cheap, and if not, get a cache.


In case you wonder why I don't do emit(, doc._id)? Well, that's because CouchDB is already kind enough to retrieve the document's ID anyway. (Sweet, no?)


Sort of related, CouchDB has a ?include_docs=true parameter.

This is really convenient — especially when you develop the application.

I gathered from various sources that using them bears a performance penalty. The reason is that include_docs issues another b-tree lookup for every row returned in the initial result. Especially with larger sets, this may turn into a bottleneck, while it can be considered OK with smaller result sets.

As always — don't forget that HTTP itself is relatively cheap and a single POST request with multiple keys (e.g. document IDs) in the body is likely not the bottleneck of your operation — compared to everything else.

And if you really need to optimize that part, there's always caching. :-)

Need a little more?

Especially when documents of different types are stored into the same database (Oh, the beauty of document oriented storage!), one should consider the following map-example:

if ( {
    emit(, null)

.foo is obviously an attribute in the document.

JavaScript vs. Erlang

sum(), I haven't found too many of these — but with version 0.10+, the CouchDB folks implemented a couple JavaScript functions in Erlang, which is an easy replacement and adds a little speed on top. :-) So in this case, use _sum.


Compact, I learned, knows how to resume. So even if you kill the process, it'll manage to resume where it left off before.

When you populate a database through bulk writes, the gain from a compact is relatively small and is probably neglectable. Especially because compacting a database takes a long while. Keep in mind that compaction is disk-bound, which is often one of the final and inevitable bottlenecks in many environments. Unless hardware is designed ground up, this will most likely suck.

Compaction can have a larger impact when documents are written one by one to a database, or a lot of updates have been committed on the set.

I remember that when I build another set with 20 million documents one by one, I ended up with a database size of 60 GB. After I compacted the database, the size dropped to 20 GB. I don't have the numbers on read speed and what not, but it also felt more speedy. ;-)


That'd be it. More next time!

My first PHP Unconfernce

I went to Hamburg last weekend to visit the PHP Unconference, which was probably my first conference ever. I've been to a couple barcamps and other smaller events, but anyway, this felt more like a real conference to me. That is, if I exclude ALA and the various ad:tech's I had to go to.

The reasons why I usually avoid tech conferences include foremost the price tag (working for myself, I can technically label it as an expense, but I still have to pay for it), doubts that it'll be worth it in terms of knowledge gained and probably time. I tend to catch up with people outside of conferences (when they are in Berlin :-)) and that has worked well for me.

I'm glad I set all these things aside for Hamburg (and it was all too easy). A lot of people expressed how much they liked their (often 3rd) PHP Unconference, and I can second, or third that — job really well done. Ulf Wendel took it one step further, blogged and asked, "Is perfect too boring?", because everything worked out so well. I guess I would say, "No, it's not boring", and I'm inclined to add, "Thanks, it really felt like having a weekend off, yet I still learned something and met a ton of nice people (or connected online nicknames to real faces)!".

I can definitely see why people visit the PHP Unconference each year, and I'll be one of them next year! ;-)

As I said, I had a great time, both my topics were accepted too. One was merged with another PHP performance talk which was overbooked with PHP VIPs which is why I decided to listen to Kore Nordmann's talk on CouchDB instead, and the other one ("Deployment") — I kind of overslept. And I'm sorry about that! I'll make sure to avoid party, party Hamburg next year.

Here are the slides for my Zend Framework (performance) talk, I hope you find them interesting:

The slides and speaker-notes contain…

  • a small intro as of why I think it's worth while to get into the ring with it
  • hints and pointers on general PHP optimization
  • I detail on a couple components (e.g. things to look out for and how to overcome them)

Make sure to check the speaker notes (using this link) — I didn't put everything in there, but a lot.

(The deployment slides will be up later this weekend.)

This also reminds me to improve my presentation-fu. I need something as kickass as keynote, but for Windows (currently). If anyone has a pointer, let me know. ;-)