I’ve been wrestling with a couple views in CouchDB currently. This blog post serves as mental note to myself, and hopefully to others. As I write this, i’m using 0.9.1
and 0.10.0
in a production setup.
Here’s the environment:
- Amazon AWS L Instance (ami-eef61587)
- Ubuntu
9.04
(Jaunty) - CouchDB
0.9.1
and0.10.0
- database size:
199.8 GB
- documents:
157408793
On to the tips
These are some small pointers which I gathered by reading different sources (wiki, mailing list, IRC, blog posts, Jan …). All those revolve around views and what not with a relatively large data set.
Do you want the speed?
Building a view on a database of this magnitude will take a while.
In the beginning I estimated about week and a half. And it really took that long.
Things to consider, always:
- upgrade to trunk ;-) (or e.g. to
0.10.x
) - view building is CPU-bound which leads us to MOAR hardware — a larger instance
The bottom line is, “Patience (really) is a virtue!”. :-)
A side-note on upgrading: Double-check that upgrading doesn’t require you to rebuild the views. That is, unless you got time.
View basics
When we initially tested if CouchDB was made for us we started off with a bunch off emit(doc.foo, doc)
-like map functions in (sometimes) temporary views. On the production data, there are a few gotcha’s.
First off — the obvious: temporary views are slow.
Back to JavaScript
Emitting the complete document will force CouchDB to duplicate data in the index which in return needs more space and also makes view building a little slower. Instead it’s suggested to always emit(doc.foo, null)
and then POST
with multiple keys in the body to retrieve the documents.
Reads are cheap, and if not, get a cache.
doc._id
In case you wonder why I don’t do emit(doc.foo, doc._id)
? Well, that’s because CouchDB is already kind enough to retrieve the document’s ID anyway. (Sweet, no?)
include_docs
Sort of related, CouchDB has a ?include_docs=true
parameter.
This is really convenient — especially when you develop the application.
I gathered from various sources that using them bears a performance penalty. The reason is that include_docs
issues another b-tree lookup for every row returned in the initial result. Especially with larger sets, this may turn into a bottleneck, while it can be considered OK with smaller result sets.
As always — don’t forget that HTTP
itself is relatively cheap and a single POST
request with multiple keys (e.g. document IDs) in the body is likely not the bottleneck of your operation — compared to everything else.
And if you really need to optimize that part, there’s always caching. :-)
Need a little more?
Especially when documents of different types are stored into the same database (Oh, the beauty of document oriented storage!), one should consider the following map-example:
if (doc.foo) {
emit(doc.foo, null)
}
.foo
is obviously an attribute in the document.
JavaScript vs. Erlang
sum()
, I haven’t found too many of these — but with version 0.10+, the CouchDB folks implemented a couple JavaScript functions in Erlang, which is an easy replacement and adds a little speed on top. :-) So in this case, use _sum
.
Compact
Compact, I learned, knows how to resume. So even if you kill the process, it’ll manage to resume where it left off before.
When you populate a database through bulk writes, the gain from a compact is relatively small and is probably neglectable. Especially because compacting a database takes a long while. Keep in mind that compaction is disk-bound, which is often one of the final and inevitable bottlenecks in many environments. Unless hardware is designed ground up, this will most likely suck.
Compaction can have a larger impact when documents are written one by one to a database, or a lot of updates have been committed on the set.
I remember that when I build another set with 20 million documents one by one, I ended up with a database size of 60 GB. After I compacted the database, the size dropped to 20 GB. I don’t have the numbers on read speed and what not, but it also felt more speedy. ;-)
Fin
That’d be it. More next time!