A toolchain for CouchDB Lounge

Friday, February 26. 2010

One of our biggest issues with CouchDB is currently the lack of compaction of our database, and by lack of, I don't mean that CouchDB doesn't support it, I mean that we are unable to actually run it.

Compaction in a nutshell

Compaction in a nutshell is pretty cool.

As you know, CouchDB is not very space-efficient. For once, CouchDB saves revisions of all documents. Which means, whenever you update a document a new revision is saved. You can rollback any time, or expose it as a nifty feature in your application — regardless, those revisions are kept around until your database is compacted.

Think about it in terms of IMAP - emails are not deleted until you hit that magic "compact" button which 99% of all people who use IMAP don't know what it's for anyway.

Another thing is that whenever new documents are written to CouchDB and bulk mode is not used, it'll save them in a way which is not very efficient either. In terms of actual storage and indexing (so rumour has it).

Compaction woes

Since everything is simple with CouchDB, compaction is a simple process in CouchDB too. Yay!

When compaction is started, CouchDB will create a new database file where it stores the data in a very optimized way (I will not detail on this, go read a science book or google if you are really interested in this!). When the compaction process finished, CouchDB will exchange your old database file with the new database file.

The woes start with that e.g. when you have 700 GB uncompacted data, you will probably need another 400 GB for compaction to finish because it will create a second database file.

The second issue is that when you have constant writing on your database, the compaction process will actually never finish. It kind of sucks and for those people who aim to provide close to 100% availability, this is extremely painful to learn.


Continue reading "A toolchain for CouchDB Lounge"

Small notes on CouchDB's views

Wednesday, October 21. 2009

I've been wrestling with a couple views in CouchDB currently. This blog post serves as mental note to myself, and hopefully to others. As I write this, i'm using 0.9.1 and 0.10.0 in a production setup.

Here's the environment:

  • Amazon AWS L Instance (ami-eef61587)
  • Ubuntu 9.04 (Jaunty)
  • CouchDB 0.9.1 and 0.10.0
  • database size: 199.8 GB
  • documents: 157408793

On to the tips

These are some small pointers which I gathered by reading different sources (wiki, mailing list, IRC, blog posts, Jan ...). All those revolve around views and what not with a relatively large data set.

Do you want the speed?

Building a view on a database of this magnitude will take a while.

In the beginning I estimated about week and a half. And it really took that long.

Things to consider, always:

  • upgrade to trunk ;-) (or e.g. to 0.10.x)
  • view building is CPU-bound which leads us to MOAR hardware — a larger instance

The bottom line is, "Patience (really) is a virtue!". :-)

A side-note on upgrading: Double-check that upgrading doesn't require you to rebuild the views. That is, unless you got time.

View basics

When we initially tested if CouchDB was made for us we started off with a bunch off emit(doc.foo, doc)-like map functions in (sometimes) temporary views. On the production data, there are a few gotcha's.

First off — the obvious: temporary views are slow.

Back to JavaScript

Emitting the complete document will force CouchDB to duplicate data in the index which in return needs more space and also makes view building a little slower. Instead it's suggested to always emit(doc.foo, null) and then POST with multiple keys in the body to retrieve the documents.

Reads are cheap, and if not, get a cache.

doc._id

In case you wonder why I don't do emit(doc.foo, doc._id)? Well, that's because CouchDB is already kind enough to retrieve the document's ID anyway. (Sweet, no?)

include_docs

Sort of related, CouchDB has a ?include_docs=true parameter.

This is really convenient — especially when you develop the application.

I gathered from various sources that using them bears a performance penalty. The reason is that include_docs issues another b-tree lookup for every row returned in the initial result. Especially with larger sets, this may turn into a bottleneck, while it can be considered OK with smaller result sets.

As always — don't forget that HTTP itself is relatively cheap and a single POST request with multiple keys (e.g. document IDs) in the body is likely not the bottleneck of your operation — compared to everything else.

And if you really need to optimize that part, there's always caching. :-)

Need a little more?

Especially when documents of different types are stored into the same database (Oh, the beauty of document oriented storage!), one should consider the following map-example:

if (doc.foo) {
    emit(doc.foo, null)
}

.foo is obviously an attribute in the document.

JavaScript vs. Erlang

sum(), I haven't found too many of these — but with version 0.10+, the CouchDB folks implemented a couple JavaScript functions in Erlang, which is an easy replacement and adds a little speed on top. :-) So in this case, use _sum.

Compact

Compact, I learned, knows how to resume. So even if you kill the process, it'll manage to resume where it left off before.

When you populate a database through bulk writes, the gain from a compact is relatively small and is probably neglectable. Especially because compacting a database takes a long while. Keep in mind that compaction is disk-bound, which is often one of the final and inevitable bottlenecks in many environments. Unless hardware is designed ground up, this will most likely suck.

Compaction can have a larger impact when documents are written one by one to a database, or a lot of updates have been committed on the set.

I remember that when I build another set with 20 million documents one by one, I ended up with a database size of 60 GB. After I compacted the database, the size dropped to 20 GB. I don't have the numbers on read speed and what not, but it also felt more speedy. ;-)

Fin

That'd be it. More next time!

Defined tags for this entry: , , , ,

Thoughts on RightScale

Tuesday, October 20. 2009

RightScale provides all kinds of things — from a pre-configured MySQL master-slave setup (with automatic EBS/s3 backups), to a full LAMP stack, Rails app servers, virtually all kinds of other pre-configured server templates to a nifty auto-scaling feature.

We decided to leverage RightScale when we planned our move to AWS a couple months ago in order to not have to build everything ourselves. I've been writing this blog entry for the past five weeks and here are some observations, thoughts and tips.

RightScale

First off, whatever you think, and do, or have done so far, let me assure you, there's always a RightScale way of doing things. For (maybe) general sanity (and definitely your own), I suggest you don't do it their way — always.

One example for the RightScale way is, that all the so-called RightScripts will attempt to start services on reboot for you, instead of registering them with the init system (e.g., on Ubuntu, update-rc.d foo defaults) when they are set up.

You may argue that RightScale's attempt will provide you with a maybe more detailed protocol of what happened during the boot sequence, but at the same time it provides more potential for errors and introduces another layer around what the operating system provides, and what generally works pretty well already.

PHP and RightScale

RightScale's sales team knows how to charm people, and when I say charm, I do not mean scam (just for clarity)! :-)

The demos are very impressive and the client show cases not any less. Where they really need to excel though are PHP-related demos because not everyone in the world runs Ruby on Rails yet. No, really — there's still us PHP people and also folks who run Python, Java and so on.

Coming from the sales pitch, I felt disappointed a little because a standard PHP setup on RightScale is as standard as you would think three years ago. mod_php, Apache2 and so on. The configuration itself is a downer as well, a lot of unnecessary settings and generally not so speedy choices. Then remember that neither CentOS nor Ubuntu are exactly up to date on packages and add another constraint to the mix — Ubuntu is on 8.04 which is one and half years in the past as I write this entry.

And even though I can relate to RighScale's position — in terms of that supporting customers with all kinds of different software is a burden and messy to say the least — I am also not a fan.

Scaling up

The largest advantage when you select a service provider such as RightScale is, that they turn raw EC2 instances into usable servers out of the box. So far example setting up a lamp stack yourself requires time, while it's still a trivial task for many. With RightScale, it's a matter of a couple clicks — select image, start, provide input variables and done.

Aside from enhanced AMIs RightScale's advantage is auto-scaling. Auto-scaling has been done a couple times before. There are more than one service provider which leverages EC2 and provides scalability on top. Then take a look at Scalr, which is open source, and then recently Amazon themselves added their own Elastic Load Balancer.

In general, I think auto-scaling is something everyone gets, and wants, but of course it's not that dead simple to implement. And especially when you move to a new platform, it's a perfect trade off to sacrifice a flexibility and money for a warm and fuzzy "works out of the box" feeling.


Continue reading "Thoughts on RightScale"

AddressLimitExceeded: Too many addresses allocated

Tuesday, October 13. 2009

I got this error message tonight when I tried to allocate another EIP from within RightScale's dashboard. So it turns out there is a maximum of 5 (E)IPs on all AWS accounts, but there's a contact form to request more. Meh.

I wish AWS would make this part slightly easier, e.g. by announcing a customer's own IP space.

Defined tags for this entry: , ,

CouchDB on Ubuntu on AWS

Friday, August 28. 2009

Here's a little HowTo on how to setup CouchDB on an AWS EC2 instance. But outside of AWS (and EC2), this setup works on any other Ubuntu server, and I suppose Debian as well.

Getting started

The following steps are a rough draft, or a sketch on how to get started. I suggest that you familiarize yourself with what all of these things do. If you want to skip on the reading and just get started, this should work anyway.

  • you (obviously) need an AWS account (and log into the AWS console).
  • you need a custom security group (make sure to open up for http traffic)
security_group_001

security_group_002

  • create an EBS volume (Take a deep breath and think about the size of the volume. Keep in mind that you don't want to run into space issues right away and that allocated storage (even idle) costs you money (e.g. 400 GB =~ 40 USD (per month), excluding the i/o).)
  • create a keypair (It'll prompt you to download a foobar.pem, I placed mine on my local machine in ~/.ssh/ and ran chmod 400 on it.)
  • get an elastic IP
  • start the instance
    • select an AMI (I selected alestic's 64bit server Ubuntu 9.04 AMI.)
    • assign your own security group AND the defaults one
    • select your keypair

Woo! We made it that far.

The instance should boot and once this is done (green indicates all went well), we want to associate the previously created EBS volume and the elastic IP to said instance.

Once these steps are complete, go on the instance screen, click on your running instance and then click on "Connect". It'll show you the ssh command to connect to your instance -- it should be similar to this:

ssh -i .ssh/foobar.pem root@ec2-W-X-Y-Z.compute-1.amazonaws.com

The W-X-Y-Z part is most likely replaced with your elastic IP.

This process is not very automated yet, but at least you have an instance up and running. The next step is to try to login and see if the EBS was attached — if all went well, you should have /mnt.


Continue reading "CouchDB on Ubuntu on AWS"