Trying out BigCouch with Chef-Solo and Vagrant

Monday, April 4. 2011

So the other day, I wanted to quickly check something in BigCouch and thanks to Vagrant, chef(-solo) and a couple cookbooks — courtesy of Cloudant — this was exceptionally easy.

As a matter of fact, I had BigCouch running and setup within literally minutes.

Here's how.


You'll need git, Ruby, gems and Vagrant (along with Virtualbox) installed. If you need help with those items, I suggest you check out my previous blog post called Getting the most out of Chef with Scalarium and vagrant.

For operating system to use, I suggest you get a Ubuntu 10.04 box (aka Lucid).

Vagrant (along with Ruby and Virtualbox) is a one time setup which you can use and abuse for all kinds of things, so don't worry about the extra steps.


Clone the cookbooks in $HOME:

$ git clone

Create a vagrant environement:

$ mkdir ~/bigcouch-test
$ cd ~/bigcouch-test
$ vagrant init

Setup ~/bigcouch-test/Vagrantfile: do |config| = "base"
  config.vm.box_url = ""

  # Forward a port from the guest to the host, which allows for outside
  # computers to access the VM, whereas host only networking does not.
  # config.vm.forward_port "http", 80, 8080

  config.vm.provisioner = :chef_solo
  config.chef.cookbooks_path = "~/cloudant_cookbooks"
  config.chef.add_recipe "bigcouch::default"

Start the vm:

$ vagrant up

Use BigCouch

$ vagrant ssh
$ sudo /etc/init.d/bigcouch start
$ ps aux|grep [b]igcouch

Done. (You should see processes located in /opt/bigcouch.)


That's all — for an added bonus you could open BigCouch's ports on the VM use it from your host system because otherwise this is all a matter of localhost. See config.vm.forward_port in your Vagrantfile. & nodejs: at a medium pace

Tuesday, February 15. 2011

In my last blog entry, I shared some nodejs-code to read CouchDB's _changes feed and publish the data to a website. In order to update the page in a continous fashion, I used which provides a nifty abstraction across server- to client-side transports — for example, websockets and ajax longpoll.


When we tested the code for a few days over the weekend, the largest issue we ran into was that the stream moved too fast. In fact it moved so fast, we couldn't read anything and were at risk of getting a seizure when we watched the page for too long.

Certainly awesome from one point of view — people are using the website — but it also led to the next objective: I had to find a way to throttle broadcasting to the client. Here's how!

Continue reading " & nodejs: at a medium pace"

node.js & fun

Wednesday, February 2. 2011

I recently had the extreme pleasure to use node.js and on a project. Here are some insights.


So the objective of the project was to read data from the _changes feed of our CouchDB cluster (hosted by Cloudant) and publish the data to a widget which we can use to display a constant stream of "what are people doing right now".

The core of the problem we faced was not just taking this stream of data and feeding it on to a page, but since we'll deploy this widget to our homepage we needed to make sure that no matter how many clients see it, the impact on the database cluster is minimal; for example, it would be a single client (or down the road up to three for failover) who actually read data from the cluster.

After shopping around for a technology to use, it became obvious that we needed some sort of abstraction because of how the different technologies (e.g. comet, websockets, ajax longpolling, ...) are implemented in different browsers. We decided to build this project on top of — pretty much for the same reasons most people go to jQuery, prototype or dojo these days.

Continue reading "node.js & fun"

Operating CouchDB II

Tuesday, November 30. 2010

A couple months ago, I wrote an article titled Operation CouchDB. I noticed that a lot of people still visit my blog for this particular post, so this is an update to the situation.

And no, you may not copy and paste this or any other of my blog posts unless you ask me. ;-)

Caching revisited

A while back I wrote about how caching is trivial with CouchDB — well sort of.

CouchDB and its ecosystem like to emphasize on how the power of HTTP allows you to leverage countless of known and proven solutions. For caching, this goes only so far.

If you remember the ETag the reality is that while it's easy to grasp what it does, a lot of reverse proxies and caches don't implement this at all, or very well. Take my favorite — Varnish. There is currently very little support for the ETag.

And once you go down this road, it gets messy:

  • need a process to listen on a filtered _changes.
  • the process must be able to PURGE Varnish

... suddenly it's homegrown and not transparent at all.

However the biggest problem I thought about the other day is that if you don't happen to be one of a chosen few to run a CouchApp in production your CouchDB client is not a regular browser. Which means that your CouchDB client library doesn't know about the ETag either. Bummer.

What about those lightweight HEAD ?

HEAD requests to validate (or invalidate) the cache have the following issues:

  • An obvious I/O penalty because CouchDB will consult the BTree each time you HEAD request to check for the ETag.

  • The HEAD request in CouchDB is currently implemented like a GET, but they throw away the body on respone.

While work is being done in COUCHDB-941 to fix these two issues in summary — and this is not just true for CouchDB — the awesomeness of the ETag comes primarily from faking performance (aka fast) by saving bandwidth to transfer data.


Ahhhhh — sorry, that might have been a lot of bitching! :-) Let me get to the point!

IMHO, if my cache has to HEAD-request against CouchDB, each time it is used to make sure the data is not stale, it's becomes pretty pointless to use a cache anyway. Point taken, HEAD-requests will become cheaper, but they have to happen anyway.

Thus only part of the work is actually offloaded from the database server while I'd say the usual approach in caching is to not hit the database (server) at all.

The (current) bottom line is: no silver bullet available.

You have to invalidate the cache from your application or e.g. by using a tool like thinner to follow _changes and purge accordingly. Or, what we're doing: make the application cache-aware and PURGE directly from it.


?include_docs=true works well, up until you a sh!tload of traffic — and then it's a hefty i/o penalty each time your view is read. Of course emitting the entire doc means that you'll need more disk space, but the performance improvement by sacrificing disk space is between 10 and 100x. (Courtesy of the nice guys at cloudant!)

In a cloud computing context like AWS this is indeed a cost factor: think instance size for general i/o performance, the actual storage (in GB) and the throughput which Amazon charges for. In real a datacenter SAS disks are also more expensive than those crappy 7500 rpm SATA(-II) disks.


When I say partitioning, I mean spreading out the various document types by database. So even though the great advantage of a document oriented database is to be able to have a lot of different data right next to each other, you should avoid that.

Because when your data grows, it's pretty darn useful to be able to push a certain document type on another server in order to scale out. I know it's not that hard to do this when they are all in one database, but separating them when everything is on fire is just a whole lot more work, then replicating the database to a dedicated instance and changing the endpoint in your application.

Keeping an eye on things

I did not mention this earlier because it's a no-brainer — sort of. It seems though, that it has to be said.

Capacity planning

... and projections are (still) pretty important with CouchDB (or NoSQL in general). If you don't know by how much or if your data will grow, the least you should do is put a monitor on your CouchDB server's _stats to keep an eye on how things develop.


If you happen to run munin, I suggest you take a look at Gordon Stratton plugins. They are very useful.

My second suggestion for monitoring would be to not just put a sensor on because with CouchDB, that page almost always works:


Of course this page is a great indicator for whether your CouchDB server is available at all, but it is not a great indicator on the server's performance, e.g. number of writes, write speed, view building, view reads and so on.

Instead my suggestion is to build a slightly more sophisticated setup e.g. using a tool like tsung which actually makes requests, inserts data and does a whole more. It would let you aggregate the data and which allows you to see your general throughput and comes in useful with a post-mortem.

If you struggle with tsung, check out my tsung chef cookbook.

Application specific

There really is never enough monitoring. Aside from monitoring general server vitals, I highly recommend keeping a sensor things that are specific to your application. E.g. the number of documents of a certain type, or in a database, etc.. All these little things help getting the big picture.


BigCouch is probably the biggest news since I last blogged about operation CouchDB.

BigCouch is Cloudants CouchDB sharding framework. It basically lets you spread out a single database across multiple nodes. I haven't had the pleasure of running BigCouch myself yet, but we've been a customer of Cloudant for a while now.

Expect more on this topic soon, as I plan to get my hands dirty.


I hope I didn't forget anything. If you have feedback or stories to share, please comment. :-)

Since I just re-read my original blog post, I think the rest of it stands.

Looking for Two PHP Developers in NYC

Thursday, August 12. 2010

Hey everyone,

it's my sincere pleasure to announce that we're looking to fill two positions for PHP developers (entry/junior) in NYC.


This is what we look for from candidates:

  • A strong and firm knowledge of PHP5
  • First hand experience with the Zend Framework
  • You've heard of PHPUnit and TDD
  • An idea of what a HTTP request is and the different applications that take part in one
  • You heard of CouchDB, MongoDB or Redis (generally "NoSQL") before

Last but absolutely not least:

We very, very, very much prefer people who contribute(d) to Open Source.


  • A web start-up.
  • The not-so-standard LAMP stack with: Linux, Nginx, PHP and mostly CouchDB.
  • A lot time to play with Amazon Web Services.
  • Size matters to you? Databases and indices in the 100 millions.
  • Maybe Solr!
  • Definitely Redis!

... generally, we always try to use the right tool for the job.

If you're interested, please email me your resume:

[email protected]

If you know someone else and we happen to hire this person my special referral bonus is a couple beers next time we meet. ;-) [Disclaimer: If you're 1821, or older.]