Vagrant: ShellProvisioner vs. Chef

Wednesday, June 20. 2012

In my last blog entry, I demo'd how to get started with Vagrant and the ShellProvisioner.

To further illustrate how amazingly simple it is to get started on some Ruby, I'll convert the shell script from my last blog post to a little recipe for chef. Same objective, we install a PEAR package — but it could be anything really.

Follow me.


This is the shell script from before:


apt-get update

apt-get install -y php5 php5-cli php-pear
hash -r

pear upgrade-all
pear install -f HTTP_Request2


Create a cookbooks directory and create structure for your first cookbook in it:

$ mkdir -p my-cookbooks/first/recipes/

Create a default.rb file with the following content:

# my-cookbooks/first/recipes/default.rb
execute "apt-get update"

packages = ["php5", "php5-cli", "php-pear"]

packages.each|p| do
  package p

execute "pear upgrade-all"
execute "pear install -f HTTP_Request2"

The recipe is later referred to as first or first::default (name of the recipe directory, name of the .rb).

It's so simple it hurts. ;)

Step by step

  1. I run apt-get update using Chef's execute resource.
  2. I create an array of the packages (Arrays are ordered in Ruby, hashes are not (up until Ruby 1.9.x). Order is important here.)
  3. I run pear upgrade-all using the execute resource.
  4. I run pear install using the execute resource.


The Vagrantfile looks slightly different when you provision with chef-solo: do |config|

  config.vm.define :web do |web_config|       = "lucid64"
    web_config.vm.host_name = "web"

    web_config.vm.provision :chef_solo do |chef|
      chef.cookbooks_path = "PATH-TO-YOUR-COOKBOOKS"
      chef.add_recipe "first"
      chef.log_level = :debug


The important bit: the path to the location of your cookbooks — could be an /absolute/path or ./../relative/path.

Because a Vagrantfile is essentially Ruby code, anything goes here.

Further reading

Getting started with Chef and Ruby can be intimidating or even frustrating at times. Google "chef cookbooks" and you know what I mean.

These links are what you need:


Save, and enjoy — vagrant up.

Automating with Chef(-Solo)

Thursday, January 6. 2011

In 2010, operations became an even more central part of my life. As I write this blog post (in early January, 2011), we have been running on Amazon AWS — and EC2 in particular — for over a year.

Previously we had used a service called RightScale but in Q3 of 2010, we moved on/away from RightScale and started using chef and a service called Scalarium.

Because Opscode's chef became such a big part of my work life, I gave a talk about chef, and chef-solo in particular, at last December's PHP Usergroup meeting in Berlin. My talk contains experience made and insights into the whole thing and demo'd a couple chef basics — enough to get started.

Here's a link to the slides:

Questions or general feedback — please use the comments (below).

Btw, the slides are using CSSS, a CSS-based slideshow system. I tried it for the first time and found working on my slides to be refreshingly simple compared to the things I had tried before. It's a pretty interesting and project.

P.S. Happy new year!

Operating CouchDB II

Tuesday, November 30. 2010

A couple months ago, I wrote an article titled Operation CouchDB. I noticed that a lot of people still visit my blog for this particular post, so this is an update to the situation.

And no, you may not copy and paste this or any other of my blog posts unless you ask me. ;-)

Caching revisited

A while back I wrote about how caching is trivial with CouchDB — well sort of.

CouchDB and its ecosystem like to emphasize on how the power of HTTP allows you to leverage countless of known and proven solutions. For caching, this goes only so far.

If you remember the ETag the reality is that while it's easy to grasp what it does, a lot of reverse proxies and caches don't implement this at all, or very well. Take my favorite — Varnish. There is currently very little support for the ETag.

And once you go down this road, it gets messy:

  • need a process to listen on a filtered _changes.
  • the process must be able to PURGE Varnish

... suddenly it's homegrown and not transparent at all.

However the biggest problem I thought about the other day is that if you don't happen to be one of a chosen few to run a CouchApp in production your CouchDB client is not a regular browser. Which means that your CouchDB client library doesn't know about the ETag either. Bummer.

What about those lightweight HEAD ?

HEAD requests to validate (or invalidate) the cache have the following issues:

  • An obvious I/O penalty because CouchDB will consult the BTree each time you HEAD request to check for the ETag.

  • The HEAD request in CouchDB is currently implemented like a GET, but they throw away the body on respone.

While work is being done in COUCHDB-941 to fix these two issues in summary — and this is not just true for CouchDB — the awesomeness of the ETag comes primarily from faking performance (aka fast) by saving bandwidth to transfer data.


Ahhhhh — sorry, that might have been a lot of bitching! :-) Let me get to the point!

IMHO, if my cache has to HEAD-request against CouchDB, each time it is used to make sure the data is not stale, it's becomes pretty pointless to use a cache anyway. Point taken, HEAD-requests will become cheaper, but they have to happen anyway.

Thus only part of the work is actually offloaded from the database server while I'd say the usual approach in caching is to not hit the database (server) at all.

The (current) bottom line is: no silver bullet available.

You have to invalidate the cache from your application or e.g. by using a tool like thinner to follow _changes and purge accordingly. Or, what we're doing: make the application cache-aware and PURGE directly from it.


?include_docs=true works well, up until you a sh!tload of traffic — and then it's a hefty i/o penalty each time your view is read. Of course emitting the entire doc means that you'll need more disk space, but the performance improvement by sacrificing disk space is between 10 and 100x. (Courtesy of the nice guys at cloudant!)

In a cloud computing context like AWS this is indeed a cost factor: think instance size for general i/o performance, the actual storage (in GB) and the throughput which Amazon charges for. In real a datacenter SAS disks are also more expensive than those crappy 7500 rpm SATA(-II) disks.


When I say partitioning, I mean spreading out the various document types by database. So even though the great advantage of a document oriented database is to be able to have a lot of different data right next to each other, you should avoid that.

Because when your data grows, it's pretty darn useful to be able to push a certain document type on another server in order to scale out. I know it's not that hard to do this when they are all in one database, but separating them when everything is on fire is just a whole lot more work, then replicating the database to a dedicated instance and changing the endpoint in your application.

Keeping an eye on things

I did not mention this earlier because it's a no-brainer — sort of. It seems though, that it has to be said.

Capacity planning

... and projections are (still) pretty important with CouchDB (or NoSQL in general). If you don't know by how much or if your data will grow, the least you should do is put a monitor on your CouchDB server's _stats to keep an eye on how things develop.


If you happen to run munin, I suggest you take a look at Gordon Stratton plugins. They are very useful.

My second suggestion for monitoring would be to not just put a sensor on because with CouchDB, that page almost always works:


Of course this page is a great indicator for whether your CouchDB server is available at all, but it is not a great indicator on the server's performance, e.g. number of writes, write speed, view building, view reads and so on.

Instead my suggestion is to build a slightly more sophisticated setup e.g. using a tool like tsung which actually makes requests, inserts data and does a whole more. It would let you aggregate the data and which allows you to see your general throughput and comes in useful with a post-mortem.

If you struggle with tsung, check out my tsung chef cookbook.

Application specific

There really is never enough monitoring. Aside from monitoring general server vitals, I highly recommend keeping a sensor things that are specific to your application. E.g. the number of documents of a certain type, or in a database, etc.. All these little things help getting the big picture.


BigCouch is probably the biggest news since I last blogged about operation CouchDB.

BigCouch is Cloudants CouchDB sharding framework. It basically lets you spread out a single database across multiple nodes. I haven't had the pleasure of running BigCouch myself yet, but we've been a customer of Cloudant for a while now.

Expect more on this topic soon, as I plan to get my hands dirty.


I hope I didn't forget anything. If you have feedback or stories to share, please comment. :-)

Since I just re-read my original blog post, I think the rest of it stands.