Skip to content

apt-repair-sources on Ubuntu

When I ran our setup on an instance the other day, I noticed how it failed with a "package not found" (or similar) error. After debugging this a bit, we discovered that Karmic moved from "" to "" (Probably diskspace or something — but who knows? :-)). And because the sources pointed to the former, it broke the bootstrap process on new and existing EC2 instances and Vagrant VMs for us. A truely consistent experience!

Whenever apt-get update is run in a chef-recipe and it exists with a non-zero status, the process is stopped. Of course there are ways to work around it (for example: ignore_failure true), but then again, most of these workarounds are hacks and not suitable for a production environment (IMHO, of course): we often discover new sources from launchpad PPAs and so on and it's paramount to want to know if discovery failed. You cannot assume that all went well

Scalarium fixed their AMI already and updated the sources to point to "old-releases". Running instances are of course still broken.

Enter apt-repair-sources

apt-repair-sources is a small (opinionated) tool written in Ruby.

It offers:

  • --dry-run (-d), which is the default
  • --fix-it-for-me (-f), which attempts to correct all problems

The reason why apt-repair-sources was written in Ruby is, that I wanted a tool to run with only the most basic setup (on Scalarium). Since Ruby comes installed by default, it was my weapon of choice (vs. Python or PHP). Another advantage was that I had an opportunity to check out more Ruby (aside from cooking with chef) and used this project to learn more anything about testing in Ruby (using Test::Unit).

Dry run

A dry run can be used to essentially debug the sources on a system.

Here's the output of a dry-run, and all is well:

[email protected]:~/apt-repair-sources/bin$ ./apt-repair-sources 
There are no errors in /etc/apt/sources.list
There are no errors in /etc/apt/sources.list.d/chris-lea-node.js-lucid.list
There are no errors in /etc/apt/sources.list.d/node.list
There are no errors in /etc/apt/sources.list.d/chris-lea-redis-server.list
There are no errors in /etc/apt/sources.list.d/silverline.list

Here's the output of a system, where sources are currently broken:

[email protected]:~/apt-repair-sources/bin$ ./apt-repair-sources 
There are no errors in /etc/apt/sources.list.d/gearman-developers-ppa-karmic.list


Fix it for me

Fix it for me attempts to correct the sources like this:

  • sources with * are moved to
  • sources with * are moved to
  • sources with are moved to

On top of these things, it will check Launchpad and third-party PPAs as well, if an issue is found, it'll just disable the entry in the sources file (by commenting it out: #).

Future releases will probably re-check commented out entries and also attempt to do some kind of sanity-checking of entries using the release name, etc.. These things are hard though and it might be the wrong approach to be opinionated here because e.g. Lucid packages sometimes also work on Karmic. Disabling these might break other things, etc..

Here's a run:

[email protected]:~/apt-repair-sources/bin$ sudo ./apt-repair-sources -f
[email protected]:~/apt-repair-sources/bin$ echo $?
[email protected]:~/apt-repair-sources/bin$ ./apt-repair-sources
There are no errors in /etc/apt/sources.list
There are no errors in /etc/apt/sources.list.d/gearman-developers-ppa-karmic.list
There are no errors in /etc/apt/sources.list.d/karmic-multiverse.list

Great success!


Both modes usually exit with zero (0), which makes it easy to include them for bootstrap processes, general trouble-shooting or periodic cronjobs etc..

Reason to not exit with 0:

  • attempt to run apt-repair-sources on another distro than Ubuntu
  • is down
  • you run with -d and -f (which of course makes no sense :-))
  • trollop (a rubygem i use for CLI option parsing is not found)



# sudo gem install apt-repair-sources


  • install Ruby Enterprise Edition (steal Karmic here; this should be your default anyway)
  • sudo gem install trollop (don't use what is in apt)
  • clone my repo: git clone git://
  • cd ./apt-repair/sources/bin && ./apt-repair-sources


  • create a gem
  • add support for Debian
  • improve my Ruby


Sure hope it's useful for someone else out there.

The code is on github, and I take pull-requests:

Some thoughts on outtages

Cloud, everybody wants it, some actually use it. So what's my take away from AWS' recent outtage?


So first off, we had two pieces of our infrastructure failing (three if we include our Multi-AV RDS) — both of which involve EBS.

Numero uno

One of those pieces in my immediate reach was a MySQL server, which we use to keep sessions. And to say the least about AWS and in their defense, the instance had run for almost 550 days and had never given us much or any reason to let us down.

In almost two years with AWS I did not magically lose a single instance. I had to reboot two or three once because the host had issues and Amazon sent us an email and asked us to make sure the instance survive a reboot, but that's about it.

Recovering the service, or at least launching a replacement in a different region would have been possible if not by coincidence we would have hit several limits on our AWS account (instances, IPs and EBS volumes), which apparently take multiple days to lift. We contacted AWS immediately and the autoresponder told us to email back if it was urgent, but I guess they had their hands full and apparently we are not high up the chain enough to express how urgent it really was.

I also tried to reach out to some of the AWS evangelists on Twitter which didn't work since they went silent almost all the way through this outtage.

All in all, it took roughly five hours to get the volume back and another 4-5 to recover the database. As far as I can tell, nothing was lost.

And in our defense — we were well aware of this SPOF and had already plans to move on a more redundant approach — I have another blog post in draft about evaluating alternatives (Membase).

Numero due

The second critical piece of infrastructure which failed for us is our hosted BigCouch cluster with Cloudant.

We managed to (manually) failover to their cluster in us-west1 later in the day and brought the service back up. We would have done this earlier, but AWS suggested it would be only a few hours which is why we wanted to avoid the hassle of having to sync up clusters later on.

Sidenote: Cloudant is still (day three of the downtime) trying to get all pieces back online. Kudos to everyone from Cloudant for their hard work and patience with us.

Lesson learned for myself: When things fail which are not within your reach, it's pretty hard to do anything and stay calm. A good thing is to keep everyone busy so our team tried to reach out to all customers (we average about 200,000 users per day) via Twitter and Facebook and it looks like we've tackled that well.

Un trio!

Well, I don't have much to say about Amazon RDS (hosted MySQL in the cloud). Except that it didn't live up to our expectations: it costs a lot of money, but we learned that apparently that doesn't buy us anything.

Looking at the CloudWatch statistics associated with our RDS setup (or EBS in general), I'm rather weary and don't even know if they can be trusted. In the end, I can't really say for how long RDS was down or failed to failover, but it must have been back before I got a handle on our own MySQL server.

The rest?

The rest of our infrastructure seems fine on AWS — most of the servers are stateless (Shared nothing, anyone?) and no EBS is involved.

And with the absense of EBS, there are no issues to begin with. Everything continued to work just as expected.

Design for failure.

This is not really a take away, but a no-brainer. It's also not limited to AWS or cloud computing in general.

You should design for failure when you build any type of applications.

In terms of cloud computing it's really not as bad as ZOMG MY INSTANCE IS GONE!!!11, but it certainly can happen. I've heard people claim that EC2 instances constantly disappear and with my background of almost two years on AWS, I know that's just not true.

Designing for failure should take the following into account:

Services become unavailable

These services don't have to run in the cloud, they can become unavailable on bare metal too. For example, a service could crash, there could be a network partition or maintenance involved. The bottom line: How does your application deal with it?

Services run degraded

For example, higher latency between the services, slower response time from disk — you name it.

The unexpected

Sometimes, everything is green, but your application still chokes.

While testing for the unexpected is of course impossible, validating what comes in is not.


I'm not sure if a fire drill is necessary, but it helps to have a plan on how to troubleshoot issues to be able to recover from an outtage.

In our case, we log almost anything and everything to syslog and utilize loggly to centralize logs. loggly's nifty console provides us with great input about the state of our application at any time.

Add to the centralized logging, that we have a lot of monitoring using ganglia and munin in place. Monitoring is an ongoing project for us since it seems like once you start, you just can't stop. ;-)

And last but not least: We can launch a new configured EC2 instance with a couple mouse clicks using Scalarium.

I value all these things equally — without them, troubleshooting and recovery would be impossible or at least a royal PITA.

Don't buy hype

So to get to the bottom of this (still ongoing) event, I'm not particulary pissed that there was downtime. Of course I can live without it, but what I mean is: Things are bound to fail.

The truth is though, that Amazon's product description is not exactly honest, or at the very least provides everyone with a lot of room for interpretation. You're asking for how much interpretation? I'm sure you could put ten cloud experts into one room and come away with 15 different opinions.

For example the use of attributes to a service such as highly available may cause different expectations for different people.

Let me break it down for you: highly available in AWS speak, means, "it works most of the time".

What about an SLA?

Highly available is not an SLA with those infamous nine Erlang nines.

On paper a multi-az deployment of Amazon RDS gets pretty close to what generally people expect from highly available: MySQL master-master replication, backups — in multiple datacenters. As of today we all know: even these things can fail.

And speaking of SLAs: It looks like none of the services failing are covered by it: AWS' track record remains clean. This is because EBS is not explicitely named in it and by the way neither is RDS. Amazon's SLA — as far as EC2 is concerned — covers some but not all network outtages. Since I was able to access the instance the entire time none of this applies here.


On Twitter people are quick to suggest that everyone who complaines now should have had a setup in multiple availability zones setup.

Here's what I think about it:

  • I find it rather amusing that apparently everyone on bare metal runs in multiple datacenters.

  • When people suggest multi-zone (or multi-az in Amazon-speak), I think they really mean multi-region. Because a zone is effectively us-east-1a, us-east-1b, us-east-1c and us-east-1d. Since all datacenters (= availability zones) in the us-east1 region failed on 2011/04/21, your multi-zone setup would not have covered your butt. Even Amazon's multi-az RDS failed.

  • Little do people know, but the zone identifiers — e.g. us-east-1a, us-east-1b — are tied to customer accounts. So for example, Cloudant's version of us-east-1a may be my us-east-1c, it may or may not be the same. This is why in many cases AWS never calls out explicit zones in outtages. This also makes it somewhat hard to plan ahead in a single region.

  • AWS sells customers on the idea that an actual multi-az setup is plenty. I don't know too many companies who do multi-region (maybe SimpleGeo?). Not even NetFlix does multi-region, but I guess they managed to sail around this disaster because they don't use EBS.

  • In the end it shouldn't be necessary to do a multi-region setup (and deal with its caveats) since according to AWS the different zones inside a region are different physical locations (let's call them datacenters) to begin with. Correct me if I'm wrong, but the description says different physical location, this is not just another rack in the same building or another port on the core switch.


Which brings me to the most important point of my blong entry.

In a nutshell, when you build for the AWS platform, you're building for a blackbox. There are a couple papers and blog posts where people try to reverse engineer the platform and write about its behavior. The problem with these things is that most people are guessing, though often (of course depending on the person writing) it seems to be a very well educated guess.

Roman Stanek blogged about communication between AWS and its customers, so head on over, I pretty much agree with everything he has to say:


So what exactly is my take away? In terms of technical details and as far as redundancy is concerned: not so much.

Whatever you do to run redundant on AWS, applies to setups in your local colocation or POP as well. And in theory, AWS makes it easier though to leverage multiple datacenters (availability zones) and even achieve somewhat of a global footprint by distributing in different regions.

The real question for anyone to ask is, Is AWS fit to host anything which requires permanent storage? (I'm inclined to say no.)

That's all.

Getting the most out of Chef with Scalarium and vagrant

Ever since I started playing around with Unix ~13 years ago, I've been a fan of automating things. What started out as writing little (maybe pointless) shell scripts slowly but surely morphed into infrastructure automation today.

As for my, or maybe anyone's, motivation to do these things, I see three main factors:

  • I'm easily bored — because repeating things is dull.
  • I'm easily distracted (when I'm bored).
  • I'm German: Of course we strive for perfection and excellence. ;-)

Being on Unix (or Linux) it's fairly simple to automate things — add some script-fu to bash or csh (or even better zsh) and off you go wrapping things into a small shell script! Then execute again and again!

Before we decided to moved to AWS (and RightScale) in late 2009 we had half a rack of servers (in a Peer1's POP in NYC) and never did any or much infrastructure automation. We had an image and a set of commands to get a server up and running, but it was far from two mouse-clicks today.

At the time, I had read about cfengine a couple of times, but datacenter-grade infrastructure management along with a rather steep learning wall (at that time anyway) seemed overkill. Add to that, that there is not a lot of time for research Fridays when you work in a small company.

Moving to AWS and RightScale required us to write lots of small shell scripts using bash and Ruby. When we moved from RightScale to Scalarium in late 2010, we went from shell scripts to Chef.

Using Chef meant that we created so-called recipes which are used to bootstrap our servers. Recipes are little Ruby scripts which live in a cookbook — open source projects are sometimes so creative. Before this move I had very little, or next to no, experience with Chef, and Chef being Ruby didn't exactly make me want to try it either.

So what exactly is Chef?

A Chef recipe is a tiny bit of Ruby code — essentially a high(er)-level wrapper around calls such as installing certain packages, creating and parsing configuration files and many other things.

Chef offers a robust abstraction about everything you can do with shell and with a little effort it's also possible to write recipes which run on multiple OS'. Supported are Ubuntu, CentOS, FreeBSD and others. For an intro to Chef see the slides of a talk I gave a couple weeks ago; I briefly blogged about it too.

Our Chef recipes currently include things like installing and configuring PHP (from source and through a launchpad repository), nginx, MySQL, CouchDB, haproxy and many other things. The list was literally growing every day for the first few weeks.

Automating with Chef(-Solo)

In 2010, operations became an even more central part of my life. As I write this blog post (in early January, 2011), we have been running on Amazon AWS — and EC2 in particular — for over a year.

Previously we had used a service called RightScale but in Q3 of 2010, we moved on/away from RightScale and started using chef and a service called Scalarium.

Because Opscode's chef became such a big part of my work life, I gave a talk about chef, and chef-solo in particular, at last December's PHP Usergroup meeting in Berlin. My talk contains experience made and insights into the whole thing and demo'd a couple chef basics — enough to get started.

Here's a link to the slides:

Questions or general feedback — please use the comments (below).

Btw, the slides are using CSSS, a CSS-based slideshow system. I tried it for the first time and found working on my slides to be refreshingly simple compared to the things I had tried before. It's a pretty interesting and project.

P.S. Happy new year!

Operating CouchDB II

A couple months ago, I wrote an article titled Operation CouchDB. I noticed that a lot of people still visit my blog for this particular post, so this is an update to the situation.

And no, you may not copy and paste this or any other of my blog posts unless you ask me. ;-)

Caching revisited

A while back I wrote about how caching is trivial with CouchDB — well sort of.

CouchDB and its ecosystem like to emphasize on how the power of HTTP allows you to leverage countless of known and proven solutions. For caching, this goes only so far.

If you remember the ETag the reality is that while it's easy to grasp what it does, a lot of reverse proxies and caches don't implement this at all, or very well. Take my favorite — Varnish. There is currently very little support for the ETag.

And once you go down this road, it gets messy:

  • need a process to listen on a filtered _changes.
  • the process must be able to PURGE Varnish

... suddenly it's homegrown and not transparent at all.

However the biggest problem I thought about the other day is that if you don't happen to be one of a chosen few to run a CouchApp in production your CouchDB client is not a regular browser. Which means that your CouchDB client library doesn't know about the ETag either. Bummer.

What about those lightweight HEAD ?

HEAD requests to validate (or invalidate) the cache have the following issues:

  • An obvious I/O penalty because CouchDB will consult the BTree each time you HEAD request to check for the ETag.

  • The HEAD request in CouchDB is currently implemented like a GET, but they throw away the body on respone.

While work is being done in COUCHDB-941 to fix these two issues in summary — and this is not just true for CouchDB — the awesomeness of the ETag comes primarily from faking performance (aka fast) by saving bandwidth to transfer data.


Ahhhhh — sorry, that might have been a lot of bitching! :-) Let me get to the point!

IMHO, if my cache has to HEAD-request against CouchDB, each time it is used to make sure the data is not stale, it's becomes pretty pointless to use a cache anyway. Point taken, HEAD-requests will become cheaper, but they have to happen anyway.

Thus only part of the work is actually offloaded from the database server while I'd say the usual approach in caching is to not hit the database (server) at all.

The (current) bottom line is: no silver bullet available.

You have to invalidate the cache from your application or e.g. by using a tool like thinner to follow _changes and purge accordingly. Or, what we're doing: make the application cache-aware and PURGE directly from it.


?include_docs=true works well, up until you a sh!tload of traffic — and then it's a hefty i/o penalty each time your view is read. Of course emitting the entire doc means that you'll need more disk space, but the performance improvement by sacrificing disk space is between 10 and 100x. (Courtesy of the nice guys at cloudant!)

In a cloud computing context like AWS this is indeed a cost factor: think instance size for general i/o performance, the actual storage (in GB) and the throughput which Amazon charges for. In real a datacenter SAS disks are also more expensive than those crappy 7500 rpm SATA(-II) disks.


When I say partitioning, I mean spreading out the various document types by database. So even though the great advantage of a document oriented database is to be able to have a lot of different data right next to each other, you should avoid that.

Because when your data grows, it's pretty darn useful to be able to push a certain document type on another server in order to scale out. I know it's not that hard to do this when they are all in one database, but separating them when everything is on fire is just a whole lot more work, then replicating the database to a dedicated instance and changing the endpoint in your application.

Keeping an eye on things

I did not mention this earlier because it's a no-brainer — sort of. It seems though, that it has to be said.

Capacity planning

... and projections are (still) pretty important with CouchDB (or NoSQL in general). If you don't know by how much or if your data will grow, the least you should do is put a monitor on your CouchDB server's _stats to keep an eye on how things develop.


If you happen to run munin, I suggest you take a look at Gordon Stratton plugins. They are very useful.

My second suggestion for monitoring would be to not just put a sensor on because with CouchDB, that page almost always works:


Of course this page is a great indicator for whether your CouchDB server is available at all, but it is not a great indicator on the server's performance, e.g. number of writes, write speed, view building, view reads and so on.

Instead my suggestion is to build a slightly more sophisticated setup e.g. using a tool like tsung which actually makes requests, inserts data and does a whole more. It would let you aggregate the data and which allows you to see your general throughput and comes in useful with a post-mortem.

If you struggle with tsung, check out my tsung chef cookbook.

Application specific

There really is never enough monitoring. Aside from monitoring general server vitals, I highly recommend keeping a sensor things that are specific to your application. E.g. the number of documents of a certain type, or in a database, etc.. All these little things help getting the big picture.


BigCouch is probably the biggest news since I last blogged about operation CouchDB.

BigCouch is Cloudants CouchDB sharding framework. It basically lets you spread out a single database across multiple nodes. I haven't had the pleasure of running BigCouch myself yet, but we've been a customer of Cloudant for a while now.

Expect more on this topic soon, as I plan to get my hands dirty.


I hope I didn't forget anything. If you have feedback or stories to share, please comment. :-)

Since I just re-read my original blog post, I think the rest of it stands.