Skip to content

Some thoughts on outtages

Cloud, everybody wants it, some actually use it. So what's my take away from AWS' recent outtage?


So first off, we had two pieces of our infrastructure failing (three if we include our Multi-AV RDS) — both of which involve EBS.

Numero uno

One of those pieces in my immediate reach was a MySQL server, which we use to keep sessions. And to say the least about AWS and in their defense, the instance had run for almost 550 days and had never given us much or any reason to let us down.

In almost two years with AWS I did not magically lose a single instance. I had to reboot two or three once because the host had issues and Amazon sent us an email and asked us to make sure the instance survive a reboot, but that's about it.

Recovering the service, or at least launching a replacement in a different region would have been possible if not by coincidence we would have hit several limits on our AWS account (instances, IPs and EBS volumes), which apparently take multiple days to lift. We contacted AWS immediately and the autoresponder told us to email back if it was urgent, but I guess they had their hands full and apparently we are not high up the chain enough to express how urgent it really was.

I also tried to reach out to some of the AWS evangelists on Twitter which didn't work since they went silent almost all the way through this outtage.

All in all, it took roughly five hours to get the volume back and another 4-5 to recover the database. As far as I can tell, nothing was lost.

And in our defense — we were well aware of this SPOF and had already plans to move on a more redundant approach — I have another blog post in draft about evaluating alternatives (Membase).

Numero due

The second critical piece of infrastructure which failed for us is our hosted BigCouch cluster with Cloudant.

We managed to (manually) failover to their cluster in us-west1 later in the day and brought the service back up. We would have done this earlier, but AWS suggested it would be only a few hours which is why we wanted to avoid the hassle of having to sync up clusters later on.

Sidenote: Cloudant is still (day three of the downtime) trying to get all pieces back online. Kudos to everyone from Cloudant for their hard work and patience with us.

Lesson learned for myself: When things fail which are not within your reach, it's pretty hard to do anything and stay calm. A good thing is to keep everyone busy so our team tried to reach out to all customers (we average about 200,000 users per day) via Twitter and Facebook and it looks like we've tackled that well.

Un trio!

Well, I don't have much to say about Amazon RDS (hosted MySQL in the cloud). Except that it didn't live up to our expectations: it costs a lot of money, but we learned that apparently that doesn't buy us anything.

Looking at the CloudWatch statistics associated with our RDS setup (or EBS in general), I'm rather weary and don't even know if they can be trusted. In the end, I can't really say for how long RDS was down or failed to failover, but it must have been back before I got a handle on our own MySQL server.

The rest?

The rest of our infrastructure seems fine on AWS — most of the servers are stateless (Shared nothing, anyone?) and no EBS is involved.

And with the absense of EBS, there are no issues to begin with. Everything continued to work just as expected.

Design for failure.

This is not really a take away, but a no-brainer. It's also not limited to AWS or cloud computing in general.

You should design for failure when you build any type of applications.

In terms of cloud computing it's really not as bad as ZOMG MY INSTANCE IS GONE!!!11, but it certainly can happen. I've heard people claim that EC2 instances constantly disappear and with my background of almost two years on AWS, I know that's just not true.

Designing for failure should take the following into account:

Services become unavailable

These services don't have to run in the cloud, they can become unavailable on bare metal too. For example, a service could crash, there could be a network partition or maintenance involved. The bottom line: How does your application deal with it?

Services run degraded

For example, higher latency between the services, slower response time from disk — you name it.

The unexpected

Sometimes, everything is green, but your application still chokes.

While testing for the unexpected is of course impossible, validating what comes in is not.


I'm not sure if a fire drill is necessary, but it helps to have a plan on how to troubleshoot issues to be able to recover from an outtage.

In our case, we log almost anything and everything to syslog and utilize loggly to centralize logs. loggly's nifty console provides us with great input about the state of our application at any time.

Add to the centralized logging, that we have a lot of monitoring using ganglia and munin in place. Monitoring is an ongoing project for us since it seems like once you start, you just can't stop. ;-)

And last but not least: We can launch a new configured EC2 instance with a couple mouse clicks using Scalarium.

I value all these things equally — without them, troubleshooting and recovery would be impossible or at least a royal PITA.

Don't buy hype

So to get to the bottom of this (still ongoing) event, I'm not particulary pissed that there was downtime. Of course I can live without it, but what I mean is: Things are bound to fail.

The truth is though, that Amazon's product description is not exactly honest, or at the very least provides everyone with a lot of room for interpretation. You're asking for how much interpretation? I'm sure you could put ten cloud experts into one room and come away with 15 different opinions.

For example the use of attributes to a service such as highly available may cause different expectations for different people.

Let me break it down for you: highly available in AWS speak, means, "it works most of the time".

What about an SLA?

Highly available is not an SLA with those infamous nine Erlang nines.

On paper a multi-az deployment of Amazon RDS gets pretty close to what generally people expect from highly available: MySQL master-master replication, backups — in multiple datacenters. As of today we all know: even these things can fail.

And speaking of SLAs: It looks like none of the services failing are covered by it: AWS' track record remains clean. This is because EBS is not explicitely named in it and by the way neither is RDS. Amazon's SLA — as far as EC2 is concerned — covers some but not all network outtages. Since I was able to access the instance the entire time none of this applies here.


On Twitter people are quick to suggest that everyone who complaines now should have had a setup in multiple availability zones setup.

Here's what I think about it:

  • I find it rather amusing that apparently everyone on bare metal runs in multiple datacenters.

  • When people suggest multi-zone (or multi-az in Amazon-speak), I think they really mean multi-region. Because a zone is effectively us-east-1a, us-east-1b, us-east-1c and us-east-1d. Since all datacenters (= availability zones) in the us-east1 region failed on 2011/04/21, your multi-zone setup would not have covered your butt. Even Amazon's multi-az RDS failed.

  • Little do people know, but the zone identifiers — e.g. us-east-1a, us-east-1b — are tied to customer accounts. So for example, Cloudant's version of us-east-1a may be my us-east-1c, it may or may not be the same. This is why in many cases AWS never calls out explicit zones in outtages. This also makes it somewhat hard to plan ahead in a single region.

  • AWS sells customers on the idea that an actual multi-az setup is plenty. I don't know too many companies who do multi-region (maybe SimpleGeo?). Not even NetFlix does multi-region, but I guess they managed to sail around this disaster because they don't use EBS.

  • In the end it shouldn't be necessary to do a multi-region setup (and deal with its caveats) since according to AWS the different zones inside a region are different physical locations (let's call them datacenters) to begin with. Correct me if I'm wrong, but the description says different physical location, this is not just another rack in the same building or another port on the core switch.


Which brings me to the most important point of my blong entry.

In a nutshell, when you build for the AWS platform, you're building for a blackbox. There are a couple papers and blog posts where people try to reverse engineer the platform and write about its behavior. The problem with these things is that most people are guessing, though often (of course depending on the person writing) it seems to be a very well educated guess.

Roman Stanek blogged about communication between AWS and its customers, so head on over, I pretty much agree with everything he has to say:


So what exactly is my take away? In terms of technical details and as far as redundancy is concerned: not so much.

Whatever you do to run redundant on AWS, applies to setups in your local colocation or POP as well. And in theory, AWS makes it easier though to leverage multiple datacenters (availability zones) and even achieve somewhat of a global footprint by distributing in different regions.

The real question for anyone to ask is, Is AWS fit to host anything which requires permanent storage? (I'm inclined to say no.)

That's all.

A roundhouse kick, or the state of PHP

Last week the usual round of PEAR-bashing on Twitter took place, then this morning Marco Tabini asked if PHP (core) was running out of scratches to itch. He also suggests he got this idea from Cal Evan's blog post about Drupal forking PHP.


[Not submitting to your linkbait.]

Pecl and PHP

So first off — moving libraries from the core to an external repository was done for various reasons. One of them is to not have to maintain more and more in the core — keep it small and lean. Though small is pretty relative in this respect.

Of course doing so, means that people who do not have root on a server cannot install the module in most cases. But I'm inclined to suggest that when a pecl extension is (really) required, there should be nothing holding you back.

And if there is no way, thanks to PHP's extentability there's almost always a PHP-equivalent to any c-extension available.

Drupal and PHP

I know a couple Drupal people myself and most of them consider themselves to be Drupal developers before PHP. Why is that? It's because Drupal found a great way to abstract whatever people annoys about PHP from its developers, thus enabling them to build websites.

Is this a good or bad thing?

Of course it's a good thing because it makes people productive.

It's also a bad thing, because it seems that some (Drupal) people are rather disconnected from upstream [PHP].

Enabling people

Whatever people think about Drupal or any other framework, keep in mind that apparently it's PHP (and not Ruby, Python or pure C) which is more than a good enough enabler because PHP allows people to build a sophisticated content-management-framework like Drupal on top of it.

Drupal is of course no exception here. Despite e.g. the standstill in Ruby-land, in PHP other tools developer over the years who are a defacto standard: take a look at Wordpress or phpBB.

If you'd like to take it down to the framework-level there are projects like Symfony, CodeIgniter, Zend Framework, ezComponents/Zeta and also PEAR.

Fork vs. wat?

I think that forking PHP is a joke and I believe that Cal doesn't know the difference between a fork and a custom package (or a distribution).

A fork usually adds or removes features from the actual code base, but reading Cal's blog post he suggests a custom package. [Woo! Technical details! They get lost along the way!]

The thing is that a lot of people do this already. The people maintain a PHP package for a certain Linux or Unix distribution — Debian/Ubuntu, Gentoo or FreeBSD — there are doing it already. Using these as an example, whatever OS is used, it already runs a customized version PHP; some distributions customize more than others.

No one objects to the Drupal community suggesting ./configure flags or maintaining packages for the various flavours of Linux and Unix, or even Windows.

I would even go as far and say that in order to optimize the stack completely, it wouldn't hurt Drupal if its community recommended flags and extensions for people who run Drupal sites.

I doubt though that anyone will maintain packages for a couple distributions in their spare time and that the majority will not benefit from this effort because they don't run Drupal on their own server. But generally this optimization is enterprisey enough and indeed what I call a business opportunity.

Moving Drupal to ...

So what's "..."? Moving it to Python or Ruby, or maybe Scala? Good luck with that.

While the majority of Drupal developers don't consider themselves to be PHP developers, they still live and benefit off the PHP ecosystem. Think libraries used in modules or used for other areas like testing. Good luck porting those.

Then add PHP's vast adoption among webhosts.

Last but not least

Which brings me to the in my opinion biggest selling point: Doing PHP has another slight advantage over Ruby, Python and the other languages — it's installed on over 90% of the shared webhosts out there.

I invite everyone to google php hosting. It's trivial to find a host for as low as a Dollar per month — you just can't beat that.

Dear Cal, if you call this a business opportunity, I wonder why there's no Dollar Ruby hosting yet. Or Java hosting for a Dollar. But maybe someone is just not seeing this great business opportunity? [Note, Sarcasm.]


What really bothers me about flaming PEAR is that the most vocal people in these flamewars never contributed any code. Open source is different from the Monday morning meetings some people are used to and where they talk people against the wall.

In open source land actual code contributions take the lead.

And while a lot of people complain about PEAR in general, here's something else:

  • thriving download stats of packages
  • PEAR package usage in other projects
  • adoption of the PEAR coding standards and conventions
  • PEAR channels thriving
  • overall installer adoption

Despite being called a mess, PEAR is an enabler for many.

Point taken

PEAR being so many things is confusing!

PEAR packages are not as easy to use as some code you copy-pasted off the Zend devzone or phpclasses. While I agree, that we should try to make it just as easy, it's just not one PEAR's goals right now.

Scratching my own itch

Scratching your own itch, is what code contributions to PEAR are currently all about. Maybe always have been.

Active package maintainers most often contribute to packages they use themselves and they contribute to PEAR's environment to move development forward in areas where its beneficial to them. Call that selfish, but the reality is that most of us contributors actually work in this web industry and we know what we want and therefor we do it.

At the same time PEAR has coding standards and convention which are in place to ensure code is written so its to the benefit of most people.

Maintainership burden

The offiside to this situation is of course that components none of the maintainers have a use for get neglected — but calling this a PEAR-only problem really one-side.

Not even company-driven frameworks like the Zend Framework are prone to this; Zend_GData is/was rather unmaintained for a long time. Or frameworks where the proposal process and architecture are valued above all; I could point how broken ezcFeed is for me. Or general issues I see in projects where decisions are primarily driven by a single person — catch my drift?

This is not meant a pissing contest between frameworks, but I just can't hear it anymore.


Is PHP generally developer-friendly?

At the expense of watering the term developer, I'd say yes.

It is extremely easy to get started — embed the following into a .php file:

echo "Hello World";

There's your PHP. It doesn't require a custom webserver process, root server or anything else. It really doesn't get any easier.

Are PHP frameworks easier than frameworks in [your other favorite language]?

Probably not! Or, hell no!

But that's the barrier of entry to any if not all frameworks on the planet. Some frameworks allow you to write your own blog in 10 steps, but you will soon discover that writing your own blog is not a great indicator for a quality framework.

Indiciators are:

  • maintained code
  • coding standards
  • tests

And if I'm allowed a snarky remark — these are areas Drupal is literally just getting around to.


Destruction breeds creation — but I get the impression that all these fights inside the PHP community don't really make it foster more.

Fighting and trolling may be an art for some and I agree they are entertaining at times, but when it becomes the only way people communicate contribute then let me be clear: it doesn't help.

The PHP community seems to be unaware how thriving PHP is and also its ecosystem. People often mistake stabilization for decline. There's nothing wrong if we don't crank out five new major versions every year.

  • People in the real world are conservative anyway and adoption is slow. [Not a PHP-only problem either, just ask the Ruby folks.]
  • People in the real world don't mind a more stable PHP environment, at the expense of buzzwords and all that crap.


In hindsight everyone always knew.

I feel like the more vocal people sharing their opinion, are pretty disconnected from the reality. That is dispite them running a magazine or a podcast about the community.

When people resort to flaming others in order to make themselves look smarter or their own project better, then that's just poor judgement on their part. Projects often die off as fast as they came about.


Yelling, sometimes also referred to as shouting, is an old school management technique. Yelling has been around for literally as long as mankind walked on planet earth. And despite a great history of success, yelling is still often unappreciated and a misunderstood art.

key points

Let me summarize the key points — what yelling is all about!

  • Yelling emphasizes one's opinion.
  • Yelling helps to bring across your point.
  • Yelling helps to avoid mis-communication.
  • (Last but not least:) Yelling encourages people on your team to not make mistakes.

Next time, I'll explain other people management methods: name-calling and maybe mobbing.

nginx configuration gotchas

After running away screaming from Zend_XmlRpc we migrated of our internal webservices are RESTful nowadays — which implies that we make heavy use of HTTP status codes and so on.

On the PHP side of things we implemented almost all of those webservices using the Zend Framework where some parts are replaced by in-house replacements (mostly stripped-down and optimized versions equivalents of Zend_Foo) and a couple nifty PEAR packages.

RESTful — how does it work?

Building a RESTful API means to adhere to the HTTP standard. URLs are resources and the appropriate methods (DELETE, GET, POST, PUT) are used on them. Add status codes in the mix and ready you are.

To keep it simple for this blog post the following status codes are more or less relevant for a read-only (GET) API:

  • 200: it's A-OK
  • 400: a bad request, e.g. a parameter missing
  • 401: unauthorized
  • 404: nothing was found

... and this is just the beginning — check out a complete list of HTTP status codes.


To serve PHP almost all of our application servers are setup like the following:

  1. nginx in the front
  2. php (fpm) processes in the back

Nginx and PHP are glued together using fastcgi (and unix-domain sockets).

For an indepth example of our setup check out the nginx-app and php-fpm recipes (along with our launchpad repository).


The other day, I noticed that for some reason whenever our API returned an error — e.g. a 404, for an empty result — it would display a standard nginx error page and not the actual response.


Digging around in /etc/nginx/fastcgi_params, I discovered the following:

fastcgi_intercept_errors on;

So what this does is that it intercepts any errors from the PHP backends and attempts to display an nginx error page. All errors may include the various PHP parse errors but apparently also a PHP generated page with a 404 status code.

So for example, the following code served by a PHP backend triggers the nginx page:

header("HTTP/1.1 404 Not Found);

The obvious fix seems simple:

fastcgi_intercept_errors off;

Sidenote: I think a similar issue might be in nginx's proxy_intercept_errors.

For both directives the manual suggests that they will intercept any status code higher than 400 — or 4xx and 5xx. But that's not all.

Tell me why?!

Reviewing the manual, I noticed that nginx will only act on fastcgi_intercept_errors on; when an error_page is (previously) defined. Checking out the rest of my sites configuration, the following part is to blame:

location / {
    error_page 404 /index.php;

    include /etc/nginx/fastcgi_params;

    fastcgi_pass  phpfpm;
    fastcgi_index index.php;

    fastcgi_param SCRIPT_FILENAME /var/www/current/www/index.php;

    index  index.php index.html index.htm;

So indeed the error_page 404 /index.php is what set it all off to begin with. And that's what I ended up removing, though it doesn't hurt to understand the implications of fastcgi_intercept_errors.

I think historically we used the 404 error handler as a cheap excuse for a rewrite rule since we only serve dynamically generated pages (and super-SEO-friendly URLs) to begin with. But that doesn't seem to be necessary — testing will be required.


The moral of the story is: nginx is really never to blame. ;-)

This is precisly what happens when you copy/paste configurations from the Internetz and don't review each and every single line to understand the full scope. In the end this was more or less a picnic on my part but I wanted to share it anyway because it was one of those WTF-moments for me.

Trying out BigCouch with Chef-Solo and Vagrant

So the other day, I wanted to quickly check something in BigCouch and thanks to Vagrant, chef(-solo) and a couple cookbooks — courtesy of Cloudant — this was exceptionally easy.

As a matter of fact, I had BigCouch running and setup within literally minutes.

Here's how.


You'll need git, Ruby, gems and Vagrant (along with Virtualbox) installed. If you need help with those items, I suggest you check out my previous blog post called Getting the most out of Chef with Scalarium and vagrant.

For operating system to use, I suggest you get a Ubuntu 10.04 box (aka Lucid).

Vagrant (along with Ruby and Virtualbox) is a one time setup which you can use and abuse for all kinds of things, so don't worry about the extra steps.


Clone the cookbooks in $HOME:

$ git clone

Create a vagrant environement:

$ mkdir ~/bigcouch-test
$ cd ~/bigcouch-test
$ vagrant init

Setup ~/bigcouch-test/Vagrantfile: do |config| = "base"
  config.vm.box_url = ""

  # Forward a port from the guest to the host, which allows for outside
  # computers to access the VM, whereas host only networking does not.
  # config.vm.forward_port "http", 80, 8080

  config.vm.provisioner = :chef_solo
  config.chef.cookbooks_path = "~/cloudant_cookbooks"
  config.chef.add_recipe "bigcouch::default"

Start the vm:

$ vagrant up

Use BigCouch

$ vagrant ssh
$ sudo /etc/init.d/bigcouch start
$ ps aux|grep [b]igcouch

Done. (You should see processes located in /opt/bigcouch.)


That's all — for an added bonus you could open BigCouch's ports on the VM use it from your host system because otherwise this is all a matter of localhost. See config.vm.forward_port in your Vagrantfile.