Some thoughts on outtages

Saturday, April 23. 2011

Cloud, everybody wants it, some actually use it. So what's my take away from AWS' recent outtage?


So first off, we had two pieces of our infrastructure failing (three if we include our Multi-AV RDS) — both of which involve EBS.

Numero uno

One of those pieces in my immediate reach was a MySQL server, which we use to keep sessions. And to say the least about AWS and in their defense, the instance had run for almost 550 days and had never given us much or any reason to let us down.

In almost two years with AWS I did not magically lose a single instance. I had to reboot two or three once because the host had issues and Amazon sent us an email and asked us to make sure the instance survive a reboot, but that's about it.

Recovering the service, or at least launching a replacement in a different region would have been possible if not by coincidence we would have hit several limits on our AWS account (instances, IPs and EBS volumes), which apparently take multiple days to lift. We contacted AWS immediately and the autoresponder told us to email back if it was urgent, but I guess they had their hands full and apparently we are not high up the chain enough to express how urgent it really was.

I also tried to reach out to some of the AWS evangelists on Twitter which didn't work since they went silent almost all the way through this outtage.

All in all, it took roughly five hours to get the volume back and another 4-5 to recover the database. As far as I can tell, nothing was lost.

And in our defense — we were well aware of this SPOF and had already plans to move on a more redundant approach — I have another blog post in draft about evaluating alternatives (Membase).

Numero due

The second critical piece of infrastructure which failed for us is our hosted BigCouch cluster with Cloudant.

We managed to (manually) failover to their cluster in us-west1 later in the day and brought the service back up. We would have done this earlier, but AWS suggested it would be only a few hours which is why we wanted to avoid the hassle of having to sync up clusters later on.

Sidenote: Cloudant is still (day three of the downtime) trying to get all pieces back online. Kudos to everyone from Cloudant for their hard work and patience with us.

Lesson learned for myself: When things fail which are not within your reach, it's pretty hard to do anything and stay calm. A good thing is to keep everyone busy so our team tried to reach out to all customers (we average about 200,000 users per day) via Twitter and Facebook and it looks like we've tackled that well.

Un trio!

Well, I don't have much to say about Amazon RDS (hosted MySQL in the cloud). Except that it didn't live up to our expectations: it costs a lot of money, but we learned that apparently that doesn't buy us anything.

Looking at the CloudWatch statistics associated with our RDS setup (or EBS in general), I'm rather weary and don't even know if they can be trusted. In the end, I can't really say for how long RDS was down or failed to failover, but it must have been back before I got a handle on our own MySQL server.

The rest?

The rest of our infrastructure seems fine on AWS — most of the servers are stateless (Shared nothing, anyone?) and no EBS is involved.

And with the absense of EBS, there are no issues to begin with. Everything continued to work just as expected.

Design for failure.

This is not really a take away, but a no-brainer. It's also not limited to AWS or cloud computing in general.

You should design for failure when you build any type of applications.

In terms of cloud computing it's really not as bad as ZOMG MY INSTANCE IS GONE!!!11, but it certainly can happen. I've heard people claim that EC2 instances constantly disappear and with my background of almost two years on AWS, I know that's just not true.

Designing for failure should take the following into account:

Services become unavailable

These services don't have to run in the cloud, they can become unavailable on bare metal too. For example, a service could crash, there could be a network partition or maintenance involved. The bottom line: How does your application deal with it?

Services run degraded

For example, higher latency between the services, slower response time from disk — you name it.

The unexpected

Sometimes, everything is green, but your application still chokes.

While testing for the unexpected is of course impossible, validating what comes in is not.


I'm not sure if a fire drill is necessary, but it helps to have a plan on how to troubleshoot issues to be able to recover from an outtage.

In our case, we log almost anything and everything to syslog and utilize loggly to centralize logs. loggly's nifty console provides us with great input about the state of our application at any time.

Add to the centralized logging, that we have a lot of monitoring using ganglia and munin in place. Monitoring is an ongoing project for us since it seems like once you start, you just can't stop. ;-)

And last but not least: We can launch a new configured EC2 instance with a couple mouse clicks using Scalarium.

I value all these things equally — without them, troubleshooting and recovery would be impossible or at least a royal PITA.

Don't buy hype

So to get to the bottom of this (still ongoing) event, I'm not particulary pissed that there was downtime. Of course I can live without it, but what I mean is: Things are bound to fail.

The truth is though, that Amazon's product description is not exactly honest, or at the very least provides everyone with a lot of room for interpretation. You're asking for how much interpretation? I'm sure you could put ten cloud experts into one room and come away with 15 different opinions.

For example the use of attributes to a service such as highly available may cause different expectations for different people.

Let me break it down for you: highly available in AWS speak, means, "it works most of the time".

What about an SLA?

Highly available is not an SLA with those infamous nine Erlang nines.

On paper a multi-az deployment of Amazon RDS gets pretty close to what generally people expect from highly available: MySQL master-master replication, backups — in multiple datacenters. As of today we all know: even these things can fail.

And speaking of SLAs: It looks like none of the services failing are covered by it: AWS' track record remains clean. This is because EBS is not explicitely named in it and by the way neither is RDS. Amazon's SLA — as far as EC2 is concerned — covers some but not all network outtages. Since I was able to access the instance the entire time none of this applies here.


On Twitter people are quick to suggest that everyone who complaines now should have had a setup in multiple availability zones setup.

Here's what I think about it:

  • I find it rather amusing that apparently everyone on bare metal runs in multiple datacenters.

  • When people suggest multi-zone (or multi-az in Amazon-speak), I think they really mean multi-region. Because a zone is effectively us-east-1a, us-east-1b, us-east-1c and us-east-1d. Since all datacenters (= availability zones) in the us-east1 region failed on 2011/04/21, your multi-zone setup would not have covered your butt. Even Amazon's multi-az RDS failed.

  • Little do people know, but the zone identifiers — e.g. us-east-1a, us-east-1b — are tied to customer accounts. So for example, Cloudant's version of us-east-1a may be my us-east-1c, it may or may not be the same. This is why in many cases AWS never calls out explicit zones in outtages. This also makes it somewhat hard to plan ahead in a single region.

  • AWS sells customers on the idea that an actual multi-az setup is plenty. I don't know too many companies who do multi-region (maybe SimpleGeo?). Not even NetFlix does multi-region, but I guess they managed to sail around this disaster because they don't use EBS.

  • In the end it shouldn't be necessary to do a multi-region setup (and deal with its caveats) since according to AWS the different zones inside a region are different physical locations (let's call them datacenters) to begin with. Correct me if I'm wrong, but the description says different physical location, this is not just another rack in the same building or another port on the core switch.


Which brings me to the most important point of my blong entry.

In a nutshell, when you build for the AWS platform, you're building for a blackbox. There are a couple papers and blog posts where people try to reverse engineer the platform and write about its behavior. The problem with these things is that most people are guessing, though often (of course depending on the person writing) it seems to be a very well educated guess.

Roman Stanek blogged about communication between AWS and its customers, so head on over, I pretty much agree with everything he has to say:


So what exactly is my take away? In terms of technical details and as far as redundancy is concerned: not so much.

Whatever you do to run redundant on AWS, applies to setups in your local colocation or POP as well. And in theory, AWS makes it easier though to leverage multiple datacenters (availability zones) and even achieve somewhat of a global footprint by distributing in different regions.

The real question for anyone to ask is, Is AWS fit to host anything which requires permanent storage? (I'm inclined to say no.)

That's all.

Tracking PHP errors

Saturday, November 27. 2010

track_errors provides the means to catch an error message emitted from PHP. It's something I like to use during the development of various applications, or to get a handle on legacy code. Here are a few examples why!

For example

Imagine the following remote HTTP call:

$response = file_get_contents('');

So whenever this call fails, it will return false and also emit an error message:

Warning: file_get_contents(
    failed to open stream: could not connect to host in /example.php on line 2

Some people use @ to suppress this error message — an absolute no-go for reasons such as:

  • it just became impossible to know why the call failed
  • @ at runtime is an expensive operation for the PHP parser

The advanced PHP web kiddo knows to always display_errors = Off (e.g. in php.ini or through ini_set()) in order to shield the visitor of their application from these nasty messages. And maybe they even know how to log the error — somewhere.

But whenever an error is logged to a log file somewhere, it also means it's buried. Sometimes these error logs are too far away and often they get overlooked. If you happen to centralize and actually analyze your logfiles, I salute you!

So how do you use PHP's very useful and informative error message to debug this runtime error?

track_errors to the rescue.

track_errors is a PHP runtime setting.

To enable it:

; php.ini
track_errors = On 


ini_set('track_errors', 1);

And this allows you to do the following:

$response = file_get_contents('');
if ($response === false) {
    throw new RuntimeException(
        "Could not open {$GLOBALS['php_errormsg']}"

The last error message is always populated in the global variable $php_errormsg.

You want more?

I also recently used the same technique to implement error checking into a legacy application. I basically did the following:

// footer.php ;-)
if (isset($GLOBALS['php_errormsg'])
    && !empty($GLOBALS['php_errormsg'])
    && $user->isAdmin()) {

    echo $GLOBALS['php_errormsg'];


As useful as this can be, there are a number of trade-offs. I suggest you use it wisely.

  • $php_errormsg is a global variable [yuck]
  • many extensions provide build in ways to catch errors
  • ini_set() calls at runtime are expensive


That's all, quick and (very) dirty.

Fan Error

Monday, October 19. 2009

A Fan Error in this case is not when your Facebook fan page is down. I received this message after my Lenovo X61s notebook decided to quit and I restarted it. The screen said "Fan Error", and the notebook refused to continue to the boot process.

A rescue party

Of course this is the last thing you want on a Sunday evening, but in true GTD fashion, I wanted to fix it right away. Here's how.


In order to not electrocute myself, I removed the battery and unplugged the notebook.

Get in there!

I basically unscrewed every screw there is at the bottom of the notebook, until it would let me remove the upper part of the casing and keyboard.


Then I tried to carefully clean the inner of my notebook from dust and dirt that accumulated over the past 14 months since I purchased it. I think had dust (and what not) from North America, Europe and South America in there. It was kinda gross. It really didn't look pretty. And that is despite all efforts to not eat and drink near it.


When I got to the fan, it wouldn't really move. Hence the fan error!

I forced it a little and white dust came out of it. So I decided to take more drastic measures and sucked it clean using my Dyson. In the beginning it wouldn't really move, but it took only a minute to resolve that. (Word of advice: If you are not super careful, the Dyson will try to suck in whatever it gets. So make sure to not vacuum the insides of your notebook. ;-))


Reassembly is pretty simple. The case clicks, and then you fill in the screws. IBM/Lenovo were smart enough to only use screws of the same type. There was a total of ten (or maybe nine), and they are all gone. So that must have worked.


Don't try this, unless you have to. And know what you are doing. This blog entry comes with no guarantees or extended warranty. Being able to fix little things yourself, feels good though.