Hosted MySQL: Amazon RDS (and backups)

Saturday, March 17. 2012
Comments

Among all the different technologies in our stack, we also use MySQL. While we still run MySQL (or Percona-Server) ourselves, we selected a managed solution to power parts of our production infrastructure: a Multi-AZ setup with Amazon's RDS.

AZ is Amazon-speak for "availability zone", essentially a datacenter. RDS stands for: Relational Database Service.

Judging from my experience with our own setups where EBS is in the mix, I have to say that Amazon does an outstanding job hiding these potential issues with RDS from us. Looking at the price tag of the setup can be intimidating at first, but as far as TCO is concerned, RDS is the complete package: managed from every which way and painless for the most part.

RDS in a nutshell

RDS is pretty awesome — it's basically a highly available MySQL setup with backups and optional goodness like read-slaves. RDS is one of the best services as far as Amazon Webservices are concerned: 90% of what anyone would need from RDS, Amazon allows you to do with a couple clicks. For tuning, it gets a little less shiny and maybe even messy, but even changing parameters is possible.

Another pretty important feature is growing and shrinking RDS. Change your storage and either apply the change right away or wait for your next maintenance window. It should be noted that these changes are rarely instant (or "right away"), which doesn't make it any less awesome. So even though for example resizing the storage is not an instant operation (of course), it still puts a whole new definition into the word elastic.

The package

A standard RDS setup gives you a managed database instance with a DNS hostname and database credentials to log in.

Access from instances is granted using database security groups, which work just like the regular security groups (on AWS). In non-AWS-language, this translates to firewall policies.

Pricing

As far as pricing is concerned, AWS is always a little tough to understand: the baseline is 0.88 USD per hour for a multi-az deployment which totals to 633.6 USD a month (large instance class). Since we opted for reservation (a 1,200 USD one time fee for a three (3) year term), we were able to drop that price to 0.56 USD per hour.

Aside from instance costs there are storage costs as well: 0.20 USD per GB (100 GB will cost you and me about 20 USD) and 0.10 USD per million I/O requests (aka the "i/o rate"). On our multi-az RDS we selected 100 GB for total storage initially but since we currently use only about 60 GB, we just end up paying about 12 USD per billing period.

While storage costs are somewhat easy to predict, the "i/o rate" is not. But it's also not a major factor. I'm unable to provide exact numbers currently because we have three RDS servers (1 multi-az deployment, 1 read-slave and another single-az deployment) and the numbers are aggregated on the billing statement but our total is 368,915,692 IOs which runs at roughly 36 USD per month.

Vendor lockin

Anyway — if RDS is awesome, what's the catch? A closed environment.

The primary advantage and disadvantage of RDS is that we don't get access to the server and our own backups.

Of course there are backups and we can use them to restore (or rollback) our RDS setup from within AWS. There are options using the AWS console and I believe using their API as well. But in the end there is no way to export this backup and load it into a non-RDS-setup. And add to that: replicating from or into RDS is not possible either. Which makes migrations and backups an unnecessary pain in the butt.

Aside from not getting access to our own backup, we also don't get access to the actual instances. Which makes sense for AWS, but it means we need to rely on in my opinion questionable metrics like Cloudwatch. Questionable because there is no way for the customer to verify the data. AWS uses their own metrics (and scale) and it's often not obvious to me how well Cloudwatch works even on regular AWS EC2 instance.

I've seen instances which became unavailable, but Cloudwatch is reporting A-OK (green). I'm not sure how beta Cloudwatch is, but we decided on Silverline (more on that in another blog post) for instance monitoring. Since Silverline requires a local client, it's unfortunately not an option for RDS.

What's pain?

Aside from the monitoring and backup quirks, one of the real pain points of Amazon RDS is that a lot of the collective MySQL knowledge is not available to us. The knowledge which is manifested in books, blogs, various monitoring solutions and outstanding tools like Percona's backup tools are not available to people who run Amazon RDS setups.

Whatever the (technical) reasons for all this may be, they pain me at least a little and should be discussed when Amazon RDS is evaluated to replace MySQL infrastructure.

MySQL and backups

I mentioned backups! First off, I hope ya'll are doing them! ;-)

For many, the preferred way to do MySQL-backups is mysqldump. It's pretty simple to use:

$ mysqldump -u root -pPASS -A > my_backup.sql

This command essentially dumps all (-A) databases from your MySQL server.

mysqldump is an OK solution as long as you take a few things in mind:

  • Create your backups during a period where there is the least activity — typically the night.
  • There will be a slight bump, but hope that your database is small enough so no one notices.

With a larger database or most databases with lots of read and write activity, this is almost impossible to achieve. For a snapshot to be consistent, table locks are used and that usually stalls access to any application which relies on your database.

Of course there is Percona's xtrabackup (which is outstanding), but with RDS, that is not an option either.

Read-slave to the rescue

Typically people will use a read-slave with MySQL to offload read queries from the master. I haven't done any tests on how far these typically lag behind with Amazon RDS, but I am going to use my RDS read-slave for something else: backup.

Adding a read-slave is easy:

  1. Log into the AWS Console and go to RDS.
  2. Select your RDS server and click 'add read replica' above

The operation will likely take a while. The timeframe depends on the type of instance and the amount of storage provisioned. In my case I selected a small instance and it assigned 100 GB of storage to match my RDS. Small instances boot notoriously long — the entire operation completed in a little over ten minutes.

On a side-note: read-replicas allow you to scale RDS beyond availability zones (AZ). But you should be aware that traffic across different AZ is billed to the customer.

Costs

A small instance costs roughly 76 USD/month (excluding storage, I/O rate and bandwidth), which by itself is not bad for a fully managed server which I basically setup with two or three clicks. Since we plan to do backup on a regular basis, we will buy a coupon to reserve the instance which cuts down costs tremendously and generally makes the AWS cloud very affordable.

Amazon RDS, quo vadis?

I mentioned a little vendor-lockin with the service and the little visibility from the outside.

In theory, this should not matter — however there are more than a few issues you should be aware of. I don't want to mention them to stomp on Amazon — RDS is still in beta after all. But you should be aware of them to get a complete picture.

Pretty questionable is the way some of these issues are handle: not at all or in private messages. AWS is not always at fault here since I imagine pretty often the customer forgets to update the ticket when the issue is only temporary because their focus shifts to other areas.

But one of the core problem with customer service problem all over AWS is that customers have to resort to posting on a forum with no guaranteed response or have to buy a support contract which includes answers like "we fixed it". The first response is usually that more details are needed (Maybe customer accounts on the forum are not linked to AWS accounts on the inside?) and off it goes into private mode.

My wish is that these situations across all AWS services are handled more transparent in the future so people see development and evolution of the service which means that a trust-worthy platform is be build.

Fin

I've been thinking about my final statement for a while. If anything right now, I would be more in favour of Amazon RDS.

Amazon RDS is an extremely interesting product — the beta-tag is even more impressive. It'll be interesting to see what it will offer once Amazon pronounces it stable.

As for the future of our RDS-setups: they are not gonna go away soon. One of our objectives for 2012 is stabilizing across all products and infrastructure underneath. I think this will include some sort of integration with our external monitoring system to at least try to make sense of things like Cloudwatch and to be able to correlate with other events from all over production.

apt-repair-sources on Ubuntu

Wednesday, November 23. 2011
Comments

When I ran our setup on an instance the other day, I noticed how it failed with a "package not found" (or similar) error. After debugging this a bit, we discovered that Karmic moved from "archive.ubuntu.com" to "old-releases.ubuntu.com" (Probably diskspace or something — but who knows? :-)). And because the sources pointed to the former, it broke the bootstrap process on new and existing EC2 instances and Vagrant VMs for us. A truely consistent experience!

Whenever apt-get update is run in a chef-recipe and it exists with a non-zero status, the process is stopped. Of course there are ways to work around it (for example: ignore_failure true), but then again, most of these workarounds are hacks and not suitable for a production environment (IMHO, of course): we often discover new sources from launchpad PPAs and so on and it's paramount to want to know if discovery failed. You cannot assume that all went well

Scalarium fixed their AMI already and updated the sources to point to "old-releases". Running instances are of course still broken.

Enter apt-repair-sources

apt-repair-sources is a small (opinionated) tool written in Ruby.

It offers:

  • --dry-run (-d), which is the default
  • --fix-it-for-me (-f), which attempts to correct all problems

The reason why apt-repair-sources was written in Ruby is, that I wanted a tool to run with only the most basic setup (on Scalarium). Since Ruby comes installed by default, it was my weapon of choice (vs. Python or PHP). Another advantage was that I had an opportunity to check out more Ruby (aside from cooking with chef) and used this project to learn more anything about testing in Ruby (using Test::Unit).

Dry run

A dry run can be used to essentially debug the sources on a system.

Here's the output of a dry-run, and all is well:

till@dev:~/apt-repair-sources/bin$ ./apt-repair-sources 
There are no errors in /etc/apt/sources.list
There are no errors in /etc/apt/sources.list.d/chris-lea-node.js-lucid.list
There are no errors in /etc/apt/sources.list.d/node.list
There are no errors in /etc/apt/sources.list.d/chris-lea-redis-server.list
There are no errors in /etc/apt/sources.list.d/silverline.list

Here's the output of a system, where sources are currently broken:

tillklampaeckel@ulic:~/apt-repair-sources/bin$ ./apt-repair-sources 
/etc/apt/sources.list: http://us-east-1.ec2.archive.ubuntu.com/ubuntu/dists/karmic/main/binary-amd64/Packages.gz
/etc/apt/sources.list: http://us-east-1.ec2.archive.ubuntu.com/ubuntu/dists/karmic/main/source/Sources.gz
/etc/apt/sources.list: http://us-east-1.ec2.archive.ubuntu.com/ubuntu/dists/karmic-updates/main/binary-amd64/Packages.gz
/etc/apt/sources.list: http://us-east-1.ec2.archive.ubuntu.com/ubuntu/dists/karmic-updates/main/source/Sources.gz
/etc/apt/sources.list: http://security.ubuntu.com/ubuntu/dists/karmic-security/main/binary-amd64/Packages.gz
/etc/apt/sources.list: http://security.ubuntu.com/ubuntu/dists/karmic-security/main/source/Sources.gz
There are no errors in /etc/apt/sources.list.d/gearman-developers-ppa-karmic.list
/etc/apt/sources.list.d/karmic-multiverse.list: http://archive.ubuntu.com/ubuntu/dists/karmic/multiverse/binary-amd64/Packages.gz
/etc/apt/sources.list.d/karmic-multiverse.list: http://archive.ubuntu.com/ubuntu/dists/karmic/multiverse/source/Sources.gz
/etc/apt/sources.list.d/karmic-multiverse.list: http://archive.ubuntu.com/ubuntu/dists/karmic-updates/multiverse/binary-amd64/Packages.gz
/etc/apt/sources.list.d/karmic-multiverse.list: http://archive.ubuntu.com/ubuntu/dists/karmic-updates/multiverse/source/Sources.gz
/etc/apt/sources.list.d/karmic-multiverse.list: http://security.ubuntu.com/ubuntu/dists/karmic-security/multiverse/binary-amd64/Packages.gz
/etc/apt/sources.list.d/karmic-multiverse.list: http://security.ubuntu.com/ubuntu/dists/karmic-security/multiverse/source/Sources.gz

Problem?

Fix it for me

Fix it for me attempts to correct the sources like this:

  • sources with *.releases.ubuntu.com are moved to archive.ubuntu.com
  • sources with *.archive.ubuntu.com are moved to old-releases.ubuntu.com
  • sources with security.ubuntu.com are moved to old-releases.ubuntu.com

On top of these things, it will check Launchpad and third-party PPAs as well, if an issue is found, it'll just disable the entry in the sources file (by commenting it out: #).

Future releases will probably re-check commented out entries and also attempt to do some kind of sanity-checking of entries using the release name, etc.. These things are hard though and it might be the wrong approach to be opinionated here because e.g. Lucid packages sometimes also work on Karmic. Disabling these might break other things, etc..

Here's a run:

tillklampaeckel@ulic:~/apt-repair-sources/bin$ sudo ./apt-repair-sources -f
tillklampaeckel@ulic:~/apt-repair-sources/bin$ echo $?
0
tillklampaeckel@ulic:~/apt-repair-sources/bin$ ./apt-repair-sources
There are no errors in /etc/apt/sources.list
There are no errors in /etc/apt/sources.list.d/gearman-developers-ppa-karmic.list
There are no errors in /etc/apt/sources.list.d/karmic-multiverse.list

Great success!

Automation

Both modes usually exit with zero (0), which makes it easy to include them for bootstrap processes, general trouble-shooting or periodic cronjobs etc..

Reason to not exit with 0:

  • attempt to run apt-repair-sources on another distro than Ubuntu
  • old-releases.ubuntu.com is down
  • you run with -d and -f (which of course makes no sense :-))
  • trollop (a rubygem i use for CLI option parsing is not found)

Setup

Gems!

# sudo gem install apt-repair-sources

Manually

  • install Ruby Enterprise Edition (steal Karmic here; this should be your default anyway)
  • sudo gem install trollop (don't use what is in apt)
  • clone my repo: git clone git://github.com/lagged/apt-repair-sources.git
  • cd ./apt-repair/sources/bin && ./apt-repair-sources

Todo

  • create a gem
  • add support for Debian
  • improve my Ruby

Fin

Sure hope it's useful for someone else out there.

The code is on github, and I take pull-requests: https://github.com/lagged/apt-repair-sources

Some thoughts on outtages

Saturday, April 23. 2011
Comments

Cloud, everybody wants it, some actually use it. So what's my take away from AWS' recent outtage?

Background

So first off, we had two pieces of our infrastructure failing (three if we include our Multi-AV RDS) — both of which involve EBS.

Numero uno

One of those pieces in my immediate reach was a MySQL server, which we use to keep sessions. And to say the least about AWS and in their defense, the instance had run for almost 550 days and had never given us much or any reason to let us down.

In almost two years with AWS I did not magically lose a single instance. I had to reboot two or three once because the host had issues and Amazon sent us an email and asked us to make sure the instance survive a reboot, but that's about it.

Recovering the service, or at least launching a replacement in a different region would have been possible if not by coincidence we would have hit several limits on our AWS account (instances, IPs and EBS volumes), which apparently take multiple days to lift. We contacted AWS immediately and the autoresponder told us to email back if it was urgent, but I guess they had their hands full and apparently we are not high up the chain enough to express how urgent it really was.

I also tried to reach out to some of the AWS evangelists on Twitter which didn't work since they went silent almost all the way through this outtage.

All in all, it took roughly five hours to get the volume back and another 4-5 to recover the database. As far as I can tell, nothing was lost.

And in our defense — we were well aware of this SPOF and had already plans to move on a more redundant approach — I have another blog post in draft about evaluating alternatives (Membase).

Numero due

The second critical piece of infrastructure which failed for us is our hosted BigCouch cluster with Cloudant.

We managed to (manually) failover to their cluster in us-west1 later in the day and brought the service back up. We would have done this earlier, but AWS suggested it would be only a few hours which is why we wanted to avoid the hassle of having to sync up clusters later on.

Sidenote: Cloudant is still (day three of the downtime) trying to get all pieces back online. Kudos to everyone from Cloudant for their hard work and patience with us.

Lesson learned for myself: When things fail which are not within your reach, it's pretty hard to do anything and stay calm. A good thing is to keep everyone busy so our team tried to reach out to all customers (we average about 200,000 users per day) via Twitter and Facebook and it looks like we've tackled that well.

Un trio!

Well, I don't have much to say about Amazon RDS (hosted MySQL in the cloud). Except that it didn't live up to our expectations: it costs a lot of money, but we learned that apparently that doesn't buy us anything.

Looking at the CloudWatch statistics associated with our RDS setup (or EBS in general), I'm rather weary and don't even know if they can be trusted. In the end, I can't really say for how long RDS was down or failed to failover, but it must have been back before I got a handle on our own MySQL server.

The rest?

The rest of our infrastructure seems fine on AWS — most of the servers are stateless (Shared nothing, anyone?) and no EBS is involved.

And with the absense of EBS, there are no issues to begin with. Everything continued to work just as expected.

Design for failure.

This is not really a take away, but a no-brainer. It's also not limited to AWS or cloud computing in general.

You should design for failure when you build any type of applications.

In terms of cloud computing it's really not as bad as ZOMG MY INSTANCE IS GONE!!!11, but it certainly can happen. I've heard people claim that EC2 instances constantly disappear and with my background of almost two years on AWS, I know that's just not true.

Designing for failure should take the following into account:

Services become unavailable

These services don't have to run in the cloud, they can become unavailable on bare metal too. For example, a service could crash, there could be a network partition or maintenance involved. The bottom line: How does your application deal with it?

Services run degraded

For example, higher latency between the services, slower response time from disk — you name it.

The unexpected

Sometimes, everything is green, but your application still chokes.

While testing for the unexpected is of course impossible, validating what comes in is not.

Recovery

I'm not sure if a fire drill is necessary, but it helps to have a plan on how to troubleshoot issues to be able to recover from an outtage.

In our case, we log almost anything and everything to syslog and utilize loggly to centralize logs. loggly's nifty console provides us with great input about the state of our application at any time.

Add to the centralized logging, that we have a lot of monitoring using ganglia and munin in place. Monitoring is an ongoing project for us since it seems like once you start, you just can't stop. ;-)

And last but not least: We can launch a new configured EC2 instance with a couple mouse clicks using Scalarium.

I value all these things equally — without them, troubleshooting and recovery would be impossible or at least a royal PITA.

Don't buy hype

So to get to the bottom of this (still ongoing) event, I'm not particulary pissed that there was downtime. Of course I can live without it, but what I mean is: Things are bound to fail.

The truth is though, that Amazon's product description is not exactly honest, or at the very least provides everyone with a lot of room for interpretation. You're asking for how much interpretation? I'm sure you could put ten cloud experts into one room and come away with 15 different opinions.

For example the use of attributes to a service such as highly available may cause different expectations for different people.

Let me break it down for you: highly available in AWS speak, means, "it works most of the time".

What about an SLA?

Highly available is not an SLA with those infamous nine Erlang nines.

On paper a multi-az deployment of Amazon RDS gets pretty close to what generally people expect from highly available: MySQL master-master replication, backups — in multiple datacenters. As of today we all know: even these things can fail.

And speaking of SLAs: It looks like none of the services failing are covered by it: AWS' track record remains clean. This is because EBS is not explicitely named in it and by the way neither is RDS. Amazon's SLA — as far as EC2 is concerned — covers some but not all network outtages. Since I was able to access the instance the entire time none of this applies here.

Multi-Zone

On Twitter people are quick to suggest that everyone who complaines now should have had a setup in multiple availability zones setup.

Here's what I think about it:

  • I find it rather amusing that apparently everyone on bare metal runs in multiple datacenters.

  • When people suggest multi-zone (or multi-az in Amazon-speak), I think they really mean multi-region. Because a zone is effectively us-east-1a, us-east-1b, us-east-1c and us-east-1d. Since all datacenters (= availability zones) in the us-east1 region failed on 2011/04/21, your multi-zone setup would not have covered your butt. Even Amazon's multi-az RDS failed.

  • Little do people know, but the zone identifiers — e.g. us-east-1a, us-east-1b — are tied to customer accounts. So for example, Cloudant's version of us-east-1a may be my us-east-1c, it may or may not be the same. This is why in many cases AWS never calls out explicit zones in outtages. This also makes it somewhat hard to plan ahead in a single region.

  • AWS sells customers on the idea that an actual multi-az setup is plenty. I don't know too many companies who do multi-region (maybe SimpleGeo?). Not even NetFlix does multi-region, but I guess they managed to sail around this disaster because they don't use EBS.

  • In the end it shouldn't be necessary to do a multi-region setup (and deal with its caveats) since according to AWS the different zones inside a region are different physical locations (let's call them datacenters) to begin with. Correct me if I'm wrong, but the description says different physical location, this is not just another rack in the same building or another port on the core switch.

Communication

Which brings me to the most important point of my blong entry.

In a nutshell, when you build for the AWS platform, you're building for a blackbox. There are a couple papers and blog posts where people try to reverse engineer the platform and write about its behavior. The problem with these things is that most people are guessing, though often (of course depending on the person writing) it seems to be a very well educated guess.

Roman Stanek blogged about communication between AWS and its customers, so head on over, I pretty much agree with everything he has to say:

Fin

So what exactly is my take away? In terms of technical details and as far as redundancy is concerned: not so much.

Whatever you do to run redundant on AWS, applies to setups in your local colocation or POP as well. And in theory, AWS makes it easier though to leverage multiple datacenters (availability zones) and even achieve somewhat of a global footprint by distributing in different regions.

The real question for anyone to ask is, Is AWS fit to host anything which requires permanent storage? (I'm inclined to say no.)

That's all.

PHP SDK for Amazon Web Services

Wednesday, September 29. 2010
Comments

Yesterday, Jeff Barr announced Amazon's own PHP SDK for their web services — own, because AWS hired CloudFusion's lead developer earlier this year (in March) and I guess after a while they decided it was time to incorporate his open source efforts into the company. The full story is on getcloudfusion.com.

So what?

What's more than just pretty interesting about all of this, is that not only is the AWS PHP SDK hosted on Github (bonus points for sure), but since it implements almost the entire API of all infrastructural services and is backed by the API provider, this library currently presents the most feasible way for PHP developers to work with AWS. And to add to that, the library is fully documented as well.

Having worked myself on various small wrappers for the EC2 and SNS web services, I'm really somewhat glad that I can stop working on them now and continue implementing the web services.

PEAR

Amazon's move is also another victory for PEAR (and of course Pirum) because it brings more acceptance to the distribution of PHP libraries using a PEAR Channel.

The SDK's channel is the following: http://pear.amazonwebservices.com/

Setup

pear channel-discover pear.amazonwebservices.com
pear install aws/sdk

The flipside

There are absolutely no unit tests included anywhere. But since I'm assuming that they exist indeed, I hope they will be open sourced before not too long. Or in case they don't (What's up with that?), that pull requests will be accepted so the community will be able to contribute some.

Shopping for a CDN

Saturday, June 5. 2010
Comments

In this blog post I'll compare different CDNs with each other, on the list are:

  • Akamai (through MySpace)
  • CacheFly
  • CloudFront
  • EdgeCast (twice, through Speedyrails)
  • LimeLight Networks (through mydeo)
  • … and Amazon S3 — the pseudo CDN

Thanks to SpeedyRails, EasyBib (CacheFly, Cloudfront, S3) and mydeo for helping with these tests.

What's a CDN?

A CDN (Content Delivery Network) is a service usually offered by Tier1's or at least companies that have a so-called global network footprint.

A CDN lets you distribute your assets/content on an array of servers and the nifty technology behind it makes sure that a customer is always transparently routed to a server closer to them, thus making it faster for the client to fetch the assets.

Content, or assets, can be anything such as images, css, JavaScript or media (audio, video). My numbers focus on assets primarily, I haven't run any tests with larger media files.

An example for CDN usage would be that, let's say I go to myspace.com — all the required assets are distributed using a CDN run by Akamai. When I browse myspace, the JavaScript files are pulled from a server located in Frankfurt. Whereas when I browse MySpace from the U.K., the files are pulled from a server in the U.K..

All of this is — as I said — transparent, which means that I don't really notice a difference when I go to the website. It should be faster though.

Performance

I'll skip over why it makes sense to use a CDN from a pure performance point of view. A much better blog article is available at the Yahoo! developer blog

When is a CDN necessary?

I wouldn't recommend getting a CDN for a blog — unless you're TechCrunch and live off of it. In my opinion this is a gray area. If you make money and your traffic is not just local (to the location of your server), consider a CDN, it's more affordable than you think.

On monitoring

Pingdom is a nifty distributed monitoring service.

What Pingdom does is the following: Pingdom allows you to setup checks (literally within minutes) and then it runs the monitoring from different locations world wide.

The advantage of multiple locations is that you do know if for example your website is not available for everyone, or if it's a local issue of a backbone provider, etc.. Beyond general availability, Pingdom also gather data on response times (average, fastest and slowest) and lets you filter on all of the above.

The current locations from which your website is monitored include Amsterdam (Netherlands), Atlanta, GA (U.S.), Chicago, IL (U.S.), Copenhagen (Denmark), Dallas, TX (U.S.), Frankfurt (Germany), Herndon, VA (U.S.), Houston, TX (U.S.), London (U.K.), Los Angeles, CA (U.S.), Montreal (Canada), New York, NY (U.S.), Stockholm (Sweden) and Tampa, FL (U.S.). In some locations, Pingdom employs multiple monitors.

The only downside I can see is that Pingdom has no footprint in all of Asia, South America or Africa. So in case you're target demo is from either of those places, I'd advice you to gather your own numbers.

Well, gathering your own research data might be a good idea regardless.

Numbers

I used a minified jQuery library to compare the results of the various CDN vendors.

Amazon S3

Why do I consider S3 to be a pseudo CDN. Well, for starters — Amazon S3 is not distributed.

By nature, it shouldn't be used as a CDN. The problem is though that many people still do. Take a look at Twitter and think twice why a page takes so long to load (and the avatars are always last). There's your answer.

In order to be fair — Twitter also sometimes switches to Cloudfront (216.137.61.222) (or Akamai (213.248.124.139)?). I haven't really figured out why they don't stick to a real CDN period.

Besides, I think using Cloudfront is still not the best choice, thinking about it, they should of course use Joe Stump's project tweetimag.es (which uses EdgeCast).

Stats porn

Spoiler: 100% uptime on all of them! ;-)

But on to the stats!

Akamai

akamai-7day

  • provider: Akamai
  • 7 day period
  • average response time: 65 ms
  • slowest average response time: 289 ms
  • fastest average response time: 19 ms

Akamai is probably the most well-known CDN. The clear advantage of Akamai over others — they are everywhere. And they charge an arm and a leg for it too. ;-) (No offense meant!)

CacheFly

cachefly-7days

  • provider: CacheFly
  • 7 day period
  • average response time: 132 ms
  • slowest average response time: 1,506 ms
  • fastest average response time: 69 ms

CacheFly is another older CDN providers (~11 years). Pretty nice support and lots of custom options available when you email them. On their todo is a transparent proxy (WANT).

CacheFly has never failed me in over four years.

Cloudfront

cloudfront-7day

  • provider: Amazon Cloudfront
  • 7 day period
  • average response time: 276 ms
  • slowest average response time: 1,983 ms
  • fastest average response time: 171 ms

Cloudfront is Amazon's idea of a CDN. It integrates well with Amazon S3. There's no transparent proxy option and it's not as distributed. And remember, it's all eventually consistent.

EdgeCast

EdgeCast offers two options. Small and large files. Small files are a little more expensive but it's generally suggested that they work just as well as large files. The small files option distributes your assets on SSD (Solid State Disk!).

The suggested use case is that large is for video and audio assets.

Regardless of the options, check the graphs and the numbers for some serious head scratching.

Large

edgecast-big-7days

  • provider: EdgeCast (big files)
  • 7 day period
  • average response time: 77 ms
  • slowest average response time: 987 ms
  • fastest average response time: 22 ms
Small

edgecast-small

  • provider: EdgeCast (small files)
  • 7 day period
  • average response time: 91 ms
  • slowest average response time: 1627 ms
  • fastest average response time: 28 ms

Limelight

limelight-7days

  • provider: Limelight through MyDeo
  • 7 day period
  • average response time: 216 ms
  • slowest average response time: 1,668 ms
  • fastest average response time: 28 ms

And why is Limelight so slow? I don't think I can blame it entirely on Limelight. In contrast to other resellers, such as Speedyrails (which resells EdgeCast), MyDeo gives you a url with mydeo.com. And this domain uses Godaddy's rather crappy DNS service so I'm guessing that part of the poor performance is due to them.

Amazon S3

amazon-s3-7days

ROFLMAO LOL!!!111one

  • provider: Amazon S3
  • 7 day period
  • average response time: 534 ms
  • slowest average response time: 2,323 ms
  • fastest average response time: 331 ms

Quo vadis CDN?

My first advice to all resellers would be to get Pingdom and constantly run monitoring to make sure the system behaves as expected. Or as the production description suggests. :-)

On Pingdom itself — of course there may be issues as well (not that I noticed). But I don't think these are a factor here. I've been running these tests for almost two months now and a different 7 day time frame didn't look too different. No one performed much better or far worse.

Here are the numbers again, side by side:

Provider Average (ms) Slowest average (ms) Fastest average (ms)
Akamai 65 289 19
CacheFly 132 1,506 69
Cloudfront 275 1,983 171
EdgeCast (large) 77 987 22
EdgeCast (small) 91 1627 28
Limelight 216 1,668 28
Amazon S3 534 2,323 331

Comment

Akamai is almost in a league of its own. Of all contenders they offer the best CDN hands down. If anyone reselling Akamai at a reasonable price reads this, feel free to leave a comment or email me. Of course I'd be interested.

Still, it's a little surprising that Akamai is not further ahead of Edgecast.

Cloudfront versus others — from personal testing and also doing the math on S3 (storage, PUT, GET) with the addition of Cloudfront on top of it, I have to say that this is a pretty expensive service and probably only useful in terms of unified billing (one provider to rule them all). If this is not an issue, I suggest you find another.

CacheFly has great support, but lacks feature and it's also pretty expensive compared to others.

EdgeCast vs. EdgeCast — I have to contact Speedrails to find out if they gave me the wrong URLs or why the more expensive option did worse in these tests. That'll be interesting to figure out. Regardless of this bit, the performance is pretty stellar and the closest to Akamai.

I'll revisit Limelight and mydeo later again.

Fin

It's pretty obvious for us that we are switching from CacheFly to another CDN over the summer.

And not just because of the general performance but also because for example EdgeCast (through SpeedyRails) seems to be a lot more cost effective while offering more features and of course the much better performance at the same time.

In case there are questions, I can extract more numbers.