Installing Varnish on Ubuntu Hardy

Tuesday, September 14. 2010
Comments

This is a quick and dirty rundown on how to install Varnish 2.1.x on Ubuntu Hardy (8.04 LTS).

Get sources setup

Add the repository to /etc/apt/sources.list:

deb http://repo.varnish-cache.org/ubuntu/ hardy Varnish-2.1 

Import the key for the new repository:

gpg --keyserver wwwkeys.eu.pgp.net --recv-keys 60E7C096C4DEFFEB
gpg --armor --export 60E7C096C4DEFFEB | apt-key add -

Installation

Update sources list and install varnish:

apt-get update
apt-get install varnish

Files of importance:

/etc/default/varnish
/etc/varnish/default.vcl
/etc/init.d/varnish

Double-check:

root@server:~# varnishd -V
varnishd (varnish-2.1.2 SVN )
Copyright (c) 2006-2009 Linpro AS / Verdens Gang AS

Further reading

I recommend a mix of the following websites/links:

Fin

That's all!

Shopping for a CDN

Saturday, June 5. 2010
Comments

In this blog post I'll compare different CDNs with each other, on the list are:

  • Akamai (through MySpace)
  • CacheFly
  • CloudFront
  • EdgeCast (twice, through Speedyrails)
  • LimeLight Networks (through mydeo)
  • … and Amazon S3 — the pseudo CDN

Thanks to SpeedyRails, EasyBib (CacheFly, Cloudfront, S3) and mydeo for helping with these tests.

What's a CDN?

A CDN (Content Delivery Network) is a service usually offered by Tier1's or at least companies that have a so-called global network footprint.

A CDN lets you distribute your assets/content on an array of servers and the nifty technology behind it makes sure that a customer is always transparently routed to a server closer to them, thus making it faster for the client to fetch the assets.

Content, or assets, can be anything such as images, css, JavaScript or media (audio, video). My numbers focus on assets primarily, I haven't run any tests with larger media files.

An example for CDN usage would be that, let's say I go to myspace.com — all the required assets are distributed using a CDN run by Akamai. When I browse myspace, the JavaScript files are pulled from a server located in Frankfurt. Whereas when I browse MySpace from the U.K., the files are pulled from a server in the U.K..

All of this is — as I said — transparent, which means that I don't really notice a difference when I go to the website. It should be faster though.

Performance

I'll skip over why it makes sense to use a CDN from a pure performance point of view. A much better blog article is available at the Yahoo! developer blog

When is a CDN necessary?

I wouldn't recommend getting a CDN for a blog — unless you're TechCrunch and live off of it. In my opinion this is a gray area. If you make money and your traffic is not just local (to the location of your server), consider a CDN, it's more affordable than you think.

On monitoring

Pingdom is a nifty distributed monitoring service.

What Pingdom does is the following: Pingdom allows you to setup checks (literally within minutes) and then it runs the monitoring from different locations world wide.

The advantage of multiple locations is that you do know if for example your website is not available for everyone, or if it's a local issue of a backbone provider, etc.. Beyond general availability, Pingdom also gather data on response times (average, fastest and slowest) and lets you filter on all of the above.

The current locations from which your website is monitored include Amsterdam (Netherlands), Atlanta, GA (U.S.), Chicago, IL (U.S.), Copenhagen (Denmark), Dallas, TX (U.S.), Frankfurt (Germany), Herndon, VA (U.S.), Houston, TX (U.S.), London (U.K.), Los Angeles, CA (U.S.), Montreal (Canada), New York, NY (U.S.), Stockholm (Sweden) and Tampa, FL (U.S.). In some locations, Pingdom employs multiple monitors.

The only downside I can see is that Pingdom has no footprint in all of Asia, South America or Africa. So in case you're target demo is from either of those places, I'd advice you to gather your own numbers.

Well, gathering your own research data might be a good idea regardless.

Numbers

I used a minified jQuery library to compare the results of the various CDN vendors.

Amazon S3

Why do I consider S3 to be a pseudo CDN. Well, for starters — Amazon S3 is not distributed.

By nature, it shouldn't be used as a CDN. The problem is though that many people still do. Take a look at Twitter and think twice why a page takes so long to load (and the avatars are always last). There's your answer.

In order to be fair — Twitter also sometimes switches to Cloudfront (216.137.61.222) (or Akamai (213.248.124.139)?). I haven't really figured out why they don't stick to a real CDN period.

Besides, I think using Cloudfront is still not the best choice, thinking about it, they should of course use Joe Stump's project tweetimag.es (which uses EdgeCast).

Stats porn

Spoiler: 100% uptime on all of them! ;-)

But on to the stats!

Akamai

akamai-7day

  • provider: Akamai
  • 7 day period
  • average response time: 65 ms
  • slowest average response time: 289 ms
  • fastest average response time: 19 ms

Akamai is probably the most well-known CDN. The clear advantage of Akamai over others — they are everywhere. And they charge an arm and a leg for it too. ;-) (No offense meant!)

CacheFly

cachefly-7days

  • provider: CacheFly
  • 7 day period
  • average response time: 132 ms
  • slowest average response time: 1,506 ms
  • fastest average response time: 69 ms

CacheFly is another older CDN providers (~11 years). Pretty nice support and lots of custom options available when you email them. On their todo is a transparent proxy (WANT).

CacheFly has never failed me in over four years.

Cloudfront

cloudfront-7day

  • provider: Amazon Cloudfront
  • 7 day period
  • average response time: 276 ms
  • slowest average response time: 1,983 ms
  • fastest average response time: 171 ms

Cloudfront is Amazon's idea of a CDN. It integrates well with Amazon S3. There's no transparent proxy option and it's not as distributed. And remember, it's all eventually consistent.

EdgeCast

EdgeCast offers two options. Small and large files. Small files are a little more expensive but it's generally suggested that they work just as well as large files. The small files option distributes your assets on SSD (Solid State Disk!).

The suggested use case is that large is for video and audio assets.

Regardless of the options, check the graphs and the numbers for some serious head scratching.

Large

edgecast-big-7days

  • provider: EdgeCast (big files)
  • 7 day period
  • average response time: 77 ms
  • slowest average response time: 987 ms
  • fastest average response time: 22 ms
Small

edgecast-small

  • provider: EdgeCast (small files)
  • 7 day period
  • average response time: 91 ms
  • slowest average response time: 1627 ms
  • fastest average response time: 28 ms

Limelight

limelight-7days

  • provider: Limelight through MyDeo
  • 7 day period
  • average response time: 216 ms
  • slowest average response time: 1,668 ms
  • fastest average response time: 28 ms

And why is Limelight so slow? I don't think I can blame it entirely on Limelight. In contrast to other resellers, such as Speedyrails (which resells EdgeCast), MyDeo gives you a url with mydeo.com. And this domain uses Godaddy's rather crappy DNS service so I'm guessing that part of the poor performance is due to them.

Amazon S3

amazon-s3-7days

ROFLMAO LOL!!!111one

  • provider: Amazon S3
  • 7 day period
  • average response time: 534 ms
  • slowest average response time: 2,323 ms
  • fastest average response time: 331 ms

Quo vadis CDN?

My first advice to all resellers would be to get Pingdom and constantly run monitoring to make sure the system behaves as expected. Or as the production description suggests. :-)

On Pingdom itself — of course there may be issues as well (not that I noticed). But I don't think these are a factor here. I've been running these tests for almost two months now and a different 7 day time frame didn't look too different. No one performed much better or far worse.

Here are the numbers again, side by side:

Provider Average (ms) Slowest average (ms) Fastest average (ms)
Akamai 65 289 19
CacheFly 132 1,506 69
Cloudfront 275 1,983 171
EdgeCast (large) 77 987 22
EdgeCast (small) 91 1627 28
Limelight 216 1,668 28
Amazon S3 534 2,323 331

Comment

Akamai is almost in a league of its own. Of all contenders they offer the best CDN hands down. If anyone reselling Akamai at a reasonable price reads this, feel free to leave a comment or email me. Of course I'd be interested.

Still, it's a little surprising that Akamai is not further ahead of Edgecast.

Cloudfront versus others — from personal testing and also doing the math on S3 (storage, PUT, GET) with the addition of Cloudfront on top of it, I have to say that this is a pretty expensive service and probably only useful in terms of unified billing (one provider to rule them all). If this is not an issue, I suggest you find another.

CacheFly has great support, but lacks feature and it's also pretty expensive compared to others.

EdgeCast vs. EdgeCast — I have to contact Speedrails to find out if they gave me the wrong URLs or why the more expensive option did worse in these tests. That'll be interesting to figure out. Regardless of this bit, the performance is pretty stellar and the closest to Akamai.

I'll revisit Limelight and mydeo later again.

Fin

It's pretty obvious for us that we are switching from CacheFly to another CDN over the summer.

And not just because of the general performance but also because for example EdgeCast (through SpeedyRails) seems to be a lot more cost effective while offering more features and of course the much better performance at the same time.

In case there are questions, I can extract more numbers.

Operating CouchDB

Saturday, May 8. 2010
Comments

These are some random operational things I learned about CouchDB. While I realize that my primary use-case (a CouchDB install with currently 230+ million documents) may be oversized for many, these are still things important things to know and to consider. And I would have loved to know of some of these before we grew that large.

I hope these findings are useful for others.

Compaction

CouchDB doesn't take great care of diskspace — the assumption is that disk is cheap. To get on top of it, you should run database and view compaction as often as you can. And the good news is, these operations help you to reclaim a lot of space (e.g. I've seen an uncompacted view of 200 GB trim down to ~30 GB).

Cleanup

In case you changed the actual view code, make sure to run the clean-up command (curl -X POST http://server/db/_view_cleanup) to regain disk space.

Performance impact

Database and view compaction (especially on larger sets) will slow down reads and writes considerably. Schedule downtime, or do it in off-peak hours if you think the operation is fast enough.

This is especially hideous when you run in a cloud environment where disk i/o generally sucks (OHAI, EBS!!!).

To stop either of those background-tasks, just restart CouchDB.

(Just fyi, the third option is of course to throw resources (hardware) at it.)

Resuming view compaction?

HA, HA! [Note, sarcasm!] — view compaction is not resumable, like database compaction.

View files

I suggest you split views into several design documents — this will have the following benefit.

For each design document, CouchDB will create a .view file (by default these are in var/lib/couchdb/database-name/.database-name_design/). It's just faster to run compact and cleanup operation on multiple (hopefully smaller files) versus one giant file.

In the end, you don't run the operation against the file directly, but against CouchDB — but CouchDB will deal with a smaller file which makes the operation faster and generally shorter — I call this poor man's view partitioning.

Warming the cache

Cache warming is when a cache is populated with items in order to avoid the cache and server being hit with too much traffic when a server starts up and here is what you can do with CouchDB in this regard.

The basics are obvious — updates to a CouchDB view are performed when the view is requested. This has certain advantages and works well in most situations. Something I noticed is that especially on smaller VPS servers where resources tend to be oversold and and are rare in general, generating view updates can slow your application down to a full stop.

As a matter of fact and CouchDB does often not respond during that operation when the disk was saturated (take into account that even a 2 GB database will get hard to work with if you only have 1 GB of RAM for CouchDB and the OS, and whatever else is running on the same server).

The options are:

  1. To get more traffic so views are constantly update and the updates performed are kept at a minimum.
  2. Make your application query views with ?stale=ok and instead update the views on a set interval, for example via a curl request in a cronjob.

Cache-warming for dummies, the CouchDB way.

View data

For various reasons such as space management and performance, it doesn't hurt to put all views on its own dedicated partition.

In order to do this, add the following to your local.ini (in [couchdb]): view_index_dir = /new_device

And assuming your database is called "helloworld" and the view dir is /new_device, your .view-files will end up in /new_device/.helloworld_design.

Overshard

I've blogged on CouchDB and CouchDB-Lounge before. No matter if you use the Lounge or build sharding into your application — consider it. From what I learned it's better to shard earlier (= overshard), than when it's too late. The faster your CouchDB grows, the more painful it will be to deal with all the data stuck in.

My suggestion is that when you rapidly approach 50,000,000 documents and see yourself exceeding (and maybe doubling) this number soon, take a deep breath and think about a sharding strategy.

Oversharding has the advantage that for example you run 10 CouchDB instances on the same server and move each of them (or a couple) to their own dedicated hardware once they exceed the resources of the single hardware.

If sharding is not your cup of tea, just outsource to Cloudant — they do a great job.

CouchDB-Lounge

CouchDB-Lounge is Meebo's python-based sharding framework for CouchDB. The lounge is essentially an nginx-webserver and a twistd service which proxies your requests to different shards using a shards.conf. The number of shards and also the level of redundancy are all defined in it.

CouchDB-Lounge is a great albeit young project. The current shortcomings IMHO include general stability of the twistd service and absence of features such as _bulk_docs which makes migrating a larger set into CouchDB-Lounge a tedious operation. Never the less, this is something to keep an eye on.

Related to CouchDB-Lounge, there's also lode — a JavaScript- and node.js-based rewrite of the Python lounge.

Erlang-Lounge

What I call Erlang-Lounge is Cloudant's internal erlang-based sharding framework for CouchDB. It's in production at Cloudant and to be released soon. From what I know Cloudant will probably offer a free opensource version and support once they released it.

Disk, CPU or memory — which is it?

This one is hard to say. But despite how awesome Erlang is, even CouchDB depends on the system's available resources. ;-)

Disk

For starters, disk i/o is almost always the bottleneck. To verify if this the bottleneck in your particular case, please run and analyze [iostat][] during certain operations which appear to be slow in your context. For everyone on AWS, consider a RAID-0 setup, for everyone else, buy faster disks.

CPU

The more CPU in a server, the more beam processes. CouchDB (or Erlang) seem to take great advantage of this resource. I haven't really figured out a connection between CPU and general performance though because in my case memory and disk were always the bottleneck.

Memory

... seems to be another underestimated bottleneck. For example, I noticed that replication can slow down to a state where it seems faster to copy-paste documents from one instance to another when CouchDB is unable to cache an entire b-tree in RAM.

We've been testing some things on a nifty high-memory 4XL AWS instance and during a compact operation, almost 90% of my ram (70 GB) was used by the OS to cache. And don't make my mistake and rely on (h)top to verify this, cat /proc/meminfo instead.

Caching

Caching is trivial with CouchDB.

e-tags

Each document and view responds with an Etag header — here is an example:

curl -I http://foo:bar@till.cloudant.com:5984/foobar/till-laptop_1273064525
HTTP/1.1 200 OK
Server: CouchDB/0.11.0a-1.0.7 (Erlang OTP/R13B)
Etag: "1-92b4825ffe4c61630d3bd9a456c7a9e0"
Date: Wed, 05 May 2010 13:20:12 GMT
Content-Type: text/plain;charset=utf-8
Content-Length: 1771
Cache-Control: must-revalidate

The Etag will only changes, when the data in the document change. Hence it's trivial to avoid hitting the database if you don't have to. The request above is a very lightweight HEAD request which only gathers the data and does not pull the entire document.

_changes

_changes represents a live-update feed of your CouchDB database. It's located at http://server/dbname/_changes.

Whenever a data changing operation is completed, _changes will reflect that, which makes it easy for a developer to stay on top to for example invalidate an application cache only when needed (and not like it's done usually when the cache expired).

Logging

Logrotate

First off, a lot of people run CouchDB from source which means that in 99% of all installs, the logrotation is not activated.

To fix this (on Ubuntu/Debian), do the following:

ln -s /usr/local/etc/logrotate.d/couch /etc/logrotate.d/couchdb

Make sure to familiarize yourself a little with logrotatet because depending on space and business of your installation, you should adjust the config a little to not run out of diskspace. If CouchDB is unable to log, it will crash.

Loglevel

In most cases it's more than alright to just run with a log level of error.

Add the following to your local.ini (in [log]): level = error

Log directory

Still running out of diskspace? Add the following to your local.ini (in [log]):

file = /path/to/more/diskspace/couch.log

... if you adjusted the above, you will need to correct the config for logrotate.d as well.

No logging?

Last but not least — if no logs are needed, just turn them off completely.

Fin

That's all kids.

Caching for dummies

Tuesday, April 6. 2010
Comments

Caching is one of the things recommended whenever something is slow — "Your [database, website, server]? Well, duh! You need a cache!".

All things aside, it's not always easy to cache stuff. I find myself often in situations where I can't cache at all or where a caching solution is complex as hell to implement. All of the sudden you need to dissect your page and lazy load half of it with Ajax — bah. And that's just because those users never want to wait for a cache to expire or invalidate. They just expect the most recent ever! :-)

Sometimes however, caching can be so trivial, it even hurts.

Bypass the application server

There are lots of different techniques and strategies to employ when you start to think about caching. The fastest albeit not always available method is to bypass your app server stack completely. And here's how. :-)

An example

My example is a pet project of mine where I display screenshots of different news outlets which are taken every 3 (three) hours — 0:00, 3:00 AM, 6:00 AM, 9:00 AM, 12:00 PM, 3:00 PM, 6:00 PM, 9:00 PM and so on. In between those fixed dates, the page basically never changes and why would I need PHP to render a page if it didn't change since the last request?

Correct, I don't! :)

And here's what I did to setup my cache:

  • Apache 1.3.x (!) and mod_php5
  • my homepage: docroot/home.php
  • httpd.conf: DirectoryIndex index.html home.php
  • a cronjob: */10 * * * * fetch -q -o docroot/index.html http://example.org/home.php

In greater detail

Homepage

The home.php does all the PHP funkyness whenever I need a fresh rendered version of my website to track an error, or adjust something.

DirectoryIndex

If I ever delete my cache (index.html), my website will still be available. The DirectoryIndex will use home.php next and the request will be a little slower and also more expensive on my side, but the website will continue to work.

Cronjob

The cronjob will issue a HTTP request (GET) using fetch against my homepage and save the result to my index.html. It's really so simple, it hurts. Currently, this cronjob is executed every 10 minutes so I can fiddle with the design and deploy a change more easily, but I could run that cronjob every hour or every few hours as well.

If you don't have fetch, try the following wget command:

wget -q -O docroot/index.html http://example.org/home.php

Fin

That's all, a really simple cache which bypasses your application server. Enjoy!

If you're in for another nifty solution, I suggest you read Brandon Savage's article on caching with the Zend Framework and/or take a look at nginx+Memcached.

Zend Framework: Slow automatic view rendering

Monday, March 29. 2010
Comments

So I posted something on Twitter today, which wasn't exactly news to me. I was more or less pointing out the obvious.

From a couple follow-up tweets, I guess, I need to explain more.

The idea

My thesis is that there's a gain in page rendering time when I disable automatic view rendering and use explicit render calls ($this->render('foo');) inside my controllers. And to cut to the chase, there is. On our app, I measured a 12% improvement using Xdebug's profiler — simple before-after-style.

General setup

I've blogged about Zend Framework performance before (1, 2, 3). Our setup is not the average Zend Framework quickstart application. We utilize a custom (much faster) loader (my public open source work in progress), no Zend_Application and explicit (vs. implicit) view rendering. The framework code is at 1.10.2. On the server-side, the application servers are nginx+php(-cgi).

I don't feel like repeating myself and while a lot of issues were already addressed in new releases of the framework, or are going to be addressed in 2.0, the above links still hold a lot of truth or at least inside and pointers if you're interested in general PHP performance (in German).

Code

IMHO, it doesn't really matter how the rest of your application looks like. Of course all applications are different and that's why I didn't say, "OMG my page rendered in 100 ms", but instead I said something like, "we got a 10+% boost". The bottom line is that everyone wants to serve fast pages and get the most out of their hardware but since applications tend to carry different features there really is no holy grail or number to adhere to.

Proposal

I urge everyone to double-check my claim. After all, it's pretty simple:

  1. Setup Xdebug
  2. Profile the page
  3. Restart PHP app server/processes (in case you use APC and/or php-cgi)
  4. Disable automatic view rendering: $this->_helper->viewRenderer->setNoRender(true);
  5. Add render() call: $this->render('foo');
  6. Profile again

... simple as that.

Conclusion

All in all this thing doesn't require too much to follow.

Automatics — such as an automatic view renderer — add convenience which results in rapid development and hopefully shorter time to market. But they do so almost always (give it nine Erlang nines ;-)) at the expense of performance.

Update, 2010-03-20 21:37: As Rob pointed out, there's even more to gain by bypassing the helper entirely. Use the code snippet below, or consider something like the following:

Padraic also blogged extensively on Zend_Controller_Action_Helper_ViewRenderer, I recommend reading Having a bad ViewRenderer day in your ZF app?.