A toolchain for CouchDB Lounge

Friday, February 26. 2010

One of our biggest issues with CouchDB is currently the lack of compaction of our database, and by lack of, I don't mean that CouchDB doesn't support it, I mean that we are unable to actually run it.

Compaction in a nutshell

Compaction in a nutshell is pretty cool.

As you know, CouchDB is not very space-efficient. For once, CouchDB saves revisions of all documents. Which means, whenever you update a document a new revision is saved. You can rollback any time, or expose it as a nifty feature in your application — regardless, those revisions are kept around until your database is compacted.

Think about it in terms of IMAP - emails are not deleted until you hit that magic "compact" button which 99% of all people who use IMAP don't know what it's for anyway.

Another thing is that whenever new documents are written to CouchDB and bulk mode is not used, it'll save them in a way which is not very efficient either. In terms of actual storage and indexing (so rumour has it).

Compaction woes

Since everything is simple with CouchDB, compaction is a simple process in CouchDB too. Yay!

When compaction is started, CouchDB will create a new database file where it stores the data in a very optimized way (I will not detail on this, go read a science book or google if you are really interested in this!). When the compaction process finished, CouchDB will exchange your old database file with the new database file.

The woes start with that e.g. when you have 700 GB uncompacted data, you will probably need another 400 GB for compaction to finish because it will create a second database file.

The second issue is that when you have constant writing on your database, the compaction process will actually never finish. It kind of sucks and for those people who aim to provide close to 100% availability, this is extremely painful to learn.


Continue reading "A toolchain for CouchDB Lounge"

Thoughts on RightScale

Tuesday, October 20. 2009

RightScale provides all kinds of things — from a pre-configured MySQL master-slave setup (with automatic EBS/s3 backups), to a full LAMP stack, Rails app servers, virtually all kinds of other pre-configured server templates to a nifty auto-scaling feature.

We decided to leverage RightScale when we planned our move to AWS a couple months ago in order to not have to build everything ourselves. I've been writing this blog entry for the past five weeks and here are some observations, thoughts and tips.

RightScale

First off, whatever you think, and do, or have done so far, let me assure you, there's always a RightScale way of doing things. For (maybe) general sanity (and definitely your own), I suggest you don't do it their way — always.

One example for the RightScale way is, that all the so-called RightScripts will attempt to start services on reboot for you, instead of registering them with the init system (e.g., on Ubuntu, update-rc.d foo defaults) when they are set up.

You may argue that RightScale's attempt will provide you with a maybe more detailed protocol of what happened during the boot sequence, but at the same time it provides more potential for errors and introduces another layer around what the operating system provides, and what generally works pretty well already.

PHP and RightScale

RightScale's sales team knows how to charm people, and when I say charm, I do not mean scam (just for clarity)! :-)

The demos are very impressive and the client show cases not any less. Where they really need to excel though are PHP-related demos because not everyone in the world runs Ruby on Rails yet. No, really — there's still us PHP people and also folks who run Python, Java and so on.

Coming from the sales pitch, I felt disappointed a little because a standard PHP setup on RightScale is as standard as you would think three years ago. mod_php, Apache2 and so on. The configuration itself is a downer as well, a lot of unnecessary settings and generally not so speedy choices. Then remember that neither CentOS nor Ubuntu are exactly up to date on packages and add another constraint to the mix — Ubuntu is on 8.04 which is one and half years in the past as I write this entry.

And even though I can relate to RighScale's position — in terms of that supporting customers with all kinds of different software is a burden and messy to say the least — I am also not a fan.

Scaling up

The largest advantage when you select a service provider such as RightScale is, that they turn raw EC2 instances into usable servers out of the box. So far example setting up a lamp stack yourself requires time, while it's still a trivial task for many. With RightScale, it's a matter of a couple clicks — select image, start, provide input variables and done.

Aside from enhanced AMIs RightScale's advantage is auto-scaling. Auto-scaling has been done a couple times before. There are more than one service provider which leverages EC2 and provides scalability on top. Then take a look at Scalr, which is open source, and then recently Amazon themselves added their own Elastic Load Balancer.

In general, I think auto-scaling is something everyone gets, and wants, but of course it's not that dead simple to implement. And especially when you move to a new platform, it's a perfect trade off to sacrifice a flexibility and money for a warm and fuzzy "works out of the box" feeling.


Continue reading "Thoughts on RightScale"

CouchDB on Ubuntu on AWS

Friday, August 28. 2009

Here's a little HowTo on how to setup CouchDB on an AWS EC2 instance. But outside of AWS (and EC2), this setup works on any other Ubuntu server, and I suppose Debian as well.

Getting started

The following steps are a rough draft, or a sketch on how to get started. I suggest that you familiarize yourself with what all of these things do. If you want to skip on the reading and just get started, this should work anyway.

  • you (obviously) need an AWS account (and log into the AWS console).
  • you need a custom security group (make sure to open up for http traffic)
security_group_001

security_group_002

  • create an EBS volume (Take a deep breath and think about the size of the volume. Keep in mind that you don't want to run into space issues right away and that allocated storage (even idle) costs you money (e.g. 400 GB =~ 40 USD (per month), excluding the i/o).)
  • create a keypair (It'll prompt you to download a foobar.pem, I placed mine on my local machine in ~/.ssh/ and ran chmod 400 on it.)
  • get an elastic IP
  • start the instance
    • select an AMI (I selected alestic's 64bit server Ubuntu 9.04 AMI.)
    • assign your own security group AND the defaults one
    • select your keypair

Woo! We made it that far.

The instance should boot and once this is done (green indicates all went well), we want to associate the previously created EBS volume and the elastic IP to said instance.

Once these steps are complete, go on the instance screen, click on your running instance and then click on "Connect". It'll show you the ssh command to connect to your instance -- it should be similar to this:

ssh -i .ssh/foobar.pem root@ec2-W-X-Y-Z.compute-1.amazonaws.com

The W-X-Y-Z part is most likely replaced with your elastic IP.

This process is not very automated yet, but at least you have an instance up and running. The next step is to try to login and see if the EBS was attached — if all went well, you should have /mnt.


Continue reading "CouchDB on Ubuntu on AWS"

Fixing up anti-spam plugins in Wordpress (and other apps) for Mosso

Monday, January 19. 2009

A lot of companies moved their web applications, or parts of them, to the cloud in 2008. Some people have had issues, for others (and AWS in particular), it's been one success story.

Because some of us like to focus on the business side and not run servers ourselves, providers like Mosso (a division of Rackspace) and MediaTemple offer scalable webhosting environments available to everyone.

Some of them call their offering cloud, others call it grid. Apparently it's the same. And I'm sure I am oversimplifying the services they offer (and I mean no disrespect), but scalable webhosting is what it really is.

Mosso in particular caught people's attention because they had a lot of issues in the beginning and because most of us know there is no such thing as 100% uptime for 100 USD/month, I don't want to poke them too hard for it.

One important thing to take into account when moving into the cloud is that on the configuration side, any virtual solution is slightly different from regular webhosting.

In particular one of the issues which my friend Allen Stern ran into when he moved to Mosso was that due to the virtual nature of the entire setup, none of his anti spam plugins in Drupal and Wordpress worked. Reason is that the IP populated in $_SERVER['REMOTE_ADDR'] is always the IP of Mosso's loadbalancer, which runs in front of the server farm and distributes all traffic to servers where resources are vacant.

Mosso instead populates the $_SERVER['HTTP_X_CLUSTER_CLIENT_IP'] header but because the majority of PHP developers are used to a very specific setup — the LAMP stack — they rarely waste time by thinking ahead of other environments.

In this case, the plugins will blacklist Mosso's loadbalancer soon and you will end up with a lot of comments which you will need to moderate. This blacklisting makes using those plugins (e.g. Akismet, Mollom) useless.

While I'm certainly amazed that Mosso could not fix this at the server level, here are a couple solutions (free of charge) for their customer base to use to fix the problem themselves.

PHP to the rescue

For everyone involved, there are multiple solutions to this problem.

The hack!

Mass-replace REMOTE_ADDR with HTTP_X_CLUSTER_CLIENT_IP.

The disadvantage is that if you run software such as Wordpress, you will loose the easy update feature since you edited all files.

Still semi-dirty

Find a file (e.g. a configuration file) which is included by the software everywhere and add the following line into it: $_SERVER['REMOTE_ADDR'] = $_SERVER['HTTP_X_CLUSTER_CLIENT_IP'];

There is no real disadvantage here, the only thing you need to keep in mind is that you probably need to re-add this to the file in case the software itself updates it and overwrites your changes.

The clean solution!

My favorite is to put the above statement into its own file (e.g. ip-fix.php) and use auto_prepend_file to fix the IP everywhere - period. The great advantage here is that this fix (while probably not the best in terms of performance) is sort of independent of the server (.htaccess requires Apache, or at least htscanner) you run and all the updates and changes you do to it.

In a nutshell, you should paste the following into a .htaccess file:

php_value auto_prepend_file /complete/path/to/ip-fix.php.

Would you trade an arm for a leg?

All three solutions are of course less than ideal because they require the customer to fix something that should be fixed on the serverside. For example, Mosso could patch the Apache to override the header, or use a webserver such as nginx etc. which does it out of the box.

According to my buddy Allen, it worked for him, and Mosso wants to roll out my work-around for all customers. (Just by the way Mosso — I'm always available for consulting! :-D)

Other providers

I do know that this is an issue with other providers as well. And while Mosso uses HTTP_X_CLUSTER_CLIENT_IP, all you need to find out is where your provider hides the real IP address, to make apply this workaround to your environment. And that's all.

Here is an idea of how to go about it:

  1. Go to http://whatismyip.com and write down your IP-Address.
  2. Create a .php file with the following contents in it, and upload it: <?php phpinfo(); ?>
  3. Open the URL of the file in your browser and look for your IP address.

In case there is no other IP-related header populated, you will need to rely on the client-side to get this IP and/or utilize captchas to defend yourself from spam. Or, of course, move providers. ;-)

Defined tags for this entry: , , , ,