jQuery post requests with a json response, sans eval()

Tuesday, May 18. 2010
Comments

I know some of you out there are probably tired of jQuery and people raving about it's goodness, but bare with me! Because jQuery never ceases to amaze me — especially when I haven't looked at it — or client-side JavaScript code in general — in a good year or so.

Refactoring

I've been refactoring some of my old JavaScript libs on a project and I noticed that I had used evil eval() all over the place to parse the JSON from our API. Guess we all know that this is not just a security issue since it allows code execution, but also a performance hit. And here's how to get around it. :-)

A codesnippet to automatically parse a JSON response from a $.post-request:

$.post(
  '/the-url-post-to',
  data,
  function(json) {
    // handle response
  },
  'json'
);

Note: The 4th parameter 'json' in $.post(). Magic.

Fin

That's all.

Defined tags for this entry: , , ,

PHP: So you'd like to migrate from MySQL to CouchDB? - Part III

Monday, May 17. 2010
Comments

This is part three of a beginner series for people with a MySQL/PHP background. Apologies for the delay, this blog entry has been in draft since the 13th December of last year (2009).

Follow these links for the previous parts:

Recap

Part I introduced the CouchDB basics which included basic requests using PHP and cURL. Part II focused on create, read, update and delete operations in CouchDB. I also introduced my nifty PHP CouchDB called ArmChair!

ArmChair is my own very simple and (hopefully) easy-to-use approach to accessing CouchDB from PHP. The objective is to develop it with each part of this series to make it a more comprehensive solution.

Part III

Part three will target basic view functions in CouchDB — think of views as a WHERE-clause in MySQL. They are similar, but also not. :-)

Map-Reduce-Thingy

If you read up on CouchDB before coming to this blog, you will probably heard of map-reduce. There, or maybe elsewhere. A lot of people attribute Google's success to map-reduce. Because they are able to process a lot of data in parallel (across multiple cores and/or machines) in relatively little time.

I guess the PageRank in Google Search or Google Analytics are examples of where it could be used.

In the following, I'll try to explain what map-reduce is. For people without a science degree. (And that includes me!)

Map

Generally, map-reduce is a way to process data. It's made off two things, map and reduce.

The idea is that the map-function is very robust and it allows data to be broken up into smaller pieces so it can be processed in parallel. In most cases the order data is processed in doesn't really matter. What generally counts is that it is processed at all. And since map allows us to run the processing in parallel, it's easier to scale out. (That's the secret sauce!)

And when I write scale-out, I don't suggest to built a cluster of 1000 servers in order to process a couple thousand documents. It's already sufficient in this case to utilize all cores in my own computer when the map task is run in parallel.

In CouchDB, the result of map is a list of keys and values.

Reduce

Reduce is called once the map-part is done. It's an optional step in terms of CouchDB — not every map requires a reduce to follow.

Real world example

  • take a simple photo application (such as flickr) with comments
  • use map to sort through the comments and emit the names of users who left one
  • use reduce to only get unique references and see how many comments were left by these user

In SQL:

SELECT user, count(*) FROM comments GROUP BY user

Why the fuzz?

Just so people don't feel offended. Map-reduce is slightly more complicated than my example SQL-query but it's also not some secret-super-duper thing. Its strength is really parallelization which requires the ability to break data into chunks to process them. The end.


Continue reading "PHP: So you'd like to migrate from MySQL to CouchDB? - Part III"

EC2 security group owner ID

Sunday, May 9. 2010
Comments

I recently had the pleasure to setup an RDS instance and it took me a while to figure out what the --ec2-security-group-owner-id parameter needs to be populated with when you want to allow access to your RDS instance from instances with a certain security group.

To cut to the chase, you need to log into AWS and then click the following link — done.

Defined tags for this entry: , , , , ,

Operating CouchDB

Saturday, May 8. 2010
Comments

These are some random operational things I learned about CouchDB. While I realize that my primary use-case (a CouchDB install with currently 230+ million documents) may be oversized for many, these are still things important things to know and to consider. And I would have loved to know of some of these before we grew that large.

I hope these findings are useful for others.

Compaction

CouchDB doesn't take great care of diskspace — the assumption is that disk is cheap. To get on top of it, you should run database and view compaction as often as you can. And the good news is, these operations help you to reclaim a lot of space (e.g. I've seen an uncompacted view of 200 GB trim down to ~30 GB).

Cleanup

In case you changed the actual view code, make sure to run the clean-up command (curl -X POST http://server/db/_view_cleanup) to regain disk space.

Performance impact

Database and view compaction (especially on larger sets) will slow down reads and writes considerably. Schedule downtime, or do it in off-peak hours if you think the operation is fast enough.

This is especially hideous when you run in a cloud environment where disk i/o generally sucks (OHAI, EBS!!!).

To stop either of those background-tasks, just restart CouchDB.

(Just fyi, the third option is of course to throw resources (hardware) at it.)

Resuming view compaction?

HA, HA! [Note, sarcasm!] — view compaction is not resumable, like database compaction.

View files

I suggest you split views into several design documents — this will have the following benefit.

For each design document, CouchDB will create a .view file (by default these are in var/lib/couchdb/database-name/.database-name_design/). It's just faster to run compact and cleanup operation on multiple (hopefully smaller files) versus one giant file.

In the end, you don't run the operation against the file directly, but against CouchDB — but CouchDB will deal with a smaller file which makes the operation faster and generally shorter — I call this poor man's view partitioning.

Warming the cache

Cache warming is when a cache is populated with items in order to avoid the cache and server being hit with too much traffic when a server starts up and here is what you can do with CouchDB in this regard.

The basics are obvious — updates to a CouchDB view are performed when the view is requested. This has certain advantages and works well in most situations. Something I noticed is that especially on smaller VPS servers where resources tend to be oversold and and are rare in general, generating view updates can slow your application down to a full stop.

As a matter of fact and CouchDB does often not respond during that operation when the disk was saturated (take into account that even a 2 GB database will get hard to work with if you only have 1 GB of RAM for CouchDB and the OS, and whatever else is running on the same server).

The options are:

  1. To get more traffic so views are constantly update and the updates performed are kept at a minimum.
  2. Make your application query views with ?stale=ok and instead update the views on a set interval, for example via a curl request in a cronjob.

Cache-warming for dummies, the CouchDB way.

View data

For various reasons such as space management and performance, it doesn't hurt to put all views on its own dedicated partition.

In order to do this, add the following to your local.ini (in [couchdb]): view_index_dir = /new_device

And assuming your database is called "helloworld" and the view dir is /new_device, your .view-files will end up in /new_device/.helloworld_design.

Overshard

I've blogged on CouchDB and CouchDB-Lounge before. No matter if you use the Lounge or build sharding into your application — consider it. From what I learned it's better to shard earlier (= overshard), than when it's too late. The faster your CouchDB grows, the more painful it will be to deal with all the data stuck in.

My suggestion is that when you rapidly approach 50,000,000 documents and see yourself exceeding (and maybe doubling) this number soon, take a deep breath and think about a sharding strategy.

Oversharding has the advantage that for example you run 10 CouchDB instances on the same server and move each of them (or a couple) to their own dedicated hardware once they exceed the resources of the single hardware.

If sharding is not your cup of tea, just outsource to Cloudant — they do a great job.

CouchDB-Lounge

CouchDB-Lounge is Meebo's python-based sharding framework for CouchDB. The lounge is essentially an nginx-webserver and a twistd service which proxies your requests to different shards using a shards.conf. The number of shards and also the level of redundancy are all defined in it.

CouchDB-Lounge is a great albeit young project. The current shortcomings IMHO include general stability of the twistd service and absence of features such as _bulk_docs which makes migrating a larger set into CouchDB-Lounge a tedious operation. Never the less, this is something to keep an eye on.

Related to CouchDB-Lounge, there's also lode — a JavaScript- and node.js-based rewrite of the Python lounge.

Erlang-Lounge

What I call Erlang-Lounge is Cloudant's internal erlang-based sharding framework for CouchDB. It's in production at Cloudant and to be released soon. From what I know Cloudant will probably offer a free opensource version and support once they released it.

Disk, CPU or memory — which is it?

This one is hard to say. But despite how awesome Erlang is, even CouchDB depends on the system's available resources. ;-)

Disk

For starters, disk i/o is almost always the bottleneck. To verify if this the bottleneck in your particular case, please run and analyze [iostat][] during certain operations which appear to be slow in your context. For everyone on AWS, consider a RAID-0 setup, for everyone else, buy faster disks.

CPU

The more CPU in a server, the more beam processes. CouchDB (or Erlang) seem to take great advantage of this resource. I haven't really figured out a connection between CPU and general performance though because in my case memory and disk were always the bottleneck.

Memory

... seems to be another underestimated bottleneck. For example, I noticed that replication can slow down to a state where it seems faster to copy-paste documents from one instance to another when CouchDB is unable to cache an entire b-tree in RAM.

We've been testing some things on a nifty high-memory 4XL AWS instance and during a compact operation, almost 90% of my ram (70 GB) was used by the OS to cache. And don't make my mistake and rely on (h)top to verify this, cat /proc/meminfo instead.

Caching

Caching is trivial with CouchDB.

e-tags

Each document and view responds with an Etag header — here is an example:

curl -I http://foo:bar@till.cloudant.com:5984/foobar/till-laptop_1273064525
HTTP/1.1 200 OK
Server: CouchDB/0.11.0a-1.0.7 (Erlang OTP/R13B)
Etag: "1-92b4825ffe4c61630d3bd9a456c7a9e0"
Date: Wed, 05 May 2010 13:20:12 GMT
Content-Type: text/plain;charset=utf-8
Content-Length: 1771
Cache-Control: must-revalidate

The Etag will only changes, when the data in the document change. Hence it's trivial to avoid hitting the database if you don't have to. The request above is a very lightweight HEAD request which only gathers the data and does not pull the entire document.

_changes

_changes represents a live-update feed of your CouchDB database. It's located at http://server/dbname/_changes.

Whenever a data changing operation is completed, _changes will reflect that, which makes it easy for a developer to stay on top to for example invalidate an application cache only when needed (and not like it's done usually when the cache expired).

Logging

Logrotate

First off, a lot of people run CouchDB from source which means that in 99% of all installs, the logrotation is not activated.

To fix this (on Ubuntu/Debian), do the following:

ln -s /usr/local/etc/logrotate.d/couch /etc/logrotate.d/couchdb

Make sure to familiarize yourself a little with logrotatet because depending on space and business of your installation, you should adjust the config a little to not run out of diskspace. If CouchDB is unable to log, it will crash.

Loglevel

In most cases it's more than alright to just run with a log level of error.

Add the following to your local.ini (in [log]): level = error

Log directory

Still running out of diskspace? Add the following to your local.ini (in [log]):

file = /path/to/more/diskspace/couch.log

... if you adjusted the above, you will need to correct the config for logrotate.d as well.

No logging?

Last but not least — if no logs are needed, just turn them off completely.

Fin

That's all kids.

The Slicehost-Cogent-Outage, or How to setup a relay with Postfix

Wednesday, May 5. 2010
Comments

Our problem is that an application hosted on Slicehost uses an external mailserver, which is located in Europe. Since neither Slicehost/Rackspace or Cogent seem to be able to fix the situation after almost two days, here's a quick workaround.

The idea is that our relay will collect emails and send them whenever the connection permits.

Postfix install

This is a pretty simple:

sudo aptitude install postfix

Configuration

main.cf

Edit /etc/postfix/main.cf (this is my entire main.cf):

relayhost = **MY-EXTERNAL-MAILSERVER (SMTP)**
smtp_sasl_auth_enable = yes
smtp_sasl_password_maps = hash:/etc/postfix/sasl/passwd
smtp_sasl_security_options =
smtpd_banner = $myhostname ESMTP $mail_name (Ubuntu)
biff = no
append_dot_mydomain = no
readme_directory = no
myhostname = **MY-SLICE-HOSTNAME**
mynetworks = 127.0.0.0/8, **MY-SLICE-IP**
alias_maps = hash:/etc/aliases
alias_database = hash:/etc/aliases
myorigin = /etc/mailname
mydestination =
mailbox_size_limit = 0
recipient_delimiter = +
inet_interfaces = loopback-only

Obviously you need to replace the **ENTRIES** in caps entries (a total of three).

/etc/postfix/sasl/passwd

Create the file:

sudo touch /etc/postfix/sasl/passwd
sudo chmod 600 /etc/postfix/sasl/passwd

Enter something like the following in it (vi /etc/postfix/sasl/passwd):

mail.example.org username:password
mail2.example.org user@example.org:password

Replace mail/mail2 with your actual SMTP. The SMTP will have to allow plain-text login for this to work.

Run the following to finish up:

sudo postmap /etc/postfix/sasl/passwd

Adjust

sudo /etc/init.d/postfix check
sudo /etc/init.d/postfix reload

Then, configure your application to not use smtp-auth and your SMPT runs on 127.0.0.1:25 if it runs on the same server.

Please note in the above main.cf, I configured postfix to only listen on the loopback.

Fin

This is an excerpt from an email:

Received: from [slice.ip] (helo=slice.host)
by mailserver.in.europe with esmtpa (Exim 4.69)
(envelope-from <email@example.org>)
id 1O9gAL-0005xQ-ES; Wed, 05 May 2010 17:05:37 +0200
Received: from [slice.ip] (localhost [127.0.0.1])
by slice.host (Postfix) with ESMTP id DE2DB11428C;
Wed,  5 May 2010 15:05:44 +0000 (UTC)

And that's all, kids!

Defined tags for this entry: , , , , , ,