Skip to content

Operating CouchDB II

A couple months ago, I wrote an article titled Operation CouchDB. I noticed that a lot of people still visit my blog for this particular post, so this is an update to the situation.

And no, you may not copy and paste this or any other of my blog posts unless you ask me. ;-)

Caching revisited

A while back I wrote about how caching is trivial with CouchDB — well sort of.

CouchDB and its ecosystem like to emphasize on how the power of HTTP allows you to leverage countless of known and proven solutions. For caching, this goes only so far.

If you remember the ETag the reality is that while it's easy to grasp what it does, a lot of reverse proxies and caches don't implement this at all, or very well. Take my favorite — Varnish. There is currently very little support for the ETag.

And once you go down this road, it gets messy:

  • need a process to listen on a filtered _changes.
  • the process must be able to PURGE Varnish

... suddenly it's homegrown and not transparent at all.

However the biggest problem I thought about the other day is that if you don't happen to be one of a chosen few to run a CouchApp in production your CouchDB client is not a regular browser. Which means that your CouchDB client library doesn't know about the ETag either. Bummer.

What about those lightweight HEAD ?

HEAD requests to validate (or invalidate) the cache have the following issues:

  • An obvious I/O penalty because CouchDB will consult the BTree each time you HEAD request to check for the ETag.

  • The HEAD request in CouchDB is currently implemented like a GET, but they throw away the body on respone.

While work is being done in COUCHDB-941 to fix these two issues in summary — and this is not just true for CouchDB — the awesomeness of the ETag comes primarily from faking performance (aka fast) by saving bandwidth to transfer data.

Solution?

Ahhhhh — sorry, that might have been a lot of bitching! :-) Let me get to the point!

IMHO, if my cache has to HEAD-request against CouchDB, each time it is used to make sure the data is not stale, it's becomes pretty pointless to use a cache anyway. Point taken, HEAD-requests will become cheaper, but they have to happen anyway.

Thus only part of the work is actually offloaded from the database server while I'd say the usual approach in caching is to not hit the database (server) at all.

The (current) bottom line is: no silver bullet available.

You have to invalidate the cache from your application or e.g. by using a tool like thinner to follow _changes and purge accordingly. Or, what we're doing: make the application cache-aware and PURGE directly from it.

include_docs

?include_docs=true works well, up until you a sh!tload of traffic — and then it's a hefty i/o penalty each time your view is read. Of course emitting the entire doc means that you'll need more disk space, but the performance improvement by sacrificing disk space is between 10 and 100x. (Courtesy of the nice guys at cloudant!)

In a cloud computing context like AWS this is indeed a cost factor: think instance size for general i/o performance, the actual storage (in GB) and the throughput which Amazon charges for. In real a datacenter SAS disks are also more expensive than those crappy 7500 rpm SATA(-II) disks.

Partitioning

When I say partitioning, I mean spreading out the various document types by database. So even though the great advantage of a document oriented database is to be able to have a lot of different data right next to each other, you should avoid that.

Because when your data grows, it's pretty darn useful to be able to push a certain document type on another server in order to scale out. I know it's not that hard to do this when they are all in one database, but separating them when everything is on fire is just a whole lot more work, then replicating the database to a dedicated instance and changing the endpoint in your application.

Keeping an eye on things

I did not mention this earlier because it's a no-brainer — sort of. It seems though, that it has to be said.

Capacity planning

... and projections are (still) pretty important with CouchDB (or NoSQL in general). If you don't know by how much or if your data will grow, the least you should do is put a monitor on your CouchDB server's _stats to keep an eye on how things develop.

Monitoring

If you happen to run munin, I suggest you take a look at Gordon Stratton plugins. They are very useful.

My second suggestion for monitoring would be to not just put a sensor on http://example.org:5984/ because with CouchDB, that page almost always works:

{"couchdb":"Welcome","version":"1.0.1"}

Of course this page is a great indicator for whether your CouchDB server is available at all, but it is not a great indicator on the server's performance, e.g. number of writes, write speed, view building, view reads and so on.

Instead my suggestion is to build a slightly more sophisticated setup e.g. using a tool like tsung which actually makes requests, inserts data and does a whole more. It would let you aggregate the data and which allows you to see your general throughput and comes in useful with a post-mortem.

If you struggle with tsung, check out my tsung chef cookbook.

Application specific

There really is never enough monitoring. Aside from monitoring general server vitals, I highly recommend keeping a sensor things that are specific to your application. E.g. the number of documents of a certain type, or in a database, etc.. All these little things help getting the big picture.

BigCouch

BigCouch is probably the biggest news since I last blogged about operation CouchDB.

BigCouch is Cloudants CouchDB sharding framework. It basically lets you spread out a single database across multiple nodes. I haven't had the pleasure of running BigCouch myself yet, but we've been a customer of Cloudant for a while now.

Expect more on this topic soon, as I plan to get my hands dirty.

Fin

I hope I didn't forget anything. If you have feedback or stories to share, please comment. :-)

Since I just re-read my original blog post, I think the rest of it stands.

Tracking PHP errors

track_errors provides the means to catch an error message emitted from PHP. It's something I like to use during the development of various applications, or to get a handle on legacy code. Here are a few examples why!

For example

Imagine the following remote HTTP call:

$response = file_get_contents('http://example.org/');

So whenever this call fails, it will return false and also emit an error message:

Warning: file_get_contents(http://example.org):
    failed to open stream: could not connect to host in /example.php on line 2

Some people use @ to suppress this error message — an absolute no-go for reasons such as:

  • it just became impossible to know why the call failed
  • @ at runtime is an expensive operation for the PHP parser

The advanced PHP web kiddo knows to always display_errors = Off (e.g. in php.ini or through ini_set()) in order to shield the visitor of their application from these nasty messages. And maybe they even know how to log the error — somewhere.

But whenever an error is logged to a log file somewhere, it also means it's buried. Sometimes these error logs are too far away and often they get overlooked. If you happen to centralize and actually analyze your logfiles, I salute you!

So how do you use PHP's very useful and informative error message to debug this runtime error?

track_errors to the rescue.

track_errors is a PHP runtime setting.

To enable it:

; php.ini
track_errors = On 

Or:

ini_set('track_errors', 1);

And this allows you to do the following:

$response = file_get_contents('http://example.org');
if ($response === false) {
    throw new RuntimeException(
        "Could not open example.org: {$GLOBALS['php_errormsg']}"
    );
}

The last error message is always populated in the global variable $php_errormsg.

You want more?

I also recently used the same technique to implement error checking into a legacy application. I basically did the following:

// footer.php ;-)
if (isset($GLOBALS['php_errormsg'])
    && !empty($GLOBALS['php_errormsg'])
    && $user->isAdmin()) {

    echo $GLOBALS['php_errormsg'];
}

Trade-offs

As useful as this can be, there are a number of trade-offs. I suggest you use it wisely.

  • $php_errormsg is a global variable [yuck]
  • many extensions provide build in ways to catch errors
  • ini_set() calls at runtime are expensive

Fin

That's all, quick and (very) dirty.

The demand web

I read a blog entry this morning entitled "The unbearable lameness of web 2.0" (scroll down for the English version).

In his blog entry Kris Köhntopp states how he's not satisfied with the status quo, and of course that he said it all before — in a nutshell, he wants a social networking standard which is adhered to across all platforms, e.g. Twitter, Facebook and whatever else there is in between.

This standard includes things like:

  • a better like/friend/subscribe model
  • auto-classification of contacts into interest groups (basically diaspora's aspect feature in automatic)
  • aggregating and analysis of shared items in your own stream and the stream of your friends/followers
  • providing sources (e.g. to be able to find the origin of a shared item vs. seeing it shared 20 times in your stream)
  • … and language detection (and possibly translation)

I hope I got it all right (in a nutshell, of course).

Fundamental problems

The blog entry itself and the comments on his blog entry suggest how trivial and easy these features are, so I'm wondering why exactly no one implemented those yet?

Well, let me try to answer that.

Trivial?

These problems are not trivial and actually require a little more thought ("Googledenk", as Kris put it). I know there are services already that implement some of these features, but who knows apparently it's not that easy after all — but feel free to prove me wrong.

The average user

These problems are also not average user problems.

Yeah, there might be 10,000 or maybe even a 100,000 people on Facebook who have these problems, but not 50,000,000. Facebook being slightly more business oriented than the average "go build it for me" social media blogger, will build a feature for 50,000,000 first before it caters to the problems of those maybe 100,000 power users.

Power users are not their target audience. Mom and dad type of people are.

Given that there are indeed services that implement these features (or at least some of them) and their general lack of traction, kind of supports my argument as well. Apparently, this is something not too many people need.

A Standard!

Does anyone remember how well OpenSocial worked out? Good luck with that.

Can someone, please?

To all those people are pissed at diaspora because it's not what they thought it would be like.

Get a grip and contribute for f's sake.

If you want something to happen, maybe you just have to go further than to your blog to bitch about it. It's really easy to rant on Twitter or your blog (see this post for example :-)), but GTD — that's the hard part.

Playground

Last but not least people forget that when they get into social networking they have no rights.

Of course in some countries you may have a right to your data, but that's basically it.

There is no given right to access a platform, no right to certain features or how they are designed and there sure as hell is no right to any kind of API. Facebook, Twitter and StudiVZ — they all allow users to come play. There's nothing for a user to demand.

Fin

My point of view. If you beg to differ, go build it.

Magento: Loading the product from a template

When I wrangle with Magento Commerce and customize anything, every other bit is of course tied to a product's ID, or sometimes entity ID.

The challenging part is that whenever you're in a template in Magento, the scope is very different from the previous one. This is sometimes frustrating, but when you think of it — it makes sense. That is, in Magento! ;-)

Magento works in blocks and each block is basically a class file, of course $this is never the same. So for example the scope of a block that renders a product page is different from something like Mage_Sales_Block_Order_Item_Renderer_Default (used to display a row of an order).

Code

I recently had to customize the display of an item in an order in the customer's account.

To do so, I had a nifty helper which is tied to a product's ID. Unfortunately, I was only able to retrieve the SKU, quantity ordered and all sorts of other things right away — but the product's ID was not available.

So how do you load a product otherwise? Simple! Using the SKU!

// app/design/frontend/default/my-template/template/sales/order/items/renderer/default.phtml
$_sku     = $this->getItem()->getSku();
$_product = Mage::getModel('catalog/product')->loadByAttribute('sku', $_sku);
$_product->getEntityId(); // here is your ID

So yeah — Mage_Catalog_Model_Product = very helpful. And of course there are a bunch of other attributes to load products by. Well, just about any attribute. Just dig around the docs, or var_dump() a product object on a product page to see what else is available.

Fin

Quick and dirty — I just blogged this because it took me 25 minutes to find that model and the loadByAttribute() method.

And hopefully by blogging this, I'll never ever forget.

APC: get a key's expiration time

It's always surprising to me, but APC is still the best kept secret.

APC offers a bunch of very useful features — foremost a realpath cache and an opcode cache. However, my favorite is neither: it's being able to cache data in shared memory. How so? Simple: use apc_store() and apc_fetch() to persist data between requests.

The other day, I wanted use a key's expiration date to send the appropriate headers (Expires and Last-Modified) to the client, but it didn't seem like APC supports this out of the box yet.

Here is more or less a small hack until there's a native call:

/**
 * Return a key's expiration time.
 *
 * @param string $key The name of the key.
 *
 * @return mixed Returns false when no keys are cached or when the key
 *               does not exist. Returns int 0 when the key never expires
 *               (ttl = 0) or an integer (unix timestamp) otherwise.
 */
function apc\_expire($key) {
    $cache = apc\_cache\_info('user');
    if (empty($cache['cache\_list'])) {
        return false;
    }
    foreach ($cache['cache\_list'] as $entry) {
        if ($entry['info'] != $key) {
            continue;
        }
        if ($entry['ttl'] == 0) {
            return 0;
        }
        $expire = $entry['creation_time']+$entry['ttl'];
        return $expire;
    }
    return false;
}