Introducing TillStore

Saturday, October 24. 2009
Comments

Update, 2009-10-24: Fixed a bug, and committed a couple other improvements — TillStore 0.2.0 is released!

I went to nosqlberlin last week and I got inspired. I listened to a lot of interesting talks — most notably about Redis, CouchDB, Riak and MongoDB (I'm omitting two talks, which of course were not less awesome than the rest!) Due to an unfortunate circumstance I had six hours to hack on stuff from Thursday to Friday.

And here it is — I'm proud to present my very own approach to key-value stores: TillStore!

What is a key-value store?

In a nutshell, a key-value store is a database for values (Doh!). There are no duplicates, it's always a key mapped to a value. No bells and whistles — most just want to be fast. Eventually consistent is another attribute which most of them like to claim for themselves. The value term (in key-value store) is very flexible. Some key-value stores only support certain types, TillStore supports them all through JSON. ;-)

Other examples for key-value stores include Riak, Redis, Cassandra and Tokyo Cabinet.

TillStore?

So vain, right? Well, in the beginning TillStore was an inside joke with a colleague. And to be honest, TillStore was nothing more but the following:

<?php
$tillStore = array();
?>

However, when I had time to hack away on Thursday night, I took it a slightly higher level. ;-)


Continue reading "Introducing TillStore"

Small notes on CouchDB's views

Wednesday, October 21. 2009
Comments

I've been wrestling with a couple views in CouchDB currently. This blog post serves as mental note to myself, and hopefully to others. As I write this, i'm using 0.9.1 and 0.10.0 in a production setup.

Here's the environment:

  • Amazon AWS L Instance (ami-eef61587)
  • Ubuntu 9.04 (Jaunty)
  • CouchDB 0.9.1 and 0.10.0
  • database size: 199.8 GB
  • documents: 157408793

On to the tips

These are some small pointers which I gathered by reading different sources (wiki, mailing list, IRC, blog posts, Jan ...). All those revolve around views and what not with a relatively large data set.

Do you want the speed?

Building a view on a database of this magnitude will take a while.

In the beginning I estimated about week and a half. And it really took that long.

Things to consider, always:

  • upgrade to trunk ;-) (or e.g. to 0.10.x)
  • view building is CPU-bound which leads us to MOAR hardware — a larger instance

The bottom line is, "Patience (really) is a virtue!". :-)

A side-note on upgrading: Double-check that upgrading doesn't require you to rebuild the views. That is, unless you got time.

View basics

When we initially tested if CouchDB was made for us we started off with a bunch off emit(doc.foo, doc)-like map functions in (sometimes) temporary views. On the production data, there are a few gotcha's.

First off — the obvious: temporary views are slow.

Back to JavaScript

Emitting the complete document will force CouchDB to duplicate data in the index which in return needs more space and also makes view building a little slower. Instead it's suggested to always emit(doc.foo, null) and then POST with multiple keys in the body to retrieve the documents.

Reads are cheap, and if not, get a cache.

doc._id

In case you wonder why I don't do emit(doc.foo, doc._id)? Well, that's because CouchDB is already kind enough to retrieve the document's ID anyway. (Sweet, no?)

include_docs

Sort of related, CouchDB has a ?include_docs=true parameter.

This is really convenient — especially when you develop the application.

I gathered from various sources that using them bears a performance penalty. The reason is that include_docs issues another b-tree lookup for every row returned in the initial result. Especially with larger sets, this may turn into a bottleneck, while it can be considered OK with smaller result sets.

As always — don't forget that HTTP itself is relatively cheap and a single POST request with multiple keys (e.g. document IDs) in the body is likely not the bottleneck of your operation — compared to everything else.

And if you really need to optimize that part, there's always caching. :-)

Need a little more?

Especially when documents of different types are stored into the same database (Oh, the beauty of document oriented storage!), one should consider the following map-example:

if (doc.foo) {
    emit(doc.foo, null)
}

.foo is obviously an attribute in the document.

JavaScript vs. Erlang

sum(), I haven't found too many of these — but with version 0.10+, the CouchDB folks implemented a couple JavaScript functions in Erlang, which is an easy replacement and adds a little speed on top. :-) So in this case, use _sum.

Compact

Compact, I learned, knows how to resume. So even if you kill the process, it'll manage to resume where it left off before.

When you populate a database through bulk writes, the gain from a compact is relatively small and is probably neglectable. Especially because compacting a database takes a long while. Keep in mind that compaction is disk-bound, which is often one of the final and inevitable bottlenecks in many environments. Unless hardware is designed ground up, this will most likely suck.

Compaction can have a larger impact when documents are written one by one to a database, or a lot of updates have been committed on the set.

I remember that when I build another set with 20 million documents one by one, I ended up with a database size of 60 GB. After I compacted the database, the size dropped to 20 GB. I don't have the numbers on read speed and what not, but it also felt more speedy. ;-)

Fin

That'd be it. More next time!

Defined tags for this entry: , , , ,

My first PHP Unconfernce

Friday, September 18. 2009
Comments

I went to Hamburg last weekend to visit the PHP Unconference, which was probably my first conference ever. I've been to a couple barcamps and other smaller events, but anyway, this felt more like a real conference to me. That is, if I exclude ALA and the various ad:tech's I had to go to.

The reasons why I usually avoid tech conferences include foremost the price tag (working for myself, I can technically label it as an expense, but I still have to pay for it), doubts that it'll be worth it in terms of knowledge gained and probably time. I tend to catch up with people outside of conferences (when they are in Berlin :-)) and that has worked well for me.

I'm glad I set all these things aside for Hamburg (and it was all too easy). A lot of people expressed how much they liked their (often 3rd) PHP Unconference, and I can second, or third that — job really well done. Ulf Wendel took it one step further, blogged and asked, "Is perfect too boring?", because everything worked out so well. I guess I would say, "No, it's not boring", and I'm inclined to add, "Thanks, it really felt like having a weekend off, yet I still learned something and met a ton of nice people (or connected online nicknames to real faces)!".

I can definitely see why people visit the PHP Unconference each year, and I'll be one of them next year! ;-)

As I said, I had a great time, both my topics were accepted too. One was merged with another PHP performance talk which was overbooked with PHP VIPs which is why I decided to listen to Kore Nordmann's talk on CouchDB instead, and the other one ("Deployment") — I kind of overslept. And I'm sorry about that! I'll make sure to avoid party, party Hamburg next year.

Here are the slides for my Zend Framework (performance) talk, I hope you find them interesting:

The slides and speaker-notes contain…

  • a small intro as of why I think it's worth while to get into the ring with it
  • hints and pointers on general PHP optimization
  • I detail on a couple components (e.g. things to look out for and how to overcome them)

Make sure to check the speaker notes (using this link) — I didn't put everything in there, but a lot.

(The deployment slides will be up later this weekend.)

This also reminds me to improve my presentation-fu. I need something as kickass as keynote, but for Windows (currently). If anyone has a pointer, let me know. ;-)

Defined tags for this entry: , , ,

PHP in schnell

Thursday, September 3. 2009
Comments

Sorry, this is all in German. Or at least Germenglish. ;-) The slides are not rocket science and show a basic introduction to "faster PHP". Things to do, things to avoid, some autoloading and a little on "shared nothing" in the end. There are some speaker-notes included (link) to make the slides more valuable for those who couldn't make it.

Hier die Folien zu meinem Vortrag "PHP in schnell" bei der PHP Usergroup in Berlin im September 2009. Leider sieht man unten keine Notizen (speaker notes). Man kann sie jedoch sehen, wenn Ihr über folgenden Link (Actions > Show speaker notes (Vorsicht: Rechtschreibfehler usw. inklusive)) die Präsentation öffnet.

Wer Fragen hat, darf gerne kommentieren oder mich einladen. ;-)

Vielen Dank an alle die da waren. Und Danke an den NewThinking Store Berlin für den Ort, Beamer, Internet und das Glas Wasser. ;-)

Defined tags for this entry: , , , ,

PHP performance III -- Running nginx

Sunday, May 31. 2009
Comments

Since part one and two were uber-successful, here's an update on my Zend Framework PHP performance situation. I've also had this post sitting around since beginning of May and I figured if I don't post it now, I never will.

Disclaimer: All numbers (aka pseudo benchmarks) were not taken on a full moon and are (of course) very relative to our server hardware (e.g. DELL 1950, 8 GB RAM) and environment. The application we run is Zend Framework-based and currently handles between 150,000 and 200,000 visitors per day.

Why switch at all?

In January of this year (2009), we started investigating the 2.2 branch of the Apache webserver. Because we used Apache 1.3 for forever, we never had the need to upgrade to Apache 2.0, or 2.2. After all, you're probably familiar with the don't fix it, if it's not broken-approach.

Late last year we ran into a couple (maybe) rather FreeBSD-specific issues with PHP and its opcode cache APC. I am by no means an expert on the entire situation, but from reading mailing lists and investigating on the server, this seemed to be expected behavior — in a nutshell: Apache 1.3 and a large opcode cache on a newer versions of FreeBSD (7) were bound to fail with larger amounts of traffic.

We tried bumping up a few settings (pv entries), but we just ran into the same issue again and again.

Because the architecture of Apache 2.2 and 1.3 is so different from one another (and upgrading to 2.2 was the proposed solution), I went on to explore this upgrade to Apache 2.2. And once I completed the switch to Apache 2.2, my issues went away.

So far, so good!

Performance?

On the performance side we experienced rather mediocre results.

While we benched that a static file could be read at around 300 requests per second (that is a pretty standard Apache 2.2 install, sans a couple unnecessary modules), PHP (mod_php) performed at a fraction of that, averaging between 20 and 23 requests per second.

Myth: Hardware is cheap(, developer time is not)!

Before some people yell at me for trying to optimize my web server, one needs to take the costs of scaling (to a 100 requests per seconds) into account.

One of those servers currently runs at 2,600.00 USD. The price tag adds up to an additional 10,400.00 USD in order to scale to a 100 (lousy) requests per seconds. Chances are of course, that the hardware is slightly less expensive since DELL gives great rebates — but the 8 GB of (server) RAM and the SAS disks by themselves are melting budgets away.

And on top of all hardware costs, you need add setup, maintenance and running costs (rack space, electricity) for an additional four servers — suddenly, developer time is cheap. ;-)

So what do we do? Nginx to the rescue?!


Continue reading "PHP performance III -- Running nginx"