PHP: So you'd like to migrate from MySQL to CouchDB? - Part I

Saturday, October 31. 2009

Update (2009-10-13): I posted part II!

This is the first part of a series. I'll start off by introducing CouchDB — from a PHP side, then I'll demo a couple basic use cases and I later on, I'll dive into migrations from MySQL.

My idea is to introduce CouchDB to a world where database-driven development generally refers to MySQL. By no means, this is meant to be disrespectful to MySQL, or SQL-databases in general. However, I'm a firm believer in using the right tool for the job.

First things first!

First off, before using CouchDB and maybe eventually replacing MySQL with it, we need to ask ourself the "Why?"-question.

And in order to be capable of more than a well-educated guess we need to familiarize ourselves with the CouchDB basics.

Basics

  • Document-oriented and schema-less storage.
  • Erlang (for rock-solid-scaling-goodness).
  • RESTful HTTP-API (we'll get to that).
  • Everything is JSON - request data, storage, response!

Document-oriented

In a document-oriented as to opposed to a relational store, the data is not stored in table, where data is usually broken down into fields. In a document-oriented store each record is stored along side and can have its own characteristics — properties of any kind.

As an example, consider these two records:

Till Klampaeckel, Berlin
Till Klampaeckel, till@php.net, Berlin, Germany

In a relational store, we would attempt to break down, or normalize, the data. Which means that we would probably create a table with the columns name, email, city and country.

Consider adding another record:

Till Klampaeckel, +49110, till@some.jabber

(Just fyi — this is not my real phone number!)

Looking for an intersection in the records, the name is the only thing this record has in common with the previous two. With a relational database, we would either have to add a column for phone number and chat, or we would start splitting off the data into multiple tables (e.g. a table called phone and one called chat) in order to get grip.

With a document-oriented database — such as CouchDB — this is not an issue.

We can store any data, constraints do not apply.

Erlang

Erlang was invented a while ago, by Ericsson, when it was still sans Sony. In a nutshell, Erlang's true strength is reliability and stability. It also manages to really utilize all the resources modern hardware has to offer since it's a master of parallelization.

CouchDB is written in Erlang, and also accepts view code written in Erlang. More on views later.

RESTful HTTP-API

For starters, a lot of HTTP-APIs claim to be RESTful, most of them are not. HTTP has so called request verbs (DELETE, GET, HEAD, POST, PUT among them) and a lot of APIs don't use them to the fullest extend, or rather not all.

Instead, most APIs are limited GET and maybe use a little POST. An example of such an API is the Flickr API.

Most of us are familiar with GET and POST already. For example, when you opened the web page to this blog entry, the browser made a GET-request. If you decide to post a comment later on — you guessed it, that's a POST-request.

Aside from its basic yet powerful nature, HTTP is interesting in particular because it is the least common multiple in many programming language. Whatever you use — C#, PHP, Python, Ruby — these languages know how to talk HTTP. And even better — most of them ship pretty comfortable wrappers.

JSON

JSON — it's godsend for those of us who never liked XML.

It's very lightweight, yet we able to represent lists and objects, integers, strings — most data types you would want to use. A clear disadvantage of JSON is that it lacks validation (think DTD), and of course comments — ha, ha!

Why, oh why?

So along with "Why?", we should consider the following:

  • Does it make sense?
  • Is CouchDB (really) the better fit for my application?
  • What is my #1 problem in MySQL, and how does CouchDB solve it?

And if we are still convinced to migrate all of our data, we'll need to decide on an access wrapper.

It's all HTTP, right?

By now, everyone has heard that CouchDB has a RESTful HTTP-API. But what does that imply?

It means, that we won't need to build a new extension in PHP to be able to use it. There's already either ext/socket or ext/curl — often both — in 99% of all PHP installs out there. Which means that PHP is more than ready to talk to CouchDB — right out of the box.

Since I mentioned JSON before — today ext/json is available in most PHP installs as well. If however we happen to be one of the few unfortunates who don't have and cannot get this extension, we should use Services_JSON instead.

Install it!

CouchDB installations are available in most Linux and Unix distributions. On MacOSX, get CouchDBXthe one-click CouchDB package, and there's a work in process for Windows as well. Especially interesting for those who run Ubuntu 9.10 (which has been released a few days ago), there's already a CouchDB install included.

Ubuntu/Debian:

apt-get install couchdb

FreeBSD:

cd /usr/ports/databases/couchdb && make install clean

Continue reading "PHP: So you'd like to migrate from MySQL to CouchDB? - Part I"

A case for PEAR and PHP4 (Or, why BC is important!)

Tuesday, September 22. 2009

Every once in someone likes to argue that PEAR is all fugly PHP4 code and why you should not use it, and instead go and use another framework or component library. Most of those people also say that they looked at or used PEAR x years ago and then act all surprised when someone else disagrees.

In related (BC) news, most people probably read my blog because of Zend Framework, and I remember that one of the reasons I sold my clients on Zend Framework was a supposedly backward compatibility and clean API. Well, a couple years later one knows it's not all that and since another BC break was argued today on one of the mailing lists and the project lead said I spread fud, I felt l needed to write something on the topic.

Facts first.

A small history of PEAR.

I don't know how old exactly PEAR is, but the manual is copyrighted since 2001 and none of the other current frameworks have been around eight — almost nine — years.

Because PEAR has been around much longer, we also have more older code than any of those PHP5-only frameworks. In comparison, Zend Framework's first stable was release in June, 2007, almost six years later.

Major versus minor releases.

A package' version consists of x.y.z.

The rules are as follow:

  • A BC break (see below) — increment x, and set y and z to 0.
  • Adding a new feature — keep x, increment y and set z to 0.
  • Fixing a bug — keep x and y, but increment z.

When someone refers to a major release in PEAR's context (and a lot of other projects such as Zend Framework, Solar and ezComponents follow this), it's one with a change in x. :-)

What is backward compatibility?

Backward compatibility, or BC, describes that a component, package or library preserves compatibility with an older versions.

Because programming itself and developers tend to evolve, PEAR tries to keep BC in all minor versions, but allows you to break compatibility to an earlier version with a new major release.

The exception to the rule is that you may break BC during alpha and beta releases before the package reaches a stable 1.0.0. Once a 1.0.0 is reached, BC may not be broken — for whatever reason.

PHP4?

Because PEAR aims to provide BC all the way, BC includes the PHP version when the package was first released. Which in turn means that of course you may make the code compatible to a later PHP release, but not without breaking compatibility to the initial release.

If you followed the above, you understand the reason why for example there is a Mail_Queue and (a soon to be) Mail_Queue2, or more importantly: why the Mail_Queue release in 2009 is still compatible to PHP4. Even though PHP4 EOL'd a while ago.

The first Mail_Queue package was released in September of 2002, the 1.0.0 stable release followed in December of the same year. Because its 1.0.0 was compatible with PHP4, we keep it backwards compatible with PHP4 until we release Mail_Queue2-0.1.0.

A principal

A lot of people argue that with the official end of life of PHP4, one should break BC anyway. But here is why you should not.

  • Even though we all love to use PHP5, there is still a lot of PHP4 in the enterprise. And like it or not, many of those apps use PEAR, and not your funky PHP5 framework.

  • How do you keep so-called necessary and unnecessary BC breaks apart? From another point of you (which is not your own), there is always a necessary BC break to fix implement something else.

  • Because there is no such thing as small or acceptable BC breaks. There are BC breaks or there are none, it's one of those things that is black and white.

BC in other frameworks

I know for a fact that ezComponents is very strict on BC. I cannot comment on Solar or Symfony, but at least in Solar's case, I'm assuming that adhere to BC as well since a some former PEAR developers are active and they also follow the PEAR Coding Standard in many respects.

Zend Framework?

A friend of mine said that if Zend Framework really kept BC, we would at 10.0.0 and not on 1.9.3.

Reasons why Zend Framework likes to break BC, even though it advertises full BC:

  • No versioning per components but per framework.

  • Missing peer review and QA leads to unstable code in a so-called stable release. (Which in turn also fakes the stability of the entire framework since it suggests that a component that was added a couple weeks ago is as stable as a component added in 2007.)

  • Because it fixes an "issue". (Biggest WTF of them all.)

The issue in question, I'm not even sure what they were trying to fix. Supposedly some developers found it too hard or did not understand how to write an adapter for Zend_Db and someone committed a fix in the 1.9 tree/branch and apparently it was OK to break BC because it was the easier way out.

I haven't updated some projects since late 1.8.x because of these BC breaks because no one can tell me what the issue is and I don't have a day to debug my application to figure out where and how it breaks. This is annoying as hell, especially since they are supposed to be tailored to the business.

On a side-note, I know of a couple components (e.g. Zend_Session) which really deserve (!) a BC break and don't get one until 2.0. And I totally get why, but why is it OK in some cases? All BC breaks fix issues!

(Btw, as I finish this post, I see an email to zf-contributor@ in my inbox where someone considers pulling the 1.9.3 release (because it obviously breaks BC). Guess I didn't spread that much FUD after all.)

Defined tags for this entry: , , , , ,

My first PHP Unconfernce

Friday, September 18. 2009

I went to Hamburg last weekend to visit the PHP Unconference, which was probably my first conference ever. I've been to a couple barcamps and other smaller events, but anyway, this felt more like a real conference to me. That is, if I exclude ALA and the various ad:tech's I had to go to.

The reasons why I usually avoid tech conferences include foremost the price tag (working for myself, I can technically label it as an expense, but I still have to pay for it), doubts that it'll be worth it in terms of knowledge gained and probably time. I tend to catch up with people outside of conferences (when they are in Berlin :-)) and that has worked well for me.

I'm glad I set all these things aside for Hamburg (and it was all too easy). A lot of people expressed how much they liked their (often 3rd) PHP Unconference, and I can second, or third that — job really well done. Ulf Wendel took it one step further, blogged and asked, "Is perfect too boring?", because everything worked out so well. I guess I would say, "No, it's not boring", and I'm inclined to add, "Thanks, it really felt like having a weekend off, yet I still learned something and met a ton of nice people (or connected online nicknames to real faces)!".

I can definitely see why people visit the PHP Unconference each year, and I'll be one of them next year! ;-)

As I said, I had a great time, both my topics were accepted too. One was merged with another PHP performance talk which was overbooked with PHP VIPs which is why I decided to listen to Kore Nordmann's talk on CouchDB instead, and the other one ("Deployment") — I kind of overslept. And I'm sorry about that! I'll make sure to avoid party, party Hamburg next year.

Here are the slides for my Zend Framework (performance) talk, I hope you find them interesting:

The slides and speaker-notes contain…

  • a small intro as of why I think it's worth while to get into the ring with it
  • hints and pointers on general PHP optimization
  • I detail on a couple components (e.g. things to look out for and how to overcome them)

Make sure to check the speaker notes (using this link) — I didn't put everything in there, but a lot.

(The deployment slides will be up later this weekend.)

This also reminds me to improve my presentation-fu. I need something as kickass as keynote, but for Windows (currently). If anyone has a pointer, let me know. ;-)

Defined tags for this entry: , , ,

PHP performance III -- Running nginx

Sunday, May 31. 2009

Since part one and two were uber-successful, here's an update on my Zend Framework PHP performance situation. I've also had this post sitting around since beginning of May and I figured if I don't post it now, I never will.

Disclaimer: All numbers (aka pseudo benchmarks) were not taken on a full moon and are (of course) very relative to our server hardware (e.g. DELL 1950, 8 GB RAM) and environment. The application we run is Zend Framework-based and currently handles between 150,000 and 200,000 visitors per day.

Why switch at all?

In January of this year (2009), we started investigating the 2.2 branch of the Apache webserver. Because we used Apache 1.3 for forever, we never had the need to upgrade to Apache 2.0, or 2.2. After all, you're probably familiar with the don't fix it, if it's not broken-approach.

Late last year we ran into a couple (maybe) rather FreeBSD-specific issues with PHP and its opcode cache APC. I am by no means an expert on the entire situation, but from reading mailing lists and investigating on the server, this seemed to be expected behavior — in a nutshell: Apache 1.3 and a large opcode cache on a newer versions of FreeBSD (7) were bound to fail with larger amounts of traffic.

We tried bumping up a few settings (pv entries), but we just ran into the same issue again and again.

Because the architecture of Apache 2.2 and 1.3 is so different from one another (and upgrading to 2.2 was the proposed solution), I went on to explore this upgrade to Apache 2.2. And once I completed the switch to Apache 2.2, my issues went away.

So far, so good!

Performance?

On the performance side we experienced rather mediocre results.

While we benched that a static file could be read at around 300 requests per second (that is a pretty standard Apache 2.2 install, sans a couple unnecessary modules), PHP (mod_php) performed at a fraction of that, averaging between 20 and 23 requests per second.

Myth: Hardware is cheap(, developer time is not)!

Before some people yell at me for trying to optimize my web server, one needs to take the costs of scaling (to a 100 requests per seconds) into account.

One of those servers currently runs at 2,600.00 USD. The price tag adds up to an additional 10,400.00 USD in order to scale to a 100 (lousy) requests per seconds. Chances are of course, that the hardware is slightly less expensive since DELL gives great rebates — but the 8 GB of (server) RAM and the SAS disks by themselves are melting budgets away.

And on top of all hardware costs, you need add setup, maintenance and running costs (rack space, electricity) for an additional four servers — suddenly, developer time is cheap. ;-)

So what do we do? Nginx to the rescue?!


Continue reading "PHP performance III -- Running nginx"

Cannot send headers; headers already sent in , line 0

Thursday, April 9. 2009

Yesterday, I started upgrading a test server to Apache 2.2 and PHP 5.2.9 and ran into this is a problem, which then bugged me for a good day. So here's the run down:

My test.php:

<?php
ob_start();
system('file -i -b /path/file.txt');
$contents = ob_get_contents();
ob_end_clean();
var_dump(headers_sent($file, $line), $file, $line);

My test2.php:

<?php
$contents = shell_exec('file -i -b /path/file.txt');
$contents = trim($contents);
var_dump(headers_sent($file, $line), $file, $line);

test.php claims that headers were send, while test2.php shows the expected behavior. The difference is not only system() vs. shell_exec(). But the ob_start(). Apparently headers_sent() returns true right after system() was called. We've used output buffering to trap whatever came back from system and this works on PHP 5.2.6 and 5.2.8, but not on 5.2.9.

Since I was after the response of the shell call, using shell_exec() is what should have been used to begin with — still. This is pretty weird.

Update: I've opened #47973 - Yay, bug fixed!

Defined tags for this entry: ,