PHP: So you'd like to migrate from MySQL to CouchDB? - Part II

Thursday, November 12. 2009
Comments

This is part II of my introductory series to move from MySQL a relational database (management) system to CouchDB. I will be using MySQL as an example. Part I of this series is available here.

Recap

In part I, I introduced CouchDB by explaining its basics. I continued by showing a simple request to create a document using curl (on the shell) and expanded how the same request could be written in PHP (using ext/curl) — or in HTTP_Request2 or with phpillow.

Part II will focus on the most basic operations and help you build a small wrapper for CouchDB. The code will be available on Github.

Getting data into CouchDB

As I explained before — CouchDB is all HTTP.

But for starters, we'll take a shortcut here and briefly talk about CouchDB's nifty administration frontend called futon. Think of Futon as a really easy to use phpMyAdmin. And by default futon is included in your CouchDB installation.

Features

Futon will allow you to virtually do anything you need:

  • create a database
  • create a document
  • create a view
  • build the view
  • view ;-) (documents, views)
  • update all of the above
  • delete all of the above
  • replicate with other CouchDB servers

Assuming the installation completed successfully, and CouchDB runs on 127.0.0.1:5984, the URL to your very own Futon is http://127.0.0.1:5984/_utils/. And it doesn't hurt to create a bookmark while you are at it.

Why GUI?

Purists will always argue that using a GUI is slower than for example hacking your requests from the shell.

That may be correct once you are a level 99 CouchDB super hero, but since we all start at level 1, Futon is a great way to interact with CouchDB to learn the basics. And besides, even level 99 CouchDB super heroes sometimes like to just click and see, and not type in crazy hacker commands to get it done.

I'll encourage everyone to check it out.

Read, write, update and delete.

Sometimes also referred to as CRUD (create, read, update, delete) — read, write, update and delete are the basics most web applications do.

Since most of you have done a "write a blog in X"-tutorial before — and I personally am tired of writing blogs in different languages or with different backends — let's use another example.

Think about a small guestbook application, it does all of the above — a fancy guest will even do update. For the sake of simplicity, I'll skip on the frontend in this example and we'll work on the backend and essentially create a small wrapper class for CouchDB.

Operations

By now — "CouchDB is all HTTP" — should sound all familiar. So in turn, all these CRUD operations in CouchDB translate to the following HTTP request methods:

  • write/create - PUT or POST
  • read - GET
  • update - PUT or POST
  • delete - DELETE

On write

Whenever you supply an ID of a new document along with the document, you should use PUT.

When you don't care about the document ID, use POST instead, and CouchDB will generate a unique ID for you.

This unique ID will not look like an autoincremented integer, but we should not cling to this concept anyway. Without diving into too advanced context now, but the auto_increment feature in MySQL is a little flawed in general and in a distributed context especially. More on this (maybe) in a later part of this series — in the mean-time, check out Joshua Schachter's post.

On update

By default, CouchDB keeps a revision ID of each document. To many this is a pretty cool feature — out of the box, so to speak. But there are two very important and fundamental things to be aware of.

  1. CouchDB will keep previous revisions of a document around until you run compact on the database. Think of compact as a house keeping chore. It will wipe your database clean of previous revisions and even optimize the indices (in case you had a lot of data changing operations in the mean time). For CouchDB revisions are especially important in a distributed context (think replication — more on this later) and while it's cool to have them, they should not be used as a feature and be exposed to the user of your application.

  2. In case we decide a document, we always have to provide the previous (or current) revision of the document. This sounds strange, but the reasons are simple — in case another update gets in between all we have to do is provide the necessary interfaces and workflows in our application to alert the user and avoid a possible conflict.

CouchDB and the HTTP standard

CouchDB's API adheres to the above in 99.999999999% of the time. And it only breaks the pattern once. The exception to the rule is that when you bulk request multiple documents, which is strictly speaking a GET operation CouchDB will allow you to post in this case.

The reason for this is that the length of GET request is limited (by RFC) and if we exceeded this by requesting too many document IDs, we would hit a hard limit. To get around this, the operation is POST — but more on this later.

Requirements

For the following examples we assume a working CouchDB installation and a database called guestbook. No admins are set up — we can read and write without permission.

For simplicty, we imagine a form with the following fields:

  • author
  • entry

... and in addition to those two keys that may be populated by the user we add:

  • type (always: guestbook)
  • added (a date of some kind)

... the last two are not absolutely necessary, but will come handy in future iterations of this tutorial.

Also, on the tech side, we need a webserver with PHP and HTTP_Request2 (pear install HTTP_Request2-alpha) in your include_path. :-)


Continue reading "PHP: So you'd like to migrate from MySQL to CouchDB? - Part II"

CouchDB: checkpointing on view building

Wednesday, November 4. 2009
Comments

I'm posting about this tidbit because Google seemed to know nothing about it.

Anyway, during the view building process, we may see the following in the couchdb.log (level = info, at least, in local.ini):

[...] [info] [...] checkpointing view update at seq 78163851 for citations _design/erlang
[...] [debug] [...] New task status for citations _design/erlang: Processed 17844590 of 107444308 changes (16%)
[...] [debug] [...] New task status for citations _design/erlang: Processed 17848060 of 107444308 changes (16%)
[...] [debug] [...] New task status for citations _design/erlang: Processed 17850878 of 107444308 changes (16%)
[...] [info] [...] checkpointing view update at seq 78170348 for citations _design/erlang
[...] [debug] [...] New task status for citations _design/erlang: Processed 17851087 of 107444308 changes (16%)

The above tells us, that CouchDB saved the current process during indexing and allows us to resume in case we decide to restart the CouchDB and interrupt the indexing process. I've tried it myself a couple times with CouchDB 0.10.0 — I also had not noticed this feature prior to it.

And why is this useful in particular? The biggest use for this is upgrading computing power (e.g. on AWS EC2) when we realize we need MOAR and then we are still able to resume when we boot into more resources.

Sidenote: Checkpointing will not help if indexing is stopped and the view is adjusted/changed. Or when the indexing stopped due to an error, such as a crash.

That's all, kids.

Defined tags for this entry: , ,

PHP: So you'd like to migrate from MySQL to CouchDB? - Part I

Saturday, October 31. 2009
Comments

Update (2009-10-13): I posted part II!

This is the first part of a series. I'll start off by introducing CouchDB — from a PHP side, then I'll demo a couple basic use cases and I later on, I'll dive into migrations from MySQL.

My idea is to introduce CouchDB to a world where database-driven development generally refers to MySQL. By no means, this is meant to be disrespectful to MySQL, or SQL-databases in general. However, I'm a firm believer in using the right tool for the job.

First things first!

First off, before using CouchDB and maybe eventually replacing MySQL with it, we need to ask ourself the "Why?"-question.

And in order to be capable of more than a well-educated guess we need to familiarize ourselves with the CouchDB basics.

Basics

  • Document-oriented and schema-less storage.
  • Erlang (for rock-solid-scaling-goodness).
  • RESTful HTTP-API (we'll get to that).
  • Everything is JSON - request data, storage, response!

Document-oriented

In a document-oriented as to opposed to a relational store, the data is not stored in table, where data is usually broken down into fields. In a document-oriented store each record is stored along side and can have its own characteristics — properties of any kind.

As an example, consider these two records:

Till Klampaeckel, Berlin
Till Klampaeckel, till@php.net, Berlin, Germany

In a relational store, we would attempt to break down, or normalize, the data. Which means that we would probably create a table with the columns name, email, city and country.

Consider adding another record:

Till Klampaeckel, +49110, till@some.jabber

(Just fyi — this is not my real phone number!)

Looking for an intersection in the records, the name is the only thing this record has in common with the previous two. With a relational database, we would either have to add a column for phone number and chat, or we would start splitting off the data into multiple tables (e.g. a table called phone and one called chat) in order to get grip.

With a document-oriented database — such as CouchDB — this is not an issue.

We can store any data, constraints do not apply.

Erlang

Erlang was invented a while ago, by Ericsson, when it was still sans Sony. In a nutshell, Erlang's true strength is reliability and stability. It also manages to really utilize all the resources modern hardware has to offer since it's a master of parallelization.

CouchDB is written in Erlang, and also accepts view code written in Erlang. More on views later.

RESTful HTTP-API

For starters, a lot of HTTP-APIs claim to be RESTful, most of them are not. HTTP has so called request verbs (DELETE, GET, HEAD, POST, PUT among them) and a lot of APIs don't use them to the fullest extend, or rather not all.

Instead, most APIs are limited GET and maybe use a little POST. An example of such an API is the Flickr API.

Most of us are familiar with GET and POST already. For example, when you opened the web page to this blog entry, the browser made a GET-request. If you decide to post a comment later on — you guessed it, that's a POST-request.

Aside from its basic yet powerful nature, HTTP is interesting in particular because it is the least common multiple in many programming language. Whatever you use — C#, PHP, Python, Ruby — these languages know how to talk HTTP. And even better — most of them ship pretty comfortable wrappers.

JSON

JSON — it's godsend for those of us who never liked XML.

It's very lightweight, yet we able to represent lists and objects, integers, strings — most data types you would want to use. A clear disadvantage of JSON is that it lacks validation (think DTD), and of course comments — ha, ha!

Why, oh why?

So along with "Why?", we should consider the following:

  • Does it make sense?
  • Is CouchDB (really) the better fit for my application?
  • What is my #1 problem in MySQL, and how does CouchDB solve it?

And if we are still convinced to migrate all of our data, we'll need to decide on an access wrapper.

It's all HTTP, right?

By now, everyone has heard that CouchDB has a RESTful HTTP-API. But what does that imply?

It means, that we won't need to build a new extension in PHP to be able to use it. There's already either ext/socket or ext/curl — often both — in 99% of all PHP installs out there. Which means that PHP is more than ready to talk to CouchDB — right out of the box.

Since I mentioned JSON before — today ext/json is available in most PHP installs as well. If however we happen to be one of the few unfortunates who don't have and cannot get this extension, we should use Services_JSON instead.

Install it!

CouchDB installations are available in most Linux and Unix distributions. On MacOSX, get CouchDBXthe one-click CouchDB package, and there's a work in process for Windows as well. Especially interesting for those who run Ubuntu 9.10 (which has been released a few days ago), there's already a CouchDB install included.

Ubuntu/Debian:

apt-get install couchdb

FreeBSD:

cd /usr/ports/databases/couchdb && make install clean

Continue reading "PHP: So you'd like to migrate from MySQL to CouchDB? - Part I"

Small notes on CouchDB's views

Wednesday, October 21. 2009
Comments

I've been wrestling with a couple views in CouchDB currently. This blog post serves as mental note to myself, and hopefully to others. As I write this, i'm using 0.9.1 and 0.10.0 in a production setup.

Here's the environment:

  • Amazon AWS L Instance (ami-eef61587)
  • Ubuntu 9.04 (Jaunty)
  • CouchDB 0.9.1 and 0.10.0
  • database size: 199.8 GB
  • documents: 157408793

On to the tips

These are some small pointers which I gathered by reading different sources (wiki, mailing list, IRC, blog posts, Jan ...). All those revolve around views and what not with a relatively large data set.

Do you want the speed?

Building a view on a database of this magnitude will take a while.

In the beginning I estimated about week and a half. And it really took that long.

Things to consider, always:

  • upgrade to trunk ;-) (or e.g. to 0.10.x)
  • view building is CPU-bound which leads us to MOAR hardware — a larger instance

The bottom line is, "Patience (really) is a virtue!". :-)

A side-note on upgrading: Double-check that upgrading doesn't require you to rebuild the views. That is, unless you got time.

View basics

When we initially tested if CouchDB was made for us we started off with a bunch off emit(doc.foo, doc)-like map functions in (sometimes) temporary views. On the production data, there are a few gotcha's.

First off — the obvious: temporary views are slow.

Back to JavaScript

Emitting the complete document will force CouchDB to duplicate data in the index which in return needs more space and also makes view building a little slower. Instead it's suggested to always emit(doc.foo, null) and then POST with multiple keys in the body to retrieve the documents.

Reads are cheap, and if not, get a cache.

doc._id

In case you wonder why I don't do emit(doc.foo, doc._id)? Well, that's because CouchDB is already kind enough to retrieve the document's ID anyway. (Sweet, no?)

include_docs

Sort of related, CouchDB has a ?include_docs=true parameter.

This is really convenient — especially when you develop the application.

I gathered from various sources that using them bears a performance penalty. The reason is that include_docs issues another b-tree lookup for every row returned in the initial result. Especially with larger sets, this may turn into a bottleneck, while it can be considered OK with smaller result sets.

As always — don't forget that HTTP itself is relatively cheap and a single POST request with multiple keys (e.g. document IDs) in the body is likely not the bottleneck of your operation — compared to everything else.

And if you really need to optimize that part, there's always caching. :-)

Need a little more?

Especially when documents of different types are stored into the same database (Oh, the beauty of document oriented storage!), one should consider the following map-example:

if (doc.foo) {
    emit(doc.foo, null)
}

.foo is obviously an attribute in the document.

JavaScript vs. Erlang

sum(), I haven't found too many of these — but with version 0.10+, the CouchDB folks implemented a couple JavaScript functions in Erlang, which is an easy replacement and adds a little speed on top. :-) So in this case, use _sum.

Compact

Compact, I learned, knows how to resume. So even if you kill the process, it'll manage to resume where it left off before.

When you populate a database through bulk writes, the gain from a compact is relatively small and is probably neglectable. Especially because compacting a database takes a long while. Keep in mind that compaction is disk-bound, which is often one of the final and inevitable bottlenecks in many environments. Unless hardware is designed ground up, this will most likely suck.

Compaction can have a larger impact when documents are written one by one to a database, or a lot of updates have been committed on the set.

I remember that when I build another set with 20 million documents one by one, I ended up with a database size of 60 GB. After I compacted the database, the size dropped to 20 GB. I don't have the numbers on read speed and what not, but it also felt more speedy. ;-)

Fin

That'd be it. More next time!

Defined tags for this entry: , , , ,

CouchDB on Ubuntu on AWS

Friday, August 28. 2009
Comments

Here's a little HowTo on how to setup CouchDB on an AWS EC2 instance. But outside of AWS (and EC2), this setup works on any other Ubuntu server, and I suppose Debian as well.

Getting started

The following steps are a rough draft, or a sketch on how to get started. I suggest that you familiarize yourself with what all of these things do. If you want to skip on the reading and just get started, this should work anyway.

  • you (obviously) need an AWS account (and log into the AWS console).
  • you need a custom security group (make sure to open up for http traffic)
security_group_001

security_group_002

  • create an EBS volume (Take a deep breath and think about the size of the volume. Keep in mind that you don't want to run into space issues right away and that allocated storage (even idle) costs you money (e.g. 400 GB =~ 40 USD (per month), excluding the i/o).)
  • create a keypair (It'll prompt you to download a foobar.pem, I placed mine on my local machine in ~/.ssh/ and ran chmod 400 on it.)
  • get an elastic IP
  • start the instance
    • select an AMI (I selected alestic's 64bit server Ubuntu 9.04 AMI.)
    • assign your own security group AND the defaults one
    • select your keypair

Woo! We made it that far.

The instance should boot and once this is done (green indicates all went well), we want to associate the previously created EBS volume and the elastic IP to said instance.

Once these steps are complete, go on the instance screen, click on your running instance and then click on "Connect". It'll show you the ssh command to connect to your instance -- it should be similar to this:

ssh -i .ssh/foobar.pem root@ec2-W-X-Y-Z.compute-1.amazonaws.com

The W-X-Y-Z part is most likely replaced with your elastic IP.

This process is not very automated yet, but at least you have an instance up and running. The next step is to try to login and see if the EBS was attached — if all went well, you should have /mnt.


Continue reading "CouchDB on Ubuntu on AWS"