PHP: So you'd like to migrate from MySQL to CouchDB? - Part III

If you enjoyed this article, please leave a comment, rss subscribe to my RSS feed and/or follow me on Twitter. Thank you very much!

This is part three of a beginner series for people with a MySQL/PHP background. Apologies for the delay, this blog entry has been in draft since the 13th December of last year (2009).

Follow these links for the previous parts:

Recap

Part I introduced the CouchDB basics which included basic requests using PHP and cURL. Part II focused on create, read, update and delete operations in CouchDB. I also introduced my nifty PHP CouchDB called ArmChair!

ArmChair is my own very simple and (hopefully) easy-to-use approach to accessing CouchDB from PHP. The objective is to develop it with each part of this series to make it a more comprehensive solution.

Part III

Part three will target basic view functions in CouchDB — think of views as a WHERE-clause in MySQL. They are similar, but also not. :-)

Map-Reduce-Thingy

If you read up on CouchDB before coming to this blog, you will probably heard of map-reduce. There, or maybe elsewhere. A lot of people attribute Google's success to map-reduce. Because they are able to process a lot of data in parallel (across multiple cores and/or machines) in relatively little time.

I guess the PageRank in Google Search or Google Analytics are examples of where it could be used.

In the following, I'll try to explain what map-reduce is. For people without a science degree. (And that includes me!)

Map

Generally, map-reduce is a way to process data. It's made off two things, map and reduce.

The idea is that the map-function is very robust and it allows data to be broken up into smaller pieces so it can be processed in parallel. In most cases the order data is processed in doesn't really matter. What generally counts is that it is processed at all. And since map allows us to run the processing in parallel, it's easier to scale out. (That's the secret sauce!)

And when I write scale-out, I don't suggest to built a cluster of 1000 servers in order to process a couple thousand documents. It's already sufficient in this case to utilize all cores in my own computer when the map task is run in parallel.

In CouchDB, the result of map is a list of keys and values.

Reduce

Reduce is called once the map-part is done. It's an optional step in terms of CouchDB — not every map requires a reduce to follow.

Real world example

  • take a simple photo application (such as flickr) with comments
  • use map to sort through the comments and emit the names of users who left one
  • use reduce to only get unique references and see how many comments were left by these user

In SQL:

SELECT user, count(*) FROM comments GROUP BY user

Why the fuzz?

Just so people don't feel offended. Map-reduce is slightly more complicated than my example SQL-query but it's also not some secret-super-duper thing. Its strength is really parallelization which requires the ability to break data into chunks to process them. The end.

An example

My example is a photo service. I have two users — myself and your mom! ;-) We both upload pictures.

Setup

My documents may look like the following:

{
  "_id" : "1",
  "type" : "photo",
  "title" : "A pretty cool photo",
  "description" : "This is a pretty fucking cool photo",
  "user" : "till"
}

{
  "_id" : "2",
  "type" : "photo",
  "title" : "Another pretty cool photo",
  "description" : "This is just another pretty fucking cool photo",
  "user" : "till"
}

{
  "_id" : "3",
  "type" : "photo",
  "title" : "A picture",
  "description" : "Not so cool, but still alright",
  "user" : "your mom"
}

{
  "_id" : "randomness",
  "type" : "comment",
  "photo" : "2",
  "text" : "My photo",
  "user" : "till"
}

Map-only

And here's a view map-function to get all my photos:

function(doc) {
  if (doc.type == 'photo') {
    emit(doc.user, null);
  }
}

Embed it in a document like this:

{
  "_id" : "_design/lookup",
  "views" : {
      "by_user" :  {
          "map" :  "function(doc) { if (doc.type == 'photo') { emit(doc.user, null); } }"
      }
  }
}

How does the request look like?

curl http://localhost:5984/photos/_design/lookup/_view/by_user?key="till"

And just like that, you wrote a map-function in CouchDB.

The equivalent in SQL:

SELECT * FROM photos WHERE user = 'till'

Map-Reduce

Get a list of users

Map:

function (doc) {
  if (doc.user) {
    emit(doc.user, null);
  }
}

This basically gets us a list with "till", "till" and "your mom". The assumption here is that each user has uploaded at least one picture — that may be a little flawed but works great for my example. :-)

In order to unique the value, we use the following reduce:

function (keys, values) {
  return true;
}

And here's the request:

curl http://localhost:5984/photos/_design/lookup/_view/userlist?group=true

(Note: The reduce doesn't do much but it allows us to group=true.)

And here's the SQL:

SELECT distinct user FROM comments

Number of photos by user

Let's assume you need your users and the number of photos they uploaded:

Map:

function (doc) {
  if (doc.type == 'photo') {
    emit(doc.user, 1);
  }
}

Reduce:

function (keys, values) {
  return sum(values);
}

Request:

curl http://localhost:5984/photos/_design/lookup/_view/count

SQL (you've seen it before):

SELECT user, count(*) FROM comments GROUP BY user

Writing views

Writing JSON in Futon is pretty tedious. I wish I could say I like it, but I don't.

I also avoid creating views through code (though I'll show you an example in the upcoming part IV). The easiest — when you don't want to rely on another tool — is to write the view code in your favorite editor/IDE and then copy it into Futon.

When you hit "save document" and it doesn't work, Futon^H^H^H^H^H^CouchDB will complain. ;-)

CouchApp

In case you'd like to up the bar a little — meet CouchApp!

The following guide (thanks, Jan) shows you how to install it and how to use it.

The biggest advantage of using CouchApp is that you'll be able to add your views to version control and so on. Something a lot of people value. :-) Only a minor is that you won't have to use Futon to fiddle with the views but instead it's semi-integrated with your favorite editor/IDE, etc..

Setup

sudo easy_install -U couchapp
mkdir project
cd project
couchapp init

Create a view

couchapp generate view till-and-your-mom
nano views/till-and-your-mom/map.js
nano views/till-and-your-mom/reduce.js
couchapp push . http://localhost:5984/db

FIN

And without further ado — that was a little introduction to map-reduce. I'd love to say something like, "More complex examples next time!", but it's really so simple because CouchDB is cool like that. For further reading, I'd recommend "Views for SQL Jockeys".

If you have any specific questions, feel free to comment and I'll take your questions into account for the next part.

I'll also make sure to write more PHP next time. ;-)

'Til next time!

| More