Skip to content

SQL MAX() and GROUP BY for CouchDB

While re-writing a couple SQL statements into CouchDB we got stuck when we wanted to do a SELECT MAX(...), id ... GROUP BY id in CouchDB.

MySQL

Imagine the following SQL table with data:

mysql> SHOW FIELDS FROM deploys;
+-------------+-------------+------+-----+---------+-------+
| Field       | Type        | Null | Key | Default | Extra |
+-------------+-------------+------+-----+---------+-------+
| project     | varchar(10) | NO   |     | NULL    |       |
| deploy_time | datetime    | NO   |     | NULL    |       |
+-------------+-------------+------+-----+---------+-------+
2 rows in set (0.01 sec)

In order to get the latest deploy for each project, I'd issue:

mysql> SELECT MAX(deploy_time), project FROM deploys GROUP BY project;
+---------------------+----------+
| MAX(deploy_time)    | project  |
+---------------------+----------+
| 2013-10-04 22:01:26 | project1 |
| 2013-10-04 22:02:17 | project2 |
| 2013-10-04 22:02:20 | project3 |
+---------------------+----------+
3 rows in set (0.00 sec)

Simple. But what do you do in CouchDB?

CouchDB

My documents look like this:

{
  "_id": "hash",
  "project": "github-user/repo/branch",
  "deploy_time": {
     "date": "2013-10-04 22:02:20",
     /* ... */
  },
  /* ... */
}

So, after more than a couple hours trying to wrap our heads around map-reduce in CouchDB, it's working.

Here's the view's map function:

function (doc) {
  if (doc.project) {
    emit(doc.project, doc.deploy_time.date);
  }
}

This produces nice key value pairs — in fact, multiple — for each project.

And because the map-function returns multiple, we need to reduce our set.

So here is the reduce:

function (doc, values, rereduce) {
  var max_time = 0, max_value, time_parsed;
  values.forEach(function(deploy_date) {
    time_parsed = Date.parse(deploy_date.replace(/ /gi,'T'));
    if (max_time < time_parsed) {
      max_time = time_parsed;
      max_value = deploy_date;
    }
  });
  return max_value;
}

The little .replace(/ /gi,'T') took especially long to figure out. Special thanks to Cloudant's señor JS-date-engineer Adam Kocoloski for helping out. ;-)

Step by step

  • iterate over values
  • fix each value (add the "T") to make Spidermonkey comply
  • parse, and compare
  • return the "latest" in the end

A note of advice: To save yourself some trouble, install a local copy of Spidermonkey and test your view code in there and not in your browser.

Open the view in your browser: http://localhost/db/_design/deploys/_view/last_deploys?group=true

{
  "rows":[
    {
      "key":"testowner/testname/testbranch",
      "value":"2013-09-13 11:41:03"
    },
    {
      "key":"testowner/testname/testbranch2",
      "value":"2013-09-12 16:48:39"
    }
  ]
}

Yay, works!

FIN

That's all for today.

Hosted MySQL: Amazon RDS (and backups)

Among all the different technologies in our stack, we also use MySQL. While we still run MySQL (or Percona-Server) ourselves, we selected a managed solution to power parts of our production infrastructure: a Multi-AZ setup with Amazon's RDS.

AZ is Amazon-speak for "availability zone", essentially a datacenter. RDS stands for: Relational Database Service.

Judging from my experience with our own setups where EBS is in the mix, I have to say that Amazon does an outstanding job hiding these potential issues with RDS from us. Looking at the price tag of the setup can be intimidating at first, but as far as TCO is concerned, RDS is the complete package: managed from every which way and painless for the most part.

RDS in a nutshell

RDS is pretty awesome — it's basically a highly available MySQL setup with backups and optional goodness like read-slaves. RDS is one of the best services as far as Amazon Webservices are concerned: 90% of what anyone would need from RDS, Amazon allows you to do with a couple clicks. For tuning, it gets a little less shiny and maybe even messy, but even changing parameters is possible.

Another pretty important feature is growing and shrinking RDS. Change your storage and either apply the change right away or wait for your next maintenance window. It should be noted that these changes are rarely instant (or "right away"), which doesn't make it any less awesome. So even though for example resizing the storage is not an instant operation (of course), it still puts a whole new definition into the word elastic.

The package

A standard RDS setup gives you a managed database instance with a DNS hostname and database credentials to log in.

Access from instances is granted using database security groups, which work just like the regular security groups (on AWS). In non-AWS-language, this translates to firewall policies.

Pricing

As far as pricing is concerned, AWS is always a little tough to understand: the baseline is 0.88 USD per hour for a multi-az deployment which totals to 633.6 USD a month (large instance class). Since we opted for reservation (a 1,200 USD one time fee for a three (3) year term), we were able to drop that price to 0.56 USD per hour.

Aside from instance costs there are storage costs as well: 0.20 USD per GB (100 GB will cost you and me about 20 USD) and 0.10 USD per million I/O requests (aka the "i/o rate"). On our multi-az RDS we selected 100 GB for total storage initially but since we currently use only about 60 GB, we just end up paying about 12 USD per billing period.

While storage costs are somewhat easy to predict, the "i/o rate" is not. But it's also not a major factor. I'm unable to provide exact numbers currently because we have three RDS servers (1 multi-az deployment, 1 read-slave and another single-az deployment) and the numbers are aggregated on the billing statement but our total is 368,915,692 IOs which runs at roughly 36 USD per month.

Vendor lockin

Anyway — if RDS is awesome, what's the catch? A closed environment.

The primary advantage and disadvantage of RDS is that we don't get access to the server and our own backups.

Of course there are backups and we can use them to restore (or rollback) our RDS setup from within AWS. There are options using the AWS console and I believe using their API as well. But in the end there is no way to export this backup and load it into a non-RDS-setup. And add to that: replicating from or into RDS is not possible either. Which makes migrations and backups an unnecessary pain in the butt.

Aside from not getting access to our own backup, we also don't get access to the actual instances. Which makes sense for AWS, but it means we need to rely on in my opinion questionable metrics like Cloudwatch. Questionable because there is no way for the customer to verify the data. AWS uses their own metrics (and scale) and it's often not obvious to me how well Cloudwatch works even on regular AWS EC2 instance.

I've seen instances which became unavailable, but Cloudwatch is reporting A-OK (green). I'm not sure how beta Cloudwatch is, but we decided on Silverline (more on that in another blog post) for instance monitoring. Since Silverline requires a local client, it's unfortunately not an option for RDS.

What's pain?

Aside from the monitoring and backup quirks, one of the real pain points of Amazon RDS is that a lot of the collective MySQL knowledge is not available to us. The knowledge which is manifested in books, blogs, various monitoring solutions and outstanding tools like Percona's backup tools are not available to people who run Amazon RDS setups.

Whatever the (technical) reasons for all this may be, they pain me at least a little and should be discussed when Amazon RDS is evaluated to replace MySQL infrastructure.

MySQL and backups

I mentioned backups! First off, I hope ya'll are doing them! ;-)

For many, the preferred way to do MySQL-backups is mysqldump. It's pretty simple to use:

$ mysqldump -u root -pPASS -A > my_backup.sql

This command essentially dumps all (-A) databases from your MySQL server.

mysqldump is an OK solution as long as you take a few things in mind:

  • Create your backups during a period where there is the least activity — typically the night.
  • There will be a slight bump, but hope that your database is small enough so no one notices.

With a larger database or most databases with lots of read and write activity, this is almost impossible to achieve. For a snapshot to be consistent, table locks are used and that usually stalls access to any application which relies on your database.

Of course there is Percona's xtrabackup (which is outstanding), but with RDS, that is not an option either.

Read-slave to the rescue

Typically people will use a read-slave with MySQL to offload read queries from the master. I haven't done any tests on how far these typically lag behind with Amazon RDS, but I am going to use my RDS read-slave for something else: backup.

Adding a read-slave is easy:

  1. Log into the AWS Console and go to RDS.
  2. Select your RDS server and click 'add read replica' above

The operation will likely take a while. The timeframe depends on the type of instance and the amount of storage provisioned. In my case I selected a small instance and it assigned 100 GB of storage to match my RDS. Small instances boot notoriously long — the entire operation completed in a little over ten minutes.

On a side-note: read-replicas allow you to scale RDS beyond availability zones (AZ). But you should be aware that traffic across different AZ is billed to the customer.

Costs

A small instance costs roughly 76 USD/month (excluding storage, I/O rate and bandwidth), which by itself is not bad for a fully managed server which I basically setup with two or three clicks. Since we plan to do backup on a regular basis, we will buy a coupon to reserve the instance which cuts down costs tremendously and generally makes the AWS cloud very affordable.

Amazon RDS, quo vadis?

I mentioned a little vendor-lockin with the service and the little visibility from the outside.

In theory, this should not matter — however there are more than a few issues you should be aware of. I don't want to mention them to stomp on Amazon — RDS is still in beta after all. But you should be aware of them to get a complete picture.

Pretty questionable is the way some of these issues are handle: not at all or in private messages. AWS is not always at fault here since I imagine pretty often the customer forgets to update the ticket when the issue is only temporary because their focus shifts to other areas.

But one of the core problem with customer service problem all over AWS is that customers have to resort to posting on a forum with no guaranteed response or have to buy a support contract which includes answers like "we fixed it". The first response is usually that more details are needed (Maybe customer accounts on the forum are not linked to AWS accounts on the inside?) and off it goes into private mode.

My wish is that these situations across all AWS services are handled more transparent in the future so people see development and evolution of the service which means that a trust-worthy platform is be build.

Fin

I've been thinking about my final statement for a while. If anything right now, I would be more in favour of Amazon RDS.

Amazon RDS is an extremely interesting product — the beta-tag is even more impressive. It'll be interesting to see what it will offer once Amazon pronounces it stable.

As for the future of our RDS-setups: they are not gonna go away soon. One of our objectives for 2012 is stabilizing across all products and infrastructure underneath. I think this will include some sort of integration with our external monitoring system to at least try to make sense of things like Cloudwatch and to be able to correlate with other events from all over production.

Legacy code

Some people have this misconception that there is a plot to kill of the original mysql extension (aka ext/mysql) in PHP.

So first off, I know that bloggers are not journalist. So apprently fact-checking is not required for a rant blog post. ;-) But if a blog post is a rant, it should be labeled as such.

Improved

ext/mysqli (hint: "i" as in improved) has been the goto extension for years. If anyone writes code in 2011 which uses ext/mysql, then they should be fired or demoted to trainee. MySQL's 4.1.3 beta was released in 2004 and ever since its release, it's been "strongly" suggested to use ext/mysqli.

PHP's fault in this case is that it didn't drop ext/mysql sooner.

Legacy code

The problem with legacy code is that it's usually not a legacy with the good meaning of the word.

It's indeed a legacy of the past which often exhibits all things that can go wrong on a PHP project.

In a situation where a project requires an older version of PHP to run, there are two kind of improvements developers (and essentially the companies employing them) miss out on: active and passive.

Passive improvements

… generally happen and don't require anyone to re-write any code, examples include:

  • PHP's runtime became more efficient
  • improvements such as garbage collection
  • most importantly: bugs were fixed

Active improvements

… may require updates to the code base — they mostly include new features (e.g. through a new extension) or extensions of the language itself — namespaces, traits, etc..

Ecosystem

In addition to missing out on the improvements of the language another major problem is that since the code base requires an out-dated (maybe vulnerable) version of PHP to run, companies are often stuck in an entirely outdated ecosystem.

From top to bottom: it starts from the OS (Security updates?) — Linux, but also includes the services you run — MySQL and Apache. But there you go, the entire LAMP stack is fubar'd.

Anyone could argue that it's possible to e.g. run PHP4 on a recent Linux with semi-recent components, but let alone compiling PHP and/or backporting patches for security issues and crucial bugfixes don't make maintainance exactly trivial.

Zend Framework: Writing an IN-Clause with Zend_Db

I'm currently running a little database analysis to see how many of our users might be affected by Gawker's breach of security the other week. Since we use Zend Framework, Zend_Db is our weapon of choice.

Running the queries I noticed, that Zend_Db doesn't seem to support an IN-clause with a straight fetchAll().

The IN-clause is only supported when I wrap my statement with Zend_Db_Select, which is something I rarely do. Part of the reason is that I still don't feel comfortable writing my SQL in a DSL which doesn't really do anything besides wrapping a string into an object and IMHO it doesn't add to readability either.

And the other reason is that I don't plan to run this import against any other database than MySQL. Which is why I don't see the need for abstraction either.

Example

The following code fails:

// $db = Zend_Db::factory($config);
$sql    = "SELECT foo FROM bar WHERE foo IN(?)";
$params = array(1,2,3,4);
$db->fetchAll($sql, $params);

The warning

PHP Warning:  mysqli_stmt::bind_param(): Number of variables doesn't match number of parameters in
prepared statement in /opt/library/Zend/Db/Statement/Mysqli.php on line 204

The exception

Mysqli statement execute error : No data supplied for parameters in prepared statement

Solution

My solution is simple:

$sql    = "SELECT foo FROM bar WHERE foo IN (?)";
$params = array('foo', 'bar', 'foobar', "foo\bar", '\bar', "'foobar'");
$query  = str_replace('?', substr(str_repeat('?, ', count($params)), 0, -2), $sql);
$db->fetchAll($query, $params);

I'm expanding a single ? to as many ?'s as necessary. The reason why I did this, is that I wanted to be able to use Zend_Db's quoting. The data I'm dealing with is really messy and I didn't feel like repairing it.

Of course there may be other and maybe more simple ways to do it.

Fin

That's all, happy hacking.

(Update, 2010/12/19 10:05 PM: Refined my example. Apparently people don't read the text (= blog post) around the code. :-))

EC2 security group owner ID

I recently had the pleasure to setup an RDS instance and it took me a while to figure out what the --ec2-security-group-owner-id parameter needs to be populated with when you want to allow access to your RDS instance from instances with a certain security group.

To cut to the chase, you need to log into AWS and then click the following link — done.