Skip to content

SQL MAX() and GROUP BY for CouchDB

While re-writing a couple SQL statements into CouchDB we got stuck when we wanted to do a SELECT MAX(...), id ... GROUP BY id in CouchDB.

MySQL

Imagine the following SQL table with data:

mysql> SHOW FIELDS FROM deploys;
+-------------+-------------+------+-----+---------+-------+
| Field       | Type        | Null | Key | Default | Extra |
+-------------+-------------+------+-----+---------+-------+
| project     | varchar(10) | NO   |     | NULL    |       |
| deploy_time | datetime    | NO   |     | NULL    |       |
+-------------+-------------+------+-----+---------+-------+
2 rows in set (0.01 sec)

In order to get the latest deploy for each project, I'd issue:

mysql> SELECT MAX(deploy_time), project FROM deploys GROUP BY project;
+---------------------+----------+
| MAX(deploy_time)    | project  |
+---------------------+----------+
| 2013-10-04 22:01:26 | project1 |
| 2013-10-04 22:02:17 | project2 |
| 2013-10-04 22:02:20 | project3 |
+---------------------+----------+
3 rows in set (0.00 sec)

Simple. But what do you do in CouchDB?

CouchDB

My documents look like this:

{
  "_id": "hash",
  "project": "github-user/repo/branch",
  "deploy_time": {
     "date": "2013-10-04 22:02:20",
     /* ... */
  },
  /* ... */
}

So, after more than a couple hours trying to wrap our heads around map-reduce in CouchDB, it's working.

Here's the view's map function:

function (doc) {
  if (doc.project) {
    emit(doc.project, doc.deploy_time.date);
  }
}

This produces nice key value pairs — in fact, multiple — for each project.

And because the map-function returns multiple, we need to reduce our set.

So here is the reduce:

function (doc, values, rereduce) {
  var max_time = 0, max_value, time_parsed;
  values.forEach(function(deploy_date) {
    time_parsed = Date.parse(deploy_date.replace(/ /gi,'T'));
    if (max_time < time_parsed) {
      max_time = time_parsed;
      max_value = deploy_date;
    }
  });
  return max_value;
}

The little .replace(/ /gi,'T') took especially long to figure out. Special thanks to Cloudant's señor JS-date-engineer Adam Kocoloski for helping out. ;-)

Step by step

  • iterate over values
  • fix each value (add the "T") to make Spidermonkey comply
  • parse, and compare
  • return the "latest" in the end

A note of advice: To save yourself some trouble, install a local copy of Spidermonkey and test your view code in there and not in your browser.

Open the view in your browser: http://localhost/db/_design/deploys/_view/last_deploys?group=true

{
  "rows":[
    {
      "key":"testowner/testname/testbranch",
      "value":"2013-09-13 11:41:03"
    },
    {
      "key":"testowner/testname/testbranch2",
      "value":"2013-09-12 16:48:39"
    }
  ]
}

Yay, works!

FIN

That's all for today.

The future of CouchDB

… is not Damien Katz.

TL;DR

The blog post Damien Katz wrote earlier today, doesn't mean much or anything for the Apache CouchDB project (or memcache project for that matter). If anything it's a public note that Damien Katz acknowledged that he moved (on) from CouchDB to Couchbase.

Short story, long

I'm not a contributor to CouchDB by means of code, (but) I blog a lot, I maintain the FreeBSD port, wrote a book and have an opinion on many things CouchDB. I've been a CouchDB-user for something like four years (since pre-0.8 times) and a BigCouch-user (and Cloudant customer) of about 1.5-two years.

I am not sure what Damien Katz tried to achieve when he posted his message to the community and while I personally find it ignorant (to say the least), it worries me how it is perceived by the general public.

Talk is cheap

Of course it may sound like the end of the world when the creater of CouchDB quits his own project, but truth to be told, Damien Katz left CouchDB a long time ago. Couchbase moved past CouchDB long before they announced it. Basically when they started integrating Membase, though there are all kinds of notable contributions from Couchbase employees (e.g. Filipe and Jan).

Damien himself hasn't (actually) commited in over a year to CouchDB. Which makes his move no real surprise, just the way he decided to communicate surprised me. Especially since he said to have no regrets, I find the tone and statements in his blog post rather questionable. That is both from a personal and professional perspective.

I just attended a Couchbase event in Berlin last year where talks about CouchDB were given along with newer Couchbase developments. So personally, while I welcome clarity, all too sudden changes in strategy don't make me happy. If I was new to CouchDB and/or Couchbase, this would look like a headless chicken (excuse the image) and way too much drama to get into.

And on a professional level, Damien's posts invalidates the efforts of many people who both contribute and work with Apache CouchDB on a daily basis.

Then later today Damien shared this:

TIL, if you create an open source project, you should stick with it forever and ever. Family can live off unicorns and stardust. — Damien Katz

Real talk

First off, the most people in this discussion (excluding HN, of course ;-)) are actually active Open Source contributors one way or the other. Many of us have other projects (plural) besides CouchDB. It's not that we troll about something we have no idea about.

Secondly, it's not the fact that Damien left, it's how he left.

No one blames people for moving on: it happens all the time. I do it all the time — write code, push it out, move on.

If code is good enough it'll be picked up, if not, it'll rott on Github for forever. It happened to other projects and it happened to CouchDB. But why would anyone pronounce a project dead where he is not anymore invested in?

Anyway: I wish Damien good luck in the future.

So where is CouchDB at?

I wrote a blog post about the current state of CouchDB last year (2011) in early December.

A few things have changed since then:

Overall, I still see contributions from notable community members all around. Not sure if it's my own perception, but there has been more activity as of late. A new release of Apache CouchDB is just around the corner. Overall, good times for Apache CouchDB indeed.

The other thing that happened after my blog post is that Couchbase said in their 2011 review (which was published in late December) that it would officially step off Apache CouchDB and contribute documentation and OSX builds (aka CouchDBX or Couchbase Single Server) to the Apache CouchDB project.

This is great news and announcing that they will step off the turf is fine too since it clears up a couple misconceptions people may have about Apache CouchDB and the former Couchbase Single Server.

And all in, this makes Damien's blog post even more unnecessary and confusing to many users out there. Especially confusing for those who are not neck-deep in Apache CouchDB — and by that I mean: they are neither subscribed to a mailing list, take part on IRC, read the CouchDB planet or know any of the contributors directly.

In the end his blog post confuses people because it contains absolutely nothing but fear, uncertainty and doubt (FUD).

Outlook

I can't reveal too much because it's not my business to announce anything — let me just say that there are good times ahead for Apache CouchDB.

Fin

I'd like to get past all the drama and re-focus on what is important: CouchDB and our data. I don't care for the rest, I want to see exciting things from Apache CouchDB in 2012.

Quo vadis, CouchDB?

Update, 2011-12-21: Couchbase posted their review of 2011 (the other day) — TL;DR: Couchbase Single Server (their Apache CouchDB distribution) is discontinued and its documentation (and its buildtools) will be contributed to Apache CouchDB.


When Ubuntu1 dropped CouchDB two weeks ago, there were a couple things which annoy (present tense) me a lot. Add to that the general echo from various media outlets blogs which pronounced CouchDB dead and a general misconception how this situation or CouchDB in general is dealt with.

Some people said I am caremad about CouchDB and that is probably true. Let me try to work through these things without offending more people.

Ubuntu1

What annoy[ed,s] me about this situation is that I wrote a chapter about Ubuntu1 in my CouchDB book. And while I realize that as soon as a book is published the information is outdated, I also want to say that I could have used the space for another project.

I talked to a couple of people about CouchDB at Ubuntu1 on IRC and no one made it sound like they are having huge or for that matter any issues.

Of course I neither work for Canonical or Couchbase. I haven't signed any NDAs etc. — but looking back a week or two my well-educated guess is that not even the people at Couchbase knew there were fundamental issues with CouchDB and Ubuntu1.

The NDA-part is of course an assumption: don't quote me on it.

Transparency

Scumbag Ubuntu1 drops CouchDB and doesn't say why. — myself on Twitter

First off: I'm not really sorry. I was abusing a meme and if you read my Twitter bio, you should not take things personal.

I also should have known better since it's not like I expect anything transparent from Canonical. (Just said it.)

When people are compelled to write a press release and put it out like that, they should expect a backlash. The reason why I reacted harsh is that Canonical didn't share any valuable information on why they discontinued using CouchDB except for: it doesn't scale.

And I'm not aware of anything concious to date.

Helpful criticism — how does it work?

Please take a look at the following email: https://lists.launchpad.net/u1db-discuss/msg00043.html

This email contains a lot of criticism. And it's all valid as well.

CouchDB feedback

Other examples:

These are great emails because they contain extremely valuable feedback.

Deal with it!

In my (humble) opinion, these kind of emails are exactly what is necessary in CouchDB-land, and many other open source projects: criticism and a little time to reflect on not so awesome features. And then moving on to make it better. If the feedback cycle doesn't happen, there's no development or evolution — just stagnation.

And in retrospect I wish more people would share their opinion on CouchDB and this situation more often. Since I'm personally invested in CouchDB, it's hard to say certain things. Honesty is sometimes brutal, but it's necessary.

In summary, a CouchDB user like Ubuntu1 (or Canonical) doesn't have the civic duty to give feedback, but to desert a project while pretending to be an Open Source vendor, and not talking to the community of the project or sharing your issues in public, that is extremely unhelpful.

Overall it strikes me that the only thing to date known about Canonical's collaboration with CouchDB is the support for OAuth in CouchDB. And most people don't even know about that (or wouldn't know how to use it). It worries me personally to not know the kind of problems Canonical ran into because they seem so messed up that they couldn't be discussed in public.

CouchDB doesn't scale

One thing I was able to extract is: CouchDB doesn't scale.

Thanks! But no thanks.

I wrote a book on CouchDB and I pretty much used it all, or at least looked at it very, very closely. I also get plenty of experience with CouchDB due to my job. Indeed, there are many situations where CouchDB doesn't scale or where it becomes extremely hard to make it scale. Situations where the user is better of putting data somewhere else.

Myself (and I'm assuming others) enjoy to learn the reasons why things break, so we can take this experience and use it going forward. If this doesn't happen we might as well all subscribe to the koolaid of a closed source vendor and purchase update subscriptions, install security packs and happily live ever after.

A patch to make CouchDB scale?

Another piece of information I gathered from the various emails written is that Canonical maintained CouchDB-specific patches for Ubuntu1. However, it's unknown what the purpose of these patches were. For example, if these patches made CouchDB scale (magically) for Ubuntu1 or if the patchset added a new feature.

What I'd really like to know is why these patches were not discussed in the open and why no one worked with the project on incorporating them into upstream. The upstream is the Apache CouchDB project.

This is another example of where communication went horribly wrong or just didn't happen.

A CouchDB company

I'm a little torn here and I don't want to offend anyone (further) especially since I know a couple Couchbase'rs or original CouchOne'rs (Hello to Jan, JChris and Mikeal) in person, but seriously: a lot of people realized that CouchOne stopped being The CouchDB company a long time ago.

This is not to say that the CouchDB project members who are employed by CouchOne/Couchbase are not dedicated to CouchDB. But if I take a look at the mobile strategy and the more or less recent integration of CouchDB with Membase/Memcache, I must notice that these strategies are far away from Apache CouchDB. Big data (whatever that means), to mobile and back.

The conclusion is that the majority of work done will not be merged into Apache CouchDB and this is one of the reasons why the Apache CouchDB project hasn't evolved much in a long time.

Not all changes can go upstream

I realize that when a company has a different strategy, not everything they do can be send upstream. After all, most if not all companies operate in a world where money is to be made and goals are to be met. Nothing wrong there.

But let's take a look at the one project which could have been dedicated to Apache CouchDB: the documentation project.

CouchOne hired an ex-MySQL'er to write really great documentation for CouchDB. The documentation made sense, it was up to date with releases, contained lots examples and what not. But it was never contributed to the open source project. The documentation is still online today, though it's now the documentation of the Couchbase Server HTTP API.

Wakey, wakey!

So in my opinion the biggest news is not that Canonical stopped using CouchDB and it's also not outrageous to think that there can be one CouchDB company. The biggest news is that Couchbase officially said: "It's not us!".

Having said that and also not knowing much about Canonical's setup and scale, I still fail to even remotely understand why they didn't work with Cloudant who spezialize in making CouchDB scale all along.

CouchDB and Evolution

Of course it is unfair to single them (Couchbase employees) out like that. For the record, there are pretty vivid projects such as GeoCouch which are also funded by Couchbase and while being devoted to the project, these guys also have to meet goals for their company.

Add to that, that other CouchDB contributors involved have not driven sustantial user-facing changes in Apache CouchDB either. CouchDB is still a very technical project with a lot of technical issues to solve. The upside to this situation is that while other NoSQL vendors add new buzzwords to each and every CHANGELOG, CouchDB is very conservative and stability driven. I appreciate that a lot.

User-facing changes on the other side are just as important for the health of a project. Subtle changes aside, but today's talks on for example querying CouchDB are extremely similar to those talks given a year or two ago. Whatever happens in this regard is not visible to users at all.

Take URL rewriting, virtualhosts and range queries as examples for features. I question:

  • the usefulness for 80% of the users
  • the rather questionable quality
  • the state of their documentation

Users need to have the ability to grasp what's on the roadmap for CouchDB. There needs to be a way for not so technical users to provide feedback which is actually incorporated into the project. All of these things aside from opening issues in a monster like Jira.

Since no one bothers currently, this is not going to happen soon.

Pretty candid stuff.

Marketing

In terms of marketing and with a lack of an official CouchDB company, the CouchDB project has taken a PostgreSQL-attitude in the last two years.

In a nutshell:

We don't give a damn if you don't realize that our database is better than this other database.

This is a little dangerous for the project itself because when I look at the cash other NoSQL vendors pour into marketing for their NoSQL database, I realized quickly that with the lack of support this project can go away pretty soon.

CouchDB being an Apache project doesn't save me or anyone either: clean intellectual property, deserted, for forever.

The various larger companies (let's say Cloudant and Meebo) are basically employed with their own forks with maybe too little reason to merge anything back to upstream yet. There are independent contributors Enki Multimedia who contribute to core but also sub projects like CouchApp.

And then, there's Couchbase which is trying to tie CouchDB behind Memcached. And from what I can tell pretty much abondens HTTP and other slower CouchDB principals in the process.

Is CouchDB alive and kicking?

You saw it coming: it depends!

Dear Jan, I'm still thinking about the email you wrote while I write my own blog entry. And honestly, that email and the general response raised more questions for myself and others than it answered.

I'd like to emphasize a difference I see (thanks, Lukas):

Core

Is the core of Apache CouchDB alive? — It's not dead.

  • Yes, because some companies drive a lot of stability into CouchDB.
  • No, because there's little or no innovation happening right now.

Ecosystem

There is a lot of innovation going on in CouchDB's ecosystem.

Most notable, the following projects come to mind:

  • BigCouch
  • Couchappspora
  • CouchDB-lucene
  • Doctrine ODM in PHP (and I'm sure there are similar projects in other languages)
  • ElasticSearch's river
  • erica
  • GeoCouch
  • Lounge (and lode)
  • various JavaScript libraries to connect CouchDB with CouchApps or node.js
  • various open data projects (like refuge.io)

Need more? Check out CouchDB in the wild which I think is more or less up to date.

Hate it or love it — there is plenty of innovating going on. And many (if not all) CouchDB committers are a part of it.

The innovation just doesn't happen in CouchDB's core.

Fin

My closing words are that I don't plan on migrating anywhere else. If anything, we have mostly migrated to BigCouch.

For Apache CouchDB, I think it's important that someone fills that void. That can be either a company, a BDFL or more engaging project leaders (plural). I think this is required so the project continues vividly.

Because I would really like to see the project survive.

RFC: Mocking protected methods

Update, 2011-06-16, 12:15 AM Thanks for the comments.

(I swear I had something like that before and it didn't work!) Here's the solution:

$userId = 61382;
$docId  = 'CLD2_62e029fc-1dae-4f20-873e-69facb64a21a';
$body   = '{"error":"missing","reason":"problem?"}';

$client = new Zend_Http_Client;
$client->setAdapter(new Zend_Http_Client_Adapter_Test);

$couchdb = $this->getMock(
    'CouchDB',
    array('makeRequest',),
    array($userId, $this->config,)
);
$couchdb->expects($this->once())
    ->method('makeRequest')
    ->will($this->returnValue(new Zend_Http_Response(200, array(), $body)));
$couchdb->setHttpClient($client);
$couchdb->getDocument($docId);

--- Original blog entry ---

I wrote a couple tests for a small CouchDB access wrapper today. But when I wrote the implementation itself, I realized that my class setup depends on an actual CouchDB server being available and here my journey began.

Example code

Consider the following example:

My objective is not to be able to test any of the protected methods directly, but to be able to supply a fixture so we don't have to setup CouchDB to run our testsuite. My fixture would replace makeRequest() and return a JSON string instead.

Some thoughts on outtages

Cloud, everybody wants it, some actually use it. So what's my take away from AWS' recent outtage?

Background

So first off, we had two pieces of our infrastructure failing (three if we include our Multi-AV RDS) — both of which involve EBS.

Numero uno

One of those pieces in my immediate reach was a MySQL server, which we use to keep sessions. And to say the least about AWS and in their defense, the instance had run for almost 550 days and had never given us much or any reason to let us down.

In almost two years with AWS I did not magically lose a single instance. I had to reboot two or three once because the host had issues and Amazon sent us an email and asked us to make sure the instance survive a reboot, but that's about it.

Recovering the service, or at least launching a replacement in a different region would have been possible if not by coincidence we would have hit several limits on our AWS account (instances, IPs and EBS volumes), which apparently take multiple days to lift. We contacted AWS immediately and the autoresponder told us to email back if it was urgent, but I guess they had their hands full and apparently we are not high up the chain enough to express how urgent it really was.

I also tried to reach out to some of the AWS evangelists on Twitter which didn't work since they went silent almost all the way through this outtage.

All in all, it took roughly five hours to get the volume back and another 4-5 to recover the database. As far as I can tell, nothing was lost.

And in our defense — we were well aware of this SPOF and had already plans to move on a more redundant approach — I have another blog post in draft about evaluating alternatives (Membase).

Numero due

The second critical piece of infrastructure which failed for us is our hosted BigCouch cluster with Cloudant.

We managed to (manually) failover to their cluster in us-west1 later in the day and brought the service back up. We would have done this earlier, but AWS suggested it would be only a few hours which is why we wanted to avoid the hassle of having to sync up clusters later on.

Sidenote: Cloudant is still (day three of the downtime) trying to get all pieces back online. Kudos to everyone from Cloudant for their hard work and patience with us.

Lesson learned for myself: When things fail which are not within your reach, it's pretty hard to do anything and stay calm. A good thing is to keep everyone busy so our team tried to reach out to all customers (we average about 200,000 users per day) via Twitter and Facebook and it looks like we've tackled that well.

Un trio!

Well, I don't have much to say about Amazon RDS (hosted MySQL in the cloud). Except that it didn't live up to our expectations: it costs a lot of money, but we learned that apparently that doesn't buy us anything.

Looking at the CloudWatch statistics associated with our RDS setup (or EBS in general), I'm rather weary and don't even know if they can be trusted. In the end, I can't really say for how long RDS was down or failed to failover, but it must have been back before I got a handle on our own MySQL server.

The rest?

The rest of our infrastructure seems fine on AWS — most of the servers are stateless (Shared nothing, anyone?) and no EBS is involved.

And with the absense of EBS, there are no issues to begin with. Everything continued to work just as expected.

Design for failure.

This is not really a take away, but a no-brainer. It's also not limited to AWS or cloud computing in general.

You should design for failure when you build any type of applications.

In terms of cloud computing it's really not as bad as ZOMG MY INSTANCE IS GONE!!!11, but it certainly can happen. I've heard people claim that EC2 instances constantly disappear and with my background of almost two years on AWS, I know that's just not true.

Designing for failure should take the following into account:

Services become unavailable

These services don't have to run in the cloud, they can become unavailable on bare metal too. For example, a service could crash, there could be a network partition or maintenance involved. The bottom line: How does your application deal with it?

Services run degraded

For example, higher latency between the services, slower response time from disk — you name it.

The unexpected

Sometimes, everything is green, but your application still chokes.

While testing for the unexpected is of course impossible, validating what comes in is not.

Recovery

I'm not sure if a fire drill is necessary, but it helps to have a plan on how to troubleshoot issues to be able to recover from an outtage.

In our case, we log almost anything and everything to syslog and utilize loggly to centralize logs. loggly's nifty console provides us with great input about the state of our application at any time.

Add to the centralized logging, that we have a lot of monitoring using ganglia and munin in place. Monitoring is an ongoing project for us since it seems like once you start, you just can't stop. ;-)

And last but not least: We can launch a new configured EC2 instance with a couple mouse clicks using Scalarium.

I value all these things equally — without them, troubleshooting and recovery would be impossible or at least a royal PITA.

Don't buy hype

So to get to the bottom of this (still ongoing) event, I'm not particulary pissed that there was downtime. Of course I can live without it, but what I mean is: Things are bound to fail.

The truth is though, that Amazon's product description is not exactly honest, or at the very least provides everyone with a lot of room for interpretation. You're asking for how much interpretation? I'm sure you could put ten cloud experts into one room and come away with 15 different opinions.

For example the use of attributes to a service such as highly available may cause different expectations for different people.

Let me break it down for you: highly available in AWS speak, means, "it works most of the time".

What about an SLA?

Highly available is not an SLA with those infamous nine Erlang nines.

On paper a multi-az deployment of Amazon RDS gets pretty close to what generally people expect from highly available: MySQL master-master replication, backups — in multiple datacenters. As of today we all know: even these things can fail.

And speaking of SLAs: It looks like none of the services failing are covered by it: AWS' track record remains clean. This is because EBS is not explicitely named in it and by the way neither is RDS. Amazon's SLA — as far as EC2 is concerned — covers some but not all network outtages. Since I was able to access the instance the entire time none of this applies here.

Multi-Zone

On Twitter people are quick to suggest that everyone who complaines now should have had a setup in multiple availability zones setup.

Here's what I think about it:

  • I find it rather amusing that apparently everyone on bare metal runs in multiple datacenters.

  • When people suggest multi-zone (or multi-az in Amazon-speak), I think they really mean multi-region. Because a zone is effectively us-east-1a, us-east-1b, us-east-1c and us-east-1d. Since all datacenters (= availability zones) in the us-east1 region failed on 2011/04/21, your multi-zone setup would not have covered your butt. Even Amazon's multi-az RDS failed.

  • Little do people know, but the zone identifiers — e.g. us-east-1a, us-east-1b — are tied to customer accounts. So for example, Cloudant's version of us-east-1a may be my us-east-1c, it may or may not be the same. This is why in many cases AWS never calls out explicit zones in outtages. This also makes it somewhat hard to plan ahead in a single region.

  • AWS sells customers on the idea that an actual multi-az setup is plenty. I don't know too many companies who do multi-region (maybe SimpleGeo?). Not even NetFlix does multi-region, but I guess they managed to sail around this disaster because they don't use EBS.

  • In the end it shouldn't be necessary to do a multi-region setup (and deal with its caveats) since according to AWS the different zones inside a region are different physical locations (let's call them datacenters) to begin with. Correct me if I'm wrong, but the description says different physical location, this is not just another rack in the same building or another port on the core switch.

Communication

Which brings me to the most important point of my blong entry.

In a nutshell, when you build for the AWS platform, you're building for a blackbox. There are a couple papers and blog posts where people try to reverse engineer the platform and write about its behavior. The problem with these things is that most people are guessing, though often (of course depending on the person writing) it seems to be a very well educated guess.

Roman Stanek blogged about communication between AWS and its customers, so head on over, I pretty much agree with everything he has to say:

Fin

So what exactly is my take away? In terms of technical details and as far as redundancy is concerned: not so much.

Whatever you do to run redundant on AWS, applies to setups in your local colocation or POP as well. And in theory, AWS makes it easier though to leverage multiple datacenters (availability zones) and even achieve somewhat of a global footprint by distributing in different regions.

The real question for anyone to ask is, Is AWS fit to host anything which requires permanent storage? (I'm inclined to say no.)

That's all.