## Continuous Integration: Automated database setup with Doctrine on Travis-CI

Testing is important — most people understand that by now. A lot of people write tests for their open source code already, but in-house testing is still hard. For example, many of us had an encounter with Jenkins: it runs well to a point where it becomes harder to maintain the Jenkins than it is to write tests.

Another obstacle is test setup and environments: When I write and run tests, there is sometimes only so much I can do to mock and avoid actual calls to my storage backend. While I prefer to run my database tests against a SQLite in memory database, there are these edge cases, where I work with multiple database or I write a direct query (and by-pass the ORM-magic).

In these cases I need to have that database server available in my test environment!

The following blog posts explains how to solve these things with Travis-CI. I will walk you through the setup on Travis-CI's business service. But most of this applies to their open source offering as well.

## Step by step

I'll try to break it up into small steps.

### Travis-CI

The first step is to login at http://travis-ci.com and add the repository tests should be run for. To be able to add repositories of an organization, you have to be the owner of the organization. The easiest way to get access to the service right now is donating to these guys or in case you have done that already: email them. ;-)

The second step is setting up a .travis.yml file.

Mine looks like this:

language: php
php:
- 5.3
- 5.4
before_script:
- ./composer.phar -v install --dev
- psql -c 'create database testdatabase;' -U postgres


Run-down:

• run the tests against PHP 5.3 and 5.4
• before_script defines your test setup (outside PHP)
• I omitted the script stanza because the default (phpunit) works for me

### Composer

I am using composer to manage my dependencies and you should too. I don't want to go into details here, but a short example of my composer.json is the following:

{
"name": "mycompany/project",
"description": "A super cool project.",
"require": {
"doctrine/orm": "2.2.2"
},
"psr-0": {
"MyCompany\\Project\\Test": "tests/",
"MyCompany\\Project": "src/"
}
}
}


Side-note: We also currently commit a composer.phar into each repository for two reasons:

1. To ensure a change in composer won't break our setup.
2. Downtime of (or connectivity issues to) their website don't break our deployments and test runs.

### Test framework setup

There is not a whole lot to setup since Travis-CI installs phpunit already. Just make sure you have a phpunit.xml in the root of your repository and you are good to go.

### Database schema

The next step would be to generate your schema and check in some .sql, right? I'm not a huge fan of this, because I hate running through a lot of manual steps when I need to update something. Manual steps means that they might be forgotten or people make a mistake. So the objective is to avoid any manual labour as much as you can.

Instead of maintaining these files, I use Doctrine's SchemaTool. It takes care of this just fine because I annotated all my entities already.

To make use of this, I suggest to add the following to your test case:

<?php
namespace MyCompany\Project\Test;

use Doctrine\ORM\Tools\Setup;
use Doctrine\ORM\EntityManager;
use Doctrine\Common\Persistence\PersistentObject;
use Doctrine\ORM\Tools\SchemaTool;

class MyTestCase extends \PHPUnit_Framework_Testcase
{
protected $em,$tool;

public function setUp()
{
$this->setUpDatabase(); // wrap this again$this->setUpSchema();
/* more setup here */
}

public function tearDown()
{
$classes = array($this->em->getClassMetadata('MyCompany\Project\Entity\SomethingImportant'),
);
$this->tool->dropSchema($classes);
unset($tool); unset($em);
}

public function setUpDatabase()
{
$isDevMode = true;$doctrineConfig = Setup::createAnnotationMetadataConfiguration(
array('path/to/Entity'),
$isDevMode ); // database configuration parameters$dbConfig = array(
'host'     => '127.0.0.1',
'user'     => 'postgres',
'dbname'   => 'testdatabase',
);

$this->em = EntityManager::create($dbConfig, $doctrineConfig); PersistentObject::setObjectManager($this->em);
}

public function setUpSchema()
{
$this->tool = new SchemaTool($this->em);
$classes = array($this->em->getClassMetadata('MyCompany\Project\Entity\SomethingImportant'),
);
$this->tool->createSchema($classes);
}
}


Assumptions made (aka, room for improvement):

• your entities are in path/to/Entity
• PostgreSQL is used and runs on 127.0.0.1
• you're using a database called testdatabase

### Tests

Once the setup is complete, I add the usual testFoo() methods into my test case class.

From within the testFoo() methods I have Doctrine's EntityManager available via $this->em. The entity manager allows me to do whatever I need or want within a test run. After a test completes, the tearDown() method is invoked and destroys the tables in your testdatabase and then setUp() re-creates it. This will take some time but the side-effects of stale data are not to be neglected. Add to that, your tests should not rely on the order they are executed in anyway. Another benefit of this setup are updated SQL tables each time a commit changes the annotations. No extra .sql files to maintain. :) ## Fin That's really all there is to running your test suite with Travis-CI and while you did all the above, you just added continuous integration to your toolkit because these tests run each time a pul request is opened or commits are pushed. :-) As I mentioned early on, these steps apply to the open source offering as well — all but the link to login. If PostgreSQL is not your game, have a look at the list of currently supported databases on Travis-CI. And if your database server is not on the list of supported applications, you might as well install it with a before_script (as long as it runs on Ubuntu). ;-) Happy testing! ## Hosted MySQL: Amazon RDS (and backups) Among all the different technologies in our stack, we also use MySQL. While we still run MySQL (or Percona-Server) ourselves, we selected a managed solution to power parts of our production infrastructure: a Multi-AZ setup with Amazon's RDS. AZ is Amazon-speak for "availability zone", essentially a datacenter. RDS stands for: Relational Database Service. Judging from my experience with our own setups where EBS is in the mix, I have to say that Amazon does an outstanding job hiding these potential issues with RDS from us. Looking at the price tag of the setup can be intimidating at first, but as far as TCO is concerned, RDS is the complete package: managed from every which way and painless for the most part. ## RDS in a nutshell RDS is pretty awesome — it's basically a highly available MySQL setup with backups and optional goodness like read-slaves. RDS is one of the best services as far as Amazon Webservices are concerned: 90% of what anyone would need from RDS, Amazon allows you to do with a couple clicks. For tuning, it gets a little less shiny and maybe even messy, but even changing parameters is possible. Another pretty important feature is growing and shrinking RDS. Change your storage and either apply the change right away or wait for your next maintenance window. It should be noted that these changes are rarely instant (or "right away"), which doesn't make it any less awesome. So even though for example resizing the storage is not an instant operation (of course), it still puts a whole new definition into the word elastic. #### The package A standard RDS setup gives you a managed database instance with a DNS hostname and database credentials to log in. Access from instances is granted using database security groups, which work just like the regular security groups (on AWS). In non-AWS-language, this translates to firewall policies. ### Pricing As far as pricing is concerned, AWS is always a little tough to understand: the baseline is 0.88 USD per hour for a multi-az deployment which totals to 633.6 USD a month (large instance class). Since we opted for reservation (a 1,200 USD one time fee for a three (3) year term), we were able to drop that price to 0.56 USD per hour. Aside from instance costs there are storage costs as well: 0.20 USD per GB (100 GB will cost you and me about 20 USD) and 0.10 USD per million I/O requests (aka the "i/o rate"). On our multi-az RDS we selected 100 GB for total storage initially but since we currently use only about 60 GB, we just end up paying about 12 USD per billing period. While storage costs are somewhat easy to predict, the "i/o rate" is not. But it's also not a major factor. I'm unable to provide exact numbers currently because we have three RDS servers (1 multi-az deployment, 1 read-slave and another single-az deployment) and the numbers are aggregated on the billing statement but our total is 368,915,692 IOs which runs at roughly 36 USD per month. ### Vendor lockin Anyway — if RDS is awesome, what's the catch? A closed environment. The primary advantage and disadvantage of RDS is that we don't get access to the server and our own backups. Of course there are backups and we can use them to restore (or rollback) our RDS setup from within AWS. There are options using the AWS console and I believe using their API as well. But in the end there is no way to export this backup and load it into a non-RDS-setup. And add to that: replicating from or into RDS is not possible either. Which makes migrations and backups an unnecessary pain in the butt. Aside from not getting access to our own backup, we also don't get access to the actual instances. Which makes sense for AWS, but it means we need to rely on in my opinion questionable metrics like Cloudwatch. Questionable because there is no way for the customer to verify the data. AWS uses their own metrics (and scale) and it's often not obvious to me how well Cloudwatch works even on regular AWS EC2 instance. I've seen instances which became unavailable, but Cloudwatch is reporting A-OK (green). I'm not sure how beta Cloudwatch is, but we decided on Silverline (more on that in another blog post) for instance monitoring. Since Silverline requires a local client, it's unfortunately not an option for RDS. ### What's pain? Aside from the monitoring and backup quirks, one of the real pain points of Amazon RDS is that a lot of the collective MySQL knowledge is not available to us. The knowledge which is manifested in books, blogs, various monitoring solutions and outstanding tools like Percona's backup tools are not available to people who run Amazon RDS setups. Whatever the (technical) reasons for all this may be, they pain me at least a little and should be discussed when Amazon RDS is evaluated to replace MySQL infrastructure. ## MySQL and backups I mentioned backups! First off, I hope ya'll are doing them! ;-) For many, the preferred way to do MySQL-backups is mysqldump. It's pretty simple to use: $ mysqldump -u root -pPASS -A > my_backup.sql


This command essentially dumps all (-A) databases from your MySQL server.

mysqldump is an OK solution as long as you take a few things in mind:

• Create your backups during a period where there is the least activity — typically the night.
• There will be a slight bump, but hope that your database is small enough so no one notices.

With a larger database or most databases with lots of read and write activity, this is almost impossible to achieve. For a snapshot to be consistent, table locks are used and that usually stalls access to any application which relies on your database.

Of course there is Percona's xtrabackup (which is outstanding), but with RDS, that is not an option either.

Typically people will use a read-slave with MySQL to offload read queries from the master. I haven't done any tests on how far these typically lag behind with Amazon RDS, but I am going to use my RDS read-slave for something else: backup.

The operation will likely take a while. The timeframe depends on the type of instance and the amount of storage provisioned. In my case I selected a small instance and it assigned 100 GB of storage to match my RDS. Small instances boot notoriously long — the entire operation completed in a little over ten minutes.

On a side-note: read-replicas allow you to scale RDS beyond availability zones (AZ). But you should be aware that traffic across different AZ is billed to the customer.

### Costs

A small instance costs roughly 76 USD/month (excluding storage, I/O rate and bandwidth), which by itself is not bad for a fully managed server which I basically setup with two or three clicks. Since we plan to do backup on a regular basis, we will buy a coupon to reserve the instance which cuts down costs tremendously and generally makes the AWS cloud very affordable.

I mentioned a little vendor-lockin with the service and the little visibility from the outside.

In theory, this should not matter — however there are more than a few issues you should be aware of. I don't want to mention them to stomp on Amazon — RDS is still in beta after all. But you should be aware of them to get a complete picture.

Pretty questionable is the way some of these issues are handle: not at all or in private messages. AWS is not always at fault here since I imagine pretty often the customer forgets to update the ticket when the issue is only temporary because their focus shifts to other areas.

But one of the core problem with customer service problem all over AWS is that customers have to resort to posting on a forum with no guaranteed response or have to buy a support contract which includes answers like "we fixed it". The first response is usually that more details are needed (Maybe customer accounts on the forum are not linked to AWS accounts on the inside?) and off it goes into private mode.

My wish is that these situations across all AWS services are handled more transparent in the future so people see development and evolution of the service which means that a trust-worthy platform is be build.

## Fin

I've been thinking about my final statement for a while. If anything right now, I would be more in favour of Amazon RDS.

Amazon RDS is an extremely interesting product — the beta-tag is even more impressive. It'll be interesting to see what it will offer once Amazon pronounces it stable.

As for the future of our RDS-setups: they are not gonna go away soon. One of our objectives for 2012 is stabilizing across all products and infrastructure underneath. I think this will include some sort of integration with our external monitoring system to at least try to make sense of things like Cloudwatch and to be able to correlate with other events from all over production.

## Zend Framework: Writing an IN-Clause with Zend_Db

I'm currently running a little database analysis to see how many of our users might be affected by Gawker's breach of security the other week. Since we use Zend Framework, Zend_Db is our weapon of choice.

Running the queries I noticed, that Zend_Db doesn't seem to support an IN-clause with a straight fetchAll().

The IN-clause is only supported when I wrap my statement with Zend_Db_Select, which is something I rarely do. Part of the reason is that I still don't feel comfortable writing my SQL in a DSL which doesn't really do anything besides wrapping a string into an object and IMHO it doesn't add to readability either.

And the other reason is that I don't plan to run this import against any other database than MySQL. Which is why I don't see the need for abstraction either.

## Example

The following code fails:

// $db = Zend_Db::factory($config);
$sql = "SELECT foo FROM bar WHERE foo IN(?)";$params = array(1,2,3,4);
$db->fetchAll($sql, $params);  ### The warning PHP Warning: mysqli_stmt::bind_param(): Number of variables doesn't match number of parameters in prepared statement in /opt/library/Zend/Db/Statement/Mysqli.php on line 204  ### The exception Mysqli statement execute error : No data supplied for parameters in prepared statement  ## Solution My solution is simple: $sql    = "SELECT foo FROM bar WHERE foo IN (?)";
$params = array('foo', 'bar', 'foobar', "foo\bar", '\bar', "'foobar'");$query  = str_replace('?', substr(str_repeat('?, ', count($params)), 0, -2),$sql);
$db->fetchAll($query, \$params);


I'm expanding a single ? to as many ?'s as necessary. The reason why I did this, is that I wanted to be able to use Zend_Db's quoting. The data I'm dealing with is really messy and I didn't feel like repairing it.

Of course there may be other and maybe more simple ways to do it.

## Fin

That's all, happy hacking.

(Update, 2010/12/19 10:05 PM: Refined my example. Apparently people don't read the text (= blog post) around the code. :-))

## My Berlin Buzzwords 2010 recap

I attended Berlin Buzzwords 2010 for the last two days and aside from meeting a bunch of great people during talks, here are some take away notes from this conference:

• I got introduced to new stuff — such as HyperTable (c++ bigtable implementation), which I had never heard of before.

• I actually know a lot more about hadoop, HDFS and tika now than I did before — though I won't be able to use any a lot of it soon. The HDFS talk in particular was interesting as it got rid off the bells and whistles (OMG distributed file system and replicated!!!) for me. On Hadoop — it was easy to feel a little overwhelmed.

• No MongoDB for me.

• Hilarious: "Localhost is local most." (by Mario Scheliga)

• (On HDFS' issues with the NameNode:) "Highly available vs. pretty highly available."

• A lot of people talked about scaling (in and off talks) without a) having any first hand experience and/or b) a need for it. That was probably the buzzwordy part about this conference.

• I did not learn as much about Lucene as I wanted or had planned. Primarily because the nature of the talks was a little too advanced for me. A basic introduction to Lucene/Solr's architecture and ways to scale out is still on my wish list.

• I noticed that contributors to Apache projects like to discuss Jira issues in their talks.

• Twitter is using Lucene/Java to scale out its (near real-time) search, but sticks to trivial types (instead of objects) to (re)gain performance.

• Riak seems pretty cool: consistent hashing, auto-balancing, sharding — must investigate more. Also, Rusty Klophaus is a cool guy and I learned that Basho is not just a software company, but they also have a band. And riak is Indonesian and stands for something like how the water flows.

• Cassandra looks interesting as well. Considering they are Java not written in Erlang, a lot of people seem to like them anyways. Also, Eric Evans is a great presenter — kudos to him. I especially liked the part where he suggested to not use Cassandra for obvious reasons, but the inner geek disagreed.

• I don't know why presentations by Nokia, are like that. I'm missing a little enthusiasm about work or project.

• Bashing other projects sucks. Also, introducing yourself with, "We are like X but better.", makes you look shady as well.

• Benchmarks on slides really suck. And if people still can't resist, they should have a better explanation for them.

• Berlin Buzzwords really had a great venue.

• Thanks mucho to the organizers — Isabel, Simon, Jan & newthinking — for an interesting conference.

For more details, head over to Rusty Klophaus:

## A toolchain for CouchDB Lounge

One of our biggest issues with CouchDB is currently the lack of compaction of our database, and by lack of, I don't mean that CouchDB doesn't support it, I mean that we are unable to actually run it.

## Compaction in a nutshell

Compaction in a nutshell is pretty cool.

As you know, CouchDB is not very space-efficient. For once, CouchDB saves revisions of all documents. Which means, whenever you update a document a new revision is saved. You can rollback any time, or expose it as a nifty feature in your application — regardless, those revisions are kept around until your database is compacted.

Think about it in terms of IMAP - emails are not deleted until you hit that magic "compact" button which 99% of all people who use IMAP don't know what it's for anyway.

Another thing is that whenever new documents are written to CouchDB and bulk mode is not used, it'll save them in a way which is not very efficient either. In terms of actual storage and indexing (so rumour has it).

## Compaction woes

Since everything is simple with CouchDB, compaction is a simple process in CouchDB too. Yay!

When compaction is started, CouchDB will create a new database file where it stores the data in a very optimized way (I will not detail on this, go read a science book or google if you are really interested in this!). When the compaction process finished, CouchDB will exchange your old database file with the new database file.

The woes start with that e.g. when you have 700 GB uncompacted data, you will probably need another 400 GB for compaction to finish because it will create a second database file.

The second issue is that when you have constant writing on your database, the compaction process will actually never finish. It kind of sucks and for those people who aim to provide close to 100% availability, this is extremely painful to learn.