Some observations on CouchDB's compact

Over the last two weeks, I had been working on an import from a raw text file of JSON data (~20 GB) into CouchDB.

Due to the fuzzyness of the data, I decided to not use _bulk_docs to import it because if a single document inside a bulk request would fail (e.g. duplicate _id), I would have to go through the request one by one to figure out what went wrong.

So I came to the conclusion that while bulk writing offers speed, to decide on less complexity at the expense of speed. However, it should be noted that if you know your data, I would always suggest you use bulk.

The import ran for a week and totalled to 20,125,604 documents and a database file of 79.4 GB.

Yesterday, I started _compact on the database and documented the numbers on Twitter, and double-wow’d at the result: the database had shrunk to only 22.8 GB.

All in all the operation ran for roughly 23 hours during which I was still able to read from my database, and possibly also write to it — though excessive write during a compact should be avoided.

I also found it remarkable that the size of the database in comparison to the raw textfile is almost 1:1.