Find space hogs and prettify output using AWK

Saturday, September 18. 2010

I really love awk.

You might disagree and call me crazy, but while awk might be a royal brainfuck at first, here's a very simple example of its power which should explain my endorsement.

Figuring out space hogs

Every once in a while I run out of diskspace on /home. Even though I am the only user on this laptop I'm always puzzled as of why and I start running du trying to figure out which install or program stole my diskspace.

Here's a example of how I start it off in $HOME: du -h --max-depth 1

If I run the above line in my $HOME directory, I get a pretty list of lies — and thanks to -h this list is including more or less useful SI units, e.g. G(B), M(B) and K(B).

However, since I have a gazillion folders in my $HOME directory, the list is too long to figure out the biggest offenders, so naturally, I pipe my du command to sort -n. This doesn't work for the following reason:

[email protected]:~$ du -h --max-depth 1|sort -n
2.5M    ./net_gearman
2.6M    ./logs
2.7M    ./.gconf
2.8M    ./.openoffice.org2
3.3G    ./.config
3.3M    ./ubuntu

The order of the files is a little screwed up. As you see .config ate 3.3 GB and listed before ubuntu, which is only 3.3 MB in size. The reason is that sort -n (-n is numeric sort) doesn't take the unit into account. It compares the string and all of the sudden it makes sense why 3.3G is listed before 3.3M.

This is what I tried to fix this: du --max-depth 1|sort -n

The above command omits the human readable SI units (-h), and the list is sorted. Yay. Case closed?

AWK to the rescue

In the end, I'm still human, and therefor I want to see those SI units to make sense of the output and I want to see them in the correct order:

du --max-depth 1|sort -n|awk '{ $1=$1/1024; printf "%.2f MB: %s\n",$1,$2 }'

In detail

Let me explain the awk command:

  • Whenever you pipe output to awk, it breaks the line into multiple variables. This is incredible useful as you can avoid grep'ing and parsing the hell out of simple strings. $0 is the entire line, then $1, $2, etc. — awk magically divided the string by _whitespace. As an example, "Hello World" piped to awk would be $0 equals "Hello World", $1 equals "Hello" and $2 equals "World".
[email protected]:~$ echo "Hello World" |awk '{ print $0 }'
Hello World
[email protected]:~$ echo "Hello World" |awk '{ print $1 }'
[email protected]:~$ echo "Hello World" |awk '{ print $2 }'
  • My awk command uses $1 (which contains the size in raw kilobytes) and devides it by 1024 to receive megabytes. No rocket science!
  • printf outputs the string and while outputting we round the number (to two decimals: %.2f) and display the name of the folder which is still in $2.

All of the above is not just simple, but it should look somewhat familiar when you have a development background. Even shell allows you to divide a number and offers a printf function for formatting purposes.



I hope awk is a little less confusing now. For further reading, I recommend the GNU AWK User Guide. (Or maybe just keep it open next time you think you can put awk to good use.)

Installing Varnish on Ubuntu Hardy

Tuesday, September 14. 2010

This is a quick and dirty rundown on how to install Varnish 2.1.x on Ubuntu Hardy (8.04 LTS).

Get sources setup

Add the repository to /etc/apt/sources.list:

deb hardy Varnish-2.1 

Import the key for the new repository:

gpg --keyserver --recv-keys 60E7C096C4DEFFEB
gpg --armor --export 60E7C096C4DEFFEB | apt-key add -


Update sources list and install varnish:

apt-get update
apt-get install varnish

Files of importance:



[email protected]:~# varnishd -V
varnishd (varnish-2.1.2 SVN )
Copyright (c) 2006-2009 Linpro AS / Verdens Gang AS

Further reading

I recommend a mix of the following websites/links:


That's all!

Selenium & Saucelenium: installation and dbus-xorg-woes

Tuesday, September 7. 2010

We're about to launch a new product, and this time it's pretty client-side-intense. The application is powered by a lot of JavaScript(-mvc) and jQuery, which do xhr calls to a ZF/CouchDB powered backend. While js-mvc has unit-testing sort of covetred, I was also looking for some integration testing, multiple browsers and all that.

Selenium vs. Saucelenium

I can't really say if you want one or the other. Revisiting Selenium in general, it's IMHO the only viable and suitable thing for a PHP shop. Primarily of course because all those nifty test cases will integrate into our existing suite of PHPUnit/phpt tests. And while I use Zend_Test already or find projects like SimpleTest's browser or even Perl's www::mechanize very appealing, neither of those executes JavaScript like a browser.

Selenium and Saucelenium have the same root — in fact Saucelenium is a Selenium fork. While the Selenium project seems to focus on 2.x currently, stable 1.x development seems to really happen at Saucelabs. That is if you call a commit from January 22nd of this year active development.

In the process of selecting one or the other, more people recommended that I'd use the Saucelabs distribution than the original one, and so I forked it on github.


Along with a script to start the damn thing, my fork also contains a Said README covers the installation part in detail, so I won't have to repeat myself here. All of this is pretty Ubuntu-centric and has been tested on Karmic Koala. I expect things to work just as well on Lucid, or on any other distribution if you get the installation right.

One thing that took me a while to figure out was the following error message:

[email protected]:/usr/src/saucelenium$ sudo ./ 
[config/dbus] couldn't take over org.x.config: org.freedesktop.DBus.Error.AccessDenied (Connection ":1.17" is not allowed to own the service "org.x.config.display0" due to security     policies in the configuration file)
(EE) config/hal: couldn't initialise context: unknown error (null)
13:17:23.166 INFO - Writing debug logs to /var/log/selenium.log
13:17:23.166 INFO - Java: Sun Microsystems Inc. 16.0-b13
13:17:23.166 INFO - OS: Linux amd64
13:17:23.176 INFO - v1.0.1 [exported], with Core [email protected]@ [@[email protected]]
13:17:23.296 INFO - Version Jetty/5.1.x
13:17:23.306 INFO - Started HttpContext[/selenium-server/driver,/selenium-server/driver]
13:17:23.306 INFO - Started HttpContext[/selenium-server,/selenium-server]
13:17:23.316 INFO - Started HttpContext[/,/]
13:17:23.326 INFO - Started SocketListener on
13:17:23.326 INFO - Started [email protected]
^C13:18:09.796 INFO - Shutting down...

After Google'ing for a bit, I came to the conclusion that the above means that I didn't have the xserver installed.

The fix was rather simple: aptitude install xserver-xorg.

Example test

The following is an example test case. It'll open and make sure it finds "What is PHP?" somewhere on that page.

Then it will continue to /downloads.php (by clicking on that link) and will make sure it finds "Binaries for other systems" on that page.

To run this test, execute: phpunit ExampleTestCase.php.

class ExampleTestCase extends PHPUnit_Extensions_SeleniumTestCase
    protected function setUp()

    public function testPHP()
        $this->assertTextPresent('What is PHP?');

        $this->assertTextPresent('Binaries for other systems');

        // check this out, especially useful for debugging:
        $this->assertEquals('', $this->drivers[0]->getLocation());

That's all.

Thanks for reading, and until next time.

Tumblr: Display a list of entries in the sidebar

Thursday, September 2. 2010

Update 2010-09-06: I turned my JavaScript code into a handy plugin for jQuery — let me introduce: jquery-simplerss.

So for whatever reason, on a lot of blogs (but not mine ;-)), the sidebar also contains the list of latest entries on said blog.

I recently edited a template for a client and he requested the same feature — which put me through three hours of nightmare.


Tumblr is a hosted blog service. It doesn't have as many (confusing) features as for example Wordpress or Serendipity, but it's doing really well at people usually do with blogs: posting stuff. Stuff includes text, excerpts from chats, photos, quotes, music, videos and maybe more.

The advantages of Tumblr versus any local blog install are:

  • it's (completely ad-)free (a local blog requires a webhost or server of some kind)
  • no installation
  • no security updates

Tumblr also comes with a domain and custom theme feature where you can basically CNAME a domain name to their server and it looks like it's part of your domain — take my activity stream for an example.

While I realize that Tumblr is not free for forever, all of the above are plenty of good reasons why I'd rather recommend my clients to get a Tumblr account instead of installing Wordpress on their server, and getting hacked a week or so later.

Customized themes

For those of us with slightly less design skills, there are commercial themes available, for the rest, they may use customize any theme available.

Customized themes are simple and to be kept simple for the most part. So for example a sidebar with the latest posts is not something the theme allows you to do and while I value that my suggestion was taken to the development team (like others before), I couldn't really wait until someone gets around to making it happen.

JavaScriptjQuery to the rescue!

So I decided to use jQuery to parse the blogs RSS feed and display the three latest items. While I realize that there are plugins available, I thought I'd do it myself — quick and dirty — because jQuery offers with $.ajax() and .find() pretty much all you need to download the RSS feed and parse the bits from it. Since my blog runs on and the RSS feed is at not even the cross-domain obstacles apply!

Here's how I did it!

Let's walk through the code, shall we?

  • $.ajax() is used to GET the feed
  • for Tumblr: dataType: 'xml'
  • .find() is used to select the items (rss > channel > item)
  • inside the loop:
    • find title, description and guid
    • create html (which is basically my template)
    • exit from the loop after we ran through the 3 latest items
  • after the loop: append to a selector

For the above to work you'll need (to include jQuery and the following HTML):

<ul id="blogposts"></ul>


So even though Tumblr's feed is a valid RSS feed per, there were a few things that didn't work right away.

At first I used dataType: feed in $.ajax() — which made sense at the time. But the problem is that Tumblr sends Content-Type: text/xml for their RSS feed. I haven't checked the internals of the dataType but apparently Google Chrome applies stricter rules to what it thinks are feeds.

This issues implies that for example the entities in <title> make Google Chrome drop the <title> all together when the XML feed is imported. There is an interesting thread on Stackoverflow about the issue. The quickfix is to use dataType: xml instead.

The second problem is that <link> is dropped/ignored because the link is wrapped in doublequotes. So I used <guid> instead.


I certainly hope that this example is complete and saves someone else some time.

PHP: So you'd like to migrate from MySQL to CouchDB? - Part III

Monday, May 17. 2010

This is part three of a beginner series for people with a MySQL/PHP background. Apologies for the delay, this blog entry has been in draft since the 13th December of last year (2009).

Follow these links for the previous parts:


Part I introduced the CouchDB basics which included basic requests using PHP and cURL. Part II focused on create, read, update and delete operations in CouchDB. I also introduced my nifty PHP CouchDB called ArmChair!

ArmChair is my own very simple and (hopefully) easy-to-use approach to accessing CouchDB from PHP. The objective is to develop it with each part of this series to make it a more comprehensive solution.

Part III

Part three will target basic view functions in CouchDB — think of views as a WHERE-clause in MySQL. They are similar, but also not. :-)


If you read up on CouchDB before coming to this blog, you will probably heard of map-reduce. There, or maybe elsewhere. A lot of people attribute Google's success to map-reduce. Because they are able to process a lot of data in parallel (across multiple cores and/or machines) in relatively little time.

I guess the PageRank in Google Search or Google Analytics are examples of where it could be used.

In the following, I'll try to explain what map-reduce is. For people without a science degree. (And that includes me!)


Generally, map-reduce is a way to process data. It's made off two things, map and reduce.

The idea is that the map-function is very robust and it allows data to be broken up into smaller pieces so it can be processed in parallel. In most cases the order data is processed in doesn't really matter. What generally counts is that it is processed at all. And since map allows us to run the processing in parallel, it's easier to scale out. (That's the secret sauce!)

And when I write scale-out, I don't suggest to built a cluster of 1000 servers in order to process a couple thousand documents. It's already sufficient in this case to utilize all cores in my own computer when the map task is run in parallel.

In CouchDB, the result of map is a list of keys and values.


Reduce is called once the map-part is done. It's an optional step in terms of CouchDB — not every map requires a reduce to follow.

Real world example

  • take a simple photo application (such as flickr) with comments
  • use map to sort through the comments and emit the names of users who left one
  • use reduce to only get unique references and see how many comments were left by these user


SELECT user, count(*) FROM comments GROUP BY user

Why the fuzz?

Just so people don't feel offended. Map-reduce is slightly more complicated than my example SQL-query but it's also not some secret-super-duper thing. Its strength is really parallelization which requires the ability to break data into chunks to process them. The end.

Continue reading "PHP: So you'd like to migrate from MySQL to CouchDB? - Part III"