Find space hogs and prettify output using AWK

Saturday, September 18. 2010

I really love awk.

You might disagree and call me crazy, but while awk might be a royal brainfuck at first, here's a very simple example of its power which should explain my endorsement.

Figuring out space hogs

Every once in a while I run out of diskspace on /home. Even though I am the only user on this laptop I'm always puzzled as of why and I start running du trying to figure out which install or program stole my diskspace.

Here's a example of how I start it off in $HOME: du -h --max-depth 1

If I run the above line in my $HOME directory, I get a pretty list of lies — and thanks to -h this list is including more or less useful SI units, e.g. G(B), M(B) and K(B).

However, since I have a gazillion folders in my $HOME directory, the list is too long to figure out the biggest offenders, so naturally, I pipe my du command to sort -n. This doesn't work for the following reason:

[email protected]:~$ du -h --max-depth 1|sort -n
2.5M    ./net_gearman
2.6M    ./logs
2.7M    ./.gconf
2.8M    ./.openoffice.org2
3.3G    ./.config
3.3M    ./ubuntu

The order of the files is a little screwed up. As you see .config ate 3.3 GB and listed before ubuntu, which is only 3.3 MB in size. The reason is that sort -n (-n is numeric sort) doesn't take the unit into account. It compares the string and all of the sudden it makes sense why 3.3G is listed before 3.3M.

This is what I tried to fix this: du --max-depth 1|sort -n

The above command omits the human readable SI units (-h), and the list is sorted. Yay. Case closed?

AWK to the rescue

In the end, I'm still human, and therefor I want to see those SI units to make sense of the output and I want to see them in the correct order:

du --max-depth 1|sort -n|awk '{ $1=$1/1024; printf "%.2f MB: %s\n",$1,$2 }'

In detail

Let me explain the awk command:

  • Whenever you pipe output to awk, it breaks the line into multiple variables. This is incredible useful as you can avoid grep'ing and parsing the hell out of simple strings. $0 is the entire line, then $1, $2, etc. — awk magically divided the string by _whitespace. As an example, "Hello World" piped to awk would be $0 equals "Hello World", $1 equals "Hello" and $2 equals "World".
[email protected]:~$ echo "Hello World" |awk '{ print $0 }'
Hello World
[email protected]:~$ echo "Hello World" |awk '{ print $1 }'
[email protected]:~$ echo "Hello World" |awk '{ print $2 }'
  • My awk command uses $1 (which contains the size in raw kilobytes) and devides it by 1024 to receive megabytes. No rocket science!
  • printf outputs the string and while outputting we round the number (to two decimals: %.2f) and display the name of the folder which is still in $2.

All of the above is not just simple, but it should look somewhat familiar when you have a development background. Even shell allows you to divide a number and offers a printf function for formatting purposes.



I hope awk is a little less confusing now. For further reading, I recommend the GNU AWK User Guide. (Or maybe just keep it open next time you think you can put awk to good use.)

Ubuntu: nginx+php-cgi on a socket

Friday, July 31. 2009

Moving our PHP application into the cloud, means for us that we are leaving FreeBSD for Linux. Not the best move (IMHO), but I shall elaborate on this in a future blog post.

Once we decided on Ubuntu as the Linux of our choice, I started by moving our development server to an instance on Slicehost. Point taken, Slicehost is not the cloud (as in Amazon EC2, Rackspace, Flexiscale or GoGrid) yet, but Linux on Slicehost and Linux on Amazon EC2 will be alike (or so I hope :-)) and a getting a small slice versus getting a small EC2 instance is an economical decision in the end.


The following is the start script for my php-cgi processes, which I ported from FreeBSD (I previously blogged about it here).

The advantages of this script are:

  1. php-cgi runs on a unix domain socket — no need for tcp/ip on localhost.
  2. No need for the infamous spawn-fcgi script, which never worked for me anyway, and on Ubuntu requires you to install lighttpd (if you don't happen to be on Karmic Koala).
  3. You can setup different websites with different instances of php-cgi. This is great for virtual hosting, especially on a development server where the different workspaces may have different PHP settings and you want to run versions in parallel without sharing settings and therefore maybe affecting each other.
  4. Icing on the cake: we could even add a custom php.ini to the start call for each instance (-c option) to customize it even further.

Continue reading "Ubuntu: nginx+php-cgi on a socket"