Find space hogs and prettify output using AWK

If you enjoyed this article, please leave a comment, rss subscribe to my RSS feed and/or follow me on Twitter. Thank you very much!

I really love awk.

You might disagree and call me crazy, but while awk might be a royal brainfuck at first, here's a very simple example of its power which should explain my endorsement.

Figuring out space hogs

Every once in a while I run out of diskspace on /home. Even though I am the only user on this laptop I'm always puzzled as of why and I start running du trying to figure out which install or program stole my diskspace.

Here's a example of how I start it off in $HOME: du -h --max-depth 1

If I run the above line in my $HOME directory, I get a pretty list of lies — and thanks to -h this list is including more or less useful SI units, e.g. G(B), M(B) and K(B).

However, since I have a gazillion folders in my $HOME directory, the list is too long to figure out the biggest offenders, so naturally, I pipe my du command to sort -n. This doesn't work for the following reason:

till@till-laptop:~$ du -h --max-depth 1|sort -n
(...)
2.5M    ./net_gearman
2.6M    ./logs
2.7M    ./.gconf
2.8M    ./.openoffice.org2
3.3G    ./.config
3.3M    ./ubuntu
(...)

The order of the files is a little screwed up. As you see .config ate 3.3 GB and listed before ubuntu, which is only 3.3 MB in size. The reason is that sort -n (-n is numeric sort) doesn't take the unit into account. It compares the string and all of the sudden it makes sense why 3.3G is listed before 3.3M.

This is what I tried to fix this: du --max-depth 1|sort -n

The above command omits the human readable SI units (-h), and the list is sorted. Yay. Case closed?

AWK to the rescue

In the end, I'm still human, and therefor I want to see those SI units to make sense of the output and I want to see them in the correct order:

du --max-depth 1|sort -n|awk '{ $1=$1/1024; printf "%.2f MB: %s\n",$1,$2 }'

In detail

Let me explain the awk command:

  • Whenever you pipe output to awk, it breaks the line into multiple variables. This is incredible useful as you can avoid grep'ing and parsing the hell out of simple strings. $0 is the entire line, then $1, $2, etc. — awk magically divided the string by _whitespace. As an example, "Hello World" piped to awk would be $0 equals "Hello World", $1 equals "Hello" and $2 equals "World".
till@till-laptop:~$ echo "Hello World" |awk '{ print $0 }'
Hello World
till@till-laptop:~$ echo "Hello World" |awk '{ print $1 }'
Hello
till@till-laptop:~$ echo "Hello World" |awk '{ print $2 }'
World
  • My awk command uses $1 (which contains the size in raw kilobytes) and devides it by 1024 to receive megabytes. No rocket science!
  • printf outputs the string and while outputting we round the number (to two decimals: %.2f) and display the name of the folder which is still in $2.

All of the above is not just simple, but it should look somewhat familiar when you have a development background. Even shell allows you to divide a number and offers a printf function for formatting purposes.

Fin

Tada!

I hope awk is a little less confusing now. For further reading, I recommend the GNU AWK User Guide. (Or maybe just keep it open next time you think you can put awk to good use.)

| More