When a little awk goes a long way

There is great post on a blog called Gregable, the post is titled “Why you should know just a little awk“.  The comments are also worth reading. I discovered the link from another post titled “A little awk” on John Cook’s blog, The Endeavour.

I use a wide variety of Unix text processing tools on a regular basis, but over time, like many others I started migrating those tasks that required the power of ‘awk’ over to another language; in my case that language was Python.  Typically I can write scripts faster in Python and I find that the code is more readable. However, after reading the above post I was reminded that there are some one-liner ‘awk’ tasks that are really clean and effective. Lately I have found myself starting to sparingly use ‘awk’ again, here’s why…

When to use ‘awk’ instead of ‘cut’

  1. Cut’s delimiter is a single character, awk’s delimiter is a regular expression.
  2. Awk allows fields to specified relative to the last field position using ‘NF’.
  3. Cut always displays fields in order of ascending field number, regardless of the order fields are specified in the field list parameter, awk can redisplay the fields in any order that you specify.
    Examples:

    splits fields at multiple characters either a, b, c, d
    awk -F'[abcd]'

    split at one (1) or more spaces
    awk -F' +'

    re-order fields
    awk '{print $3 "\t" $2 "\t" $1}'

    prints last field
    awk '{print $NF}'

    prints next to last field
    awk '{print $(NF-1)}'

    When to use ‘awk’ instead of Python, Perl, etc.?

    1. When you can write the task in one simple, readable line with awk, i.e.
      1. Simple reformatting of data.
      2. Simple comparisons on fields.
      3. Rearrange order of fields.
      4. Split on regular expressions, including multiple characters.
      5. Feel free to comment on other reasons.
    2. When the speed of Python, Perl, etc. scripts are too slow for repeated use, this is rare when coded properly.

    Watch your quotes with ‘awk’ …

    Here is the standard unix method of quoting:

    $ awk '$NF > 385 && $(NF-1) ~ "^Sh" {print NR "\t" $0}' orders.txt
    4       10416   2005-05-10 00:00:00     Shipped 386
    6       10418   2005-05-16 00:00:00     Shipped 412
    
    Here is the equivalent command using unxutils for Windows:

    Note the difference in quoting…

    C:\> gawk "$NF>385 && $(NF-1) ~ \"^Sh\" {print NR \"\t\" $0}" orders.txt
    3       10416   2005-05-10 00:00:00     Shipped 386
    5       10418   2005-05-16 00:00:00     Shipped 412
    

    Here is what the above command is doing:

    1. Iterates through every line in the file “orders.txt”.
    2. Splits the fields at tab characters (default delimiter).
    3. Tests if the last field is greater than 385.
    4. Tests if the next to last field matches the regular expression “^Sh”, i.e. begins with the letters “Sh”.
    5. If items 3 & 4 were true then print the line number followed by a tab and then then line text itself.
    Sample text being processed:
    $ cat orders.txt
    10413   2005-05-05 00:00:00     Shipped 175
    10414   2005-05-06 00:00:00     On Hold 362
    10415   2005-05-09 00:00:00     Disputed        471
    10416   2005-05-10 00:00:00     Shipped 386
    10417   2005-05-13 00:00:00     Disputed        141
    10418   2005-05-16 00:00:00     Shipped 412
    10419   2005-05-17 00:00:00     Shipped 382
    10420   2005-05-29 00:00:00     In Process      282
    Advertisement

    Tags: ,

    Leave a Reply

    Fill in your details below or click an icon to log in:

    WordPress.com Logo

    You are commenting using your WordPress.com account. Log Out / Change )

    Twitter picture

    You are commenting using your Twitter account. Log Out / Change )

    Facebook photo

    You are commenting using your Facebook account. Log Out / Change )

    Connecting to %s


    Follow

    Get every new post delivered to your Inbox.