Yesterday I got an email friend who complained that "awk is still a mystery". Not being one to ignore a cry for help with the command line, I was motivated to write up a simple introduction to the basics of awk. But where to post it? I know! We've got this little blog we're not doing anything with at the moment (er, yeah, sorry about that folks-- life's been exciting for the Command Line Kung Fu team recently)...
The first thing you need to understand about awk is that it reads and operates on each line of input one at a time. It's as if your awk code were sitting inside a big loop:
for each line of input
    # your code is here
end loop
Your code goes in curly braces. So the simplest awk program is one that just prints out every line of a file:
awk '{print}' /etc/passwd
Nothing too exciting there. It's just a more complicated way to "cat /etc/passwd". Note that you generally want to enclose your awk code in single quotes like I did in the example above. This prevents special characters in the awk script from being interpolated by your shell before they even get to awk.
One of the nice features of awk is that it automatically splits up each input line using whitespace as the delimiter. It doesn't matter how many spaces/tabs appear in between items on the line, each chunk of whitespace in its entirety is treated as a delimiter.
The whitespace-delimited fields are put into variables named $1, $2, and so on. Rather than just doing "print" as we did in the last example (which prints out the whole original line), you can print out any of the individual fields by number. For example, I can pull out the percentage used (field 5) and file system mount point (field 6) from df output:
$ df -h -t ext4 | awk '{print $5, $6}'
Use% Mounted
58% /
24% /boot
42% /var
81% /home
89% /usr
The comma in the "print $5, $6" expression causes awk to put a space between the two fields. If you did "print $5 $6", you'd get the two fields jammed up against each other with no space between them.
We could use a similar strategy to pull out just the usernames from ps (field 1):
$ ps -ef | awk '{print $1}'
UID
root
root
root
...
Not so interesting maybe, until you start combining it with other shell primitives:
$ ps -ef | awk '{print $1}' | sort | uniq -c | sort -nr
    188 root
     70 hal
      2 www-data
      2 avahi
      2 108
      1 UID
      1 syslog
      1 rtkit
      1 ntp
      1 mysql
      1 gdm
      1 daemon
      1 102
Once we sort all the usernames in order, we can use "uniq -c" to count the number of processes running as each user. The final "sort -nr" gives us a descending ("-r") numeric ("-n") sort of the counts.
And this is fundamentally what's interesting about awk. It's great in the middle of a shell pipeline to be able to pull out individual fields that we're interested in processing further.
The other cool power of awk is that you can operate on selected lines of your input and ignore the rest. Any awk statement like "{print}" can optionally be preceded by a conditional operator. If a conditional operation exists, then your awk code will only operate on lines that match the expression.
The most common conditional operator is "/.../", which does pattern matching. For example, I could pull out the process IDs of all sshd processes like this:
$ ps -ef | awk '/sshd/ {print $2}'
1366
10883
That output is maybe more interesting when you use it with the kill command to kick people off of your system:
# kill $(ps -ef | awk '/sshd/ {print $2}')
Of course, you better be on the system console when you execute that command. Otherwise, you've just locked yourself out of the box!
While pattern matching tends to get used most frequently, awk has a full suite of comparison and logical operators. Returning to our df example, what if we wanted to print out only the file systems that were more than 80% full? Remember that the percent used is in field 5 and the file system mount point is field 6. If field 5 is more than 80, we want to print field 6:
$ df -h -t ext4 | awk '($5 > 80) {print $6}'
Mounted
/home
/usr
Whoops! The header line ends up getting dumped out too! We'd actually like to suppress that. I could use the tail command to strip that out, but I can also do it in our awk statement:
$ df -h -t ext4 | awk '$5 ~ /[0-9]/ && ($5 > 80) {print $6}'
/home
/usr
"$5 ~ /[0-9/" means do a pattern match specifically against field 5 and make sure it contains at least one digit. And then we check to make sure that field 5 is greater than 80. If both of those conditional expressions are true then we'll print out field 6. I made this more complicated that it needs to be just to show you that you can put together complicated logical expressions with "&&" (and "||" for the "or" relationship) and do pattern matching on specific fields if you want to.
While splitting on whitespace is frequently useful, sometimes you're dealing with input that's broken up by some other character, like commas in a CSV file or colons in /etc/passwd. awk has a "-F" option that lets you specify a delimiter other than whitespace.
Here's a little trick to find out if you have any duplicate UIDs in your /etc/passwd file:
$ awk -F: '{print $3}' /etc/passwd | sort | uniq -d
Here we're merely using awk to pull the UID field (field 3) from the colon-delimited ("-F:") /etc/passwd file. Then we sort the UIDs and use "uniq -d" to tell us if there are any duplicates. You want this command to return no output, indicating no duplicates were found.
There's a lot more to awk, but this is more than enough to get you started with this useful little utility. But like any new skill, the best way to master awk is practice. So I'm going to give you a few exercises to work on. I'll post the answers on the blog in a week or so. Good luck!
Click to Open Code Editor