 | Level: Introductory Daniel Robbins (drobbins@gentoo.org), President/CEO, Gentoo Technologies, Inc.
01 Dec 2000 Awk is a very nice language with a very strange name. In this first article of a three-part series, Daniel Robbins will quickly get your awk programming skills up to speed. As the series progresses, more advanced topics will be covered, culminating with an advanced real-world awk application demo. In defense of awk
In this series of articles, I'm going to turn you into a proficient
awk coder. I'll admit, awk doesn't have a very pretty or particularly
"hip" name, and the GNU version of awk, called gawk, sounds downright weird. Those unfamiliar with the language may hear "awk" and think of a mess of code so backwards and antiquated that it's capable of driving even the most knowledgeable UNIX guru to
the brink of insanity (causing him to repeatedly yelp "kill -9!" as he runs for coffee machine).
Sure, awk doesn't have a great name. But it is
a great language. Awk is geared toward text
processing and report generation, yet features many well-designed
features that allow for serious programming. And, unlike some
languages, awk's syntax is familiar, and borrows some of the best
parts of languages like C, python, and bash (although, technically, awk was created before both python and bash). Awk is one of those
languages that, once learned, will become a key part of your strategic coding
arsenal.
The first awk
You should see the contents of your /etc/passwd
file appear before your eyes. Now, for an explanation of what awk did.
When we called awk, we specified /etc/passwd as our input file. When
we executed awk, it evaluated the print command for each line in
/etc/passwd, in order. All output is sent to stdout, and we get a result identical to catting /etc/passwd.
Now, for an explanation of the { print } code block. In
awk, curly braces are used to group blocks of code together, similar to C. Inside our
block of code, we have a single print command. In awk, when a
print command appears by itself, the full contents of the current
line are printed.
$ awk '{ print $0 }' /etc/passwd
|
In awk, the $0 variable represents the entire current line, so
print and print $0 do exactly the same thing.
$ awk '{ print "" }' /etc/passwd
|
$ awk '{ print "hiya" }' /etc/passwd
|
Running this script will fill your screen with hiya's. :)
Multiple fields print $1
$ awk -F":" '{ print $1 $3 }' /etc/passwd
|
halt7
operator11
root0
shutdown6
sync5
bin1
....etc.
|
print $1 $3
$ awk -F":" '{ print $1 " " $3 }' /etc/passwd
|
$1$3
$ awk -F":" '{ print "username: " $1 "\t\tuid:" $3" }' /etc/passwd
|
username: halt uid:7
username: operator uid:11
username: root uid:0
username: shutdown uid:6
username: sync uid:5
username: bin uid:1
....etc.
|
External scripts
BEGIN {
FS=":"
}
{ print $1 }
|
The difference between these two methods has to do with how we set the
field separator. In this script, the field separator is specified
within the code itself (by setting the FS variable), while our
previous example set FS by passing the -F":" option to awk on the command line. It's
generally best to set the field separator inside the script itself,
simply because it means you have one less command line argument to
remember to type. We'll cover the FS variable in more detail later in this
article.
The BEGIN and END blocks
Normally, awk executes each block of your script's code once for each input line.
However, there are many programming situations where you may need to execute
initialization code before awk begins processing the text from
the input file. For such situations, awk allows you to define a BEGIN
block. We used a BEGIN block in the previous example. Because the
BEGIN block is evaluated before awk starts processing the input file,
it's an excellent place to initialize the FS (field separator)
variable, print a heading, or initialize other global variables that you'll reference
later in the program.
Awk also provides another special block, called the END block. Awk
executes this block after all lines in the input file have been
processed. Typically, the END block is used to perform final
calculations or print summaries that should appear at the end of the
output stream.
Regular expressions and blocks
/[0-9]+\.[0-9]*/ { print }
|
Expressions and blocks fredprint
$1 == "fred" { print $3 }
|
root
Conditional statements if
{
if ( $5 ~ /root/ ) {
print $3
}
}
|
Both scripts function identically. In the first example, the
boolean expression is placed outside the block, while in the second
example, the block is executed for every input line, and we
selectively perform the print command by using an if statement. Both
methods are available, and you can choose the one that best meshes
with the other parts of your script.
if if
{
if ( $1 == "foo" ) {
if ( $2 == "foo" ) {
print "uno"
} else {
print "one"
}
} else if ($1 == "bar" ) {
print "two"
} else {
print "three"
}
}
|
if
! /matchme/ { print $1 $3 $4 }
|
{
if ( $0 !~ /matchme/ ) {
print $1 $3 $4
}
}
|
Both scripts will output only those lines that don't contain a
matchme character sequence. Again, you can choose the method that works best for your code. They both do the same thing.
( $1 == "foo" ) && ( $2 == "bar" ) { print }
|
This example will print only those lines where field one equals foo
and field two equals bar.
Numeric variables!
In the BEGIN block, we initialize our integer variable x to
zero. Then, each time awk encounters a blank line, awk will execute
the x=x+1 statement, incrementing x. After all the lines have been
processed, the END block will execute, and awk will print out a final
summary, specifying the number of blank lines it found.
Stringy variables
1.01x$( )1.01
If you do a little experimenting, you'll find that if a particular
variable doesn't contain a valid number, awk will treat that variable as a
numerical zero when it evaluates your mathematical expression.
Lots of operators
Another nice thing about awk is its full complement of mathematical
operators. In addition to standard addition, subtraction,
multiplication, and division, awk allows us to use the previously
demonstrated exponent operator "^", the modulo (remainder) operator
"%", and a bunch of other handy assignment operators borrowed from C.
These include pre- and post-increment/decrement ( i++, --foo ),
add/sub/mult/div assign operators ( a+=3, b*=2, c/=2.2, d-=6.2
). But that's not all -- we also get handy modulo/exponent assign ops
as well ( a^=2, b%=4 ).
Field separators
Awk has its own complement of special variables. Some of them allow
you to fine-tune how awk functions, while others can be read to glean
valuable information about the input. We've already touched on one of
these special variables, FS. As mentioned earlier, this variable
allows
you to set the character sequence that awk expects to find
between fields. When we were using /etc/passwd as input, FS was set to
":". While this did the trick, FS allows us even more flexibility.
Above, we use the special "+" regular expression character, which means "one or
more of the previous character".
While this assignment will do the trick, it's not necessary. Why?
Because by default, FS is set to a single space character, which awk
interprets to mean "one or more spaces or tabs." In this particular
example, the default FS setting was exactly what you wanted in the
first place!
Number of fields
{
if ( NF > 2 ) {
print $1 " " $2 ":" $3
}
}
|
Record number
{
#skip header
if ( NR > 10 ) {
print "ok, now for the real information!"
}
}
|
Awk provides additional variables that can be used for a variety of
purposes. We'll cover more of these variables in later
articles.
We've come to the end of our initial exploration of awk. As the
series continues, I'll demonstrate more advanced awk functionality, and
we'll end the series with a real-world awk application. In the meantime, if
you're eager to learn more, check out the resources listed below.
Resources
About the author  | |  | Residing in Albuquerque, New Mexico, Daniel Robbins is the
President/CEO of Gentoo Technologies,
Inc., the creator of Gentoo Linux, an advanced Linux for the
PC, and the Portage system, a next-generation ports system for Linux.
He has also served as a contributing author for the Macmillan books
Caldera OpenLinux Unleashed, SuSE Linux Unleashed, and Samba Unleashed.
Daniel has been involved with computers in some fashion since the
second grade, when he was first exposed to the Logo programming
language as well as a potentially dangerous dose of Pac Man. This
probably explains why he has since served as a Lead Graphic Artist at
SONY Electronic Publishing/Psygnosis. Daniel enjoys spending
time with his wife, Mary, and his new baby daughter, Hadassah. You can reach Daniel at drobbins@gentoo.org. |
Rate this page
|  |