Go to the first, previous, next, last section, table of contents.
The following @command{awk} program prints the number of occurrences of each word in its input. It illustrates the associative nature of @command{awk} arrays by using strings as subscripts. It also demonstrates the `for index in array' mechanism. Finally, it shows how @command{awk} is used in conjunction with other utility programs to do a useful task of some complexity with a minimum of effort. Some explanations follow the program listing:
# Print list of word frequencies
{
for (i = 1; i <= NF; i++)
freq[$i]++
}
END {
for (word in freq)
printf "%s\t%d\n", word, freq[word]
}
This program has two rules. The
first rule, because it has an empty pattern, is executed for every input line.
It uses @command{awk}'s field-accessing mechanism
(see section Examining Fields) to pick out the individual words from
the line, and the built-in variable NF (see section Built-in Variables)
to know how many fields are available.
For each input word, it increments an element of the array freq to
reflect that the word has been seen an additional time.
The second rule, because it has the pattern END, is not executed
until the input has been exhausted. It prints out the contents of the
freq table that has been built up inside the first action.
This program has several problems that would prevent it from being
useful by itself on real text files:
The way to solve these problems is to use some of @command{awk}'s more advanced
features. First, we use tolower to remove
case distinctions. Next, we use gsub to remove punctuation
characters. Finally, we use the system @command{sort} utility to process the
output of the @command{awk} script. Here is the new version of
the program:
# wordfreq.awk -- print list of word frequencies
{
$0 = tolower($0) # remove case distinctions
# remove punctuation
gsub(/[^[:alnum:]_[:blank:]]/, "", $0)
for (i = 1; i <= NF; i++)
freq[$i]++
}
END {
for (word in freq)
printf "%s\t%d\n", word, freq[word]
}
Assuming we have saved this program in a file named `wordfreq.awk', and that the data is in `file1', the following pipeline:
awk -f wordfreq.awk file1 | sort +1 -nr
produces a table of the words appearing in `file1' in order of decreasing frequency. The @command{awk} program suitably massages the data and produces a word frequency table, which is not ordered.
The @command{awk} script's output is then sorted by the @command{sort} utility and printed on the terminal. The options given to @command{sort} specify a sort that uses the second field of each input line (skipping one field), that the sort keys should be treated as numeric quantities (otherwise `15' would come before `5'), and that the sorting should be done in descending (reverse) order.
The @command{sort} could even be done from within the program, by changing
the END action to:
END {
sort = "sort +1 -nr"
for (word in freq)
printf "%s\t%d\n", word, freq[word] | sort
close(sort)
}
This way of sorting must be used on systems that do not have true pipes at the command-line (or batch-file) level. See the general operating system documentation for more information on how to use the @command{sort} program.
Go to the first, previous, next, last section, table of contents.