Go to the first, previous, next, last section, table of contents.


Transliterating Characters

The system @command{tr} utility transliterates characters. For example, it is often used to map uppercase letters into lowercase for further processing:

generate data | tr 'A-Z' 'a-z' | process data ...

@command{tr} requires two lists of characters.(60) may require that the lists be written as range expressions enclosed in square brackets (`[a-z]') and quoted, to prevent the shell from attempting a file name expansion. This is not a feature.} When processing the input, the first character in the first list is replaced with the first character in the second list, the second character in the first list is replaced with the second character in the second list, and so on. If there are more characters in the "from" list than in the "to" list, the last character of the "to" list is used for the remaining characters in the "from" list.

Some time ago, a user proposed that a transliteration function should be added to @command{gawk}. The following program was written to prove that character transliteration could be done with a user-level function. This program is not as complete as the system @command{tr} utility but it does most of the job.

The @command{translate} program demonstrates one of the few weaknesses of standard @command{awk}: dealing with individual characters is very painful, requiring repeated use of the substr, index, and gsub built-in functions (see section String Manipulation Functions).(61) acquired the ability to split each character in a string into separate array elements.}

There are two functions. The first, stranslate, takes three arguments:

from
A list of characters to translate from.
to
A list of characters to translate to.
target
The string to do the translation on.

Associative arrays make the translation part fairly easy. t_ar holds the "to" characters, indexed by the "from" characters. Then a simple loop goes through from, one character at a time. For each character in from, if the character appears in target, gsub is used to change it to the corresponding to character.

The translate function simply calls stranslate using $0 as the target. The main program sets two global variables, FROM and TO, from the command line, and then changes ARGV so that @command{awk} reads from the standard input.

Finally, the processing rule simply calls translate for each record:

# translate.awk -- do tr-like stuff
# Bugs: does not handle things like: tr A-Z a-z, it has
# to be spelled out. However, if `to' is shorter than `from',
# the last character in `to' is used for the rest of `from'.

function stranslate(from, to, target,     lf, lt, t_ar, i, c)
{
    lf = length(from)
    lt = length(to)
    for (i = 1; i <= lt; i++)
        t_ar[substr(from, i, 1)] = substr(to, i, 1)
    if (lt < lf)
        for (; i <= lf; i++)
            t_ar[substr(from, i, 1)] = substr(to, lt, 1)
    for (i = 1; i <= lf; i++) {
        c = substr(from, i, 1)
        if (index(target, c) > 0)
            gsub(c, t_ar[c], target)
    }
    return target
}

function translate(from, to)
{
    return $0 = stranslate(from, to, $0)
}

# main program
BEGIN {
    if (ARGC < 3) {
        print "usage: translate from to" > "/dev/stderr"
        exit
    }
    FROM = ARGV[1]
    TO = ARGV[2]
    ARGC = 2
    ARGV[1] = "-"
}

{
    translate(FROM, TO)
    print
}

While it is possible to do character transliteration in a user-level function, it is not necessarily efficient, and we (the @command{gawk} authors) started to consider adding a built-in function. However, shortly after writing this program, we learned that the System V Release 4 @command{awk} had added the toupper and tolower functions (see section String Manipulation Functions). These functions handle the vast majority of the cases where character transliteration is necessary, and so we chose to simply add those functions to @command{gawk} as well and then leave well enough alone.

An obvious improvement to this program would be to set up the t_ar array only once, in a BEGIN rule. However, this assumes that the "from" and "to" lists will never change throughout the lifetime of the program.


Go to the first, previous, next, last section, table of contents.