" ends it. In the next example, we'll look at placing HTML-style paragraph tags in a plain text file. For this example, the input is a file containing variable-length lines that form paragraphs; each paragraph is separated from the next one by a blank line. Therefore, the script must collect all lines in the hold space until a blank line is encountered. The contents of the hold space are retrieved and surrounded with the paragraph tags. Here's the script: /^$/!{ H d } /^$/{ x s/^\n/
/ s/$/<\/p>/ G } Running the script on a sample file produces:
My wife won't let me buy a power saw. She is afraid of an accident if I use one. So I rely on a hand saw for a variety of weekend projects like building shelves. However, if I made my living as a carpenter, I would have to use a power saw. The speed and efficiency provided by power tools would be essential to being productive.
For people who create and modify text files, sed and awk are power tools for editing.
Most of the things that you can do with these programs can be done interactively with a text editor. However, using these programs can save many hours of repetitive work in achieving the same result.
The script has basically two parts, corresponding to each address. Either we do one thing if the input line is not blank or a different thing if it is. If the input line is not blank, it is appended to the hold space (with H), and then deleted from the pattern space. The delete command prevents the line from being output and clears the pattern space. Control passes back to the top of the script and a new line is read. The general idea is that we don't output any line of text; it is collected in the hold space. If the input line is blank, we process the contents of the hold space. To illustrate what the second procedure does, let's use the second paragraph in the previous sample file and show what happens. After a blank line has been read, the pattern space and the hold space have the following contents: Pattern Space: ^$ Hold Space: \nFor people who create and modify text files, \nsed and awk are power tools for editing. A blank line in the pattern space is represented as "^$", the regular expression that matches it. The embedded newlines are represented in the hold space by "\n". Note that the Hold command puts a newline in the hold space and then appends the current line to the hold space. Even when the hold space is empty, the Hold command places a newline before the contents of the pattern space. The exchange command (x) swaps the contents of the hold space and the pattern space. The blank line is saved in the hold space so we can retrieve it at the end of the procedure. (We could insert a newline in other ways, also.) Pattern Space: \nFor people who create and modify text files, \nsed and awk are power tools for editing. Hold Space: ^$ Now we make two substitutions: placing "
" at the beginning of the pattern space and "
" at the end. The first substitute command matches "^\n" because a newline is at the beginning of the line as a consequence of the Hold command. The second substitute command matches the end of the pattern space ("$" does not match any embedded newlines but only the terminal newline.) Pattern Space:
For people who create and modify text files, \nsed and awk are power tools for editing.
Hold Space: ^$ Note that the embedded newline is preserved in the pattern space. The last command, G, appends the blank line in the hold space to the pattern space. Upon reaching the bottom of the script, sed outputs the paragraph we had collected in the hold space and coded in the pattern space. This script illustrates the mechanics of collecting input and holding on to it until another pattern is matched. It's important to pay attention to flow control in the script. The first procedure in the script does not reach bottom because we don't want any output yet. The second procedure does reach bottom, clearing the pattern space and the hold space before we begin collecting lines for the next paragraph. This script also illustrates how to use addressing to set up exclusive addresses, in which a line must match one or the other address. You can also set up addresses to handle various exceptions in the input and thereby improve the reliability of a script. For instance, in the previous script, what happens if the last line in the input file is not blank? All the lines collected since the last blank line will not be output. There are several ways to handle this, but a rather clever one is to manufacture a blank line that the blank-line procedure will match later in the script. In other words, if the last line contains a line of text, we will copy the text to the hold space and clear the contents of the pattern space with the substitute command. We make the current line blank so that it matches the procedure that outputs what has been collected in the hold space. Here's the procedure: ${ /^$/!{ H s/.*// } } This procedure must be placed in the script before the two procedures shown earlier. The addressing symbol "$" matches only the last line in the file. Inside this procedure, we test for lines that are not blank. If the line is blank, we don't have to do anything with it. If the current line is not blank, then we append it to the hold space. This is what we do in the other procedure that matches a non-blank line. Then we use the substitute command to create a blank line in the pattern space. Upon exiting this procedure, there is a blank line in the pattern space. It matches the subsequent procedure for blank lines that adds the HTML paragraph codes and outputs the paragraph. 6.2 A Case for Study 6.4 Advanced Flow Control Commands Chapter 6 Advanced sed Commands 6.4 Advanced Flow Control Commands You have already seen several examples of changes in sed's normal flow control. In this section, we'll look at two commands that allow you to direct which portions of the script get executed and when. The branch (b) and test (t) commands transfer control in a script to a line containing a specified label. If no label is specified, control passes to the end of the script. The branch command transfers control unconditionally while the test command is a conditional transfer, occurring only if a substitute command has changed the current line. A label is any sequence of up to seven characters.[1] A label is put on a line by itself that begins with a colon: [1] The POSIX standard says that an implementation can allow longer labels if it wishes to. GNU sed allows labels to be of any length. :mylabel There are no spaces permitted between the colon and the label. Spaces at the end of the line will be considered part of the label. When you specify the label in a branch or test command, a space is permitted between the command and the label itself: b mylabel Be sure you don't put a space after the label. 6.4.1 Branching The branch command allows you to transfer control to another line in the script. [address]b[label] The label is optional, and if not supplied, control is transferred to the end of the script. If a label is supplied, execution resumes at the line following the label. In Chapter 4, Writing sed Scripts, we looked at a typesetting script that transformed quotation marks and hyphens into their typesetting counterparts. If we wanted to avoid making these changes on certain lines, then we could use the branch command to skip that portion of the script. For instance, text inside computer-generated examples marked by the .ES and .EE macros should not be changed. Thus, we could write the previous script like this: /^\.ES/,/^\.EE/b s/^"/``/ s/"$/''/ s/"? /''? /g . . . s/\\(em\\^"/\\(em``/g s/"\\(em/''\\(em/g s/\\(em"/\\(em``/g s/@DQ@/"/g Because no label is supplied, the branch command branches to the end of the script, skipping all subsequent commands. The branch command can be used to execute a set of commands as a procedure, one that can be called repeatedly from the main body of the script. As in the case above, it also allows you to avoid executing the procedure at all based on matching a pattern in the input. You can have a similar effect by using ! and grouping a set of commands. The advantage of the branch command over ! for our application is that we can more easily specify multiple conditions to avoid. The ! symbol can apply to a single command, or it can apply to a set of commands enclosed in braces that immediately follows. The branch command, on the other hand, gives you almost unlimited control over movement around the script. For example, if we are using multiple macro packages, there may be other macro pairs besides .ES and . EE that define a range of lines that we want to avoid altogether. So, for example, we can write: /^\.ES/,/^\.EE/b /^\.PS/,/^\.PE/b /^\.G1/,/^\.G2/b To get a good idea of the types of flow control possible in a sed script, let's look at some simple but abstract examples. The first example shows you how to use the branch command to create a loop. Once an input line is read, command1 and command2 will be applied to the line; afterwards, if the contents of the pattern space match the pattern, then control will be passed to the line following the label "top," which means command1 then command2 will be executed again. :top command1 command2 /pattern/b top command3 The script executes command3 only if the pattern doesn't match. All three commands will be executed, although the first two may be executed multiple times. In the next example, command1 is executed. If the pattern is matched, control passes to the line following the label "end." This means command2 is skipped. command1 /pattern/b end command2 :end command3 In all cases, command1 and command3 are executed. Now let's look at how to specify that either command2 or command3 are executed, but not both. In the next script, there are two branch commands. command1 /pattern/b dothree command2 b :dothree command3 The first branch command transfers control to command3. If that pattern is not matched, then command2 is executed. The branch command following command2 sends control to the end of the script, bypassing command3. The first of the branch commands is conditional upon matching the pattern; the second is not. We will look at a "real-world" example after looking at the test command. 6.4.2 The Test Command The test command branches to a label (or the end of the script) if a successful substitution has been made on the currently addressed line. Thus, it implies a conditional branch. Its syntax follows: [address]t[label] If no label is supplied, control falls through to the end of the script. If the label is supplied, then execution resumes at the line following the label. Let's look at an example from Tim O'Reilly. He was trying to generate automatic index entries based on evaluating the arguments in a macro that produced the top of a command reference page. If there were three quoted arguments, he wanted to do something different than if there were two or only one. The task was to try to match each of these cases in succession (3,2,1) and when a successful substitution was made, avoid making any further matches. Here's Tim's script: /\.Rh 0/{ s/"\(.*\)" "\(.*\)" "\(.*\)"/"\1" "\2" "\3"/ t s/"\(.*\)" "\(.*\)"/"\1" "\2"/ t s/"\(.*\)"/"\1"/ } The test command allows us to drop to the end of the script once a substitution has been made. If there are three arguments on the .Rh line, the test command after the first substitute command will be true, and sed will go on to the next input line. If there are fewer than three arguments, no substitution will be made, the test command will be evaluated false, and the next substitute command will be tried. This will be repeated until all the possibilities are used up. The test command provides functionality similar to a case statement in the C programming language or the shell programming languages. You can test each case and when a case proves true, then you exit the construct. If the above script were part of a larger script, we could use a label, perhaps tellingly named "break," to drop to the end of the command grouping where additional commands can be applied. /\.Rh 0/{ s/"\(.*\)" "\(.*\)" "\(.*\)"/"\1" "\2" "\3"/ t break . . . } :break more commands The next section gives a full example of the test command and the use of labels. 6.4.3 One More Case Remember Lenny? He was the fellow given the task of converting Scribe documents to troff. We had sent him the following script: # Scribe font change script. s/@f1(\([^)]*\))/\\fB\1\\fR/g /@f1(.*/{ N s/@f1(\(.*\n[^)]*\))/\\fB\1\\fR/g P D } He sent the following mail after using the script: Thank you so much! You've not only fixed the script but shown me where I was confused about the way it works. I can repair the conversion script so that it works with what you've done, but to be optimal it should do two more things that I can't seem to get working at all - maybe it's hopeless and I should be content with what's there. First, I'd like to reduce multiple blank lines down to one. Second, I'd like to make sed match the pattern over more than two (say, even only three) lines. Thanks again. Lenny The first request to reduce a series of blank lines to one has already been shown in this chapter. The following four lines perform this function: /^$/{ N /^\n$/D } We want to look mainly at accomplishing the second request. Our previous font-change script created a two-line pattern space, tried to make the match across those lines, and then output the first line. The second line became the first line in the pattern space and control passed to the top of the script where another line was read in. We can use labels to set up a loop that reads multiple lines and makes it possible to match a pattern across multiple lines. The following script sets up two labels: begin at the top of the script and again near the bottom. Look at the improved script: # Scribe font change script. New and Improved. :begin /@f1(\([^)]*\))/{ s//\\fB\1\\fR/g b begin } /@f1(.*/{ N s/@f1(\([^)]*\n[^)]*\))/\\fB\1\\fR/g t again b begin } :again P D Let's look more closely at this script, which has three parts. Beginning with the line that follows : begin, the first part attempts to match the font change syntax if it is found completely on one line. After making the substitution, the branch command transfers control back to the label begin. In other words, once we have made a match, we want to go back to the top and look for other possible matches, including the instruction that has already been applied - there could be multiple occurrences on the line. The second part attempts to match the pattern over multiple lines. The Next command builds a multiple line pattern space. The substitution command attempts to locate the pattern with an embedded newline. If it succeeds, the test command passes control to the line following the again label. If no substitution is made, control is passed to the line following the label begin so that we can read in another line. This is a loop that goes into effect when we've matched the beginning sequence of a font change request but have not yet found the ending sequence. Sed will loop back and keep appending lines into the pattern space until a match has been found. The third part is the procedure following the label again. The first line in the pattern space is output and then deleted. Like the previous version of this script, we deal with multiple lines in succession. Control never reaches the bottom of the script but is redirected by the Delete command to the top of the script. 6.3 Hold That Line 6.5 To Join a Phrase Chapter 6 Advanced sed Commands 6.5 To Join a Phrase We have covered all the advanced constructs of sed and are now ready to look at a shell script named phrase that uses nearly all of them. This script is a general-purpose, grep-like program that allows you to look for a series of multiple words that might appear across two lines. An essential element of this program is that, like grep, it prints out only the lines that match the pattern. You might think we'd use the -n option to suppress the default output of lines. However, what is unusual about this sed script is that it creates an input/output loop, controlling when a line is output or not. The logic of this script is to first look for the pattern on one line and print the line if it matches. If no match is found, we read another line into the pattern space (as in previous multiline scripts). Then we copy the two-line pattern space to the hold space for safekeeping. Now the new line that was read into the pattern space previously could match the search pattern on its own, so the next match we attempt is on the second line only. Once we've determined that the pattern is not found on either the first or second lines, we remove the newline between the two lines and look for it spanning those lines. The script is designed to accept arguments from the command line. The first argument is the search pattern. All other command-line arguments will be interpreted as filenames. Let's look at the entire script before analyzing it: #! /bin/sh # phrase -- search for words across lines # $1 = search string; remaining args = filenames search=$1 shift for file do sed ' /'"$search"'/b N h s/.*\n// /'"$search"'/b g s/ *\n/ / /'"$search"'/{ g b } g D' $file done A shell variable named search is assigned the first argument on the command line, which should be the search pattern. This script shows another method of passing a shell variable into a script. Here we surround the variable reference with a pair of double quotes and then single quotes. Notice the script itself is enclosed in single quotes, which protect characters that are normally special to the shell from being interpreted. The sequence of a double-quote pair inside a single-quote pair[2] makes sure the enclosed argument is evaluated first by the shell before the sed script is evaluated by sed.[3] [2] Actually, this is the concatenation of single-quoted text with double-quoted text with more single-quoted text (and so on, whew!) to produce one large quoted string. Being a shell wizard helps here. [3] You can also use shell variables to pass a series of commands into a sed script. This somewhat simulates a procedure call but it makes the script more difficult to read. The sed script tries to match the search string at three different points, each marked by the address that looks for the search pattern. The first line of the script looks for the search pattern on a line by itself: /'"$search"'/b If the search pattern matches the line, the branch command, without a label, transfers control to the bottom of the script where the line is printed. This makes use of sed's normal control-flow so that the next input line is read into the pattern space and control then returns to the top of the script. The branch command is used in the same way each time we try to match the pattern. If a single input line does not match the pattern, we begin our next procedure to create a multiline pattern space. It is possible that the new line, by itself, will match the search string. It may not be apparent why this step is necessary - why not just immediately look for the pattern anywhere across two lines? The reason is that if the pattern is actually matched on the second line, we'd still output the pair of lines. In other words, the user would see the line preceding the matched line and might be confused by it. This way we output the second line by itself if that is what matches the pattern. N h s/.*\n// /'"$search"'/b The Next command appends the next input line to the pattern space. The hold command places a copy of the two-line pattern space into the hold space. The next action will change the pattern space and we want to preserve the original intact. Before looking for the pattern, we use the substitute command to remove the previous line, up to and including the embedded newline. There are several reasons for doing it this way and not another way, so let's consider some of the alternatives. You could write a pattern that matches the search pattern only if it occurs after the embedded newline: /\n.*'"$search"'/b However, if a match is found, we don't want to print the entire pattern space, just the second portion of it. Using the above construct would print both lines when only the second line matches. You might want to use the Delete command to remove the first line in the pattern space before trying to match the pattern. A side effect of the Delete command is a change in flow control that would resume execution at the top of the script. (The Delete command could conceivably be used but not without changing the logic of this script.) So, we try to match the pattern on the second line, and if that is unsuccessful, then we try to match it across two lines: g s/ *\n/ / /'"$search"'/{ g b } The get command retrieves a copy of the original two-line pair from the hold space, overwriting the line we had worked with in the pattern space. The substitute command replaces the embedded newline and any spaces preceding it with a single space. Then we attempt to match the pattern. If the match is made, we don't want to print the contents of the pattern space, but rather get the duplicate from the hold space (which preserves the newline) and print it. Thus, before branching to the end of the script, the get command retrieves the copy from the hold space. The last part of the script is executed only if the pattern has not been matched. g D The get command retrieves the duplicate, that preserves the newline, from the hold space. The Delete command removes the first line in the pattern space and passes control back to the top of the script. We delete only the first part of the pattern space, instead of clearing it, because after reading another input line, it is possible to match the pattern spanning across both lines. Here's the result when the program is run on a sample file: $ phrase "the procedure is followed" sect3 If a pattern is followed by a \f(CW!\fP, then the procedure is followed for all lines that do not match the pattern. so that the procedure is followed only if there is no match. As we mentioned at the outset, writing sed scripts is a good primer for programming. In the chapters that follow, we will be looking at the awk programming language. You will see many similarities to sed to make you comfortable but you will see a broader range of constructs for writing useful programs. As you begin trying to do more complicated tasks with sed, the scripts get so convoluted as to make them difficult to understand. One of the advantages of awk is that it handles complexity better, and once you learn the basics, awk scripts are easier to write and understand. 6.4 Advanced Flow Control Commands 7. Writing Scripts for awk Chapter 7 7. Writing Scripts for awk Contents: Playing the Game Hello, World Awk's Programming Model Pattern Matching Records and Fields Expressions System Variables Relational and Boolean Operators Formatted Printing Passing Parameters Into a Script Information Retrieval As mentioned in the preface, this book describes POSIX awk; that is, the awk language as specified by the POSIX standard. Before diving into the details, we'll provide a bit of history. The original awk was a nice little language. It first saw the light of day with Version 7 UNIX, around 1978. It caught on, and people used it for significant programming. In 1985, the original authors, seeing that awk was being used for more serious programming than they had ever intended, decided to beef up the language. (See Chapter 11, A Flock of awks, for a description of the original awk, and all the things it did not have when compared to the new one.) The new version was finally released to the world at large in 1987, and it is this version that is still found on SunOS 4.1.x systems. In 1989, for System V Release 4, awk was updated in some minor ways.[1] This version became the basis for the awk feature list in the POSIX standard. POSIX clarified a number of things about awk, and added the CONVFMT variable (to be discussed later in this chapter). [1] The -v option and tolower() and toupper() functions were added, and srand () and printf were cleaned up. The details will be presented in this and the following chapters. As you read the rest of this book, bear in mind that the term awk refers to POSIX awk, and not to any particular implementation, whether the original one from Bell Labs, or any of the others discussed in Chapter 11. However, in the few cases where different versions have fundamental differences of behavior, that will be pointed out in the main body of the discussion. 7.1 Playing the Game To write an awk script, you must become familiar with the rules of the game. The rules can be stated plainly and you will find them described in Appendix B, Quick Reference for awk, rather than in this chapter. The goal of this chapter is not to describe the rules but to show you how to play the game. In this way, you will become acquainted with many of the features of the language and see examples that illustrate how scripts actually work. Some people prefer to begin by reading the rules, which is roughly equivalent to learning to use a program from its manual page or learning to speak a language by scanning its rules of grammar - not an easy task. Having a good grasp of the rules, however, is essential once you begin to use awk regularly. But the more you use awk, the faster the rules of the game become second nature. You learn them through trial and error - spending a long time trying to fix a silly syntax error such as a missing space or brace has a magical effect upon long-term memory. Thus, the best way to learn to write scripts is to begin writing them. As you make progress writing scripts, you will no doubt benefit from reading the rules (and rereading them) in Appendix B or the awk manpage or The AWK Programming Language book. You can do that later - let's get started now. 6.5 To Join a Phrase 7.2 Hello, World Chapter 7 Writing Scripts for awk 7.2 Hello, World It has become a convention to introduce a programming language by demonstrating the "Hello, world" program. Showing how this program works in awk will demonstrate just how unconventional awk is. In fact, it's necessary to show several different approaches to printing "Hello, world." In the first example, we create a file named test that contains a single line. This example shows a script that contains the print statement: $ echo 'this line of data is ignored' > test $ awk '{ print "Hello, world" }' test Hello, world This script has only a single action, which is enclosed in braces. That action is to execute the print statement for each line of input. In this case, the test file contains only a single line; thus, the action occurs once. Note that the input line is read but never output. Now let's look at another example. Here, we use a file that contains the line "Hello, world." $ cat test2 Hello, world $ awk '{ print }' test2 Hello, world In this example, "Hello, world" appears in the input file. The same result is achieved because the print statement, without arguments, simply outputs each line of input. If there were additional lines of input, they would be output as well. Both of these examples illustrate that awk is usually input-driven. That is, nothing happens unless there are lines of input on which to act. When you invoke the awk program, it reads the script that you supply, checking the syntax of your instructions. Then awk attempts to execute the instructions for each line of input. Thus, the print statement will not be executed unless there is input from the file. To verify this for yourself, try entering the command line in the first example but omit the filename. You'll find that because awk expects input to come from the keyboard, it will wait until you give it input to process: press RETURN several times, then type an EOF (CTRL-D on most systems) to signal the end of input. For each time that you pressed RETURN, the action that prints "Hello, world" will be executed. There is yet another way to write the "Hello, world" message and not have awk wait for input. This method associates the action with the BEGIN pattern. The BEGIN pattern specifies actions that are performed before the first line of input is read. $ awk 'BEGIN { print "Hello, world" }' Hello, world Awk prints the message, and then exits. If a program has only a BEGIN pattern, and no other statements, awk will not process any input files. 7.1 Playing the Game 7.3 Awk's Programming Model Chapter 7 Writing Scripts for awk 7.3 Awk's Programming Model It's important to understand the basic model that awk offers the programmer. Part of the reason why awk is easier to learn than many programming languages is that it offers such a well-defined and useful model to the programmer. An awk program consists of what we will call a main input loop. A loop is a routine that is executed over and over again until some condition exists that terminates it. You don't write this loop, it is given - it exists as the framework within which the code that you do write will be executed. The main input loop in awk is a routine that reads one line of input from a file and makes it available for processing. The actions you write to do the processing assume that there is a line of input available. In another programming language, you would have to create the main input loop as part of your program. It would have to open the input file and read one line at a time. This is not necessarily a lot of work, but it illustrates a basic awk shortcut that makes it easier for you to write your program. The main input loop is executed as many times as there are lines of input. As you saw in the "Hello, world" examples, this loop does not execute until there is a line of input. It terminates when there is no more input to be read. Awk allows you to write two special routines that can be executed before any input is read and after all input is read. These are the procedures associated with the BEGIN and END rules, respectively. In other words, you can do some preprocessing before the main input loop is ever executed and you can do some postprocessing after the main input loop has terminated. The BEGIN and END procedures are optional. You can think of an awk script as having potentially three major parts: what happens before, what happens during, and what happens after processing the input. Figure 7.1 shows the relationship of these parts in the flow of control of an awk script. Figure 7.1: Flow and control in awk scripts Of these three parts, the main input loop or "what happens during processing" is where most of the work gets done. Inside the main input loop, your instructions are written as a series of pattern/action procedures. A pattern is a rule for testing the input line to determine whether or not the action should be applied to it. The actions, as we shall see, can be quite complex, consisting of statements, functions, and expressions. The main thing to remember is that each pattern/action procedure sits in the main input loop, which takes care of reading the input line. The procedures that you write will be applied to each input line, one line at a time. 7.2 Hello, World 7.4 Pattern Matching Chapter 7 Writing Scripts for awk 7.4 Pattern Matching The "Hello, world" program does not demonstrate the power of pattern-matching rules. In this section, we look at a number of small, even trivial examples that nonetheless demonstrate this central feature of awk scripts. When awk reads an input line, it attempts to match each pattern-matching rule in a script. Only the lines matching the particular pattern are the object of an action. If no action is specified, the line that matches the pattern is printed (executing the print statement is the default action). Consider the following script: /^$/ { print "This is a blank line." } This script reads: if the input line is blank, then print "This is a blank line." The pattern is written as a regular expression that identifies a blank line. The action, like most of those we've seen so far, contains a single print statement. If we place this script in a file named awkscr and use an input file named test that contains three blank lines, then the following command executes the script: $ awk -f awkscr test This is a blank line. This is a blank line. This is a blank line. (From this point on, we'll assume that our scripts are placed in a separate file and invoked using the -f command-line option.) The result tells us that there are three blank lines in test. This script ignores lines that are not blank. Let's add several new rules to the script. This script is now going to analyze the input and classify it as an integer, a string, or a blank line. # test for integer, string or empty line. /[0-9]+/ { print "That is an integer" } /[A-Za-z]+/ { print "This is a string" } /^$/ { print "This is a blank line." } The general idea is that if a line of input matches any of these patterns, the associated print statement will be executed. The + metacharacter is part of the extended set of regular expression metacharacters and means "one or more." Therefore, a line containing a sequence of one or more digits will be considered an integer. Here's a sample run, taking input from standard input: $ awk -f awkscr 4 That is an integer t This is a string 4T That is an integer This is a string RETURN This is a blank line. 44 That is an integer CTRL-D $ Note that input "4T" was identified as both an integer and a string. A line can match more than one rule. You can write a stricter rule set to prevent a line from matching more than one rule. You can also write actions that are designed to skip other parts of the script. We will be exploring the use of pattern-matching rules throughout this chapter. 7.4.1 Describing Your Script Adding comments as you write the script is a good practice. A comment begins with the "#" character and ends at a newline. Unlike sed, awk allows comments anywhere in the script. NOTE: If you are supplying your awk program on the command line, rather than putting it in a file, do not use a single quote anywhere in your program. The shell would interpret it and become confused. As we begin writing scripts, we'll use comments to describe the action: # blank.awk -- Print message for each blank line. /^$/ { print "This is a blank line." } This comment offers the name of the script, blank.awk, and briefly describes what the script does. A particularly useful comment for longer scripts is one that identifies the expected structure of the input file. For instance, in the next section, we are going to look at writing a script that reads a file containing names and phone numbers. The introductory comments for this program should be: # blocklist.awk -- print name and address in block form. # fields: name, company, street, city, state and zip, phone It is useful to embed this information in the script because the script won't work unless the structure of the input file corresponds to that expected by the person who wrote the script. 7.3 Awk's Programming Model 7.5 Records and Fields Chapter 7 Writing Scripts for awk 7.5 Records and Fields Awk makes the assumption that its input is structured and not just an endless string of characters. In the simplest case, it takes each input line as a record and each word, separated by spaces or tabs, as a field. (The characters separating the fields are often referred to as delimiters.) The following record in the file names has three fields, separated by either a space or a tab. John Robinson 666-555-1111 Two or more consecutive spaces and/or tabs count as a single delimiter. 7.5.1 Referencing and Separating Fields Awk allows you to refer to fields in actions using the field operator $. This operator is followed by a number or a variable that identifies the position of a field by number. "$1" refers to the first field, "$2" to the second field, and so on. "$0" refers to the entire input record. The following example displays the last name first and the first name second, followed by the phone number. $ awk '{ print $2, $1, $3 }' names Robinson John 666-555-1111 $1 refers to the first name, $2 to the last name, and $3 to the phone number. The commas that separate each argument in the print statement cause a space to be output between the values. (Later on, we'll discuss the output field separator (OFS), whose value the comma outputs and which is by default a space.) In this example, a single input line forms one record containing three fields: there is a space between the first and last names and a tab between the last name and the phone number. If you wanted to grab the first and last name as a single field, you could set the field separator explicitly so that only tabs are recognized. Then, awk would recognize only two fields in this record. You can use any expression that evaluates to an integer to refer to a field, not just numbers and variables. $ echo a b c d | awk 'BEGIN { one = 1; two = 2 } > { print $(one + two) }' c You can change the field separator with the -F option on the command line. It is followed by the delimiter character (either immediately, or separated by whitespace). In the following example, the field separator is changed to a tab. $ awk -F"\t" '{ print $2 }' names 666-555-1111 "\t" is an escape sequence (discussed below) that represents an actual tab character. It should be surrounded by single or double quotes. Commas delimit fields in the following two address records. John Robinson,Koren Inc.,978 4th Ave.,Boston,MA 01760,696- 0987 Phyllis Chapman,GVE Corp.,34 Sea Drive,Amesbury,MA 01881,879-0900 An awk program can print the name and address in block format. # blocklist.awk -- print name and address in block form. # input file -- name, company, street, city, state and zip, phone { print "" # output blank line print $1 # name print $2 # company print $3 # street print $4, $5 # city, state zip } The first print statement specifies an empty string (@DQ@@DQ@) (remember, print by itself outputs the current line). This arranges for the records in the report to be separated by blank lines. We can invoke this script and specify that the field separator is a comma using the following command: awk -F, -f blocklist.awk names The following report is produced: John Robinson Koren Inc. 978 4th Ave. Boston MA 01760 Phyllis Chapman GVE Corp. 34 Sea Drive Amesbury MA 01881 It is usually a better practice, and more convenient, to specify the field separator in the script itself. The system variable FS can be defined to change the field separator. Because this must be done before the first input line is read, we must assign this variable in an action controlled by the BEGIN rule. BEGIN { FS = "," } Now let's use it in a script to print out the names and phone numbers. # phonelist.awk -- print name and phone number. # input file -- name, company, street, city, state and zip, phone BEGIN { FS = "," } # comma-delimited fields { print $1 ", " $6 } Notice that we use blank lines in the script itself to improve readability. The print statement puts a comma followed by a space between the two output fields. This script can be invoked from the command line: $ awk -f phonelist.awk names John Robinson, 696-0987 Phyllis Chapman, 879-0900 This gives you a basic idea of how awk can be used to work with data that has a recognizable structure. This script is designed to print all lines of input, but we could modify the single action by writing a pattern-matching rule that selected only certain names or addresses. So, if we had a large listing of names, we could select only the names of people residing in a particular state. We could write: /MA/ { print $1 ", " $6 } where MA would match the postal state abbreviation for Massachusetts. However, we could possibly match a company name or some other field in which the letters "MA" appeared. We can test a specific field for a match. The tilde (~) operator allows you to test a regular expression against a field. $5 ~ /MA/ { print $1 ", " $6 } You can reverse the meaning of the rule by using bang-tilde (!~). $5 !~ /MA/ { print $1 ", " $6 } This rule would match all those records whose fifth field did not have "MA" in it. A more challenging pattern-matching rule would be one that matches only long-distance phone numbers. The following regular expression looks for an area code. $6 ~ /1?(-| )?\(?[0-9]+\)?( |-)?[0-9]+-[0-9]+/ This rule matches any of the following forms: 707-724-0000 (707) 724-0000 (707)724-0000 1-707-724-0000 1 707-724-0000 1(707)724-0000 The regular expression can be deciphered by breaking down its parts. "1?" means zero or one occurrences of "1". "(-| )?" looks for either a hyphen or a space in the next position, or nothing at all. "\(?" looks for zero or one left parenthesis; the backslash prevents the interpretation of "(" as the grouping metacharacter. "[0-9]+" looks for one or more digits; note that we took the lazy way out and specified one or more digits rather than exactly three. In the next position, we are looking for an optional right parenthesis, and again, either a space or a hyphen, or nothing at all. Then we look for one or more digits "[0-9]+" followed by a hyphen followed by one or more digits "[0-9]+". 7.5.2 Field Splitting: The Full Story There are three distinct ways you can have awk separate fields. The first method is to have fields separated by whitespace. To do this, set FS equal to a single space. In this case, leading and trailing whitespace (spaces and/or tabs) are stripped from the record, and fields are separated by runs of spaces and/or tabs. Since the default value of FS is a single space, this is the way awk normally splits each record into fields. The second method is to have some other single character separate fields. For example, awk programs for processing the UNIX /etc/passwd file usually use a ":" as the field separator. When FS is any single character, each occurrence of that character separates another field. If there are two successive occurrences, the field between them simply has the empty string as its value. Finally, if you specify more than a single character as the field separator, it will be interpreted as a regular expression. That is, the field separator will be the "leftmost longest non-null and nonoverlapping" substring[2] that matches the regular expression. (The phrase "null string" is technical jargon for what we've been calling the "empty string.") You can see the difference between specifying: [2] The AWK Programming Language [Aho], p. 60. FS = "\t" which causes each tab to be interpreted as a field separator, and: FS = "\t+" which specifies that one or more consecutive tabs separate a field. Using the first specification, the following line would have three fields: abc\t\tdef whereas the second specification would only recognize two fields. Using a regular expression allows you to specify several characters to be used as delimiters: FS = "[':\t]" Any of the three characters in brackets will be interpreted as the field separator. 7.4 Pattern Matching 7.6 Expressions Chapter 7 Writing Scripts for awk 7.6 Expressions The use of expressions in which you can store, manipulate, and retrieve data is quite different from anything you can do in sed, yet it is a common feature of most programming languages. An expression is evaluated and returns a value. An expression consists of any combination of numeric and string constants, variables, operators, functions, and regular expressions. We covered regular expressions in detail in Chapter 2, Understanding Basic Operations, and they are summarized in Appendix B. Functions will be discussed fully in Chapter 9, Functions. In this section, we will look at expressions consisting of constants, variables, and operators. There are two types of constants: string or numeric ("red" or 1). A string must be quoted in an expression. Strings can make use of the escape sequences listed in Table 7.1. Table 7.1: Escape Sequences Sequence Description \a Alert character, usually ASCII BEL character \b Backspace \f Formfeed \n Newline \r Carriage return \t Horizontal tab \v Vertical tab \ddd Character represented as 1 to 3 digit octal value \xhex Character represented as hexadecimal value[3] \c Any literal character c (e.g., \" for ")[4] [3] POSIX does not provide "\x", but it is commonly available. [4] Like ANSI C, POSIX leaves purposely undefined what you get when you put a backslash before any character not listed in the table. In most awks, you just get that character. A variable is an identifier that references a value. To define a variable, you only have to name it and assign it a value. The name can only contain letters, digits, and underscores, and may not start with a digit. Case distinctions in variable names are important: Salary and salary are two different variables. Variables are not declared; you do not have to tell awk what type of value will be stored in a variable. Each variable has a string value and a numeric value, and awk uses the appropriate value based on the context of the expression. (Strings that do not consist of numbers have a numeric value of 0.) Variables do not have to be initialized; awk automatically initializes them to the empty string, which acts like 0 if used as a number. The following expression assigns a value to x: x = 1 x is the name of the variable, = is an assignment operator, and 1 is a numeric constant. The following expression assigns the string "Hello" to the variable z: z = "Hello" A space is the string concatenation operator. The expression: z = "Hello" "World" concatenates the two strings and assigns "HelloWorld" to the variable z. The dollar sign ($) operator is used to reference fields. The following expression assigns the value of the first field of the current input record to the variable w: w = $1 A variety of operators can be used in expressions. Arithmetic operators are listed in Table 7.2. Table 7.2: Arithmetic Operators Operator Description + Addition - Subtraction * Multiplication / Division % Modulo ^ Exponentiation ** Exponentiation[5] [5] This is a common extension. It is not in the POSIX standard, and often not in the system documentation, either. Its use is thus nonportable. Once a variable has been assigned a value, that value can be referenced using the name of the variable. The following expression adds 1 to the value of x and assigns it to the variable y: y = x + 1 So, evaluate x, add 1 to it, and put the result into the variable y. The statement: print y prints the value of y. If the following sequence of statements appears in a script: x = 1 y = x + 1 print y then the value of y is 2. We could reduce these three statements to two: x = 1 print x + 1 Notice, however, that after the print statement the value of x is still 1. We didn't change the value of x; we simply added 1 to it and printed that value. In other words, if a third statement print x followed, it would output 1. If, in fact, we wished to accumulate the value in x, we could use an assignment operator +=. This operator combines two operations; it adds 1 to x and assigns the new value to x. Table 7.3 lists the assignment operators used in awk expressions. Table 7.3: Assignment Operators Operator Description ++ Add 1 to variable. -- Subtract 1 from variable. += Assign result of addition. -= Assign result of subtraction. *= Assign result of multiplication. /= Assign result of division. %= Assign result of modulo. ^= Assign result of exponentiation. **= Assign result of exponentiation.[6] [6] As with "**", this is a common extension, which is also nonportable. Look at the following example, which counts each blank line in a file. # Count blank lines. /^$/ { print x += 1 } Although we didn't initialize the value of x, we can safely assume that its value is 0 up until the first blank line is encountered. The expression "x += 1" is evaluated each time a blank line is matched and the value of x is incremented by 1. The print statement prints the value returned by the expression. Because we execute the print statement for every blank line, we get a running count of blank lines. There are different ways to write expressions, some more terse than others. The expression "x += 1" is more concise than the following equivalent expression: x = x + 1 But neither of these expressions is as terse as the following expression: ++x "++" is the increment operator. ("--" is the decrement operator.) Each time the expression is evaluated the value of the variable is incremented by one. The increment and decrement operators can appear on either side of the operand, as prefix or postfix operators. The position has a different effect. ++x Increment x before returning value (prefix) x++ Increment x after returning value (postfix) For instance, if our example was written: /^$/ { print x++ } When the first blank line is matched, the expression returns the value "0"; the second blank line returns "1", and so on. If we put the increment operator before x, then the first time the expression is evaluated, it will return "1." Let's implement that expression in our example. In addition, instead of printing a count each time a blank line is matched, we'll accumulate the count as the value of x and print only the total number of blank lines. The END pattern is the place to put the print that displays the value of x after the last input line is read. # Count blank lines. /^$/ { ++x } END { print x } Let's try it on the sample file that has three blank lines in it. $ awk -f awkscr test 3 The script outputs the number of blank lines. 7.6.1 Averaging Student Grades Let's look at another example, one in which we sum a series of student grades and then calculate the average. Here's what the input file looks like: john 85 92 78 94 88 andrea 89 90 75 90 86 jasper 84 88 80 92 84 There are five grades following the student's name. Here is the script that will give us each student's average: # average five grades { total = $2 + $3 + $4 + $5 + $6 avg = total / 5 print $1, avg } This script adds together fields 2 through 6 to get the sum total of the five grades. The value of total is divided by 5 and assigned to the variable avg. ("/" is the operator for division.) The print statement outputs the student's name and average. Note that we could have skipped the assignment of avg and instead calculated the average as part of the print statement, as follows: print $1, total / 5 This script shows how easy it is to write programs in awk. Awk parses the input into fields and records. You are spared having to read individual characters and declaring data types. Awk does this for you, automatically. Let's see a sample run of the script that calculates student averages: $ awk -f grades.awk grades john 87.4 andrea 86 jasper 85.6 7.5 Records and Fields 7.7 System Variables Chapter 7 Writing Scripts for awk 7.7 System Variables There are a number of system or built-in variables defined by awk. Awk has two types of system variables. The first type defines values whose default can be changed, such as the default field and record separators. The second type defines values that can be used in reports or processing, such as the number of fields found in the current record, the count of the current record, and others. These are automatically updated by awk; for example, the current record number and input file name. There are a set of default values that affect the recognition of records and fields on input and their display on output. The system variable FS defines the field separator. By default, its value is a single space, which tells awk that any number of spaces and/or tabs separate fields. FS can also be set to any single character, or to a regular expression. Earlier, we changed the field separator to a comma in order to read a list of names and addresses. The output equivalent of FS is OFS, which is a space by default. We'll see an example of redefining OFS shortly. Awk defines the variable NF to be the number of fields for the current input record. Changing the value of NF actually has side effects. The interactions that occur when $0, the fields, and NF are changed is a murky area, particularly when NF is decreased.[7] Increasing it creates new (empty) fields, and rebuilds $0, with the fields separated by the value of OFS. In the case where NF is decreased, gawk and mawk rebuild the record, and the fields that were above the new value of NF are set equal to the empty string. The Bell Labs awk does not change $0. [7] Unfortunately, the POSIX standard isn't as helpful here as it should be. Awk also defines RS, the record separator, as a newline. RS is a bit unusual; it's the only variable where awk only pays attention to the first character of the value. The output equivalent to RS is ORS, which is also a newline by default. In the next section, "Working with Multiline Records," we'll show how to change the default record separator. Awk sets the variable NR to the number of the current input record. It can be used to number records in a list. The variable FILENAME contains the name of the current input file. The variable FNR is useful when multiple input files are used as it provides the number of the current record relative to the current input file. Typically, the field and record separators are defined in the BEGIN procedure because you want these values set before the first input line is read. However, you can redefine these values anywhere in the script. In POSIX awk, assigning a new value to FS has no effect on the current input line; it only affects the next input line. NOTE: Prior to the June 1996 release of Bell Labs awk, versions of awk for UNIX did not follow the POSIX standard in this regard. In those versions, if you have not yet referenced an individual field, and you set the field separator to a different value, the current input line is split into fields using the new value of FS. Thus, you should test how your awk behaves, and if at all possible, upgrade to a correct version of awk. Finally, POSIX added a new variable, CONVFMT, which is used to control number-to-string conversions. For example, str = (5.5 + 3.2) " is a nice value" Here, the result of the numeric expression 5.5 + 3.2 (which is 8.7) must be converted to a string before it can be used in the string concatenation. CONVFMT controls this conversion. Its default value is "%.6g", which is a printf-style format specification for floating-point numbers. Changing CONVFMT to "%d", for instance, would cause all numbers to be converted to strings as integers. Prior to the POSIX standard, awk used OFMT for this purpose. OFMT does the same job, but controlling the conversion of numeric values when using the print statement. The POSIX committee wanted to separate the tasks of output conversion from simple string conversion. Note that numbers that are integers are always converted to strings as integers, no matter what the values of CONVFMT and OFMT may be. Now let's look at some examples, beginning with the NR variable. Here's a revised print statement for the script that calculates student averages: print NR ".", $1, avg Running the revised script produces the following output: 1. john 87.4 2. andrea 86 3. jasper 85.6 After the last line of input is read, NR contains the number of input records that were read. It can be used in the END action to provide a report summary. Here's a revised version of the phonelist.awk script. # phonelist.awk -- print name and phone number. # input file -- name, company, street, city, state and zip, phone BEGIN { FS = ", *" } # comma-delimited fields { print $1 ", " $6 } END { print "" print NR, "records processed." } This program changes the default field separator and uses NR to print the total number of records printed. Note that this program uses a regular expression for the value of FS. This program produces the following output: John Robinson, 696-0987 Phyllis Chapman, 879-0900 2 records processed. The output field separator (OFS) is generated when a comma is used to separate the arguments in a print statement. You may have wondered what effect the comma has in the following expression: print NR ".", $1, avg By default, the comma causes a space (the default value of OFS) to be output. For instance, you could redefine OFS to be a tab in a BEGIN action. Then the preceding print statement would produce the following output: 1. john 87.4 2. andrea 86 3. jasper 85.6 This is especially useful if the input consists of tab-separated fields and you want to generate the same kind of output. OFS can be redefined to be a sequence of characters, such as a comma followed by a space. Another commonly used system variable is NF, which is set to the number of fields for the current record. As we'll see in the next section, you can use NF to check that a record has the same number of fields that you expect. You can also use NF to reference the last field of each record. Using the "$" field operator and NF produces that reference. If there are six fields, then "$NF" is the same as "$6." Given a list of names, such as the following: John Kennedy Lyndon B. Johnson Richard Milhouse Nixon Gerald R. Ford Jimmy Carter Ronald Reagan George Bush Bill Clinton you will note that the last name is not the same field number for each record. You could print the last name of each President using "$NF."[8] [8] This scheme breaks down for Martin Van Buren; fortunately, our list contains only recent U.S. presidents. These are the basic system variables, the ones most commonly used. There are more of them, as listed in Appendix B, and we'll introduce new system variables as needed in the chapters that follow. 7.7.1 Working with Multiline Records All of our examples have used input files whose records consisted of a single line. In this section, we show how to read a record where each field consists of a single line. Earlier, we looked at an example of processing a file of names and addresses. Let's suppose that the same data is stored on file in block format. Instead of having all the information on one line, the person's name is on one line, followed by the company's name on the next line and so on. Here's a sample record: John Robinson Koren Inc. 978 Commonwealth Ave. Boston MA 01760 696-0987 This record has six fields. A blank line separates each record. To process this data, we can specify a multiline record by defining the field separator to be a newline, represented as "\n", and set the record separator to the empty string, which stands for a blank line. BEGIN { FS = "\n"; RS = "" } We can print the first and last fields using the following script: # block.awk - print first and last fields # $1 = name; $NF = phone number BEGIN { FS = "\n"; RS = "" } { print $1, $NF } Here's a sample run: $ awk -f block.awk phones.block John Robinson 696-0987 Phyllis Chapman 879-0900 Jeffrey Willis 914-636-0000 Alice Gold (707) 724-0000 Bill Gold 1-707-724-0000 The two fields are printed on the same line because the default output separator (OFS) remains a single space. If you want the fields to be output on separate lines, change OFS to a newline. While you're at it, you probably want to preserve the blank line between records, so you must specify the output record separator ORS to be two newlines. OFS = "\n"; ORS = "\n\n" 7.7.2 Balance the Checkbook This is a simple application that processes items in your check register. While not necessarily the easiest way to balance the checkbook, it is amazing how quickly you can build something useful with awk. This program presumes you have entered in a file the following information: 1000 125 Market 125.45 126 Hardware Store 34.95 127 Video Store 7.45 128 Book Store 14.32 129 Gasoline 16.10 The first line contains the beginning balance. Each of the other lines represent information from a single check: the check number, a description of where it was spent, and the amount of the check. The three fields are separated by tabs. The core task of the script is that it must get the beginning balance and then deduct the amount of each check from that balance. We can provide detail lines for each check to compare against the check register. Finally, we can print the ending balance. Here it is: # checkbook.awk BEGIN { FS = "\t" } #1 Expect the first record to have the starting balance. NR == 1 { print "Beginning Balance: \t" $1 balance = $1 next # get next record and start over } #2 Apply to each check record, subtracting amount from balance. { print $1, $2, $3 print balance -= $3 } Let's run this program and look at the results: $ awk -f checkbook.awk checkbook.test Beginning Balance: 1000 125 Market 125.45 874.55 126 Hardware Store 34.95 839.6 127 Video Store 7.45 832.15 128 Book Store 14.32 817.83 129 Gasoline 16.10 801.73 The report is difficult to read, but later we will learn to fix the format using the printf statement. What's important is to confirm that the script is doing what we want. Notice, also, that getting this far takes only a few minutes in awk. In a programming language such as C, it would take you much longer to write this program; for one thing, you might have many more lines of code; and you'd be programming at a much lower level. There are any number of refinements that you'd want to make to this program to improve it, and refining a program takes much longer. The point is that with awk you are able to isolate and implement the basic functionality quite easily. 7.6 Expressions 7.8 Relational and Boolean Operators Chapter 7 Writing Scripts for awk 7.8 Relational and Boolean Operators Relational and Boolean operators allow you to make comparisons between two expressions. The relational operators are found in Table 7.4. Table 7.4: Relational Operators Operator Description < Less than > Greater than <= Less than or equal to >= Greater than or equal to == Equal to != Not equal to ~ Matches !~ Does not match A relational expression can be used in place of a pattern to control a particular action. For instance, if we wanted to limit the records selected for processing to those that have five fields, we could use the following expression: NF == 5 This relational expression compares the value of NF (the number of fields for each input record) to five. If it is true, the action will be executed; otherwise, it will not. NOTE: Make sure you notice that the relational operator "==" ("is equal to") is not the same as the assignment operator "=" ("equals"). It is a common error to use "=" instead of "==" to test for equality. We can use a relational expression to validate the phonelist database before attempting to print out the record. NF == 6 { print $1, $6 } Then only lines with six fields will be printed. The opposite of "==" is "!=" ("is not equal to"). Similarly, you can compare one expression to another to see if it is greater than (>) or less than (<) or greater than or equal to (>=) or less than or equal to (<=). The expression NR > 1 tests whether the number of the current record is greater than 1. As we'll see in the next chapter, relational expressions are typically used in conditional (if) statements and are evaluated to determine whether or not a particular statement should be executed. Regular expressions are usually written enclosed in slashes. These can be thought of as regular expression constants, much as "hello" is a string constant. We've seen many examples so far: /^$/ { print "This is a blank line." } However, you are not limited to regular expression constants. When used with the relational operators ~ ("match") and !~ ("no match"), the right-hand side of the expression can be any awk expression; awk treats it as a string that specifies a regular expression.[9] We've already seen an example of the ~ operator used in a pattern-matching rule for the phone database: [9] You may also use strings instead of regular expression constants when calling the match(), split(), sub(), and gsub() functions. $5 ~ /MA/ { print $1 ", " $6 } where the value of field 5 is compared against the regular expression "MA." Since any expression can be used with ~ and !~, regular expressions can be supplied through variables. For instance, in the phonelist script, we could replace "/MA/" with state and have a procedure that defines the value of state. $5 ~ state { print $1 ", " $6 } This makes the script much more general to use because a pattern can change dynamically during execution of the script. For instance, it allows us to get the value of state from a command-line parameter. We will talk about passing command-line parameters into a script later in this chapter. Boolean operators allow you to combine a series of comparisons. They are listed in Table 7.5. Table 7.5: Boolean Operators Operator Description || Logical OR && Logical AND ! Logical NOT Given two or more expressions, || specifies that one of them must evaluate to true (non-zero or non- empty) for the whole expression to be true. && specifies that both of the expressions must be true to return true. The following expression: NF == 6 && NR > 1 states that the number of fields must be equal to 6 and that the number of the record must be greater than 1. && has higher precedence than ||. Can you tell how the following expression will be evaluated? NR > 1 && NF >= 2 || $1 ~ /\t/ The parentheses in the next example show which expression would be evaluated first based on the rules of precedence. (NR > 1 && NF >= 2) || $1 ~ /\t/ In other words, both of the expressions in parentheses must be true or the right hand side must be true. You can use parentheses to override the rules of precedence, as in the following example which specifies that two conditions must be true. NR > 1 && (NF >= 2 || $1 ~ /\t/) The first condition must be true and either of two other conditions must be true. Given an expression that is either true or false, the ! operator inverts the sense of the expression. ! (NR > 1 && NF > 3) This expression is true if the parenthesized expression is false. This operator is most useful with awk's in operator to see if an index is not in an array (as we shall see later), although it has other uses as well. 7.8.1 Getting Information About Files Now we are going to look at a couple of scripts that process the output of a UNIX command, ls. The following is a sample of the long listing produced by the command ls -l:[10] [10] Note that on a Berkeley 4.3BSD-derived UNIX system such as Ultrix or SunOS 4.1. x, ls -l produces an eight-column report; use ls -lg to get the same report format shown here. $ ls -l -rw-rw-rw- 1 dale project 6041 Jan 1 12:31 com.tmp -rwxrwxrwx 1 dale project 1778 Jan 1 11:55 combine. idx -rw-rw-rw- 1 dale project 1446 Feb 15 22:32 dang -rwxrwxrwx 1 dale project 1202 Jan 2 23:06 format. idx This listing is a report in which data is presented in rows and columns. Each file is presented across a single row. The file listing consists of nine columns. The file's permissions appear in the first column, the size of the file in bytes in the fifth column, and the filename is found in the last column. Because one or more spaces separate the data in columns, we can treat each column as a field. In our first example, we're going to pipe the output of this command to an awk script that prints selected fields from the file listing. To do this, we'll create a shell script so that we can make the pipe transparent to the user. Thus, the structure of the shell script is: ls -l $* | awk 'script' The $* variable is used by the shell and expands to all arguments passed from the command line. (We could use $1 here, which would pass the first argument, but passing all the arguments provides greater flexibility.) These arguments can be the names of files or directories or additional options to the ls command. If no arguments are specified, the "$*" will be empty and the current directory will be listed. Thus, the output of the ls command will be directed to awk, which will automatically read standard input, since no filenames have been given. We'd like our awk script to print the size and name of the file. That is, print field 5 ($5) and field 9 ($9). ls -l $* | awk '{ print $5, "\t", $9 }' If we put the above lines in a file named fls and make that file executable, we can enter fls as a command. $ fls 6041 com.tmp 1778 combine.idx 1446 dang 1202 format.idx $ fls com* 6041 com.tmp 1778 combine.idx So what our program does is take the long listing and reduce it to two fields. Now, let's add new functionality to our report by producing some information that the ls -l listing does not provide. We add each file's size to a running total, to produce the total number of bytes used by all files in the listing. We can also keep track of the number of files and produce that total. There are two parts to adding this functionality. The first is to accumulate the totals for each input line. We create the variable sum to accumulate the size of files and the variable filenum to accumulate the number of files in the listing. { sum += $5 ++filenum print $5, "\t", $9 } The first expression uses the assignment operator +=. It adds the value of field 5 to the present value of the variable sum. The second expression increments the present value of the variable filenum. This variable is used as a counter, and each time the expression is evaluated, 1 is added to the count. The action we've written will be applied to all input lines. The totals that are accumulated in this action must be printed after awk has read all the input lines. Therefore, we write an action that is controlled by the END rule. END { print "Total: ", sum, "bytes (" filenum " files)" } We can also use the BEGIN rule to add column headings to the report. BEGIN { print "BYTES", "\t", "FILE" } Now we can put this script in an executable file named filesum and execute it as a single-word command. $ filesum c* BYTES FILE 882 ch01 1771 ch03 1987 ch04 6041 com.tmp 1778 combine.idx Total: 12459 bytes (5 files) What's nice about this command is that it allows you to determine the size of all files in a directory or any group of files. While the basic mechanism works, there are a few problems to be taken care of. The first problem occurs when you list the entire directory using the ls -l command. The listing contains a line that specifies the total number of blocks in the directory. The partial listing (all files beginning with "c") in the previous example does not have this line. But the following line would be included in the output if the full directory was listed: total 555 The block total does not interest us because the program displays the total file size in bytes. Currently, filesum does not print this line; however, it does read this line and cause the filenum counter to be incremented. There is also a problem with this script in how it handles subdirectories. Look at the following line from an ls -l: drwxrwxrwx 3 dale project 960 Feb 1 15:47 sed A "d" as the first character in column 1 (file permissions) indicates that the file is a subdirectory. The size of this file (960 bytes) does not indicate the size of files in that subdirectory and therefore, it is slightly misleading to add it to the file size totals. Also, it might be helpful to indicate that it is a directory. If you want to list the files in subdirectories, supply the -R (recursive) option on the command line. It will be passed to the ls command. However, the listing is slightly different as it identifies each directory. For instance, to identify the subdirectory old, the ls -lR listing produces a blank line followed by: ./old: Our script ignores that line and a blank line preceding it but nonetheless they increment the file counter. Fortunately, we can devise rules to handle these cases. Let's look at the revised, commented script: ls -l $* | awk ' # filesum: list files and total size in bytes # input: long listing produced by "ls -l" #1 output column headers BEGIN { print "BYTES", "\t", "FILE" } #2 test for 9 fields; files begin with "-" NF == 9 && /^-/ { sum += $5 # accumulate size of file ++filenum # count number of files print $5, "\t", $9 # print size and filename } #3 test for 9 fields; directory begins with "d" NF == 9 && /^d/ { print "", "\t", $9 # print and name } #4 test for ls -lR line ./dir: $1 ~ /^\..*:$/ { print "\t" $0 # print that line preceded by tab } #5 once all is done, END { # print total file size and number of files print "Total: ", sum, "bytes (" filenum " files)" }' The rules and their associated actions have been numbered to make it easier to discuss them. The listing produced by ls -l contains nine fields for a file. Awk supplies the number of fields for a record in the system variable NF. Therefore, rules 2 and 3 test that NF is equal to 9. This helps us avoid matching odd blank lines or the line stating the block total. Because we want to handle directories and files differently, we use another pattern to match the first character of the line. In rule 2 we test for "-" in the first position on the line, which indicates a file. The associated action increments the file counter and adds the file size to the previous total. In rule 3, we test for a directory, indicated by "d" as the first character. The associated action prints "" in place of the file size. Rules 2 and 3 are compound expressions, specifying two patterns that are combined using the && operator. Both patterns must be matched for the expression to be true. Rule 4 tests for the special case produced by the ls -lR listing ("./old:"). There are a number of patterns that we can write to match that line, using regular expressions or relational expressions: NF == 1 If the number of fields equals 1 ... /^\..*:$/ If the line begins with a period followed by any number of characters and ends in a colon... $1 ~ /^\..*:$/ If field 1 matches the regular expression... We used the latter expression because it seems to be the most specific. It employs the match operator (~) to test the first field against a regular expression. The associated action consists of only a print statement. Rule 5 is the END pattern and its action is only executed once, printing the sum of file sizes as well as the number of files. The filesum program demonstrates many of the basic constructs used in awk. What's more, it gives you a pretty good idea of the process of developing a program (although syntax errors produced by typos and hasty thinking have been gracefully omitted). If you wish to tinker with this program, you might add a counter for a directories, or a rule that handles symbolic links. 7.7 System Variables 7.9 Formatted Printing Chapter 7 Writing Scripts for awk 7.9 Formatted Printing Many of the scripts that we've written so far perform the data processing tasks just fine, but the output has not been formatted properly. That is because there is only so much you can do with the basic print statement. And since one of awk's most common functions is to produce reports, it is crucial that we be able to format our reports in an orderly fashion. The filesum program performs the arithmetic tasks well but the report lacks an orderly format. Awk offers an alternative to the print statement, printf, which is borrowed from the C programming language. The printf statement can output a simple string just like the print statement. awk 'BEGIN { printf ("Hello, world\n") }' The main difference that you will notice at the outset is that, unlike print, printf does not automatically supply a newline. You must specify it explicitly as "\n". The full syntax of the printf statement has two parts: printf ( format-expression [, arguments] ) The parentheses are optional. The first part is an expression that describes the format specifications; usually this is supplied as a string constant in quotes. The second part is an argument list, such as a list of variable names, that correspond to the format specifications. A format specification is preceded by a percent sign (%) and the specifier is one of the characters shown in Table 7.6. The two main format specifiers are s for strings and d for decimal integers.[11] [11] The way printf does rounding is discussed in Appendix B. Table 7.6: Format Specifiers Used in printf Character Description c ASCII character d Decimal integer i Decimal integer. (Added in POSIX) e Floating-point format ([-]d.precisione[+-]dd) E Floating-point format ([-]d.precisionE[+-]dd) f Floating-point format ([-]ddd.precision) g e or f conversion, whichever is shortest, with trailing zeros removed G E or f conversion, whichever is shortest, with trailing zeros removed o Unsigned octal value s String x Unsigned hexadecimal number. Uses a-f for 10 to 15 X Unsigned hexadecimal number. Uses A-F for 10 to 15 % Literal % This example uses the printf statement to produce the output for rule 2 in the filesum program. It outputs a string and a decimal value found in two different fields: printf("%d\t%s\n", $5, $9) The value of $5 is to be output, followed by a tab (\t) and $9 and then a newline (\n).[12] For each format specification, you must supply a corresponding argument. [12] Compare this statement with the print statement in the filesum program that prints the header line. The print statement automatically supplies a newline (the value of ORS); when using printf, you must supply the newline, it is never automatically provided for you. This printf statement can be used to specify the width and alignment of output fields. A format expression can take three optional modifiers following "%" and preceding the format specifier: %-width.precision format-specifier The width of the output field is a numeric value. When you specify a field width, the contents of the field will be right-justified by default. You must specify "-" to get left-justification. Thus, "%-20s" outputs a string left-justified in a field 20 characters wide. If the string is less than 20 characters, the field will be padded with whitespace to fill. In the following examples, a "|" is output to indicate the actual width of the field. The first example right-justifies the text: printf("|%10s|\n", "hello") It produces: | hello| The next example left-justifies the text: printf("|%-10s|\n", "hello") It produces: |hello | The precision modifier, used for decimal or floating-point values, controls the number of digits that appear to the right of the decimal point. For string values, it controls the maximum number of characters from the string that will be printed. Note that the default precision for the output of numeric values is "%.6g". You can specify both the width and precision dynamically, via values in the printf or sprintf argument list. You do this by specifying asterisks, instead of literal values. printf("%*.*g\n", 5, 3, myvar); In this example, the width is 5, the precision is 3, and the value to print will come from myvar. The default precision used by the print statement when outputting numbers can be changed by setting the system variable OFMT. For instance, if you are using awk to write reports that contain dollar values, you might prefer to change OFMT to "%.2f". Using the full syntax of the format expression can solve the problem with filesum of getting fields and headings properly aligned. One reason we output the file size before the filename was that the fields had a greater chance of aligning themselves if they were output in that order. The solution that printf offers us is the ability to fix the width of output fields; therefore, each field begins in the same column. Let's rearrange the output fields in the filesum report. We want a minimum field width so that the second field begins at the same position. You specify the field width place between the % and the conversion specification. "%-15s" specifies a minimum field width of 15 characters in which the value is left-justified. "%10d", without the hyphen, is right-justified, which is what we want for a decimal value. printf("%-15s\t%10d\n", $9, $5) # print filename and size This will produce a report in which the data is aligned in columns and the numbers are right-justified. Look at how the printf statement is used in the END action: printf("Total: %d bytes (%d files)\n", sum, filenum) The column header in the BEGIN rule is also changed appropriately. With the use of the printf statement, filesum now produces the following output: $ filesum g* FILE BYTES g 23 gawk 2237 gawk.mail 1171 gawk.test 74 gawkro 264 gfilesum 610 grades 64 grades.awk 231 grepscript 6 Total: 4680 bytes (9 files) 7.8 Relational and Boolean Operators 7.10 Passing Parameters Into a Script Chapter 7 Writing Scripts for awk 7.10 Passing Parameters Into a Script One of the more confusing subtleties of programming in awk is passing parameters into a script. A parameter assigns a value to a variable that can be accessed within the awk script. The variable can be set on the command line, after the script and before the filename. awk 'script' var=value inputfile Each parameter must be interpreted as a single argument. Therefore, spaces are not permitted on either side of the equal sign. Multiple parameters can be passed this way. For instance, if you wanted to define the variables high and low from the command line, you could invoke awk as follows: $ awk -f scriptfile high=100 low=60 datafile Inside the script, these two variables are available and can be accessed as any awk variable. If you were to put this script in a shell script wrapper, then you could pass the shell's command-line arguments as values. (The shell makes available command-line arguments in the positional variables - $1 for the first parameter, $2 for the second, and so on.)[13] For instance, look at the shell script version of the previous command: [13] Careful! Don't confuse the shell's parameters with awk's field variables. awk -f scriptfile "high=$1" "low=$2" datafile If this shell script were named awket, it could be invoked as: $ awket 100 60 "100" would be $1 and passed as the value assigned to the variable high. In addition, environment variables or the output of a command can be passed as the value of a variable. Here are two examples: awk '{ ... }' directory=$cwd file1 ... awk '{ ... }' directory=`pwd` file1 ... "$cwd" returns the value of the variable cwd, the current working directory (csh only). The second example uses backquotes to execute the pwd command and assign its result to the variable directory (this is more portable). You can also use command-line parameters to define system variables, as in the following example: $ awk '{ print NR, $0 }' OFS='. ' names 1. Tom 656-5789 2. Dale 653-2133 3. Mary 543-1122 4. Joe 543-2211 The output field separator is redefined to be a period followed by a space. An important restriction on command-line parameters is that they are not available in the BEGIN procedure. That is, they are not available until after the first line of input is read. Why? Well, here's the confusing part. A parameter passed from the command line is treated as though it were a filename. The assignment does not occur until the parameter, if it were a filename, is actually evaluated. Look at the following script that sets a variable n as a command-line parameter. awk 'BEGIN { print n } { if (n == 1) print "Reading the first file" if (n == 2) print "Reading the second file" }' n=1 test n=2 test2 There are four command-line parameters: "n=1," "test," "n=2," and "test2". Now, if you remember that a BEGIN procedure is "what we do before processing input," you'll understand why the reference to n in the BEGIN procedure returns nothing. So the print statement will print a blank line. If the first parameter were a file and not a variable assignment, the file would not be opened until the BEGIN procedure had been executed. The variable n is given an initial value of 1 from the first parameter. The second parameter supplies the name of the file. Thus, for each line in test, the conditional "n == 1" will be true. After the input is exhausted from test, the third parameter is evaluated, and it sets n to 2. Finally, the fourth parameter supplies the name of a second file. Now the conditional "n == 2" in the main procedure will be true. One consequence of the way parameters are evaluated is that you cannot use the BEGIN procedure to test or verify parameters that are supplied on the command line. They are available only after a line of input has been read. You can get around this limitation by composing the rule "NR == 1" and using its procedure to verify the assignment. Another way is to test the command-line parameters in the shell script before invoking awk. POSIX awk provides a solution to the problem of defining parameters before any input is read. The -v option[14] specifies variable assignments that you want to take place before executing the BEGIN procedure (i.e., before the first line of input is read.) The -v option must be specified before a command- line script. For instance, the following command uses the -v option to set the record separator for multiline records. [14] The -v option was not part of the original (1987) version of nawk (still used on SunOS 4.1.x systems and some System V Release 3.x systems). It was added in 1989 after Brian Kernighan of Bell Labs, the GNU awk authors, and the authors of MKS awk agreed on a way to set variables on the command line that would be available inside the BEGIN block. It is now part of the POSIX specification for awk. $ awk -F"\n" -v RS="" '{ print }' phones.block A separate -v option is required for each variable assignment that is passed to the program. Awk also provides the system variables ARGC and ARGV, which will be familiar to C programmers. Because this requires an understanding of arrays, we will discuss this feature in Chapter 8, Conditionals, Loops, and Arrays. 7.9 Formatted Printing 7.11 Information Retrieval Chapter 7 Writing Scripts for awk 7.11 Information Retrieval An awk program can be used to retrieve information from a database, the database basically being any kind of text file. The more structured the text file, the easier it is to work with, although the structure might be no more than a line consisting of individual words. The list of acronyms below is a simple database. $ cat acronyms BASIC Beginner's All-Purpose Symbolic Instruction Code CICS Customer Information Control System COBOL Common Business Oriented Language DBMS Data Base Management System GIGO Garbage In, Garbage Out GIRL Generalized Information Retrieval Language A tab is used as the field separator. We're going to look at a program that takes an acronym as input and displays the appropriate line from the database as output. (In the next chapter, we're going to look at two other programs that use the acronym database. One program reads the list of acronyms and then finds occurrences of these acronyms in another file. The other program locates the first occurrence of these acronyms in a text file and inserts the description of the acronym.) The shell script that we develop is named acro. It takes the first argument from the command line (the name of the acronym) and passes it to the awk script. The acro script follows: $ cat acro #! /bin/sh # assign shell's $1 to awk search variable awk '$1 == search' search=$1 acronyms The first argument specified on the shell command line ($1) is assigned to the variable named search; this variable is passed as a parameter into the awk program. Parameters passed to an awk program are specified after the script section. (This gets somewhat confusing, because $1 inside the awk program represents the first field of each input line, while $1 in the shell represents the first argument supplied on the command line.) The example below demonstrates how this program can be used to find a particular acronym on our list. $ acro CICS CICS Customer Information Control System Notice that we tested the parameter as a string ($1 == search). We could also have written this as a regular expression match ($1 ~ search). 7.11.1 Finding a Glitch A net posting was once forwarded to one of us because it contained a problem that could be solved using awk. Here's the original posting by Emmett Hogan: I have been trying to rewrite a sed/tr/fgrep script that we use quite a bit here in Perl, but have thus far been unsuccessful... hence this posting. Having never written anything in perl, and not wishing to wait for the Nutshell Perl Book, I figured I'd tap the knowledge of this group. Basically, we have several files which have the format: item info line 1 info line 2 . . . info line n Where each info line refers to the item and is indented by either spaces or tabs. Each item "block" is separated by a blank line. What I need to do, is to be able to type: info glitch filename Where info is the name of the perl script, glitch is what I want to find out about, and filename is the name of the file with the information in it. The catch is that I need it to print the entire "block" if it finds glitch anywhere in the file, i.e.: machine Sun 3/75 8 meg memory Prone to memory glitches more info more info would get printed if you looked for "glitch" along with any other "blocks" which contained the word glitch. Currently we are using the following script: #!/bin/csh -f # sed '/^ /\!s/^/@/' $2 | tr '\012@' '@\012' | fgrep -i $1 | tr '@' '\012' Which is in a word....SLOW. I am sure Perl can do it faster, better, etc...but I cannot figure it out. Any, and all, help is greatly appreciated. Thanks in advance, Emmett ------------------------------------------------------------------- Emmett Hogan Computer Science Lab, SRI International The problem yielded a solution based on awk. You may want to try to tackle the problem yourself before reading any further. The solution relies on awk's multiline record capability and requires that you be able to pass the search string as a command-line parameter. Here's the info script using awk: awk 'BEGIN { FS = "\n"; RS = "" } $0 ~ search { print $0 }' search=$1 $2 Given a test file with multiple entries, info was tested to see if it could find the word "glitch." $ info glitch glitch.test machine Sun 3/75 8 meg memory Prone to memory glitches more info more info In the next chapter, we look at conditional and looping constructs, and arrays. 7.10 Passing Parameters Into a Script 8. Conditionals, Loops, and Arrays Chapter 8 8. Conditionals, Loops, and Arrays Contents: Conditional Statements Looping Other Statements That Affect Flow Control Arrays An Acronym Processor System Variables That Are Arrays This chapter covers some fundamental programming constructs. It covers all the control statements in the awk programming language. It also covers arrays, variables that allow you to store a series of values. If this is your first exposure to such constructs, you'll recognize that even sed provided conditional and looping capabilities. In awk, these capabilities are much more generalized and the syntax is much easier to use. In fact, the syntax of awk's conditional and looping constructs is borrowed from the C programming language. Thus, by learning awk and the constructs in this chapter, you are also on the way to learning the C language. 8.1 Conditional Statements A conditional statement allows you to make a test before performing an action. In the previous chapter, we saw examples of pattern matching rules that were essentially conditional expressions affecting the main input loop. In this section, we look at conditional statements used primarily within actions. A conditional statement is introduced by if and evaluates an expression placed in parentheses. The syntax is: if ( expression ) action1 [else action2] If expression evaluates as true (non-zero or non-empty), action1 is performed. When an else clause is specified, action2 is performed if expression evaluates to false (zero or empty). An expression might contain the arithmetic, relational, or Boolean operators discussed in Chapter 7, Writing Scripts for awk. Perhaps the simplest conditional expression that you could write is one that tests whether a variable contains a non-zero value. if ( x ) print x If x is zero, the print statement will not be executed. If x has a non-zero value, that value will be printed. You can also test whether x equals another value: if ( x == y ) print x Remember that "==" is a relational operator and "=" is an assignment operator. We can also test whether x matches a pattern using the pattern-matching operator "~": if ( x ~ /[yY](es)?/ ) print x Here are a few additional syntactical points: ● If any action consists of more than one statement, the action is enclosed within a pair of braces. if ( expression ) { statement1 statement2 } Awk is not very particular about the placement of braces and statements (unlike sed). The opening brace is placed after the conditional expression, either on the same line or on the next line. The first statement can follow the opening brace or be placed on the line following it. The closing brace is put after the last statement, either on the same line or after it. Spaces or tabs are allowed before or after the braces. The indentation of statements is not required but is recommended to improve readability. ● A newline is optional after the close parenthesis, and after else. if ( expression ) action1 [else action2] ● A newline is also optional after action1, providing that a semicolon ends action1. if ( expression ) action1; [else action2] ● You cannot avoid using braces by using semicolons to separate multiple statements on a single line. In the previous chapter, we saw a script that averaged student grades. We could use a conditional statement to tell us whether the student passed or failed. Presuming that an average of 65 or above is a passing grade, we could write the following conditional: if ( avg >= 65 ) grade = "Pass" else grade = "Fail" The value assigned to grade depends upon whether the expression "avg >= 65" evaluates to true or false. Multiple conditional statements can be used to test whether one of several possible conditions is true. For example, perhaps the students are given a letter grade instead of a pass-fail mark. Here's a conditional that assigns a letter grade based on a student's average: if (avg >= 90) grade = "A" else if (avg >= 80) grade = "B" else if (avg >= 70) grade = "C" else if (avg >= 60) grade = "D" else grade = "F" The important thing to recognize is that successive conditionals like this are evaluated until one of them returns true; once that occurs, the rest of the conditionals are skipped. If none of the conditional expressions evaluates to true, the last else is accepted, constituting the default action; in this case, it assigns "F" to grade. 8.1.1 Conditional Operator Awk provides a conditional operator that is found in the C programming language. Its form is: expr ? action1 : action2 The previous simple if/else condition can be written using a conditional operator: grade = (avg >= 65) ? "Pass" : "Fail" This form has the advantage of brevity and is appropriate for simple conditionals such as the one shown here. While the ?: operator can be nested, doing so leads to programs that quickly become unreadable. For clarity, we recommend parenthesizing the conditional, as shown above. 7.11 Information Retrieval 8.2 Looping Chapter 8 Conditionals, Loops, and Arrays 8.2 Looping A loop is a construct that allows us to perform one or more actions again and again. In awk, a loop can be specified using a while, do, or for statement. 8.2.1 While Loop The syntax of a while loop is: while (condition) action The newline is optional after the right parenthesis. The conditional expression is evaluated at the top of the loop and, if true, the action is performed. If the expression is never true, the action is not performed. Typically, the conditional expression evaluates to true and the action changes a value such that the conditional expression eventually returns false and the loop is exited. For instance, if you wanted to perform an action four times, you could write the following loop: i = 1 while ( i <= 4 ) { print $i ++i } As in an if statement, an action consisting of more than one statement must be enclosed in braces. Note the role of each statement. The first statement assigns an initial value to i. The expression "i <= 4" compares i to 4 to determine if the action should be executed. The action consists of two statements, one that simply prints the value of a field referenced as "$i" and another that increments i. i is a counter variable and is used to keep track of how many times we go through the loop. If we did not increment the counter variable or if the comparison would never evaluate to false (e.g., i > 0), then the action would be repeated without end. 8.2.2 Do Loop The do loop is a variation of the while loop. The syntax of a do loop is: do action while (condition) The newline is optional after do. It is also optional after action providing the statement is terminated by a semicolon. The main feature of this loop is that the conditional expression appears after the action. Thus, the action is performed at least once. Look at the following do loop. BEGIN { do { ++x print x } while ( x <= 4 ) } In this example, the value of x is set in the body of the loop using the auto-increment operator. The body of the loop is executed once and the expression is evaluated. In the previous example of a while loop, the initial value of i was set before the loop. The expression was evaluated first, then the body of the loop was executed once. Note the value of x when we run this example: $ awk -f do.awk 1 2 3 4 5 Before the conditional expression is first evaluated, x is incremented to 1. (This relies on the fact that all awk variables are initialized to zero.) The body of the loop is executed five times, not four; when x equals 4, the conditional expression is true and the body of the loop is executed again, incrementing x to 5 and printing its value. Only then is the conditional expression evaluated to false and the loop exited. By changing the operator from "<=" to "<", or less than, the body of the loop will be executed four times. To keep in mind the difference between the do loop and the while loop, remember that the do loop always executes the body of the loop at least once. At the bottom of the procedure, you decide if you need to execute it again. For an example, let's look at a program that loops through the fields of a record, referencing as many fields as necessary until their cumulative value exceeds 100. The reason we use a do loop is that we will reference at least one of the fields. We add the value of the field to the total, and if the total exceeds 100 we don't reference any other fields. We reference the second field only if the first field is less than 100. Its value is added to the total and if the total exceeds 100 then we exit the loop. If it is still less than 100, we execute the loop once again. { total = i = 0 do { ++i total += $i } while ( total <= 100 ) print i, ":", total } The first line of the script initializes the values of two variables: total and i. The loop increments the value of i and uses the field operator to reference a particular field. Each time through the loop, it refers to a different field. When the loop is executed for the first time, the field reference gets the value of field one and assigns it to the variable total. The conditional expression at the end of the loop evaluates whether total exceeds 100. If it does, the loop is exited. Then the value of i, the number of fields that we've referred to, and the total are printed. (This script assumes that each record totals at least 100; otherwise, we'd have to check that i does not exceed the number of fields for the record. We construct such a test in the example presented in next section to show the for loop.) Here's a test file containing a series of numbers: $ cat test.do 45 25 60 20 10 105 50 40 33 5 9 67 108 3 5 4 Running the script on the test file produces the following: $ awk -f do.awk test.do 3 : 130 2 : 115 4 : 114 1 : 108 For each record, only as many fields are referenced as needed for the total to exceed 100. 8.2.3 For Loop The for statement offers a more compact syntax that achieves the same result as a while loop. Although it looks more difficult, this syntax is much easier to use and makes sure that you provide all the required elements of a loop. The syntax of a for loop is: for ( set_counter ; test_counter ; increment_counter ) action The newline after the right parenthesis is optional. The for loop consists of three expressions: set_counter Sets the initial value for a counter variable. test_counter States a condition that is tested at the top of the loop. increment_counter Increments the counter each time at the bottom of the loop, right before testing the test_counter again. Look at this rather common for loop that prints each field on the input line. for ( i = 1; i <= NF; i++ ) print $i As in the previous example, i is a variable that is used to reference a field using the field operator. The system variable NF contains the number of fields for the current input record, and we test it to determine if i has reached the last field on the line. The value of NF is the maximum number of times to go through the loop. Inside the loop, the print statement is executed, printing each field on its own line. A script using this construct can print each word on a line by itself, which can then be run through sort | uniq -c to get word distribution statistics for a file. You can also write a loop to print from the last field to the first. for ( i = NF; i >= 1; i-- ) print $i Each time through the loop the counter is decremented. You could use this to reverse the order of fields. The grades.awk script that we showed earlier determined the average of five grades. We can make the script much more useful by averaging any number of grades. That is, if you were to run this script throughout the school year, the number of grades to average would increase. Rather than revising the script to accommodate the specific number of fields, we can write a generalized script that loops to read however many fields there are. Our earlier version of the program calculated the average of 5 grades using these two statements: total = $2 + $3 + $4 + $5 + $6 avg = total / 5 We can revise that using a for loop to sum each field in the record. total = 0 for (i = 2; i <= NF; ++i) total += $i avg = total / (NF - 1) We initialize the variable total each time because we don't want its value to accumulate from one record to the next. At the beginning of the for loop, the counter i is initialized to 2 because the first numeric field is field 2. Each time through the loop the value of the current field is added to total. When the last field has been referenced (i is greater than NF), we break out of the loop and calculate the average. For instance, if the record consists of 4 fields, the first time through the loop, we assign the value of $2 to total. At the bottom of the loop, i is incremented by 1, then compared to NF, which is 4. The expression evaluates to true and total is incremented by the value of $3. Notice how we divide the total by the number of fields minus 1 to remove the student name from the count. The parentheses are required around "NF - 1" because the precedence of operators would otherwise divide total by NF and then subtract 1, instead of subtracting 1 from NF first. 8.2.4 Deriving Factorials The factorial of a number is the product of successively multiplying that number by one less than that number. The factorial of 4 is 4 × 3 × 2 × 1, or 24. The factorial of 5 is 5 times the factorial of 4 or 5 × 24, or 120. Deriving a factorial for a given number can be expressed using a loop as follows: fact = number for (x = number - 1 ; x > 1; x--) fact *= x where number is the number for which we will derive the factorial fact. Let's say that number equals 5. The first time through the loop x is equal to 4. The action evaluates "5 * 4" and assigns the value to fact. The next time through the loop, x is 3 and 20 is multiplied by it. We go through the loop until x equals 1. Here is the above fragment incorporated into a standalone script that prompts the user for a number and then prints the factorial of that number. awk '# factorial: return factorial of user-supplied number BEGIN { # prompt user; use printf, not print, to avoid the newline printf("Enter number: ") } # check that user enters a number $1 ~ /^[0-9]+$/ { # assign value of $1 to number & fact number = $1 if (number == 0) fact = 1 else fact = number # loop to multiply fact*x until x = 1 for (x = number - 1; x > 1; x--) fact *= x printf("The factorial of %d is %g\n", number, fact) # exit -- saves user from typing CRTL-D. exit } # if not a number, prompt again. { printf("\nInvalid entry. Enter a number: ") }' - This is an interesting example of a main input loop that prompts for input and reads the reply from standard input. The BEGIN rule is used to prompt the user to enter a number. Because we have specified that input is to come not from a file but from standard input, the program will halt after putting out the prompt and then wait for the user to type a number. The first rule checks that a number has been entered. If not, the second rule will be applied, prompting the user again to re-enter a number. We set up an input loop that will continue to read from standard input until a valid entry is found. See the lookup program in the next section for another example of constructing an input loop. Here's an example of how the factorial program works: $ factorial Enter number: 5 The factorial of 5 is 120 Note that the result uses "%g" as the conversion specification format in the printf statement. This permits floating point notation to be used to express very large numbers. Look at the following example: $ factorial Enter number: 33 The factorial of 33 is 8.68332e+36 8.1 Conditional Statements 8.3 Other Statements That Affect Flow Control Chapter 8 Conditionals, Loops, and Arrays 8.3 Other Statements That Affect Flow Control The if, while, for, and do statements allow you to change the normal flow through a procedure. In this section, we look at several other statements that also affect a change in flow control. There are two statements that affect the flow control of a loop, break and continue. The break statement, as you'd expect, breaks out of the loop, such that no more iterations of the loop are performed. The continue statement stops the current iteration before reaching the bottom of the loop and starts a new iteration at the top. Consider what happens in the following program fragment: for ( x = 1; x <= NF; ++x ) if ( y == $x ) { print x, $x break } print A loop is set up to examine each field of the current input record. Each time through the loop, the value of y is compared to the value of a field referenced as $x. If the result is true, we print the field number and its value and then break from the loop. The next statement to be executed is print. The use of break means that we are interested only in the first match on a line and that we don't want to loop through the rest of the fields. Here's a similar example using the continue statement: for ( x = 1; x <= NF; ++x ) { if ( x == 3 ) continue print x, $x } This example loops through the fields of the current input record, printing the field number and its value. However (for some reason), we want to avoid printing the third field. The conditional statement tests the counter variable and if it is equal to 3, the continue statement is executed. The continue statement passes control back to the top of the loop where the counter variable is incremented again. It avoids executing the print statement for that iteration. The same result could be achieved by simply re- writing the conditional to execute print as long as x is not equal to 3. The point is that you can use the continue statement to avoid hitting the bottom of the loop on a particular iteration. There are two statements that affect the main input loop, next and exit. The next statement causes the next line of input to be read and then resumes execution at the top of the script.[1] This allows you to avoid applying other procedures on the current input line. A typical use of the next statement is to continue reading input from a file, ignoring the other actions in the script until that file is exhausted. The system variable FILENAME provides the name of the current input file. Thus, a pattern can be written: [1] Some awks don't allow you to use next from within a user-defined function; Caveat emptor. FILENAME == "acronyms" { action next } { print } This causes the action to be performed for each line in the file acronyms. After the action is performed, the next line of input is read. Control does not pass to the print statement until the input is taken from a different source. The exit statement exits the main input loop and passes control to the END rule, if there is one. If the END rule is not defined, or the exit statement is used in the END rule, then the script terminates. We used the exit statement earlier in the factorial program to exit after reading one line of input. An exit statement can take an expression as an argument. The value of this expression will be returned as the exit status of awk. If the expression is not supplied, the exit status is 0. If you supply a value to an initial exit statement, and then call exit again from the END rule without a value, the first value is used. For example: awk '{ ... exit 5 } END { exit }' Here, the exit status from awk will be 5. You will come across examples that use these flow-control statements in upcoming sections. 8.2 Looping 8.4 Arrays Chapter 8 Conditionals, Loops, and Arrays 8.4 Arrays An array is a variable that can be used to store a set of values. Usually the values are related in some way. Individual elements are accessed by their index in the array. Each index is enclosed in square brackets. The following statement assigns a value to an element of an array: array[subscript] = value In awk, you don't have to declare the size of the array; you only have to use the identifier as an array. This is best done by assigning a value to an array element. For instance, the following example assigns the string "cherry" to an element of the array named flavor. flavor[1] = "cherry" The index or subscript of this element of the array is "1". The following statement prints the string "cherry": print flavor[1] Loops can be used to load and extract elements from arrays. For instance, if the array flavor has five elements, you can write a loop to print each element: flavor_count = 5 for (x = 1; x <= flavor_count; ++x) print flavor[x] One way that arrays are used in awk is to store a value from each record, using the record number as the index to the array. Let's suppose we wanted to keep track of the averages calculated for each student and come up with a class average. Each time a record is read we make the following assignment. student_avg[NR] = avg The system variable NR is used as the subscript for the array because it is incremented for each record. When the first record is read, the value of avg is placed in student_avg[1]; for the second record, the value is placed in student_avg[2], and so on. After we have read all of the records, we have a list of averages in the array student_avg. In an END rule, we can average all of these grades by writing a loop to get the total of the grades and then dividing it by the value of NR. Then we can compare each student average to the class average to collect totals for the number of students at or above average and the number below. END { for ( x = 1; x <= NR; x++ ) class_avg_total += student_avg[x] class_average = class_avg_total / NR for ( x = 1; x <= NR; x++ ) if (student_avg[x] >= class_average) ++above_average else ++below_average print "Class Average: ", class_average print "At or Above Average: ", above_average print "Below Average: ", below_average } There are two for loops for accessing the elements of the array. The first one totals the averages so that it can be divided by the number of student records. The next loop retrieves each student average so that it can be compared to the class average. If it is at or above average, we increment the variable above_average; otherwise, we increment below_average. 8.4.1 Associative Arrays In awk, all arrays are associative arrays. What makes an associative array unique is that its index can be a string or a number. In most programming languages, the indices of arrays are exclusively numeric. In these implementations, an array is a sequence of locations where values are stored. The indices of the array are derived from the order in which the values are stored. There is no need to keep track of indices. For instance, the index of the first element of an array is "1" or the first location in the array. An associative array makes an "association" between the indices and the elements of an array. For each element of the array, a pair of values is maintained: the index of the element and the value of the element. The elements are not stored in any particular order as in a conventional array. Thus, even though you can use numeric subscripts in awk, the numbers do not have the same meaning that they do in other programming languages - they do not necessarily refer to sequential locations. However, with numeric indices, you can still access all the elements of an array in sequence, as we did in previous examples. You can create a loop to increment a counter that references the elements of the array in order. Sometimes, the distinction between numeric and string indices is important. For instance, if you use "04" as the index to an element of the array, you cannot reference that element using "4" as its subscript. You'll see how to handle this problem in a sample program date-month, shown later in this chapter. Associative arrays are a distinctive feature of awk, and a very powerful one that allows you to use a string as an index to another value. For instance, you could use a word as the index to its definition. If you know the word, you can retrieve the definition. For example, you could use the first field of the input line as the index to the second field with the following assignment: array[$1] = $2 Using this technique, we could take our list of acronyms and load it into an array named acro. acro[$1] = $2 Each element of the array would be the description of an acronym and the subscript used to retrieve the element would be the acronym itself. The following expression: acro["BASIC"] produces: Beginner's All-Purpose Symbolic Instruction Code There is a special looping syntax for accessing all the elements of an associative array. It is a version of the for loop. for ( variable in array ) do something with array[variable] The array is the name of an array, as it was defined. The variable is any variable, which you can think of as a temporary variable similar to a counter that is incremented in a conventional for loop. This variable is set to a particular subscript each time through the loop. (Because variable is an arbitrary name, you often see item used, regardless of what variable name was used for the subscript when the array was loaded.) For example, the following for loop prints the name of the acronym item and the definition referenced by that name, acro[item]. for ( item in acro ) print item, acro[item] In this example, the print statement prints the current subscript ("BASIC," for instance) followed by the element of the acro array referenced by the subscript ("Beginner's All-Purpose Symbolic Instruction Code"). This syntax can be applied to arrays with numeric subscripts. However, the order in which the items are retrieved is somewhat random.[2] The order is very likely to vary among awk implementations; be careful to write your programs so that they don't depend on any one version of awk. [2] The technical term used in The AWK Programming Language is "implementation dependent." It is important to remember that all array indices in awk are strings. Even when you use a number as an index, awk automatically converts it to a string first. You don't have to worry about this when you use integer indices, since they get converted to strings as integers, no matter what the value may be of OFMT (original awk and earlier versions of new awk) or CONVFMT (POSIX awk). But if you use a real number as an index, the number to string conversion might affect you. For instance: $ gawk 'BEGIN { data[1.23] = "3.21"; CONVFMT = "%d" > printf "<%s>\n", data[1.23] }' <> Here, nothing was printed between the angle brackets, since the second time, 1.23 was converted to just 1, and data["1"] has the empty string as its value. NOTE: Not all implementations of awk get the number to string conversion right when CONVFMT has changed between one use of a number and the next. Test the above example with your awk to be sure it works correctly. Now let's return to our student grade program for an example. Let's say that we wanted to report how many students got an "A," how many got a "B," and so on. Once we determine the grade, we could increment a counter for that grade. We could set up individual variables for each letter grade and then test which one to increment. if ( grade == "A" ) ++gradeA else if (grade == "B" ) ++gradeB . . . However, an array makes this task much easier. We can define an array called class_grade, and simply use the letter grade (A through F) as the index to the array. ++class_grade[grade] Thus, if the grade is an "A" then the value of class_grade["A"] is incremented by one. At the end of the program, we can print out these values in the END rule using the special for loop: for (letter_grade in class_grade) print letter_grade ":", class_grade[letter_grade] | "sort" The variable letter_grade references a single subscript of the array class_grade each time through the loop. The output is piped to sort, to make sure the grades come out in the proper order. (Piping output to programs is discussed in Chapter 10, The Bottom Drawer.) Since this is the last addition we make to the grades.awk script, we can look at the full listing. # grades.awk -- average student grades and determine # letter grade as well as class averages. # $1 = student name; $2 - $NF = test scores. # set output field separator to tab. BEGIN { OFS = "\t" } # action applied to all input lines { # add up grades total = 0 for (i = 2; i <= NF; ++i) total += $i # calculate average avg = total / (NF - 1) # assign student's average to element of array student_avg[NR] = avg # determine letter grade if (avg >= 90) grade = "A" else if (avg >= 80) grade = "B" else if (avg >= 70) grade = "C" else if (avg >= 60) grade = "D" else grade = "F" # increment counter for letter grade array ++class_grade[grade] # print student name, average and letter grade print $1, avg, grade } # print out class statistics END { # calculate class average for (x = 1; x <= NR; x++) class_avg_total += student_avg[x] class_average = class_avg_total / NR # determine how many above/below average for (x = 1; x <= NR; x++) if (student_avg[x] >= class_average) ++above_average else ++below_average # print results print "" print "Class Average: ", class_average print "At or Above Average: ", above_average print "Below Average: ", below_average # print number of students per letter grade for (letter_grade in class_grade) print letter_grade ":", class_grade [letter_grade] | "sort" } Here's a sample run: $ cat grades.test mona 70 77 85 83 70 89 john 85 92 78 94 88 91 andrea 89 90 85 94 90 95 jasper 84 88 80 92 84 82 dunce 64 80 60 60 61 62 ellis 90 98 89 96 96 92 $ awk -f grades.awk grades.test mona 79 C john 88 B andrea 90.5 A jasper 85 B dunce 64.5 D ellis 93.5 A Class Average: 83.4167 At or Above Average: 4 Below Average: 2 A: 2 B: 2 C: 1 D: 1 8.4.2 Testing for Membership in an Array The keyword in is also an operator that can be used in a conditional expression to test that a subscript is a member of an array. The expression: item in array returns 1 if array[item] exists and 0 if it does not. For example, the following conditional statement is true if the string "BASIC" is a subscript of the array acro. if ( "BASIC" in acro ) print "Found BASIC" This is true if "BASIC" is a subscript used to access an element of acro. This syntax cannot tell you whether "BASIC" is the value of an element of acro. This expression is the same as writing a loop to check that such a subscript exists, although the above expression is much easier to write, and much more efficient to execute. 8.4.3 A Glossary Lookup Script This program reads a series of glossary entries from a file named glossary and puts them into an array. The user is prompted to enter a glossary term and if it is found, the definition of the term is printed. Here's the lookup program: awk '# lookup -- reads local glossary file and prompts user for query #0 BEGIN { FS = "\t"; OFS = "\t" # prompt user printf("Enter a glossary term: ") } #1 read local file named glossary FILENAME == "glossary" { # load each glossary entry into an array entry[$1] = $2 next } #2 scan for command to exit program $0 ~ /^(quit|[qQ]|exit|[Xx])$/ { exit } #3 process any non-empty line $0 != "" { if ( $0 in entry ) { # it is there, print definition print entry[$0] } else print $0 " not found" } #4 prompt user again for another term { printf("Enter another glossary term (q to quit): ") }' glossary - The pattern-matching rules are numbered to make this discussion easier. As we look at the individual rules, we'll discuss them in the order in which they are encountered in the flow of the script. Rule #0 is the BEGIN rule, which is performed only once before any input is read. It sets FS and OFS to a tab and then prompts the user to enter a glossary item. The response will come from standard input, but that is read after the glossary file. Rule #1 tests to see if the current filename (the value of FILENAME) is "glossary" and is therefore only applied while reading input from this file. This rule loads the glossary entries into an array: entry[term] = definition where $1 is the term and $2 is the definition. The next statement at the end of rule #1 is used to skip other rules in the script and causes a new line of input to be read. So, until all the entries in the glossary file are read, no other rule is evaluated. Once input from glossary is exhausted, awk reads from standard input because "-" is specified on the command line. Standard input is where the user's response comes from. Rule #3 tests that the input line ($0) is not empty. This rule should match whatever the user types. The action uses in to see if the input line is an index in the array. If it is, it simply prints out the corresponding value. Otherwise, we tell the user that no valid entry was found. After rule #3, rule #4 will be evaluated. This rule simply prompts the user for another entry. Note that regardless of whether a valid entry was processed in rule #3, rule #4 is executed. The prompt also tells the user how to quit the program. After this rule, awk looks for the next line of input. If the user chooses to quit by entering "q" as the next line of input, rule #2 will be matched. The pattern looks for a complete line consisting of alternative words or single letters that the user might enter to quit. The "^" and "$" are important, signifying that the input line contains no other characters but these; otherwise a "q" appearing in a glossary entry would be matched. Note that the placement of this rule in the sequence of rules is significant. It must appear before rules #3 and #4 because these rules will match anything, including the words "quit" and "exit." Let's look at how the program works. For this example, we will make a copy of the acronyms file and use it as the glossary file. $ cp acronyms glossary $ lookup Enter a glossary term: GIGO Garbage in, garbage out Enter another glossary term (q to quit): BASIC Beginner's All-Purpose Symbolic Instruction Code Enter another glossary term (q to quit): q As you can see, the program is set up to prompt the user for additional items until the user enters "q". Note that this program can be easily revised to read a glossary anywhere on the file system, including the user's home directory. The shell script that invokes awk could handle command-line options that allow the user to specify the glossary filename. You could also read a shared glossary file and then read a local one by writing separate rules to process the entries. 8.4.4 Using split() to Create Arrays The built-in function split() can parse any string into elements of an array. This function can be useful to extract "subfields" from a field. The syntax of the split() function is: n = split(string, array, separator) string is the input string to be parsed into elements of the named array. The array's indices start at 1 and go to n, the number of elements in the array. The elements will be split based on the specified separator character. If a separator is not specified, then the field separator (FS) is used. The separator can be a full regular expression, not just a single character. Array splitting behaves identically to field splitting; see the section "Referencing and Separating Fields" in Chapter 7. For example, if you had a record in which the first field consisted of the person's full name, you could use the split() function to extract the person's first and last names. The following statement breaks up the first field into elements of the array fullname: z = split($1, fullname, " ") A space is specified as the delimiter. The person's first name can be referenced as: fullname[1] and the person's last name can be referenced as: fullname[z] because z contains the number of elements in the array. This works, regardless of whether the person's full name contains a middle name. If z is the value returned by split(), you can write a loop to read all the elements of this array. z = split($1, array, " ") for (i = 1; i <= z; ++i) print i, array[i] The next section contains additional examples of using the split() function. 8.4.5 Making Conversions This section looks at two examples that demonstrate similar methods of converting output from one format to another. When working on the index program shown in Chapter 12, Full-Featured Applications, we needed a quick way to assign roman numerals to volume numbers. In other words, volume 4 needed to be identified as "IV" in the index. Since there was no immediate prospect of the number of volumes exceeding 10, we wrote a script that took as input a number between 1 and 10 and converted it to a roman numeral. This shell script takes the first argument from the command line and echoes it as input to the awk program. echo $1 | awk '# romanum -- convert number 1-10 to roman numeral # define numerals as list of roman numerals 1-10 BEGIN { # create array named numerals from list of roman numerals split("I,II,III,IV,V,VI,VII,VIII,IX,X", numerals, ",") } # look for number between 1 and 10 $1 > 0 && $1 <= 10 { # print specified element print numerals[$1] exit } { print "invalid number" exit }' This script defines a list of 10 roman numerals, then uses split() to load them into an array named numerals. This is done in the BEGIN action because it only needs to be done once. The second rule checks that the first field of the input line contains a number between 1 and 10. If it does, this number is used as the index to the numerals array, retrieving the corresponding element. The exit statement terminates the program. The last rule is executed only if there is no valid entry. Here's an example of how it works: $ romanum 4 IV Following along on the same idea, here's a script that converts dates in the form "mm-dd-yy" or "mm/dd/ yy" to "month day, year." awk ' # date-month -- convert mm/dd/yy or mm-dd-yy to month day, year # build list of months and put in array. BEGIN { # the 3-step assignment is done for printing in book listmonths = "January,February,March,April,May, June," listmonths = listmonths "July,August,September," listmonths = listmonths "October,November,December" split(listmonths, month, ",") } # check that there is input $1 != "" { # split on "/" the first input field into elements of array sizeOfArray = split($1, date, "/") # check that only one field is returned if (sizeOfArray == 1) # try to split on "-" sizeOfArray = split($1, date, "-") # must be invalid if (sizeOfArray == 1) exit # add 0 to number of month to coerce numeric type date[1] += 0 # print month day, year print month[date[1]], (date[2] ", 19" date[3]) }' This script reads from standard input. The BEGIN action creates an array named month whose elements are the names of the months of the year. The second rule verifies that we have a non-empty input line. The first statement in the associated action splits the first field of input looking for "/" as the delimiter. sizeOfArray contains the number of elements in the array. If awk was unable to parse the string, it creates the array with only one element. Thus, we can test the value of sizeOfArray to determine if we have several elements. If we do not, we assume that perhaps "-" was used as the delimiter. If that fails to produce an array with multiple elements, we assume the input is invalid, and exit. If we have successfully parsed the input, date[1] contains the number of the month. This value can be used as the index to the array month, nesting one array inside another. However, before using date[1], we coerce the type of date[1] by adding 0 to it. While awk will correctly interpret "11" as a number, leading zeros may cause a number to be treated as a string. Thus, "06" might not be recognized properly without type coercion. The element referenced by date[1] is used as the subscript for month. Here's a sample run: $ echo "5/11/55" | date-month May 11, 1955 8.4.6 Deleting Elements of an Array Awk provides a statement for deleting an element of an array. The syntax is: delete array[subscript] The brackets are required. This statement removes the element indexed by subscript from array. In particular, the in test for subscript will now return false. This is different than just assigning the empty string to that element; in that case in would still be true. See the lotto script in the next chapter for an example of using the delete statement. 8.3 Other Statements That Affect Flow Control 8.5 An Acronym Processor Chapter 8 Conditionals, Loops, and Arrays 8.5 An Acronym Processor Now let's look at a program that scans a file for acronyms. Each acronym is replaced with a full text description, and the acronym in parentheses. If a line refers to "BASIC," we'd like to replace it with the description "Beginner's All-Purpose Symbolic Instruction Code" and put the acronym in parentheses afterwards. (This is probably not a useful program in and of itself, but the techniques used in the program are general and have many such uses.) We can design this program for use as a filter that prints all lines, regardless of whether a change has been made. We'll call it awkro. awk '# awkro - expand acronyms # load acronyms file into array "acro" FILENAME == "acronyms" { split($0, entry, "\t") acro[entry[1]] = entry[2] next } # process any input line containing caps /[A-Z][A-Z]+/ { # see if any field is an acronym for (i = 1; i <= NF; i++) if ( $i in acro ) { # if it matches, add description $i = acro[$i] " (" $i ")" } } { # print all lines print $0 }' acronyms $* Let's first see it in action. Here's a sample input file. $ cat sample The USGCRP is a comprehensive research effort that includes applied as well as basic research. The NASA program Mission to Planet Earth represents the principal space-based component of the USGCRP and includes new initiatives such as EOS and Earthprobes. And here is the file acronyms: $ cat acronyms USGCRP U.S. Global Change Research Program NASA National Aeronautic and Space Administration EOS Earth Observing System Now we run the program on the sample file. $ awkro sample The U.S. Global Change Research Program (USGCRP) is a comprehensive research effort that includes applied as well as basic research. The National Aeronautic and Space Administration (NASA) program Mission to Planet Earth represents the principal space-based component of the U.S. Global Change Research Program (USGCRP) and includes new initiatives such as Earth Observing System (EOS) and Earthprobes. We'll look at this program in two parts. The first part reads records from the acronyms file. # load acronyms file into array "acro" FILENAME == "acronyms" { split($0, entry, "\t") acro[entry[1]] = entry[2] next } The two fields from these records are loaded into an array using the first field as the subscript and assigning the second field to an element of the array. In other words, the acronym itself is the index to its description. Note that we did not change the field separator, but instead used the split() function to create the array entry. This array is then used in creating an array named acro. Here is the second half of the program: # process any input line containing caps /[A-Z][A-Z]+/ { # see if any field is an acronym for (i = 1; i <= NF; i++) if ( $i in acro ) { # if it matches, add description $i = acro[$i] " (" $i ")" } } { # print all lines print $0 } Only lines that contain more than one consecutive capital letter are processed by the first of the two actions shown here. This action loops through each field of the record. At the heart of this section is the conditional statement that tests if the current field ($i) is a subscript of the array (acro). If the field is a subscript, we replace the original value of the field with the array element and the original value in parentheses. (Fields can be assigned new values, just like regular variables.) Note that the insertion of the description of the acronym results in lines that may be too long. See the next chapter for a discussion of the length() function, which can be used to determine the length of a string so you can divide it up if it is too long. Now we're going to change the program so it makes a replacement only the first time an acronym appears. After we've found it, we don't want to search for that acronym any more. This is easy to do; we simply delete that acronym from the array. if ( $i in acro ) { # if it matches, add description $i = acro[$i] " (" $i ")" # only expand the acronym once delete acro[acronym] } There are other changes that would be good to make. In running the awkro program, we soon discovered that it failed to match the acronym if it was followed by a punctuation mark. Our initial solution was not to handle it in awk at all. Instead, we used two sed scripts, one before processing: sed 's/\([^.,;:!][^.,;:!]*\)\([.,;:!]\)/\1 @@@\2/g' and one after: sed 's/ @@@\([.,;:!]\)/\1/g' A sed script, run prior to invoking awk, could simply insert a space before any punctuation mark, causing it to be interpreted as a separate field. A string of garbage characters (@@@) was also added so we'd be able to easily identify and restore the punctuation mark. (The complicated expression used in the first sed command makes sure that we catch the case of more than one punctuation mark on a line.) This kind of solution, using another tool in the UNIX toolbox, demonstrates that not everything needs to be done as an awk procedure. Awk is all the more valuable because it is situated in the UNIX environment. However, with POSIX awk, we can implement a different solution, one that uses a regular expression to match the acronym. Such a solution can be implemented with the match() and sub() functions described in the next chapter. 8.5.1 Multidimensional Arrays Awk supports linear arrays in which the index to each element of the array is a single subscript. If you imagine a linear array as a row of numbers, a two-dimensional array represents rows and columns of numbers. You might refer to the element in the second column of the third row as "array[3, 2]." Two- and three-dimensional arrays are examples of multidimensional arrays. Awk does not support multidimensional arrays but instead offers a syntax for subscripts that simulate a reference to a multidimensional array. For instance, you could write the following expression: file_array[NR, i] = $i where each field of an input record is indexed by its record number and field number. Thus, the following reference: file_array[2, 4] would produce the value of the fourth field of the second record. This syntax does not create a multidimensional array. It is converted into a string that uniquely identifies the element in a linear array. The components of a multidimensional subscript are interpreted as individual strings ("2" and "4," for instance) and concatenated together separated by the value of the system variable SUBSEP. The subscript-component separator is defined as "\034" by default, an unprintable character rarely found in ASCII text. Thus, awk maintains a one-dimensional array and the subscript for our previous example would actually be "2\0344" (the concatenation of "2," the value of SUBSEP, and "4"). The main consequence of this simulation of multidimensional arrays is that the larger the array, the slower it is to access individual elements. However, you should time this, using your own application, with different awk implementations (see Chapter 11, A Flock of awks). Here is a sample awk script named bitmap.awk that shows how to load and output the elements of a multidimensional array. This array represents a two-dimensional bitmap that is 12 characters in width and height. BEGIN { FS = "," # comma-separated fields # assign width and height of bitmap WIDTH = 12 HEIGHT = 12 # loop to load entire array with "O" for (i = 1; i <= WIDTH; ++i) for (j = 1; j <= HEIGHT; ++j) bitmap[i, j] = "O" } # read input of the form x,y. { # assign "X" to that element of array bitmap[$1, $2] = "X" } # at end output multidimensional array END { for (i = 1; i <= WIDTH; ++i){ for (j = 1; j <= HEIGHT; ++j) printf("%s", bitmap[i, j] ) # after each row, print newline printf("\n") } } Before any input is read, the bitmap array is loaded with O's. This array has 144 elements. The input to this program is a series of coordinates, one per line. $ cat bitmap.test 1,1 2,2 3,3 4,4 5,5 6,6 7,7 8,8 9,9 10,10 11,11 12,12 1,12 2,11 3,10 4,9 5,8 6,7 7,6 8,5 9,4 10,3 11,2 12,1 For each coordinate, the program will put an "X" in place of an "O" as that element of the array. At the end of the script, the same kind of loop that loaded the array, now outputs it. The following example reads the input from the file bitmap.test. $ awk -f bitmap.awk bitmap.test XOOOOOOOOOOX OXOOOOOOOOXO OOXOOOOOOXOO OOOXOOOOXOOO OOOOXOOXOOOO OOOOOXXOOOOO OOOOOXXOOOOO OOOOXOOXOOOO OOOXOOOOXOOO OOXOOOOOOXOO OXOOOOOOOOXO XOOOOOOOOOOX The multidimensional array syntax is also supported in testing for array membership. The subscripts must be placed inside parentheses. if ((i, j) in array) This tests whether the subscript i,j (actually, i SUBSEP j) exists in the specified array. Looping over a multidimensional array is the same as with one-dimensional arrays. for (item in array) You must use the split() function to access individual subscript components. Thus: split(item, subscr, SUBSEP) creates the array subscr from the subscript item. Note that we needed to use the loop-within-a-loop to output the two-dimensional bitmap array in the previous example because we needed to maintain rows and columns. 8.4 Arrays 8.6 System Variables That Are Arrays Chapter 8 Conditionals, Loops, and Arrays 8.6 System Variables That Are Arrays Awk provides two system variables that are arrays: ARGV An array of command-line arguments, excluding the script itself and any options specified with the invocation of awk. The number of elements in this array is available in ARGC. The index of the first element of the array is 0 (unlike all other arrays in awk but consistent with C) and the last is ARGC - 1. ENVIRON An array of environment variables. Each element of the array is the value in the current environment and the index is the name of the environment variable. 8.6.1 An Array of Command-Line Parameters You can write a loop to reference all the elements of the ARGV array. # argv.awk - print command-line parameters BEGIN { for (x = 0; x < ARGC; ++x) print ARGV[x] print ARGC } This example also prints out the value of ARGC, the number of command-line arguments. Here's an example of how it works on a sample command line: $ awk -f argv.awk 1234 "John Wayne" Westerns n=44 - awk 1234 John Wayne Westerns n=44 - 6 As you can see, there are six elements in the array. The first element is the name of the command that invoked the script. The last argument, in this case, is the filename, "-", for standard input. Note the "-f argv.awk" does not appear in the parameter list. Generally, the value of ARGC will be at least 2. If you don't want to refer to the program name or the filename, you can initialize the counter to 1 and then test against ARGC - 1 to avoid referencing the last parameter (assuming that there is only one filename). Remember that if you invoke awk from a shell script, the command-line parameters are passed to the shell script and not to awk. You have to pass the shell script's command-line parameters to the awk program inside the shell script. For instance, you can pass all command-line parameters from the shell script to awk, using "$*". Look at the following shell script: awk ' # argv.sh - print command-line parameters BEGIN { for (x = 0; x < ARGC; ++x) print ARGV[x] print ARGC }' $* This shell script works the same as the first example of invoking awk. One practical use is to test the command-line parameters in the BEGIN rule using a regular expression. The following example tests that all the parameters, except the first, are integers. # number.awk - test command-line parameters BEGIN { for (x = 1; x < ARGC; ++x) if ( ARGV[x] !~ /^[0-9]+$/ ) { print ARGV[x], "is not an integer." exit 1 } } If the parameters contain any character that is not a digit, the program will print the message and quit. After testing the value, you can, of course, assign it to a variable. For instance, we could write a BEGIN procedure of a script that checks the command-line parameters before prompting the user. Let's look at the following shell script that uses the phone and address database from the previous chapter: awk '# phone - find phone number for person # supply name of person on command line or at prompt. BEGIN { FS = "," # look for parameter if ( ARGC > 2 ){ name = ARGV[1] delete ARGV[1] } else { # loop until we get a name while (! name) { printf("Enter a name? ") getline name < "-" } } } $1 ~ name { print $1, $NF }' $* phones.data We test the ARGC variable to see if there are more than two parameters. By specifying "$*", we can pass all the parameters from the shell command line inside to the awk command line. If this parameter has been supplied, we assume the second parameter, ARGV[1], is the one we want and it is assigned to the variable name. Then that parameter is deleted from the array. This is very important if the parameter that is supplied on the command line is not of the form "var=value"; otherwise, it will later be interpreted as a filename. If additional parameters are supplied, they will be interpreted as filenames of alternative phone databases. If there are not more than two parameters, then we prompt for the name. The getline function is discussed in Chapter 10; using this syntax, it reads the next line from standard input. Here are several examples of this script in action: $ phone John John Robinson 696-0987 $ phone Enter a name? Alice Alice Gold (707) 724-0000 $ phone Alice /usr/central/phonebase Alice Watson (617) 555-0000 Alice Gold (707) 724-0000 The first example supplies the name on the command line, the second prompts the user, and the third takes two command-line parameters and uses the second as a filename. (The script will not allow you to supply a filename without supplying the person's name on the command line. You could devise a test that would permit this syntax, though.) Because you can add to and delete from the ARGV array, there is the potential for doing a lot of interesting manipulation. You can place a filename at the end of the ARGV array, for instance, and it will be opened as though it were specified on the command line. Similarly, you can delete a filename from the array and it will never be opened. Note that if you add new elements to ARGV, you should also increment ARGC; awk uses the value of ARGC to know how many elements in ARGV it should process. Thus, simply decrementing ARGC will keep awk from examining the final element in ARGV. As a special case, if the value of an ARGV element is the empty string (""), awk will skip over it and continue on to the next element. 8.6.2 An Array of Environment Variables The ENVIRON array was added independently to both gawk and MKS awk. It was then added to the System V Release 4 nawk, and is now included in the POSIX standard for awk. It allows you to access variables in the environment. The following script loops through the elements of the ENVIRON array and prints them. # environ.awk - print environment variable BEGIN { for (env in ENVIRON) print env "=" ENVIRON[env] } The index of the array is the variable name. The script generates the same output produced by the env command (printenv on some systems). $ awk -f environ.awk DISPLAY=scribe:0.0 FRAME=Shell 3 LOGNAME=dale MAIL=/usr/mail/dale PATH=:/bin:/usr/bin:/usr/ucb:/work/bin:/mac/bin:. TERM=mac2cs HOME=/work/dale SHELL=/bin/csh TZ=PST8PDT EDITOR=/usr/bin/vi You can reference any element, using the variable name as the index of the array: ENVIRON["LOGNAME"] You can also change any element of the ENVIRON array. ENVIRON["LOGNAME"] = "Tom" However, this change does not affect the user's actual environment (i.e., when awk is done, the value of LOGNAME will not be changed) nor does it affect the environment inherited by programs that are invoked from awk via the getline or system() functions, which are described in Chapter 10. This chapter has covered many important programming constructs. You will continue to see examples in upcoming chapters that make use of these constructs. If programming is new to you, be sure you take the time to run and modify the programs in this chapter, and write small programs of your own. It is essential, like learning how to conjugate verbs, that these constructs become familiar and predictable to you. 8.5 An Acronym Processor 9. Functions Chapter 9 9. Functions Contents: Arithmetic Functions String Functions Writing Your Own Functions A function is a self-contained computation that accepts a number of arguments as input and returns some value. Awk has a number of built-in functions in two groups: arithmetic and string functions. Awk also provides user-defined functions, which allow you to expand upon the built-in functions by writing your own. 9.1 Arithmetic Functions Nine of the built-in functions can be classified as arithmetic functions. Most of them take a numeric argument and return a numeric value. Table 9.1 summarizes these arithmetic functions. Table 9.1: awk's Built-In Arithmetic Functions Awk Function Description cos(x) Returns cosine of x (x is in radians). exp(x) Returns e to the power x. int(x) Returns truncated value of x. log(x) Returns natural logarithm (base-e) of x. sin(x) Returns sine of x (x is in radians). sqrt(x) Returns square root of x. atan2(y,x) Returns arctangent of y/x in the range -[pi] to [pi] . rand() Returns pseudo-random number r, where 0 <= r < 1. srand(x) Establishes new seed for rand(). If no seed is specified, uses time of day. Returns the old seed. 9.1.1 Trigonometric Functions The trigonometric functions cos() and sin() work the same way, taking a single argument that is the size of an angle in radians and returning the cosine or sine for that angle. (To convert from degrees to radians, multiply the number by [pi] /180.) The trigonometric function atan2() takes two arguments and returns the arctangent of their quotient. The expression atan2(0, -1) produces [pi] . The function exp() uses the natural exponential, which is also known as base-e exponentiation. The expression exp(1) returns the natural number 2.71828, the base of the natural logarithms, referred to as e. Thus, exp(x) is e to the x-th power. The log() function gives the inverse of the exp() function, the natural logarithm of x. The sqrt() function takes a single argument and returns the (positive) square root of that argument. 9.1.2 Integer Function The int() function truncates a numeric value by removing digits to the right of the decimal point. Look at the following two statements: print 100/3 print int(100/3) The output from these statements is shown below: 33.3333 33 The int() function simply truncates; it does not round up or down. (Use the printf format "%.0f" to perform rounding.)[1] [1] The way printf does rounding is discussed in Appendix B, Quick Reference for awk. 9.1.3 Random Number Generation The rand() function generates a pseudo-random floating-point number between 0 and 1. The srand () function sets the seed or starting point for random number generation. If srand() is called without an argument, it uses the time of day to generate the seed. With an argument x, srand() uses x as the seed. If you don't call srand() at all, awk acts as if srand() had been called with a constant argument before your program started, causing you to get the same starting point every time you run your program. This is useful if you want reproducible behavior for testing, but inappropriate if you really do want your program to behave differently every time. Look at the following script: # rand.awk -- test random number generation BEGIN { print rand() print rand() srand() print rand() print rand() } We print the result of the rand() function twice, and then call the srand() function before printing the result of the rand() function two more times. Let's run the script. $ awk -f rand.awk 0.513871 0.175726 0.760277 0.263863 Four random numbers are generated. Now look what happens when we run the program again: $ awk -f rand.awk 0.513871 0.175726 0.787988 0.305033 The first two "random" numbers are identical to the numbers generated in the previous run of the program while the last two numbers are different. The last two numbers are different because we provided the rand() function with a new seed. The return value of the srand() function is the seed it was using. This can be used to keep track of sequences of random numbers, and re-run them if needed. 9.1.4 Pick 'em To show how to use rand(), we'll look at a script that implements a "quick-pick" for a lottery game. This script, named lotto, picks x numbers from a series of numbers 1 to y. Two arguments can be supplied on the command line: how many numbers to pick (the default is 6) and the highest number in the series (the default is 30). Using the default values for x and y, the script generates six unique random numbers between 1 and 30. The numbers are sorted for readability from lowest to highest and output. Before looking at the script itself, let's run the program: $ lotto Pick 6 of 30 9 13 25 28 29 30 $ lotto 7 35 Pick 7 of 35 1 6 9 16 20 22 27 The first example uses the default values to print six random numbers from 1 to 30. The second example prints seven random numbers out of 35. The full lotto script is fairly complicated, so before looking at the entire script, let's look at a smaller script that generates a single random number in a series: awk -v TOPNUM=$1 ' # pick1 - pick one random number out of y # main routine BEGIN { # seed random number using time of day srand() # get a random number select = 1 + int(rand() * TOPNUM) # print pick print select }' The shell script expects a single argument from the command line and this is passed into the program as "TOPNUM=$1," using the -v option. All the action happens in the BEGIN procedure. Since there are no other statements in the program, awk exits when the BEGIN procedure is done. The main routine first calls the srand() function to seed the random number generator. Then we get a random number by calling the rand() function: select = 1 + int(rand() * TOPNUM) It might be helpful to see this expression broken up so each part of it is obvious. Statement Result print r = rand() 0.467315 print r * TOPNUM 14.0195 print int(r * TOPNUM) 14 print 1 + int(r * TOPNUM) 15 Because the rand() function returns a number between 0 and 1, we multiply it by TOPNUM to get a number between 0 and TOPNUM. We then truncate the number to remove the fractional values and then add 1 to the number. The latter is necessary because rand() could return 0. In this example, the random number that is generated is 15. You could use this program to print any single number, such as picking a number between 1 and 100. $ pick1 100 83 The lotto script must "pick one" multiple times. Basically, we need to set up a for loop to execute the rand() function as many times as needed. One of the reasons this is difficult is that we have to worry about duplicates. In other words, it is possible for a number to be picked again; therefore we have to keep track of the numbers already picked. Here's the lotto script: awk -v NUM=$1 -v TOPNUM=$2 ' # lotto - pick x random numbers out of y # main routine BEGIN { # test command line args; NUM = $1, how many numbers to pick # TOPNUM = $2, last number in series if (NUM <= 0) NUM = 6 if (TOPNUM <= 0) TOPNUM = 30 # print "Pick x of y" printf("Pick %d of %d\n", NUM, TOPNUM) # seed random number using time and date; do this once srand() # loop until we have NUM selections for (j = 1; j <= NUM; ++j) { # loop to find a not-yet-seen selection do { select = 1 + int(rand() * TOPNUM) } while (select in pick) pick[select] = select } # loop through array and print picks. for (j in pick) printf("%s ", pick[j]) printf("\n") }' Unlike the previous program, this one looks for two command-line arguments, indicating x numbers out of y. The main routine looks to see if these numbers were supplied and if not, assigns default values. There is only one array, pick, for holding the random numbers that are selected. Each number is guaranteed to be in the desired range, because the result of rand() (a value between 0 and 1) is multiplied by TOPNUM and then truncated. The heart of the script is a loop that occurs NUM times to assign NUM elements to the pick array. To get a new non-duplicate random number, we use an inner loop that generates selections and tests to see if they are in the pick array. (Using the in operator is much faster than looping through the array comparing subscripts.) While (select in pick), the corresponding element has been found already, so the selection is a duplicate and we reject the selection. If it is not true that select in pick, then we assign select to an element of the pick array. This will make future in tests return true, causing the do loop to continue. Finally, the program loops through the pick array and prints the elements. This version of the lotto script leaves one thing out. See if you can tell what it is if we run it again: $ lotto 7 35 Pick 7 of 35 5 21 9 30 29 20 2 That's right, the numbers are not sorted. We'll defer showing the code for the sort routine until we discuss user-defined functions. While it's not necessary to have written the sorting code as a function, it makes a lot of sense. One reason is that you can tackle a more generalized problem and retain the solution for use in other programs. Later on, we will write a function that sorts the elements of an array. Note that the pick array isn't ready for sorting, since its indices are the same as its values, not numbers in order. We would have to set up a separate array for sorting by our sort function: # create a numerically indexed array for sorting i = 1 for (j in pick) sortedpick[i++] = pick[j] The lotto program is set up to do everything in the BEGIN block. No input is processed. You could, however, revise this script to read a list of names from a file and for each name generate a "quick-pick." 8.6 System Variables That Are Arrays 9.2 String Functions Chapter 9 Functions 9.2 String Functions The built-in string functions are much more significant and interesting than the numeric functions. Because awk is essentially designed as a string-processing language, a lot of its power derives from these functions. Table 9.2 lists the string functions found in awk. Table 9.2: Awk's Built-In String Functions Awk Function Description gsub(r,s,t) Globally substitutes s for each match of the regular expression r in the string t. Returns the number of substitutions. If t is not supplied, defaults to $0. index(s,t) Returns position of substring t in string s or zero if not present. length(s) Returns length of string s or length of $0 if no string is supplied. match(s,r) Returns either the position in s where the regular expression r begins, or 0 if no occurrences are found. Sets the values of RSTART and RLENGTH. split(s,a,sep) Parses string s into elements of array a using field separator sep; returns number of elements. If sep is not supplied, FS is used. Array splitting works the same way as field splitting. sprintf("fmt",expr) Uses printf format specification for expr. sub(r,s,t) Substitutes s for first match of the regular expression r in the string t. Returns 1 if successful; 0 otherwise. If t is not supplied, defaults to $0. substr(s,p,n) Returns substring of string s at beginning position p up to a maximum length of n. If n is not supplied, the rest of the string from p is used. tolower(s) Translates all uppercase characters in string s to lowercase and returns the new string. toupper(s) Translates all lowercase characters in string s to uppercase and returns the new string. The split() function was introduced in the previous chapter in the discussion on arrays. The sprintf() function uses the same format specifications as printf(), which is discussed in Chapter 7, Writing Scripts for awk. It allows you to apply the format specifications on a string. Instead of printing the result, sprintf() returns a string that can be assigned to a variable. It can do specialized processing of input records or fields, such as performing character conversions. For instance, the following example uses the sprintf() function to convert a number into an ASCII character. for (i = 97; i <= 122; ++i) { nextletter = sprintf("%c", i) ... } A loop supplies numbers from 97 to 122, which produce ASCII characters from a to z. That leaves us with three basic built-in string functions to discuss: index(), substr(), and length (). 9.2.1 Substrings The index() and substr() functions both deal with substrings. Given a string s, index(s,t) returns the leftmost position where string t is found in s. The beginning of the string is position 1 (which is different from the C language, where the first character in a string is at position 0). Look at the following example: pos = index("Mississippi", "is") The value of pos is 2. If the substring is not found, the index() function returns 0. Given a string s, substr(s,p) returns the characters beginning at position p. The following example creates a phone number without an area code. phone = substr("707-555-1111", 5) You can also supply a third argument which is the number of characters to return. The next example returns just the area code: area_code = substr("707-555-1111", 1, 3) The two functions can be and often are used together, as in the next example. This example capitalizes the first letter of the first word for each input record. awk '# caps - capitalize 1st letter of 1st word # initialize strings BEGIN { upper = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" lower = "abcdefghijklmnopqrstuvwxyz" } # for each input line { # get first character of first word FIRSTCHAR = substr($1, 1, 1) # get position of FIRSTCHAR in lowercase array; if 0, ignore if (CHAR = index(lower, FIRSTCHAR)) # change $1, using position to retrieve # uppercase character $1 = substr(upper, CHAR, 1) substr($1, 2) # print record print $0 }' This script creates two variables, upper and lower, consisting of uppercase and lowercase letters. Any character that we find in lower can be found at the same position in upper. The first statement of the main procedure extracts a single character, the first one, from the first field. The conditional statement tests to see if that character can be found in lower using the index() function. If CHAR is not 0, then CHAR can be used to extract the uppercase character from upper. There are two substr() function calls: the first one retrieves the capitalized letter and the second call gets the rest of the first field, extracting all characters, beginning with the second character. The values returned by both substr() functions are concatenated and assigned to $1. Making an assignment to a field as we do here is a new twist, but it has the added benefit that the record can be output normally. (If the assignment was made to a variable, you'd have to output the variable and then output the record's remaining fields.) The print statement prints the changed record. Let's see it in action: $ caps root user Root user dale Dale Tom Tom In a little bit, we'll see how to revise this program to change all characters in a string from lower- to uppercase or vice versa. 9.2.2 String Length When presenting the awkro program in the previous chapter, we noted that the program was likely to produce lines that exceed 80 characters. After all, the descriptions are quite long. We can find out how many characters are in a string using the built-in function length(). For instance, to evaluate the length of the current input record, we specify length($0). (As it happens, if length() is called without an argument, it returns the length of $0.) The length() function is often used to find the length of the current input record, in order to determine if we need to break the line. One way to handle the line break, perhaps more efficiently, is to use the length() function to get the length of each field. By accumulating those lengths, we could specify a line break when a new field causes the total to exceed a certain number. Chapter 13, A Miscellany of Scripts, contains a script that uses the length() function to break lines greater than 80 columns wide. 9.2.3 Substitution Functions Awk provides two substitution functions: sub() and gsub(). The difference between them is that gsub() performs its substitution globally on the input string whereas sub() makes only the first possible substitution. This makes gsub() equivalent to the sed substitution command with the g (global) flag. Both functions take at least two arguments. The first is a regular expression (surrounded by slashes) that matches a pattern and the second argument is a string that replaces what the pattern matches. The regular expression can be supplied by a variable, in which case the slashes are omitted. An optional third argument specifies the string that is the target of the substitution. If there is no third argument, the substitution is made for the current input record ($0). The substitution functions change the specified string directly. You might expect, given the way functions work, that the function returns the new string created when the substitution is made. The substitution functions actually return the number of substitutions made. sub() will always return 1 if successful; both return 0 if not successful. Thus, you can test the result to see if a substitution was made. For example, the following example uses gsub() to replace all occurrences of "UNIX" with "POSIX". if (gsub(/UNIX/, "POSIX")) print The conditional statement tests the return value of gsub() such that the current input line is printed only if a change is made. As with sed, if an "&" appears in the substitution string, it will be replaced by the string matched by the regular expression. Use "\&" to output an ampersand. (Remember that to get a literal "\" into a string, you have to type two of them.) Also, note that awk does not "remember" the previous regular expression, as does sed, so you cannot use the syntax "//" to refer to the last regular expression. The following example surrounds any occurrence of "UNIX" with the troff font-change escape sequences. gsub(/UNIX/, "\\fB&\\fR") If the input is "the UNIX operating system", the output is "the \fBUNIX\fR operating system". In Chapter 4, Writing sed Scripts, we presented the following sed script named do.outline: sed -n ' s/"//g s/^\.Se /Chapter /p s/^\.Ah / A. /p s/^\.Bh / B. /p' $* Now here's that script rewritten using the substitution functions: awk ' { gsub(/"/, "") if (sub(/^\.Se /, "Chapter ")) print if (sub(/^\.Ah /, "\tA. ")) print if (sub(/^\.Bh /, "\t\tB. ")) print }' $* The two scripts are exactly equivalent, printing out only those lines that are changed. For the first edition of this book, Dale compared the run-time of both scripts and, as he expected, the awk script was slower. For the second edition, new timings showed that performance varies by implementation, and in fact, all tested versions of new awk were faster than sed! This is nice, since we have the capabilities in awk to make the script do more things. For instance, instead of using letters of the alphabet, we could number the headings. Here's the revised awk script: awk '# do.outline -- number headings in chapter. { gsub(/"/, "") } /^\.Se/ { sub(/^\.Se /, "Chapter ") ch = $2 ah = 0 bh = 0 print next } /^\.Ah/ { sub(/^\.Ah /, "\t " ch "." ++ah " ") bh = 0 print next } /^\.Bh/ { sub(/^\.Bh /, "\t\t " ch "." ah "." ++bh " ") print }' $* In this version, we break out each heading into its own pattern-matching rule. This is not necessary but seems more efficient since we know that once a rule is applied, we don't need to look at the others. Note the use of the next statement to bypass further examination of a line that has already been identified. The chapter number is read as the first argument to the ".Se" macro and is thus the second field on that line. The numbering scheme is done by incrementing a variable each time the substitution is made. The action associated with the chapter-level heading initializes the section-heading counters to zero. The action associated with the top-level heading ".Ah" zeroes the second-level heading counter. Obviously, you can create as many levels of heading as you need. Note how we can specify a concatenation of strings and variables as a single argument to the sub() function. $ do.outline ch02 Chapter 2 Understanding Basic Operations 2.1 Awk, by Sed and Grep, out of Ed 2.2 Command-line Syntax 2.2.1 Scripting 2.2.2 Sample Mailing List 2.3 Using Sed 2.3.1 Specifying Simple Instructions 2.3.2 Script Files 2.4 Using Awk 2.5 Using Sed and Awk Together If you wanted the option of choosing either numbers or letters, you could maintain both programs and construct a shell wrapper that uses some flag to determine which program should be invoked. 9.2.4 Converting Case POSIX awk provides two functions for converting the case of characters within a string. The functions are tolower() and toupper(). Each takes a single string argument, and returns a copy of that string, with all the characters of one case converted to the other (upper to lower and lower to upper, respectively). Their use is straightforward: $ cat test Hello, World! Good-bye CRUEL world! 1, 2, 3, and away we GO! $ awk '{ printf("<%s>, <%s>\n", tolower($0), toupper ($0)) }' test , , <1, 2, 3, and away we go!>, <1, 2, 3, AND AWAY WE GO!> Note that nonalphabetic characters are left unchanged. 9.2.5 The match() Function The match() function allows you to determine if a regular expression matches a specified string. It takes two arguments, the string and the regular expression. (This function is confusing because the regular expression is in the second position, whereas it is in the first position for the substitution functions.) The match() function returns the starting position of the substring that was matched by the regular expression. You might consider it a close relation to the index() function. In the following example, the regular expression matches any sequence of capital letters in the string "the UNIX operating system". match("the UNIX operating system", /[A-Z]+/) The value returned by this function is 5, the character position of "U," the first capital letter in the string. The match() function also sets two system variables: RSTART and RLENGTH. RSTART contains the same value returned by the function, the starting position of the substring. RLENGTH contains the length of the string in characters (not the ending position of the substring). When the pattern does not match, RSTART is set to 0 and RLENGTH is set to -1. In the previous example, RSTART is equal to 5 and RLENGTH is equal to 4. (Adding them together gives you the position of the first character after the match.) Let's look at a rather simple example that prints out a string matched by a specified regular expression, demonstrating the "extent of the match," as discussed in Chapter 3, Understanding Regular Expression Syntax. The following shell script takes two command-line arguments: the regular expression, which should be specified in quotes, and the name of the file to search. awk '# match -- print string that matches line # for lines match pattern match($0, pattern) { # extract string matching pattern using # starting position and length of string in $0 # print string print substr($0, RSTART, RLENGTH) }' pattern="$1" $2 The first command-line parameter is passed as the value of pattern. Note that $1 is surrounded by quotes, necessary to protect any spaces that might appear in the regular expression. The match() function appears in a conditional expression that controls execution of the only procedure in this awk script. The match() function returns 0 if the pattern is not found, and a non-zero value (RSTART) if it is found, allowing the return value to be used as a condition. If the current record matches the pattern, then the string is extracted from $0, using the values of RSTART and RLENGTH in the substr() function to specify the starting position of the substring to be extracted and its length. The substring is printed. This procedure only matches the first occurrence in $0. Here's a trial run, given a regular expression that matches "emp" and any number of characters up to a blank space: $ match "emp[^ ]*" personnel.txt employees employee employee. employment, employer employment employee's employee The match script could be a useful tool in improving your understanding of regular expressions. The next script uses the match() function to locate any sequence of uppercase letters so that they can be converted to lowercase. Compare it to the caps program shown earlier in the chapter. awk '# lower - change upper case to lower case # initialize strings BEGIN { upper = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" lower = "abcdefghijklmnopqrstuvwxyz" } # for each input line { # see if there is a match for all caps while (match($0, /[A-Z]+/)) # get each cap letter for (x = RSTART; x < RSTART+RLENGTH; ++x) { CAP = substr($0, x, 1) CHAR = index(upper, CAP) # substitute lowercase for upper gsub(CAP, substr(lower, CHAR, 1)) } # print record print $0 }' $* In this script, the match() function appears in a conditional expression that determines whether a while loop will be executed. By placing this function in a loop, we apply the body of the loop as many times as the pattern occurs in the current input record. The regular expression matches any sequence of uppercase letters in $0. If a match is made, a for loop does the lookup of each character in the substring that was matched, similar to what we did in the caps sample program, shown earlier in this chapter. What's different here is how we use the system variables RSTART and RLENGTH. RSTART initializes the counter variable x. It is used in the substr() function to extract one character at a time from $0, beginning with the first character that matched the pattern. By adding RLENGTH to RSTART, we get the position of the first character after the ones that matched the pattern. That is why the loop uses "<" instead of "<=". At the end, we use gsub() to replace the uppercase letter with the corresponding lowercase letter.[2] Notice that we use gsub() instead of sub() because it offers us the advantage of making several substitutions if there are multiple instances of the same letter on the line. [2] You may be wondering, "why not just use tolower()?" Good question. Some early versions of nawk, including the one on SunOS 4.1.x systems, don't have tolower() and toupper(); thus it's useful to know how to do it yourself. $ cat test Every NOW and then, a WORD I type appears in CAPS. $ lower test every now and then, a word i type appears in caps. Note that you could change the regular expression to avoid matching individual capital letters by matching a sequence of two or more uppercase characters, by using: "/[A-Z][A-Z]+/." This would also require revising the way the lowercase conversion was made using gsub(), since it matches a single character on the line. In our discussion of the sed substitution command, you saw how to save and recall a portion of a string matched by a pattern, using \( and \) to surround the pattern to be saved and \n to recall the saved string in the replacement pattern. Unfortunately, awk's standard substitution functions offer no equivalent syntax. The match() function can solve many such problems, though. For instance, if you match a string using the match() function, you can single out characters or a substring at the head or tail of the string. Given the values of RSTART and RLENGTH, you can use the substr() function to extract the characters. In the following example, we replace the second of two colons with a semicolon. We can't use gsub() to make the replacement because "/:/" matches the first colon and "/:[^:]*:/" matches the whole string of characters. We can use match() to match the string of characters and to extract the last character of the string. # replace 2nd colon with semicolon using match, substr if (match($1, /:[^:]*:/)) { before = substr($1, 1, (RSTART + RLENGTH - 2)) after = substr($1, (RSTART + RLENGTH)) $1 = before ";" after } The match() function is placed within a conditional statement that tests that a match was found. If there is a match, we use the substr() function to extract the substring before the second colon as well as the substring after it. Then we concatenate before, the literal ";", and after, assigning it to $1. You can see examples of the match() function in use in Chapter 12, Full-Featured Applications. 9.1 Arithmetic Functions 9.3 Writing Your Own Functions Chapter 9 Functions 9.3 Writing Your Own Functions With user-defined functions, awk allows the novice programmer to take another step toward C programming[3] by writing programs that make use of self-contained functions. When you write a function properly, you have defined a program component that can be reused in other programs. The real benefit of modularity becomes apparent as programs grow in size or in age, and as the number of programs you write increases significantly. [3] Or programming in any other traditional high-level language. A function definition can be placed anywhere in a script that a pattern-action rule can appear. Typically, we put the function definitions at the top of the script before the pattern-action rules. A function is defined using the following syntax: function name (parameter-list) { statements } The newlines after the left brace and before the right brace are optional. You can also have a newline after the close-parenthesis of the parameter list and before the left brace. The parameter-list is a comma-separated list of variables that are passed as arguments into the function when it is called. The body of the function consists of one or more statements. The function typically contains a return statement that returns control to that point in the script where the function was called; it often has an expression that returns a value as well. return expression The following example shows the definition for an insert() function: function insert(STRING, POS, INS) { before_tmp = substr(STRING, 1, POS) after_tmp = substr(STRING, POS + 1) return before_tmp INS after_tmp } This function takes three arguments, inserting one string INS in another string STRING after the character at position POS.[4] The body of this function uses the substr() function to divide the value of STRING into two parts. The return statement returns a string that is the result of concatenating the first part of STRING, the INS string, and the last part of STRING. A function call can appear anywhere that an expression can. Thus, the following statement: [4] We've used a convention of giving all uppercase names to our parameters. This is mostly to make the explanation easier to follow. In practice, this is probably not a good idea, since it becomes much easier to accidentally have a parameter conflict with a system variable. print insert($1, 4, "XX") If the value of $1 is "Hello," then this functions returns "HellXXo." Note that when calling a user- defined function, there can be no spaces between the function name and the left parenthesis. This is not true of built-in functions. It is important to understand the notion of local and global variables. A local variable is a variable that is local to a function and cannot be accessed outside of it. A global variable, on the other hand, can be accessed or changed anywhere in the script. There can be potentially damaging side effects of global variables if a function changes a variable that is used elsewhere in the script. Therefore, it is usually a good idea to eliminate global variables in a function. When we call the insert() function, and specify $1 as the first argument, then a copy of that variable is passed to the function, where it is manipulated as a local variable named STRING. All the variables in the function definition's parameter list are local variables and their values are not accessible outside the function. Similarly, the arguments in the function call are not changed by the function itself. When the insert() function returns, the value of $1 is not changed. However, the variables defined in the body of the function are global variables, by default. Given the above definition of the insert() function, the temporary variables before_tmp and after_tmp are visible outside the function. Awk provides what its developers call an "inelegant" means of declaring variables local to a function, and that is by specifying those variables in the parameter list. The local temporary variables are put at the end of the parameter list. This is essential; parameters in the parameter list receive their values, in order, from the values passed in the function call. Any extra parameters, like normal awk variables, are initialized to the empty string. By convention, the local variables are separated from the "real" parameters by several spaces. For instance, the following example shows how to define the insert() function with two local variables. function insert(STRING, POS, INS, before_tmp, after_tmp) { body } If this seems confusing,[5] seeing how the following script works might help: [5] The documentation calls it a syntactical botch. function insert(STRING, POS, INS, before_tmp) { before_tmp = substr(STRING, 1, POS) after_tmp = substr(STRING, POS + 1) return before_tmp INS after_tmp } # main routine { print "Function returns", insert($1, 4, "XX") print "The value of $1 after is:", $1 print "The value of STRING is:", STRING print "The value of before_tmp:", before_tmp print "The value of after_tmp:", after_tmp } Notice that we specify before_tmp in the parameter list. In the main routine, we call the insert() function and print its result. Then we print different variables to see what their value is, if any. Now let's run the above script and look at the output: $ echo "Hello" | awk -f insert.awk - Function returns HellXXo The value of $1 after is: Hello The value of STRING is: The value of before_tmp: The value of after_tmp: o The insert() function returns "HellXXo," as expected. The value of $1 is the same after the function was called as it was before. The variable STRING is local to the function and it does not have a value when called from the main routine. The same is true for before_tmp because its name was placed in the parameter list for the function definition. The variable after_tmp which was not specified in the parameter list does have a value, the letter "o." As this example shows, $1 is passed "by value" into the function. This means that a copy is made of the value when the function is called and the function manipulates the copy, not the original. Arrays, however, are passed "by reference." That is, the function does not work with a copy of the array but is passed the array itself. Thus, any changes that the function makes to the array are visible outside of the function. (This distinction between "scalar" variables and arrays also holds true for functions written in the C language.) The next section presents an example of a function that operates on an array. 9.3.1 Writing a Sort Function Earlier in this chapter we presented the lotto script for picking x random numbers out of a series of y numbers. The script that we showed did not sort the list of numbers that were selected. In this section, we develop a sort function that sorts the elements of an array. We define a function that takes two arguments, the name of the array and the number of elements in the array. This function can be called this way: sort(sortedpick, NUM) The function definition lists the two arguments and three local variables used in the function. # sort numbers in ascending order function sort(ARRAY, ELEMENTS, temp, i, j) { for (i = 2; i <= ELEMENTS; ++i) { for (j = i; ARRAY[j-1] > ARRAY[j]; --j) { temp = ARRAY[j] ARRAY[j] = ARRAY[j-1] ARRAY[j-1] = temp } } return } The body of the function implements an insertion sort. This sorting algorithm is very simple. We loop through each element of the array and compare it to the value preceding it. If the first element is greater than the second, the first and second elements are swapped. To actually swap the values, we use a temporary variable to hold a copy of the value while we overwrite the original. The loop continues swapping adjacent elements until all are in order. At the end of the function, we use the return statement to simply return control.[6] The function does not need to pass the array back to the main routine because the array itself is changed and it can be accessed directly. [6] In this case, the return is not strictly necessary; "falling off the end" of the function would have the same effect. Since functions can have return values, it's a good idea to always use a return statement, even when you are not returning a value. This helps make your programs more readable. Here's proof positive: $ lotto 7 35 Pick 7 of 35 6 7 17 19 24 29 35 In fact, many of the scripts that we developed in this chapter could be turned into functions. For instance, if we only had the original, 1987, version of nawk, we might want to write our own tolower () and toupper() functions. The value of writing the sort() function in a general fashion is that you can easily reuse it. To demonstrate this, we'll take the above sort function and use it to sort student grades. In the following script, we read all of the student grades into an array and then call sort() to put the grades in ascending order. # grade.sort.awk -- script for sorting student grades # input: student name followed by a series of grades # sort function -- sort numbers in ascending order function sort(ARRAY, ELEMENTS, temp, i, j) { for (i = 2; i <= ELEMENTS; ++i) for (j = i; ARRAY[j-1] > ARRAY[j]; --j) { temp = ARRAY[j] ARRAY[j] = ARRAY[j-1] ARRAY[j-1] = temp } return } # main routine { # loop through fields 2 through NF and assign values to # array named grades for (i = 2; i <= NF; ++i) grades[i-1] = $i # call sort function to sort elements sort(grades, NF-1) # print student name printf("%s: ", $1) # output loop for (j = 1; j <= NF-1; ++j) printf("%d ", grades[j]) printf("\n") } Note that the sort routine is identical to the previous version. In this example, once we've sorted the grades we simply output them: $ awk -f grade.sort.awk grades.test mona: 70 70 77 83 85 89 john: 78 85 88 91 92 94 andrea: 85 89 90 90 94 95 jasper: 80 82 84 84 88 92 dunce: 60 60 61 62 64 80 ellis: 89 90 92 96 96 98 However, you could, for instance, delete the first element of the sort array if you wanted to average the student grades after dropping the lowest grade. As another exercise, you could write a version of the sort function that takes a third argument indicating an ascending or descending sort. 9.3.2 Maintaining a Function Library You might want to put a useful function in its own file and store it in a central directory. Awk permits multiple uses of the -f option to specify more than one program file.[7] For instance, we could have written the previous example such that the sort function was placed in a separate file from the main program grade.awk. The following command specifies both program files: [7] The SunOS 4.1.x version of nawk does not support multiple script files. This feature was not in the original 1987 version of nawk either. It was added in 1989 and is now part of POSIX awk. $ awk -f grade.awk -f /usr/local/share/awk/sort.awk grades. test This command assumes that grade.awk is in the working directory and that the sort function is defined in sort.awk in the directory /usr/local/share/awk. NOTE: You cannot put a script on the command line and also use the -f option to specify a filename for a script. Remember to document functions clearly so that you will understand how they work when you want to reuse them. 9.3.3 Another Sorted Example Lenny, our production editor, is back with another request. Dale: The last section of each Xlib manpage is called "Related Commands" (that is the argument of a .SH) and it's followed by a list of commands (often 10 or 20) that are now in random order. It'd be more useful and professional if they were alphabetized. Currently, commands are separated by a comma after each one except the last, which has a period. The question is: could awk alphabetize these lists? We're talking about a couple of hundred manpages. Again, don't bother if this is a bigger job than it seems to someone who doesn't know what's involved. Best to you and yours, Lenny To see what he is talking about, a simplified version of an Xlib manpage is shown below: .SH "Name" XSubImage - create a subimage from part of an image. . . . .SH "Related Commands" XDestroyImage, XPutImage, XGetImage, XCreateImage, XGetSubImage, XAddPixel, XPutPixel, XGetPixel, ImageByteOrder. You can see that the names of related commands appear on several lines following the heading. You can also see that they are in no particular order. To sort the list of related commands is actually fairly simple, given that we've already covered sorting. The structure of the program is somewhat interesting, as we must read several lines after matching the "Related Commands" heading. Looking at the input, it is obvious that the list of related commands is the last section in the file. All other lines except these we want to print as is. The key is to match all lines from the heading "Related Commands" to the end of the file. Our script can consist of four rules, that match: 1. The "Related Commands" heading 2. The lines following that heading 3. All other lines 4. After all lines have been read (END) Most of the "action" takes place in the END procedure. That's where we sort and output the list of commands. Here's the script: # sorter.awk -- sort list of related commands # requires sort.awk as function in separate file BEGIN { relcmds = 0 } #1 Match related commands; enable flag x /\.SH "Related Commands"/ { print relcmds = 1 next } #2 Apply to lines following "Related Commands" (relcmds == 1) { commandList = commandList $0 } #3 Print all other lines, as is. (relcmds == 0) { print } #4 now sort and output list of commands END { # remove leading spaces and final period. gsub(/, */, ",", commandList) gsub(/\. *$/, "", commandList) # split list into array sizeOfArray = split(commandList, comArray, ",") # sort sort(comArray, sizeOfArray) # output elements for (i = 1; i < sizeOfArray; i++) printf("%s,\n", comArray[i]) printf("%s.\n", comArray[i]) } Once the "Related Commands" heading is matched, we print that line and then set a flag, the variable relcmds, which indicates that subsequent input lines are to be collected.[8] The second procedure actually collects each line into the variable commandList. The third procedure is executed for all other lines, simply printing them. [8] The getline function introduced in the next chapter provides a simpler way to control reading input lines. When all lines of input have been read, the END procedure is executed, and we know that our list of commands is complete. Before splitting up the commands into fields, we remove any number of spaces following a comma. Next we remove the final period and any trailing spaces. Finally, we create the array comArray using the split() function. We pass this array as an argument to the sort() function, and then we print the sorted values. This program generates the following output: $ awk -f sorter.awk test .SH "Name" XSubImage - create a subimage from part of an image. .SH "Related Commands" ImageByteOrder, XAddPixel, XCreateImage, XDestroyImage, XGetImage, XGetPixel, XGetSubImage, XPutImage, XPutPixel. Once again, the virtue of calling a function to do the sort versus writing or copying the code to do the same task is that the function is a module that's been tested previously and has a standard interface. That is, you know that it works and you know how it works. When you come upon the same sort code in the awk version, which uses different variable names, you have to scan it to verify that it works the same way as other versions. Even if you were to copy the lines into another program, you would have to make changes to accommodate the new circumstances. With a function, all you need to know is what kind of arguments it expects and their calling sequence. Using a function reduces the chance for error by reducing the complexity of the problem that you are solving. Because this script presumes that the sort() function exists in a separate file, it must be invoked using the multiple -f options: $ awk -f sort.awk -f sorter.awk test where the sort() function is defined in the file sort.awk. 9.2 String Functions 10. The Bottom Drawer Chapter 10 10. The Bottom Drawer Contents: The getline Function The close() Function The system() Function A Menu-Based Command Generator Directing Output to Files and Pipes Generating Columnar Reports Debugging Limitations Invoking awk Using the #! Syntax This chapter is proof that not everything has its place. Some things just don't seem to fit, no matter how you organize them. This chapter is a collection of such things. It is tempting to label it "Advanced Topics," as if to explain its organization (or lack thereof), but some readers might feel they need to make more progress before reading it. We have therefore called it "The Bottom Drawer," thinking of the organization of a chest of drawers, with underwear, socks, and other day-to-day things in the top drawers and heavier things that are less frequently used, like sweaters, in the bottom drawers. All of it is equally accessible, but you have to bend over to get things in the bottom drawer. It requires a little more effort to get something, that's all. In this chapter we cover a number of topics, including the following: ● The getline function ● The system() function ● Directing output to files and pipes ● Debugging awk scripts 10.1 The getline Function The getline function is used to read another line of input. Not only can getline read from the regular input data stream, it can also handle input from files and pipes. The getline function is similar to awk's next statement. While both cause the next input line to be read, the next statement passes control back to the top of the script. The getline function gets the next line without changing control in the script. Possible return values are: 1 If it was able to read a line. 0 If it encounters the end-of-file. -1 If it encounters an error. NOTE: Although getline is called a function and it does return a value, its syntax resembles a statement. Do not write getline(); its syntax does not permit parentheses. In the previous chapter, we used a manual page source file as an example. The -man macros typically place the text argument on the next line. Although the macro is the pattern that you use to find the line, it is actually the next line that you process. For instance, to extract the name of the command from the manpage, the following example matches the heading "Name," reads the next line, and prints the first field of it: # getline.awk -- test getline function /^\.SH "?Name"?/ { getline # get next line print $1 # print $1 of new line. } The pattern matches any line with ".SH" followed by "Name," which might be enclosed in quotes. Once this line is matched, we use getline to read the next input line. When the new line is read, getline assigns it $0 and parses it into fields. The system variables NF, NR, and FNR are also set. Thus, the new line becomes the current line, and we are able to refer to "$1" and retrieve the first field. Note that the previous line is no longer available as $0. However, if necessary, you can assign the line read by getline to a variable and avoid changing $0, as we'll see shortly. Here's an example that shows how the previous script works, printing out the first field of the line following ".SH Name." $ awk -f getline.awk test XSubImage The sorter.awk program that we demonstrated at the end of Chapter 9, Functions, could have used getline to read all the lines after the heading "Related Commands." We can test the return value of getline in a while loop to read a number of lines from the input. The following procedure replaces the first two procedures in the sorter program: # Match "Related Commands" and collect them /^\.SH "?Related Commands"?/ { print while (getline > 0) commandList = commandList $0 } The expression "getline > 0" will be true as long as getline successfully reads an input line. When it gets to the end-of-file, getline returns 0 and the loop is exited. 10.1.1 Reading Input from Files Besides reading from the regular input stream, the getline function allows you to read input from a file or a pipe. For instance, the following statement reads the next line from the file data: getline < "data" Although the filename can be supplied through a variable, it is typically specified as a string constant, which must be enclosed in quotes. The symbol "<" is the same as the shell's input redirection symbol and will not be interpreted as the "less than" symbol. We can use a while loop to read all the lines from a file, testing for an end-of-file to exit the loop. The following example opens the file data and prints all of its lines: while ( (getline < "data") > 0 ) print (We parenthesize to avoid confusion; the "<" is a redirection, while the ">" is a comparison of the return value.) The input can also come from standard input. You can use getline following a prompt for the user to enter information: BEGIN { printf "Enter your name: " getline < "-" print } This sample code prints the prompt "Enter your name:" (printf is used because we don't want a carriage return after the prompt), and then calls getline to gather the user's response.[1] The response is assigned to $0, and the print statement outputs that value. [1] At least at one time, SGI versions of nawk did not support the use of "-" with getline to read from standard input. Caveat emptor. 10.1.2 Assigning the Input to a Variable The getline function allows you to assign the input record to a variable. The name of the variable is supplied as an argument. Thus, the following statement reads the next line of input into the variable input: getline input Assigning the input to a variable does not affect the current input line; that is, $0 is not affected. The new input line is not split into fields, and thus the variable NF is also unaffected. It does increment the record counters, NR and FNR. The previous example demonstrated how to prompt the user. That example could be written as follows, assigning the user's response to the variable name. BEGIN { printf "Enter your name: " getline name < "-" print name } Study the syntax for assigning the input data to a variable because it is a common mistake to instead write: name = getline # wrong which assigns the return value of getline to the variable name. 10.1.3 Reading Input from a Pipe You can execute a command and pipe the output into getline. For example, look at the following expression: "who am i" | getline That expression sets "$0" to the output of the who am i command. dale ttyC3 Jul 18 13:37 The line is parsed into fields and the system variable NF is set. Similarly, you can assign the result to a variable: "who am i" | getline me By assigning the output to a variable, you avoid setting $0 and NF, but the line is not split into fields. The following script is a fairly simple example of piping the output of a command to getline. It uses the output from the who am i command to get the user's name. It then looks up the name in /etc/ passwd, printing out the fifth field of that file, the user's full name: awk '# getname - print users fullname from /etc/passwd BEGIN { "who am i" | getline name = $1 FS = ":" } name ~ $1 { print $5 } ' /etc/passwd The command is executed from the BEGIN procedure, and it provides us with the name of the user that will be used to find the user's entry in /etc/passwd. As explained above, who am i outputs a single line, which getline assigns to $0. $1, the first field of that output, is then assigned to name. The field separator is set to a colon (:) to allow us to access individual fields in entries in the /etc/passwd file. Notice that FS is set after getline or else the parsing of the command's output would be affected. Finally, the main procedure is designed to test that the first field matches name. If it does, the fifth field of the entry is printed. For instance, when Dale runs this script, it prints "Dale Dougherty." When the output of a command is piped to getline and it contains multiple lines, getline reads a line at a time. The first time getline is called it reads the first line of output. If you call it again, it reads the second line. To read all the lines of output, you must set up a loop that executes getline until there is no more output. For instance, the following example uses a while loop to read each line of output and assign it to the next element of the array, who_out: while ("who" | getline) who_out[++i] = $0 Each time the getline function is called, it reads the next line of output. The who command, however, is executed only once. The next example looks for "@date" in a document and replaces it with today's date: # subdate.awk -- replace @date with todays date /@date/ { "date +'%a., %h %d, %Y'" | getline today gsub(/@date/, today) } { print } The date command, using its formatting options,[2] provides the date and getline assigns it to the variable today. The gsub() function replaces each instance of "@date" with today's date. [2] Older versions of date don't support formatting options. Particularly the one on SunOS 4.1.x systems; there you have to use /usr/5bin/date. Check your local documentation. This script might be used to insert the date in a form letter: To: Peabody From: Sherman Date: @date I am writing you on @date to remind you about our special offer. All lines of the input file would be passed through as is, except the lines containing "@date", which are replaced with today's date: $ awk -f subdate.awk subdate.test To: Peabody From: Sherman Date: Sun., May 05, 1996 I am writing you on Sun., May 05, 1996 to remind you about our special offer. 9.3 Writing Your Own Functions 10.2 The close() Function Chapter 10 The Bottom Drawer 10.2 The close() Function The close() function allows you to close open files and pipes. There are a number of reasons you should use it. ● You can only have so many pipes open at a time. (See the section "Limitations" below, which describes how such limitations can differ from system to system.) In order to open as many pipes in a program as you wish, you must use the close() function to close a pipe when you are done with it (ordinarily, when getline returns 0 or -1). It takes a single argument, the same expression used to create the pipe. Here's an example: close("who") ● Closing a pipe allows you to run the same command twice. For example, you can use date twice to time a command. ● Using close() may be necessary in order to get an output pipe to finish its work. For example: { some processing of $0 | "sort > tmpfile" } END { close("sort > tmpfile") while ((getline < "tmpfile") > 0) { do more work } } ● Closing open files is necessary to keep you from exceeding your system's limit on simultaneously open files. We will see an example of the close() function in the section "Working with Multiple Files" later in this chapter. 10.1 The getline Function 10.3 The system() Function Chapter 10 The Bottom Drawer 10.3 The system() Function The system() function executes a command supplied as an expression.[3] It does not, however, make the output of the command available within the program for processing. It returns the exit status of the command that was executed. The script waits for the command to finish before continuing execution. The following example executes the mkdir command: [3] The system() function is modeled after the standard C library function of the same name. BEGIN { if (system("mkdir dale") != 0) print "Command Failed" } The system() function is called from an if statement that tests for a non-zero exit status. Running the program twice produces one success and one failure: $ awk -f system.awk $ ls dale $ awk -f system.awk mkdir: dale: File exists Command Failed The first run creates the new directory and system() returns an exit status of 0 (success). The second time the command is executed, the directory already exists, so mkdir fails and produces an error message. The "Command Failed" message is produced by awk. The Berkeley UNIX command set has a small but useful command for troff users named soelim, named because it "eliminates" ".so" lines from a troff input file. (.so is a request to include or "source" the contents of the named file.) If you have an older System V system that does not have soelim, you can use the following awk script to create it: /^\.so/ { gsub(/"/, "", $2) system("cat " $2) next } { print } This script looks for ".so" at the beginning of a line, removes any quotation marks, and then uses system() to execute the cat command and output the contents of the file. This output merges with the rest of the lines in the file, which are simply printed to standard output, as in the following example. $ cat soelim.test This is a test .so test1 This is a test .so test2 This is a test. $ awk -f soelim.awk soelim.test This is a test first:second one:two This is a test three:four five:six This is a test. We don't explicitly test the exit status of the command. Thus, if the file does not exist, the error messages merge with the output: $ awk -f soelim.awk soelim.test This is a test first:second one:two This is a test cat: cannot open test2 This is a test. We might want to test the return value of the system() function and generate an error message for the user. This program is also very simplistic: it does not handle instances of ".so" nested in the included file. Think about how you might implement a version of this program that did handle nested ".so" requests. This example is a function prompting you to enter a filename. It uses the system() function to execute the test command to verify the file exists and is readable: # getFilename function -- prompts user for filename, # verifies that file exists and returns absolute pathname. function getFilename( file) { while (! file) { printf "Enter a filename: " getline < "-" # get response file = $0 # check that file exists and is readable # test returns 1 if file does not exist. if (system("test -r " file)) { print file " not found" file = "" } } if (file !~ /^\//) { "pwd" | getline # get current directory close("pwd") file = $0 "/" file } return file } This function returns the absolute pathname of the file specified by the user. It places the prompting and verification sequence inside a while loop in order to allow the user to make a different entry if the previous one is invalid. The test -r command returns 0 if the file exists and is readable, and 1 if not. Once it is determined that the filename is valid, then we test the filename to see if it begins with a "/", which would indicate that the user supplied an absolute pathname. If that test fails, we use the getline function to get the output of the pwd command and prepend it to the filename. (Admittedly, the script makes no attempt to deal with "./" or "../" entries, although tests can be easily devised to match them.) Note the two uses of the getline function: the first gets the user's response and the second executes the pwd command. 10.2 The close() Function 10.4 A Menu-Based Command Generator Chapter 10 The Bottom Drawer 10.4 A Menu-Based Command Generator In this section, we look at a general use of the system() and getline functions to implement a menu-based command generator. The object of this program is to give unsophisticated users a simple way to execute long or complex UNIX commands. A menu is used to prompt the user with a description of the task to be performed, allowing the user to choose by number any selection of the menu to execute. This program is designed as a kind of interpreter that reads from a file the descriptions that appear in the menu and the actual command lines that are executed. That way, multiple menu-command files can be used, and they can be easily modified by awk-less users without changing the program. The format of a menu-command file contains the menu title as the first line in the file. Subsequent lines contain two fields: the first is the description of the action to be performed and the second is the command line that performs it. An example is shown below: $ cat uucp_commands UUCP Status Menu Look at files in PUBDIR:find /var/spool/uucppublic -print Look at recent status in LOGFILE:tail /var/spool/uucp/ LOGFILE Look for lock files:ls /var/spool/uucp/*.LCK The first step in implementing the menu-based command generator is to read the menu-command file. We read the first line of this file and assign it to a variable named title. The rest of the lines contain two fields and are read into two arrays, one for the menu items and one for the commands to be executed. A while loop is used, along with getline, to read one line at a time from the file. BEGIN { FS = ":" if ((getline < CMDFILE) > 0) title = $1 else exit 1 while ((getline < CMDFILE) > 0) { # load array ++sizeOfArray # array of menu items menu[sizeOfArray] = $1 # array of commands associated with items command[sizeOfArray] = $2 } ... } Look carefully at the syntax of the expression tested by the if statement and the while loop. (getline < CMDFILE) > 0 The variable CMDFILE is the name of the menu-command file, which is passed as a command-line parameter. The two angle-bracket symbols have completely different functions. The "<" symbol is interpreted by getline as the input redirection operator. Then the value returned by getline is tested to see if it is greater than (">") 0. It is parenthesized on purpose, in order to make this clear. In other words, "getline < CMDFILE" is evaluated first and then its return value is compared to 0. This procedure is placed in the BEGIN pattern. However, there is one catch. Because we intended to pass the name of the menu file as a command-line parameter, the variable CMDFILE would not normally be defined and available in the BEGIN pattern. In other words, the following command will not work: awk script CMDFILE="uucp_commands" - because CMDFILE variable won't be defined until the first line of input is read. Fortunately, awk provides the -v option to handle just such a case. Using the -v option makes sure that the variable is set immediately and thus available in the BEGIN pattern. awk -v CMDFILE="uucp_commands" script If your version of awk doesn't have the -v option, you can pass the value of CMDFILE as a shell variable. Create a shell script to execute awk and in it define CMDFILE. Then change the line that reads CMDFILE in the invoke script (see below) as follows: while ((getline < '"$CMDFILE"') > 0 ) { Once the menu-command file is loaded, the program must display the menu and prompt the user. This is implemented as a function because we need to call it in two places: from the BEGIN pattern to prompt the user initially, and after we have processed the user's response so that another choice can be made. Here's the display_menu() function: function display_menu() { # clear screen -- comment out if clear does not work system("clear") # print title, list of items, exit item, and prompt print "\t" title for (i = 1; i <= sizeOfArray; ++i) printf "\t%d. %s\n", i, menu[i] printf "\t%d. Exit\n", i printf("Choose one: ") } The first thing we do is use the system() function to call a command to clear the screen. (On my system, clear does this; on others it may be cls or some other command. Comment out the line if you cannot find such a command.) Then we print the title and each of the items in a numbered list. The last item is always "Exit." Finally, we prompt the user for a choice. The program will take standard input so that the user's answer to the prompt will be the first line of input. Our reading of the menu-command file was done within the program and not as part of the input stream. Thus, the main procedure of the program is to respond to the user's choice and execute a command. Here's that part of the program: # Applies the user response to prompt { # test value of user response if ($1 > 0 && $1 <= sizeOfArray) { # print command that is executed printf("Executing ... %s\n", command[$1]) # then execute it. system(command[$1]) printf("") # wait for input before displaying menu again getline } else exit # re-display menu display_menu() } First, we test the range of the user's response. If the response falls outside the range, we simply exit the program. If it is a valid response, then we retrieve the command from the array command, display it, and then execute it using the system() function. The user sees the result of the command on the screen followed by the message "." The purpose of this message is to wait for the user to finish before clearing the screen and redisplaying the menu. The getline function causes the program to wait for a response. Note that we don't do anything with the response. The display_menu() function is called at the end of this procedure to redisplay the menu and prompt for another line of input. Here's the invoke program in full: awk -v CMDFILE="uucp_commands" '# invoke -- menu-based # command generator # first line in CMDFILE is the title of the menu # subsequent lines contain: $1 - Description; # $2 Command to execute BEGIN { FS = ":" # process CMDFILE, reading items into menu array if ((getline < CMDFILE) > 0) title = $1 else exit 1 while ((getline < CMDFILE) > 0) { # load array ++sizeOfArray # array of menu items menu[sizeOfArray] = $1 # array of commands associated with items command[sizeOfArray] = $2 } # call function to display menu items and prompt display_menu() } # Applies the user response to prompt { # test value of user response if ($1 > 0 && $1 <= sizeOfArray) { # print command that is executed printf("Executing ... %s\n", command[$1]) # then execute it. system(command[$1]) printf("") # wait for input before displaying menu again getline } else exit # re-display menu display_menu() } function display_menu() { # clear screen -- if clear does not work, try "cls" system("clear") # print title, list of items, exit item, and prompt print "\t" title for (i = 1; i <= sizeOfArray; ++i) printf "\t%d. %s\n", i, menu[i] printf "\t%d. Exit\n", i printf("Choose one: ") }' - When a user runs the program, the following output is displayed: UUCP Status Menu 1. Look at files in PUBDIR 2. Look at recent status in LOGFILE 3. Look for lock files 4. Exit Choose one: The user is prompted to enter the number of a menu selection. Anything other than a number between 1 and 3 exits the menu. For instance, if the user enters "1" to see a list of files in uucp's public directory, then the following result is displayed on the screen: Executing ...find /var/spool/uucppublic -print /var/spool/uucppublic /var/spool/uucppublic/dale /var/spool/uucppublic/HyperBugs When the user presses the RETURN key, the menu is redisplayed on the screen. The user can quit from the program by choosing "4". This program is really a shell for executing commands. Any sequence of commands (even other awk programs) can be executed by modifying the menu-command file. In other words, the part of the program that might change the most is extracted from the program itself and maintained in a separate file. This allows the menu list to be changed and extended very easily by a nontechnical user. 10.3 The system() Function 10.5 Directing Output to Files and Pipes Chapter 10 The Bottom Drawer 10.5 Directing Output to Files and Pipes The output of any print or printf statement can be directed to a file, using the output redirection operators ">" or ">>". For example, the following statement writes the current record to the file data.out: print > "data.out" The filename can be any expression that evaluates to a valid filename. A file is opened by the first use of the redirection operator, and subsequent uses append data to the file. The difference between ">" and ">>" is the same as between the shell redirection operators. A right-angle bracket (">") truncates the file when opening it while ">>" preserves whatever the file contains and appends data to it. Because the redirection operator ">" is the same as the relational operator, there is the potential for confusion when you specify an expression as an argument to the print command. The rule is that ">" will be interpreted as a redirection operator when it appears in an argument list for any of the print statements. To use ">" as a relational operator in an expression that appears in the argument list, put either the expression or the argument list in parentheses. For example, the following example uses parentheses around the conditional expression to make sure that the relational expression is evaluated properly: print "a =", a, "b =", b, "max =", (a > b ? a : b) > "data. out" The conditional expression evaluates whether a is greater than b; if it is, then the value of a is printed as the maximum value; otherwise, b's value is used. 10.5.1 Directing Output to a Pipe You can also direct output to a pipe. The command print | command opens a pipe the first time it is executed and sends the current record as input to that command. In other words, the command is only invoked once, but each execution of the print command supplies another line of input. The following script strips troff macros and requests from the current input line and then sends the line as input to wc to determine how many words are in the file: {# words.awk - strip macros then get word count sub(/^\.../,"") print | "wc -w" } By removing formatting codes, we get a truer word count. In most cases, we prefer to use a shell script to pipe the output of the awk command to another command rather than do it inside the awk script. For instance, we'd write the previous example as a shell script invoking awk and piping its output to wc: awk '{ # words -- strip macros sub(/^\.../,"") print }' $* | # get word count wc -w This method seems simpler and easier to understand. Nonetheless, the other method has the advantage of accomplishing the same thing without creating a shell script. Remember that you can only have so many pipes open at a time. Use the close() function to close the pipe when you are done with it. 10.5.2 Working with Multiple Files A file is opened whenever you read from or write to a file. Every operating system has some limit on the number of files a running program may have open. Furthermore, each implementation of awk may have an internal limit on the number of open files; this number could be smaller than the system's limit.[4] So that you don't run out of open files, awk provides a close() function that allows you to close an open file. Closing files that you have finished processing allows your program to open more files later on. [4] Gawk will attempt to appear to have more files open than the system limit by closing and reopening files as needed. Even though gawk is "smart," it is still more efficient to close your files when you're done with them. A common use for directing output to files is to split up a large file into a number of smaller files. Although UNIX provides utilities, split and csplit, that do a similar job, they do not have the ability to give the new file a useful filename. Similarly, sed can be used to write to a file, but you must specify a fixed filename. With awk, you can use a variable to specify the filename and pick up the value from a pattern in the file. For instance, if $1 provided a string that could be used as a filename, you could write a script to output each record to its own file: print $0 > $1 You should perhaps test the filename, either to determine its length or to look for characters that cannot be used in a filename. If you don't close your files, such a program would eventually run out of available open files, and have to give up. The example we are going to look at works because it uses the close() function so that you will not run into any open-file limitations. The following script was used to split up a large file containing dozens of manpages. Each manual page began by setting a number register and ended with a blank line: .nr X 0 (Although they used the -man macros for the most part, the beginning of a manpage was strangely coded, making things a little harder.) The line that provides the filename looks like this: .if \nX=0 .ds x} XDrawLine "" "Xlib - Drawing Primitives" The fifth field on this line, "XDrawLine," contains the filename. Perhaps the only difficulty in writing the script is that the first line is not the one that provides the filename. Therefore, we collect the lines in an array until we get a filename. Once we get the filename, we output the array, and from that point on we simply write each input line to the new file. Here's the man.split script: # man.split -- split up a file containing X manpages. BEGIN { file = 0; i = 0; filename = "" } # First line of new manpage is ".nr X 0" # Last line is blank /^\.nr X 0/,/^$/ { # this conditional collects lines until we get a filename. if (file == 0) line[++i] = $0 else print $0 > filename # this matches the line that gives us the filename if ($4 == "x}") { # now we have a filename filename = $5 file = 1 # output name to screen print filename # print any lines collected for (x = 1; x <= i; ++x){ print line[x] > filename } i = 0 } # close up and clean up for next one if ($0 ~ /^$/) { close(filename) filename = "" file = 0 i = 0 } } As you can see, we use the variable file as a flag to convey whether or not we have a valid filename and can write to the file. Initially, file is 0, and the current input line is stored in an array. The variable i is a counter used to index the array. When we encounter the line that sets the filename, then we set file to 1. The name of the new file is printed to the screen so that the user can get some feedback on the progress of the script. Then we loop through the array and output it to the new file. When the next input line is read, file will be set to 1 and the print statement will output it to the named file. 10.4 A Menu-Based Command Generator 10.6 Generating Columnar Reports Chapter 10 The Bottom Drawer 10.6 Generating Columnar Reports This section describes a small-scale business application that produces reports with dollar amounts. While this application doesn't introduce any new material, it does emphasize the data processing and reporting capabilities of awk. (Surprisingly, some people do use awk to write small business applications.) It is presumed that a script exists for data entry. The data-entry script has two jobs: the first is to enter the customer's name and mailing address for later use in building a mailing list; the second is to record the customer's order of any of seven items, the number of items ordered, and the price per item. The data collected for the mailing list and the customer order were written to separate files. Here are two sample customer records from the customer order file: Charlotte Webb P.O N61331 97 Y 045 Date: 03/14/97 #1 3 7.50 #2 3 7.50 #3 1 7.50 #4 1 7.50 #7 1 7.50 Martin S. Rossi P.O NONE Date: 03/14/97 #1 2 7.50 #2 5 6.75 Each order covers multiple lines, and a blank line separates one order from another. The first two lines supply the customer's name, purchase order number and the date of the order. Each subsequent line identifies an item by number, the number ordered, and the price of the item. Let's write a simple program that multiplies the number of items by the price. The script can ignore the first two lines of each record. We only want to read the lines where an item is specified, as in the following example. awk '/^#/ { amount = $2 * $3 printf "%s %6.2f\n", $0, amount next } { print }' $* The main procedure only affects lines that match the pattern. It multiplies the second field by the third field, assigning the value to the variable amount. The printf conversion %f is used to print a floating-point number; "6.2" specifies a minimum field width of six and a precision of two. Precision is the number of digits to the right of the decimal point; the default for %f is six. We print the current record along with the value of the variable amount. If a line is printed within this procedure, the next line is read from standard input. Lines not matching the pattern are simply passed through. Let's look at how addem works: $ addem orders Charlotte Webb P.O N61331 97 Y 045 Date: 03/14/97 #1 3 7.50 22.50 #2 3 7.50 22.50 #3 1 7.50 7.50 #4 1 7.50 7.50 #7 1 7.50 7.50 Martin S. Rossi P.O NONE Date: 03/14/97 #1 2 7.50 15.00 #2 5 6.75 33.75 This program did not need to access the customer record as a whole; it simply acted on the individual item lines. Now, let's design a program that reads multiline records and accumulates order information for display in a report. This report should display for each item the total number of copies and the total amount. We also want totals reflecting all copies ordered and the sum of all orders. Our new script will begin by setting the field and record separators: BEGIN { FS = "\n"; RS = "" } Each record has a variable number of fields, depending upon how many items have been ordered. First, we check that the input record has at least three fields. Then a for loop is built to read all of the fields beginning with the third field. NF >= 3 { for (i = 3; i <= NF; ++i) { In database terms, each field has a value and each value can be further broken up as subvalues. That is, if the value of a field in a multiline record is a single line, subvalues are the words that are on that line. We can use the split() function to divide a field into subvalues. The following part of the script splits each field into subvalues. $i will supply the value of the current field that will be divided into elements of the array order: sv = split($i, order, " ") if (sv == 3) { procedure } else print "Incomplete Record" } # end for loop The number of elements returned by the function is saved in a variable sv. This allows us to test that there are three subvalues. If there are not, the else statement is executed, printing the error message to the screen. Next we assign each individual element of the array to a specific variable. This is mainly to make it easier to remember what each element represents: title = order[1] copies = order[2] price = order[3] Then we perform a group of arithmetic operations on these values: amount = copies * price total_vol += copies total_amt += amount vol[title] += copies amt[title] += amount We accumulate these values until the last input record is read. The END procedure prints the report. Here's the complete program: $ cat addemup #! /bin/sh # addemup -- total customer orders awk 'BEGIN { FS = "\n"; RS = "" } NF >= 3 { for (i = 3; i <= NF; ++i) { sv = split($i, order, " ") if (sv == 3) { title = order[1] copies = order[2] price = order[3] amount = copies * price total_vol += copies total_amt += amount vol[title] += copies amt[title] += amount } else print "Incomplete Record" } } END { printf "%5s\t%10s\t%6s\n\n", "TITLE", "COPIES SOLD", "TOTAL" for (title in vol) printf "%5s\t%10d\t$%7.2f\n", title, vol[title], amt [title] printf "%s\n", "-------------" printf "\t%s%4d\t$%7.2f\n", "Total ", total_vol, total_amt }' $* We have defined two arrays that have the same subscript. We only need to have one for loop to read both arrays. addemup, an order report generator, produces the following output: $ addemup orders TITLE COPIES SOLD TOTAL #1 5 $ 37.50 #2 8 $ 56.25 #3 1 $ 7.50 #4 1 $ 7.50 #7 1 $ 7.50 ------------- Total 16 $ 116.25 10.5 Directing Output to Files and Pipes 10.7 Debugging Chapter 10 The Bottom Drawer 10.7 Debugging No aspect of programming is more frustrating or more essential than debugging. In this section, we'll look at ways to debug awk scripts and offer advice on how to correct an awk program that fails to do what it is supposed to do. Modern versions of awk do a pretty good job of reporting syntax errors. But even with good error detection, it is often difficult to isolate the problem. The techniques for discovering the source of the problem are a modest few and are fairly obvious. Unfortunately, most awk implementations come with no debugging tools or extensions. There are two classes of problems with a program. The first is really a bug in the program's logic. The program runs - that is, it finishes without reporting any error messages, but it does not produce the result you wanted. For instance, perhaps it does not create any output. This bug could be caused by failing to use a print statement to output the result of a calculation. Program errors are mental errors, if you will. The second class of error is one in which the program fails to execute or complete execution. This could result from a syntax error and cause awk to spit code at you that it is unable to interpret. Many syntax errors are the result of a typo or a missing brace or parenthesis. Syntax errors usually generate error messages that help direct you to the problem. Sometimes, however, a program may cause awk to fail (or "core dump") without producing any reasonable error message.[5] This may also be caused by a syntax error, but there could be problems specific to the machine. We have had a few larger scripts that dumped core on one machine while they ran without a problem on another. You could, for instance, be running up against limitations set for awk for that particular implementation. See the section "Limitations", later in this chapter. [5] This indicates that the awk implementation is poor. Core dumps are very rare in modern versions of awk. You should be clear in your mind which type of program bug you are trying to find: an error in the script's logic or an error in its syntax. 10.7.1 Make a Copy Before you begin debugging a program, make a copy of it. This is extremely important. To debug an awk script, you have to change it. These modifications may point you to the error but many changes will have no effect or may introduce new problems. It's good to be able to restore changes that you make. However, it is bothersome to restore each change that you make, so I like to continue making changes until I have found the problem. When I know what it is, I go back to the original and make the change. In effect, that restores all the other inconsequential changes that were made in the copy. It is also helpful to view the process of creating a program as a series of stages. Look at a core set of features as a single stage. Once you have implemented these features and tested them, make a copy of the program before going to the next stage to develop new features. That way, you can always return to the previous stage if you have problems with the code that you add. We would recommend that you formalize this process, and go so far as to use a source code management system, such as SCCS (Source Code Control System), RCS (Revision Control System), or CVS (Concurrent Versioning System, which is compatible with RCS). The latter two are freely available from any GNU FTP mirror site. 10.7.2 Before and After Photos What is difficult in debugging awk is that you don't always know what is happening during the course of the program. You can inspect the input and the output, but there is no way to stop the program in mid- course and examine its state. Thus, it is difficult to know which part of the program is causing a problem. A common problem is determining when or where in the program the assignment of a variable takes place. The first method of attack is to use the print statement to print the value of the variable at various points in the program. For instance, it is common to use a variable as a flag to determine that a certain condition has occurred. At the beginning of the program, the flag might be set to 0. At one or more points in the program, the value of this flag might be set to 1. The problem is to find where the change actually occurs. If you want to check the flag at a particular part of the program, use print statements before and after the assignment. For instance: print flag, "before" if (! $1) { . . . flag = 1 } print flag, "after" If you are unsure about the result of a substitution command or any function, print the string before and after the function is called: print $2 sub(/ *\(/, "(", $2) print $2 The value of printing the value before the substitution command is to make sure that the command sees the value that you think should be there. A previous command might have changed that variable. The problem may turn out to be that the format of the input record is not as you thought. Checking the input carefully is a very important step in debugging. In particular, use print statements to verify that the sequence of fields is as you expect. When you find that input is causing the problem, you can either fix the input or write new code to accommodate it. 10.7.3 Finding Out Where the Problem Is The more modular a script is - that is, the more it can be broken down into separate parts - the easier it is to test and debug the program. One of the advantages of writing functions is that you can isolate what is going on inside the function and test it without affecting other parts of the program. You can omit an entire action and see what happens. If a program has a number of branching constructs, you might find that an input line falls through one of branches. Test that the input reaches part of a program. For instance, when debugging the masterindex program, described in Chapter 12, Full-Featured Applications, we wanted to know if an entry containing the word "retrieving" was being handled in a particular part of the program. We inserted the following line in the part of the program where we thought it should be encountered: if ($0 ~ /retrieving/) print ">> retrieving" > "/dev/tty" When the program runs, if it encounters the string "retrieving," it will print the message. (">>" is used as a pair of characters that will instantly call attention to the output; "!!" is also a good one.) Sometimes you might not be sure which of several print statements are causing a problem. Insert identifiers into the print statement that will alert you to the print statement being executed. In the following example, we simply use the variable name to identify what is printed with a label: if (PRIMARY) print (">>PRIMARY:", PRIMARY) else if (SECONDARY) print (">>SECONDARY:", SECONDARY) else print (">>TERTIARY:", TERTIARY) This technique is also useful for investigating whether or not parts of the program are executed at all. Some programs get to be like remodeled homes: a room is added here, a wall is taken down there. Trying to understand the basic structure can be difficult. You might wonder if each of the parts is truly needed or indeed if it is ever executed at all. If an awk program is part of a pipeline of several programs, even other awk programs, you can use the tee command to redirect output to a file, while also piping the output to the next command. For instance, look at the shell script for running the masterindex program, as shown in Chapter 12: $INDEXDIR/input.idx $FILES | sort -bdf -t: +0 -1 +1 -2 +3 -4 +2n -3n | uniq | $INDEXDIR/pagenums.idx | tee page.tmp | $INDEXDIR/combine.idx | $INDEXDIR/format.idx By adding "tee page.tmp", we are able to capture the output of the pagenums.idx program in a file named page.tmp. The same output is also piped to combine.idx. 10.7.4 Commenting Out Loud Another technique is simply commenting out a series of lines that may be causing problems to see whether they really are. We recommend developing a consistent two-character symbol such as "#%" to comment out lines temporarily. Then you will notice them on subsequent editing and remember to deal with them. It also becomes easier to remove the symbols and restore the lines with a single editing command that does not affect program comments: #% if ( thisFails ) print "I give up" Using the comment here eliminates the conditional, so the print statement is executed unconditionally. 10.7.5 Slash and Burn When all else fails, arm yourself with your editor's delete command and begin deleting portions of the program until the error disappears. Of course, make a copy of the program and delete lines from the temporary copy. This is a very crude technique, but an effective one to use before giving up altogether or starting over from scratch. It is sometimes the only way to discover what is wrong when the only result you get is that the program dumps core. The idea is the same as above, to isolate the problem code. Remove a function, for instance, or a for loop to see if it is the cause of the problem. Be sure to cut out complete units: for instance, all the statements within braces and the matching braces. If the problem persists - the program continues to break - then cut out another large section of the program. Sooner or later, you will find the part that is causing the problem. You can use "slash and burn" to learn how a program works. First, run the original program on sample input, saving the output. Begin by removing a part of the program that you don't understand. Then run the modified program on sample input and compare the output to the original. Look to see what changed. 10.7.6 Getting Defensive About Your Script There are all types of input errors and inconsistencies that will turn up bugs in your script. You probably didn't consider that user errors will be pointed to as problems with your program. Therefore, it is a good idea to surround your core program with "defensive" procedures designed to trap inconsistent input records and prevent the program from failing unexpectedly. For instance, you might want to verify each input record before processing it, making sure that the proper number of fields exist or that the kind of data that you expect is found in a particular field. Another aspect of incorporating defensive techniques is error handling. In other words, what do you want to have happen once the program detects an error? While in some cases you can have the program continue, in other cases it may be preferable that the program print an error message and/or halt. It is also appropriate to recognize that awk scripts are typically confined to the realm of quick fixes, programs that solve a particular problem rather than solving a class of problems encountered by many different users. Because of the nature of these programs, it is not really necessary that they be professional quality. Thus, it is not necessary to write 100% user-proof programs. For one thing, defensive programming is quite time-consuming and frequently tedious. Secondly, as amateurs, we are at liberty to write programs that perform the way we expect them to; a professional has to write for an audience and must account for their expectations. In brief, if you are writing the script for others to use, consider how it may be used and what problems its users may encounter before considering the program complete. If not, maybe the fact that the script works - even for a very narrow set of circumstances - is good enough and all there is time for. 10.6 Generating Columnar Reports 10.8 Limitations Chapter 10 The Bottom Drawer 10.8 Limitations There are fixed limits within any awk implementation. The only trouble is that the documentation seldom reports them. Table 10.1 lists the limitations as described in The AWK Programming Language. These limitations are implementation-specific but they are good ballpark figures for most systems. Table 10.1: Limitations Item Limit Number of fields per record 100 Characters per input record 3000 Characters per output record 3000 Characters per field 1024 Characters per printf string 3000 Characters in literal string 400 Characters in character class 400 Files open 15 Pipes open 1 NOTE: Despite the number in Table 10.1, experience has shown that most awks allow you to have more than one open pipe. In terms of numeric values, awk uses double-precision, floating-point numbers that are limited in size by the machine's architecture. Running into these limits can cause unanticipated problems with scripts. In developing examples for the first edition of this book, Dale thought he'd write a search program that could look for a word or sequence of words in a single paragraph. The idea was to read a document as a series of multiline records and if any of the fields contained the search term, print the record, which was a paragraph. It could be used to search through mail files where blank lines delimit paragraphs. The resulting program worked for small test files. However, when tried on larger files, the program dumped core because it encountered a paragraph that was longer than the maximum input record size, which is 3000 characters. (Actually, the file contained an included mail message where blank lines within the message were prefixed by ">".) Thus, when reading multiple lines as a single record, you better be sure that you don't anticipate records longer than 3000 characters. By the way, there is no particular error message that alerts you to the fact that the problem is the size of the current record. Fortunately, gawk and mawk (see Chapter 11, A Flock of awks) don't have such small limits; for example, the number of fields in a record is limited in gawk to the maximum value that can be held in a C long, and certainly records can be longer than 3000 characters. These versions allow you to have more open files and pipes. Recent versions of the Bell Labs awk have two options, -mf N and -mr N, that allow you to set the maximum number of fields and the maximum record size on the command line, as an emergency way to get around the default limits. (Sed implementations also have their own limits, which aren't documented. Experience has shown that most UNIX versions of sed have a limit of 99 or 100 substitute (s) commands.) 10.7 Debugging 10.9 Invoking awk Using the #! Syntax Chapter 10 The Bottom Drawer 10.9 Invoking awk Using the #! Syntax The "#!" syntax is an alternative syntax for invoking awk from a shell script. It has the advantage of allowing you to specify awk parameters and filenames on the shell-script command line. The "#!" syntax is recognized on modern UNIX systems, but is not typically found in older System V systems. The best way to use this syntax is to put the following line as the first line[6] of the shell script: [6] Note that the pathname to use is system-specific. #!/bin/awk -f "#!" is followed by the pathname that locates your version of awk and then the -f option. After this line, you specify the awk script: #!/bin/awk -f { print $1 } Note that no quotes are necessary around the script. All lines in the file after the first one will be executed as though they were specified in a separate script file. A few years ago, there was an interesting discussion on the Net about the use of the "#!" syntax that clarified how it works. The discussion was prompted by a 4.2BSD user's observation that the shell script below fails: #!/bin/awk { print $1 } while the one below works: #!/bin/sh /bin/awk '{ print $1 }' The two responses that we saw were by Chris Torek and Guy Harris and we will try to summarize their explanation. The first script fails because it passes the filename of the script as the first parameter (argv [1] in C) and awk interprets it as the input file and not the script file. Because no script has been supplied, awk produces a syntax error message. In other words, if the name of the shell script is "myscript," then the first script executes as: /bin/awk myscript If the script was changed to add the -f option, it looks like this: #!/bin/awk -f { print $1 } Then you enter the following command: $ myscript myfile It then executes as though you had typed: /bin/awk -f myscript myfile NOTE: You can put only one parameter on the "#!" line. This line is processed directly by the UNIX kernel; it is not processed by the shell and thus cannot contain arbitrary shell constructs. The "#!" syntax allows you to create shell scripts that pass command-line parameters transparently to awk. In other words, you can pass awk parameters from the command line that invokes the shell script. For instance, we demonstrate passing parameters by changing our sample awk script to expect a parameter n: { print $1*n } Assuming that we have a test file in which the first field contains a number that can be multiplied by n, we can invoke the program, as follows: $ myscript n=4 myfile This spares us from having to pass "$1" as a shell variable and assigning it to n as an awk parameter inside the shell script. The masterindex, described in Chapter 12, uses the "#!" syntax to invoke awk. If your system does not support this syntax, you can change the script by removing the "#!", placing single quotes around the entire script, and ending the script with "$*", which expands to all shell command-line parameters. Well, we've quite nearly cleaned out this bottom drawer. The material in this chapter has a lot to do with how awk interfaces with the UNIX operating system, invoking other utilities, opening and closing files, and using pipes. And, we have discussed some of the admittedly crude techniques for debugging awk scripts. We have covered all of the features of the awk programming language. We have concentrated on the POSIX specification for awk, with only an occasional mention of actual awk implementations. The next chapter covers the differences among various awk versions. Chapter 12 is devoted to breaking down two large, complex applications: a document spellchecker and an indexing program. Chapter 13, A Miscellany of Scripts, presents a variety of user-contributed programs that provide additional examples of how to write programs. 10.8 Limitations 11. A Flock of awks Chapter 11 11. A Flock of awks Contents: Original awk Freely Available awks Commercial awks Epilogue In the previous four chapters, we have looked at POSIX awk, with only occasional reference to actual awk implementations that you would run. In this chapter, we focus on the different versions of awk that are available, what features they do or do not have, and how you can get them. First, we'll look at the original V7 version of awk. The original awk lacks many of the features we've described, so this section mostly describes what's not there. Next, we'll look at the three versions whose source code is freely available. All of them have extensions to the POSIX standard. Those that are common to all three versions are discussed first. Finally, we look at three commercial versions of awk. 11.1 Original awk In each of the sections that follow, we'll take a brief look at how the original awk differs from POSIX awk. Over the years, UNIX vendors have enhanced their versions of original awk; you may need to write small test programs to see exactly what features your old awk has or doesn't have. 11.1.1 Escape Sequences The original V7 awk only had "\t", "\n", "\"", and, of course, "\\". Most UNIX vendors have added some or all of "\b" and "\r" and "\f". 11.1.2 Exponentiation Exponentiation (using the ^, ^=, **, and **= operators) is not in old awk. 11.1.3 The C Conditional Expression The three-argument conditional expression found in C, "expr1 ? expr2 : expr3" is not in old awk. You must resort to a plain old if-else statement. 11.1.4 Variables as Boolean Patterns You cannot use the value of a variable as a Boolean pattern. flag { print "..." } You must instead use a comparison expression. flag != 0 { print "..." } 11.1.5 Faking Dynamic Regular Expressions The original awk made it difficult to use patterns dynamically because they had to be fixed when the script was interpreted. You can get around the problem of not being able to use a variable as a regular expression by importing a shell variable inside an awk program. The value of the shell variable will be interpreted by awk as a constant. Here's an example: $ cat awkro2 #! /bin/sh # assign shell's $1 to awk search variable search=$1 awk '$1 ~ /'"$search"'/' acronyms The first line of the script makes the variable assignment before awk is invoked. To get the shell to expand the variable inside the awk procedure, we enclose it within single, then double, quotation marks. [1] Thus, awk never sees the shell variable and evaluates it as a constant string. [1] Actually, this is the concatenation of single-quoted text with double-quoted text with more single-quoted text to produce one large quoted string. This trick was used earlier, in Chapter 6, Advanced sed Commands. Here's another version that makes use of the Bourne shell variable substitution feature. Using this feature gives us an easy way to specify a default value for the variable if, for instance, the user does not supply a command-line argument. search=$1 awk '$1 ~ /'"${search:-.*}"'/' acronyms The expression "${search:-.*}" tells the shell to use the value of search if it is defined; if not, use ".*" as the value. Here, ".*" is regular-expression syntax specifying any string of characters; therefore, all entries are printed if no entry is supplied on the command line. Because the whole thing is inside double quotes, the shell does not perform a wildcard expansion on ".*". 11.1.6 Control Flow In POSIX awk, if a program has just a BEGIN procedure, and nothing else, awk will exit after executing that procedure. The original awk is different; it will execute the BEGIN procedure and then go on to process input, even if there are no pattern-action statements. You can force awk to exit by supplying / dev/null on the command line as a data file argument, or by using exit. In addition, the BEGIN and END procedures, if present, have to be at the beginning and end of program, respectively. Furthermore, you can only have one of each. 11.1.7 Field Separating Field separating works the same in old awk as it does in modern awk, except that you can't use regular expressions. 11.1.8 Arrays There is no way in the original awk to delete an element from an array. The best thing you can do is assign the empty string to the unwanted array element, and then code your program to ignore array elements whose values are empty. Along the same lines, in is not an operator in original awk; you cannot use if (item in array) to see if an item is present. Unfortunately, this forces you to loop through every item in an array to see if the index you want is present. for (item in array) { if (item == searchkey) { process array[item] break } } 11.1.9 The getline Function The original V7 awk did not have getline. If your awk is really ancient, then getline may not work for you. Some vendors have the simplest form of getline, which reads the next record from the regular input stream, and sets $0, NF and NR (there is no FNR, see below). All of the other forms of getline are not available. 11.1.10 Functions The original awk had only a limited number of built-in string functions. (See Table 11.1 and Table 11.3.) Table 11.1: Original awk's Built-In String Functions Awk Function Description index(s,t) Returns position of substring t in string s or zero if not present. length(s) Returns length of string s or length of $0 if no string is supplied. split(s,a,sep) Parses string s into elements of array a using field separator sep; returns number of elements. If sep is not supplied, FS is used. Array splitting works the same way as field splitting. sprintf("fmt",expr) Uses printf format specification for expr. substr(s,p,n) Returns substring of string s at beginning position p up to maximum length of n. If n isn't supplied, the rest of the string from p is used. Some built-in functions can be classified as arithmetic functions. Most of them take a numeric argument and return a numeric value. Table 11.2 summarizes these arithmetic functions. Table 11.2: Original awk's Built-In Arithmetic Functions Awk Function Description exp(x) Returns e to the power x. int(x) Returns truncated value of x. log(x) Returns natural logarithm (base-e) of x. sqrt(x) Returns square root of x. One of the nicest facilities in awk, the ability to define your own functions, is also not available in original awk. 11.1.11 Built-In Variables In original awk only the variables shown in Table 11.3 are built in. Table 11.3: Original awk System Variables Variable Description FILENAME Current filename FS Field separator (a blank) NF Number of fields in current record NR Number of the current record OFMT Output format for numbers (%.6g) OFS Output field separator (a blank) ORS Output record separator (a newline) RS Record separator (a newline) OFMT does double duty, serving as the conversion format for the print statement, as well as for converting numbers to strings. 10.9 Invoking awk Using the #! Syntax 11.2 Freely Available awks Chapter 11 A Flock of awks 11.2 Freely Available awks There are three versions of awk whose source code is freely available. They are the Bell Labs awk, GNU awk, and mawk, by Michael Brennan. This section discusses the extensions that are common to two or more of them, and then looks at each version in detail and describes how to obtain it. 11.2.1 Common Extensions This section discusses extensions to the awk language that are available in two or more of the freely available awks.[2] [2] As the maintainer of gawk and the author of many of the extensions described here and in the section below on gawk, my opinion about the usefulness of these extensions may be biased. :-) You should make your own evaluation. [A.R.] 11.2.1.1 Deleting all elements of an array All three free awks extend the delete statement, making it possible to delete all the elements of an array at one time. The syntax is: delete array Normally, to delete every element from an array, you have to use a loop, like this. for (i in data) delete data[i] With the extended version of the delete statement, you can simply use delete data This is particularly useful for arrays with lots of subscripts; this version is considerably faster than the one using a loop. Even though it no longer has any elements, you cannot use the array name as a simple variable. Once an array, always an array. This extension appeared first in gawk, then in mawk and the Bell Labs awk. 11.2.1.2 Obtaining individual characters All three awks extend field splitting and array splitting as follows. If the value of FS is the empty string, then each character of the input record becomes a separate field. This greatly simplifies cases where it's necessary to work with individual characters. Similarly, if the third argument to the split() function is the empty string, each character in the original string will become a separate element of the target array. Without these extensions, you have to use repeated calls to the substr() function to obtain individual characters. This extension appeared first in mawk, then in gawk and the Bell Labs awk. 11.2.1.3 Flushing buffered output The 1993 version of the Bell Labs awk introduced a new function that is not in the POSIX standard, fflush(). Like close(), the argument to fflush() is the name of an open file or pipe. Unlike close(), the fflush() function only works on output files and pipes. Most programs buffer their output, storing data to be written to a file or pipe in an internal chunk of memory until there's enough to send on to the destination. Occasionally, it's useful for the programmer to be able to explicitly flush the buffer, that is, force all buffered data to actually be delivered. This is the purpose of the fflush() function. This function appeared first in the Bell Labs awk, then in gawk and mawk. 11.2.1.4 Special filenames With any version of awk, you can write directly to the special UNIX file, /dev/tty, that is a name for the user's terminal. This can be used to direct prompts or messages to the user's attention when the output of the program is directed to a file: printf "Enter your name:" >"/dev/tty" This prints "Enter your name:" directly on the terminal, no matter where the standard output and the standard error are directed. The three free awks support several special filenames, as listed in Table 11.4. Table 11.4: Special Filenames Filename Description /dev/stdin Standard input (not mawk)[3] /dev/stdout Standard output /dev/stderr Standard error [3] The mawk manpage recommends using "-" for the standard input, which is most portable. Note that a special filename, like any filename, must be quoted when specified as a string constant. The /dev/stdin, /dev/stdout, and /dev/stderr special files originated in V8 UNIX. Gawk was the first to build in special recognition of these files, followed by mawk and the Bell Labs awk. A printerr() function Error messages inform users about problems often related to missing or incorrect input. You can simply inform the user with a print statement. However, if the output of the program is redirected to a file, the user won't see it. Therefore, it is good practice to specify explicitly that the error message be sent to the terminal. The following printerr() function helps to create consistent user error messages. It prints the word "ERROR" followed by a supplied message, the record number, and the current record. The following example directs output to /dev/tty: function printerr (message) { # print message, record number and record printf("ERROR:%s (%d) %s\n", message, NR, $0) > "/ dev/tty" } If the output of the program is sent to the terminal screen, then error messages will be mixed in with the output. Outputting "ERROR" will help the user recognize error messages. In UNIX, the standard destination for error messages is standard error. The rationale for writing to standard error is the same as above. To write to standard error explicitly, you must use the convoluted syntax "cat 1>&2" as in the following example: print "ERROR" | "cat 1>&2" This directs the output of the print statement to a pipe which executes the cat command. You can also use the system() function to execute a UNIX command such as cat or echo and direct its output to standard error. When the special file /dev/stderr is available, this gets much simpler: print "ERROR" > "/dev/stderr" # recent awks only 11.2.1.5 The nextfile statement The nextfile statement is similar to next, but it operates at a higher level. When nextfile is executed, the current data file is abandoned, and processing starts over at the top of the script, using the first record of the following file. This is useful when you know that you only need to process part of a file; there's no need to then set up a loop to skip records using next. The nextfile statement originated in gawk, and then was added to the Bell Labs awk. It will be available in mawk, starting with version 1.4. 11.2.1.6 Regular expression record separators (gawk and mawk) Gawk and mawk allow RS to be a full regular expression, not just a single character. In that case, the records are separated by the longest text in the input that matches the regular expression. Gawk also sets RT (the record terminator) to the actual input text that matched RS. An example of this is given below. The ability to have RS be a regular expression first appeared in mawk, and was later added to gawk. 11.2.2 Bell Labs awk The Bell Labs awk is, of course, the direct descendant of the original V7 awk, and of the "new" awk that first became avaliable with System V Release 3.1. Source code is freely available via anonymous FTP to the host netlib.bell-labs.com. It is in the file /netlib/research/awk.bundle.Z. This is a compressed shell archive file. Be sure to use "binary," or "image" mode to transfer the file. This version of awk requires an ANSI C compiler. There have been several distinct versions; we will identify them here according to the year they became available. The first version of new awk became available in late 1987. It had almost everything we've described in the previous four chapters (although there are footnotes that indicate those things that are not available). This version is still in use on SunOS 4.1.x systems and some System V Release 3 UNIX systems. In 1989, for System V Release 4, several new things were added. The only difference between this version and POSIX awk is that POSIX uses CONVFMT for number-to-string conversions, while the 1989 version still used OFMT. The new features were: ● Escape characters in command-line assignments were now interpreted. ● The tolower() and toupper() functions were added. ● printf was improved: dynamic width and precision were added, and the behavior for "%c" was rationalized. ● The return value from the srand() function was defined to be the previous seed. (The awk book didn't state what srand() returned.) ● It became possible to use regular expressions as simple expressions. For example: if (/cute/ || /sweet/) print "potential here!" ● The -v option was added to allow setting variables on the command line before execution of the BEGIN procedure. ● Multiple -f options could now be used to have multiple source files. (This originated in MKS awk, was adopted by gawk, and then added to the Bell Labs awk.) ● The ENVIRON array was added. (This was developed independently for both MKS awk and gawk, and then added to the Bell Labs awk.) In 1993, Brian Kernighan of Bell Labs was able to release the source code to his awk. At this point, CONVFMT became available, and the fflush() function, described above, was added. A bug-fix release was made in August of 1994. In June of 1996, Brian Kernighan made another release. It can be retrieved either from the FTP site given above, or via a World Wide Web browser from Dr. Kernighan's Web page (http://cm.bell-labs. com/who/bwk), which refers to this version as "the one true awk." :-) This version adds several features that originated in gawk and mawk, described earlier in this chapter in the "Common Extensions" section. 11.2.3 GNU awk (gawk) The Free Software Foundation GNU project's version of awk, gawk, implements all the features of the POSIX awk, and many more. It is perhaps the most popular of the freely available implementations; gawk is used on Linux systems, as well as various other freely available UNIX-like systems, such as NetBSD and FreeBSD. Source code for gawk is available via anonymous FTP[4] to the host ftp.gnu.ai.mit.edu. It is in the file / pub/gnu/gawk-3.0.3.tar.gz (there may be a later version there by the time you read this). This is a tar file compressed with the gzip program, whose source code is available in the same directory. There are many sites worldwide that "mirror" the files from the main GNU distribution site; if you know of one close to you, you should get the files from there. Be sure to use "binary" or "image" mode to transfer the file(s). [4] If you don't have Internet access and wish to get a copy of gawk, contact the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 U.S.A. The telephone number is 617-542-5942, and the fax number is 617-542-2652. Besides the common extensions listed earlier, gawk has a number of additional features. We examine them in this section. 11.2.3.1 Command line options Gawk has several very useful command-line options. Like most GNU programs, these options are spelled out and begin with two dashes, "--". ● --lint and --lint-old cause gawk to check your program, both at parse-time and at run-time, for constructs that are dubious or nonportable to other versions of awk. The --lint-old option warns about function calls that are not portable to the original version of awk. It is separate from --lint, since most systems now have some version of new awk. ● --traditional disables GNU-specific extensions, such as the time functions and gensub() (see below). With this option, gawk is intended to behave the same as the Bell Labs awk. ● --re-interval enables full POSIX regular expression matching, by allowing gawk to recognize interval expressions (such as "/stuff{1,3}/"). ● --posix disables all extensions that are not specified in the POSIX standard. This option also turns on recognition of interval expressions. There are a number of other options that are less important for everyday programming and script portability; see the gawk documentation for details. Although POSIX awk allows you to have multiple instances of the -f option, there is no easy way to use library functions from a command-line program. The --source option in gawk makes this possible. gawk --source 'script' -f mylibs.awk file1 file2 This example runs the program in script, which can use awk functions from the file mylibs.awk. The input data comes from file1 and file2. 11.2.3.2 An awk program search path Gawk allows you to specify an environment variable named AWKPATH that defines a search path for awk program files. By default, it is defined to be .:/usr/local/share/awk. Thus, when a filename is specified with the -f option, the two default directories will be searched, beginning with the current directory. Note that if the filename contains a "/", then no search is performed. For example, if mylibs.awk was a file of awk functions in /usr/local/share/awk, and myprog.awk was a program in the current directory, we run gawk like this: gawk -f myprog.awk -f mylibs.awk datafile1 Gawk would find each file in the appropriate place. This makes it much easier to have and use awk library functions. 11.2.3.3 Line continuation Gawk allows you to break lines after either a "?" or ":". You can also continue strings across newlines using a backslash. $ gawk 'BEGIN { print "hello, \ > world" }' hello, world 11.2.3.4 Extended regular expressions Gawk provides several additional regular expression operators. These are common to most GNU programs that work with regular expressions. The extended operators are listed in Table 11.5. Table 11.5: Gawk Extended Regular Expressions Special Operators Usage \w Matches any word-constituent character (a letter, digit, or underscore). \W Matches any character that is not word-constituent. \< Matches the empty string at the beginning of a word. \> Matches the empty string at the end of a word. \y Matches the empty string at either the beginning or end of a word (the word boundary). Other GNU software uses "\b", but that was already taken. \B Matches the empty string within a word. \` Matches the empty string at the beginning of a buffer. This is the same as a string in awk, and thus is the same as ^. It is provided for compatibility with GNU Emacs and other GNU software. \' Matches the empty string at the end of a buffer. This is the same as a string in awk, and thus is the same as $. It is provided for compatibility with GNU Emacs and other GNU software. You can think of "\w" as a shorthand for the (POSIX) notation [[:alnum:]_] and "\W" as a shorthand for [^[:alnum:]_]. The following table gives examples of what the middle four operators match, borrowed from Effective AWK Programming. Table 11.6: Examples of gawk Extended Regular Expression Operators Expression Matches Does Not Match \ stow stowaway \yballs?\y ball or balls ballroom or baseball \Brat\B crate dirty rat 11.2.3.5 Regular expression record terminators Besides allowing RS to be a regular expression, gawk sets the variable RT (record terminator) to the actual input text that matched the value of RS. Here is a simple example, due to Michael Brennan, that shows the power of gawk's RS and RT variables. As we have seen, one of the most common uses of sed is its substitute command (s/old/ new/g). By setting RS to the pattern to match, and ORS to the replacement text, a simple print statement can print the unchanged text followed by the replacement text. $ cat simplesed.awk # simplesed.awk --- do s/old/new/g using just print # Thanks to Michael Brennan for the idea # # NOTE! RS and ORS must be set on the command line { if (RT == "") printf "%s", $0 else print } There is one wrinkle; at end of file, RT will be empty, so we use a printf statement to print the record. [5] We could run the program like this. [5] See Effective AWK Programming [Robbins], Section 16.2.8, for an elaborate version of this program. $ cat simplesed.data "This OLD house" is a great show. I like shopping for old things at garage sales. $ gawk -f simplesed.awk RS="old|OLD" ORS="brand new" simplesed.data "This brand new house" is a great show. I like shopping for brand new things at garage sales. 11.2.3.6 Separating fields Besides the regular way that awk lets you split the input into records and the record into fields, gawk gives you some additional capabilities. First, as mentioned above, if the value of FS is the empty string, then each character of the input record becomes a separate field. Second, the special variable FIELDWIDTHS can be used to split out data that occurs in fixed-width columns. Such data may or may not have whitespace separating the values of the fields. FIELDWIDTHS = "5 6 8 3" Here, the record has four fields: $1 is five characters wide, $2 is six characters wide, and so on. Assigning a value to FIELDWIDTHS causes gawk to start using it for field splitting. Assigning a value to FS causes gawk to return to the regular field splitting mechanism. Use FS = FS to make this happen without having to save the value of FS in an extra variable. This facility would be of most use when working with fixed-width field data, where there may not be any whitespace separating fields, or when intermediate fields may be all blank. 11.2.3.7 Additional special files Gawk has a number of additional special filenames that it interprets internally. All of the special filenames are listed in Table 11.7. Table 11.7: Gawk's Special Filenames Filename Description /dev/stdin Standard input. /dev/stdout Standard output. /dev/stderr Standard error. /dev/fd/n The file referenced as file descriptor n. Obsolete Filename Description /dev/pid Returns a record containing the process ID number. /dev/ppid Returns a record containing the parent process ID number. /dev/pgrpid Returns a record containing the process group ID number. /dev/user Returns a record with the real and effective user IDs, the real and effective group IDs, and if available, any secondary group IDs. The first three were described earlier. The fourth filename provides access to any open file descriptor that may have been inherited from gawk's parent process (usually the shell). You can use file descriptor 0 for standard input, 1 for standard output, and 2 for standard error. The second group of special files, labeled "obsolete," have been in gawk for a while, but are being phased out. They will be replaced by a PROCINFO array, whose subscipts are the desired item and whose element value is the associated value. For example, you would use PROCINFO["pid"] to get the current process ID, instead of using getline pid < "/dev/pid". Check the gawk documentation to see if PROCINFO is available and if these filenames are still supported. 11.2.3.8 Additional variables Gawk has several more system variables. They are listed in Table 11.8. Table 11.8: Additional gawk System Variables Variable Description ARGIND The index in ARGV of the current input file. ERRNO A message describing the error if getline or close() fail. FIELDWIDTHS A space-separated list of numbers describing the widths of the input fields. IGNORECASE If non-zero, pattern matches and string comparisons are case-independent. RT The value of the input text that matched RS. We have already seen the record terminator variable, RT, so we'll proceed to the other variables that we haven't covered yet. All pattern matching and string comparison in awk is case sensitive. Gawk introduced the IGNORECASE variable so that you can specify that regular expressions be interpreted without regard for upper- or lowercase characters. Beginning with version 3.0 of gawk, string comparisons can also be done without case sensitivity. The default value of IGNORECASE is zero, which means that pattern matching and string comparison are performed the same as in traditional awk. If IGNORECASE is set to a non-zero value, then case distinctions are ignored. This applies to all places where regular expressions are used, including the field separator FS, the record separator RS, and all string comparisons. It does not apply to array subscripting. Two more gawk variables are of interest. ARGIND is set automatically by gawk to be the index in ARGV of the current input file name. This variable gives you a way to track how far along you are in the list of filenames. Finally, if an error occurs doing a redirection for getline or during a close(), gawk sets ERRNO to a string describing the error. This makes it possible to provide descriptive error messages when something goes wrong. 11.2.3.9 Additional functions Gawk has one additional string function, and two functions for dealing with the current date and time. They are listed in Table 11.9. Table 11.9: Additional gawk Functions Gawk Function Description gensub(r, s, h, t) If h is a string starting with g or G, globally substitutes s for r in t. Otherwise, h is a number: substitutes for the h'th occurrence. Returns the new value, t is unchanged. If t is not supplied, defaults to $0. systime() Returns the current time of day in seconds since the Epoch (00:00 a.m., January 1, 1970 UTC). strftime(format, timestamp) Formats timestamp (of the same form returned by systime()) according to format. If no timestamp, use current time. If no format either, use a default format whose output is similar to the date command. 11.2.3.10 A general substitution function The 3.0 version of gawk introduced a new general substitution function, named gensub(). The sub() and gsub() functions have some problems. ● You can change either the first occurrence of a pattern or all the occurrences of a pattern. There is no way to change, say, only the third occurrence of a pattern but not the ones before it or after it. ● Both sub() and gsub() change the actual target string, which may be undesirable. ● It is impossible to get sub() and gsub() to emit a literal backslash followed by the matched text, because an ampersand preceded by a backslash is never replaced.[6] [6] A full discussion is given in Effective AWK Programming [Robbins], Section 12.3. The details are not for the faint of heart. ● There is no way to get at parts of the matched text, analogous to the \(...\) construct in sed. For all these reasons, gawk introduced the gensub() function. The function takes at least three arguments. The first is a regular expression to search for. The second is the replacement string. The third is a flag that controls how many substitutions should be performed. The fourth argument, if present, is the original string to change. If it is not provided, the current input record ($0) is used. The pattern can have subpatterns delimited by parentheses. For example, it can have "/(part) (one|two| three)/". Within the replacement string, a backslash followed by a digit represents the text that matched the nth subpattern. $ echo part two | gawk '{ print gensub(/(part) (one|two| three)/, "\\2", "g") }' two The flag is either a string beginning with g or G, in which case the substitution happens globally, or it is a number indicating that the nth occurrence should be replaced. $ echo a b c a b c a b c | gawk '{ print gensub(/a/, "AA", 2) }' a b c AA b c a b c The fourth argument is the string in which to make the change. Unlike sub() and gsub(), the target string is not changed. Instead, the new string is the return value from gensub(). $ gawk ' BEGIN { old = "hello, world" new = gensub(/hello/, "goodbye", 1, old) printf("<%s>, <%s>\n", old, new) }' , 11.2.3.11 Time management for programmers Awk programs are very often used for processing the log files produced by various programs. Often, each record in a log file contains a timestamp, indicating when the record was produced. For both conciseness and precision, the timestamp is written as the result of the UNIX time(2) system call, which is the number of seconds since midnight, January 1, 1970 UTC. (This date is often referred to as "the Epoch.") To make it easier to generate and process log file records with these kinds of timestamps in them, gawk has two functions, systime() and strftime(). The systime() function is primarily intended for generating timestamps to go into log records. Suppose, for example, that we use an awk script to respond to CGI queries to our WWW server. We might log each query to a log file. { ... printf("%s:%s:%d\n", User, Host, systime()) >> "/var/log/ cgi/querylog" ... } Such a record might look like arnold:some.domain.com:831322007 The strftime() function[7] makes it easy to turn timestamps into human-readable dates. The format string is similar to the one used by sprintf(); it consists of literal text mixed with format specifications for different components of date and time. [7] This function is patterned after the function of the same name in ANSI C. $ gawk 'BEGIN { print strftime("Today is %A, %B %d, %Y") }' Today is Sunday, May 05, 1996 The list of available formats is quite long. See your local strftime(3) manpage, and the gawk documentation for the full list. Our hypothetical CGI log file might be processed by this program: # cgiformat --- process CGI logs # data format is user:host:timestamp #1 BEGIN { FS = ":"; SUBSEP = "@" } #2 { # make data more obvious user = $1; host = $2; time = $3 # store first contact by this user if (! ((user, host) in first)) first[user, host] = time # count contacts count[user, host]++ # save last contact last[user, host] = time } #3 END { # print the results for (contact in count) { i = strftime("%y-%m-%d %H:%M", first [contact]) j = strftime("%y-%m-%d %H:%M", last [contact]) printf "%s -> %d times between %s and %s\n", contact, count[contact], i, j } } The first step is to set FS to ":" to split the field correctly. We also use a neat trick and set the subscript separator to "@", so that the arrays become indexed by "user@host" strings. In the second step, we look to see if this is the first time we've seen this user. If so (they're not in the first array), we add them. Then we increment the count of how many times they've connected. Finally we store this record's timestamp in the last array. This element keeps getting overwritten each time we see a new connection by the user. That's OK; what we will end up with is the last (most recent) connection stored in the array. The END procedure formats the data for us. It loops through the count array, formatting the timestamps in the first and last arrays for printing. Consider a log file with the following records in it. $ cat /var/log/cgi/querylog arnold:some.domain.com:831322007 mary:another.domain.org:831312546 arnold:some.domain.com:831327215 mary:another.domain.org:831346231 arnold:some.domain.com:831324598 Here's what running the program produces: $ gawk -f cgiformat.awk /var/log/cgi/querylog mary@another.domain.org -> 2 times between 96-05-05 12:09 and 96-05-05 21:30 arnold@some.domain.com -> 3 times between 96-05-05 14:46 and 96-05-05 15:29 11.2.4 Michael's awk (mawk) The third freely available awk is mawk, written by Michael Brennan. This program is upwardly compatible with POSIX awk, and has a few extensions as well. It is solid and performs very well. Source code for mawk is freely available via anonymous FTP from ftp.whidbey.net. It is in /pub/ brennan/mawk1.3.3.tar.gz. (There may be a later version there by the time you read this.) This is also a tar file compressed with the gzip program. Be sure to use "binary," or "image" mode to transfer the file. Mawk's primary advantages are its speed and robustness. Although it has fewer features than gawk, it almost always outperforms it.[8] Besides UNIX systems, mawk also runs under MS-DOS. [8] Gawk's advantages are that it has a larger feature set, it has been ported to more non- UNIX kinds of systems, and it comes with much more extensive documentation. The common extensions described above are also available in mawk. 11.1 Original awk 11.3 Commercial awks Chapter 11 A Flock of awks 11.3 Commercial awks There are also several commercial versions of awk. In this section, we review the ones that we know about. 11.3.1 MKS awk Mortice Kern Systems (MKS) in Waterloo, Ontario (Canada)[9] supplies awk as part of the MKS Toolkit for MS-DOS/Windows, OS/2, Windows 95, and Windows NT. [9] Mortice Kern Systems, 185 Columbia Street West, Waterloo, Ontario N2L 5Z5, Canada. Phone: 1-800-265-2797 in North America, 1-519-884-2251 elsewhere. URL is http://www.mks.com/. The MKS version implements POSIX awk. It has the following extensions: ● The exp(), int(), log(), sqrt(), tolower(), and toupper() functions use $0 if given no argument. ● An additional function ord() is available. This function takes a string argument, and returns the numeric value of the first character in the string. It is similar to the function of the same name in Pascal. 11.3.2 Thompson Automation awk (tawk) Thompson Automation Software[10] makes a version of awk (tawk)[11] for MS-DOS/Windows, Windows 95 and NT, and Solaris. Tawk is interesting on several counts. First, unlike other versions of awk, which are interpreters, tawk is a compiler. Second, tawk comes with a screen-oriented debugger, written in awk! The source for the debugger is included. Third, tawk allows you to link your compiled program with arbitrary functions written in C. Tawk has received rave reviews in the comp.lang.awk newsgroup. [10] Thompson Automation Software, 5616 SW Jefferson, Portland OR 97221 U.S.A. Phone: 1-800-944-0139 within the U.S., 1-503-224-1639 elsewhere. [11] Michael Brennan, in the mawk(1) manpage, makes the following statement: "Implementors of the AWK language have shown a consistent lack of imagination when naming their programs." Tawk comes with an awk interface that acts like POSIX awk, compiling and running your program. You can, however, compile your program into a standalone executable file. The tawk compiler actually compiles into a compact intermediate form. The intermediate representation is linked with a library that executes the program when it is run, and it is at link time that other C routines can be integrated with the awk program. Tawk is a very full-featured implementation of awk. Besides implementing the features of POSIX awk (based on new awk), it extends the language in some fundamental ways, and also has a very large number of built-in functions. 11.3.2.1 Tawk language extensions This section provides a "laundry list" of the new features in tawk. A full treatment of them is beyond the scope of this book; the tawk documentation does a nice job of presenting them. Hopefully, by now you should be familiar enough with awk that the value of these features will be apparent. Where relevant, we'll contrast the tawk feature with a comparable feature in gawk. ● Additional special patterns, INIT, BEGINFILE, and ENDFILE. INIT is like BEGIN, but the actions in its procedure are run before[12] those of the BEGIN procedure. BEGINFILE and ENDFILE provide you the ability to have per-file start-up and clean-up actions. Unlike using a rule based on FNR == 1, these actions are executed even when files are empty. [12] I confess that I don't see the real usefulness of this. [A.R.] ● Controlled regular expressions. You can add a flag to a regular expression ("/match me/") that tells tawk how to treat the regular expression. An i flag ("/match me/i") indicates that case should be ignored when doing matching. An s flag indicates that the shortest possible text should be matched, instead of the longest. ● An abort [expr] statement. This is similar to exit, except that tawk exits immediately, bypassing any END procedure. The expr, if provided, becomes the return value from tawk to its parent program. ● True multidimensional arrays. Conventional awk simulates multidimensional arrays by concatenating the values of the subscripts, separated by the value of SUBSEP, to generate a (hopefully) unique index in a regular associative array. While implementing this feature for compatibility, tawk also provides true multidimensional arrays. a[1][1] = "hello" a[1][2] = "world" for (i in a[1]) print a[1][i] Multidimensional arrays guarantee that the indices will be unique, and also have the potential for greater performance when the number of elements gets to be very large. ● Automatic sorting of arrays. When looping over every element of an array using the for (item in array) construct, tawk will first sort the indices of the array, so that array elements are processed in order. You can control whether this sorting is turned on or off, and if on, whether the sorting is numeric or alphabetic, and in ascending or descending order. While the sorting incurs a performance penalty, it is likely to be less than the overhead of sorting the array yourself using awk code, or piping the results into an external invocation of sort. ● Scope control for functions and variables. You can declare that functions and variables are global to an entire program, global within a "module" (source file), local to a module, and local to a function. Regular awk only gives you global variables, global functions, and extra function parameters, which act as local variables. This feature is a very nice one, making it much easier to write libraries of awk functions without having to worry about variable names inadvertently conflicting with those in other library functions or in the user's main program. ● RS can be a regular expression. This is similar to gawk and mawk; however, the regular expression cannot be one that requires more than one character of look-ahead. The text that matched RS is saved in the variable RSM (record separator match), similar to gawk's RT variable. ● Describing fields, instead of the field separators. The variable FPAT can be a regular expression that describes the contents of the fields. Successive occurrences of text that matches FPAT become the contents of the fields. ● Controlling the implicit file processing loop. The variable ARGI tracks the position in ARGV of the current input data file. Unlike gawk's ARGIND variable, assigning a value to ARGI can be used to make tawk skip over input data files. ● Fixed-length records. By assigning a value to the RECLEN variable, you can make tawk read records in fixed-length chunks. If RS is not matched within RECLEN characters, then tawk returns a record that is RECLEN characters long. ● Hexadecimal constants. You can specify C-style hexadecimal constants (0xDEAD and 0xBEEF being two rather famous ones) in tawk programs. This helps when using the built-in bit manipulation functions (see the next section). Whew! That's a rather long list, but these features bring additional power to programming in awk. 11.3.2.2 Additional built-in tawk functions Besides extending the language, tawk provides a large number of additional built-in functions. Here is another "laundry list," this time of the different classes of functions available. Each class has two or more functions associated with it. We'll briefly describe the functionality of each class. ● Extended string functions. Extensions to the standard string functions and new string functions allow you to match and substitute for subpatterns within patterns (similar to gawk's gensub() function), assign to substrings within strings, and split a string into an array based on a pattern that matches elements, instead of the separator. There are additional printf formats, and string translation functions. While undoubtedly some of these functions could be written as user- defined functions, having them built in provides greater performance. ● Bit manipulation functions. You can perform bitwise AND, OR, and XOR operations on (integer) values. These could also be written as user-defined functions, but with a loss of performance. ● More I/O functions. There is a suite of functions modeled after those in the stdio(3) library. In particular, the ability to seek within a file, and do I/O in fixed-size amounts, is quite useful. ● Directory operation functions. You can make, remove, and change directories, as well as remove and rename files. ● File information functions. You can retrieve file permissions, size, and modification times. ● Directory reading functions. You can get the current directory name, as well as read a list of all the filenames in a directory. ● Time functions. There are functions to retrieve the current time of day, and format it in various ways. These functions are not quite as flexible as gawk's strftime() function. ● Execution functions. You can sleep for a specific amount of time, and start other functions running. Tawk's spawn() function is interesting because it allows you to provide values for the new program's environment, and also indicate whether the program should or should not run asynchronously. This is particularly valuable on non-UNIX systems, where the command interpreters (such as MS-DOS's command.com) are quite limited. ● File locking. You can lock and unlock files and ranges within files. ● Screen functions. You can do screen-oriented I/O. Under UNIX, these functions are implemented on top of the curses(3) library. ● Packing and unpacking of binary data. You can specify how binary data structures are laid out. This, together with the new I/O functions, makes it possible to do binary I/O, something you would normally have to do in C or C++. ● Access to internal state. You can get or set the value of any awk variable through function calls. ● Access to MS-DOS low-level facilities. You can use system interrupts, and peek and poke values at memory addresses. These features are obviously for experts only. From this list, it becomes clear that tawk provides a nice alternative to C and to Perl for serious programming tasks. As an example, the screen functions and internal state functions are used to implement the tawk debugger in awk. 11.3.3 Videosoft VSAwk Videosoft[13] sells software called VSAwk that brings awk-style programming into the Visual Basic environment. VSAwk is a Visual Basic control that works in an event driven fashion. Like awk, VSAwk gives you startup and cleanup actions, and splits the input record into fields, as well as the ability to write expressions and call the awk built-in functions. [13] Videosoft can be reached at 2625 Alcatraz Avenue, Suite 271, Berkeley CA 94705 U. S.A. Phone: 1-510-704-8200. Fax: 1-510-843-0174. Their site is http://www.videosoft. com. VSAwk resembles UNIX awk mostly in its data processing model, not its syntax. Nevertheless, it's interesting to see how people apply the concepts from awk to the environment provided by a very different language. 11.2 Freely Available awks 11.4 Epilogue Chapter 11 A Flock of awks 11.4 Epilogue Well, we've pretty thoroughly covered the ins and outs of programming in awk, both the standard language, and the extensions available in different implementations. As you work with awk, you'll come to find it an easy and pleasant language to program in, since it does almost all of the drudgery for you, allowing you to concentrate on the actual problem to be solved. 11.3 Commercial awks 12. Full-Featured Applications Chapter 12 12. Full-Featured Applications Contents: An Interactive Spelling Checker Generating a Formatted Index Spare Details of the masterindex Program This chapter presents two complex applications that integrate most features of the awk programming language. The first program, spellcheck, provides an interactive interface to the UNIX spell program. The second application, masterindex, is a batch program for generating an index for a book or a set of books. Even if you are not interested in the particular application, you should study these larger programs to get a feel for the scope of the problems that an awk program can solve. 12.1 An Interactive Spelling Checker The UNIX spell program does an adequate job of catching spelling errors in a document. For most people, however, it only does half the job. It doesn't help you correct the misspelled words. First-time users of spell find themselves jotting down the misspelled words and then using the text editor to change the document. More skilled users build a sed script to make the changes automatically. The spellcheck program offers another way - it shows you each word that spell has found and asks if you want to correct the word. You can change each occurrence of the word after seeing the line on which it occurs, or you can correct the spelling error globally. You can also choose to add any word that spell turns up to a local dictionary file. Before describing the program, let's have a demonstration of how it works. The user enters spellcheck, a shell script that invokes awk, and the name of the document file. $ spellcheck ch00 Use local dict file? (y/n)y If a dictionary file is not specified on the command line, and a file named dict exists in the current directory, then the user is asked if the local dictionary should be used. spellcheck then runs spell using the local dictionary. Running spell checker ... Using the list of "misspelled" words turned up by spell, spellcheck prompts the user to correct them. Before the first word is displayed, a list of responses is shown that describes what actions are possible. Responses: Change each occurrence, Global change, Add to Dict, Help, Quit CR to ignore: 1 - Found SparcStation (C/G/A/H/Q/):a The first word found by spell is "SparcStation." A response of "a" (followed by a carriage return) adds this word to a list that will be used to update the dictionary. The second word is clearly a misspelling and a response of "g" is entered to make the change globally: 2 - Found languauge (C/G/A/H/Q/):g Globally change to:language Globally change languauge to language? (y/n):y > and a full description of its scripting language. 1 lines changed. Save changes? (y/n)y After prompting the user to enter the correct spelling and confirming the entry, the change is made and each line affected is displayed, preceded by a ">". The user is then asked to approve these changes before they are saved. The third word is also added to the dictionary: 3 - Found nawk (C/G/A/H/Q/):a The fourth word is a misspelling of "utilities." 4 - Found utlitities (C/G/A/H/Q/):c These utlitities have many things in common, including ^^^^^^^^^^ Change to:utilities Change utlitities to utilities? (y/n):y Two other utlitities that are found on the UNIX system ^^^^^^^^^^ Change utlitities to utilities? (y/n):y >These utilities have many things in common, including >Two other utilities that are found on the UNIX system 2 lines changed. Save changes? (y/n)y The user enters "c" to change each occurrence. This response allows the user to see the line containing the misspelling and then make the change. After the user has made each change, the changed lines are displayed and the user is asked to confirm saving the changes. It is unclear whether the fifth word is a misspelling or not, so the user enters "c" to view the line. 5 - Found xvf (C/G/A/H/Q/):c tar xvf filename ^^^ Change to:RETURN After determining that it is not a misspelling, the user enters a carriage return to ignore the word. Generally, spell turns up a lot of words that are not misspellings so a carriage return means to ignore the word. After all the words in the list have been processed, or if the user quits before then, the user is prompted to save the changes made to the document and the dictionary. Save corrections in ch00 (y/n)? y Make changes to dictionary (y/n)? y If the user answers "n," the original file and the dictionary are left unchanged. Now let's look at the spellcheck.awk script, which can be divided into four sections: ● The BEGIN procedure, that processes the command-line arguments and executes the spell command to create a word list. ● The main procedure, that reads one word at a time from the list and prompts the user to make a correction. ● The END procedure, that saves the working copy of the file, overwriting the original. It also appends words from the exception list to the current dictionary. ● Supporting functions, that are called to make changes in the file. We will look at each of these sections of the program. 12.1.1 BEGIN Procedure The BEGIN procedure for spellcheck.awk is large. It is also somewhat unusual. # spellcheck.awk -- interactive spell checker # # AUTHOR: Dale Dougherty # # Usage: nawk -f spellcheck.awk [+dict] file # (Use spellcheck as name of shell program) # SPELLDICT = "dict" # SPELLFILE = "file" # BEGIN actions perform the following tasks: # 1) process command-line arguments # 2) create temporary filenames # 3) execute spell program to create wordlist file # 4) display list of user responses BEGIN { # Process command-line arguments # Must be at least two args -- nawk and filename if (ARGC > 1) { # if more than two args, second arg is dict if (ARGC > 2) { # test to see if dict is specified with "+" # and assign ARGV[1] to SPELLDICT if (ARGV[1] ~ /^\+.*/) SPELLDICT = ARGV[1] else SPELLDICT = "+" ARGV[1] # assign file ARGV[2] to SPELLFILE SPELLFILE = ARGV[2] # delete args so awk does not open them as files delete ARGV[1] delete ARGV[2] } # not more than two args else { # assign file ARGV[1] to SPELLFILE SPELLFILE = ARGV[1] # test to see if local dict file exists if (! system ("test -r dict")) { # if it does, ask if we should use it printf ("Use local dict file? (y/n)") getline reply < "-" # if reply is yes, use "dict" if (reply ~ /[yY](es)?/){ SPELLDICT = "+dict" } } } } # end of processing args > 1 # if args not > 1, then print shell-command usage else { print "Usage: spellcheck [+dict] file" exit 1 } # end of processing command line arguments # create temporary file names, each begin with sp_ wordlist = "sp_wordlist" spellsource = "sp_input" spellout = "sp_out" # copy SPELLFILE to temporary input file system("cp " SPELLFILE " " spellsource) # now run spell program; output sent to wordlist print "Running spell checker ..." if (SPELLDICT) SPELLCMD = "spell " SPELLDICT " " else SPELLCMD = "spell " system(SPELLCMD spellsource " > " wordlist ) # test wordlist to see if misspelled words turned up if ( system("test -s " wordlist ) ) { # if wordlist is empty (or spell command failed), exit print "No misspelled words found." system("rm " spellsource " " wordlist) exit } # assign wordlist file to ARGV[1] so that awk will read it. ARGV[1] = wordlist # display list of user responses responseList = "Responses: \n\tChange each occurrence," responseList = responseList "\n\tGlobal change," responseList = responseList "\n\tAdd to Dict," responseList = responseList "\n\tHelp," responseList = responseList "\n\tQuit" responseList = responseList "\n\tCR to ignore: " printf("%s", responseList) } # end of BEGIN procedure The first part of the BEGIN procedure processes the command-line arguments. It checks that ARGC is greater than one for the program to continue. That is, in addition to "nawk," a filename must be specified. This file specifies the document that spell will analyze. An optional dictionary filename can be specified as the second argument. The spellcheck script follows the command-line interface of spell, although none of the obscure spell options can be invoked from the spellcheck command line. If a dictionary is not specified, then the script executes a test command to see if the file dict exists. If it does, the prompt asks the user to approve using it as the dictionary file. Once we've processed the arguments, we delete them from the ARGV array. This is to prevent their being interpreted as filename arguments. The second part of the BEGIN procedure sets up some temporary files, because we do not want to work directly with the original file. At the end of the program, the user will have the option of saving or discarding the work done in the temporary files. The temporary files all begin with "sp_" and are removed before exiting the program. The third part of the procedure executes spell and creates a word list. We test to see that this file exists and that there is something in it before proceeding. If for some reason the spell program fails, or there are no misspelled words found, the wordlist file will be empty. If this file does exist, then we assign the filename as the second element in the ARGV array. This is an unusual but valid way of supplying the name of the input file that awk will process. Note that this file did not exist when awk was invoked! The name of the document file, which was specified on the command line, is no longer in the ARGV array. We will not read the document file using awk's main input loop. Instead, a while loop reads the file to find and correct misspelled words. The last task in the BEGIN procedure is to define and display a list of responses that the user can enter when a misspelled word is displayed. This list is displayed once at the beginning of the program as well as when the user enters "Help" at the main prompt. Putting this list in a variable allows us to access it from different points in the program, if necessary, without maintaining duplicates. The assignment of responseList could be done more simply, but the long string would not be printable in this book. (You can't break a string over two lines.) 12.1.2 Main Procedure The main procedure is rather small, merely displaying a misspelled word and prompting the user to enter an appropriate response. This procedure is executed for each misspelled word. One reason this procedure is short is because the central action - correcting a misspelled word - is handled by two larger user-defined functions, which we'll see in the last section. # main procedure, executed for each line in wordlist. # Purpose is to show misspelled word and prompt user # for appropriate action. { # assign word to misspelling misspelling = $1 response = 1 ++word # print misspelling and prompt for response while (response !~ /(^[cCgGaAhHqQ])|^$/ ) { printf("\n%d - Found %s (C/G/A/H/Q/):", word, misspelling) getline response < "-" } # now process the user's response # CR - carriage return ignores current word # Help if (response ~ /[Hh](elp)?/) { # Display list of responses and prompt again. printf("%s", responseList) printf("\n%d - Found %s (C/G/A/Q/):", word, misspelling) getline response < "-" } # Quit if (response ~ /[Qq](uit)?/) exit # Add to dictionary if ( response ~ /[Aa](dd)?/) { dict[++dictEntry] = misspelling } # Change each occurrence if ( response ~ /[cC](hange)?/) { # read each line of the file we are correcting newspelling = ""; changes = "" while( (getline < spellsource) > 0){ # call function to show line with misspelled word # and prompt user to make each correction make_change($0) # all lines go to temp output file print > spellout } # all lines have been read # close temp input and temp output file close(spellout) close(spellsource) # if change was made if (changes){ # show changed lines for (j = 1; j <= changes; ++j) print changedLines[j] printf ("%d lines changed. ", changes) # function to confirm before saving changes confirm_changes() } } # Globally change if ( response ~ /[gG](lobal)?/) { # call function to prompt for correction # and display each line that is changed. # Ask user to approve all changes before saving. make_global_change() } } # end of Main procedure The first field of each input line from wordlist contains the misspelled word and it is assigned to misspelling. We construct a while loop inside which we display the misspelled word to the user and prompt for a response. Look closely at the regular expression that tests the value of response: while (response !~ /(^[cCgGaAhHqQ])|^$/) The user can only get out of this loop by entering any of the specified letters or by entering a carriage return - an empty line. The use of regular expressions for testing user input helps tremendously in writing a simple but flexible program. The user can enter a single letter "c" in lower- or uppercase or a word beginning with "c" such as "Change." The rest of the main procedure consists of conditional statements that test for a specific response and perform a corresponding action. The first response is "help," which displays the list of responses again and then redisplays the prompt. The next response is "quit." The action associated with quit is exit, which drops out of the main procedure and goes to the END procedure. If the user enters "add," the misspelled word is put in the array dict and will be added as an exception in a local dictionary. The "Change" and "Global" responses cause the program's real work to begin. It's important to understand how they differ. When the user enters "c" or "change," the first occurrence of the misspelled word in the document is displayed. Then the user is prompted to make the change. This happens for each occurrence in the document. When the user enters "g" or "global," the user is prompted to make the change right away, and all the changes are made at once without prompting the user to confirm each one. This work is largely handled by two functions, make_change() and make_global_change (), which we'll look at in the last section. These are all the valid responses, except one. A carriage return means to ignore the misspelled word and get the next word in the list. This is the default action of the main input loop, so no conditional need be set up for it. 12.1.3 END Procedure The END procedure, of course, is reached in one of the following circumstances: ● The spell command failed or did not turn up any misspellings. ● The list of misspelled words is exhausted. ● The user has entered "quit" at a prompt. The purpose of the END procedure is to allow the user to confirm any permanent change to the document or the dictionary. # END procedure makes changes permanent. # It overwrites the original file, and adds words # to the dictionary. # It also removes the temporary files. END { # if we got here after reading only one record, # no changes were made, so exit. if (NR <= 1) exit # user must confirm saving corrections to file while (saveAnswer !~ /([yY](es)?)|([nN]o?)/ ) { printf "Save corrections in %s (y/n)? ", SPELLFILE getline saveAnswer < "-" } # if answer is yes then mv temporary input file to SPELLFILE # save old SPELLFILE, just in case if (saveAnswer ~ /^[yY]/) { system("cp " SPELLFILE " " SPELLFILE ". orig") system("mv " spellsource " " SPELLFILE) } # if answer is no then rm temporary input file if (saveAnswer ~ /^[nN]/) system("rm " spellsource) # if words have been added to dictionary array, then prompt # to confirm saving in current dictionary. if (dictEntry) { printf "Make changes to dictionary (y/n)? " getline response < "-" if (response ~ /^[yY]/){ # if no dictionary defined, then use "dict" if (! SPELLDICT) SPELLDICT = "dict" # loop through array and append words to dictionary sub(/^\+/, "", SPELLDICT) for ( item in dict ) print dict[item] >> SPELLDICT close(SPELLDICT) # sort dictionary file system("sort " SPELLDICT "> tmp_dict") system("mv " "tmp_dict " SPELLDICT) } } # remove word list system("rm sp_wordlist") } # end of END procedure The END procedure begins with a conditional statement that tests that the number of records is less than or equal to 1. This occurs when the spell program does not generate a word list or when the user enters "quit" after seeing just the first record. If so, the END procedure is exited as there is no work to save. Next, we create a while loop to ask the user about saving the changes made to the document. It requires the user to respond "y" or "n" to the prompt. If the answer is "y," the temporary input file replaces the original document file. If the answer is "n," the temporary file is removed. No other responses are accepted. Next, we test to see if the dict array has something in it. Its elements are the words to be added to the dictionary. If the user approves adding them to the dictionary, these words are appended to the current dictionary, as defined above, or if not, to a local dict file. Because the dictionary must be sorted to be read by spell, a sort command is executed with the output sent to a temporary file that is afterwards copied over the original file. 12.1.4 Supporting Functions There are three supporting functions, two of which are large and do the bulk of the work of making changes in the document. The third function supports that work by confirming that the user wants to save the changes that were made. When the user wants to "Change each occurrence" in the document, the main procedure has a while loop that reads the document one line at a time. (This line becomes $0.) It calls the make_change() function to see if the line contains the misspelled word. If it does, the line is displayed and the user is prompted to enter the correct spelling of the word. # make_change -- prompt user to correct misspelling # for current input line. Calls itself # to find other occurrences in string. # stringToChange -- initially $0; then unmatched substring of $0 # len -- length from beginning of $0 to end of matched string # Assumes that misspelling is defined. function make_change (stringToChange, len, # parameters line, OKmakechange, printstring, carets) # locals { # match misspelling in stringToChange; otherwise do nothing if ( match(stringToChange, misspelling) ) { # Display matched line printstring = $0 gsub(/\t/, " ", printstring) print printstring carets = "^" for (i = 1; i < RLENGTH; ++i) carets = carets "^" if (len) FMT = "%" len+RSTART+RLENGTH-2 "s\n" else FMT = "%" RSTART+RLENGTH-1 "s\n" printf(FMT, carets) # Prompt user for correction, if not already defined if (! newspelling) { printf "Change to:" getline newspelling < "-" } # A carriage return falls through # If user enters correction, confirm while (newspelling && ! OKmakechange) { printf ("Change %s to %s? (y/n):", misspelling, newspelling) getline OKmakechange < "-" madechg = "" # test response if (OKmakechange ~ /[yY](es)?/ ) { # make change (first occurrence only) madechg = sub(misspelling, newspelling, stringToChange) } else if ( OKmakechange ~ /[nN]o?/ ) { # offer chance to re-enter correction printf "Change to:" getline newspelling < "-" OKmakechange = "" } } # end of while loop # if len, we are working with substring of $0 if (len) { # assemble it line = substr($0,1,len-1) $0 = line stringToChange } else { $0 = stringToChange if (madechg) ++changes } # put changed line in array for display if (madechg) changedLines[changes] = ">" $0 # create substring so we can try to match other occurrences len += RSTART + RLENGTH part1 = substr($0, 1, len-1) part2 = substr($0, len) # calls itself to see if misspelling is found in remaining part make_change(part2, len) } # end of if } # end of make_change() If the misspelled word is not found in the current input line, nothing is done. If it is found, this function shows the line containing the misspelling and asks the user if it should be corrected. Underneath the display of the current line is a row of carets that indicates the misspelled word. Two other utlitities that are found on the UNIX system ^^^^^^^^^^ The current input line is copied to printstring because it is necessary to change the line for display purposes. If the line contains any tabs, each tab in this copy of the line is temporarily replaced by a single space. This solves a problem of aligning the carets when tabs were present. (A tab counts as a single character when determining the length of a line but actually occupies greater space when displayed, usually five to eight characters long.) After displaying the line, the function prompts the user to enter a correction. It then follows up by displaying what the user has entered and asks for confirmation. If the correction is approved, the sub() function is called to make the change. If not approved, the user is given another chance to enter the correct word. Remember that the sub() function only changes the first occurrence on a line. The gsub() function changes all occurrences on a line, but we want to allow the user to confirm each change. Therefore, we have to try to match the misspelled word against the remaining part of the line. And we have to be able to match the next occurrence regardless of whether or not the first occurrence was changed. To do this, make_change() is designed as a recursive function; it calls itself to look for additional occurrences on the same line. In other words, the first time make_change() is called, it looks at all of $0 and matches the first misspelled word on that line. Then it splits the line into two parts - the first part contains the characters up to the end of the first occurrence and the second part contains the characters that immediately follow up to the end of the line. Then it calls itself to try and match the misspelled word in the second part. When called recursively, the function takes two arguments. make_change(part2, len) The first is the string to be changed, which is initially $0 when called from the main procedure but each time thereafter is the remaining part of $0. The second argument is len or the length of the first part, which we use to extract the substring and reassemble the two parts at the end. The make_change() function also collects an array of lines that were changed. # put changed line in array for display if (madechg) changedLines[changes] = ">" $0 The variable madechg will have a value if the sub() function was successful. $0 (the two parts have been rejoined) is assigned to an element of the array. When all of the lines of the document have been read, the main procedure loops through this array to display all the changed lines. Then it calls the confirm_changes() function to ask if these changes should be saved. It copies the temporary output file over the temporary input file, keeping intact the corrections made for the current misspelled word. If a user decides to make a "Global change," the make_global_change() function is called to do it. This function is similar to the make_change() function, but is simpler because we can make the change globally on each line. # make_global_change -- # prompt user to correct misspelling # for all lines globally. # Has no arguments # Assumes that misspelling is defined. function make_global_change( newspelling, OKmakechange, changes) { # prompt user to correct misspelled word printf "Globally change to:" getline newspelling < "-" # carriage return falls through # if there is an answer, confirm while (newspelling && ! OKmakechange) { printf ("Globally change %s to %s? (y/n):", misspelling, newspelling) getline OKmakechange < "-" # test response and make change if (OKmakechange ~ /[yY](es)?/ ) { # open file, read all lines while( (getline < spellsource) > 0){ # if match is found, make change using gsub # and print each changed line. if ($0 ~ misspelling) { madechg = gsub (misspelling, newspelling) print ">", $0 changes += 1 # counter for line changes } # write all lines to temp output file print > spellout } # end of while loop for reading file # close temporary files close(spellout) close(spellsource) # report the number of changes printf ("%d lines changed. ", changes) # function to confirm before saving changes confirm_changes() } # end of if (OKmakechange ~ y) # if correction not confirmed, prompt for new word else if ( OKmakechange ~ /[nN]o?/ ){ printf "Globally change to:" getline newspelling < "-" OKmakechange = "" } } # end of while loop for prompting user for correction } # end of make_global_change() This function prompts the user to enter a correction. A while loop is set up to read all the lines of the document and apply the gsub() function to make the changes. The main difference is that all the changes are made at once - the user is not prompted to confirm them. When all lines have been read, the function displays the lines that were changed and calls confirm_changes() to get the user to approve this batch of changes before saving them. The confirm_changes() function is a routine called to get approval of the changes made when the make_change() or make_global_change() function is called. # confirm_changes -- # confirm before saving changes function confirm_changes( savechanges) { # prompt to confirm saving changes while (! savechanges ) { printf ("Save changes? (y/n)") getline savechanges < "-" } # if confirmed, mv output to input if (savechanges ~ /[yY](es)?/) system("mv " spellout " " spellsource) } The reason for creating this function is to prevent the duplication of code. Its purpose is simply to require the user to acknowledge the changes before replacing the old version of the document file (spellsource) with the new version (spellout). 12.1.5 The spellcheck Shell Script To make it easy to invoke this awk script, we create the spellcheck shell script (say that three times fast). It contains the following lines: AWKLIB=/usr/local/awklib nawk -f $AWKLIB/spellcheck.awk $* This script sets up a shell variable AWKLIB that specifies the location of the spellcheck.awk script. The symbol "$*" expands to all command-line parameters following the name of the script. These parameters are then available to awk. One of the interesting things about this spell checker is how little is done in the shell script.[1] All of the work is done in the awk programming language, including executing 10 UNIX commands. We're using a consistent syntax and the same constructs by doing it all in awk. When you have to do some of your work in the shell and some in awk, it can get confusing. For instance, you have to remember the differences in the syntax of if conditionals and how to reference variables. Modern versions of awk provide a true alternative to the shell for executing commands and interacting with a user. The full listing for spellcheck.awk is found in Appendix C, Supplement for Chapter 12. [1] UNIX Text Processing (Dougherty and O'Reilly, Howard W. Sams, 1987) presents a sed-based spell checker that relies heavily upon the shell. It is interesting to compare the two versions. 11.4 Epilogue 12.2 Generating a Formatted Index Chapter 12 Full-Featured Applications 12.2 Generating a Formatted Index The process of generating an index usually involves three steps: ● Code the index entries in the document. ● Format the document, producing index entries with page numbers. ● Process the index entries to sort them, combining entries that differ only in page number, and then preparing the formatted index. This process remains pretty much the same whether using troff, other coded batch formatters, or a WYSIWYG formatter such as FrameMaker, although the steps are not as clearly separated with the latter. However, I will be describing how we use troff to generate an index such as the one for this book. We code the index using the following macros: Macro Description .XX Produces general index entries. .XN Creates "see" or "see also" cross references. .XB Creates bold page entry indicating primary reference. .XS Begins range of pages for entry. .XE Ends range of pages for entry. These macros take a single quoted argument, which can have one of several forms, indicating primary, secondary, or tertiary keys: "primary [ : secondary [ ; tertiary ]]" A colon is used as the separator between the primary and secondary keys. To support an earlier coding convention, the first comma is interpreted as the separator if no colon is used. A semicolon indicates the presence of a tertiary key. The page number is always associated with the last key. Here is an entry with only a primary key: .XX "XView" The next two entries specify a secondary key: .XX "XView: reserved names" .XX "XView, packages" The most complex entries contain tertiary keys: .XX "XView: objects; list" .XX "XView: objects; hierarchy of" Finally, there are two types of cross references: .XN "error recovery: (see error handling)" .XX "mh mailer: (see also xmh mailer)" The "see" entry refers a person to another index entry. The "see also" is typically used when there are entries for, in this case, "mh mailer," but there is relevant information catalogued under another name. Only "see" entries do not have page numbers associated with them. When the document is processed by troff, the following index entries are produced: XView 42 XView: reserved names 43 XView, packages 43 XView: objects; list of 43 XView: objects; hierarchy of 44 XView, packages 45 error recovery: (See error handling) mh mailer: (see also xmh mailer) 46 These entries serve as input to the indexing program. Each entry (except for "see" entries) consists of the key and a page number. In other words, the entry is divided into two parts and the first part, the key, can also be divided into three parts. When these entries are processed by the indexing program and the output is formatted, the entries for "XView" are combined as follows: XView, 42 objects; hierarchy of, 44; list of, 43 packages, 43,45 reserved names, 43 To accomplish this, the indexing program must: ● Sort the index by key and page number. ● Merge entries that differ only in the page number. ● Merge entries that have the same primary and/or secondary keys. ● Look for consecutive page numbers and combine as a range. ● Prepare the index in a format for display on screen or for printing. This is what the index program does if you are processing the index entries for a single book. It also allows you to create a master index, an overall index for a set of volumes. To do that, an awk script appends either a roman numeral or an abbreviation after the page number. Each file then contains the entries for a particular book and those entries are uniquely identified. If we chose to use roman numerals to identify the volume, then the above entries would be changed to: XView 42:I XView: reserved names 43:I XView: objects; list of 43:I With multivolume entries, the final index that is generated might look like this: XView, I:42; II:55,69,75 objects; hierarchy of, I:44; list of, I:43; II: 56 packages, I:43,45 reserved names, I:43 For now, it's only important to recognize that the index entry used as input to the awk program can have a page number or a page number followed by a volume identifier. 12.2.1 The masterindex Program Because of the length and complexity of this indexing application,[2] our description presents the larger structure of the program. Use the comments in the program itself to understand what is happening in the program line by line. [2] The origins of this indexing program are traced back to a copy of an indexing program written in awk by Steve Talbott. I learned this program by taking it apart, and made some changes to it to support consecutive page numbering in addition to section-page numbering. That was the program I described in UNIX Text Processing. Knowing that program, I wrote an indexing program that could deal with index entries produced by Microsoft Word and generate an index using section-page numbering. Later, we needed a master index for several books in our X Window System Series. I took it as an opportunity to rethink our indexing program, and rewrite it using nawk, so that it supports both single- book and multiple-book indices. The AWK Programming Language contains an example of an index program that is smaller than the one shown here and might be a place to start if you find this one too complicated. It does not, however, deal with keys. That indexing program is a simplified version of the one described in Bell Labs Computing Science Technical Report 128, Tools for Printing Indexes, October 1986, by Brian Kernighan and Jon Bentley. [D.D.] After descriptions of each of the program modules, a final section discusses a few remaining details. For the most part, these are code fragments that deal with nitty-gritty, input-related problems that had to be solved along the way. The shell script masterindex[3] allows the user to specify a number of different command-line options to specify what kind of index to make and it invokes the necessary awk programs to do the job. The operations of the masterindex program can be broken into five separate programs or modules that form a single pipe. [3] This shell script and the documentation for the program are presented in Appendix C. You might want to first read the documentation for a basic understanding of using the program. input.idx | sort | pagenums.idx | combine.idx | format.idx All but one of the programs are written using awk. For sorting the entries, we rely upon sort, a standard UNIX utility. Here's a brief summary of what each of these programs does: input.idx Standardizes the format of entries and rotates them. sort Sorts entries by key, volume, and page number. pagenums.idx Merges entries with same key, creating a list of page numbers. combine.idx Combines consecutive page numbers into a range. format.idx Prepares the formatted index for the screen or processing by troff. We will discuss each of these steps in a separate section. 12.2.2 Standardizing Input This input.idx script looks for different types of entries and standardizes them for easier processing by subsequent programs. Additionally, it automatically rotates index entries containing a tilde (~). (See the section "Rotating Two Parts" later in this chapter.) The input to the input.idx program consists of two tab-separated fields, as described earlier. The program produces output records with three colon-separated fields. The first field contains the primary key; the second field contains the secondary and tertiary keys, if defined; and the third field contains the page number. Here's the code for input.idx program: #!/work/bin/nawk -f # ------------------------------------------------ # input.idx -- standardize input before sorting # Author: Dale Dougherty # Version 1.1 7/10/90 # # input is "entry" tab "page_number" # ------------------------------------------------ BEGIN { FS = "\t"; OFS = "" } #1 Match entries that need rotating that contain a single tilde # $1 ~ /~[^~]/ # regexp does not work and I do not know why $1 ~ /~/ && $1 !~ /~~/ { # split first field into array named subfield n = split($1, subfield, "~") if (n == 2) { # print entry without "~" and then rotated printf("%s %s::%s\n", subfield[1], subfield [2], $2) printf("%s:%s:%s\n", subfield[2], subfield [1], $2) } next }# End of 1 #2 Match entries that contain two tildes $1 ~ /~~/ { # replace ~~ with ~ gsub(/~~/, "~", $1) } # End of 2 #3 Match entries that use "::" for literal ":". $1 ~ /::/ { # substitute octal value for "::" gsub(/::/, "\\72", $1) }# End of 3 #4 Clean up entries { # look for second colon, which might be used instead of ";" if (sub(/:.*:/, "&;", $1)) { sub(/:;/, ";", $1) } # remove blank space if any after colon. sub(/: */, ":", $1) # if comma is used as delimiter, convert to colon. if ( $1 !~ /:/ ) { # On see also & see, try to put delimiter before "(" if ($1 ~ /\([sS]ee/) { if (sub(/, *.*\(/, ":&", $1)) sub(/:, */, ":", $1) else sub(/ *\(/, ":(", $1) } else { # otherwise, just look for comma sub(/, */, ":", $1) } } else { # added to insert semicolon in "See" if ($1 ~ /:[^;]+ *\([sS]ee/) sub(/ *\(/, ";(", $1) } }# End of 4 #5 match See Alsos and fix for sort at end $1 ~ / *\([Ss]ee +[Aa]lso/ { # add "~zz" for sort at end sub(/\([Ss]ee +[Aa]lso/, "~zz(see also", $1) if ($1 ~ /:[^; ]+ *~zz/) { sub(/ *~zz/, "; ~zz", $1) } # if no page number if ($2 == "") { print $0 ":" next } else { # output two entries: # print See Also entry w/out page number print $1 ":" # remove See Also sub(/ *~zz\(see also.*$/, "", $1) sub(/;/, "", $1) # print as normal entry if ( $1 ~ /:/ ) print $1 ":" $2 else print $1 "::" $2 next } }# End of 5 #6 Process entries without page number (See entries) (NF == 1 || $2 == "" || $1 ~ /\([sS]ee/) { # if a "See" entry if ( $1 ~ /\([sS]ee/ ) { if ( $1 ~ /:/ ) print $1 ":" else print $1 ":" next } else { # if not a See entry, generate error printerr("No page number") next } }# End of 6 #7 If the colon is used as the delimiter $1 ~ /:/ { # output entry:page print $1 ":" $2 next }# End of 7 #8 Match entries with only primary keys. { print $1 "::" $2 }# End of 8 # supporting functions # # printerr -- print error message and current record # Arg: message to be displayed function printerr (message) { # print message, record number and record printf("ERROR:%s (%d) %s\n", message, NR, $0) > "/ dev/tty" } This script consists of a number of pattern-matching rules to recognize different types of input. Note that an entry could match more than one rule unless the action associated with a rule calls the next statement. As we describe this script, we will be referring to the rules by number. Rule 1 rotates entries containing a tilde and produces two output records. The split() function creates an array named subfield that contains the two parts of the compound entry. The two parts are printed in their original order and are then swapped to create a second output record in which the secondary key becomes a primary key. Because we are using the tilde as a special character, we must provide some way of actually entering a tilde. We have implemented the convention that two consecutive tildes are translated into a single tilde. Rule 2 deals with that case, but notice that the pattern for rule 1 makes sure that the first tilde it matches is not followed by another tilde.[4] [4] In the first edition, Dale wrote, "For extra credit, please send me mail if you can figure out why the commented regular expression just before rule 1 does not do the job. I used the compound expression as a last resort." I'm ashamed to admit that this stumped me also. When Henry Spencer turned on the light, it was blinding: "The reason why the commented regexp doesn't work is that it doesn't do what the author thought. :-) It looks for tilde followed by a non-tilde character... but the second tilde of a ~~ combination is usually followed by a non-tilde! Using /[^~]~[^~]/ would probably work." I plugged this regular expression in to the program, and it worked just fine. [A.R.] The order of rules 1 and 2 in the script is significant. We can't replace "~~" with "~" until after the procedure for rotating the entry. Rule 3 does a job similar to that of rule 2; it allows "::" to be used to output a literal ":" in the index. However, since we use the colon as an input delimiter throughout the input to the program, we cannot allow it to appear in an entry as finally output until the very end. Thus, we replace the sequence "::" with the colon's ASCII value in octal. (The format.idx program will reverse the replacement.) Beginning with rule 4, we attempt to recognize various ways of coding entries - giving the user more flexibility. However, to make writing the remaining programs easier, we must reduce this variety to a few basic forms. In the "basic" syntax, the primary and secondary keys are separated by a colon. The secondary and tertiary keys are separated by a semicolon. Nonetheless the program also recognizes a second colon, in place of a semicolon, as the delimiter between the secondary and tertiary keys. It also recognizes that if no colon is specified as a delimiter, then a comma can be used as the delimiter between primary and secondary keys. (In part, this was done to be compatible with an earlier program that used the comma as the delimiter.) The sub() function looks for the first comma on the line and changes it to a colon. This rule also tries to standardize the syntax of "see" and "see also" entries. For entries that are colon- delimited, rule 4 removes spaces after the colon. All of the work is done using the sub() function. Rule 5 deals with "see also" entries. We prepend the arbitrary string "~zz" to the "see also" entries so that they will sort at the end of the list of secondary keys. The pagenums.idx script, later in the pipeline, will remove "~zz" after the entries have been sorted. Rule 6 matches entries that do not specify a page number. The only valid entry without a page number contains a "see" reference. This rule outputs "see" entries with ":" at the end to indicate an empty third field. All other entries generate an error message via the printerr() function. This function notifies the user that a particular entry does not have a page number and will not be included in the output. This is one method of standardizing input - throwing out what you can't interpret properly. However, it is critical to notify the user so that he or she can correct the entry. Rule 7 outputs entries that contain the colon-delimiter. Its action uses next to avoid reaching rule 8. Finally, rule 8 matches entries that contain only a primary key. In other words, there is no delimiter. We output "::" to indicate an empty second field. Here's a portion of the contents of our test file. We'll be using it to generate examples in this section. $ cat test XView: programs; initialization 45 XV_INIT_ARGS~macro 46 Xv_object~type 49 Xv_singlecolor~type 80 graphics: (see also server image) graphics, XView model 83 X Window System: events 84 graphics, CANVAS_X_PAINT_WINDOW 86 X Window System, X Window ID for paint window 87 toolkit (See X Window System). graphics: (see also server image) Xlib, repainting canvas 88 Xlib.h~header file 89 When we run this file through input.idx, it produces: $ input.idx test XView:programs; initialization:45 XV_INIT_ARGS macro::46 macro:XV_INIT_ARGS:46 Xv_object type::49 type:Xv_object:49 Xv_singlecolor type::80 type:Xv_singlecolor:80 graphics:~zz(see also server image): graphics:XView model:83 X Window System:events:84 graphics:CANVAS_X_PAINT_WINDOW:86 X Window System:X Window ID for paint window:87 graphics:~zz(see also server image): Xlib:repainting canvas:88 Xlib.h header file::89 header file:Xlib.h:89 Each entry now consists of three colon-separated fields. In the sample output, you can find examples of entries with only a primary key, those with primary and secondary keys, and those with primary, secondary, and tertiary keys. You can also find examples of rotated entries, duplicate entries, and "see also" entries. The only difference in the output for multivolume entries is that each entry would have a fourth field that contains the volume identifier. 12.2.3 Sorting the Entries Now the output produced by input.idx is ready to be sorted. The easiest way to sort the entries is to use the standard UNIX sort program rather than write a custom script. In addition to sorting the entries, we want to remove any duplicates and for this task we use the uniq program. Here's the command line we use: sort -bdf -t: +0 -1 +1 -2 +3 -4 +2n -3n | uniq As you can see, we use a number of options with the sort command. The first option, -b, specifies that leading spaces be ignored. The -d option specifies a dictionary sort in which symbols and special characters are ignored. -f specifies that lower- and uppercase letters are to be folded together; in other words, they are to be treated as the same character for purposes of the sort. The next argument is perhaps the most important: -t: tells the program to use a colon as a field delimiter for sort keys. The "+" options that follow specify the number of fields to skip from the beginning of the line. Therefore, to specify the first field as the primary sort key, we use "+0." Similarly, the "-" options specify the end of a sort key. "- 1" specifies that the primary sort key ends at the first field, or the beginning of the second field. The second sort field is the secondary key. The fourth field ("+3") if it exists, contains the volume number. The last key to sort is the page number; this requires a numeric sort (if we did not tell sort that this key consists of numbers, then the number 1 would be followed by 10, instead of 2). Notice that we sort page numbers after sorting the volume numbers. Thus, all the page numbers for Volume I are sorted in order before the page numbers for Volume II. Finally, we pipe the output to uniq to remove identical entries. Processing the output from input.idx, the sort command produces: graphics:CANVAS_X_PAINT_WINDOW:86 graphics:XView model:83 graphics:~zz(see also server image): header file:Xlib.h:89 macro:XV_INIT_ARGS:46 toolkit:(See X Window System).: type:Xv_object:49 type:Xv_singlecolor:80 X Window System:events:84 X Window System:X Window ID for paint window:87 Xlib:repainting canvas:88 Xlib.h header file::89 XView:programs; initialization:45 XV_INIT_ARGS macro::46 Xv_object type::49 Xv_singlecolor type::80 12.2.4 Handling Page Numbers The pagenums.idx program looks for entries that differ only in page number and creates a list of page numbers for a single entry. The input to this program is four colon-separated fields: PRIMARY:SECONDARY:PAGE:VOLUME The fourth is optional. For now, we consider only the index for a single book, in which there are no volume numbers. Remember that the entries are now sorted. The heart of this program compares the current entry to the previous one and determines what to output. The conditionals that implement the comparison can be extracted and expressed in pseudocode, as follows: PRIMARY = $1 SECONDARY = $2 PAGE = $3 if (PRIMARY == prevPRIMARY) if (SECONDARY == prevSECONDARY) print PAGE else print PRIMARY:SECONDARY:PAGE else print PRIMARY:SECONDARY:PAGE prevPRIMARY = PRIMARY prevSECONDARY = SECONDARY Let's see how this code handles a series of entries, beginning with: XView::18 The primary key doesn't match the previous primary key; the line is output as is: XView::18 The next entry is: XView:about:3 When we compare the primary key of this entry to the previous one, they are the same. When we compare secondary keys, they differ; we output the record as is: XView:about:3 The next entry is: XView:about:7 Because both the primary and secondary keys match the keys of the previous entry, we simply output the page number. (The printf function is used instead of print so that there is no automatic newline.) This page number is appended to the previous entry so that it looks like this: XView:about:3,7 The next entry also matches both keys: XView:about:10 Again, only the page number is output so that entry now looks like: XView:about:3,7,10 In this way, three entries that differ only in page number are combined into a single entry. The full script adds an additional test to see if the volume identifier matches. Here's the full pagenums. idx script: #!/work/bin/nawk -f # ------------------------------------------------ # pagenums.idx -- collect pages for common entries # Author: Dale Dougherty # Version 1.1 7/10/90 # # input should be PRIMARY:SECONDARY:PAGE:VOLUME # ------------------------------------------------ BEGIN { FS = ":"; OFS = ""} # main routine -- apply to all input lines { # assign fields to variables PRIMARY = $1 SECONDARY = $2 PAGE = $3 VOLUME = $4 # check for a see also and collect it in array if (SECONDARY ~ /\([Ss]ee +[Aa]lso/) { # create tmp copy & remove "~zz" from copy tmpSecondary = SECONDARY sub(/~zz\([Ss]ee +[Aa]lso */, "", tmpSecondary) sub(/\) */, "", tmpSecondary) # remove secondary key along with "~zz" sub(/^.*~zz\([Ss]ee +[Aa]lso */, "", SECONDARY) sub(/\) */, "", SECONDARY) # assign to next element of seeAlsoList seeAlsoList[++eachSeeAlso] = SECONDARY "; " prevPrimary = PRIMARY # assign copy to previous secondary key prevSecondary = tmpSecondary next } # end test for see Also # Conditionals to compare keys of current record to previous # record. If Primary and Secondary keys are the same, only # the page number is printed. # test to see if each PRIMARY key matches previous key if (PRIMARY == prevPrimary) { # test to see if each SECONDARY key matches previous key if (SECONDARY == prevSecondary) # test to see if VOLUME matches; # print only VOLUME:PAGE if (VOLUME == prevVolume) printf (",%s", PAGE) else { printf ("; ") volpage(VOLUME, PAGE) } else{ # if array of See Alsos, output them now if (eachSeeAlso) outputSeeAlso(2) # print PRIMARY:SECONDARY:VOLUME:PAGE printf ("\n%s:%s:", PRIMARY, SECONDARY) volpage(VOLUME, PAGE) } } # end of test for PRIMARY == prev else { # PRIMARY != prev # if we have an array of See Alsos, output them now if (eachSeeAlso) outputSeeAlso(1) if (NR != 1) printf ("\n") if (NF == 1){ printf ("%s:", $0) } else { printf ("%s:%s:", PRIMARY, SECONDARY) volpage(VOLUME, PAGE) } } prevPrimary = PRIMARY prevSecondary = SECONDARY prevVolume = VOLUME } # end of main routine # at end, print newline END { # in case last entry has "see Also" if (eachSeeAlso) outputSeeAlso(1) printf("\n") } # outputSeeAlso function -- list elements of seeAlsoList function outputSeeAlso(LEVEL) { # LEVEL - indicates which key we need to output if (LEVEL == 1) printf ("\n%s:(See also ", prevPrimary) else { sub(/;.*$/, "", prevSecondary) printf ("\n%s:%s; (See also ", prevPrimary, prevSecondary) } sub(/; $/, ".):", seeAlsoList[eachSeeAlso]) for (i = 1; i <= eachSeeAlso; ++i) printf ("%s", seeAlsoList[i]) eachSeeAlso = 0 } # volpage function -- determine whether or not to print volume info # two args: volume & page function volpage(v, p) { # if VOLUME is empty then print PAGE only if ( v == "" ) printf ("%s", p) else # otherwise print VOLUME^PAGE printf ("%s^%s",v, p) } Remember, first of all, that the input to the program is sorted by its keys. The page numbers are also in order, such that an entry for "graphics" on page 7 appears in the input before one on page 10. Similarly, entries for Volume I appear in the input before Volume II. Therefore, this program need do no sorting; it simply compares the keys and if they are the same, appends the page number to a list. In this way, the entries are reduced. This script also handles "see also" entries. Since the records are now sorted, we can remove the special sorting sequence "~zz." We also handle the case where we might encounter consecutive "see also" entries. We don't want to output: Toolkit (see also Xt) (See also XView) (See also Motif). Instead we'd like to combine them into a list such that they appear as: Toolkit (see also Xt; XView; Motif) To do that, we create an array named seeAlsoList. From SECONDARY, we remove the parentheses, the secondary key if it exists and the "see also" and then assign it to an element of seeAlsoList. We make a copy of SECONDARY with the secondary key and assign it to prevSecondary for making comparisons to the next entry. The function outputSeeAlso() is called to read all the elements of the array and print them. The function volpage() is also simple and determines whether or not we need to output a volume number. Both of these functions are called from more than one place in the code, so the main reason for defining them as functions is to reduce duplication. Here's an example of what it produces for a single-book index: X Window System:Xlib:6 XFontStruct structure::317 Xlib::6 Xlib:repainting canvas:88 Xlib.h header file::89,294 Xv_Font type::310 XView::18 XView:about:3,7,10 XView:as object-oriented system:17 Here's an example of what it produces for a master index: reserved names:table of:I^43 Xt:example of programming interface:I^44,65 Xt:objects; list of:I^43,58; II^40 Xt:packages:I^43,61; II^42 Xt:programs; initialization:I^45 Xt:reserved names:I^43,58 Xt:reserved prefixes:I^43,58 Xt:types:I^43,54,61 The "^" is used as a temporary delimiter between the volume number and the list of page numbers. 12.2.5 Merging Entries with the Same Keys The pagenums.idx program reduced entries that were the same except for the page number. Now we'll to process entries that share the same primary key. We also want to look for consecutive page numbers and combine them in ranges. The combine.idx is quite similar to the pagenums.idx script, making another pass through the index, comparing entries with the same primary key. The following pseudocode abstracts this comparison. (To make this discussion easier, we will omit tertiary keys and show how to compare primary and secondary keys.) After the entries are processed by pagenums.idx, no two entries exist that share the same primary and secondary keys. Therefore, we don't have to compare secondary keys. PRIMARY = $1 SECONDARY = $2 PAGE = $3 if (PRIMARY == prevPRIMARY) print :SECONDARY: else print PRIMARY:SECONDARY prevPRIMARY = PRIMARY prevSECONDARY = SECONDARY If the primary keys match, we output only the secondary key. For instance, if there are three entries: XView:18 XView:about:3, 7, 10 XView:as object-oriented system:17 they will be output as: XView:18 :about:3, 7, 10 :as object-oriented system:17 We drop the primary key when it is the same. The actual code is a little more difficult because there are tertiary keys. We have to test primary and secondary keys to see if they are unique or the same, but we don't have to test tertiary keys. (We only need to know that they are there.) You no doubt noticed that the above pseudocode does not output page numbers. The second role of this script is to examine page numbers and combine a list of consecutive numbers. The page numbers are a comma-separated list that can be loaded into an array, using the split() function. To see if numbers are consecutive, we loop through the array comparing each element to 1 + the previous element. eachpage[j-1]+1 == eachpage[j] In other words, if adding 1 to the previous element produces the current element, then they are consecutive. The previous element becomes the first page number in the range and the current element becomes the last page in the range. This is done within a while loop until the conditional is not true, and the page numbers are not consecutive. Then we output the first page number and the last page number separated by a hyphen: 23-25 The actual code looks more complicated than this because it is called from a function that must recognize volume and page number pairs. It first has to split the volume from the list of page numbers and then it can call the function (rangeOfPages()) to process the list of numbers. Here is the full listing of combine.idx: #!/work/bin/nawk -f # ------------------------------------------------ # combine.idx -- merge keys with same PRIMARY key # and combine consecutive page numbers # Author: Dale Dougherty # Version 1.1 7/10/90 # # input should be PRIMARY:SECONDARY:PAGELIST # ------------------------------------------------ BEGIN { FS = ":"; OFS = ""} # main routine -- applies to all input lines # It compares the keys and merges the duplicates. { # assign first field PRIMARY=$1 # split second field, getting SEC and TERT keys. sizeOfArray = split($2, array, ";") SECONDARY = array[1] TERTIARY = array[2] # test that tertiary key exists if (sizeOfArray > 1) { # tertiary key exists isTertiary = 1 # two cases where ";" might turn up # check SEC key for list of "see also" if (SECONDARY ~ /\([sS]ee also/){ SECONDARY = $2 isTertiary = 0 } # check TERT key for "see also" if (TERTIARY ~ /\([sS]ee also/){ TERTIARY = substr($2, (index($2, ";") + 1)) } } else # tertiary key does not exist isTertiary = 0 # assign third field PAGELIST = $3 # Conditional to compare primary key of this entry to that # of previous entry. Then compare secondary keys. This # determines which non-duplicate keys to output. if (PRIMARY == prevPrimary) { if (isTertiary && SECONDARY == prevSecondary) printf (";\n::%s", TERTIARY) else if (isTertiary) printf ("\n:%s; %s", SECONDARY, TERTIARY) else printf ("\n:%s", SECONDARY) } else { if (NR != 1) printf ("\n") if ($2 != "") printf ("%s:%s", PRIMARY, $2) else printf ("%s", PRIMARY) prevPrimary = PRIMARY } prevSecondary = SECONDARY } # end of main procedure # routine for "See" entries (primary key only) NF == 1 { printf ("\n") } # routine for all other entries # It handles output of the page number. NF > 1 { if (PAGELIST) # calls function numrange() to look for # consecutive page numbers. printf (":%s", numrange(PAGELIST)) else if (! isTertiary || (TERTIARY && SECONDARY)) printf (":") } # end of NF > 1 # END procedure outputs newline END { printf ("\n") } # Supporting Functions # numrange -- read list of Volume^Page numbers, detach Volume # from Page for each Volume and call rangeOfPages # to combine consecutive page numbers in the list. # PAGE = volumes separated by semicolons; volume and page # separated by ^. function numrange(PAGE, listOfPages, sizeOfArray) { # Split up list by volume. sizeOfArray = split(PAGE, howManyVolumes,";") # Check to see if more than 1 volume. if (sizeOfArray > 1) { # if more than 1 volume, loop through list for (i = 1; i <= sizeOfArray; ++i) { # for each Volume^Page element, detach Volume # and call rangeOfPages function on Page to # separate page numbers and compare to find # consecutive numbers. if (split(howManyVolumes[i], volPage,"^") == 2) listOfPages = volPage[1] "^" rangeOfPages(volPage[2]) # collect output in listOfPages if (i == 1) result = listOfPages else result=result ";" listOfPages } # end for loop } else { # not more than 1 volume # check for single volume index with volume number # if so, detach volume number. # Both call rangeOfPages on the list of page numbers. if (split(PAGE,volPage,"^") == 2 ) # if Volume^Page, detach volume and then call rangeOfPages listOfPages = volPage[1] "^" rangeOfPages(volPage[2]) else # No volume number involved listOfPages = rangeOfPages(volPage [1]) result = listOfPages } # end of else return result # Volume^Page list } # End of numrange function # rangeOfPages -- read list of comma-separated page numbers, # load them into an array, and compare each one # to the next, looking for consecutive numbers. # PAGENUMBERS = comma-separated list of page numbers function rangeOfPages(PAGENUMBERS, pagesAll, sizeOfArray, pages, listOfPages, d, p, j) { # close-up space on troff-generated ranges gsub(/ - /, ",-", PAGENUMBERS) # split list up into eachpage array. sizeOfArray = split(PAGENUMBERS, eachpage, ",") # if more than 1 page number if (sizeOfArray > 1){ # for each page number, compare it to previous number + 1 p = 0 # flag indicates assignment to pagesAll # for loop starts at 2 for (j = 2; j-1 <= sizeOfArray; ++j) { # start by saving first page in sequence (firstpage) # and loop until we find last page (lastpage) firstpage = eachpage[j-1] d = 0 # flag indicates consecutive numbers found # loop while page numbers are consecutive while ((eachpage[j-1]+1) == eachpage [j] || eachpage[j] ~ /^-/) { # remove "-" from troff-generated range if (eachpage[j] ~ /^-/) { sub(/^-/, "", eachpage[j]) } lastpage = eachpage[j] # increment counters ++d ++j } # end of while loop # use values of firstpage and lastpage to make range. if (d >= 1) { # there is a range pages = firstpage "-" lastpage } else # no range; only read firstpage pages = firstpage # assign range to pagesAll if (p == 0) { pagesAll = pages p = 1 } else { pagesAll = pagesAll "," pages } }# end of for loop # assign pagesAll to listOfPages listOfPages = pagesAll } # end of sizeOfArray > 1 else # only one page listOfPages = PAGENUMBERS # add space following comma gsub(/,/, ", ", listOfPages) # return changed list of page numbers return listOfPages } # End of rangeOfPages function This script consists of minimal BEGIN and END procedures. The main routine does the work of comparing primary and secondary keys. The first part of this routine assigns the fields to variables. The second field contains the secondary and tertiary keys and we use split() to separate them. Then we test that there is a tertiary key and set the flag isTertiary to either 1 or 0. The next part of the main procedure contains the conditional expressions that look for identical keys. As we said in our discussion of the pseudocode for this part of the program, entries with wholly identical keys have already been removed by the pagenums.idx. The conditionals in this procedure determine what keys to output based on whether or not each is unique. If the primary key is unique, it is output, along with the rest of the entry. If the primary key matches the previous key, we compare secondary keys. If the secondary key is unique, then it is output, along with the rest of the entry. If the primary key matches the previous primary key, and the secondary key matches the previous secondary key, then the tertiary key must be unique. Then we only output the tertiary key, leaving the primary and secondary keys blank. The different forms are shown below: primary primary:secondary :secondary :secondary:tertiary ::tertiary primary:secondary:tertiary The main procedure is followed by two additional routines. The first of them is executed only when NF equals one. It deals with the first of the forms on the list above. That is, there is no page number so we must output a newline to finish the entry. The second procedure deals with all entries that have page numbers. This is the procedure where we call a function to take apart the list of page numbers and look for consecutive pages. It calls the numrange () function, whose main purpose is to deal with a multivolume index where a list of page numbers might look like: I^35,55; II^200 This function calls split() using a semicolon delimiter to separate each volume. Then we call split () using a "^" delimiter to detach the volume number from the list of page numbers. Once we have the list of pages, we call a second function rangeOfPages() to look for consecutive numbers. On a single-book index, such as the sample shown in this chapter, the numrange() function really does nothing but call rangeOfPages(). We discussed the meat of the rangeOfPages() function earlier. The eachpage array is created and a while loop is used to go through the array comparing an element to the one previous. This function returns the list of pages. Sample output from this program follows: Xlib:6 :repainting canvas:88 Xlib.h header file:89, 294 Xv_Font type:310 XView:18 :about:3, 7, 10 :as object-oriented system:17 :compiling programs:41 :concept of windows differs from X:25 :data types; table of:20 :example of programming interface:44 :frames and subframes:26 :generic functions:21 :Generic Object:18, 24 :libraries:42 :notification:10, 35 :objects:23-24; :: table of:20; :: list of:43 :packages:18, 43 :programmer's model:17-23 :programming interface:41 :programs; initialization:45 :reserved names:43 :reserved prefixes:43 :structure of applications:41 :subwindows:28 :types:43 :window objects:25 In particular, notice the entry for "objects" under "XView." This is an example of a secondary key with multiple tertiary keys. It is also an example of an entry with a consecutive page range. 12.2.6 Formatting the Index The previous scripts have done nearly all of the processing, leaving the list of entries in good order. The format.idx script, probably the easiest of the scripts, reads the list of entries and generates a report in two different formats, one for display on a terminal screen and one to send to troff for printing on a laser printer. Perhaps the only difficulty is that we output the entries grouped by each letter of the alphabet. A command-line argument sets the variable FMT that determines which of the two output formats is to be used. Here's the full listing for format.idx: #!/work/bin/nawk -f # ------------------------------------------------ # format.idx -- prepare formatted index # Author: Dale Dougherty # Version 1.1 7/10/90 # # input should be PRIMARY:SECONDARY:PAGE:VOLUME # Args: FMT = 0 (default) format for screen # FMT = 1 output with troff macros # MACDIR = pathname of index troff macro file # ------------------------------------------------ BEGIN { FS = ":" upper = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" lower = "abcdefghijklmnopqrstuvwxyz" } # Output initial macros if troff FMT NR == 1 && FMT == 1 { if (MACDIR) printf (".so %s/indexmacs\n", MACDIR) else printf (".so indexmacs\n") printf (".Se \"\" \"Index\"\n") printf (".XC\n") } # end of NR == 1 # main routine - apply to all lines # determine which fields to output { # convert octal colon to "literal" colon # make sub for each field, not $0, so that fields are not parsed gsub(/\\72/, ":", $1) gsub(/\\72/, ":", $2) gsub(/\\72/, ":", $3) # assign field to variables PRIMARY = $1 SECONDARY = $2 TERTIARY = "" PAGE = $3 if (NF == 2) { SECONDARY = "" PAGE = $2 } # Look for empty fields to determine what to output if (! PRIMARY) { if (! SECONDARY) { TERTIARY = $3 PAGE = $4 if (FMT == 1) printf (".XF 3 \"%s", TERTIARY) else printf (" %s", TERTIARY) } else if (FMT == 1) printf (".XF 2 \"%s", SECONDARY) else printf (" %s", SECONDARY) } else { # if primary entry exists # extract first char of primary entry firstChar = substr($1, 1, 1) # see if it is in lower string. char = index(lower, firstChar) # char is an index to lower or upper letter if (char == 0) { # if char not found, see if it is upper char = index(upper, firstChar) if (char == 0) char = prevChar } # if new char, then start group for new letter of alphabet if (char != prevChar) { if (FMT == 1) printf(".XF A \"%s\"\n", substr(upper, char, 1)) else printf("\n\t\t%s\n", substr (upper, char, 1)) prevChar = char } # now output primary and secondary entry if (FMT == 1) if (SECONDARY) printf (".XF 1 \"%s\" \"% s", PRIMARY, SECONDARY) else printf (".XF 1 \"%s\" \"", PRIMARY) else if (SECONDARY) printf ("%s, %s", PRIMARY, SECONDARY) else printf ("%s", PRIMARY) } # if page number, call pageChg to replace "^" with ":" # for multi-volume page lists. if (PAGE) { if (FMT == 1) { # added to omit comma after bold entry if (! SECONDARY && ! TERTIARY) printf ("%s\"", pageChg (PAGE)) else printf (", %s\"", pageChg (PAGE)) } else printf (", %s", pageChg(PAGE)) } else if (FMT == 1) printf("\"") printf ("\n") } # End of main routine # Supporting function # pageChg -- convert "^" to ":" in list of volume^page # Arg: pagelist -- list of numbers function pageChg(pagelist) { gsub(/\^/, ":", pagelist) if (FMT == 1) { gsub(/[1-9]+\*/, "\\fB&\\P", pagelist) gsub(/\*/, "", pagelist) } return pagelist }# End of pageChg function The BEGIN procedure defines the field separator and the strings upper and lower. The next procedure is one that outputs the name of the file that contains the troff index macro definitions. The name of the macro directory can be set from the command line as the second argument. The main procedure begins by converting the "hidden" colon to a literal colon. Note that we apply the gsub() function to each field rather than the entire line because doing the latter would cause the line to be reevaluated and the current order of fields would be disturbed. Next we assign the fields to variables and then test to see whether the field is empty. If the primary key is not defined, then we see if the secondary key is defined. If it is, we output it. If it is not, then we output a tertiary key. If the primary key is defined, then we extract its first character and then see if we find it in the lower string. firstChar = substr($1, 1, 1) char = index(lower, firstChar) The char variable holds the position of the letter in the string. If this number is greater than or equal to 1, then we also have an index into the upper string. We compare each entry and while char and prevChar are the same, the current letter of the alphabet is unchanged. Once they differ, first we check for the letter in the upper string. If char is a new letter, we output a centered string that identifies that letter of the alphabet. Then we look at outputting the primary and secondary entries. Finally, the list of page numbers is output, after calling the pageChg() function to replace the "^" in volume-page references with a colon. Sample screen output produced by format.idx is shown below: X X Protocol, 6 X Window System, events, 84 extensibility, 9 interclient communications, 9 overview, 3 protocol, 6 role of window manager, 9 server and client relationship, 5 software hierarchy, 6 toolkits, 7 X Window ID for paint window, 87 Xlib, 6 XFontStruct structure, 317 Xlib, 6 repainting canvas, 88 Xlib.h header file, 89, 294 Xv_Font type, 310 XView, 18 about, 3, 7, 10 as object-oriented system, 17 compiling programs, 41 concept of windows differs from X, 25 data types; table of, 20 example of programming interface, 44 frames and subframes, 26 generic functions, 21 Generic Object, 18, 24 Sample troff output produced by format.idx is shown below: .XF A "X" .XF 1 "X Protocol" "6" .XF 1 "X Window System" "events, 84" .XF 2 "extensibility, 9" .XF 2 "interclient communications, 9" .XF 2 "overview, 3" .XF 2 "protocol, 6" .XF 2 "role of window manager, 9" .XF 2 "server and client relationship, 5" .XF 2 "software hierarchy, 6" .XF 2 "toolkits, 7" .XF 2 "X Window ID for paint window, 87" .XF 2 "Xlib, 6" .XF 1 "XFontStruct structure" "317" .XF 1 "Xlib" "6" .XF 2 "repainting canvas, 88" .XF 1 "Xlib.h header file" "89, 294" .XF 1 "Xv_Font type" "310" .XF 1 "XView" "18" .XF 2 "about, 3, 7, 10" .XF 2 "as object-oriented system, 17" This output must be formatted by troff to produce a printed version of the index. The index of this book was originally done using the masterindex program. 12.2.6.1 The masterindex shell script The masterindex shell script is the glue that holds all of these scripts together and invokes them with the proper options based on the user's command line. For instance, the user enters: $ masterindex -s -m volume1 volume2 to specify that a master index be created from the files volume1 and volume2 and that the output be sent to the screen. The masterindex shell script is presented in Appendix C with the documentation. 12.1 An Interactive Spelling Checker 12.3 Spare Details of the masterindex Program Chapter 12 Full-Featured Applications 12.3 Spare Details of the masterindex Program This section presents a few interesting details of the masterindex program that might otherwise escape attention. The purpose of this section is to extract some interesting program fragments and show how they solve a particular problem. 12.3.1 How to Hide a Special Character Our first fragment is from the input.idx script, whose job it is to standardize the index entries before they are sorted. This program takes as its input a record consisting of two tab-separated fields: the index entry and its page number. A colon is used as part of the syntax for indicating the parts of an index entry. Because the program uses a colon as a special character, we must provide a way to pass a literal colon through the program. To do this, we allow the indexer to specify two consecutive colons in the input. However, we can't simply convert the sequence to a literal colon because the rest of the program modules called by masterindex read three colon-separated fields. The solution is to convert the colon to its octal value using the gsub() function. #< from input.idx # convert literal colon to octal value $1 ~ /::/ { # substitute octal value for "::" gsub(/::/, "\\72", $1) "\\72" represents the octal value of a colon. (You can find this value by scanning a table of hexadecimal and octal equivalents in the file /usr/pub/ascii.) In the last program module, we use gsub() to convert the octal value back to a colon. Here's the code from format.idx. #< from format.idx # convert octal colon to "literal" colon # make sub for each field, not $0, so that fields are not parsed gsub(/\\72/, ":", $1) gsub(/\\72/, ":", $2) gsub(/\\72/, ":", $3) The first thing you notice is that we make this substitution for each of the three fields separately, instead of having one substitution command that operates on $0. The reason for this is that the input fields are colon-separated. When awk scans an input line, it breaks the line into fields. If you change the contents of $0 at any point in the script, awk will reevaluate the value of $0 and parse the line into fields again. Thus, if you have three fields prior to making the substitution, and the substitution makes one change, adding a colon to $0, then awk will recognize four fields. By doing the substitution for each field, we avoid having the line parsed again into fields. 12.3.2 Rotating Two Parts Above we talked about the colon syntax for separating the primary and secondary keys. With some kinds of entries, it makes sense to classify the item under its secondary key as well. For instance, we might have a group of program statements or user commands, such as "sed command." The indexer might create two entries: one for "sed command" and one for "command: sed." To make coding this kind of entry easier, we implemented a coding convention that uses a tilde (~) character to mark the two parts of this entry so that the first and second part can be swapped to create the second entry automatically.[5] Thus, coding the following index entry [5] The idea of rotating index entries was derived from The AWK Programming Language. There, however, an entry is automatically rotated where a blank is found; the tilde is used to prevent a rotation by "filling in" the space. Rather than have rotation be the default action, we use a different coding convention, where the tilde indicates where the rotation should occur. .XX "sed~command" produces two entries: sed command 43 command: sed 43 Here's the code that rotates entries. #< from input.idx # Match entries that need rotating that contain a single tilde $1 ~ /~/ && $1 !~ /~~/ { # split first field into array named subfield n = split($1, subfield, "~") if (n == 2) { # print entry without "~" and then rotated printf("%s %s::%s\n", subfield[1], subfield [2], $2) printf("%s:%s:%s\n", subfield[2], subfield [1], $2) } next } The pattern-matching rule matches any entry containing a tilde but not two consecutive tildes, which indicate a literal tilde. The procedure uses the split() function to break the first field into two "subfields." This gives us two substrings, one before and one after the tilde. The original entry is output and then the rotated entry is output, both using the printf statement. Because the tilde is used as a special character, we use two consecutive tildes to represent a literal tilde in the input. The following code occurs in the program after the code that swaps the two parts of an entry. #< from input.idx # Match entries that contain two tildes $1 ~ /~~/ { # replace ~~ with ~ gsub(/~~/, "~", $1) } Unlike the colon, which retains a special meaning throughout the masterindex program, the tilde has no significance after this module so we can simply output a literal tilde. 12.3.3 Finding a Replacement The next fragment also comes from input.idx. The problem was to look for two colons separated by text and change the second colon to a semicolon. If the input line contains class: class initialize: (see also methods) then the result is: class: class initialize; (see also methods) The problem is fairly simple to formulate - we want to change the second colon, not the first one. It is pretty easy to solve in sed because of the ability to select and recall a portion of what is matched in the replacement section (using \(...\) to surround the portion to match and \1 to recall the first portion). Lacking the same ability in awk, you have to be more clever. Here's one possible solution: #< from input.idx # replace 2nd colon with semicolon if (sub(/:.*:/, "&;", $1)) sub(/:;/, ";", $1) The first substitution matches the entire span between two colons. It makes a replacement with what is matched (&) followed by a semicolon. This substitution occurs within a conditional expression that evaluates the return value of the sub() function. Remember, this function returns 1 if a substitution is made - it does not return the resulting string. In other words, if we make the first substitution, then we make the second one. The second substitution replaces ":;" with ";". Because we can't make the replacement directly, we do it indirectly by making the context in which the second colon appears distinct. 12.3.4 A Function for Reporting Errors The purpose of the input.idx program is to allow variations (or less kindly, inconsistencies) in the coding of index entries. By reducing these variations to one basic form, the other programs are made easier to write. The other side is that if the input.idx program cannot accept an entry, it must report it to the user and drop the entry so that it does not affect the other programs. The input.idx program has a function used for error reporting called printerr(), as shown below: function printerr (message) { # print message, record number and record printf("ERROR:%s (%d) %s\n", message, NR, $0) > "/ dev/tty" } This function makes it easier to report errors in a standard manner. It takes as an argument a message, which is usually a string that describes the error. It outputs this message along with the record number and the record itself. The output is directed to the user's terminal "/dev/tty." This is a good practice since the standard output of the program might be, as it is in this case, directed to a pipe or to a file. We could also send the error message to standard error, like so: print "ERROR:" message " (" NR ") " $0 | "cat 1>&2" This opens a pipe to cat, with cat's standard output redirected to the standard error. If you are using gawk, mawk, or the Bell Labs awk, you could instead say: printf("ERROR:%s (%d) %s\n", message, NR, $0) > "/dev/ stderr" In the program, the printerr() function is called as follows: printerr("No page number") When this error occurs, the user sees the following error message: ERROR:No page number (612) geometry management: set_values_almost 12.3.5 Handling See Also Entries One type of index entry is a "see also." Like a "see" reference, it refers the reader to another entry. However, a "see also" entry may have a page number as well. In other words, this entry contains information of its own but refers the reader elsewhere for additional information. Here are a few sample entries. error procedure 34 error procedure (see also XtAppSetErrorMsgHandler) 35 error procedure (see also XtAppErrorMsg) The first entry in this sample has a page number while the last one does not. When the input.idx program finds a "see also" entry, it checks to see if a page number ($2) is supplied. If there is one, it outputs two records, the first of which is the entry without the page number and the second of which is an entry and page number without the "see also" reference. #< input.idx # if no page number if ($2 == "") { print $0 ":" next } else { # output two entries: # print See Also entry w/out page number print $1 ":" # remove See Also sub(/ *~zz\(see also.*$/, "", $1) sub(/;/, "", $1) # print as normal entry if ( $1 ~ /:/ ) print $1 ":" $2 else print $1 "::" $2 next } The next problem to be solved was how to get the entries sorted in the proper order. The sort program, using the options we gave it, sorted the secondary keys for "see also" entries together under "s." (The -d option causes the parenthesis to be ignored.) To change the order of the sort, we alter the sort key by adding the sequence "~zz" to the front of it. #< input.idx # add "~zz" for sort at end sub(/\([Ss]ee [Aa]lso/, "~zz(see also", $1) The tilde is not interpreted by the sort but it helps us identify the string later when we remove it. Adding "~zz" assures us of sorting to the end of the list of secondary or tertiary keys. The pagenums.idx script removes the sort string from "see also" entries. However, as we described earlier, we look for a series of "see also" entries for the same key and create a list. Therefore, we also remove that which is the same for all entries, and put the reference itself in an array: #< pagenums.idx # remove secondary key along with "~zz" sub(/^.*~zz\([Ss]ee +[Aa]lso */, "", SECONDARY) sub(/\) */, "", SECONDARY) # assign to next element of seeAlsoList seeAlsoList[++eachSeeAlso] = SECONDARY "; " There is a function that outputs the list of "see also" entries, separating each of them by a semicolon. Thus, the output of the "see also" entry by pagenums.idx looks like: error procedure:(see also XtAppErrorMsg; XtAppSetErrorHandler.) 12.3.6 Alternative Ways to Sort In this program, we chose not to support troff font and point size requests in index entries. If you'd like to support special escape sequences, one way to do so is shown in The AWK Programming Language. For each record, take the first field and prepend it to the record as the sort key. Now that there is a duplicate of the first field, remove the escape sequences from the sort key. Once the entries are sorted, you can remove the sort key. This process prevents the escape sequences from disturbing the sort. Yet another way is to do something similar to what we did for "see also" entries. Because special characters are ignored in the sort, we could use the input.idx program to convert a troff font change sequence such as "\fB" to "~~~" and "\fI" to "~~~~," or any convenient escape sequence. This would get the sequence through the sort program without disturbing the sort. (This technique was used by Steve Talbott in his original indexing script.) The only additional problem that needs to be recognized in both cases is that two entries for the same term, one with font information and one without, will be treated as different entries when one is compared to the other. 12.2 Generating a Formatted Index 13. A Miscellany of Scripts Chapter 13 13. A Miscellany of Scripts Contents: uutot.awk - Report UUCP Statistics phonebill - Track Phone Usage combine - Extract Multipart uuencoded Binaries mailavg - Check Size of Mailboxes adj - Adjust Lines for Text Files readsource - Format Program Source Files for troff gent - Get a termcap Entry plpr - lpr Preprocessor transpose - Perform a Matrix Transposition m1 - Simple Macro Processor This chapter contains a miscellany of scripts contributed by Usenet users. Each program is introduced with a brief description by the program's author. Our comments are placed inside brackets [like this]. Then the full program listing is shown. If the author did not supply an example, we generate one and describe it after the listing. Finally, in a section called "Program Notes," we talk briefly about the program, highlighting some interesting points. Here is a summary of the scripts: uutot.awk Report UUCP statistics. phonebill Track phone usage. combine Extract multipart uuencoded binaries. mailavg Check size of mailboxes. adj Adjust lines for text files. readsource Format program source files for troff. gent Get a termcap entry. plpr lpr preprocessor. transpose Perform a matrix transposition. m1 A very simple macro processor. 13.1 uutot.awk - Report UUCP Statistics Contributed by Roger A. Cornelius Here's something I wrote in nawk in response to all the C versions of the same thing which were posted to alt.sources awhile back. Basically, it summarizes statistics of uucp connections (connect time, throughput, files transmitted, etc.). It only supports HDB-style log files, but will show statistics on a site- by-site, or on an overall (all sites), basis. [It also works with /usr/spool/uucp/SYSLOG.] I use a shell wrapper which calls "awk -f" to run this, but it's not necessary. Usage information is in the header. (Sorry about the lack of comments.) # @(#) uutot.awk - display uucp statistics - requires new awk # @(#) Usage:awk -f uutot.awk [site ...] /usr/spool/uucp/. Admin/xferstats # Author: Roger A. Cornelius (rac@sherpa.uucp) # dosome[]; # site names to work for - all if not set # remote[]; # array of site names # bytes[]; # bytes xmitted by site # time[]; # time spent by site # files[]; # files xmitted by site BEGIN { doall = 1; if (ARGC > 2) { doall = 0; for (i = 1; i < ARGC-1; i++) { dosome[ ARGV[i] ]; ARGV[i] = ""; } } kbyte = 1024 # 1000 if you're not picky bang = "!"; sending = "->"; xmitting = "->" "|" "<-"; hdr1 = "Remote K-Bytes K-Bytes K-Bytes " \ "Hr:Mn:Sc Hr:Mn:Sc AvCPS AvCPS # #\n"; hdr2 = "SiteName Recv Xmit Total " \ "Recv Xmit Recv Xmit Recv Xmit\n"; hdr3 = "-------- --------- --------- --------- -------- " \ "-------- ----- ----- ---- ----"; fmt1 = "%-8.8s %9.3f %9.3f %9.3f %2d:%02d:%02.0f " \ "%2d:%02d:%02.0f %5.0f %5.0f %4d %4d\n"; fmt2 = "Totals %9.3f %9.3f %9.3f %2d:%02d:%02.0f " \ "%2d:%02d:%02.0f %5.0f %5.0f %4d %4d\n"; } { if ($6 !~ xmitting) # should never be next; direction = ($6 == sending ? 1 : 2) site = substr($1,1,index($1,bang)-1); if (site in dosome || doall) { remote[site]; bytes[site,direction] += $7; time[site,direction] += $9; files[site,direction]++; } } END { print hdr1 hdr2 hdr3; for (k in remote) { rbyte += bytes[k,2]; sbyte += bytes[k,1]; rtime += time[k,2]; stime += time[k,1]; rfiles += files[k,2]; sfiles += files [k,1]; printf(fmt1, k, bytes[k,2]/kbyte, bytes [k,1]/kbyte, (bytes[k,2]+bytes[k,1])/kbyte, time[k,2]/3600, (time[k,2]% 3600)/60, time[k,2]%60, time[k,1]/3600, (time[k,1]% 3600)/60, time[k,1]%60, bytes[k,2] && time[k,2] ? bytes [k,2]/time[k,2] : 0, bytes[k,1] && time[k,1] ? bytes [k,1]/time[k,1] : 0, files[k,2], files[k,1]); } print hdr3 printf(fmt2, rbyte/kbyte, sbyte/kbyte, (rbyte +sbyte)/kbyte, rtime/3600, (rtime%3600)/60, rtime%60, stime/3600, (stime%3600)/60, stime%60, rbyte && rtime ? rbyte/rtime : 0, sbyte && stime ? sbyte/stime : 0, rfiles, sfiles); } A test file was generated to test Cornelius' program. Here are a few lines extracted from /usr/spool/uucp/. Admin/xferstats (because each line in this file is too long to print on a page, we have broken the line following the directional arrow for display purposes only): isla!nuucp S (8/3-16:10:17) (C,126,25) [ttyi1j] -> 1131/4.880 secs, 231 bytes/sec isla!nuucp S (8/3-16:10:20) (C,126,26) [ttyi1j] -> 149/0.500 secs, 298 bytes/sec isla!sue S (8/3-16:10:49) (C,126,27) [ttyi1j] -> 646/25.230 secs, 25 bytes/sec isla!sue S (8/3-16:10:52) (C,126,28) [ttyi1j] -> 145/0.510 secs, 284 bytes/sec uunet!uisla M (8/3-16:15:50) (C,951,1) [cui1a] -> 1191/0.660 secs, 1804 bytes/sec uunet!uisla M (8/3-16:15:53) (C,951,2) [cui1a] -> 148/0.080 secs, 1850 bytes/sec uunet!uisla M (8/3-16:15:57) (C,951,3) [cui1a] -> 1018/0.550 secs, 1850 bytes/sec uunet!uisla M (8/3-16:16:00) (C,951,4) [cui1a] -> 160/0.070 secs, 2285 bytes/sec uunet!daemon M (8/3-16:16:06) (C,951,5) [cui1a] <- 552/2.740 secs, 201 bytes/sec uunet!daemon M (8/3-16:16:09) (C,951,6) [cui1a] <- 102/1.390 secs, 73 bytes/sec Note that there are 12 fields; however, the program really only uses fields 1, 6, 7, and 9. Running the program on the sample input produces the following results: $ nawk -f uutot.awk uutot.test Remote K-Bytes K-Bytes K-Bytes Hr:Mn:Sc Hr:Mn:Sc AvCPS AvCPS # # SiteName Recv Xmit Total Recv Xmit Recv Xmit Recv Xmit -------- --------- --------- --------- -------- -------- ----- ----- ---- ---- uunet 0.639 2.458 3.097 0:04:34 2:09:49 2 0 2 4 isla 0.000 2.022 2.022 0:00:00 0:13:58 0 2 0 4 -------- --------- --------- --------- -------- -------- ----- ----- ---- ---- Totals 0.639 4.480 5.119 0:04:34 2:23:47 2 1 2 8 13.1.1 Program Notes for uutot.awk This nawk application is an excellent example of a clearly written awk program. It is also a typical example of using awk to change a rather obscure UNIX log into a useful report. Although Cornelius apologizes for the lack of comments that explain the logic of the program, the usage of the program is clear from the initial comments. Also, he uses variables to define search patterns and the report's layout. This helps to simplify conditional and print statements in the body of the program. It also helps that the variables have names which aid in immediately recognizing their purpose. This program has a three-part structure, as we emphasized in Chapter 7, Writing Scripts for awk. It consists of a BEGIN procedure, in which variables are defined; the body, in which each line of data from the log file is processed; and the END procedure, in which the output for the report is generated. 12.3 Spare Details of the masterindex Program 13.2 phonebill - Track Phone Usage Chapter 13 A Miscellany of Scripts 13.2 phonebill - Track Phone Usage Contributed by Nick Holloway The problem is to calculate the cost of phone calls made. In the United Kingdom, charges are made for the number of "units" used during the duration of the call (no free local calls). The length of time a "unit" lasts depends on the charge band (linked to distance) and the charge rate (linked to time of day). You get charged a whole unit as soon as the time period begins. The input to the program is four fields. The first field is the date (not used). The second field is "band/rate" and is used to look up the length a unit will last. The third field is the length of the call. This can either be "ss," "mm:ss," or "hh:mm:ss". The fourth field is the name of the caller. We keep a stopwatch (old cheap digital), a book, and a pen. Come bill time this is fed through my awk script. This only deals with the cost of the calls, not the standing charge. The aim of the program was to enable the minimum amount of information to be entered by the callers, and the program could be used to collect together the call costs for each user in one report. It is also written so that if British Telecom changes its costs, these can be done easily in the top of the source (this has been done once already). If more charge bands or rates are added, the table can be simply expanded (wonders of associative arrays). There are no real sanity checks done on the input data. The usage is: phonebill [ file ... ] Here is a (short) sample of input and output. Input: 29/05 b/p 5:35 Nick 29/05 L/c 1:00:00 Dale 01/06 L/c 30:50 Nick Output: Summary for Dale: 29/05 L/c 1:00:00 11 units Total: 11 units @ 5.06 pence per unit = $0.56 Summary for Nick: 29/05 b/p 5:35 19 units 01/06 L/c 30:50 6 units Total: 25 units @ 5.06 pence per unit = $1.26 The listing for phonebill follows: #!/bin/awk -f #------------------------------------------------------------------ # Awk script to take in phone usage - and calculate cost for each # person #------------------------------------------------------------------ # Author: N.Holloway (alfie@cs.warwick.ac.uk) # Date : 27 January 1989 # Place : University of Warwick #------------------------------------------------------------------ # Entries are made in the form # Date Type/Rate Length Name # # Format: # Date : "dd/mm" - one word # Type/Rate : "bb/rr" (e.g. L/c) # Length : "hh:mm:ss", "mm:ss", "ss" # Name : "Fred" - one word (unique) #------------------------------------------------------------------ # Charge information kept in array 'c', indexed by "type/ rate", # and the cost of a unit is kept in the variable 'pence_per_unit' # The info is stored in two arrays, both indexed by the name. The # first 'summary' has the lines that hold input data, and number # of units, and 'units' has the cumulative total number of units # used by name. #------------------------------------------------------------------ BEGIN \ { # --- Cost per unit pence_per_unit = 4.40 # cost is 4.4 pence per unit pence_per_unit *= 1.15 # VAT is 15% # --- Table of seconds per unit for different bands/ rates # [ not applicable have 0 entered as value ] c ["L/c"] = 330 ; c ["L/s"] = 85.0; c ["L/p"] = 60.0; c ["a/c"] = 96 ; c ["a/s"] = 34.3; c ["a/p"] = 25.7; c ["b1/c"]= 60.0; c ["b1/s"]= 30.0; c ["b1/p"]= 22.5; c ["b/c"] = 45.0; c ["b/s"] = 24.0; c ["b/p"] = 18.0; c ["m/c"] = 12.0; c ["m/s"] = 8.00; c ["m/p"] = 8.00; c ["A/c"] = 9.00; c ["A/s"] = 7.20; c ["A/p"] = 0 ; c ["A2/c"]= 7.60; c ["A2/s"]= 6.20; c ["A2/p"]= 0 ; c ["B/c"] = 6.65; c ["B/s"] = 5.45; c ["B/p"] = 0 ; c ["C/c"] = 5.15; c ["C/s"] = 4.35; c ["C/p"] = 3.95; c ["D/c"] = 3.55; c ["D/s"] = 2.90; c ["D/p"] = 0 ; c ["E/c"] = 3.80; c ["E/s"] = 3.05; c ["E/p"] = 0 ; c ["F/c"] = 2.65; c ["F/s"] = 2.25; c ["F/p"] = 0 ; c ["G/c"] = 2.15; c ["G/s"] = 2.15; c ["G/p"] = 2.15; } { spu = c [ $2 ] # look up charge band if ( spu == "" || spu == 0 ) { summary [ $4 ] = summary [ $4 ] "\n\t" \ sprintf ( "%4s %4s %7s ? units",\ $1, $2, $3 ) \ " - Bad/Unknown Chargeband" } else { n = split ( $3, t, ":" ) # calculate length in seconds seconds = 0 for ( i = 1; i <= n; i++ ) seconds = seconds*60 + t[i] u = seconds / spu # calculate number of seconds if ( int( u ) == u ) # round up to next whole unit u = int( u ) else u = int( u ) + 1 units [ $4 ] += u # store info to output at end summary [ $4 ] = summary [ $4 ] "\n\t" \ sprintf ( "%4s %4s %7s %3d units",\ $1, $2, $3, u ) } } END \ { for ( i in units ) { # for each person printf ( "Summary for %s:", i ) # newline at start # of summary print summary [ i ] # print summary details # calc cost total = int ( units[i] * pence_per_unit + 0.5 ) printf ( \ "Total: %d units @ %.2f pence per unit = $%d.% 02d\n\n", \ units [i], pence_per_unit, total/100, \ total%100 ) } } 13.2.1 Program Notes for phonebill This program is another example of generating a report that consolidates information from a simple record structure. This program also follows the three-part structure. The BEGIN procedure defines variables that are used throughout the program. This makes it easy to change the program, as phone companies are known to "upwardly revise" their rates. One of the variables is a large array named c in which each element is the number of seconds per unit, using the band over the rate as the index to the array. The main procedure reads each line of the user log. It uses the second field, which identifies the band/rate, to get a value from the array c. It checks that a positive value was returned and then processes that value by the time specified in $3. The number of units for that call is then stored in an array named units, indexed by the name of the caller ($4). This value accumulates for each caller. Finally, the END routine prints out the values in the units array, producing the report of units used per caller and the total cost of the calls. 13.1 uutot.awk - Report UUCP Statistics 13.3 combine - Extract Multipart uuencoded Binaries Chapter 13 A Miscellany of Scripts 13.3 combine - Extract Multipart uuencoded Binaries Contributed by Rahul Dhesi Of all the scripts I have ever written, the one I am most proud of is the "combine" script. While I was moderating comp.binaries.ibm.pc, I wanted to provide users a simple way of extracting multipart uuencoded binaries. I added BEGIN and END headers to each part to enclose the uuencoded part and provided users with the following script: cat $* | sed '/^END/,/^BEGIN/d' | uudecode This script will accept a list of filenames (in order) provided as command-line arguments. It will also accept concatenated articles as standard input. This script invokes cat in a very useful way that is well known to expert shell script users but not enough used by most others. This allows the user the choice of either providing command-line arguments or standard input. The script invokes sed to strip out superfluous headers and trailers, except for headers in the first input file and trailers in the last input file. The final result is that the uuencoded part of the multiple input files is extracted and uudecoded. Each input file (see postings in comp.binaries.ibm.pc) has the following form: headers BEGIN uuencoded text END I have lots of other shell stuff, but the above is simplest and has proved useful to several thousand comp. binaries.ibm.pc readers. 13.3.1 Program Notes for combine This one is pretty obvious but accomplishes a lot. For those who might not understand the use of this command, here is the explanation. A Usenet newsgroup such as comp.binaries.ibm.pc distributes public- domain programs and such. Binaries, the object code created by the compiler, cannot be distributed as news articles unless they are "encoded." A program named uuencode converts the binary to an ASCII representation that can be easily distributed. Furthermore, there are limits on the size of news articles and large binaries are broken up into a series of articles (1 of 3, 2 of 3, 3 of 3, for example). Dhesi would break up the encoded binary into manageable chunks, and then add the BEGIN and END lines to delimit the text that contained encoded binary. A reader of these articles might save each article in a file. Dhesi's script automates the process of combining these articles and removing extraneous information such as the article header as well as the extra BEGIN and END headers. His script removes lines from the first END up to and including the next BEGIN pattern. It combines all the separate encoded parcels and directs them to uudecode, which converts the ASCII representation to binary. One has to appreciate the amount of manual editing work avoided by a simple one-line script. 13.2 phonebill - Track Phone Usage 13.4 mailavg - Check Size of Mailboxes Chapter 13 A Miscellany of Scripts 13.4 mailavg - Check Size of Mailboxes Contributed by Wes Morgan While tuning our mail system, we needed to take a "snapshot" of the users' mailboxes at regular intervals over a 30-day period. This script simply calculates the average size and prints the arithmetic distribution of user mailboxes. #! /bin/sh # # mailavg - average size of files in /usr/mail # # Written by Wes Morgan, morgan@engr.uky.edu, 2 Feb 90 ls -Fs /usr/mail | awk ' { if(NR != 1) { total += $1; count += 1; size = $1 + 0; if(size == 0) zercount+=1; if(size > 0 && size <= 10) tencount+=1; if(size > 10 && size <= 19) teencount+=1; if(size > 20 && size <= 50) uptofiftycount+=1; if(size > 50) overfiftycount+=1; } } END { printf("/usr/mail has %d mailboxes using %d blocks,", count,total) printf("average is %6.2f blocks\n", total/count) printf("\nDistribution:\n") printf("Size Count\n") printf(" O %d\n",zercount) printf("1-10 %d\n",tencount) printf("11-20 %d\n",teencount) printf("21-50 %d\n",uptofiftycount) printf("Over 50 %d\n",overfiftycount) }' exit 0 Here's a sample output from mailavg: $ mailavg /usr/mail has 47 mailboxes using 5116 blocks, average is 108.85 blocks Distribution: Size Count O 1 1-10 13 11-20 1 21-50 5 Over 50 27 13.4.1 Program Notes for mailavg This administrative program is similar to the filesum program in Chapter 7. It processes the output of the ls command. The conditional expression "NR != 1" could have been put outside the main procedure as a pattern. While the logic is the same, using the expression as a pattern clarifies how the procedure is accessed, making the program easier to understand. In that procedure, Morgan uses a series of conditionals that allow him to collect distribution statistics on the size of each user's mailbox. 13.3 combine - Extract Multipart uuencoded Binaries 13.5 adj - Adjust Lines for Text Files Chapter 13 A Miscellany of Scripts 13.5 adj - Adjust Lines for Text Files Contributed by Norman Joseph [Because the author used his program to format his mail message before sending it, we're preserving the linebreaks and indented paragraphs in presenting it here as the program's example. This program is similar to the BSD fmt program.] Well, I decided to take you up on your offer. I'm sure there are more sophisticated gurus out there than me, but I do have a nawk script that I'm kind of fond of, so I'm sending it in. Ok, here's the low down. When I'm writing e-mail, I often make a lot of changes to the text (especially if I'm going to post on the net). So what starts out as a nicely adjusted letter or posting usually ends up looking pretty sloppy by the time I'm done adding and deleting lines. So I end up spending a lot of time joining and breaking lines all through my document so as to get a nice right-hand margin. So I say to myself, "This is just the kind of tedious work a program would be good for." Now, I know I can use nroff to filter my document through and adjust the lines, but it has lousy defaults (IMHO) for simple text like this. So, with a view to sharpening my nawk skills I wrote adj.nawk and the accompanying shell script wrapper adj. Here's the syntax for the nawk filter adj: adj [-l|c|r|b] [-w n] [-i n] [files ...] The options are: -l Lines are left adjusted, right ragged (default). -c Lines are centered. -r Lines are right adjusted, left ragged. -b Lines are left and right adjusted. -w n Sets line width to n characters (default is 70). -i n Sets initial indent to n characters (default is 0). So, whenever I'm finished with this letter (I'm using vi) I will give the command :%!adj -w73 (I like my lines a little longer) and all the breaking and joining will be done by a program (the way the Good Lord intended :-). Indents and blank lines are preserved, and two spaces are given after any end-of-sentence punctuation. The program is naive about tabs, and when computing line lengths, it considers a tab character to be one space wide. The program is notable for its use of command-line parameter assignment, and some of the newer features of awk (nawk), such as the match and split built-in functions, and for its use of support functions. #! /bin/sh # # adj - adjust text lines # # usage: adj [-l|c|r|b] [-w n] [-i n] [files ...] # # options: # -l - lines are left adjusted, right ragged (default) # -c - lines are centered # -r - lines are right adjusted, left ragged # -b - lines are left and right adjusted # -w n - sets line width to characters (default: 70) # -i n - sets initial indent to characters (default: 0) # # note: # output line width is -w setting plus -i setting # # author: # Norman Joseph (amanue!oglvee!norm) adj=l wid=70 ind=0 set -- `getopt lcrbw:i: $*` if test $? != 0 then printf 'usage: %s [-l|c|r|b] [-w n] [-i n] [files ...]' $0 exit 1 fi for arg in $* do case $arg in -l) adj=l; shift;; -c) adj=c; shift;; -r) adj=r; shift;; -b) adj=b; shift;; -w) wid=$2; shift 2;; -i) ind=$2; shift 2;; --) shift; break;; esac done exec nawk -f adj.nawk type=$adj linelen=$wid indent=$ind $* Here's the adj.nawk script that's called by the shell script adj. # adj.nawk -- adjust lines of text per options # # NOTE: this nawk program is called from the shell script "adj" # see that script for usage & calling conventions # # author: # Norman Joseph (amanue!oglvee!norm) BEGIN { FS = "\n" blankline = "^[ \t]*$" startblank = "^[ \t]+[^ \t]+" startwords = "^[^ \t]+" } $0 ~ blankline { if ( type == "b" ) putline( outline "\n" ) else putline( adjust( outline, type ) "\n" ) putline( "\n" ) outline = "" } $0 ~ startblank { if ( outline != "" ) { if ( type == "b" ) putline( outline "\n" ) else putline( adjust( outline, type ) "\n" ) } firstword = "" i = 1 while ( substr( $0, i, 1 ) ~ "[ \t]" ) { firstword = firstword substr( $0, i, 1 ) i++ } inline = substr( $0, i ) outline = firstword nf = split( inline, word, "[ \t]+" ) for ( i = 1; i <= nf; i++ ) { if ( i == 1 ) { testlen = length( outline word[i] ) } else { testlen = length( outline " " word[i] ) if ( match( ".!?:;", "\\" substr( outline, length( outline ), 1 )) ) testlen++ } if ( testlen > linelen ) { putline( adjust( outline, type ) "\n" ) outline = "" } if ( outline == "" ) outline = word[i] else if ( i == 1 ) outline = outline word[i] else { if ( match( ".!?:;", "\\" substr( outline, length( outline ), 1 )) ) outline = outline " " word[i] # 2 spaces else outline = outline " " word[i] # 1 space } } } $0 ~ startwords { nf = split( $0, word, "[ \t]+" ) for ( i = 1; i <= nf; i++ ) { if ( outline == "" ) testlen = length( word[i] ) else { testlen = length( outline " " word[i] ) if ( match( ".!?:;", "\\" substr( outline, length( outline ), 1 )) ) testlen++ } if ( testlen > linelen ) { putline( adjust( outline, type ) "\n" ) outline = "" } if ( outline == "" ) outline = word[i] else { if ( match( ".!?:;", "\\" substr( outline, length( outline ), 1 )) ) outline = outline " " word[i] # 2 spaces else outline = outline " " word[i] # 1 space } } } END { if ( type == "b" ) putline( outline "\n" ) else putline( adjust( outline, type ) "\n" ) } # # -- support functions -- # function putline( line, fmt ) { if ( indent ) { fmt = "%" indent "s%s" printf( fmt, " ", line ) } else printf( "%s", line ) } function adjust( line, type, fill, fmt ) { if ( type != "l" ) fill = linelen - length( line ) if ( fill > 0 ) { if ( type == "c" ) { fmt = "%" (fill+1)/2 "s%s" line = sprintf( fmt, " ", line ) } else if ( type == "r" ) { fmt = "%" fill "s%s" line = sprintf( fmt, " ", line ) } else if ( type == "b" ) { line = fillout( line, fill ) } } return line } function fillout( line, need, i, newline, nextchar, blankseen ) { while ( need ) { newline = "" blankseen = 0 if ( dir == 0 ) { for ( i = 1; i <= length( line ); i++ ) { nextchar = substr( line, i, 1 ) if ( need ) { if ( nextchar == " " ) { if ( ! blankseen ) { newline = newline " " need-- blankseen = 1 } } else { blankseen = 0 } } newline = newline nextchar } } else if ( dir == 1 ) { for ( i = length( line ); i >= 1; i-- ) { nextchar = substr( line, i, 1 ) if ( need ) { if ( nextchar == " " ) { if ( ! blankseen ) { newline = " " newline need-- blankseen = 1 } } else { blankseen = 0 } } newline = nextchar newline } } line = newline dir = 1 - dir } return line } 13.5.1 Program Notes for adj This small text formatter is a nifty program for those of us who use text editors. It allows you to set the maximum line width and justify paragraphs and thus can be used to format mail messages or simple letters. The adj shell script does all the option setting, although it could have been done by reading ARGV in the BEGIN action. Using the shell to establish command-line parameters is probably easier for those who are already familiar with the shell. The lack of comments in the adj.awk script makes this script more difficult to read than some of the others. The BEGIN procedure assigns three regular expressions to variables: blankline, startblank, startwords. This is a good technique (one that you'll see used in lex specifications) because regular expressions can be difficult to read and the name of the variable makes it clear what it matches. Remember that modern awks lets you supply a regular expression as a string, in a variable. There are three main procedures, which can be named by the variable they match. The first is blankline, a procedure which handles collected text once a blank line is encountered. The second is startblank, which handles lines that begin with whitespace (spaces or tabs). The third is startwords, which handles a line of text. The basic procedure is to read a line of text and determine how many of the words in that line will fit, given the line width, outputting those that will fit and saving those that will not in the variable outline. When the next input line is read, the contents of outline must be output before that line is output. The adjust() function does the work of justifying the text based on a command-line option specifying the format type. All types except "l" (left-adjusted, right-ragged) need to be filled. Therefore, the first thing this function does is figure out how much "fill" is needed by subtracting the length of the current line from the specified line length. It makes excellent use of the sprintf() function to actually do the positioning of the text. For instance, to center text, the value of fill (plus 1) is divided by 2 to determine the amount of padding needed on each side of the line. This amount is passed through the fmt variable as the argument to sprintf(): fmt = "%" (fill+1)/2 "s%s" line = sprintf( fmt, " ", line ) Thus, the space will be used to pad a field that is the length of half the amount of fill needed. If text is right-justified, the value of fill itself is used to pad the field. Finally, if the format type is "b" (block), then the function fillout is called to determine where to add spaces that will fill out the line. In looking over the design of the program, you can see, once again, how the use of functions helps to clarify what a program is doing. It helps to think of the main procedure as controlling the flow of input through the program while procedures handle the operations performed on the input. Separating the "operations" from the flow control makes the program readable and more easily maintained. In passing, we're not sure why FS, the field separator, is set to newline in the BEGIN procedure. This means that the field and record separators are the same (i.e., $0 and $1 are the same). The split() function is called to break the line into fields using tabs or spaces as the delimiter. nf = split( $0, word, "[ \t]+" ) It would seem that the field separator could have been set to the same regular expression, as follows: FS = "[ \t]+" It would be more efficient to use the default field parsing. Finally, using the match() function to find punctuation is inefficient; it would have been better to use index(). 13.4 mailavg - Check Size of Mailboxes 13.6 readsource - Format Program Source Files for troff Chapter 13 A Miscellany of Scripts 13.6 readsource - Format Program Source Files for troff Contributed by Martin Weitzel I am often preparing technical documents, especially for courses and training. In these documents, I often need to print source files of different kinds (C programs, awk programs, shell scripts, makefiles). The problem is that the sources often change with time and I want the most recent version when I print. I also want to avoid typos in print. As I'm using troff for text processing, it should be easy to include the original sources into the text. But there are some characters (especially "" and "." and "," at the beginning of a line) that I must escape to prevent interpretation by troff. I often want excerpts from sources rather than a complete file. I also need a mechanism for setting page breaks. Well, perhaps I'm being a perfectionist, but I don't want to see a C function printed nearly complete on one page, but only the two last lines appear on the next. As I frequently change the documents, I cannot hunt for "nice" page breaks - this must be done automatically. To solve these set of problems, I wrote a filter that preprocesses any source for inclusion as text in troff. This is the awk program I send with this letter. [He didn't offer a name for it so it is here named readsource.] The whole process can be further automated through makefiles. I include a preprocessed version of the sources into my troff documents, and I make the formatting dependent on these preprocessed files. These files again are dependent on their originals, so if I "make" the document to print it, the preprocessed sources will be checked to see if they are still current; otherwise they will be generated new from their originals. My program contains a complete description in the form of comments. But as the description is more for me than for others, I'll give you some more hints. Basically, the program simply guards some characters, e.g., "\" is turned into "\e" and "\&" is written before every line. Tabs may be expanded to spaces (there's a switch for it), and you may even generate line numbers in front of every line (switch selectable). The format of these line numbers can be set through an environmental variable. If you want only parts of a file to be processed, you can select these parts with two regular expressions (with another switch). You must specify the first line to be included and the first line not to be. I've found that this is often practical: If you want to show only a certain function of a C program, you can give the first line of the function definition and the first line of the next function definition. If the source is changed such that new functions are inserted between the two or the order is changed, the pattern matching will not work correctly. But this will accommodate the more frequently made, smaller changes in a program. The final feature, getting the page breaks right, is a bit tricky. Here a technique has evolved that I call "here-you-may-break." Those points are marked by a special kind of line (I use "/*!" in C programs and "#!" in awk, shell, makefiles, etc.). How the points are marked doesn't matter too much, you may have your own conventions, but it must be possible to give a regular expression that matches exactly this kind of line and no others (e.g., if your sources are written so that a page break is acceptable wherever you have an empty line, you can specify this very easily, as all you need is the regular expression for empty lines). Before all the marked lines, a special sequence will be inserted which again is given by an environmental variable. With troff, I use the technique of opening a "display" (.DS) before I include such preprocessed text, and inserting a close (.DE) and new open (.DS) display wherever I would accept a page break. After this, troff does the work of gathering as many lines as fit onto the current page. I suppose that suitable techniques for other text processors exist. #! /bin/sh # Copyright 1990 by EDV-Beratung Martin Weitzel, D-6100 Darmstadt # ================================================================== # PROJECT: Printing Tools # SH-SCRIPT: Source to Troff Pre-Formatter # ================================================================== #! # ------------------------------------------------------------------ # This programm is a tool to preformat source files, so that they # can be included (.so) within nroff/troff-input. Problems when # including arbitrary files within nroff/troff-input occur on lines, # starting with dot (.) or an apostrophe ('), or with the respective # chars, if these are changed, furthermore from embedded backslashes. # While changing the source so that none of the above will cause # any problems, some other useful things can be done, including # line numbering and selecting interesting parts. # ------------------------------------------------------------------ #! USAGE="$0 [-x d] [-n] [-b pat] [-e pat] [-p pat] [file ...]" # # SYNOPSIS: # The following options are supported: # -x d expand tabs to "d" spaces # -n number source lines (see also: NFMT) # -b pat start output on a line containing "pat", # including this line (Default: from beginning) # -e pat end output on a line containing "pat" # excluding this line (Default: upto end) # -p pat before lines containing "pat", page breaks # may occur (Default: no page breaks) # "pat" may be an "extended regular expression" as supported by awk. # The following variables from the environment are used: # NFMT specify format for line numbers (Default: see below) # PBRK string, to mark page breaks. (Default: see below) #! # PREREQUISITES: # Common UNIX-Environment, including awk. # # CAVEATS: # "pat"s are not checked before they are used (processing may have # started, before problems are detected). # "NFMT" must contain exactly one %d-format specifier, if -n # option is used. # In "NFMT" and "PBRK", embedded double quotes must be guarded with # a leading backslash. # In "pat"s, "NFMT" and "PBRK" embedded TABs and NLs must be written # as \t and \n. Backslashes that should "go thru" to the output as # such, should be doubled. (The latter is only *required* in a few # special cases, but it does no harm the other cases). # #! # BUGS: # Slow - but may serve as prototype for a faster implementation. # (Hint: Guarding backslashes the way it is done by now is very # expensive and could also be done using sed 's/\\/\\e/g', but tab # expansion would be much harder then, because I can't imagine how # to do it with sed. If you have no need for tab expansion, you may # change the program. Another option would be to use gsub(), which # would limit the program to environments with nawk.) # # Others bugs may be, please mail me. #! # AUTHOR: Martin Weitzel, D-6100 DA (martin@mwtech. UUCP) # # RELEASED: 25. Nov 1989, Version 1.00 # ------------------------------------------------------------------ #! CSOPT # ------------------------------------------------------------------ # check/set options # ------------------------------------------------------------------ xtabs=0 nfmt= bpat= epat= ppat= for p do case $sk in 1) shift; sk=0; continue esac case $p in -x) shift; case $1 in [1-9]|1[0-9]) xtabs=$1; sk=1;; *) { >&2 echo "$0: bad value for option -x: $1"; exit 1; } esac ;; -n) nfmt="${NFMT:-<%03d>\ }"; shift ;; -b) shift; bpat=$1; sk=1 ;; -e) shift; epat=$1; sk=1 ;; -p) shift; ppat=$1; sk=1 ;; --) shift; break ;; *) break esac done #! MPROC # ------------------------------------------------------------------ # now the "real work" # ------------------------------------------------------------------ awk ' #. prepare for tab-expansion, page-breaks and selection BEGIN { if (xt = '$xtabs') while (length(sp) < xt) sp = sp " "; PBRK = "'"${PBRK-'.DE\n.DS\n'}"'" '${bpat:+' skip = 1; '}' } #! limit selection range { '${epat:+' if (!skip && $0 ~ /'"$epat"'/) skip = 1; '}' '${bpat:+' if (skip && $0 ~ /'"$bpat"'/) skip = 0; '}' if (skip) next; } #! process one line of input as required { line = ""; ll = 0; for (i = 1; i <= length; i++) { c = substr($0, i, 1); if (xt && c == "\t") { # expand tabs nsp = 8 - ll % xt; line = line substr(sp, 1, nsp); ll += nsp; } else { if (c == "\\") c = "\\e"; line = line c; ll++; } } } #! finally print this line { '${ppat:+' if ($0 ~ /'"$ppat"'/) printf("%s", PBRK); '}' '${nfmt:+' printf("'"$nfmt"'", NR) '}' printf("\\&%s\n", line); } ' $* For an example of how it works, we ran readsource to extract a part of its own program. $ readsource -x 3 -b "process one line" -e "finally print" readsource \! process one line of input as required \&{ \& line = ""; ll = 0; \& for (i = 1; i <= length; i++) { \& c = substr($0, i, 1); \& if (xt && c == "\\et") { \& # expand tabs \& nsp = 8 - ll % xt; \& line = line substr(sp, 1, nsp); \& ll += nsp; \& } \& else { \& if (c == "\\e\\e") c = "\\e\\ee"; \& line = line c; \& ll++; \& } \& } \&} 13.6.1 Program Notes for readsource This program is, first of all, quite useful, as it helped us prepare the listings in this book. The author does really stretch (old) awk to its limits, using shell variables to pass information into the script. It gets the job done, but it is quite obscure. The program does run slowly. We followed up on the author's suggestion and changed the way the program replaced tabs and backslashes. The original program uses an expensive character-by-character comparison, obtaining the character using the substr() function. (It is the procedure that is extracted in the example above.) Its performance points out how costly it is in awk to read a line one character at a time, something that is very simple in C. Running readsource on itself produced the following times: $ timex readsource -x 3 readsource > /dev/null real 1.56 user 1.22 sys 0.20 The procedure that changes the way tabs and backslashes are handled can be re-written in nawk to use the gsub() function: #! process one line of input as required { if ( xt && $0 ~ "\t" ) gsub(/\t/, sp) if ($0 ~ "\\") gsub(/\\/, "\\e") } The last procedure needs a small change, replacing the variable line with "$0". (We don't use the temporary variable line.) The nawk version produces: $ timex readsource.2 -x 3 readsource > /dev/null real 0.44 user 0.10 sys 0.22 The difference is pretty remarkable. One final speedup might be to use index() to look for backslashes: #! process one line of input as required { if ( xt && index($0, "\t") > 0 ) gsub(/\t/, sp) if (index($0, "\\") > 0) gsub(/\\/, "\\e") } 13.5 adj - Adjust Lines for Text Files 13.7 gent - Get a termcap Entry Chapter 13 A Miscellany of Scripts 13.7 gent - Get a termcap Entry Contributed by Tom Christiansen Here's a sed script I use to extract a termcap entry. It works for any termcap-like file, such as disktab. For example: $ gent vt100 extracts the vt100 entry from termcap, while: $ gent eagle /etc/disktab gets the eagle entry from disktab. Now I know it could have been done in C or Perl, but I did it a long time ago. It's also interesting because of the way it passes options into the sed script. I know, I know: it should have been written in sh not csh, too. #!/bin/csh -f set argc = $#argv set noglob set dollar = '$' set squeeze = 0 set noback="" nospace="" rescan: if ( $argc > 0 && $argc < 3 ) then if ( "$1" =~ -* ) then if ( "-squeeze" =~ $1* ) then set noback='s/\\//g' nospace='s/^[ ]*//' set squeeze = 1 shift @ argc -- goto rescan else echo "Bad switch: $1" goto usage endif endif set entry = "$1" if ( $argc == 1 ) then set file = /etc/termcap else set file = "$2" endif else usage: echo "usage: `basename $0` [-squeeze] entry [termcapfile]" exit 1 endif sed -n -e \ "/^${entry}[|:]/ {\ :x\ /\\${dollar}/ {\ ${noback}\ ${nospace}\ p\ n\ bx\ }\ ${nospace}\ p\ n\ /^ / {\ bx\ }\ }\ /^[^ ]*|${entry}[|:]/ {\ :y\ /\\${dollar}/ {\ ${noback}\ ${nospace}\ p\ n\ by\ }\ ${nospace}\ p\ n\ /^ / {\ by\ }\ }" < $file 13.7.1 Program Notes for gent Once you get used to reading awk scripts, they seem so much easier to understand than all but the simplest sed script. It can be a painstaking task to figure out what a small sed script like the one shown here is doing. This script does show how to pass shell variables into a sed script. Variables are used to pass optional sed commands into the script, such as the substitution commands that replace backslashes and spaces. This script could be simplified in several ways. First of all, the two regular expressions don't seem necessary to match the entry. The first matches the name of the entry at the beginning of a line; the second matches it elsewhere on the line. The loops labeled x and y are identical and even if the two regular expressions were necessary, we could have them branch to the same loop. 13.6 readsource - Format Program Source Files for troff 13.8 plpr - lpr Preprocessor Chapter 13 A Miscellany of Scripts 13.8 plpr - lpr Preprocessor Contributed by Tom Van Raalte I thought you might want to use the following script around the office. It is a preprocessor for lpr that sends output to the "best" printer. [This shell script is written for a BSD or Linux system and you would use this command in place of lpr. It reads the output of the lpq command to determine if a specific printer is available. If not, it tries a list of printers to see which one is available or which is the least busy. Then it invokes lpr to send the job to that printer.] #!/bin/sh # #set up temp file TMP=/tmp/printsum.$$ LASERWRITER=${LASERWRITER-ps6} #Check to see if the default printer is free? # # FREE=`lpq -P$LASERWRITER | awk ' { if ($0 == "no entries") { val=1 print val exit 0 } else { val=0 print val exit 0 } }'` #echo Free is $FREE # #If the default is free then $FREE is set, and we print and exit. # if [ $FREE -eq 1 ] then SELECT=$LASERWRITER #echo selected $SELECT lpr -P$SELECT $* exit 0 fi #echo Past the exit # #Now we go on to see if any of the printers in bank are free. # BANK=${BANK-$LASERWRITER} #echo bank is $BANK # #If BANK is the same as LASERWRITER, then we have no choice. #otherwise, we print on the one that is free, if any are free. # if [ "$BANK" = "$LASERWRITER" ] then SELECT=$LASERWRITER lpr -P$SELECT $* exit 0 fi #echo past the check bank=laserprinter # #Now we check for a free printer. #Note that $LASERWRITER is checked again in case it becomes free #during the check. # #echo now we check the other for a free one for i in $BANK $LASERWRITER do FREE=`lpq -P$i | awk ' { if ($0 == "no entries") { val=1 print val exit 0 } else { val=0 print val exit 0 } }'` if [ $FREE -eq 1 ] then # echo in loop for $i SELECT=$i # echo select is $SELECT # if [ "$FREE" != "$LASERWRITER" ] # then # echo "Output redirected to printer $i" # fi lpr -P$SELECT $* exit 0 fi done #echo done checking for a free one # #If we make it here then no printers are free. So we #print on the printer with the least bytes queued. # # for i in $BANK $LASERWRITER do val=`lpq -P$i | awk ' BEGIN { start=0; } /^Time/ { start=1; next; } (start == 1){ test=substr($0,62,20); print test; } ' | awk ' BEGIN { summ=0; } { summ=summ+$1; } END { print summ; }'` echo "$i $val" >> $TMP done SELECT=`awk '(NR==1) { select=$1; best=$2 } ($2 < best) { select=$1; best=$2} END { print select } ' $TMP ` #echo $SELECT # rm $TMP #Now print on the selected printer #if [ $SELECT != $LASERWRITER ] #then # echo "Output redirected to printer $i" #fi lpr -P$SELECT $* trap 'rm -f $TMP; exit 99' 2 3 15 13.8.1 Program Notes for plpr For the most part, we've avoided scripts like these in which most of the logic is coded in the shell script. However, such a minimalist approach is representative of a wide variety of uses of awk. Here, awk is called to do only those things that the shell script can't do (or do as easily). Manipulating the output of a command and performing numeric comparisons is an example of such a task. As a side note, the trap statement at the end should be at the top of the script, not at the bottom. 13.7 gent - Get a termcap Entry 13.9 transpose - Perform a Matrix Transposition Chapter 13 A Miscellany of Scripts 13.9 transpose - Perform a Matrix Transposition Contributed by Geoff Clare transpose performs a matrix transposition on its input. I wrote this when I saw a script to do this job posted to the Net and thought it was horribly inefficient. I posted mine as an alternative with timing comparisons. If I remember rightly, the original one stored all the elements individually and used a nested loop with a printf for each element. It was immediately obvious to me that it would be much faster to construct the rows of the transposed matrix "on the fly." My script uses ${1+"$@"} to supply file names on the awk command line so that if no files are specified awk will read its standard input. This is much better than plain $* which can't handle filenames containing whitexspace. #! /bin/sh # Transpose a matrix: assumes all lines have same number # of fields exec awk ' NR == 1 { n = NF for (i = 1; i <= NF; i++) row[i] = $i next } { if (NF > n) n = NF for (i = 1; i <= NF; i++) row[i] = row[i] " " $i } END { for (i = 1; i <= n; i++) print row[i] }' ${1+"$@"} Here's a test file: 1 2 3 4 5 6 7 8 9 10 11 12 Now we run transpose on the file. $ transpose test 1 5 9 2 6 10 3 7 11 4 8 12 13.9.1 Program Notes for transpose This is a very simple but interesting script. It creates an array named row and appends each field into an element of the array. The END procedure outputs the array. 13.8 plpr - lpr Preprocessor 13.10 m1 - Simple Macro Processor Chapter 13 A Miscellany of Scripts 13.10 m1 - Simple Macro Processor Contributed by Jon Bentley The m1 program is a "little brother" to the m4 macro processor found on UNIX systems. It was originally published in the article m1: A Mini Macro Processor, in Computer Language, June 1990, Volume 7, Number 6, pages 47-61. This program was brought to my attention by Ozan Yigit. Jon Bentley kindly sent me his current version of the program, as well as an early draft of his article (I was having trouble getting a copy of the published one). A PostScript version of this paper is included with the example programs, available from O'Reilly's FTP server (see the Preface). I wrote these introductory notes, and the program notes below. [A.R.] A macro processor copies its input to its output, while performing several jobs. The tasks are: 1. Define and expand macros. Macros have two parts, a name and a body. All occurrences of a macro's name are replaced with the macro's body. 2. Include files. Special include directives in a data file are replaced with the contents of the named file. Includes can usually be nested, with one included file including another. Included files are processed for macros. 3. Conditional text inclusion and exclusion. Different parts of the text can be included in the final output, often based upon whether a macro is or isn't defined. 4. Depending on the macro processor, comment lines can appear that will be removed from the final output. If you're a C or C++ programmer, you're already familiar with the built-in preprocessor in those languages. UNIX systems have a general-purpose macro processor called m4. This is a powerful program, but somewhat difficult to master, since macro definitions are processed for expansion at definition time, instead of at expansion time. m1 is considerably simpler than m4, making it much easier to learn and to use. Here is Jon's first cut at a very simple macro processor. All it does is define and expand macros. We can call it m0a. In this and the following programs, the "at" symbol (@) distinguishes lines that are directives, and also indicates the presence of macros that should be expanded. /^@define[ \t]/ { name = $2 $1 = $2 = ""; sub(/^[ \t]+/, "") symtab[name] = $0 next } { for (i in symtab) gsub("@" i "@", symtab[i]) print } This version looks for lines beginning with "@define." This keyword is $1 and the macro name is taken to be $2. The rest of the line becomes the body of the macro. The next input line is then fetched using next. The second rule simply loops through all the defined macros, performing a global substitution of each macro with its body in the input line, and then printing the line. Think about the tradeoffs in this version of simplicity versus program execution time. The next version (m0b) adds file inclusion: function dofile(fname) { while (getline 0) { if (/^@define[ \t]/) { # @define name value name = $2 $1 = $2 = ""; sub(/^[ \t]+/, "") symtab[name] = $0 } else if (/^@include[ \t]/) # @include filename dofile($2) else { # Anywhere in line @name@ for (i in symtab) gsub("@" i "@", symtab[i]) print } } close(fname) } BEGIN { if (ARGC == 2) dofile(ARGV[1]) else dofile("/dev/stdin") } Note the way dofile() is called recursively to handle nested include files. With all of that introduction out of the way, here is the full-blown m1 program. #! /bin/awk -f # NAME # # m1 # # USAGE # # awk -f m1.awk [file...] # # DESCRIPTION # # M1 copies its input file(s) to its output unchanged except as modified by # certain "macro expressions." The following lines define macros for # subsequent processing: # # @comment Any text # @@ same as @comment # @define name value # @default name value set if name undefined # @include filename # @if varname include subsequent text if varname != 0 # @unless varname include subsequent text if varname == 0 # @fi terminate @if or @unless # @ignore DELIM ignore input until line that begins with DELIM # @stderr stuff send diagnostics to standard error # # A definition may extend across many lines by ending each line with # a backslash, thus quoting the following newline. # # Any occurrence of @name@ in the input is replaced in the output by # the corresponding value. # # @name at beginning of line is treated the same as @name@. # # BUGS # # M1 is three steps lower than m4. You'll probably miss something # you have learned to expect. # # AUTHOR # # Jon L. Bentley, jlb@research.bell-labs.com # function error(s) { print "m1 error: " s | "cat 1>&2"; exit 1 } function dofile(fname, savefile, savebuffer, newstring) { if (fname in activefiles) error("recursively reading file: " fname) activefiles[fname] = 1 savefile = file; file = fname savebuffer = buffer; buffer = "" while (readline() != EOF) { if (index($0, "@") == 0) { print $0 } else if (/^@define[ \t]/) { dodef() } else if (/^@default[ \t]/) { if (!($2 in symtab)) dodef() } else if (/^@include[ \t]/) { if (NF != 2) error("bad include line") dofile(dosubs($2)) } else if (/^@if[ \t]/) { if (NF != 2) error("bad if line") if (!($2 in symtab) || symtab[$2] == 0) gobble() } else if (/^@unless[ \t]/) { if (NF != 2) error("bad unless line") if (($2 in symtab) && symtab[$2] != 0) gobble() } else if (/^@fi([ \t]?|$)/) { # Could do error checking here } else if (/^@stderr[ \t]?/) { print substr($0, 9) | "cat 1>&2" } else if (/^@(comment|@)[ \t]?/) { } else if (/^@ignore[ \t]/) { # Dump input until $2 delim = $2 l = length(delim) while (readline() != EOF) if (substr($0, 1, l) == delim) break } else { newstring = dosubs($0) if ($0 == newstring || index (newstring, "@") == 0) print newstring else buffer = newstring "\n" buffer } } close(fname) delete activefiles[fname] file = savefile buffer = savebuffer } # Put next input line into global string "buffer" # Return "EOF" or "" (null string) function readline( i, status) { status = "" if (buffer != "") { i = index(buffer, "\n") $0 = substr(buffer, 1, i-1) buffer = substr(buffer, i+1) } else { # Hume: special case for non v10: if (file == "/dev/stdin") if (getline = 2) { for (i = 1; i < ARGC; i++) dofile(ARGV[i]) } else error("usage: m1 [fname...]") } 13.10.1 Program Notes for m1 The program is nicely modular, with an error() function similar to the one presented in Chapter 11, A Flock of awks, and each task cleanly divided into separate functions. The main program occurs in the BEGIN procedure at the bottom. It simply processes either standard input, if there are no arguments, or all of the files named on the command line. The high-level processing happens in the dofile() function, which reads one line at a time, and decides what to do with each line. The activefiles array keeps track of open files. The variable fname indicates the current file to read data from. When an "@include" directive is seen, dofile() simply calls itself recursively on the new file, as in m0b. Interestingly, the included filename is first processed for macros. Read this function carefully - there are some nice tricks here. The readline() function manages the "pushback." After expanding a macro, macro processors examine the newly created text for any additional macro names. Only after all expanded text has been processed and sent to the output does the program get a fresh line of input. The dosubs() function actually performs the macro substitution. It processes the line left-to-right, replacing macro names with their bodies. The rescanning of the new line is left to the higher-level logic that is jointly managed by readline() and dofile(). This version is considerably more efficient than the brute-force approach used in the m0 programs. Finally, the dodef() function handles the defining of macros. It saves the macro name from $2, and then uses sub() to remove the first two fields. The new value of $0 now contains just (the first line of) the macro body. The Computer Language article explains that sub() is used on purpose, in order to preserve whitespace in the macro body. Simply assigning the empty string to $1 and $2 would rebuild the record, but with all occurrences of whitespace collapsed into single occurrences of the value of OFS (a single blank). The function then proceeds to gather the rest of the macro body, indicated by lines that end with a "\". This is an additional improvement over m0: macro bodies can be more than one line long. The rest of the program is concerned with conditional inclusion or exclusion of text; this part is straightforward. What's nice is that these conditionals can be nested inside each other. m1 is a very nice start at a macro processor. You might want to think about how you could expand upon it; for instance, by allowing conditionals to have an "@else" clause; processing the command line for macro definitions; "undefining" macros, and the other sorts of things that macro processors usually do. Some other extensions suggested by Jon Bentley are: 1. Add "@shell DELIM shell line here," which would read input lines up to "DELIM," and send the expanded output through a pipe to the given shell command. 2. Add commands "@longdef" and "@longend." These commands would define macros with long bodies, i.e., those that extend over more than one line, simplifying the logic in dodoef(). 3. Add "@append MacName MoreText," like ".am" in troff. This macro in troff appends text to an already defined macro. In m1, this would allow you to add on to the body of an already defined macro. 4. Avoid the V10 /dev/stdin special file. The Bell Labs UNIX systems[1] have a special file actually named /dev/stdin, that gives you access to standard input. It occurs to me that the use of "-" would do the trick, quite portably. This is also not a real issue if you use gawk or the Bell Labs awk, which interpret the special file name /dev/stdin internally (see Chapter 11). [1] And some other UNIX systems, as well. As a final note, Jon often makes use of awk in two of his books, Programming Pearls, and More Programming Pearls - Confessions of a Coder (both published by Addison-Wesley). These books are both excellent reading. 13.9 transpose - Perform a Matrix Transposition A. Quick Reference for sed Appendix A A. Quick Reference for sed Contents: Command-Line Syntax Syntax of sed Commands Command Summary for sed A.1 Command-Line Syntax The syntax for invoking sed has two forms: sed [-n][-e] `command' file(s) sed [-n] -f scriptfile file(s) The first form allows you to specify an editing command on the command line, surrounded by single quotes. The second form allows you to specify a scriptfile, a file containing sed commands. Both forms may be used together, and they may be used multiple times. The resulting editing script is the concatenation of the commands and script files. The following options are recognized: -n Only print lines specified with the p command or the p flag of the s command. -e cmd Next argument is an editing command. Useful if multiple scripts are specified. -f file Next argument is a file containing editing commands. If the first line of the script is "#n", sed behaves as if -n had been specified. Frequently used sed scripts are usually invoked from a shell script. Since this is the same for sed or awk, see the section "Shell Wrapper for Invoking awk" in Appendix B, Quick Reference for awk. 13.10 m1 - Simple Macro Processor A.2 Syntax of sed Commands Appendix A Quick Reference for sed A.2 Syntax of sed Commands Sed commands have the general form: [address[,address]][!]command [arguments] Sed copies each line of input into a pattern space. Sed instructions consist of addresses and editing commands. If the address of the command matches the line in the pattern space, then the command is applied to that line. If a command has no address, then it is applied to each input line. If a command changes the contents of the space, subsequent command-addresses will be applied to the current line in the pattern space, not the original input line. A.2.1 Pattern Addressing address can be either a line number or a pattern, enclosed in slashes (/pattern/). A pattern is described using a regular expression. Additionally, \n can be used to match any newline in the pattern space (resulting from the N command), but not the newline at the end of the pattern space. If no pattern is specified, the command will be applied to all lines. If only one address is specified, the command will be applied to all lines matching that address. If two comma-separated addresses are specified, the command will be applied to a range of lines between the first and second addresses, inclusively. Some commands accept only one address: a, i, r, q, and =. The ! operator following an address causes sed to apply the command to all lines that do not match the address. Braces ({}) are used in sed to nest one address inside another or to apply multiple commands at the same address. [/pattern/[,/pattern/]]{ command1 command2 } The opening curly brace must end a line, and the closing curly brace must be on a line by itself. Be sure there are no spaces after the braces. A.2.2 Regular Expression Metacharacters for sed The following table lists the pattern-matching metacharacters that were discussed in Chapter 3, Understanding Regular Expression Syntax. Note that an empty regular expression "//" is the same as the previous regular expression. Table A.1: Pattern-Matching Metacharacters Special Characters Usage . Matches any single character except newline. * Matches any number (including zero) of the single character (including a character specified by a regular expression) that immediately precedes it. [...] Matches any one of the class of characters enclosed between the brackets. All other metacharacters lose their meaning when specified as members of a class. A circumflex (^) as the first character inside brackets reverses the match to all characters except newline and those listed in the class. A hyphen (-) is used to indicate a range of characters. The close bracket (]) as the first character in the class is a member of the class. \{n,m\} Matches a range of occurrences of the single character (including a character specified by a regular expression) that immediately precedes it. \{n\} will match exactly n occurrences, \{n,\} will match at least n occurrences, and \{n,m\} will match any number of occurrences between n and m. (sed and grep only). ^ Locates regular expression that follows at the beginning of line. The ^ is only special when it occurs at the beginning of the regular expression. $ Locates preceding regular expression at the end of line. The $ is only special when it occurs at the end of the regular expression. \ Escapes the special character that follows. \( \) Saves the pattern enclosed between "\(" and "\)" into a special holding space. Up to nine patterns can be saved in this way on a single line. They can be "replayed" in substitutions by the escape sequences "\1" to "\9". \n Matches the nth pattern previously saved by "\(" and "\)", where n is a number from 1 to 9 and previously saved patterns are counted from the left on the line. & Prints the entire matched text when used in a replacement string. A.1 Command-Line Syntax A.3 Command Summary for sed Appendix A Quick Reference for sed A.3 Command Summary for sed : :label Label a line in the script for the transfer of control by b or t. label may contain up to seven characters. (The POSIX standard says that an implementation can allow longer labels if it wishes to. GNU sed allows labels to be of any length.) = [address]= Write to standard output the line number of addressed line. a [address]a\ text Append text following each line matched by address. If text goes over more than one line, newlines must be "hidden" by preceding them with a backslash. The text will be terminated by the first newline that is not hidden in this way. The text is not available in the pattern space and subsequent commands cannot be applied to it. The results of this command are sent to standard output when the list of editing commands is finished, regardless of what happens to the current line in the pattern space. b [address1[,address2]]b[label] Transfer control unconditionally (branch) to :label elsewhere in script. That is, the command following the label is the next command applied to the current line. If no label is specified, control falls through to the end of the script, so no more commands are applied to the current line. c [address1[,address2]]c\ text Replace (change) the lines selected by the address with text. When a range of lines is specified, all lines as a group are replaced by a single copy of text. The newline following each line of text must be escaped by a backslash, except the last line. The contents of the pattern space are, in effect, deleted and no subsequent editing commands can be applied to it (or to text). d [address1[,address2]]d Delete line(s) from pattern space. Thus, the line is not passed to standard output. A new line of input is read and editing resumes with first command in script. D [address1[,address2]]D Delete first part (up to embedded newline) of multiline pattern space created by N command and resume editing with first command in script. If this command empties the pattern space, then a new line of input is read, as if the d command had been executed. g [address1[,address2]]g Copy (get) contents of hold space (see h or H command) into the pattern space, wiping out previous contents. G [address1[,address2]]G Append newline followed by contents of hold space (see h or H command) to contents of the pattern space. If hold space is empty, a newline is still appended to the pattern space. h [address1[,address2]]h Copy pattern space into hold space, a special temporary buffer. Previous contents of hold space are wiped out. H [address1[,address2]]H Append newline and contents of pattern space to contents of the hold space. Even if hold space is empty, this command still appends the newline first. i [address1]i\ text Insert text before each line matched by address. (See a for details on text.) l [address1[,address2]]l List the contents of the pattern space, showing nonprinting characters as ASCII codes. Long lines are wrapped. n [address1[,address2]]n Read next line of input into pattern space. Current line is sent to standard output. New line becomes current line and increments line counter. Control passes to command following n instead of resuming at the top of the script. N [address1[,address2]]N Append next input line to contents of pattern space; the new line is separated from the previous contents of the pattern space by a newline. (This command is designed to allow pattern matches across two lines. Using \n to match the embedded newline, you can match patterns across multiple lines.) p [address1[,address2]]p Print the addressed line(s). Note that this can result in duplicate output unless default output is suppressed by using "#n" or the -n command-line option. Typically used before commands that change flow control (d, n, b) and might prevent the current line from being output. P [address1[,address2]]P Print first part (up to embedded newline) of multiline pattern space created by N command. Same as p if N has not been applied to a line. q [address]q Quit when address is encountered. The addressed line is first written to output (if default output is not suppressed), along with any text appended to it by previous a or r commands. r [address]r file Read contents of file and append after the contents of the pattern space. Exactly one space must be put between r and the filename. s [address1[,address2]]s/pattern/replacement/[flags] Substitute replacement for pattern on each addressed line. If pattern addresses are used, the pattern // represents the last pattern address specified. The following flags can be specified: n Replace nth instance of /pattern/ on each addressed line. n is any number in the range 1 to 512, and the default is 1. g Replace all instances of /pattern/ on each addressed line, not just the first instance. p Print the line if a successful substitution is done. If several successful substitutions are done, multiple copies of the line will be printed. w file Write the line to file if a replacement was done. A maximum of 10 different files can be opened. t [address1[,address2]]t [label] Test if successful substitutions have been made on addressed lines, and if so, branch to line marked by :label. (See b and :.) If label is not specified, control falls through to bottom of script. w [address1[,address2]]w file Append contents of pattern space to file. This action occurs when the command is encountered rather than when the pattern space is output. Exactly one space must separate the w and the filename. A maximum of 10 different files can be opened in a script. This command will create the file if it does not exist; if the file exists, its contents will be overwritten each time the script is executed. Multiple write commands that direct output to the same file append to the end of the file. x [address1[,address2]]x Exchange contents of the pattern space with the contents of the hold space. y [address1[,address2]]y/abc/xyz/ Transform each character by position in string abc to its equivalent in string xyz. A.2 Syntax of sed Commands B. Quick Reference for awk Appendix B B. Quick Reference for awk Contents: Command-Line Syntax Language Summary for awk Command Summary for awk This appendix describes the features of the awk scripting language. B.1 Command-Line Syntax The syntax for invoking awk has two basic forms: awk [-v var=value] [-Fre] [--] 'pattern { action }' var=value datafile(s) awk [-v var=value] [-Fre] -f scriptfile [--] var=value datafile(s) An awk command line consists of the command, the script and the input filename. Input is read from the file specified on the command line. If there is no input file or "-" is specified, then standard input is read. The -F option sets the field separator (FS) to re. The -v option sets the variable var to value before the script is executed. This happens even before the BEGIN procedure is run. (See the discussion below on command-line parameters.) Following POSIX argument parsing conventions, the "--" option marks the end of command-line options. Using this option, for instance, you could specify a datafile that begins with "-", which would otherwise be confused with a command-line option. You can specify a script consisting of pattern and action on the command line, surrounded by single quotes. Alternatively, you can place the script in a separate file and specify the name of the scriptfile on the command line with the -f option. Parameters can be passed into awk by specifying them on the command line after the script. This includes setting system variables such as FS, OFS, and RS. The value can be a literal, a shell variable ($var) or the result of a command (`cmd`); it must be quoted if it contains spaces or tabs. Any number of parameters can be specified. Command-line parameters are not available until the first line of input is read, and thus cannot be accessed in the BEGIN procedure. (Older implementations of awk and nawk would process leading command-line assignments before running the BEGIN procedure. This was contrary to how things were documented in The AWK Programming Language, which says that they are processed when awk would go to open them as filenames, i.e., after the BEGIN procedure. The Bell Labs awk was changed to correct this, and the -v option was added at the same time, in early 1989. It is now part of POSIX awk.) Parameters are evaluated in the order in which they appear on the command line up until a filename is recognized. Parameters appearing after that filename will be available when the next filename is recognized. B.1.1 Shell Wrapper for Invoking awk Typing a script at the system prompt is only practical for simple, one-line scripts. Any script that you might invoke as a command and reuse can be put inside a shell script. Using a shell script to invoke awk makes the script easy for others to use. You can put the command line that invokes awk in a file, giving it a name that identifies what the script does. Make that file executable (using the chmod command) and put it in a directory where local commands are kept. The name of the shell script can be typed on the command line to execute the awk script. This is preferred for easily used and reused scripts. On modern UNIX systems, including Linux, you can use the #! syntax to create self-contained awk scripts: #! /usr/bin/awk -f script Awk parameters and the input filename can be specified on the command line that invokes the shell script. Note that the pathname to use is system-dependent. A.3 Command Summary for sed B.2 Language Summary for awk Appendix B Quick Reference for awk B.2 Language Summary for awk This section summarizes how awk processes input records and describes the various syntactic elements that make up an awk program. B.2.1 Records and Fields Each line of input is split into fields. By default, the field delimiter is one or more spaces and/or tabs. You can change the field separator by using the -F command-line option. Doing so also sets the value of FS. The following command-line changes the field separator to a colon: awk -F: -f awkscr /etc/passwd You can also assign the delimiter to the system variable FS. This is typically done in the BEGIN procedure, but can also be passed as a parameter on the command line. awk -f awkscr FS=: /etc/passwd Each input line forms a record containing any number of fields. Each field can be referenced by its position in the record. "$1" refers to the value of the first field; "$2" to the second field, and so on. "$0" refers to the entire record. The following action prints the first field of each input line: { print $1 } The default record separator is a newline. The following procedure sets FS and RS so that awk interprets an input record as any number of lines up to a blank line, with each line being a separate field. BEGIN { FS = "\n"; RS = "" } It is important to know that when RS is set to the empty string, newline always separates fields, in addition to whatever value FS may have. This is discussed in more detail in both The AWK Programming Language and Effective AWK Programming. B.2.2 Format of a Script An awk script is a set of pattern-matching rules and actions: pattern { action } An action is one or more statements that will be performed on those input lines that match the pattern. If no pattern is specified, the action is performed for every input line. The following example uses the print statement to print each line in the input file: { print } If only a pattern is specified, then the default action consists of the print statement, as shown above. Function definitions can also appear: function name (parameter list) { statements } This syntax defines the function name, making available the list of parameters for processing in the body of the function. Variables specified in the parameter-list are treated as local variables within the function. All other variables are global and can be accessed outside the function. When calling a user- defined function, no space is permitted between the name of the function and the opening parenthesis. Spaces are allowed in the function's definition. User-defined functions are described in Chapter 9, Functions. B.2.2.1 Line termination A line in an awk script is terminated by a newline or a semicolon. Using semicolons to put multiple statements on a line, while permitted, reduces the readability of most programs. Blank lines are permitted between statements. Program control statements (do, if, for, or while) continue on the next line, where a dependent statement is listed. If multiple dependent statements are specified, they must be enclosed within braces. if (NF > 1) { name = $1 total += $2 } You cannot use a semicolon to avoid using braces for multiple statements. You can type a single statement over multiple lines by escaping the newline with a backslash (\). You can also break lines following any of the following characters: , { && || Gawk also allows you to continue a line after either a "?" or a ":". Strings cannot be broken across a line (except in gawk, using "\" followed by a newline). B.2.2.2 Comments A comment begins with a "#" and ends with a newline. It can appear on a line by itself or at the end of a line. Comments are descriptive remarks that explain the operation of the script. Comments cannot be continued across lines by ending them with a backslash. B.2.3 Patterns A pattern can be any of the following: /regular expression/ relational expression BEGIN END pattern, pattern 1. Regular expressions use the extended set of metacharacters and must be enclosed in slashes. For a full discussion of regular expressions, see Chapter 3, Understanding Regular Expression Syntax. 2. Relational expressions use the relational operators listed under "Expressions" later in this chapter. 3. The BEGIN pattern is applied before the first line of input is read and the END pattern is applied after the last line of input is read. 4. Use ! to negate the match; i.e., to handle lines not matching the pattern. 5. You can address a range of lines, just as in sed: pattern, pattern Patterns, except BEGIN and END, can be expressed in compound forms using the following operators: && Logical And || Logical Or Sun's version of nawk (SunOS 4.1.x) does not support treating regular expressions as parts of a larger Boolean expression. E.g., "/cute/ && /sweet/" or "/fast/ || /quick/" do not work. In addition the C conditional operator ?: (pattern ? pattern : pattern) may be used in a pattern. 6. Patterns can be placed in parentheses to ensure proper evaluation. 7. BEGIN and END patterns must be associated with actions. If multiple BEGIN and END rules are written, they are merged into a single rule before being applied. B.2.4 Regular Expressions Table 13.2 summarizes the regular expressions as described in Chapter 3. The metacharacters are listed in order of precedence. Table B.1: Regular Expression Metacharacters Special Characters Usage c Matches any literal character c that is not a metacharacter. \ Escapes any metacharacter that follows, including itself. ^ Anchors following regular expression to the beginning of string. $ Anchors preceding regular expression to the end of string. . Matches any single character, including newline. [...] Matches any one of the class of characters enclosed between the brackets. A circumflex (^) as the first character inside brackets reverses the match to all characters except those listed in the class. A hyphen (-) is used to indicate a range of characters. The close bracket (]) as the first character in a class is a member of the class. All other metacharacters lose their meaning when specified as members of a class, except \, which can be used to escape ], even if it is not first. r1|r2 Between two regular expressions, r1 and r2, it allows either of the regular expressions to be matched. (r1)(r2) Used for concatenating regular expressions. r* Matches any number (including zero) of the regular expression that immediately precedes it. r+ Matches one or more occurrences of the preceding regular expression. r? Matches 0 or 1 occurrences of the preceding regular expression. (r) Used for grouping regular expressions. Regular expressions can also make use of the escape sequences for accessing special characters, as defined in the section "Escape sequences" later in this appendix. Note that ^ and $ work on strings; they do not match against newlines embedded in a record or string. Within a pair of brackets, POSIX allows special notations for matching non-English characters. They are described in Table 13.3. Table B.2: POSIX Character List Facilities Notation Facility [.symbol.] Collating symbols. A collating symbol is a multi-character sequence that should be treated as a unit. [=equiv=] Equivalence classes. An equivalence class lists a set of characters that should be considered equivalent, such as "e" and "è". [:class:] Character classes. Character class keywords describe different classes of characters such as alphabetic characters, control characters, and so on. [:alnum:] Alphanumeric characters [:alpha:] Alphabetic characters [:blank:] Space and tab characters [:cntrl:] Control characters [:digit:] Numeric characters [:graph:] Printable and visible (non-space) characters [:lower:] Lowercase characters [:print:] Printable characters [:punct:] Punctuation characters [:space:] Whitespace characters [:upper:] Uppercase characters [:xdigit:] Hexadecimal digits Note that these facilities (as of this writing) are still not widely implemented. B.2.5 Expressions An expression can be made up of constants, variables, operators and functions. A constant is a string (any sequence of characters) or a numeric value. A variable is a symbol that references a value. You can think of it as a piece of information that retrieves a particular numeric or string value. B.2.5.1 Constants There are two types of constants, string and numeric. A string constant must be quoted while a numeric constant is not. B.2.5.2 Escape sequences The escape sequences described in Table 13.4 can be used in strings and regular expressions. Table B.3: Escape Sequences Sequence Description \a Alert character, usually ASCII BEL character \b Backspace \f Formfeed \n Newline \r Carriage return \t Horizontal tab \v Vertical tab \ddd Character represented as 1 to 3 digit octal value \xhex Character represented as hexadecimal value[1] \c Any literal character c (e.g., \" for ")[2] [1] POSIX does not provide "\x", but it is commonly available. [2] Like ANSI C, POSIX leaves it purposely undefined what you get when you put a backslash before any character not listed in the table. In most awks, you just get that character. B.2.5.3 Variables There are three kinds of variables: user-defined, built-in, and fields. By convention, the names of built-in or system variables consist of all capital letters. The name of a variable cannot start with a digit. Otherwise, it consists of letters, digits, and underscores. Case is significant in variable names. A variable does not need to be declared or initialized. A variable can contain either a string or numeric value. An uninitialized variable has the empty string ("") as its string value and 0 as its numeric value. Awk attempts to decide whether a value should be processed as a string or a number depending upon the operation. The assignment of a variable has the form: var = expr It assigns the value of the expression to var. The following expression assigns a value of 1 to the variable x. x = 1 The name of the variable is used to reference the value: { print x } prints the value of the variable x. In this case, it would be 1. See the section "System Variables" below for information on built-in variables. A field variable is referenced using $n, where n is any number 0 to NF, that references the field by position. It can be supplied by a variable, such as $NF meaning the last field, or constant, such as $1 meaning the first field. B.2.5.4 Arrays An array is a variable that can be used to store a set of values. The following statement assigns a value to an element of an array: array[index] = value In awk, all arrays are associative arrays. What makes an associative array unique is that its index can be a string or a number. An associative array makes an "association" between the indices and the elements of an array. For each element of the array, a pair of values is maintained: the index of the element and the value of the element. The elements are not stored in any particular order as in a conventional array. You can use the special for loop to read all the elements of an associative array. for (item in array) The index of the array is available as item, while the value of an element of the array can be referenced as array[item]. You can use the operator in to test that an element exists by testing to see if its index exists. if (index in array) tests that array[index] exists, but you cannot use it to test the value of the element referenced by array [index]. You can also delete individual elements of the array using the delete statement. B.2.5.5 System variables Awk defines a number of special variables that can be referenced or reset inside a program, as shown in Table 13.5 (defaults are listed in parentheses). Table B.4: Awk System Variables Variable Description ARGC Number of arguments on command line ARGV An array containing the command-line arguments CONVFMT String conversion format for numbers (%.6g). (POSIX) ENVIRON An associative array of environment variables FILENAME Current filename FNR Like NR, but relative to the current file FS Field separator (a blank) NF Number of fields in current record NR Number of the current record OFMT Output format for numbers (%.6g) OFS Output field separator (a blank) ORS Output record separator (a newline) RLENGTH Length of the string matched by match() function RS Record separator (a newline) RSTART First position in the string matched by match() function SUBSEP Separator character for array subscripts (\034) B.2.5.6 Operators Table 13.6 lists the operators in the order of precedence (low to high) that are available in awk. Table B.5: Operators Operators Description = += -= *= /= %= ^= **= Assignment ?: C conditional expression || Logical OR && Logical AND ~ !~ Match regular expression and negation < <= > >= != == Relational operators (blank) Concatenation + - Addition, subtraction * / % Multiplication, division, and modulus + - ! Unary plus and minus, and logical negation ^ ** Exponentiation ++ -- Increment and decrement, either prefix or postfix $ Field reference NOTE: While "**" and "**=" are common extensions, they are not part of POSIX awk. B.2.6 Statements and Functions An action is enclosed in braces and consists of one or more statements and/or expressions. The difference between a statement and a function is that a function returns a value, and its argument list is specified within parentheses. (The formal syntactical difference does not always hold true: printf is considered a statement, but its argument list can be put in parentheses; getline is a function that does not use parentheses.) Awk has a number of predefined arithmetic and string functions. A function is typically called as follows: return = function(arg1,arg2) where return is a variable created to hold what the function returns. (In fact, the return value of a function can be used anywhere in an expression, not just on the right-hand side of an assignment.) Arguments to a function are specified as a comma-separated list. The left parenthesis follows after the name of the function. (With built-in functions, a space is permitted between the function name and the parentheses.) B.1 Command-Line Syntax B.3 Command Summary for awk Appendix B Quick Reference for awk B.3 Command Summary for awk The following alphabetical list of statements and functions includes all that are available in POSIX awk, nawk, or gawk. See Chapter 11, A Flock of awks, for extensions available in different implementations. atan2() atan2(y, x) Returns the arctangent of y/x in radians. break Exit from a while, for, or do loop. close() close(filename-expr) close(command-expr) In most implementations of awk, you can only have a limited number of files and/or pipes open simultaneously. Therefore, awk provides a close() function that allows you to close a file or a pipe. It takes as an argument the same expression that opened the pipe or file. This expression must be identical, character by character, to the one that opened the file or pipe - even whitespace is significant. continue Begin next iteration of while, for, or do loop. cos() cos(x) Return cosine of x in radians. delete delete array[element] Delete element of an array. do do body while (expr) Looping statement. Execute statements in body then evaluate expr and if true, execute body again. exit exit [expr] Exit from script, reading no new input. The END rule, if it exists, will be executed. An optional expr becomes awk's return value. exp() exp(x) Return exponential of x (e ^ x). for for (init-expr; test-expr; incr-expr) statement C-style looping construct. init-expr assigns the initial value of the counter variable. test-expr is a relational expression that is evaluated each time before executing the statement. When test-expr is false, the loop is exited. incr-expr is used to increment the counter variable after each pass. for (item in array) statement Special loop designed for reading associative arrays. For each element of the array, the statement is executed; the element can be referenced by array[item]. getline Read next line of input. getline [var] [ file" directs the output to a file, overwriting its previous contents. ">> file" appends the output to a file, preserving its previous contents. In both of these cases, the file will be created if it does not already exist. "| command" directs the output as the input to a system command. printf printf (format-expr [, expr-list ]) [ dest-expr ] An alternative output statement borrowed from the C language. It has the ability to produce formatted output. It can also be used to output data without automatically producing a newline. format-expr is a string of format specifications and constants; see next section for a list of format specifiers. expr-list is a list of arguments corresponding to format specifiers. See the print statement for a description of dest-expr. rand() rand() Generate a random number between 0 and 1. This function returns the same series of numbers each time the script is executed, unless the random number generator is seeded using the srand () function. return return [expr] Used at end of user-defined functions to exit function, returning value of expression. sin() sin(x) Return sine of x in radians. split() split(str, array, sep) Function that parses string into elements of array using field separator, returning number of elements in array. Value of FS is used if no field separator is specified. Array splitting works the same as field splitting. sprintf() sprintf (format-expr [, expr-list ] ) Function that returns string formatted according to printf format specification. It formats data but does not output it. format-expr is a string of format specifications and constants; see the next section for a list of format specifiers. expr-list is a list of arguments corresponding to format specifiers. sqrt() sqrt(x) Return square root of x. srand() srand(expr) Use expr to set a new seed for random number generator. Default is time of day. Return value is the old seed. sub() sub(r, s, t) Substitute s for first match of the regular expression r in the string t. Return 1 if successful; 0 otherwise. If t is not supplied, defaults to $0. substr() substr(str, beg, len) Return substring of string str at beginning position beg, and the characters that follow to maximum specified length len. If no length is given, use the rest of the string. system() system(command) Function that executes the specified command and returns its status. The status of the executed command typically indicates success or failure. A value of 0 means that the command executed successfully. A non-zero value, whether positive or negative, indicates a failure of some sort. The documentation for the command you're running will give you the details. The output of the command is not available for processing within the awk script. Use "command | getline" to read the output of a command into the script. tolower() tolower(str) Translate all uppercase characters in str to lowercase and return the new string.[3] [3] Very early versions of nawk, such as that in SunOS 4.1.x, don't support tolower() and toupper(). However, they are now part of the POSIX specification for awk. toupper() toupper(str) Translate all lowercase characters in str to uppercase and return the new string. while while (expr) statement Looping construct. While expr is true, execute statement. B.3.1 Format Expressions Used in printf and sprintf A format expression can take three optional modifiers following "%" and preceding the format specifier: %-width.precision format-specifier The width of the output field is a numeric value. When you specify a field width, the contents of the field will be right-justified by default. You must specify "-" to get left-justification. Thus, "%-20s" outputs a string left-justified in a field 20 characters wide. If the string is less than 20 characters, the field will be padded with spaces to fill. The precision modifier, used for decimal or floating-point values, controls the number of digits that appear to the right of the decimal point. For string formats, it controls the number of characters from the string to print. You can specify both the width and precision dynamically, via values in the printf or sprintf argument list. You do this by specifying asterisks, instead of specifying literal values. printf("%*.*g\n", 5, 3, myvar); In this example, the width is 5, the precision is 3, and the value to print will come from myvar. Older versions of nawk may not support this. Note that the default precision for the output of numeric values is "%.6g." The default can be changed by setting the system variable OFMT. This affects the precision used by the print statement when outputting numbers. For instance, if you are using awk to write reports that contain dollar values, you might prefer to change OFMT to "%.2f." The format specifiers, shown in Table 13.7, are used with printf and sprintf statements. Table B.6: Format Specifiers Used in printf Character Description c ASCII character. d Decimal integer. i Decimal integer. Added in POSIX. e Floating-point format ([-]d.precisione[+-]dd). E Floating-point format ([-]d.precisionE[+-]dd). f Floating-point format ([-]ddd.precision). g e or f conversion, whichever is shortest, with trailing zeros removed. G E or f conversion, whichever is shortest, with trailing zeros removed. o Unsigned octal value. s String. x Unsigned hexadecimal number. Uses a-f for 10 to 15. X Unsigned hexadecimal number. Uses A-F for 10 to 15. % Literal %. Often, whatever format specifiers are available in the system's sprintf(3) subroutine are available in awk. The way printf and sprintf() do rounding will often depend upon the system's C sprintf(3) subroutine. On many machines, sprintf rounding is "unbiased," which means it doesn't always round a trailing ".5" up, contrary to naive expectations. In unbiased rounding, ".5" rounds to even, rather than always up, so 1.5 rounds to 2 but 4.5 rounds to 4. The result is that if you are using a format that does rounding (e.g., "%.0f") you should check what your system does. The following function does traditional rounding; it might be useful if your awk's printf does unbiased rounding. # round --- do normal rounding # Arnold Robbins, arnold@gnu.ai.mit.edu # Public Domain function round(x, ival, aval, fraction) { ival = int(x) # integer part, int() truncates # see if fractional part if (ival == x) # no fraction return x if (x < 0) { aval = -x # absolute value ival = int(aval) fraction = aval - ival if (fraction >= .5) return int(x) - 1 # - 2.5 --> -3 else return int(x) # -2.3 --> - 2 } else { fraction = x - ival if (fraction >= .5) return ival + 1 else return ival } } B.2 Language Summary for awk C. Supplement for Chapter 12 Appendix C C. Supplement for Chapter 12 Contents: Full Listing of spellcheck.awk Listing of masterindex Shell Script Documentation for masterindex This appendix contains supplemental programs and documentation for the programs described in Chapter 12, Full-Featured Applications. C.1 Full Listing of spellcheck.awk # spellcheck.awk -- interactive spell checker # # AUTHOR: Dale Dougherty # # Usage: nawk -f spellcheck.awk [+dict] file # (Use spellcheck as name of shell program) # SPELLDICT = "dict" # SPELLFILE = "file" # BEGIN actions perform the following tasks: # 1) process command line arguments # 2) create temporary filenames # 3) execute spell program to create wordlist file # 4) display list of user responses BEGIN { # Process command line arguments # Must be at least two args -- nawk and filename if (ARGC > 1) { # if more than two args, second arg is dict if (ARGC > 2) { # test to see if dict is specified with "+" # and assign ARGV[1] to SPELLDICT if (ARGV[1] ~ /^\+.*/) SPELLDICT = ARGV[1] else SPELLDICT = "+" ARGV[1] # assign file ARGV[2] to SPELLFILE SPELLFILE = ARGV[2] # delete args so awk does not open them as files delete ARGV[1] delete ARGV[2] } # not more than two args else { # assign file ARGV[1] to SPELLFILE SPELLFILE = ARGV[1] # test to see if local dict file exists if (! system ("test -r dict")) { # if it does, ask if we should use it printf ("Use local dict file? (y/ n)") getline reply < "-" # if reply is yes, use "dict" if (reply ~ /[yY](es)?/){ SPELLDICT = "+dict" } } } } # end of processing args > 1 # if args not > 1, then print shell-command usage else { print "Usage: spellcheck [+dict] file" exit 1 } # end of processing command line arguments # create temporary file names, each begin with sp_ wordlist = "sp_wordlist" spellsource = "sp_input" spellout = "sp_out" # copy SPELLFILE to temporary input file system("cp " SPELLFILE " " spellsource) # now run spell program; output sent to wordlist print "Running spell checker ..." if (SPELLDICT) SPELLCMD = "spell " SPELLDICT " " else SPELLCMD = "spell " system(SPELLCMD spellsource " > " wordlist ) # test wordlist to see if misspelled words turned up if ( system("test -s " wordlist ) ) { # if wordlist is empty, (or spell command failed), exit print "No misspelled words found." system("rm " spellsource " " wordlist) exit } # assign wordlist file to ARGV[1] so that awk will read it. ARGV[1] = wordlist # display list of user responses responseList = "Responses: \n\tChange each occurrence," responseList = responseList "\n\tGlobal change," responseList = responseList "\n\tAdd to Dict," responseList = responseList "\n\tHelp," responseList = responseList "\n\tQuit" responseList = responseList "\n\tCR to ignore: " printf("%s", responseList) } # end of BEGIN procedure # main procedure, executed for each line in wordlist. # Purpose is to show misspelled word and prompt user # for appropriate action. { # assign word to misspelling misspelling = $1 response = 1 ++word # print misspelling and prompt for response while (response !~ /(^[cCgGaAhHqQ])|^$/ ) { printf("\n%d - Found %s (C/G/A/H/Q/):", word, misspelling) getline response < "-" } # now process the user's response # CR - carriage return ignores current word # Help if (response ~ /[Hh](elp)?/) { # Display list of responses and prompt again. printf("%s", responseList) printf("\n%d - Found %s (C/G/A/Q/):", word, misspelling) getline response < "-" } # Quit if (response ~ /[Qq](uit)?/) exit # Add to dictionary if ( response ~ /[Aa](dd)?/) { dict[++dictEntry] = misspelling } # Change each occurrence if ( response ~ /[cC](hange)?/) { # read each line of the file we are correcting newspelling = ""; changes = "" while( (getline < spellsource) > 0){ # call function to show line with misspelled word # and prompt user to make each correction make_change($0) # all lines go to temp output file print > spellout } # all lines have been read # close temp input and temp output file close(spellout) close(spellsource) # if change was made if (changes){ # show changed lines for (j = 1; j <= changes; ++j) print changedLines[j] printf ("%d lines changed. ", changes) # function to confirm before saving changes confirm_changes() } } # Globally change if ( response ~ /[gG](lobal)?/) { # call function to prompt for correction # and display each line that is changed. # Ask user to approve all changes before saving. make_global_change() } } # end of Main procedure # END procedure makes changes permanent. # It overwrites the original file, and adds words # to the dictionary. # It also removes the temporary files. END { # if we got here after reading only one record, # no changes were made, so exit. if (NR <= 1) exit # user must confirm saving corrections to file while (saveAnswer !~ /([yY](es)?)|([nN]o?)/ ) { printf "Save corrections in %s (y/n)? ", SPELLFILE getline saveAnswer < "-" } # if answer is yes then mv temporary input file to SPELLFILE # save old SPELLFILE, just in case if (saveAnswer ~ /^[yY]/) { system("cp " SPELLFILE " " SPELLFILE ".orig") system("mv " spellsource " " SPELLFILE) } # if answer is no then rm temporary input file if (saveAnswer ~ /^[nN]/) system("rm " spellsource) # if words have been added to dictionary array, then prompt # to confirm saving in current dictionary. if (dictEntry) { printf "Make changes to dictionary (y/n)? " getline response < "-" if (response ~ /^[yY]/){ # if no dictionary defined, then use "dict" if (! SPELLDICT) SPELLDICT = "dict" # loop through array and append words to dictionary sub(/^\+/, "", SPELLDICT) for ( item in dict ) print dict[item] >> SPELLDICT close(SPELLDICT) # sort dictionary file system("sort " SPELLDICT "> tmp_dict") system("mv " "tmp_dict " SPELLDICT) } } # remove word list system("rm sp_wordlist") } # end of END procedure # function definitions # make_change -- prompt user to correct misspelling # for current input line. Calls itself # to find other occurrences in string. # stringToChange -- initially $0; then unmatched substring of $0 # len -- length from beginning of $0 to end of matched string # Assumes that misspelling is defined. function make_change (stringToChange, len, # parameters line, OKmakechange, printstring, carets) # locals { # match misspelling in stringToChange; otherwise do nothing if ( match(stringToChange, misspelling) ) { # Display matched line printstring = $0 gsub(/\t/, " ", printstring) print printstring carets = "^" for (i = 1; i < RLENGTH; ++i) carets = carets "^" if (len) FMT = "%" len+RSTART+RLENGTH-2 "s\n" else FMT = "%" RSTART+RLENGTH-1 "s\n" printf(FMT, carets) # Prompt user for correction, if not already defined if (! newspelling) { printf "Change to:" getline newspelling < "-" } # A carriage return falls through # If user enters correction, confirm while (newspelling && ! OKmakechange) { printf ("Change %s to %s? (y/n):", misspelling, newspelling) getline OKmakechange < "-" madechg = "" # test response if (OKmakechange ~ /[yY](es)?/ ) { # make change (first occurrence only) madechg = sub(misspelling, newspelling, stringToChange) } else if ( OKmakechange ~ /[nN]o?/ ) { # offer chance to re-enter correction printf "Change to:" getline newspelling < "-" OKmakechange = "" } } # end of while loop # if len, we are working with substring of $0 if (len) { # assemble it line = substr($0,1,len-1) $0 = line stringToChange } else { $0 = stringToChange if (madechg) ++changes } # put changed line in array for display if (madechg) changedLines[changes] = ">" $0 # create substring so we can try to match other occurrences len += RSTART + RLENGTH part1 = substr($0, 1, len-1) part2 = substr($0, len) # calls itself to see if misspelling is found in remaining part make_change(part2, len) } # end of if } # end of make_change() # make_global_change -- # prompt user to correct misspelling # for all lines globally. # Has no arguments # Assumes that misspelling is defined. function make_global_change( newspelling, OKmakechange, changes) { # prompt user to correct misspelled word printf "Globally change to:" getline newspelling < "-" # carriage return falls through # if there is an answer, confirm while (newspelling && ! OKmakechange) { printf ("Globally change %s to %s? (y/n):", misspelling, newspelling) getline OKmakechange < "-" # test response and make change if (OKmakechange ~ /[yY](es)?/ ) { # open file, read all lines while( (getline < spellsource) > 0){ # if match is found, make change using gsub # and print each changed line. if ($0 ~ misspelling) { madechg = gsub (misspelling, newspelling) print ">", $0 changes += 1 # counter for line changes } # write all lines to temp output file print > spellout } # end of while loop for reading file # close temporary files close(spellout) close(spellsource) # report the number of changes printf ("%d lines changed. ", changes) # function to confirm before saving changes confirm_changes() } # end of if (OKmakechange ~ y) # if correction not confirmed, prompt for new word else if ( OKmakechange ~ /[nN]o?/ ){ printf "Globally change to:" getline newspelling < "-" OKmakechange = "" } } # end of while loop for prompting user for correction } # end of make_global_change() # confirm_changes -- # confirm before saving changes function confirm_changes( savechanges) { # prompt to confirm saving changes while (! savechanges ) { printf ("Save changes? (y/n)") getline savechanges < "-" } # if confirmed, mv output to input if (savechanges ~ /[yY](es)?/) system("mv " spellout " " spellsource) } B.3 Command Summary for awk C.2 Listing of masterindex Shell Script Appendix C Supplement for Chapter 12 C.2 Listing of masterindex Shell Script #! /bin/sh # 1.1 -- 7/9/90 MASTER="" FILES="" PAGE="" FORMAT=1 INDEXDIR=/work/sedawk/awk/index #INDEXDIR=/work/index INDEXMACDIR=/work/macros/current # Add check that all dependent modules are available. sectNumber=1 useNumber=1 while [ "$#" != "0" ]; do case $1 in -m*) MASTER="TRUE";; [1-9]) sectNumber=$1;; *,*) sectNames=$1; useNumber=0;; -p*) PAGE="TRUE";; -s*) FORMAT=0;; -*) echo $1 " is not a valid argument";; *) if [ -f $1 ]; then FILES="$FILES $1" else echo "$1: file not found" fi;; esac shift done if [ "$FILES" = "" ]; then echo "Please supply a valid filename." exit fi if [ "$MASTER" != "" ]; then for x in $FILES do if [ "$useNumber" != 0 ]; then romaNum=`$INDEXDIR/romanum $sectNumber` awk '-F\t' ' NF == 1 { print $0 } NF > 1 { print $0 ":" volume } ' volume=$romaNum $x >>/tmp/index$$ sectNumber=`expr $sectNumber + 1` else awk '-F\t' ' NR == 1 { split(namelist, names, ","); volname = names[volume] } NF == 1 { print $0 } NF > 1 { print $0 ":" volname } ' volume=$sectNumber namelist=$sectNames $x >>/tmp/ index$$ sectNumber=`expr $sectNumber + 1` fi done FILES="/tmp/index$$" fi if [ "$PAGE" != "" ]; then $INDEXDIR/page.idx $FILES exit fi $INDEXDIR/input.idx $FILES | sort -bdf -t: +0 -1 +1 -2 +3 -4 +2n -3n | uniq | $INDEXDIR/pagenums.idx | $INDEXDIR/combine.idx | $INDEXDIR/format.idx FMT=$FORMAT MACDIR=$INDEXMACDIR if [ -s "/tmp/index$$" ]; then rm /tmp/index$$ fi C.1 Full Listing of spellcheck. awk C.3 Documentation for masterindex Appendix C Supplement for Chapter 12 C.3 Documentation for masterindex This documentation, and the notes that follow, are by Dale Dougherty. C.3.1 masterindex indexing program for single and multivolume indexing. Synopsis masterindex [-master [volume]] [-page] [-screen] [filename..] Description masterindex generates a formatted index based on structured index entries output by troff. Unless you redirect output, it comes to the screen. Options -m or -master indicates that you are compiling a multivolume index. The index entries for each volume should be in a single file and the filenames should be listed in sequence. If the first file is not the first volume, then specify the volume number as a separate argument. The volume number is converted to a roman numeral and prepended to all the page numbers of entries in that file. -p or -page produces a listing of index entries for each page number. It can be used to proof the entries against hardcopy. -s or -screen specifies that the unformatted index will be viewed on the "screen". The default is to prepare output that contains troff macros for formatting. Files /work/bin/masterindex /work/bin/page.idx /work/bin/pagenums.idx /work/bin/combine.idx /work/bin/format.idx /work/bin/rotate.idx /work/bin/romanum /work/macros/current/indexmacs See Also Note that these programs require "nawk" (new awk): nawk (1), and sed (1V). Bugs The new index program is modular, invoking a series of smaller programs. This should allow me to connect different modules to implement new features as well as isolate and fix problems more easily. Index entries should not contain any troff font changes. The program does not handle them. Roman numerals greater than eight will not be sorted properly, thus imposing a limit of an eight-book index. (The sort program will sort the roman numerals 1-10 in the following order: I, II, III, IV, IX, V, VI, VII, VIII, X.) C.3.2 Background Details Tim O'Reilly recommends The Joy of Cooking (JofC) index as an ideal index. I examined the JofC index quite thoroughly and set out to write a new indexing program that duplicated its features. I did not wholly duplicate the JofC format, but this could be done fairly easily if desired. Please look at the JofC index yourself to examine its features. I also tried to do a few other things to improve on the previous index program and provide more support for the person coding the index. C.3.3 Coding Index Entries This section describes the coding of index entries in the document file. We use the .XX macro for placing index entries in a file. The simplest case is: .XX "entry" If the entry consists of primary and secondary sort keys, then we can code it as: .XX "primary, secondary" A comma delimits the two keys. We also have a .XN macro for generating "See" references without a page number. It is specified as: .XN "entry (See anotherEntry)" While these coding forms continue to work as they have, masterindex provides greater flexibility by allowing three levels of keys: primary, secondary, and tertiary. You'd specify the entry like so: .XX "primary: secondary; tertiary" Note that the comma is not used as a delimiter. A colon delimits the primary and secondary entry; the semicolon delimits the secondary and tertiary entry. This means that commas can be a part of a key using this syntax. Don't worry, though, you can continue to use a comma to delimit the primary and secondary keys. (Be aware that the first comma in a line is converted to a colon, if no colon delimiter is found.) I'd recommend that new books be coded using the above syntax, even if you are only specifying a primary and secondary key. Another feature is automatic rotation of primary and secondary keys if a tilde (~) is used as the delimiter. So the following entry: .XX "cat~command" is equivalent to the following two entries: .XX "cat command" .XX "command: cat" You can think of the secondary key as a classification (command, attribute, function, etc.) of the primary entry. Be careful not to reverse the two, as "command cat" does not make much sense. To use a tilde in an entry, enter "~~". I added a new macro, .XB, that is the same as .XX except that the page number for this index entry will be output in bold to indicate that it is the most significant page number in a range. Here is an example: .XB "cat command" When troff processes the index entries, it outputs the page number followed by an asterisk. This is how it appears when output is seen in screen format. When coded for troff formatting, the page number is surrounded by the bold font change escape sequences. (By the way, in the JofC index, I noticed that they allowed having the same page number in roman and in bold.) Also, this page number will not be combined in a range of consecutive numbers. One other feature of the JofC index is that the very first secondary key appears on the same line with the primary key. The old index program placed any secondary key on the next line. The one advantage of doing it the JofC way is that entries containing only one secondary key will be output on the same line and look much better. Thus, you'd have "line justification, definition of" rather than having "definition of" indented on the next line. The next secondary key would be indented. Note that if the primary key exists as a separate entry (it has page numbers associated with it), the page references for the primary key will be output on the same line and the first secondary entry will be output on the next line. To reiterate, while the syntax of the three-level entries is different, this index entry is perfectly valid: .XX "line justification, definition of" It also produces the same result as: .XX "line justification: definition of" (The colon disappears in the output.) Similarly, you could write an entry, such as .XX "justification, lines, defined" or .XX "justification: lines, defined" where the comma between "lines" and "defined" does not serve as a delimiter but is part of the secondary key. The previous example could be written as an entry with three levels: .XX "justification: lines; defined" where the semicolon delimits the tertiary key. The semicolon is output with the key, and multiple tertiary keys may follow immediately after the secondary key. The main thing, though, is that page numbers are collected for all primary, secondary, and tertiary keys. Thus, you could have output such as: justification 4-9 lines 4,6; defined, 5 C.3.4 Output Format One thing I wanted to do that our previous program did not do is generate an index without the troff codes. masterindex has three output modes: troff, screen, and page. The default output is intended for processing by troff (via fmt). It contains macros that are defined in /work/macros/current/indexmacs. These macros should produce the same index format as before, which was largely done directly through troff requests. Here are a few lines off the top: $ masterindex ch01 .so /work/macros/current/indexmacs .Se "" "Index" .XC .XF A "A" .XF 1 "applications, structure of 2; program 1" .XF 1 "attribute, WIN_CONSUME_KBD_EVENTS 13" .XF 2 "WIN_CONSUME_PICK_EVENTS 13" .XF 2 "WIN_NOTIFY_EVENT_PROC 13" .XF 2 "XV_ERROR_PROC 14" .XF 2 "XV_INIT_ARGC_PTR_ARGV 5,6" The top two lines should be obvious. The .XC macro produces multicolumn output. (It will print out two columns for smaller books. It's not smart enough to take arguments specifying the width of columns, but that should be done.) The .XF macro has three possible values for its first argument. An "A" indicates that the second argument is a letter of the alphabet that should be output as a divider. A "1" indicates that the second argument contains a primary entry. A "2" indicates that the entry begins with a secondary entry, which is indented. When invoked with the -s argument, the program prepares the index for viewing on the screen (or printing as an ASCII file). Again, here are a few lines: $ masterindex -s ch01 A applications, structure of 2; program 1 attribute, WIN_CONSUME_KBD_EVENTS 13 WIN_CONSUME_PICK_EVENTS 13 WIN_NOTIFY_EVENT_PROC 13 XV_ERROR_PROC 14 XV_INIT_ARGC_PTR_ARGV 5,6 XV_INIT_ARGS 6 XV_USAGE_PROC 6 Obviously, this is useful for quickly proofing the index. The third type of format is also used for proofing the index. Invoked using -p, it provides a page-by-page listing of the index entries. $ masterindex -p ch01 Page 1 structure of XView applications applications, structure of; program XView applications XView applications, structure of XView interface compiling XView programs XView, compiling programs Page 2 XView libraries C.3.5 Compiling a Master Index A multivolume master index is invoked by specifying the -m option. Each set of index entries for a particular volume must be placed in a separate file. $ masterindex -m -s book1 book2 book3 xv_init() procedure II: 4; III: 5 XV_INIT_ARGC_PTR_ARGV attribute II: 5,6 XV_INIT_ARGS attribute I: 6 Files must be specified in consecutive order. If the first file is not Volume 1, you can specify the number as an argument. $ masterindex -m 4 -s book4 book5 C.2 Listing of masterindex Shell Script