) { # process the line } _ _END_ _ # your data goes here 7.12.3 Discussion The _ _DATA_ _ and _ _END_ _ symbols tell the Perl compiler there's nothing more for it to do in the current file. They represent the logical end for code in a module or a program before the physical end-of-file. Text after _ _DATA_ _ or _ _END_ _ can be read through the per-package DATA filehandle. For example, take the hypothetical module Primes. Text after _ _DATA_ _ in Primes.pm can be read from the Primes::DATA filehandle. _ _END_ _ behaves as a synonym for _ _DATA_ _ in the main package. Any text occurring after an _ _END_ _ token in a module is completely inaccessible. This lets you write self-contained programs instead of keeping data in separate files. Often this is used for documentation. Sometimes it's configuration data or old test data that the program was originally developed with, left lying about should it ever be needed again. Another trick is to use DATA to find out the current program's or module's size or last modification date. On most systems, the $0 variable will contain the full pathname to your running script. On systems where $0 is not correct, you could try the DATA filehandle instead. This can be used to pull in the size, modification date, etc. Put a special token _ _DATA_ _ at the end of the file (and maybe a warning not to delete it), and the DATA filehandle is available to the script itself. use POSIX qw(strftime); $raw_time = (stat(DATA))[9]; $size = -s DATA; $kilosize = int($size / 1024) . "k"; print "Script size is $kilosize\n"; print strftime("
Last script update: %c (%Z)\n", localtime($raw_time)); _ _DATA_ _ DO NOT REMOVE THE PRECEDING LINE. Everything else in this file will be ignored. If you want to store more than one file in your program, see Recipe 7.13. 7.12.4 See Also The "Scalar Value Constructors" section of perldata(1), and the "Other Literal Tokens" section of Chapter 2 of Programming Perl; Recipe 7.13 [ Team LiB ] [ Team LiB ] Recipe 7.13 Storing Multiple Files in the DATA Area 7.13.1 Problem You've figured out how to use _ _END_ _ or _ _DATA_ _ to store a virtual file in your source code, but you now want multiple virtual files in one source file. 7.13.2 Solution Use the Inline::Files module from CPAN. Carefully. use Inline::Files; while () { # ... } while () { # ... } _ _SETUP_ _ everything for the SETUP filehandle goes here _ _EXECUTION_ _ everything for the EXECUTION filehandle goes here 7.13.3 Discussion One limitation with the _ _DATA_ _ setup is that you can have only one embedded data file per physical file. The CPAN module Inline::Files cleverly circumvents this restriction by providing logical embedded files. It's used like this: use Inline::Files; # # All your code for the file goes here first, then... # _ _ALPHA_ _ This is the data in the first virtual file, ALPHA. _ _BETA_ _ This is the data in the next virtual file, BETA. _ _OMEGA_ _ This is the data in yet another virtual file, OMEGA. _ _ALPHA_ _ This is more data in the second part of virtual file, ALPHA. The code is expected to read from filehandles whose names correspond to the double- underbarred symbols: here ALPHA, BETA, and OMEGA. You may have more than one section by the same name in the same program, and differently named sections needn't be read in any particular order. These handles work much like the ARGV handle does. For one thing, they're implicitly opened on first usage. For example, using the following code in the designated spot in the preceding code example: while () { print "omega data: $_"; } while () { print "alpha data: $_"; } would produce this: omega data: This is the data in yet another virtual file, OMEGA. omega data: alpha data: This is the data in the first virtual file, ALPHA. alpha data: alpha data: This is more data in the second part of virtual file, ALPHA. alpha data: Also like the ARGV handle, while reading from a particular handle, the list of available virtual files is in the array by that name, and the currently opened virtual file is in the scalar by that name. There's also a hash by that name that holds various bits of status information about that set of virtual files, including the current file, line number, and byte offset. If we used the Perl debugger on this program and dumped out the variables, it might show this: DB2> \$ALPHA, \@ALPHA, \%ALPHA 0 SCALAR(0x362e34) -> '/home/tchrist/inline-demo(00000000000000000291)' 1 ARRAY(0x362e40) 0 '/home/tchrist/inline-demo(00000000000000000291)' 1 '/home/tchrist/inline-demo(00000000000000000476)' 2 HASH(0x362edc) 'file' => undef 'line' => undef 'offset' => undef 'writable' => 1 What's that last line telling us? It tells whether that virtual file is writable. By default, if your script is writable, then so too are the virtual files, and they are opened in read-write mode! Yes, that means you can update them yourself, including even adding new virtual files to your source code simply by running that code. There is absolutely no limit to the mischief or grief that can ensue from this: catastrophes are easy to come by as you accidentally obliterate your painstakingly won data. We therefore implore you to back everything up first. The module itself supports an automatic mechanism for this: use Inline::Files -backup; which saves the original in a file with a ".bak" appended to it. You may also specify an explicit backup file: use Inline::Files -backup => "/tmp/safety_net"; 7.13.4 See Also The documentation for the CPAN module Inline::Files; Recipe 7.12 [ Team LiB ] [ Team LiB ] Recipe 7.14 Writing a Unix-Style Filter Program 7.14.1 Problem You want to write a program that takes a list of filenames on the command line and reads from STDIN if no filenames were given. You'd like the user to be able to give the file "-" to indicate STDIN or "someprogram |" to indicate the output of another program. You might want your program to modify the files in place or to produce output based on its input. 7.14.2 Solution Read lines with <>: while (<>) { # do something with the line } 7.14.3 Discussion When you say: while (<>) { # ... } Perl translates this into:[4] [4] Except that the code written here won't work, because ARGV has internal magic. unshift(@ARGV, "-") unless @ARGV; while ($ARGV = shift @ARGV) { unless (open(ARGV, $ARGV)) { warn "Can't open $ARGV: $!\n"; next; } while (defined($_ = )) { # ... } } You can access ARGV and $ARGV inside the loop to read more from the filehandle or to find the filename currently being processed. Let's look at how this works. 7.14.3.1 Behavior If the user supplies no arguments, Perl sets @ARGV to a single string, "-". This is shorthand for STDIN when opened for reading and STDOUT when opened for writing. It's also what lets the user of your program specify "-" as a filename on the command line to read from STDIN. Next, the file-processing loop removes one argument at a time from @ARGV and copies the filename into the global variable $ARGV. If the file cannot be opened, Perl goes on to the next one. Otherwise, it processes a line at a time. When the file runs out, the loop goes back and opens the next one, repeating the process until @ARGV is exhausted. The open statement didn't say open(ARGV, "<", $ARGV). There's no extra less-than sign supplied. This allows for interesting effects, like passing the string "gzip -dc file.gz |" as an argument, to make your program read the output of the command "gzip -dc file.gz". See Recipe 16.6 for more about this use of magic open. You can change @ARGV before or inside the loop. Let's say you don't want the default behavior of reading from STDIN if there aren't any arguments—you want it to default to all C or C++ source and header files. Insert this line before you start processing : @ARGV = glob("*.[Cch]") unless @ARGV; Process options before the loop, either with one of the Getopt libraries described in Chapter 15 or manually: # arg demo 1: Process optional -c flag if (@ARGV && $ARGV[0] eq "-c") { $chop_first++; shift; } # arg demo 2: Process optional -NUMBER flag if (@ARGV && $ARGV[0] =~ /^-(\d+)$/) { $columns = $1; shift; } # arg demo 3: Process clustering -a, -i, -n, or -u flags while (@ARGV && $ARGV[0] =~ /^-(.+)/ && (shift, ($_ = $1), 1)) { next if /^$/; s/a// && (++$append, redo); die "usage: $0 [-ainu] [filenames] ...\n"; } Other than its implicit looping over command-line arguments, <> is not special. The special variables controlling I/O still apply; see Chapter 8 for more on them. You can set $/ to set the line terminator, and $. contains the current line (record) number. If you undefine $/, you don't get the concatenated contents of all files at once; you get one complete file each time: undef $/; while (<>) { # $_ now has the complete contents of # the file whose name is in $ARGV } If you localize $/, the old value is automatically restored when the enclosing block exits: { # create block for local local $/; # record separator now undef while (<>) { # do something; called functions still have # undeffed version of $/ } } # $/ restored here Because processing never explicitly closes filehandles, the record number in $. is not reset. If you don't like that, you can explicitly close the file yourself to reset $.: while (<>) { print "$ARGV:$.:$_"; close ARGV if eof; } The eof function defaults to checking the end-of-file status of the last file read. Since the last handle read was ARGV, eof reports whether we're at the end of the current file. If so, we close it and reset the $. variable. On the other hand, the special notation eof( ) with parentheses but no argument checks if we've reached the end of all files in the processing. 7.14.3.2 Command-line options Perl has command-line options, -n, -p, -a, and -i, to make writing filters and one-liners easier. The -n option adds the while (<>) loop around your program text. It's normally used for filters like grep or programs that summarize the data they read. The program is shown in Example 7- 2. Example 7-2. findlogin1 #!/usr/bin/perl # findlogin1 - print all lines containing the string "login" while (<>) {# loop over files on command line print if /login/; } The program in Example 7-2 could be written as shown in Example 7-3. Example 7-3. findlogin2 #!/usr/bin/perl -n # findlogin2 - print all lines containing the string "login" print if /login/; You can combine the -n and -e options to run Perl code from the command line: % perl -ne 'print if /login/' The -p option is like -n but adds a print right before the end of the loop. It's normally used for programs that translate their input, such as the program shown in Example 7-4. Example 7-4. lowercase1 #!/usr/bin/perl # lowercase - turn all lines into lowercase while (<>) { # loop over lines on command line s/(\p{Letter})/\l$1/g; # change all letters to lowercase print; } The program in Example 7-4 could be written as shown in Example 7-5. Example 7-5. lowercase2 #!/usr/bin/perl -p # lowercase - turn all lines into lowercase s/(\p{Letter})/\l$1/g;# change all letters to lowercase Or it could be written from the command line as: % perl -pe 's/(\p{Letter})/\l$1/g' While using -n or -p for implicit input looping, the special label LINE: is silently created for the whole input loop. That means that from an inner loop, you can skip to the following input record by using next LINE (which is like awk's next statement), or go on to the next file by closing ARGV (which is like awk's nextfile statement). This is shown in Example 7-6. Example 7-6. countchunks #!/usr/bin/perl -n # countchunks - count how many words are used. # skip comments, and bail on file if _ _END_ _ # or _ _DATA_ _ seen. for (split /\W+/) { next LINE if /^#/; close ARGV if /_ _(DATA|END)_ _/; $chunks++; } END { print "Found $chunks chunks\n" } The tcsh keeps a .history file in a format such that every other line contains a commented out timestamp in Epoch seconds: #+0894382237 less /etc/motd #+0894382239 vi ~/.exrc #+0894382242 date #+0894382242 who #+0894382288 telnet home A simple one-liner can render that legible: % perl -pe 's/^#\+(\d+)\n/localtime($1) . " "/e' Tue May 5 09:30:37 1998 less /etc/motd Tue May 5 09:30:39 1998 vi ~/.exrc Tue May 5 09:30:42 1998 date Tue May 5 09:30:42 1998 who Tue May 5 09:31:28 1998 telnet home The -i option changes each file on the command line. It is described in Recipe 7.16, and is normally used in conjunction with -p. 7.14.4 See Also perlrun(1), and the "Switches" section of Chapter 19 of Programming Perl; Recipe 7.16; Recipe 16.6 [ Team LiB ] [ Team LiB ] Recipe 7.15 Modifying a File in Place with a Temporary File 7.15.1 Problem You need to update a file in place, and you can use a temporary file. 7.15.2 Solution Read from the original file, write changes to a temporary file, and then rename the temporary back to the original: open(OLD, "<", $old) or die "can't open $old: $!"; open(NEW, ">", $new) or die "can't open $new: $!"; while () { # change $_, then... print NEW $_ or die "can't write $new: $!"; } close(OLD) or die "can't close $old: $!"; close(NEW) or die "can't close $new: $!"; rename($old, "$old.orig") or die "can't rename $old to $old.orig: $!"; rename($new, $old) or die "can't rename $new to $old: $!"; This is the best way to update a file "in place." 7.15.3 Discussion This technique uses little memory compared to the approach that doesn't use a temporary file. It has the added advantages of giving you a backup file and being easier and safer to program. You can make the same changes to the file using this technique that you can with the version that uses no temporary file. For instance, to insert lines at line 20, say: while () { if ($. = = 20) { print NEW "Extra line 1\n"; print NEW "Extra line 2\n"; } print NEW $_; } To delete lines 20 through 30, say: while () { next if 20 .. 30; print NEW $_; } Note that rename won't work across filesystems, so you should create your temporary file in the same directory as the file being modified. The truly paranoid programmer would lock the file during the update. The tricky part is that you have to open the file for writing without destroying its contents before you can get a lock to modify it. Recipe 7.18 shows how to do this. 7.15.4 See Also Recipe 7.1; Recipe 7.16; Recipe 7.17; Recipe 7.18 [ Team LiB ] [ Team LiB ] Recipe 7.16 Modifying a File in Place with the -i Switch 7.16.1 Problem You need to modify a file in place from the command line, and you're too lazy[5] for the file manipulation of Recipe 7.15. [5] Lazy-as-virtue, not lazy-as-sin. 7.16.2 Solution Use the -i and -p switches to Perl. Write your program on the command line: % perl -i.orig -p -e 'FILTER COMMAND' file1 file2 file3 ... or use the switches in programs: #!/usr/bin/perl -i.orig -p # filter commands go here 7.16.3 Discussion The -i command-line switch modifies each file in place. It creates a temporary file as in the previous recipe, but Perl takes care of the tedious file manipulation for you. Use it with -p (explained in Recipe 7.14) to turn: while (<>) { if ($ARGV ne $oldargv) { # are we at the next file? rename($ARGV, $ARGV . ".orig"); open(ARGVOUT, ">", $ARGV); # plus error check select(ARGVOUT); $oldargv = $ARGV; } s/DATE/localtime/e; } continue{ print; } select (STDOUT); # restore default output into: % perl -pi.orig -e 's/DATE/localtime/e' The -i switch takes care of making a backup (say -i instead of -i.orig to discard the original file contents instead of backing them up), and -p makes Perl loop over filenames given on the command line (or STDIN if no files were given). The preceding one-liner would turn a file containing the following: Dear Sir/Madam/Ravenous Beast, As of DATE, our records show your account is overdue. Please settle by the end of the month. Yours in cheerful usury, --A. Moneylender into: Dear Sir/Madam/Ravenous Beast, As of Sat Apr 25 12:28:33 1998, our records show your account is overdue. Please settle by the end of the month. Yours in cheerful usury, --A. Moneylender This switch makes in-place translators a lot easier to write and to read. For instance, this changes isolated instances of "hisvar" to "hervar" in all C, C++, and yacc files: % perl -i.old -pe 's{\bhisvar\b}{hervar}g' *.[Cchy] Turn on and off the -i behavior with the special variable $^I. Set @ARGV, and then use <> as you would with -i on the command line: # set up to iterate over the *.c files in the current directory, # editing in place and saving the old file with a .orig extension local $^I = ".orig"; # emulate -i.orig local @ARGV = glob("*.c"); # initialize list of files while (<>) { if ($. = = 1) { print "This line should appear at the top of each file\n"; } s/\b(p)earl\b/${1}erl/ig; # Correct typos, preserving case print; } continue {close ARGV if eof} Beware that creating a backup file under a particular name when that name already exists clobbers the version previously backed up. 7.16.4 See Also perlrun(1), and the "Switches" section of Chapter 19 of Programming Perl; the $^I and $. variables in perlvar(1), and in Chapter 28 of Programming Perl; the .. operator in the "Range Operator" sections of perlop(1) and Chapter 3 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 7.17 Modifying a File in Place Without a Temporary File 7.17.1 Problem You need to insert, delete, or change one or more lines in a file, and you don't want to (or can't) use a temporary file. 7.17.2 Solution Open the file in update mode ("+<"), read the whole file into an array of lines, change the array, then rewrite the file and truncate it to its current seek pointer. open(FH, "+<", $FILE) or die "Opening: $!"; @ARRAY = ; # change ARRAY here seek(FH,0,0) or die "Seeking: $!"; print FH @ARRAY or die "Printing: $!"; truncate(FH,tell(FH)) or die "Truncating: $!"; close(FH) or die "Closing: $!"; 7.17.3 Discussion As explained in this chapter's Introduction, the operating system treats files as unstructured streams of bytes. This makes it impossible to insert, modify, or change bits of the file in place. (Except for the special case of fixed-record-length files, discussed in Recipe 8.13.) You can use a temporary file to hold the changed output, or you can read the entire file into memory, change it, and write it back out again. Reading everything into memory is fine for small files, but doesn't scale well. Trying it on your 800 MB web server log files will either deplete your virtual memory or thrash your machine's VM system. For small files, though, this works: open(F, "+<", $infile) or die "can't read $infile: $!"; $out = ""; while () { s/DATE/localtime/eg; $out .= $_; } seek(F, 0, 0) or die "can't seek to start of $infile: $!"; print F $out or die "can't print to $infile: $!"; truncate(F, tell(F)) or die "can't truncate $infile: $!"; close(F) or die "can't close $infile: $!"; For other examples of things you can do in-place, look at the recipes in Chapter 8. This approach is only for the truly determined. It's harder to write, takes more memory (potentially a lot more), doesn't keep a backup file, and may confuse other processes trying to read from the file you're updating. For most purposes, therefore, we suggest it's probably not worth it. Remember to lock if you're paranoid, careful, or both. 7.17.4 See Also The seek, truncate, open, and sysopen functions in perlfunc(1) and in Chapter 29 of Programming Perl; Recipe 7.15; Recipe 7.16; Recipe 7.18 [ Team LiB ] [ Team LiB ] Recipe 7.18 Locking a File 7.18.1 Problem Many processes need to update the same file simultaneously. 7.18.2 Solution Have all processes honor advisory locking by using flock: use Fcntl qw(:flock); # for the LOCK_* constants open(FH, "+<", $path) or die "can't open $path: $!"; flock(FH, LOCK_EX) or die "can't flock $path: $!"; # update file, then... close(FH) or die "can't close $path: $!"; 7.18.3 Discussion Operating systems vary greatly in the type and reliability of locking techniques available. Perl tries hard to give you something that works, even if your operating system uses its own underlying technique. The flock function takes two arguments: a filehandle and a number representing what to do with the lock on that filehandle. The numbers are normally represented by names, such as LOCK_EX, which you can get from the Fcntl or IO::File modules. Locks come in two varieties: shared (LOCK_SH) and exclusive (LOCK_EX). Despite what you might infer by "exclusive," processes aren't required to obey locks on files. Another way of saying this is that flock implements advisory locking. It allows processes to let the operating system suspend would-be writers of a file until any readers are finished with it. Flocking files is like putting up a stoplight at an intersection. It works only if people pay attention to whether the light is red or green—or yellow for a shared lock. The red light doesn't stop traffic; it merely signals that traffic should stop. A desperate, ignorant, or rude person will still go flying through the intersection no matter what the light says. Likewise, flock only blocks out other flockers—not all processes trying to do I/O. Unless everyone is polite, accidents can (and will) happen. The polite process customarily indicates its intent to read from the file by requesting a LOCK_SH. Many processes can have simultaneous shared locks on the file because they (presumably) won't be changing the data. If a process intends to write to the file, it should request an exclusive lock via LOCK_EX. The operating system then suspends the requesting process until all other processes have released their locks, at which point it grants the lock to the suspended process and unblocks it. You are guaranteed that no other process will be able to successfully run flock(FH, LOCK_EX) on the same file while you hold the lock. (This is almost—but not quite—like saying there can be only one exclusive lock on the file. Forked children inherit not only their parents' open files, but, on some systems, also any locks held. That means if you hold an exclusive lock and fork without execing, your child might also have that same exclusive lock on the file!) The flock function is therefore by default a blocking operation. You can also acquire a lock without wedging your process by using the LOCK_NB flag when you request a lock. This lets you warn the user that there may be a wait until other processes with locks are done: unless (flock(FH, LOCK_EX|LOCK_NB)) { warn "can't immediately write-lock the file ($!), blocking ..."; unless (flock(FH, LOCK_EX)) { die "can't get write-lock on numfile: $!"; } } If you use LOCK_NB and are refused a LOCK_SH, then you know that someone else has a LOCK_EX and is updating the file. If you are refused a LOCK_EX, then someone holds either a LOCK_SH or a LOCK_EX, so you shouldn't try to update the file. Locks dissolve when the file is closed, which may not be until your process exits. If you lock or unlock the file, Perl automatically flushes its buffers for you. Here's how you increment a number in a file, using flock: use Fcntl qw(:DEFAULT :flock); sysopen(FH, "numfile", O_RDWR|O_CREAT) or die "can't open numfile: $!"; flock(FH, LOCK_EX) or die "can't write-lock numfile: $!"; # Now we have acquired the lock, it's safe for I/O $num = || 0; # DO NOT USE "or" THERE!! seek(FH, 0, 0) or die "can't rewind numfile : $!"; truncate(FH, 0) or die "can't truncate numfile: $!"; print FH $num+1, "\n" or die "can't write numfile: $!"; close(FH) or die "can't close numfile: $!"; Closing the filehandle flushes the buffers and unlocks the file. The truncate function is discussed in Chapter 8. File locking is not as easy as you might think—or wish. Because locks are advisory, if one process uses locking and another doesn't, all bets are off. Never use the existence of a file as a locking indication because there exists a race condition between the test for the existence of the file and its creation. Furthermore, because file locking is an intrinsically stateful concept, it doesn't get along well with the stateless model embraced by network filesystems such as NFS. Although some vendors claim that fcntl addresses such matters, practical experience suggests otherwise. The CPAN module File::NFSLock uses a clever scheme to obtain and release locks on files over NFS, which is different from the flock system. Don't confuse Perl's flock with the SysV function lockf. Unlike lockf, flock locks entire files at once. Perl doesn't support lockf directly, although the CPAN module File::Lock does offer its functionality if your operating system has lockf. The only way in pure Perl to lock a portion of a file is to use the fnctl function, as demonstrated in the lockarea program at the end of this chapter. 7.18.4 See Also The flock and fcntl functions in perlfunc(1) and in Chapter 29 of Programming Perl; the documentation for the standard Fcntl and DB_File modules (also in Chapter 32 of Programming Perl); the documentation for the CPAN modules File::Lock and File::NFSLock; Recipe 7.24; Recipe 7.25 [ Team LiB ] [ Team LiB ] Recipe 7.19 Flushing Output 7.19.1 Problem When printing to a filehandle, output doesn't appear immediately. This is a problem in CGI scripts running on some programmer-hostile web servers where, if the web server sees warnings from Perl before it sees the (buffered) output of your script, it sends the browser an uninformative 500 Server Error. These buffering problems also arise with concurrent access to files by multiple programs and when talking with devices or sockets. 7.19.2 Solution Disable buffering by setting the per-filehandle variable $| to a true value, customarily 1: $old_fh = select(OUTPUT_HANDLE); $| = 1; select($old_fh); Or, if you don't mind the expense of loading an IO module, disable buffering by invoking the autoflush method: use IO::Handle; OUTPUT_HANDLE->autoflush(1); This works with indirect filehandles as well: use IO::Handle; $fh->autoflush(1); 7.19.3 Discussion In most stdio implementations, buffering varies with the type of output device. Disk files are block buffered, often with a buffer size of more than 2K. Pipes and sockets are often buffered with a buffer size between ½K and 2K. Serial devices, including terminals, modems, mice, and joysticks, are normally line-buffered; stdio sends the entire line out only when it gets the newline. Perl's print function does not directly support truly unbuffered output, i.e., a physical write for each individual character. Instead, it supports command buffering, in which one physical write is made after every separate output command. This isn't as hard on your system as no buffering at all, and it still gets the output where you want it, when you want it. Control output buffering through the $| special variable. Enable command buffering on output handles by setting it to a true value. This does not affect input handles at all; see Recipe 15.6 and Recipe 15.8 for unbuffered input. Set this variable to a false value to use default stdio buffering. Example 7-7 illustrates the difference. Example 7-7. seeme #!/usr/bin/perl -w # seeme - demo stdio output buffering $| = (@ARGV > 0); # command buffered if arguments given print "Now you don't see it..."; sleep 2; print "now you do\n"; If you call this program with no arguments, STDOUT is not command buffered. Your terminal (console, window, telnet session, whatever) doesn't receive output until the entire line is completed, so you see nothing for two seconds and then get the full line "Now you don't see it ... now you do". If you call the program with at least one argument, STDOUT is command buffered. That means you first see "Now you don't see it...", and then after two seconds you finally see "now you do". The dubious quest for increasingly compact code has led programmers to use the return value of select, the filehandle that was currently selected, as part of the second select: select((select(OUTPUT_HANDLE), $| = 1)[0]); There's another way. The IO::Handle module and any modules that inherit from that class provide three methods for flushing: flush, autoflush, and printflush. All are invoked on filehandles, either as literals or as variables containing a filehandle or reasonable facsimile. The flush method causes all unwritten output in the buffer to be written out, returning true on failure and false on success. The printflush method is a print followed by a one-time flush. The autoflush method is syntactic sugar for the convoluted antics just shown. It sets the command-buffering property on that filehandle (or clears it if passed an explicit false value), and returns the previous value for that property on that handle. For example: use FileHandle; STDERR->autoflush; # already unbuffered in stdio $filehandle->autoflush(0); If you're willing to accept the oddities of indirect object notation covered in Chapter 13, you can even write something reasonably close to English: use IO::Handle; # assume REMOTE_CONN is an interactive socket handle, # but DISK_FILE is a handle to a regular file. autoflush REMOTE_CONN 1; # unbuffer for clarity autoflush DISK_FILE 0; # buffer this for speed This avoids the bizarre select business and makes your code much more readable. Unfortunately, your program takes longer to compile because now you're including the IO::Handle module, so dozens of files need to be opened and thousands and thousands of lines must first be read and compiled. For short and simple applications, you might as well learn to manipulate $| directly, and you'll be happy. But for larger applications that already use a class derived from the IO::Handle class, you've already paid the price for the ticket, so you might as well see the show. To ensure that your output gets where you want it, when you want it, buffer flushing is important. It's particularly important with sockets, pipes, and devices, because you may be trying to do interactive I/O with these—more so, even, because you can't assume line buffering. Consider the program in Example 7-8. Example 7-8. getpcomidx #!/usr/bin/perl -w # getpcomidx - fetch www.perl.com's index.html document use IO::Socket; $sock = new IO::Socket::INET (PeerAddr => "www.perl.com", PeerPort => "http(80)"); die "Couldn't create socket: $@" unless $sock; # the library doesn't support $! setting; it uses $@ $sock->autoflush(1); # Mac *must* have \015\012\015\012 instead of \n\n here. # It's a good idea for others, too, as that's the spec, # but implementations are encouraged to accept "\cJ\cJ" too, # and as far as we've seen, they do. $sock->print("GET /index.html http/1.1\n\n"); $document = join("", $sock->getlines( )); print "DOC IS: $document\n"; If you're running at least v5.8 Perl, you can use the new I/O layers mechanism to force unbuffered output. This is available through the :unix layer. If the handle is already open, you can do this: binmode(STDOUT, ":unix") || die "can't binmode STDOUT to :unix: $!"; or you can specify the I/O layer when initially calling open: open(TTY, ">:unix", "/dev/tty") || die "can't open /dev/tty: $!"; print TTY "54321"; sleep 2; print TTY "\n"; There's no way to control input buffering using the sorts of flushing discussed so far. For that, you need to see Recipe 15.6 and Recipe 15.8. 7.19.4 See Also The $| entry in perlvar(1), and Chapter 28 of Programming Perl; the documentation for the standard FileHandle and IO::Handle modules (also in Chapter 32 of Programming Perl); the select function in perlfunc(1) and in Chapter 29 of Programming Perl; Recipe 15.6 and Recipe 15.8 [ Team LiB ] [ Team LiB ] Recipe 7.20 Doing Non-Blocking I/O 7.20.1 Problem You want to read from or write to a filehandle without the system blocking your process until the program, file, socket, or device at the other end is ready. This is desired less often of regular files than of special files. 7.20.2 Solution Open the file with sysopen, specifying the O_NONBLOCK option: use Fcntl; sysopen(MODEM, "/dev/cua0", O_NONBLOCK|O_RDWR) or die "Can't open modem: $!\n"; If you already have an open filehandle, invoke the blocking method from IO::Handle with an argument of 0: use IO::Handle; MODEM->blocking(0); # assume MODEM already opened Or use the low-level fcntl function: use Fcntl; $flags = ""; fcntl(HANDLE, F_GETFL, $flags) or die "Couldn't get flags for HANDLE : $!\n"; $flags |= O_NONBLOCK; fcntl(HANDLE, F_SETFL, $flags) or die "Couldn't set flags for HANDLE: $!\n"; 7.20.3 Discussion On a disk file, when no more data can be read because you're at the end of the file, the input operation returns immediately. But suppose the filehandle in question were the user's keyboard or a network connection. In those cases, simply because there's no data there right now doesn't mean there never will be, so the input function normally doesn't return until it gets data. Sometimes, though, you don't want to wait; you want to grab whatever's there and carry on with whatever you were doing. Once a filehandle has been set for non-blocking I/O, the sysread or syswrite calls that would otherwise block will instead return undef and set $! to EAGAIN: use Errno; $rv = syswrite(HANDLE, $buffer, length $buffer); if (!defined($rv) && $!{EAGAIN}) { # would block } elsif ($rv != length $buffer) { # incomplete write } else { # successfully wrote } $rv = sysread(HANDLE, $buffer, $BUFSIZ); if (!defined($rv) && $!{EAGAIN}) { # would block } else { # successfully read $rv bytes from HANDLE } The O_NONBLOCK constant is part of the POSIX standard, so most machines should support it. We use the Errno module to test for the error EAGAIN. Testing $!{EAGAIN} is the same as testing whether $! = = EAGAIN. 7.20.4 See Also The sysopen and fcntl functions in perlfunc(1) and in Chapter 29 of Programming Perl; the documentation for the standard Errno and IO::Handle modules (also in Chapter 32 of Programming Perl); your system's open(2) and fcntl(2) manpages; Recipe 7.22; Recipe 7.21 [ Team LiB ] [ Team LiB ] Recipe 7.21 Determining the Number of Unread Bytes 7.21.1 Problem You want to know how many unread bytes are available for reading from a filehandle. 7.21.2 Solution Use the FIONREAD ioctl call: $size = pack("L", 0); ioctl(FH, $FIONREAD, $size) or die "Couldn't call ioctl: $!\n"; $size = unpack("L", $size); # $size bytes can be read Make sure the input filehandle is unbuffered (because you've used an I/O layer like :unix on it), or use only sysread. 7.21.3 Discussion The Perl ioctl function is a direct interface to the operating system's ioctl(2) system call. If your system doesn't have the FIONREAD request or the ioctl(2) call, you can't use this recipe. FIONREAD and the other ioctl(2) requests are numeric values normally found lurking in C include files. Perl's h2ph tool tries to convert C include files to Perl code, which can be required. FIONREAD ends up defined as a function in the sys/ioctl.ph file: require "sys/ioctl.ph"; $size = pack("L", 0); ioctl(FH, FIONREAD( ), $size) or die "Couldn't call ioctl: $!\n"; $size = unpack("L", $size); If h2ph wasn't installed or doesn't work for you, you can manually grep the include files: % grep FIONREAD /usr/include/*/* /usr/include/asm/ioctls.h:#define FIONREAD 0x541B If you install Inline::C from CPAN, you can write a C subroutine to obtain the constant for you: use Inline C; $FIONREAD = get_FIONREAD( ); # ... _ _END_ _ _ _C_ _ #include int get_FIONREAD( ) { return FIONREAD; } If all else fails, write a small C program using the editor of champions: % cat > fionread.c #include main( ) { printf("%#08x\n", FIONREAD); } ^D % cc -o fionread fionread.c % ./fionread 0x4004667f Then hardcode it, leaving porting as an exercise to your successor. $FIONREAD = 0x4004667f; # XXX: opsys dependent $size = pack("L", 0); ioctl(FH, $FIONREAD, $size) or die "Couldn't call ioctl: $!\n"; $size = unpack("L", $size); FIONREAD requires a filehandle connected to a stream, which means sockets, pipes, and tty devices all work, but regular files don't. If this is too much system programming for you, try to think outside the problem. Read from the filehandle in non-blocking mode (see Recipe 7.20). Then, if you manage to read something, that's how much was there waiting to be read. If you couldn't read anything, you know there was nothing to be read. This might get you in trouble with other users (or other processes) who are using the same system, though— because it uses busy-wait I/O, it's a drain on system resources. 7.21.4 See Also Recipe 7.20; your system's ioctl(2) manpage; the ioctl function in perlfunc(1) and in Chapter 29 of Programming Perl; the documentation for the Inline::C module from CPAN [ Team LiB ] [ Team LiB ] Recipe 7.22 Reading from Many Filehandles Without Blocking 7.22.1 Problem You want to learn whether input is available to be read, rather than blocking until there's input the way does. This is useful when reading from pipes, sockets, devices, and other programs. 7.22.2 Solution Use select with a timeout value of 0 seconds if you're comfortable with manipulating bit- vectors representing file descriptor sets: $rin = ""; # repeat next line for all filehandles to poll vec($rin, fileno(FH1), 1) = 1; vec($rin, fileno(FH2), 1) = 1; vec($rin, fileno(FH3), 1) = 1; $nfound = select($rout=$rin, undef, undef, 0); if ($nfound) { # input waiting on one or more of those 3 filehandles if (vec($rout,fileno(FH1),1)) { # do something with FH1 } if (vec($rout,fileno(FH2),1)) { # do something with FH2 } if (vec($rout,fileno(FH3),1)) { # do something with FH3 } } The IO::Select module provides an abstraction layer to hide bit-vector operations: use IO::Select; $select = IO::Select->new( ); # repeat next line for all filehandles to poll $select->add(*FILEHANDLE); if (@ready = $select->can_read(0)) { # input waiting on the filehandles in @ready } 7.22.3 Discussion The select function is really two functions in one. If you call it with one argument, you change the current default output filehandle (see Recipe 7.19). If you call it with four arguments, it tells you which filehandles have input waiting or are ready to receive output. This recipe deals only with four-argument select. The first three arguments to select are strings containing bit-vectors. Each bit-vector represents a set of file descriptors to inspect for pending input, pending output, and pending expedited data (like out-of-band or urgent data on a socket), respectively. The final argument is the timeout—how long select should spend waiting for status to change. A timeout value of 0 indicates a poll. Timeout can also be a floating-point number of seconds, or undef to wait (block) until status changes: $rin = ""; vec($rin, fileno(FILEHANDLE), 1) = 1; $nfound = select($rin, undef, undef, 0); # just check if ($nfound) { # read ten bytes from FILEHANDLE sysread(HANDLE, $data, 10); print "I read $data"; } The IO::Select module hides the bit-vectors from you. IO::Select->new( ) returns a new object on which you invoke the add method to add one or more filehandles to the set. Once you've added the filehandles you are interested in, invoke can_read, can_write, or has_exception. These methods return a list of filehandles that you can safely read from or write to, or that have unread exceptional data (TCP out-of-band data, for example). If you want to read an entire line of data, you can't use the readline function or the line input operator (unless you use an unbuffered I/O layer). Otherwise, you'll mix a buffered I/O function with a check that ignores those buffers in user space and cares only about what's buffered in kernel space. This is a big no-no. For details on this and directions for calling sysread on whatever is available on a socket or pipe and then returning immediately, see Recipe 7.23. If you're trying to do non-blocking reads on a terminal line (that is, on a keyboard or other serial line device), see Recipe 15.6 and Recipe 15.8. 7.22.4 See Also The select function in perlfunc(1) and in Chapter 29 of Programming Perl; the documentation for the standard module IO::Select (also in Chapter 32 of Programming Perl); Recipe 7.20; Recipe 7.23 [ Team LiB ] [ Team LiB ] Recipe 7.23 Reading an Entire Line Without Blocking 7.23.1 Problem You need to read a line of data from a handle that select says is ready for reading, but you can't use Perl's normal operation (readline) in conjunction with select because may buffer extra data and select doesn't know about those buffers. 7.23.2 Solution Use the following sysreadline function, like this: $line = sysreadline(SOME_HANDLE); In case only a partial line has been sent, include a number of seconds to wait: $line = sysreadline(SOME_HANDLE, TIMEOUT); Here's the function to do that: use IO::Handle; use IO::Select; use Symbol qw(qualify_to_ref); sub sysreadline(*;$) { my($handle, $timeout) = @_; $handle = qualify_to_ref($handle, caller( )); my $infinitely_patient = (@_ = = 1 || $timeout < 0); my $start_time = time( ); my $selector = IO::Select->new( ); $selector->add($handle); my $line = ""; SLEEP: until (at_eol($line)) { unless ($infinitely_patient) { return $line if time( ) > ($start_time + $timeout); } # sleep only 1 second before checking again next SLEEP unless $selector->can_read(1.0); INPUT_READY: while ($selector->can_read(0.0)) { my $was_blocking = $handle->blocking(0); CHAR: while (sysread($handle, my $nextbyte, 1)) { $line .= $nextbyte; last CHAR if $nextbyte eq "\n"; } $handle->blocking($was_blocking); # if incomplete line, keep trying next SLEEP unless at_eol($line); last INPUT_READY; } } return $line; } sub at_eol($) { $_[0] =~ /\n\z/ } 7.23.3 Discussion As described in Recipe 7.22, to determine whether the operating system has data on a particular handle for your process to read, you can use either Perl's built-in select function or the can_read method from the standard IO::Select module. Although you can reasonably use functions like sysread and recv to get data, you can't use the buffered functions like readline (that is, ), read, or getc. Also, even the unbuffered input functions might still block. If someone connects and sends a character but never sends a newline, your program will block in a , which expects its input to end in a newline—or in whatever you've assigned to the $/ variable. We circumvent this by setting the handle to non-blocking mode and then reading in characters until we find "\n". This removes the need for the blocking call. The sysreadline function in the Solution takes an optional second argument so you don't have to wait forever in case you get a partial line and nothing more. A far more serious issue is that select reports only whether the operating system's low-level file descriptor is available for I/O. It's not reliable in the general case to mix calls to four- argument select with calls to any of the buffered I/O functions listed in this chapter's Introduction (read, , seek, tell, etc.). Instead, you must use sysread—and sysseek if you want to reposition the filehandle within the file. The reason is that select's response does not reflect any user-level buffering in your own process's address space once the kernel has transferred the data. But the —really Perl's readline( ) function—still uses your underlying buffered I/O system. If two lines were waiting, select would report true only once. You'd read the first line and leave the second one in the buffer. But the next call to select would block because, as far as the kernel is concerned, it's already given you all of the data it had. That second line, now hidden from your kernel, sits unread in an input buffer that's solely in user space. 7.23.4 See Also The sysread function in perlfunc(1) and in Chapter 29 of Programming Perl; the documentation for the standard modules Symbol, IO::Handle, and IO::Select (also in Chapter 32 of Programming Perl); Recipe 7.22 [ Team LiB ] [ Team LiB ] Recipe 7.24 Program: netlock When locking files, we recommend that you use flock when possible. However, on some systems, flock's locking strategy is not reliable. For example, perhaps the person who built Perl on your system configured flock to use a version of file locking that didn't even try to work over the Net, or you're on the increasingly rare system where no flock emulation exists at all. The following program and module provide a basic implementation of a file locking mechanism. Unlike a normal flock, with this module you lock file names, not file descriptors. Thus, you can use it to lock directories, domain sockets, and other non-regular files. You can even lock files that don't exist yet. It uses a directory created at the same level in the directory structure as the locked file, so you must be able to write to the enclosing directory of the file you wish to lock. A sentinel file within the lock directory contains the owner of the lock. This is also useful with Recipe 7.15 because you can lock the filename even though the file that has that name changes. The nflock function takes one or two arguments. The first is the pathname to lock; the second is the optional amount of time to wait for the lock. The function returns true if the lock is granted, returns false if the timeout expired, and raises an exception should various improbable events occur, such as being unable to write the directory. Set the $File::LockDir::Debug variable to true to make the module emit messages if it stalls waiting for a lock. If you forget to free a lock and try to exit the program, the module will remove them for you. This won't happen if your program is sent a signal it doesn't trap. Example 7-9 shows a driver program to demonstrate the File::LockDir module. Example 7-9. drivelock #!/usr/bin/perl -w # drivelock - demo File::LockDir module use strict; use File::LockDir; $SIG{INT} = sub { die "outta here\n" }; $File::LockDir::Debug = 1; my $path = shift or die "usage: $0 \n"; unless (nflock($path, 2)) { die "couldn't lock $path in 2 seconds\n"; } sleep 100; nunflock($path); The module itself is shown in Example 7-10. For more on building your own modules, see Chapter 12. Example 7-10. File::LockDir package File::LockDir; # module to provide very basic filename-level # locks. No fancy systems calls. In theory, # directory info is sync'd over NFS. Not # stress tested. use strict; use Exporter; our (@ISA, @EXPORT); @ISA = qw(Exporter); @EXPORT = qw(nflock nunflock); our ($Debug, $Check); $Debug ||= 0; # may be predefined $Check ||= 5; # may be predefined use Cwd; use Fcntl; use Sys::Hostname; use File::Basename; use File::stat; use Carp; my %Locked_Files = ( ); # usage: nflock(FILE; NAPTILL) sub nflock($;$) { my $pathname = shift; my $naptime = shift || 0; my $lockname = name2lock($pathname); my $whosegot = "$lockname/owner"; my $start = time( ); my $missed = 0; my $owner; # if locking what I've already locked, return if ($Locked_Files{$pathname}) { carp "$pathname already locked"; return 1 } if (!-w dirname($pathname)) { croak "can't write to directory of $pathname"; } while (1) { last if mkdir($lockname, 0777); confess "can't get $lockname: $!" if $missed++ > 10 && !-d $lockname; if ($Debug) {{ open($owner, "< $whosegot") || last; # exit "if"! my $lockee = <$owner>; chomp($lockee); printf STDERR "%s $0\[$$]: lock on %s held by %s\n", scalar(localtime), $pathname, $lockee; close $owner; }} sleep $Check; return if $naptime && time > $start+$naptime; } sysopen($owner, $whosegot, O_WRONLY|O_CREAT|O_EXCL) or croak "can't create $whosegot: $!"; printf $owner "$0\[$$] on %s since %s\n", hostname( ), scalar(localtime); close($owner) or croak "close $whosegot: $!"; $Locked_Files{$pathname}++; return 1; } # free the locked file sub nunflock($) { my $pathname = shift; my $lockname = name2lock($pathname); my $whosegot = "$lockname/owner"; unlink($whosegot); carp "releasing lock on $lockname" if $Debug; delete $Locked_Files{$pathname}; return rmdir($lockname); } # helper function sub name2lock($) { my $pathname = shift; my $dir = dirname($pathname); my $file = basename($pathname); $dir = getcwd( ) if $dir eq "."; my $lockname = "$dir/$file.LOCKDIR"; return $lockname; } # anything forgotten? END { for my $pathname (keys %Locked_Files) { my $lockname = name2lock($pathname); my $whosegot = "$lockname/owner"; carp "releasing forgotten $lockname"; unlink($whosegot); rmdir($lockname); } } 1; [ Team LiB ] [ Team LiB ] Recipe 7.25 Program: lockarea Perl's flock function only locks complete files, not regions of the file. Although fcntl supports locking of a file's regions, this is difficult to access from Perl, largely because no one has written an XS module that portably packs up the necessary structure. The program in Example 7-11 implements fcntl, but only for the three architectures it already knows about: SunOS, BSD, and Linux. If you're running something else, you'll have to figure out the layout of the flock structure. We did this by eyeballing the C-language sys/fcntl.h #include file—and running the c2ph program to figure out alignment and typing. This program, while included with Perl, only works on systems with a strong Berkeley heritage, like those listed above. As with Unix—or Perl itself—you don't have to use c2ph, but it sure makes life easier if you can. The struct_flock function in the lockarea program packs and unpacks in the proper format for the current architectures by consulting the $^O variable, which contains your current operating system name. There is no struct_flock function declaration. It's just aliased to the architecture-specific version. Function aliasing is discussed in Recipe 10.14. The lockarea program opens a temporary file, clobbering any existing contents and writing a screenful (80 by 23) of blanks. Each line is the same length. The program then forks one or more times and lets the child processes try to update the file at the same time. The first argument, N, is the number of times to fork to produce 2 ** N processes. So lockarea 1 makes two children, lockarea 2 makes four, lockarea 3 makes eight, lockarea 4 makes sixteen, etc. The more kids, the more contention for the locks. Each process picks a random line in the file, locks that line only, and then updates it. It writes its process ID into the line, prepended with a count of how many times the line has been updated: 4: 18584 was just here If the line was already locked, then when the lock is finally granted, that line is updated with a message telling which process was in the way of this process: 29: 24652 ZAPPED 24656 A fun demo is to run the lockarea program in the background and the rep program from Chapter 15, watching the file change. Think of it as a video game for systems programmers. % lockarea 5 & % rep -1 'cat /tmp/lkscreen' When you interrupt the original parent, usually with Ctrl-C or by sending it a SIGINT from the command line, it kills all of its children by sending its entire process group a signal. Example 7-11. lockarea #!/usr/bin/perl -w # lockarea - demo record locking with fcntl use strict; my $FORKS = shift || 1; my $SLEEP = shift || 1; use Fcntl; use POSIX qw(:unistd_h); use Errno; my $COLS = 80; my $ROWS = 23; # when's the last time you saw *this* mode used correctly? open(FH, "+> /tmp/lkscreen") or die $!; select(FH); $| = 1; select STDOUT; # clear screen for (1 .. $ROWS) { print FH " " x $COLS, "\n"; } my $progenitor = $$; fork( ) while $FORKS-- > 0; print "hello from $$\n"; if ($progenitor = = $$) { $SIG{INT} = \&infanticide; } else { $SIG{INT} = sub { die "goodbye from $$" }; } while (1) { my $line_num = int rand($ROWS); my $line; my $n; # move to line seek(FH, $n = $line_num * ($COLS+1), SEEK_SET) or next; # get lock my $place = tell(FH); my $him; next unless defined($him = lockplace(*FH, $place, $COLS)); # read line read(FH, $line, $COLS) = = $COLS or next; my $count = ($line =~ /(\d+)/) ? $1 : 0; $count++; # update line seek(FH, $place, 0) or die $!; my $update = sprintf($him ? "%6d: %d ZAPPED %d" : "%6d: %d was just here", $count, $$, $him); my $start = int(rand($COLS - length($update))); die "XXX" if $start + length($update) > $COLS; printf FH "%*.*s\n", -$COLS, $COLS, " " x $start . $update; # release lock and go to sleep unlockplace(*FH, $place, $COLS); sleep $SLEEP if $SLEEP; } die "NOT REACHED"; # just in case # lock($handle, $offset, $timeout) - get an fcntl lock sub lockplace { my ($fh, $start, $till) = @_; ##print "$$: Locking $start, $till\n"; my $lock = struct_flock(F_WRLCK, SEEK_SET, $start, $till, 0); my $blocker = 0; unless (fcntl($fh, F_SETLK, $lock)) { die "F_SETLK $$ @_: $!" unless $!{EAGAIN} || $!{EDEADLK}; fcntl($fh, F_GETLK, $lock) or die "F_GETLK $$ @_: $!"; $blocker = (struct_flock($lock))[-1]; ##print "lock $$ @_: waiting for $blocker\n"; $lock = struct_flock(F_WRLCK, SEEK_SET, $start, $till, 0); unless (fcntl($fh, F_SETLKW, $lock)) { warn "F_SETLKW $$ @_: $!\n"; return; # undef } } return $blocker; } # unlock($handle, $offset, $timeout) - release an fcntl lock sub unlockplace { my ($fh, $start, $till) = @_; ##print "$$: Unlocking $start, $till\n"; my $lock = struct_flock(F_UNLCK, SEEK_SET, $start, $till, 0); fcntl($fh, F_SETLK, $lock) or die "F_UNLCK $$ @_: $!"; } # OS-dependent flock structures # Linux struct flock # short l_type; # short l_whence; # off_t l_start; # off_t l_len; # pid_t l_pid; BEGIN { # c2ph says: typedef='s2 l2 i', sizeof=16 my $FLOCK_STRUCT = "s s l l i"; sub linux_flock { if (wantarray) { my ($type, $whence, $start, $len, $pid) = unpack($FLOCK_STRUCT, $_[0]); return ($type, $whence, $start, $len, $pid); } else { my ($type, $whence, $start, $len, $pid) = @_; return pack($FLOCK_STRUCT, $type, $whence, $start, $len, $pid); } } } # SunOS struct flock: # short l_type; /* F_RDLCK, F_WRLCK, or F_UNLCK */ # short l_whence; /* flag to choose starting offset */ # long l_start; /* relative offset, in bytes */ # long l_len; /* length, in bytes; 0 means lock to EOF */ # short l_pid; /* returned with F_GETLK */ # short l_xxx; /* reserved for future use */ BEGIN { # c2ph says: typedef='s2 l2 s2', sizeof=16 my $FLOCK_STRUCT = "s s l l s s"; sub sunos_flock { if (wantarray) { my ($type, $whence, $start, $len, $pid, $xxx) = unpack($FLOCK_STRUCT, $_[0]); return ($type, $whence, $start, $len, $pid); } else { my ($type, $whence, $start, $len, $pid) = @_; return pack($FLOCK_STRUCT, $type, $whence, $start, $len, $pid, 0); } } } # (Free)BSD struct flock: # off_t l_start; /* starting offset */ # off_t l_len; /* len = 0 means until end-of-file */ # pid_t l_pid; /* lock owner */ # short l_type; /* lock type: read/write, etc. */ # short l_whence; /* type of l_start */ BEGIN { # c2ph says: typedef="q2 i s2", size=24 my $FLOCK_STRUCT = "ll ll i s s"; # XXX: q is ll sub bsd_flock { if (wantarray) { my ($xxstart, $start, $xxlen, $len, $pid, $type, $whence) = unpack($FLOCK_STRUCT, $_[0]); return ($type, $whence, $start, $len, $pid); } else { my ($type, $whence, $start, $len, $pid) = @_; my ($xxstart, $xxlen) = (0,0); return pack($FLOCK_STRUCT, $xxstart, $start, $xxlen, $len, $pid, $type, $whence); } } } # alias the fcntl structure at compile time BEGIN { for ($^O) { *struct_flock = do { /bsd/ && \&bsd_flock || /linux/ && \&linux_flock || /sunos/ && \&sunos_flock || die "unknown operating system $^O, bailing out"; }; } } # install signal handler for children BEGIN { my $called = 0; sub infanticide { exit if $called++; print "$$: Time to die, kiddies.\n" if $$ = = $progenitor; my $job = getpgrp( ); $SIG{INT} = "IGNORE"; kill -2, $job if $job; # killpg(SIGINT, job) 1 while wait > 0; print "$$: My turn\n" if $$ = = $progenitor; exit; } } END { &infanticide } [ Team LiB ] [ Team LiB ] Chapter 8. File Contents The most brilliant decision in all of Unix was the choice of a single character for the newline sequence. —Mike O'Dell, only half jokingly [ Team LiB ] [ Team LiB ] Introduction Before the Unix Revolution, every kind of data source and destination was inherently different. Getting two programs merely to understand each other required heavy wizardry and the occasional sacrifice of a virgin stack of punch cards to an itinerant mainframe repairman. This computational Tower of Babel made programmers dream of quitting the field to take up a less painful hobby, like autoflagellation. These days, such cruel and unusual programming is largely behind us. Modern operating systems work hard to provide the illusion that I/O devices, network connections, process control information, other programs, the system console, and even users' terminals are all abstract streams of bytes called files . This lets you easily write programs that don't care where their input came from or where their output goes. Because programs read and write streams of simple text, every program can communicate with every other program. It is difficult to overstate the power and elegance of this approach. No longer dependent upon troglodyte gnomes with secret tomes of JCL (or COM) incantations, users can now create custom tools from smaller ones by using simple command-line I/O redirection, pipelines, and backticks. Basic Operations Treating files as unstructured byte streams necessarily governs what you can do with them. You can read and write sequential, fixed-size blocks of data at any location in the file, increasing its size if you write past the current end. Perl uses an I/O library that emulates C's stdio (3) to implement reading and writing of variable-length records like lines, paragraphs, and words. What can't you do to an unstructured file? Because you can't insert or delete bytes anywhere but at end-of-file, you can't easily change the length of, insert, or delete records. An exception is the last record, which you can delete by truncating the file to the end of the previous record. For other modifications, you need to use a temporary file or work with a copy of the file in memory. If you need to do this a lot, a database system may be a better solution than a raw file (see Chapter 14 ). Standard with Perl as of v5.8 is the Tie::File module, which offers an array interface to files of records. We use it in Recipe 8.4 . The most common files are text files, and the most common operations on text files are reading and writing lines. Use the line-input operator, (or the internal function implementing it, readline ), to read lines, and use print to write them. These functions can also read or write any record that has a specific record separator. Lines are simply variable-length records that end in "\n ". The operator returns undef on error or when end of the file is reached, so use it in loops like this: while (defined ($line = )) { chomp $line; $size = length($line); print "$size\n"; # output size of line } Because this operation is extremely common in Perl programs that process lines of text, and that's an awful lot to type, Perl conveniently provides some shorter aliases for it. If all shortcuts are taken, this notation might be too abstract for the uninitiated to guess what it's really doing. But it's an idiom you'll see thousands of times in Perl, so you'll soon get used to it. Here are increasingly shortened forms, where the first line is the completely spelled-out version: while (defined ($line = )) { ... } while ($line = ) { ... } while () { ... } In the second line, the explicit defined test needed for detecting end-of-file is omitted. To make everyone's life easier, you're safe to skip that defined test, because when the Perl compiler detects this situation, it helpfully puts one there for you to guarantee your program's correctness in odd cases. This implicit addition of a defined occurs on all while tests that do nothing but assign to one scalar variable the result of calling readline , readdir , or readlink . As is just shorthand for readline(FH) , it also counts. We're not quite done shortening up yet. As the third line shows, you can also omit the variable assignment completely, leaving just the line input operator in the while test. When you do that here in a while test, it doesn't simply discard the line it just read as it would anywhere else. Instead, it reads lines into the special global variable $_ . Because so many other operations in Perl also default to $_ , this is more useful than it might initially appear. while () { chomp; print length( ), "\n"; # output size of line } In scalar context, reads just the next line, but in list context, it reads all remaining lines: @lines = ; Each time reads a record from a filehandle, it increments the special variable $ . (the "current input record number"). This variable is reset only when close is called explicitly, which means that it's not reset when you reopen an already opened filehandle. Another special variable is $/ , the input record separator. It is set to "\n " by default. You can set it to any string you like; for instance, "\0 " to read null-terminated records. Read entire paragraphs by setting $/ to the empty string, "". This is almost like setting $/ to "\n\n ", in that empty lines function as record separators. However, "" treats two or more consecutive empty lines as a single record separator, whereas "\n\n " returns empty records when more than two consecutive empty lines are read. Undefine $/ to read the rest of the file as one scalar: undef $/; $whole_file = ; # "slurp" mode The -0 option to Perl lets you set $/ from the command line: % perl -040 -e '$word = <>; print "First word is $word\n";' The digits after -0 are the octal value of the single character to which $/ is to be set. If you specify an illegal value (e.g., with -0777 ), Perl will set $/ to undef . If you specify -00 , Perl will set $/ to "". The limit of a single octal value means you can't set $/ to a multibyte string; for instance, "%%\n " to read fortune files. Instead, you must use a BEGIN block: % perl -ne 'BEGIN { $/="%%\n" } chomp; print if /Unix/i' fortune.dat Use print to write a line or any other data. The print function writes its arguments one after another and doesn't automatically add a line or record terminator by default. print HANDLE "One", "two", "three"; # "Onetwothree" print "Baa baa black sheep.\n"; # Sent to default output handle There is no comma between the filehandle and the data to print. If you put a comma in there, Perl gives the error message "No comma allowed after filehandle ". The default output handle is STDOUT . Change it with the select function. (See the Introduction to Chapter 7 .) Newlines All systems use the virtual "\n " to represent a line terminator, called a newline . There is no such thing as a newline character; it is a platform-independent way of saying "whatever your string library uses to represent a line terminator." On Unix, VMS, and Windows, this line terminator in strings is "\cJ " (the Ctrl-J character). Versions of the old Macintosh operating system before Mac OS X used "\cM ". As a Unix variant, Mac OS X uses "\cJ ". Operating systems also vary in how they store newlines in files. Unix also uses "\cJ " for this. On Windows, though, lines in a text file end in "\cM\cJ ". If your I/O library knows you are reading or writing a text file, it will automatically translate between the string line terminator and the file line terminator. So on Windows, you could read four bytes ("Hi\cM\cJ ") from disk and end up with three in memory ("Hi\cJ " where "\cJ " is the physical representation of the newline character). This is never a problem on Unix, as no translation needs to happen between the disk's newline ("\cJ ") and the string's newline ("\cJ "). Terminals, of course, are a different kettle of fish. Except when you're in raw mode (as in system("stty raw") ), the Enter key generates a "\cM " (carriage return) character. This is then translated by the terminal driver into a "\n " for your program. When you print a line to a terminal, the terminal driver notices the "\n " newline character (whatever it might be on your platform) and turns it into the "\cM\cJ " (carriage return, line feed) sequence that moves the cursor to the start of the line and down one line. Even network protocols have their own expectations. Most protocols prefer to receive and send "\cM\cJ " as the line terminator, but many servers also accept merely a "\cJ ". This varies between protocols and servers, so check the documentation closely! The important notion here is that if the I/O library thinks you are working with a text file, it may be translating sequences of bytes for you. This is a problem in two situations: when your file is not text (e.g., you're reading a JPEG file) and when your file is text but not in a byte-oriented ASCII-like encoding (e.g., UTF-8 or any of the other encodings the world uses to represent their characters). As if this weren't bad enough, some systems (again, MS-DOS is an example) use a particular byte sequence in a text file to indicate end-of-file. An I/O library that knows about text files on such a platform will indicate EOF when that byte sequence is read. Recipe 8.11 shows how to disable any translation that your I/O library might be doing. I/O Layers With v5.8, Perl I/O operations are no longer simply wrappers on top of stdio. Perl now has a flexible system (I/O layers) that transparently filters multiple encodings of external data. In Chapter 7 we met the :unix layer, which implements unbuffered I/O. There are also layers for using your platform's stdio (:stdio ) and Perl's portable stdio implementation (:perlio ), both of which buffer input and output. In this chapter, these implementation layers don't interest us as much as the encoding layers built on top of them. The :crlf layer converts a carriage return and line feed (CRLF, "\cM\cJ ") to "\n " when reading from a file, and converts "\n " to CRLF when writing. The opposite of :crlf is :raw , which makes it safe to read or write binary data from the filehandle. You can specify that a filehandle contains UTF-8 data with :utf8 , or specify an encoding with :encoding(...) . You can even write your own filter in Perl that processes data being read before your program gets it, or processes data being written before it is sent to the device. It's worth emphasizing: to disable :crlf , specify the :raw layer. The :bytes layer is sometimes misunderstood to be the opposite of :crlf , but they do completely different things. The former refers to the UTF-8ness of strings, and the latter to the behind-the-scenes conversion of carriage returns and line feeds. You may specify I/O layers when you open the file: open($fh, "<:raw:utf8", $filename); # read UTF-8 from the file open($fh, "<:encoding(shiftjis)", $filename); # shiftjis japanese encoding open(FH, "+<:crlf", $filename); # convert between CRLF and \n Or you may use binmode to change the layers of an existing handle: binmode($fh, ":raw:utf8"); binmode($fh, ":raw:encoding(shiftjis)"); binmode(FH, "<:raw:crlf"); Because binmode pushes onto the stack of I/O layers, and the facility for removing layers is still evolving, you should always specify a complete set of layers by making the first layer be :raw as follows: binmode(HANDLE, ":raw"); # binary-safe binmode(HANDLE); # same as :raw binmode(HANDLE, ":raw :utf8"); # read/write UTF-8 binmode(HANDLE, ":raw :encoding(shiftjis)"); # read/write shiftjis Recipe 8.18 , Recipe 8.19 , and Recipe 8.20 show how to manipulate I/O layers. Advanced Operations Use the read function to read a fixed-length record. It takes three arguments: a filehandle, a scalar variable, and the number of characters to read. It returns undef if an error occurred or else returns the number of characters read. $rv = read(HANDLE, $buffer, 4096) or die "Couldn't read from HANDLE : $!\n"; # $rv is the number of bytes read, # $buffer holds the data read To write a fixed-length record, just use print . The truncate function changes the length (in bytes) of a file, which can be specified as a filehandle or as a filename. It returns true if the file was successfully truncated, false otherwise: truncate(HANDLE, $length) or die "Couldn't truncate: $!\n"; truncate("/tmp/$$.pid", $length) or die "Couldn't truncate: $!\n"; Each filehandle keeps track of where it is in the file. Reads and writes occur from this point, unless you've specified the O_APPEND flag (see Recipe 7.1 ). Fetch the file position for a filehandle with tell , and set it with seek . Because the library rewrites data to preserve the illusion that "\n " is the line terminator, and also because you might be using characters with code points above 255 and therefore requiring a multibyte encoding, you cannot portably seek to offsets calculated simply by counting characters. Unless you can guarantee your file uses one byte per character, seek only to offsets returned by tell . $pos = tell(DATAFILE); print "I'm $pos bytes from the start of DATAFILE.\n"; The seek function takes three arguments: the filehandle, the offset (in bytes) to go to, and a numeric argument indicating how to interpret the offset. 0 indicates an offset from the start of the file (like the value returned by tell ); 1, an offset from the current location (a negative number means move backward in the file, a positive number means move forward); and 2, an offset from end-of-file. seek(LOGFILE, 0, 2) or die "Couldn't seek to the end: $!\n"; seek(DATAFILE, $pos, 0) or die "Couldn't seek to $pos: $!\n"; seek(OUT, -20, 1) or die "Couldn't seek back 20 bytes: $!\n"; So far we've been describing buffered I/O. That is, readline or , print , read , seek , and tell are all operations that use buffering for speed and efficiency. This is their default behavior, although if you've specified an unbuffered I/O layer for that handle, they won't be buffered. Perl also provides an alternate set of I/O operations guaranteed to be unbuffered no matter what I/O layer is associated with the handle. These are sysread , syswrite , and sysseek , all discussed in Chapter 7 . The sysread and syswrite functions are different in appearance from their and print counterparts. Both take a filehandle to act on: a scalar variable to either read into or write out from, and the number of characters to transfer. (With binary data, this is the number of bytes, not characters.) They also accept an optional fourth argument, the offset from the start of the scalar variable at which to start reading or writing: $written = syswrite(DATAFILE, $mystring, length($mystring)); die "syswrite failed: $!\n" unless $written = = length($mystring); $read = sysread(INFILE, $block, 256, 5); warn "only read $read bytes, not 256" if 256 != $read; The syswrite call sends the contents of $mystring to DATAFILE . The sysread call reads 256 characters from INFILE and stores 5 characters into $block , leaving intact the 5 characters it skipped. Both sysread and syswrite return the number of characters transferred, which could be different than the amount of data you were attempting to transfer. Maybe the file didn't have as much data as you thought, so you got a short read. Maybe the filesystem that the file lives on filled up. Maybe your process was interrupted partway through the write. Stdio takes care of finishing the transfer in cases of interruption, but if you use raw sysread and syswrite calls, you must finish up yourself. See Recipe 9.3 for an example. The sysseek function doubles as an unbuffered replacement for both seek and tell . It takes the same arguments as seek , but it returns the new position on success and undef on error. To find the current position within the file: $pos = sysseek(HANDLE, 0, 1); # don't change position die "Couldn't sysseek: $!\n" unless defined $pos; These are the basic operations available to you. The art and craft of programming lies in using these basic operations to solve complex problems such as finding the number of lines in a file, reversing lines in a file, randomly selecting a line from a file, building an index for a file, and so on. [ Team LiB ] [ Team LiB ] Recipe 8.1 Reading Lines with Continuation Characters 8.1.1 Problem You have a file with long lines split over two or more lines, with backslashes to indicate that a continuation line follows. You want to rejoin those split lines. Makefiles, shell scripts, and many other scripting or configuration languages let you break a long line into several shorter ones in this fashion. 8.1.2 Solution Build up the complete lines one at a time until reaching one without a backslash: while (defined($line = ) ) { chomp $line; if ($line =~ s/\\$//) { $line .= ; redo unless eof(FH); } # process full record in $line here } 8.1.3 Discussion Here's an example input file: DISTFILES = $(DIST_COMMON) $(SOURCES) $(HEADERS) \ $(TEXINFOS) $(INFOS) $(MANS) $(DATA) DEP_DISTFILES = $(DIST_COMMON) $(SOURCES) $(HEADERS) \ $(TEXINFOS) $(INFO_DEPS) $(MANS) $(DATA) \ $(EXTRA_DIST) You'd like to process that file a record at a time with the escaped newlines ignored. The first record would then be the first two lines, the second record the next three lines, etc. Here's how the algorithm works. The while loop reads lines one at a time. The substitution operator s/// tries to remove a trailing backslash. If the substitution fails, we've found a line without a backslash at the end. Otherwise, read another record, concatenate it onto the accumulating $line variable, and use redo to jump back to just inside the opening brace of the while loop. This lands us back on the chomp. A frequent problem with files intended to be in this format arises when unnoticed spaces or tabs follow the backslash before the newline. The substitution that found continuation lines would be more forgiving if written this way: if ($line =~ s/\\\s*$//) { # as before } Unfortunately, even if your program is forgiving, surely others will not be. Remember to be liberal in what you accept, but conservative in what you produce. 8.1.4 See Also The chomp function in perlfunc(1) and in Chapter 29 of Programming Perl; the redo keyword in the "Loop Control" sections of perlsyn(1) and Chapter 4 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 8.2 Counting Lines (or Paragraphs or Records) in a File 8.2.1 Problem You need to compute the number of lines in a file. 8.2.2 Solution Many systems have a wc program to count lines in a file: $count = `wc -l < $file`; die "wc failed: $?" if $?; chomp($count); You could also open the file and read line-by-line until the end, counting lines as you go: open(FILE, "<", $file) or die "can't open $file: $!"; $count++ while ; # $count now holds the number of lines read Here's the fastest solution, assuming your line terminator really is "\n": $count += tr/\n/\n/ while sysread(FILE, $_, 2 ** 20); 8.2.3 Discussion Although you can use -s $file to determine the file size in bytes, you generally cannot use it to derive a line count. See the Introduction in Chapter 9 for more on -s. If you can't or don't want to call another program to do your dirty work, you can emulate wc by opening up and reading the file yourself: open(FILE, "<", $file) or die "can't open $file: $!"; $count++ while ; # $count now holds the number of lines read Another way of writing this is: open(FILE, "<", $file) or die "can't open $file: $!"; for ($count=0; ; $count++) { } If you're not reading from any other files, you don't need the $count variable in this case. The special variable $. holds the number of lines read since a filehandle was last explicitly closed: 1 while ; $count = $.; This reads in all records in the file, then discards them. To count paragraphs, set the global input record separator variable $/ to the empty string ("") before reading to make the input operator () read a paragraph at a time. $/ = ""; # enable paragraph mode for all reads open(FILE, "<", $file) or die "can't open $file: $!"; 1 while ; $para_count = $.; The sysread solution reads the file a megabyte at a time. Once end-of-file is reached, sysread returns 0. This ends the loop, as does undef, which would indicate an error. The tr operation doesn't really substitute \n for \n in the string; it's an old idiom for counting occurrences of a character in a string. 8.2.4 See Also The tr operator in perlop(1) and Chapter 5 of Programming Perl; your system's wc(1) manpage; the $/ entry in perlvar(1), and in the "Special Variables in Alphabetical Order" section of Chapter 28 of Programming Perl; the Introduction to Chapter 9 [ Team LiB ] [ Team LiB ] Recipe 8.3 Processing Every Word in a File 8.3.1 Problem You need to do something to every word in a file, similar to the foreach function of csh. 8.3.2 Solution Either split each line on whitespace: while (<>) { for $chunk (split) { # do something with $chunk } } or use the m//g operator to pull out one chunk at a time: while (<>) { while ( /(\w[\w'-]*)/g ) { # do something with $1 } } 8.3.3 Discussion Decide what you mean by "word." Sometimes you want anything but whitespace, sometimes you want only program identifiers, and sometimes you want English words. Your definition governs which regular expression to use. The preceding two approaches work differently. Patterns are used in the first approach to decide what is not a word. In the second, they're used to decide what is a word. With these techniques, it's easy to make a word frequency counter. Use a hash to store how many times each word has been seen: # Make a word frequency count %seen = ( ); while (<>) { while ( /(\w[\w'-]*)/g ) { $seen{lc $1}++; } } # output hash in a descending numeric sort of its values foreach $word ( sort { $seen{$b} <=> $seen{$a} } keys %seen) { printf "%5d %s\n", $seen{$word}, $word; } To make the example program count line frequency instead of word frequency, omit the second while loop and use $seen{lc $_}++ instead: # Line frequency count %seen = ( ); while (<>) { $seen{lc $_}++; } foreach $line ( sort { $seen{$b} <=> $seen{$a} } keys %seen ) { printf "%5d %s", $seen{$line}, $line; } Odd things that may need to be considered as words include "M.I.T.", "Micro$oft", "o'clock", "49ers", "street-wise", "and/or", "&", "c/o", "St.", "Tschüß", and "Niño". Bear this in mind when you choose a pattern to match. The last two require you to place a use locale in your program and then use \w for a word character in the current locale, or else use the Unicode letter property if you have Unicode text: /(\p{Letter}[\p{Letter}'-]*)/ 8.3.4 See Also perlre(1); the split function in perlfunc(1) and in Chapter 29 of Programming Perl; Recipe 6.3; Recipe 6.23 [ Team LiB ] [ Team LiB ] Recipe 8.4 Reading a File Backward by Line or Paragraph 8.4.1 Problem You want to process each line or paragraph of a text file in reverse. 8.4.2 Solution Read all lines into an array, then process that array from the end to the start: @lines = ; while ($line = pop @lines) { # do something with $line } Or store an array of lines in reverse order: @lines = reverse ; foreach $line (@lines) { # do something with $line } Or use the Tie::File module (standard as of v5.8): use Tie::File; tie(@lines, "Tie::File", $FILENAME, mode => 0) or die "Can't tie $FILENAME: $!"; $max_lines = $#lines; for ($i = $max_lines; $i; $i--) { # do something with $lines[$i], eg line number them: printf "%5d %s\n", $i+1, $lines[$i], } 8.4.3 Discussion The limitations of file access mentioned in this chapter's Introduction prevent reading a line at a time starting from the end. You must read the lines into memory, then process them in reverse order. This requires at least as much available memory as the size of the file, unless you use tricks like Tie::File does. The first technique moves through the array of lines in reverse order. This destructively processes the array, popping an element off the end of the array each time through the loop. We could do it non-destructively with: for ($i = $#lines; $i != -1; $i--) { $line = $lines[$i]; } The second approach generates an array of lines already in reverse order. This array can then be processed non-destructively. We get the reversed lines because the assignment to @lines confers list context on the return from reverse, and reverse confers list context on its argument of , which returns a list of all lines in the file. These approaches are easily extended to paragraphs just by changing $/: # this enclosing block keeps local $/ temporary { local $/ = ""; @paragraphs = reverse ; } foreach $paragraph (@paragraphs) { # do something } The Tie::File module lets you treat the file as an array of lines. The solution then becomes simply iterating through the array a line at a time from the end back to the start. It's much slower than reading everything into memory and reversing it, but works on files too big to fit into memory all at once. Be careful, though: Tie::File will rewrite the file if you change the contents of the tied @lines, so don't do that. In our example, assigning @lines = reverse(@lines) would reverse the file on disk! By opening the file with mode O_RDONLY (0), you can avoid that possibility. The default mode is O_RDWR | O_CREAT. Also, Tie::File cannot emulate the paragraph semantics of setting $/ to the empty string (""). 8.4.4 See Also The reverse function in perlfunc(1) and in Chapter 29 of Programming Perl; the $/ entry in perlvar(1), and in Chapter 28 of Programming Perl; the documentation for the standard Tie::File module; Recipe 4.11; Recipe 1.7 [ Team LiB ] [ Team LiB ] Recipe 8.5 Trailing a Growing File 8.5.1 Problem You want to read from a continually growing file, but the read fails when you reach the current end-of-file. 8.5.2 Solution Read until end-of-file. Sleep, clear the EOF flag, and read some more. Repeat until interrupted. To clear the EOF flag, either use seek: for (;;) { while () { .... } sleep $SOMETIME; seek(FH, 0, 1); } or use the IO::Handle module's clearerr method: use IO::Handle; for (;;) { while () { .... } sleep $SOMETIME; FH->clearerr( ); } 8.5.3 Discussion When you read until end-of-file, an internal flag is set that prevents further reading. The most direct way to clear this flag is the clearerr method, if supported: it's in the IO::Handle modules. $naptime = 1; use IO::Handle; open (LOGFILE, "/tmp/logfile") or die "can't open /tmp/logfile: $!"; for (;;) { while () { print } # or appropriate processing sleep $naptime; LOGFILE->clearerr( ); # clear stdio error flag } Because Perl v5.8 ships with its own stdio implementation, that simple approach should almost always work. On the rare system where it doesn't work, you may need to use seek. The seek code given in the Solution tries to move zero bytes from the current position, which nearly always works. It doesn't change the current position, but it should clear the end-of-file condition on the handle so that the next operation picks up new data. If that still doesn't work, perhaps because it relies on features of your I/O implementation, you may need to use the following seek code, which remembers the old file position explicitly and returns there directly. for (;;) { for ($curpos = tell(LOGFILE); ; $curpos = tell(LOGFILE)) { # process $_ here } sleep $naptime; seek(LOGFILE, $curpos, 0); # seek to where we had been } On some kinds of filesystems, the file could be removed while you are reading it. If so, there's probably little reason to continue checking whether it grows. To make the program exit in that case, stat the handle and make sure its link count (the third field in the return list) hasn't gone to 0: exit if (stat(LOGFILE))[3] = = 0 If you're using the File::stat module, you could write that more readably as: use File::stat; exit if stat(*LOGFILE)->nlink = = 0; The CPAN module File::Tail lets you tie a filehandle so that the read operation blocks at the end of the file until more data is available: use File::Tail; tie *FH, "File::Tail", (name => $FILENAME); while () { # do something with line read } The operator in this case never returns undef to indicate end-of-file. 8.5.4 See Also The seek and tell functions in perlfunc(1) and in Chapter 29 of Programming Perl; your system's tail(1) and stdio(3) manpages; the documentation for the standard File::stat module (also in Chapter 32 of Programming Perl); the documentation for the CPAN module File::Tail [ Team LiB ] [ Team LiB ] Recipe 8.6 Picking a Random Line from a File 8.6.1 Problem You want to return a random line from a file. 8.6.2 Solution Use rand and $. (the current line number) to decide which line to print: srand; rand($.) < 1 && ($line = $_) while <>; # $line is the random line 8.6.3 Discussion This is a beautiful example of a solution that may not be obvious. We read every line in the file but don't have to store them all in memory. This is great for large files. Each line has a 1 in N (where N is the number of lines read so far) chance of being selected. Here's a replacement for fortune using this algorithm: $/ = "%%\n"; @ARGV = ("/usr/share/games/fortunes") unless @ARGV; srand; rand($.) < 1 && ($adage = $_) while <>; print $adage; If you know line offsets (for instance, you've created an index) and the number of lines, you can randomly select a line and jump to its offset in the file, but you usually don't have such an index. Here's a more rigorous explanation of how the algorithm works. The function call rand ($.) picks a random number between 0 and the current line number. Therefore, you have a one in N chance, that is, , of keeping the Nth line. Therefore you've a 100% chance of keeping the first line, a 50% chance of keeping the second, a 33% chance of keeping the third, and so on. The question is whether this is fair for all N, where N is any positive integer. First, some concrete examples, then abstract ones. Obviously, a file with one line (N=1) is fair: you always keep the first line because = 100%, making it fair for files of 1 line. For a file with two lines, N=2. You always keep the first line; then when reaching the second line, you have a 50% chance of keeping it. Thus, both lines have an equal chance of being selected, which shows that N=2 is fair. For a file with three lines, N=3. You have a one-third chance, 33%, of keeping that third line. That leaves a two-thirds chance of retaining one of the first two out of the three lines. But we've already shown that for those first two lines there's a 50-50 chance of selecting either one. 50 percent of two-thirds is one-third. Thus, you have a one-third chance of selecting each of the three lines of the file. In the general case, a file of N+1 lines will choose the last line times and one of the previous N lines times. Dividing by N leaves us with for each the first N lines in our N+1 line file, and also for line number N+1. The algorithm is therefore fair for all N, where N is a positive integer. We've managed to fairly choose a random line from a file with speed directly proportional to the size of the file, but using no more memory than it takes to hold the longest line, even in the worst case. 8.6.4 See Also The $. entry in perlvar(1) and in Chapter 28 of Programming Perl; Recipe 2.6; Recipe 2.7 [ Team LiB ] [ Team LiB ] Recipe 8.7 Randomizing All Lines 8.7.1 Problem You want to copy a file and randomly reorder its lines. 8.7.2 Solution Read all lines into an array, shuffle the array using List::Util's shuffle function, and write the shuffled lines back out: use List::Util qw(shuffle); while ( ) { push(@lines, $_); } @lines = shuffle(@lines); foreach (@reordered) { print OUTPUT $_; } 8.7.3 Discussion The easiest approach is to read all lines into memory and shuffle them there. Because you don't know where lines start in the file, you can't just shuffle a list of line numbers and then extract lines in the order they'll appear in the shuffled file. Even if you did know the byte offsets of the start of each line, it would probably still be slower because you'd be seeking around in the file instead of sequentially reading it from start to finish. If you have a version of Perl older than v5.8, you can download the List::Util module from CPAN. 8.7.4 See Also The documentation for the standard List::Util module; Recipe 2.6; Recipe 2.7; Recipe 4.18 [ Team LiB ] [ Team LiB ] Recipe 8.8 Reading a Particular Line in a File 8.8.1 Problem You want to extract a single line from a file. 8.8.2 Solution The simplest solution is to read the lines until you get to the one you want: # looking for line number $DESIRED_LINE_NUMBER $. = 0; do { $LINE = } until $. = = $DESIRED_LINE_NUMBER || eof; If you are going to be doing this a lot and the file fits into memory, read the file into an array: @lines = ; $LINE = $lines[$DESIRED_LINE_NUMBER]; The standard (as of v5.8) Tie::File ties an array to a file, one line per array element: use Tie::File; use Fcntl; tie(@lines, Tie::File, $FILE, mode => O_RDWR) or die "Cannot tie file $FILE: $!\n"; $line = $lines[$sought - 1]; If you have the DB_File module, its DB_RECNO access method ties an array to a file, one line per array element: use DB_File; use Fcntl; $tie = tie(@lines, DB_File, $FILE, O_RDWR, 0666, $DB_RECNO) or die "Cannot open file $FILE: $!\n"; # extract it $line = $lines[$sought - 1]; 8.8.3 Discussion Each strategy has different features, useful in different circumstances. The linear access approach is easy to write and best for short files. The Tie::File module gives good performance, regardless of the size of the file or which line you're reading (and is pure Perl, so doesn't require any external libraries). The DB_File mechanism has some initial overhead, but later accesses are faster than with linear access, so use it for long files that are accessed more than once and are accessed out of order. It is important to know whether you're counting lines from 0 or 1. The $. variable is 1 after the first line is read, so count from 1 when using linear access. The index mechanism uses many offsets, so count from 0. Tie::File and DB_File treat the file's records as an array indexed from 0, so count lines from 0. Here are three different implementations of the same program, print_line. The program takes two arguments: a filename and a line number to extract. The version in Example 8-1 simply reads lines until it finds the one it's looking for. Example 8-1. print_line-v1 #!/usr/bin/perl -w # print_line-v1 - linear style @ARGV = = 2 or die "usage: print_line FILENAME LINE_NUMBER\n"; ($filename, $line_number) = @ARGV; open(INFILE, "<", $filename) or die "Can't open $filename for reading: $!\n"; while () { $line = $_; last if $. = = $line_number; } if ($. != $line_number) { die "Didn't find line $line_number in $filename\n"; } print; The Tie::File version is shown in Example 8-2. Example 8-2. print_line-v2 #!/usr/bin/perl -w # print_line-v2 - Tie::File style use Tie::File; use Fcntl; @ARGV = = 2 or die "usage: print_line FILENAME LINE_NUMBER\n"; ($filename, $line_number) = @ARGV; tie @lines, Tie::File, $filename, mode => O_RDWR or die "Can't open $filename for reading: $!\n"; if (@lines > $line_number) { die "Didn't find line $line_number in $filename\n"; } print "$lines[$line_number-1]\n"; The DB_File version in Example 8-3 follows the same logic as Tie::File. Example 8-3. print_line-v3 #!/usr/bin/perl -w # print_line-v3 - DB_File style use DB_File; use Fcntl; @ARGV = = 2 or die "usage: print_line FILENAME LINE_NUMBER\n"; ($filename, $line_number) = @ARGV; $tie = tie(@lines, DB_File, $filename, O_RDWR, 0666, $DB_RECNO) or die "Cannot open file $filename: $!\n"; unless ($line_number < $tie->length) { die "Didn't find line $line_number in $filename\n" } print $lines[$line_number-1]; # easy, eh? If you will be retrieving lines by number often and the file doesn't fit into memory, build a byte- address index to let you seek directly to the start of the line using the techniques in Recipe 8.27. 8.8.4 See Also The documentation for the standard Tie::File and DB_File modules (also in Chapter 32 of Programming Perl); the tie function in perlfunc(1) and in Chapter 29 of Programming Perl; the entry on $. in perlvar(1) and in Chapter 28 of Programming Perl; Recipe 8.27 [ Team LiB ] [ Team LiB ] Recipe 8.9 Processing Variable-Length Text Fields 8.9.1 Problem You want to extract variable-length fields from your input. 8.9.2 Solution Use split with a pattern matching the field separators. # given $RECORD with field separated by a pattern, # extract a list of fields @FIELDS = split(/PATTERN/, $RECORD); 8.9.3 Discussion The split function takes up to three arguments: PATTERN, EXPRESSION, and LIMIT. The LIMIT parameter is the maximum number of fields to split into. (If the input contains more fields, they are returned unsplit in the final list element.) If LIMIT is omitted, all fields (except any final empty ones) are returned. EXPRESSION gives the string value to split. If EXPRESSION is omitted, $_ is split. PATTERN is a pattern matching the field separator. If PATTERN is omitted, contiguous stretches of whitespace are used as the field separator and leading empty fields are silently discarded. If your input field separator isn't a fixed string, you might want split to return the field separators as well as the data by using parentheses in PATTERN to save the field separators. For instance: split(/([+-])/, "3+5-2"); returns the values: (3, "+", 5, "-", 2) To split colon-separated records in the style of the /etc/passwd file, use: @fields = split(/:/, $RECORD); The classic application of split is whitespace-separated records: @fields = split(/\s+/, $RECORD); If $RECORD started with whitespace, this last use of split would have put an empty string into the first element of @fields because split would consider the record to have an initial empty field. If you didn't want this, you could use this special form of split: @fields = split(" ", $RECORD); This behaves like split with a pattern of /\s+/, but ignores leading whitespace. When the record separator can appear in the record, you have a problem. The usual solution is to escape occurrences of the record separator in records by prefixing them with a backslash. See Recipe 1.18. 8.9.4 See Also The split function in perlfunc(1) and in Chapter 29 of Programming Perl; Recipe 1.18 [ Team LiB ] [ Team LiB ] Recipe 8.10 Removing the Last Line of a File 8.10.1 Problem You'd like to remove the last line from a file. 8.10.2 Solution Use the standard (as of v5.8) Tie::File module and delete the last element from the tied array: use Tie::File; tie @lines, Tie::File, $file or die "can't update $file: $!"; delete $lines[-1]; 8.10.3 Discussion The Tie::File solution is the most efficient solution, at least for large files, because it doesn't have to read through the entire file to find the last line and doesn't read the entire file into memory. It is, however, considerably slower for small files than code you could implement yourself by hand. That doesn't mean you shouldn't use Tie::File; it just means you've optimized for programmer time instead of for computer time. If you don't have Tie::File and can't install it from CPAN, read the file a line at a time and keep track of the byte address of the last line you've seen. When you've exhausted the file, truncate to the last address you saved: open (FH, "+<", $file) or die "can't update $file: $!"; while () { $addr = tell(FH) unless eof(FH); } truncate(FH, $addr) or die "can't truncate $file: $!"; Remembering the offset is more efficient than reading the whole file into memory because it holds only one given line at a time. Although you still have to grope your way through the whole file, you can use this technique on files larger than available memory. 8.10.4 See Also The documentation for the standard Tie::File module; the truncate and tell functions in perlfunc(1) and in Chapter 29 of Programming Perl; your system's open(2) and fopen(3) manpages; Recipe 8.18 [ Team LiB ] [ Team LiB ] Recipe 8.11 Processing Binary Files 8.11.1 Problem You want to read 8-bit binary data as 8-bit binary data, i.e., neither as characters in a particular encoding nor as a text file with any newline or end-of-file conversions that your I/O library might want to do. 8.11.2 Solution Use the binmode function on the filehandle: binmode(HANDLE); 8.11.3 Discussion The binmode function lets you specify new I/O layers for a filehandle. The default layer to specify is :raw, which removes any layers that would interfere with binary data. The Solution is thus equivalent to: binmode(HANDLE, ":raw"); except that explicitly specifying :raw only works on Perl 5.8 and later. The one-argument form of binmode works on all versions of Perl. Because Perl makes :crlf the default if you are on an operating system that needs it, you should rarely (if ever) need to specify :crlf in your program. Furthermore, it's generally not wise to add or remove the :crlf layer once you've begun reading from the file, as there may be data already read into buffers that you can't unread. You can, however, safely change the :encoding(...) layer midstream (when parsing XML, for example). You should get into the habit of calling binmode when you open a binary file. This will make your program portable to systems that might (un)helpfully translate bytes in your binary file into something unusable. You may specify the I/O layers when you open a filehandle, rather than using binmode after the fact: open(FH, "< :raw", $filename); # binary mode Specify the default set of layers for all subsequently opened input and output filehandles with the open pragma: use open IN => ":raw"; # binary files 8.11.4 See Also The PerlIO(3) manpage; the open and binmode functions in perlfunc(1) and in Chapter 29 of Programming Perl; your system's open(2) and fopen(3) manpages [ Team LiB ] [ Team LiB ] Recipe 8.12 Using Random-Access I/O 8.12.1 Problem You have to read a binary record from the middle of a large file but don't want to read a record at a time to get there. 8.12.2 Solution Once you know the record's size, multiply it by the record number to get the byte address, and then seek to that byte address and read the record: $ADDRESS = $RECSIZE * $RECNO; seek(FH, $ADDRESS, 0) or die "seek:$!"; read(FH, $BUFFER, $RECSIZE); 8.12.3 Discussion The Solution assumes the first record has a RECNO of 0. If you're counting from one, use: $ADDRESS = $RECSIZE * ($RECNO-1); This is best applied to binary data. Applying it to text files assumes you have a constant character width and constant line length. This rules out most Unicode encodings, any kind of Windows text file, and any text file where lines can have different lengths. 8.12.4 See Also The seek function in perlfunc(1) and in Chapter 29 of Programming Perl; Recipe 8.13 [ Team LiB ] [ Team LiB ] Recipe 8.13 Updating a Random-Access File 8.13.1 Problem You want to read an old record from a binary file, change its values, and write back the record. 8.13.2 Solution After reading the old record, pack up the updated values, seek to the previous address, and write it back. use Fcntl; # for SEEK_SET and SEEK_CUR $ADDRESS = $RECSIZE * $RECNO; seek(FH, $ADDRESS, SEEK_SET) or die "Seeking: $!"; read(FH, $BUFFER, $RECSIZE) = = $RECSIZE or die "Reading: $!"; @FIELDS = unpack($FORMAT, $BUFFER); # update fields, then $BUFFER = pack($FORMAT, @FIELDS); seek(FH, -$RECSIZE, SEEK_CUR) or die "Seeking: $!"; print FH $BUFFER; close FH or die "Closing: $!"; 8.13.3 Discussion You don't have to use anything fancier than print in Perl to output a record. Remember that the opposite of read is not write but print, although oddly enough, the opposite of sysread is syswrite. The example program shown in Example 8-4, weekearly, takes one argument: the user whose record you want to backdate by a week. (Of course, in practice, you wouldn't really want to (nor be able to!) mess with the system accounting files.) This program requires write access to the file to be updated, since it opens the file in update mode. After fetching and altering the record, it packs it up again, skips backward in the file one record, and writes it out. Example 8-4. weekearly #!/usr/bin/perl -w # weekearly -- set someone's login date back a week use User::pwent; use IO::Seekable; $typedef = "L A12 A16"; # linux fmt; sunos is "L A8 A16" $sizeof = length(pack($typedef, ( ))); $user = shift(@ARGV) || $ENV{USER} || $ENV{LOGNAME}; $address = getpwnam($user)->uid * $sizeof; open (LASTLOG, "+<:raw", "/var/log/lastlog") or die "can't update /var/log/lastlog: $!"; seek(LASTLOG, $address, SEEK_SET) or die "seek failed: $!"; read(LASTLOG, $buffer, $sizeof) = = $sizeof or die "read failed: $!"; ($time, $line, $host) = unpack($typedef, $buffer); $time -= 24 * 7 * 60 * 60; # back-date a week $buffer = pack($typedef, $time, $line, $time); seek(LASTLOG, -$sizeof, SEEK_CUR) # backup one record or die "seek failed: $!"; print LASTLOG $record; close(LASTLOG) or die "close failed: $!"; 8.13.4 See Also The PerlIO(3) manpage; the open, seek, read, pack, and unpack functions in the perlfunc(1) and in Chapter 29 of Programming Perl; Recipe 8.12; Recipe 8.14 [ Team LiB ] [ Team LiB ] Recipe 8.14 Reading a String from a Binary File 8.14.1 Problem You want to read a NUL-terminated string from a file, starting at a particular address. 8.14.2 Solution Ensure you're working with a binary file, set $/ to an ASCII NUL, and read the string with <>: binmode(FH); # binary mode $old_rs = $/; # save old $/ $/ = "\0"; # ASCII 0: NUL seek(FH, $addr, SEEK_SET) or die "Seek error: $!\n"; $string = ; # read string chomp $string; # remove NUL $/ = $old_rs; # restore old $/ You can use local to save and restore $/: { local $/ = "\0"; # ... } # $/ is automatically restored 8.14.3 Discussion The example program shown in Example 8-5, bgets, accepts a filename and one or more byte addresses as arguments. Decimal, octal, or hexadecimal addresses may be specified. For each address, the program reads and prints the null- or EOF-terminated string at that position. Example 8-5. bgets #!/usr/bin/perl -w # bgets - get a string from an address in a binary file use IO::Seekable; use open IO => ":raw"; # binary mode on all opened handles ($file, @addrs) = @ARGV or die "usage: $0 file addr ..."; open(FH, $file) or die "cannot open $file: $!"; $/ = "\000"; foreach $addr (@addrs) { $addr = oct $addr if $addr =~ /^0/; seek(FH, $addr, SEEK_SET) or die "can't seek to $addr in $file: $!"; printf qq{%#x %#o %d "%s"\n}, $addr, $addr, $addr, scalar <>; } Example 8-6 is a simple implementation of the Unix strings program. Example 8-6. strings #!/usr/bin/perl -w # strings - pull strings out of a binary file $/ = "\0"; use open IO => ":raw"; while (<>) { while (/([\040-\176\s]{4,})/g) { print $1, "\n"; } } 8.14.4 See Also The PerlIO(3) manpage; the seek, getc, and ord functions in perlfunc(1) and in Chapter 29 of Programming Perl; the discussion of qq// in the "Quote and Quote-Like Operators" section of the perlop(1) manpage, and in the "Pick Your Own Quotes" section of Chapter 2 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 8.15 Reading Fixed-Length Records 8.15.1 Problem You want to read a file whose records have a fixed length. 8.15.2 Solution Use read and unpack: # $RECORDSIZE is the length of a record, in bytes. # $TEMPLATE is the unpack template for the record # FILE is the file to read from # @FIELDS is an array, one element per field until ( eof(FILE) ) { read(FILE, $record, $RECORDSIZE) = = $RECORDSIZE or die "short read\n"; @FIELDS = unpack($TEMPLATE, $record); } 8.15.3 Discussion Because the file in question is not a text file, you can't use or IO::Handle's getline method to read records. Instead, you must simply read a particular number of bytes into a variable. This variable contains one record's data, which you decode using unpack with the appropriate format. For binary data, the catch is determining that format. When reading data written by a C program, this can mean peeking at C include files or manpages describing the structure layout, and this requires knowledge of C. It also requires that you become unnaturally chummy with your C compiler, because otherwise it's hard to predict field padding and alignment (such as the x2 in the format used in Recipe 8.24). If you're lucky enough to be on a Berkeley Unix system or a system supporting gcc, then you may be able to use the c2ph tool distributed with Perl to cajole your C compiler into helping you with this. The tailwtmp program at the end of this chapter uses the format described in utmp(5) under Linux, and works on its /var/log/wtmp and /var/run/utmp files. Once you commit to working in binary format, machine dependencies creep in fast. It probably won't work unaltered on your system, but the procedure is still illustrative. Here is the relevant layout from the C include file on Linux: #define UT_LINESIZE 12 #define UT_NAMESIZE 8 #define UT_HOSTSIZE 16 struct utmp { /* here are the pack template codes */ short ut_type; /* s for short, must be padded */ pid_t ut_pid; /* i for integer */ char ut_line[UT_LINESIZE]; /* A12 for 12-char string */ char ut_id[2]; /* A2, but need x2 for alignment */ time_t ut_time; /* l for long */ char ut_user[UT_NAMESIZE]; /* A8 for 8-char string */ char ut_host[UT_HOSTSIZE]; /* A16 for 16-char string */ long ut_addr; /* l for long */ }; Once you figure out the binary layout, feed that (in this case, "s x2 i A12 A2 x2 l A8 A16 l") to pack with an empty field list to determine the record's size. Remember to check the return value of read to make sure you got the number of bytes you asked for. If your records are text strings, use the "a" or "A" unpack templates. Fixed-length records are useful in that the nth record begins at byte offset SIZE * (n-1) in the file, where SIZE is the size of a single record. See the indexing code in Recipe 8.8 for an example. 8.15.4 See Also The unpack, pack, and read functions in perlfunc(1) and in Chapter 29 of Programming Perl; Recipe 1.1 [ Team LiB ] [ Team LiB ] Recipe 8.16 Reading Configuration Files 8.16.1 Problem You want to allow users of your program to change its behavior through configuration files. 8.16.2 Solution Either process a file in trivial VAR=VALUE format, setting a hash key-value pair for each setting: while () { chomp; # no newline s/#.*//; # no comments s/^\s+//; # no leading white s/\s+$//; # no trailing white next unless length; # anything left? my ($var, $value) = split(/\s*=\s*/, $_, 2); $User_Preferences{$var} = $value; } or better yet, treat the config file as full Perl code: do "$ENV{HOME}/.progrc"; 8.16.3 Discussion The first solution lets you read config files in a trivial format like this (comments and empty lines are allowed): # set class C net NETMASK = 255.255.255.0 MTU = 296 DEVICE = cua1 RATE = 115200 MODE = adaptive After you're done, you can pull in a setting by using something like $User_Preferences{"RATE"} to find the value 115200. If you wanted the config file to set the global variable by that name, instead of assigning to the hash, use this: no strict "refs"; $$var = $value; and the $RATE variable would contain 115200. The second solution uses do to pull in raw Perl code directly. When used with an expression instead of a block, do interprets the expression as a filename. This is nearly identical to using require, but without risk of taking a fatal exception. In the second format, the config file would look like: # set class C net $NETMASK = "255.255.255.0"; $MTU = 0x128; # Brent, please turn on the modem $DEVICE = "cua1"; $RATE = 115_200; $MODE = "adaptive"; If you don't see the point of having extra punctuation and live code, consider this: you can have all of Perl at your disposal. You can now add arbitrary logic and tests to your simple assignments: if ($DEVICE =~ /1$/) { $RATE = 28_800; } else { $RATE = 115_200; } Many programs support system and personal configuration files. If you want the user's choices to override the system ones, load the user file second: $APPDFLT = "/usr/local/share/myprog"; do "$APPDFLT/sysconfig.pl"; do "$ENV{HOME}/.myprogrc"; If you want to ignore the system config file when the user has his own, test the return value of the do. do "$APPDFLT/sysconfig.pl" or do "$ENV{HOME}/.myprogrc"; You might wonder what package those files are compiled in. They will be in the same package that do itself was compiled into. Typically you'll direct users to set particular variables, which, being unqualified globals, will end up in the current package. If you'd prefer unqualified variables go into a particular package, do this: { package Settings; do "$ENV{HOME}/.myprogrc" } As with a file read using require or use, those read using do count as a separate and unrelated lexical scope. That means the configuration file can't access its caller's lexical (my) variables, nor can the caller find any such variables that might have been set in the file. It also means that the user's code isn't held accountable to a lexically scoped pragma like use strict or use warnings, which may be in effect in the caller. If you don't want clean partitioning of variable visibility, you can get the config file's code executed in your own lexical scope. If you have a cat program or its technical equivalent handy, you could write yourself a hand-rolled do: eval `cat $ENV{HOME}/.myprogrc`; We've never actually seen anyone (except Larry Wall himself) use that approach in production code. For one thing, do is a lot easier to type. Also, it respects the @INC path, which is normally searched if a full path is not specified, but, unlike using a require, no implicit error checking happens under do. This means you don't have to wrap it in an eval to catch exceptions that would otherwise cause your program to die, because do already functions as an eval. You can still check for errors on your own if you'd like: $file = "someprog.pl"; unless ($return = do $file) { warn "couldn't parse $file: $@" if $@; warn "couldn't do $file: $!" unless defined $return; warn "couldn't run $file" unless $return; } This is much simpler for the programmer to source in code than it would be to invent and then parse a complicated, new syntax. It's also much easier on the users than forcing them to learn the syntax rules of yet another configuration file. Even better, you give the user access to a powerful algorithmic programming language. One reasonable concern is security. How do you know that the file hasn't been tampered with by someone other than the user? The traditional approach here is to do nothing, trusting the directory and file permissions. Nine times out of ten, this is also the right approach. Most projects just aren't worth being that paranoid over. For those that are, see the next recipe. 8.16.4 See Also The eval and require functions in perlfunc(1) and in Chapter 29 of Programming Perl; Recipe 8.17 [ Team LiB ] [ Team LiB ] Recipe 8.17 Testing a File for Trustworthiness 8.17.1 Problem You want to read from a file, perhaps because it has configuration information. You want to use the file only if it can't be written to (or perhaps not even be read from) by anyone else than its owner. 8.17.2 Solution Use the stat function to retrieve ownership and file permissions information. You can use the built-in version, which returns a list: ( $dev, $ino, $mode, $nlink, $uid, $gid, $rdev, $size, $atime, $mtime, $ctime, $blksize, $blocks ) = stat($filename) or die "no $filename: $!"; $mode &= 07777; # discard file type info Or you can use the by-name interface: use File::stat; $info = stat($filename) or die "no $filename: $!"; if ($info->uid = = 0) { print "Superuser owns $filename\n"; } if ($info->atime > $info->mtime) { print "$filename has been read since it was written.\n"; } 8.17.3 Discussion Usually you trust users to set file permissions as they wish. If they want others to read their files, or even to write to them, that's their business. Applications such as editors, mailers, and shells are often more discerning, though, refusing to evaluate code in configuration files if anyone but the owner can write to them. This helps avoid Trojan horse attacks. Security- minded programs such as ftp and ssh may even reject config files that can be read by anyone but their owner. If the file is writable by someone other than the owner or is owned by someone other than the current user or the superuser, it shouldn't be trusted. To figure out file ownership and permissions, the stat function is used. The following function returns true if the file is deemed safe and false otherwise. If the stat fails, undef is returned. use File::stat; sub is_safe { my $path = shift; my $info = stat($path); return unless $info; # owner neither superuser nor me # the real uid is in stored in the $< variable if (($info->uid != 0) && ($info->uid != $<)) { return 0; } # check whether group or other can write file. # use 066 to detect either reading or writing if ($info->mode & 022) { # someone else can write this return 0 unless -d _; # non-directories aren't safe # but directories with the sticky bit (01000) are return 0 unless $info->mode & 01000; } return 1; } A directory is considered safe even if others can write to it, provided its mode 01000 (owner delete only) bit is set. Careful programmers also ensure that no enclosing directory is writable. This is due to systems with the "chown giveaway" problem in which any user can give away a file they own to make it owned by someone else. The following function handles that by using the is_safe function to check every enclosing directory up to the root if it detects that you have the chown problem, for which it queries the POSIX::sysconf. If you don't have an unrestricted version of chown, the is_verysafe subroutine just calls is_safe. If you do have the problem, it walks up the filesystem tree until it reaches the root. use Cwd; use POSIX qw(sysconf _PC_CHOWN_RESTRICTED); sub is_verysafe { my $path = shift; return is_safe($path) if sysconf(_PC_CHOWN_RESTRICTED); $path = getcwd( ) . "/" . $path if $path !~ m{^/}; do { return unless is_safe($path); $path =~ s#([^/]+|/)$##; # dirname $path =~ s#/$## if length($path) > 1; # last slash } while length $path; return 1; } To use this in a program, try something like this: $file = "$ENV{HOME}/.myprogrc"; readconfig($file) if is_safe($file); This has potential for a race condition, because it's presumed that the hypothetical readconfig function will open the file. Between the time when is_safe checks the file's stats and when readconfig opens it, something wicked could theoretically occur. To avoid this, pass is_safe the already open filehandle, which is set up to handle this: $file = "$ENV{HOME}/.myprogrc"; if (open(FILE, "<", $file)) { readconfig(*FILE) if is_safe(*FILE); } You would still have to arrange for readconfig to accept a filehandle instead of a filename, though. 8.17.4 See Also The stat function in perlfunc(1) and in Chapter 29 of Programming Perl; documentation for the standard POSIX and File::stat modules; Recipe 8.16 [ Team LiB ] [ Team LiB ] Recipe 8.18 Treating a File as an Array 8.18.1 Problem Your file contains a list of lines or records, and you'd like to be able to use Perl's powerful array operations to access and manipulate the file. 8.18.2 Solution Use the Tie::File module, standard with v5.8 of Perl: use Tie::File; use Fcntl; tie @data, Tie::File, $FILENAME or die "Can't tie to $filename : $!\n"; # use array operations on @data to work with the file 8.18.3 Discussion The Tie::File module makes a file appear to be an array, one record per element. You can then fetch and assign to elements of the array, use array functions like push and splice, use negative indices, or reverse it, and in every instance you're really working with the data on disk. If you don't specify how Tie::File should open the file, it is opened for read and write access and created if it doesn't exist. To specify a particular access mode (see Recipe 7.1), pass the Fcntl mode with the mode parameter when you tie. For example: use Fcntl; tie(@data, Tie::File, $filename, mode => O_RDONLY) or die "Can't open $filename for reading: $!\n"; When you alter the array, the file is rewritten on disk. For example, if you change the length of an element, all records later in the file must be copied to make the change. Take this code: foreach (@data) { s/Perl Cookbook/Perl Cookbook (2nd edition)/g; } That's close because you change the length of record 0, forcing a copy of records 1..N. Then you change the length of record 1, forcing a copy of records 2..N. It's better to defer the update until all changes have been made and then have Tie::File update the file in one single write. To do this, call a method on the object behind the tied array: (tied @data)->defer; # defer updates foreach (@data) { s/Perl Cookbook/Perl Cookbook (2nd edition)/g; } (tied @data)->flush; Exactly how much rewriting to defer is governed by how much memory you let Tie::File use, because the only way to keep track of changes without updating the file is to store those changes in memory. The Tie::File manpage shows how to change options for memory use. 8.18.4 See Also Recipe 8.4; Recipe 8.8; Recipe 8.10 [ Team LiB ] [ Team LiB ] Recipe 8.19 Setting the Default I/O Layers 8.19.1 Problem You want to ensure all files opened by your program use a particular set of I/O layers. For example, you know that every file will contain UTF-8 data. 8.19.2 Solution Use the open pragma: use open IO => ":raw:utf8"; 8.19.3 Discussion You can easily specify I/O layers when you open a filehandle directly, but that doesn't help you when the filehandle is opened by someone else's code (possibly even the Perl core). The open pragma lets you specify a default set of layers for every open that doesn't specify its own layers. The open module also offers separate IN and OUT control for input and output handles. For example, to read bytes and emit UTF-8: use open "IN" => ":bytes", "OUT" => ":utf8"; The :std option tells open to apply the input and output layers to STDIN and STDOUT/STDERR. For example, the following code makes input handles read Greek (ISO 8859-7) and output handles write in the UTF-8 Unicode encoding. Then it applies the same layers to STDIN, STDOUT, and STDERR: use open "IN" => ":encoding(Greek)", # reading Greek "OUT" => ":utf8", # writing 8-bit data in Unicode UTF-8, ":std"; # STDIN is Greek, 8.19.4 See Also The documentation for the standard open pragma; Recipe 8.12 and Recipe 8.19 [ Team LiB ] [ Team LiB ] Recipe 8.20 Reading or Writing Unicode from a Filehandle 8.20.1 Problem You have a file containing text in a particular encoding and when you read data from that into a Perl string, Perl treats it as a series of 8-bit bytes. You'd like to work with characters instead of bytes because your encoding characters can take more than one byte. Also, if Perl doesn't know about your encoding, it may fail to identify certain characters as letters. Similarly, you may want to output text in a particular encoding. 8.20.2 Solution Use I/O layers to tell Perl that data from that filehandle is in a particular encoding. open(my $ifh, "<:encoding(ENCODING_NAME)", $filename); open(my $ofh, ">:encoding(ENCODING_NAME)", $filename); 8.20.3 Discussion Perl's text manipulation functions handle UTF-8 strings just as well as they do 8-bit data—they just need to know what type of data they're working with. Each string in Perl is internally marked as either UTF-8 or 8-bit data. The encoding(...) layer converts data between variable external encodings and the internal UTF-8 within Perl. This is done by way of the Encode module. In the section on Unicode Support in Perl back in the Introduction to Chapter 1, we explained how under Unicode, every different character had a different code point (i.e., a different number) associated with it. Assigning all characters unique code points solves many problems. No longer does the same number, like 0xC4, represent one character under one character repertoire (e.g., a LATIN CAPITAL LETTER A WITH DIAERESIS under ISO-8859-1) and a different character in another repertoire (e.g., a GREEK CAPITAL LETTER DELTA under ISO- 8859-7). This neatly solves many problems, but still leaves one important issue: the precise format used in memory or disk for each code point. If most code points fit in 8 bits, it would seem wasteful to use, say, a full 32 bits for each character. But if every character is the same size as every other character, the code is easier to write and may be faster to execute. This has given rise to different encoding systems for storing Unicode, each offering distinct advantages. Fixed-width encodings fit every code point into the same number of bits, which simplifies programming but at the expense of some wasted space. Variable-width encodings use only as much space as each code point requires, which saves space but complicates programming. One further complication is combined characters, which may look like single letters on paper but in code require multiple code points. When you see a capital A with two dots above it (a diaeresis) on your screen, it may not even be character U+00C4. As explained in Recipe 1.8, Unicode supports the idea of combining characters, where you start with a base character and add non-spacing marks to it. U+0308 is a "COMBINING DIAERESIS", so you could use a capital A (U+0041) followed by U+0308, or A\x{308} to produce the same output. The following table shows the old ISO 8859-1 way of writing a capital A with a diaeresis, in which the logical character code and the physical byte layout enjoyed an identical representation, and the new way under Unicode. We'll include both ways of writing that character: one precomposed in one code point and the other using two code points to create a combined character. Old way New way Ä A Ä Ä Character(s) 0xC4 U+0041 U+00C4 U+0041 U+0308 Character repertoire ISO 8859-1 Unicode Unicode Unicode Character code(s) 0xC4 0x0041 0x00C4 0x0041 0x0308 Encoding — UTF-8 UTF-8 UTF-8 Byte(s) 0xC4 0x41 0xC3 0x84 0x41 0xCC 0x88 The internal format used by Perl is UTF-8, a variable-width encoding system. One reason for this choice is that legacy ASCII requires no conversion for UTF-8, looking in memory exactly as it did before—just one byte per character. Character U+0041 is just 0x41 in memory. Legacy data sets don't increase in size, and even those using Western character sets like ISO 8859-n grow only slightly, since in practice you still have a favorable ratio of regular ASCII characters to 8-bit accented characters. Just because Perl uses UTF-8 internally doesn't preclude using other formats externally. Perl automatically converts all data between UTF-8 and whatever encoding you've specified for that handle. The Encode module is used implicitly when you specify an I/O layer of the form ":encoding(....)". For example: binmode(FH, ":encoding(UTF-16BE)") or die "can't binmode to utf-16be: $!"; or directly in the open: open(FH, "< :encoding(UTF-32)", $pathname) or die "can't open $pathname: $!"; Here's a comparison of actual byte layouts of those two sequences, both representing a capital A with diaeresis, under several other popular formats: U+00C4 U+0041 U+0308 UTF-8 c3 84 41 cc 88 UTF-16BE 00 c4 00 41 03 08 UTF-16LE c4 00 41 00 08 03 U+00C4 U+0041 U+0308 UTF-16 fe ff 00 c4 fe ff 00 41 03 08 UTF-32LE c4 00 00 00 41 00 00 00 08 03 00 00 UTF-32BE 00 00 00 c4 00 00 00 41 00 00 03 08 UTF-32 00 00 fe ff 00 00 00 c4 00 00 fe ff 00 00 00 41 00 00 03 08 This can chew up memory quickly. It's also complicated by the fact that some computers are big-endian, others little-endian. So fixed-width encoding formats that don't specify their endian- ness require a special byte-ordering sequence ("FF EF" versus "EF FF"), usually needed only at the start of the stream. If you're reading or writing UTF-8 data, use the :utf8 layer. Because Perl natively uses UTF-8, the :utf8 layer bypasses the Encode module for performance. The Encode module understands many aliases for encodings, so ascii, US-ascii, and ISO- 646-US are synonymous. Read the Encode::Supported manpage for a list of available encodings. Perl supports not only standard Unicode names but vendor-specific names, too; for example, iso-8859-1 is cp850 on DOS, cp1252 on Windows, MacRoman on a Mac, and hp- roman8 on NeXTstep. The Encode module recognizes all of these as names for the same encoding. 8.20.4 See Also The documentation for the standard Encode module; the Encode::Supported manpage; Recipe 8.12 and Recipe 8.19 [ Team LiB ] UTF-16 fe ff 00 c4 fe ff 00 41 03 08 UTF-32LE c4 00 00 00 41 00 00 00 08 03 00 00 UTF-32BE 00 00 00 c4 00 00 00 41 00 00 03 08 UTF-32 00 00 fe ff 00 00 00 c4 00 00 fe ff 00 00 00 41 00 00 03 08 This can chew up memory quickly. It's also complicated by the fact that some computers are big-endian, others little-endian. So fixed-width encoding formats that don't specify their endian- ness require a special byte-ordering sequence ("FF EF" versus "EF FF"), usually needed only at the start of the stream. If you're reading or writing UTF-8 data, use the :utf8 layer. Because Perl natively uses UTF-8, the :utf8 layer bypasses the Encode module for performance. The Encode module understands many aliases for encodings, so ascii, US-ascii, and ISO- 646-US are synonymous. Read the Encode::Supported manpage for a list of available encodings. Perl supports not only standard Unicode names but vendor-specific names, too; for example, iso-8859-1 is cp850 on DOS, cp1252 on Windows, MacRoman on a Mac, and hp- roman8 on NeXTstep. The Encode module recognizes all of these as names for the same encoding. 8.20.4 See Also The documentation for the standard Encode module; the Encode::Supported manpage; Recipe 8.12 and Recipe 8.19 [ Team LiB ] [ Team LiB ] Recipe 8.21 Converting Microsoft Text Files into Unicode 8.21.1 Problem You have a text file written on a Microsoft computer that looks like garbage when displayed. How do you fix this? 8.21.2 Solution Set the encoding layer appropriately when reading to convert this into Unicode: binmode(IFH, ":encoding(cp1252)") || die "can't binmode to cp1252 encoding: $!"; 8.21.3 Discussion Suppose someone sends you a file in cp1252 format, Microsoft's default in-house 8-bit character set. Files in this format can be annoying to read—while they might claim to be Latin1, they are not, and if you look at them with Latin1 fonts loaded, you'll get garbage on your screen. A simple solution is as follows: open(MSMESS, "< :crlf :encoding(cp1252)", $inputfile) || die "can't open $inputfile: $!"; Now data read from that handle will be automatically converted into Unicode when you read it in. It will also be processed in CRLF mode, which is needed on systems that don't use that sequence to indicate end of line. You probably won't be able to write out this text as Latin1. That's because cp1252 includes characters that don't exist in Latin1. You'll have to leave it in Unicode, and displaying Unicode properly may not be as easy as you wish, because finding tools to work with Unicode is something of a quest in its own right. Most web browsers support ISO 10646 fonts; that is, Unicode fonts (see http://www.cl.cam.ac.uk/~mgk25/ucs-fonts.html ). Whether your text editor does is a different matter, although both emacs and vi (actually, vim , not nvi ) have mechanisms for handling Unicode. The authors used the following xterm (1) command to look at text: xterm -n unicode -u8 -fn -misc-fixed-medium-r-normal--20-200-75-75-c-100-iso10646-1 But many open questions still exist, such as cutting and pasting of Unicode data between windows. The www.unicode.org site has help for finding and installing suitable tools for a variety of platforms, including both Unix and Microsoft systems. You'll also need to tell Perl it's alright to emit Unicode. If you don't, you'll get a warning about a "Wide character in print " every time you try. Assuming you're running in an xterm like the one shown previously (or its equivalent for your system) that has Unicode fonts available, you could just do this: binmode(STDOUT, ":utf8"); But that requires the rest of your program to emit Unicode, which might not be convenient. When writing new programs specifically designed for this, though, it might not be too much trouble. As of v5.8.1, Perl offers a couple of other means of getting this effect. The -C command-line switch controls some Unicode features related to your runtime environment. This way you can set those features on a per-command basis without having to edit the source code. The -C switch can be followed by either a number or a list of option letters. Some available letters, their numeric values, and effects are as follows: I 1 STDIN is assumed to be in UTF-8 O 2 STDOUT will be in UTF-8 E 4 STDERR will be in UTF-8 S 7 I + O + E i 8 UTF-8 is the default PerlIO layer for input streams o 16 UTF-8 is the default PerlIO layer for output streams D 24 i + o A 32 the @ARGV elements are expected to be strings encoded in UTF-8 Letter Number Meaning You may use letters or numbers. If you use numbers, you have to add them up. For example, - COE and -C6 are synonyms of UTF-8 on both STDOUT and STDERR . One last approach is to use the PERL_UNICODE environment variable. If set, it contains the same value as you would use with -C . For example, with the xterm that has Unicode fonts loaded, you could do this in a POSIX shell: sh% export PERL_UNICODE=6 or this in the csh : csh% setenv PERL_UNICODE 6 The advantage of using the environment variable is that you don't have to edit the source code as the pragma would require, and you don't even need to change the command invocation as setting -C would require. 8.21.4 See Also The perlrun (1), encoding (3), PerlIO (3), and Encode (3) manpages [ Team LiB ] [ Team LiB ] Recipe 8.22 Comparing the Contents of Two Files 8.22.1 Problem You have two files and want to see whether they're the same or different. 8.22.2 Solution Use the standard File::Compare module with filenames, typeglobs, or any indirect filehandles: use File::Compare; if (compare($FILENAME_1, $FILENAME_2) = = 0) { # they're equal } if (compare(*FH1, *FH2) = = 0) { # they're equal } if (compare($fh1, $fh2) = = 0) { # they're equal } 8.22.3 Discussion The File::Compare module (standard as of v5.8 and available on CPAN if you have an earlier version of Perl) compares two files for equality. The compare function, exported by default, returns 0 when the files are equal, 1 when they differ, and -1 when any error occurs during reading. To compare more than two filehandles, simply loop, comparing two at a time: # ensure all filehandles in @fh hold the same data foreach $fh (@fh[1..$#fh]) { if (compare($fh[0], $fh)) { # $fh differs } } If you want details of exactly how two files differ, use the Text::Diff module from CPAN: use Text::Diff; $diff = diff(*FH1, *FH2); $diff = diff($FILENAME_1, $FILENAME_2, { STYLE => "Context" }); In addition to filehandles, diff can also take filenames, strings, and even arrays of records. Pass a hash of options as the third argument. The STYLE option controls the type of output returned; it can be "Unified" (the default), "Context", or "OldStyle". You can even write your own class for custom diff formats. The value returned by diff is a string similar to the output of the diff(1) program. This string is in valid diff format, suitable for feeding into patch(1). Although Text::Diff will not always produce the same output as GNU diff, byte for byte, its diffs are nevertheless correct. 8.22.4 See Also The documentation for the standard File::Compare module; the documentation for the CPAN module Text::Diff; the diff(1) and patch(1) manpages. [ Team LiB ] [ Team LiB ] Recipe 8.23 Pretending a String Is a File 8.23.1 Problem You have data in string, but would like to treat it as a file. For example, you have a subroutine that expects a filehandle as an argument, but you would like that subroutine to work directly on the data in your string instead. Additionally, you don't want to write the data to a temporary file. 8.23.2 Solution Use the scalar I/O in Perl v5.8: open($fh, "+<", \$string); # read and write contents of $string 8.23.3 Discussion Perl's I/O layers include support for input and output from a scalar. When you read a record with <$fh>, you are reading the next line from $string. When you write a record with print, you change $string. You can pass $fh to a function that expects a filehandle, and that subroutine need never know that it's really working with data in a string. Perl respects the various access modes in open for strings, so you can specify that the strings be opened as read-only, with truncation, in append mode, and so on: open($fh, "<", \$string); # read only open($fh, ">", \$string); # write only, discard original contents open($fh, "+>", \$string); # read and write, discard original contents open($fh, "+<", \$string); # read and write, preserve original contents These handles behave in all respects like regular filehandles, so all I/O functions work, such as seek, truncate, sysread, and friends. 8.23.4 See Also The open function in perlfunc(1) and in Chapter 29 of Programming Perl; Recipe 8.12 and Recipe 8.19 [ Team LiB ] [ Team LiB ] Recipe 8.24 Program: tailwtmp Every time a user logs into or out of a Unix system, a record is added to the wtmp file. You can't use the normal tail program on it, because the file is in binary format. The tailwtmp program in Example 8-7 knows the format of the binary file and shows every new record as it appears. You'll have to adjust the pack format for your own system. Example 8-7. tailwtmp #!/usr/bin/perl -w # tailwtmp - watch for logins and logouts; # uses linux utmp structure, from utmp(5) $typedef = "s x2 i A12 A4 l A8 A16 l"; $sizeof = length pack($typedef, ( ) ); use IO::File; open(WTMP, "< :raw", "/var/log/wtmp") or die "can't open /var/log/wtmp: $!"; seek(WTMP, 0, SEEK_END); for (;;) { while (read(WTMP, $buffer, $sizeof) = = $sizeof) { ($type, $pid, $line, $id, $time, $user, $host, $addr) = unpack($typedef, $buffer); next unless $user && ord($user) && $time; printf "%1d %-8s %-12s %2s %-24s %-16s %5d %08x\n", $type,$user,$line,$id,scalar(localtime($time)), $host,$pid,$addr; } for ($size = -s WTMP; $size = = -s WTMP; sleep 1) { } WTMP->clearerr( ); } [ Team LiB ] [ Team LiB ] Recipe 8.25 Program: tctee Not all systems support the classic tee program for splitting output pipes to multiple destinations. This command sends the output from someprog to /tmp/output and to the mail pipe beyond: % someprog | tee /tmp/output | Mail -s "check this" user@host.org This program helps not only users who aren't on Unix systems and don't have a regular tee; it also helps those who are, because it offers features not found on other versions of tee. The four flag arguments are -i to ignore interrupts, -a to append to output files, -u for unbuffered output, and -n to omit copying the output on to standard out. Because this program uses Perl's magic open, you can specify pipes as well as files. % someprog | tctee f1 "|cat -n" f2 ">>f3" That sends the output from someprog to the files f1 and f2, appends it to f3, sends a copy to the program cat -n, and also produces the stream on standard output. The program in Example 8-8 is one of many venerable Perl programs written nearly a decade ago that still runs perfectly well. If written from scratch now, we'd probably use strict, warnings, and ten to thirty thousand lines of modules. But if it ain't broke . . . Example 8-8. tctee #!/usr/bin/perl # tctee - clone that groks process tees # perl3 compatible, or better. while ($ARGV[0] =~ /^-(.+)/ && (shift, ($_ = $1), 1)) { next if /^$/; s/i// && (++$ignore_ints, redo); s/a// && (++$append, redo); s/u// && (++$unbuffer, redo); s/n// && (++$nostdout, redo); die "usage $0 [-aiun] [filenames] ...\n"; } if ($ignore_ints) { for $sig ("INT", "TERM", "HUP", "QUIT") { $SIG{$sig} = "IGNORE"; } } $SIG{"PIPE"} = "PLUMBER"; $mode = $append ? ">>" : ">"; $fh = "FH000"; unless ($nostdout) { %fh = ("STDOUT", "standard output"); # always go to stdout } $| = 1 if $unbuffer; for (@ARGV) { if (!open($fh, (/^[^>|]/ && $mode) . $_)) { warn "$0: cannot open $_: $!\n"; # like sun's; i prefer die $status++; next; } select((select($fh), $| = 1)[0]) if $unbuffer; $fh{$fh++} = $_; } while () { for $fh (keys %fh) { print $fh $_; } } for $fh (keys %fh) { next if close($fh) || !defined $fh{$fh}; warn "$0: couldnt close $fh{$fh}: $!\n"; $status++; } exit $status; sub PLUMBER { warn "$0: pipe to \"$fh{$fh}\" broke!\n"; $status++; delete $fh{$fh}; } [ Team LiB ] [ Team LiB ] Recipe 8.26 Program: laston When you log in to a Unix system, it tells you when you last logged in. That information is stored in a binary file called lastlog. Each user has their own record; UID 8 is at record 8, UID 239 at record 239, and so on. To find out when a given user last logged in, convert their login name to a number, seek to their record in that file, read, and unpack. Doing so with shell tools is hard, but with the laston program, it's easy. Here's an example: % laston gnat gnat UID 314 at Mon May 25 08:32:52 2003 on ttyp0 from below.perl.com The program in Example 8-9 is much newer than the tctee program in Example 8-8, but it's less portable. It uses the Linux binary layout of the lastlog file. You'll have to change this for other systems. Example 8-9. laston #!/usr/bin/perl -w # laston - find out when given user last logged on use User::pwent; use IO::Seekable qw(SEEK_SET); open (LASTLOG, "< :raw", "/var/log/lastlog") or die "can't open /var/log/lastlog: $!"; $typedef = "L A12 A16"; # linux fmt; sunos is "L A8 A16" $sizeof = length(pack($typedef, ( ))); for $user (@ARGV) { $U = ($user =~ /^\d+$/) ? getpwuid($user) : getpwnam($user); unless ($U) { warn "no such uid $user\n"; next; } seek(LASTLOG, $U->uid * $sizeof, SEEK_SET) or die "seek failed: $!"; read(LASTLOG, $buffer, $sizeof) = = $sizeof or next; ($time, $line, $host) = unpack($typedef, $buffer); printf "%-8s UID %5d %s%s%s\n", $U->name, $U->uid, $time ? ("at " . localtime($time)) : "never logged in", $line && " on $line", $host && " from $host"; } [ Team LiB ] [ Team LiB ] Recipe 8.27 Program: Flat File Indexes It sometimes happens that you need to jump directly to a particular line number in a file, but the lines vary in length, so you can't use Recipe 8.12. Although you could start at the beginning of the file and read every line, this is inefficient if you're making multiple queries. The solution is to build an index of fixed-width records, one per line. Each record contains the offset in the data file of the corresponding line. The subroutine in Example 8-10 takes the data file and a filehandle to send the index to. It reads a record at a time and prints the current offset in the file to the index, packed into a big-ending unsigned 32-bit integer; see the documentation for the pack function in perlfunc(1) for alternative storage types. Example 8-10. build_index # usage: build_index(*DATA_HANDLE, *INDEX_HANDLE) sub build_index { my $data_file = shift; my $index_file = shift; my $offset = 0; while (<$data_file>) { print $index_file pack("N", $offset); $offset = tell($data_file); } } Once you have an index, it becomes easy to read a particular line from the data file. Jump to that record in the index, read the offset, and jump to that position in the data file. The next line you read will be the one you want. Example 8-11 returns the line, given the line number and the index and data file handles. Example 8-11. line_with_index # usage: line_with_index(*DATA_HANDLE, *INDEX_HANDLE, $LINE_NUMBER) # returns line or undef if LINE_NUMBER was out of range sub line_with_index { my $data_file = shift; my $index_file = shift; my $line_number = shift; my $size; # size of an index entry my $i_offset; # offset into the index of the entry my $entry; # index entry my $d_offset; # offset into the data file $size = length(pack("N", 0)); $i_offset = $size * ($line_number-1); seek($index_file, $i_offset, 0) or return; read($index_file, $entry, $size); $d_offset = unpack("N", $entry); seek($data_file, $d_offset, 0); return scalar(<$data_file>); } To use these subroutines, just say: open($fh, "<", $file) or die "Can't open $file for reading: $!\n"; open($index, "+>", $file.idx) or die "Can't open $file.idx for read/write: $!\n"; build_index($fh, $index); $line = line_with_index($file, $index, $seeking); The next step is to cache the index file between runs of the program, so you're not building it each time. This is shown in Example Recipe 8.12. Then add locking for concurrent access, and check time stamps on the files to see whether a change to the data file has made an old index file out of date. Example 8-12. cache_line_index #!/usr/bin/perl -w # cache_line_index - index style # build_index and line_with_index from above @ARGV = = 2 or die "usage: print_line FILENAME LINE_NUMBER"; ($filename, $line_number) = @ARGV; open(my $orig, "<", $filename) or die "Can't open $filename for reading: $!"; # open the index and build it if necessary # there's a race condition here: two copies of this # program can notice there's no index for the file and # try to build one. This would be easily solved with # locking $indexname = "$filename.index"; sysopen(my $idx, $indexname, O_CREAT|O_RDWR) or die "Can't open $indexname for read/write: $!"; build_index($orig, $idx) if -z $indexname; # XXX: race unless lock $line = line_with_index($orig, $idx, $line_number); die "Didn't find line $line_number in $filename" unless defined $line; print $line; [ Team LiB ] [ Team LiB ] Chapter 9. Directories Unix has its weak points, but its file system is not one of them. —Chris Torek [ Team LiB ] [ Team LiB ] Introduction To fully understand directories, you need to be acquainted with the underlying mechanics. The following explanation is slanted toward the Unix filesystem, for whose syscalls and behavior Perl's directory access routines were designed, but it is applicable to some degree to most other platforms. A filesystem consists of two parts: a set of data blocks where the contents of files and directories are kept, and an index to those blocks. Each entity in the filesystem has an entry in the index, be it a plain file, a directory, a link, or a special file like those in /dev. Each entry in the index is called an inode (short for index node). Since the index is a flat index, inodes are addressed by number. A directory is a specially formatted file, whose inode entry marks it as a directory. A directory's data blocks contain a set of pairs. Each pair consists of the name of something in that directory and the inode number of that thing. The data blocks for /usr/bin might contain: Name Inode bc 17 du 29 nvi 8 pine 55 vi 8 Every directory is like this, even the root directory (/). To read the file /usr/bin/vi, the operating system reads the inode for /, reads its data blocks to find the entry for /usr, reads /usr's inode, reads its data block to find /usr/bin, reads /usr/bin's inode, reads its data block to find /usr/bin/vi, reads /usr/bin/vi's inode, and then reads the data from its data block. The name in a directory entry isn't fully qualified. The file /usr/bin/vi has an entry with the name vi in the /usr/bin directory. If you open the directory /usr/bin and read entries one by one, you get filenames like patch, rlogin, and vi instead of fully qualified names like /usr/bin/patch, /usr/bin/rlogin, and /usr/bin/vi. The inode has more than a pointer to the data blocks. Each inode also contains the type of thing it represents (directory, plain file, etc.), the size of the thing, a set of permissions bits, owner and group information, the time the thing was last modified, the number of directory entries that point to this inode, and so on. Some operations on files change the contents of the file's data blocks; others change just the inode. For instance, appending to or truncating a file updates its inode by changing the size field. Other operations change the directory entry that points to the file's inode. Changing a file's name changes only the directory entry; it updates neither the file's data nor its inode. Three fields in the inode structure contain the last access, change, and modification times: atime, ctime, and mtime. The atime field is updated each time the pointer to the file's data blocks is followed and the file's data is read. The mtime field is updated each time the file's data changes. The ctime field is updated each time the file's inode changes. The ctime is not creation time; there is no way under standard Unix to find a file's creation time. Reading a file changes its atime only. Changing a file's name doesn't change atime, ctime, or mtime, because the directory entry changed (it does change the atime and mtime of the directory the file is in, though). Truncating a file doesn't change its atime (because we haven't read; we've just changed the size field in its directory entry), but it does change its ctime because we changed its size field and its mtime because we changed its contents (even though we didn't follow the pointer to do so). We can access the inode of a file or directory by calling the built-in function stat on its name. For instance, to get the inode for /usr/bin/vi, say: @entry = stat("/usr/bin/vi") or die "Couldn't stat /usr/bin/vi : $!"; To get the inode for the directory /usr/bin, say: @entry = stat("/usr/bin") or die "Couldn't stat /usr/bin : $!"; You can stat filehandles, too: @entry = stat(INFILE) or die "Couldn't stat INFILE : $!"; The stat function returns a list of the values of the fields in the directory entry. If it couldn't get this information (for instance, if the file doesn't exist), it returns an empty list. It's this empty list we test for using the or die construct. Be careful of using || die because that throws the expression into scalar context, in which case stat only reports whether it worked. It doesn't return the list of values. The underscore ( _ ) cache referred to later will still be updated, though. The values returned by stat are listed in Table 9-1. Table 9-1. Stat return values Element Abbreviation Description 0 dev Device number of filesystem 1 ino Inode number (the "pointer" field) 2 mode File mode (type and permissions) 3 nlink Number of (hard) links to the file 4 uid Numeric user ID of file's owner 5 gid Numeric group ID of file's owner 6 rdev The device identifier (special files only) 7 size Total size of file, in bytes 8 atime Last access time, in seconds, since the Epoch 9 mtime Last modify time, in seconds, since the Epoch 10 ctime Inode change time, in seconds, since the Epoch The standard File::stat module provides a named interface to these values. It overrides the stat function, so instead of returning the preceding array, it returns an object with a method for each attribute: Element Abbreviation Description 11 blksize Preferred block size for filesystem I/O 12 blocks Actual number of blocks allocated The standard File::stat module provides a named interface to these values. It overrides the stat function, so instead of returning the preceding array, it returns an object with a method for each attribute: use File::stat; $inode = stat("/usr/bin/vi"); $ctime = $inode->ctime; $size = $inode->size; In addition, Perl provides operators that call stat and return one value only (see Table 9-2). These are collectively referred to as the -X operators because they all take the form of a dash followed by a single character. They're modeled on the shell's test operators. Table 9-2. File test operators -X Stat field Meaning -r mode File is readable by effective UID/GID -w mode File is writable by effective UID/GID -x mode File is executable by effective UID/GID -o mode File is owned by effective UID -R mode File is readable by real UID/GID -W mode File is writable by real UID/GID -X mode File is executable by real UID/GID -O mode File is owned by real UID -e File exists -z size File has zero size -s size File has nonzero size (returns size) -f mode,rdev File is a plain file -d mode,rdev File is a directory -l mode File is a symbolic link -p mode File is a named pipe (FIFO) 11 blksize Preferred block size for filesystem I/O 12 blocks Actual number of blocks allocated The standard File::stat module provides a named interface to these values. It overrides the stat function, so instead of returning the preceding array, it returns an object with a method for each attribute: use File::stat; $inode = stat("/usr/bin/vi"); $ctime = $inode->ctime; $size = $inode->size; In addition, Perl provides operators that call stat and return one value only (see Table 9-2). These are collectively referred to as the -X operators because they all take the form of a dash followed by a single character. They're modeled on the shell's test operators. Table 9-2. File test operators -X Stat field Meaning -r mode File is readable by effective UID/GID -w mode File is writable by effective UID/GID -x mode File is executable by effective UID/GID -o mode File is owned by effective UID -R mode File is readable by real UID/GID -W mode File is writable by real UID/GID -X mode File is executable by real UID/GID -O mode File is owned by real UID -e File exists -z size File has zero size -s size File has nonzero size (returns size) -f mode,rdev File is a plain file -d mode,rdev File is a directory -l mode File is a symbolic link -X Stat field Meaning -p mode File is a named pipe (FIFO) -S mode File is a socket -b rdev File is a block special file -c rdev File is a character special file -t rdev Filehandle is opened to a tty -u mode File has setuid bit set -g mode File has setgid bit set -k mode File has sticky bit set -T N/A File is a text file -B N/A File is a binary file (opposite of -T) -M mtime Age of file in days when script started -A atime Same for access time -C ctime Same for inode change time (not creation) The stat and the -X operators cache the values that the stat(2) syscall returned. If you then call stat or a -X operator with the special filehandle _ (a single underscore), it won't call stat again but will instead return information from its cache. This lets you test many properties of a single file without calling stat(2) many times or introducing a race condition: open(F, "<", $filename ) or die "Opening $filename: $!\n"; unless (-s F && -T _) { die "$filename doesn't have text in it.\n"; } The stat call just returns the information in one inode, though. How do we list the directory contents? For that, Perl provides opendir, readdir, and closedir: opendir(DIRHANDLE, "/usr/bin") or die "couldn't open /usr/bin : $!"; while ( defined ($filename = readdir(DIRHANDLE)) ) { print "Inside /usr/bin is something called $filename\n"; } closedir(DIRHANDLE); These directory-reading functions are designed to look like the file open and close functions. Where open takes a filehandle, though, opendir takes a directory handle. They may look the same to you (the same bare word), but they occupy different namespaces. Therefore, you could open(BIN, "/a/file") and opendir(BIN, "/a/dir"), and Perl won't get confused. You might, but Perl won't. Because filehandles and directory handles are different, you can't use the <> operator to read from a directory handle (<> calls readline on the filehandle). Similar to what happens with open and the other functions that initialize filehandles, you can -p mode File is a named pipe (FIFO) -S mode File is a socket -b rdev File is a block special file -c rdev File is a character special file -t rdev Filehandle is opened to a tty -u mode File has setuid bit set -g mode File has setgid bit set -k mode File has sticky bit set -T N/A File is a text file -B N/A File is a binary file (opposite of -T) -M mtime Age of file in days when script started -A atime Same for access time -C ctime Same for inode change time (not creation) The stat and the -X operators cache the values that the stat(2) syscall returned. If you then call stat or a -X operator with the special filehandle _ (a single underscore), it won't call stat again but will instead return information from its cache. This lets you test many properties of a single file without calling stat(2) many times or introducing a race condition: open(F, "<", $filename ) or die "Opening $filename: $!\n"; unless (-s F && -T _) { die "$filename doesn't have text in it.\n"; } The stat call just returns the information in one inode, though. How do we list the directory contents? For that, Perl provides opendir, readdir, and closedir: opendir(DIRHANDLE, "/usr/bin") or die "couldn't open /usr/bin : $!"; while ( defined ($filename = readdir(DIRHANDLE)) ) { print "Inside /usr/bin is something called $filename\n"; } closedir(DIRHANDLE); These directory-reading functions are designed to look like the file open and close functions. Where open takes a filehandle, though, opendir takes a directory handle. They may look the same to you (the same bare word), but they occupy different namespaces. Therefore, you could open(BIN, "/a/file") and opendir(BIN, "/a/dir"), and Perl won't get confused. You might, but Perl won't. Because filehandles and directory handles are different, you can't use the <> operator to read from a directory handle (<> calls readline on the filehandle). Similar to what happens with open and the other functions that initialize filehandles, you can supply opendir an undefined scalar variable where the directory handle is expected. If the function succeeds, Perl initializes that variable with a reference to a new, anonymous directory handle. opendir(my $dh, "/usr/bin") or die; while (defined ($filename = readdir($dh))) { # ... } closedir($dh); Just like any other autovivified reference, when this one is no longer used (for example, when it goes out of scope and no other references to it are held), Perl automatically deallocates it. And just as close is implicitly called on filehandles autovivified through open at that point, directory handles autovivified through opendir have closedir called on them, too. Filenames in a directory aren't necessarily stored alphabetically. For an alphabetical list of files, read the entries and sort them yourself. The separation of directory information from inode information can create some odd situations. Operations that update the directory—such as linking, unlinking, or renaming a file—all require write permission only on the directory, not on the file. This is because the name of a file is actually something the directory calls that file, not a property inherent to the file itself. Only directories hold names of files; files are ignorant of their own names. Only operations that change information in the file data itself demand write permission on the file. Lastly, operations that alter the file's permissions or other metadata are restricted to the file's owner or the superuser. This can lead to the interesting situation of being able to delete (i.e., unlink from its directory) a file you can't read, or write to a file you can't delete. Although these situations may make the filesystem structure seem odd at first, they're actually the source of much of Unix's power. Links, two filenames that refer to the same file, are now extremely simple. The two directory entries just list the same inode number. The inode structure includes a count of the number of directory entries referring to the file (nlink in the values returned by stat). This lets the operating system store and maintain only one copy of the modification times, size, and other file attributes. When one directory entry is unlinked, data blocks are deleted only if the directory entry was the last one that referred to the file's inode—and no processes still have the file open. You can unlink an open file, but its disk space won't be released until the last close. Links come in two forms. The kind described previously, where two directory entries list the same inode number (like vi and nvi in the earlier table), are called hard links. The operating system cannot tell the first directory entry of a file (the one created when the file was created) from any subsequent hard links to it. The other kind, soft or symbolic links, are very different. A soft link is a special type of file whose data block stores the filename the file is linked to. Soft links have a different mode value, indicating they're not regular files. The operating system, when asked to open a soft link, instead opens the filename contained in the data block. Executive Summary Filenames are kept in a directory, separate from the size, protections, and other metadata kept in an inode. The stat function returns the inode information (metadata). opendir, readdir, and friends provide access to filenames in a directory through a directory handle. Directory handles look like filehandles, but they are not the same. In particular, you can't use <> on directory handles. Permissions on a directory determine whether you can read and write the list of filenames. Permissions on a file determine whether you can change the file's metadata or contents. Three different times are stored in an inode. None of them is the file's creation time. [ Team LiB ] [ Team LiB ] Recipe 9.1 Getting and Setting Timestamps 9.1.1 Problem You need to retrieve or alter when a file was last modified (written or changed) or accessed (read). 9.1.2 Solution Use stat to get those times and utime to set them. Both functions are built into Perl: ($READTIME, $WRITETIME) = (stat($filename))[8,9]; utime($NEWREADTIME, $NEWWRITETIME, $filename); 9.1.3 Discussion As explained in the Introduction, three different times are associated with an inode in the traditional Unix filesystem. Of these, any user can set the atime and mtime with utime, assuming the user has write access to the parent directory of the file. There is effectively no way to change the ctime. This example shows how to call utime: $SECONDS_PER_DAY = 60 * 60 * 24; ($atime, $mtime) = (stat($file))[8,9]; $atime -= 7 * $SECONDS_PER_DAY; $mtime -= 7 * $SECONDS_PER_DAY; utime($atime, $mtime, $file) or die "couldn't backdate $file by a week w/ utime: $!"; You must call utime with both atime and mtime values. If you want to change only one, you must call stat first to get the other: $mtime = (stat $file)[9]; utime(time, $mtime, $file); This is easier to understand if you use File::stat: use File::stat; utime(time, stat($file)->mtime, $file); Use utime to make it appear as though you never touched a file at all (beyond its ctime being updated). For example, to edit a file, use the program in Example 9-1. Example 9-1. uvi #!/usr/bin/perl -w # uvi - vi a file without changing its access times $file = shift or die "usage: uvi filename\n"; ($atime, $mtime) = (stat($file))[8,9]; system($ENV{EDITOR} || "vi", $file); utime($atime, $mtime, $file) or die "couldn't restore $file to orig times: $!"; 9.1.4 See Also The stat and utime functions in perlfunc(1) and in Chapter 29 of Programming Perl; the standard File::stat module (also in Chapter 32 of Programming Perl); your system's utime(3) manpage [ Team LiB ] [ Team LiB ] Recipe 9.2 Deleting a File 9.2.1 Problem You want to delete a file. Perl's delete function isn't what you want. 9.2.2 Solution Use Perl's unlink function: unlink($FILENAME) or die "Can't delete $FILENAME: $!\n"; unlink(@FILENAMES) = = @FILENAMES or die "Couldn't unlink all of @FILENAMES: $!\n"; 9.2.3 Discussion The unlink function takes its name from the Unix syscall. Perl's unlink takes a list of filenames and returns the number of filenames successfully deleted. This return value can then be tested with || or or : unlink($file) or die "Can't unlink $file: $!"; unlink doesn't report which filenames it couldn't delete, only how many it deleted. Here's one way to test for successful deletion of many files and report the number deleted: unless (($count = unlink(@filelist)) = = @filelist) { warn "could only delete $count of " . (@filelist) . " files"; } A foreach over @filelist would permit individual error messages. Under Unix, deleting a file from a directory requires write access to the directory,[1] not to the file, because it's the directory you're changing. Under some circumstances, you could remove a file you couldn't write to or write to a file you couldn't remove. [1] Unless the sticky bit, mode 01000, is turned on for the directory, which further restricts deletions to be by the file's owner only. Shared directories such as /tmp are usually mode 01777 for security reasons. If you delete a file that some process still has open, the operating system removes the directory entry but doesn't free up data blocks until all processes have closed the file. This is how the tmpfile function in File::Temp works (see Recipe 7.11 ). 9.2.4 See Also The unlink function in perlfunc (1) and in Chapter 29 of Programming Perl ; your system's unlink (2) manpage; Recipe 7.11 [ Team LiB ] [ Team LiB ] Recipe 9.3 Copying or Moving a File 9.3.1 Problem You need to copy a file, but Perl has no built-in copy function. 9.3.2 Solution Use the copy function from the standard File::Copy module: use File::Copy; copy($oldfile, $newfile); You can do it by hand: open(IN, "<", $oldfile) or die "can't open $oldfile: $!"; open(OUT, ">", $newfile) or die "can't open $newfile: $!"; $blksize = (stat IN)[11] || 16384; # preferred block size? while (1) { $len = sysread IN, $buf, $blksize); if (!defined $len) { next if $! =~ /^Interrupted/; # ^Z and fg on EINTR die "System read error: $!\n"; } last unless $len; $offset = 0; while ($len) { # Handle partial writes. defined($written = syswrite OUT, $buf, $len, $offset) or die "System write error: $!\n"; $len -= $written; $offset += $written; }; } close(IN); close(OUT); or you can call your system's copy program: system("cp $oldfile $newfile"); # unix system("copy $oldfile $newfile"); # dos, vms 9.3.3 Discussion The File::Copy module provides copy and move functions. These are more convenient than resorting to low-level I/O calls and more portable than calling system. This version of move works across file-system boundaries; the standard Perl built-in rename (usually) does not. use File::Copy; copy("datafile.dat", "datafile.bak") or die "copy failed: $!"; move("datafile.dat", "datafile.new") or die "move failed: $!"; Because these functions return only a simple success status, you can't easily tell which file prevented the copy or move from working. Copying the files manually lets you pinpoint which files didn't copy, but it fills your program with complex sysreads and syswrites. 9.3.4 See Also Documentation for the standard File::Copy module (also in Chapter 32 of Programming Perl); the rename, read, and syswrite functions in perlfunc(1) and in Chapter 29 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 9.4 Recognizing Two Names for the Same File 9.4.1 Problem You want to determine whether two filenames in a list correspond to the same file on disk (because of hard and soft links, two filenames can refer to a single file). You might do this to make sure that you don't change a file you've already worked with. 9.4.2 Solution Maintain a hash, keyed by the device and inode number of the files you've seen. The values are the names of the files: %seen = ( ); sub do_my_thing { my $filename = shift; my ($dev, $ino) = stat $filename; unless ($seen{$dev, $ino}++) { # do something with $filename because we haven't # seen it before } } 9.4.3 Discussion A key in %seen is made by combining the device number ($dev) and inode number ($ino) of each file. Files that are the same will have the same device and inode numbers, so they will have the same key. If you want to maintain a list of all files of the same name, instead of counting the number of times seen, save the name of the file in an anonymous array. foreach $filename (@files) { ($dev, $ino) = stat $filename; push( @{ $seen{$dev,$ino} }, $filename); } foreach $devino (sort keys %seen) { ($dev, $ino) = split(/$;/o, $devino); if (@{$seen{$devino}} > 1) { # @{$seen{$devino}} is a list of filenames for the same file } } The $; variable contains the separator string using the old multidimensional associative array emulation syntax, $hash{$x,$y,$z}. It's still a one-dimensional hash, but it has composite keys. The key is really join($; => $x, $y, $z). The split separates them again. Although you'd normally just use a real multilevel hash directly, here there's no need, and it's cheaper not to. 9.4.4 See Also The $; ($SUBSEP) variable in perlvar(1), and in the "Special Variables" section of Chapter 28 of Programming Perl; the stat function in perlfunc(1) and in Chapter 29 of Programming Perl; Chapter 5 [ Team LiB ] [ Team LiB ] Recipe 9.5 Processing All Files in a Directory 9.5.1 Problem You want to do something to each file in a particular directory. 9.5.2 Solution Use opendir to open the directory and readdir to retrieve every filename: opendir(DIR, $dirname) or die "can't opendir $dirname: $!"; while (defined($file = readdir(DIR))) { # do something with "$dirname/$file" } closedir(DIR); 9.5.3 Discussion The opendir, readdir, and closedir functions operate on directories as open, <>, and close operate on files. Both use handles, but the directory handles used by opendir and friends are different from the filehandles used by open and friends. In particular, you can't use <> on a directory handle. In scalar context, readdir returns the next filename in the directory until it reaches the end of the directory, when it returns undef. In list context it returns the rest of the filenames in the directory or an empty list if there were no files left. As explained in this chapter's Introduction, the filenames returned by readdir do not include the directory name. When you work with the filenames returned by readdir, you must either move to the right directory first or prepend the directory to the filename. This shows one way of prepending: $dir = "/usr/local/bin"; print "Text files in $dir are:\n"; opendir(BIN, $dir) or die "Can't open $dir: $!"; while( $file = readdir BIN) { print "$file\n" if -T "$dir/$file"; } closedir(BIN); The readdir function will return the special directories "." (the directory itself) and ".." (the parent of the directory). Most people skip those files with code like: while ( defined ($file = readdir BIN) ) { next if $file =~ /^\.\.?$/; # skip . and .. # ... } Like filehandles, bareword directory handles are per-package constructs. You can use the local *DIRHANDLE syntax to get a new bareword directory handle. Alternatively, pass an undefined scalar as the first argument to opendir and Perl will put a new indirect directory handle into that scalar: opendir my $dh, $directory or die; while (defined ($filename = readdir($dh))) { # ... } closedir $dh; Or, finally, you can use DirHandle to get an object-oriented view of a directory handle. The following code uses DirHandle and produces a sorted list of plain files that aren't dotfiles (that is, their names don't begin with a "."): use DirHandle; sub plainfiles { my $dir = shift; my $dh = DirHandle->new($dir) or die "can't opendir $dir: $!"; return sort # sort pathnames grep { -f } # choose only "plain" files map { "$dir/$_" } # create full paths grep { !/^\./ } # filter out dot files $dh->read( ); # read all entries } DirHandle's read method behaves just like readdir, returning all remaining filenames. The bottom grep returns only those that don't begin with a period. The map turns the filenames returned by read into fully qualified filenames, and the top grep filters out directories, links, etc. The resulting list is then sorted and returned. In addition to readdir, there's also rewinddir (to move the directory handle back to the start of the filename list), seekdir (to move to a specific offset in the list), and telldir (to find out how far from the start of the list you are). 9.5.4 See Also The closedir, opendir, readdir, rewinddir, seekdir, and telldir functions in perlfunc(1) and in Chapter 29 of Programming Perl; documentation for the standard DirHandle module (also in Chapter 32 of Programming Perl) [ Team LiB ] [ Team LiB ] Recipe 9.6 Globbing, or Getting a List of Filenames Matching a Pattern 9.6.1 Problem You want to get a list of filenames similar to those produced by MS-DOS's *.* and Unix's *.h. This is called globbing, and the filename wildcard expression is called a glob, or occasionally a fileglob to distinguish it from a typeglob. 9.6.2 Solution Perl provides globbing with the semantics of the Unix C shell through the glob keyword and <>: @list = <*.c>; @list = glob("*.c"); You can also use readdir to extract the filenames manually: opendir(DIR, $path); @files = grep { /\.c$/ } readdir(DIR); closedir(DIR); 9.6.3 Discussion In versions of Perl before v5.6, Perl's built-in glob and notation (not to be confused with ) ran an external program (often the csh shell) to get the list of filenames. This led to globbing being tarred with security and performance concerns. As of v5.6, Perl uses the File::Glob module to glob files, which solves the security and performance problems of the old implementation. Globs have C shell semantics on non-Unix systems to encourage portability. In particular, glob syntax isn't regular expression syntax—glob uses ? to mean "any single character" and * to mean "zero or more characters," so glob("f?o*") matches flo and flood but not fo. For complex rules about which filenames you want, roll your own selection mechanism using readdir and regular expressions. At its simplest, an opendir solution uses grep to filter the list returned by readdir: @files = grep { /\.[ch]$/i } readdir(DH); As always, the filenames returned don't include the directory. When you use the filename, prepend the directory name to get the full pathname: opendir(DH, $dir) or die "Couldn't open $dir for reading: $!"; @files = ( ); while( defined ($file = readdir(DH)) ) { next unless /\.[ch]$/i; my $filename = "$dir/$file"; push(@files, $filename) if -T $filename; } The following example combines directory reading and filtering with the efficient sorting technique from Recipe 4.16. It sets @dirs to a sorted list of the subdirectories in a directory whose names are all numeric: @dirs = map { $_->[1] } # extract pathnames sort { $a->[0] <=> $b->[0] } # sort names numeric grep { -d $_->[1] } # path is a dir map { [ $_, "$path/$_" ] } # form (name, path) grep { /^\d+$/ } # just numerics readdir(DIR); # all files Recipe 4.16 explains how to read these strange-looking constructs. As always, formatting and documenting your code can make it much easier to read and understand. 9.6.4 See Also The opendir, readdir, closedir, grep, map, and sort functions in perlfunc(1) and in Chapter 29 of Programming Perl; documentation for the standard DirHandle module (also in Chapter 32 of Programming Perl); the "I/O Operators" section of perlop(1), and the "Filename Globbing Operator" section of Chapter 2 of Programming Perl; we talk more about globbing in Recipe 6.9; Recipe 9.5 [ Team LiB ] [ Team LiB ] Recipe 9.7 Processing All Files in a Directory Recursively 9.7.1 Problem You want to do something to each file and subdirectory in a particular directory. 9.7.2 Solution Use the standard File::Find module. use File::Find; sub process_file { # do whatever; } find(\&process_file, @DIRLIST); 9.7.3 Discussion File::Find provides a convenient way to process a directory recursively. It does the directory scans and recursion for you. All you do is pass find a code reference and a list of directories. For each file in those directories, recursively, find calls your function. Before calling your function, find by default changes to the directory being visited, whose path relative to the starting directory is stored in the $File::Find::dir variable. $_ is set to the basename of the file being visited, and the full path of that file can be found in $File::Find::name. Your code can set $File::Find::prune to true to tell find not to descend into the directory just seen. This simple example demonstrates File::Find. We give find an anonymous subroutine that prints the name of each file visited and adds a / to the names of directories: @ARGV = qw(.) unless @ARGV; use File::Find; find sub { print $File::Find::name, -d && "/", "\n" }, @ARGV; The -d file test operator returns the empty string '' if it fails, making the && return that, too. But if -d succeeds, the && returns "/", which is then printed. The following program prints the total bytes occupied by everything in a directory, including subdirectories. It gives find an anonymous subroutine to keep a running sum of the sizes of each file it visits. That includes all inode types, including the sizes of directories and symbolic links, not just regular files. Once the find function returns, the accumulated sum is displayed. use File::Find; @ARGV = (".") unless @ARGV; my $sum = 0; find sub { $sum += -s }, @ARGV; print "@ARGV contains $sum bytes\n"; This code finds the largest single file within a set of directories: use File::Find; @ARGV = (".") unless @ARGV; my ($saved_size, $saved_name) = (-1, ""); sub biggest { return unless -f && -s _ > $saved_size; $saved_size = -s _; $saved_name = $File::Find::name; } find(\&biggest, @ARGV); print "Biggest file $saved_name in @ARGV is $saved_size bytes long.\n"; We use $saved_size and $saved_name to keep track of the name and the size of the largest file visited. If we find a file bigger than the largest seen so far, we replace the saved name and size with the current ones. When the find finishes, the largest file and its size are printed out, rather verbosely. A more general tool would probably just print the filename, its size, or both. This time we used a named function rather than an anonymous one because the function was getting big. It's simple to change this to find the most recently changed file: use File::Find; @ARGV = (".") unless @ARGV; my ($age, $name); sub youngest { return if defined $age && $age > (stat($_))[9]; $age = (stat(_))[9]; $name = $File::Find::name; } find(\&youngest, @ARGV); print "$name " . scalar(localtime($age)) . "\n"; The File::Find module doesn't export its $name variable, so always refer to it by its fully qualified name. Example 9-2 is more a demonstration of namespace munging than of recursive directory traversal, although it does find all directories. It makes $name in our current package an alias for the one in File::Find, which is essentially how Exporter works. Then it declares its own version of find with a prototype so it can be called like grep or map. Example 9-2. fdirs #!/usr/bin/perl -lw # fdirs - find all directories @ARGV = qw(.) unless @ARGV; use File::Find ( ); sub find(&@) { &File::Find::find } *name = *File::Find::name; find { print $name if -d } @ARGV; Our own find only calls the find in File::Find, which we were careful not to import by specifying an ( ) empty list in the use statement. Rather than write this: find sub { print $File::Find::name if -d }, @ARGV; we can write the more pleasant: find { print $name if -d } @ARGV; 9.7.4 See Also The documentation for the standard File::Find and Exporter modules (also in Chapter 32 of Programming Perl); your system's find(1) manpage; Recipe 9.6 [ Team LiB ] [ Team LiB ] Recipe 9.8 Removing a Directory and Its Contents 9.8.1 Problem You want to remove a directory tree recursively without using rm -r. 9.8.2 Solution Use the finddepth function from File::Find, shown in Example 9-3. Example 9-3. rmtree1 #!/usr/bin/perl # rmtree1 - remove whole directory trees like rm -r use File::Find; die "usage: $0 dir ..\n" unless @ARGV; find { bydepth => 1, no_chdir => 1, wanted => sub { if (!-l && -d _) { rmdir or warn "couldn't rmdir directory $_: $!"; } else { unlink or warn "couldn't unlink file $_: $!"; } } } => @ARGV; Or use rmtree from File::Path, as shown in Example 9-4. Example 9-4. rmtree2 #!/usr/bin/perl # rmtree2 - remove whole directory trees like rm -r use File::Path; die "usage: $0 dir ..\n" unless @ARGV; foreach $dir (@ARGV) { rmtree($dir); } These programs remove an entire directory tree. Use with extreme caution! 9.8.3 Discussion The File::Find module supports an alternate interface in which find's first argument is a hash reference containing options and their settings. The bydepth option is the same as calling finddepth instead of find. This is guaranteed to visit all files beneath a directory before the directory itself, just what we need to remove a directory and its contents. The no_chdir option stops find from descending into directories during processing; under this option, $_ is the same as $File::Find::name. Finally, the wanted option takes a code reference, our old wanted( ) function. We use two different functions, rmdir and unlink; both default to $_ if no argument is provided. The unlink function deletes only files, and rmdir deletes only empty directories. We need to use finddepth or the bydepth option to make sure we've first removed the directory's contents before we rmdir the directory itself. We first check that the file isn't a symbolic link before determining whether it's a directory, because -d returns true for both a real directory and a symbolic link to a directory. stat, lstat, and file test operators like -d all use the syscall stat(2), which returns the file meta- information stored in the file's inode. These functions and operators cache that information in the special underscore (_) filehandle. This permits tests on the same file while avoiding redundant syscalls that would return the same information, slowly. According to POSIX, if the directory is either the root directory (the mount point for the filesystems or the result of a chroot(2) syscall) or the current working directory of any process, it is unspecified whether the rmdir syscall succeeds, or whether it fails and sets errno ($! in Perl) to EBUSY ("Device busy"). Many systems tolerate the latter condition, but few the former. 9.8.4 See Also The unlink, rmdir, lstat, and stat functions in perlfunc(1) and in Chapter 29 of Programming Perl; the documentation for the standard File::Find module (also in Chapter 32 of Programming Perl); your system's rm(1) and stat(2) manpages; the -X section of perlfunc(1), and the "Named Unary and File Test Operators" section of Chapter 3 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 9.9 Renaming Files 9.9.1 Problem You have many files whose names you want to change. 9.9.2 Solution Use a foreach loop and the rename function: foreach $file (@NAMES) { my $newname = $file; # change $newname rename($file, $newname) or warn "Couldn't rename $file to $newname: $!\n"; } 9.9.3 Discussion This is straightforward. rename takes two arguments. The first is the filename to change, and the second is its new name. Perl's rename is a frontend to the operating system's rename syscall, which typically won't rename files across filesystem boundaries. A small change turns this into a generic rename script, such as the one by Larry Wall shown in Example 9-5. Example 9-5. rename #!/usr/bin/perl -w # rename - Larry's filename fixer $op = shift or die "Usage: rename expr [files]\n"; chomp(@ARGV = ) unless @ARGV; for (@ARGV) { $was = $_; eval $op; die $@ if $@; rename($was,$_) unless $was eq $_; } This script's first argument is Perl code that alters the filename (stored in $_) to reflect how you want the file renamed. It can do this because it uses an eval to do the hard work. It also skips rename calls when the filename is untouched. This lets you simply use wildcards like rename EXPR * instead of making long lists of filenames. Here are five examples of calling the rename program from the shell: % rename 's/\.orig$//' *.orig % rename "tr/A-Z/a-z/ unless /^Make/" * % rename '$_ .= ".bad"' *.f % rename 'print "$_: "; s/foo/bar/ if =~ /^y/i' * % find /tmp -name "*~" -print | rename 's/^(.+)~$/.#$1/' The first shell command removes a trailing ".orig" from each filename. The second converts uppercase to lowercase. Because a translation is used rather than the lc function, this conversion won't be locale-aware. To fix that, you'd have to write: % rename 'use locale; $_ = lc($_) unless /^Make/' * The third appends ".bad" to each Fortran file ending in ".f", something many of us have wanted to do for a long time. The fourth prompts the user for the change. Each file's name is printed to standard output and a response read from standard input. If the user types something starting with a "y" or "Y", any "foo" in the filename is changed to "bar". The fifth uses find to locate files in /tmp that end with a tilde. It renames these so that instead of ending with a tilde, they start with a dot and a pound sign. In effect, this switches between two common conventions for backup files. The rename script exemplifies the powerful Unix tool-and-filter philosophy. Even though we could have created a dedicated command for lowercase conversion, it's nearly as easy to write a flexible, reusable tool by embedding an eval. By reading filenames from standard input, we don't have to build in the recursive directory walk. Instead, we just use find, which performs this function well. There's no reason to recreate the wheel, although using File::Find we could have. 9.9.4 See Also The rename function in perlfunc(1) and in Chapter 29 of Programming Perl; your system's mv(1) and rename(2) manpages; the documentation for the standard File::Find module (also in Chapter 32 of Programming Perl) [ Team LiB ] [ Team LiB ] Recipe 9.10 Splitting a Filename into Its Component Parts 9.10.1 Problem You want to extract a filename, its enclosing directory, or the extension(s) from a string that contains a full pathname. 9.10.2 Solution Use routines from the standard File::Basename module. use File::Basename; $base = basename($path); $dir = dirname($path); ($base, $dir, $ext) = fileparse($path); 9.10.3 Discussion The standard File::Basename module contains routines to split up a filename. dirname and basename supply the directory and filename portions, respectively: $path = "/usr/lib/libc.a"; $file = basename($path); $dir = dirname($path); print "dir is $dir, file is $file\n"; # dir is /usr/lib, file is libc.a The fileparse function can extract the extension. Pass fileparse the path to decipher and a regular expression that matches the extension. You must supply a pattern because an extension isn't necessarily dot-separated. Consider ".tar.gz": is the extension ".tar", ".gz", or ".tar.gz"? By specifying the pattern, you control which you get. $path = "/usr/lib/libc.a"; ($name,$dir,$ext) = fileparse($path,'\..*'); print "dir is $dir, name is $name, extension is $ext\n"; # dir is /usr/lib/, name is libc, extension is .a By default, these routines parse pathnames using your operating system's normal conventions for directory separators by consulting the $^O ($OSNAME) variable, which holds a string identifying the platform you're running on. That value was determined when Perl was built and installed. You can change the default by calling the fileparse_set_fstype routine. This alters the behavior of subsequent calls to the File::Basename functions: fileparse_set_fstype("MacOS"); $path = "Hard%20Drive:System%20Folder:README.txt"; ($name,$dir,$ext) = fileparse($path,'\..*'); print "dir is $dir, name is $name, extension is $ext\n"; # dir is Hard%20Drive:System%20Folder, name is README, extension is .txt To pull out just the extension, you might use this: sub extension { my $path = shift; my $ext = (fileparse($path,'\..*'))[2]; $ext =~ s/^\.//; return $ext; } When called on a file like source.c.bak, this returns an extension of "c.bak", not just "bak". If you want ".bak" returned, use '\.[^.]*' as the second argument to fileparse (this will, of course, leave the filename as source.c). When passed a pathname with a trailing directory separator, such as "lib/", fileparse considers the directory name to be "lib/", whereas dirname considers it to be ".". 9.10.4 See Also The documentation for the standard File::Basename module (also in Chapter 32 of Programming Perl); the entry for $^O ($OSNAME) in perlvar(1), and in the "Special Variables in Alphabetical Order" section of Chapter 28 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 9.11 Working with Symbolic File Permissions Instead of Octal Values 9.11.1 Problem You want to print, inspect, or change permissions on a file or directory, but you don't want to specify the permissions in octal (e.g., 0644, 0755). You want to print permissions as ls(1) shows them (e.g., -rwx-r-xr-x) and specify permissions changes in the way that chmod(1) does (e.g., g-w to remove write access for the group). 9.11.2 Solution Use the CPAN module Stat::lsMode to convert numeric permissions to a string: use Stat::lsMode; $lsmode = file_mode($pathname); Use the CPAN module File::chmod to manipulate symbolic permissions: use File::chmod; chmod("g=rw,o=-w", @files); # group can read/write, others can't write chmod("-rwxr-xr--", @files); # ls-style permissions 9.11.3 Discussion The Stat::lsMode module provides functions for generating ls-style permissions strings. The file_mode function takes a pathname and returns a permissions string. This string is false if the pathname doesn't exist or Perl can't stat it. If all goes well, you get a string like "drwxr-x---" for a directory or "-rwxr-x----" for a file. For more fine-grained control, Stat::lsMode offers format_mode, which takes a numeric permissions value and returns the 10-character ls-style string. Notice the leading d and - in those strings. This indicates the type of file whose permissions you're inspecting: - means regular file, d means directory, l means symbolic link, and so on. The format_perms function from Stat::lsMode does the same job as format_mode, but it returns a nine-character string, which does not have the type indicator. For example: use Stat::lsMode; print file_mode("/etc"), "\n"; print format_mode((stat "/etc")[2]), "\n"; drwxr-xr-x r-xr-xr-x The File::chmod module gives you a chmod that accepts these nine-character permissions strings: use File::chmod; chmod("rwxr-xr-x", @files); These strings are three clusters of three characters. The three clusters represent what the user, group, and others can do to the file (respectively). The three characters represent reading, writing, and executing, with a dash (-) in a column indicating the corresponding permission is denied to the group. So in "rwxrw-r--", the owner can read, write, and execute; users in the same group as the file can read and write but not execute; and everyone else can only read. You can specify relative changes to the permissions for a particular file; for example, g-w removes write permission from the group. The first letter(s) indicates whose permissions are being changed (user, group, other, or a combination). Then comes a + or - to indicate adding or removing permissions, or = to indicate you're specifying the complete set of permissions. Then you specify some or all of rwx. You can join these with commas to form relative permissions; for example, g-w,o+x (remove write from group, add execute to other). If you omit the u, g, or o, then the change applies to everyone. Here are some valid permissions changes and what they do: u= remove all permissions for the user g=r group can only read g+wx group can also write and execute g=rwx,o=rx group can do all, other can only read and execute =rwx everybody can do everything So you can now say: chmod("u=", @files); # remove all permissions for the user on @files chmod("g=r", @files); chmod("g+wx", @files); chmod("g=rwx,o-rx", @files); chmod("=rwx", @files); File::chmod also provides functions for seeing what the new permission would be without actually making the change. See the File::chmod documentation for more details. 9.11.4 See Also The documentation for the CPAN modules File::chmod and Stat::lsMode; the chmod and stat functions in perlfunc(1) [ Team LiB ] [ Team LiB ] Recipe 9.12 Program: symirror The program in Example 9-6 recursively duplicates a directory tree, making a shadow forest full of symlinks pointing back at the real files. Example 9-6. symirror #!/usr/bin/perl # symirror - build spectral forest of symlinks use warnings; use strict; use Cwd qw(realpath); use File::Find qw(find); die "usage: $0 realdir mirrordir" unless @ARGV == 2; our $SRC = realpath $ARGV[0]; our $DST = realpath $ARGV[1]; my $oldmask = umask 077; # in case was insanely uncreatable chdir $SRC or die "can't chdir $SRC: $!"; unless (-d $DST) { mkdir($DST, 0700) or die "can't mkdir $DST: $!"; } find { wanted => \&shadow, postprocess => \&fixmode, } => "."; umask $oldmask; sub shadow { (my $name = $File::Find::name) =~ s!^\./!!; # correct name return if $name eq "."; if (-d) { # make a real dir; we'll copy mode later mkdir("$DST/$name", 0700) or die "can't mkdir $DST/$name: $!"; } else { # all else gets symlinked symlink("$SRC/$name", "$DST/$name") or die "can't symlink $SRC/$name to $DST/$name: $!"; } } sub fixmode { my $dir = $File::Find::dir; my $mode = (stat("$SRC/$dir"))[2] & 07777; chmod($mode, "$DST/$dir") or die "can't set mode on $DST/$dir: $!"; } [ Team LiB ] [ Team LiB ] Recipe 9.13 Program: lst Have you ever wondered what the newest or biggest files within a directory are? The standard ls program has options for listing out directories sorted in time order (the -t flag) and for recursing into subdirectories (the -R flag). However, it pauses at each directory to display the sorted contents of just that directory. It doesn't descend through all subdirectories first and then sort everything it found. The following lst program does that. Here's an example using its -l flag to get a long listing: % lst -l /etc 12695 0600 1 root wheel 512 Fri May 29 10:42:41 1998 /etc/ssh_random_seed 12640 0644 1 root wheel 10104 Mon May 25 7:39:19 1998 /etc/ld.so.cache 12626 0664 1 root wheel 12288 Sun May 24 19:23:08 1998 /etc/psdevtab 12304 0644 1 root root 237 Sun May 24 13:59:33 1998 /etc/exports 12309 0644 1 root root 3386 Sun May 24 13:24:33 1998 /etc/inetd.conf 12399 0644 1 root root 30205 Sun May 24 10:08:37 1998 /etc/sendmail.cf 18774 0644 1 gnat perldoc 2199 Sun May 24 9:35:57 1998 /etc/X11/XMetroconfig 12636 0644 1 root wheel 290 Sun May 24 9:05:40 1998 /etc/mtab 12627 0640 1 root root 0 Sun May 24 8:24:31 1998 /etc/wtmplock 12310 0644 1 root tchrist 65 Sun May 24 8:23:04 1998 /etc/issue .... /etc/X11/XMetroconfig showed up in the middle of the listing for /etc because it wasn't just for /etc, but for everything within that directory, recursively. Other supported options include sorting on read time instead of write time using -u and sorting on size rather than time with -s. The -i flag takes the list of filenames from standard input instead of recursing with find. That way, if you already had a list of filenames, you could feed them to lst for sorting. The program is shown in Example 9-7. Example 9-7. lst #!/usr/bin/perl # lst - list sorted directory contents (depth first) use Getopt::Std; use File::Find; use File::stat; use User::pwent; use User::grent; getopts("lusrcmi") or die << DEATH; Usage: $0 [-mucsril] [dirs ...] or $0 -i [-mucsrl] < filelist Input format: -i read pathnames from stdin Output format: -l long listing Sort on: -m use mtime (modify time) [DEFAULT] -u use atime (access time) -c use ctime (inode change time) -s use size for sorting Ordering: -r reverse sort NB: You may only use select one sorting option at a time. DEATH unless ($opt_i || @ARGV) { @ARGV = (".") } if ($opt_c + $opt_u + $opt_s + $opt_m > 1) { die "can only sort on one time or size"; } $IDX = "mtime"; $IDX = "atime" if $opt_u; $IDX = "ctime" if $opt_c; $IDX = "size" if $opt_s; $TIME_IDX = $opt_s ? "mtime" : $IDX; *name = *File::Find::name; # forcibly import that variable # the $opt_i flag tricks wanted into taking # its filenames from ARGV instead of being # called from find. if ($opt_i) { *name = *_; # $name now alias for $_ while (<>) { chomp; &wanted; } # ok, not stdin really } else { find(\&wanted, @ARGV); } # sort the files by their cached times, youngest first @skeys = sort { $time{$b} <=> $time{$a} } keys %time; # but flip the order if -r was supplied on command line @skeys = reverse @skeys if $opt_r; for (@skeys) { unless ($opt_l) { # emulate ls -l, except for permissions print "$_\n"; next; } $now = localtime $stat{$_}->$TIME_IDX( ); printf "%6d %04o %6d %8s %8s %8d %s %s\n", $stat{$_}->ino( ), $stat{$_}->mode( ) & 07777, $stat{$_}->nlink( ), user($stat{$_}->uid( )), group($stat{$_}->gid( )), $stat{$_}->size( ), $now, $_; } # get stat info on the file, saving the desired # sort criterion (mtime, atime, ctime, or size) # in the %time hash indexed by filename. # if they want a long list, we have to save the # entire stat object in %stat. yes, this is a # hash of objects sub wanted { my $sb = stat($_); # XXX: should be stat or lstat? return unless $sb; $time{$name} = $sb->$IDX( ); # indirect method call $stat{$name} = $sb if $opt_l; } # cache user number to name conversions; don't worry # about the apparently extra call, as the system caches the # last one called all by itself sub user { my $uid = shift; $user{$uid} = getpwuid($uid) ? getpwuid($uid)->name : "#$uid" unless defined $user{$uid}; return $user{$uid}; } # cache group number to name conversions; ditto on unworryness sub group { my $gid = shift; $group{$gid} = getgrgid($gid) ? getgrgid($gid)->name : "#$gid" unless defined $group{$gid}; return $group{$gid}; } [ Team LiB ] [ Team LiB ] Chapter 10. Subroutines Composing mortals with immortal fire. —W. H. Auden, "Three Songs for St Cecilia's Day" [ Team LiB ] [ Team LiB ] Introduction To avoid the dangerous practice of copying and pasting code, larger programs reuse chunks of code as subroutines and functions. We'll use the terms subroutine and function interchangeably because Perl doesn't distinguish between the two. Even object-oriented methods are just subroutines that are called using a special syntax, described in Chapter 13. A subroutine is declared with the sub keyword. Here's a simple subroutine definition: sub hello { $greeted++; # global variable print "hi there!\n"; } The typical way of calling that subroutine is: hello( ); # call subroutine hello with no arguments/parameters Because Perl compiles your program before executing it, it doesn't matter where subroutines are declared. Definitions don't have to be in the same file as your main program. They can be pulled in from other files using the do, require, or use operators, as described in Chapter 12. They can even be created on the fly using eval or AUTOLOAD, or generated using closures, which can act as function templates. If you are familiar with other programming languages, several characteristics of Perl's functions may surprise you if you're unprepared for them. Most recipes in this chapter illustrate how to be aware of—and to take advantage of—these properties. Perl functions have no formal, named parameters, but this is not necessarily a bad thing. See Recipe 10.1 and Recipe 10.7. All variables are global unless declared otherwise. See Recipe 10.2, Recipe 10.3, and Recipe 10.13 for details. Passing or returning more than one array or hash normally causes them to lose their separate identities. See Recipe 10.5, Recipe 10.8, Recipe 10.9, and Recipe 10.11 to avoid this. A function can know in which context it was called, how many arguments it was called with, and even which other function called it. See Recipe 10.4 and Recipe 10.6 to find out how. Perl's undef value can be used to signal an error return from the function because no valid string, number, or reference ever has that value. Recipe 10.10 covers subtle pitfalls with undef you should avoid, and Recipe 10.12 shows how to deal with other catastrophic conditions. Perl supports interesting operations on functions that you might not see in other languages, such as anonymous functions, creating functions on the fly, and calling them indirectly using function pointers. See Recipe 10.14 and Recipe 10.16 for these esoteric topics. Calling a function as $x = &func; does not supply any arguments, but rather provides direct access to its caller's @_ array! If you omit the ampersand and use either func( ) or func, then a new and empty @_ is provided instead. Historically, Perl hasn't provided a construct like C's switch or the shell's case for multiway branching. The switch function shown in Recipe 10.17 takes care of that for you. [ Team LiB ] [ Team LiB ] Recipe 10.1 Accessing Subroutine Arguments 10.1.1 Problem You have written a function that takes arguments supplied by its caller, and you need to access those arguments. 10.1.2 Solution The special array @_ holds the values passed in as the function's arguments. Thus, the first argument to the function is in $_[0], the second in $_[1], and so on. The number of arguments is simply scalar(@_). For example: sub hypotenuse { return sqrt( ($_[0] ** 2) + ($_[1] ** 2) ); } $diag = hypotenuse(3,4); # $diag is 5 Most subroutines start by copying arguments into named private variables for safer and more convenient access: sub hypotenuse { my ($side1, $side2) = @_; return sqrt( ($side1 ** 2) + ($side2 ** 2) ); } 10.1.3 Discussion It's been said that programming has only three nice numbers: zero, one, and however many you please. Perl's subroutine mechanism was designed to facilitate writing functions with as many—or as few—elements in the parameter and return lists as you wish. All incoming parameters appear as separate scalar values in the special array @_, which is automatically local to each function (see Recipe 10.13). To return a value or values from a subroutine, use the return statement with arguments. If there is no return statement, the return value is the result of the last evaluated expression. Here are some sample calls to the hypotenuse function defined in the Solution: print hypotenuse(3, 4), "\n"; # prints 5 @a = (3, 4); print hypotenuse(@a), "\n"; # prints 5 If you look at the arguments used in the second call to hypotenuse, it might appear that only one argument was passed: the array @a. This isn't what happens—the elements of @a are copied into the @_ array separately. Similarly, if you called a function with (@a, @b), you'd be giving it all arguments from both arrays. This is the same principle of flattened lists at work as in: @both = (@men, @women); The scalars in @_ are implicit aliases for the ones passed in, not copies. That means changing the elements of @_ in a subroutine changes the values in the subroutine's caller. This is a holdover from before Perl had proper references. You can write functions that leave their arguments intact by copying the arguments to private variables like this: @nums = (1.4, 3.5, 6.7); @ints = int_all(@nums); # @nums unchanged sub int_all { my @retlist = @_; # make safe copy for return for my $n (@retlist) { $n = int($n) } return @retlist; } You can also write functions that change their caller's variables: @nums = (1.4, 3.5, 6.7); trunc_em(@nums); # @nums now (1,3,6) sub trunc_em { for (@_) { $_ = int($_) } # truncate each argument } Don't pass constants as arguments to a function that intends to modify those arguments; for example, don't call trunc_em(1.4, 3.5, 6.7). If you do, you'll get a runtime exception to the effect of Modification of a read-only value attempted at .... The built-in functions chop and chomp are like that; they modify their caller's variables and return something else entirely. Beginning Perl programmers who notice regular functions that all return some new value—including int, uc, and readline— without modifying those functions' arguments sometimes incorrectly infer that chop and chomp work similarly. This leads them to write code like: $line = chomp(<>); # WRONG $removed_chars = chop($line); # RIGHT $removed_count = chomp($line); # RIGHT until they get the hang of how this pair really works. Given the vast potential for confusion, you might want to think twice before modifying @_ in your own subroutines, especially if you also intend to provide a distinct return value. 10.1.4 See Also Chapter 6 of Programming Perl and perlsub(1) [ Team LiB ] [ Team LiB ] Recipe 10.2 Making Variables Private to a Function 10.2.1 Problem Your subroutine needs temporary variables. You shouldn't use global variables, because another subroutine might also use the same variables. 10.2.2 Solution Use my to declare a variable private to a region of your program: sub somefunc { my $variable; # $variable is invisible outside somefunc( ) my ($another, @an_array, %a_hash); # declaring many variables at once # ... } 10.2.3 Discussion The my operator confines a variable to a particular region of code in which it can be used and accessed. Outside that region, it can't be accessed. This region is called its scope. Variables declared with my have lexical scope, meaning that they exist only within a specific textual region of code. For instance, the scope of $variable in the Solution is the function it was defined in, somefunc. The variable is created when somefunc is entered, and it is destroyed when the function returns. The variable can be accessed only from inside the function, not from outside. A lexical scope is usually a block of code with braces around it, such as those defining the body of the somefunc subroutine or those marking the code blocks of if, while, for, foreach, and eval. An entire source file and the string argument to eval are each a lexical scope;[1] think of them as blocks with invisible braces delimiting their confines. Because a lexical scope is most often found as a brace-delimited block, when discussing lexical variables we sometimes say that they are visible only in their block, but what we really mean is that they're visible only in their scope. [1] Although not of the same sort: the eval scope is a nested scope, just like a nested block, but the file scope is unrelated to any other. The code that can legally access a my variable is determined statically at compile time and never changes, and so lexical scoping is sometimes referred to as static scoping, especially when in contrast to dynamic scoping, a topic we'll cover in Recipe 10.13. You can combine a my declaration with an assignment. Use parentheses when defining more than one variable: my ($name, $age) = @ARGV; my $start = fetch_time( ); These lexical variables behave as you would expect of a local variable. Nested blocks can see lexicals declared in enclosing, outer blocks, but not in unrelated blocks: my ($a, $b) = @pair; my $c = fetch_time( ); sub check_x { my $x = $_[0]; my $y = "whatever"; run_check( ); if ($condition) { print "got $x\n"; } } In the preceding code, the if block inside the function can access the private $x variable. However, the run_check function called from within that scope cannot access $x or $y, because run_check was presumably defined in another scope. However, check_x can access $a, $b, or $c from the outer scope because the function was defined in the same scope as those three variables. Don't nest definitions of named subroutines. If you do, they won't get the right bindings of the lexical variables. Recipe 10.16 shows how to cope with this restriction. When a lexical variable goes out of scope, its storage is freed unless a reference to the variable still exists, as with @arguments in the following code: sub save_array { my @arguments = @_; push(our @Global_Array, \@arguments); } This code creates a new array each time save_array is called, so you don't have to worry that it'll reuse the same array each time the function is called. Perl's garbage collection system knows not to deallocate things until they're no longer used. This is why you can return a reference to a private variable without leaking memory. 10.2.4 See Also The section on "Scoped Declarations" in Chapter 4 of Programming Perl and the section on "Private Variables via my( )" in perlsub(1) [ Team LiB ] [ Team LiB ] Recipe 10.3 Creating Persistent Private Variables 10.3.1 Problem You want a variable to retain its value between calls to a subroutine but not be visible outside that routine. For instance, you'd like your function to keep track of how many times it was called. 10.3.2 Solution Wrap the function in another block, then declare my variables in that block's scope rather than in the function's: { my $variable; sub mysub { # ... accessing $variable } } If the variables require initialization, make that block an INIT so the variable is guaranteed to be set before the main program starts running: INIT { my $variable = 1; # initial value sub othersub { # ... accessing $variable } } 10.3.3 Discussion Unlike local[2] variables in C or C++, Perl's lexical variables don't necessarily get recycled just because their scope has exited. If something more permanent is still aware of the lexical, it will stick around. In this code, mysub uses $variable, so Perl doesn't reclaim the variable when the block around the definition of mysub ends. [2] Technically speaking, auto variables. Here's how to write a counter: { my $counter; sub next_counter { return ++$counter } } Each time next_counter is called, it increments and returns the $counter variable. The first time next_counter is called, $counter is undefined, so it behaves as though it were 0 for the ++. The variable is not part of next_counter's scope, but rather part of the block surrounding it. No code from outside can change $counter except by calling next_counter. Generally, you should use an INIT for the extra scope. Otherwise, you could call the function before its variables were initialized. INIT { my $counter = 42; sub next_counter { return ++$counter } sub prev_counter { return --$counter } } This technique creates the Perl equivalent of C's static variables. Actually, it's a little better: rather than being limited to just one function, both functions share their private variable. 10.3.4 See Also The sections on "Closures" in Chapter 8 of Programming Perl and on "Avante-Garde Compiler, Retro Interpreter" in Chapter 18 of Programming Perl; the section on "Private Variables via my( )" in perlsub(1); the section on "Package Constructors and Destructors" in perlmod(1); Recipe 11.4 [ Team LiB ] [ Team LiB ] Recipe 10.4 Determining Current Function Name 10.4.1 Problem You want to determine the name of the currently running function. This is useful for creating error messages that don't need to be changed if you copy and paste the subroutine code. 10.4.2 Solution Use the caller function: $this_function = (caller(0))[3]; 10.4.3 Discussion Code can always determine the current source line number via the special symbol _ _LINE_ _, the current file via _ _FILE_ _, and the current package via _ _PACKAGE_ _. But no such symbol for the current subroutine name exists, let alone the name for the subroutine that called this one. The built-in function caller handles all of these. In scalar context it returns the calling function's package name, but in list context it returns much more. You can also pass it a number indicating how many frames (nested subroutine calls) back you'd like information about: 0 is your own function, 1 is your caller, and so on. Here's the full syntax, where $i is how far back you're interested in: ($package, $filename, $line, $subr, $has_args, $wantarray # 0 1 2 3 4 5 $evaltext, $is_require, $hints, $bitmask # 6 7 8 9 )= caller($i); Here's what each of those return values means: $package The package in which the code was compiled. $filename The name of the file in which the code was compiled, reporting -e if launched from that command-line switch, or - if the script was read from standard input. $line The line number from which that frame was called. $subr The name of that frame's function, including its package. Closures are indicated by names like main::_ _ANON_ _, which are not callable. In an eval, it contains (eval). $has_args Whether the function had its own @_ variable set up. It may be that there are no arguments, even if true. The only way for this to be false is if the function was called using the &fn notation instead of fn( ) or &fn( ). $wantarray The value the wantarray function would return for that stack frame; either true, false but defined, or else undefined. This tells whether the function was called in list, scalar, or void context (respectively). $evaltext The text of the current eval STRING, if any. $is_require Whether the code is currently being loaded by a do, require, or use. $hints, $bitmask These both contain pragmatic hints that the caller was compiled with. Consider them to be for internal use only by Perl itself. Rather than using caller directly as in the Solution, you might want to write functions instead: $me = whoami( ); $him = whowasi( ); sub whoami { (caller(1))[3] } sub whowasi { (caller(2))[3] } These use arguments of 1 and 2 for parent and grandparent functions because the call to whoami or whowasi would itself be frame number 0. 10.4.4 See Also The wantarray and caller functions in Chapter 29 of Programming Perl and in perlfunc(1); Recipe 10.6 [ Team LiB ] [ Team LiB ] Recipe 10.5 Passing Arrays and Hashes by Reference 10.5.1 Problem You want to pass a function more than one array or hash and have each remain distinct. For example, you want to put the algorithm from Recipe 4.8 into a subroutine. This subroutine must then be called with two arrays that remain distinct. 10.5.2 Solution Pass arrays and hashes by reference, using the backslash operator: array_diff( \@array1, \@array2 ); 10.5.3 Discussion See Chapter 11 for more about manipulation of references. Here's a subroutine that expects array references, along with code to call it correctly: @a = (1, 2); @b = (5, 8); @c = add_vecpair( \@a, \@b ); print "@c\n"; 6 10 sub add_vecpair { # assumes both vectors the same length my ($x, $y) = @_; # copy in the array references my @result; for (my $i=0; $i < @$x; $i++) { $result[$i] = $x->[$i] + $y->[$i]; } return @result; } A potential problem with this function is that it doesn't verify the number and types of arguments passed into it. You could check explicitly this way: unless (@_ = = 2 && ref($x) eq 'ARRAY' && ref($y) eq 'ARRAY') { die "usage: add_vecpair ARRAYREF1 ARRAYREF2"; } If all you plan to do is die on error (see Recipe 10.12), you can sometimes omit this check, since dereferencing the wrong kind of reference triggers an exception anyway. However, good defensive programming style encourages argument validation for all functions. 10.5.4 See Also The sections on "Passing References" and on "Prototypes" in Chapter 6 of Programming Perl and on "Pass by Reference" in perlsub(1); Recipe 10.11; Chapter 11; Chapter 8 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 10.6 Detecting Return Context 10.6.1 Problem You want to know in which context your function was called. This lets one function do different things, depending on how its return value or values are used, just like many of Perl's built-in functions. 10.6.2 Solution Use the wantarray( ) function, which has three possible return values, depending on how the current function was called: if (wantarray( )) { # list context } elsif (defined wantarray( )) { # scalar context } else { # void context } 10.6.3 Discussion Many built-in functions act differently when called in scalar context than they do when called in list context. A user-defined function can learn which context it was called in by checking wantarray. List context is indicated by a true return value. If wantarray returns a value that is false but defined, then the function's return value will be used in scalar context. If wantarray returns undef, your function isn't being asked to provide any value at all. if (wantarray( )) { print "In list context\n"; return @many_things; } elsif (defined wantarray( )) { print "In scalar context\n"; return $one_thing; } else { print "In void context\n"; return; # nothing } mysub( ); # void context $a = mysub( ); # scalar context if (mysub( )) { } # scalar context @a = mysub( ); # list context print mysub( ); # list context 10.6.4 See Also The return and wantarray functions in Chapter 29 of Programming Perl and in perlfunc(1) [ Team LiB ] [ Team LiB ] Recipe 10.7 Passing by Named Parameter 10.7.1 Problem You want to make a function with many parameters that are easy to call so that programmers remember what the arguments do, rather than having to memorize their order. 10.7.2 Solution Name each parameter in the call: thefunc(INCREMENT => "20s", START => "+5m", FINISH => "+30m"); thefunc(START => "+5m", FINISH => "+30m"); thefunc(FINISH => "+30m"); thefunc(START => "+5m", INCREMENT => "15s"); Then in the subroutine, create a hash loaded up with default values plus the array of named pairs. sub thefunc { my %args = ( INCREMENT => '10s', FINISH => 0, START => 0, @_, # argument pair list goes here ); if ($args{INCREMENT} =~ /m$/ ) { ... } } 10.7.3 Discussion Functions whose arguments require a particular order work well for short argument lists, but as the number of parameters increases, it's awkward to make some optional or have default values. You can only leave out trailing arguments, never initial ones. A more flexible approach allows the caller to supply arguments using name-value pairs. The first element of each pair is the argument name; the second, its value. This makes for self- documenting code because you can see the parameters' intended meanings without having to read the full function definition. Even better, programmers using your function no longer have to remember argument order, and they can leave unspecified any extraneous, unused arguments. This works by having the function declare a private hash variable to hold the default parameter values. Put the current arguments, @_, after the default values, so the actual arguments override the defaults because of the order of the values in the assignment. A common variation on this is to preface the parameter name with a hyphen, intended to evoke the feel of command-line parameters: thefunc(-START => "+5m", -INCREMENT => "15s"); Ordinarily the hyphen isn't part of a bareword, but the Perl tokenizer makes an exception for the => operator to permit this style of function argument. 10.7.4 See Also Chapter 4 [ Team LiB ] [ Team LiB ] Recipe 10.8 Skipping Selected Return Values 10.8.1 Problem You have a function that returns many values, but you only care about some of them. The stat function is a classic example: you often want only one value from its long return list (mode, for instance). 10.8.2 Solution Either assign to a list that has undef in some positions: ($a, undef, $c) = func( ); or else take a slice of the return list, selecting only what you want: ($a, $c) = (func( ))[0,2]; 10.8.3 Discussion Using dummy temporary variables is wasteful; plus it feels artificial and awkward: ($dev,$ino,$DUMMY,$DUMMY,$uid) = stat($filename); A nicer style is to use undef instead of dummy variables to discard a value: ($dev,$ino,undef,undef,$uid) = stat($filename); Or you can take a slice, picking up just the values you care about: ($dev,$ino,$uid,$gid) = (stat($filename))[0,1,4,5]; If you want to put an expression into list context and discard all of its return values (calling it simply for side effects), you can assign this to the empty list: ( ) = some_function( ); This last strategy is rather like a list version of the scalar operator—it calls the function in list context, even in a place it wouldn't otherwise do so. You can get just a count of return values this way: $count = ( ) = some_function( ); or you can call it in list context and make sure it returns some non-zero number of items (which you immediately discard) like this: if (( ) = some_function( )) { .... } If you hadn't assigned to the empty list, the Boolean context of the if test would have called the function in scalar context. 10.8.4 See Also The section on "List Values and Arrays" in Chapter 2 of Programming Perl and perlsub(1); Recipe 3.1 [ Team LiB ] [ Team LiB ] Recipe 10.9 Returning More Than One Array or Hash 10.9.1 Problem You want a function to return more than one array or hash, but the return list flattens into just one long list of scalars. 10.9.2 Solution Return references to the hashes or arrays: ($array_ref, $hash_ref) = somefunc( ); sub somefunc { my @array; my %hash; # ... return ( \@array, \%hash ); } 10.9.3 Discussion Just as all arguments collapse into one flat list of scalars, return values do, too. Functions that want to return multiple, distinct arrays or hashes need to return those by reference, and the caller must be prepared to receive references. If a function wants to return three separate hashes, for example, it should use one of the following: sub fn { ..... return (\%a, \%b, \%c); # or return \(%a, %b, %c); # same thing } The caller must expect a list of hash references returned by the function. It cannot just assign to three hashes. (%h0, %h1, %h2) = fn( ); # WRONG! @array_of_hashes = fn( ); # eg: $array_of_hashes[2]{"keystring"} ($r0, $r1, $r2) = fn( ); # eg: $r2->{"keystring"} 10.9.4 See Also The general discussions on references in Chapter 11, and in Chapter 8 of Programming Perl; Recipe 10.5 [ Team LiB ] [ Team LiB ] Recipe 10.10 Returning Failure 10.10.1 Problem You want to return a value indicating that your function failed. 10.10.2 Solution Use a bare return statement without any argument, which returns undef in scalar context and the empty list ( ) in list context. return; 10.10.3 Discussion A return without an argument means: sub empty_retval { return ( wantarray ? ( ) : undef ); } You can't use just return undef, because in list context you will get a list of one value: undef. If your caller says: if (@a = yourfunc( )) { ... } then the "error" condition will be perceived as true because @a will be assigned (undef) and then evaluated in scalar context. This yields 1, the number of elements assigned to @a, which is true. You could use the wantarray function to see what context you were called in, but a bare return is a clear and tidy solution that always works: unless ($a = sfunc( )) { die "sfunc failed" } unless (@a = afunc( )) { die "afunc failed" } unless (%a = hfunc( )) { die "hfunc failed" } Some of Perl's built-in functions have a peculiar return value. Both fcntl and ioctl have the curious habit of returning the string "0 but true" in some circumstances. (This magic string is conveniently exempt from nagging warnings about improper numerical conversions.) This has the advantage of letting you write code like this: ioctl(....) or die "can't ioctl: $!"; That way, code doesn't have to check for a defined zero as distinct from the undefined value, as it would for the read or glob functions. "0 but true" is zero when used numerically. It's rare that this kind of return value is needed. A more common (and spectacular) way to indicate failure in a function is to raise an exception, as described in Recipe 10.12. 10.10.4 See Also The undef, wantarray, and return functions in Chapter 29 of Programming Perl and in perlfunc(1); Recipe 10.12 [ Team LiB ] [ Team LiB ] Recipe 10.11 Prototyping Functions 10.11.1 Problem You want to use function prototypes so the compiler can check your argument types. 10.11.2 Solution Perl has something of a prototype facility, but it isn't what you're thinking. Perl's function prototypes are more like a context coercion used to write functions that behave like some Perl built-ins, such as push and pop. 10.11.3 Discussion Manually checking the validity of a function's arguments can't happen until runtime. If you make sure the function is declared before it is used, you can tickle the compiler into using a very limited form of prototype checking. But don't confuse Perl's function prototypes with those found in any other language. A Perl function prototype is zero or more spaces, backslashes, or type characters enclosed in parentheses after the subroutine definition or name. A backslashed type symbol means that the argument is passed by reference, and the argument in that position must start with that type character. A prototype can impose context on the prototyped function's arguments. This is done when Perl compiles your program. But this does not always mean that Perl checks the number or type of arguments; since a scalar prototype is like inserting a scalar in front of just one argument, sometimes an implicit conversion occurs instead. For example, if Perl sees func(3, 5) for a function prototyped as sub func ($), it will stop with a compile-time error. But if it sees func(@array) with the same prototype, it will merely put @array into scalar context instead of complaining that you passed an array, but it wanted a scalar. This is so important that it bears repeating: don't use Perl prototypes expecting the compiler to check type and number of arguments for you. It does a little bit of that, sometimes, but mostly it's about helping you type less, and sometimes to emulate the calling and parsing conventions of built-in functions. 10.11.3.1 Omitting parentheses Ordinarily your subroutines take a list of arguments, and you can omit parentheses on the function call if the compiler has already seen a declaration or definition for that function: @results = reverse myfunc 3, 5; Without prototypes, this is the same as: @results = reverse(myfunc(3, 5)); Without parentheses, Perl puts the righthand side of the subroutine call into list context. You can use prototypes to change this behavior. Here is a function that's prototyped to take just one argument: sub myfunc($); @results = reverse myfunc 3, 5; Now this is the same as: @results = reverse(myfunc(3), 5); Notice how the scalar prototype has altered the Perl parser! It grabs only the next thing it sees, leaving what remains for whatever other function is looking for arguments. A void prototype like: sub myfunc( ); will also alter the parser, causing no arguments to be passed to the function. This works just like the time built-in. That means that in the absence of parentheses, you cannot know what is going on by casual inspection. Things that look the same can quietly behave completely differently from one another. Consider these declarations and assignments: sub fn0( ); sub fn1($); sub fnN(@); $x = fn0 + 42; $x = fn1 + 42; $y = fnN fn1 + 42, fn0 + 42; $y = fnN fn0 + 42, fn1 + 42; $z = fn1 fn1 + 42, fn1 + 42; $z = fnN fnN + 42, fnN + 42; Astonishingly enough, those are parsed by the Perl compiler as though they'd been written this way: $x = fn0( ) + 42; $x = fn1(42); $y = fnN(fn1(42), fn0( ) + 42); $y = fnN(fn0( ) + 42, fn1(42)); $z = fn1(fn1(42)), fn1(42); $z = fnN(fnN(42, fnN(42))); Without first looking closely at the prototypes and then thinking really hard about how Perl's parser works, you'd never be able to predict that. Maintainability would suffer horribly. This is one strong argument for using more parentheses than might be demanded by purely precedential concerns (or, alternatively, this is an argument for avoiding prototypes). 10.11.3.2 Mimicking built-ins The other common use of prototypes is to give the convenient pass-without-flattening behavior of built-in functions like push and shift. When you call push as push(@array, 1, 2, 3) the function gets a reference to @array instead of the actual array. This is accomplished by backslashing the @ character in the prototype: sub mypush (\@@) { my $array_ref = shift; my @remainder = @_; # ... } The \@ in the prototype says "require the first argument to begin with an @ character, and pass it by reference." The second @ says "the rest of the arguments are a (possibly empty) list." A backslash in a prototype requires that the argument actually begin with the literal type character, which can sometimes be annoying. You can't even use the conditional ?: construct to pick which array to pass: mypush( $x > 10 ? @a : @b, 3, 5 ); # WRONG Instead, you must play games with references: mypush( @{ $x > 10 ? \@a : \@b }, 3, 5 ); # RIGHT (but ugly) Here's an hpush function that works like push, but on hashes. It uses a list of key-value pairs to add to an existing hash, overwriting any previous values associated with those keys. sub hpush(\%@) { my $href = shift; while ( my ($k, $v) = splice(@_, 0, 2) ) { $href->{$k} = $v; } } hpush(%pieces, "queen" => 9, "rook" => 5); You may also backslash several argument types simultaneously by using the \[ ] notation: sub mytie ( \[$@%&*] $; @ ) That function accepts any of the five types and passes it by reference, followed by one mandatory scalar context argument and optional trailing list of remaining arguments. You can discover a particular function's prototype using the prototype built-in function. For example, calling prototype("hpush") given the previous definition would return the string "\%@". You can even find out a built-in's prototype this way—if it has one, that is. Not all core built-ins can be emulated. For those that can, the prototype function returns what their built-in prototype is. Since you can always call a core built-in function like int as CORE::int, built-ins are deemed to reside in package CORE. For example: for $func (qw/int reverse keys push open print/) { printf "Prototype for %s is %s\n", $func, prototype("CORE::$func") || "UNAVAILABLE"; } Prototype for int is ;$ Prototype for reverse is @ Prototype for keys is \% Prototype for push is \@@ Prototype for open is *;$@ Prototype for print is UNAVAILABLE 10.11.4 See Also The prototype function in perlfunc(1); the section on "Prototypes" in Chapter 6 of Programming Perl and in perlsub(1); Recipe 10.5 [ Team LiB ] [ Team LiB ] Recipe 10.12 Handling Exceptions 10.12.1 Problem How do you safely call a function that might raise an exception? How do you create a function that raises an exception? 10.12.2 Solution Sometimes you encounter a problem so exceptional that merely returning an error isn't strong enough, because the caller could unintentionally ignore the error. Use die STRING from your function to trigger an exception: die "some message"; # raise exception The caller can wrap the function call in an eval to intercept that exception, then consult the special variable $@ to see what happened: eval { func( ) }; if ($@) { warn "func raised an exception: $@"; } 10.12.3 Discussion Raising exceptions is not a facility to be used lightly. Most functions should return an error using a bare return statement. Wrapping every call in an exception trap is tedious and unsightly, removing the appeal of using exceptions in the first place. But, on rare occasions, failure in a function should cause the entire program to abort. Rather than calling the irrecoverable exit function, you should call die instead, which at least gives the programmer the chance to cope. If no exception handler has been installed via eval, then the program aborts at that point. To detect this, wrap the call to the function with a block eval. The $@ variable will be set to the offending exception if one occurred; otherwise, it will be false. eval { $val = func( ) }; warn "func blew up: $@" if $@; Any eval catches all exceptions, not just specific ones. Usually you should propagate unexpected exceptions to an enclosing handler. For example, suppose your function raised an exception containing the string "Full moon!". You could safely trap that exception while letting others through by inspecting the $@ variable. Calling die without an argument uses the contents of $@ to construct a new exception string. eval { $val = func( ) }; if ($@ && $@ !~ /Full moon!/) { die; # re-raise unknown errors } If the function is part of a module, consider using the Carp module and call croak or confess instead of die. The only difference between die and croak is that with croak, the error appears to be from the caller's perspective, not the module's. The confess function, on the other hand, creates a full stack backtrace of who called whom and with what arguments. Another intriguing possibility is for the function to detect that its return value is being completely ignored because the function was called in a void context. If that were returning an error indication would be useless, so raise an exception instead. if (defined wantarray( )) { return; } else { die "pay attention to my error!"; } Of course, just because it's not voided doesn't mean the return value is being dealt with appropriately. But if it is voided, it's certainly not being checked. There are CPAN modules that offer alternative ways of handling exceptions. The Error module offers try, catch, and throw notation instead of eval and die: use Error ':try'; try { something( ); } catch Error::Database with { my $e = shift; warn "Problem in " . $e->{'-database'} . " (caught)\n"; }; Error offers try, catch ... with, except, otherwise, and finally blocks for maximum flexibility in error handling. The Exception::Class module from CPAN lets you create classes of exceptions and objects to represent specific exceptions. The two modules can be combined so that you can catch these exception objects. 10.12.4 See Also The $@ ($EVAL_ERROR) variable in Chapter 28 of Programming Perl and perlvar(1); the die and eval functions in Chapter 29 of Programming Perl and perlfunc(1); the documentation for the CPAN modules Error and Exception::Class; Recipe 10.15; Recipe 12.2; Recipe 16.21 [ Team LiB ] [ Team LiB ] Recipe 10.13 Saving Global Values 10.13.1 Problem You need to temporarily save away the value of a global variable. 10.13.2 Solution Use the local operator to save a previous global value, automatically restoring it when the current block exits: our $age = 18; # declare and set global variable if (CONDITION) { local $age = 23; func( ); # sees temporary value of 23 } # Perl restores the old value at block exit 10.13.3 Discussion Despite its name, Perl's local operator does not create a local variable. That's what my does. Instead, local merely preserves an existing value for the duration of its enclosing block. Hindsight shows that if local had been called save_value instead, much confusion could have been avoided. Three places where you must use local instead of my are: You need to give a global variable a temporary value, especially $_.1. You need to create a local file or directory handle or a local function.2. You want to temporarily change just one element of an array or hash.3. 10.13.3.1 Using local( ) for temporary values for globals The first situation is more apt to happen with predefined, built-in variables than with user variables. Often these are variables that Perl consults for hints for its high-level operations. In particular, any function that uses $_, implicitly or explicitly, should certainly have a local $_. This is annoyingly easy to forget to do. See Recipe 13.15 for one solution to this. Another common target for local is the $/ variable, a global that implicitly affects the behavior of the readline operator used in operations. $para = get_paragraph(*FH); # pass filehandle glob $para = get_paragraph(*FH); # pass filehandle by glob reference $para = get_paragraph(*IO{FH}); # pass filehandle by IO reference [ Team LiB ] Recipe 10.13 Saving Global Values 10.13.1 Problem You need to temporarily save away the value of a global variable. 10.13.2 Solution Use the local operator to save a previous global value, automatically restoring it when the current block exits: our $age = 18; # declare and set global variable if (CONDITION) { local $age = 23; func( ); # sees temporary value of 23 } # Perl restores the old value at block exit 10.13.3 Discussion Despite its name, Perl's local operator does not create a local variable. That's what my does. Instead, local merely preserves an existing value for the duration of its enclosing block. Hindsight shows that if local had been called save_value instead, much confusion could have been avoided. Three places where you must use local instead of my are: You need to give a global variable a temporary value, especially $_.1. You need to create a local file or directory handle or a local function.2. You want to temporarily change just one element of an array or hash.3. 10.13.3.1 Using local( ) for temporary values for globals The first situation is more apt to happen with predefined, built-in variables than with user variables. Often these are variables that Perl consults for hints for its high-level operations. In particular, any function that uses $_, implicitly or explicitly, should certainly have a local $_. This is annoyingly easy to forget to do. See Recipe 13.15 for one solution to this. Another common target for local is the $/ variable, a global that implicitly affects the behavior of the readline operator used in operations. $para = get_paragraph(*FH); # pass filehandle glob $para = get_paragraph(*FH); # pass filehandle by glob reference $para = get_paragraph(*IO{FH}); # pass filehandle by IO reference sub get_paragraph { my $fh = shift; local $/ = ''; my $paragraph = <$fh>; chomp($paragraph); return $paragraph; } 10.13.3.2 Using local( ) for local handles The second situation used to arise whenever you needed a local filehandle or directory handle—or more rarely, a local function. $contents = get_motd( ); sub get_motd { local *MOTD; open(MOTD, "/etc/motd") or die "can't open motd: $!"; local $/ = undef; # slurp full file; local $_ = ; close (MOTD); return $_; } If you wanted to return the open filehandle, you'd use: return *MOTD; However, in modern releases of Perl, you would make use of the filehandle autovivification property: $contents = get_motd( ); sub get_motd { my $motd; # this will be filled in by the next line open($motd, "/etc/motd") or die "can't open motd: $!"; local $/ = undef; # slurp full file; return scalar <$motd>; } When the function returns, the anonymous filehandle is automatically closed for you. However, if you'd chosen to return $motd, then it wouldn't be. This is explained more fully in the Introduction to Chapter 7. 10.13.3.3 Using local( ) on parts of aggregates The third situation is exceedingly rare, except for one common case. Because the local operator is really a "save value" operator, you can use it to save off just one element of an array or hash, even if that array or hash is itself a lexical! my @nums = (0 .. 5); sub first { local $nums[3] = 3.14159; second( ); } sub second { print "@nums\n"; } second( ); 0 1 2 3 4 5 first( ); 0 1 2 3.14159 4 5 The only common use for this kind of thing is for temporary signal handlers. sub first { local $SIG{INT} = 'IGNORE'; second( ); } Now while second is running, interrupt signals are ignored. When first returns, the previous value of $SIG{INT} is automatically restored. Although a lot of old code uses local, it's definitely something to steer clear of when it can be avoided. Because local still manipulates the values of global variables, not local variables, you'll run afoul of use strict unless you declared the globals using our or the older use vars. The local operator produces dynamic scoping or runtime scoping. This is in contrast with the other kind of scoping Perl supports, which is much more easily understood. That's the kind of scoping that my provides, known as lexical scoping, or sometimes as static or compile-time scoping. With dynamic scoping, a variable is accessible if it's found in the current scope—or in the scope of any frames (blocks) in its entire subroutine call stack, as determined at runtime. Any functions called have full access to dynamic variables, because they're still globals, just ones with temporary values. Only lexical variables are safe from such tampering. Old code that says: sub func { local($x, $y) = @_; #.... } can almost always be replaced without ill effect by the following: sub func { my($x, $y) = @_; #.... } The only case where code can't be so upgraded is when it relies on dynamic scoping. That would happen if one function called another, and the latter relied upon access to the former's temporary versions of the global variables $x and $y. Code that handles global variables and expects strange action at a distance instead of using proper parameters is fragile at best. Good programmers avoid this kind of thing like the plague. (The solution is to explicitly pass values as parameters, rather than storing them in shared global variables.) If you come across old code that uses: &func(*Global_Array); sub func { local(*aliased_array) = shift; for (@aliased_array) { .... } } this should probably be changed into something like this: func(\@Global_Array); sub func { my $array_ref = shift; for (@$array_ref) { .... } } They're using the old pass-the-typeglob strategy devised before Perl supported proper references. It's not a pretty thing. 10.13.4 See Also The local, my, and our functions in Chapter 29 of Programming Perl and perlfunc(1); Chapter 6 of Programming Perl; the section on "Scoped Declarations" in Chapter 4 of Programming Perl; the sections on "Private Variables via my( )" and "Temporary Values via local( )" in perlsub(1); Recipe 10.2; Recipe 10.16 [ Team LiB ] [ Team LiB ] Recipe 10.14 Redefining a Function 10.14.1 Problem You want to temporarily or permanently redefine a function, but functions can't be assigned to. 10.14.2 Solution To redefine a function, assign a new code reference to the typeglob of the name of that function. Use local if you want this redefinition to be temporary. undef &grow; # silence -w complaints of redefinition *grow = \&expand; grow( ); # calls expand( ) { local *grow = \&shrink; # only until this block exists grow( ); # calls shrink( ) } 10.14.3 Discussion Unlike a variable (but like named filehandles, directory handles, and formats), a named function cannot be directly assigned to. It's just a name and doesn't vary. You can manipulate it almost as though it were a variable, because you can directly manipulate the runtime symbol table using typeglobs like *foo to produce interesting aliasing effects. Assigning a reference to a typeglob changes what is accessed the next time a symbol of the referent's type is needed. This is what the Exporter does when you import a function or variable from one package into another. Since this is direct manipulation of the package symbol table, it works only on package variables (globals), not lexicals. *one::var = \%two::Table; # make %one::var alias for %two::Table *one::big = \&two::small; # make &one::big alias for &two::small A typeglob is one of those things you can only use local on, not my. If you do use local, the aliasing effect is then limited to the duration of the current block. local *fred = \&barney; # temporarily alias &fred to &barney If the value assigned to a typeglob is not a reference but itself another typeglob, then all types by that name are aliased. The types aliased in a full typeglob assignment are scalar, array, hash, function, filehandle, directory handle, and format. That means that assigning *Top = *Bottom would make the current package variable $Top an alias for $Bottom, @Top for @Bottom, %Top for %Bottom, and &Top for &Bottom. It would even alias the corresponding file and directory handles and formats! You probably don't want to do this. Use assignments to typeglobs together with closures to clone a bunch of similar functions cheaply and easily. Imagine you wanted a function for HTML generation to help with colors. For example: $string = red("careful here"); print $string; careful here You could write the red function this way: sub red { "@_ " } If you need more colors, you could do something like this: sub color_font { my $color = shift; return "@_ "; } sub red { color_font("red", @_) } sub green { color_font("green", @_) } sub blue { color_font("blue", @_) } sub purple { color_font("purple", @_) } # etc The similar nature of these functions suggests that there may be a way to factor out the common bit. To do this, use an assignment to an indirect typeglob. If you're running with the highly recommended use strict pragma, you must first disable strict "refs" for that block. @colors = qw(red blue green yellow orange purple violet); for my $name (@colors) { no strict 'refs'; *$name = sub { "@_ " }; } These functions all seem independent, but the real code was compiled only once. This technique saves on compile time and memory use. To create a proper closure, any variables in the anonymous subroutine must be lexicals. That's the reason for the my on the loop iteration variable. This is one of the few places where giving a prototype to a closure is sensible. If you wanted to impose scalar context on the arguments of these functions (probably not a wise idea), you could have written it this way instead: *$name = sub ($) { "$_[0] " }; However, since prototype checking happens at compile time, the preceding assignment happens too late to be useful. So, put the whole loop of assignments within a BEGIN block, forcing it to occur during compilation. You really want to use a BEGIN here, not an INIT, because you're doing something that you want the compiler itself to notice right away, not something for the interpreter to do just before your program runs. 10.14.4 See Also The sections on "Symbol Tables" in Chapter 10 of Programming Perl and in perlmod(1); the sections on "Closures" and "Symbol Table References" in Chapter 8 of Programming Perl; the discussion of closures in perlref(1); Recipe 10.11; Recipe 11.4 [ Team LiB ] [ Team LiB ] Recipe 10.15 Trapping Undefined Function Calls with AUTOLOAD 10.15.1 Problem You want to intercept calls to undefined functions so you can handle them gracefully. 10.15.2 Solution Declare a function called AUTOLOAD for the package whose undefined function calls you'd like to trap. While running, that package's $AUTOLOAD variable contains the name of the undefined function being called. 10.15.3 Discussion Another strategy for creating similar functions is to use a proxy function. If you call an undefined function, instead of automatically raising an exception, you can trap the call. If the function's package has a function named AUTOLOAD, then this function is called in its place, with the special package global $AUTOLOAD set to the package-qualified function name. The AUTOLOAD subroutine can then do whatever that function would do. sub AUTOLOAD { my $color = our $AUTOLOAD; $color =~ s/.*:://; return "@_ "; } #note: sub chartreuse isn't defined. print chartreuse("stuff"); When the nonexistent main::chartreuse function is called, rather than raising an exception, main::AUTOLOAD is called with the same arguments as you passed chartreuse. The package variable $AUTOLOAD would contain the string main::chartreuse because that's the function it's proxying. The technique using typeglob assignments shown in Recipe 10.14 is faster and more flexible than using AUTOLOAD. It's faster because you don't have to run the copy and substitute. It's more flexible because it lets you do this: { local *yellow = \&violet; local (*red, *green) = (\&green, \&red); print_stuff( ); } While print_stuff( ) is running, including from within any functions it calls, anything printed in yellow will come out violet, and the red and green texts will exchange colors. Aliasing subroutines like this won't handle calls to undefined subroutines. AUTOLOAD does. 10.15.4 See Also The section on "Autoloading" in Chapter 10 of Programming Perl and in perlsub(1); the documentation for the standard modules AutoLoader and AutoSplit; Recipe 10.12; Recipe 12.11; Recipe 13.12 [ Team LiB ] [ Team LiB ] Recipe 10.16 Nesting Subroutines 10.16.1 Problem You want subroutines to nest, such that one subroutine is visible and callable only from another. When you try the obvious approach of nesting sub FOO { sub BAR { } ... }, Perl gives warnings about variables that will not stay shared. 10.16.2 Solution Instead of making the inner functions normal subroutines, make them closures and temporarily assign their references to the typeglob of the right name to create a localized function. 10.16.3 Discussion If you use nested subroutines in other programming languages with their own private variables, you'll have to work at it a bit in Perl. The intuitive coding of this kind of thing gives the warning "will not stay shared." For example, this won't work: sub outer { my $x = $_[0] + 35; sub inner { return $x * 19 } # WRONG return $x + inner( ); } The following is a workaround: sub outer { my $x = $_[0] + 35; local *inner = sub { return $x * 19 }; return $x + inner( ); } Now inner( ) can be called only from within outer( ) because of the temporary assignments of the closure. Once called, it has normal access to the lexical variable $x from the scope of outer( ). This essentially creates a function local to another function, something not directly supported in Perl; however, the programming isn't always clear. 10.16.4 See Also The sections on "Symbol Tables" in Chapter 10 in Programming Perl and in perlmod(1); the sections on "Closures" and "Symbol Table References" in Chapter 8 of Programming Perl and the discussion of closures in perlref(1); Recipe 10.13; Recipe 11.4 [ Team LiB ] [ Team LiB ] Recipe 10.17 Writing a Switch Statement 10.17.1 Problem You want to write a multiway branch statement, much as you can in C using its switch statement or in the shell using case—but Perl seems to support neither. 10.17.2 Solution Use the Switch module, standard as of the v5.8 release of Perl. use Switch; switch ($value) { case 17 { print "number 17" } case "snipe" { print "a snipe" } case /[a-f]+/i { print "pattern matched" } case [1..10,42] { print "in the list" } case (@array) { print "in the array" } case (%hash) { print "in the hash" } else { print "no case applies" } } 10.17.3 Discussion The Switch module extends Perl's basic syntax by providing a powerful and flexible switch construct. In fact, it's so powerful and flexible that instead of a complete description of how it works, we'll instead provide examples of some common uses. For the full story, make sure to consult the documentation that accompanies the module. A switch takes an argument and a mandatory block, within which can occur any number of cases. Each of those cases also takes an argument and a mandatory block. The arguments to each case can vary in type, allowing (among many other things) any or all of string, numeric, or regex comparisons against the switch's value. When the case is an array or hash (or reference to the same), the case matches if the switch value corresponds to any of the array elements or hash keys. If no case matches, a trailing else block will be executed. Unlike certain languages' multiway branching constructs, here once a valid case is found and its block executed, control transfers out of the enclosing switch. In other words, there's no implied fall-through behavior the way there is in C. This is considered desirable because even the best of programmers will occasionally forget about fall-through. However, this is Perl, so you can have your cake and eat it, too. Just use a next from within a switch to transfer control to the next case. Consider: %traits = (pride => 2, sloth => 3, hope => 14); switch (%traits) { case "impatience" { print "Hurry up!\n"; next } case ["laziness","sloth"] { print "Maybe tomorrow!\n"; next } case ["hubris","pride"] { print "Mine's best!\n"; next } case ["greed","cupidity","avarice"] { print "More more more!"; next } } Maybe tomorrow! Mine's best! Because each case has a next, it doesn't just do the first one it finds, but goes on for further tests. The next can be conditional, too, allowing for conditional fall through. You might have noticed something else interesting about that previous example: the argument to the switch wasn't a scalar; it was the %traits hash. It turns out that you can switch on other things than scalars. In fact, both case and switch accept nearly any kind of argument. The behavior varies depending on the particular combination. Here, the strings from each of those cases are taken as keys to index into the hash we're switching on. If you find yourself preferring fall-through as the default, you can have that, too: use Switch 'fallthrough'; %traits = (pride => 2, sloth => 3, hope => 14); switch (%traits) { case "impatience" { print "Hurry up!\n" } case ["laziness","sloth"] { print "Maybe tomorrow!\n" } case ["hubris","pride"] { print "Mine's best!\n" } case ["greed","cupidity","avarice"] { print "More more more!" } } One area where a bunch of cascading ifs would still seem to excel is when each test involves a different expression, and those expressions are more complex than a simple string, numeric, or pattern comparison. For example: if ($n % 2 = = 0) { print "two " } elsif ($n % 3 = = 0) { print "three " } elsif ($n % 5 = = 0) { print "five " } elsif ($n % 7 = = 0) { print "seven " } Or if you want more than one test to be able to apply, you can do this with fall-through behavior: if ($n % 2 = = 0) { print "two " } if ($n % 3 = = 0) { print "three " } if ($n % 5 = = 0) { print "five " } if ($n % 7 = = 0) { print "seven " } Perl's switch can handle this too, but you have to be a bit more careful. For a case item to be an arbitrary expression, wrap that expression in a subroutine. That subroutine is called with the switch argument as the subroutine's argument. If the subroutine returns a true value, then the case is satisfied. use Switch 'fallthrough'; $n = 30; print "Factors of $n include: "; switch ($n) { case sub{$_[0] % 2 = = 0} { print "two " } case sub{$_[0] % 3 = = 0} { print "three " } case sub{$_[0] % 5 = = 0} { print "five " } case sub{$_[0] % 7 = = 0} { print "seven " } } That's pretty cumbersome to write—and to read—but with a little bit of highly magical syntactic sugar, even that clumsiness goes away. If you import the _ _ subroutine (yes, that really is a double underscore), you can use that in an expression as the case target, and the _ _ will represent the value being switched on. For example: use Switch qw( _ _ fallthrough ); $n = 30; print "Factors of $n include: "; switch ($n) { case _ _ % 2 = = 0 { print "two " } case _ _ % 3 = = 0 { print "three " } case _ _ % 5 = = 0 { print "five " } case _ _ % 7 = = 0 { print "seven " } } print "\n"; Due to the way that _ _ is implemented, some restrictions on its use apply. The main one is that your expression can't use && or || in it. Here's one final trick with switch. This time, instead of having a scalar in the switch and subroutines in the cases, let's do it the other way around. You can switch on a subroutine reference; each case value will be passed into that subroutine, and if the sub returns a true value, then the case is deemed to have matched and its code block executed. That makes the factor example read: use Switch qw(fallthrough); $n = 30; print "Factors of $n include: "; switch (sub {$n % $_[0] = = 0} ) { case 2 { print "two " } case 3 { print "three " } case 5 { print "five " } case 7 { print "seven " } } This is probably the most aesthetically pleasing way of writing it, since there's no longer duplicate code on each line. The Switch module uses a facility called source filters to emulate behavior anticipated in Perl6 (whenever that might be). This has been known to cause mysterious compilation errors if you use constructs in your code you were warned against. You should therefore pay very close attention to the section on "Dependencies, Bugs, and Limitations" in the Switch manpage. 10.17.4 See Also The documentation for the Switch module; the perlsyn(1) manpage's section on "Basic BLOCKs and Switch Statements"; the section on "Case Statements" in Chapter 4 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 10.18 Program: Sorting Your Mail The program in Example 10-1 sorts a mailbox by subject by reading input a paragraph at a time, looking for one with a "From " at the start of a line. When it finds one, it searches for the subject, strips it of any "Re : " marks, and stores its lowercased version in the @sub array. Meanwhile, the messages themselves are stored in a corresponding @msgs array. The $msgno variable keeps track of the message number. Example 10-1. bysub1 #!/usr/bin/perl # bysub1 - simple sort by subject my(@msgs, @sub); my $msgno = -1; $/ = ''; # paragraph reads while (<>) { if (/^From/m) { /^Subject:\s*(?:Re:\s*)*(.*)/mi; $sub[++$msgno] = lc($1) || ''; } $msgs[$msgno] .= $_; } for my $i (sort { $sub[$a] cmp $sub[$b] || $a <=> $b } (0 .. $#msgs)) { print $msgs[$i]; } That sort is only sorting array indices. If the subjects are the same, cmp returns 0, so the second part of the || is taken, which compares the message numbers in the order they originally appeared. If sort were fed a list like (0,1,2,3) , that list would get sorted into a different permutation, perhaps (2,1,3,0) . We iterate across them with a for loop to print out each message. Example 10-2 shows how an awk programmer might code this program, using the -00 switch to read paragraphs instead of lines. Example 10-2. bysub2 #!/usr/bin/perl -n00 # bysub2 - awkish sort-by-subject INIT { $msgno = -1 } $sub[++$msgno] = (/^Subject:\s*(?:Re:\s*)*(.*)/mi)[0] if /^From/m; $msg[$msgno] .= $_; END { print @msg[ sort { $sub[$a] cmp $sub[$b] || $a <=> $b } (0 .. $#msg) ] } Perl programmers have used parallel arrays like this since Perl 1. Keeping each message in a hash is a more elegant solution, though. We'll sort on each field in the hash, by making an anonymous hash as described in Chapter 11 . Example 10-3 is a program similar in spirit to Example 10-1 and Example 10-2 . Example 10-3. bysub3 #!/usr/bin/perl -00 # bysub3 - sort by subject using hash records use strict; my @msgs = ( ); while (<>) { push @msgs, { SUBJECT => /^Subject:\s*(?:Re:\s*)*(.*)/mi, NUMBER => scalar @msgs, # which msgno this is TEXT => '', } if /^From/m; $msgs[-1]{TEXT} .= $_; } for my $msg (sort { $a->{SUBJECT} cmp $b->{SUBJECT} || $a->{NUMBER} <=> $b->{NUMBER} } @msgs ) { print $msg->{TEXT}; } Once you have real hashes, adding further sorting criteria is simple. A common way to sort a folder is subject major, date minor order. The hard part is figuring out how to parse and compare dates. Date::Manip does this, returning a string you can compare; however, the datesort program in Example 10-4 , which uses Date::Manip, runs more than 10 times slower than the previous one. Parsing dates in unpredictable formats is extremely slow. Example 10-4. datesort #!/usr/bin/perl -00 # datesort - sort mbox by subject then date use strict; use Date::Manip; my @msgs = ( ); while (<>) { next unless /^From/m; my $date = ''; if (/^Date:\s*(.*)/m) { ($date = $1) =~ s/\s+\(.*//; # library hates (MST) $date = ParseDate($date); } push @msgs, { SUBJECT => /^Subject:\s*(?:Re:\s*)*(.*)/mi, DATE => $date, NUMBER => scalar @msgs, TEXT => '', }; } continue { $msgs[-1]{TEXT} .= $_; } for my $msg (sort { $a->{SUBJECT} cmp $b->{SUBJECT} || $a->{DATE} cmp $b->{DATE} || $a->{NUMBER} <=> $b->{NUMBER} } @msgs ) { print $msg->{TEXT}; } Example 10-4 is written to draw attention to the continue block. When a loop's end is reached, either because it fell through to that point or got there from a next , the whole continue block is executed. It corresponds to the third portion of a three-part for loop, except that the continue block isn't restricted to an expression. It's a full block, with separate statements. 10.18.1 See Also The sort function in Chapter 29 of Programming Perl and in perlfunc (1); the discussion of the $/ ($RS , $INPUT_RECORD_SEPARATOR ) variable in Chapter 28 of Programming Perl , in perlvar (1), and in the Introduction to Chapter 8 ; Recipe 3.7 ; Recipe 4.16 ; Recipe 5.10 ; Recipe 11.9 [ Team LiB ] [ Team LiB ] Chapter 11. References and Records With as little a web as this will I ensnare as great a fly as Cassio. —Shakespeare, Othello, Act II, scene i [ Team LiB ] [ Team LiB ] Introduction Perl provides three fundamental data types: scalars, arrays, and hashes. It's certainly possible to write many programs without complex records, but most programs need something more sophisticated than simple variables and lists. Perl's three built-in types combine with references to produce arbitrarily complex and powerful data structures. Selecting the proper data structure and algorithm can make the difference between an elegant program that does its job quickly and an ungainly concoction that's glacially slow to execute and consumes system resources voraciously. The first part of this chapter shows how to create and use plain references. The second part shows how to create higher-order data structures out of references. References To grasp the concept of references, you must first understand how Perl stores values in variables. Each defined variable has associated with it a name and the address of a chunk of memory. This idea of storing addresses is fundamental to references because a reference is a value that holds the location of another value. The scalar value that contains the memory address is called a reference. Whatever value lives at that memory address is called its referent. See Figure 11-1. Figure 11-1. Reference and referent The referent could be any built-in type (scalar, array, hash, ref, code, or glob) or a user-defined type based on one of the built-ins. Referents in Perl are typed. This means, for example, that you can't treat a reference to an array as though it were a reference to a hash. Attempting to do so raises a runtime exception. No mechanism for type casting exists in Perl. This is considered a feature. So far, it may look as though a reference were little more than a raw address with strong typing. But it's far more than that. Perl takes care of automatic memory allocation and deallocation (garbage collection) for references, just as it does for everything else. Every chunk of memory in Perl has a reference count associated with it, representing how many places know about that referent. The memory used by a referent is not returned to the process's free pool until its reference count reaches zero. This ensures that you never have a reference that isn't valid—no more core dumps and general protection faults from mismanaged pointers as in C. Freed memory is returned to Perl for later use, but few operating systems reclaim it and decrease the process's memory footprint. This is because most memory allocators use a stack, and if you free up memory in the middle of the stack, the operating system can't take it back without moving the rest of the allocated memory around. That would destroy the integrity of your pointers and blow XS code out of the water. To follow a reference to its referent, preface the reference with the appropriate type symbol for the data you're accessing. For instance, if $sref is a reference to a scalar, you can say: print $$sref; # prints the scalar value that the reference $sref refers to $$sref = 3; # assigns to $sref's referent To access one element of an array or hash whose reference you have, use the infix pointer- arrow notation, as in $rv->[37] or $rv->{"wilma"}. Besides dereferencing array references and hash references, the arrow is also used to call an indirect function through its reference, as in $code_ref->("arg1", "arg2"); this is discussed in Recipe 11.4. If you're using an object, use an arrow to call a method, $object->methodname("arg1", "arg2"), as shown in Chapter 13. Perl's syntax rules make dereferencing complex expressions tricky—it falls into the category of "hard things that should be possible." Mixing right associative and left associative operators doesn't work out well. For example, $$x[4] is the same as $x->[4]; that is, it treats $x as a reference to an array and then extracts element number four from that. This could also have been written ${$x}[4]. If you really meant "take the fifth element of @x and dereference it as a scalar reference," then you need to use ${$x[4]}. Avoid putting two type signs ($@%&) side-by- side, unless it's simple and unambiguous like %hash = %$hashref. In the simple cases using $$sref in the previous example, you could have written: print ${$sref}; # prints the scalar $sref refers to ${$sref} = 3; # assigns to $sref's referent For safety, some programmers use this notation exclusively. When passed a reference, the ref function returns a string describing its referent. (It returns false if passed a non-reference.) This string is usually one of SCALAR, ARRAY, HASH, or CODE, although the other built-in types of GLOB, REF, IO, Regexp, and LVALUE also occasionally appear. If you call ref on a non-reference, it returns an empty string. If you call ref on an object (a reference whose referent has been blessed), it returns the class the object was blessed into: CGI, IO::Socket, or even ACME::Widget. Create references in Perl by using a backslash on things already there, or dynamically allocate new things using the [ ], { }, and sub { } composers. The backslash operator is simple to use: put it before whatever you want a reference to. For instance, if you want a reference to the contents of @array, just say: $aref = \@array; You can even create references to constant values; future attempts to change the value of the referent cause a runtime exception: $pi = \3.14159; $$pi = 4; # runtime error Anonymous Data Using a backslash to produce references to existing, named variables is simple enough for implementing pass-by-reference semantics in subroutine calls, but for creating complex data structures on the fly, it quickly becomes cumbersome. You don't want to be bogged down by having to invent a unique name for each subsection of the large, complex data structure. Instead, you allocate new, nameless arrays and hashes (or scalars or functions) on demand, growing your structure dynamically. Explicitly create anonymous arrays and hashes with the [ ] and { } composers. This notation allocates a new array or hash, initializes it with any data values listed between the pair of square or curly brackets, and returns a reference to the newly allocated aggregate: $aref = [ 3, 4, 5 ]; # new anonymous array $href = { "How" => "Now", "Brown" => "Cow" }; # new anonymous hash Perl also implicitly creates anonymous data types through autovivification. This occurs when you indirectly store data through a variable that's currently undefined; that is, you treat that variable as though it holds the reference type appropriate for that operation. When you do so, Perl allocates the needed array or hash and stores its reference in the previously undefined variable. undef $aref; @$aref = (1, 2, 3); print $aref; ARRAY(0x80c04f0) See how we went from an undefined variable to one with an array reference in it without explicitly assigning that reference? Perl filled in the reference for us. This property lets code like the following work correctly, even as the first statement in your program, all without declarations or allocations: $a[4][23][53][21] = "fred"; print $a[4][23][53][21]; fred print $a[4][23][53]; ARRAY(0x81e2494) print $a[4][23]; ARRAY(0x81e0748) print $a[4]; ARRAY(0x822cd40) Table 11-1 shows mechanisms for producing references to both named and anonymous scalars, arrays, hashes, functions, and typeglobs. (See the discussion of filehandle autovivification in the Introduction to Chapter 7 for a discussion of anonymous filehandles.) Table 11-1. Syntax for named and anonymous values Reference to Named Anonymous Scalar \$scalar \do{my $anon} Array \@array [ LIST ] Hash \%hash { LIST } Code \&function sub { CODE } Glob *symbol open(my $handle, ...); $handle Figure 11-2 and Figure 11-3 illustrate the differences between named and anonymous values. Figure 11-2 shows named values, and Figure 11-3 shows anonymous ones. Figure 11-2. Named values Figure 11-3. Anonymous values In other words, saying $a = \$b makes $$a and $b the same piece of memory. If you say $$a = 3, then $b is set to 3, even though you only mentioned $a by name, not $b. All references evaluate to true when used in Boolean context. That way a subroutine that normally returns a reference can indicate an error by returning undef. sub cite { my (%record, $errcount); ... return $errcount ? undef( ) : record; } $op_cit = cite($ibid) or die "couldn't make a reference"; Without an argument, undef produces an undefined value. But passed a variable or function as its argument, the undef operator renders that variable or function undefined when subsequently tested with the defined function. However, this does not necessarily free memory, call object destructors, etc. It just decrements its referent's reference count by one. my ($a, $b) = ("Thing1", "Thing2"); $a = \$b; undef $b; Memory isn't freed yet, because you can still reach "Thing2" indirectly using its reference in $a. "Thing1", however, is completely gone, having been recycled as soon as $a was assigned \$b. Although memory allocation in Perl is sometimes explicit and sometimes implicit, memory deallocation is nearly always implicit. You don't routinely have cause to undefine variables. Just let lexical variables (those declared with my) evaporate when their scope terminates; the next time you enter that scope, those variables will be new again. For global variables (those declared with our, fully-qualified by their package name, or imported from a different package) that you want reset, it normally suffices to assign the empty list to an aggregate variable or a false value to a scalar one. It has been said that there exist two opposing schools of thought regarding memory management in programming. One school holds that memory management is too important a task to be left to the programming language, while the other judges it too important to be left to the programmer. Perl falls solidly in the second camp, since if you never have to remember to free something, you can never forget to do so. As a rule, you need rarely concern yourself with freeing any dynamically allocated storage in Perl,[1] because memory management—garbage collection, if you would—is fully automatic. Recipe 11.15 and Recipe 13.13, however, illustrate exceptions to this rule. [1] External subroutines compiled in C notwithstanding. Records The predominant use of references in Perl is to circumvent the restriction that arrays and hashes may hold scalars only. References are scalars, so to make an array of arrays, make an array of array references. Similarly, hashes of hashes are implemented as hashes of hash references, arrays of hashes as arrays of hash references, hashes of arrays as hashes of array references, and so on. Once you have these complex structures, you can use them to implement records. A record is a single logical unit comprising various different attributes. For instance, a name, an address, and a birthday might compose a record representing a person. C calls such things structs, and Pascal calls them RECORDs. Perl doesn't have a particular name for these because you can implement this notion in different ways. The most common technique in Perl is to treat a hash as a record, where the keys of the hash are the record's field names and the values of the hash are those fields' values. For instance, we might create a "person" record like this: $person = { "Name" => "Leonhard Euler", "Address" => "1729 Ramanujan Lane\nMathworld, PI 31416", "Birthday" => 0x5bb5580, }; Because $person is a scalar, it can be stored in an array or hash element, thus creating groups of people. Now apply the array and hash techniques from Chapter 4 and Chapter 5 to sort the sets, merge hashes, pick a random record, and so on. The attributes of a record, including the "person" record, are always scalars. You can certainly use numbers as readily as strings there, but that's no great trick. The real power play happens when you use even more references for values in the record. "Birthday", for instance, might be stored as an anonymous array with three elements: day, month, and year. You could then say $person->{"Birthday"}->[0] to access just the day field. Or a date might be represented as a hash record, which would then lend itself to access such as $person->{"Birthday"}->{"day"}. Adding references to your collection of skills makes possible many more complex and useful programming strategies. At this point, we've conceptually moved beyond simple records. We're now creating elaborate data structures that represent complicated relationships between the data they hold. Although we can use these to implement traditional data structures like linked lists, recipes in the second half of this chapter don't deal specifically with any particular structure. Instead, they give generic techniques for loading, printing, copying, and saving generic data structures. The final program example demonstrates creating binary trees. See Also Chapters 8 and 9 of Programming Perl; perlref(1), perlreftut(1), perllol(1), and perldsc(1) [ Team LiB ] [ Team LiB ] Recipe 11.1 Taking References to Arrays 11.1.1 Problem You need to manipulate an array by reference. 11.1.2 Solution To get a reference to an array: $aref = \@array; $anon_array = [1, 3, 5, 7, 9]; $anon_copy = [ @array ]; @$implicit_creation = (2, 4, 6, 8, 10); To deference an array reference, precede it with an at sign (@): push(@$anon_array, 11); Or use a pointer arrow plus a bracketed subscript for a particular element: $two = $implicit_creation->[0]; To get the last index number by reference, or the number of items in that referenced array: $last_idx = $#$aref; $num_items = @$aref; Or defensively embracing and forcing context: $last_idx = $#{ $aref }; $num_items = scalar @{ $aref }; 11.1.3 Discussion Here are array references in action: # check whether $someref contains a simple array reference if (ref($someref) ne "ARRAY") { die "Expected an array reference, not $someref\n"; } print "@{$array_ref}\n"; # print original data @order = sort @{ $array_ref }; # sort it push @{ $array_ref }, $item; # append new element to orig array If you can't decide whether to use a reference to a named array or to create a new one, here's a simplistic guideline that will prove right more often than not. Only take a reference to an existing array to return the reference out of scope, thereby creating an anonymous array, or to pass the array by reference to a function. For virtually all other cases, use [@array] to create a new array reference with a copy of the old values. Automatic reference counting and the backslash operator make a powerful combination: sub array_ref { my @array; return \@array; } $aref1 = array_ref( ); $aref2 = array_ref( ); Each time array_ref is called, the function allocates a new piece of memory for @array. If we hadn't returned a reference to @array, its memory would have been freed when its block, the subroutine, ended. But here a reference to @array is still accessible, so Perl doesn't free that storage, and we wind up with a reference to a piece of memory no longer accessible through the symbol table. Such a piece of memory is called anonymous because it has no name associated with it. To access a particular element of the array referenced by $aref, you could write $$aref[4], but $aref->[4] is the same thing, and clearer. print $array_ref->[$N]; # access item in position N (best) print $$array_ref[$N]; # same, but confusing print ${$array_ref}[$N]; # same, but still confusing, and ugly to boot If you have an array reference, you can only access a slice of the referenced array in this way: @$pie[3..5]; # array slice, but a little confusing to read @{$pie}[3..5]; # array slice, easier (?) to read Array slices, even when accessed through array references, are assignable. In the next line, the array dereference happens first, then the slice: @{$pie}[3..5] = ("blackberry", "blueberry", "pumpkin"); An array slice is just syntactic sugar for a list of individual array elements. Because you can't take a reference to a list, you can't take a reference to an array slice: $sliceref = \@{$pie}[3..5]; # WRONG! To iterate through the entire array, loop with foreach or for: foreach $item ( @{$array_ref} ) { # $item has data } for ($idx = 0; $idx <= $#{ $array_ref }; $idx++) { # $array_ref->[$idx] has data } 11.1.4 See Also Chapters 8 and 9 of Programming Perl; perlref(1), perlreftut(1), and perllol(1); Recipe 2.13; Recipe 4.6 [ Team LiB ] [ Team LiB ] Recipe 11.2 Making Hashes of Arrays 11.2.1 Problem For each key in a hash, only one scalar value is allowed, but you'd like to use one key to store and retrieve multiple values. That is, you'd like the value to produce a list. 11.2.2 Solution Use references to arrays as the hash values. Use push to append: push(@{ $hash{"KEYNAME"} }, "new value"); Then, dereference the value as an array reference when printing out the hash: foreach $string (keys %hash) { print "$string: @{$hash{$string}}\n"; } 11.2.3 Discussion You can only store scalar values in a hash. References, however, are scalars. This solves the problem of storing multiple values for one key by making $hash{$key} a reference to an array containing the values for $key. Normal hash operations acting on individual scalar values (insertion, deletion, iteration, and testing for existence) are now written with array operations acting on lists of values (like push, splice, and foreach). Here's how to give a key many values: $hash{"a key"} = [ 3, 4, 5 ]; # anonymous array Once you have a key with many values, here's how to use them: @values = @{ $hash{"a key"} }; To append a new value to the array of values associated with a particular key, use push: push @{ $hash{"a key"} }, $value; One common application of such data structures is inverting a hash that may have several keys with the same associated value. When inverted, you end up with a hash that has many values for the same key. This is addressed in Recipe 5.9. Be warned that this: @residents = @{ $phone2name{$number} }; causes a runtime exception under use strict because you're dereferencing an undefined reference where autovivification won't occur. You must do this instead: @residents = exists( $phone2name{$number} ) ? @{ $phone2name{$number} } : ( ); 11.2.4 See Also The section on "Hashes of Arrays" in Chapter 9 of Programming Perl and in perldsc(1); the section on "Symbolic References" in Chapter 8 of Programming Perl; Recipe 5.9; Tie Example: Make a Hash That Always Appends in Recipe 13.15 [ Team LiB ] [ Team LiB ] Recipe 11.3 Taking References to Hashes 11.3.1 Problem You need to manipulate a hash by reference. This might be because it was passed into a function that way or because it's part of a larger data structure. 11.3.2 Solution To get a hash reference: $href = \%hash; $anon_hash = { "key1" => "value1", "key2" => "value2", ... }; $anon_hash_copy = { %hash }; To dereference a hash reference: %hash = %$href; $value = $href->{$key}; @slice = @$href{$key1, $key2, $key3}; # note: no arrow! @keys = keys %$href; To check whether something is a hash reference: if (ref($someref) ne "HASH") { die "Expected a hash reference, not $someref\n"; } 11.3.3 Discussion This example prints out all keys and values from two predefined hashes: foreach $href ( \%ENV, \%INC ) { # OR: for $href ( \(%ENV,%INC) ) { foreach $key ( keys %$href ) { print "$key => $href->{$key}\n"; } } Access slices of hashes by reference as you'd access slices of arrays by reference. For example: @values = @$hash_ref{"key1", "key2", "key3"}; for $val (@$hash_ref{"key1", "key2", "key3"}) { $val += 7; # add 7 to each value in hash slice } 11.3.4 See Also The Introductionin Chapter 5; Chapter 8 of Programming Perl; perlref(1); Recipe 11.9 [ Team LiB ] [ Team LiB ] Recipe 11.4 Taking References to Functions 11.4.1 Problem You need to manipulate a subroutine by reference. This might happen if you need to create a signal handler, a Tk callback, or a hash of function pointers. 11.4.2 Solution To get a code reference: $cref = \&func; $cref = sub { ... }; To call a code reference: @returned = $cref->(@arguments); @returned = &$cref(@arguments); 11.4.3 Discussion If the name of a function is func, you can produce a reference to it by prefixing its name with \&. You can also dynamically allocate anonymous functions using the sub { } notation. These code references can be stored just like any other reference. It is possible to store the name of a function in a variable, such as: $funcname = "thefunc"; &$funcname( ); but that's not a very good solution for several reasons. First, it uses symbolic references, not real (hard) references, and so is forbidden under the use strict "refs" pragma. Symbolic references to variables are usually a bad idea, since they can't access lexical variables, only globals, and aren't reference counted. Second, as written it doesn't include package information, so if executed in a different package, it would try to call the wrong function. Finally, in the odd case that the function were redefined at some point, the symbolic reference would get whatever the current definition for the function was, whereas the hard reference would still access the old definition. Instead of placing the name of the function in the variable, use the backslash operator to create a reference to the function. This is the normal way to store a function in a variable or pass along to another function. You can mix and match references to named functions with references to unnamed ones: my %commands = ( "happy" => \&joy, "sad" => \&sullen, "done" => sub { die "See ya!" }, "mad" => \&angry, ); print "How are you? "; chomp($string = ); if ($commands{$string}) { $commands{$string}->( ); } else { print "No such command: $string\n"; } If you create an anonymous function that refers to a lexical (my) variable from an enclosing scope, reference counting ensures that the lexical variable is never deallocated so long as that function reference exists: sub counter_maker { my $start = 0; return sub { # this is a closure return $start++; # lexical from enclosing scope }; } $counter = counter_maker( ); for ($i = 0; $i < 5; $i ++) { print &$counter, "\n"; } Even though counter_maker has ended and $start gone out of scope, Perl doesn't free the variable, because the anonymous subroutine referenced by $counter still has a reference to $start. If we call counter_maker again, it'll return another, different anonymous subroutine reference that uses a different $start: $counter1 = counter_maker( ); $counter2 = counter_maker( ); for ($i = 0; $i < 5; $i ++) { print &$counter1, "\n"; } print &$counter1, " ", &$counter2, "\n"; 1 2 3 4 5 0 Closures are often used in callback routines. In graphical and other event-based programming, you associate code with a keypress, mouse click, window expose event, etc. The code will be called much later, probably from an entirely different scope. Variables mentioned in the closure must be available when it's finally called. To work properly, such variables must be lexicals, not globals. Another use for closures is function generators, that is, functions that create and return brand- new functions. The counter_maker function is such a function generator. Here's another simple one: sub timestamp { my $start_time = time( ); return sub { return time( ) - $start_time }; } $early = timestamp( ); sleep 20; $later = timestamp( ); sleep 10; printf "It's been %d seconds since early.\n", $early->( ); printf "It's been %d seconds since later.\n", $later->( ); It's been 30 seconds since early. It's been 10 seconds since later. Each call to timestamp generates and returns a brand-new function. The timestamp function creates a lexical called $start_time that contains the current clock time (in epoch seconds). Every time that closure is called, it returns how many seconds have elapsed since it was created by subtracting its starting time from the current time. 11.4.4 See Also The section on "Closures" in Chapter 8 of Programming Perl and the discussion on closures in perlref(1); Recipe 10.11; Recipe 11.4 [ Team LiB ] [ Team LiB ] Recipe 11.5 Taking References to Scalars 11.5.1 Problem You want to create and manipulate a reference to a scalar value. 11.5.2 Solution To create a reference to a scalar variable, use the backslash operator: $scalar_ref = \$scalar; # get reference to named scalar To create a reference to an anonymous scalar value (a value that isn't in a variable), assign to a dereferenced undefined variable: undef $anon_scalar_ref; $$anon_scalar_ref = 15; This creates a reference to a constant scalar: $anon_scalar_ref = \15; Use ${...} to dereference: print ${ $scalar_ref }; # dereference it ${ $scalar_ref } .= "string"; # alter referent's value 11.5.3 Discussion If you want to create many new anonymous scalars, use a subroutine that returns a reference to a lexical variable out of scope, as explained in this chapter's Introduction: sub new_anon_scalar { my $temp; return \$temp; } Dereference a scalar reference by prefacing it with $ to get at its contents: $sref = new_anon_scalar( ); $$sref = 3; print "Three = $$sref\n"; @array_of_srefs = ( new_anon_scalar( ), new_anon_scalar( ) ); ${ $array[0] } = 6.02e23; ${ $array[1] } = "avocado"; print "\@array contains: ", join(", ", map { $$_ } @array ), "\n"; Notice we put braces around $array[0] and $array[1]. If we tried to say $$array[0], the tight binding of dereferencing would turn it into $array->[0]. It would treat $array as an array reference and return the element at index zero. Here are other examples where it is safe to omit the braces: $var = `uptime`; # $var holds text $vref = \$var; # $vref "points to" $var if ($$vref =~ /load/) { } # look at $var, indirectly chomp $$vref; # alter $var, indirectly As mentioned in the Introduction, you may use the ref built-in to inspect a reference for its referent's type. Calling ref on a scalar reference returns the string "SCALAR": # check whether $someref contains a simple scalar reference if (ref($someref) ne "SCALAR") { die "Expected a scalar reference, not $someref\n"; } 11.5.4 See Also Chapters 8 and 9 of Programming Perl and perlref(1) [ Team LiB ] [ Team LiB ] Recipe 11.6 Creating Arrays of Scalar References 11.6.1 Problem You want to create and manipulate an array of references to scalars. This arises when you pass variables by reference to a function so the function can change their values. 11.6.2 Solution To create an array, either backslash each scalar in the list to store in the array: @array_of_scalar_refs = ( \$a, \$b ); or simply backslash the entire list, taking advantage of the backslash operator's distributive property: @array_of_scalar_refs = \( $a, $b ); To get or set the value of an element of the list, use ${ ... }: ${ $array_of_scalar_refs[1] } = 12; # $b = 12 11.6.3 Discussion In the following examples, @array is a simple array containing references to scalars (an array of references is not a reference to an array). To access the original data indirectly, braces are mandatory. ($a, $b, $c, $d) = (1 .. 4); # initialize @array = (\$a, \$b, \$c, \$d); # refs to each scalar @array = \( $a, $b, $c, $d); # same thing! @array = map { \my $anon } 0 .. 3; # allocate 4 anon scalar refs ${ $array[2] } += 9; # $c now 12 ${ $array[ $#array ] } *= 5; # $d now 20 ${ $array[-1] } *= 5; # same; $d now 100 $tmp = $array[-1]; # using temporary $$tmp *= 5; # $d now 500 The two assignments to @array are equivalent—the backslash operator is distributive across a list. So preceding a list (including a slice or a function's return list, but not an array) with a backslash is the same as applying a backslash to everything in that list. The ensuing code changes the values of the variables whose references were stored in the array. Here's how to deal with such an array without explicit indexing: use Math::Trig qw(pi); # load the constant pi foreach $sref (@array) { # prepare to change $a,$b,$c,$d ($$sref **= 3) *= (4/3 * pi); # replace with spherical volumes } This code uses the formula for deriving the volume of a sphere: The $sref loop index variable is each reference in @array, and $$sref is the number itself, that is, the original variables $a, $b, $c, and $d. Changing $$sref in the loop changes those variables as well. First we replace $$sref with its cube, then multiply the resulting value by 4/3p. This takes advantage of the fact that assignment in Perl returns an lvalue, letting you chain assignment operators together as we've done using the **= and *= assignment operators. Actually, anonymous scalars are pretty useless, given that a scalar value fits in the same space as a scalar reference. That's why there's no explicit composer. Scalar references exist only to allow aliasing—which can be done in other ways. 11.6.4 See Also The section on "Assignment Operators" in Chapter 3 of Programming Perl and in perlop(1); the section on "Other Tricks You Can Do with Hard References" in Chapter 8 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 11.7 Using Closures Instead of Objects 11.7.1 Problem You want records with private state, behavior, and identity, but you don't want to learn object- oriented programming to accomplish this. 11.7.2 Solution Write a function that returns (by reference) a hash of code references. These code references are closures created in the same scope, so when they execute, they'll share bindings to the same private variables. 11.7.3 Discussion Because a closure is a binding of code and data, it can implement what might be thought of as an object. Here's an example that creates and returns a hash of anonymous functions. mkcounter takes an argument of a seed counter and returns a reference to a hash of code references that you can use to manipulate the counter indirectly. $c1 = mkcounter(20); $c2 = mkcounter(77); printf "next c1: %d\n", $c1->{NEXT}->( ); # 21 printf "next c2: %d\n", $c2->{NEXT}->( ); # 78 printf "next c1: %d\n", $c1->{NEXT}->( ); # 22 printf "last c1: %d\n", $c1->{PREV}->( ); # 21 printf "old c2: %d\n", $c2->{RESET}->( ); # 77 The code values in the hash references in $c1 and $c2 maintain their own separate state. Here's how to set that up: sub mkcounter { my $count = shift; my $start = $count; my $bundle = { "NEXT" => sub { return ++$count }, "PREV" => sub { return --$count }, "GET" => sub { return $count }, "SET" => sub { $count = shift }, "BUMP" => sub { $count += shift }, "RESET" => sub { $count = $start }, }; $bundle->{"LAST"} = $bundle->{"PREV"}; return $bundle; } Because the lexical variables used by the closures in the $bundle hash reference are returned by the function, they are not deallocated. The next time mkcounter is called, the closures get a different set of variable bindings for the same code. Because no one outside those closures can access these two variables, this assures true privacy. The assignment right before the return makes both "PREV" and "LAST" values point to the same closure. Depending on your object-oriented background, you might think of these as being two different messages, both implemented using the same method. The bundle we return is not an object in that it has no obvious inheritance and polymorphism. (Yet.) But it certainly does have state, behavior, and identity, as well as encapsulation. 11.7.4 See Also The section on "Closures" in Chapter 8 of Programming Perl and the discussion on closures in perlref(1); Recipe 11.4; Recipe 11.9; Chapter 13 [ Team LiB ] [ Team LiB ] Recipe 11.8 Creating References to Methods 11.8.1 Problem You want to store a reference to a method. 11.8.2 Solution Create a closure that makes the proper method call on the appropriate object. 11.8.3 Discussion When you ask for a reference to a method, you're asking for more than just a raw function pointer. You also need to record which object the method needs to be called upon as the object contains the data the method will work with. The best way to do this is using a closure. Assuming $obj is lexically scoped, you can say: $mref = sub { $obj->meth(@_) }; # later... $mref->("args", "go", "here"); Even when $obj goes out of scope, the closure stored in $mref has captured it. Later when it's called indirectly, the correct object is used for the method call. Be aware that the notation: $sref = \$obj->meth; doesn't do what you probably expected. It first calls the method on that object and gives you either a reference to the return value or a reference to the last of the return values if the method returns a list. The can method from the UNIVERSAL base class, while appealing, is also unlikely to produce what you want. $cref = $obj->can("meth"); This produces a code ref to the appropriate method (should one be found), but one that carries no object information. Think of it as a raw function pointer. The information about the object is lost. That's why you need a closure to capture both the object state as well as the method to call. 11.8.4 See Also The discussion on methods in the Introduction to Chapter 13; the section on "Closures" in Chapter 8 of Programming Perl; Recipe 11.7; Recipe 13.8 [ Team LiB ] [ Team LiB ] Recipe 11.9 Constructing Records 11.9.1 Problem You want to create a record data type. 11.9.2 Solution Use a reference to an anonymous hash. 11.9.3 Discussion Suppose you wanted to create a data type that contained various data fields. The easiest way is to use an anonymous hash. For example, here's how to initialize and use that record: $record = { NAME => "Jason", EMPNO => 132, TITLE => "deputy peon", AGE => 23, SALARY => 37_000, PALS => [ "Norbert", "Rhys", "Phineas"], }; printf "I am %s, and my pals are %s.\n", $record->{NAME}, join(", ", @{$record->{PALS}}); Just having one of these records isn't much fun—you'd like to build larger structures. For example, you might want to create a %byname hash that you could initialize and use this way: # store record $byname{ $record->{NAME} } = $record; # later on, look up by name if ($rp = $byname{"Aron"}) { # false if missing printf "Aron is employee %d.\n", $rp->{EMPNO}; } # give jason a new pal push @{$byname{"Jason"}->{PALS}}, "Theodore"; printf "Jason now has %d pals\n", scalar @{$byname{"Jason"}->{PALS}}; That makes %byname a hash of hashes because its values are hash references. Looking up employees by name would be easy using such a structure. If we find a value in the hash, we store a reference to the record in a temporary variable, $rp, which we then use to get any field we want. We can use our existing hash tools to manipulate %byname. For instance, we could use the each iterator to loop through it in an arbitrary order: # Go through all records while (($name, $record) = each %byname) { printf "%s is employee number %d\n", $name, $record->{EMPNO}; } What about looking employees up by employee number? Just build and use another data structure, an array of hashes called @employees. If your employee numbers aren't consecutive (for instance, they jump from 1 to 159997) an array would be a bad choice. Instead, you should use a hash mapping employee number to record. For consecutive employee numbers, use an array: # store record $employees[ $record->{EMPNO} ] = $record; # lookup by id if ($rp = $employee[132]) { printf "employee number 132 is %s\n", $rp->{NAME}; } With a data structure like this, updating a record in one place effectively updates it everywhere. For example, this gives Jason a 3.5% raise: $byname{"Jason"}->{SALARY} *= 1.035; This change is reflected in all views of these records. Remember that $byname{"Jason"} and $employees[132] both refer to the same record because the references they contain refer to the same anonymous hash. How would you select all records matching a particular criterion? This is what grep is for. Here's how to get everyone with "peon" in their titles or all 27-year-olds: @peons = grep { $_->{TITLE} =~ /peon/i } @employees; @tsevens = grep { $_->{AGE} = = 27 } @employees; Each element of @peons and @tsevens is itself a reference to a record, making them arrays of hashes, like @employees. Here's how to print all records sorted in a particular order, say by age: # Go through all records foreach $rp (sort { $a->{AGE} <=> $b->{AGE} } values %byname) { printf "%s is age %d.\n", $rp->{NAME}, $rp->{AGE}; # or with a hash slice on the reference printf "%s is employee number %d.\n", @$rp{"NAME","EMPNO"}; } Rather than take time to sort them by age, you could create another view of these records, @byage. Each element in this array, $byage[27] for instance, would be an array of all records with that age. In effect, this is an array of arrays of hashes. Build it this way: # use @byage, an array of arrays of records push @{ $byage[ $record->{AGE} ] }, $record; Then you could find them all this way: for ($age = 0; $age <= $#byage; $age++) { next unless $byage[$age]; print "Age $age: "; foreach $rp (@{$byage[$age]}) { print $rp->{NAME}, " "; } print "\n"; } A similar approach is to use map to avoid the foreach loop: for ($age = 0; $age <= $#byage; $age++) { next unless $byage[$age]; printf "Age %d: %s\n", $age, join(", ", map {$_->{NAME}} @{$byage[$age]}); } 11.9.4 See Also Recipe 4.14; Recipe 11.3 [ Team LiB ] [ Team LiB ] Recipe 11.10 Reading and Writing Hash Records to Text Files 11.10.1 Problem You want to read or write hash records stored in text files. 11.10.2 Solution Use a simple file format with one field per line: FieldName: Value and separate records with blank lines. 11.10.3 Discussion If you have an array of records that you'd like to store into and retrieve from a text file, you can use a simple format based on mail headers. The format's simplicity requires that the keys have neither colons nor newlines, and the values not have newlines. This code writes them out: foreach $record (@Array_of_Records) { for $key (sort keys %$record) { print "$key: $record->{$key}\n"; } print "\n"; } Reading them in is easy, too. $/ = ""; # paragraph read mode while (<>) { my @fields = split /^([^:]+):\s*/m; shift @fields; # for leading null field push(@Array_of_Records, { map /(.*)/, @fields }); } The split acts upon $_, its default second argument, which contains a full paragraph. The pattern looks for start of line (not just start of record, thanks to the /m) followed by one or more non-colons, followed by a colon and optional whitespace. When split's pattern contains parentheses, these are returned along with the values. The return values placed in @fields are in key-value order, with a leading null field we shift off. The braces in the call to push produce a reference to a new anonymous hash, which we copy @fields into. Since that array was stored in order of the needed key-value pairing, this makes for well-ordered hash contents. All you're doing is reading and writing a plain text file, so you can use related recipes for additional components. You could use Recipe 7.18 to ensure that you have clean, concurrent access; Recipe 1.18 to store colons and newlines in keys and values; and Recipe 11.3 to store more complex structures. If you are willing to sacrifice the elegance of a plain textfile for a quick, random-access database of records, use a DBM file, as described in Recipe 11.14. 11.10.4 See Also The split function in perlfunc(1) and Chapter 29 of Programming Perl; Recipe 11.9; Recipe 11.13; Recipe 11.14 [ Team LiB ] [ Team LiB ] Recipe 11.11 Printing Data Structures 11.11.1 Problem You want to print out a data structure. 11.11.2 Solution If the output's legibility and layout are important, write your own custom printing routine. If you are in the Perl debugger, use the x command: DB<1> $reference = [ { "foo" => "bar" }, 3, sub { print "hello, world\n" } ]; DB<2> x $reference 0 ARRAY(0x1d033c) 0 HASH(0x7b390) 'foo' = 'bar' 1 3 2 CODE(0x21e3e4) -> &main::_ _ANON_ _[(eval 15)[/usr/local/...perl5db.pl:17]:2] in (eval 15)[/usr/local/.../perl5db.pl:17]:2-2 From within your own programs, use the Dumper function from the standard module Data::Dumper: use Data::Dumper; print Dumper($reference); Or if you'd like output formatted in the same style as the Debugger uses: use Dumpvalue; Dumpvalue->new->dumpValue($reference); 11.11.3 Discussion Sometimes you'll want to make a dedicated function for your data structure that delivers a particular output format, but often this is overkill. If you're running under the Perl debugger, the x and X commands provide nice pretty-printing. The x command is more useful because it works on both global and lexical variables, whereas X works only on globals. Pass x a reference to the data structure you want to print. DB<3> x @INC 0 ARRAY(0x807d0a8) 0 '/home/tchrist/perllib' 1 '/usr/lib/perl5/i686-linux/5.00403' 2 '/usr/lib/perl5' 3 '/usr/lib/perl5/site_perl/i686-linux' 4 '/usr/lib/perl5/site_perl' 5 '.' The standard Dumpvalue module provides the Debugger's output formatting using an object- oriented interface. Here's an example: use Dumpvalue; Dumpvalue->new->dumpvars("main", "INC"); @INC = ( 0 '/usr/local/lib/perl5/5.8.1/OpenBSD.i386-openbsd' 1 '/usr/local/lib/perl5/5.8.1' 2 '/usr/local/lib/perl5/site_perl/5.8.1/OpenBSD.i386-openbsd' 3 '/usr/local/lib/perl5/site_perl/5.8.1' 4 '/usr/local/lib/perl5/site_perl/5.8.0/OpenBSD.i386-openbsd' 5 '/usr/local/lib/perl5/site_perl/5.8.0' 6 '/usr/local/lib/perl5/site_perl' 7 '.' ) %INC = ( 'Dumpvalue.pm' = '/usr/local/lib/perl5/5.8.1/Dumpvalue.pm'> 'strict.pm' = '/usr/local/lib/perl5/5.8.1/strict.pm'> ) which is like using the V main INC command in the Debugger. All the output formatting options from the Debugger are available from Dumpvalue. Just pass Dumpvalue->new option pairs: $dobj = Dumpvalue->new(option1 => value1, option2 => value2); Options available as of v5.8.1 include arrayDepth, hashDepth, compactDump, veryCompact, globPrint, dumpDBFiles, dumpPackages, dumpReused, tick, quoteHighBit, printUndef, usageOnly, unctrl, subdump, bareStringify, quoteHighBit, and stopDbSignal. The Data::Dumper module, also included in the standard Perl distribution, has a different approach. It provides a Dumper function that takes a list of references and returns a string with a printable (and evalable) form of those references. use Data::Dumper; print Dumper(\@INC); $VAR1 = [ '/usr/local/lib/perl5/5.8.1/OpenBSD.i386-openbsd', '/usr/local/lib/perl5/5.8.1', '/usr/local/lib/perl5/site_perl/5.8.1/OpenBSD.i386-openbsd', '/usr/local/lib/perl5/site_perl/5.8.1', '/usr/local/lib/perl5/site_perl/5.8.0/OpenBSD.i386-openbsd', '/usr/local/lib/perl5/site_perl/5.8.0', '/usr/local/lib/perl5/site_perl', '.' ]; Data::Dumper supports a variety of output formats. Check its documentation for details. Particularly useful is the option to decompile Perl code: use Data::Dumper; $Data::Dumper::Deparse = 1; $a = sub { print "hello, world\n" }; print Dumper($a); $VAR1 = sub { print 'hello, world'; }; 11.11.4 See Also The documentation for Data::Dumper; Chapter 20 of Programming Perl or perldebug(1) [ Team LiB ] [ Team LiB ] Recipe 11.12 Copying Data Structures 11.12.1 Problem You need to copy a complex data structure. 11.12.2 Solution Use the dclone function from the standard Storable module: use Storable; $r2 = dclone($r1); 11.12.3 Discussion Two types of "copy" are sometimes confused. A surface copy (also known as shallow copy) simply copies references without creating copies of the data behind them: @original = ( \@a, \@b, \@c ); @surface = @original; A deep copy creates an entirely new structure with no overlapping references. This copies references to one layer deep: @deep = map { [ @$_ ] } @original; If @a, @b, and @c themselves contain references, the preceding map is no longer adequate. Writing your own code to deep-copy structures is laborious and rapidly becomes tiresome. The Storable module provides a function called dclone that recursively copies its argument: use Storable qw(dclone); $r2 = dclone($r1); This only works on references or blessed objects of type SCALAR, ARRAY, HASH, or CODE;[2] references of type GLOB, IO, and the more esoteric types are not supported. The safeFreeze function from the FreezeThaw module supports even these types when used in the same address space by using a reference cache that could interfere with garbage collection and object destructors under some circumstances. [2] Believe it or not, it's true. Storable can even serialize closures. See its manpage for how to unthaw these using a Safe compartment. Because dclone takes and returns references, you must add extra punctuation if you have a hash or arrays to copy: %newhash = %{ dclone(\%oldhash) }; 11.12.4 See Also The documentation for the standard Storable and Data::Dumper modules, and for the FreezeThaw CPAN module. [ Team LiB ] [ Team LiB ] Recipe 11.13 Storing Data Structures to Disk 11.13.1 Problem You want to save your large, complex data structure to disk so you don't have to reconstruct it from scratch each time your program runs. 11.13.2 Solution Use the Storable module's store and retrieve functions: use Storable; store(\%hash, "filename"); # later on... $href = retrieve("filename"); # by ref %hash = %{ retrieve("filename") }; # direct to hash 11.13.3 Discussion The Storable module uses C functions and a binary format to walk Perl's internal data structures and lay out its data. It's more efficient than a pure Perl and string-based approach, but it's also more fragile. The store and retrieve functions expect binary data using the machine's own byte-ordering. This means files created with these functions cannot be shared across different architectures. nstore does the same job store does, but keeps data in canonical (network) byte order, at a slight speed cost: use Storable qw(nstore); nstore(\%hash, "filename"); # later ... $href = retrieve("filename"); No matter whether store or nstore was used, you need to call the same retrieve routine to restore the objects in memory. The producer must commit to portability, but the consumer doesn't have to. Code need only be changed in one place when the producer has a change of heart; the code thus offers a consistent interface for the consumer, who does not need to know or care. The store and nstore functions don't lock the files they work on. If you're worried about concurrent access, open the file yourself, lock it using Recipe 7.18, and then use store_fd or its slower but machine-independent version nstore_fd. Here's code to save a hash to a file, with locking. We don't open with the O_TRUNC flag because we have to wait to get the lock before we can clobber the file. use Storable qw(nstore_fd); use Fcntl qw(:DEFAULT :flock); sysopen(DF, "/tmp/datafile", O_RDWR|O_CREAT, 0666) or die "can't open /tmp/datafile: $!"; flock(DF, LOCK_EX) or die "can't lock /tmp/datafile: $!"; nstore_fd(\%hash, *DF) or die "can't store hash\n"; truncate(DF, tell(DF)); close(DF); Here's code to restore that hash from a file, with locking: use Storable qw(retrieve_fd); use Fcntl qw(:DEFAULT :flock); open(DF, " < /tmp/datafile") or die "can't open /tmp/datafile: $!"; flock(DF, LOCK_SH) or die "can't lock /tmp/datafile: $!"; $href = retrieve_fd(*DF); close(DF); With care, you can pass large data objects efficiently between processes using this strategy, since a filehandle connected to a pipe or socket is still a byte stream, just like a plain file. Unlike the various DBM bindings, Storable does not restrict you to using only hashes (or arrays, with DB_File). Arbitrary data structures, including objects, can be stored to disk. The whole structure must be read in or written out in its entirety. 11.13.4 See Also The section on "Remote Procedure Calls (RPC)" in Chapter 13 of Advanced Perl Programming, by Sriram Srinivasan (O'Reilly); Recipe 11.14 [ Team LiB ] [ Team LiB ] Recipe 11.14 Transparently Persistent Data Structures 11.14.1 Problem You have a complex data structure that you want to persist outside your program. 11.14.2 Solution Use MLDBM and either (preferably) DB_File or else GDBM_File: use MLDBM qw(DB_File); use Fcntl; tie(%hash, "MLDBM", "testfile.db", O_CREAT|O_RDWR, 0666) or die "can't open tie to testfile.db: $!"; # ... act on %hash untie %hash; 11.14.3 Discussion A hash with 100,000 items in it would undoubtably take considerable time to build. Storing this to disk, either slowly by hand or quickly with Storable, is still an expensive operation in memory and computation. The DBM modules solve this by tying hashes to disk database files. Rather than reading the whole structure in at once, they only pull in what they need, when they need it. To the user, it looks like a hash that persists across program invocations. Unfortunately, the values in this persistent hash must be plain strings. You cannot readily use a database file as a backing store for a hash of hashes, a hash of arrays, and so on—just for a hash of strings. However, the MLDBM module from CPAN allows you to store references in a database. It uses Data::Dumper to stringify these references for external storage: use MLDBM qw(DB_File); use Fcntl; tie(%hash, "MLDBM", "testfile.db", O_CREAT|O_RDWR, 0666) or die "can't open tie to testfile.db: $!"; Now you can use %hash to fetch or store complex records from disk. The only drawback is that you can't access the references piecemeal. You have to pull in the reference from the database, work with it, and then store it back. # this doesn't work! $hash{"some key"}[4] = "fred"; # RIGHT $aref = $hash{"some key"}; $aref->[4] = "fred"; $hash{"some key"} = $aref; 11.14.4 See Also Recipe 11.13 [ Team LiB ] [ Team LiB ] Recipe 11.15 Coping with Circular Data Structures Using Weak References 11.15.1 Problem You have an inherently self-referential data structure, so Perl's reference-based garbage collection system won't notice when that structure is no longer being used. You want to prevent your program from leaking memory. 11.15.2 Solution Convert all internal references within the data structure into weak references so they don't increment the reference count. 11.15.3 Description Perl's memory management system relies on an underlying reference count to know when to reclaim memory. In practice, this works fairly well except for one particular situation: when a variable directly or indirectly points at itself. Consider: { my ($a, $b); ($a, $b) = \($b, $a); # same as (\$b, \$a); } The two underlying scalars that $a and $b represent each start out with a reference count of one apiece in the first line of the block. In the second line, those scalars are each initialized to contain a reference to the other variable; $a points to $b and vice versa. Saving a reference increments the underlying reference count on the scalars, so now both refcounts are set to two. As the block exits and those lexical variables become unreachable (by name), both refcounts are decremented by one, leaving one in each—forever. Since the refcounts can never reach zero, memory used by those two underlying scalars will never be reclaimed. You'll leak two scalars every time that block executes; if it's a loop or a subroutine, you could eventually run out of memory. The standard Devel::Peek module's Dump function shows you underlying reference counts, plus a whole lot more. This code: use Devel::Peek; $a = 42; $b = \$a; Dump $a; produces this output: SV = IV(0xd7cc4) at 0xd72b8 REFCNT = 2 FLAGS = (IOK,pIOK) IV = 42 The important thing to notice there is that the refcount is two. That's because the scalar can be reached two ways: once via the variable named $a, and then again through dereferencing $b using $$b. You can produce the same condition, even without using another variable: { my $a; $a = \$a; } This most often occurs when creating a data structure whose elements contain references that directly or indirectly point back to the initial element. Imagine a circular linked list—a ring data structure. $ring = { VALUE => undef, NEXT => undef, PREV => undef, }; $ring->{NEXT} = $ring; $ring->{PREV} = $ring; The underlying hash has an underlying refcount of three, and undeffing $ring or letting it go out of scope will decrement that count only by one, resulting in a whole hash full of memory irrecoverable by Perl. To address this situation, Perl introduced in its v5.6 release the concept of weak references. A weak reference is just like any other regular reference (meaning a "hard" reference, not a "symbolic" one) except for two critical properties: it no longer contributes to the reference count on its referent, and when its referent is garbage collected, the weak reference itself becomes undefined. These properties make weak references perfect for data structures that hold internal references to themselves. That way those internal references do not count toward the structure's reference count, but external ones still do. Although Perl supported weak references starting in v5.6, there was no standard weaken( ) function to access them from Perl itself in the standard release. You had to go to CPAN to pull in the WeakRef module. Beginning in v5.8.1, the weaken( ) function is included standard with the Scalar::Util module. That module also provides an is_weak( ) function that reports whether its reference argument has been weakened. Here's how you would apply weak references to the ring example just given: use Scalar::Util qw(weaken); $ring = { VALUE => undef, NEXT => undef, PREV => undef, }; $ring->{NEXT} = $ring; $ring->{PREV} = $ring; weaken($ring->{NEXT}); weaken($ring->{PREV}); In Recipe 13.13, we show how to create a circular-linked list data structure that won't leak memory by employing an elaborate trick using a dummy head node and an object-oriented device called a destructor. With weak references, the code becomes much simpler. Here's the same algorithm as that recipe uses, but here using weak references to address the memory- leak issue. use Scalar::Util qw(weaken); my $COUNT = 1000; for (1..20) { my $ring = node(100_000 + $_); for my $value (1 .. $COUNT) { insert_value($ring, $value); } } # return a node sub node($) { my ($init_value) = @_; my $node = { VALUE => $init_value }; $node->{NEXT} = $node->{PREV} = $node; weaken($node->{NEXT}); weaken($node->{PREV}); return $node; } # $node = search_ring($ring, $value) : find $value in the ring # structure in $node sub search_ring { my ($ring, $value) = @_; my $node = $ring->{NEXT}; while ($node != $ring && $node->{VALUE} != $value) { $node = $node->{NEXT}; } return $node; } # insert_value( $ring, $value ) : insert $value into the ring structure sub insert_value { my ($ring, $value) = @_; my $node = { VALUE => $value }; weaken($node->{NEXT} = $ring->{NEXT}); weaken($ring->{NEXT}->{PREV} = $node); weaken($ring->{NEXT} = $node); weaken($node->{PREV} = $ring); ++$ring->{COUNT}; } # delete_value( $ring, $value ) : delete a node from the ring # structure by value sub delete_value { my ($ring, $value) = @_; my $node = search_ring($ring, $value); return if $node = = $ring; $ring->delete_node($node); } # delete a node from the ring structure sub delete_node { my ($ring, $node) = @_; weaken($node->{PREV}->{NEXT} = $node->{NEXT}); weaken($node->{NEXT}->{PREV} = $node->{PREV}); --$ring->{COUNT}; } Every time we store a reference to part of the data structure within that same structure, we weaken the reference so it doesn't count toward the reference count. Otherwise our program's in-core memory footprint would have grown terrifically. You can watch that happen by adding: system("ps v$$"); within the loop on systems that support the ps(1) program. All it takes to trigger the leak is not weakening any of the four assignments in the insert_value function just shown. 11.15.4 See Also The algorithms in this recipe derive in part from pages 206-207 of Introduction to Algorithms, by Cormen, Leiserson, and Rivest (MIT Press/McGraw-Hill). See also Recipe 13.13; the section on "Garbage Collection, Circular References, and Weak References" in Chapter 8 of Programming Perl; the documentation for the standard Devel::Peek and Scalar::Util modules [ Team LiB ] [ Team LiB ] Recipe 11.16 Program: Outlines Outlines are a simple (and thus popular) way of structuring data. The hierarchy of detail implied by an outline maps naturally to our top-down way of thinking about the world. The only problem is that it's not obvious how to represent outlined data as a Perl data structure. Take, for example, this simple outline of some musical genres: Alternative .Punk ..Emo ..Folk Punk .Goth ..Goth Rock ..Glam Goth Country .Old Time .Bluegrass .Big Hats Rock .80s ..Big Hair ..New Wave .60s ..British ..American Here we use a period to indicate a subgroup. There are many different formats in which that outline could be output. For example, you might write the genres out in full: Alternative Alternative - Punk Alternative - Punk - Emo Alternative - Punk - Folk Punk Alternative - Goth ... You might number the sections: 1 Alternative 1.1 Punk 1.1.1 Emo 1.1.2 Folk Punk 1.2 Goth ... or alphabetize: Alternative Alternative - Goth Alternative - Goth - Glam Goth Alternative - Goth - Goth Rock Alternative - Punk Alternative - Punk - Emo ... or show inheritance: Alternative Punk - Alternative Emo - Punk - Alternative Folk Punk - Punk - Alternative Goth - Alternative Goth Rock - Goth - Alternative ... These transformations are all much easier than it might seem. The trick is to represent the levels of the hierarchy as elements in an array. For example, you'd represent the third entry in the sample outline as: @array = ("Alternative", "Goth", "Glam Goth"); Now reformatting the entry is trivial. There's an elegant way to parse the input file to get this array representation: while () { chomp; $tag[$in = s/\G\.//g] = $_; # do something with @tag[0..$in] } The substitution deletes leading periods from the current entry, returning how many it deleted. This number indicates the indentation level of the current entry. Alphabetizing is now simple using the Unix sort program: $ISA = "-"; open(STDOUT, "|sort -b -t'$ISA' -df"); while () { chomp; $tag[$in = s/\G\.//g] = $_; print join(" $ISA ", @tag[0 .. $in]); } close STDOUT; _ _END_ _ Alternative .Punk ..Emo ..Folk Punk .Goth Numbering the outline is equally simple: while () { chomp; $count[$in = s/\G\.//g]++; delete @count[($in+1) .. $#count]; print join(".", @count), " $_"; } _ _END_ _ Alternative .Punk ..Emo ..Folk Punk .Goth ..Goth Rock Notice that renumbering is our only application where we've deleted elements from the array. This is because we're not keeping names of hierarchy levels in the array; now we're keeping counts. When we go up a level (e.g., from three levels down to a new second-level heading), we reset the counter on the old level. [ Team LiB ] [ Team LiB ] Recipe 11.17 Program: Binary Trees Because Perl's built-in data types are already powerful, high-level, dynamic data types in their own right, most code can use what's already provided. If you just want quick lookups, you nearly always want to use a simple hash. As Larry has said, "The trick is to use Perl's strengths rather than its weaknesses." However, hashes provide no inherent ordering. To traverse the hash in a particular order, you must first extract its keys and then sort them. If you find yourself doing so many times, performance will suffer, but probably not enough to justify the time required to craft a fancy algorithm. A tree structure provides ordered traversals. How do you write a tree in Perl? First, you grab one of your favorite textbooks on data structures; the authors recommend Cormen et al., as mentioned in Other Books in the Preface. Using an anonymous hash to represent each node in the tree, translate the algorithms in the book into Perl. This is usually much more straightforward than you would imagine. The program code in Example 11-1 demonstrates an ordered binary tree implementation using anonymous hashes. Each node has three fields: a left child, a right child, and a value. The crucial property of an ordered binary tree is that at every node, all left children have values that are less than the current node value, and all right children have values that are greater. The main program does three things. First, it creates a tree with 20 random nodes. Then it shows the in-order, pre-order, and post-order traversals of that tree. Finally, it allows the user to enter a key and reports whether that key is in the tree. The insert function takes advantage of Perl's implicit pass-by-reference behavior on scalars to initialize an empty tree when asked to insert into an empty node. The assignment of the new node back to $_[0] alters the value in its caller. Although this data structure takes much more memory than a simple hash and the lookups are slower, the ordered traversals themselves are faster. A B-Tree is not a binary tree; it is a more flexible tree structure normally maintained on disk. DB_File has a BTREE interface (see DB_File(3)), and Mark-Jason Dominus has an excellent article on B-Trees in The Perl Journal, Volume 2, Issue 4, Winter 1997, pp. 35-42. If you want to learn more about binary trees, Introduction to Algorithms, by Cormen, Leiserson, and Rivest, and Algorithms in C, by Robert Sedgewick, both cover the basic material. But for a treatment of that material cast in native Perl, no book can compare with Mastering Algorithms with Perl, by Orwant, Hietaniemi, and MacDonald. The program is shown in Example 11-1. Example 11-1. bintree #!/usr/bin/perl -w # bintree - binary tree demo program use strict; my($root, $n); # first generate 20 random inserts while ($n++ < 20) { insert($root, int(rand(1000)))} # now dump out the tree all three ways print "Pre order: "; pre_order($root); print "\n"; print "In order: "; in_order($root); print "\n"; print "Post order: "; post_order($root); print "\n"; # prompt until EOF for (print "Search? "; <>; print "Search? ") { chomp; my $found = search($root, $_); if ($found) { print "Found $_ at $found, $found->{VALUE}\n" } else { print "No $_ in tree\n" } } exit; ######################################### # insert given value into proper point of # provided tree. If no tree provided, # use implicit pass by reference aspect of @_ # to fill one in for our caller. sub insert { my($tree, $value) = @_; unless ($tree) { $tree = { }; # allocate new node $tree->{VALUE} = $value; $tree->{LEFT} = undef; $tree->{RIGHT} = undef; $_[0] = $tree; # $_[0] is reference param! return; } if ($tree->{VALUE} > $value) { insert($tree->{LEFT}, $value) } elsif ($tree->{VALUE} < $value) { insert($tree->{RIGHT}, $value) } else { warn "dup insert of $value\n" } # XXX: no dups } # recurse on left child, # then show current value, # then recurse on right child. sub in_order { my($tree) = @_; return unless $tree; in_order($tree->{LEFT}); print $tree->{VALUE}, " "; in_order($tree->{RIGHT}); } # show current value, # then recurse on left child, # then recurse on right child. sub pre_order { my($tree) = @_; return unless $tree; print $tree->{VALUE}, " "; pre_order($tree->{LEFT}); pre_order($tree->{RIGHT}); } # recurse on left child, # then recurse on right child, # then show current value. sub post_order { my($tree) = @_; return unless $tree; post_order($tree->{LEFT}); post_order($tree->{RIGHT}); print $tree->{VALUE}, " "; } # find out whether provided value is in the tree. # if so, return the node at which the value was found. # cut down search time by only looking in the correct # branch, based on current value. sub search { my($tree, $value) = @_; return unless $tree; if ($tree->{VALUE} = = $value) { return $tree; } search($tree->{ ($value < $tree->{VALUE}) ? "LEFT" : "RIGHT"}, $value) } [ Team LiB ] [ Team LiB ] Chapter 12. Packages, Libraries, and Modules Like all those possessing a library, Aurelian was aware that he was guilty of not knowing his in its entirety. —Jorge Luis Borges, The Theologians [ Team LiB ] [ Team LiB ] Introduction Imagine that you have two separate programs, both of which work fine by themselves, and you decide to make a third program that combines the best features from the first two. You copy both programs into a new file or cut and paste selected pieces. You find that the two programs had variables and functions with the same names that should remain separate. For example, both might have an init function or a global $count variable. When merged into one program, these separate parts would interfere with each other. The solution to this problem is packages. Perl uses packages to partition the global namespace. The package is the basis for both traditional modules and object-oriented classes. Just as directories contain files, packages contain identifiers. Every global identifier (variables, functions, file and directory handles, and formats) has two parts: its package name and the identifier proper. These two pieces are separated from one another with a double colon. For example, the variable $CGI::needs_binmode is a global variable named $needs_binmode, which resides in package CGI. Where the filesystem uses slashes to separate the directory from the filename, Perl uses a double colon. $Names::startup is the variable named $startup in the package Names, whereas $Dates::startup is the $startup variable in package Dates. Saying $startup by itself without a package name means the global variable $startup in the current package. (This assumes that no lexical $startup variable is currently visible. Lexical variables are explained in Chapter 10.) When looking at an unqualified variable name, a lexical takes precedence over a global. Lexicals live in scopes; globals live in packages. If you really want the global instead, you need to fully qualify it. package is a compile-time declaration that sets the default package prefix for unqualified global identifiers, much as chdir sets the default directory prefix for relative pathnames. This effect lasts until the end of the current scope (a brace-enclosed block, file, or eval). The effect is also terminated by any subsequent package statement in the same scope. (See the following code.) All programs are in package main until they use a package statement to change this. package Alpha; $name = "first"; package Omega; $name = "last"; package main; print "Alpha is $Alpha::name, Omega is $Omega::name.\n"; Alpha is first, Omega is last. Unlike user-defined identifiers, built-in variables with punctuation names (like $_ and $.) and the identifiers STDIN, STDOUT, STDERR, ARGV, ARGVOUT, ENV, INC, and SIG are all forced to be in package main when unqualified. That way things like STDIN, @ARGV, %ENV, and $_ are always the same no matter what package you're in; for example, @ARGV always means @main::ARGV, even if you've used package to change the default package. A fully qualified @ElseWhere::ARGV would not, and carries no special built-in meaning. Make sure to localize $_ if you use it in your module. Modules The unit of software reuse in Perl is the module, a file containing related functions designed to be used by programs and other modules. Every module has a public interface, a set of variables and functions that outsiders are encouraged to use. From inside the module, the interface is defined by initializing certain package variables that the standard Exporter module looks at. From outside the module, the interface is accessed by importing symbols as a side effect of the use statement. The public interface of a Perl module is whatever is documented to be public. When we talk about modules in this chapter, and traditional modules in general, we mean those that use the Exporter. The require and use statements load a module into your program, although their semantics vary slightly. require loads modules at runtime, with a check to avoid the redundant loading of a given module. use is like require, with two added properties: compile-time loading and automatic importing. Modules included with use are processed at compile time, but require processing happens at runtime. This is important because if a module needed by a program is missing, the program won't even start because the use fails during compilation of your script. Another advantage of compile-time use over runtime require is that function prototypes in the module's subroutines become visible to the compiler. This matters because only the compiler cares about prototypes, not the interpreter. (Then again, we don't usually recommend prototypes except for replacing built-in commands, which do have them.) use is suitable for giving hints to the compiler because of its compile-time behavior. A pragma is a special module that acts as a directive to the compiler to alter how Perl compiles your code. A pragma's name is always all lowercase, so when writing a regular module instead of a pragma, choose a name that starts with a capital letter. Pragmas supported by the v5.8.1 release of Perl include attributes, autouse, base, bigint, bignum, bigrat, bytes, charnames, constant, diagnostics, fields, filetest, if, integer, less, locale, open, overload, sigtrap, sort, strict, subs, utf8, vars, vmsish, and warnings. Each has its own manpage. The other difference between require and use is that use performs an implicit import on the included module's package. Importing a function or variable from one package to another is a form of aliasing; that is, it makes two different names for the same underlying thing. It's like linking files from another directory into your current one by the command ln /somedir/somefile. Once it's linked in, you no longer have to use the full pathname to access the file. Likewise, an imported symbol no longer needs to be fully qualified by package name (or declared with our or the older use vars if a variable, or with use subs if a subroutine). You can use imported variables as though they were part of your package. If you imported $English::OUTPUT_AUTOFLUSH in the current package, you could refer to it as $OUTPUT_AUTOFLUSH. The required file extension for a Perl module is .pm. The module named FileHandle would be stored in the file FileHandle.pm. The full path to the file depends on your include path, which is stored in the global @INC variable. Recipe 12.8 shows how to manipulate this array for your own purposes. If the module name itself contains any double colons, these are translated into your system's directory separator. That means that the File::Find module resides in the file File/Find.pm under most filesystems. For example: require "FileHandle.pm"; # runtime load require FileHandle; # ".pm" assumed; same as previous use FileHandle; # compile-time load require "Cards/Poker.pm"; # runtime load require Cards::Poker; # ".pm" assumed; same as previous use Cards::Poker; # compile-time load Import/Export Regulations The following is a typical setup for a hypothetical module named Cards::Poker that demonstrates how to manage its exports. The code goes in the file named Poker.pm within the directory Cards; that is, Cards/Poker.pm. (See Recipe 12.8 for where the Cards directory should reside.) Here's that file, with line numbers included for reference: 1 package Cards::Poker; 2 use Exporter; 3 @ISA = ("Exporter"); 4 @EXPORT = qw(&shuffle @card_deck); 5 @card_deck = ( ); # initialize package global 6 sub shuffle { } # fill-in definition later 7 1; # don't forget this Line 1 declares the package that the module will put its global variables and functions in. Typically, a module first switches to a particular package so that it has its own place for global variables and functions, one that won't conflict with that of another program. This package name must be written exactly as in the corresponding use statement when the module is loaded. Don't say package Poker just because the basename of your file is Poker.pm. Rather, say package Cards::Poker because your users will say use Cards::Poker. This common problem is hard to debug. If you don't make the package names specified by the package and use statements identical, you won't see a problem until you try to call imported functions or access imported variables, which will be mysteriously missing. Line 2 loads in the Exporter module, which manages your module's public interface as described later. Line 3 initializes the special, per-package array @ISA to contain the word "Exporter". When a user says use Cards::Poker, Perl implicitly calls a special method, Cards::Poker- >import( ). You don't have an import method in your package, but that's okay, because the Exporter package does, and you're inheriting from it because of the assignment to @ISA (is a). Perl looks at the package's @ISA for resolution of undefined methods. Inheritance is a topic of Chapter 13. You may ignore it for now—so long as you put code like that in lines 2 and 3 into each module you write. Line 4 assigns the list ('&shuffle', '@card_deck') to the special, per-package array @EXPORT. When someone imports this module, variables and functions listed in that array are aliased into the caller's own package. That way they don't have to call the function Cards::Poke::shuffle(23) after the import. They can just write shuffle(23) instead. This won't happen if they load Cards::Poker with require Cards::Poker; only a use imports. Lines 5 and 6 set up the package global variables and functions to be exported. (We presume you'll actually flesh out their initializations and definitions more than in these examples.) You're free to add other variables and functions to your module, including ones you don't put in the public interface via @EXPORT. See Recipe 12.1 for more about using the Exporter. Finally, line 7 is a simple 1, indicating the overall return value of the module. If the last evaluated expression in the module doesn't produce a true value, an exception will be raised. Trapping this is the topic of Recipe 12.2. Packages group and organize global identifiers. They have nothing to do with privacy. Code compiled in package Church can freely examine and alter variables in package State. Package variables are always global and are used for sharing. But that's okay, because a module is more than just a package; it's also a file, and files count as their own scope. So if you want privacy, use lexical variables instead of globals. This is the topic of Recipe 12.4. Other Kinds of Library Files A library is a collection of loosely related functions designed to be used by other programs. It lacks the rigorous semantics of a Perl module. The file extension .pl indicates that it's a Perl library file. Examples include syslog.pl and abbrev.pl. These are included with the standard release for compatibility with prehistoric scripts written under Perl v4 or below. Perl libraries—or in fact, any arbitrary file with Perl code in it—can be loaded in using do "file.pl" or with require "file.pl". The latter is preferred in most situations, because unlike do, require does implicit error checking. It raises an exception if the file can't be found in your @INC path, doesn't compile, or if it doesn't return a true value when any initialization code is run (the last part is what the 1 was for earlier). Another advantage of require is that it keeps track of which files have already been loaded in the global hash %INC. It doesn't reload the file if %INC indicates that the file has already been read. Libraries work well when used by a program, but problems arise when libraries use one another. Consequently, simple Perl libraries have been rendered mostly obsolete, replaced by the more modern modules. But some programs still use libraries, usually loading them in with require instead of do. Other file extensions are occasionally seen in Perl. A .ph is used for C header files that have been translated into Perl libraries using the h2ph tool, as discussed in Recipe 12.17. A .xs indicates an augmented C source file, possibly created by the h2xs tool, which will be compiled by the xsubpp tool and your C compiler into native machine code. This process of creating mixed-language modules is discussed in Recipe 12.18. So far we've talked only about traditional modules, which export their interface by allowing the caller direct access to particular subroutines and variables. Most modules fall into this category. But some problems—and some programmers—lend themselves to more intricately designed modules: those involving objects. An object-oriented module seldom uses the import-export mechanism at all. Instead, it provides an object-oriented interface full of constructors, destructors, methods, inheritance, and operator overloading. This is the subject of Chapter 13. Not Reinventing the Wheel CPAN, the Comprehensive Perl Archive Network, is a gigantic repository of nearly everything about Perl you could imagine, including source, documentation, alternate ports, and above all, modules—some 4,500 of them as of spring of 2003. Before you write a new module, check with CPAN to see whether one already exists that does what you need. Even if one doesn't, something close enough might give you ideas. CPAN is a replicated archive, currently mirrored on nearly 250 sites. Access CPAN via http://www.cpan.org/. If you just want to poke around, you can manually browse through the directories there. There are many indices, including listings of just new modules and of all modules organized by name, author, or category. A convenient alternative to picking through thousands of modules is the search engine available at http://search.cpan.org/. You can search for modules by their name or author, but the facility for grepping through all registered modules' documentation is often more useful. That way you don't have download and install a module just to see what it's supposed to do. See Also Chapters 10, 11, and 22 of Programming Perl; perlmod(1) [ Team LiB ] [ Team LiB ] Recipe 12.1 Defining a Module's Interface 12.1.1 Problem You want the standard Exporter module to define the external interface to your module. 12.1.2 Solution In module file YourModule.pm, place the following code. Fill in the ellipses as explained in the Discussion section. package YourModule; use strict; our (@ISA, @EXPORT, @EXPORT_OK, %EXPORT_TAGS, $VERSION); use Exporter; $VERSION = 1.00; # Or higher @ISA = qw(Exporter); @EXPORT = qw(...); # Symbols to autoexport (:DEFAULT tag) @EXPORT_OK = qw(...); # Symbols to export on request %EXPORT_TAGS = ( # Define names for sets of symbols TAG1 => [...], TAG2 => [...], ... ); ######################## # your code goes here ######################## 1; # this should be your last line In other files where you want to use YourModule, choose one of these lines: use YourModule; # Import default symbols into my package use YourModule qw(...); # Import listed symbols into my package use YourModule ( ); # Do not import any symbols use YourModule qw(:TAG1); # Import whole tag set 12.1.3 Discussion The standard Exporter module handles the module's external interface. Although you could define your own import method for your package, almost no one does this. When someone says use YourModule, this does a require "YourModule.pm" statement followed a YourModule->import( ) method call, both during compile time. The import method inherited from the Exporter package looks for global variables in your package to govern its behavior. Because they must be package globals, we've declared them with our to satisfy use strict. These variables are: $VERSION When a module is loaded, a minimal required version number can be supplied. If the version isn't at least this high, the use will raise an exception. use YourModule 1.86; # If $VERSION < 1.86, fail @EXPORT This array contains a list of functions and variables that will be exported into the caller's own namespace so they can be accessed without being fully qualified. Typically, a qw( ) list is used. @EXPORT = qw(&F1 &F2 @List); @EXPORT = qw( F1 F2 @List); # same thing With the simple use YourModule call the function &F1 can be called as F1( ) rather than YourModule::F1( ), and the array can be accessed as @List instead of @YourModule::List. The ampersand is optional in front of an exported function specification. To load the module at compile time but request that no symbols be exported, use the special form use Exporter ( ), with empty parentheses. @EXPORT_OK This array contains symbols that can be imported if they're specifically asked for. If the array were loaded this way: @EXPORT_OK = qw(Op_Func %Table); then the user could load the module like so: use YourModule qw(Op_Func %Table F1); and import only the Op_Func function, the %Table hash, and the F1 function. The F1 function was listed in the @EXPORT array. Notice that this does not automatically import F2 or @List, even though they're in @EXPORT. To get everything in @EXPORT plus extras from @EXPORT_OK, use the special :DEFAULT tag, such as: use YourModule qw(:DEFAULT %Table); %EXPORT_TAGS This hash is used by large modules like CGI or POSIX to create higher-level groupings of related import symbols. Its values are references to arrays of symbol names, all of which must be in either @EXPORT or @EXPORT_OK. Here's a sample initialization: %EXPORT_TAGS = ( Functions => [ qw(F1 F2 Op_Func) ], Variables => [ qw(@List %Table) ], ); An import symbol with a leading colon means to import a whole group of symbols. Here's an example: use YourModule qw(:Functions %Table); That pulls in all symbols from: @{ $YourModule::EXPORT_TAGS{Functions} }, that is, it pulls in the F1, F2, and Op_Func functions and then the %Table hash. Although you don't list it in %EXPORT_TAGS, the implicit tag :DEFAULT automatically means everything in @EXPORT. You don't have to have all those variables defined in your module. You just need the ones that you expect people to be able to use. 12.1.4 See Also The "Creating Modules" section of Chapter 11 of Programming Perl; the documentation for the standard Exporter module, also found in Chapter 32 of Programming Perl; Recipe 12.8; Recipe 12.22 [ Team LiB ] [ Team LiB ] Recipe 12.2 Trapping Errors in require or use 12.2.1 Problem You need to load in a module that might not be present on your system. This normally results in a fatal exception. You want to detect and trap these failures. 12.2.2 Solution Wrap the require or use in an eval, and wrap the eval in a BEGIN block: # no import BEGIN { unless (eval "require $mod; 1") { warn "couldn't require $mod: $@"; } } # imports into current package BEGIN { unless (eval "use $mod; 1") { warn "couldn't use $mod: $@"; } } 12.2.3 Discussion You usually want a program to fail if it tries to load a module that is missing or doesn't compile. Sometimes, though, you'd like to recover from that error, perhaps trying an alternative module instead. As with any other exception, you insulate yourself from compilation errors with an eval. You don't want to use eval { BLOCK }, because this traps only runtime exceptions, and use is a compile-time event. Instead, you must use eval "string" to catch compile-time problems as well. Remember, require on a bareword has a slightly different meaning than require on a variable. It adds a ".pm" and translates double-colons into your operating system's path separators, canonically / (as in URLs), but sometimes \, :, or even . on some systems. If you need to try several modules in succession, stopping at the first one that works, you could do something like this: BEGIN { my($found, @DBs, $mod); $found = 0; @DBs = qw(Giant::Eenie Giant::Meanie Mouse::Mynie Moe); for $mod (@DBs) { if (eval "require $mod") { $mod->import( ); # if needed $found = 1; last; } } die "None of @DBs loaded" unless $found; } We wrap the eval in a BEGIN block to ensure the module-loading happens at compile time instead of runtime. 12.2.4 See Also The eval, die, use, and require functions in Chapter 32 of Programming Perl and in perlfunc(1); Recipe 10.12; Recipe 12.3 [ Team LiB ] [ Team LiB ] Recipe 12.3 Delaying use Until Runtime 12.3.1 Problem You have a module that you don't need to load each time the program runs, or whose inclusion you wish to delay until after the program starts up. 12.3.2 Solution Either break up the use into its separate require and import components, or else employ the use autouse pragma. 12.3.3 Discussion Programs that check their arguments and abort with a usage message on error have no reason to load modules they never use. This delays the inevitable and annoys users. But those use statements happen during compilation, not execution, as explained in the Introduction. Here, an effective strategy is to place argument checking in a BEGIN block before loading the modules. The following is the start of a program that checks to make sure it was called with exactly two arguments, which must be whole numbers, before going on to load the modules it will need: BEGIN { unless (@ARGV = = 2 && (2 = = grep {/^\d+$/} @ARGV)) { die "usage: $0 num1 num2\n"; } } use Some::Module; use More::Modules; A related situation arises in programs that don't always use the same set of modules every time they're run. For example, the factors program from Chapter 2 needs the infinite precision arithmetic library only when the -b command-line flag is supplied. A use statement would be pointless within a conditional because it's evaluated at compile time, long before the if can be checked. So we use a require instead: if ($opt_b) { require Math::BigInt; } Because Math::BigInt is an object-oriented module instead of a traditional one, no import was needed. If you have an import list, specify it with a qw( ) construct as you would with use. For example, rather than this: use Fcntl qw(O_EXCL O_CREAT O_RDWR); you might say this instead: require Fcntl; Fcntl->import(qw(O_EXCL O_CREAT O_RDWR)); Delaying the import until runtime means that the rest of your program is not subject to any imported semantic changes that the compiler would have seen if you'd used a use. In particular, subroutine prototypes and the overriding of built-in functions are not seen in time. You might want to encapsulate this delayed loading in a subroutine. The following deceptively simple approach does not work: sub load_module { require $_[0]; #WRONG import $_[0]; #WRONG } It fails for subtle reasons. Imagine calling require with an argument of "Math::BigFloat". If that's a bareword, the double colon is converted into your operating system's path separator and a trailing .pm is added. But as a simple variable, it's a literal filename. Worse, Perl doesn't have a built-in import function. Instead, there's a class method named import that we're using the dubious indirect object syntax on. As with indirect filehandles, you can use indirect objects only on a plain scalar variable, a bareword, or a block returning the object, not an expression or one element from an array or hash. A better implementation might look more like: load_module("Fcntl", qw(O_EXCL O_CREAT O_RDWR)); sub load_module { eval "require $_[0]"; die if $@; $_[0]->import(@_[1 .. $#_]); } But this still isn't perfectly correct in the general case. It really shouldn't import those symbols into its own package. It should put them into its caller's package. We could account for this, but the whole procedure is getting increasingly messy. Occasionally, the condition can be reasonably evaluated before runtime, perhaps because it uses only built-in, predefined variables, or because you've arranged to initialize the variables used in the conditional expression at compile time with a BEGIN block. If so, the if pragma comes in handy. The syntax is: use CONDITION, MODULE; use CONDITION, MODULE => ARGUMENTS; As in: use if $^O =~ /bsd/i, BSD::Resource; use if $] >= 5.006_01, File::Temp => qw/tempfile tempdir/; A convenient alternative is the use autouse pragma. This directive can save time on infrequently loaded functions by delaying their loading until they're actually used: use autouse Fcntl => qw( O_EXCL( ) O_CREAT( ) O_RDWR( ) ); We put parentheses after O_EXCL, O_CREAT, and O_RDWR when we autoused them but not when we used them or imported them. The autouse pragma doesn't just take function names; it can also take a prototype for the function. The Fcntl constants are prototyped to take no arguments, so we can use them as barewords in our program without use strict kvetching. Remember, too, that use strict's checks take place at compile time. If we use Fcntl, the prototypes in the Fcntl module are compiled and we can use the constants without parentheses. If we require or wrap the use in an eval, as we did earlier, we prevent the compiler from reading the prototypes, so we can't use the Fcntl constants without parentheses. Read the autouse pragma's online documentation to learn its various caveats and provisos. 12.3.4 See Also Recipe 12.2; the discussion on the import method in the documentation for the standard Exporter module, also found in Chapter 32 of Programming Perl; the documentation for the standard use autouse pragma [ Team LiB ] [ Team LiB ] Recipe 12.4 Making Variables Private to a Module 12.4.1 Problem You want to make a variable private to a package. 12.4.2 Solution You can't. But you can make them private to the file that the module sits in, which usually suffices. 12.4.3 Discussion Remember that a package is just a way of grouping variables and functions together, conferring no privacy. Anything in a package is by definition global and accessible from anywhere. Packages only group; they don't hide. For privacy, only lexical variables will do. A module is implemented in a Module.pm file, with all its globals in the package named Module. Because that whole file is by definition a scope and lexicals are private to a scope, creating file-scoped lexicals is effectively the same thing as a module-private variable. If you alternate packages within a scope, though, you may be surprised that the scope's lexicals are still visible throughout that scope. That's because a package statement only sets a different prefix for a global identifier; it does not end the current scope, not does it begin a new one. package Alpha; my $aa = 10; $x = "azure"; package Beta; my $bb = 20; $x = "blue"; package main; print "$aa, $bb, $x, $Alpha::x, $Beta::x\n"; 10, 20, , azure, blue Was that the output you expected? The two lexicals, $aa and $bb, are still in scope because we haven't left the current block, file, or eval. You might think of globals and lexicals as existing in separate dimensions, forever unrelated to each other. Package statements have nothing to do with lexicals. By setting the current prefix, the first global variable $x is really $Alpha::x, whereas the second $x is now $Beta::x because of the intervening package statement changing the default prefix. Package identifiers, if fully qualified, can be accessed from anywhere, as we've done in the print statement. So, packages can't have privacy—but modules can because they're in a file, which is always its own scope. Here's a simple module, placed in the file Flipper.pm, that exports two functions, flip_words and flip_boundary. The module provides code to reverse words in a line, and to change the definition of a word boundary. # Flipper.pm package Flipper; use strict; require Exporter; use vars qw(@ISA @EXPORT $VERSION); @ISA = qw(Exporter); @EXPORT = qw(flip_words flip_boundary); $VERSION = 1.0; my $Separatrix = " "; # default to blank; must precede functions sub flip_boundary { my $prev_sep = $Separatrix; if (@_) { $Separatrix = $_[0] } return $prev_sep; } sub flip_words { my $line = $_[0]; my @words = split($Separatrix, $line); return join($Separatrix, reverse @words); } 1; This module sets three package variables needed by the Exporter and also initializes a lexical variable at file level called $Separatrix. Again, this variable is private to the file, not to the package. All code beneath its declaration in the same scope (or nested within that scope, as are the functions' blocks) can see $Separatrix perfectly. Even though they aren't exported, global variables could be accessed using the fully qualified name, as in $Flipper::VERSION. A scope's lexicals cannot be examined or tinkered with from outside that scope, which in this case is the entire file below their point of declaration. You cannot fully qualify lexicals or export them either; only globals can be exported. If someone outside the module needs to look at or change the file's lexicals, they must ask the module itself. That's where the flip_boundary function comes into play, allowing indirect access to the module's private parts. This module would work the same even if its $Separatrix variable were a package global rather than a file lexical. Someone from the outside could theoretically play with it without the module realizing this. On the other hand, if they really want to that badly, perhaps you should let them do so. Peppering your module with file-scoped lexicals is not necessary. You already have your own namespace (Flipper, in this case) where you can store any identifier you want. That's what it's there for, after all. Good Perl programming style nearly always avoids fully qualified identifiers. Speaking of style, the case of identifiers used in the Flipper module was not random. Following the Perl style guide, identifiers in all capitals are reserved for those with special meaning to Perl itself. Functions and local variables are all lowercase. The module's persistent variables (either file lexicals or package globals) are capitalized. Identifiers with multiple words have each word separated by an underscore to make them easier to read. We advise against using mixed capitals without underscores—you wouldn't like reading this book without spaces, either. 12.4.4 See Also The discussion on file-scoped lexicals in perlmod(1); the "Scoped Declarations" section in Chapter 4 of Programming Perl; the section on "Programming with Style" in Chapter 24 of Programming Perl or perlstyle(1); Recipe 10.2; Recipe 10.3 [ Team LiB ] [ Team LiB ] Recipe 12.5 Making Functions Private to a Module 12.5.1 Problem You want to make a function private to a package. 12.5.2 Solution You can't. But you can make a private variable and store a reference to an anonymous function in it. # this is the file SomeModule.pm package Some_Module; my $secret_function = sub { # your code here }; sub regular_function { # now call your "private" function via the code ref $secret_function->(ARG1, ARG2); } 12.5.3 Discussion Even a function that isn't exported can still be accessed by anyone, anywhere if they qualify that function's name with its package. That's because function names are always in the package symbol table, which is globally accessible. By creating a lexical variable at the file scope, code in that module file below the point of declaration has full access to that variable. Code in other files will not, because those scopes are unrelated. The subroutine created via sub { .... } is anonymous, so there's no name in the symbol table for anyone outside to find. Not even other code in the module can call the function by name, since it doesn't have one, but that code can use the lexical variable to dereference the code reference indirectly. $secret_function->(ARGS); # infix deref form &$secret_function(ARGS); # prefix deref form Curiously, if you really wanted to, you could give this anonymous function a temporary name. Using the technique outlined in Recipe 10.16, assign the code reference to a localized typeglob, like this: sub module_function { local *secret = $secret_function; Other_Package::func1( ); secret(ARG1, ARG2); Yet_Another_Package::func2( ); } Now for the duration of module_function, your previously secret function can be called using a direct function call; no indirection required. However, code outside the module can also find that function. In the example, it doesn't matter whether func1 and func2 are in the module's file scope, because you've made a temporary symbol table entry through which they could get at your secret function. Therefore, if Other_Package::func1 turned around and called Some_Module::secret, it could find it—but only if func1 were called from the module_function in the example. If it were called from some other point, there wouldn't be any secret function in the Some_Module package symbol table, so the attempted function call would fail. This slightly peculiar behavior, where temporaries' values and visibility depend upon who called whom at runtime, is called dynamic scoping. This is the nature of the local keyword. You can see why we don't usually suggest using it. 12.5.4 See Also Recipe 12.4; the section on "Dynamically Scoped Variables: local" in Chapter 4 of Programming Perl; the section on "Symbol Tables" in Chapter 10 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 12.6 Determining the Caller's Package 12.6.1 Problem You need to find out the current or calling package. 12.6.2 Solution To find the current package: $this_pack = _ _PACKAGE_ _; To find the caller's package: $that_pack = caller( ); 12.6.3 Discussion The _ _PACKAGE_ _ symbol returns the package that the code is currently being compiled into. This doesn't interpolate into double-quoted strings: print "I am in package _ _PACKAGE_ _\n"; # WRONG! I am in package _ _PACKAGE_ _ Needing to figure out the caller's package arose more often in older code that received as input a string of code to be evaluated, or a filehandle, format, or directory handle name. Consider a call to a hypothetical runit function: package Alpha; runit('$line = '); package Beta; sub runit { my $codestr = shift; eval $codestr; die if $@; } Because runit was compiled in a different package than was currently executing, when the eval runs, it acts as though it were passed $Beta::line and Beta::TEMP. The old workaround was to include your caller's package first: package Beta; sub runit { my $codestr = shift; my $hispack = caller; eval "package $hispack; $codestr"; die if $@; } That approach works only when $line is a global variable. If it's lexical, that won't help at all. Instead, arrange for runit to accept a reference to a subroutine: package Alpha; runit( sub { $line = } ); package Beta; sub runit { my $coderef = shift; &$coderef( ); } This not only works with lexicals, but has the added benefit of checking the code's syntax at compile time, which is a major win. If all that's being passed in is a filehandle, it's more portable to use the Symbol::qualify function. This function takes a name and package to qualify the name into. If the name needs qualification, it fixes it; otherwise, it's left alone. But that's considerably less efficient than a * prototype. Here's an example that reads and returns n lines from a filehandle. The function qualifies the handle before working with it. open (FH, "<", "/etc/termcap") or die "can't open /etc/termcap: $!"; ($a, $b, $c) = nreadline(3, "FH"); use Symbol ( ); use Carp; sub nreadline { my ($count, $handle) = @_; my(@retlist,$line); croak "count must be > 0" unless $count > 0; $handle = Symbol::qualify($handle, (caller( ))[0]); croak "need open filehandle" unless defined fileno($handle); push(@retlist, $line) while defined($line = <$handle>) && $count--; return @retlist; } If everyone who called your nreadline function passed the filehandle as a typeglob *FH, as a glob reference *FH, or using FileHandle or IO::Handle objects, you wouldn't need to do this. It's only the possibility of a bare "FH" string that requires qualification. 12.6.4 See Also The documentation for the standard Symbol module, also found in Chapter 32 of Programming Perl; the descriptions of the special symbols _ _FILE_ _, _ _LINE_ _, and _ _PACKAGE_ _ in perldata(1); Recipe 12.14; Recipe 7.6 [ Team LiB ] [ Team LiB ] Recipe 12.7 Automating Module Cleanup 12.7.1 Problem You need to create module setup code and cleanup code that gets called automatically, without user intervention. 12.7.2 Solution For setup code, put executable statements outside subroutine definitions in the module file. For cleanup code, use an END subroutine in that module. 12.7.3 Discussion In some languages, the programmer must remember to call module initialization code before accessing any of that module's regular functions. Similarly, when the program is done, the programmer may have to call module-specific finalization code. Not so in Perl. For per-module initialization code, executable statements outside of any subroutines in your module suffice. When the module is loaded in, that code runs right then and there. The user never has to remember to do this, because it's done automatically. Now, why would you want automatic cleanup code? It depends on the module. You might want to write a shutdown message to a logfile, tell a database server to commit any pending state, refresh a screen, or return the tty to its original state. Suppose you want a module to log quietly whenever a program using it starts up or finishes. Add code in an END subroutine to run after your program finishes: $Logfile = "/tmp/mylog" unless defined $Logfile; open(LF, ">>", $Logfile) or die "can't append to $Logfile: $!"; select(((select(LF), $|=1))[0]); # unbuffer LF logmsg("startup"); sub logmsg { my $now = scalar gmtime; print LF "$0 $$ $now: @_\n" or die "write to $Logfile failed: $!"; } END { logmsg("shutdown"); close(LF) or die "close $Logfile failed: $!"; } The first part of code, outside any subroutine declaration, is executed at module load time. The module user doesn't have to do anything special to make this happen. Someone might be unpleasantly surprised, however, if the file couldn't be accessed, since the die would make the use or require fail. END routines work like exit handlers, such as trap 0 in the shell, atexit in C programming, or global destructors or finalizers in object-oriented languages. All of the ENDs in a program are run in the opposite order that they were loaded; that is, last seen, first run. These get called whether the program finishes through normal process termination by implicitly reaching the end of your main program, through an explicit call to the exit function, or via an uncaught exception such as die or a mistake involving division by zero. Uncaught signals are a different matter, however. Death by signal does not run your exit handlers. The following pragma takes care of them: use sigtrap qw(die normal-signals error-signals); That causes all normal signals and error signals to make your program expire via the die mechanism, effectively converting a signal into an exception and thus permitting your END handlers to run. You can get fancier, too: use sigtrap qw( die untrapped normal-signals stack-trace any error-signals ); That says to die only on an untrapped normal signal, but for error signals, to produce a stack trace before dying—like the confess function from the Carp module. END also isn't called when a process polymorphs itself via the exec function because you are still in the same process, just a different program. All normal process attributes remain, like process ID and parent PID, user and group IDs, umask, current directory, environment variables, resource limits and accumulated statistics, and open file descriptors (however, see the $^F variable in perlvar(1) or Programming Perl). If it didn't work this way, exit handlers would execute redundantly in programs manually managing their fork and exec calls. This would not be good. 12.7.4 See Also The standard use sigtrap pragma, also in Chapter 31 of Programming Perl; Chapter 18 of Programming Perl and the section on "Package Constructors and Destructors" in perlmod(1); the $^F ($SYSTEM_FD_MAX) variable in Chapter 28 of Programming Perl and in perldata(1); the fork and exec functions in Chapter 29 of Programming Perl and in perlmod(1) [ Team LiB ] [ Team LiB ] Recipe 12.8 Keeping Your Own Module Directory 12.8.1 Problem You don't want to install your own personal modules in the standard per-system extension library. 12.8.2 Solution You have several choices: use Perl's -I command line switch; set your PERL5LIB environment variable; or employ the use lib pragma, possibly in conjunction with the FindBin module. 12.8.3 Discussion The @INC array contains a list of directories to consult when do, require, or use pulls in code from another file. You can print these out easily from the command line: % perl -e 'printf "%d %s\n", $i++, $_ for @INC' 0 /usr/local/lib/perl5/5.8.0/OpenBSD.i386-openbsd 1 /usr/local/lib/perl5/5.8.0 2 /usr/local/lib/perl5/site_perl/5.8.0/OpenBSD.i386-openbsd 3 /usr/local/lib/perl5/site_perl/5.8.0 4 /usr/local/lib/perl5/site_perl/5.6.0 5 /usr/local/lib/perl5/site_perl/5.00554 6 /usr/local/lib/perl5/site_perl/5.005 7 /usr/local/lib/perl5/site_perl 8 . The first two directories, elements 0 and 1 of @INC, are respectively the standard architecture- dependent and architecture-independent directories, which all standard libraries, modules, and pragmas will go into. You have two of them because some modules contain information or formatting that makes sense only on that particular architecture. For example, the Config module contains information that cannot be shared across several architectures, so it goes in the 0th array element. Modules that include compiled C components, such as Socket.so, are also placed there. Most modules, however, go in the platform-independent directory in the 1st element. The next pair, elements 2 and 3, fulfills roles analogous to elements and 1, but on a site-specific basis. Suppose you have a module that didn't come with Perl, such as one from CPAN or that you wrote yourself. When you or (more likely) your system administrator installs this module, its components go into one of the site-specific directories. You are encouraged to use these for any modules that your entire site should be able to access conveniently. In this particular configuration, elements 4 -7 are there so that Perl can find any site-specific modules installed under a previous release of Perl. Such directories can be automatically added to @INC when you configure, build, and install a newer Perl release, making it easier to upgrade. The last standard component, "." (your current working directory), is useful only when developing and testing your software, not when deploying it. If your modules are in the same directory that you last chdired to, you're fine. If you're anywhere else, it doesn't work. So sometimes none of the @INC directories work out. Maybe you have your own personal modules. Perhaps your project group has particular modules that are relevant only to that project. In these cases, you need to augment the standard @INC search. The first approach involves a command-line flag, -Idirlist. The dirlist is a colon-separated[1] list of one or more directories, which are prepended to the front of the @INC array. This works well for simple command lines, and thus can be used on a per-command basis, such as when you call a quick one-liner from a shell script. [1] Comma-separated on Mac OS 9. This technique should not be used in the #! (pound-bang) line. First, it's not much fun to modify each program. More importantly, some older operating systems have bugs related to how long that line can be, typically 32 characters, including the #! part. That means if you have a very long path, such as #!/opt/languages/free/extrabits/perl, you may get the mysterious "Command not found" error. Perl does its best to rescan the line manually, but this is still too dicey to rely on. Often, a better solution is to set the PERL5LIB environment variable. This can be done in your shell start-up file. Or, your system administrator may want to do so in a systemwide start-up file so all users can benefit. For example, suppose you have all your own modules in a directory called ~/perllib. You would place one of the following lines in your shell start-up file, depending on which shell you use: # syntax for sh, bash, ksh, or zsh $ export PERL5LIB=$HOME/perllib # syntax for csh or tcsh % setenv PERL5LIB ~/perllib Probably the most convenient solution from your users' perspective is for you to add a use lib pragma near the top of your script. That way users of the program need take no special action to run that program. Imagine a hypothetical project called Spectre whose programs rely on its own set of libraries. Those programs could have a statement like this at their start: use lib "/projects/spectre/lib"; What happens when you don't know the exact path to the library? Perhaps you've installed the whole project in an arbitrary path. You could create an elaborate installation procedure to dynamically update the script, but even if you did, paths would still be frozen at installation time. If someone moved the files later, the libraries wouldn't be found. The FindBin module conveniently solves this problem. This module tries to determine the full path to the executing script's enclosing directory, setting an importable package variable called $Bin to that directory. Typical usage is to look for modules either in the same directory as the program or in a lib directory at the same level. To demonstrate the first case, suppose you have a program called /wherever/spectre/myprog that needs to look in /wherever/spectre for its modules, but you don't want that path hardcoded. use FindBin; use lib $FindBin::Bin; The second case would apply if your program lives in /wherever/spectre/bin/myprog but needs to look at /wherever/spectre/lib for its modules. use FindBin qw($Bin); use lib "$Bin/../lib"; 12.8.4 See Also The documentation for the standard use lib pragma (also in Chapter 31 of Programming Perl) and the standard FindBin module; the discussion of the PERL5LIB environment in perl(1) and the "Environmental Variables" section of Chapter 19 of Programming Perl; your shell's syntax for setting environment variables [ Team LiB ] [ Team LiB ] Recipe 12.9 Preparing a Module for Distribution 12.9.1 Problem You want to prepare your module in standard distribution format so you can easily send your module to a friend. Better yet, you plan to contribute your module to CPAN so everyone can use it. 12.9.2 Solution It's best to start with Perl's standard h2xs tool. Let's say you want to make a Planets module or an Astronomy::Orbits module. You'd type: % h2xs -XA -n Planets % h2xs -XA -n Astronomy::Orbits These commands make subdirectories called ./Planets/ and ./Astronomy/Orbits/, respectively, where you will find all the components you need to get you started. The -n flag names the module you want to make, -X suppresses creation of XS (external subroutine) components, and -A means the module won't use the AutoLoader. 12.9.3 Discussion Writing modules is easy—once you know how. Writing a proper module is like filling out a legal contract: it's full of places to initial, sign, and date exactly right. If you miss any, it's not valid. Instead of hiring a contract lawyer, you can get a quick start on writing modules using the h2xs program. This tool gives you a skeletal module file with the right parts filled in, and it also gives you the other files needed to correctly install your module and its documentation or to bundle up for contributing to CPAN or sending off to a friend. h2xs is something of a misnomer because XS is Perl's external subroutine interface for linking with C or C ++. But the h2xs tool is also extremely convenient for preparing a distribution even when you aren't using the XS interface. Let's look at the module file that h2xs has made. Because the module is called Astronomy::Orbits, the user specifies not use Orbits but rather use Astronomy::Orbits. Therefore an extra Astronomy subdirectory is made, under which an Orbits subdirectory is placed. Here is the first and perhaps most important line of Orbit.pm: package Astronomy::Orbits; This sets the package—the default prefix—on all global identifiers (variables, functions, filehandles, etc.) in the file. Therefore a variable like @ISA is really the global variable @Astronomy::Orbits::ISA. As we said in the Introduction, you must not make the mistake of saying package Orbits because it's in the file Orbits.pm. The package statement in the module must be exactly match the target of the use or require statement, which means the leading directory portion needs to be there and the characters' case must be the same. Furthermore, it must be installed in an Astronomy subdirectory. The h2xs command will set this all up properly, including the installation rule in the Makefile. But if you're doing this by hand, you must keep this in mind. See Recipe 12.1 for that. If you plan to use autoloading, described in Recipe 12.11, omit the -A flag to h2xs, which produces lines like this: require Exporter; require AutoLoader; @ISA = qw(Exporter AutoLoader); If your module is bilingual in Perl and C as described in Recipe 12.18, omit the -X flag to h2xs to produce lines like this: require Exporter; require DynaLoader; @ISA = qw(Exporter DynaLoader); Following this is the Exporter's variables as explained in Recipe 12.1. If you're writing an object-oriented module as described in Chapter 13, you probably won't use the Exporter at all. That's all there is for setup. Now, write your module code. When you're ready to ship it off, use the make dist directive from your shell to bundle it all up into a tar archive for easy distribution. (The name of the make program may vary from system to system.) % perl Makefile.PL % make dist This will leave you with a file whose name is something like Astronomy-Orbits-1.03.tar.Z. To register as a CPAN developer, check out http://pause.cpan.org. 12.9.4 See Also http://www.cpan.org to find a mirror near you and directions for submission; h2xs(1); the documentation for the standard Exporter, AutoLoader, AutoSplit, and ExtUtils::MakeMaker modules, also found in Chapter 32 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 12.10 Speeding Module Loading with SelfLoader 12.10.1 Problem You'd like to load a very large module quickly. 12.10.2 Solution Use the SelfLoader module: require Exporter; require SelfLoader; @ISA = qw(Exporter SelfLoader); # # other initialization or declarations here # _ _DATA_ _ sub abc { .... } sub def { .... } 12.10.3 Discussion When you load a module using require or use, the entire module file must be read and compiled (into internal parse trees, not into byte code or native machine code) right then. For very large modules, this annoying delay is unnecessary if you need only a few functions from a particular file. To address this problem, the SelfLoader module delays compilation of each subroutine until that subroutine is actually called. SelfLoader is easy to use: just place your module's subroutines underneath the _ _DATA_ _ marker so the compiler will ignore them, use a require to pull in the SelfLoader, and include SelfLoader in the module's @ISA array. That's all there is to it. When your module is loaded, the SelfLoader creates stub functions for all routines below _ _DATA_ _. The first time a function gets called, the stub replaces itself by first compiling the real function and then calling it. There is one significant restriction on modules that employ the SelfLoader (or the AutoLoader for that matter, described in Recipe 12.11). SelfLoaded or AutoLoaded subroutines have no access to lexical variables in the file whose _ _DATA_ _ block they are in because they are compiled via eval in an imported AUTOLOAD block. Such dynamically generated subroutines are therefore compiled in the scope of SelfLoader's or AutoLoader's AUTOLOAD. Whether the SelfLoader helps or hinders performance depends on how many subroutines the module has, how large they are, and whether they are all called over the lifetime of the program or not. You should initially develop and test your module without SelfLoader. Commenting out the _ _DATA_ _ line will take care of that, making those functions visible to the compiler. 12.10.4 See Also The documentation for the standard module SelfLoader; Recipe 12.11 [ Team LiB ] [ Team LiB ] Recipe 12.11 Speeding Up Module Loading with Autoloader 12.11.1 Problem You want to use the AutoLoader module. 12.11.2 Solution The easiest solution is to use the h2xs facility to create a directory and all the files you need. Here we assume you have your own directory, ~/perllib/, which contains your personal library modules. % h2xs -Xn Sample % cd Sample % perl Makefile.PL LIB=~/perllib % (edit Sample.pm) % make install 12.11.3 Discussion The AutoLoader addresses the same performance issues as the SelfLoader. It also provides stub functions that get replaced by real ones the first time they're called. But instead of looking for functions all in the same file, hidden below a _ _DATA_ _ marker, the AutoLoader expects to find the real definition for each function in its own file. If your Sample.pm module had two functions, foo and bar, then the AutoLoader would expect to find them in Sample/auto/foo.al and Sample/auto/bar.al, respectively. Modules employing the AutoLoader load faster than those using the SelfLoader, but at the cost of extra files, disk space, and complexity. This setup sounds complicated. If you were doing it manually, it probably would be. Fortunately, h2xs helps out tremendously here. Besides creating a module directory with templates for your Sample.pm file and other files you need, it also generates a Makefile that uses the AutoSplit module to break your module's functions into little files, one function per file. The make install rule installs these so they will be found automatically. All you have to do is put the module functions down below an _ _END_ _ line (rather than a _ _DATA_ _ line as in SelfLoader) that h2xs already created. As with the SelfLoader, it's easier to develop and test your module without the AutoLoader. Just comment out the _ _END_ _ line while developing it. The same restrictions about invisibility of file lexicals that apply to modules using the SelfLoader also apply when using the AutoLoader, so using file lexicals to maintain private state doesn't work. If state is becoming that complex and significant an issue, consider writing an object module instead of a traditional one. 12.11.4 See Also The documentation for the standard module AutoLoader; h2xs(1); Recipe 12.10 [ Team LiB ] [ Team LiB ] Recipe 12.12 Overriding Built-in Functions 12.12.1 Problem You want to replace a standard, built-in function with your own version. 12.12.2 Solution Import that function from another module into your own namespace. 12.12.3 Discussion Suppose you want to give a function of your own the same name as one of Perl's core built-ins. If you write: sub time { "it's howdy doody time" } print time( ); then you won't get your function called—you'll still get Perl's original, built-in version. You could use an explicit ampersand to call the function: print &time( ); because that always gets your function, never the built-in. But then you forego any prototype checking and context coercion on the function's arguments. However, there is a way to override that. Many (but not all) of Perl's built-in functions may be overridden. This is not something to be attempted lightly, but it is possible. You might do this, for example, if you are running on a platform that doesn't support the function that you'd like to emulate. Or, you might want to add your own wrapper around the built-in. Not all reserved words have the same status. Those that return a negative number in the C- language keyword( ) function in the toke.c file in your Perl source kit may be overridden. Keywords that cannot be overridden as of v5.8.1 are defined, delete, do, else, elsif, eval, exists, for, foreach, format, glob, goto, grep, if, last, local, m, map, my, next, no, our, package, pos, print, printf, prototype, q, qq, qr, qw, qx, redo, require, return, s, scalar, sort, split, study, sub, tie, tied, tr, undef, unless, untie, until, use, while, and y. The rest can. A standard Perl module that overrides a built-in is Cwd, which can overload chdir. Others are the by-name versions of functions that return lists: File::stat, Net::hostent, Net::netent, Net::protoent, Net::servent, Time::gmtime, Time::localtime, Time::tm, User::grent, and User::pwent. These modules all override built-in functions like stat or getpwnam to return an object that can be accessed using a name, like getpwnam("daemon")->dir. To do this, they have to override the original, list-returning versions of those functions. Overriding may be done uniquely by importing the function from another package. This import only takes effect in the importing package, not in all possible packages. It's not enough simply to predeclare the function. You have to import it. This is a guard against accidentally redefining built-ins. Let's say that you'd like to replace the built-in time function, whose return value is in integer seconds, with one that returns a floating-point number instead. You could make a Time::HiRes module with an optionally exported time function as follows: package Time::HiRes; use strict; require Exporter; use vars qw(@ISA @EXPORT_OK); @ISA = qw(Exporter); @EXPORT_OK = qw(time); sub time( ) { ..... } # TBA Then the user who wants to use this augmented version of time would say something like: use Time::HiRes qw(time); $start = time( ); 1 while print time( ) - $start, "\n"; This code assumes that your system has a function you can stick in the "TBA" definition shown previously. It just so happens, however, that you don't have to figure that part out, because the Time::HiRes module (which is included standard with the Perl distribution) does indeed behave as we've outlined it here. You can import its time( ) function to get the one that is fancier than the core built-in, just as we did here. If you don't want to take the trouble to create a full module file, set up its exports, and all the rest of the rigamarole, there's a shortcut approach via the subs pragma. It works like this: use subs qw(time); sub time { "it's howdy doody time" } print time( ); Now you'd get your own function, even without the ampersand. Even when you override a built-in by importing a function, that built-in is always still accessible if you fully qualify it using the (pseudo)package named CORE. Thus, even if you imported time( ) from FineTime, overriding the built-in, that original built-in can be called as CORE::time( ). For overriding of methods and operators, see Chapter 13. 12.12.4 See Also The section on "Overriding Built-in Functions" in Chapter 11 of Programming Perl and in perlsub(1); Recipe 10.11 [ Team LiB ] [ Team LiB ] Recipe 12.13 Overriding a Built-in Function in All Packages 12.13.1 Problem You want to change the definition of a core built-in function within your entire program, not just the current package. 12.13.2 Solution Manually import, via direct symbol-table manipulation, the function into the CORE::GLOBAL pseudopackage. *CORE::GLOBAL::int = \&myown_int; 12.13.3 Discussion The technique demonstrated in the previous recipe only overrides a built-in in a particular package. It doesn't change everything for your whole program, no matter what package that function is called from. To do so would risk changing the behavior of code from modules you didn't write, and which were therefore not prepared for the change. It has been said that Unix was not designed to stop you from doing stupid things, because that would also stop you from doing clever things. So, too, with Perl. Just because overriding a function in all packages at once might seem, well, imprudent doesn't mean a clever person won't someday find a marvelous use for such a facility. For example, let's suppose that you've decided that the core int function's behavior of integer truncation, also known as rounding toward zero, is so annoying to your program that you want to provide an alternative by the same name. This would do it: package Math::Rounding; use warnings; use Carp; use Exporter; our @EXPORT = qw(int); our @ISA = qw(Exporter); sub int(;$) { my $arg = @_ ? shift : $_; use warnings FATAL => "numeric"; # promote to die( )ing my $result = eval { sprintf("%.0f", $arg) }; if ($@) { die if $@ !~ /isn't numeric/; $@ =~ s/ in sprintf.*/ in replacement int/s; croak $@; } else { return $result; } } Your replacement version uses sprintf( ) to round to the closest integer. It also raises an exception if passed a non-numeric string. A program could access this function either by saying: use Math::Rounding ( ); $y = Math::Rounding::int($x); or by importing the function and overriding the built-in: use Math::Rounding qw(int); $y = int($x); However, that only manages to replace the built-in for the current package. To replace it in all packages, at some point during compile time you'll have to execute a line of code like this: *CORE::GLOBAL::int = \&Math::Rounding::int; The standard File::Glob module allows you to change Perl's core glob operator using special import tags: ## override the core glob, forcing case sensitivity use File::Glob qw(:globally :case); my @sources = <*.{c,h,y}> ## override the core glob forcing case insensitivity use File::Glob qw(:globally :nocase); my @sources = <*.{c,h,y}> The module does this with its own version of import that detects those tags and makes the necessary assignments. You could do this, too. That way, this: use Math::Rounding qw(-global int); would make Perl use your replacement version for all calls to int from any package anywhere in your program. Here's a replacement import function that handles this: sub import { if (@_ && $_[1] =~ /^-/) { if ($_[1] ne "-global") { croak "unknown import pragma"; } splice(@_, 1, 1); # discard "-global" no warnings "once"; # suppress "used only once" warnings *CORE::GLOBAL::int = \∫ } else { die; } _ _PACKAGE_ _ -> export_to_level(1, @_); } The assignment happens only if the first thing to import is "-global". The last line in our import function uses part of the Exporter module's internal API to handle any normal import. 12.13.4 See Also Recipe 12.12; the section on "Overriding Built-in Functions" in Chapter 11 of Programming Perl and in perlsub(1); the documentation for the standard BSD::Glob module, as well as its source code [ Team LiB ] [ Team LiB ] Recipe 12.14 Reporting Errors and Warnings Like Built- ins 12.14.1 Problem You want to generate errors and warnings in your modules, but when you use warn or die, the user sees your own filename and line number. You'd like your functions to act like built-ins and report messages from the perspective of the user's code, not your own. 12.14.2 Solution The standard Carp module provides functions to do this. Use carp instead of warn. Use croak (for a short message) and confess (for a long message) instead of die. 12.14.3 Discussion Like built-ins, some of your module's functions generate warnings or errors if all doesn't go well. Think about sqrt: when you pass it a negative number (and you haven't used the Math::Complex module), an exception is raised, producing a message such as "Can't take sqrt of -3 at /tmp/negroot line 17", where /tmp/negroot is the name of your own program. But if you write your own function that dies, perhaps like this: sub even_only { my $n = shift; die "$n is not even" if $n & 1; # one way to test #.... } then the message will say it's coming from the file your even_only function was itself compiled in, rather than from the file the user was in when they called your function. That's where the Carp module comes in handy. Instead of using die, use croak instead: use Carp; sub even_only { my $n = shift; croak "$n is not even" if $n % 2; # here's another #.... } If you just want to complain about something, but have the message report where in the user's code the problem occurred, call carp instead of warn. For example: use Carp; sub even_only { my $n = shift; if ($n & 1) { # test whether odd number carp "$n is not even, continuing"; ++$n; } #.... } Many built-ins emit warnings only when the -w command-line switch has been used. The $^W variable (which is not meant to be a control character but rather a ^ followed by a W) reflects whether that switch was used. You could choose to grouse only if the user asked for complaints: carp "$n is not even, continuing" if $^W; The Carp module provides a third function: confess. This works just like croak, except that it provides a full stack backtrace as it dies, reporting who called whom and with what arguments. If you're only interested in the error message from carp, croak, and friends, the longmess and shortmess functions offer those: use Carp; $self->transplant_organ( ) or $self->error( Carp::longmess("Unable to transplant organ") ); 12.14.4 See Also The warn and die functions in Chapter 29 of Programming Perl and in perlfunc(1); the documentation for the standard Carp module, also in Chapter 32 of Programming Perl; Recipe 19.2; the discussion on _ _WARN_ _ and _ _DIE_ _ in the %SIG entry of Chapter 28 of Programming Perl, in perlvar(1), and in Recipe 16.15 [ Team LiB ] [ Team LiB ] Recipe 12.15 Customizing Warnings 12.15.1 Problem You would like your module to respect its caller's settings for lexical warnings, but you can't inspect the predefined $^W[2] variable to determine those settings. [2] That's $WARNING if you've used English. 12.15.2 Solution Your module should use this pragma: use warnings::register; Then from inside your module, use the warnings::enabled function from that module as described in the Discussion to check whether the caller has warnings enabled. This works for both the old-style, global warnings and for lexical warnings set via the use warnings pragma. 12.15.3 Discussion Perl's -w command-line flag, mirrored by the global $^W variable, suffers from several problems. For one thing, it's an all-or-nothing affair, so if you turn it on for the program, module code included by that program—including code you may not have written—is also affected by it. For another, it's at best cumbersome to control compile-time warnings with it, forcing you to resort to convoluted BEGIN blocks. Finally, suppose you were interested in numeric warnings but not any other sort; you'd have to write a $SIG{_ _WARN_ _} handler to sift through all warnings to find those you did or did not want to see. Lexical warnings, first introduced in Perl v5.6, address all this and more. By lexical, we mean that their effects are constrained to the lexical scope in which use warnings or no warnings occurs. Lexical warnings pay no attention to the -w command-line switch. Now when you turn warnings on in one scope, such as the main program's file scope, that doesn't enable warnings in modules you load. You can also selectively enable or disable individual categories of warnings. For example: use warnings qw(numeric uninitialized); use warnings qw(all); no warnings qw(syntax); The warnings::register pragma permits a module to check the warnings preferences of its caller's lexical scope. The pragma also creates a new warning category, taken from the name of the current package. These user-defined warning categories are easily distinguishable from the built-in warning categories because a module's package always starts (or should always start) with an uppercase letter. This way lowercase warning categories, like lowercase module names, are reserved to Perl itself. Built-in warnings categories are organized into several groups. The all category means all built-in warnings categories, including subcategories such as unsafe, io, syntax, etc. (see Figure 12-1). The syntax category comprises particular warnings categories, such as ambiguous, precedence, and deprecated. These can be added and subtracted at will, but order matters: Figure 12-1. Warnings categories use warnings; # turn on all warnings no warnings "syntax"; # turn off the syntax group use warnings "deprecated"; # but turn back on deprecated warnings Back to your module. Suppose you write a module called Whiskey. The Whiskey.pm file begins this way: package Whiskey; use warnings::register; Now code using that module does this: use Whiskey; use warnings qw(Whiskey); It's important to load the module before asking to use warnings for that module. Otherwise, the Whiskey warning category hasn't been registered yet, and you'll raise an exception if you try to use it as a warnings category. Here's a whimsical Whiskey module: package Whiskey; use strict; use warnings; # for our own code, not our caller use warnings::register; sub drink { if (warnings::enabled( ) && (localtime( ))[2] < 12) { warnings:warn("Sun not yet over the yardarm"); } print "Merry!\n"; } sub quaff { if (warnings::enabled("deprecated")) { warnings::warn("deprecated", "quaffing deprecated in favor of chugging"); } &drink; } # chuggers care not of the hour sub chug { print "Very merry\n"; } 1; The Whiskey::drink function uses the warnings::enabled function to check whether its caller has warnings enabled. Any of these in the caller's scope is enough to make that function return true: use warnings; use warnings qw(all); # means same as previous use warnings qw(Whiskey); The function will also return true if global warnings are enabled using -w or $^W. In the Whiskey::quaff function, a specific category of warnings is checked: deprecated. This is enabled if all warnings have been selected, if the syntax warnings have been selected (because deprecated warnings are considered a subcategory of syntax warnings, which is a subcategory of all warnings), or if deprecated warnings have been specifically selected. It will not be enabled just because the caller has enabled Whiskey warnings. Any category you create is considered a subcategory of all, but not of anything else. Check for Whiskey warnings using: warnings::enabled("Whiskey") The warnings::warn function is used instead of the warn built-in, in case Whiskey warnings have been promoted into exceptions: use warnings FATAL => "Whiskey"; 12.15.4 See Also The documentation on the use warnings pragma in Chapter 31 of Programming Perl and perllexwarn(1) [ Team LiB ] [ Team LiB ] Recipe 12.16 Referring to Packages Indirectly 12.16.1 Problem You want to refer to a variable or function in a package unknown until runtime, but syntax like $packname::$varname is illegal. 12.16.2 Solution Use symbolic references: { no strict "refs"; $val = ${ $packname . "::" . $varname }; @vals = @{ $packname . "::" . $aryname }; &{ $packname . "::" . $funcname }("args"); ($packname . "::" . $funcname) -> ("args"); } 12.16.3 Discussion A package declaration has meaning at compile time. If you don't know the name of the package or variable until runtime, you'll have to resort to symbolic references for direct access to the package symbol table. Assuming you normally run with use strict in effect, you must disable part of it to use symbolic references. Once you've used the no strict "refs" directive in that block, build up a string with the fully qualified name of the variable or function you're interested in. Then dereference this name as though it were a proper Perl reference. During the prehistoric eras (before Perl 5), programmers were forced to use an eval for this kind of thing: eval "package $packname; \$'$val = \$$varname"; # set $main'val die if $@; As you see, this approach makes quoting difficult. It's also comparatively slow. Fortunately, you never need to do this just to access variables indirectly by name. Symbolic references are a necessary compromise. Similarly, eval could be used to define functions on the fly. Suppose you wanted to be able to get the base 2 or base 10 logs of numbers: printf "log2 of 100 is %.2f\n", log2(100); printf "log10 of 100 is %.2f\n", log10(100); Perl has only the natural log function. Here's how one could use eval to create these functions at runtime. Here we'll create functions named log2 up through log999: $packname = "main"; for ($i = 2; $i < 1000; $i++) { $logN = log($i); eval "sub ${packname}::log$i { log(shift) / $logN }"; die if $@; } Here, at least, you don't need to do that. The following code does the same thing, but instead of compiling a new function 998 times, we compile it only once, as a closure. Then we use symbolic dereferencing of the symbol table to assign the same subroutine reference to many function names: $packname = "main"; for ($i = 2; $i < 1000; $i++) { my $logN = log($i); no strict "refs"; *{"${packname}::log$i"} = sub { log(shift) / $logN }; } When you assign a reference to a typeglob, you create an alias for just the referent type of that name. That's how the Exporter does its job. The first line in the next code sample manually imports the function name Colors::blue into the current package. The second makes the main::blue function an alias for the Colors::azure function. *blue = \&Colors::blue; *main::blue = \&Colors::azure; Given the flexibility of typeglob assignments and symbolic references, a full-blown eval "STRING" is nearly always unnecessary for these sorts of indirect namespace manipulation, the last resort of the desperate programmer. The only thing worse would be if it weren't available at all. 12.16.4 See Also The section on "Symbolic References" in Chapter 8 of Programming Perl and in the start of perlsub(1); Recipe 11.4 [ Team LiB ] [ Team LiB ] Recipe 12.17 Using h2ph to Translate C #include Files 12.17.1 Problem Someone gave you code that generates the bizarre error message: Can't locate sys/syscall.ph in @INC (did you run h2ph?) (@INC contains: /usr/lib/perl5/i686-linux/5.00404 /usr/lib/perl5 /usr/lib/perl5/site_perl/i686-linux /usr/lib/perl5/site_perl .) at some_program line 7. You want to know what it means and how to fix it. 12.17.2 Solution Get your system administrator to do this, running as the superuser: % cd /usr/include; h2ph sys/syscall.h However, most include files require other include files, which means you should probably just translate them all: % cd /usr/include; h2ph *.h */*.h If that reports too many filenames or misses some that are more deeply nested, try this instead: % cd /usr/include; find . -name "*.h" -print | xargs h2ph 12.17.3 Discussion A file whose name ends in .ph has been created by the h2ph tool, which translates C preprocessor directives from C #include files into Perl. The goal is to allow Perl code to access the same constants as C code. h2xs is a better approach in most cases because it provides compiled C code for your modules, not Perl code simulating C code. However, using h2xs requires a lot more programming savvy (at least, for accessing C code) than h2ph does. When h2ph's translation process works, it's wonderful. When it doesn't, you're probably out of luck. As system architectures and include files become more complex, h2ph fails more frequently. If you're lucky, the constants you need are already in the Fcntl, Socket, or POSIX modules. The POSIX module implements constants from sys/file.h, sys/errno.h, and sys/wait.h, among others. It also allows fancy tty handling, as described in Recipe 15.8. So what can you do with these .ph files? Here are a few examples. The first uses the pessimally non-portable syscall function to access your operating system's gettimeofday syscall. This implements the FineTime module described in Recipe 12.12. # file FineTime.pm package main; require "sys/syscall.ph"; die "No SYS_gettimeofday in sys/syscall.ph" unless defined &SYS_gettimeofday; package FineTime; use strict; require Exporter; use vars qw(@ISA @EXPORT_OK); @ISA = qw(Exporter); @EXPORT_OK = qw(time); sub time( ) { my $tv = pack("LL", ( )); # presize buffer to two longs syscall(&main::SYS_gettimeofday, $tv, undef) >= 0 or die "gettimeofday: $!"; my($seconds, $microseconds) = unpack("LL", $tv); return $seconds + ($microseconds / 1_000_000); } 1; If you are forced to require an old-style .pl or .ph file, do so from the main package (package main in the preceding code). These old libraries always put their symbols in the current package, and main serves as a reasonable rendezvous point. To use a symbol, use its fully qualified name, as we did with main::SYS_gettimeofday. The sys/ioctl.ph file, if you can get it to build on your system, is the gateway to your system's idiosyncratic I/O functions through the ioctl function. One such function is the TIOCSTI ioctl, shown in Example 12-1. That abbreviation stands for "terminal I/O control, simulate terminal input." On systems that implement this function, it will push one character into your device stream so that the next time any process reads from that device, it gets the character you put there. Example 12-1. jam #!/usr/bin/perl -w # jam - stuff characters down STDIN's throat require "sys/ioctl.ph"; die "no TIOCSTI" unless defined &TIOCSTI; sub jam { local $SIG{TTOU} = "IGNORE"; # "Stopped for tty output" local *TTY; # make local filehandle open(TTY, "+<", "/dev/tty") or die "no tty: $!"; for (split(//, $_[0])) { ioctl(TTY, &TIOCSTI, $_) or die "bad TIOCSTI: $!"; } close(TTY); } jam("@ARGV\n"); Since sys/ioctl.h translation is so dodgy, you'll probably have to run this C program to get your TIOCSTI value: % cat > tio.c << EOF && cc tio.c && a.out #include main( ) { printf("%#08x\n", TIOCSTI); } EOF 0x005412 Another popular use for ioctl is for figuring out your current window size in rows and columns, and maybe even in pixels. This is shown in Example 12-2. Example 12-2. winsz #!/usr/bin/perl # winsz - find x and y for chars and pixels require "sys/ioctl.ph"; die "no TIOCGWINSZ " unless defined &TIOCGWINSZ; open(TTY, "+<", "/dev/tty") or die "No tty: $!"; unless (ioctl(TTY, &TIOCGWINSZ, $winsize="")) { die sprintf "$0: ioctl TIOCGWINSZ (%08x: $!)\n", &TIOCGWINSZ; } ($row, $col, $xpixel, $ypixel) = unpack("S4", $winsize); print "(row,col) = ($row,$col)"; print " (xpixel,ypixel) = ($xpixel,$ypixel)" if $xpixel || $ypixel; print "\n"; As you see, as soon as you start playing with .ph files, unpacking binary data, and calling syscall and ioctl, you need to know about the C APIs that Perl normally hides. The only other thing that requires this much C knowledge is using the XS interface. Some suggest you should resist the temptation to descend into such unportable convolutions. Others feel that the demands put upon the trenchworkers are such that they must be forgiven these desperate measures. Fortunately, less fragile mechanisms are increasingly available. CPAN modules for most of these functions now exist, which should theoretically prove more robust than sourcing .ph files. 12.17.4 See Also h2ph(1); the instructions on running h2ph in the INSTALL file from the Perl source distribution; the syscall and ioctl functions in Chapter 29 of Programming Perl and in perlmod(1); Recipe 12.18 [ Team LiB ] [ Team LiB ] Recipe 12.18 Using h2xs to Make a Module with C Code 12.18.1 Problem You'd like to access your system's unique C functions from Perl. 12.18.2 Solution Use the h2xs tool to generate the necessary template files, fill the files in appropriately, and then type: % perl Makefile.PL % make 12.18.3 Discussion A Perl module need not be written solely in Perl. As with any other module, first pick a module name and use h2xs on it. We'll make a FineTime::time function with the same semantics as in the previous recipe, but this time around, we'll implement it using real C. First, we run the following command: % h2xs -cn FineTime If we had a .h file with function prototype declarations, we could include that, but because we're writing this one from scratch, we'll use the -c switch to omit building code to translate any #define symbols. The -n switch says to create a module directory named FineTime/, which will have the following files: Manifest List of files in the distribution Changes Change log Makefile.PL A meta-makefile FineTime.pm The Perl parts FineTime.xs The soon-to-be C parts test.pl A test driver Before we can type make, we'll have to generate a Makefile based on our system's configuration using the Makefile.PL template. Here's how to do that: % perl Makefile.PL If the XS code calls library code that isn't in the normal set of libraries Perl links from, add one more line to Makefile.PL first. For example, if we wanted to link against the librpm.a library, which lives in the /usr/redhat/lib directory, we'd change the line of Makefile.PL that reads: "LIBS" => [""], # e.g., "-lm" so that it says: "LIBS" => ["-L/usr/redhat/lib -lrpm"], If the module is to be installed somewhere other than the local site_lib directory, specify that on the command line: % perl Makefile.PL LIB=~/perllib Finally, edit the FineTime.pm and FineTime.xs files. In the first case, most of the work has been done for us. We just set up the export list with the function to be exported. This time we put it in @EXPORT_OK so that if the user wants the function, they must ask for it by name. Here's FineTime.pm: package FineTime; use strict; use vars qw($VERSION @ISA @EXPORT_OK); require Exporter; require DynaLoader; @ISA = qw(Exporter DynaLoader); @EXPORT_OK = qw(time); $VERSION = "0.01"; bootstrap FineTime $VERSION; 1; The make process automatically translates FineTime.xs into a FineTime.c file and eventually into a shared library, probably called FineTime.so on most platforms. The utility that does this translation is xsubpp, which is described in its own manpage and perlxstut(1). The build will call xsubpp automatically. Besides a strong C background, you also need to understand the C-to-Perl interface, called XS (external subroutine). The details and nuances of XS are beyond the scope of this book. The automatically generated FineTime.xs had the Perl-specific include files in it, as well as the MODULE declaration. We've added some extra includes and written the code for the new time function. Although this doesn't look entirely like C, it will, once xsubpp is done with it. Here's the FineTime.xs we used: #include #include #include "EXTERN.h" #include "perl.h" #include "XSUB.h" MODULE = FineTime PACKAGE = FineTime double time( ) CODE: struct timeval tv; gettimeofday(&tv,0); RETVAL = tv.tv_sec + ((double) tv.tv_usec) / 1000000; OUTPUT: RETVAL Defining a function by the same name as one from the standard C library won't cause a problem when it's compiled, because that's not its real name. That's just what Perl calls it. The C linker will see it as XS_FineTime_time, so no conflict exists. Here's what happened with make install (with some edits): % make install mkdir ./blib/lib/auto/FineTime cp FineTime.pm ./blib/lib/FineTime.pm /usr/local/bin/perl -I/usr/lib/perl5/i686-linux/5.00403 -I/usr/lib/perl5 /usr/lib/perl5/ExtUtils/xsubpp -typemap /usr/lib/perl5/ExtUtils/typemap FineTime.xs FineTime.tc && mv FineTime.tc FineTime.c && cc -c -Dbool=char -DHAS_BOOL -O2-DVERSION=\"0.01\" -DXS_VERSION=\"0.01\" -fpic -I/usr/lib/perl5/i686-linux/5.00403/CORE FineTime.c Running Mkbootstrap for FineTime ( ) chmod 644 FineTime.bs LD_RUN_PATH="" cc -o blib/arch/auto/FineTime/FineTime.so -shared -L/usr/local/lib FineTime.o chmod 755 blib/arch/auto/FineTime/FineTime.so cp FineTime.bs ./blib/arch/auto/FineTime/FineTime.bs chmod 644 blib/arch/auto/FineTime/FineTime.bs Installing /home/tchrist/perllib/i686-linux/./auto/FineTime/FineTime.so Installing /home/tchrist/perllib/i686-linux/./auto/FineTime/FineTime.bs Installing /home/tchrist/perllib/./FineTime.pm Writing /home/tchrist/perllib/i686-linux/auto/FineTime/.packlist Appending installation info to /home/tchrist/perllib/i686-linux/perllocal.pod Once this is all done, we'll be able to type something like this into the shell: % perl -I ~/perllib -MFineTime=time -le "1 while print time( )" | head 888177070.090978 888177070.09132 888177070.091389 888177070.091453 888177070.091515 888177070.091577 888177070.091639 888177070.0917 888177070.091763 888177070.091864 12.18.4 See Also Chapters 18 through 20 in Advanced Perl Programming; perlxstut(1) and perlxs(1) to learn how to call C from Perl; perlcall(1) and perlguts(1) to understand the internal Perl API, also the "Extending Perl" section of Chapter 21 of Programming Perl; perlembed(1) to learn how to call Perl from C, also the "Embedding Perl" section of Chapter 21 of Programming Perl; the documentation for the standard ExtUtils::MakeMaker module, h2ph(1) and xsubpp(1); http://www.cpan.org/authors/Dean_Roehrich/, which contains Dean's comprehensive XS cookbook that includes directions on interfacing with C++ [ Team LiB ] [ Team LiB ] Recipe 12.19 Writing Extensions in C with Inline::C 12.19.1 Problem You'd like to write functions in C that you can call from Perl. You may already have tried XS and found it harmful to your mental health. 12.19.2 Solution Use the Inline::C module available from CPAN: use Inline C; $answer = somefunc(20, 4); print "$answer\n"; # prints 80 _ _END_ _ _ _C_ _ double somefunc(int a, int b) { /* Inline knows most basic C types */ double answer = a * b; return answer; } 12.19.3 Discussion Inline::C was created as an alternative to the XS system for building C extension modules. Rather than jumping through all the hoopla of h2xs and the format of an .xs file, Inline::C lets you embed C code into your Perl program. There are also Inline modules for Python, Ruby, and Java, among other languages. By default, your C source is in the _ _END_ _ or _ _DATA_ _ section of your program after a _ _C_ _ token. This permits multiple Inlined language blocks in a single file. If you want, use a here document when you load Inline: use Inline C <<'END_OF_C'; double somefunc(int a, int b) { /* Inline knows most basic C types */ double answer = a * b; return answer; } END_OF_C Inline::C scans the source code for ANSI-style function definitions. When it finds a function definition it knows how to deal with, it creates a Perl wrapper for the function. Inline can automatically translate the basic C data types (double, int, char *, etc.) by using the typemap that comes with Perl. A typemap shows Perl how to convert between C values and Perl data types, and you can install your own if you need to use more complex data structures than the basic typemap supports. You can link against external libraries, parse header files as h2xs does, pass and return multiple values, handle objects, and more. See the Inline::C-Cookbook manpage that comes with the Inline::C module for more details. 12.19.4 See Also The documentation with the Inline::C module from CPAN; the Inline::C-Cookbook manpage [ Team LiB ] [ Team LiB ] Recipe 12.20 Documenting Your Module with Pod 12.20.1 Problem You need to document your module, but don't know what format to use. 12.20.2 Solution Embed your documentation in the your module file using pod format. 12.20.3 Discussion Pod stands for plain old documentation. It's documentation embedded in your program using a very simple markup format. Programmers are notorious for writing the code first and the documentation never, so pod was designed to make writing documentation so easy that anyone can and will do so. Sometimes this even works. When Perl is parsing your source code, a line starting with an equals sign (where a new statement is expected) says to ignore all text until it finds a line beginning with =cut, after which it will start parsing code again. This lets you mix code and documentation throughout your Perl program or module file. Since it's mostly plain text, type in your documentation as literal text, or nearly so. The translators try to be clever and make output-specific decisions so the programmer doesn't have to specifically format variable names, function calls, etc. Perl ships with several translators that filter generic pod format into specific output styles. These include pod2man to change your pods into troff for use with the man program or for phototypesetting and printing; pod2html for creating web pages (which works even on non-Unix systems); and pod2text for plain ASCII. Other translators, such as pod2ipf, pod2fm, pod2texi, pod2latex, and pod2ps, may also be available or can be found on CPAN. Many books are written using proprietary word processors with limited scripting capabilities. Not this one! It was written in pod format using common text editors (vi for Tom, emacs for Nat). The final book was produced by converting the pod source files to FrameMaker. Although formally documented in perlpod(1), pod is probably easiest to learn by reading existing module files. If you started making your module using h2xs, then you already have the sample pods right there. The Makefile knows to convert these into man format and install those manpages so others can read them. Alternatively, the perldoc program can translate pods on the fly using pod2text. Indented paragraphs will be left verbatim. Other paragraphs will be reformatted to fit the page. Only two kinds of markups are used in pod: paragraphs beginning with an equals sign and one or more words, and interior sequences starting with a single letter followed by text enclosed in angle brackets. Paragraph tags are for headers, list enumeration, and per-translator escapes. Angle bracket sequences are mainly used for font changes, such as selecting bold, italic, or constant-width fonts. Here's an example of an =head2 pod directive and various bracket escapes for font changes: =head2 Discussion If we had a I<.h> file with function prototype declarations, we could include that, but since we're writing this one from scratch, we'll use the B<-c> flag to omit building code to translate any #define symbols. The B<-n> flag says to create a module directory named I , which will have the following files. The =for escape introduces specific code that is only for a particular output filter. This book, for example, written mostly in pod, includes calls to the standard troff tools eqn, tbl, and pic. Here's an example of embedded eqn. Only translators that produce troff will heed this paragraph. =for troff .EQ log sub n (x) = { {log sub e (x)} over {log sub e (n)} } .EN Pod can also create multiline comments. In C, the sequence /* .... */ can comment out many lines of text all at once—there's no need to put a marker on each line. Since Perl ignores pod directives, use these for block commenting. The trick is to find a directive that the pod filters ignore. You could specify that a block is "for later" or "for nobody": =for later next if 1 .. ?^$?; s/^(.)/>$1/; s/(.{73})........*/$1 /; =cut back to perl or you could use a =begin and =end pair: =begin comment if (!open(FILE, "<", $file)) { unless ($opt_q) { warn "$me: $file: $!\n"; $Errors++; } next FILE; } $total = 0; $matches = 0; =end comment 12.20.4 See Also The section on "PODs: Embedded Documentation" in perlsyn(1), as well as perlpod(1), pod2man(1), pod2html(1), and pod2text(1); Chapter 26 of Programming Perl [ Team LiB ] [ Team LiB ] Recipe 12.21 Building and Installing a CPAN Module 12.21.1 Problem You want to install a module file that you downloaded from CPAN over the Net or obtained from a CD. 12.21.2 Solution Type the following commands into your shell. It will build and install Version 4.54 of the Some::Module package. % gunzip Some-Module-4.54.tar.gz % tar xf Some-Module-4.54 % cd Some-Module-4.54 % perl Makefile.PL % make % make test % make install 12.21.3 Discussion Like most programs on the Net, Perl modules are available in source kits stored as tar archives in GNU zip format.[3] If tar warns of "Directory checksum errors ", then you downloaded the binary file in text format, mutilating it. [3] This is not the same as the zip format common on Windows machines, but newer version of Windows winzip will read it. Prior to Perl 5.005, you'll need the standard port of Perl for Win32, not the ActiveState port, to build CPAN modules. Free versions of tar and gnutar are also available for Microsoft systems. You'll probably have to become a privileged user with adequate permissions to install the module in the system directories. Standard modules are installed in a directory like /usr/lib/perl5 , whereas third-party modules are installed in /usr/lib/perl5/site_ perl . Here's a sample run, showing the installation of the MD5 module: % gunzip MD5-1.7.tar.gz % tar xf MD5-1.7.tar % cd MD5-1.7 % perl Makefile.PL Checking if your kit is complete... Looks good Writing Makefile for MD5 % make mkdir ./blib mkdir ./blib/lib cp MD5.pm ./blib/lib/MD5.pm AutoSplitting MD5 (./blib/lib/auto/MD5) /usr/bin/perl -I/usr/local/lib/perl5/i386 ... ... cp MD5.bs ./blib/arch/auto/MD5/MD5.bs chmod 644 ./blib/arch/auto/MD5/MD5.bsmkdir ./blib/man3 Manifying ./blib/man3/MD5.3 % make test PERL_DL_NONLAZY=1 /usr/bin/perl -I./blib/arch -I./blib/lib -I/usr/local/lib/perl5/i386-freebsd/5.00404 -I/usr/local/lib/perl5 test.pl 1..14 ok 1 ok 2 ... ok 13 ok 14 % sudo make install Password: Installing /usr/local/lib/perl5/site_perl/i386-freebsd/./auto/MD5/ MD5.so Installing /usr/local/lib/perl5/site_perl/i386-freebsd/./auto/MD5/ MD5.bs Installing /usr/local/lib/perl5/site_perl/./auto/MD5/autosplit.ix Installing /usr/local/lib/perl5/site_perl/./MD5.pm Installing /usr/local/lib/perl5/man/man3/./MD5.3 Writing /usr/local/lib/perl5/site_perl/i386-freebsd/auto/MD5/.packlist Appending installation info to /usr/local/lib/perl5/i386-freebsd/ 5.00404/perllocal.pod If your system manager isn't around or can't be prevailed upon to run the installation, don't worry. When you use Perl to generate the Makefile from template Makefile.PL , you can specify alternate installation directories. # if you just want the modules installed in your own directory % perl Makefile.PL LIB=~/lib # if you have your own complete distribution % perl Makefile.PL PREFIX=~/perl5-private An even simpler approach is to use the CPAN module from the command line, because it can search for, download, and install the module you need. Suppose you wanted to find the CPAN module Getopt::Declare. All you'd have to do is type: % perl -MCPAN -e "install Getopt::Declare" The first time you use the CPAN module, it will ask you some configuration questions. It saves these away so that when you use it in the future, it won't need to ask you those questions again. The CPAN module also supports an interactive command shell. This can be used to search for modules whose precise names you're uncertain of, check which CPAN modules have newer versions than you have installed, install bundles of related modules, and various other useful commands. Here's an example run of the interactive shell. % perl -MCPAN -e shell cpan shell -- CPAN exploration and modules installation (v1.70) ReadLine support enabled cpan> h Display Information command argument description a,b,d,m WORD or /REGEXP/ about authors, bundles, distributions, modules i WORD or /REGEXP/ about anything of above r NONE reinstall recommendations ls AUTHOR about files in the author's directory Download, Test, Make, Install... get download make make (implies get) test MODULES, make test (implies make) install DISTS, BUNDLES make install (implies test) clean make clean look open subshell in these dists' directories readme display these dists' README files Other h,? display this menu ! perl-code eval a perl command o conf [opt] set and query options q quit the cpan shell reload cpan load CPAN.pm again reload index load newer indices autobundle Snapshot force cmd unconditionally do cmd cpan> i /inflect/ CPAN: Storable loaded ok Going to read /home/tchrist/.cpan/Metadata Database was generated on Mon, 07 Apr 2003 22:42:33 GMT Distribution D/DC/DCONWAY/Lingua-EN-Inflect-1.88.tar.gz Module Lingua::EN::Inflect (D/DC/DCONWAY/Lingua-EN-Inflect-1.88.tar.gz) 2 items found cpan> install Lingua::EN::Inflect [build and install output deleted] cpan> quit The CPAN module is slowly being phased out in favor of CPANPLUS, a module with similar functionality that is built for flexibility as well as power. The CPANPLUS text interface is similar to that of the CPAN module, but it also offers a GUI and programmer interfaces, which can access a lot of functionality that the CPAN module hides. 12.21.4 See Also The documentation for the standard ExtUtils::MakeMaker module; the INSTALL file in the Perl source distribution for information on building a statically linked perl binary [ Team LiB ] [ Team LiB ] Recipe 12.22 Example: Module Template Following is the skeleton of a module. If you want to write a module of your own, you can copy this and customize it. package Some::Module; # must live in Some/Module.pm use strict; require Exporter; # set the version for version checking our $VERSION = 0.01; our @ISA = qw(Exporter); our @EXPORT = qw(&func1 &func2 &func4); our %EXPORT_TAGS = ( ); # eg: TAG => [ qw!name1 name2! ], # your exported package globals go here, # as well as any optionally exported functions our @EXPORT_OK = qw($Var1 %Hashit &func3); use vars qw($Var1 %Hashit); # non-exported package globals go here our(@more, $stuff); # initialize package globals, first exported ones $Var1 = ""; %Hashit = ( ); # then the others (which are still accessible as $Some::Module::stuff) $stuff = ""; @more = ( ); # all file-scoped lexicals must be created before # the functions below that use them. # file-private lexicals go here my $priv_var = ""; my %secret_hash = ( ); # here's a file-private function as a closure, # callable as &$priv_func. my $priv_func = sub { # stuff goes here. }; # make all your functions, whether exported or not; # remember to put something interesting in the { } stubs sub func1 { .... } # no prototype sub func2( ) { .... } # proto'd void sub func3($$) { .... } # proto'd to 2 scalars # this one isn't auto-exported, but could be called! sub func4(\%) { .... } # proto'd to 1 hash ref END { } # module cleanup code here (global destructor) 1; [ Team LiB ] [ Team LiB ] Recipe 12.23 Program: Finding Versions and Descriptions of Installed Modules Perl comes with many modules included standard. Even more can be found on CPAN. The following program prints out the names, versions, and descriptions of all modules installed on your system. It uses standard modules like File::Find and includes several techniques described in this chapter. To run it, type: % pmdesc It prints a list of modules and their descriptions: FileHandle (2.00) - supply object methods for filehandles IO::File (1.06021) - supply object methods for filehandles IO::Select (1.10) - OO interface to the select system call IO::Socket (1.1603) - Object interface to socket communications ... With the -v flag, pmdesc provides the names of the directories the files are in: % pmdesc -v <<>> FileHandle (2.00) - supply object methods for filehandles ... The -w flag warns if a module doesn't come with a pod description, and -s sorts the module list within each directory. The program is given in Example 12-3 . Example 12-3. pmdesc #!/usr/bin/perl -w # pmdesc - describe pm files # tchrist@perl.com use strict; use File::Find qw(find); use Getopt::Std qw(getopts); use Carp; use vars ( q!$opt_v!, # give debug info q!$opt_w!, # warn about missing descs on modules q!$opt_a!, # include relative paths q!$opt_s!, # sort output within each directory ); $| = 1; getopts("wvas") or die "bad usage"; @ARGV = @INC unless @ARGV; # Globals. wish I didn't really have to do this. use vars ( q!$Start_Dir!, # The top directory find was called with q!%Future!, # topdirs find will handle later ); my $Module; # install an output filter to sort my module list, if wanted. if ($opt_s) { if (open(ME, "-|")) { $/ = ""; while () { chomp; print join("\n", sort split /\n/), "\n"; } exit; } } MAIN: { my %visited; my ($dev,$ino); @Future{@ARGV} = (1) x @ARGV; foreach $Start_Dir (@ARGV) { delete $Future{$Start_Dir}; print "\n << Modules from $Start_Dir>>\n\n" if $opt_v; next unless ($dev,$ino) = stat($Start_Dir); next if $visited{$dev,$ino}++; next unless $opt_a || $Start_Dir =~ m!^/!; find(\&wanted, $Start_Dir); } exit; } # calculate module name from file and directory sub modname { local $_ = $File::Find::name; if (index($_, $Start_Dir . "/") = = 0) { substr($_, 0, 1+length($Start_Dir)) = ""; } s { / } {::}gx; s { \.p(m|od)$ } { }x; return $_; } # decide if this is a module we want sub wanted { if ( $Future{$File::Find::name} ) { warn "\t(Skipping $File::Find::name, qui venit in futuro.)\n" if 0 and $opt_v; $File::Find::prune = 1; return; } return unless /\.pm$/ && -f; $Module = &modname; # skip obnoxious modules if ($Module =~ /^CPAN(\Z|::)/) { warn("$Module -- skipping because it misbehaves\n"); return; } my $file = $_; unless (open(POD, "<", $file)) { warn "\tcannot open $file: $!"; # if $opt_w; return 0; } $: = " -:"; local $/ = ""; local $_; while () { if (/=head\d\s+NAME/) { chomp($_ = ); s/^.*?-\s+//s; s/\n/ /g; #write; my $v; if (defined ($v = getversion($Module))) { print "$Module ($v) "; } else { print "$Module "; } print "- $_\n"; return 1; } } warn "\t(MISSING DESC FOR $File::Find::name)\n" if $opt_w; return 0; } # run Perl to load the module and print its verson number, redirecting # errors to /dev/null sub getversion { my $mod = shift; my $vers = `$^X -m$mod -e 'print \$${mod}::VERSION' 2>/dev/null`; $vers =~ s/^\s*(.*?)\s*$/$1/; # remove stray whitespace return ($vers || undef); } format = ^<<<<<<<<<<<<<<<<<~~^<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< $Module, $_ . This can also be accomplished through the backend programmer interface in the CPANPLUS module, if you have it installed. This program displays information on all available modules (the -X option is to silence any warnings about invalid paths or version numbers): #!/usr/bin/perl -X use CPANPLUS::Backend; use Data::Dumper; $cp = CPANPLUS::Backend->new; $installed = $cp->installed->rv; # fetch list of installed mods foreach my $module (sort keys %$installed) { # get the module's information $info = $cp->details(modules => [$module])->rv->{$module}; # display the fields we care about printf("%-35.35s %44.44s\n", $module, $info->{Description}); } When run, it outputs a table like this: Algorithm::Cluster Perl extension for the C clustering library Algorithm::NaiveBayes None given AnyDBM_File Uses first available *_File module above Apache Interface to the Apache server API Apache::AuthDBI None given Apache::Connection Inteface to Apache conn_rec struct [ Team LiB ] [ Team LiB ] Chapter 13. Classes, Objects, and Ties All the world over, I will back the masses against the classes. —William E. Gladstone, Speech at Liverpool, 28 June 1886 [ Team LiB ] [ Team LiB ] Introduction Although Perl was not initially conceived of as an object-oriented language, within a few years of its initial release, complete support for object-oriented programming had been added. As usual, Perl doesn't try to enforce one true style, but embraces many. This helps more people do their job the way they want to do it. You don't have to use objects to write programs, unlike Java, where programs are instances of objects. If you want to, though, you can write Perl programs that use nearly every weapon in the object-oriented arsenal. Perl supports classes and objects, single and multiple inheritance, instance methods and class methods, access to overridden methods, constructors and destructors, operator overloading, proxy methods through autoloading, delegation, a rooted hierarchy for all objects, and two levels of garbage collection. You can use as many or as few object-oriented techniques as you want and need. Ties are the only part of Perl where you must use object orientation. And even then, only the module implementor need be aware of this; the casual user gets to remain blissfully unaware of the internal mechanics. Ties, discussed in Recipe 13.15, let you transparently intercept access to a variable. For example, you can use ties to create hashes that support lookups by key or value instead of just by key. Under the Hood If you ask 10 people what object orientation is, you'll get 10 different answers. People bandy about terms like abstraction and encapsulation, trying to isolate the basic units of object- oriented programming languages and give them big names to write papers and books about. Not all object-oriented languages offer the same features, yet they are still deemed object- oriented. This, of course, produces more papers and books. We follow the nomenclature used in Perl's documentation, the perlobj(1) manpage, and Chapter 12 of Programming Perl. An object is a variable that belongs to a class. Methods are functions associated with a class. In Perl, a class is a package—and usually a module. An object is a reference to something associated with a class. Once associated with a class, something is said to be blessed into that class. There's nothing ecclesiastical or spooky going on here. Blessing merely associates a referent with a class, and this is done with the bless function, which takes one or two arguments. The first is a reference to the thing you want associated with the class; the second is the package with which to make that association. $object = { }; # hash reference bless($object, "Data::Encoder"); # bless $object into Data::Encoder class bless($object); # bless $object into current package The class name is the package name (Data::Encoder in this example). Because classes are modules (usually), the code for the Data::Encoder class resides in the file Data/Encoder.pm. As with traditional modules, the directory structure is purely for convenience; it implies nothing about inheritance, variable sharing, or anything else. Unlike a traditional module, though, an object module seldom if ever uses the Exporter. Access should be through methods only, not imported functions or variables. Once an object has been blessed, calling the ref function on its reference returns the name of its class instead of the fundamental type of referent: $obj = [3,5]; print ref($obj), " ", $obj->[1], "\n"; bless($obj, "Human::Cannibal"); print ref($obj), " ", $obj->[1], "\n"; ARRAY 5 Human::Cannibal 5 As you can see, you can still dereference a reference once it has been blessed. Most frequently, objects are implemented as blessed hash references. You can use any kind of reference you want, but hash references are the most flexible because they allow arbitrarily named data fields in an object. $obj->{Stomach} = "Empty"; # directly accessing an object's contents $obj->{NAME} = "Thag"; # uppercase field name to make it stand out (optional) Although Perl permits it, it's considered poor form for code outside the class to directly access the contents of an object. The point of objects, everyone agrees, is to give you an abstract something with mediated access through designated methods. This lets the maintainer of the class change its implementation without needing to change all application code that uses the class. Methods The whole purpose for blessing—that is, associating a package with a referent—is so that Perl can determine the package namespace in which to find functions when you invoke methods against an object. To invoke a method, use ->. Here, we invoke the encode( ) method of $object with the argument "data" and store the return value in $encoded: $encoded = $object->encode("data"); The lefthand operand of the -> operator is said to be the method's invocant. Think of the invocant as the entity on whose behalf the method was called. Methods always involve invocants. Here we have an object method because we invoke the method on an object. We can also have class methods where the invocant is a string representing the package—meaning, of course, the class. $encoded = Data::Encoder->encode("data"); Invoking a method calls the function in the corresponding class, implicitly passing its invocant as the initial argument to that function: a reference for object methods, a string for class methods. It isn't always obvious which of the two invocation types you have, because the invocant could be a variable holding a class name instead of one holding a reference that's been blessed. $class = "Animal::" . ($aquatic ? "Fish" : "Mammal"); $beastie = $class->create( ); That will sometimes invoke the create method from class Animal::Fish and sometimes invoke the create method from class Animal::Mammal. This might even end up being the same underlying function if those two classes share a common ancestral class. Here you don't know the class until runtime. Recipe 13.8 shows how to invoke a method where the method name isn't determined until runtime. Most classes provide constructor methods, which return new objects. Unlike in some object- oriented languages, constructor methods in Perl are not specially named. In fact, you can name them anything you like. C++ programmers have a penchant for naming their constructors in Perl new. We recommend that you name your constructors whatever makes sense in the context of the problem you're solving. For example, constructors in the Tk extension to Perl are named after the widgets they create. A less common approach is to export a function with the same name as the class; see Recipe 13.14.4 in Recipe 13.14 for an example. A typical constructor used as a class method looks like this: sub new { my $class = shift; my $self = { }; # allocate new hash for object bless($self, $class); return $self; } Call the constructor with: $object = Classname->new( ); If there isn't any inheritance or other monkey business working behind the scenes, this is effectively the same as: $object = Classname::new("Classname"); The new function's first argument here is the name of the class—hence, the package—to bless the new reference into. A constructor should pass that string as the second argument to bless. Recipe 13.1 also talks about functions that return blessed references. Constructors don't have to be class methods; it's often useful to have object methods that themselves return ne