File Scanning Commands

It is always possible to parse a file in TclX as follows:

for_file line list.html {
	do_something_with $line
}

However, the TclX developers went much further than this, and provided a family of commands expressly for file scanning. These commands give you the power of awk and grep (and then some) in Tcl, and the file scan works remarkably fast. The basic mechanism is more complicated than grep, but once you grasp the concepts it's actually simpler to use.

For those who aren't sure what file scanning is about: programmers often want to scan through an ASCII file, and for each line "if the line contains foo do this, and if it contains bar do that". The Unix awk utility was designed for just this purpose; the grep family of Unix commands more simply extracts matching lines from files, given a pattern. TclX can do this concisely, without a system or exec invocation of awk; furthermore, processing complexity can easily be expanded or added. (Note: a regular expression is a way of specifying complex match criteria in a single string, by using wildcards and other substitution macros; TclX uses regular expressions, just as grep and egrep do, to scan files.)

The essential concept for scanning is a scancontext. It's a handle -- Tcl commonly uses handles, like the value returned by an open command, to manipulate files and other data sources. You use the scancontext command to create the handle, and then the scanmatch command to associate it with a regular expression and some code to execute if a match is found. You can use multiple scanmatch commands to set up checking and processing for several regular expressions at once. Then you use the scanfile command, which actually scans the file, executes the code you established for each match, and puts match status information in a global variable matchInfo. So the sequence is:

set fileHandleVar [open fileName r]
set scanHandleVar [scancontext create]
scanmatch $scanHandleVar regExp codeToExecute
scanfile $scanHandleVar $fileHandleVar

Here's a simple example. The object is to pull out of a log file some records matching the current month, and to rewrite those records into a new format in an output file:

# set up a target dir and some date-related strings
set accdir /usr/local/bean/counting/puse
set efp [open $accdir/HOST_ERROR.LOG a]
set yd [fmtclock [convertclock yesterday]]
set ydd [lrange $yd 0 2]
set ydm [lindex $yd 1]
set yy [lindex $yd 5]
set rcode PRT001
# set up a target output file and open it
set target $accdir/$rcode.$ydm.$yy.flat
set bfp [open $target w]
# set up a file pointer for the file to scan
set fd [open /var/adm/$accfile r]
# create a scancontext
set sc [scancontext create]
# attach a scanmatch to the scancontext
scanmatch $sc "\[A-z\] $ydm \[0-9\]* \[0123456789:]* $yy" \
	{processLine $matchInfo(line)}
# scan the file using the scancontext
scanfile $sc $fd

Here we want to scan a printer accounting file for lines logged during a certain month. We set up some strings and file pointers, then create a scancontext handle (scancontext create). We then associate a regexp with that, which should match lines like

Tue Jul 11 17:42:16 1995 foo.ps marvin helios.cia.org 1 535.570 666

where the month ($ydm) is Jul and the year is 1995. If we get a match, we call processLine, with one argument: the text of the matching line. Here's processLine. It's not very interesting, but note the use of lassign, a TclX command discussed earlier in this chapter; the core Tcl list commands llength, split, and lindex are also useful here:

proc processLine line {

        global efp 
        global bfp
        global rcode 

        if {[llength $line] > 11} {
                puts $efp "Cannot parse line from printer $rcode:"
                puts $efp "  $line"
        } else {
        lassign $line wday mon day time yr file user host pages cpu uid
        set date "$wday $mon $day $time $yr"
        set toy [convertclock $date]
        set host [lindex [split $host .] 0]
        puts $bfp "$toy~$rcode~$user~$file~$host~$pages"
        }
}

We split the line into words, convert the date to an integer clock value to conserve space, and write a line of output. Another utility (in this case, Sybase bcp) can now read the data, using the tildes as field separators, and store these records in a relational database. You could do the same thing with a for_file, checking to see if each line matches the month ydm. Or you could exec a grep command and process the output of the exec. But as it turns out, the file scanning commands are faster than either of those.

The scancontext/scanfile mechanism is tremendously flexible and powerful because you can create several scancontexts, each bound to any set of regexps and conditional code. Each line of the scanned file will be checked against the regular expressions in the order in which you added them using scanmatch. If a line happens to match more than one of your regexps, you can permit all the scanmatch code to be executed; but often you want to stop if you match a particular expression. In that case you can use continue in the scanmatch code, which makes scanfile skip all later scanmatches in the scancontext. You can thus write, remarkably tersely, a complex file parsing algorithm that might otherwise take many, many lines of code (as a for_file loop with a complicated mess of if-blocks).

Some of the potential power of these commands starts to reveal itself in the matchInfo array. The text of the matching line (array index line) is only the beginning. scanfile also returns in matchInfo the following indices:

offset: byte offset into the file of the first character of the matching line
linenum: the line number of the matching line
context : the handle of the scancontext which yielded this match
handle : the file handle of the file where the matching line was found
submatch0 : the characters that matched the first parenthesized subexpression in the regexp; the second will be in submatch1 and so on
subindex0: a 2-element list of the start and end indices of the string matching the first parenthesized subexpression; the second will be in subindex1, and so on

Here's another, slightly smarter example (code courtesy of R. Stover, UCO/Lick Observatory). To help you visualize what these code excerpts are really doing, here's some sample data from a "services" file which controls a large data-taking system:

host:lichenous.cia.org
trtalk     	 horticultist.cia.org        	# traffic controller host 
JAPANDTAKE  	/u/developers/crate/japan/dtake
TIGERDTAKE    	/u/developers/crate/tiger/dtake
DSP-TIGERDTAKE	/u/developers/host/lab/dtake
TOSSDTAKE   	/u/developers/crate/toss/dtake
TOSS_CONTROLLER nocontroller
SPROCKETDTAKE   /u/developers/crate/sprocket/dtake
LURKERDTAKE  	/u/developers/crate/lurker/dtake
lockfile    	/usr/local/noise/lockfile
dictdir     	/usr/local/noise/info/
errorlog    	/usr/local/noise/log/errorlog
runnerlog   	/usr/local/noise/log/runnerlog
trafficlog  	/usr/local/noise/log/trafficlog
INFOHOST  	horticultist.cia.org       	# Host for infoman
INFOMAN   	/u/developers/host/infoman/infoman
hamtalk     	horticultist.cia.org   		# Machine for talking to hambone 
hamport     	/dev/ttyb    			# Port for hambone spectrograph

And here is some code which parses this file:

# Read the services file and scan for all entries corresponding to
# a given host.  Return the entries in an array.
# Input:        File            The pathname of the file to scan
#               host            The host section to scan for
# Output:       savearray       The array into which the procedure func
#                               can store the matched values.
#               The return value is 1 if all processing went OK at 0
#               otherwise.
#
# Sample call: Services /home/ccdev/dtakeservice myHost arrayName
#
#@package: services Services Servicefile
proc Services {File host savearray} {
    global ServiceHost
    set ServiceHost ""
    set fd [open $File r]
    if {$fd == -1} {return 0}
    set sc [scancontext create]
    scanmatch $sc "host:" {ScanServiceForHost $host}
    scanmatch $sc {ScanForService $savearray}
    scanfile $sc $fd
    scancontext delete $sc
    close $fd
    unset ServiceHost
    return 0
}

Here the author sets up two regexps, one to look for lines containing "host:" and the other to look at all lines not containing "host:" -- the lack of a regexp in the second scanmatch is shorthand for "does not match any of the regexps in the current scancontext". Here is ScanServiceForHost, which we call if the string "host:" is present in the line:

# Input:        host    The name of the host to scan for.
# Output:       Global ServiceHost is either set or unset.
proc ScanServiceForHost {host} {
    global ServiceHost
    upvar 1 matchInfo matchInfo
    set hline $matchInfo(line)
    set colon [string first ":" $hline]
    incr colon
    set hostval [string trim [string range "$hline" $colon end]]
    if {[string match $hostval $host] == 1} {
        set ServiceHost $host
    } else {
        set ServiceHost ""
    }
}

Here we find the hostname, that is, the string immediately following the string "host:", and check to see whether it matches our own hostname. If so, we are the ServiceHost. But for any line that does not contain "host:", we do this:

# Output: savearray     The services are stored in this array, with the
#                       service name used as the index.
proc ScanForService {savearray} {
    global ServiceHost
    if {$ServiceHost != ""} {
        upvar #0 $savearray save
        upvar 1 matchInfo matchInfo
        set sline [string trim $matchInfo(line)]
        if {[string length "$sline"] == 0} return
        if {[string first "#" "$sline"] == 0} return
        set sname [lvarpop sline 0]
        set service [string trim [string range "$sline" 0 end]]
        set save($sname) $service
    }
}

Here we use the global variable ServiceHost, which was set in ScanForServiceHost, and if it has been set (i.e. we previously encountered a line containing "host:" and the hostname matched our hostname) we collect the service names from the matching lines and stuff them in an array.

You would have had to use two exec's (a grep and a grep -v) or some untidy logic with a for_file to achieve this same result by other means. The file scanning commands are a concise, modular, all-Tcl method of performing complex processing of input files. The overhead of repeated process startup (via exec) is obviated, and there are no external awk/sed script files. In my estimation scancontext, scanmatch, and scanfile are three of the most ingenious and useful commands in TclX.