Next Up Previous Contents
Next: 2.5 Code topics
Up: 2.4 Programming languages
Previous: 2.4.5 Java
[ID index][Keyword index]

2.4.6 Other languages

C and Fortran cover most of the bases for scientific computing, but there are one or two others which come in useful occasionally.

2.4.6.1 awk and sed

Often, you can find yourself performing some repetitive editing task, for example massaging data into a form which a program can conveniently read. Such tasks can conveniently, and reliably, be done by programs such as awk and sed. Neither of these utilities is as well-known as it should be, as they can save a great deal of tedious and error-prone effort.

sed is a version of the very simple editor ed, which is specialised for performing edits on a stream of text. For example, the following rather elaborate sed script prints all the section headings from a LaTeX document:

sed -n
              's/^\\\(sub\)*section{\(.*\)}.*$/\2/p'
              sc13.tex

This may look like gibberish, but it is simpler than it looks. The option -n instructs sed not to print out input lines, which it does by default. The sed expression in quotes calls the s command: whenever the `regular expression' between the first pair of slashes matches, the s command replaces it with the expression between the second pair and, because the s command is suffixed with a p, prints out the modified line. The regular expression matches lines which start with a backslash, have zero or more occurrences of the string `sub', which is followed by the string `section{', then any sequence of characters, followed by a } then any characters, ending at the end of the line. The caret ^ matches the beginning of a line, the backslash is a special character, so that it must be `escaped' by prefixing it with another backslash, \\, the grouping operators are \( and \), the asterisk indicates that the previous (bracketed) expression may be present zero or more times, the dot matches any character, and the dollar matches the end of the line. As well as grouping, the parentheses save what they match, and the expression \2 in the replacement text refers to what the second pair of parentheses matched, namely the text between the curly braces. The overall result is that the matched string (the whole of the line) is replaced by the contents of the braces and then, because of the p suffix, printed.

Another useful tool is awk, named after its designers Aho, Weinberger and Kernighan. Like sed, it works through a text file, executing scraps of code whenever a line matches some condition. Before it does anything with a line, it breaks it into fields, separated by whitespace by default. Consider the following example[Note 7]


ps u | sed 1d | \
   awk '{print $4, $0; totmem+=$4}; END {printf "total memory: %f\n", totmem}'
This (not terribly useful) line generates a process listing using ps, uses sed to delete the first line (that is, it executes the command d on line number 1), and then passes the result through the awk program contained in quotes. On every line, this prints field number 4 (the %MEM column in the listing) and field 0 (which is awk-speak for the whole input line), and adds the value of the fourth field to a running total; on the line matching the pattern END -- that is, the pseudo-line at the end of the file -- awk prints out the accumulated total.

You won't typically generate expressions as complicated as these on the fly (at least, not until you get really good). This example is intended to suggest that you can, in aliases or in scripts, perform quite complicated transformations of text files. For further details you could look at the sed or awk man-pages, which are complete but very compressed, or work through a tutorial in your system's printed documentation. There are several guides to sed and awk, but you might be best off, initially, using an advanced introduction to Unix, such as [quigley] or [nutshell]. The canonical documentation for regular expressions is on the ed(1) manual page.

2.4.6.2 Perl

Perl is a general-purpose scripting language. It started off as a text-reformatting facility, rather like a super-awk, but it has now grown to the point where it really is a programming language in its own right, capable of supporting quite substantial projects. Perl programmers can call on a huge range of supporting code, collected at the Comprehensive Perl Archive Network, CPAN, to do everything from internet programming to database access. Perl's expressive power makes it ideal for rapid development of all sorts of complex systems -- some huge proportion of the web's CGI scripts, for example, are written in Perl. Unfortunately, the flexibility of Perl's syntax make it quite possible to write spaghetti, the like of which we have not seen since Fortran IV dropped out of fashion.

The Perl manual pages are reasonably clear. O'Reilly publishes a good book on Perl, written by Larry Wall, its author [wall]. This is a good reference book, but [schwartz97] is possibly a better tutorial introduction. Perl regular expressions are slightly different from the ones used by sed and friends -- see the perlre manual page.

Perl is a semi-interpreted language. Somewhat like Java, when the Perl interpreter first processes your Perl script, it compiles it to an internal code, which it then proceeds to interpret. This means that Perl programs have a relatively long startup time, but run reasonably efficiently after that. This is not a big issue in most applications.

The current (end-2001) version of Perl is 5.6 or thereabouts. Perl 6 will be a significant step in the evolution of the language: it's in the offing, but still some way away.


Next Up Previous Contents
Next: 2.5 Code topics
Up: 2.4 Programming languages
Previous: 2.4.5 Java
[ID index][Keyword index]
Theory and Modelling Resources Cookbook
Starlink Cookbook 13
Norman Gray
2 December 2001. Release 2-5. Last updated 10 March 2003