C and Fortran cover most of the bases for scientific computing, but there are one or two others which come in useful occasionally.
Often, you can find yourself performing some repetitive
editing task, for example massaging data into a form
which a program can conveniently read. Such tasks can
conveniently, and reliably, be done by programs such as
awk
and sed
. Neither of these
utilities is as well-known as it should be, as they can
save a great deal of tedious and error-prone effort.
sed
is a version of the very
simple editor ed
, which is specialised for
performing edits on a stream of text. For example, the
following rather elaborate sed
script
prints all the section headings from a LaTeX document:
This may look like gibberish, but it is simpler than it looks. The optionsed -n 's/^\\\(sub\)*section{\(.*\)}.*$/\2/p' sc13.tex
-n
instructs sed
not to print
out input lines, which it does by default. The
sed
expression in quotes calls the
s
command: whenever the `regular
expression' between the first pair of slashes matches,
the s
command replaces it with the
expression between the second pair and, because the
s
command is suffixed with a
p
, prints out the modified line. The
regular expression matches lines which start with a
backslash, have zero or more occurrences of the string
`sub', which is followed by the string
`section{
', then any sequence of
characters, followed by a }
then any
characters, ending at the end of the line. The caret
^
matches the beginning of a line, the
backslash is a special character, so that it must be
`escaped' by prefixing it with another backslash,
\\
, the grouping operators are
\(
and \)
, the asterisk
indicates that the previous (bracketed) expression may
be present zero or more times, the dot matches any
character, and the dollar matches the end of the line.
As well as grouping, the parentheses save what they
match, and the expression \2
in the
replacement text refers to what the second pair of
parentheses matched, namely the text between the curly
braces. The overall result is that the matched string
(the whole of the line) is replaced by the contents of
the braces and then, because of the p
suffix, printed.Another useful tool is awk
, named after
its designers Aho, Weinberger and Kernighan. Like
sed
, it works through a text file,
executing scraps of code whenever a line matches some
condition. Before it does anything with a line, it
breaks it into fields, separated by whitespace by
default. Consider the following example[Note 7]
This (not terribly useful) line generates a process listing usingps u | sed 1d | \ awk '{print $4, $0; totmem+=$4}; END {printf "total memory: %f\n", totmem}'
ps
, uses sed
to
delete the first line (that is, it executes the command
d
on line number 1), and then passes the
result through the awk
program contained in
quotes. On every line, this prints field number 4 (the
%MEM
column in the listing) and field 0
(which is awk
-speak for the whole input
line), and adds the value of the fourth field to a
running total; on the line matching the pattern
END
-- that is, the pseudo-line at the end
of the file -- awk
prints out the
accumulated total.You won't typically generate expressions as complicated
as these on the fly (at least, not until you get
really good). This example is intended to
suggest that you can, in aliases or in scripts, perform
quite complicated transformations of text files. For
further details you could look at the sed
or awk
man-pages, which are complete but
very compressed, or work through a tutorial in your
system's printed documentation. There are several
guides to sed
and awk
, but you
might be best off, initially, using an advanced
introduction to Unix, such as
[quigley] or
[nutshell]. The canonical
documentation for regular expressions is on the
ed(1)
manual page.
Perl is a general-purpose scripting language. It
started off as a text-reformatting facility, rather like
a super-awk
, but it has now grown to the
point where it really is a programming language in its
own right, capable of supporting quite substantial
projects. Perl programmers can call on a huge range of
supporting code, collected at the Comprehensive Perl
Archive Network, CPAN, to do
everything from internet programming to database access.
Perl's expressive power makes it ideal for rapid
development of all sorts of complex systems -- some huge
proportion of the web's CGI scripts, for example, are
written in Perl. Unfortunately, the flexibility of
Perl's syntax make it quite possible to write spaghetti,
the like of which we have not seen since Fortran IV
dropped out of fashion.
The Perl manual pages
are reasonably clear. O'Reilly publishes a good book on
Perl, written by Larry Wall, its author
[wall]. This is a good reference
book, but [schwartz97] is possibly a
better tutorial introduction. Perl regular expressions
are slightly different from the ones used by
sed
and friends -- see the
perlre
manual page.
Perl is a semi-interpreted language. Somewhat like Java, when the Perl interpreter first processes your Perl script, it compiles it to an internal code, which it then proceeds to interpret. This means that Perl programs have a relatively long startup time, but run reasonably efficiently after that. This is not a big issue in most applications.
The current (end-2001) version of Perl is 5.6 or thereabouts. Perl 6 will be a significant step in the evolution of the language: it's in the offing, but still some way away.