This script takes one or several source code files with code prologues more or less resembling the SST prologue conventions and marks them up as SGML conforming to the Starlink Programcode DTD. It does this using a combination of special knowledge about the likely contents of Starlink prologues and rules or guesses about the meaning of indentation, spacing, key words and so on in the source. Some of this processing can be influenced by use of an optional configuration file specified on the command line. This is not normally necessary, but can be done to tailor the program's behaviour to particular foibles of the source code to be converted.
According to the command-line flags, the output may or may not contain enough surrounding markup for it to constitute an entire SGML programcode document. The SGML is in all cases written to standard output.
The nature of the output is controlled by the -d
and
-D
flags as follows:
-D
flag is given, then output will be a
monolithic document which contains the marked up prologues
of the given source-files and the surrounding markup
necessary to constitute a free-standing SGML document
conforming to the programcode DTD. In this case the
parts of the source code which are not the prologue
(i.e. the body of the code) will be discarded.-d
flag is given, then output will be a
container document which refers to the given source-files
by way of entity references.-d
nor -D
flag is given,
then output will be a version of the source code in the given
source-files in which the prologue is marked up with SGML tags.
This output does not constitute a free-standing SGML document,
but may be included in a suitable container document. The
output contains the original source code, and will look just
the same to a compiler as the input file. The resulting
file can, if desired, be used as the new primary copy of the
source code.The default configuration is designed to convert Fortran source
code, but since many Starlink C source files contain prologues
which look identical to those of Fortran files (each comment line
begins with a *
) the program may be used on
suitable C files too.
Example 1:
code2sgml -1 -D sub.f subwrap.c > subs.sgml
The first prologue from each of the named files is marked up
and written, with appropriate pre- and post-amble, to a new file
subs.sgml
which constitutes a complete freestanding
SGML document. This will not include the main body of the
source code from the original files.
Example 2:
code2sgml -l script -c config.pl setup.sh > new/setup.sh
A new version of the named file is written to a different
directory; any prologues in it will be marked up, but changes
will only be made in comment lines, so that the new file
will appear identical to the old one as far as the compiler
is concerned. The file config.pl
is used to override
some of the default configuration options of the program.
The default configuration options are those appropriate for
script-type source code.
Once it has been checked that the markup is satisfactory,
the old source file can be discarded and the new one
used as a replacement. The new file does not by itself
consitute a complete SGML document.
Example 3:
code2sgml -d *.f > routines.sgml
A container document is written which references all .f
files in the current directory. This constitutes a complete
SGML document, but requires the same .f
files, with
prologues marked up in SGML as in the previous example, to be
in its directory for subsequent processing.
-1
-c config-file
-D
-d
-l language
c2s.
language is loaded
from code2sgml
's home directory. Currently
the values `fortran' and `script' are available; `fortran'
is the default and will work for many C prologues too (as long
as they follow the convention of starting each prologue
comment line with an asterisk). The `script' file should
work, on the whole, for files in which prologue comment lines
begin with a hash character.All the values implied by the value of this flag can in any
case be overridden from the user configuration file
(see the -c
flag).
-d
nor -D
flags is given,
a marked up copy of the named file, which does not constitute a
complete SGML document, will be written.The quality of markup made by this program is inevitably not perfect. For prologues which conform to the usages common in, say, KAPPA, it should provide quite good output, but peculiar indentations, spellings, linebreaks, lists, verbatim-type blocks etc are bound to confuse it. For some of these, if usage is consistent within a package to be converted, it may be worthwhile to write a configuration file. In any case it will always be advisable to cast an eye over the SGML output, or failing that, over the resulting downconverted documentation.
A particular idiosyncracy which the program has is to convert
miscellaneous blocks of text into one-element ul
lists. This is often not the best markup for such text (depending
on context it may have the sense of a block quote, verbatim text,
a list of some sort ...), but without knowing the semantics of
the text it's hard to come up with anything better. It may
therefore be a good idea to look for
sequences in the output with a view to replacing them by something more appropriate.<ul><li>
Another thing to look for particularly is normal text marked up
as verbatim
and, especially, vice
versa.
The converter currently discards top-level sections which have no (non-ignored) content, so that placeholders for content, e.g. an empty `Bugs:' heading, will not be propagated into the output document.
A configuration file may be named on the command line using the
-c
flag as described above. This takes the form of
executable Perl code which can override the default values of
certain variables and functions used by the program. Many of
these are regular expressions. By adjusting these to the
conventions used in a given set of source files the upconversion
can be tailored to work better than it would do with the default
settings. Modifying the default settings does not require much
expertise in Perl programming as such, but it is necessary to
understand Perl regular expressions for modifying the pattern
matching variables. Perl regular expressions are a superset of
normal (grep(1)
-style) regular expressions.
The program actually executes the file
c2s.fortran
(or c2s.script
-- see the
description of the -l
flag) from its own home directory
before executing the user config file
if one has been specified. All or parts of
c2s.fortran
or c2s.script
can therefore
be used as a skeleton for modification if required.
The following are regular expressions which can be modified. They are written using the normal Perl 5 regular expression syntax. Except where noted, they should not contain backreferencing parentheses.
$Blank_rx
^
'.$Bullet_rx
$Prefix_rx
and should not contain a `^
'.
Normally only blocks in which all the elements can be matched
against this expression will be interpreted as a list.$Comment_rx
^
'.$Default_rx
$
'
character should not be used.$Discard_rx
^
'.
The default value ignores the placeholder lines such as
which pepper some Starlink prologues.{insert_further_arguments_here}
$ForceLi_rx
ul
element will be begun here.
In this case the list item continues until the end of the
block, or until another list item of the same sort or a blank
line is encountered. Note that if $ForceLi_rx
is defined too broadly it can begin lists where they are not
intended, for instance if set simply to `-
'
it may interpret a punctuation dash at the start of a line
as the start of a list. The pattern is assumed to follow
$Prefix_rx
and so should not contain the
`^
' character.$ForceDt_rx
dt
(item description in a
dl
-type list) is certainly starting here
(unless we are specifically expecting something else, for
instance inside a verbatim block). If we are not already
within a list a new dl
element will be begun here.
In this case the list item continues until the end of the
block, or until another list item of the same sort or a blank
line is encountered. The expression should contain one set
of parentheses which surround the content of the dt
element. Note that if $ForceDt_rx
is defined too broadly it can begin lists where they are not
intended. The pattern is assumed to follow
$Prefix_rx
and so should not contain the
`^
' character. The default expression will
recognise list items which look like
See also the- Item -- Description
$Dt_maxlines
variable.@Headings
Name
,
Usage
etc. which get turned into top-level children
of the routineprologue
element. It is used
in conjunction with the $Posthead_rx
pattern.
Each expression will be surrounded in grouping parentheses
(so the disjunction character `|
' may be used)
and matched insensitive to case. The form of
@Headings
is a list of pairs of (element-type,
regular expression). Earlier matches in the list are
preferred.$Posthead_rx
$Prefix_rx
^
'. The default value just ignores leading
whitespace.$Probegin_rx
$Proend_rx
. It should start with a
`^
' character and may end with a `$
'
character. The default value will match a leading comment-start
character followed by one or more plus signs.$Proend_rx
^
' character and may
end with a `$
' character. The default value
will match a leading comment-start character followed by one
or more minus signs.$Verbatim_rx
$Verbatim_rx
then
the whole block is assumed to be verbatim text. The default
value is at least four whitespace characters, since if a line
contains these it is probably using spacing for text formatting.
Note that trailing spaces are stripped by the code before
this is assessed and leading spaces will normally get mopped
up by $Prefix_rx
matching.The following are other variables which affect the operation of the converter:
$Brkleng
$Maxleng
). This gives the length
to which the lines will be truncated by preference.
$Brkleng
should be shorter than
$Maxleng
, unless $Maxleng
is zero.$Commbegin
$Doctitle
title
element in the output document.$Dt_maxlines
under certain circumstances this will be identified as a list item description and its describing text, and it will be converted to aSome text Some more text
dt
, dd
pair in a
dl
list element. $Dt_maxlines
gives the largest number of lines which the would-be
dt
block may occupy in the source prologue for
this identification to be made -- if larger it's assumed to
be something else. The default value is 1.$Exactindent
$Exactindent
to N. This makes the
interpretation of block hierarchies more reliable. However,
if this may sometimes not be the case, then this variable is
best set to zero, in which case the program will try to make
more relaxed intelligent decisions about hierarchies.
The default value is zero.$Fpi
$Maxleng
$Maxleng
an
attempt is made to break it (to the size given by
$Brkleng
). Only the minimum number of line
breaks is made to satisfy this criterion.
If $Maxleng
is zero, then no line breaks will
be made.$Normleng
$Verbose
The following are subroutines used for parsing and manipulating certain strings. More detailed descriptions of their arguments and return values are given with their default definitions in the code.
Auth_parse( $authline ) = ( $id, $name, $affil )
MBT: Mark Taylor (STARLINK)
Hist_parse( $histline ) = ( $id, $date )
25-MAY-2000 (MBT):
Name2id( $routinename ) = ( $id )
routinename
element into a value to use for the referencing ID of the
routinename
element. The default definition
simply discards any words after the first one and uses that
as the result. Depending on what IDs the main SUN document
uses to refer to routines in the routinelist, changes may
need to be made to this, for instance some sort of case folding.