Next: latex2sgml
Up: Upconverter program descriptions
Previous: Upconverter program descriptions
[ID index][Keyword index]

code2sgml - Convert source code prologue to SGML.

Description

This script takes one or several source code files with code prologues more or less resembling the SST prologue conventions and marks them up as SGML conforming to the Starlink Programcode DTD. It does this using a combination of special knowledge about the likely contents of Starlink prologues and rules or guesses about the meaning of indentation, spacing, key words and so on in the source. Some of this processing can be influenced by use of an optional configuration file specified on the command line. This is not normally necessary, but can be done to tailor the program's behaviour to particular foibles of the source code to be converted.

According to the command-line flags, the output may or may not contain enough surrounding markup for it to constitute an entire SGML programcode document. The SGML is in all cases written to standard output.

The nature of the output is controlled by the -d and -D flags as follows:

If the -D flag is given, then output will be a monolithic document which contains the marked up prologues of the given source-files and the surrounding markup necessary to constitute a free-standing SGML document conforming to the programcode DTD. In this case the parts of the source code which are not the prologue (i.e. the body of the code) will be discarded.
If the -d flag is given, then output will be a container document which refers to the given source-files by way of entity references.
If neither the -d nor -D flag is given, then output will be a version of the source code in the given source-files in which the prologue is marked up with SGML tags. This output does not constitute a free-standing SGML document, but may be included in a suitable container document. The output contains the original source code, and will look just the same to a compiler as the input file. The resulting file can, if desired, be used as the new primary copy of the source code.

The default configuration is designed to convert Fortran source code, but since many Starlink C source files contain prologues which look identical to those of Fortran files (each comment line begins with a *) the program may be used on suitable C files too.

Examples

Example 1:

     code2sgml -1 -D sub.f subwrap.c > subs.sgml

The first prologue from each of the named files is marked up and written, with appropriate pre- and post-amble, to a new file subs.sgml which constitutes a complete freestanding SGML document. This will not include the main body of the source code from the original files.

Example 2:

     code2sgml -l script -c config.pl setup.sh > new/setup.sh

A new version of the named file is written to a different directory; any prologues in it will be marked up, but changes will only be made in comment lines, so that the new file will appear identical to the old one as far as the compiler is concerned. The file config.pl is used to override some of the default configuration options of the program. The default configuration options are those appropriate for script-type source code. Once it has been checked that the markup is satisfactory, the old source file can be discarded and the new one used as a replacement. The new file does not by itself consitute a complete SGML document.

Example 3:

        code2sgml -d *.f > routines.sgml

A container document is written which references all .f files in the current directory. This constitutes a complete SGML document, but requires the same .f files, with prologues marked up in SGML as in the previous example, to be in its directory for subsequent processing.

Authors

Mark Taylor

Flags

-1: If this flag is given, then only the first prologue identified in the file will be marked up. Otherwise, all prologues will be converted.
-c config-file: Specify a user configuration file, which is executable Perl code, to fine-tune conversion parameters. See below for a detailed description of what may be configured in such a file.
-D: Output a monolithic SGML document containing prologues from all the listed source-files, but omitting non-prologue parts of the code.
-d: Output only a container SGML document referring to constituent files via entity references. The named source-files do not need to exist when this program is invoked, but will need to be present when the document is subsequently processed.
-l language: Specify the language to be assumed by the converter. This determines the default configuration file to be used - a file called c2s.language is loaded from code2sgml's home directory. Currently the values `fortran' and `script' are available; `fortran' is the default and will work for many C prologues too (as long as they follow the convention of starting each prologue comment line with an asterisk). The `script' file should work, on the whole, for files in which prologue comment lines begin with a hash character.
All the values implied by the value of this flag can in any case be overridden from the user configuration file (see the -c flag).

If neither the -d nor -D flags is given, a marked up copy of the named file, which does not constitute a complete SGML document, will be written.

Notes

The quality of markup made by this program is inevitably not perfect. For prologues which conform to the usages common in, say, KAPPA, it should provide quite good output, but peculiar indentations, spellings, linebreaks, lists, verbatim-type blocks etc are bound to confuse it. For some of these, if usage is consistent within a package to be converted, it may be worthwhile to write a configuration file. In any case it will always be advisable to cast an eye over the SGML output, or failing that, over the resulting downconverted documentation.

A particular idiosyncracy which the program has is to convert miscellaneous blocks of text into one-element ul lists. This is often not the best markup for such text (depending on context it may have the sense of a block quote, verbatim text, a list of some sort ...), but without knowing the semantics of the text it's hard to come up with anything better. It may therefore be a good idea to look for

        <ul><li>

sequences in the output with a view to replacing them by something more appropriate.

Another thing to look for particularly is normal text marked up as verbatim and, especially, vice versa.

The converter currently discards top-level sections which have no (non-ignored) content, so that placeholders for content, e.g. an empty `Bugs:' heading, will not be propagated into the output document.

Configuration file

A configuration file may be named on the command line using the -c flag as described above. This takes the form of executable Perl code which can override the default values of certain variables and functions used by the program. Many of these are regular expressions. By adjusting these to the conventions used in a given set of source files the upconversion can be tailored to work better than it would do with the default settings. Modifying the default settings does not require much expertise in Perl programming as such, but it is necessary to understand Perl regular expressions for modifying the pattern matching variables. Perl regular expressions are a superset of normal (grep(1)-style) regular expressions.

The program actually executes the file c2s.fortran (or c2s.script -- see the description of the -l flag) from its own home directory before executing the user config file if one has been specified. All or parts of c2s.fortran or c2s.script can therefore be used as a skeleton for modification if required.

The following are regular expressions which can be modified. They are written using the normal Perl 5 regular expression syntax. Except where noted, they should not contain backreferencing parentheses.

$Blank_rx

This matches a whole blank line in the source code -- blank lines are significant as paragraph breaks etc. It should start with a `^'.

$Bullet_rx

This matches a bullet in a normal list which is indented from its parent comment text. it is assumed to follow $Prefix_rx and should not contain a `^'. Normally only blocks in which all the elements can be matched against this expression will be interpreted as a list.

$Comment_rx

This matches any comment line. Only comment lines will become part of the marked up SGML. It should start with a `^'.

$Default_rx

This identifies the parameter default value at the end of an argument description block -- the default value identifies any text in square brackets. The expression should contain one set of parentheses which surround the default value itself. It is assumed to be at the end of a line so the `$' character should not be used.

$Discard_rx

This matches the whole of any comment line which should be discarded altogether. It should start with a `^'. The default value ignores the placeholder lines such as

            {insert_further_arguments_here}

which pepper some Starlink prologues.

$ForceLi_rx

This indicates that a normal list item is certainly starting here (unless we are specifically expecting something else, for instance inside a verbatim block). If we are not already within a list a new ul element will be begun here. In this case the list item continues until the end of the block, or until another list item of the same sort or a blank line is encountered. Note that if $ForceLi_rx is defined too broadly it can begin lists where they are not intended, for instance if set simply to `- ' it may interpret a punctuation dash at the start of a line as the start of a list. The pattern is assumed to follow $Prefix_rx and so should not contain the `^' character.

$ForceDt_rx

This indicates that a dt (item description in a dl-type list) is certainly starting here (unless we are specifically expecting something else, for instance inside a verbatim block). If we are not already within a list a new dl element will be begun here. In this case the list item continues until the end of the block, or until another list item of the same sort or a blank line is encountered. The expression should contain one set of parentheses which surround the content of the dt element. Note that if $ForceDt_rx is defined too broadly it can begin lists where they are not intended. The pattern is assumed to follow $Prefix_rx and so should not contain the `^' character. The default expression will recognise list items which look like

            - Item -- Description

See also the $Dt_maxlines variable.

@Headings

This is an array of regular expressions which match top-level headings in the prologue such as Name, Usage etc. which get turned into top-level children of the routineprologue element. It is used in conjunction with the $Posthead_rx pattern. Each expression will be surrounded in grouping parentheses (so the disjunction character `|' may be used) and matched insensitive to case. The form of @Headings is a list of pairs of (element-type, regular expression). Earlier matches in the list are preferred.

$Posthead_rx

This matches any text, usually punctuation, at the end of a top-level heading which is to be ignored. The default value allows a trailing colon to be ignored.

$Prefix_rx

This matches the leading part of a comment line which should be ignored (except during verbatim processing). It should start with a `^'. The default value just ignores leading whitespace.

$Probegin_rx

This matches the line which introduces the beginning of a code prologue block. If found within a prologue block which has already started, it is ignored. Otherwise it triggers conversion to SGML of the text following it until a line matching $Proend_rx. It should start with a `^' character and may end with a `$' character. The default value will match a leading comment-start character followed by one or more plus signs.

$Proend_rx

This matches the line which terminates a code prologue block. It should start with a `^' character and may end with a `$' character. The default value will match a leading comment-start character followed by one or more minus signs.

$Verbatim_rx

If any line in a block matches $Verbatim_rx then the whole block is assumed to be verbatim text. The default value is at least four whitespace characters, since if a line contains these it is probably using spacing for text formatting. Note that trailing spaces are stripped by the code before this is assessed and leading spaces will normally get mopped up by $Prefix_rx matching.

The following are other variables which affect the operation of the converter:

$Brkleng

The converter will sometimes break lines which have become too long (see $Maxleng). This gives the length to which the lines will be truncated by preference. $Brkleng should be shorter than $Maxleng, unless $Maxleng is zero.

$Commbegin

Text to place at the start of new comment lines.

$Doctitle

This gives the text of the top level title element in the output document.

$Dt_maxlines

If a block looks like

             Some text
                Some more text

under certain circumstances this will be identified as a list item description and its describing text, and it will be converted to a dt, dd pair in a dl list element. $Dt_maxlines gives the largest number of lines which the would-be dt block may occupy in the source prologue for this identification to be made -- if larger it's assumed to be something else. The default value is 1.

$Exactindent

If all blocks are indented such that one bit of text is subordinate to another bit of text only where the block is indented by exactly N spaces, then set $Exactindent to N. This makes the interpretation of block hierarchies more reliable. However, if this may sometimes not be the case, then this variable is best set to zero, in which case the program will try to make more relaxed intelligent decisions about hierarchies. The default value is zero.

$Fpi

This gives the Formal Public Identifier of the DTD to which the output document will claim to conform.

$Maxleng

Lines will often become longer when markup tags are inserted. If any line becomes longer than $Maxleng an attempt is made to break it (to the size given by $Brkleng). Only the minimum number of line breaks is made to satisfy this criterion. If $Maxleng is zero, then no line breaks will be made.

$Normleng

This gives the `normal' minimum length of a line. If two consecutive lines within a block (not at the end of it) are shorter than this, the block is assumed to be for verbatim formatting. This isn't always a good indication, but for many prologues it works quite well. If set to zero then blocks will not be identified as verbatim on the basis of line lengths.

$Verbose

Determines whether warnings are given for unexpected input. The default value is 1. This should not normally be changed.

The following are subroutines used for parsing and manipulating certain strings. More detailed descriptions of their arguments and return values are given with their default definitions in the code.

Auth_parse( $authline ) = ( $id, $name, $affil )

This extracts the ID, name, and affiliation of an author from an author description line. The default definition will decode lines of the form:

            MBT: Mark Taylor (STARLINK)

Hist_parse( $histline ) = ( $id, $date )

This extracts the author ID and date from a history change description line. The default definition will decode lines of the form:

            25-MAY-2000 (MBT):

Name2id( $routinename ) = ( $id )

This turns the contents of the routinename element into a value to use for the referencing ID of the routinename element. The default definition simply discards any words after the first one and uses that as the result. Depending on what IDs the main SUN document uses to refer to routines in the routinelist, changes may need to be made to this, for instance some sort of case folding.

Next: latex2sgml
Up: Upconverter program descriptions
Previous: Upconverter program descriptions
[ID index][Keyword index]