This reads a file marked up using the Star2HTML extensions, plus an HTX index file, and produces a summary of the document marked up using the DocumentSummary DTD. That includes the sub*section structure, the cross-references, and most of the header. It emits a suitable CATALOG line on STDOUT.
The aim of this tool is only secondarily to produce an accurate, or complete, summary of the target document. It is primarily required simply to complete whilst running unattended (within the package's installation script), producing a valid SGML document, which can be used as a cross-reference target in the General DTD's DOCXREF element without being grossly misleading.
Usage:
abstract-star2html.pl \ --prefix=sun180.htx/ --output=sun180.summary \ /star/docs/sun180.tex /star/docs/sun180.htx/htx.index \ >>CATALOG 2>abstract-warnings
The parsing can cope with \xlabel
commands either outside
section headings or inside them, and copes with multiple
\xlabel
commands within a heading by
emitting <label>
elements after the heading. It logs a message to stderr when it
discovers this.
If the parser discovers LaTeX markup in the section headings (which is true at some point for almost every file), then it logs a message to STDERR.
The parsing respects \begin{htmlonly}...\end{htmlonly}
The parsing copes with the arguments to each of the commands it matches
(\newcommand{\star...}
and \sub*section{...}
)
being on more than one line.
It concatenates the lines before analysing them.
Emits a warning if an \xlabel
doesn't appear in the
HTX index file.
A file marked up using the Star2HTML extensions to LaTeX2HTML
The index file produced by HTX (see SUN/188)
The prefix is added to each of the filenames in the
HTX index, to make the generated URLs relative to the appropriate
root of the document server, which is set in the DSSSL variable
%starlink-document-server%
, and might have the value
`file:///star/docs/'
The name of the file to receive the generated output. If omitted, the result goes to STDOUT.
If present, then any errors which emerge after this will terminate the program, but return with a zero exit status.
Print the version number and exit.
Type: file
A file conforming to the DocumentSummary DTD
The result could benefit from a little editing, to insert the attribute values of the AUTHOR element such as email address, which aren't included in the Star2HTML file, but it should be valid without it. However, these attributes aren't actually used in the cross reference, so there's no great loss at present.
It also
assumes that the things it matches are at the beginning of the line, possibly
preceded by whitespace (this isn't just to speed it up, but also to avoid
matching any reference to the \section
command within the
body of the text).
The parsing doesn't attempt to deal with markup in section titles, but it does attempt to detect and warn about it, logging a message to stderr.
There's not a lot of point in working hard to make this code do much better than this, since that would essentially require the sophistication of a full document conversion.
Because folk can do arbitrarily clever things with newcommands, I've had
to dumb down the parsing of them. The code here will successfully extract
document number, author, etc, as long as the corresponding newcommand is
all on one line. I've had to do the same with
\(sub)*section
parsing.
This will fail sometimes, but the result will be a
thin but valid SGML document, and will not cause this code to spin
its wheels indefinitely. If you include the option --force
, then
even if there's some fatal error, such as an input file not being present,
then the script will still return with a zero exit status.