Package uk.me.nxg.enormity.esis

Writes out a SAX stream in a format based on the sgmls ESIS output.

See:
          Description

Interface Summary
EsisWriter Provides the writing functions needed by an EsisHandler
 

Class Summary
EsisHandler Writes out a SAX stream in a format based on the sgmls ESIS output.
EsisParser A parser which can interpret the pseudo-ESIS syntax of EsisHandler.
StreamEsisWriter Writes ESIS output to a stream, taking care of encodings and line separators
 

Package uk.me.nxg.enormity.esis Description

Writes out a SAX stream in a format based on the sgmls ESIS output. This original format is defined by sgmls. The original point of the format was that it should be easy for downstream tools to parse. The point here is that it turns an XML file into an unambiguous byte-stream and, further, that it permits a normalisation operation which is both well-defined and simple.

There isn't a complete overlap between the ESIS and the SAX model, so there are some differences. All the differences here are extensions rather than changes.

The output consists of a sequence of lines, separated by CR LF (ie bytes 0xd 0xa). Each line consists of a start character indicating which type of output record it represents, followed by one or more arguments. There are always the same number of arguments, separated by a single space.

Mprefix uristart prefix mappingextn
mprefixend prefix mappingextn
Aattname CDATA valuedeclare attributeESIS
Bnamespace localname CDATA valuedeclare namespaced attributeextn
(namestart elementESIS
[namespace localnamestart namespaced elementextn
)nameend elementESIS
]namespace localnameend namespaced elementextn
-textcharacter contentESIS
=textignorable whitespaceextn
?pi dataprocessing instructionESIS
Xnameskipped entityextn

An important function of this class is to normalise the ESIS output. We do this in the following ways:

  1. Attribute records (‘A’ and ‘B’) are alphabetised on output.
  2. Succeeding 'character content' events are merged, and leading and trailing whitespace is trimmed from the resulting merged event. If the resulting event is empty, it is discarded. Ignorable whitespace is... ignored.
  3. Start and end prefix mappings (‘M’ and ‘m’) are discarded.
  4. Any processing instruction which has a ‘target’ of signature is removed.
  5. All of the output is encoded to bytes as UTF-8.

Each start element event is preceded by the set of attributes on that event.

The result of this is to turn the XML:

<doc><ns:p class='foo' xmlns:ns="urn:namespace" ns:att='bar'>Hello</ns:p>
  <p> there,
chum
</p>
</doc>

into the (unnormalised) ESIS form:

(doc
Mns urn:namespace
Aclass CDATA foo
Burn:namespace att CDATA bar
[urn:namespace p
-Hello
]urn:namespace p
mns
-\n  
(p
- there,\nchum\n
)p
-\n
)doc

This can also be given the normalised form:

(doc
Aclass CDATA foo
Burn:namespace att CDATA bar
[urn:namespace p
-Hello
]urn:namespace p
(p
-there,\nchum
)p
)doc

In the normalised form, the prefix mappings have been removed (the prefixes are not semantically important), leading and trailing whitespace has been removed from the ‘-’ lines, and all-whitespace ‘-’ records have been removed.



Copyright © 2012. All Rights Reserved.