Bookmarks / SGML exegeses / English - Arch engine examples?
English - Arch engine examples?

Subject: Re: Arch engine examples? [was: SP 1.1 released]
Date: 10 Jun 1996 00:00:00 GMT
From: jenglish@crl.com (Joe English)
Organization: Tagheads
Newsgroups: comp.text.sgml

On comp.text.sgml, Dan Connolly <connolly@w3.org> wrote (07 Jun 1996):
> In <9606061215.AA16600@jclark.com> James Clark <jjc@jclark.com> writes:
>  > New in this release is an architecture engine. If you build an
>  > application with SP that works with documents conforming to some DTD,
>  > the architecture engine will allow it automatically to work with any
>  > document that conforms to the architecture which has that DTD as its
>  > meta-DTD.
> [...]
> If anybody's really motivated, I have a sort of challenge:
> Take the SGML source (using Gary Houston's Snafu DTD) of the HTML 2.0
> spec at: [...]
>       http://www.w3.org/pub/WWW/MarkUp/html-spec/html-spec-19950922.tar.gz
> and generate something like the HTML output at:
>       http://www.w3.org/pub/WWW/MarkUp/html-spec/html-spec.html
> using the architecture engine.

OK...

Here's what I did:  Aiming for "-//W3C//DTD HTML 3.2//EN"
as the target architecture, I wrote an LPD to map SNAFU
onto HTML and added

    <!LINKTYPE HTML SPAPER #IMPLIED SYSTEM "SNAFU-HTML.lpd">

to the file 'html-spec.sgm', right after the <!DOCTYPE...[...]>
declaration.  (It would have been possible to modify the SNAFU DTD
instead, but LINK seemed like the easiest way to go.) I also replaced
the SNAFU SGML declaration with SP's default since snafu.decl
specifies LINK IMPLICIT NO; and I made a few changes to the HTML 3.2
DTD to make it possible for SNAFU to conform to it as an
architecture.  (Then I scrapped everything and did it all over
again,  since my first attempt was a mess :-)

You can see the results at:

    <URL: http://www.crl.com/~jenglish/arcxmp/ >

SNAFU-HTML.lpd contains the link process definition.
HTMLArch.dtd is the modified "architectural" HTML DTD.
html-spec.html is the converted output.

If you have SP 1.1 and Internet access, try this out:

    sgmlnorm -d -C http://www.crl.com/~jenglish/arcxmp/Catalog | more

to get a normalized version of the SNAFU source, and

    sgmlnorm -d -A html -C http://www.crl.com/~jenglish/arcxmp/Catalog | more

to generate the HTML version.

> I suspect there are some things (like cross reference processing)
> which are not expressable, and would require something like
> the DSSSL tree transformation stuff.

Yes.  The output document is missing some stuff:

There's no automatically-generated table of contents,
cross-reference text, or section numbers; and the output
document is not split up into multiple nodes.

Internal cross-references don't work, since the architecture engine
cannot change ID references "FOO" into URL fragment IDs "#FOO".
I added a REFID attribute to the HTML A element, so the
cross-references are actually there; they just won't be
recognized by most browsers.  (And they all use "HERE" as the
cross-reference text...)

All of the external data entity references vanished (this
includes most of the code listings).  I think this is due to
SGMLNORM, which does not seem to do the right thing with them.
(Even if it did, most browsers wouldn't process the entity
declarations or references anyway.)

The result document does not conform to HTML 2.0 or 3.2 (it does
conform to the architectural DTD), but it passes the "looks OK in
Mosaic" test (NCSA 2.7b4 for X).

My thoughts on the AFDR:

Mapping to architectural forms would be easier with EXPLICIT LINK.
(The result element type in the link rule would be used instead of the
ArcForm attribute.)  Indeed, the AFDR looks like it does exactly what
EXPLICIT LINK was intended to do in the first place, with a few
features added (and a considerably more awkward syntax).

LINK in general would be much more useful if context-sensitive
resolution of link rules worked as described in the annotations
in the SGML Handbook instead of what's actually specified in the
ISO text.  (There are gross inconsistencies between the two,
more on that later...)

It should be possible to specify the LPD somewhere other than
the document prolog:  either external to the document entity
altogether (so that processing could be specified without
modifying the instance or DTD), or inside the DTD (so document
types could use LINK to define conformance to architectures).
Putting <!LINKTYPE...> in the prolog runs contrary to the
principle of separating processing from data, and makes the
document unusable under systems that don't support LINK (like sgmls).
The AFDR can be used without LINK, but it's not as powerful
or convenient.

The AFDR is no better at doing arbitrary SGML-to-SGML transformations
than EXPLICIT LINK.  However, the AFDR concept itself is sound; in
particular a fully fleshed-out architectural version of HTML would be a
useful thing to have.  It would probably not take much for browsers to
support it either; the only things missing from current Web browsers
are support for ID references and local data entities.

Notes on the "architectural" HTML DTD:

HTML is much flatter than the SNAFU DTD; for example, SNAFU
like many document types allows bulleted lists, NOTEs, code listings,
etc., inside paragraphs, while HTML does not.  I had to modify the
target DTD to allow "deeper" content models.

SNAFU requires P elements in a lot of places that HTML does not,
and there are a lot of empty P's in the source document due to
invisible short references.  Consequently the HTML output has a
lot of extra <P>s and </P>s, and will look best in browsers that
don't automatically insert vertical whitespace for every P
start- and end- tag.  NCSA Mosaic for X 2.7b4 does a good job in
this respect, but YMMV.

There were some conflicts in the element structure at the outermost
levels between SNAFU and HTML.  In particular, the outermoust SNAFU
document structure looks like:

<!ELEMENT spaper O O (paperfront, body, index?)>
<!ELEMENT paperfront O O
        (title & docnum? & date? & author & abstract?
         & copyrite? & thanks? & toc? & figlist?) -(%notes;)>

The TITLE, DOCNUM, DATE, and AUTHOR elements are all "metadata"
and map to elements in the HTML HEAD.  ABSTRACT and COPYRITE
however are "content" and should map to elements in the HTML
BODY.  Since the architecture engine cannot rearrange the
element hierarchy, I suppressed the ABSTRACT and COPYRITE altogether.
A better solution for an architectural version of HTML would be
to make the distinction between "metadata" and "content" based
on attributes instead of position in the element hierarchy.

--Joe English

  jenglish@crl.com
Norman
1 January 2001