Reading DTDs

Next: A.3 Document type declaration subset
Up: A SGML - some of the details
Previous: A.1 Standard identification
[ID index][Keyword index]

A.2 Reading DTDs

The discussion in Section 3 should be enough to let you produce your own documents but, now or in the future, you may find it useful to be able to read the DTD directly.[Note 11] Once you are familiar with the underlying ideas, the expression of them in the DTD turns out to be agreeably compact and reasonably readable.

My account of the DTD syntax will be rather compressed - see [gentle], or the other references in Section 2.2 for alternatives.

A simple HTML-like DTD could be declared as follows:

<!ELEMENT html      O O (head, body, copyright?)>
<!ELEMENT head      O O (title & link*)>
<!ELEMENT title     - - (#PCDATA)>
<!ELEMENT link      - O EMPTY>
<!ELEMENT body      O O (p | dl)+>
<!ELEMENT p         - O (#PCDATA)>
<!ELEMENT dl        - - (dt, dd)+>
<!ELEMENT (dt|dd)   - O (#PCDATA)>
<!ELEMENT copyright - - (#PCDATA)>

<!ENTITY % URL "CDATA"
        -- The term URL means a CDATA attribute
           whose value is a Uniform Resource Locator,
           See RFC1808 (June 95) and RFC1738 (Dec 94).
        -->
<!ATTLIST link
    href  %URL          #REQUIRED  -- URL for linked resource --
    rel   (next | prev) #IMPLIED   -- reverse link types --
    >
<!ENTITY amp "&">

And here is a simple document which uses this DTD:

<link href="http://www.astro.gla.ac.uk/users/norman/" rel=next>
<title>This is a title</title>

<p>And here is a paragraph
<dl>
<dt>With a delimited list
<dd>Correctly formed &amp; OK
</dl>

This displays most of the important syntactical features in an SGML DTD, so if we explain it line-by-line, it should illustrate the features you need to make some sense of most DTDs.

<!ELEMENT html    O O (head, body, copyright?)>

This `element declaration' declares the html element. The element type name is followed by a statement of whether the start and end tags may be omitted if the parser can infer their presence. The minimisation specifications may be either `-' (minus), indicating that the corresponding tag is required, or `O' (letter O), indicating that it may be omitted. Following this is the `content model' which, in this case, states that the html element must consist of one head, one body, and an optional copyright, in that order - the comma connecting the element names specifies that they must be in order, and the question mark following the copyright element indicates that it may occur zero or one times. The omission of the start element is possible in this case, since the first element in the html element must be a head element, so whenever the parser finds a head element, it can know that the html element has begun.

So what is in the head element?

<!ELEMENT head    O O (title & link*)>

The head element consists of precisely one title, and zero or more link elements, in either order. The head tags can be inferred from the presence of the title and link elements, and so it is feasible for us to declare that they may be omitted. The star following the link token in the content model indicates that this element may appear zero or more times, and the ampersand declares that the elements on either side of it must both appear, but can do so in either order. Note that this content model allows `title', `title link link...' and `link link...title', but not `link' or `link title link'.

Finally we have some text:

<!ELEMENT title   - - (#PCDATA)>

The title element is very simple: neither the start not the end tag may be omitted, and it may contain only characters (#PCDATA stands for `parseable character data') and entity references such as &.

<!ELEMENT link    - O EMPTY>

The link element has no actual content, so it is given a content model consisting of the reserved word EMPTY. The tag omission for empty elements is always `- O'. The point of the link element is to hold its attributes, which we will come to shortly.

<!ELEMENT body    O O (p | dl)+>

The document body consists of paragraph elements, or `delimited lists'. The `or' connector, `|', indicates that either of the p or dl elements may appear, and the `plus' occurrence indicator asserts that the group (p|dl) must appear one or more times. In other words, the body consists of a sequence of p and dl elements in arbitrary order.

Finally, we start to specify the `interesting' content of the document.

<!ELEMENT dl      - - (dt, dd)+>

Like the body itself, the dl element consists of a sequence of one or more structures. Unlike the body element, however, the structure is not a list of alternatives, but a sequence. Where the body element would allow `p p dl p' for example, the dl element requires that the dt and dd elements alternate - the repeatable element is the ordered pair of elements `dt, dd'.

<!ELEMENT (dt|dd) - O (#PCDATA)>
<!ELEMENT p       - O (#PCDATA)>
<!ELEMENT copyright - - (#PCDATA)>

The paragraph, list and copyright elements have simple content models. Note that we can specify the structure of more than one element in the same declaration.

Prior to specifying the attributes for the link element, we may declare an abbreviation.

<!ENTITY % URL "CDATA"
        -- The term URL means a CDATA attribute
           whose value is a Uniform Resource Locator,
           See RFC1808 (June 95) and RFC1738 (Dec 94).
        -->

This declares URL to be a `parameter entity', usable only within this DTD. The entity reference `%URL' will be substituted by the string `CDATA' (unparsed character data) when it is encountered. A DTD may declare an entity more than once, but any declarations after the first are silently ignored.

Note the structure of the comment in this last declaration: in SGML, comments may appear only within markup declarations (that is within `<! ... >'), they start and end with the string `--', and there may be more than one in a row. Thus you may legally find `<!>' within an SGML file - this is a completely empty markup declaration. Such a declaration may have a single comment within it, as in `', or it may have several, as in , which has three comments within it, the third of which is empty.

Now we declare the attributes for the link element.

<!ATTLIST link
    href  %URL          #REQUIRED  -- URL for linked resource --

This declares an href attribute. After expansion of the %URL entity reference, this attribute is seen to have a `declared value' of CDATA (unparsed character data), and this attribute is required to be present, so that the SGML parser will object if it finds a link element in a document without an href attribute.

The rel attribute can take only two values:

    rel   (next | prev) #IMPLIED   -- reverse link types --
    >

The link element may have the attribute `rel=next' or `rel=prev', but no other strings. Since this attribute is `#IMPLIED', it is also permitted to omit it entirely. A document may even specify this as simply `

<link
href="here.html" next

>' and the parser will infer that the value `next' is associated with the attribute name `rel'.

<!ENTITY amp "&">

Entity references (other than parameter entities, which are internal to a DTD) are made using a construction such as `&entname;. This presents a problem if you want to include the ampersand in your text, but this declaration sets up an entity called `amp', which can be used to include an ampersand in text by typing `&'. You can use this in your own documents to create shorthand forms for bits of text you don't want to retype.

Next: A.3 Document type declaration subset
Up: A SGML - some of the details
Previous: A.1 Standard identification
[ID index][Keyword index]

The Starlink SGML Set
Starlink System Note 70
Norman Gray, Mark Taylor
21 April 1999. Release DR-0.7-13. Last updated 24 August 2001