IVOA logo

Vocabularies in the Virtual Observatory, v0.01

IVOA Working Draft, 2007 December 6 [DRAFT $Revision: 22 $]

Working Group
Semantics
This version
http://www.ivoa.net/Documents/ivoa-thesaurus-0.01
Latest version
http://www.ivoa.net/Documents/ivoa-thesaurus-0.01
Editors
TBD
Authors
Alasdair J G Gray, Norman Gray, Frederic V Hessman and Andrea Preite Martinez

Abstract

As the astronomical information processed within the Virtual Observatory becomes more complex, there is an increasing need for a more formal means of identifying quantities, concepts, and processes not confined to things easily placed in a FITS image, or expressed in a catalogue or a table. We proposed that the IVOA adopt a standard format for vocabularies based on the W3C's Resource Description Framework (RDF) and Simple Knowledge Organization System (SKOS). By adopting a standard and simple format, the IVOA will permit different groups to create and maintain their own specialized vocabularies while letting the rest of the astronomical community access, use, and combined them. The use of current, open standards ensures that VO applications will be able to tap into resources of the growing semantic web. Several examples of useful astronomical vocabularies are provided, including work on a common IVOA thesaurus intended to provide a semantic common base for VO applications.

Status of this document

This is an IVOA Working Draft. The first release of this document was 2007 December 6.

This document is an IVOA Working Draft for review by IVOA members and other interested parties. It is a draft document and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use IVOA Working Drafts as reference materials or to cite them as other than work in progress.

A list of current IVOA Recommendations and other technical documents can be found at http://www.ivoa.net/Documents/.

Acknowledgments

We would like to thank the members of the IVOA semantic working group for many interesting ideas and fruitful discussions.

Table of Contents


1 Introduction

1.1 Vocabularies in astronomy

Astronomical information of relevance to the Virtual Observatory (VO) is not confined to quantities easily expressed in a catalogue or a table. Fairly simple things such as position on the sky, brightness in some units, times measured in some frame, redshits, classifications or other similar quantities are easily manipulated and stored in VOTables and can now be identified using IVOA UCDs [std:ucd]. However, astrophysical concepts and quantities use a wide variety of names, identifications, classifications and associations, most of which cannot be described or labelled via UCDs.

Formally, there are a number of basic forms of organised semantic knowledge of potential use to the VO, ranging from informal at one extreme, to very formal and highly structured at the other. I think this list covers definitions covered more naturally in the text below it -- omissable?[NG]

The term folksonomy has emerged in the last few years, to describe what would in other circumstances be described as an uncontrolled keyword list. The new term, and the substantial recent interest in it, is a consequence of the realisation that even such a simple mechanism can in certain circumstances (well-known examples are the Flickr and del.icio.us social services) add substantial value to a set of resources.

More formal definitions are presented later in this document. In the present document, we will not need to distinguish between controlled vocabularies, taxonomies and thesauri, and so we will use the term vocabulary to represent all three cases.

There has been some progress towards creating an ontology of astronomical object types [std:ivoa-astro-onto], however such a formal approach may not be necessary, and may be counterproductive if the increased complication makes a system hard to use [AG Not sure counterproductive is the right argument here. Ontologies do not meet all of the navigation and retrieval use cases.]. An ontology is necessary if we are to have a computer (appear to) `understand' something of a domain, but in the present case, we are more concerned with the related but distinct problem of letting human users find resources of interest, and so the most appropriate technology derives from the Information Science community, that of controlled vocabularies, taxonomies and thesauri.

One of the best examples of the need for a simple vocabulary within the VO is VOEvent [std:voevent], the VO standard for handling astronomical events: if someone broadcasts, or `publishes', the occurrence of an event, the implication is that someone else is going to want to respond to it, but no institution is interested in all possible events, so some standardised information about what the event `is about' is necessary, in a form which ensures that the parties can communicate effectively. If a `burst' is announced, is it a Gamma Ray Burst due to the collapse of a star in a distant galaxy, a solar flare, or the brightening of a stellar or AGN accretion disk? If a publisher doesn't use the label one might have expected, how is one to guess what other equivalent labels might have been used?

There have been a number of attempts to create astronomical vocabularies (in the present document we will not need to distinguish vocabularies, taxonomies and thesauri, and will use the term `vocabulary' for all three cases).

1.2 Formalising and managing multiple vocabularies

We find ourselves in the situation where there are multiple vocabularies in use, describing a broad range of resources of interest to professional and amateur astronomers, and members of the public. These different vocabularies use different terms and different relationships to support the different constituencies they cater for. For example, delta Sct and RR Lyr are terms one would hope to find in a vocabulary aimed at professional astronomers, associated with the notion of variable star; however one would hope not to find such technical terms in a vocabulary intended to support outreach activities.

One approach to this problem is to create a single consensus vocabulary, which draws terms from the various existing vocabularies to create a new vocabulary which is able to express anything its users might desire. The problem with this is that such an effort would be very expensive: both in terms of time and effort on the part of those creating it, and to the potential users, who have to learn to navigate around it, recognise the new terms, and who have to be supported in using the new terms correctly (or, more often, incorrectly).

The alternative approach to the problem is to evade it, and this is the approach taken in this document. Rather than deprecating the existence of multiple overlapping vocabularies, we embrace it, formalise all of them, and formally declare the relationships between them. This means that:

Illustrating the power of this peer-to-peer approach, we include as appendices to this proposal formalised versions of a number of existing vocabularies, encoded as SKOS vocabularies [std:skoscore].I don't think we're just illustrating it here, but producing a specification[NG]

2 SKOS-based vocabularies

2.1 Selection of the vocabulary format

After extensive online and face-to-face discussions, the authors have brokered a consensus within the IVOA community that formalised vocabularies should be published at least in SKOS (Simple Knowledge Organising Systems) format, a W3C draft standard application of RDF to the field of knowledge organisation [std:skoscore]. SKOS draws on long experience within the Library and Information Science community, to address a well-defined set of problems to do with the indexing and retrieval of information and resources; as such, it is a close match to the problem this working group is addressing.

ISO 5964 [std:iso5964] defines a number of the relevant terms (ISO 5964:1985=BS 6723:1985; see also [std:bs8723-1] and [std:z39.19]), and some of the (lightweight) theoretical background. The only technical distinction relevant to this document is that between `vocabulary' and `thesaurus': BS-8723-1 defines a thesaurus as a

controlled vocabulary in which concepts are represented by preferred terms, formally organized so that paradigmatic relationships between the concepts are made explicit, and the preferred terms are accompanied by lead-in entries for synonyms or quasi-synonyms. NOTE: The purpose of a thesaurus is to guide both the indexer and the searcher to select the same preferred term or combination of preferred terms to represent a given subject. (BS-8723-1, sect. 2.39)

with a similar definition in ISO-5964 sect. 3.16. The paradigmatic relationships in question are those relating a term to a broader, narrower or more generically related term, with an operational definition of broader term which is such that a resource retrieved by a given term will also be retrieved by that term's broader term. This is not a subsumption relationship, as there is no implication that the concept referred to by a narrower term is of the same type as a broader term.

Thus a vocabulary (SKOS or otherwise) is not an ontology. It has lighter and looser semantics than an ontology, and is specialised for the restricted case of resource retrieval. Those interested in ontological analyses can easily transfer the vocabulary relationship information from SKOS to a formal ontological format such as OWL [std:owl].

What is to be the format of the `master' files? SKOS or mildly-formatted plain text?[NG] By definition, this will be left up to the publishers! All we need to see is SKOS. [FVH] Open issue.

2.2 Content and format of a SKOS vocabulary

A published vocabulary in SKOS format consists of a list of entries – the examples below are shown in the Turtle notation for RDF [std:turtle] (this is similar to the more informal N3 notation) (e.g. N3)[FVH] Turtle is probably more standard, and clearer as an example[NG] – and each entry should contain the following elements:

2.3 Additional SKOS vocabulary resources

In addition to the vocabulary itself, other resources can be provided to help users identify the structure and contents:

2.4 Suggested good practices

As long as the vocabularies conform to the standard RDF, SKOS and other syntaxes, there is nothing keeping a VO application from using the vocabulary to support the human user and to enable new connections between different sources of information. However, we have identified a set of 10 Commandments which, if followed, will make the creation, management, and use of the vocabularies within the VO much simpler and more effective: several of these are open issues [NG]

  1. The SKOS documents defining the vocabulary should be published at a long-term accessible URI and should be mirrored at a central IVOA vocabulary repository.
  2. Each version of the vocabulary should be indicated within the name (e.g. "MyFavoriteVocabulary-v3.14") and previous versions should continue to be available even after having been subsumed by newer versions; Published vocabulary updates should be infrequent and individual changes should be documented, e.g. by <skos:changeNote>.
  3. To ensure that the parsing of the vocabulary tokens by various and sometimes limited computer programs is trivial, tokens should consist only of the letters a-Z, A-Z, and numbers 0-9, i.e. no spaces, no exotic letters (e.g. umlauts), and no characters which would make a token inexpressible as part of a URI; since tokens are for use by computers only, this is not a big restriction - the exotic letters can be used within the labels and documentation if appropriate.
  4. Token names should be kept in human-readable form, directly reflect the implied meaning, and not be semi-random id's only (e.g. spiralGalaxy, not "t1234567"); tokens should preferably be created via a direct conversion from the preferred label via removable/translation of non-token characters (see above) and sub-token separation via capitalization of the first sub-token character (e.g. the label "My favorite idea-label #42" is converted into "MyFavoriteIdeaLabel42"). Open issue
  5. Tokens and labels should be singular unless based on previously determined sources where the conversion to singular forms would impare the usefulness of the vocabulary (e.g. spiral galaxy, not "spiral galaxies"). Open issue
  6. Each entry should have one or more definitions (<skos:definition>) with a clear language localization (e.g. lang="fr" for French) that constitutes a short description of the concept which could be adopted by an application using the vocabulary; The use of additional documentation in standard SKOS or Dublin format (see above) is encouraged.
  7. Thesaurus entries (broader, narrower, related) are encouraged, but not required; if used, they should be complete (e.g. all broader links have corresponding narrower links in the referenced entries and related entries link each other).
  8. TopConcept entries (see above) should normally be those not having a broader reference (i.e. not at a sub-ordinate position in a thesaurus hierarchy) but should at least include all entries linked by the broader reference of subordinate entry; separate TopConcept documents should be indicated by concatenating "-topConcepts" to the vocabulary name (e.g. MyFavoriteVocabulary-v3.14-topConcepts.xml).IS THIS OK? HOW ELSE ARE WE TO FIND THE SEPARATE TopConcept FILES? [FVH]
  9. The publishers of vocabularies should provide on-line documentation permitting the easy perusal of labels, tokens, definitions, and other documentation; ideally, the namespace of the vocabulary should be identical with the documentation location and each token should correspond to an internal anchor (e.g. "http://www.MyInstitute.org/vocabularies/MyFavoriteVocabulary-v3.14#spiralGalaxy" is a direct link to the documentation about the entry spiral galaxy in the vocabulary "MyFavoriteVocabulary-v3.14"). IS THIS OK? WE DISCUSSED SOMETHING LIKE THIS SOMETIME A WHILE AGO, BUT NEVER PUT IT SO BLUNTLY. PRACTICABLE? [FVH]
  10. Publishers are encouraged to publish mappings between their vocabularies and other commonly used vocabularies; if external to the defining vocabulary documents, these documents should be indicated by concatenating "-mapping" to the vocabulary name (e.g. MyFavoriteVocabulary-v3.14-mapping.xml). OK? [FVH]

These suggestions are by no means trivial – there was considerable discussion within the semantic working group on many of these topics, particularly about token formats (some wanted lower-case only), and singular versus plural forms of the labels (different traditions exist within the international library science community). Obviously, no publisher of an astronomical vocabulary has to adopt these rules, but the adoption of these rules will make it easier to use the vocabularly in external generic VO applications.

3 Example vocabularies

The intent of having the IVOA adopt SKOS as the prefered format for astronomical vocabularies is to encourage the creation and management of diverse vocabularies by competent astronomical groups, so that users of the VO and related resources can benefit directly and dynamically without the intervention of an IAU or IVOA bureaucracy or committee. However, we felt it important to provide several examples of vocabularies in SKOS format as part of the proposal, both to illustrate how simple and powerful the concept is, and to provide an immediate vocabular basis for VO applications.

We provide a set of SKOS files representing the vocabularies which have been developed, and mappings between them. These can be downloaded at the URL

http://www.ivoa.net/Documents/ivoa-thesaurus-0.01/dist-XXX.tar.gz

To be expanded: there are no mappings at the moment. Also, the vocabularies are all in a single language, though translations of the IAU93 thesaurus are available.

3.1 A Constellation Name Vocabulary (normative)

This vocabulary is presented as a simple example of an astronomical vocabulary for a very particular purpose, e.g. handling constellation information like that commonly encountered in variable star research. For example, SS Cygni is a cataclysmic variable located in the constellation Cygnus. The name of the star uses the genitive form Cygni, but the alternate label SS Cyg uses the standard abbreviation Cyg. Given the constellation vocabulary, all of these forms are recorded together in a computer-manipulatable format.

The <skos:ConceptScheme> contains a single <skos:TopConcept>, constellation

	<skos:Concept rdf:about="#constellation">
		<skos:inScheme rdf:resource=""/>
		<skos:prefLabel>constellation</skos:prefLabel>
		<skos:definition>IAU-sanctioned constellation names</skos:definition>
		<skos:narrower rdf:resource="#Andromeda"/>
		...
		<skos:narrower rdf:resource="#Vulpecula"/>
	</skos:Concept>

Alternate Turtle form, for illustration, with the SKOS namespace being the default...

<#constellation> a :Concept;
    :inScheme <>;
    :prefLabel "constellation";
    :definition "IAU-sanctioned constellation names";
    :narrower <#Andromeda>;
    ...
    :narrower <#Vulpecula>.

and the entry for Cygnus is

	<skos:Concept rdf:about="#Cygnus">
		<skos:inScheme rdf:resource=""/>
		<skos:prefLabel>Cygnus</skos:prefLabel>
		<skos:definition>Cygnus</skos:definition>
		<skos:altLabel>Cygni</skos:altLabel>
		<skos:altLabel>Cyg</skos:altLabel>
		<skos:broader rdf:resource="#constellation"/>
		<skos:scopeNote>prefLabel is nominative form; altLabels are the genitive and short forms</skos:scopeNote>
	</skos:Concept>

Note that SKOS alone does not permit the distinct differentiation of genitive forms and abbreviations, but the use of alternate labels is more than adequate enough for processing by VO applications where the difference between SS Cygni, SS Cyg, and the incorrect form SS Cygnus is probably irrelevant.

3.2 The 1993 IAU Thesaurus (normative)

The IAU Thesaurus consists of concepts with mostly capitalized labels and a rich set of thesaurus relationships (BF for "broader form", NF for narrower form, and RF for related form). In addition, the non-SKOS thesaurus relationships U (for use) and UF (use for) link fundamental entries to separate entries which are really just alternative labels (mostly indicated by non-capitalized label names). In a separate document, the equivalents are given in five languages: English, French, German, Italian, and Spanish. Enumeratable concepts are plural (e.g. SPIRAL GALAXIES) and non-enumerable concepts are singular (e.g. STABILITY). Finally, there are some useage hints like combine with other

In converting the IAU Thesaurus to SKOS, we have used the original English labels as preferred labels (e.g. SPIRAL GALAXIES) and the labels in other languages have been included as alternate labels. The U and UF links have been converted to alternative labels, reducing the total number of entries from X to Y. The tokens have been created using the "5th Commandment", i.e. deletion of spaces and capitalization of the first letter in sub-tokens only (e.g. SpiralGalaxies).

3.3 The Astronomy & Astrophysics Keyword List (normative)

SHORT DESCRIPTION HERE

3.4 The AOIM Taxonomy (normative)

SHORT DESCRIPTION HERE

3.5 The UCD1+ Vocabulary (non-normative)

The UCD standard is currently the only officially sanctioned and managed vocabulary for the IVOA. The normative document is a simple text file containing entries consisting of tokens (e.g. em.IR), a short description, and usage information (syntax codes which permit UCD tokens to be concatenated). The form of the tokens implies a natural hierarchy: em.IR.8-15um is obviously a narrower term than em.IR, which in turn is narrower than em.

Given the structure of the UCD1+ vocabulary, the natural translation to SKOS consists of preferred labels equal to the original tokens (the UCD1 words include dashes and periods), vocabulary tokens created using the "5th Commandment" (e.g. "emIR815Um" for em.IR.8-15um), direct use of the definitions, and the syntax codes placed in usage documentation: <skos:scopeNote>UCD syntax code: P</skos:scopeNote> NOTE: THIS IS THE FORMAT I USED IN MY VERSION - MAY NOT BE THE SAME AS NORMAN'S [FVH]

Note that the SKOS document containing the UCD1+ vocabulary does NOT consistute the official version: the normative document is still the text list. However, on the long term, the IVOA may decide to make the SKOS version normative, since the SKOS version contains all of the information contained in the original text document but has the advantage of being in a standard format easily read and used by any application on the semantic web.

3.6 The proposed IVOA Thesaurus

While it is true that the adoption of SKOS will make it easy to publish and access different astronomical vocabularies, the fact is that there is no vocabulary which makes it easy to jump-start the use of vocabularies in generic astrophysical VO applications: each of the previously developed vocabularies has their own limits and biases. For example, the IAU Thesaurus provides a large number of entries, copious relationships, and translations to four other languages, but there are no definitions, many concepts are now only useful for historical purposes (e.g. many photographic or historical instrument entries), some of the relationships are false or outdated, and many important or newer concepts and their common abbreviations are missing.

Despite its faults, the IAU Thesaurus constitutes a very extensive vocabulary which could easily serve as the basis vocabulary once we have removed its most egregrious faults and extended it to cover the most obvious semantic holes. To this end, a heavily revised IAU thesaurus is in preparation for use within the IVOA and other astronomical contexts. The goal is to provide a general vocabulary foundation to which other, more specialized, vocabularies can be added as needed, and to provide a good lingua franca for the creation of vocabulary mappings.

Appendices

Bibliography

[baader04] Franz Baader, Ian Horrocks, and Ulrike Sattler.
Description logics. In Steffen Staab and Rudi Studer, editors, Handbook on Ontologies, International Handbooks on Information Systems, chapter 1, pages 3-28. Springer Verlag, 2004.
[gruber93] T Gruber.
A translation approach to portable ontology specification. Knowledge Acquisition, 5 no. 2 pp. 199-220, 1993.
[lortet94] M-C Lortet, S Borde, and F Ochsenbein.
Second reference dictionary of the nomenclature of celestial objects. Astron.\ Ap.\ Supp, 107 pp. 193-218, 1994. [Online].
[lortet94a] M-C Lortet, S Borde, and F Ochsenbein.
The second reference dictionary of the nomenclature of celestial objects (solar system excluded). volumes i, ii.. Technical Report 24, Centre des Données astronomique des Strasbourg, 1994. [Online].
[preitemartinez07] Andrea Preite Martinez and Soizick Lesteven.
Astronomical keywords in the era of the virtual observatory. IVOA Note, IVOA, 2007. [Online].
[shobbrook92] R.M. Shobbrook and R.R. Shobbrook.
The IAU thesaurus for improved on-line access to information. Proc. Astron. Soc. of Australia, 10 pp. 134, 1992. [Online].
[std:bs8723-1] Structured vocabularies for information retrieval - guide - definitions, symbols and abbreviations (BS 8723-1:2005).
British Standard, 2005.
[std:iso5964] Documentation - guidelines for the establishment and development of multilingual thesauri (ISO 5964:1985=BS 6723:1985).
International Standard, 1985.
[std:ivoa-astro-onto] L. Cambrésy, S. Derriere, P. Padovani, A. Preite Martinez, and A. Richard.
Ontology of astronomical object types. IVOA Working Draft, 2007. [Online].
[std:owl] World Wide Web Consortium.
The web ontology language. [Online].
[std:rtml] Remote telescope markup language.
Web page. [Online].
[std:skoscore] Alistair Miles and Dan Brickley.
SKOS core guide. W3C Working Draft, nov 2005. [Online].
[std:turtle] Dave Beckett.
Turtle - terse RDF triple language. Draft Standard, nov 2007. [Online].
[std:ucd] Sébastien Derriere, Norman Gray, Robert Mann, Andrea Preite Martinez, Jonathan McDowell, Thomas McGlynn, François Ochsenbein, Pedro Osuna, Guy Rixon, and Roy Williams.
UCD (Unified Content Descriptor) - moving to UCD1+. [Online, cited July 2005].
[std:voevent] Sky event reporting metadata (voevent).
IVOA Recommendation, 2006. [Online].
[std:z39.19] Guidelines for the construction, format and management of monolingual thesauri (ANSI/NISO Z39.19-1993=ISO 2788:1986).
American National Standard, 1993.

$Revision: 22 $ $Date: 2007-12-20 19:02:44 +0000 (Thu, 20 Dec 2007) $