[EDITORS' DRAFT: bfd758746561, 2010-11-25T17:39:07+00:00]
We highlight the lack of any consensus practice, in astronomy, for minting and maintaining URIs intended to be used for long-term naming purposes. In the current draft, we (i) highlight the current problem and suggest its importance, and (ii) indicate possible solutions, without yet recommending any solution in particular. This is intended to be a discussion document, rather than a set of conclusions.
This is an IVOA Note.
The first release of this document was YYYY Month DD.
This is an IVOA Note expressing suggestions from and opinions of the authors. It is intended to share best practices, possible approaches, or other perspectives on interoperability with the Virtual Observatory. It should not be referenced or otherwise interpreted as a standard specification.
A list of current IVOA Recommendations and other technical documents can be found at http://www.ivoa.net/Documents/.
Thanks to Sébastien Derriere and Rob Seaman for comments on an earlier draft of this document.
There is a present need to articulate a set of best practices for URIs of use in astronomy, which are expected to be usable, useful, and used, in the long term.
[This document should eventually appear as ... what? An IVOA Note would be obvious, but other avenues are possible. I very much welcome comments and collaborators on this.]
In this note, we discuss:
Only URIs – Uniform Resource Identifiers – are actually formally defined, by RFC 3986, and in principle replace the original notion of URLs and URNs – Uniform Resource Locators and Names. A URI is in principle only the name of something, and doesn't have to be retrievable (or ‘resolvable’ or ‘dereferenceable’), but in places where we wish to stress that a URI may actually be dereferenced and the result processed, it is most natural to refer to the thing as a URL, even though the two terms are interchangeable. It is an established W3C recommendation that, even when a URI is being used only in its role as an identifier, it is still useful and prudent to have it return useful information when it is dereferenced.
In the beginning was the web, and on the second day there was the first 404 broken link. URL links have been breaking since the beginning of the web, and this has been a problem, or not, depending on how important the information is, and how far, in space and in time, the reader of the information is distant from its author.
More and more, and especially in technical contexts, URLs are becoming the primary reference for a standard or other technical document. Organisations such as the W3C have careful procedures to generate URLs in a maintainable way, and to avoid URLs breaking or their contents changing.
In particular, the Semantic Web (SW) uses URIs extensively and consistently, both as formal names for objects and concepts and, when they are dereferenced and their contents retrieved, as sources of machine-readable information. These uses are most explicit in the notion of ‘Linked Data’ (see the wikipedia article and references there). Such SW practices will diffuse through the technical and scientific communities, and software will come to depend on their integrity and resolvability to a greater and more pervasive extent than hitherto. Since there is no technical reason why URIs should not break (and the absence of such a mechanism is arguably one of the prime socio-technical reasons for the web’s initial success), such long-term preservation must be managed at a higher level.
This document therefore addresses the specific question: what prescriptions should the astronomy community make, concerning the domain, structure, and management of the URLs associated with astronomical conventions, to best preserve them for the long term?
This document is addressed at those entities who have an interest in the long-term availability and intelligibility of astronomical data. The general problem is not unique to this field, of course, but astronomy has traditionally taken a long view with respect to its data.
In particular, the document may be of interest to space agencies, archives and policy bodies such as the IVOA or the IAU.
There are a number of things which currently have long-term names in astronomy. These include the following.
http://www.ivoa.net/rdf. This would be the obvious location for further vocabularies, including vocabularies currently being developed as part of the SimDB process.
Above, we describe the domains where new URIs would ‘most naturally’ be minted; the purpose of this document is to question whether this is where such important URIs should be minted in fact.
The timescale of interest in this note is that on which the
presence or internal structure of domains
u-strasbg.fr will change.
One obvious source of change is that these entities cease to exist.
Government departments and agencies generally change their identities slowly (US government departments seem very persistent, whereas UK departments, for example, can change their names and remits on prime-ministerial whim), but their internal structures can change on shorter timescales – GSFC could potentially change its responsibilities without major upheaval to NASA. Such agencies will have archives, but these will in some circumstances concentrate on the (possibly mandated) preservation of core ‘business’ records, rather than being tasked and funded to support the interests of the astronomical community worldwide.
Non-governmental agencies such as the IVOA are potentially more secure than governmental ones: since they have no budget they are almost free to run, and can focus on a tightly-defined mission. The threat to names created by such agencies is that their missions, however tightly defined initially, may drift, or they may become absorbed into another organisation, such as the IAU.
Universities are long-lived entities – indeed numerous universities
have outlived the countries which founded them – but they have
historically achieved this longevity by numerous reorganisations,
refoundations, and accommodations with the local civil powers. Also,
while universities’ staff may be concerned with the preservation of
eternal truth (according to the fashion of the century), the
institutions themselves tend to have rather hard-headed
self-interests. As an example, Strasbourg’s
which had been the home of the SIMBAD and Vizier databases almost
since the web began, was in 2009 changed to
unistra.fr as a result of a university merger, and
gla.ac.uk was in 2010
whimsically ‘rebranded’ to
Separately from the simple preservation of URLs, is the threat which comes from the continuing development of standards and names. A standard, or one of the names it includes, may start off in one organisation, and by a process of development, wider adoption, or formalisation, end up being associated with another. This is illustrated by the progress of the FITS standard over two decades, which started as a multilateral agreement, became a NASA standard, and was then more formally re-standardised with IAU blessing.
Some vocabularies will start off in one place, and then become someone else’s responsibility. Although existing astronomy vocabularies have been created through a formal process, within the IVOA, it seems likely that future vocabularies will develop in an analogous way to the FITS standard, starting off as experiments, and becoming more formal when, or if, they are more broadly used.
The upper limit of our preservation hopes should probably be taken to be of order 100 years. Though we should aim for solutions which have the potential to last longer (and 300 years does not sound unreasonable), aiming for a milennium seems optimistic – that’s down to the archivists to preserve what they can past the apocalypse.
The principal difficulties that must be addressed are:
The first two difficulties are to some extent in opposition: a solution based on HTTP, for example, is obviously usable, but would seem to be vulnerable to obsolescence. This is probably not as big a problem as it might at first appear. HTTP is so pervasive now that its eventual replacement will have to seamlessly absorb its addresses if it is to be at all viable, and a well-designed HTTP URL scheme would be mechanically translatable to a successor. On this topic, see David Booth’s remarks on going from URNs to HTTP URLs: these are concerned with going in the opposite direction, but are usefully applicable. Also ARKs, as discussed below, are explicit in their concern to be useful in both the HTTP and non-HTTP worlds.
Branding has both positive and negative aspects. In the positive sense, a funder or instigator wants to have credit, perhaps as an academic reward for their work, or to justify their continued existence to their own funders. In contrast, when an identifier has to be moved from one curator to another, the new curator may not want to be burdened with the old curator’s branding, nor will the old curator (if they still exist) be happy to have part of their branding under the control of another organisation. For this reason, it is important that long-term identifiers are as neutral as possible, with as few links as possible to real-world entities.
There are several technical solutions to this problem, which we will briefly outline below. What these various mechanisms have in common is that the persistent identifier is almost completely ‘brandless’. Each of the cases has support for managing what the identifier points to, and each has at least some support for identifier metadata.
Persistent URLs are simply URLs which redirect to the intended
resource. They are typically hosted at
http://purl.org, though the
purl software is available and they may in principle be hosted
elsewhere. They redirect with a 302 ‘Moved temporarily’ HTTP status
code, though others are possible.
Digital Object Identifiers (DOI) are the best-known example of the
A DOI consists of a registrant prefix and a per-object suffix. DOIs
are resolved using an API, but can also be resolved by suffixing the
DOI to the URL
DOIs are perhaps most familar as identifiers for journal articles,
they can in fact refer to any digital object. They are not free to
Archival Resource Keys (ARK), like DOIs, have a registrant prefix and a per-object suffix, but unlike DOIs they are most typically resolved by being part of a URL Description needs clarification!!
EZIDs (pronounced ‘easy-IDs’) are rather new, and are intended as a uniform gateway to both ARKs and DOIs. Need elaboration
The Virtual Observatory has already produced two persistent identfier systems. The GLU system was developed at CDS, and IVOA Identifiers (often referred to as ‘ivorns’) by the IVOA. Whatever their merits and current use, and despite their priority (first appearing in 1998 and 2003 respectively), both systems suffer by being restricted to astronomy, and by not being worldwide archival standards, benefitting from the associated institutional backing and awareness enjoyed by DOIs, PURLs and the like.