The Benefits of Digital Preservation: an alternative view

JISC ran a 'benefits' workshop in Bristol last month, as part of its Managing Research Data programme. That was valuable, in the main (though the talk about TRAC was depressing in more ways than I have self-control to enumerate).

One element of the second day was a presentation on, and discussion of Neil Beagrie's analysis of the Benefits from Digital Preservation of Research Data -- a component of the MRD programme's Keeping Research Data Safe project (KRDS). For a more compact introduction, see the summary links and factsheet at the Charles Beagrie site.

That benefits work includes a few important dimensions and a taxonomy that illuminate the broad contours of the benefits digital preservation investments potentially generate. I had thought this taxonomy was interesting and useful, but as the discussion moved on, I realised that I had significantly misunderstood its intention. I owned up to that, and it turned out that that alternative view was, perhaps, orthogonal rather than wrong. Simon suggested I post a note about it after the meeting. Here, somewhat later than billed, is that note.

The benefits matrix, KRDS-style

First, it's probably useful to note how (I believe) the KRDS report conceives of the benefits matrix.

That matrix looks like this (see the factsheet for a fuller version):

Dimension 1Direct Benefits Indirect Benefits (costs avoided)
New research opportunities No re-creation of data
scholarly communication lower future preservation costs
repurposing data use by new audiences
Dimension 2Near-term benefitsLong-term benefits
Value to current researcher secures value to future researchers
No data lost from postdoc turnover adds value over time
Dimension 3Private Benefits Public benefits
Benefits to sponsor/funder of research input for future research
Benefits to researcher motivating new research

The report illustrates various features located exclusively in one box or another. Neil has since clarified that the dimensions weren't intended to be exclusive, but that remains a natural picture for a reader to take away from the report.

Well, that's not how I think of dimensions.

The benefits space

Not to get all mathy, but when I see the word dimension, it means a set of things which have common features:

Thus, I thought that the Beagrie notion was that any identified benefit could be located at some point in these three dimensions, and so be located at a point in a three-dimensional box. This isn't massively far away from the original notion, but stressing the continuity lets us look at this in a usefully rich way.

Locating benefits

Look at the figure below, for example:

This figure is intended to show the benefits of four separate features of a digital repository (four features that are relevant to the sort of repositories I'm interested in), by locating them roughly on the three axes.

DR software: Data reduction software is generally of fairly direct benefits to a user, so it lives to the left of the diagram, at the 'direct' end of the direct-indirect axis; it's helpful to researchers in the immediate term, but will also allow them to get access to, and re-reduce, their data in future, so is somewhat spread out along the near-long axis; and it's mostly a private benefit. All that, together, means that DR software's benefits are in a cloud at the lower-front-left of the box in the diagram.

Good metadata: This is very spread out. Metadata helps individuals find their own data, and that of others, and so is of immediate practical use - that puts it in the lower-front-left of the diagram. From another point of view, however, that same metadata will help add value to any archive which holds the corresponding data, which in turn is a benefit to the community as a whole. Accordingly, this benefit ends up spread out along the diagonal from direct-near-private to indirect-long-public (there's more on this topic below; thanks to Angus Whyte for encouraging me to clarify this).

Open data: This, on the other hand, has both indirect and long-term benefits. Its location on the private-public axis, though, is vaguer, and it seems to me that it should be regarded as a lot more spread out. That's why it ends up as a stretched out oval, nestling in that upper-far-right corner, though stretching down from the top towards the 'private' end of the axis.

Sysadmins have pretty indirect benefits in this context. They may help solve individuals' problems in many cases, but generally the main payoff is that they keep things quietly running. Although some of the things they keep running will be external services, most of the benefit is seen internally to the organisation. That puts them squarely in the bottom-right of the diagram. However good sysadmins show their worth on both the short and the long terms, so their cloud is stretched out the full way from the front to the back.

Almost all of these assessments are debatable, and I wouldn't defend my locating of any of them with much conviction.

Locating interests

A benefit to this way of approaching the Beagrie dimensions is that we can add onto the same diagram our answers to the question: who cares?.

(See also movies: large and small).

Researchers are selfish (when it comes down to the spreadsheet and the moving finger hovering over the delete-row button, they want someone else's project to be cut, dammit). They care about direct, near-term private benefits; thus their interest colour in the front-left-bottom corner. Research councils, in contrast, care all about the long-term public benefits, and don't get any direct payoff from the research they fund; they fill in the corner diagonally opposite the researchers. Institutions are as selfish as the researchers, but the institution's interests are mostly long-term, so they stretch out over the length of the direct-indirect axis at the bottom of the far side.

What these two diagrams together tell us is that (in this example) support for good software and, to a slightly lesser extent, good metadata are likely to be easy sells to researchers, but that they're only going to get excited about open data and resourcing for sysadmins if one appeals to their theoretical long-term interests. It's also clear why the term 'open data' has spread like hot honey across every research council document in sight.

Locating non-interests

To the extent that these diagrams are complete, they also suggest why it might be difficult to get institutions excited about digital repositories: none of the (blue) benefits overlap with the institutions' (red) interests. It also, rather grimly, shows why sysadmins have such a precarious time: they're clearly beneficial, but no-one in the red diagram sees it as their job to be interested in that (all is not lost: I'd think that heads of department are located in the bottom-right-front of the red diagram – they care).

The utility of this point of view, and further thoughts

I don't necessarily claim that these are strongly defensible locations for those features and interests, nor that these are the only benefits, and the only interests which matter. However I believe the exercise is usefully illustrative and (a further benefit of looking at things in this way), the locations can be very concretely argued about.

I mentioned, above, that some features would be easy sells to some interest groups. Another way of putting this is that it suggsts that people will tend to see only those benefits that overlap with their interests at the time. That's no big surprise by itself, of course, but it might be useful to see it as framing people's approach to the benefits. To return to the example of the good metadata, consider a researcher struggling with a data file with missing or wrong metadata. They will mutter darkly about idiocy and incompetence, spend the time to reconstruct the missing information, and move on with their work. In this frame of mind, it would take a long-shot appeal to their altruism to persuade them to take the time to upload the modified metadata to the source it came from. The same researcher, however, would readily see the benefit of having other people upload their corrected metadata, and in this disciplinary health frame of mind, their set of interests might sit on quite different parts of the diagram. So I don't think that this picture captures the full range of possibilities – one can imagine different clouds for researcher in work mode and researcher in grant mode.

Discipline dependence

I suspect that the red diagram would be largely discipline-independent, but the blue diagram strongly discipline-specific. Despite that, perhaps there's a cross-discipline version of the blue diagram: JISC's probably interested in that.

Independence

One thing I thought I might see in this diagram was degeneracy. I don't mean that in the sense of vice-chancellors retreating with their catamites and astrologers to their Capri villas, but in the mathematical sense.

On earth, distance from London is a dimension – everywhere on earth is some distance or other from London – but if you've got latitude and longitude, being told the distance from London doesn't add anything (this doesn't quite work as an analogy, because knowing, say, latitude and distance-from-London doesn't unambiguously give you the longitude, but let that pass). That is, that set of three positions is degenerate.

At first glance, this set of Beagrie dimensions looked degenerate: it appeared that anyone who was interested in direct benefits, would also be focused on near-term private ones, and anyone interested in long-term benefits would care only about indirect public ones; it appears that once you've been told someone's position on one dimension, you know about their position on the others. The fact that the clouds in the diagram tend to cluster along the diagonal does to some extent corroborate this. But it's not entirely true, as suggested by the (fairly confident) locations of Institutions and of sysadmins, sitting in corners well clear of that diagonal.

Having said that, I can't think of anyone with interests in the top-back-left of the red diagram (direct, long-term, public), and no benefits spring to mind in that area, so I suspect these axes are at least more correlated than they could be.

Acknowledgements

Thanks to Angus Whyte for very useful comments on an earlier version of this note, and to Neil Beagrie for useful clarifications on the goals of the benefits exercise.

Norman, 2010 December 20