One of the many challenges of our increasingly digital world is that of establishing effective ways of preserving digital information — which is far more fragile than printed material. What are the implications of this for the scholarly record, and where does Open Access (OA) fit into the picture?In a 1999 report for the Council on Library and Information Resources Jeff Rothenberg , a senior research scientist at the RAND Corporation, pointed out that while we were generating more and more digital content each year no one really knew how to preserve it effectively. If we didn't find a way of doing it soon, he warned, "our increasingly digital heritage is in grave risk of being lost."
In launching the UK Web Archive earlier this year British Library chief executive Dame Lynne Brindley estimated that the Library would only be able to archive about one per cent of the 8.8 million .co.uk domains expected to exist by 2011. The remaining 99 per cent, she said, was in danger of falling into a "digital black hole".
In the context of Rothenberg's earlier warning Brindley's comment might seem to suggest that very little has changed in the past eleven years so far as digital preservation is concerned. But that would be the wrong conclusion to reach. Rather, it draws attention to the fact that digital preservation is not just a technical issue.
As it happens, many of the technical issues associated with digital preservation have now been resolved. In their place, however, a bunch of other issues have emerged — including legal, organisational, social, and financial issues.
What concerns Brindley, for instance, are not the technical issues associated with archiving the Web, but the undesirable barrier that today's copyright laws imposes on anyone trying to do so. Since copyright requires obtaining permission from the owner of every web site before archiving it the task is time consuming, expensive, and quite often impossible.
Clearly there are implications here for the research community.
State of play
So what is the current state of play so far as preserving the scholarly record is concerned?
First we need to distinguish between two different categories of digital information. There is retro-digitised material, which in the research context consists mainly of data created as a result of research libraries digitising their print holdings — journals, books, theses, special collections etc. Then there is born-digital material — which includes ejournals, eBooks and raw data produced during the research process.
It is worth noting that the quantities of raw data generated by Big Science can be mind-boggling. In the case of the Large Hadron Collider, for instance, CERN expects that it will generate 27 terabytes of raw data every day when it is running at full throttle — plus 10 terabytes of "event summary data".
To cater for this deluge CERN has created a bespoke computing grid called the WLCG. While the costs associated with the WLCG will be shared amongst 130 computing centres around the world, the personnel and materials costs to CERN alone reached 100 million Euros in 2008, and CERN's budget for the grid going forward is 14 million Euros per annum.
Of course, these figures by no means represent preservation costs alone, and they are not typical — but they provide some perspective on the kind of challenges the science community faces.
What were the main findings?
So far as retro-digitisation is concerned, the Report points out that funding is limited and "the quantity of non-digitised material is huge". Even so, it adds, there is general concern about "the sustainability of hosting" the data that has been generated from digitisation. This is a particular concern for small and medium-sized institutions.
With regard to born-digital material the Report found that the largest gaps are currently in the "provision for perpetual access for e-journals".
The situation with regard to eBooks and databases is less clear since, as the Report points out, "experience in digital preservation with these content types is currently more limited."
While the Report focused on the situation in Germany the international nature of today's research environment suggests the situation will be similar in all developed nations (Although Germany does have two unique mass digitisation centres).
We should not be surprised that the German Report found the largest gap to be in the preservation of journal content. As we shall see, the migration from a print to a digital environment has disrupted traditional practices and responsibilities, and led to some uncertainty about who is ultimately responsible for preserving the scholarly record.
We should also point out that one important area that the German Report did not look at is the growing trend for scholars to make use of blogs, wikis, open notebooks and other Web 2.0 applications. Should this data not be preserved? If it should, whose responsibility is it to do it, and what peculiar challenges does it raise? As we have seen, for instance, preserving web content is not a technical issue alone. Amongst other things there are copyright issues. (Although as the research community starts to use more liberal copyright licences these difficulties should ease somewhat).
Another recently published report did look at the issue of web-created scholarly content, but reached no firm conclusion. Produced by the Blue Ribbon Task Force, this Report concluded: "[I]n scholarly discourse there is a clear community consensus about the value of e-journals over time. There is much less clarity about the long-term value of emerging forms of scholarly communication such as blogs, products of collaborative workspaces, digital lab books, and grey literature (at least in those fields that do not use preprints). Demand may be hypothesised — social networking sites should be preserved for future generations — but that does not tell us what to do or why."
One issue likely to be of interest to OA advocates is whether institutional repositories should be expected to play a part in preserving research output.
Evidence cited by the German Report suggests that repositories are not generally viewed as preservation tools. It pointed out, for instance, that the Dutch National Library's KB e-Depot currently archives the content hosted in 13 institutional repositories in the Netherlands.
The Blue Ribbon Report, by contrast, appears to believe that repositories do have a long-term archiving role. It suggests, for instance, that self-archiving mandates should always be accompanied by a "preservation mandate".
The Report goes on to suggest that the inevitable additional costs associated with repository preservation should be taken out of the institution's Gold OA fund (where such a fund exists).
If you wish to read the rest of this introduction, and the interview with preservation specialist Neil Beagrie, please click on the link below. I am publishing it under a Creative Commons licence, so you are free to copy and distribute it as you wish, so long as you credit me as the author, do not alter or transform the text, and do not use it for any commercial purpose.
If you would like to republish the interview on a commercial basis, or have any comments on it, please email me at firstname.lastname@example.org.
To read the rest of the introduction and the interview with Neal Beagrie (as a PDF file) click here.