Wednesday, October 05, 2016

Institutional Repositories: Response to comments

The introduction I wrote for the recent Q&A with Clifford Lynch has attracted some commentary from the institutional repository (IR) and open access (OA) communities. I thank those who took the time to respond. After reading the comments the following questions occurred to me.

(A print version of this text is available here)

1.     Is the institutional repository dead or dying?

Judging by the Mark Twain quote with which COAR’s Kathleen Shearer headed her response (“The reports of our death have been greatly exaggerated”), and judging by CORE’s Nancy Pontika insisting in her comment that we should not give up on the IR (“It is my strong belief that we don’t need to abandon repositories”) people might conclude that I had said the IR is dead.

Indeed, by the time Shearer’s comments were republished on the OpenAIRE blog (under the title “COAR counters reports of repositories’ demise”) the wording had strengthened – Shearer was now saying that I had made a number of “somewhat questionable assertions, in particular that institutional repositories (IRs) have failed.”

That is not exactly what I said, although I did quote a blog post by Eric Van de Velde (here) in which he declared the IR obsolete. As he put it, “Its flawed foundation cannot be repaired. The IR must be phased out and replaced with viable alternatives.”

What I said (and about this Clifford Lynch seemed to agree, as do a growing number of others) is that it is time for the research community to take stock, and rethink what it hopes to achieve with the IR.

It is however correct to say I argued that green OA has “failed as a strategy”. And I do believe this. I gave some of the reasons why I do in my introduction, the most obvious of which is that green OA advocates assumed that once IRs were created they would quickly be filled by researchers self-archiving their work. Yet seventeen years after the Santa Fe meeting, and 22 years after Stevan Harnad began his long campaign to persuade researchers to self-archive, it is clear there remains little or no appetite for doing so, even though researchers are more than happy to post their papers on commercial sites like and ResearchGate.

However, I then went on to say that I saw two possible future scenarios for the IR. The first would see the research community “finally come together, agree on the appropriate role and purpose of the IR, and then implement a strategic plan that will see repositories filled with the target content (whatever it is deemed to be).”

The second scenario I envisaged was that the IR would be “captured by commercial publishers, much as open access itself is being captured by means of pay-to-publish gold OA.”

Neither of these scenarios assumes the IR will die, although they do envisage somewhat different futures for it. That said, what they could share in common is a propensity for the link between the IR and open access to weaken. Already we are seeing a growing number of papers in IRs being hidden behind login walls – either as a result of publisher embargoes or because many institutions have come to view the IR less as a way of making research freely available, more as a primary source of raw material for researcher evaluation and/or other internal processes. As IRs merge with Research Information Management (RIM) tools and Current Research Information Systems (CRIS) this darkening of the content in IRs could intensify.  

What makes this darkening likely is that the internal processes that IRs are starting to be used for generally only require the deposit of the metadata (bibliographic details) of papers, not the full-text. As such, the underlying documents may not just be inaccessible, but entirely absent.

This outcome seems even more likely in my second scenario. Here the IR is (so far as research articles are concerned) downgraded to the task of linking users to content hosted on publishers’ sites. Again, to fulfil such a role the IR need host only metadata.

2.     So what is the role of an institutional repository? What should be deposited in it, and for what purpose?

As I pointed out in my introduction, there is today no consensus on the role and purpose of the IR. Some see it as a platform for green OA, some view it as a journal publication platform, some as a metadata repository, some as a digital archive, some as a research data repository (I could go on).

It is worth noting here a comment posted on my blog by David Lowe. The reason why the IR will persist, he said, “is not related to OA publishing as such, but instead to ETDs.” Presumably this means that Lowe expects the primary role of the IR to become that of facilitating ETD workflows.

It turns out that ETDs are frequently locked behind login walls, as Joachim Schöpfel and Hélène Prost pointed out in a 2014 paper called Back to Grey: Disclosure and Concealment of Electronic Theses and Dissertations. “Our paper,” they wrote “describes a new and unexpected effect of the development of digital libraries and open access, as a paradoxical practice of hiding information from the scientific community and society, while partly sharing it with a restricted population (campus).”

And they concluded that the Internet “is not synonymous with openness, and the creation of institutional repositories and ETD workflows does not make all items more accessible and available. Sometimes, the new infrastructure even appears to increase barriers.”

In short, the roles that IRs are expected to play are now manifold and sometimes they are in conflict with one another. One consequence of this is that the link between the repository and open access could become more and more tenuous. Indeed, it is not beyond the bounds of possibility that the link could break altogether.

3.     To what extent can we say that the IR movement – and the OAI-PMH standard on which it was based – has proved successful, both in terms of interoperability and deposit levels?

As I said in my introduction, thousands of IRs have been created since 1999. That is undoubtedly an achievement. On the other hand, many of these repositories remain half empty, and for the reasons stated about we could see them increasingly being populated with metadata alone.

Both Shearer and Pontika agree that more could have been achieved with the IR. With regard to OAI-PMH Pontika says that while it has its disadvantages, “it has served the field well for quite some time now.”

But what does serving the field well mean in this context? Let’s recall that the main reason for holding the Santa Fe meeting, and for developing OAI-PMH, was to make IRs interoperable. And yet interoperability remains more aspiration than reality today. Perhaps for this reason most research papers are now located by means of commercial search engines and Google Scholar, not OAI-PMH harvesters – a point Shearer conceded when I interviewed her in 2014.

Of course, if running an IR becomes less about providing open access and more about enabling internal processes, or linking to papers hosted elsewhere, interoperability begins to seem unnecessary.

4.     Do IR advocates now accept that there is a need to re-think the institutional repository, and is the IR movement about to experience a great leap forward as a result?

Most IR advocates do appear to agree that it is time to review the current status of the institutional repository, and to rethink its role and purpose. And it is the Confederation of Open Access Repositories (COAR) that is leading on this.

“The calls for a fundamental rethink of repositories is already being answered!” Tony Ross-Hellauer –  scientific manager at OpenAIRE (a member of COAR) –  commented on my blog.  “See the ongoing work of the COAR next-generation repositories working group.”

Shearer, who is the executive director of COAR (and so presumably responsible for the working group), explains in her response that the group has set itself the task of identifying “the core functionalities for the next generation of repositories, as well as the architectures and technologies required to implement them.”

As a result, Shearer says, the IR community is “now well positioned to offer a viable alternative for an open and community led scholarly communication system.”

So all is well? Not everyone thinks so. As an anonymous commenter pointed out on my blog: “All this is not really offering a new way and more like reacting to the flow. Maybe that has to do with the kind of people working on it, the IR crowd is usually coming from the library field and their job is not to be inventive but to archive and keep stuff save.”

Archiving and keeping stuff save are very worthy missions, but it is to for-profit publishers that people tend to turn when they are looking for inventive solutions, and we can see that legacy publishers are now keen to move into the IR space. This suggests that if the goal is to create a community-led scholarly communications system COAR’s initiative could turn out to be a case of shutting the stable door after the horse has bolted.

5.     What is the most important task when seeking to engineer radical change in scholarly communication: articulating a vision, providing enabling technology, or getting community buy-in?

“Ultimately, what we are promoting is a conceptual model, not a technology,” says Shearer “Technologies will and must change over time, including repository technologies. We are calling for the scholarly community to take back control of the knowledge production process via a distributed network based at scholarly institutions around the world.”

Shearer adds that the following vision underlies COAR’s work:

“To position distributed repositories as the foundation of a globally networked infrastructure for scholarly communication that is collectively managed by the scholarly community. The resulting global repository network should have the potential to help transform the scholarly communication system by emphasizing the benefits of collective, open and distributed management, open content, uniform behaviors, real-time dissemination, and collective innovation.”

As such, I take it that COAR is seeking to facilitate the first scenario I outlined. But were not the above objectives those of the attendees of the 1999 Santa Fe meeting? Yet seventeen years later we are still waiting for them to be realised. Why might it be different this time around, especially now that legacy publishers are entering the market for IR services, and some universities seem minded to outsource the hosting of research papers to commercial organisations, rather than work with colleagues in the research community to create an interoperable network of distributed repositories?

What has also become apparent over the past 17 years is that open movements and initiatives focused on radical reform of scholarly communication tend to be long on impassioned calls, petitions and visions, short on collective action.

As NYU librarian April Hathcock put it when reporting on a Force11 Scholarly Commons Working Group she attended recently: “As several of my fellow librarian colleagues pointed out at the meeting, we tend to participate in conversations like this all the time and always with very similar results. The principles are fine, but to me, they’re nothing new or radical. They’re the same things we’ve been talking about for ages.”

Without doubt, articulating a vision is a good and necessary thing to do. But it can only take you so far. You also need enabling technology. And here we have learned that there is many a slip ‘twixt the cup and the lip.” OAI-PMH has not delivered on its promise, as even Herbert Van de Sompel, one of the architects of the protocol, appears to have concluded. (Although this tweet suggests that he too does not agree with the way I characterised the current state of the IR movement).

Shearer is of course right to say that technologies have to change over time. However, choosing the wrong one can at derail, or significantly slow down, the objective you are working towards.

But even if you have articulated a clear and desirable vision, and you have put the right technology in place, in the generally chaotic and anarchic world of scholarly communication you can only hope to achieve your objectives if you get community buy-in. That is what the IR and self-archiving movements have surely demonstrated.

6.     To what extent are commercial organisations colonising the IR landscape?

In my introduction I said that commercial publishers are now actively seeking to colonise and control the repository (a strategy supported by their parallel activities aimed at co-opting gold open access). As such, I said, the challenge the IR community faces is now much greater than in 1999.

In her response, Shearer says that I mischaracterise the situation. “[T]here are numerous examples of not-for-profit aggregators including BASE, CORE, SemanticScholar, CiteSeerX, OpenAIRE, LA Referencia and SHARE (I could go on),” she said. “These services index and provide access to a large set of articles, while also, in some cases, keeping a copy of the content.”

In fact, I did discuss non-profit services like BASE and OpenAIRE, as well as PubMed Central, HAL and SciELO. In doing so I pointed out that a high percentage of the large set of articles that Shearer refers to are not actually full-text documents, but metadata records. And of the full-text documents that are deposited, many are locked behind login walls. In the case of BASE, therefore, only around 60% of the records it indexes provide access to the full-text.

In addition, many consist of non-peer-reviewed and non-target content such as blog posts. Thats fine, but this is not the target content that OA advocates say they want to see made open access. Indeed, in some cases a record may consist of no more than a link to a link (e.g. see the first item listed here).

So the claims that these services make about indexing and providing access to a large set of articles need to be taken with a pinch of salt.

It is also important to note that publishers are at a significant advantage here, since they host and control access to the full-text of everything they publish. Moreover, they can provide access to the version of record (VoR) of articles. This is invariably the version that researchers want to read.

It also means that publishers can offer access both to OA papers as well as to paywalled papers, all through the same interface. And since they have the necessary funds to perfect the technology, publishers can offer more and better functionality, and a more user-friendly interface. For this reason, I suggested, they will soon (and indeed some already are) charging for services that index open content, as I assume Elsevier plans to do with the DataSearch service it is developing. This seems to me to be a new form of enclosure of the commons.

Shearer also took me to task for attaching too much significance to the partnership between Elsevier and the University of Florida – in which the University has agreed to outsource access to papers indexed in its repository to Elsevier. I suggested that by signing up to deals like this, universities will allow commercial publishers to increasingly control and marginalise IRs. This is an exaggeration, says Shearer “[O]ne repository does not make a trend.”

I agree that one swallow does not a summer make. However, summer does eventually arrive, and I anticipate that the agreement with the University of Florida will prove the first swallow of a hot summer. Other swallows will surely follow.

Consider, for instance, that the University of Florida has also signed a Letter of Agreement with CHORUS in a pilot initiative intended to scale up the Elsevier project “to a multilateral, industry effort.”

In addition to Elsevier, publishers involved in the pilot include the American Chemical Society, the American Physical Society, The Rockefeller University Press and Wiley. Other publishers will surely follow.

And just last week it was announced that Qatar University Library has signed a deal with Elsevier that apes the one signed by the University of Florida. I think we can see a trend in the making here.

As things stand, therefore, it is not clear to me how initiatives like COAR and SHARE can hope to match the collective power of legacy publishers working through CHORUS.

Let’s recall that OA advocates long argued that legacy publishers would never be able to replicate in an OA environment the dominance they have long enjoyed in the subscription world. As a result, it was said, as open access commodifies the services they provide publishers will experience a downward pressure on prices. In response, they will either have to downsize their operations, or get out of the publishing business altogether. Today we can see that legacy publishers are not only prospering in the OA environment, but getting ever richer as their profits rise – all at the expense of the taxpayer.

But let me be clear: while I fear that legacy publishers are going to co-opt both OA and IRs, I would much prefer they did not. Far better that the research community – with the help of non-profit concerns – succeeded in developing COAR’s “viable alternative for an open and community led scholarly communication system.”

So I applaud COAR’s initiative and absolutely sign up to its vision. My doubts are that, as things stand, that vision is unlikely to be realised. For it to happen I believe more dramatic changes would be needed than the OA and IR movements appear to assume, or are working towards.

7.     Will the IR movement, as with all such attempts by the research community to take back control of scholarly communication, inevitably fall victim to a collective action dilemma?

Let me here quote Van de Sompel, one of the key architects of OAI-PMH. Van de Sompel, I would add, has subsequently worked on OAI-ORE (which Lynch mentions in the Q&A) and on ResourceSync (which Shearer mentions in her critique).

In a retrospective on repository interoperability efforts published last year Van de Sompel concluded, “Over the years, we have learned that no one is ‘King of Scholarly Communication’ and that no progress regarding interoperability can be accomplished without active involvement and buy-in from the stakeholder communities. However, it is a significant challenge to determine what exactly the stakeholder communities are, and who can act as their representatives, when the target environment is as broad as all nodes involved in web-based scholarship. To put this differently, it is hard to know how to exactly start an effort to work towards increased interoperability.”

The larger problem here, of course, is the difficulties inherent in trying to get the research community to co-operate.

This is the problem that afflicts all attempts by the research community to, in Shearer’s words, “take back control of the knowledge production process.” What inevitably happens is that they bump up against what John Wenzler, Dean of Libraries California State University, has described as a “collective action dilemma”.

But what is the solution? Wenzler suggests the research community should focus on trying to control the costs of scholarly communication. Possible ways of doing this he says could include requiring pricing transparency and lobbying for government intervention and regulation. “[T]he government can try to limit a natural monopoly’s ability to exploit its customers by regulating its prices instead.”)

He concedes however: “Currently, the dominant political ideology in Western capitalist countries, especially in the United States, is hostile to regulation, and it would be difficult to convince politicians to impose prices on an industry that hasn’t been regulated in the past.”

He adds: “Moreover, even if some kind of International Publishing Committee were created to establish price rates, there is a chance that regulators would be captured by publisher interests.”

It is worth recalling that while OA advocates have successfully persuaded many governments to introduce open access/public access policies, this has not put control of the knowledge production process back into the hands of the research community, or reduced prices. Quite the reverse: it is (ironically) increasing the power and dominance of legacy publishers.  

In short, as things stand if you want to make a lot of money from the taxpayer you could do no better than become a scholarly publisher!

I don’t like being the eternal pessimist. I am convinced there must be a way of achieving the objectives of the open access and IR movements, and I believe it would be a good thing for that to happen. Before it can, however, these movements really need to acknowledge the degree to which their objectives are being undermined and waylaid by publishers. And rather than just repeating the same old mantras, and recycling the same visions, they need to come up with new and more compelling strategies for achieving their objectives. I don’t claim to know what the answer is, but I do know that time is not on the side of the research community here.


Stevan Harnad said...

Repositories vs. Quasitories, or Much Ado About Next To Nothing: I

“I have a feeling that when Posterity looks back at the last decade of the 2nd A.D. millennium of scholarly and scientific research on our planet, it may chuckle at us…. I don't think there is any doubt in anyone's mind as to what the optimal and inevitable outcome of all this will be: The Give-Away literature will be free at last online, in one global, interlinked virtual library.. and its [peer review] expenses will be paid for up-front, out of the [subscription cancelation] savings. The only question is: When? This piece is written in the hope of wiping the potential smirk off Posterity's face by persuading the academic cavalry, now that they have been led to the waters of self-archiving, that they should just go ahead and drink!” (Harnad, 20th century)

Richard Poynder notes that 17 years on, Institutional Repositories (IRs) are still half-empty of their target content: peer-reviewed research journal articles.

He is right. Most researchers are still not doing the requisite keystrokes to deposit their peer-reviewed papers (and their frantic librarians' efforts are no substitute).

The reason is that researchers' institutions and funders still have not got their heads around the right deposit mandates.

They will, but they will not get historic credit for having done it as soon as they could have.

Richard also says authors are more willing to deposit in and ResearchGate.

Not true. In percentage terms those central Quasitories are doing just as badly as IRs. But their visible recruiting efforts (software that keeps reminding and cajoling authors) is clever, and something along the same lines should be adopted as part of funder and especially institutional deposit mandates. (Keystrokes are keystrokes, whether done for one's own institutional repository or a third party Quasitory.)

The biggest Quasitory of all is the Virtual Quasitory called Google Scholar (GS). GS has mooted most of the fuss about interoperability because it full-text-inverts all content. It's a nuclear weapon, but it is in no hurry. Unlike institutions and funders, GS is under no financial pressure. And unlike publishers, it does not have the ambition or the need to capture and preserve publishers' obsolete, parasitic functions (even though, unlike publishers, GS is in an incomparably better position to maximise functionality on the web). GS is waiting patiently for the research community to get its act together.

Institutions and funders are not just sluggish in adopting and optimizing their deposit mandates but they are making Faustian Little Deals with their parasites, prolonging their longstanding dysfunctional bondage.

Stevan Harnad said...

Repositories vs. Quasitories, or Much Ado About Next To Nothing: II

Can't blame publishers for striving at all costs to keep making a buck, even if they no longer really have any essential service or expertise to offer (other than managing peer review). Publishers' last resort for clinging to their empty empire is the OA embargo -- for which the antidote -- the eprint-request button (the IR's functional equivalent of and ResearchGate -- is already known; it's just waiting to be used, along with effective deposit mandates.

As to why it's all taking so excruciatingly long: I'm no good at sussing that out, and besides, Alma Swan has forbidden me even to give voice to my suspicion, beyond perhaps the first of its nine letters: S.

Vincent-Lamarre, P, Boivin, J, Gargouri, Y, Larivière, V & Harnad, (2016) Estimating Open Access Mandate Effectiveness: The MELIBEA Score. Journal of the Association for Information Science and Technology (JASIST) 67 (in press)

Swan, A; Gargouri, Y; Hunt, M; & Harnad, S (2015) Open Access Policy: Numbers, Analysis, Effectiveness. Pasteur4OA Workpackage 3 Report.

Harnad, S (2015) Open Access: What, Where, When, How and Why. In: Ethics, Science, Technology, and Engineering: An International Resource. eds. J. Britt Holbrook & Carl Mitcham, (2nd edition of Encyclopedia of Science, Technology, and Ethics, Farmington Hills MI: MacMillan Reference)

Harnad, S (2015) Optimizing Open Access Policy. The Serials Librarian, 69(2), 133-141

Sale, A., Couture, M., Rodrigues, E., Carr, L. and Harnad, S. (2014) Open Access Mandates and the "Fair Dealing" Button. In: Dynamic Fair Dealing: Creating Canadian Culture Online (Rosemary J. Coombe & Darren Wershler, Eds.)

Harnad, S (2014) The only way to make inflated journal subscriptions unsustainable: Mandate Green Open Access. LSE Impact of Social Sciences Blog 4/28

Petr Knoth said...

I just wanted to add that CORE, listed among the OAI-PMH harvesters, is a free not-for-profit service, which indexes and keeps a cached copy of research papers harvested from repositories and journals via OAI-PMH (and other protocols). It carries out no pulling of full-texts at the time of access, unless specifically requested by the user, and the user is also not going to hit a pay-walls. CORE has close to 4.5 million full-texts available and about 37 million metadata records. The access to the full-text content is, as opposed to other commercial services, provided both via a user interface as well as an API or data dumps.

My argument here is that I don't think that it can be said that OAI-PMH has failed and does not enable interoperability or the development of aggregations. Clearly this is possible. However, I have highlighted before certain issues with OAI-PMH that make interoperability more difficult to achieve, see (Knoth, 2013). I believe the way forward is to constructively address these issues through the development of common practice or better open protocols. This approach is completely different from the strategy of most existing commercial services that create solutions on top of which it is almost impossible to develop anything new. For example, in the domain of text mining research papers, there is no commercial service providing an acceptable solution to the provision of research papers. The role of repositories in enabling this (and other) important use cases should not be underestimated. In fact, achieving interoperability across the content from publishers is an order of magnitude more complicated than across repositories, see (Knoth & Pontika, 2016).

Knoth, P. (2013) From Open Access Metadata to Open Access Content: Two Principles for Increased Visibility of Open Access Content, Open Repositories 2013, Charlottetown, Prince Edward Island, Canada

Knoth, P. and Pontika, N. (2016) Aggregating Research Papers from Publishers' Systems to Support Text and Data Mining: Deliberate Lack of Interoperability or Not?, Workshop: INTEROP2016 at 10th Language Resources and Evaluation Conference