A preprint posted on PeerJ yesterday
offers some new insight into the number of articles now available on an open-access
The new study is different to previous ones in a number of ways, not
least because it includes data from users of Unpaywall, a browser plug-in that identifies papers that researchers are looking for, and then checks to see whether the papers are available for free anywhere on the Web.
Unpaywall is based on oaDOI, a tool that scours the web for open-access full-text versions of journal articles.
Both tools were developed by Impactstory, a non-profit focused on
open-access issues in science. Two of the authors of the PeerJ preprint – Heather Piwowar
and Jason Priem – founded Impactstory. They also wrote the Unpaywall and oaDOI software.
The paper – which is called The State of
OA: A large-scale analysis of the prevalence and impact of Open Access articles
– reports that 28% of the scholarly literature (19 million articles) is now OA,
and growing, and that for recent articles the percentage available as OA rises to 45%.
The study authors say they also found that
OA articles receive 18% more citations than average.
In addition, the authors report on what they describe as a previously under-discussed phenomenon of open access – Bronze OA. This refers to articles that are made free-to-read on the publisher’s website without an explicit open licence.
Below I publish a Q&A with Heather Piwowar about the study.
Note: my questions were based on an earlier version of the article I saw, and a couple of the quotes I cite were changed in the final version of the paper. Nevertheless, all the questions and the answers remain relevant and useful so I have not changed any of the questions.
What is new and different about your study? Do you feel it is more accurate
than previous studies that have sought to estimate how much of the literature
is OA, or is it just another shot at trying to do that?
HP: Our study has a few important
· We look at a
broader range of the literature than previous studies and go further back (to
pre-1950 articles), we look at more articles (all of Crossref, not just all of Scopus or Web of
Science – Crossref has twice the number of articles that Scopus has), and we take a larger sample than most other studies. That’s because we classify
OA status algorithmically, rather than relying on manual classification. This
allowed us to sample 300k articles, rather than a few hundred as many OA
studies have done. So, our sample is more accurate than most; and more
generalizable as well.
undertook a more detailed categorization of OA. We looked not just at Green and
Gold OA, but also Hybrid, and a new category we call Bronze OA. Many other
studies (including the most comparable to ours, the European Commission report
you mention below) do not bring out all these categories specifically. (I will
say more on that below). Furthermore, we didn’t include Academic Social
Networks. Mixing those with publisher-hosted free-to-read content makes the
results less useful to policy makers.
data and our methods are open, for anyone to use and build upon. Again, this is
a big difference from the Archambault et al. study (that is, the one commissioned by
the European Commission) and we think it is an important difference.
· We include data
from Unpaywall users, which allows us to get a sense of how much of the
literature is OA from the perspective of actual readers. Readers massively
favour newer articles, for instance, which is good news because such articles
are more likely to be OA. By sampling actual reader data, from people using an
OA tool that anyone can install, we can report OA percentages that are more
realistic and useful for many real-world policy issues.
You estimate that at least 28% of the scholarly literature is open access
today. OA advocates tend nowadays to cite the earlier European Commission
report which, the EU claims,
indicates that back in 2011 nearly 50% of papers were OA. Was the EU study an
overestimate in your view, or has there been a step backwards?
HP: Their 50% estimate was of recent
papers, and included papers posted to ResearchGate (RG) and Academia.edu as
open access. Our 28% estimate is for all journal articles, going back to 1900 –
everything with a DOI. We found 45% OA for recent articles, and that’s
excluding RG and Academia. So, they are pretty similar estimates.
In fact, you came up with a number of different percentages. Can you explain
the differences between these figures, why it is important to make these
distinctions, and what the implications of the different figures are?
HP: There are two summary percentages: 28% OA
for all journal articles, and 47% OA for journal articles that people read. As
I noted, people read more recent articles, and more recent articles are more
likely to be OA, so it turns out that almost half of the papers people are
interested in reading right now are actually OA. Which is really cool!
when you consider that we used automated methods that missed a bit of OA it is
more than half, so the 47% is a lower bound.
You coin a new definition of open access in your paper, what you call Bronze
OA. Can you say something about Bronze OA and its implications? It seems to me,
for instance, that a lot of papers (over half?) currently available as open
access are vulnerable to losing their OA status. Is that right? If so, what can
be done to mitigate the problem?
HP: Yes, we did think we were coining a new
term. But this morning I learned we weren’t the first to use the term Bronze OA
– that honour goes to Ged Ridgway, who posted the tweet below
guess it’s a case of Great Minds Think Alike!
definition of Bronze OA is the same as Ged’s: articles made free-to-read on the
publisher’s website, without an explicit open license. This includes Delayed OA
and promotional material like newsworthy articles that the publishers have
chosen to make free but not open.
also includes a surprising number of articles (perhaps as much as half of the
Bronze total, based on a very preliminary sample) from entirely free-to-read
journals that are not listed in DOAJ and do not publish content under an open
license. Opinions will differ on whether these are properly called “Gold OA”
journals/articles; in the paper, we suggest they might be called “Dark Gold”
(because they are hard to find in OA indexes) or “Hidden Gold.” We are keen to
see more research on this.
research is also needed to understand the other characteristics of Bronze OA.
Is it disproportionately non-peer-reviewed content (e.g. front-matter), as seems
likely? How much of Bronze OA is also Delayed OA? How much Bronze is
Promotional, and how transient is the free-to-read status of this content? How
many Bronze articles are published in “hidden gold” journals that are not
listed in the DOAJ? Why are these journals not defining an explicit license for
their content, and are there effective ways to encourage them to do so?
kind of follow-up research is needed before we can understand the risks
associated with Bronze and what kind of mitigation would be helpful.
You say in your paper, “About 7% of the literature (and 17% of the OA
literature) is Green, and this number does not seem to be growing at the rate
of Gold and Hybrid OA.” You also suspect that much of this green OA is
“backfilling” repositories with older articles, which are generally viewed as
being of less value. What happened to the OA dream articulated
by Stevan Harnad in 1994, and what future do you predict for green OA going
HP: First, I should clarify: our definition
of Green OA for the purposes of the study is that a paper is in a repository
and is not available for free on the publisher site. This is so we don’t double
count articles as both Green and Gold (or Hybrid or Bronze) for our analysis.
gave publisher-hosted locations the priority in our classifications because we
suspect most people would rather read papers there. So, in our article when we
say green OA isn’t growing, what we mean is that more recent papers that are
only available in repositories are available as Green OA at roughly the same rate as older
is worth future study to understand this better. I have a suspicion: perhaps
much of what would have been Green OA became Bronze and what we call “shadowed
green” – where there is a copy in a repository and a freely available copy on
the publisher’s site as well. I suspect publishers responded to funder mandates
that require self-archiving by making the paper free on the publisher sites as
well, in synchronized timing.
Biomed doesn’t look like it has as much Green as I’d expect, given the success
of the NIH mandate and the number of articles in PMC. We do know many biomed
journals have Delayed OA policies, which we categorized as Bronze in our
analysis. Did they implement these Delayed OA policies in response to the PMC
mandates? Perhaps others already know this to be true... I haven’t had a chance
to look it up. Anyway. I think the interplay between Green and Bronze is
especially worth more exploration.
do also report on all the articles that are deposited in repositories, Green
plus shadowed green, in the article’s Appendices. We found the proportion of
the literature that is deposited in repositories to be higher for recent
final note: We actually changed the sentence that you quoted in the final
version of our paper, because we were wrong to talk about “growing” as we did.
Our study didn’t measure when articles were deposited in repositories, but just
looked at their publication year. Other studies have demonstrated that people
often upload papers from earlier years, a practice called backfilling.
suppose in some ways these have less value, because they are read less often.
That said, anyone who really needs a particular paper and doesn’t otherwise
have access to it is surely happy to find it.
You also looked at the so-called citation advantage and estimate that an OA
article is likely to attract 18% more citations than average. The citation
advantage is a controversial topic. I don’t want to appear too cynical, but is
not the idea of trying to demonstrate a citation advantage more an advocacy
tool than a meaningful notion. I note, for instance, that Academia.edu has claimed
that posting papers to its network provides a 73% citation advantage. Surely
the real point here is that if all papers were open access there would be no
advantage to open access from a citation point of view?
That’s true! And that’s the world I’d love to see – one where the citation playing
field is flat, because everyone can read everything.
What would you say were the implications of your study for the research
community, for librarians, for publishers and for open access policies?
HP: For the research community: Install
Unpaywall! You’ll be able to read half the literature for free. Self-archive
your papers, or publish OA.
OA/bibliometrics researchers: Build on our open data and code, let’s learn more
about OA and where it’s going.
librarians: Use this data to negotiate with publishers: Half the literature is
free. Don’t pay full price for it.
publishers: Half the literature is now free to read. That percentage is
growing. You don’t need a weathervane to know which way the wind blows: long
term, there’s no money in selling things that people can get for free. Flip
your journals. Sell services to authors, not access to content – it’s an
increasingly smart business decision, as well as the Right Thing To Do.
open access policy makers: We need to understand more about Bronze. Bronze OA
doesn’t safeguard a paper’s free-to-read status, and it isn’t licensed for
reuse. This isn’t good enough for the noble and useful content that is
Scholarly Research. Also: let’s accelerate the growth.
didn’t ask about tool developers. An increasing number of people are making
tools that they can integrate OA into. They should use the oaDOI service. Now
that such a large chunk of the literature is free, there are a lot of really
transformative things we can build and do – in terms of knowledge extraction,
indexing, search, recommendation, machine learning etc.
OA was at the beginning as much (in fact more) about affordability as about
access (certainly from the perspective of librarians). I note the recently
of the RCUK open access policy reports that the average APC paid by RCUK rose
by 14% between 2014 and 2016, and that the increase was greater for those
publishers below the top 10 (who are presumably focused on catching up with
their larger competitors). Likewise, the various flipping deals we are seeing
emerge are focused on no more than transferring costs from subscriptions to
APCs, with no realistic expectation of prices falling in the future. If the
research community could not afford the subscription system (which OA advocates
have always maintained) how can it afford open access in the long-term?
HP: If the rising APCs are because
small publishers are catching up with the leaders by raising prices, that won’t
continue forever – they’ll catch up. Then it’ll work like other competitive
main issue is freeing up the money that is currently spent on subscriptions. We
think studies like this, and tools like Unpaywall, can be helpful in lowering
subscription rates, and foregoing Big Deals, as libraries are increasingly
As you say, in your study you ignored social networking sites like Academia.edu
and ResearchGate “in accordance with an emerging consensus from the OA
community, and based largely on concerns about long-term persistence and
copyright compliance.” And you also say, “The growing proportion of OA, along
with its increased availability using tools like oaDOI and Unpaywall, may make
toll-access publishing increasingly unprofitable, and encourage publishers to flip
to Gold OA models.” I am wondering, however, if it is not more likely that
sites like Academia.edu (which researchers much prefer to use than paying to
publish or depositing in their repository) and Sci-Hub (which is said to contain most of the scientific
literature now) will be the trigger that will finally force legacy
publishers to flip their journals to open access, whatever one’s views on the
copyright issues Would you agree?
HP: It won’t be any one trigger, but
rather an increasingly inhospitable environment. Sci-Hub is a huge contributor
to that, and Academic Social Networks are too. Unpaywall opens up another
front: a best-practice, legal approach to bypassing paywalls that librarians
and others can unabashedly recommend. It all combines to make it easier and
more profitable for publishers to flip, and for the future to be OA.
Thank you for answering my questions.