Utopia Documents is a novel tool
for interacting with the scientific literature. Developed in 2009, it is a free
PDF reader that can connect the static content of scientific articles to the
dynamic world of online content.
This week Utopia will be released as an open source project. It will also become the platform for a new
crowdsourcing tool called Lazarus. With Lazarus, it is hoped to recover large
swathes of the legacy data currently imprisoned in the charts, tables, diagrams
and free-text of life science papers published in PDF files. This information
will then be made available as an open access database.
The developer of Utopia is computer
scientist Steve Pettifer, currently based at the University of Manchester. In a recent email
conversation Pettifer explained to me the background to Utopia, and what he
hopes to achieve with Lazarus.
One
of the long-standing debates within the open access movement is whether
priority should be given to advocating for gratis OA (no cost
access to read research papers), or libre OA (no cost
access to read plus the right to reuse/repurpose
papers).
Advocates
for libre OA argue that since the
benefits it provides are much greater than gratis OA, libre OA should be
prioritised. Advocates for gratis OA respond that since
gratis OA is achievable much more quickly and easily (and without additional
cost to the research community), it should
be prioritised. Besides, they add, very few researchers want to reuse research
papers in any case.
In
reply to this last point, libre OA advocates retort that the issue is not just
one of reuse, but having the ability to text and data mine papers in order to create
new services and databases and generate new knowledge. For this reason, they
say, it is vital that papers are licensed under permissive Creative Commons
licences that allow reuse (i.e. libre OA).
Passive reading
For
similar reasons libre OA advocates dislike the widespread use of PDFs today.
Designed to ensure that the (print-focused) layout of a document is the same
whatever system it is displayed in, the Adobe Acrobat format is not conducive
to text mining. So while it is fine for human readers, computers struggle to
make sense of a PDF.
It
may, for instance, not include information about who authored the document or
the nature of the content in a form that machines can understand, since this
would require the inclusion of metadata. While metadata can be inserted into PDF
files, publishers/authors rarely go to the effort of inserting it. For this
reason PDFs generally also do not have an explicit machine readable
licence embedded in them to signal what can legally be done with the content.
In
addition, any diagrams and charts in a PDF file will be static images, so
machines cannot extract the underlying data in order to reuse or process the
information.
Critics
of the PDF also dislike the fact that it permits only passive reading. This
means that scientists are not fully able to exploit the dynamic and linked
nature of the Web. In fact, researchers often simply print PDF files out and
read them offline. For these reasons, libre OA
advocates,
computer scientists, and forward-looking publishers (particularly OA
publishers) are constantly trying to wean researchers off PDFs in favour of
reading papers online in HTML.
Over
a decade ago, for instance, the Biochemical
Journal spent a great deal of time and effort revamping its site. It did
this sufficiently well that it won the 2007 ALPSP/Charlesworth
Award for Best Online Journal — on the grounds that it had successfully
“overcome the limitations of print and exploited the flexibility of the digital
environment”.
But
to the frustration of the journal’s publisher — Portland Press — despite all
its efforts scientists simply carried on downloading the papers as PDF files.
Researchers,
it turns out, still much prefer PDFs.
The
question is however: Do PDF files allow scientists to make best us of the Web? This
thought occurred to Steve Pettifer in 2008, as he watched a room full of life
scientists trying to combine the work of two separate labs by downloading PDFs,
printing them off, and then rapidly scanning the information in them. Surely, he
thought, this is not a very efficient way of doing science in the 21st
Century?
Since
Portland Press had reached the same conclusion it offered to fund Pettifer and
his colleague Terri Attwood to come up
with a solution that would combine the appeal, portability, and convenience of
the PDF with the dynamic qualities of the Web.
The
outcome was Utopia Documents.
Utopia
Superficially,
Utopia Documents is just another PDF reader. Unlike other readers, however, it comes
with a number of novel interactive features. When a paper is loaded into it,
for instance, a sidebar opens up on the right-hand side and fills with relevant
data from external databases and services like Mendeley, SHERPA/RoMEO, and Wikipedia.
So,
for instance, Utopia will pull the OA status of the paper and any self-archiving
policy of the journal in which the paper was published from SHERPA/RoMEO. It
also lists related papers from Mendeley; and where papers cited in the article are
freely available on the Web it inserts live links to enable the user to
download these papers into the viewer alongside the original paper. These can
then be read/saved.
There
is also a figure browser that lets the user flip through all the images in the
document, and a page browser for jumping between pages.
Other
interactive features include the ability to play and rotate molecules and
protein sequences, and, where there is a reference to a drug, the molecular
structure and formula of the compound can be pulled in.
In
addition, it is possible to look up words and phrases in the paper by
highlighting them and clicking “explore” from a popup menu. Amongst other
things, this function allows structures from the protein databank to be pulled
in, as well as associated laboratory products, related articles from PubMed,
and news from SciBite.
In
some cases it is also possible to manipulate and plot the data in the paper’s tables
as simple graphs, and to “play” 3D structures of proteins. Registered users can
also comment on the paper in the hope of starting a conversation about it with
other users.
So
how does Utopia turn a static PDF file into a dynamic interactive document? Pettifer
explains: “Where Utopia is able to find the machine-readable version of a paper
online, either because it exists in a well-known open place (such as PubMed Central) or because a
publisher has told us how to get hold of it, we can consume that version on the
fly and use it to enrich the PDF version by finding references, extracting
tables of data and so on.”
He
adds, “Where we can’t find any semantic information inside the PDF itself
(publishers rarely put this in) or online anywhere, Utopia tries to guess what’s
going on by reconstructing some of the semantics from the typographic layout of
the article.”
Pragmatic compromise
It
is important to note, however, that Pettifer does not assume that Utopia is a
long-term solution. Rather, he says, it is a pragmatic compromise in
recognition of researchers continuing preference for PDFs.
Certainly
we should not doubt that scientists continue to prefer the PDF. A survey undertaken by
Nature earlier this year, for
instance, suggested that the overwhelming majority of researchers still view the
PDF as the best way to read scholarly papers. Specifically, 39.24% of those
surveyed said they preferred to read academic papers in PDF on their desktop
computer, 43.4% preferred printed PDF, and 11.28% preferred reading PDFs on a
Tablet (giving a total of 93.92% who preferred the PDF format). Only 4.34% said
they preferred HTML.
Nevertheless,
suggests Pettifer, we can expect this to change — for a number of reasons.
First, as more and more people start to read scientific articles on handheld
devices the appeal of the PDF will likely wane. “A PDF can be quite pleasant to
read on a big screen (centuries of typographic craft have gone into working out
how to lay things out ‘on paper’), but is often fairly awful on a tablet or
mobile phone, especially if it’s a two-column format.”
Second,
suggests Pettifer, the growth of OA will also likely change scientists’ habits,
since he suspects that the popularity of the PDF is partly a consequence of closed
access publishing. “I’m fairly sure that part of the motivation for hoarding
PDFs is that we know they can’t then be taken away from us if we move
institutions, or our subscriptions to a particular journal expire. I think that
if I could be certain that an article was always accessible to me wherever I
happened to be because it’s open access, I’d personally be more comfortable
with reading it on line rather than cluttering up my hard disc with files.”
Third,
Pettifer says, research papers will increasingly have to be read by machines
rather than by humans; and, as we saw, PDF files are not exactly
machine-friendly.
“We
are already way past the point where any single person, or even research group,
can hope to read all the relevant literature in even fairly niche areas of
science, so we have to assume that the primary consumers of scientific ‘papers’
in the future are going to be machines,” says Pettifer. “And that means that we
have to create content that is suitable for them first and foremost.”
Machines first
Clearly,
therefore, in the future papers will need to be comprehensible to machines as
well as to humans. However, says Pettifer, this need not be problematic, so
long as the needs of the machine are prioritised. “[M]achines can turn their
formats into ones suitable for human consumption much more easily than going
the other way round. So it really doesn’t matter hugely whether we have PDFs
and ePubs and Mobi and HTML and
so on being created all at the same time, as long as these are all generated
from the same machine-readable source.”
One
can therefore envisage a future in which the default format for research papers
will be a machine-readable one, but it will be possible to create human-readable
versions as and when needed — a variation on print-on-demand perhaps?
Given
all this, we might be inclined to conclude that Utopia’s usefulness will be
short-lived. In reality, however, it may be that its real value has yet to be
realised.
Consider,
for instance, that the only people to read many research papers today will be
the author, the editor and the reviewers. How many papers are never read by
anyone other than this small group is a source of
disagreement,
but it is widely assumed that many papers are never downloaded and/or cited. As
it happens, this may be the fate of a great many of the PDF files lying around on
the Web. A recent report by the World Bank, for instance, concluded that nearly
one-third of its PDF reports have never been downloaded. Another 40 percent have
been downloaded fewer than 100 times, and only 13 percent have seen more than
250 downloads.
Commenting on the World
Bank’s finding on the Washington Post
website Christopher
Ingraham
said, “Not every policy report is going to be a game-changer, of course. But
the sheer numbers dictate that there are probably a lot of really, really good
ideas out there that never see the light of day. This seems like an inefficient
way for the policy community to do business, but what’s the alternative?”
Loading
the Web with PDF files of research papers that may never be read by other
scientists would seem to be an equally inefficient way for the research
community to do business.
But
suppose there were an alternative? Suppose that all the data and good ideas in
those PDFs could be extracted and put in in an OA database that scientists
could search?
Lazarus
This,
in fact, is the next step for Utopia. Thanks to a grant Pettifer and
his colleagues received earlier this year from The Biotechnology and Biological
Sciences Research Council (BBSRC) Utopia will
become the platform for a new crowdsourcing project called Lazarus that aims to
recover large swathes of legacy data buried in the charts, tables, diagrams and
free-text in the multitude of life science papers that have been published.
Once extracted, this information will be converted into processable data and placed
in a searchable database.
The
logic of doing this is surely sound. In its current form this information may
not only never be read by researchers but, as things stand, it cannot be mined,
validated, analysed or reused by machines either — unless, that is, someone was
prepared to go to a great deal of effort to recover it manually.
So,
for instance, small molecules in PDF files are typically represented as static illustrations,
biochemical properties as tables or graphs, and any protein/DNA sequences are generally
buried in the text. None of this is understandable to machines today. In
addition, references and citations will likely be in arcane formats, and other
objects of biological interest referred to by ambiguous names.
To
release this data by hand would require re-typing figures from tables, checking
citations in digital libraries, and redrawing molecules by hand etc. — a highly
time-consuming task.
In
theory, of course, mass-mining methods (text mining, optical recognition) could
be used to automate the task. However, the technology for doing this is not yet
sufficiently reliable to be used without human validation. In addition, the
licenses under which the articles are published will invariably prevent the
“bulk processing” that this implies. Bottom line: without “human computation”,
this knowledge is destined to remain entombed in the literature for the
foreseeable future.
Unless,
that is, tools like Utopia are deployed. Currently, when a scientist loads a
paper into Utopia the locked data is only freed temporarily, since the text
mining and semantic analysis Utopia performs when someone reads a paper is
thrown away when the article is closed. With Lazarus, however, all the
non-copyrighted bits of data released when an article is read will be collected
and put into a central repository by the scientists participating in the
project.
Web-based observatory
And
since this data will be unlocked on an individual-researcher basis, with users
extracting only the facts, it cannot be said to infringe copyright. Explains Pettifer.
“[W]e’re not storing or redistributing any copyrightable material, just data
and ‘citations’ in the form of nanopublications. And we’re not ‘bulk processing’
anything at all; simply collecting together the insights of a lot of individual
people and, with their permission, putting them in a place where others can
find them.”
Clearly,
the secret will lie in persuading a sufficient number of researchers to take
part. How popular has Utopia proved to date? Unfortunately, we don’t know. “It’s
horribly hard to tell right now,” Pettifer says, “Because it’s a desktop
application that doesn’t phone home and there’s no requirement to register the
only thing we can go off is downloads, which have been steady at around 50k per
year for the past couple of years.”
By
contrast, he adds, when Lazarus kicks off “we’ll end up with much more
meaningful statistics because user provenance will get recorded against
behaviour.”
It
is worth stating that by taking part in the Lazarus project researchers will
not only help themselves, but benefit their community too. To stress this
point, and encourage scientists to take part, Pettifer plans to create a web-based
observatory. This will also gather per-article metrics and observe and steer
the crowd toward data-resurrection campaigns.
To
further incentivise researchers to take part, Pettifer and his team are
releasing the source code of Utopia Documents. This, says Pettifer, “opens up
opportunities for collaboration with other open source projects, and could
maybe even get the developer community interested enough to contribute new
stuff.”
If
a searchable crowd-sourced resource like Lazarus had existed in 2008 then
presumably the life scientists that Pettifer observed trying to combine the
work of two labs could have achieved their objective both more quickly and more
easily than printing out and scanning multiple PDF files.
It
is also clear that libre OA has a great deal to offer the research community. On
the other hand, of course, the way in which Utopia/Lazarus works suggests that
gratis OA would be sufficient for Pettifer’s purposes. Moreover, as a result of
the Hargreaves
rules (which came into
force on June 1st) text mining will be less hampered by
copyright restrictions in future — in the UK at least.
To
find out more about Utopia, Lazarus, and Pettifer’s views on the future of
scholarly communication please read the interview below.
The interview begins
RP:
Can you say something briefly about yourself and your research interests?
SP: I’m a computer scientist. My
original research was in distributed virtual reality, trying to figure out how
humans could inhabit a shared, consistent virtual environment when the network
that connects them is relatively slow and unreliable compared with our experience
of ‘causality’. We developed the kinds of techniques that now make multi player
console games commonplace. From there I moved into visualising scientific data,
and started working with biophysicists and chemists to tackle some of the
problems in those domains.
RP:
How did your earlier research feed into your current research?
SP: What I realised working with
life scientists is that they have a very different take on ‘the literature’ to
computer scientists. Whereas we tend to use traditional publication as a reason
to attend conferences and workshops and for putting achievements ‘on record’,
in the life sciences the literature is a much more central part of the
knowledge base, and it was quite an eye-opener to realise how crude the tools
were for interacting with the literature.
So
we started looking at whether any of the ‘distributed visualisation’ techniques
we’d been working on from other areas could improve the way that scientists
interact with scholarly articles. I guess the common theme here is trying to
understand what humans are trying to achieve, and how the distributed nature of
the internet can help or hinder this.
RP:
Why Utopia Documents?
SP: Two things happened
independently, but which jointly convinced me that a new tool was needed.
The
first was that I was working with Prof Doug Kell, and taking
part in a curiously named ‘Yeast Jamboree’. Two independent labs had been
developing a computational model of yeast glycolysis, and this was a
get-together to merge the best bits to create a consensus model [1].
This
was the first time I’d seen scientists interact with ‘the literature’ in anger;
a room full of people, furiously downloading PDFs, scanning them for evidence
that a certain bit of the model was good or bad, and cross-checking facts with
other papers and online resources. There were PDFs, printed and electronic all
over the place, and this convinced me that there must be a better way of doing
things.
The
second was that Prof Terri
Attwood
had been approached by Rhonda Oliver and Audrey McCulloch who at the
time were working for Portland Press. They were
interested in exploring whether anything could be done to improve the quality
and utility of the auxiliary materials in their flagship publication, The
Biochemical Journal.
Portland
Press had put a lot of effort into building their own publishing platform
(which I believe won the ALSPS/Charlesworth
Award
a while back), but in spite of all the technical innovations around the HTML
view, they were still seeing users downloading static PDFs far more often than
consuming the online content.
Rhonda
and Audrey eventually took the brave move of funding a one year project to
build a tool that would help bring PDFs to life by dynamically linking content
to online tools and databases. And from that was born Utopia Documents.
RP:
What year did development on Utopia start?
SP: 2008; we launched the software
at an event in the British Library on December 10th 2009.
RP:
Is there a company behind Utopia Documents and do you earn revenue from the
reader, or is it exclusively a research project?
SP: Utopia was originally a
research project funded by Portland Press, then Astra Zeneca and Pfizer. The IP is now owned by a spinout company
called Lost Island Labs, and although it’s legally a limited company we have
mostly treated it as a research account into which we can put any money we make
from selling bespoke versions of the software so that we can keep the team
going ‘between grants’ (unlike publishing an ‘idea’ in the literature, software
requires a certain amount of ongoing maintenance, and grant income is erratic,
so we decided we needed some mechanism of smoothing the income out).
More
recently we’ve been able to get a BBSRC grant for
harvesting data from the literature that will sustain the team for another
three years, so we’re in reasonably good shape from that point of view. We also
have a number of other plans for Utopia, some of which will help with funding.
For good or ill
RP:
Ok, can you first say something about what Utopia Documents does?
SP: The premise of Utopia
Documents is that for good or for ill, scientists still prefer PDFs to HTML
representations of articles.
So
rather than ignoring this fact and trying to force scientists to use a format
they are still largely uncomfortable with, Utopia attempts to blend the best of
‘the web’ with some of the nice qualities of PDFs.
The
main technical challenge is that although it’s possible to encode much of the
same ‘semantic structure’ in PDFs as it is in HTML, publishers generally don’t
do this. So to bring PDFs alive we had to develop algorithms that ‘retro fit’
much of the semantics by analysing the structure and content of a PDF article.
This
involves a blend of algorithms from computer graphics and vision for
recognising the ‘typography’ of an article, as well as text mining techniques
for recognising it’s content.
RP:
So Utopia can link papers to external databases and to services like Mendeley, SHERPA/RoMEO, PubMed Central, SciBite and Wikipedia, all of which
makes reading a paper a more dynamic experience. Additionally, where papers
cited in the PDF file are available on an open-access basis, users can link
directly to those articles. Apart from this last feature what, if any,
additional features does open access enable Utopia to provide? To what extent
(if any) was Utopia envisioned as an open-access tool?
SP: Apart from the ability to fetch
papers automatically, Utopia’s functionality relies less on open access and
more on whether publishers have machine-readable versions of their articles in
an accessible place (which could be behind a firewall, but accessible via IP
authentication).
Where
Utopia is able to find the machine-readable version of a paper online, either
because it exists in a well-known open place (such as Pubmed Central) or because a
publisher has told us how to get hold of it, we can consume that version on the
fly and use it to enrich the PDF version by finding references, extracting
tables of data and so on.
Other
features, like browsing by image or being able to get data on biological terms
that the user highlights are completely independent of publishers; all you have
to have for those to work is the PDF itself.
Where
we can’t find any semantic information inside the PDF itself (publishes rarely
put this in) or online anywhere, Utopia tries to guess what’s going on by
reconstructing some of the semantics from the typographic layout of the article
(you can try out the PDF to XML conversion
outside of Utopia here).
RP:
Can you say something about the bespoke versions of Utopia and why they are
needed? (Presumably these are for pharmaceutical companies like AstraZeneca and
Pfizer?)
SP: There are two reasons that companies pay for
the commercial version. The first is that the free version is quite promiscuous
in terms of the relationships it has with other free-to-use online resources;
so as you look up terms in papers, it’s communicating with bio-databases such
as UniProt or PDB to find
definitions and information about biological entities in the paper you’re
reading.
For
most users that’s no problem — it’s no different from interacting with those
databases via the web — but for pharma companies, the mere fact that you’re
interested in a particular protein or drug is commercially sensitive.
So
we create versions of the software that can work behind a company firewall, so
that commercial users can be sure they are not leaving any fingerprints behind
on the internet that could cause their companies problems with IP later on.
The
other thing we do is create connections between Utopia and whatever in-house
knowledge bases the companies have themselves; so when commercial users
interact with articles they can see what their company knows about a particular
drug/transporter/gene as well as information that’s in the public domain.
Hamburger to cow
RP:
I’m thinking that Utopia doesn’t really solve the problem that is exercising
many minds in the OA movement today — how to enable effective text and data
mining?
SP: It certainly doesn’t address
the real problem, which is the licensing/legal one, but I think the software we’ve
built for turning un-semantic PDFs into something a machine can get traction on
goes a long way towards this.
It
would be much better if the articles were just created with the right markup in
the first place of course; turning a hamburger back into a cow is a rather
messy business.
RP:
The PDF format is increasingly frowned upon by scientists as a vessel for
scientific papers (indeed, researchers have started to hold conferences with
titles like “Beyond the PDF”). However, as you point out, for
good or ill most still prefer to download PDFs. Perhaps it is for this reason
that you describe Utopia as a compromise. But does this mean that you view Utopia
as a stopgap solution until a better one emerges, or do expect to see more dynamic
readers like Utopia being developed, and used on a long-term basis?
SP: I take issue with the idea
that the PDF is frowned upon by scientists: there are definitely a very vocal
few who really don’t like it (and I completely understand why, and largely
agree with them), but compared with the wider body of scientists it’s a tiny
minority.
If
you do a show of hands at any scientific meeting outside of the Beyond The PDF
/ FORCE11 community,
the vast majority — and I mean 90% or more — of scientists say they primarily
interact with the literature by downloading PDFs; and even those that
occasionally read or browse articles online will also snag a copy of the PDF
version to keep for later reading and reference.
Most
of the PDFs produced by today’s publishers are really awful for machines to
read (and that’s as much because publishers don’t use the features the format has
to offer properly). As I indicated, that’s a real impediment for people wanting
to do data or text mining (though less of an impediment than the licensing
issues that typically affect the same content). Which is why for Utopia to work
we’ve had to go to the lengths of writing software to try to recreate the
semantic structure of PDFs from their visual appearance.
We
built Utopia as a pragmatic compromise; it would be great if we didn’t have to
deal with PDFs, but the reality is that for now we do.
The
cracks are beginning to show for PDF as a human-readable format too, as more
people shift to small form factor devices for reading scientific articles. A
PDF can be quite pleasant to read on a big screen (centuries of typographic
craft have gone into working out how to lay things out ‘on paper’), but is
often fairly awful on a tablet or mobile phone, especially if it’s a two-column
format.
So
I think as these devices become more common, we will finally see a move away
from the PDF towards something like ePUB or just HTML that can reflow sensibly
to fit on a tiny screen.
RP:
There is a good example of this transition here, where the
main publishing format of the Code4Lib journal has
been changed from PDF to EPUB. This was done, the publisher explains, because
of the growing use of e-book readers plus the accessibility problems posed by
PDF. In terms of other solutions, I suppose we are mainly talking about XML and
RDF? But how
would you distinguish what you are doing with Utopia from what eLife is doing
with the Lens Project, Elsevier is doing with its Article of the Future and “executable papers” projects,
and Wiley is doing with its Anywhere Article? Is it that
these latter approaches (like that of Portland Press) focus on screen-based
HTML solutions — a process I think is now called “enhanced
publication”, whereas Utopia uses a combination of extracting
and importing metadata from machine-readable versions of papers elsewhere, combined
with guesswork/reverse engineering?
SP: Yes; well mostly yes anyway! Publishers
are definitely getting better at making the online HTML reading experience a
pleasant one, but there’s still a very long way to go. I’d say that eLife and PeerJ are way ahead
of the game on this front; they have clean, responsive designs and have avoided
cluttering up the article page with dross that I really don’t care about when I’m
doing ‘proper’ deep reading. They produce excellent PDFs of their content too. eLife’s
lens is particularly cool; the Article of the Future on the other hand I find a
very frustrating experience, especially on small screens, so even more of an
incentive for me to grab the PDF.
In
any case scientists are, somewhat ironically, often quite conservative in their
adoption of new technology — there’s still a lot of people who just print PDFs
out on paper. So I think that although the PDF’s life is limited, it’ll be a quite
a few years yet before it vanishes all together (and there’s still the issue of
dealing with all the pre-XML legacy content that exists only in PDF form).
What
I have concluded recently though is that a big part of the popularity of PDF is
a side-effect of closed access publishing. I’ve not done a proper study of
this, but I’m fairly sure that part of the motivation for hoarding PDFs is that
we know they can’t then be taken away from us if we move institutions, or our
subscriptions to a particular journal expire.
I
think that if I could be certain that an article was always accessible to me
wherever I happened to be because it’s open access, I’d personally be more
comfortable with reading it online rather than cluttering up my hard disc with
files.
RP:
Might it rather be that many remain far more comfortable reading research
papers on paper? As you say, many scientists still print out PDFs and read them
offline?
SP: I think there’s some truth in
that. I’ve largely weaned myself off printing articles to paper, but still find
myself doing it very occasionally if I know I’m going to be doing a lot of
annotation or want a really un-disturbed read.
But
I’m pretty sure that eReaders and tablets will start to have an effect on this
in a way that desktop reading hasn’t so far.
RP:
From what you say I assume Utopia is able to do more with papers from some
publishers than with papers from others — where, for instance, the publisher
encodes some semantic structure into its PDFs. Or is Utopia able to guess as
accurately as if it had access to the metadata?
SP: It definitely varies; if we
can find definitive metadata online then we can do more stuff more accurately.
If we can’t find it, then we have to apply heuristics to recover the semantics,
and that can be hit and miss. Generally we’re pretty good at it now, but we can’t
guarantee to get everything right all the time.
Largely automatic
RP:
I understand the Biochemical
Journal was marking up its papers for specific
use in the Utopia Reader. Is it still doing this, and have other publishers begun
to do so?
SP: Yes and no. Utopia’s original
emphasis was on editorial annotation of articles — something that boutique
publishers like Portland can do because they have dedicated in-house editors,
and that large-scale publishers such as Springer and Elsevier generally can’t
do because they rely more on academic good will.
Over
the years as named-entity
recognition
and our semantic analysis of the article has improved we’ve found there’s less
of a benefit to the manual annotation in any case; so now it’s largely
automatic once we have found the article’s metadata.
RP:
Do you think the need for enhanced publication is more of an issue for
scientific (STEM) researchers than it is for humanities and social science
(HSS) researchers?
SP: I’m not sure I know the other
fields well enough to comment really; the Life Sciences in particular are very
well served by ontologies, databases and online tools, so Utopia works very
well in that field.
Elsewhere
there are fewer services we can call on, so less we can do with articles beyond
extracting tables and references.
RP:
You said Utopia was a “pragmatic compromise” and that you expect a move away
from PDF going forward. Can you say more about how you see scientific
publishing developing in the future?
SP:
My
take on this is that as the amount of stuff getting published increases we are
going to be more and more reliant on machines to help us figure out what to
read; and that means they have to do some of the reading for us.
We
are already way past the point where any single person, or even research group,
can hope to read all the relevant literature in even fairly niche areas of
science, so we have to assume that the primary consumers of scientific ‘papers’
in the future are going to be machines. And that means that we have to create
content that is suitable for them first and foremost.
The
nice thing about doing that (which for now means XML and RDF, but in the future
could be some other machine-readable format-de-jour, it really doesn’t matter
too much) is that machines can turn their formats into ones suitable for human
consumption much more easily than going the other way round.
So
it really doesn’t matter hugely whether we have PDFs and ePubs and Mobi and HTML and so on being
created all at the same time, as long as these are all generated from the same
machine-readable source.
RP:
So you are saying that increasingly the focus of scholarly publishing is going
to shift to creating machine-readable files from which PDFs and other
human-friendly formats can be created on the fly (as and when needed) — which
is the reverse of what Utopia currently does (cow to hamburger vs. hamburger to
cow)?
SP: Essentially yes; at the moment
Utopia has to guess at a lot of the semantics; if those were available in
machine-readable form we wouldn’t have to do that. Instead we could just ‘project’
those semantics onto whatever format the user happens to like, whether that’s
PDF, ePubs or whatever.
The
other crucial aspect is not just shipping the text and images around in a
machine-readable container, but of making the content amenable to machine
processing; and that means as authors / editors / publishers we’re going to
have to get better at bridging the gap between the ambiguities of natural
language (which is nice for humans to read) and formal notations that computers
can make sense of.
At
the moment these are worlds apart; computers struggle to make sense of
scientific prose, and humans find it very hard to write things in a way a
machine can interpret. I think that if we can’t bring these things together
somehow, we’re going to be in trouble fairly soon.
The bigger problem
RP:
Do you have a sense of how that might be done and what might hold it back?
SP: There’s a really interesting
bit of work going on in this space at the moment called the Resource Identification Initiative which is
trying to encourage authors and publishers to include machine-readable
identifiers in the text of their articles; it’s very early days yet, but I
think if it gets wider uptake it could be a really important step towards
making the important bits of articles accessible to machines as well as humans.
There’s
nothing very technologically complicated involved here; the real problem as far
as I can see is one of tradition and momentum. Part of the problem is that
doing this kind of thing implies a cost to authors in return for a benefit to
readers; it’s easy for us to forget that every scientist wears both those hats
at different times of the day.
The
bigger problem is that today’s publishing industry is based on a business model
that relies on pretending that the internet doesn’t exist. It made sense to
charge for publishing as a service when you were laying things out with lead
type and shipping them round the country on a horse and cart, but we all know
that it’s easy and cheap to get material out on the internet for people to read
with very little technological expertise.
The
only things the industry has left that are of any value are ‘peer review’ — and
the value of that is subject of some debate —
and the very questionable kudos associated with being published in a ‘top
journal’. I’m not saying that ‘self-publishing’ is the way forward, but the
costs of ‘traditional publishing’ seem utterly out of kilter with the claimed
added value.
I’m
pretty sure at some point that funding bodies are going to call shenanigans on
the cost of the current process, so it strikes me that if industry could
position itself as doing something constructive that is a consequence of and in
harmony with the existence of the digital media and the internet, rather than
fighting against it, that would be a good thing.
RP:
You are saying that instead of fighting the exigencies of the Internet publishers
could be doing more to exploit its benefits. What sort of things do you have in
mind?
SP: Helping authors make their
material machine-readable. It doesn’t sound very sexy right now, but I think it
will be crucial in the future, and not everyone wants to have to learn about
identifiers and ontologies. I think that’s something where publishers could add
real value to the material they process, rather than obsessing about typography
and copyright.
RP:
When I spoke recently with COAR’s Kathleen Shearer we
discussed the discoverability problem arising from the failure to attach adequate
metadata to papers posted in institutional repositories. We can expect this
problem to be mitigated, she said, by the use of “new, automated methods for
assigning metadata and repository software platforms can build-in standard
vocabularies and metadata elements.” Earlier you said that much of what Utopia
does now is automated. Would I be right in thinking that new methods are being
(or will be) developed to automate the process of making research papers
machine-readable in the first place? If so, this would make it much easier for authors,
and presumably for publishers. But if that is right, what in your view would be
the implications for publishers as they seek to retain a key role in the
digital environment? The value they provide might seem to be becoming
increasingly redundant.
SP: The metadata requirements for ‘discoverability’
and what I’ll loosely call ‘reproducibility’ or maybe ‘verifiability’ are quite
different.
For
discovering content you can afford to have things a bit off because the nature
of the kind of query you’re going to be running is naturally a bit fuzzy (“find
me all the papers to do with skin cancer”). For this, automated techniques can
work pretty well, and people can cope with a few false positives and negatives
much as they do with any online search.
But
when it comes to specifying objects of interest, or being precise about claims
that are made in a paper it’s much harder to develop automatic techniques able
to guarantee accuracy. If a paper talks about a particular gene or antibody, it
would be good to know that the link I have to a database entry is definitively
to the thing the author meant, and not something that a heuristic has guessed
might be right.
So
computers are good at getting together a lot of data for broad overviews, but
when it comes down to detail it’ll be quite some time before we can rely on
them to be as accurate as a human. Of
course to make the associations manually you need a human that understands the
scientific domain, as well as how to represent this in terms of ontologies and
identifiers and such, and who can confirm the result with the original author
in order to get the provenance of how that link came to be made right! So it’s
quite a specialised job.
I
can imagine small or society-scale publishers maybe being able to do this;
whether large throughput publishers could get together teams with the right
experience at the right scale is another matter. But apart from this it’s hard
to see what else publishers can provide that counts as real value in the long
term (I’ll probably never get anything published again now).
RP:
This is what I take from what you have said. Utopia does three things. 1. It
turns static documents into dynamic documents. 2. It links the literature with
the underlying data. 3. It provides a better way of meeting the needs of both
machines and humans. 4. Something more drastic will need to be done in order to
adapt the research paper to the Internet, but publishers are holding back the
process of change. Is that correct?
SP: Yes that sums it up nicely.
RP:
I see your new grant is for a project called Lazarus. This aims “to harness the
crowd of scientists reading life-science articles to recover the swathes of
legacy data buried in charts, tables, diagrams and free-text, to liberate
process-able data into a shared resource that benefits the community.” Is this
connected with Utopia, or a development from it? Would you describe it as an
open access project (since presumably the results will be freely available)?
SP: The plan is to use Utopia as a
way of resurrecting the dead knowledge in the literature. At the moment all the
text mining and semantic analysis that Utopia performs when someone reads an
article gets thrown away when they close the article; the idea behind Lazarus
is that non-copyrighted bits of data from the article will get pooled in a
central repository that others can then search.
So
it’s a crowd-sourcing approach to generating a rich knowledge-base about the
literature. All the data generated will be open access, and attributed both to
the original source (so we expect it will drive traffic to the original article
which will keep publishers happy) and to whoever contributed it.
Stories that persuade with data
RP:
You said that the real problem with text mining is a licensing/legal one. We
should perhaps highlight this, not least because it will presumably circumscribe
what you are able to do with Lazarus.
You
will also know that there is a long-standing debate within the open access
movement between those who advocate for so-called gratis OA and those who insist that only libre OA will do. Essentially, gratis OA provides
eyeball access to papers alone, whereas libre OA allows content to be repurposed
and reused (by the use of Creative Commons licensing).
And
the licensing issue is key since, as you pointed out, traditionally licensed and
PDF-provided eyeball access to papers is fine for humans, but poses a serious
problem for machines, and for people who want to text mine documents and reuse
and repurpose scholarly papers as you plan to do with Lazarus. I assume,
therefore, that you sit firmly in the libre OA camp?
SP: Very much so. I’ve not seen a
single plausible argument (outside of ‘protecting our business model’) why
anyone would want to publish a scientific article with any license other than
CC-BY.
But
there’s an interesting point here; it’s actually very hard to stop a single
user from extracting knowledge from a single article at a time because it’s
hard to pin down what ‘automated means’ implies.
As
soon as you open a PDF to read or print with any reader, you are processing it
digitally (and to try to prohibit that would make the thing you’ve bought utterly
unusable and therefore probably not a fair sale). I think it would be hard to
object or prevent someone from using the ‘find’ function in a PDF reader to
identify a particular term; and from there doing that repeatedly to find
multiple terms isn’t a huge step. It’s certainly not illegal to then cite an
article to say “I found the words ibuprofen
and inflammation in this paper;
therefore I think it’s a paper about ibuprofen and how it affects inflammation”
and maybe even to quote fragments of it to back your claim up.
And
that’s pretty much what we’re doing in Lazarus; we’re not storing or
redistributing any copyrightable material, just data and ‘citations’ in the
form of nanopublications. And we’re
not ‘bulk processing’ anything at all; simply collecting together the insights
of a lot of individual people and, with their permission, putting them in a
place where others can find them.
No
scientist would trust my claim that this is a paper about ibuprofen and
inflammation without wanting to read the full article in any case (maybe being
directed automatically to the relevant parts via something like Utopia), so I
think this will actually end up driving more traffic to full text articles to
confirm interesting claims.
Moreover,
my understanding is that the UK Hargreaves
rules
mean that the right to read is now the right to mine, in the UK at least. So the
problem is beginning to ease.
RP:
You said earlier that you had other plans for Utopia. This is in addition to
Lazarus?
SP: Apart from the Lazarus
activities, we’ve got a few new features in the pipeline that users have asked
for and that we think will really transform the way Utopia works — but it’s
probably a bit too early to say too much about those right now.
The
most exciting thing for us is that we’re finally in a position to be able to release
Utopia Documents as an open source project. As a primarily research-focussed
group we’ve been wanting to do this for quite some time (from the beginning
really), but didn’t want to shoot ourselves in our collective feet in terms of
financial sustainability; you can’t keep on going back to funding bodies for
more money to keep software alive, and we wanted to make sure that we had some
way of keeping the research group together through funding droughts.
We’ve
always had a model where the software has been free to use, even for commercial
purposes, and we have been able to get enough income to sustain the work on the
back of consultancy and bespoke versions for use inside commercial firewalls. Now
that we’ve tried that model for a while it’s pretty clear that making the
source code open won’t damage that source of funding and at the same time it opens
up opportunities for collaboration with other open source projects, and could
maybe even get the developer community interested enough to contribute new
stuff.
RP:
When I spoke to John Willbanks a few years
ago he painted a picture of a future in which the Internet will also become a major
platform for sharing the physical tools of science — e.g. cell lines, antibodies, plasmids etc. What
struck me as particularly interesting here is that he suggested the ability to
acquire these tools will be embedded directly into research papers. So if a
reader of an open access paper wanted more detailed information on, say, a cell
line, they would be able to click on a link and pull up information from a
remote database. If they then decided they wanted to obtain that cell line from
a biobank, they would order it in the same way as they might order an item on
Amazon or eBay, utilising a 1-click system available directly from the article.
This
suggests that the scholarly paper is set to become a great deal more than just a
viewable document. Not only will it become the raw material for multiple
machines and software agents to data mine, and the front-end to hundreds of
databases (as we have discussed), but it will also be the launch pad for a
range of ecommerce systems. The interview with Willbanks was six years ago and
so thinking may have moved on. But what’s your view on this?
SP: I think that’s spot on. Anita de Waard described
articles as ‘stories that
persuade with data‘,
which I think encapsulates things nicely. Humans need the story and the ability
to analyse the data; and as the scale of the literature increases we’re going
to be relying more and more on machines to help us do that. There’s still a
long way to go though.
RP:
As I understand it, the research process consists primarily of gathering data,
performing analyses and then sharing the results of those analyses with peers.
This process has already begun to change as a result of the Internet and, as
you say, it can be expected to change a great deal more in the future. But how might
the process look in, say, 25 years’ time? Will the research paper as we
understand it have disappeared, and been replaced by something else altogether?
SP: The clearest description I’ve
seen of this is what Prof. Dave de Roure and
colleagues call ‘Research
Objects‘
— packages of knowledge that contain the story, as well as the data and any
computational methods necessary to analyse or reproduce the claims being made.
There’s a lot of work going on at www.researchobject.org now that I
think has really exciting potential. If anything is going to be the article of
the future, I think it’s research objects of some kind.
1. M. J. Herrgård, N. Swainston,
P. Dobson, W. B. Dunn, K. Y. Arga, M. Arvas,
N. Blüthgen, S. Borger, R. Costenoble, M. Heinemann,
M. Hucka, N. Le Novère, P. Li, W. Liebermeister, M. L.
Mo, A. P. Oliveira, D. Petranovic, S. Pettifer,
E. Simeonidis, K. Smallbone, I. Spasić, D. Weichart,
R. Brent, D. S. Broomhead, H. V. Westerhoff, B. Kürdar,
M. Penttila, E. Klipp, B. Ø. Palsson, U. Sauer, S. G.
Oliver, P. Mendes, J. Nielsen, and D. B. Kell. A consensus yeast metabolic network obtained from a
community approach to systems biology. Nature Biotechnology, 26:1155 - 1160, October 2008.
No comments:
Post a Comment