Back from Beyond, or: How I learned to stop worrying and love the PDF
As I slowly wind down from what perhaps should be called the ‘Beyond the PDF Bootcamp’, I am starting to process the whirlwind of talks, demos, conversations and interactions that took and are still very much taking place. First and foremost, it was an incredible amount of fun to see old friends, and meet new ones, and see old friends make new ones, hear the multitude of passionate voices and see the hard work everyone did to show their wares and share their thoughts at this uniquely cross-disciplinary event. At dinner just before the workshop, I sat next to Michael Kurtz from Harvard, who told me something I admit I didn’t know: that 95% of all the matter in the universe is composed of dark matter and dark energy, and we don’t have a clue what either of those actually are. This concrete image of the vast amount of things we don’t offered a great intro, to me, for an exciting, thrilling, amazing three days, and an image reflecting the sheer gigantonormousness of the questions that we are all trying to help solve, sooner, rather than later.
To someone coming in from the outside it must have been pretty overwhelming to see these 50-or so scientists, developers, with the odd librarian or publisher thrown in, all individually and collectively hell-bent on ‘changing science publishing’; while continually asking each other what that means, exactly. Change, – from what? What is broken now; what needs fixing? But more importantly – change to what, for whom, and by whom? And there do not seem to be any hard and fast answers – although perhaps some slow and soft ones are starting to appear. In trying to process the workshop and its many discussions, I myself thinking about the conference as covering not so much a list of topics as a set of polar opposites. So far, I’ve found seven, which I would like to put out as a first list of axes to span a space within which we are slowly trying to orient and position ourselves. I very much invite everyone, present and not present, to add to or comment on these; this is just a first, stream-of-consciousness representation of things that struck me.
- Order vs. chaos.
One of the things that I personally learned from helping Phil run this workshop was how much wonderful, useful, creative work can be done by leaning back and letting order develop out of a certain amount of chaos. In preparing for the meeting, Phil always knew quite clearly what he wanted, but he had a great down-under ‘no-worries mate’ attitude to the whole thing, while I panicked with Dutch uptight-ness about the program not being posted and the chairs not invited and how were we going to run the workshops when no one really knew what the scope or the goals were, and lots of other stuff. I admit to harassing poor Phil to no end, as he dealt with the really important stuff – getting the food and the travel and the hotels together, oh and by the way running a research group and doing great science, just on the side. So a lesson I personally learned was that overplanning is not necessary, and probably doesn’t help – as Phil put it: ‘You just get a bunch of good people in a room, and then you step aside and let them get on with it.’ That is exactly what happened, and it was wonderful to see how everyone worked and stepped up to the plate and led discussions and breakout sessions and took up chores (taking notes, posting demo’s, commenting on Twitter) without anyone planning it or telling them what to do. It seems a testimony both to the quality of the participants and to Phil’s charismatic and understated leadership that we got where we did on Day 3 – at a point far beyond anything I think any of us had hoped for, beforehand. It seems a great model for science publishing, and science as a whole, to move forward: make sure you get interested and interesting people in the same space, offer them room to communicate and a few tools and something amazing will emerge, without anyone steering or taking control or pushing things in a specific direction. - Data vs. rhetoric
A key discussion at the conference – as it should be, and it was wonderful to watch it emerge! – concerns the nature of the research object. On the first day, a fascinating discussion emerged about the concept of annotation – and how you can argue that papers themselves are (just) annotations of data. Likewise, reviewers’ comments, review articles, blog posts and even citations themselves can all be seen to be annotations on a particular paper, so we can visualize the information space as a series of concentric circles that surround the data in an increasing cycle of finding, observation, interpretation and comment. There were pleas for a ‘data paper’ by John Kunz of the California Digital Library, which ‘minimally consists of a cover sheet and a set of links to archived artifacts ’. A step up in complexity, perhaps, are nanopublications [1] – essentially, graphs of triples with provenance. One step more in the narrative chain brings one to what used to be called modular papers [2], and is now referred to as ‘a medium-grained data object‘ in the W3C Health Care and Life Sciences discussion; and finally we get to the coarse-grained, IMRaD shaped structure that we all know and love. Somewhere in between, rhetoric happens, and it was encouraging and exciting to hear the participants bandy about terms such as rhetoric, narrative and persuasion – not words you’d hear uttered by the semantic/bioinformatics community even five years ago! Although some of us old fogeys thought we’d made some perfectly useful wheels (e.g., [3] -[5]), it seems some of them will be reinvented; but having this discussion be as lively as it is, is much more important than rehashing what was thought up in the past. This is a vibrant and intelligent group of people who are apt to take up nuggets of what worked and invent parts that don’t seem useful anymore – and it is fascinating to see what will emerge as a set of research objects that can be linked, archived -and annotated. - Old vs. New
One thing that truly thrilled me was to see some of the giants in changing scientific communication, such as Peter Murray-Rust and Michael Kurtz, be in interaction and listening and being heard by the new generation, who are all-aTwitter and have grown up in social networks in a semantic, interconnected world. Many of the good ideas from twenty years ago have been realized (as Kurtz’s sketch of the astronomy information space showed, where scientists run the journals, data stores and archives and, essentially, everyone has access to everything in the way they want – they even have an alerting system that works!). But many others have not gained the impact or traction they so deserve: to me, Scientific Markup Language is a case in point, and I hope Peter Murray-Rust will reintroduce it to this group. In aiming to provide a home for these ongoing developments, I think there is a unique opportunity to connect both the old and the new, and a cross-disciplinary group of scientists to work on some problems using ideas ranging from Linked Data to triple stores to more SGML-based, old-fashioned principles. One party that I do think is still missing in this discussion, and I found to be missing at the meeting, is the Digital Repositories community, who tackle issues such as archiving and attribution, authoring and annotation, and have a lot to tell us – and perhaps some concrete needs we can help address. Hopefully this can be rectified in the future, as we establish a more solid (virtual or real) meeting space, and allow for outside contributions in the discussions that are now taking place on a daily basis. - RDF vs. PDF
Several people remarked that for a conference devoted to moving ‘Beyond the PDF’, there were a surprising number of PDFs shown! In fact, only part of the discussion focused on ‘overcoming’ this format, as the Utopia PDF viewer demonstration certainly offered one a great ‘wow!’ factor [6]. I believe Steve Pettifer of Utopia even put out the challenge for people to name 10 things that they believed PDF can not do – and build them! RDF had a great number of proponents too, and of course, ideally, everyone agreed you need both: PDF is what people prefer to consume and RDF is more to computers’ tastes. The collaboration between Utopia and the Annotation Framework developed by the Harvard group shows, for my money, the most delightful way to date of combining Harvard’s semantically solid, provenance-focused Annotation Ontology with an awesome tool – allowing you to add annotations in the comfort of your own local copy and then sharing them with the community at large. The whole system makes such sense and seems so easy to use that I don’t see how we could have all lived without it – I’d certainly love a copy on my desktop as soon as possible. Now there’s an iPad application that would make science easier – getting an intelligently annotated document with links to other documents, and being able to communicate with your co-annotators at the moment you are reading what they commented! - Open vs. closed.
My colleague Brad Allen made the astute observation that the single piece of software that dominated this conference was Twitter – yet nobody talked about it. In fact, a large and very passionate part of the discussion on the first two days of the conference concerned the plea that ‘everything we do has to be open’ – that no matter what is done in the future pertaining to new forms of science publishing, not only the content has to be freely available, but also all software that is used has to be entirely open source, which is ‘clonable’ (in the words of a Twitterer). Still, we all happily use Twitter, and for me it has considerably improved my work – I know of things I wouldn’t have otherwise known, and meet people I would otherwise not have met. In short: it works. I do not know how Twitter makes money, and frankly, I don’t care, as long as they give me a tool that makes my life more productive and pleasant. What was wonderful about Friday’s discussion, in my view, was that we were able to overcome the open/closed debate that seemed to divide the room in earlier discussions (both in the room and on Twitter). It seems clear that if there are features or entities that truly improve the way we read, write and communicate, we are okay with having pay for them. As an example, I thought Wingu elements was an intriguing and way of building an eLab-workflow tool, and the concept of people building apps that can run within or outside the platform a useful one. The business model has not fully evolved yet – but the community can give a list of needed components (interoperability, open data standards, import and export options, for starters) and perhaps these can help Wingu move to a model that is acceptable to the scientists they cater to, and still allow them to exist, as a company. Similarly, the Utopia viewer, and some of the more domain-specific authoring, indexing, and annotation tools seem to really enhance the speed and quality of information access are ones I would gladly pay for.As for content, on Friday Maryann Martone stated emphatically that money needs to be spent “Either putting in content or taking it out”. Michael Kurtz put some figures with these concepts: the information infrastructure for an average active astronomer costs about $20,000 per year. Of this, more than 70% goes into data archiving, and only about 4k$ into all publication costs (reading and publishing). The open/closed access discussion is clearly not settled – but as we move to explore a number of use cases, it seems at least we can start to define what the varioys components are, and can all go back and try to figure out what our role in this brave new world can be. To be sustainable, an information architecture needs to come in place that allows scientists to add value but not spend too much of their time of (grant) money in maintaining software or offering customer support, while the publishers and repositories provide services (such as archiving and large-scale, high-quality XML production) that the community as a whole agrees it is worth paying someone to do. The fact that the workshop seemed to face, and then collectively overcome the open/closed dichotomy – an old discussion, that has not always been very fruitful in the past – is great progress, indeed!
- Central vs. distributed
Another dichotomy that had proponents on either side of the spectrum concerned the organizational arrangement that best allows us reach this glorious future we all seek. On the one end of the argument is the ‘thousand flowers bloom’, ‘let’s all build our own thing and see how it connects when it works’ school of thought; on the other, proponents for an infrastructure, an architecture, a framework offering a solid foundation that we can all build on (and other construction site metaphors). It seems obvious, and I think was a clear outcome of the meeting, that we should do both: so over the next few months, some people will be working on developing principles or meeting places (there was talk of a journal, and an (Invisible) College?) whereas others are going home and happily continuing to work on a Really Cool App. And as long as the RCA guys’n’gals are aware of interoperability requirements and a standards of Basic Semantic Hygiene, and as long as the frameworks and standards groups don’t lose the forest for the trees, and make sure they are still connected to things people actually need and use, this parallel development should help us all leapfrog over each other on the way to science communication paradise. A practical example of the ‘central vs. distributed’ dichotomy was the conference discussion that took place before, during and after BtPDF, which has been taking place on several platforms: the conference website, built on Google Sites; Twitter (which was cached and analyzed in different places); the Etherpad app, which was used to take communal notes during the breakout sessions, and the very active BtPDF (Google groups) mailing list, that occasionally contains urgent requests to please, please, post everything on the website as well. All in all, a nice model for the distributed, frantic information space that the average scientists (well – the average person!) find him or herself living in! At least the collected BeyondthePDF conversations offer a nice little corpus of distributed discussions, that perhaps some clever group of computer scientists can mine, extract, combine, and connect to represent the voices, themes and dynamic of this spirited debate.But, finally, a wonderful outcome of the conference as a whole, in my mind, was the idea that while we maintain the distributed nature of the discussions and developments, a bunch of the participants will collaborate and very concretely start to work on a single use case: helping Maryann Martone expediate finding a cure for Spinal Muscular Atrophy. Maryann actually has a fighting chance to help find a cure for this horrible disease (‘childhood Lou Gehrig’s disease’, as she described it), provided she doesn’t have to spend months or years first gathering and then processing the vast literature that is related to neuromuscular diseases, and its genetic origins, treatment and all other possible related elements. A number of groups at the meeting (including all publishers, Harvard, ISI, the Leiden Bioinformatics group and others) have agreed to join forces and build a ‘knowledge terrarium’ that will help connect all components of the available content without barriers of business models or technology and, most importantly, help Maryann speed up her research. From this use case, undoubtably, new standards, definitions, architectures, and thoughts about business models will emerge – but more importantly, there’s a change some kid might walk again because we all got together and did something
- Haves vs. have-nots
On of the most interesting post-conference discussions I had were about the need to not just improve scientific communication, but to get more people interested in science, in the first place. In the US, a mere 16% of all college students got undergaduate degrees in science or engineering (in 2006, the latest figures available, for some reason), as opposed to 47% in China and 27% in France [7]. If we don’t collectively improve this number, there won’t be anyone who can cure cancer, figure out how the brain works, or find out what dark energy is, by the time we are all retired – and the publication process will be the least of the worries of the scientists that are left.
On the other hand, the single most poignant image of the entire meeting to me came from Leslie Chan, who said that we know that ‘mosquitoes transfer malaria’, in fact – there is a cure for malaria, and yet 850.000 people, mostly children, die from the disease every year. Having a cure is clearly not enough. In trying to solve these and other pressing issues, there are a vast contingent of scientists in the not-so-lucky parts of the world who cannot access, and certainly cannot get published, in the mainstream journals we are trying to change. Chan’s Bioline system, an open-access and open-source platform for developing-world scientists is a valiant attempt to right some of these wrongs, but as a community it seems we should spend more of our thoughts, efforts, and resources looking at allowing access to these other groups of scientists – and perhaps, involving them can help address the dearth of scientists that we will surely face.
Well – those are my seven dimensions. I very much look forward to others adding points, debating some of the more outrageous, arrogant or incorrect claims, and in general continuing this discussion with people at, or regretfully not at, the meeting. As a community, I hope we can start to build this information space of the future as we discuss it, and very much look forward to the time when all of this will come to pass, and can be handed over to the next generation, as a matter of course. So that maybe, some day, one of them might figure out what dark energy actually is…
Anita de Waard
http://elsatglabs.com/labs/anita
References:
[1] Paul Groth, Andrew Gibson and Jan Velterop, The anatomy of a nanopublication, Information Services and Use, Volume 30, Number 1-2 / 2010, p. 51-56, http://iospress.metapress.com/content/ftkh21q50t521wm2/
[2] Joost Kircz, Modularity: the next form of scientific information presentation?, Journal of Documentation, Vol.54,no.2,March 1998,pp.210-235.
[3] de Waard, A., (2007).A Pragmatic Structure for the Research Article, in: Proceedings ICPW’07: 2nd International Conference on the Pragmatic Web, 22-23 Oct. 2007, Tilburg: NL. (Eds.) Buckingham Shum, S., Lind, M. and Weigand, H. Published in: ACM Digital Library & Open University ePrint 9275. http://elsatglabs.com/labs/anita/papers/ICPW2007_DeWaard.pdf
[4] de Waard, A., Buckingham Shum, S., Carusi, A., Park, J., Samwald, M., and Sándor, Á. (2009). Hypotheses, Evidence and Relationships: The HypER Approach for Representing Scientific Knowledge Claims, Proceedings of the Workshop on Semantic Web Applications in Scientific Discourse (SWASD 2009), co-located with the 8th International Semantic Web Conference (ISWC-2009) – http://elsatglabs.com/labs/anita/papers/Hyper290809.pdf
[5] Tudor Groza, Siegfried Handschuh, Tim Clark, Simon Buckingham Shum, Anita de Waard, A Short Survey of Discourse Representation Models Proceedings of the Semantic Web Applications in Scientific Discourse Workshop Workshop at The 8th International Semantic Web Conference (ISWC 2009), Chantilly, Virginia, USA, 2009. http://elsatglabs.com/labs/anita/papers/SWASD2009_Discourse#251E15.pdf
[6] T. K. Attwood, D. B. Kell, P. Mcdermott, J. Marsh, S. R. Pettifer, and D. Thorne. Utopia Documents: linking scholarly literature with research data. Bioinformatics, 26:i540-i546, Sep 2010
[7] Lisa W. Foderado, “An Infusion of Science Where the Arts Reign”, New York Times, January 21, 2011 http://www.nytimes.com/2011/01/22/nyregion/22science.html?_r=1&ref=science