There is a perfect storm forming in the publishing sector,
and not the negative one of impending doom and gloom that typically comes to
mind. This new storm, created by a tsunami of data and flood of new
technologies, is one of potential and opportunity for those bold enough to navigate
the waters. The combination of data and technology will enable our ability to deliver
new features and products that are able to better address our customers’
information needs. The storm is simply: “Big Data”.
“Big Data” is one of the current catch phrases in
information technology circles that can be used to mean many different things
to different people. While many people assume that it refers specifically to very
large-sized datasets, it is really more encompassing than that. We like
the following definition of “Big Data” from O’Reilly Radar, Release 2.0: Issue
Big Data: when the size and performance requirements for data management become significant design and decision factors for implementing a data management and analysis system. For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration.
Within publishing there are three dimensions, significant
movement along any of which, can trigger the transition of a problem or product
into the realm of “Big Data”. They are: the volume of data in question,
the complexity of calculations that have to be applied against it, and the speed
at which the processing needs to happen. These dimensions surface
themselves as follows:
Volume – The amount of data that is being authored, crawled, and
captured is increasing exponentially. This data spans the spectrum from highly
structured record formats to completely unstructured textual information and everything
in between. These data types include the tradition well-formatted publishing
assets (articles, books, databases, etc.). In addition, we would also include
usage information, customer data, crawled web pages, extracted intelligence/metadata,
etc. The list goes on and on. As more and more of this data is captured
electronically and interlinked, it becomes much more valuable. Unfortunately it
also becomes much more unwieldy to work with.
Complexity – This flood of new data quickly overwhelms the
ability of traditional software packages and paradigms to handle it. Even smaller
datasets can strain their capabilities when extensive computations need to be
performed to generate a result. Fortunately within the past few years,
new technology options such as Hadoop, HPCC Systems, NoSQL databases, etc. have
transformed the ability to perform extensive computations and rapidly process
large amounts of data. These technologies will form the backbone of the new
analytic and delivery capabilities for our data and products.
Speed – Frequently, these new tools and the amount of data being
processed require significant amounts of storage and processing power to
generate results quickly. Historically, having to purchase and host the
necessary computer capacity would have precluded any attempts to work at such large
scale. Fortunately, today new platform technologies exist that allow us to
store, manage, and process this data in cost effective and scalable ways. Developments
such as virtualization, public and private cloud computing, commodity hardware,
high speed bandwidth, etc. provide cost effective alternatives for acquiring on
demand compute capacity and bulk storage at the necessary scale.
Publishing has numerous entrenched challenges that are ripe
for a “Big Data” solution. Here are just a few of many that jump to mind:
- Author disambiguation: Various attempts such as Elsevier’s
Scopus Author Id, Thomson Reuters’ ResearcherID and the new ORCID
initiative are testament that this is not a trivial problem. The
complexity and scale of the calculations needed to resolve the records
make this a prime “Big Data” candidate problem.
- Recommendation engines. Amazon-esque suggestions of
related material, potential upsells, and even predictive page generation
and caching are the product requirements de jour. The amount of data and
the required response times preclude traditional solutions.
- Deep citation analytics: Products like Elsevier’s SciVal
Suite and Thomson Reuters’ Research Analytics are raising the bar in
customized, on demand analytics. Again the combination of data size,
complex algorithms, and response speed are classic hallmarks of “Big Data”.
The opportunity is here today. All that is needed to weather
the storm is a bit of vision and a good surfboard. We hope you’ll join us
for the ride. Aloha!
Curt Kohler – Director of Disruptive Technologies, Elsevier Labs
Darin McBeath – Director of Disruptive Technologies, Elsevier Labs