“Big Data” and Archives
I’m just back from a one-day conference at Princeton University titled Big Data: Public Policy and the Exploding Digital Corpus. It was a stimulating (and exhausting) day, and I thought I’d try and write down a few of my preliminary impressions.
The conference was sponsored by Princeton’s interdisciplinary Center for Information Technology Policy, and brought together a number of groups – information technology professionals, lawyers, policy experts, archivists and librarians – to discuss the challenges posed by the rapid growth of extremely large sets of data generated or aggregated by computers, commonly referred to as “big data.” This made for a day of challenging and exhilarating exchanges, and plenty of opportunities for participants to consider the fundamental values and principles of their profession and to explain those using language and theoretical frameworks that others could understand.
As its title suggests, the conference was primarily focused on policy which addresses the challenges of big data, and as a result, there was very little discussion of concrete solutions, and even fewer answers proposed. Instead, much of the conversation centered on attempts to properly define and scope the issues of big data. I’m not sure we accomplished much even in that regard, but I do know that my horizons were dramatically expanded by sitting in a room full of smart and articulate people from a variety of backgrounds, and I hope that others had the same experience.
There’s no way I can recap every session, but the talk that stood out for me was the keynote address by David Weinberger, the author of Everything is Miscellaneous and co-director of the Harvard Law School Library Lab. His talk started out with a whirlwind tour of the evolution of “facts” and “knowledge” throughout history, starting with Aristotle and touching on Charles Darwin, Jeremy Bentham, Frank Zappa and Russell Ackoff. He then outlined two competing forms of knowledge, arguing that we’re at a moment in history when our concept of knowledge is changing dramatically. We’re moving away, Weinberger argued, from a conception of knowledge as pared down to a “brain-sized chunk” that can be contained within long-form argument and the covers of a book and moving towards a network of knowledge in which “facts” are unfinished, public, abundant, imperfect and unsettled. The former is designed to drive out difference through the “certified answers” of experts, while the latter embraces difference through an abundance of links and connections. Weinberger compared an Encyclopedia Britannica article to a Wikipedia article. While Britannica articles are, on average, many times larger than Wikipedia articles, they are finite. The same can’t be said for a Wikipedia article. As anyone who’s spent any time on Wikipedia knows, it’s easy (and often fun) to get lost in the links for any given article. Given these two examples, Weinberger argued that a theoretical alien landing on earth for the first time would have a far better idea of how humans think and “know” if they looked at a Wikipedia article than if they looked at an article in the Encyclopedia Britannica.
There were a lot of provocative ideas batted around at the conference, but one of the major themes that leapt out at me was that many of the problems and challenges posed by big data are not new to archivists. They are essentially the fundamental problems of preservation, access and description that the profession has struggled with from its genesis. However, these issues take on additional dimensions in the context of massive data sets that can be understood only as a whole, and only with the aid of computers. Providing access is tied to providing adequate technological infrastructure and appropriate licensing; description to search and retrieval methods based on visualization rather than keywords, and preservation something that’s assessed in project proposals presented to institutional review boards.
Likewise, it became clear both while listening to a number of the panelists – including Brewster Kahle (of the Internet Archive) and the inimitable Richard Cox – and in conversations with colleagues afterwards, that many of the critical functions of the archival profession are not going to disappear, but that their scope and our relation as archivists to them needs to change. As Cox put it, the archival impulse will survive, but the role of the archivist as currently constituted may not. Questions of appraisal, reference and access are going to remain central, and in some respects may become even more important, but we will most likely carry out those functions in cooperation with records creators, researchers and machines.
My overwhelming sense coming out of this conference is that these are precisely the kinds of conversations that are essential to the archival profession, not only as we move into an age of electronic records, but also as the production, reproduction and consumption of knowledge changes dramatically. Events such as this are one way to stimulate these conversations, but it’s also incumbent on each of us to get out of our professional ghetto and start talking to our colleagues in other disciplines. I think we’ll all be surprised not only at what we can learn from each other, but also at what we have in common.
Video and speakers’ slides should be posted on the conference website soon. There have already been a few other blog posts about the conference, including David Weinberger’s live blog of the Brewster Kahle/Victoria Stodden/Richard Cox panel, as well as the closing panel chaired by Paul Ohm of the University of Colorado that included members from Facebook and Google, as well as the public policy and legal professions.