[This blog originally appeared on Big Data Republic in 2013. Unfortunately, all the content has been taken offline]
Sir Walter Scott contrasted his style of writing with that of Jane Austen: "The big Bow-Wow strain I can do myself like any now going; but the exquisite touch which renders ordinary commonplace things and characters interesting from the truth of the description and the sentiment is denied to me. "While he characterized his work as large, Jane Austen called her own small, a "little bit (two inches wide) of ivory on which I work with so fine a brush."
Seeing themselves as such strong contrasts to each other, they likely would have been very surprised to be coupled together as "the literary equivalent of Homo erectus, or, if you prefer, Adam and Eve. " Using computational power to analyze 3,592 works published between 1780 and 1900, he concluded that Walter Scott and Jane Austen were the two primary influencers of all novelists who came after them in terms of style and theme. Those are the types of discoveries that Jockers expound upon in his newly published book, Macroanalysis: Digital Methods and Literary History.
Systematic textual analysis has a history that goes much further back than computers. The first concordance, according to The Word Crunchers dates back about 800 years. It was a most labor-intensive project, taking up the work of 500 friars. A Chaucer concordance took 50 years until it was read for publication in 1927. Computers entered the picture as early as 1951 when "I.B.M. helped create an automated concordance." Those were the days of punch card programming, so “indexing all of Aquinas took a million man-hours.” It was only complete in 1974. Ten years later, though, computers could analyze texts effortlessly, as depicted in the reports of a novelist’s favorite word in David Lodge’s novel Small World.
The proliferation of digitalized books, courtesy of Google books is what makes it possible for computers to now process huge volumes of text from thousands of works. Matthew Jockers, along with Franco Moretti, founded the Stanford Literary Lab in 2010. The research is done in groups along the lines of scientific investigations with the help of computer.
The approach is critiqued by a Chronicle of Higher Education article as The Humanities Go Google:
Data-diggers are gunning to debunk old claims based on "anecdotal" evidence and answer once-impossible questions about the evolution of ideas, language, and culture. Critics, meanwhile, worry that these stat-happy quants take the human out of the humanities. Novels aren't commodities like bags of flour, they warn. Cranking words from deeply specific texts like grist through a mill is a recipe for lousy research, they say—and a potential disaster for the profession.
It’s not just a matter of traditionalists feeling threatened by computer power. Algorithms that depend on Google books for meta-data tags may reach wrong conclusions. Geoffrey Nunberg, a linguist, is quoted as declaring Google’s tags "a mess," not to be relied on. Aside from questions of accuracy, there is that of relevance. Researcher have to ask themselves: "What does this tell me that what we can't already do?"
I had the same question when I read the article on Jockers. Aside from identifying the novel’s trail set by Austen, it points out the supposed revelation that the novels of George Eliot "more closely resemble the patterns of male writers." Is it altogether surprising that the author of Silly Novels by Lady Novelists who deliberately adopted a masculine pseudonym broke the mold conceived for female writers? That’s something that any student of Victorian literature should already know.
What this form of research could do that traditional studies do not is unearth the roads not taken by the literary canon. In a New Scientist article on Jockers’ work, Nicholas Dames, chair of the department of English and comparative literature at Columbia University as seeing the value of this type of research to bring to light the full body of fiction "rather than the small percentage of canonical texts that are usually taken as exemplary." That opens up the consideration of the canon in a larger context, which can lead to questioning the marked trail of influence. But that will only work if the Google Books data proves comprehensive and reliable enough to accurately represent the literature of the time.