Data mining the English Language

It’s one thing to read a book for its aesthetic subtleties, but it’s quite another to data mine roughly a million of them for shifts in the English language.

Ted Underwood and his team of graduate and undergraduate students have stepped back from reading individual books so that they may instead measure trends and changes in genres and diction across 800,000 volumes written in English. Identifying changes and differences between works traditionally has meant comparing some handful of books to another handful, but Underwood’s research allows him to see patterns in language that are too big to see if the reader is “too close to the text,” he said.

“There are patterns in literary history that we can’t see at the scale of reading we ordinarily encounter,” Underwood said.

So far, his research of works published in the 18th and 19th centuries has shown two major shifts in English: fiction, poetry and drama become less like nonfiction over that time in their diction, and first-person narration decreases over time.

Underwood found poetry adopted more simple language, or diction that predates the Norman conquest of the British Isles in the 11th century. Pre-Norman conquest words like “good,” “old” and “red” were used more toward the end of the 19th century as poetry moved from discussing broad topics like politics and nature to simpler, more elementary subjects like personal experiences and aesthetics.

“By the end of the 19th century, poetry is very careful about using less-learned language so that it can connect with elementary, personal experiences rather than ideas and social issues,” Underwood said.

Despite the trend toward personal experiences, the prevelance of third-person writing grew.

To begin the study, Jordan Sellers, one of Underwood’s research assistants and graduate student in English literature, compiled a survey of more than 4,000 books from the 18th and 19th centuries, over the course of a year. Using WorldCat, a massive online catalog of books and other published materials, he carefully evaluated books from all genres and types, from medical texts to fiction to travel memoirs, to create a representative sample of the work published in that time.

Determining genre of a work — or even what the definitive characteristics of a genre are — is a difficult business. Unlike other English literature research, Underwood’s study does not concern itself directly with genre.

“Projects in digital humanities trend toward smaller collections,” Sellers said. “They are more targeted in the sense that you will se someone working on Victorian poetry, or something like that.”

Because this research’s goal is not to present the way a particular type of genre was at a certain time, it can be left as a “fuzzy” term, Sellers said.

Because English today is a study of aesthetics, mixing digital and statistical analysis can be met with some pushback.

But Mike Black, Underwood’s research assistant who developed software to analyze the texts and graduate student in English, thinks computer programming and English go hand in hand.

“I’ve always liked thinking about how language works, so being able to think about it in a programming context really forces you to very closely examine it,” Black said. “In a lot of ways programming is similar to writing: You’re taking a big abstract concept, and you are trying to break it down to little pieces that people can understand.”

The digital humanities is not a new field of study, but it’s not pervasive in English departments. Underwood and his team believe this research is not an end to the complex debates in English, but rather a way to add to current discussions or create new ones. With more research, Underwood hopes to uncover even more patterns in the English language that have yet to be seen.

Upon completing the initial survey of 4,000 books, Underwood proceeded to the next step, which is to analyze 800,000 books. He has obtained them largely from the HathiTrust Digital Library, which has a collection of 10.5 million digital volumes, 3.5 million of which are available for public consumption.

He is limited, though, because with current copyright laws, he can only analyze works published before 1923.

Even with this limitation, Underwood is using his findings to develop tools for digital libraries or other researchers to use in their own digital humanities research.

Research doesn’t have to be humanities against science, Underwood said, because they are “more complementary than conflicting.”

Instead of conflict, the digital humanities “gives us a panorama.”

Ryan can be reached at [email protected] and @ryanjweber.