Last month I did a workshop on text analysis in Python for a new computational social science group that several of us started at Duke Sociology (workshop materials). As I created the workshop materials, I had two thoughts: (1) most text analysis projects require essentially the same set of steps. The key is to come up with a system and design pattern that works for you. (2) There aren’t many new algorithms for text analysis in the social sciences. Most algorithms we've picked up are simply more efficient or slightly different variations of very old algorithms.
Most text analysis projects require the same steps. By this I mean that most projects require the same or similar boilerplate tasks: preprocessing might involve fixing spelling issues or removing artifacts from original texts; tokenization involves some decisions about which tokens to include, how to deal with named entities, stopword removal, or hyphenation collapsing; document representation storage involves placing the parsed dataset into a database (plug for my package doctable), spreadsheet, pickled file, or some other storage medium that can be accessed. Then an algorithm operates on those document representations to create models, which are again stored in a database or other file for creating visualizations or running statistical models. There may be more aspects to this: hand-coding, metadata analysis, etc tend to be pretty important - but the exceptions are rare.
The point of learning these tools is to develop a series of design patterns. Most of us (speaking to social scientists here) are not software engineers, nor have we ever even been paid to write code that someone else will read or use. If we were, we would know that design patterns in code are all about predictability. Despite the incredibly large range of possible algorithm designs that could be used, the software engineer seeks to be consistent: consistent project structure, consistent design architecture, and even consistent use of syntax for basics like loops and conditionals. Someone has to read that code, so the goal is to make it as easily understood as possible. For us, we are (often) the only ones to read our code. So learning to write research code is about developing our way of doing the same boilerplate tasks and organizing projects in a way that we can recognize easily in the future.
There aren’t that many new tools for text analysis in the social sciences. Let'’'s start with an easy one: Latent Dirichlet Analysis (LDA). This one algorithm spurred on a (perceived) revolution of social scientists doing text analysis for interpretation. It hit everywhere: sociology (see this Poetics special issue), corpus linguistics, and the digital humanities especially. Right now, Word2Vec is HOT. Same deal, different algorithm. Now, I’m not against this: in fact, I’ve used both of these tools in research projects, and I think they can be incredibly useful and provide important insights. My argument is not that they are not useful – only that they are not particularly novel in the ways that social scientists use them.
While LDA has by become by-far the most popular topic modeling algorithm, it does pretty much the same thing as its matrix factorization equivalent Nonnegative Matrix Factorization (NMF). They both start with the same model of the texts: documents are bags of words represented as rows in a document-term matrix. NMF is similar to Singular Value Decomposition (SVD) except that it works on matrices with only positive entries. The thing is, NMF is really old. For whatever reason, it was LDA that spurred interest in topic modeling for the social sciences. Same thing goes for word embeddings. While Word2Vec became hugely possible, subsequent works showed that a simple Pointwise Mutual Information (PMI) calculation based on word frequency with SVD could produce results similar to those from Word2Vec, but PMI-SVD has been around for a really long time. With the exception of parsetree and named entity extraction, I’m not totally convinced that we’ve seen anything new that really changes the way we can analyze texts.
If most text analysis pipelines are very similar and we’re using basically the same tools as we’ve had for the last few decades, where does that leave text analysis researchers? It leaves us with substance. The fact that these algorithms are so readily available and we have so many tools for working with texts means that we can focus less on methods and more on substantive analyses. In my opinion, the value in doing computational text analysis is more than the cool factor: we can answer classical questions that were difficult to answer before. In an increasingly digital society the study of digital texts has never been so important. We just have to be willing to ask the right questions before we pick up new tools.