Devin J. Cornell: Theory, Method, and Reproducibility in Text Analysis

I have been reading a lot recently about how we think and talk about the relationship between theory and methods in work using computational text analysis for the social sciences. I would like to suggest that an assumed inseparability of theory and methods leads us to false conclusions about the potential for Topic Modeling and other machine learning (ML) approaches to provide meaningful insight, and that it holds us back from developing more systematic interpretation methods and addressing issues like reproducibility and robustness. To make this point, I will respond specificically to Dr. Andrew Hardie’s talk “Exploratory analysis of word frequencies across corpus texts” given at the 2017 Corpus Linguistics conference in Birmingham. Andrew makes some really good points in his critique of shortcomings and misunderstandings of tools like Topic Modeling, but I believe some of the critiques are based on fundamental misunderstandings of the idea of “assumptions” in the use of modeling tools.

Talk video: “Exploratory analysis of word frequencies across corpus texts” by Dr. Andrew Hardie

Andrew’s argument is essentially that many researchers, particularly in the digital humanities, use the algorithms without considering the technical details that make them work, and that furthermore the algorithms themselves have serious shortcomings. Andrew points out that “topics”, as we would refer to them in an LDA context, are not topics in the linguistic sense, but rather “a group of weightings that satisfy a stochastic model” (19:20). “It means that there is a kind of generative theory of how discourse works []” inherent in the algorithms. He says that topic modeling works along the assumption that texts are generated from authors drawing words probabilistically from sets of bins (topics) containing words – “that’s the generative theory of discourse.” Andrew also points out that the algorithms themselves are flawed in their application to the social sciences because they (a) require arbitrary parameters like number of topics and stopword lists that are not standardized and are often given little thought in analysis; (b) they use stochastic initialization that allows for a different topic model for every run of the algorithm; and (c) the algorithm often produces topics which are apparently meaningless and difficult to trace back to the empirical data in a useful way.

Separation of Theory and Method

Andrew’s presentation is an attack on both common practices for interpretations of topic models as found in the social sciences as well as the algorithms themselves. I’ll first point out something that is obvious to humanists and digital humanities scholars: that no argument is being made for texts being generated using probabilistic “bag of words” processes, or furthermore that topic modeling is any attempt at a “generative theory of how discourse works []”. That is by no means an assumption of the algorithm, and this line of research rarely, if ever, makes the argument of an exact mapping – this description of topic modeling is simply inaccurate. Andrew also brings up an interesting point about scientific models to make his point. He uses the definition “a description of a phenomenon that is simplified in some ways, to help us understand how that phenomenon actually works.” Instead, he says, “a topic model is a description of a phenomenon that works completely differently to how the thing it’s modeling actually works.” Andrew says that “we know that a topic model is not what is happening” – but researchers continue to use this method anyways.

I’ll use the example of classical statistical models to point out why this argument doesn’t make sense. Linear regression has been used for many years to test and develop theories based on relationships between observed variables on a large scale. I would guess that most researchers do not assume by using the models that observed variables are generated from some kind of linear model of related random variables. Researchers use this model because it can tell us something about empirical data which is evidence of some real underlying social or human process (a scientific model according to Andrew) that is obviously not the same statistical process used to build the model. Looking through thousands of observations of more than two dimensions would likely not be a fruitful approach to scientific analysis, so the models are often presented as “results” in place of raw data. These models, however, are not introduced without supporting theory or interpretation – in fact, most quantitative papers spend most of the texts describing the motivation for the model in an effort to point out why the results properly test the theories. Most critiques of models are also at the theoretical and model design level – rarely are the implementations of the models themselves critiqued, mostly because they are made with standard software and, more importantly, researchers often know their limitations and potential biases (Type 1 and Type 2 errors for example). Admittedly, linear regression is not a generative ML model, but neither were generative models designed to provide some “generative model of [X]” (Andrew uses X=discourse – a misuse of the word “generative”, which has a stricter mathematical meaning in this case). When gaussian mixture models or hidden markov models are used for practical purposes, researchers are not arguing that the underlying processes are statistical processes who’s features are somehow ‘uncovered’ by the construction of the model from empirical data. A specific underlying generative process by no means an assumption required for use of an algorithm – this is something that has been reiterated hundreds of times in ML and statistics literature. Andrew’s definition of scientific models are not at all incompatible with his painting of topic modeling as a description of a phenomenon that works differently than the underlying process, but that is never assumed by the use of these models.

His attack on common practices of interpretation of the topics is well justified – I think most researchers agree we could refine some of the systematic approaches to topic interpretation. Currently MALLET provides a way to tag topics according to the interpretation, but I think in practice this is often performed by simply looking at the content top 20 topic words. We could be using things like word clouds or even histograms to reveal more nuance in the actual distributions over words that topics are composed of. Furthermore, we need to go back to the texts – topic content is useless without examining the ways that these topics are present in the original documents. We need to find ways to systematically perform close readings that can be summarized for the purposes of making academic arguments that rely on topic modeling.

Reproduceability and Randomness

One of the toted benefits to using topic modeling is the reproducibility of the analysis. Whereas OLS models will consistently give the same output given the same data, this is not the case with topic modeling algorithms. Like most machine learning algorithms, they involve random parameter initializations that are iteratively updated according to some optimization function. This means that given the same parameters (except random generator seed), the topic models can produce completely different outputs. Andrew says that this is a complete deal breaker for most humanist scholars, and a big concern to other researchers. I believe this to be a fundamental misunderstanding of topic models, which may be quite widespread, as a method which is uncovering some kind of ground truth in the data – an assumption quite embedded in Andrew’s “generative model of discourse”. I sometimes call this the “ghost in the data” bias, and I’ve observed it more broadly across different types of quantitative projects. The fact is, as Andrew points out, this is how many scholars view the models, and this bleeds over into interpretations and claims being made.

If we dismiss the absurd assumption that there is some ground truth topic model which properly represents appropriate features of the data, then we can view a given topic model as a useful description of the data which captures features that may be of interest to our analysis. With this understanding, I want to point out that it is possible to create reproducible topic models by setting a random generator seed as a parameter. Alongside other parameter values and listings of software versions used, topic models are exactly reproducible given the same data. Admittedly though, I remain concerned about the robustness of a model given the data itself. If one were to run the algorithm multiple times and get a similar result, there might be more confidence in the statistical foundations (note: not theoretical foundations – that is a mute point) of the model. As social scientists, we need to develop methods and metrics for determining statistical robustness of a given topic model – this is not something unusual for generative ML models. As a simple approach, I suggest running the model multiple times and quantitatively comparing the different generated models with each other using topic distributions over words. If a particular topic model is generated that may be useful for the research questions being asked, this model can include a robustness measurement by comparing it with a hundred or more other random models. Researchers can then select topic models based on a combination of both robustness and usefulness. The dismissal of the ‘ground truth’ assumption allows for a more rigorous understanding of how the model relates to the empirical data.

Algorithm parameters are yet another subject being debated. Researchers often ask “how do we determine the correct number of topics we should use?”, to which I (and many others) respond that there is no “correct” number of topics, nor a correct value for any other parameter to topic modeling algorithms. If we see topic models as possible ‘interpretations’ of the data that provide useful perspectives rather than some ‘ground truth’ excavators, these questions no longer make sense. From this perspective, reproducibility in the sense of reporting input parameters, software versions, and random generator seeds is appropriate. Model results can even further be strengthened by robustness measures observed for a given model.

Algorithmic Transparency

Part of Andrew’s critique is also the opaqueness of the algorithms themselves. As in, we don’t know what is happening under the hood because the models are so complex. He contrasts this with factor analysis, pointing out that it is both more interpretable and provides parameters that are easier to select. I think this also hints at a strong link bias between theory and method. I argue that simpler algorithms are not more interpretable in terms of their relationship to meanings in the data. It is true that it would be difficult or pointless to trace the mathematics of LDA through your data from the TDM into the factorized word-topic and topic-document matrices, but it is actually the role of the interpreter to trace topic models back to the original data. Different models should be selected based on how they capture topics relevant to the research interest. Exactly what word collocates mean depends on the documents themselves anyways – content, linguistic style, audience, and a hundred other features that are up for consideration by the researcher. In the same way that being able to solve the normal equations needed to build an OLS model doesn’t give you any more insight into the data than if you had used R or Stata, knowing algorithmic details of LDA is only useful to the extent that it gives you intuition as to the types of patterns to look for in the interpretation. I’m not saying that knowing algorithmic details of LDA is useless, but that a properly systematic interpretive analysis of the data shouldn’t require that mathematical knowledge.

Rather than teaching all of our social scientists Bayesian statistics, I think it would be more productive to further consider what the tools can show, develop more systematic approaches to interpretive analysis, and to change how we talk and think about the separation between topic modeling tools and the theories of discourse or text production. Like in any other types of quantitative analyses, topic models are tools that can provide perspectives rather excavators of some hidden truth extracted in textual data.