Devin J. Cornell

Sociology PhD Student at Duke University
devin.cornell [at] duke.edu

I use computational methods to study cultural processes through
which organizations produce and are shaped by meaning.


Recent Work

Discursive Fields and Intra-party Influence in Colombian Politics

MA Thesis Committee: John W. Mohr, Maria S. Charles, Verta Taylor

Thesis on UC Santa Barbara eScholarship

When are politicians influential in shifting party discourse? This study explores how same-party politicians influence one another, and how this influence leads to changes to a party's larger discourse. I suggest that the extent to which politicians are able to influence other party politicians depends on how their messages situate them within the party’s discursive field. I further suggest that certain messages are particularly influential when distinctive within a given time period. To assess this effect, I use a case study of just under 1 million Tweets from politicians in the Colombian political party Centro Democrático from 2015-2017. I use topic modeling and network analysis to measure influence within a dynamic discursive field, and a genetic learning algorithm to identify types of messages, as topics, which constitute the field under which we observe the strongest linkage between field position and influence. I find that politicians are influential when posting about current events and when creating symbolic distinctions which are central to the party ideology - in the case of Centro Democrático, distinctions between the concept of peace itself and the peace process developing in Colombia. These results suggest that the discursive field can be a powerful tool for analysis of influence and political discourse.

School, Studying, and Smarts: Gender Stereotypes and Education Across 80 Years of American Print Media, 1930-2009

Andrei Boutyline, Alina Arseniev-Koehler, Devin Cornell

Working Paper on SocArXiv

Gender stereotypes have important consequences for boys’ and girls’ academic outcomes. In this article, we apply computational word embeddings to a 200-million-word corpus of American print media (1930-2009) to examine how these stereotypes changed as women’s educational attainment caught up with and eventually surpassed men’s. This transformation presents a rare opportunity to observe how stereotypes change alongside the reversal of an important pattern of stratification. We track six stereotypes that prior work has linked to academic outcomes. Our results suggest that stereotypes of socio-behavioral skills and problem behaviors—attributes closely tied to the core stereotypical distinction between women as communal and men as agentic—remained unchanged. The other four stereotypes, however, became increasingly gender-polarized: as women’s academic attainment increased, school and studying gained increasingly feminine associations, whereas both intelligence and unintelligence gained increasingly masculine ones. Unexpectedly, we observe that trends in the gender associations of intelligence and studying are near-perfect mirror opposites, suggesting that they may be connected. Overall, the changes we observe appear consistent with contemporary theoretical accounts of the gender system that argue that it persists partly because surface stereotypes shift to reinterpret social change in terms of a durable hierarchical distinction between men and women.


Python Packages

DocTable

Python package for parsing, storing, and accessing text documents and models for large scale text analysis.

Website

GitHub Project

EasyText

Command-line tool for text analysis.

Create topic models, run sentiment analysis, count named entities, and extract subject-verb-object triplets from the command line.

GitHub Project

Blog Posts

Jan 12, 2021

Last month I did a workshop on text analysis in Python for a new computational social science group that several of us started at Duke Sociology (workshop materials). As I created the workshop materials, I had two thoughts: (1) most text analysis projects require essentially the same set of steps. The key is to come up with a system and design pattern that works for you. (2) There aren’t many new algorithms for text analysis in the social sciences. Most algorithms we've picked up are simply more efficient or slightly different variations of very old algorithms.

Most text analysis projects require the same steps. By this I mean that most projects require the same or similar boilerplate tasks: preprocessing might involve fixing spelling issues or removing artifacts from original texts; tokenization involves some decisions about which tokens to include, how to deal with named entities, stopword removal, or hyphenation collapsing; document representation storage involves placing the parsed dataset into a database (plug for my package doctable), spreadsheet, pickled file, or some other storage medium that can be accessed. Then an algorithm operates on those document representations to create models, which are again stored in a database or other file for creating visualizations or running statistical models. There may be more aspects to this: hand-coding, metadata analysis, etc tend to be pretty important - but the exceptions are rare.

The point of learning these tools is to develop a series of design patterns. Most of us (speaking to social scientists here) are not software engineers, nor have we ever even been paid to write code that someone else will read or use. If we were, we would know that design patterns in code are all about predictability. Despite the incredibly large range of possible algorithm designs that could be used, the software engineer seeks to be consistent: consistent project structure, consistent design architecture, and even consistent use of syntax for basics like loops and conditionals. Someone has to read that code, so the goal is to make it as easily understood as possible. For us, we are (often) the only ones to read our code. So learning to write research code is about developing our way of doing the same boilerplate tasks and organizing projects in a way that we can recognize easily in the future.

There aren’t that many new tools for text analysis in the social sciences. Let'’'s start with an easy one: Latent Dirichlet Analysis (LDA). This one algorithm spurred on a (perceived) revolution of social scientists doing text analysis for interpretation. It hit everywhere: sociology (see this Poetics special issue), corpus linguistics, and the digital humanities especially. Right now, Word2Vec is HOT. Same deal, different algorithm. Now, I’m not against this: in fact, I’ve used both of these tools in research projects, and I think they can be incredibly useful and provide important insights. My argument is not that they are not useful – only that they are not particularly novel in the ways that social scientists use them.

While LDA has by become by-far the most popular topic modeling algorithm, it does pretty much the same thing as its matrix factorization equivalent Nonnegative Matrix Factorization (NMF). They both start with the same model of the texts: documents are bags of words represented as rows in a document-term matrix. NMF is similar to Singular Value Decomposition (SVD) except that it works on matrices with only positive entries. The thing is, NMF is really old. For whatever reason, it was LDA that spurred interest in topic modeling for the social sciences. Same thing goes for word embeddings. While Word2Vec became hugely possible, subsequent works showed that a simple Pointwise Mutual Information (PMI) calculation based on word frequency with SVD could produce results similar to those from Word2Vec, but PMI-SVD has been around for a really long time. With the exception of parsetree and named entity extraction, I’m not totally convinced that we’ve seen anything new that really changes the way we can analyze texts.

If most text analysis pipelines are very similar and we’re using basically the same tools as we’ve had for the last few decades, where does that leave text analysis researchers? It leaves us with substance. The fact that these algorithms are so readily available and we have so many tools for working with texts means that we can focus less on methods and more on substantive analyses. In my opinion, the value in doing computational text analysis is more than the cool factor: we can answer classical questions that were difficult to answer before. In an increasingly digital society the study of digital texts has never been so important. We just have to be willing to ask the right questions before we pick up new tools.

Jan 12, 2021

I've been thinking recently about how we think and talk about the relationship between theory and methods in computational text analysis. I would like to suggest that an assumed inseparability of theory and methods leads us to false conclusions about the potential for Topic Modeling and other machine learning (ML) approaches to provide meaningful insight, and that it holds us back from developing more systematic interpretation methods and addressing issues like reproducibility and robustness. I'll respond specifically to Dr. Andrew Hardie's presentation "Exploratory analysis of word frequencies across corpus texts" (watch here) given at the 2017 Corpus Linguistics conference in Birmingham. Andrew makes some really good points in his critique about shortcomings and misunderstandings of tools like Topic Modeling, and I hope to contribute to this conversation so that we can further improve these methods – both in how we use them and how we think about them.

In his talk, Andrew makes some good points about topic modeling as it is linked with the “Big Data” phenomenon everyone is obsessed with (including me). His argument is essentially that many researchers, particularly in the digital humanities, use the algorithms without considering the technical details that make them work, and that furthermore the algorithms themselves have serious shortcomings. Andrew points out that “topics”, as LDA calls them, are not topics in the linguistic sense, but rather “a group of weightings that satisfy a stochastic model” (19:20). “It means that there is a kind of generative theory of how discourse works []” inherent in the algorithms. He says that topic modeling works along the assumption that texts are generated from authors drawing words probabilistically from sets of bins (topics) containing words – “that’s the generative theory of discourse.” Andrew also points out that the algorithms themselves are flawed in their application to the social sciences because they (a) require arbitrary parameters like number of topics and stopword lists that are not standardized and are often given little thought in analysis; (b) they use stochastic initialization that allows for a different topic model for every run of the algorithm; and (c) the algorithm often produces topics which are apparently meaningless and difficult to trace back to the empirical data in a useful way.

Separation of Theory and Method

Andrew’s presentation is an attack on both common practices for interpretations of topic models as found in the social sciences as well as the algorithms themselves. I’ll first point out something that is obvious to humanists and digital humanities scholars: that no argument is being made for texts being generated using probabilistic “bag of words” processes, or furthermore that topic modeling is any attempt at a “generative theory of how discourse works []”. That is by no means an assumption of the algorithm, and this line of research rarely, if ever, makes the argument of an exact mapping – this description of topic modeling is simply inaccurate. Andrew also brings up an interesting point about scientific models to make his point. He uses the definition “a description of a phenomenon that is simplified in some ways, to help us understand how that phenomenon actually works.” Instead, he says, “a topic model is a description of a phenomenon that works completely differently to how the thing it’s modeling actually works.” Andrew says that “we know that a topic model is not what is happening” – but researchers continue to use this method anyways.

I’ll use the example of classical statistical models to point out why this argument doesn’t make sense. Linear regression has been used for many years to test and develop theories based on relationships between observed variables on a large scale. I would guess that most researchers do not assume by using the models that observed variables are generated from some kind of linear model of related random variables. Researchers use this model because it can tell us something about empirical data which is evidence of some real underlying social or human process (a scientific model according to Andrew) that is obviously not the same statistical process used to build the model. Looking through thousands of observations of more than two dimensions would likely not be a fruitful approach to scientific analysis, so the models are often presented as “results” in place of raw data. These models, however, are not introduced without supporting theory or interpretation – in fact, most quantitative papers spend most of the texts describing the motivation for the model in an effort to point out why the results properly test the theories. Most critiques of models are also at the theoretical and model design level – rarely are the implementations of the models themselves critiqued, mostly because they are made with standard software and, more importantly, researchers often know their limitations and potential biases (Type 1 and Type 2 errors for example). Admittedly, linear regression is not a generative ML model, but neither were generative models designed to provide some “generative model of [X]” (Andrew uses X=discourse – a misuse of the word “generative”, which has a stricter mathematical meaning in this case). When gaussian mixture models or hidden markov models are used for practical purposes, researchers are not arguing that the underlying processes are statistical processes who’s features are somehow ‘uncovered’ by the construction of the model from empirical data. A specific underlying generative process by no means an assumption required for use of an algorithm – this is something that has been reiterated hundreds of times in ML and statistics literature. Andrew’s definition of scientific models are not at all incompatible with his painting of topic modeling as a description of a phenomenon that works differently than the underlying process, but that is never assumed by the use of these models.

His attack on common practices of interpretation of the topics is well justified – I think most researchers agree we could refine some of the systematic approaches to topic interpretation. Currently MALLET provides a way to tag topics according to the interpretation, but I think in practice this is often performed by simply looking at the content top 20 topic words. We could be using things like word clouds or even histograms to reveal more nuance in the actual distributions over words that topics are composed of. Furthermore, we need to go back to the texts – topic content is useless without examining the ways that these topics are present in the original documents. We need to find ways to systematically perform close readings that can be summarized for the purposes of making academic arguments that rely on topic modeling.

Reproduceability and Randomness

One of the toted benefits to using topic modeling is the reproducibility of the analysis. Whereas OLS models will consistently give the same output given the same data, this is not the case with topic modeling algorithms. Like most machine learning algorithms, they involve random parameter initializations that are iteratively updated according to some optimization function. This means that given the same parameters (except random generator seed), the topic models can produce completely different outputs. Andrew says that this is a complete deal breaker for most humanist scholars, and a big concern to other researchers. I believe this to be a fundamental misunderstanding of topic models, which may be quite widespread, as a method which is uncovering some kind of ground truth in the data – an assumption quite embedded in Andrew’s “generative model of discourse”. I sometimes call this the “ghost in the data” bias, and I’ve observed it more broadly across different types of quantitative projects. The fact is, as Andrew points out, this is how many scholars view the models, and this bleeds over into interpretations and claims being made.

If we dismiss the absurd assumption that there is some ground truth topic model which properly represents appropriate features of the data, then we can view a given topic model as a useful description of the data which captures features that may be of interest to our analysis. With this understanding, I want to point out that it is possible to create reproducible topic models by setting a random generator seed as a parameter. Alongside other parameter values and listings of software versions used, topic models are exactly reproducible given the same data. Admittedly though, I remain concerned about the robustness of a model given the data itself. If one were to run the algorithm multiple times and get a similar result, there might be more confidence in the statistical foundations (note: not theoretical foundations – that is a mute point) of the model. As social scientists, we need to develop methods and metrics for determining statistical robustness of a given topic model – this is not something unusual for generative ML models. As a simple approach, I suggest running the model multiple times and quantitatively comparing the different generated models with each other using topic distributions over words. If a particular topic model is generated that may be useful for the research questions being asked, this model can include a robustness measurement by comparing it with a hundred or more other random models. Researchers can then select topic models based on a combination of both robustness and usefulness. The dismissal of the ‘ground truth’ assumption allows for a more rigorous understanding of how the model relates to the empirical data.

Algorithm parameters are yet another subject being debated. Researchers often ask “how do we determine the correct number of topics we should use?”, to which I (and many others) respond that there is no “correct” number of topics, nor a correct value for any other parameter to topic modeling algorithms. If we see topic models as possible ‘interpretations’ of the data that provide useful perspectives rather than some ‘ground truth’ excavators, these questions no longer make sense. From this perspective, reproducibility in the sense of reporting input parameters, software versions, and random generator seeds is appropriate. Model results can even further be strengthened by robustness measures observed for a given model.

Algorithmic Transparency

Part of Andrew’s critique is also the opaqueness of the algorithms themselves. As in, we don’t know what is happening under the hood because the models are so complex. He contrasts this with factor analysis, pointing out that it is both more interpretable and provides parameters that are easier to select. I think this also hints at a strong link bias between theory and method. I argue that simpler algorithms are not more interpretable in terms of their relationship to meanings in the data. It is true that it would be difficult or pointless to trace the mathematics of LDA through your data from the TDM into the factorized word-topic and topic-document matrices, but it is actually the role of the interpreter to trace topic models back to the original data. Different models should be selected based on how they capture topics relevant to the research interest. Exactly what word collocates mean depends on the documents themselves anyways – content, linguistic style, audience, and a hundred other features that are up for consideration by the researcher. In the same way that being able to solve the normal equations needed to build an OLS model doesn’t give you any more insight into the data than if you had used R or Stata, knowing algorithmic details of LDA is only useful to the extent that it gives you intuition as to the types of patterns to look for in the interpretation. I’m not saying that knowing algorithmic details of LDA is useless, but that a properly systematic interpretive analysis of the data shouldn’t require that mathematical knowledge.

Rather than teaching all of our social scientists Bayesian statistics, I think it would be more productive to further consider what the tools can show, develop more systematic approaches to interpretive analysis, and to change how we talk and think about the separation between topic modeling tools and the theories of discourse or text production. Like in any other types of quantitative analyses, topic models are tools that can provide perspectives rather excavators of some hidden truth extracted in textual data.

Dec 3, 2019

I created a public GitHub repo to share a cleaned version of the US National Security Strategy documents in plain text. It is a nice dataset to use for text analysis demos, and you can use the download_nss function to download the docs from the public repo directly in your code.

I generated these by copy/pasting the pdf text into plain text and doing some cleaning like special character conversion and some spell-checking. Paragraphs in the text are separated by two newlines, and all paragraphs appear on the same line.

The choice of NSS documents was motivated by one of my all-time favorite articles co-authored by my former advisor John Mohr, Robin Wagner-Pacifici, and Ronald Breiger. In addition to the documents analyzed in that piece, I also copy/pasted text from the Trump 2017 NSS document. Each presidential administration since 1987 is required to produce at least one document per term, so you can easily compare the documents by administration or party.

Mohr, J. W., Wagner-Pacifici, R., and Breiger, R. L. (2015). Toward a computational hermeneutics. Big Data and Society, (July–December), 1–8. (link)

April 11, 2019

John Mohr and I recently received a grant for undergraduate instructional development aimed at creating a tool for non-programmers to run and analyze LDA and NMF topic models on a provided set of texts. We chose to make this tool accessable to non-coders so that it can be integrated into general sociology courses where most students have very little technical experience. The tool generates topic-token and document-topic distributions as an excel spreadsheet, allowing students to run analyses and generate figures from within an interface they may be familiar with. The tool uses a command-line interface and can be installed using the command pip install easytext (github repo).

The command line interface is particularly focused on generating spreadsheets that students can then view and manipulate in a spreadsheet program like Excel or LibreOffice. Students can perform interpretive analysis by going between EasyText output spreadsheets and the original texts, or feed the output into a quantitative analysis program like R or Stata. The program supports features for simple word counting, noun phrase detection, Named Entity Recognition, noun-verb pair detection, entity-verb detection, prepositional phrase extraction, basic sentiment analysis, topic modeling, and the GloVe word embedding algorithm.

While there are debates about the role of topic modeling and other algorithmic approaches to text analysis requiring interpretation, our undergraduate students have shown enthusiasm and diligence in considering the limitations and strengths of such tools (see an example of a student I mentored). In many ways, their experiences with text analysis algorithms have forced them to think beyond the familiarity of p-values and confidence intervals to establish different kinds of patterns in the social world – ones that may be partially out-of-reach with classical sociological research methods. And in this process, they are forced to consider the promises and pitfalls of using these algorithms for analyses.

See the README and Command Reference pages for usage examples.

As an example use case, consider a time when you have a spreadsheet of document names and texts called “mytextdata.xls”. Let’s assume that the column name of document names is “title” and the column of texts is simply “text”. To run a topic model of this text data with 10 topics that outputs to “mytopicmodel.xls”, we would use the following command:

python -m easytext topicmodel -n 10 mytextdata.xls --doclabelcol "title" --textcol "text" mytopicmodel.xls

The topic model output spreadsheet contains four sheets: doc_topic, topic_words, doc_summary, and topic_summary. easytext spreadsheet example

While doc_topic contains rows as documents and columns as topic probabilities and topic_words contains topics as words and word probabilities as columns, the doc_summary and topic_summary sheets are meant to assist with interpretation; the topics most closely associated with each document and the words most closely associated with each topic, respectively.

Any topic model interpretation of course relies on referring back to the text of the original documents themselves, but this spreadsheet is designed to help with the process of linking the statistical topic model with the content and form of texts.

Further documentation is needed to push this into an instructional tool, but this is a good first step towards that end.

December 30, 2017

This Fall, John Mohr and I ran a pilot program to teach Sociology undergraduates how to use topic modeling in courese projects. The pilot program lasted 4 weeks and students were asked to prepare a text corpus of approximately 100 documents using LexisNexis (or copy-paste from the web) and perform analysis using Excel or Google Sheets. Past mentoring projects of both John and I showed that undergraduates can come up with some pretty creative ways to use these computational analysis tools, even if they can'’'t write the code to do it themselves.

Beyond the technical, the most challenging part of this work is getting students to think about what information they can get from large corpora and how to use the tools to answer questions of interest. It is clear that the era of Big Data and access to internet has changed the way social processes occur on a large scale (think Fake News), so we need to train social scientists to use new tools and think about data differently.

Topic presence in NYT corpus

Researchers like John and I are excited about the new questions we can answer about these tools, but I've been realizing that it's not so easy to explain how to analyze 100 documents using algorithms that to them are 'black boxes'. Part of this involves the emphasis on a loose coupling between theory and method, and part of it relates to theorizing about the media being analyzed. I’m thinking that few undergraduates are immediately prepared to study news corpora because it'’'s not something students are used to exploring manually as 'close readers'. Additionally, there are technical challenges like construction of corpus and use of Excel that add curvature to the learning barrier.

Our approach was to have students create a corpus of interest using LexisNexis, send it to me so I could output a topic model as a spreadsheet, then use Excel to aid in qualitative analysis and generate quantitative measures from the data as a comparison between sources or over time. We thought that it would be a good idea to provide an example analysis that students could walk through, so I generated two documents to guide students through (1) corpus construction and (2) analysis using news about Betsy DeVos as an example. We wanted the document to explain not only methodologically how, but how to think about the data at each stage of the analysis.

Preparing Your Corpus PDF Document

The first document about corpus construction has three parts detailing (1) what a corpus looks like on a computer, (2) how to build a corpus by downloading files from LexisNexis, and (3) how to build a corpus from an arbitrary news website by copy/pasting. I look at text files as documents, and have students build a corpus from simply copy-pasting text from downloaded LexisNexis search results or web pages. This is arguably the simplest approach to this type of analysis and also perhaps the most time consuming on a per-document basis. I think it is appropriate for teaching purposes because students won’t reach memory or speed limitations while working with the data on their personal computers, and they can become intimately familiar with the texts as a practice for using the methods.

This is a document word cloud generated from https://www.jasondavies.com/wordcloud/. I encouraged students to try quickly reading through documents this way.

Betsy DeVoss Word Cloud

I then performed the analysis using some of the command line topic modeling tools I built. The library relies on downloaded nltk corpora for things like stopwords and requires dependencies that users likely need to install, so I chose to simply run my code that would output a spreadsheet for them. If interested, you can see the scripts that I used from the command line lda.py and nmf.py. In the future, I’d like to build semanticanalysis into an installable library that users can access using pip. I’d also build in nltk.download() functions as needed or maybe even switch to a different text analysis library for that step (I’m not fond of Java on which nltk is based). In addition to topic models, I also ran the simplest type of sentiment analysis using word banks: I used the python empath library for this. It is essentially a collection of topics whos contents are uniformly distributed over manually-selected words. Among their sentiment categories are positive_emotion and negative_emotion that I encouraged students to use, but they could use any of the other categories as well.

TopicModel Analysis Guide PDF

My second document details how to do different kinds of analysis using the topic model spreadsheet or sentiment analysis spreadsheets. The topic model spreadsheet contains topic content (first 20 words) on the first sheet, and a topic-document matrix on the second sheet. I then give a systematic method for interpreting each of the topics: (1) analyze the topic word contents and develop hypotheses about what each topic might be tracing through the data, then (2) sort documents by presence of that topic, and read through the first 10 or more documents to narrow or construct new hypotheses for topic representation. Topics can trace different styles, contents, or modes of discourse as they relate to different corpora and types of documents, so it is important to recognize what they mean within a specific corpus. After the topics are thoroughly examined, quantitative analysis can be performed to compare news sources, time periods, and topics or topic collections in the data. Questions like topic presence are only meaningful with appropriate interpretations, but can provide insightful results if effort is put into the process.

This analysis compares the relative presence of topics in each of two subcorpora. We can see that topic 9 dominates topic 2 more in the Daily News than it does in the New York Times. From the table (and other types of charts), we can also see that both topics 2 and 9 occur much more in the NYT than DN.

NYT Corpus Topic Prevalence Pie Charts

I would say we received a lot of variation in the effort students were willing to put into the projects. Ultimately, I’d say that most students enjoyed it, even if they were at the end still a little confused about how the process works (perhaps in part due to the fact that I generated the topic models for them). Four weeks is far too short for a proper pilot program, but the results were well worth the effort. At the very least the project had the effect of opening students’ minds as to the types of analyses that can be performed with large corpora and new questions that can be asked using these tools.

I also encouraged the interpretation of topics by calculating document correlations with some of the empath sentiment categories that are easy to interpret. This table shows that T9 might be related to banking but all other relationships seem implausible.

Empath Prevalence

From my end, I think my biggest challenge is to demonstrate more thoroughly the link between theory and method: more than merely describing the weak coupling, I need to demonstrate under specific conditions and in specific contexts how and when assumptions may or may not hold to answer questions of interest. This is more difficult than a technical problem because many researchers debate today about these topics. My hope is that by continuing to teach and refine the methods we can improve how we understand the tools and contribute to the broad field of theorists and methodologists involved in these debates.

December 30, 2017

Last October I visited Erlangen, Germany to attend a workshop set up by Dr. Tim Griebel and Prof. Dr. Stefan Evert called “Texts and Images of Austerity in Britain. A Multimodal Multimedia Analysis”. Tim and Stefan are leading this ongoing project aimed at analyzing 20k news articles from The Telegraph and The Guardian starting in 2010 and leading up to Brexit. I’m working alongside 21 other researchers with backgrounds in discourse analysis, corpus linguistics, computational linguistics, multimodal analysis, and sociology to explore discourse between the two news sources across time from different perspectives.

Images of austerity news headlines

Comparison of front pages of The Guardian and The Daily Telegraph after the Greek people voted to reject austerity measures imposed by the IMF.

This event was a great opportunity to learn from experienced corpus linguists (CL) and discourse analysis scholars from different countries with different academic backgrounds. I found it surprisingly difficult to jump into UK politics without any background, but I learned a lot from listening to presentations and discussions. I spent last summer learning about Colombian politics and assumed the language barrier was the most difficult part of that research, but even in English it took me a while to understand the economic factors involved in the political discussions.

It was fascinating to compare populist movements occurring in the UK with those in the US and Colombia that I’m more familiar with. Last year I attended a preconference on populism at the International Communications Association annual meeting in San Diego, and the general consensus was that populism is a set of styles (repertoires?) and moral backdrops (intuitions?) that politicians use to build support for political positions. My observation is that the exact positions they apply to may vary widely by country. An inclination towards formal practice theories of culture leads me to believe that we can compare and contrast contexts by identifying sets of discursive repertoires and discursive frameworks that tap into deeply held morals and emotions. I’ll be presenting some of my work on Colombia from both theoretical and empirical perspectives using a combination of interviews and Twitter data at the 2018 Pacific Sociological Association annual meeting. That work will be shared after the proceedings are posted.

Another big impression was the surprising divide between the corpus CL community and other fields like sociology, communications, and the digital humanities who study large corpora of texts using computational methods. While I’m familiar with the topic modeling and LSA approaches, the linguists I met use collocations, POS taggers, stemming, lemmatization and other non-machine learning approaches to text analysis. I expressed my surprise to one of the other workshop attendees and they pointed out that the pushback against ML from CL was reflected in a presentation Dr. Andrew Hardie (also a workshop attendee) gave at the 2017 Corpus Linguistics conference that aimed to critique topic modeling – I wrote a response to that critique here.

This workshop also forced me to consider the contributions that sociology could have to computational text analysis. My thought is that sociologists can help to place the media into a larger societal context by exploring economic, cultural, and organizational factors that affect and are affected by the media. Discourse analysis, digital humanities, and communications all examine the causes, content, and effects of media on people, but I think sociology has the opportunity to explore this at a collective scale. Cultural analysis could further contribute by looking at the moral under-girding of political rhetoric and how it relates to the construction of social categories that people understand and navigate through.

Overall, this workshop was a great opportunity for me to meet new people and be exposed to totally different approaches to computational text analysis and broader discourse theory. I’m excited to see where these collaborations will lead!

EDIT: the end result was finally published!

December 24, 2017

I spent Summer of 2017 with my colleague Marcelle Cohen living in and studying the conflict and peace process in Colombia. Our objective was to explore how political discourse as cultural practice creates entrenched ideologies and contentious politics there, and how those discourses relate to other populist movements happening around the world. From a methodological perspective, I’m interested to see how we can use interview data in tandem with computational text analysis and quantitative network methods. We performed interviews with politicians and diplomats, attended political rallies in Bogota and more rural communities, and made connections with some local peace organizations and universities. Our interviews will allow us to give agency to the political elite and understand discourse at a point of production as it is embedded in a political institution. Ultimately I had a great experience that allowed me to test the lenses of cultural and political theory, learn about qualitative methods, and dive deeper into the political culture in Colombia.

I took this photo after the last disarmament event at a FARC camp in rural Colombia. The after-event scne felt like a foreshadowing of post-accords politics.

colombia fieldwork

This article is more about my meta-impressions – see my presentation Political Culture in Colombia for some depth.

This project started about halfway through our first year as soc grad students when we were talking about the deep underlying moral and emotional culture of politics that divide people in the US. Marcelle impressed upon me that a similar thing was happening in Colombia: in a 2016 popular referendum the public voted “no” to set of peace accords that would end the 50 year war between the FARC and the Colombian government. She had spent the last summer there meeting people and learning more about the process, and she planned to go there summer 2017 to interview people after the referendum. This was the type of research I’d always wanted to do – meeting and listening to real people embedded in this political culture that had the potential to change the country for generations to come. I decided to join her on the trip; we submitted the IRB proposal, applied for funding, and took off for Colombia!

Me with university professor Ginneth Esmeralda Narváez Jaimes and two of her students at Universidad de Santo Tomas in Bogota. They invited me to an English-speaking forum about the peace process at the university.

colombia students

The trip itself was an awesome experience. Not only did I learn about the conflict and peace process there, but through attendance of events and exploration of the city I developed a sense of the political culture and how it enters into other aspects of life. The qualitative research was a lot more difficult than I expected – I relied upon my colleague Marcelle much more than anticipated. In addition, I’m not fluent in Spanish so networking and making friends was more challenging than imagined. That said, I still managed to meet many contacts that I still talk to today. It’s tricky being a non-Colombian studying Colombia; on one hand, we got access to politicians and diplomats that might have otherwise been impossible, but on the other hand the last thing we want to do is tell other people about their culture. People would ask our opinions, and I would always preface by noting that mine is the perspective of an outsider. Still, they listened to what we had to say and gave their opinions on what was happening there. These interactions were invaluable for studying the lived experiences of Colombians.

We met ex-president Uribe himself. The woman on the left had traveled three hours through the country to see this event. She had to leave one hour into the event because it started late.

colombia saw uribe

My approach also seemed to be much more theoretical and less Marxist than most researchers studying Colombia, particularly those from political science. That first year in Sociology I was interested in practice theories and the debates between Swidler and Vaisey further laid out by Lizardo. Studying politics in Colombia using only formal theories of culture turned out to be difficult, so I’m doing some theoretical development that focuses on discourse as practice. Thus, the interviews become an demonstration of practice itself. I argue that identifying larger themes and styles of discourse reveal the kinds of “strategies for action” that Swidler pointed to – abstract discursive frameworks which politicians weave in and out of to form coherent and consistent arguments that will identify them with the party while also distinguishing themselves for individual attainment.

Bogota from a tall mountain overlooking the city.

bogota mountain view

A picture of me outside the state and senate buildings where we did some of the interviews.

me outside senate building

I received an Innovation grant from my institutional IGERT network science program to combine interview methods with computational text analysis and quantitative network methods. While I’m getting rather tired of studies using Twitter, I think it’s a particularly useful platform for studying politicians in Colombia: ex-president Uribe of the Centro Democratico party has Tweeted over 65 thousand times in the last six years alone. Most party members have Twitter accounts with nearly that many tweets, and so I can study relationships between them at the level of discourse. I plan to use Topic Modeling with some of the Semantic Analysis methods I’m developing to further explore how political propaganda becomes “common sense” to party followers. I’d also like to integrate some of the quantitative survey methods like Relational Class Analysis (or Correlation Class Analysis) that Goldberg, Vaisey, and Boutyline have been developing. This analysis would provide insight that goes beyond the study of party propaganda into cultural configurations that motivate the more general public.

I learned much about culture, politics, morality, and sociological research from this trip. Many questions arose about ethical practices and effective research, and still I have many more questions than answers. As someone who had entered the world of sociology less than a year prior, I was surprised at how challenging but exciting qualitative research can be. I’d only read about the adventures of ethnographers and interviewers in books, so it was amazing to be able to do some of that work myself. I think it also gave me a good opportunity to theorize about Colombian political culture while I lived there. It would be difficult to grasp the full complexity of the context by simply looking at survey data or even texts alone.

Here are some more photos from the trip:

Current president Santos at last disarmament event at a FARC camp in rural Colombia.

president santos

Amazing Peruvian food.

peruvian food

Beautiful street art in Bogota.

bogota street art

Central square in Bogota with a church and the capital building.

bogota central square