Notes: English 507 at the University of Victoria

MALLET

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modelling, information extraction, and other machine learning applications to text.

To get started with MALLET and Latent Dirichlet Allocation topic modeling, we're going to use this tool. However, for extensive use, I recommend using the command line. If you are using the command line, then it's important to note that MALLET's documentation could certainly be better. In fact, Graham, Weingart, and Milligan's "Getting Started with Topic Modeling and MALLET," in The Programming Historian, might be the best place to start.

If you need a corpus of texts for MALLET, then I recommend using a subdirectory or two from this .zip corpus of 19 c. texts, complied and corrected (using a period-specific spell checker as well as automated OCR correction) by Jordan Sellers and Ted Underwood. (Thanks, Jordan and Ted!) The zip file is somewhat large (~2000 volumes); give it a bit to download. You might also want to refrain from running the entire corpus through MALLET, especially if you are using a laptop. If you want a sense of what's in the .zip file of 19 c. texts, then review this metadata file, which contains author, title, date, and filename information, among other things. It's a TXT file best viewed in a spreadsheet application, such as Excel or Google Drive.

As you're using MALLET, it doesn't hurt to compare its results with results from Voyant, a web-based text reading and analysis environment that provides quantitative data and expresses it in graphical form. Combining Voyant with MALLET will allow you to combine, e.g., word frequencies with thematic summaries.

A few things to consider when using MALLET:

Borrowing from Colorado Reed, "a topic is a probability distribution over a collection of words and a topic model is a formal statistical relationship between a group of observed and latent (unknown) random variables that specifies a probabilistic procedure to generate the topics—a generative model. The central goal of a topic is to provide a 'thematic summary' of a collection of documents. In other words, it answers the question: what themes are these documents discussing?" (2).
MALLET relies on Latent Dirichlet Allocation (LDA), which is probably the most popular topic model across the disciplines. The model describes how documents (or bags of words) obtain their words. In so doing, it makes no assumptions about the order in which the words appear in a given document (Reed 2-3). What's more, you can create documents based on your own preferences. For instance, a document could be a paragraph in a novel, or it could be an entire novel. That said, it's important to interpret models across the topics they generate. In the case of MALLET, a cluster of words is only meaningful in relation to the other clusters that are identified. Avoid isolating specific topics as if they emerged independently from the balance.
This particular tool outputs MALLET results in both CSV (for spreadsheets) and HTML (for browsers). You should them both, as they provide different information and inform each other, through different structures/formats.
When using the tool, it is important to run the algorithm several times, changing the number of preferred topics and iterations. Also consider running it with and without stopwords (even if the results with stopwords will seem banal or obvious). This way, you can test for consistency (or interesting anomalies), and you iteratively develop the model, seeing what congeals across trials.
In the humanities, topic modelling and LDA are rarely used to prove anything about texts. Instead, they are vehicles for conjecture and speculation, perhaps prompting us to think about groups of documents in ways we have not considered.
Keep Ben Schmidt's perspective in mind: "And most humanists who do what I've just done—blindly throwing data into MALLET—won't be able to give the results the pushback they deserve. . . . I don't think I'm alone: and I'm not sure that we should too enthusiastic about interpreting results from machine learning which we can only barely steer. So there are cases where topic modeling can be useful for data-creation purposes . . . But as artifacts to be interpreted on their own, topic models may be less useful."
If you are using LDA to make an argument, then you might consider including word and topic intrusion in your methodology. See Chang et al. below for details. In short, word intrusion allows you to better determine how semantically persuasive topics are. It also tests the degree to which topics correspond with everyday interpretation by people (not just computers). Meanwhile, topic intrusion tests for how well a model corresponds with what people typically associate with particular topics. Both situate LDA in the sphere of everyday communication through a sort of user testing.

Some other resources for MALLET and topic modelling:

Posner, "Very Basic Strategies for Interpreting Results from the Topic Modeling Tool"
Chang, Jonathan et al. “Reading Tea Laves: How Humans Interpret Topic Models”
Graham, Weingart, and Milligan, "Getting Started with Topic Modeling and MALLET"
Underwood, "Topic Modeling Made Just Simple Enough"
Underwood, "Where to Start with Text Mining"
Schmidt, "Compare and Contrast"
Schmidt, "When You Have a MALLET, Everything Looks Like a Nail"
Weingart, "Topic Modeling for Humanists: A Guided Tour"
Templeton, "Topic Modeling in the Humanities: An Overview"
Blei, "Topic Modeling and Digital Humanities"
Nelson, "Mining the Dispatch"
Blevins, "Topic Modeling Martha Ballard’s Diary"
Rhody, "Topic Modeling and Figurative Language" (+ the data)