University of Victoria
Jentery Sayers
Spring 2015



MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modelling, information extraction, and other machine learning applications to text.

To get started with MALLET and Latent Dirichlet Allocation topic modeling, we're going to use this tool. However, for extensive use, I recommend using the command line. If you are using the command line, then it's important to note that MALLET's documentation could certainly be better. In fact, Graham, Weingart, and Milligan's "Getting Started with Topic Modeling and MALLET," in The Programming Historian, might be the best place to start.

If you need a corpus of texts for MALLET, then I recommend using a subdirectory or two from this .zip corpus of 19 c. texts, complied and corrected (using a period-specific spell checker as well as automated OCR correction) by Jordan Sellers and Ted Underwood. (Thanks, Jordan and Ted!) The zip file is somewhat large (~2000 volumes); give it a bit to download. You might also want to refrain from running the entire corpus through MALLET, especially if you are using a laptop. If you want a sense of what's in the .zip file of 19 c. texts, then review this metadata file, which contains author, title, date, and filename information, among other things. It's a TXT file best viewed in a spreadsheet application, such as Excel or Google Drive.

As you're using MALLET, it doesn't hurt to compare its results with results from Voyant, a web-based text reading and analysis environment that provides quantitative data and expresses it in graphical form. Combining Voyant with MALLET will allow you to combine, e.g., word frequencies with thematic summaries.

A few things to consider when using MALLET:

Some other resources for MALLET and topic modelling: