Comparing two sets of texts

One very useful text analysis tool in the HathiTrust Research Center Portal is the Meandre Dunning LogLikelihood to Tagcloud algorithm. The algorithm compares and contrasts two worksets by identifying the words that are more and less common in one workset than in another workset. This tool has been very useful in analyzing how different subsets of the Early American Cookbooks collection differ from the collection as a whole. Tag clouds  display for over and under represented terms for government publications, Fannie Farmer’s cookbooks, and the different census regions of the United States  (Northeastern, Southern, Midwestern, and Western

How it works:
• calculates Dunning Log-likelihood based on two worksets provided as inputs: an “analysis workset” and a “reference workset”
• loads each page of each workset, removes the first and last line of each page, joins hyphenated words that occur at the end of the line;
• performs part of speech tagging (selecting only NN|NNS|JJ.*|RB.*|PRP.*|RP|VB.*|IN);
• lowercases the tokens remaining;
• counts the tokens remaining for all volumes for each collection;
• compares counts from each collection using the Dunning Log-likelihood statistic; the “overused” tokens in the analysis collection (relative to the reference collection), 200 tokens by default, are displayed as a tag cloud and made available via a csv file; the “underused tokens” (also 200 tokens by default) in the analysis collection relative to the reference collection are, likewise, displayed as a tag cloud and made available via a csv file

What is topic modeling?

Topic modeling is a useful way to look for trends and patterns in the collection which may add to our understanding of early cookbooks. What is topic modeling?

As Megan R. Brett explains in Topic Modeling: A Basic Introduction, topic modeling is a form of text mining, a way of identifying patterns in a corpus. You take your corpus and run it through a tool which groups words across the corpus into ‘topics’ (Brett, 2012). Miriam Posner has described topic modeling as “a method for finding and tracing clusters of words (called “topics” in shorthand) in large bodies of texts” (Posner 2012).

Topic modeling is an automated text mining technique that offers a “suite of algorithms to discover hidden thematic structure in large collections of texts” (Blei 2013, 7).  Topic modeling is a methodology developed in computer science, machine learning, and natural language processing that has recently become very popular in the digital humanities (Meeks 2013). New digital tools such as MALLET (McCallum 2002) generate comprehensive lists of subjects through statistical analysis of word occurrences in a corpus. The content of the documents, not a human indexer, determines the topics (Jockers 2013, 124). Unlike traditional classification systems with a pre-existing taxonomy of terms, topic modeling creates topics by clustering words that frequently occur together in a text. The resulting topical clusters can be readily interpreted as subject facets by human readers, allowing them to browse the topics of a collection quickly and find relevant material using topically expanded keyword searches (Mimno and McCallum 2007).

The topic models for Early American Cookbooks were generating using the Meandre Topic Modeling algorithm created by Loretta Auvil and available via the HathiTrust Research Center Portal. The algorithm serves to “identify “topics” in a workset based on words that have a high probability of occurring close together in the text. Topics are models trained on co-occurring text using Latent Dirichlet Allocation (LDA), where each topic is treated as a generative model and volumes are assigned a probability of how likely each topic is to have generated that text. The most likely words for a topic are displayed as a word cloud.”  Please see  Topics Models for Early American Cookbooks and Topic Models for Government Publications for the word cloud results and the About page for more details on the workflow. 

WORKS CITED

Blei, David M. 2013. “Topic Modeling and Digital Humanities.” Journal of Digital Humanities.  

Brett, Megan R. 2012. “Topic Modeling: A Basic Introduction.” Journal of Digital Humanities.

Jockers, Matthew Lee. Macroanalysis Digital Methods and Literary History. Urbana: University of Illinois Press, 2013.

McCallum, Andrew Kachites. 2002. “MALLET: A Machine Learning for Language Toolkit.” 

Meeks, Elijah, and Scott Weingart, 2013. “The Digital Humanities Contribution to Topic Modeling.” Journal of Digital Humanities

Mimno, David, and Andrew McCallum. “Organizing the OCA: Learning Faceted Subjects from a Library of Digital Books.” Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries. New York, NY, USA: ACM, 2007. 376–385.

Posner, Miriam. “Very Basic Strategies for Interpreting Results from the Topic Modeling Tool.” Miriam Posner’s Blog. 29 Oct. 2012.

Topic modeling for government publications

Here are the topic modeling results for the United States government publications in the collection. The ten word clouds in the chart below show different topics or clusters of words that recur across all of the texts. The names of the topics were not generated by the algorithm but rather added as a way to label and interpret the clusters. While it is impossible to draw definitive analytical conclusions, the topics do provide a interesting snapshot of the subject matter.

The government publications include primarily military cooking manuals with some additional USDA recipe booklets focusing on nutrition and use of substitute ingredients during wartime rationing. The subject matter of these publications is quite different from the rest of the cookbooks in the collection and this difference is demonstrated in the topic word clouds. The topics represent a more clearly defined, scientific approach to cooking with clear groups of ingredients and measurements. Topic 2 (dairy), topic 5 (bread), topic 6 (stew), topic 7 (meat), and topic 10 (equipment) are all quite straightforward descriptions of basic kitchen items.  Topic 4 (meat analysis) and topic 9 (bread analysis) emphasize weights, measures, and nutrition terms such as results, protein and digestibility.  Topic 1 (mess hall) includes words such as men, mess, recipe, meal, rations  and serves to describe the daily workings of a military kitchen. Topic 8 (labor and costs) addresses the economic aspects of running a large food service operation. 

 

Topic 1: Mess hall
 

Topic 2: Dairy
 

Topic 3: Nutrition
 

Topic 4: Meat analysis
 

Topic 5: Bread
 

Topic 6: Stew
 

Topic 7: Meat
 

Topic 8: Labor and costs
 

Topic 9: Bread analysis
 

Topic 10: Equipment

Topic modeling for early American cookbooks

Topic modeling shows some interesting trends and patterns in the text for the 1450 books in the collection. The ten word clouds in the chart below show different topics or clusters of words that recur across all of the texts. The names of the topics were not generated by the algorithm but rather added as a way to label and interpret the clusters. While it is impossible to draw definitive analytical conclusions, the topics do provide an interesting snapshot of the subject matter.

Early American cookbooks had many common themes, largely because the diet and cookery techniques in the 1800 to 1920 period were far more homogeneous than they are today. Nearly every cook used salt, pepper, and butter as the primary methods of seasoning (topic 1),  boiled kettles over a fire for long periods of time (topic 2), prepared meat, most frequently with gravy or sauce (topic 3), made bread (topic 6),  cake (topic 9), and various fruit based items such as jelly, lemonade, or ice cream (topic 10).  Some topics are more sparse and hard to interpret. Topic 5 possibly represents boiling vegetables and topic 7 seems to be about pickling or similar processes.  Topic 4 includes words such as place, time, made, long, heat, air, and dry. The significance is unclear, but the topic may possibly refer to storage of food in cupboards, drying fruit or other related processes. Topic 8 reaches beyond the ingredients and instructions into the how and why of cooking and homemaking. Words such as food, time, good, made, great, people, work, body, give, family, years are commonly present in the forewords and introductions to cookbooks which sought to provide inspiration for readers.