“Vegetarian” timeline

The number of books in the Early American Cookbooks collection which contain the word “vegetarian” in the text increases slowly in the late 19th century and then grows substantially in the years from 1900 to 1920. The vegetarian movement in the United States grew over the same timespan and publishers began producing cookbooks devoted to a purely vegetarian diet. The timeline also reflects the increased number of references to a vegetarian diet not only in books such as How to Cook Vegetables (1891) by the bestselling author Sarah Tyson Rorer, but also in general cookbooks such as Fannie Farmer’s A New Book of Cookery (1917).

Use of word "vegetarian" over time
Use of word “vegetarian” over time

Vegetarian over-represented terms

Early vegetarian cookbooks featured recipes containing nuts and new forms of protein foods based on nuts. These ingredients are prominent in this word cloud showing over-represented terms in vegetarian cookbooks when compared to the full set of the Early American Cookbooks collection. Unusual words such as protose, nuttolene, trumese, and terralac (all nut based mixtures to be used instead of meat) appear along with new terms for grain products (granose, granola). Many of these products were invented by John Harvey Kellogg, an early proponent of vegetarianism and the inventor of Corn Flakes cereal.  

Vegetarian over-represented-terms
Vegetarian over-represented terms (Meandre Dunning Log Likelihood to Tagcloud Algorithm)

This visualization was created by comparing two sets of texts,  vegetarian cookbooks and the full Early American Cookbooks collection, using the Meandre Dunning Log-likelihood to Tagcloud algorithm in the HathiTrust Research Center Portal.

Frugal cookbooks over and under-represented terms

A text analysis comparison between the texts cookbooks containing the word “frugal” and the full Early American Cookbooks set shows some interesting differences. The “frugal” books have over-represented terms which feature everyday words such as them, that, good, not, should, your etc which have no obvious connection to cooking. The under-represented terms feature kitchen measurement terms such as teaspoons and also names of ingredients, notably some more luxurious items such as chicken, chocolate, cake, butter, and pineapple. While it is not possible to form definitive conclusions, it seems clear that the frugal books emphasize ordinary language (perhaps directed toward expenditure and lifestyle choices with a healthy dose of “should” and “not”?) and do not offer a wealth of different ingredient names.
 

Frugallity over-represented terms (Meandre Dunning Log Likelihood to Tagcloud Algorithm)
Frugallity over-represented terms (Meandre Dunning Log Likelihood to Tagcloud Algorithm)
Frugal under-represented terms (Meandre Dunning Log Likelihood to Tagcloud Algorithm)
Frugal under-represented terms (Meandre Dunning Log Likelihood to Tagcloud Algorithm)

This visualization was created by comparing two sets of texts,  cookbooks containing the word “frugal” and the full Early American Cookbooks set, using the Meandre Dunning Log-likelihood to Tagcloud algorithm in the HathiTrust Research Center Portal.

“Frugal” timeline

The number of books in the Early American Cookbooks collection which contain the word “frugal” in the text increases over the years 1800 to 1920. This increase may simply be a reflection of the overall increase in the number of books published over time in the collection (see books by year chart). The peaks in the numbers at the end of the 19th century may reflect an increase in the number of books directed at young, inexperienced housekeepers with a small budget such as The Cottage Kitchen: A Collection of Practical and Inexpensive Receipts or Motherly Talks: The Home, How to Make and Keep It

Use of word "frugal" over time
Use of word “frugal” over time

Comparing two sets of texts

One very useful text analysis tool in the HathiTrust Research Center Portal is the Meandre Dunning LogLikelihood to Tagcloud algorithm. The algorithm compares and contrasts two worksets by identifying the words that are more and less common in one workset than in another workset. This tool has been very useful in analyzing how different subsets of the Early American Cookbooks collection differ from the collection as a whole. Tag clouds  display for over and under represented terms for government publications, Fannie Farmer’s cookbooks, and the different census regions of the United States  (Northeastern, Southern, Midwestern, and Western

How it works:
• calculates Dunning Log-likelihood based on two worksets provided as inputs: an “analysis workset” and a “reference workset”
• loads each page of each workset, removes the first and last line of each page, joins hyphenated words that occur at the end of the line;
• performs part of speech tagging (selecting only NN|NNS|JJ.*|RB.*|PRP.*|RP|VB.*|IN);
• lowercases the tokens remaining;
• counts the tokens remaining for all volumes for each collection;
• compares counts from each collection using the Dunning Log-likelihood statistic; the “overused” tokens in the analysis collection (relative to the reference collection), 200 tokens by default, are displayed as a tag cloud and made available via a csv file; the “underused tokens” (also 200 tokens by default) in the analysis collection relative to the reference collection are, likewise, displayed as a tag cloud and made available via a csv file

What is topic modeling?

Topic modeling is a useful way to look for trends and patterns in the collection which may add to our understanding of early cookbooks. What is topic modeling?

As Megan R. Brett explains in Topic Modeling: A Basic Introduction, topic modeling is a form of text mining, a way of identifying patterns in a corpus. You take your corpus and run it through a tool which groups words across the corpus into ‘topics’ (Brett, 2012). Miriam Posner has described topic modeling as “a method for finding and tracing clusters of words (called “topics” in shorthand) in large bodies of texts” (Posner 2012).

Topic modeling is an automated text mining technique that offers a “suite of algorithms to discover hidden thematic structure in large collections of texts” (Blei 2013, 7).  Topic modeling is a methodology developed in computer science, machine learning, and natural language processing that has recently become very popular in the digital humanities (Meeks 2013). New digital tools such as MALLET (McCallum 2002) generate comprehensive lists of subjects through statistical analysis of word occurrences in a corpus. The content of the documents, not a human indexer, determines the topics (Jockers 2013, 124). Unlike traditional classification systems with a pre-existing taxonomy of terms, topic modeling creates topics by clustering words that frequently occur together in a text. The resulting topical clusters can be readily interpreted as subject facets by human readers, allowing them to browse the topics of a collection quickly and find relevant material using topically expanded keyword searches (Mimno and McCallum 2007).

The topic models for Early American Cookbooks were generating using the Meandre Topic Modeling algorithm created by Loretta Auvil and available via the HathiTrust Research Center Portal. The algorithm serves to “identify “topics” in a workset based on words that have a high probability of occurring close together in the text. Topics are models trained on co-occurring text using Latent Dirichlet Allocation (LDA), where each topic is treated as a generative model and volumes are assigned a probability of how likely each topic is to have generated that text. The most likely words for a topic are displayed as a word cloud.”  Please see  Topics Models for Early American Cookbooks and Topic Models for Government Publications for the word cloud results and the About page for more details on the workflow. 

WORKS CITED

Blei, David M. 2013. “Topic Modeling and Digital Humanities.” Journal of Digital Humanities.  

Brett, Megan R. 2012. “Topic Modeling: A Basic Introduction.” Journal of Digital Humanities.

Jockers, Matthew Lee. Macroanalysis Digital Methods and Literary History. Urbana: University of Illinois Press, 2013.

McCallum, Andrew Kachites. 2002. “MALLET: A Machine Learning for Language Toolkit.” 

Meeks, Elijah, and Scott Weingart, 2013. “The Digital Humanities Contribution to Topic Modeling.” Journal of Digital Humanities

Mimno, David, and Andrew McCallum. “Organizing the OCA: Learning Faceted Subjects from a Library of Digital Books.” Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries. New York, NY, USA: ACM, 2007. 376–385.

Posner, Miriam. “Very Basic Strategies for Interpreting Results from the Topic Modeling Tool.” Miriam Posner’s Blog. 29 Oct. 2012.

Topic modeling for government publications

Here are the topic modeling results for the United States government publications in the collection. The ten word clouds in the chart below show different topics or clusters of words that recur across all of the texts. The names of the topics were not generated by the algorithm but rather added as a way to label and interpret the clusters. While it is impossible to draw definitive analytical conclusions, the topics do provide a interesting snapshot of the subject matter.

The government publications include primarily military cooking manuals with some additional USDA recipe booklets focusing on nutrition and use of substitute ingredients during wartime rationing. The subject matter of these publications is quite different from the rest of the cookbooks in the collection and this difference is demonstrated in the topic word clouds. The topics represent a more clearly defined, scientific approach to cooking with clear groups of ingredients and measurements. Topic 2 (dairy), topic 5 (bread), topic 6 (stew), topic 7 (meat), and topic 10 (equipment) are all quite straightforward descriptions of basic kitchen items.  Topic 4 (meat analysis) and topic 9 (bread analysis) emphasize weights, measures, and nutrition terms such as results, protein and digestibility.  Topic 1 (mess hall) includes words such as men, mess, recipe, meal, rations  and serves to describe the daily workings of a military kitchen. Topic 8 (labor and costs) addresses the economic aspects of running a large food service operation. 

 

Topic 1: Mess hall
 

Topic 2: Dairy
 

Topic 3: Nutrition
 

Topic 4: Meat analysis
 

Topic 5: Bread
 

Topic 6: Stew
 

Topic 7: Meat
 

Topic 8: Labor and costs
 

Topic 9: Bread analysis
 

Topic 10: Equipment

Government publications over-represented terms

Government publications over-represented terms (Meandre Dunning Log Likelihood to Tagcloud Algorithm)
Government publications over-represented terms (Meandre Dunning Log Likelihood to Tagcloud Algorithm)

What do the words feces, urine, experiment, grams, and ration have to do with cookbooks? They are all over-represented terms in United States government publications on cooking. These publications include primarily military cooking manuals with some additional USDA recipe booklets focusing on nutrition and use of substitute ingredients during wartime rationing. The subject matter of these publications is quite different from the rest of the cookbooks in the collection and this difference is demonstrated in the word cloud above. The government publications take a much more scientific approach to cooking, focusing on experiments, nutrition and digestion, measurements, and rations per man. 

The word “feces” was a valuable clue in interpreting and correcting the data visualizations in this project. The word first appeared in an over-represented tag cloud for books published in the Southern census region of the United States. It seemed hard to believe that cookbooks on Southern cuisine featured feces so a re-examination of the dataset was in order. Washington, D.C. is part of the Southern census region, but it is also the place of publication for large numbers of government documents. Separating out the books published by government agencies from the larger Southern set proved to be the answer to the problem. Without the government publications, the over-represented terms for the Southern set no longer contained feces, urine, or any of the other nutrition related terms. 

This visualization was created by comparing two sets of texts, government publications and the full Early American Cookbooks set, using the Meandre Dunning Log-likelihood to Tagcloud algorithm in the HathiTrust Research Center Portal.

Midwestern

Cookbooks published in Midwestern states
Cookbooks published in Midwestern states

Important Midwestern cookbooks include Buckeye Cookery, And Practical Housekeeping: Compiled From Original Recipes by Estelle Woods Wilcox (1877), Fullständigaste Svensk-Amerikansk Kokbok = Swedish English Cookbook (1897), Science in the kitchen by E.E. Kellogg (1892), and The Settlement Cook Book by Mrs. Simon Kander (1915 edition)

Books published in the Midwest comprise 24% of the Early American Cookbooks collection. When the books are compared to the full set of titles in Early American Cookbooks, the over-represented terms show several baking terms plus the name brand Crisco. Crisco was introduced in 1911 by Proctor and Gamble and promoted through cookbooks such as The Story of Crisco (1914). Other terms include the names of new types of foods introduced by the early vegetarian and health food movements such as protose (a peanut based protein food marketed by John Harvey Kellog) and graham (a whole grain flour biscuit introduced by Sylvester Graham). 

Midwestern over-represented terms (Meandre Dunning Log Likelihood to Tagcloud Algorithm)

Midwestern over-represented terms (Meandre Dunning Log Likelihood to Tagcloud Algorithm)

This visualization was created by comparing two sets of texts,  cookbooks published in the Midwest and the full Early American Cookbooks collection, using the Meandre Dunning Log-likelihood to Tagcloud algorithm in the HathiTrust Research Center Portal.

Northeastern

Cookbooks published in Northeastern states
Cookbooks published in Northeastern states

Books published in the Northeast comprise 61% of the Early American Cookbooks collection. Large numbers of cookbooks were published in New York, traditionally the publishing center of the United States, as well as in Boston and Philadelphia. The high percentage in the Northeast also reflects the population distribution in the United States in period from 1800 to 1920. Most commercial publishing was centered in the Northeast in the early 19th century and book publishers became established in other regions as the population shifted westward over time. 

Text analysis of books published in the Northeast shows some interesting trends. When the books are compared to the full set of titles in Early American Cookbooks, the over-represented terms show terms more common in early 19th century books. These include early printing styles such as the long “s” which looks like an “f.” In the tag cloud below “fweet” is “sweet’ and “fugar” is “sugar.” There are also old versions of words (divers rather than diverse) and English spellings such as flavour, colour, and centre.  The Northeast region is also evident in place names such as Philadelphia and local companies such as Ryzon, a baking powder company based in New York

Northeastern over-represented terms
Northeastern over-represented terms (Meandre Dunning Log Likelihood to Tagcloud Algorithm)

This visualization was created by comparing two sets of texts,  cookbooks published in the Northeast and the full Early American Cookbooks collection, using the Meandre Dunning Log-likelihood to Tagcloud algorithm in the HathiTrust Research Center Portal.