Comparing two sets of texts

One very useful text analysis tool in the HathiTrust Research Center Portal is the Meandre Dunning LogLikelihood to Tagcloud algorithm. The algorithm compares and contrasts two worksets by identifying the words that are more and less common in one workset than in another workset. This tool has been very useful in analyzing how different subsets of the Early American Cookbooks collection differ from the collection as a whole. Tag clouds  display for over and under represented terms for government publications, Fannie Farmer’s cookbooks, and the different census regions of the United States  (Northeastern, Southern, Midwestern, and Western

How it works:
• calculates Dunning Log-likelihood based on two worksets provided as inputs: an “analysis workset” and a “reference workset”
• loads each page of each workset, removes the first and last line of each page, joins hyphenated words that occur at the end of the line;
• performs part of speech tagging (selecting only NN|NNS|JJ.*|RB.*|PRP.*|RP|VB.*|IN);
• lowercases the tokens remaining;
• counts the tokens remaining for all volumes for each collection;
• compares counts from each collection using the Dunning Log-likelihood statistic; the “overused” tokens in the analysis collection (relative to the reference collection), 200 tokens by default, are displayed as a tag cloud and made available via a csv file; the “underused tokens” (also 200 tokens by default) in the analysis collection relative to the reference collection are, likewise, displayed as a tag cloud and made available via a csv file