6B Text Analysis and Visualization of a Small Corpus (10 October)
Reading before class: Sinclair/Rockwell (NCDH), 274-290 “Text Analysis and Visualization: Making Meaning Count;” Sinclair/Ruecker/Radzikowski “Information Visualization for Humanities Scholars;” Clement “Text Analysis, Data Mining and Visualization in Literary Scholarship;” Ammann “What can a Text Analysis Smartphone App do for Digital Humanities?”
Explore before class: Voyant and Tapor web-based tools; download AntConc to your laptop and Textal to your iPhone if you have one (it is only currently available for iOS). Also identify three digital texts (available already in plain text format, or go OCR them at the CDS) on a similar topic relevant to you that you would like to analyze.
In-class discussion: What is a “bag of words” approach to textual analysis? What is a stop-word list? What becomes visible when reading at a distance that you did not see before? What new kinds of knowledge emerge with scale? What do we mean when we say that visualization is “heuristic?” If you are working in a language other than English, are these out-of-the-box tools sufficient? If you know Arabic, check out the Arabic interface of Voyant that the instructor and his colleague are translating. Do you have any suggestions for its improvement? What difference do parameters make when doing textual analysis (using dh101 4B)? What are the issues of using a web-based platform for visualization? Where do your data sit?
In-class exercise: With a sample corpus and using some of the out-of-the-box tools we will learn to create concordances, search for collocations, as well as to make visuals like word clouds, bubble lines and density plots for a small corpus. We will work with a plug-in for Firefox that simulates colorblindness. How might universal design be applied to visualization in the humanities?
7A Larger Scale Mining and Cultural Analytics (12 October)
Reading before class: Michel/Aiden “What We Learned from Five Million Books;” Jockers/Underwood (NCDH), 291-306 “Text-Mining the Humanities”
Underwood/Goldstone “What Can Topic Models of PMLA Teach Us…?;” Blevins Topic Modeling Martha Ballard’s Diary“ You might find dh101 6B helpful as well.
Preparation before class: If you don’t know it already, check out Google n-gram. Another version of this can be found at the HathiTrust (a consortium of libraries to which NYU belongs providing access to digitized although not usually OCR’s text), see BookWorm. Also, skim through the new journal CA: Cultural Analytics without reading all the articles. Both approaches look at text/data from a distance.
In class discussion. What are the hidden claims of culturomics? Would you call Google n-gram a “black box”? What does it actually show us? What can it not show us? How it is different from the ways we were analyzing custom corpora (say in literature or historical writing)?What is the difference between HathiTrust BookWorm and Google n-gram? What kinds of topics are the authors publishing in this journal CA tackling? At what scales are their research projects articulated? What are some of the computational methods that are used in CA to analyze corpora? What are the corpora they have chosen? What is topic modeling and how does it work the best?
In class exercise: We will look at how to build worksets and study them algorithmically from within HathiTrust’s Research Center, including Bookworm.
7B Storytelling from Student Corpora (17 October)
Today’s session will be composed of short presentations based on student work. What stories have you been able to tell based on your small corpus? What have you not been able to do yet? What has visualization done to your thinking about texts? Do your visualizations have a rhetorical force? You need only use Voyant and AntConc for this, although it you are feeling adventurous you can try other means (particularly useful for that is Tapor).
As a practice in openness, I suggest that you invite a friend or a professor to come hear you give your short presentation.
Blog post #4: After your presentation, write up the results of your small presentation as a blog posting (due 23 October). What kind of a story can you tell from your mini corpus? What patterns and changes do you find? Here is a chance for you to begin to articulate your presentation (week 7). You can use the blog and supporting visuals for the presentation instead of a slide deck. Check out the recent DH visualization event at CUNY and this reaction to it. What do you think they mean by “unflattening” and “enacting”? (to be posted the day before the next class begins by 6pm.) You might look at Sousanis’ Unflattening. Feel free to incorporate suggestions, criticisms, lessons learned from your analysis.