Remote 11 – Reading Like a Computer

Week 11

14 April : On Computing Style

Before we jump into style, you may enjoy reading this recent article about transcription in the age of AI from the NYT.

Subject: In the next week and a half, we will be exploring the idea of a “reading like a computer” from the perspective of style. We will discuss our opinions about what makes the style of an author and reorient our analysis around the features of language specific to a particular author. We will carry out some experiments with text corpora and computing style, also known as “stylometry”. We will also visualize our results.

Format: Class on Tuesday will be mixed delivery, consisting of pre-recorded parts, some synchronous meeting and a quick follow up in a forum. There will be no synchronous class on Thursday, but attendance will be counted via your participation in the forum in the form of “quick writing” (a few sentences in response to a prompt).

To prepare before class:

Before class, I would like everyone to upload five (5) texts by a single author in English. The source of those texts can be from anywhere and about any subject, but they must be at least 2000 words long each. You can copy them from Project Gutenberg, news articles or your own hard drive–anywhere you like. We are not going to reveal the identity of the people until our discussions in class on Tuesday. The texts you choose will be shared in Drive, so consider first if there is any information you would not like sharing with the class. We will not publish the corpus openly beyond Drive. In your choice of writings, you might pick two (by the same author) that are about one topic, whereas the other three are about other topics. Or you could pick two texts from long ago, and three others that are more recent. Whatever your choice of texts is, jot down a few notes so that you remember.

Copy them in machine-readable plain text format and save them using TextEdit (iOS), Notepad (Windows) or any other text editor (SublimeText, Atom) and save them as a .txt file, and name the files in camelCase without any special characters or accents in this format: {maskedName}_{nameOfText}.txt, e.g. mrBean_myLife.txt or zorro_curseOfCapistrano.txt. Please place them all in S20 RLAC Shared Folder > Data > Text for Style exercise.

Please submit these before Friday, 10 April, 11:59pm

To read before class:

Read carefully the excerpt from the article “Revisiting Style” (Herrmann et al) (in Drive)
Check out the site of the Computational Stylistics Group.

To watch before class: (you may have to open the videos and load in advance–streaming can be slow now)

Stylometry, the Types of Voice in Writing (5 mins)
Visualizing Literature: Trees, Maps and Networks (the talk is 50 mins, but the first 10 minutes should suffice to get the general idea)
Hildegard of Bingen: Authorship and Stylometry (20 mins)

To download before class:

To speed things up, download the corpora we will be using in class ahead of class. The corpora can be found in the data folder in Drive. I will demonstrate the notebook that I wrote that allows you to generate these quickly. We won’t need to overwhelm Project Gutenberg by all downloading the same thing. The corpora we will use on Tuesday are : AustenBrontes | Doyle | Little Cousin | Clean Corpus (the normalized corpus that you provided me with)

Notebook:

The notebook needed for this session that I made publicly available is here. Check the “notebooks” folder in Drive for the one we will use in class. It is called Stylo_with_ProjectGutenberg_Texts_revised. Download it and open it in RStudio.

Class outline:

Introductory short video to week 11 (4 mins)
How to structure the metadata for Stylo for use with Project Gutenberg (4 mins)
Building Metadata for Stylo from Scratch (4 mins)
How to publish a Google sheet to the web as a CSV openly to the web (3 mins)

Run #1: First, we will try learn how to automatically generate a corpus from Project Gutenberg and will study Jane Austen and the Brontë sisters.

About Austen | About the Brontë sisters
Step by step with the Stylo with Project Gutenberg Texts Notebook (11 mins)
Using the Notebook and the Graphic User Interface for Style with Austen and the Brontes (6 mins)
Explanation of the Cluster Analysis and Bootstrap Consensus Tree for Austen and the Brontes (11 mins)

What is cluster analysis?
What is principle component analysis? (discussed at 15:00 in the Hildegard documentary)

Run #2: Second, we will try the Little Cousin series.

About the children’s literature of American Empire

Run #3: The corpus that we built as a class.

On reformatting the “messy” corpus you provided me (2 mins)

See our BCT as an interactive d3 object here. What is a force-directed graph?

And one last PCA:

Run #4: If there is time, we will try Sir Arthur Conan Doyle who wrote in many different genres.

About the author of Sherlock Holmes | his prolific writing career

Quick writing:

In this forum, let us know what your definition of writing style is, or to what definition given by someone else you adhere. I would like you to focus on the plots that we were able to generate about the classification exercise we did on Tuesday. What did you learn about the authors? Why do you think the computer classified them in the way it did? Can you relate anything that we found to your definition of style? Fill in the information in the “Mystery text classification metadata” sheet in the same drive so that others can know more about the authors and the texts. (due Saturday night)

Blog 4:

Choose an author (or group of a few authors) who has/have a different writings in Project Gutenberg. Do some research into that author and their biography/bibliography. Create a metadata table like the one we used for the authors this week and using the notebook do a consensus bootstrap anlysis with about 500 MWF. Also do a Principle Component Analysis with word loadings. What do the visualizations help you understand, if anything, about this author? If you are having trouble making the stylo notebook work, refer to my videos or make an appointment to meet in virtual office hours (due: 30 April, 11:59pm)