-Text as Data

8 / Th 21 Feb  Text Analytics Basics” | Exploring Voyant Tools |  tapor.ca

9 / Tu 26 Feb  
Reading: Rockwell/Sinclair “The Measured Words: How Computers Analyze Texts” ch 2 of Hermeneutica | creating a Google Alert to help you with your Zotero | Project Gutenberg (for open, plain text)


Browse: 
Try the Harry Potter and fan fiction corpus (in Drive) with Voyant Tools.  Experiment with some of the different visualizations (using the drop down menu from the windows looking icon.)

Instead of cutting and pasting the text into the window, use the upload button to upload all 6 texts.  Try it and see what happens. 

 

10 / Th 28 Feb Practicum:  tidytext package – text mining and building a word cloud in R  | gutenbergr package | Discussion of  “The Measured Words”

Class 10 GitHub  

A more in-depth treatment of the Tidy package can be found here

 

Blog 1:  Build a small corpus of texts that interest you from Project Gutenberg, repeat the practicum to see what it is that you can discover.  In your response, please comment on the process, whatever you did to manipulate the texts or change the settings of Voyant or the provided R script.  Also, describe why you chose these texts and what looking at them with the help of a computer taught you about them. What happens to text when it is turned into lists, numbers? Read about more complex forms of text analysis like here or here. What do they do differently? (due 12 March, 11:59pm). 

 

11 / Tu 5 Mar  Textual forensics: trying out the stylo package for R

Please collect five texts by two different authors.  These authors could be yourself at different moments of your life (papers from middle school and now) or two other people whose identity you are sure of (journalists you read, authors, blog postings, presidential speeches).  You do not need to do anything with the texts except label them and place them in Drive. 
 
Steps to follow:
 
1    Please name the files exactly in this fashion:  {youpickthename}_{KeywordsAboutTheText}.txt   e.g. Wrisley_AllAboutNothing.txt; AnonBaker_Trees.txt, with a person’s last name, an underscore and title or keywords.
2     Each text should be at least 2000 words long, in modern English.
3     Each text should be saved in .txt format.  What this basically means is that you are able to paste it into TextEdit (IoS) or WordPad (Windows), or even better into a text editor such as Atom or SublimeText.  You can also save a word file into a .txt file, but it should not be a .doc or .docx format. All formatting such as italics, bold, underlining or fonts will disappear. Another way of calling .txt files is PLAIN TEXT. There is an option in Adobe Pro to perform optical character recognition on a pdf file, or if that process has already been performed, you can cut and paste the words from the page into a .txt file.  Be attentive to the fact that this process is imperfect.  You can also copy and paste from a web site (careful not to copy extraneous information like menus, headers).  
4    Place your five files labelled according to the instructions above in the class drive in the folder “CORPUS” 

A user manual for the Stylo package can be found here

Class 11 Github

 

-Dataset

     

12 / Th 7 Mar      Exploring humanities datasets:  

Titanic dataset
Barber shops of Abu Dhabi (2nd link therein)
data.world “public” repository
Wikidata
Zenodo (Literary Geographies of Christine de Pizan, Linguistic Landscape of Leipzig)
data.gov
CodeCamp
French Enlightenment Correspondents 

Reading:  Hayles, Narrative and Database

What is a dataset?  What elements might it contain? What structure can it have? What could a dataset about the humanities exhibit? 

 

13 / Tu 12 Mar    Project planning: Creating a Structured Narrative Dataset for visualization

Search for a table in Wikipedia/Eurostat/Kaggle on a topic of interest to you.  How can you “enrich” the data contained therein.  We are going to practice some very basic data enrichment or carpentry.  Not all datasets can be extended, but many can.  We will discuss today some ways that you might achieve this. Enrichment of a dataset allows you to correlate one set of information with another. 

Look for a dataset like the Titanic that allows you to plot the relationship between one set of numerical variables and another, and that also has bounded set of types (e.g. Male/Female ; no / yes / maybe; children, adolescents, adults).You will choose a topic and get to work on it.  Next class we will practice visualizing that data using R. 

Look at the following Wikipedia pages: 

https://en.wikipedia.org/wiki/United_Nations
https://en.wikipedia.org/wiki/FIFA_World_Cup

Take a look at Eurostat or Kaggle for lots of datasets: 

https://ec.europa.eu/eurostat/data/database
https://www.kaggle.com/datasets

 

14 / Th 14 Mar   Practicum: the dyplr / ggplot packages in R 

US Supreme Court Justices | Global Murder rates | Inter-state War Data


Blog 2: 
Find a topic that interests you where there are tables with string and numerical values that would lend themselves to a simple scatterplot visualization. Look for an existing medium size dataset in Wikipedia, Eurostat or Kaggle as we did in class 13. Extract the data into a csv file and clean/normalize it. Visualize the dataset in R using the basic scripts we have in the GitHub repository for class 14 from different angles. Add on one or two other columns that “enrich” the data. Comment on the process in your blog posting. What goes into a dataset? Who do you think brought together the basic information of yours? When you visualized a dataset, did anything become apparent that was not when you examined it in tabular form? Do you like to explore data visually? do you feel it is potentially deceptive? What kinds of other visualization would you like to do? (due 4 April. 11:59pm)

 

Class 14 GitHub

 

Zotero progress: make sure that you have 10 articles on your topic and a google alert made by 1 April.