Week 9 :
Having trouble keeping focused in your studies? Try the virtual study buddy approach:
https://twitter.com/MedlibAEP/status/1239571278165618696?s=20
Transcribing Handwriting
Format: Class on Thursday will be mixed delivery, consisting of pre-recorded parts, some synchronous meeting and a quick follow up in a forum. There will be no synchronous class on Tuesday, but attendance will be counted via your participation in the forum in the form of “quick writing” (a few sentences in response to a prompt).
In class exercise : We will be working with Transkribus to see how a computer can “read” human handwriting. We will work with the text from Project Gutenberg: Thomas Lowell’s With Lawrence of Arabia. https://www.gutenberg.org/files/61428/61428-0.txt
To prepare before class: In the data folder of our shared class drive, you will find a doc that contains a version of Lowell’s text above. I have divided the text into segments that are 400-500 words long using natural breaks in the text. Find the part where you see ***** [your last name] starts and write this text out by hand in your natural handwriting until you find ***** [your last name] ends (you can use the document outline at left to find your part). Please keep the margins of the line the same as in the original file–check out the sample two pages that I did from the beginning. Scan or take a picture with your phone and post to drive (they can be in jpg or pdf format). (Please submit this by 23 March, 11:59pm). This will count as a 0/1 participation point.
To check out before class:
Take a look at these LOC crowdsourcing projects : Disabled, but Not Disheartened | The Rosa Parks Papers | Letters to Lincoln
The “public models” of Transkribus and the texts on which they were trained.
To skim:
Who was Lawrence of Arabia? Who was Thomas Lowell?
Machine Reading the Primeros Libros (Alpert-Abrams) [optional]
AI, the Transcription Economy and the Future of Work
In Her Battle Against the Machines, She’s Holding Her Ground
To watch before class:
“Optical Character Recognition using Google Docs”
“Transkribus Makes a Breakthrough in Understanding Ancient Texts”
“How to Use Transkribus in 10 Steps of Less”
Thought questions: Why do people engage in crowdsourcing projects of archives such as these? Why do we want OCR to work? When is OCR easy? What are some of the key shortcomings of OCR? What is HTR? For what enterprise purposes would you want to extract text from images?
Class outline (Thursday):
9:00-9:10 Class housekeeping & Questions | Virtual office hours | Midterm project
9:10-9:20 Ways of text creation beyond Project Gutenberg – synchronous discussion
9:20-9:30 General discussion and questions (on our slow transition to mostly asynchronous class material)
The rest of class for today consists of a series of short videos that you can watch when you are ready:
1 How Does Optical Character Recognition (OCR) work? (up to 4:40)
2 OCR – Google Drive Tutorial
If you haven’t already watched “How to Use Transkribus in 10 Steps of Less” yet, then watch it now. The rest of the small videos are accessible via NYU only when you are authenticated.
3 A Small introduction to the rest of the videos. (4 mins)
4 Some examples of recognizing historical handwriting. (14 mins)
5 HTR and Archival Materials about the Gulf. (20 mins)
6 Some experiments with the Louvre Abu Dhabi Latin Bible. (12 mins) This manuscript is the one I will be speaking about.
7 Our class’ transcription of Lawrence of Arabia. (11 mins)
8 Concluding video. (4 mins)
The data produced by the HTR process in Transkribus for our Lawrence of Arabia segment will be placed in the “data” folder in the shared class drive (two files of the text of your handwritten versions of Lawrence of Arabia using two different models.
Files in Drive:
full LOA.pdf: the combined file for Transkribus can be found in Drive
full_LOA_Benthammodel.txt: the automatic transcription of full_LOA using the model trained on the Transcribe Bentham project
full_LOA_1892model.txt : the automatic transcription of full_LOA using the model trained on the Bushire residency papers.
Quick writing: By Saturday 4 April, please contribute to the forum a few sentences on what you discovered this week about how well our handwriting can be understood by a computer trained to read 19th century handwriting. Whose handwriting was best recognized in the class? Any idea why? Which model worked the best? Why do you think that might be?
Virtual Office Hours (VOH): I encourage to sign up for the regular office hours if you would like to speak more about the content this week. In them, I would like to have both video and audio on so that we can have a conversation.