4A Introduction to Digital/Digitized Text (21 September)
Reading before class: Gardiner/Musto (DHP), 31-42 “The Elements of Digital Humanities: Text and Document;” Cooney/Roe/Olsen “The Notion of a Textbase”
Preparation before class: pick two of the following text archives (in the languages you are comfortable in, or feel free to find your own) and explore them before class: Internet Archive / Project Gutenberg / Oxford Text Archive / Corpus of Contemporary Arabic / Early English Books Online / Corpus of Middle English / al-Maktaba al-Shamela / Perseus Digital Library / Biblioteca Virtual Miguel de Cervantes / Digital Library of Old Spanish / MSU Humanities Data / Visualising English Print / King Saud University Corpus of Classical Arabic
In-class discussion: What did you find in the text archives? What is the difference between a digitized document and text? What are the formats in which the text is offered? How was it digitized? How is it structured? What do you think the “hidden labor” of digital text might mean? How are text archives different from Google Books? HathiTrust? What tensions do you see between open and restricted access? What other kinds of “texts” might the digitally inclined humanist turn to?
Exercise at Center for Digital Scholarship with Abbyy FineReader and digitizing some text. We will try scanning a few pages and test Optical Character Recognition and its machine learning capabilities for a few languages of interest to the course (English, Middle English, French, Arabic, etc). The process of creating “clean” electronic text is time consuming and with Abbyy FineReader involves scanning a text (or loading in the images of the pdf, analyzing their layout, determining the language, “reading” the text and perhaps training it to read difficult fonts and finally exporting the text.
Google has released an open source OCR software (anyone want to try it?)
Blog Post #1 will be about your experience of digitization and your initial reflections on your ideas for a personal corpus (to be posted the day before the next class, that is 25 September, begins by 6pm.)
Week Four – Digital Text (Social Editing, TEI) II
4B Social and Participatory Text Creation (26 September)
Reading Before Class: Price (NCDH), 137-149 “Social Scholarly Editing;” Terras (NCDH), 420-438 “Crowdsourcing in the Digital Humanities;” Bilansky “TypeWright: An Experiment in Participatory Curation;” Brabham*”Concepts, Theories and Cases of Crowdsourcing”
Preparation Before Class: Take a look at a few of these crowd transcription projects: Trove: Australian Newspapers Online; TypeWright (for 18thConnect and EEBO); Transcribe Bentham; MicroPasts; Social Edition of the Devonshire Manuscript, the French Campaign in Egypt 1798-1802 (made with FromThePage) and Crowdmap the Crusades. The “Crowdsourcing the World’s Heritage” page is also full of interesting examples. Can you rank them in terms of potential users (amateur to specialized)? Compare these to citizen science initiatives. What do they have in common?
In-class discussion: When we do not have born-digital text and we cannot use machine learning to transcribe typeset documents into digital text, some projects resort to social participation in the editing or creation of text. Why is text creation desirable? How accurate are data stemming from these initiatives? What strategies are adopted to increase that text quality? How many people make up the social group working on them? What values or benefits do the projects invoke that encourage people to volunteer their time? How open is that group? Define these words “participatory scholarship,” “crowdsourcing,” “citizen scholar”? Who are the “top text correctors” or “super transcribers” in the above-mentioned projects? How is the textual data generated being used?
Exercise in class in electronic publishing workflows: We will use TypeWright a tool that will allow us to transcribe, or rather correct the OCR of, a small document called a “chart,” that is a textual description of coastal regions of Arabia that dates to the 18th century. The larger context of this text is the expansion of British trade through Arabia and Asia. It is called a “Memoir of a chart of the east coast of Arabia from Dofar to the island Maziera” (also available here in Google Books) and our crowdsourced transcription of the text, once finished, will be reviewed for accuracy and then published for the anyone to use within 18th Connect. You can read about the author Alexander Dalrympe who composed this based on a certain Captain Smith’s “eye-draught,” that is, a drawing made by sight without scientific instruments. Once our text has been approved, we will have it transformed into TEI XML (see 5A) and if one or two persons in the class would like to work with me to create a digital edition with it like this one done by an undergraduate, this can count as one of your final projects. Incidentally, the island “Maziera” is found off the coast of Oman as shown in the map below.
Blog post #2: What would be a good citizen scholar project for the Arab World or a global crossroads like Abu Dhabi? What kinds of texts could we transcribe? What else might we classify collectively. i.e. using crowd input? How could we at NYUAD encourage such an initiative that would be locally meaningful? What would be challenges in carrying out such a project? (to be posted the day before the next class begins, that is 27 September, by 6pm.)
5A The Text Encoding Initiative (28 September)
Reading before class: Skim (without getting too caught up in the details) Pierazzo (NCDH), 307-321 “Textual Scholarship and Text Encoding;” Roueché “Why Do We Mark up Texts?,” dh101 6A. Check out Hamlet encoded in xml.
Also skim before class: “A Very Gentle Introduction to the TEI markup language,” the “examples” in “TEI by Examples.” Then, go look at the document samples in the Linguistic Landscapes of Beirut. On the map interfaces, you can view the actual sample of written language that has been “tagged” by clicking on one of the colorful points on the map. Up will pop an info box with a thumbnail and a link for viewing the larger image. What are the various languages and scripts used? Imagine how those multilingual mini-documents might look as marked up text.
In-class discussion: Documents we create in word processors like WordPerfect, MacWrite, Word have lots of hidden markup. Commercial markup standards change over time. This means they will be not readable forever. Why do we use word processors? Do you have any old files on old media that are now unreadable? What did the demo with Markdown allow us to do easily? What is the difference between html and markdown? Today we are interested in markup, or text encoding. What does markup allow us to add to our digital texts? What kinds of semantic content is added with markup? Here we will refer to dh101 2A.