The Arabic Collections Online Project
By Laura Batty
One of the most exciting digital developments in the world of scholarship and research is the ability to scan and archive texts and make them available online. The NYU Libraries currently host thousands upon thousands of e-books, millions of full-text articles, and access to thousands of e-journal publications. Recently, a group of researchers at NYU and other universities added their own efforts to the pool of knowledge by making some 10,000 titles available—all online, and all in Arabic.
Several years in the making, NYU set out with partner universities Columbia, Cornell, and Princeton to digitize these books and make them available to scholars around the world. “NYU Abu Dhabi has been absolutely critical to this,” said Michael Stoller, Director of Collections & Research Services at NYU. “It is the identification of this content…that has been culturally significant to the Emirati government. This is a significant contribution to the Arab culture to make this material actively available on the web.”
Before the universities could even begin the process of having the books scanned, a team convened to cross-reference titles and double-check copyright dates. In the United States, any materials published prior to 1923 do not require permission from copyright holders. Different dates are used by different countries.. Given the global nature of the project, it was important to comb through the various national copyright laws to make sure they were applied accordingly.
“We did an analysis to look at each of these countries’ publications, and different countries had different copyright laws in place,” said David Millman, Director of Digital Library Technology Services at NYU. “We’re able to make available materials that are much newer than we would normally, if they were published in the U.S.”
Launching the Collection
In October of 2014, the Arabic Collections Online went online with thousands of titles. Barely pausing to savor the team’s success, Millman and Stoller are already thinking about ways to expand the collection and make these books even more accessible. According to Stoller, “At some point, we may very well want to do what’s called optical character recognition, which would allow us to produce a full-text, searchable version of the book.” As is, the scanned books are available as PDF files, essentially images of the documents themselves. Optical character recognition would allow scholars to search for words within the text.
“The difficulty at this point is that, with optical character recognition (OCR), the computer has to be able to recognize the characters…in all the various forms that those letters can take,” Stoller said. Though the technology doesn’t yet exist for Arabic optical character recognition, the team has made a conscious effort to set certain specifications for the scanned documents so that the quality of the images will be conducive to OCR in the future.
A collection of 10,000 books is already large; however, Millman and Stoller hope to bring in new partner universities. “This has been a perfect project for NYU, because this is an institution that has come to think of itself in the global context,” Stoller said. “This is, obviously, an effort to make the literature of one of the most populous portions of the human race readily available to scholars everywhere.”