One of the main deliverables for ENM is a topic map of names and concepts derived from index entries. ENM’s topic map is a meta-index made by combining many individual back-of-book indexes into one dataset.
Background on Topic Maps
Originally developed in the 1990s to address the need for dynamic, aggregated indexes for UNIX manuals, Topic Maps is an ISO standard (ISO 13250) for representing data about concepts.1 For the ENM project, we are not using the XML syntax for topic maps (XTM), but rather are inspired by the underlying data model. A topic map can be thought of as a dataset that sits on top of a source (such as a book) or group of sources. As an interlinked graph, the topic map facilitates navigation between parts of the source.
A topic map consists of topics and relations. These relations can be associations between topics or links between topics and sources (called occurrences). Topics are representations of abstract concepts, and are comprised of names and can have types.
In book index terms, a topic is a heading, an association is a “see” or “see also” cross-reference, and occurrences are the page number locators listed for each heading. Unlike an index heading, however, topic map elements, like name, can be scoped in order to disambiguate and designate conditions of validity for a given statement. We are using scope as a qualifier, which you occasionally see as parentheticals in index entries.
ENM topic map
The ENM topic map was generated using a custom-built piece of software called the Topic Curation Toolkit (TCT). Developed by Infoloom, the TCT is a series of back-end scripts and a front-end editing interface. The TCT parses EPUB files and populates a relational database by extracting, merging, and linking index entries and page text.
Above is a screenshot of the record in the Topic Curation Toolkit for the topic Amazon. In the center there are names for this topic, including one that is scoped; on the right there are relations to other topics; below there are occurrences, or pages in books; and on the very bottom there are links that connect to controlled vocabularies.
Above is a screenshot of the occurrence or book page view in the Topic Curation Toolkit. The text of the page is in the middle, and on the right is a “reverse-index” view of all of the topics indexed on this page.
ENM topic map stats
The ENM topic map was built from 89 books with indexes and page numbers. Not all epubs have both, so out of 113 books we were considering, 89 were usable for this aspect of our project. Published between 1987 and 2016 (with the largest number published in 1998), the books in the topic map are open access titles from disciplines including literary analysis, philosophy, law, media studies, race studies, and gender studies. These 89 EPUBs generated 45,000+ topics! Only 2,652 topics appeared in two or more books, so one lesson learned is that it might make more sense to generate a topic map for a specific discipline, since those books might be more likely to share terminology.
By the end of the project, we plan to release a more detailed Topic Curation Report, so stay tuned for that for more details about the content and creation of the ENM topic map. To view the open-source code, made available under an open source Apache 2.0 license, head over to our Github repositories for the frontend, backend, and vagrant box components.
References
- Newcomb, S. R. (2003). A Perspective on the Quest for Global Knowledge Interchange. In J. Park & S. Hunting (Eds.), XML topic maps: creating and using topic maps for the Web. Boston: Addison-Wesley.