The workflow for managing the topic map is a mix of machine processing and human curation. The Topic Curation Toolkit (TCT) does things like merge and link topics, and the human editor reviews and does quality control. Core editorial activities happen in the TCT. Most of the fields can be directly edited, and it’s also possible to add new data, “re-run relations,” and mark whether a topic is reviewed (simply checked over) or edited (in other words, changed in some way). The TCT provides alphabetical lists of topics that can be sorted (alphabetically, by number of occurrences, or by number of relations) and filtered by review status (unreviewed or reviewed).
When I began my work as Digital Production Editor, I created a workflow for proceeding through the topics in alphabetical order. Between January and June of 2017, I was able to check about 15,800 topics, or a little over one third of the topic map. 12,647 of these topics were marked as reviewed, while 3,151 were edited. While working through part of the topic map, I discovered issues and observed patterns that helped me develop editorial values, principles, and practices to guide my work going forward.
The topic map values identified were
- Automation over human curation
- Serendipity over rigidity
- Transparency and mitigation of bias
- Autonomy of the user (potential conflict with Serendipity over rigidity)
- Accuracy of content (potential conflict with Automation over human curation)
The principles (more specific than values) included
- Pay attention to the impact of editorial intervention on the end-user
- Minimize involvement of the editor
- Keep editorial intervention to the minimum possible, so that the workflow could scale up in the future.This implies a level of trust in the automated processes producing topics and relations.
- Privilege serendipity through loose semantics
- The dataset resulting from this process should facilitate the serendipitous discovery of concepts by users. Therefore, broad or loose understanding, especially in the relations between topics, is valued over restrictive and narrow interpretation of a topic’s meaning and relations. This is especially true for broader or more general topics.
- In the current iteration of the Topic Curation Toolkit software and because of the nature of EPUB index markup today, some subentries may have parsed with issues, and see or see also cross references may have linked subentries instead of main entries. Because not all of these are easily findable, this variety of messiness or looseness is acceptable in the context of this project.
- Minimize misleading or irrelevant relations between topics
- to the extent possible while considering the importance of principle 2 (Minimal involvement of the editor).
- Work to get through all topics
- Strive to efficiently address all topics, but acknowledge that the Digital Production Editor may not be able to review all topics
- Acknowledge non-neutrality
- A workflow is as a series of decisions and actions. A critical analysis of a workflow acknowledges the points at which bias and judgement enter the process and recognizes the impact of the editor.
- Relatedly, a critical workflow recognizes that the software and automated processes are designed to make choices which are not always ones that humans would make (e.g., connecting topics based on strings of matching words, which may have distinct meanings.)
- Proceed as if we are going to continue adding terms and indexes to the topic map dataset.
- Conduct more involved editing in special cases
Finally, the values and principles are informed by–and then in turn themselves inform–specific day-to-day practices, or policies. The practices provide guidelines for removing relations, adding relations, splitting topics, merging topics, deleting topics, name scopes, alternate names and changing names, main/sub entry curation, problem records, and URIs. For example, one policy under the category of splitting topics has to do with splitting proper nouns and general concepts:
SPLITTING TOPICS
Policy
Split proper nouns and general concepts into separate topics and move relations to the appropriate topic.
Key examples
Nation, The
Semantic ambiguity can be a problem for the user when they look at occurrences. For example, there was a topic record with two names listed: Nation and Nation, The. The occurrences referred to either the concept of a nation or to the specific publication called The Nation. In other words, one of the names in this topic was a proper noun. A user trying to read about The Nation would have to sift through irrelevant occurrences to find what they needed.
The example here is the topic nation. It turned out that some occurrences referred to the concept of a nation, while others referred to the publication The Nation. This happened because the TCT ignores stopwords like “the” and “a.” Taking into account here the values of autonomy and accuracy, we thought that a user trying to read about The Nation would become frustrated with having to sift through occurrences about the general concept, so I created a policy to split those kinds of homographs into separate topics.
Even though I was the only editor of the ENM topic map, creating a policy document was important for both maintaining consistency in my own work and ensuring that I kept a record of why certain decisions were made, which (I hope) provides some transparency to those who will eventually use the topic map. Read the full policies document for more details (link forthcoming).