Scaffolding Text Encoding

November 5, 2022

J.A.T. Smith, Pepperdine University

Rachel N. Hogan, Syracuse University

This article describes a structured approach to teaching text encoding for beginning digital humanists as part of a broader theoretical and practical instruction in textual editing and criticism.

Here is surely a truth now universally acknowledged: that the whole of our cultural inheritance has to be recurated and reedited in digital forms and institutional structures.
—Jerome McGann, A New Republic of Letters

Introduction

Textual scholars know that working with primary sources is one of the most profound and intimate ways to become acquainted with a text. Yet, few students on the undergraduate level ever have this opportunity. As a result, many students, even very experienced students of literature who have written dozens of thoughtful literary analyses,1 fail to understand the editorial history of the texts that they study—how the book in their hands or on their screens got there. They often lack a beginner’s appreciation of the kinds of alterations made by even well-meaning editors and how the accidents of survival or textual scholarship may impact literary criticism. Embracing the principles behind textual criticism, now expanded to include digital textual criticism, can play a foundational role in developing students’ analytical and interpretive abilities. This is because students who have gone through the process of transcription and editing, as well as analysis and description in the form of the writing of metadata and coding—all of which demand precision and meticulousness—inevitably recognize how much judgment comes into play in quality textual transmission and reproduction. One such student said in her end-of-semester reflection letter, “I found myself questioning what I knew about digital documents and how they existed on the internet.”2 Students come to realize that there is such a thing as good judgment and that all editors carry great responsibility in these areas of scholarship, since ultimately, editors do not know what kinds of burdens a primary source and its associated data will be asked to bear in future work.

Beginning in the 2018–2019 academic year, all students majoring in English at Pepperdine University have been required to enroll in a new gateway class called Introduction to Digital Humanities. It is the second of two introductory courses for the major—the other being a literary and rhetorical studies course. Introduction to DH emphasizes the skills and theory related to the production and dissemination of texts in a digital environment.3 With little exception, students have few technical skills beyond basic computer literacy of the sort common to most high school graduates. They can superficially use word processors like MS Word and Google Docs and upload files to social media platforms like Instagram. They can enter search terms into a search engine. They can send and share documents over emails and sometimes also through cloud services like Dropbox. What they lack is an understanding of the rules behind those behaviors.

It is not just ignorance, however. Students in every cohort have expressed significant fear of technology, even attributing their antipathy to technology as the reason for their decision to major in English. As one student said, “I have a real aversion to the inner workings of technology—it’s half the reason I pursued English.”4 And another, “Prior to this class, I wasn’t too interested in the digital world; I took a computer science class once, and while it was interesting, I clearly learned that the course was not my forte. Therefore, part of me getting into an English major was a goal to avoid such subjects.” After completing this class, however, the same student recognized that avoidance is not an option: “[A]s I sat in our Digital Humanities class, I was proven wrong. The digital world creeps upon all fields of study, whether we like it or not. To accommodate to the realities of the future is a talent in and of itself. I realized that in this class.”5

The Introduction to DH class seeks to break through that surface level understanding of the digital sphere as well as reduce self-limiting fears of technology. It does so by tracing a historical-literary collection from its initial acquisition to its archival intake and finally to its digital presentation in edited form as part of a curated website. It shows how each stage in a text’s history and its treatment as first object and then as idea can impact its reception and interpretation. Simply put, the students follow this literary collection through the fields of archival studies, textual criticism, and then literary criticism, reading about and then actively engaging in components of each. This end-to-end pedagogical approach is rare in literary studies, but it can be extremely meaningful since it helps students develop the whole picture of what it is that they do. By beginning with a humanist object of study, moreover, this connection is more palatable for the digitally reluctant. The first digitally avoidant student quoted in the paragraph earlier, for example, said that after taking the class, “I’ve broadened not only my general horizons but picked up tricks and tools to facilitate better stewardship of the books I’ve so long been passionate about.”6

Since 2018, the class has centered on the creation of a curated digital exhibit for the correspondence between Margaret Martin Brock, an important figure in the Republican Party in the second half of the twentieth century, and several US Presidents. Mrs. Brock’s papers are held by Pepperdine’s Special Collections at Payson Library. So far, four cohorts of between 16 and 18 students each have completed the class and their respective projects: M.M. Brock and Dwight Eisenhower, Fall 2018; M.M. Brock and Richard Nixon, Spring 2020; M.M. Brock and Gerald Ford, Spring 2021; and M.M. Brock and Ronald Reagan, Spring 2022. In addition to its historical interest, institutional relevance and accessibility, this collection was selected because it also met some practical needs. It includes many different objects, five or six dozen per presidential group, which can be easily divided among students, whether there are twelve or twenty enrolled. In addition, letters themselves are an excellent object for beginning coders, since they are fairly formulaic in their constituent parts and, therefore, provide many opportunities for learning through repetition.7

The class itself has been led by a professor of English, J.A.T. Smith, who brings in several DH liaisons from across campus, primarily from the Library8 but also from IT, to supplement the instruction with their areas of expertise.9 It has also been supported by one or more student interns in off semesters, that is, the semester when the class is not in session. These interns, who have taken the class in its previous iteration, minimize losses that result from the annual turnover of the larger body of the Brock Project team by addressing outstanding issues from the previous class and preparing some of the materials for the upcoming one. Jason Eggleston, Pepperdine’s Senior Application Analyst, serves as the DH liaison for the instruction of TEI-XML, and Rachel Hogan, an English major who graduated in 2021 and a current library and information science student at Syracuse University, served as the 2020–2021 intern. Students in the class are grouped into teams (digitization, coding, web design, and editing/historical context respectively), though all members are introduced to the stages of the project’s life cycle, beginning with an examination of the letters in Special Collections, followed by their digitization, transcription, coding, and description, and ending with their presentation on a publicly-aimed curated website.10

The students’ public digital projects are hosted through four respective Omeka sites funded by Pepperdine Libraries. At the end of each semester, the student-generated metadata is also imported directly from Omeka through the API and into CONTENTdm, Pepperdine Libraries’ digital management system, and attached to the digital objects in the Pepperdine digital collections. (The Library also maintains a redundant copy on their own physical servers.) The transcriptions, coded letters, and accompanying essays are archived in a class Google Drive and maintained by Smith.

This paper focuses on the instructional process for the lessons concerned with coding as a form of textual criticism, with a special emphasis on one of the scaffolding assignments (lesson three below). That is, it focuses on how we, that is the instructional staff and intern, addressed the relative lack of technical background in the enrolled students. The lessons, as presented, characterize the Spring 2021 and Spring 2022 classes, informed by experiences gained from both the Fall 2018 and Spring 2020 classes, each of which concluded with a “DH Debrief,” a reflexive exercise in which one DH cohort provides input on how to improve the process for the following cohort.

Coding the Text

Lesson-by-Lesson

A warm-up—MS Word Styles and indexing by a digital humanist.
XML-TEI lecture by a technical specialist
XML-TEI analog worksheet of exemplary letter
XML-TEI coding of entire letter set using Oxygen or Visual Studio Code

A warm-up—MS Word Styles and indexing by a digital humanist

The coding unit begins with an introduction to hierarchy in MS Word by Smith and is followed by an introduction to TEI-XML by Eggleston, though students will have already transcribed the letters as part of an earlier unit on textual editing and criticism. While many students use MS Word, few fully utilize its affordances and constraints or recognize the logic that governs format. In this preliminary activity, therefore, students learn how to differentiate between content and form in two activities: the first includes the imposition of MS Word Styles hierarchy for an unformatted textbook, including the generation of a table of contents, and the second involves the indexing of several pages of that textbook.11 The first of these two preliminary activities introduce students to the implicit structures of text by encouraging students to consider what the purpose is of textual apparatus, in this case, the table of contents, chapter headings, section headings, and subsection headings, and how such analytical work makes the text accessible in ways most useful to students and scholars. It sets students up to understand the OHCO (ordered hierarchy of content object) model that necessarily (though problematically) governs every XML file.12 It also requires students to identify the relationship between form and function, since they must consistently associate the right font, size, color, and spacing with each level of text. One student in the most recent cohort recognized how important understanding the oftentimes poor alignment between text and technology is: “I am grateful that in this class I learned more about how the humanities and technology intersect with each other, and I also loved getting to learn about how they sometime[s] don’t like to intersect perfectly with each other (looking at you hierarchies).”13

The second of these two preliminary activities, the indexing, forms students’ understanding of the relationship between textual accidence and the indexical substance—for example, the difference between “Nixon, Richard” in an index and instances of reference to “President Nixon,” “Nixon,” the “37th president,” “Dick,” or even “he” or “him” that they might encounter in the body of the text. Finally, students practice discerning between different registers of production and how to navigate detail thresholds, since it is not possible to index (or code) every aspect of every element of a text. The indexing is also a formative activity for the development of students’ research skills, since many are not yet in the habit of using indices when doing research using physical books.

XML-TEI lecture by a technical specialist

Next, students are introduced to TEI-XML by Jason Eggleston, Pepperdine University’s Senior Application Analyst. In former years, the “Coders” began coding their own letter sets immediately after this introduction; however, during the end-of-term debriefs, both earlier versions of the class recommended the creation of a highly structured TEI worksheet to serve as an intermediate step. They wanted low-stakes practice to complement the theoretical instruction. As an intern, and along with help from Eggleston and Smith, Rachel Hogan designed this worksheet from the TEI P5 Guidelines and a list of coding standards that her coding team from the Spring 2020 class developed.

XML-TEI analog worksheet of exemplary letter

The worksheet itself has two parts, meant to be taught across two separate days of class. The first part goes through the process of digital textual editing, situates the coding assignment in the context of the larger project, and explains the parts of a letter and their corresponding tags. Since students today have less experience writing and receiving analog letters, they may initially have difficulty naming the parts of the letter and matching those parts to the corresponding XML element or tag, that is, the conventions of correspondence architecture, but this lack of familiarity is quickly overcome. In the second part, students practice using the software programs Oxygen or Visual Studio Code to encode the same sample letter.

We visualized the digital lifecycle by showing them a side-by-side-by-side example of what the process and finished product looks like: digitized letter, transcribed letter, coded letter (Fig. 1).

Three images side-by-side-by-side. The first is a digital facsimile of a letter; the second is a copy of that letter transcribed into a Word document, and the third is a copy of that same letter with xml tags added. — Figure 1. Three representations of the same text: digitized, transcribed, and coded.

Next, we took a digitized letter and mapped out the different parts of the letter to show the relationship between hierarchy and coding (Fig. 2). This also gave students specific direction on where certain elements belonged outside of the long list of tags given at the end of the worksheet.

A digital facsimile of a letter with each functional element of that letter circled and labeled with its type and its associated TEI tag: letter header, dateline, opening salutation, body, closing salutation, signature, and letter footer. — Figure 2. Mise-en-page of a letter.

Last, we created a fill-in-the-blank exercise (Fig. 3) allowing students to practice filling in the tag names, without the added pressure of formatting and locating that comes with coding. We included a graphic on the right-hand side that allowed them to see what part of the transcribed letter they were tagging. The first page required the tagging of the header, dateline, and opening salutation of the letter; the second page required tagging of the body of the letter; and the last page required the tagging of the closing salutation, signature, and footer.

On the left of the image, the header, dateline, and opening salutation of a letter with empty boxes in green and red to be filled in with the appropriate TEI tags for the information included. Green represents an opening tag and red represents the closing tag. On the right side of the image, a reduced version of the entire letter with the associated sections highlighted in yellow. — Figure 3. TEI tag fill-in-the-blank worksheet for letter header, dateline, and opening salutation.

XML-TEI coding of entire letter set using Oxygen or Visual Studio Code

After completing the worksheet, students were then invited to code the same letter in Oxygen or Visual Studio Code by starting once again with a plain text transcription, copying and pasting it into the software, adding the header as detailed on the assignment, and transferring the tags that they had written on the worksheet.

Overall, the fill-in-the-blank exercise was a success for its first and second implementation. The “Coders” were more comfortable completing the Oxygen/Visual Studio Code exercise after first doing the warm-up fill-in-the-blank worksheet. For the coding team, this allowed them to have two practice exercises before they began to work on their own letter sets. Smith, Eggleston, and Hogan supplemented the initial activity by meeting with the coding team both in scheduled coding time during class and during individual office hours to help with other questions that they had. In a thank you letter to Rachel at the end of the semester, one student wrote, “The TEI worksheet was helpful when learning how to code, and I really appreciated when you met with me for office hours to clarify some of my coding confusion.”14 The preliminary MS Word exercises, the worksheet, and the time addressing student questions and concerns all worked together to help the coding team gain confidence with their new technical skill set, and it helped them to complete their coding work successfully.

When transitioning from the worksheet to the software, however, students should be reminded that XML is case sensitive and changing the case of the tags will cause their code to fail when entered into the software, and that punctuation (with the exception of punctuation to indicate abbreviations) belongs outside of the tag. It is also important that, when the Coders are freed to work on their own letter sets, time must be allotted for the group to decide on an acceptable level of detail and consistency in tagging, that is, code norming or coding consensus. Students were unprepared to recognize the level of variability inherent in coding, an activity that they believed to be more objective than it is. For example, some students were comfortable with using simple elements with few attributes unless necessary (i.e. <fw type=“header”></fw>), while others used attributes to add detail to elements such as <region type=“state” n=“NY”></region>.

This variability did not necessarily cause any errors in the code, but it should be noted that, depending on how the code is used in the future, the variation in coding practices could be problematic. If the code were to be used in a way that is analogous to the way that an index is used in a physical book, then every instance of a reference would not need to be coded because the purpose of the code would be for location. Just as a book does not index multiple instances of a concept on a single page, digital tags meant to locate references would not need to be comprehensive. A reader could quickly navigate to a particular paragraph and identify the other ways that that reference manifests. If, however, the code were being used to count instances of a particular reference, as, for example, is the case in much linguistic analysis, then all manifestations of the term, whether as a pronominal reference or in abbreviated form, would need to be coded. A brief statement of coding practices similar to a statement of editorial principles, however, could clarify which approach to coding was taken and help prevent end-users from assuming a level of comprehensiveness in searches that may not be present.

Conclusion

By the end of this unit, students have gained experience reinforcing the difference between accidence, the instance of a thing, and substance, the meaning of a thing. This Platonic distinction is a key element in any form of interpretation that depends on thesis-driven claims based on inductive logic, the approach of virtually all English literary essays. Students improve their hermeneutic skills because they advance in the theoretical understanding of language in its manifestation and in its intent through structured repetition, accomplished by distinguishing between the content of a text and its form and function a couple hundred times. Although students may not initially see the connection between this technical application and their literary study, over time, they see that these skills are analogous to the scanning of a line of poetry to determine metrical patterns, identifying figurative devices in a work of literature, or recognizing its theme. These activities take individual instances of a thing and demand that the reader determine what, if any, category it belongs to.

For some students, this basic introduction to the hierarchical and analytical aspects of digital text is sufficient to meet their professional or academic needs; however, for those students who seek additional training and hands-on experience, there are many pathways available to them. English students early in their careers who are interested in developing additional technical skills are encouraged to add a digital humanities minor where they may take classes in JavaScript, R, Excel, or Adobe Photoshop.15 Students who are interested in developing additional textual editing skills, perhaps in preparation for graduate school in the humanities where they will be expected to work extensively with primary sources, are encouraged to seek out research assistantships with humanities professors who are editing primary sources, whether in English, history, religion, or a foreign language. Students who are interested in editorial work are also encouraged to seek out opportunities as early-stage formatters and indexers for academic books. And students who want to develop their understanding of library work or information systems are encouraged to seek internships in Special Collections and Archives, with the Digital Librarian, or in the Library Maker Space. These opportunities, though perhaps in slightly different forms, should be available at most colleges and universities, where there are always professors who need primary sources transcribed or book manuscripts formatted or indexed or where there are librarians who need assistance cataloging and describing new collections.

Another product of this course as a whole has been a shift in the career goals of some English students. Every year since the class has been offered, one to two of the English students, upon graduation, have decided to continue their studies in library and information science graduate programs. This was not the case before the class was implemented. These recent graduates have now gone on to enter MLIS programs at the University of Oklahoma, the University of North Carolina at Chapel Hill, and Syracuse University. One of the authors of this paper, Rachel, has found the background in metadata, text mining, and encoding invaluable in her graduate studies given her emphasis in digital scholarship and digital preservation in academic libraries. These three skills have also been useful in her work as a library graduate assistant. For example, she recently partnered with Syracuse University Libraries’ Digital Scholarship Librarian for a workshop on text mining technology.

And many students, whether they had plans to continue in these areas of study or not, found the DH course to be relevant:

As someone who likes to learn hands-on, being thrown in the Digital Humanities world by working on a real-world project was life-changing. I walked into the course a bit fearful, however, I am so proud of the work that I, along with my team, completed.
—T. G., Spring 2022

To me, this class felt more like an actual job than simply a required class for my major. I felt like I was contributing to the real world of scholarship.
—S.G., Spring 2022

The Brock Project itself was an incredibly eye-opening experience because it provided real-world application for things that were previously only abstract concepts in my head.
—A.H., Spring 2022

In a beautiful bit of irony, the exploration of the digital helped students to better understand the real. Textual scholarship facilitates literary scholarship. This greater understanding, alongside the basic digital humanities skills that are introduced in this course, are good reasons to continue and expand this kind of instruction in undergraduate humanities programs.

Notes

1 Coding can also have many positive applications for literary interpretation. See Jacobs (2013).
2 N. B., Reflection Letter, Spring 2022. Quotations from students throughout this essay are all drawn from an end-of-semester letter reflection.
3 The class also enrolls one to two students who are Digital Humanities minors, as well as the occasional student who is interested in the course for elective credit.
4 J. R., Reflection Letter, Spring 2022.
5 A. C., Reflection Letter, Spring 2022.
6 J. R., Reflection Letter, Spring 2022.
7 While designed to run more or less the same, none of these classes dealt with the same instructional parameters. With the exception of the last, all of these classes faced tremendous challenges that would not be considered usual in the course of a typical semester. The Fall 2018 semester was interrupted by the Woolsey Fires, which required the full evacuation of campus and the completion of the semester all online. The Spring 2020 class experienced a similar mid-semester disruption due to the COVID-19 outbreak. And the Spring 2021 class had to be conducted exclusively online, and therefore, lost the excellent hands-on work in Special Collections. Heroic efforts on the part of the library made each of the classes possible to continue and meet their learning objectives.
8 The course is also supported by Payson’s Special Collections Librarian, Melissa Nykanen, University Archivist, Kelsey Knox, and Digital Librarian, Josias Bartram in 2018 and 2020 and Gavin Do in 2021. See Brooks (2017) for another example of teaching TEI at a liberal arts college—though this example is one led exclusively by library staff.
9 This concurs with Francesca Giannetti (2019), who argues that text encoding should be incorporated in the teaching duties of librarians, who are already involved in undergraduate education through digital and information literacy, because it allows for practical application of digital and technical skills.
10 See Rehbein and Fritze (2012) and Engel and Thain (2015), who both discuss a similar end-to-end instructional approach in their courses.
11 Our gratitude to Ted McAllister, who offered up his unpublished text, The Paradox of Freedom, for the purposes of these two related assignments.
12 The assumption that all text can be simplified according to a single, non-overlapping hierarchy has had numerous challenges. See Pichler (2021) and Pierazzo (2015, 317).
13 M. B., Reflection Letter, Spring 2022.
14 E. A., Reflection, Spring 2021.
15 A student who graduates with a minor in digital humanities should be able to:

Bring together the traditional tools of humanistic thinking (interpretation and critique, historical perspective, comparative cultural and social analysis, contextualization, archival research) with the tools of computational thinking (information design, statistical analysis, geographic information systems, database creation, and computer graphics) to formulate, analyze, and interpret a humanities-based research problem;
Understand and produce humanities-based data from multi-modal and multimedia sources through systematic data processing and data-mining;
Evaluate and design digital projects and tools critically for communication, project development, and long-term preservation of digital data in ways that demonstrate an understanding of the appropriate uses and limitations of tools and projects on both practical and ethical levels, including sensitivity to issues of sustainability, intellectual property, open access/proprietary knowledge, and private and public dissemination;
Work collaboratively and think across disciplines, media, and methodologies on multi-authored research projects, project proposals, reports, and presentations aimed at both academic and nonacademic communities.

References

Brooks, Mackenzie. 2017. “Teaching TEI to Undergraduates: A Case Study in a Digital Humanities Curriculum.” College & Undergraduate Libraries 24, nos. 2–4: 467–81. https://doi.org/10.1080/10691316.2017.1326331.

Engel, Deena, and Marion Thain. 2015. “Textual Artifacts and Their Digital Representations: Teaching Graduate Students to Build Online Archives.” Digital Humanities Quarterly 9, no. 1. http://www.digitalhumanities.org/dhq/vol/9/1/000199/000199.html.

Giannetti, Francesca. 2019. “ ‘So near while apart’: Correspondence Editions as Critical Library Pedagogy and Digital Humanities Methodology.” The Journal of Academic Librarianship 45, no. 5: 102033. https://doi.org/10.1016/j.acalib.2019.05.001.

Jacobs, Sarah Ruth. 2013. “Digital Close Reading: TEI for Teaching Poetic Vocabularies.” The Journal of Interactive Technology and Pedagogy 3. https://jitp.commons.gc.cuny.edu/digital-close-reading-tei-for-teaching-poetic-vocabularies/.

McGann, Jerome. 2014. A New Republic of Letters: Memory and Scholarship in the Age of Digital Reproduction. Cambridge, Massachusetts: Harvard University Press.

Pichler, Alois. “Hierarchical or Non-hierarchical? A Philosophical Approach to a Debate in Text Encoding.” Digital Humanities Quarterly 15, no. 1. http://www.digitalhumanities.org/dhq/vol/15/1/000525/000525.html.

Pierazzo, Elena. 2015. “Textual Scholarship and Text Encoding.” In A New Companion to Digital Humanities, edited by Susan Schreibman, Ray Siemens, and John Unsworth, 307–21. John Wiley & Sons, Ltd. https://doi.org/10.1002/9781118680605.ch21.

Rehbein, Malte, and Christiane Fritze. 2012. “Hands-On Teaching Digital Humanities: A Didactic Analysis of a Summer School Course on Digital Editing.” In Digital Humanities Pedagogy: Practices, Principles and Politics, edited by Brett D. Hirsch, 47–78. Open Book Publishers. https://doi.org/10.2307/j.ctt5vjtt3.7.

Appendices

Appendix A – TEI Worksheet and Assignment Sheet

Appendix B – TEI Worksheet Answer Key

Appendix C – XML for Brock-Nixon 1968-05-24

About the Authors

J.A.T. Smith (PhD, UCLA) is an Associate Professor of English, Coordinator and Founder of the Digital Humanities Minor, and the Associate Director of the Center for Faith and Learning at Pepperdine University. In her time there, she has received an award for excellence in teaching as well as a major grant to develop a pedagogical app called The Vineyard. Her scholarly research focuses on the language and theology of late medieval English bishop Reginald Pecock. She has published a translation and commentary of his last polemical volume, The Book of Faith (UCLA-CMRS, 2020) and is currently working on a major project which seeks to sequence and situate the entirety of his corpus (even the burned books).

Rachel N. Hogan is a masters student studying Library and Information Science at Syracuse University. She received a Bachelor of Arts degree in English Literature from Pepperdine University. As an Information Literacy Scholar at Syracuse University Libraries, she works with the Learning and Academic Engagement and Information Literacy Departments. She is also a graduate student intern at SUNY Environmental Science and Forestry’s Moon Library and helps with their scholarly communication, assessment, and service desks. Her scholarly interests include academic librarianship, digital scholarship, and information literacy.

This entry is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license.

Show the following:

Adjust appearance:

Notes

Scaffolding Text Encoding

Introduction

Coding the Text

Lesson-by-Lesson

A warm-up—MS Word Styles and indexing by a digital humanist

XML-TEI lecture by a technical specialist

XML-TEI analog worksheet of exemplary letter

XML-TEI coding of entire letter set using Oxygen or Visual Studio Code

Conclusion

Notes

References

Appendices

Appendix A – TEI Worksheet and Assignment Sheet

Appendix B – TEI Worksheet Answer Key

Appendix C – XML for Brock-Nixon 1968-05-24

About the Authors

Annotate