Putting the Friction Back In: Minimal Computing Approaches to Corpus Construction

Anouk Lang, University of Edinburgh

Abstract

This article sets out ways that corpus literacy can be taught in the digital humanities classroom to illuminate for students the practical steps and curatorial decisions that go into constructing a corpus, and the implications of these decisions for the computational text analysis that follows. It proposes a framework that resonates with the principles of minimal computing while also leading students to interrogate the resources and the labor required for constructing textual corpora. It suggests critical readings that can be used in tandem with an exploration of the Google Ngram Viewer and the Google Books project whose data underlies it, as a way into understanding the limitations of Google's digitization project and the importance of reliable metadata and robust OCR (optical character recognition), as well as the historical contingency of projects claiming to widen access to information. It lays out ways to lead students through the practicality of building their own corpus, from undertaking OCR on their own devices to the cleaning and structuring steps that, undertaken collaboratively with others, bring awareness to concerns including file naming conventions, logical directory structures, accurate metadata, and version control, while also fostering the crucial digital humanities (DH) skill of being able to work collaboratively. This kind of corpus literacy is, I argue, not only compatible with a minimal computing approach but one of the starting points from which a broader program of critical AI literacy might begin.

Keywords: corpus literacy; AI literacy; minimal computing; OCR; Google Books.

There might not at first seem to be much common ground between minimal computing and artificial intelligence in its post-GPT-3 incarnation. But as the frictionless interfaces and cheerful personas of generative AI chatbots obscure the energy, resource, and human costs of training large language models (see for example Kgomo 2025; Narayanan and Kapoor 2024, 144; Rowe 2023; and Schwartz et al. 2019) and big tech companies hoover up increasing amounts of data, shatter benchmarks, dominate competitors, and generate higher revenue, bringing minimal computing and its principles into the classroom provides a countervailing perspective to the “arms race” ethos driving developments in machine learning, and presents students with a framework for thinking critically about the AI-powered technologies that underpin our daily lives. Concomitantly, the need for AI literacy is becoming increasingly pressing (see Dimock 2020, 452–3; Mollick 2024, 45; Raley and Rhee 2023, 191; Willison 2024), as the rise of downstream problems such as misinformation indicate that many people do not necessarily understand that a sequence of probabilistically generated tokens is not in the same epistemological category as a statement whose reliability can be gauged by referring to a credible source. The confusion between categories is compounded by the rhetorical authority AI chatbots perform, and by the fact that the outputs of large language models (LLMs) reflect the limitations of the corpora on which they have been trained.

It is this latter problem of the relationship between the outputs of LLMs and the corpora they have been trained on that I take up in this essay, which presents a minimal computing approach to corpus literacy and how it can be incorporated into the classroom. While “corpus literacy” is used in the field of language learning to designate “the ability to use the technology of corpus linguistics to investigate language and enhance the language development of students” (Heather and Helt 2012, 417), I am using the term here in a broader sense to refer to questions such as where the texts comprising a corpus have come from, the principles governing text selection, and the preprocessing steps that have been applied to them. If students have only a vague sense that AI chatbots rely on “scraping the internet,” they are likely even hazier on the processes involved in turning texts into data. What are the practical steps and the curatorial decisions that go into constructing a corpus, and what are the implications of these decisions for the computational text analysis that follows? In a single semester, it may not be possible for teachers with limited resources to train a model, though there are ways to do this on a laptop that better align with the minimal computing principle of, as Roopika Risam and Alex Gil put it, “using only the technologies that are necessary and sufficient for developing digital humanities scholarship in … constrained environments” (2022), than with the gargantuan models which require new data centres to be built. However, focusing on a prior condition for generative AI, corpus construction, makes it possible to address these questions and give students hands-on experience in constructing a corpus themselves.

In what follows, I outline how I take students through the process of corpus construction in my Digital Humanities for Literary Studies course at the University of Edinburgh, a course I designed in 2014 and have been teaching to final-year undergraduates and master’s students since. I am fortunate to work at a well-resourced institution with a cohort of students who, since the early 2020s, bring both laptops and smartphones to class. One constraint is that no computer lab is available for teaching the course, resulting in a mix of Mac, PC, and Linux machines in the classroom. This means using web-based interfaces and nonproprietary software which can be used across operating systems. Over the years I have put increasing emphasis on assigning students secondary reading on the historical and technical background to digitization projects such as Google Books, an approach which aligns the course with the ethos of minimal computing not so much in terms of a set of methodologies but rather as “a mode of thinking about digital humanities praxis that resists the idea that ‘innovation’ is defined by newness, scale, or scope” (Risam and Gil 2022). It also resonates with Jentery Sayers’s (2016) point that minimal computing has more to do with the material particulars of computation than elements of minimal design such as plain text, simplified layouts, and pared-back interfaces, which have been prominent in thinking about minimal computing so far. As AI hype in mainstream media discourse persistently directs attention away from the labor and the materiality underlying forms of computational work toward the latest model, the next benchmark achieved, and the newest Rubicon crossed, consciously going back down the big data scale to the OCR and preprocessing of a single text, and turning back in time from the apparently imminent AGI (artificial general intelligence) singularity to situate contemporary practices in their historical context can itself feel like a radical move.

Thinking Critically and Historically about Corpora

Scholarly attention to corpus construction predates both LLMs and Google Books by some decades, as the body of scholarship on the topic within the discipline of corpus linguistics makes clear (see for example Biber 1993; Hardie and McEnery 2011, 2–22; Tognini-Bonelli 2001, 57–62). But while students in linguistics departments—often institutionally housed within social science or computer science administrative units—might be exposed to this body of literature, humanities students are less likely to encounter it. A reading which I assign in order to open up questions around the politics of corpus construction without the unfamiliar language of corpus linguistics is Tressie McMillan Cottom’s (2016, 542, 543–4) essay “More Scale, More Questions,” which elaborates the ways that assumptions are always embedded in the corpora on which so-called big data depends, and asserts the need for quantitative textual analysis to begin with an interrogation of the power relations, and the economic forces, involved in the construction of a corpus.

In preparation for querying the Ngram Viewer (a free online interface allowing users to search and visualize how frequently specific words or phrases have appeared in books over time) and eventually constructing their own corpus, students are assigned three essays by Robert Darnton published in the New York Review of Books on the Google Books digitization project. Writing in 2008, a few years after Google began digitizing the holdings of large research libraries and public libraries, Darnton was initially enthusiastic about the project’s potential to widen access to books, calling it “the ultimate stage in the democratization of knowledge set in motion by the invention of writing, the codex, movable type, and the Internet” (Darnton 2008). But he also raises salient—and prescient—concerns, for instance around the obsolescence of electronic media, the potential imperilment of digitized books should Google’s corporate fortunes decline, errors made while scanning, practices that deviate from the standards established by bibliographers and thus impede discoverability, and the potential for texts which are not digitized to become less visible and thus perceived as less important. In the second article, published in 2009, Darnton sets the Google Books project in the context of the Enlightenment and the economic imperatives shaping the dissemination of scholarly knowledge in both the eighteenth and the twenty-first centuries. If a certain level of privilege was required for individuals to participate in the burgeoning intellectual networks of the Republic of Letters, so too in the digital era economic inequity inflects access to knowledge when, for instance, publishers increase journal subscription prices beyond what university library budgets can bear. Weighing the possible harms of putting so much power, and control over so much information, into the hands of a single tech company, Darnton points out that as Google is a profit-making enterprise, its motivations in preserving books will inevitably differ from those of libraries, meaning that librarians “cannot sit on the sidelines, as if the market forces can be trusted to operate for the public good” (Darnton 2009). The third article, from 2014, seeks to “imagine a future free from the vested interests that have blocked the flow of information in the past” (Darnton 2014). It puts forward some mechanisms by which access to digitized books and digitized cultural heritage can be opened up: the use of preprint repositories, library consortia negotiating sustainable pricing structures with publishers, and the creation of institutions such as the Digital Public Library of America, which widens access to digitized holdings in libraries across the US through a distributed structure.

Darnton’s perspective as a book historian and librarian enables him to—in line with McMillan Cottom’s exhortation—cast light on the economic underpinnings of the Google Books project, the data which the Ngram Viewer visualizes, and in the process complicating the view, common among students, that information “wants to be free” and that the costs of making it accessible online and preserving it in perpetuity are negligible. Taken together, his three articles not only situate Google’s digitization efforts in a historical line of other attempts to disseminate—and monetize—access to information, and emphasize the material conditions giving rise to those efforts, but also make clear what is at stake in a mass digitization project of this sort. There are many other large digital corpora to which there is only time to gesture briefly, for example HathiTrust and Chronicling America, along with “shadow libraries”: pirated archives of copyrighted publications such as Library Genesis (see Eve 2022). Choosing to focus on Google Books via Darnton’s critical eye is an attempt to temporarily pause students’ automatic recourse to the Google search box, and to help them see that behind the frictionless process of searching there are multiple operations with a great deal of friction built in, which have unfurled over time and which have economic and legal ramifications. As the three articles predate questions that rose to public prominence in the early 2020s around the ethical, labor, and copyright implications of the web-scraping tactics of the big tech companies building LLMs, they demonstrate to students the historical continuities to be found between the current moment of AI hype and previous points when tech companies have forged ahead with ingesting large amounts of text without paying sufficient attention to established principles from fields such as bibliography and information science.

Ngrams, OCR, and Data Literacy

Students now get to encounter the Google Books data—along with some of its problems—for themselves, by exploring the Google Ngram Viewer. They begin this section of the course by reading the canonical “culturomics” paper accompanying its launch (Michel et al. 2011), alongside an essay conveying the sense of excitement that the Ngram Viewer generated on its release (Cohen 2010).

The Ngram Viewer has a number of pedagogical benefits. Importantly, it is fun: students can quickly devise queries relating to their own interests. The site is fairly stable, as the ngrams are precomputed. The search box allows both for simple keyword queries and more advanced queries that use a specific search syntax, for instance grammatical tags such as “searchword_ADJ” to find adjectives. For students with little or no exposure to more advanced search strategies, such as using Boolean operators, this is a step up in search literacy which is not too intimidating, and which can be readily grasped via a search syntax cheat sheet by Alan Liu (2022).

Alongside its ease of use, the Ngram Viewer also offers multiple ways into the unreliability of the Google Books corpus, bringing to light problems such as inaccurate publication dates and OCR errors. The concerns Darnton raises thus come to life on students’ screens, revealing some of the errors and fallibilities that underlie what might initially seem like the trustworthy infrastructure of Google. Crane is another useful reading here, as he goes into more detail on the types of noise found in massive digital libraries, which for example make books in historical languages like classical Greek essentially unsearchable (Crane 2006). Talking through how and why these problems occur is part of what the course seeks to teach about data literacy: when students encounter oddities such as unexpected dips or bumps in an Ngram plot, they are encouraged to return to the source—the scans of individual books—to investigate the possible causes.1

One type of oddity students encounter in their exploratory play with Ngrams is plots that shift between jagged and smooth, and these provide opportunities for statistical literacy. A simple plot such as the one in Figure 1 for a search for “colour” and “color” over the past four hundred years gives no explicit indication of the smaller amount of data (i.e., the number of published books) available from earlier centuries. Students can be shown that the blockiness of the lines pre-1850 compared with their relative smoothness post-1850 hints at how many more books were published in the second half of this historical period and, correspondingly, the greater reliability of the data from that period. This can lead to further discussion about what the smoothing function in the Ngrams interface might obscure or distort.

A line graph from Google Ngram Viewer comparing frequency of 'color' (blue) and 'colour' (red) from 1600 to 2022. The lines appear jagged and irregular before 1850 but become notably smoother afterward, visually demonstrating that fewer books were published in earlier centuries—meaning data from that period is less statistically reliable. — Figure 1. Search for “color” and “colour” in Google Ngram Viewer.

Another problem that students notice, as they try to make sense of the plots, is unexpected shapes that, on closer inspection, turn out to be metadata errors. As of this writing, a search for “Smashwords”—the publisher and distributor first launched as an ebook publishing platform in 2008—shows a bump in publication dates in the 1940s (Figure 2). Following through to the scanned books reveals numerous metadata errors in Smashwords books, including incorrect publication dates (Figure 3).

A line graph from Google Ngram Viewer showing results for 'smashwords' from 1800 to 2020. The graph shows the expected sharp rise after 2008 when Smashwords launched, but also reveals an anomalous bump in the 1940s—an impossibility that demonstrates metadata errors in the Google Books corpus. — Figure 2. Search for “smashwords” in Google Ngram Viewer.

A screenshot from Google Books showing a scanned page from The Economist with an erroneous publication date of 1843, demonstrating how metadata errors in the Google Books corpus can incorrectly assign historical dates to modern digital publications. — Figure 3. Two publications which appear in the results for “smashwords,” with erroneous publication dates (1843 and 1900).

A screenshot from Google Books showing a cover image and erroneous publication date for a book called Online Passive Income — Figure 3. Two publications which appear in the results for “smashwords,” with erroneous publication dates (1843 and 1900).

Searches for “ebook” and “e-book,” meanwhile, show a spike in publication dates around 1900, thus demonstrating not only that caution is needed with metadata in Google Books data, but also that publication dates may be estimated or rounded to the nearest decade or century in ways that produce anomalies, as seen in Figure 4.

A line graph from Google Ngram Viewer comparing 'ebook' (red) and 'e-book' (blue) from 1800 to 1980. A dramatic spike appears around 1900 for both terms—decades before ebooks existed—revealing that Google Books publication dates are sometimes estimated or rounded to century marks, producing misleading historical data. — Figure 4. The result of searches for “ebook” and “e-book” in Google Ngram Viewer. The date range has been constrained to 1800–1980 to make it easier to see the rise in results around 1900.

OCR errors also demonstrate to students the prescience of Darnton’s concerns that Google’s bibliographic capacities would prove somewhat lacking compared to those of professional librarians. Running a search for the common OCR error “tlie” from 1600 to 2022 shows how books in the earlier half of this period have considerably poorer OCR (Figure 5). Clicking through to view page scans of those books (Figure 6) illustrates the more uneven, blotchier printing, as well as features such as ligatures and the medial s, which are not as accurately classified by OCR engines trained on modern typography.

A line graph from Google Ngram Viewer showing frequency of 'tlie'—a common OCR misreading of 'the'—from 1600 to 2022. The error appears at much higher rates in the 17th and 18th centuries with frequent spikes, declining sharply after 1850, demonstrating that OCR accuracy is significantly poorer on older texts with archaic typography such as ligatures and the medial 's'. — Figure 5. Result for a search for “tlie” in Google Ngram Viewer. 2

A screenshot from Google Books showing part of a printed page from a book by George Wilkins called The Miseries of Inforst Mariage — Figure 5. Result for a search for “tlie” in Google Ngram Viewer. 2

Clicking through from a scanned book that contains the “tlie” OCR error to Google’s metadata page reveals further errors. The title is given as Old English Drama: Students’ Facsimile Edition · Volume 100 (when it should be The Miseries of Inforst Mariage), the publisher as John S. Farmer (when it should be George Vincent), and an original publication date of 1598 (when it is 1607). The correct publication metadata is easily ascertained for this book by digitally flipping through the scanned pages, but for books whose full page scans are unavailable, checking the accuracy of the metadata is much more laborious.

As with search, OCR provides an example of a technological advance whose operation has become so frictionless that it can be hard for students to see it as a process with a history. As Ryan Cordell points out, treating OCR as an automatic process elides a substantial body of research by computer scientists: “[w]hile OCR certainly automates certain acts of transcription, it does so following constantly-developing rules and processes devised by human scholars” (2016). The human labor and expertise behind OCR was, for several years after 2019, brought vividly into visibility for students when I was joined by a coteacher, Dr. Bea Alex, a computer scientist and linguist who gave students firsthand insights into her work on historical OCR for the project Plague Dot Text (Casey et al. 2021). As Alex’s account of her work on this project made clear, actions that might appear to an end user to be well aligned with minimal computing principles—for instance uploading a scanned image file of text into Google Drive and having the OCR immediately returned—are the result of processes that are energy- and compute-intensive and that required the deep expertise of many computer scientists for their development. While it is difficult for any user of digital technology to have their hands entirely clean in this respect, focusing students’ attention on the extent to which ostensibly minimal computational processes often obscure their more resource-intensive aspects is a valuable component of both minimal and “maximal” DH teaching.

These tasks could be seen as having drifted away from the territory of minimal computing toward that of book history and science and technology studies (STS). However, this material provides essential background knowledge for students coming from literary and historical disciplines, before they build their own corpus. It is also a part of the course well future-proofed against changes: the history of Google’s digitization project and the still-visible evidence of its errors will “stand still” in a course in which much else needs to be updated year after year.

Digital and Social Infrastructures for Coconstructing a Corpus

Having gotten some insights into how the corpus sausage is made, students now work together to construct a corpus of their own, the analysis of which will later form the core of their final project. As students do not have access to a digitization suite or a scanner, I supply each with several dozen image files of pages for which they are individually responsible for OCRing, proofreading, and then saving—with the correct metadata—to a secure online repository. Performing OCR used to require applications such as Adobe Acrobat or ABBYY Finereader, but with the advent of reliable machine learning-powered text recognition on some mobile phones from 2021 onward, most students are now able to do OCR with devices that came into the classroom in their pockets.3 On phones where this is not an inbuilt feature, scanning apps can perform the same function, and students without a smartphone can also upload PDFs to Google Docs and use its built-in OCR. On the one hand, these tools represent a considerable efficiency gain for boutique classroom OCR projects and a welcome workaround in place of expensive proprietary software. As Risam and Gil observe, working from the principle of using what we have encourages students “to focus on the assets available to them and thus resists a deficit mindset for those of us who are working under constraints” (2022). However, this part of students’ workflow also exemplifies one of the points of compromise for minimal computing principles: using tools from big technology companies, the development of which uses considerable resources and may, in addition, be underpinned by unethical and harmful practices.4 Acknowledging the difficulties of extricating oneself from the big tech companies, Risam and Gil argue that a complete divorce from maximal forms of computation is currently impossible, and so dependence on big tech and social media companies’ infrastructure will remain in place for the foreseeable future. This enmeshment with the products of big tech in our everyday lives does, however, offer opportunities for class discussion: later in the semester students read Kashmir Hill’s (2019) sobering account of how difficult she found it to divest herself of the products of the five largest tech companies, both in terms of her work as a technology journalist and the practicalities of everyday life. If those who teach digital humanities cannot—at least for the moment—avoid using digital infrastructures whose low costs, reliability, and ease of use make them a logical choice where time is limited and students’ access to the latest technology is unevenly distributed, this tension can at least be used to prompt critical reflection on the compromises involved in using such technologies.

Once students have copied the text produced by running OCR on their images and pasting the results into a text file, they now need to correct it, a task that some assume will involve little more than proofreading. As we undertake it in class, however, students begin to raise questions: “Should I put two hard returns between paragraphs, or indent them?”; “How are people representing the start of a new chapter?”; “Should I put spaces between each of the three dots in an ellipsis mark?” These might appear to be trivial formatting matters, but—as the students learn later when they begin querying the corpus with a concordancer—decisions about seemingly unimportant characters such as periods can have consequences when going beyond simple text searches to queries using wildcards and regular expressions. When I hear these kinds of consequential formatting problems coming to light in students’ discussions, I have taken to letting a few slip through the net, so that students discover for themselves at a later stage the importance of standardization across multiple contributions, and learn how to go back through the finished corpus to address those inconsistencies retroactively. Thus, though this task appears to be an individual one, it inducts students into what might be the most crucial skill of all for digital humanists to develop: working collaboratively with others. Used to writing essays on their own, or contributing in atomised ways to group projects, students learn that they need to check in with their fellow OCR correctors, keep a record of group decisions, and adapt their own practice accordingly.

Beyond the exigencies of standardization and the importance of consulting one’s collaborators to develop standards everyone can agree on, there are other obligations owed to one’s groupmates. As the task progresses, some students complete their OCR correction earlier than others and want to move on to the analysis, but no one can advance until the whole corpus is ready. Students thus learn how important it is to meet deadlines and to carry out their assigned tasks so as not to hold others in the group up. Here, project management tools can be useful, especially for accountability.5 As students finish their files and need to store them somewhere accessible and secure, emerging questions such as “Who OCRed this section?” and “Where can I find the latest version of this file?” reveal the importance of considerations of file-naming conventions, logical directory structures, accurate metadata recording, and version control.6 These considerations—which tend to be new to the humanities students in the class—show how data infrastructures are essential to the functioning of even the most minimal of digital projects. These data management lessons hold across operating systems and applications, and they are transferable: students report carrying them into their other classes and their capstone dissertation projects. Moreover, having to confer on file-naming conventions, directory structures, and so on is another way of putting the friction back into what would otherwise be largely frictionless processes. While I take care not to make the volume of work too onerous—choosing the length of the text to be digitized based on the number of students signed up for the class—one of my aims is for students to appreciate that, done properly, the creation of digital artifacts is laborious. Time, human labor, material resources, and other considerations7 must all be weighed when assessing how and by whom a corpus has been built, which goes well beyond the labor of authoring the texts that constitute it.

Close Reading Through Data Cleaning

The idea that “the best way to get to know your data is by cleaning it” is likely a familiar one to anyone who has built a corpus from scratch. Accordingly, a benefit that flows from students doing OCR correction and proofreading is that they get to know their allocated section of the corpus well. Inspecting digitized text carefully for errors can function as a form of close reading, something which is useful preparation for the analytical work that follows. For instance, a student might spot something of interest in their allocated section and develop it into a query to be applied to the corpus as a whole. Care needs to be taken not just with cleaning and curation decisions, however, but also with the notion of cleaning as a process separate from, and subordinate to, the analysis. As Katie Rawson and Trevor Muñoz point out, the paradigms and practices that are gathered under the heading of “data cleaning” too often go unspecified in humanities work, and explicitly articulating these is important if one wants to work with data “without risking the foreclosure of specific and valuable humanistic modes of producing knowledge” (2019, 280). The work of data preparation itself incorporates cultural critical practices, they point out, and separating these from the work of analysis risks reinscribing the binary between cultural criticism and data analysis.

Rawson and Muñoz (2019, 282) illustrate this point via a case study of their efforts to “clean” a data set of food items listed on digitized historical menus, where variant spellings and orthographic conventions led them to seek ways to standardize labels for dishes, and in the process to discover that the data model around which the data had originally been organized was different from the one they needed to answer their research questions. Thinking about a data model is a crucial part of the intellectual work of analyzing and representing data, but asking students to conceptualize a data model’s relationship to data preparation decisions, research questions, and, eventually, the argument they want to make about their corpus can be forbiddingly abstract. However, if data modeling is a bridge too far for students new to the field, then preparing a corpus at least puts them in a position to see how decisions made at the level of cleaning (e.g., choosing to standardize variant spellings) and structuring (e.g., adding part-of-speech tags) will have effects on downstream queries, such as the ability to extract toponyms along with their collocating prepositions in order to map these as part of an argument about a corpus’s spatial imaginary.

Once students have their corpus—with uncorrected OCR files in one directory, corrected OCR files in another, and meaningful file names consistently assigned—they get to experience the satisfaction of knowing that they have, unusually for a course within a literature program, built something tangible. Much can be done with that thing, analytically and pedagogically, but even before any of that happens, something else has been achieved: a nascent community of practice. This is especially valuable in the face of the widely reported post-COVID decline in student engagement observed at universities across the globe (see, for example, Grove 2024; McMurtrie 2022; and Otte 2024). Meaningful collaborative activities like this—where no one can query the corpus unless everyone pulls their weight in building it—are a way of keeping students engaged with a course and accountable to each other. Scaffolding students’ ability to work collaboratively in what remains a resolutely individualist discipline—English literature—and in the specific context of the UK university system, whose relatively low contact hours mitigate against group projects, is not only valuable in itself (as illustrated by Croxall and Jakacki 2023; Ermolaev et al. 2024; and Kim 2024, among others), but is also well aligned with the ethos of minimal computing, given that these are skills for which no computing is required at all.

Returning to the analytical opportunities presented by a hand-curated corpus, these can also be explored in minimal ways. In class, we use the concordancer AntConc (Anthony 2022), an application which is free to download, supported on multiple platforms, relatively lightweight, and which has robust documentation. Students build on what they learned earlier about search syntax and Boolean expressions with the Ngram Viewer and take their searching up a level, for instance by specifying their own operators, which AntConc allows users to customize if they want to override its defaults. This introduces them to thinking about text searching at a more complex level than keywords or phrases, which can be connected to wider literary and theoretical ideas they have encountered in other parts of their degree. My go-to example is to have students compose a search string to investigate gendered pronouns before and after particular verbs, for instance, and then to point out that the verb to look and the greater prevalence of he looks/looked/is looking at her compared to she looks/looked/is looking at him is an illustration of the male gaze (see Mulvey 1975). Pursuing investigations of their own among the keyword in context (KWIC) lines and collocation tables that AntConc allows them to generate, students can then take these insights and incorporate them into an argument, which they will present as part of their final project for the course.

These text analysis techniques are thus at a small enough scale to be manageable on a laptop. Importantly for a course located within an English literature degree, the quantitative analysis is done in conjunction with close reading: we think about how to move between the two modes, and how readings at scale might be informed by, and integrated into, scholarly articles on the primary texts. Rather than succumbing to the move toward ever-bigger data, the aim is for students to begin to be able to see the value of producing and working analytically with structured data. Even a basic level of structuring and a small dataset can be useful. Alphabetizing KWIC lines delivered by a concordancer can speed up the process of identifying variants in place names, for instance, which is more feasible than writing scripts to perform named entity recognition, as it requires a level of coding expertise that is difficult to achieve in a single semester if students have no programming experience.

Conclusion: Building the Foundations for AI Literacy

The more seamlessly processes such as OCR happen behind the scenes, the more important it is to teach students that they are activities which require labor, time, and resources, which were built in particular ways in specific times and places, and which will therefore have cultural, linguistic, and other biases embedded within them which reflect the specificities of their construction. Students’ experience of Google is that its suite of products manages emails, calendars, photos, documents, and much else seamlessly, but as Alison Booth and Miriam Posner put it, the errors and glitches in the scanned pages of books it has digitized serve “to peel back the glossy outer layer of Google Books to reveal an enterprise far from omnipotent” (2020, 15). Such reflections might be old news to those who have been working in digital humanities for some time, but are likely not so obvious to those whose habits of technological use have been formed in an era where IT infrastructure is, for the most part, the wallpaper of daily life: simply there, usually functional, and not something to be interrogated. Prompting such interrogation by pointing students to OCR errors—and asking them to do OCR themselves—would ideally go beyond following the principles of minimal computing to a recognition of the Anglophone and US biases that so frequently underlie digital infrastructures: coding languages whose putative human readability extends only to those who read English and in which American English is the default, applications that break when nonstandard characters such as accented letters are used, the dominance of left-to-right languages, and more.

To return to the bigger picture of AI literacy: this is a forebodingly large thing to take on in the classroom, not only because LLMs are generally perceived as black boxes, but also because engaging critically and mindfully with them requires knowledge that extends across disciplinary boundaries. As a first step, however, understanding how and what is at stake when textual corpora are assembled can move us toward a better grasp of how the corpora on which LLMs are trained may be partial or flawed, and of the materiality and labor underpinning the otherwise ethereal experience of conversing with an AI chatbot. With a single text or a small number of texts, students do not directly address for themselves the question of selection and what is excluded, underrepresented, or overrepresented in a corpus, but trying out a tool like Ngram Viewer and reading scholarship on corpus selection can help to show them the implications of those choices at a larger scale.

In this way, even a beginner-level digital humanities class can provide students with the opportunity to start to unfold some of the sociotechnical assemblages that constitute what is commonly referred to under the sign of “AI,” and that have produced the corpora on which generative AI and other algorithmic technologies are based. The STS scholar Lucy Suchman (2023) underscores the importance of this work of unfolding, urging us to talk about AI not as a static, singular thing, but rather in terms of the processes, material histories, actors, and other components that work to constitute it. The activities I have described above relate to only a few of the many practices and processes that need to be examined for this ongoing work of conceptual dereification to proceed, but they are one place to start with AI literacy, especially in a context where tech companies are not forthcoming about the specific texts that they have ingested into their training corpora. Coupling minimal computing principles to literary- and book history-aligned instantiations of DH in order to put some of the friction back in to corpus-building is one thing we can do to better equip students to see through the rhetorical ferment of AI hype which drives the idea that bigger, better, and faster is the only way forward, and to resist technocapitalist imperatives to generate ever larger profits, use greater amounts of energy and water use, and naturalize the idea that computing needs to involve the disproportionate consumption of resources.

Notes

It was previously possible to click through from the Ngram Viewer to Google Books to find all—or at least some—of the instances of a search term in the scanned books. As of this writing, however, the number of books returned have become much smaller, and the availability of full-text scans scarcer still. ↑
Ben Schmidt identifies the reason for the spike around 2008 visible in Figure 5: “‘tlie’ used to be a common OCR misreading of ‘the,’ but Google's OCR no longer includes it. BUT there was a boom in digital reprints of classic books that echoed the first scanning projects, so now ‘tlie’ spikes in 2008 because of shitty first-generation digital samizdat” (Schmidt 2024). ↑
On an iPhone XS, XR or later, running iOS 15 or above, this feature is known as Live Text. A user opens the camera and points it at some text; after a few moments a text selection icon appears, and the text can then be selected, copied, and pasted into another application like a notes app or email, without the need to take a photo for the text recognition mechanism to be activated. ↑
One example is the ImageNet dataset, which was important to the development of image classification models, and which continues to be used for benchmarking, and contains racist and offensive labels (see, for example, Crawford and Paglen 2021). ↑
Tools such as Trello often have a free or educational tier which serve small-scale projects well. More minimally, a simple online document listing who is assigned to do what task by what time will also work. ↑
Kristin Briney’s “File Naming Convention Worksheet” and other resources available at http://dataabinitio.com/?p=976 are good options for teaching file-naming conventions. ↑
These include questions of copyright, open access, and fair use, which I do not have the space to cover fully here. As these topics have become more mainstream within public discussions of generative AI, they have correspondingly come to play a more prominent role in class discussions. Students do not, however, have to make practical decisions in this respect, as our class corpus is not publicly accessible. Students go on to produce analyses that appear on the open web and which are supported by examples, but these are brief enough to constitute fair use. ↑

References

Anthony, Laurence. 2022. AntConc. V. 4.2.0. Released December 25. http://www.laurenceanthony.net/software/antconc/.

Biber, Douglas. 1993. “Representativeness in Corpus Design.” Literary and Linguistic Computing 8 (4): 243–57. https://doi.org/10.1093/llc/8.4.243.

Booth, Alison, and Miriam Posner. 2020. “Introduction: The Materials at Hand.” PMLA 135 (1): 9–22. https://doi.org/10.1632/pmla.2020.135.1.9.

Casey, Arlene, Mike Bennett, Richard Tobin, Claire Grover, Iona Walker, Lukas Engelmann, and Beatrice Alex. 2021. “Plague Dot Text: Text Mining and Annotation of Outbreak Reports of the Third Plague Pandemic (1894–1952).” Journal of Data Mining & Digital Humanities. https://doi.org/10.46298/jdmdh.6071.

Cohen, Dan. 2010. “Initial Thoughts on the Google Books Ngram Viewer and Datasets.” Dan Cohen (blog). December 19. https://dancohen.org/2010/12/19/initial-thoughts-on-the-google-books-ngram-viewer-and-datasets/.

Cordell, Ryan. 2016. “‘Q i-Jtb the Raven’: Taking Dirty OCR Seriously.” Ryan C. Cordell (blog). January 7. https://ryancordell.org/research/qijtb-the-raven-mla/.

Crane, Gregory. 2006. “What Do You Do with a Million Books?” D-Lib Magazine 12 (3). http://www.dlib.org/dlib/march06/crane/03crane.html.

Crawford, Kate, and Trevor Paglen. 2021. “Excavating AI: The Politics of Images in Machine Learning Training Sets.” AI & Society 36 (4): 1105–16. https://doi.org/10.1007/s00146–021–01162–8.

Croxall, Brian, and Diane K. Jakacki. 2023. “What Is Digital Humanities and What’s It Doing in the Classroom?” In What We Teach When We Teach DH, edited by Brian Croxall and Diane K. Jakacki. University of Minnesota Press.

Darnton, Robert. 2008. “The Library in the New Age.” New York Review of Books, June 12. http://www.nybooks.com/articles/21514.

———. 2009. “Google and the Future of Books.” New York Review of Books, February 12. http://www.nybooks.com/articles/22281.

———. 2014. “A World Digital Library Is Coming True!” New York Review of Books, May 22. http://www.nybooks.com/articles/archives/2014/may/22/world-digital-library-coming-true/.

Dimock, Wai Chee. 2020. “AI and the Humanities.” PMLA 135 (3): 449–54. https://doi.org/10.1632/pmla.2020.135.3.449.

Ermolaev, Natalia, Rebecca Munson, and Meredith Martin. 2024. “Graduate Students and Project Management: A Humanities Perspective.” In Digital Futures of Graduate Study in the Humanities, edited by Gabriel Hankins, Anouk Lang, and Simon Appleford. University of Minnesota Press.

Eve, Martin Paul. 2022. “Lessons from the Library: Extreme Minimalist Scaling at Pirate Ebook Platforms.” Digital Humanities Quarterly 16 (2). https://www.digitalhumanities.org/dhq/vol/16/2/000587/000587.html.

Grove, Jack. 2024. “Lectures in Question as Paid Work Pushes Attendance Even Lower.” Times Higher Education, March 14. https://www.timeshighereducation.com/news/lectures-question-paid-work-pushes-attendance-even-lower.

Hardie, Andrew, and Tony McEnery, eds. 2011. Corpus Linguistics: Method, Theory and Practice. Cambridge University Press. https://doi.org/10.1017/CBO9780511981395.002.

Heather, Julian, and Marie Helt. 2012. “Evaluating Corpus Literacy Training for Pre-Service Language Teachers: Six Case Studies.” Journal of Technology and Teacher Education 20 (4): 415–40. https://www.learntechlib.org/primary/p/39324/.

Hill, Kashmir. 2019. “Life Without the Tech Giants.” Gizmodo, January 22. https://gizmodo.com/life-without-the-tech-giants-1830258056.

Kgomo, Sonia. 2025. “I Was a Content Moderator for Facebook. I Saw the Real Cost of Outsourcing Digital Labour.” The Guardian, February 12. https://www.theguardian.com/commentisfree/2025/feb/12/moderator-facebook-real-cost-outsourcing-digital-labour.

Kim, Hoyeol. 2024. “Challenges of Collaboration: Pursuing Computational Research in a Humanities Graduate Program.” In Digital Futures of Graduate Study in the Humanities, edited by Gabriel Hankins, Anouk Lang, and Simon Appleford. University of Minnesota Press.

Liu, Alan. 2022. “Google Books Ngram Viewer Cheat Sheet.” Alanliu.org, September 26. https://alanyliu.org/course-materials/google-books-ngram-viewer-cheat-sheet/.

McMillan Cottom, Tressie. 2016. “More Scale, More Questions: Observations from Sociology.” In Debates in the Digital Humanities 2016, edited by Matthew K. Gold and Lauren F. Klein. University of Minnesota Press.

McMurtrie, Beth. 2022. “A ‘Stunning’ Level of Student Disconnection.” The Chronicle of Higher Education, April 5. https://www.chronicle.com/article/a-stunning-level-of-student-disconnection.

Michel, Jean-Baptiste, Yuan Kui Shen, Aviva Presser Aiden, et al. 2011. “Quantitative Analysis of Culture Using Millions of Digitized Books.” Science 331 (6014). https://doi.org/10.1126/science.1199644.

Mollick, Ethan. 2024. Co-Intelligence: Living and Working with AI. WH Allen.

Mulvey, Laura. 1975. “Visual Pleasure and Narrative Cinema.” Screen 16 (3): 6–18.

Narayanan, Arvind, and Sayash Kapoor. 2024. AI Snake Oil: What Artificial Intelligence Can Do, What It Can’t, and How to Tell the Difference. Princeton University Press.

Otte, Jedidajah. 2024. “‘I See Little Point’: UK University Students on Why Attendance Has Plummeted.” The Guardian, May 28. https://www.theguardian.com/education/article/2024/may/28/i-see-little-point-uk-university-students-on-why-attendance-has-plummeted.

Raley, Rita, and Jennifer Rhee. 2023. “Critical AI: A Field in Formation.” American Literature 95 (2): 185–204. https://doi.org/10.1215/00029831–10575021.

Rawson, Katie, and Trevor Muñoz. 2019. “Against Cleaning.” In Debates in the Digital Humanities 2019, edited by Matthew K. Gold and Lauren F. Klein. University of Minnesota Press.

Risam, Roopika, and Alex Gil. 2022. “Introduction: The Questions of Minimal Computing.” Digital Humanities Quarterly 16 (2). https://www.digitalhumanities.org/dhq/vol/16/2/000646/000646.html.

Rowe, Niamh. 2023. “‘It’s Destroyed Me Completely’: Kenyan Moderators Decry Toll of Training of AI Models.” The Guardian, August 2. https://www.theguardian.com/technology/2023/aug/02/ai-chatbot-training-human-toll-content-moderator-meta-openai.

Sayers, Jentery. 2016. “Minimal Definitions.” Minimal Computing: A Working Group of GO::DH, October 2. http://go-dh.github.io/mincomp/thoughts/2016/10/02/minimal-definitions/.

Schmidt, Ben (@bschmidt.bsky.social). 2024. “Oh wow this one is amazing. 'tlie' used to be a common OCR misreading of 'the,' but Google's OCR no longer includes it. BUT there was a boom in digital reprints of classic books that echoed the first scanning projects, so now 'tlie' spikes in 2008 …” Bluesky, December 11. https://bsky.app/profile/bschmidt.bsky.social/post/3lcyqtghgvc2x.

Schwartz, Roy, Jesse Dodge, Noah A. Smith, and Oren Etzioni. 2019. “Green AI.” arXiv, August 13. https://doi.org/10.48550/arXiv.1907.10597.

Suchman, Lucy. 2023. “The Uncontroversial ‘Thingness’ of AI.” Big Data & Society 10 (2): 1–5. https://doi.org/10.1177/20539517231206794.

Tognini-Bonelli, Elena. 2001. Corpus Linguistics at Work. John Benjamins.

Willison, Simon. 2024. “Things We Learned About LLMs in 2024.” Simon Willison’s Weblog, December 31. https://simonwillison.net/2024/Dec/31/llms-in-2024/.

About the Author

Anouk Lang is Senior Lecturer in Digital Humanities in the Department of English and Scottish Literature and an affiliate of the Edinburgh Futures Institute at the University of Edinburgh. She works on critical AI, critical making, digital mapping, and the application of digital humanities methods to twentieth-century literature and culture. She is the editor of From Codex to Hypertext: Reading at the Turn of the Twenty-First Century, co-editor (with Gabriel Hankins and Simon Appleford) of Digital Futures of Graduate Study in the Humanities, and co-editor (with Ian Henderson) of Patrick White Beyond the Grave: New Critical Perspectives.

This entry is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license.

Articles

Show the following:

Adjust appearance:

Notes

Putting the Friction Back In: Minimal Computing Approaches to Corpus Construction

Abstract

Thinking Critically and Historically about Corpora

Ngrams, OCR, and Data Literacy

Digital and Social Infrastructures for Coconstructing a Corpus

Close Reading Through Data Cleaning

Conclusion: Building the Foundations for AI Literacy

Notes

References

About the Author

Annotate