Quantitative Formalism: An Experiment
This paper is the report of a study conducted by five people—four at Stanford and one at the University of Wisconsin—which tried to establish whether computer-generated algorithms could “recognize” literary genres. You take David Copperfield, run it through a program without any human input—“unsupervised,” as the expression goes—and . . . can the program figure out whether it’s a gothic novel or a Bildungsroman? The answer is, fundamentally, yes: but a yes with so many complications that it is necessary to look at the entire process of our study. These are new methods we are using, and with new methods the process is almost as important as the results.
1. Prologue: Docuscope Reads Shakespeare
During the fall of 2008, Franco Moretti was visiting Madison, where Michael Witmore introduced him to work he and Jonathan Hope had been doing on Shakespeare’s dramatic genres, using a text-tagging device known as Docuscope, a hand-curated corpus of several million English words (and strings of words) that had been sorted into grammatical, semantic, and rhetorical categories. ((See Jonathan Hope and Michael Witmore, “The Very Large Textual Object: A Prosthetic Reading of Shakespeare,” Early Modern Literary Studies 9.3 (January 2004): 6.1–36; Hope and Witmore, “Shakespeare by the Numbers: On the Linguistic Texture of the Late Plays” in Early Modern Tragicomedy, eds. Subha Mukherji and Raphael Lyne (London: Boydell and Brewer, 2007): 133–53; Hope and Witmore, “The Hundredth Psalm to the Tune of ‘Greensleeves’: Digital Approaches to Shakespeare’s Language of Genre,” Shakespeare Quarterly 61.3, “Special Issue: New Media Approaches to Shakespeare,” ed. Katherine Rowe (Fall 2010): 357–90; and Witmore’s blog, www.winedarksea.org.))
Docuscope is essentially a smart dictionary: it consists of a list of more than 200 million possible strings of English, each assigned to one of 101 functional linguistic categories called Language Action Types (LATs).1 When Docuscope “reads” a text, it does so by looking for words, and strings of words, that it can “recognize”—that is to say, that it can match to one of its 101 LATs. When this happens, the associated LAT is credited with one appearance. For example, since Docuscope assigns I and me to the LAT “FirstPerson,” their occurrence in a text is recorded as an appearance of the LAT FirstPerson.2
Based on these counts, Hope and Witmore used unsupervised factor analysis—a factor, here, being a pattern that includes some categories, in variable proportions, and excludes others—to create portraits of received genre distinctions such as those made by the editors of the First Folio (Heminges and Condell), and of the genre of “late romances” that was first identified in the 19th century. Multivariate analyses and clustering techniques made groupings of the plays that corresponded not only to conventional genre groupings, but also picked out texts that critics had identified as outliers.3 Thus, in clustering Shakespeare’s Folio plays, the program managed to take Henry VIII out of the History plays cluster and place it near other “late plays,” a readjustment from the initial Folio designations that later critics have advocated as well. One can see this grouping pattern in figure 1 below, taken from an early complete linkage clustering of the plays.
After seeing these results, Moretti asked Witmore whether he would consider clustering novelistic genres. Witmore agreed, and a meeting was planned for February 2009 at Stanford.
For Docuscope, see David Kaufer, Suguru Ishizaki, Brian Butler and Jeff Collins, The Power of Words: Unveiling the Speaker and Writer’s Hidden Craft (New Jersey and London: Lawrence Erlbaum Associates, 2004). A fascinating discussion of how the program came to be designed and an early précis of its categories can be found at http://www.betterwriting.net/projects/fed01/dsc_fed01.html, accessed March 3, 2010. ↩
Because of the way they are used in the program, LATs must be given names without spaces. Obviously the characterization of the words that are contained in each of these categories is a matter of interpretation, as is the choice of those words themselves, which took place over the course of almost a decade of hand-coding. In general, Witmore and Hope use the categories or LATs to identify statistical patterns, then move from the categories to concrete textual instances in order to see how particular words are functioning in context. ↩
They discovered, for instance, that Shakespeare’s “late romances” were distinguished, linguistically, from those that went before them by word patterns that allowed speakers to narrate past action while highlighting their own emotional stance with respect to those actions (a process they called focalized retrospection). Specific linguistic features of these plays were responsible for this effect, for example 1) certain types of subordinated conjunction (a comma, followed by the word which) and 2) past-tense verb forms introduced by a past-tense auxiliary form of the verb to be. Comedies and histories were also shown to be significantly distinct from one another, with comedy possessing a high degree of first- and second-person pronouns (classed under the LATs FirstPerson and DirectAddress); a high degree of language expressing uncertainty (the LAT Uncertainty); an absence of nouns and verbs used to refer to motion, the properties of sensed objects, and sensed changes in objects (LATs labeled Motions, SenseProperty, SenseObject); an absence of first-person plural pronouns (the LAT Inclusive); and an absence of words indicating social entities or expectations that must be shared or mutually acknowledged (the LAT CommonAuthority). ↩