Reading Macroanalysis 3.0: Style, Or the Author Comes Back from the Dead

I’m going to devote two posts to Chapter 6, “Style.” In this post I’m going to present what I take to be Jockers’ main result, that authorial identity is, in fact, a strong feature of texts. That may not come as much of a surprise to most as it’s something that many of us have “known” for a long time. But we’ve not known it in the context of an investigation of this kind.
In my second post I’m going to offer some general methodological remarks about operationalization and then discuss on Jockers’ results can be brought to bear on Edward Said’s anxiety over the existence of an autonomous aesthetic sphere.
The Statistical Assessment of Style
Jockers begins (p. 63) with this statement:

In statistical of quantitative authorship attribution, a researcher attempts to classify a work of unknown or disputed authorship in order to assign it to a known author based on a training set of works of known authorship.

So we start with texts whose authorship is known and analyze them in some way to identify features thought to be characteristic of the particular author. We’re not interested in the content of the text, but in authorial style.
Such work has a substantial history, going back to the mid 1960s when Mosteller and Wallace used statistical techniques to identify the authors of fifteen (out of 85) Federalist Papers with uncertain authorship. Because this particular set of texts is relatively homogenous in important respects – genre, time of composition, provenance – it is reasonable to attribute statistical differences in the texts to authorial style. Such homogeneity is not always the case, as Jockers notes (p. 63):

A consistent problem for authorship researchers, however, is the possibility that other external factors (for example, linguistic register, genre, genre, nationality, gender, ethnicity, and so on) may influence or even overpower the latent authorial signal. Accounting for the influence of external factors on authorial style is an important task for authorship researchers, but the study of influence is also a concern to literary researchers who wish to understand the creative impulse and the degree to which authors are the products of their times and environments.

Jockers will go on to discover that authors are in fact highly constrained, which I don’t think comes as a surprise to anyone except, perhaps, for a few sophomoric Romantics who are totally besotted with the trope of authorial creative genius.
What Determines Style?
First of all, you are always making identifications in the context of a known ensemble of possible identities. In the case of the Federalist Papers, the papers in question had to be identified with Alexander Hamilton, James Madison, or John Jay. So, given texts known to be by one of these, which features most reliably distinguish them from one another?
Years of research has shown that high-frequency words (mostly grammatical function words) and punctuation marks are most useful such features. Just why those features are most useful is an interesting question, and Jockers has some discussion of the matter scattered throughout the chapter as appropriate in particular cases. But I’m simply going to treat the matter as an empirical fact. Nor, for that matter, am I going to attempt to summarize Jockers’ account of how one works with these features (pp. 64-68).
Jockers reports one set of studies where he used a set of 42 such features and another set of studies where he used 161 features. In both sets of studies he used a corpus of 106 novels that Franco Moretti had already coded for genre. Each novel, in turn, was sliced into 10 equal-sized segments to give us 1060 texts for investigation. For each of those texts we measure the relative frequency of each of the 42 features. So, each text is represented by a string of 42 numbers, where each number is the frequency of some feature. Each of those sets is the text’s “signature” (my term) or “signal” (Jockers’). Jockers then uses a standard technique to group the signatures into piles according to similarity.
By way of comparison, imagine that someone has given you a pile of 100 photos of people. Your job is to group them into piles of people who look more or less alike. What you don’t know is that only 10 different people are present in those photos, with each individual being represented by 10 photos taken at various times of life, in various poses, clothing, etc. Will you put all the photos of the same individual into the same pile? That’s the kind of game Jockers is playing.
But his game is a bit more complex. What he discovered is that the classification routine not only managed to group segments of the same works together (remember, each novel had been sliced into 10 segments), but also grouped them according to genre (as Moretti had categorized them).
So, we now know that genres, like authors, have distinct signals (as determined by this particular set of 42 features). But that leaves us with a problem (p. 70):

If genres, like authors, appear to have distinguishable linguistic expressions that can be detected at the level of common function words, then a way was needed for disambiguating “author” effects from those of genre and determining whether such effects were diluting or otherwise obscuring the genre signals we observed. If there were author effect as well as genre effects, then one must consider the further possibility that there are generational effects (that is, time-based factors) in the usage of language: perhaps the genre signals being observed were in fact just characteristic of a particular generation’s habits of style. In other words, although it was clear that the novels were clustering together in genre groups, it was not clear that these groupings could be attributed to signals expressed by unique genres.

That is, though it looked like the system was grouping according to genre, maybe it was also picking up on author (as authors tend to stick to one or a few genres) or even generation (remember that Moretti had found a generational turnover in genres). I’m not going to attempt to summarize what Jockers did to sort things out, but the upshot is that, yes, genres do appear to have fairly robust characteristic signals.
Jockers then decided to undertake a further set of experiments. He used the same set of 106 novels (by 47 different authors), again divided into 10 segments each. But he expanded the feature set from 42 to 161 (listed at the bottom of p. 78). Jockers then added five pieces of metadata to each text segment: author, author’s gender, genre, decade of publication, and title. Each of these is to be taken as an index of some ‘force’ in the world that exerts ‘causal influence’ over the properties of the text. What we want to know is just how much force each of these factors has in determining textural properties.
So, we’ve got 1060 chunks of text. Each chunk is characterized in two ways: 1) by a signal that’s a function of 161 text attributes, and 2) by five pieces of metadata taken as indices of causal forces.
Before even looking at Jockers’ result, let’s do a little thinking. Given that each of these text segments is one of 10 slices from a single novel, we’d expect that the text ‘force’ would have a strong causal influence on the signature of a given segment. (And if the word “cause” bothers you, well, think in terms of Aristotle’s formal cause, rather than efficient cause.) What would it mean for one of the other forces to have a stronger influence on the signature than the text itself?
And then there’s the author factor. Each of these 1060 segments was written by one of 47 authors. What would it mean for one of the other factors to be stronger than the author?
As for the other factors, gender is an attribute of the author while genre is an attribute of the text. Authors chose write in specific genres and that, in turn, determines a great deal about how they write the text. Decade is a function of overall cultural milieu.
Jockers’ analysis of this data is rich and interesting but, again, I’m not going to attempt to summarize it. I’m going straight to what I regard as his big result (pp. 92-93):

Despite the presence of forty-seven different authors in this test corpus, the model was able to detect enough variation between author signals to correctly identify the authors of a random set of samples with 93 percent accuracy. This may feel uncomfortable, even controversial, to those reared in the “there is no author” school of literary criticism. The data are undeniable. Ultimately, it does not matter if Dickens is really Dickens or some amalgamation of voices driven to paper by the hand of Dickens. The signal derived from 161 linguistic features and extracted from books bearing the name “Charles Dickens” on the cover is consistent.

Actually, concerning that penultimate sentence, I think it does matter, something I’ll return to in the next post. But I have no quarrel with the point Jockers is making here: authors determine style.
It’s not simply that authorship can be identified with a high degree of accuracy, but that it can be more accurately identified than any of the other factors, including text. It is even easier to identify the author of a segment than it is to identify the text from which the segment came. Gender, genre, and decade were, in turn, less accurately assessed than author or text.
So What?
Of course, many literary critics take it for granted that authorial style is important. But it is one thing to believe, on whatever basis, that authorial style is important. It’s something else again to be able to assess style in such a way that one can identify the author of a given text. Decades of research have shown us that computational methods are more accurate than human assessment.
In the next post I’m going to take Jockers’ computational resurrection of the dead author and use it to argue for the existence of an autonomous aesthetic realm.
* * * * *
Previously:

Reading Macroanalysis 1: Framing: Hyperobjects, Objectification, and Evolution
Reading Macroanalysis 2: Metadata and the Emperor’s New Clothes
Reading Macroanalysis 2.1: How do we make inferences from patterns in collections of books to patterns in populations of readers?