Notes Toward a Theory of the Corpus, Part 1: History [#DH]

By Bbenzon @bbenzon
By corpus I mean a collection of texts. The texts can be of any kind, but I am interested in literature, so I’m interested in literary texts. What can we infer from a corpus of literary texts? In particular, what can we infer about history?
Well, to some extent, it depends on the corpus, no? I’m interested in an answer which is fairly general in some ways, in other ways not. The best thing to do is to pick an example and go from there.
The example I have in mind is the 3300 or so 19th century Anglophone novels that Matthew Jockers examined in Macroanalysis (2013 – so long ago, but it almost seems like yesterday). Of course, Jockers has already made plenty of inferences from that corpus. Let’s just accept them all more or less at face value. I’m after something different.
I’m thinking about the nature of historical process. Jockers' final study, the one about influence, tells us something about that process, more than Jockers seems to realize. I think it tells us that cultural evolution is a force in human history, but I don’t intend to make that argument here. Rather, my purpose is to argue that Jockers has created evidence that can be brought to bear on that kind of assertion. The purpose of this post is to indicate why I believe that.
A direction in a 600 dimension space
In his final study Jockers produced the following figure (I’ve superimposed the arrow):

Each node in that graph represents a single novel. The image is a 2D projection of a roughly 600 dimensional space, one dimension for each of the 600 features Jockers has identified for each novel. The length of each edge is proportional to the distance between the two nodes. Jockers has eliminated all edges above a certain relatively small value (as I recall he doesn’t tell us the cut off point). Thus two nodes are connected only if they are relatively close to one another, where Jockers takes closeness to indicate that the author of the more recent novel was influenced by the author of more distant one.
You may or may not find that to be a reasonable assumption, but let’s set it aside. What interests me is the fact that the novels in this are in rough temporal order, from 1800 at the left (gray) to 1900 at the right (purple). Where did that order come from? There were no dates in 600D description of each novel. As far as I can tell, that must be a product of the historical process that produced those texts. That process must therefore have a temporal direction.
I’ve spent a fair amount of effort explicitly arguing that point [1], but don’t want to reprise that argument here. For the purposes of this piece, assume that that argument is at least a reasonable one to make.
What is that direction? I don’t have a name for it, but that’s what the arrow in the image indicates. One might call it Progress, especially with Hegel looking over your shoulder. And I admit to a bias in favor of progress, though I have no use for the notion of some ultimate telos toward which history tends. But saying that direction is progress is a gesture without substantial intellectual content because it doesn’t engage with the terms in which that 600D space is constructed. What are those terms? Some of them are topics of the sort identified in topic analysis, e.g. American slavery, beauty and affection, dreams and thoughts, Greek and Egyptian gods, knaves rogues and asses, life history, machines and industry, misery and despair, scenes of natural beauty, and so on [3]. Others are stylistic features, such as the frequency of specific words, e.g. the, heart, would, me, lady, which are the first five words in a list Jockers has in the “Style” chapter of Macroanalysis (p. 94).
The arrow I’ve imposed on Jockers’ graph is a diagonal in the 600D space whose dimensions are defined by those features and so its direction must specified in terms that are commensurate with such features. Would I like to have an intelligible interpretation of that direction? Sure. But let’s leave that aside. We’ve got an abstract space in which we can represent the characteristics of novels (Daniel Dennett might call this a design space) and we’ve got a vector in that space, a direction.
What’s that direction about? What is it about texts that is changing as we move along that vector? I don’t know. Can I speculate? Sure. But not here and now. What’s important now is that that vector exists. We can think about it without having to know exactly what it is.
A snapshot of Spirit of the 19th century
In a post back in 2014 I suggested that Jockers’ image depicts the Geist of 19th century Anglo-American literary culture [2]. That’s what interests me, the possibility that we’re looking at a 21st century operationalization of an idea from 19th century German idealism. Here’s what the Stanford Encyclopedia of Philosophy has to say about Hegel’s conception of history [4]:
In a sense Hegel’s phenomenology is a study of phenomena (although this is not a realm he would contrast with that of noumena) and Hegel’s Phenomenology of Spirit is likewise to be regarded as a type of propaedeutic to philosophy rather than an exercise in or work of philosophy. It is meant to function as an induction or education of the reader to the standpoint of purely conceptual thought from which philosophy can be done. As such, its structure has been compared to that of a Bildungsroman (educational novel), having an abstractly conceived protagonist—the bearer of an evolving series of so-called shapes of consciousness or the inhabitant of a series of successive phenomenal worlds—whose progress and set-backs the reader follows and learns from. Or at least this is how the work sets out: in the later sections the earlier series of shapes of consciousness becomes replaced with what seem more like configurations of human social life, and the work comes to look more like an account of interlinked forms of social existence and thought within which participants in such forms of social life conceive of themselves and the world. Hegel constructs a series of such shapes that maps onto the history of western European civilization from the Greeks to his own time.
Now, I am not proposing that Jockers’ has operationalized that conception, those “so-called shapes of consciousness”, in any way that could be used to buttress or refute Hegel’s philosophy of history – which, after all, posited a final end to history. But I am suggesting that can we reasonably interpret that image as depicting a (single) historical phenomenon, perhaps even something like an animating ‘force’, albeit one requiring a thoroughly material account. Whatever it is, it is as abstract as the Hegelian Geist.
How could that be?
Let’s spell out some fairly obvious things about the material underpinning, if you will, of the phenomena represented in that image. Each node represents a novel published in the 19th century, either in Ireland, England, or the United States. There are roughly 3300 novels, which implies something on the order 3300 people, the authors of those novels. Of course, some authors produced more than one text in the corpus, and some texts had more than one author. Moreover, those authors worked with editors, each of whom read the book. And the editors reported to publishers who had to authorize publication, and so on. So lets say we have on the order of 10,000 individuals more or less associated with creation of those 3300 books.
Each book necessarily is an expression of those 10,000 minds. The expression is direct in the case of authors, but not-so-direct in the case of the others. Yet they wouldn’t have been involved with the book unless it somehow answered to or reflected something in their minds, even if it was only a commercial hunch about what would sell. And then we have the readers who (more or less necessarily) see or seek something of themselves in the books they read.
Note however that the fact of publication represents a commercial judgment about the viability of a given title in the marketplace. Editors and publishers, authors too, are aware of that marketplace and take that into account in their decisions. Thus independently of the actual post-publication readership of a book, the decision to publish represents a studied judgment about the taste and desires of the current literary marketplace.
Some books will have had relatively few readers, on the order of 100s or at most 1000s; while others had many readers, tens or hundreds of thousands, maybe even millions in some cases. And these readings, with their associated readers, may have happened within a year or two of publication or may have been spread out over decades or more. Yet the books register in the corpus only once, the year of initial publication [4]. The corpus thus does not accurately represent the presence of those novels in the minds of 19th century readers.
Nor, for that matter, does that corpus represent either the entire 19th century production of Anglophone novels or a representative sample of that production. It’s a convenience sample. It’s what Jockers could cobble together in a reasonable amount of time.
Nonetheless I’m going to say that it represents the (collective) consciousness, perhaps mind if you will, of the 19th century Anglophone reading public. It is by no means a complete representation of the consciousness/mind of that public, but it’s not a mere chimera either. Perhaps a visual analogy will help.
This is a photograph of a high-end apartment building in Lower Manhattan, 8 Spruce Street. For what it’s worth, the building was designed by the “starchitect” Frank Gehry:

Is that a complete representation of 8 Spruce Street? Of course not. I note that the building is partially hidden by clouds and by another building. It represents the building as it appeared from a certain point of view at a certain time on a certain day, no more, no less. And, of course, it tells us nothing about the building’s interior, much less about those who live there.
If we are going to use that photograph to draw conclusions about that building, we are going to have to be careful. It will only support limited inferences. But it WILL support SOME inferences.
And so it is with Jockers’ snapshot of 19th century Anglo-American fiction. What conclusions can we draw from it? I’ve already drawn one conclusion, that that fiction unfolds or evolves along a certain (as yet uncharacterized) direction in the feature or design space of novelistic possibility. But what determines novelistic possibility?
What is spirit?
The human mind, obviously, the human mind.
Evolutionary psychologists, represented in literary studies by Joseph Carroll and his legion of literary Darwinists (among others), would have us believe that the basic parameters of the human mind are given in biology. Certainly, biology is important, essential, even foundational. But not even Carroll himself believes that biology is all. Biology is shaped by (local) culture.
I’m attracted by the analogy of a board game, such as chess. Biology provides the basic rules of the game, the pieces and their moves, the game board, and the rules of play. Culture provides the tactics and strategy of games play. Those biological rules are thus quite open-ended, leaving many degrees of freedom for cultural elaboration and variation.
What determines those variations? Marxists of all stripes tell us material conditions, modes of production. Sure, why not? But those are hardly simple matters. And beyond that we have happenstance. Things are done this way because at some time and place someone decided to do it for whatever reason and, somehow, it caught on.
I mean, who knows? Biological, material conditions and modes of production, happenstance, what else? At the moment it really doesn’t matter, not for my argument. Biology creates a space of possibilities and culture plays in that space.
Think of it like this:

On the left I have indicated the various biological and cultural factors acting on the readers (note that writers are necessarily readers of their own texts) while the various features of literary works are on the right. The important part of the diagram is the middle block, the reader’s mind/brain. What is important is that the biological and cultural factors do not map on to text features in a simple manner. Each text feature is subject to multiple influencing factors, both biological and cultural.
Each person lives through the interaction of cultural and biological factors. Biological traits are inherited from one generation to another as are most of the cultural ones. It’s very difficult for the apple to fall far from the tree, if you will. Evolution: descent with modification. And that’s what we see in Jockers’ figure, the gradual evolution of the 19th century Anglo-American Geist as it expresses itself in the novel. There’s nothing fundamentally mysterious or immaterial about this, though we certainly don’t understand the process. Those books are material objects. The people who produce them and read them are material beings, their brains in particular. How those brains work, we don’t know, though we’re learning more everyday.
In the realm of the aesthetic
I would like to conclude by considering a passage from one of Edward Said’s last essays, “Globalizing Literary Study,” published in 2001 in PMLA [6]. He says:
I myself have no doubt, for instance, that an autonomous aesthetic realm exists, yet how it exists in relation to history, politics, social structures, and the like, is really difficult to specify. Questions and doubts about all these other relations have eroded the formerly perdurable national and aesthetic frameworks, limits, and boundaries almost completely. The notion neither of author, nor of work, nor of nation is as dependable as it once was, and for that matter the role of imagination, which used to be a central one, along with that of identity has undergone a Copernical transformation in the common understanding of it.
I believe that the most interesting way of thinking about that vector in the 600 dimensional “design space” of the 19th century Anglophone novel is to consider the possibility that it is evidence for the existence of that autonomous aesthetic realm [7]. THAT’s why I want to be very careful in thinking about just what a corpus is, what it implies.
In principle the previous diagram takes full account of all the various factors of historical particularity (at the upper left). But it also includes our underlying biological characteristics (lower left). Our engagement with texts must necessarily reflect both realms. That’s where we find the autonomous aesthetic realm, in the tension between biology and culture. Those texts are not fully subsumed by either realm, but emerge through living in both.
I take the fact those 3300 19th century Anglophone novels seem to unfold along a single dimension in design space as evidence of the fundamental integrity and autonomy of the historical process that underlies those texts. That’s what Jockers has shown, even if he hasn’t interpreted his evidence in those terms. How do we then get from a demonstration that historical process is a force in human history? Couldn’t we think of those novels as epiphenomenal reflections of that process, in the way that some philosophers think of consciousness as an epiphenomenal effect biological processes in the brain?
Of course we could think in those terms. Which is to say that thinking of that process – which I think of as a cultural evolutionary one – as itself a force in history requires an argument, one that I’ve not provided here. Nor do I intend to. Rather I simply want to indicate that such an argument is now in play. That’s what’s at stake in thinking carefully about a long-term historical corpus of literary texts. We thinking about the nature of history.
References
[1] See William Benzon, On the Direction of Cultural Evolution: Lessons from the 19th Century Anglophone Novel, Working Paper, April 2015, pp. 31, https://www.academia.edu/12112568/On_the_Direction_of_Cultural_Evolution_Lessons_from_the_19th_Century_Anglophone_Novel.
[2] William Benzon, Reading Macroanalysis 7.1: Visualizing the Geist of 19th Century Anglo-American Literary Culture, New Savanna (blog), August 22, 2014, accessed September 25, 2018, https://new-savanna.blogspot.com/2014/08/reading-macroanalysis-71-visualizing.html.
[3] Jockers identified 500 topics in his analysis and he has created a website where you can examine each of them, http://www.matthewjockers.net/macroanalysisbook/macro-themes/.
[4] Redding, Paul, "Georg Wilhelm Friedrich Hegel", The Stanford Encyclopedia of Philosophy (Summer 2018 Edition), Edward N. Zalta (ed.), https://plato.stanford.edu/archives/sum2018/entries/hegel/.
[5] I’m assuming this. I don’t know how Jockers handled the issue of multiple printings and different editions.
[6] Edward W. Said, “Globalizing Literary Study”, PMLA 116(1), 2001: 64-68.
[7] I first advanced this idea at the end of my 2006 theoretical and methodological paper, Literary Morphology: Nine Propositions in a Naturalist Theory of Form, PsyArt: An Online Journal for the Psychological Study of the Arts, August 2006, Article 060608, https://www.academia.edu/235110/Literary_Morphology_Nine_Propositions_in_a_Naturalist_Theory_of_Form.