Rhythmic Constraints on Stress Timing in English

What kind of embodied constraints affect the production of speech? Can we say anything we like when we like, or are there constraints in play that make some things easier than others? This is the question asked in Cummins & Port (1998) which we recently read in lab meeting (with our PhD student Agnes).

Cummins and Port asked participants to produce sentences over and over and examined when during the cycle a certain stress beat occurred. They set it up so that the beat was timed with a beep to occur throughout the cycle, but showed that people could actually only place the beat in 2 or 3 places in the beat reliably. The big picture result is that speech production is shaped, in part, by the underlying dynamics of production described in terms of the rhythms it is set up to produce.The nice detail here comes from the theoretical set up and analysis that drives this study. Cummins and Port are directly inspired and guided by work in coordination dynamics. Agnes is interested in this work because she's looking at ways to investigate language and speech using the tools of dynamical systems and embodied cognition - remember, our big pitch is that language is special but not magical and we should be able to study it the way we study, say, rhythmic movement coordination.

The task was to produce sentences over and over (speech cycling) and match a set timing. Each trial included 14 repetitions of a sentence like 'Big for a duck' (the first two were dropped from analysis). Participants heard two tones, a high and low tone 700ms apart. Their job was to produce the sentence so that 'big' happened on the high tone and 'duck' happened on the low tone. Participants then did 12 'continuation' repetitions with no pacing signal. A cycle was defined as running from the onset of 'Big' to the next onset of 'Big' in a participant's speech. The phase of this cycle runs from 0-1. The gap between the low and high tones was varied so that the the cycle was 1s - 2.33s in length. This meant the low tone (pacing 'duck') happened at a target phase that varied from 0.3 to 0.7. See Figure 1, a more useful version of their Figure 2.

Figure 1. A cycle is defined as the time between high tones, i.e. the production of ‘big’. Varying the high-low gap means the low tone occurs at different phases (relative times) in the cycle

So the task for people was to produce the word 'duck' 700ms after 'big' but at different locations in the cycle as defined by the two high tones pacing 'big'. The phases defining the low-high gap were changed randomly for each trial with a uniform probability. The question was, would there be any variation in performance when assessed in terms of phase?The analogy is coordinated rhythmic movement. Any relative phase is technically possible, but people only produce two (0º and 180º) because of the way they couple their limbs (using perceived relative phase, the information for which is the relative direction of motion). Cummins and Port note that this constraint on rhythm production emerges from the coupling of the components into a task-specific device and suggest that if rhythmic speech is also the result of a TSD, the way that TSD is built might have identifiable effects constraining the rhythms that it can produce. (They have no theory or model of what this TSD might be yet, so they can't make specific predictions, but that's OK because this paper is the proof of concept work).The task asks people to produce a variety of phases that show up from a uniform distribution. If people can simply reproduce that uniform distribution, there is no TSD dynamical device in-between the stimulus and response shaping that response. If they can't, then the distributions they do produce will demonstrate the presence of a TSD-type organisation (i.e. lots of components temporarily organised into a lower dimensional synergy with specific dynamical characteristics). Recall that you might want to do this in order to make the whole system something you could control, i.e. to solve the degrees of freedom problem. Results

Figure 2. Frequency histograms for the phases people produced while trying to produce phases that were uniformly distributed

Figure 2 shows that people could not reproduce the uniform distribution of the stimuli. The data for three subjects showed 3 clear clusters and one showed two clusters (confirmed by more formal analysis later in the paper). The three participants were female musicians, KA was a male non-musician. (Experiment 2 replicated part of this design with female non-musicians, a male non-musician and a male musician. The latter showed 2 clusters, the rest showed 3; this task clearly has space for considerable individual variation that is not entirely accounted for by musical experience). Basic finding: Some phases are easier than others, and the clusters reflect attractors in the rhythmic dynamics of speech production in this task.

Figure 3. Plotting target phase vs error for each participant, with data sorted into clusters.

Figure 3 shows that within each cluster, error gets worse the farther from that cluster's attractor. This is analogous to rhythmic movement coordination again; you can try to produce a mean relative phase of, say, 90º, but you will mostly fail.

Basic result: People made the least errors when the required phase was in the attractor region.

Figure 4. Plotting target phase vs variability and fitting that data with quadratic regression

Figure 4 shows how production variability varied with target phase. This plot is analogous to the famous HKB potential function that describes how the required effort to produce a coordination varies with relative phase. Each plot shows a local minimum in variability that aligns with the attractors; note that as in the HKB model, not all attractors are equally stable. SummaryThis paper was directly inspired by work in coordination dynamics and applied all the right lessons to the study of speech production. What they found was that people were unable to produce a constant 700ms interval between two stress beats in an English sentence when those beats occurred in the context of rhythmic production. The dynamical device assembled to produce the rhythmical speech imposed constraints on what timings were possible and these constraints affected behavior.There is no explanation here as to what the dynamic might be or how it's composed or organised, but in principle the task analysis that led to Geoff's model of coordination dynamics could be used here as well. I seem to recall Bob Port mentioning in a talk that he had applied the HKB model to these kinds of data; this would work because the model allows him to a) add terms to include a third attractor (this is how Kelso and Zanone modelled learning) and b) you can parameterise it so those attractors show up at whatever phase fits the data. However, this approach suffers the same problems as the HKB model: it's purely descriptive and because it does not include a specification of the actual dynamic at work it could easily lead you to make incorrect predictions. I'll se eif I can get Fred Cummins to tell us about more recent work. I liked this paper a lot; it was rigourous and it did not try to simply jam the HKB model onto their data. Instead, it drew inspiration from the task dynamic approach and tailored that approach to suit the task at hand. It is also a great example of how to study things like speech using the same tools and language as we use to study action more generally. ReferencesCummins, F. & Port, R. (1998). Rhythmic constraints on stress timing in English, Journal of Phonetics, 26, 145-171. Download ($$)