Phoneme surprisal

What stimulus properties are responsible for driving neural activity during speech comprehension? For quite some time, we’ve known that word frequency modulates brain responses like the ERP N400; higher frequency words elicit smaller amplitude responses. In addition, listeners’ expectancies for particular words, as quantified in Cloze probability ratings, also modulate reaction time and the N400; higher Cloze probabilities elicit faster RTs and smaller amplitude responses.

For variables that modulate brain responses to speech sounds, phoneme surprisal seems to robustly correlate with superior temporal activity around 100 to 140ms after the onset of a phoneme (Brodbeck et al. 2018), although we’ll show below that phoneme surprisal effects are not limited to this time window. As a variable, phoneme surprisal is a product of information theory – how much information a phoneme carries in context, where the usual context to consider is the “prefix” or list of phonemes before the target phoneme starting from the beginning of a word. As we have seen in previous posts, phoneme surprisal is conventionally measured in terms of the probability distribution over the “cohort” of words that are consistent with this “prefix.”

In considering the literature on phoneme surprisal, and in planning future experiments, we should distinguish between “phoneme surprisal” as a variable (PS_V) and “phoneme surprisal” as a particular neural response (or multiple such responses) modulated by the phoneme surprisal variable (PS_N). We should also be clear about the difference between using surprisal as variable without an account of why linguistic processing should be sensitive to this variable at a particular point of time and in a particular brain region, and using surprisal as a variable in connection with a neurobiological theory of speech processing, as say in the “predictive coding” framework (see e.g., Gagnepain et al. 2012).

On the first distinction, at the NYU MorphLab we have published two studies using the PS_V that have discovered different – at least in time – neural responses sensitive to the variable. In Gwilliams and Marantz (2015), a study on Arabic root processing, we found that PS_V computed over roots yielded a PS_N on the third root consonant in the 100-140ms post phoneme onset time period in the superior temporal gyrus (see Figure 3 below). PS_V measured at this third root consonant for the whole word, by contrast, did not yield a PS_N in the same time frame. (The graph of PS_V effects shows additional PS_N’s after 200ms and after 300ms, which we will set aside.)

In Gaston & Marantz (2018), we examined the effect of prior context on the PS_N of English words like clash that can be used either as nouns or as verbs. We computed PS_V in various ways when these words were placed after to to force the verb use and after the to force the noun use. For example, we considered measuring PS_V after the by removing from the cohort all the words with only verb uses but leaving the full frequency of the target word as both noun and verb in the probability distribution of the remaining cohort vs. also reducing the frequency of the target in the cohort by removing its verb uses as well. We found a set of complicated PS_N responses sensitive to the contextual manipulation of PS_V (see Figure 3 below). However, the PS_N was not in the same time range as the (first) PS_N from the Gwilliams and Marantz study but instead came 100ms later.

For these studies from the MorphLab, the fact that the PS_V yielded distinct PS_N’s was not crucial to the interpretation of the results. Of interest in the Arabic study was whether there was evidence for processing the root of the word independent of the whole word, despite the fact that the root and “pattern” morphology that make up the word are temporally intertwined. For the study on the effects of context, we were interested in whether the preceding context would modulate cohort frequency effects, and if so, which measure of cohort refinement provided the best modulator of brain responses in the superior temporal lobe. Our conclusions were connected to our hypotheses about the relationship between PS_V and cognitive models of representation and processing, not to prior assumptions about PS_N.

That being said, ultimately our goal is to understand the neurobiology of speech perception – the way that the cognitive models are implemented in the neural hardware. For this goal, we should be seeking a consistent PS_N (or multiple consistent PS_N’s) and develop an interpretation of this PS_N within a neurologically grounded processing theory. For this goal, the literature provides some promising results. In a study examining subjects’ responses to listening to natural speech (well, audiobooks), Brodbeck et al. (2018) identify a strong PS_N in the same time range and same neural neighborhood as the PS_N in Gwilliams and Marantz’s Arabic study (see Figure 2 below). Brodbeck et al. did not decompose the words in their study, so PS_V was computed based solely on whole word cohorts, and function and content words weren’t distinguished. While context, including word-internal morphological context, may have modulated the effects of PS_V on the PS_N, this simple whole word PS_V measure nevertheless remained robust and stronger than other variables they entertained as potentially modulating the brain response. Laura Gwilliams’ ongoing work in the MorphLab has found a similar latency for a PS_N from naturalistic speech, using different analysis techniques (and a different data set) from Brodbeck et al.

The timing and location of Brodbeck’s PS_N response is broadly compatible with the timing and location of responses associated with the initial identification of phonemes, as measured, e.g., by ECoG recordings in Mesgarani et al. (2014) and subsequent publications (see Figure 1 below). This invites an interpretation of the PS_N as a measure of the predictability of a phoneme being identified, rather than in terms of the information content of the phoneme. Such an analysis is part of the “predictive coding” framework as described, e.g., in Gagnepain et al. (2012). In this framework, a response that could be Brodbeck’s PS_N in time and space is construed as an error signal proportional to the discrepancy between the predicted phoneme and the incoming phonetic information. It will be of great interest to tease apart predictions of a processing model based on predictive coding vs. one based on information theory. We note here the prediction made by Gagnepain et al. (2012) that we should not see, in addition to a PS_V-related neural response, a response that is modulated by cohort entropy. However, Brodbeck et al. observe a robust entropy response that was close to the PS_N both temporally and areally but nonetheless statistically independent (see their Figure 2 above).

Returning to the topic of PS_V responses in morphologically complex words, we see that it’s important to understand whether PS_V responses are uniquely associated with a PS_V that is computed using considerations like transition probability (the likelihood of an affix given the stem) and other factors that fix the probability of the affix before assessing the likelihood of the phonemes in the affix. One could imagine instead that the PS_Vs that matter most for the Brodbeck/Gwilliams early PS_N is computed over cohorts of morphemes, without modulation associated with the contextual statistics of the morphemes. Functional morphemes (prepositions, determiners, complementizers) are highly predicted in syntactic context, but the PS_V relevant to the Brodbeck/Gwilliams PS_N might ignore syntactic prediction in assigning probability weights to the morphemes in the relevant cohort for the PS_V. Consider that the context-modulated PS_N we observed in the Gaston & Marantz paper was not this early PS_N, but a significantly later response (with respect to phoneme onset), and that the Brodbeck et al. study apparently included functional morphemes without contextualizing their PS_V to predicted contextual frequencies of these morphemes. A contextually unmodulated PS_Vwould not strictly speaking be an information theoretic PS_V, since a contextually predicted phoneme is simply not as informative as a contextually unpredicted one, and this PS_V would thus overestimate the information content of contextually predicted phonemes (say, phonemes in a suffix after a stem that highly predicts the suffix). Still, the field awaits a set of plausible processing theories that make sense of the importance of the non-contextual PS_V and make further predictions for MEG experiments (that we can run).

References

Brodbeck, C., Hong, L. E., & Simon, J. Z. (2018). Rapid transformation from auditory to linguistic representations of continuous speech. Current Biology, 28(24), 3976-3983.

Gagnepain, P., Henson, R. N., & Davis, M. H. (2012). Temporal predictive codes for spoken words in auditory cortex. Current Biology, 22(7), 615-621.

Gaston, P., & Marantz, A. (2018). The time course of contextual cohort effects in auditory processing of category-ambiguous words: MEG evidence for a single “clash” as noun or verb. Language, Cognition and Neuroscience, 33(4), 402-423.

Gwilliams, L., & Marantz, A. (2015). Non-linear processing of a linear speech stream: The influence of morphological structure on the recognition of spoken Arabic words. Brain and language, 147, 1-13.

Mesgarani, N., Cheung, C., Johnson, K., & Chang, E. F. (2014). Phonetic feature encoding in human superior temporal gyrus. Science, 343(6174), 1006-1010.

1 Comment

Laura Gwilliams
August 25, 2020 at 11:10 pm

This post focuses on how phoneme surprisal can shift neural responses “up and down” in magnitude. It is perhaps important to note that phoneme surprisal can also serve to initiate existing phonetic processes (e.g. the STG responses a la Mesgarani et al., 2014) “earlier and later” in time. In a preprint authored by myself, Alec Marantz, Jean-Remi King and David Poeppel, we show an analysis of responses to continuous speech using MEG (https://www.biorxiv.org/content/10.1101/2020.04.04.025684v1.full). We showed that, when controlling for phoneme position in the word, the phonetic features of more predictable phonemes were decodable earlier than less predictable ones. This thus suggests that the predictability of information potentially plays a role in multiple neural computations — not just e.g. the ease with which information can be integrated with the internal model predictions (i.e. modulating activity up/down), but also the speed with which existing computations can be initiated.

1 Comment

Leave a Reply Cancel reply

News & Events

Meta