Author: Alec Marantz (Page 5 of 11)

Phoneme surprisal

August 25, 2020 / Alec Marantz / 1 Comment

What stimulus properties are responsible for driving neural activity during speech comprehension? For quite some time, we’ve known that word frequency modulates brain responses like the ERP N400; higher frequency words elicit smaller amplitude responses. In addition, listeners’ expectancies for particular words, as quantified in Cloze probability ratings, also modulate reaction time and the N400; higher Cloze probabilities elicit faster RTs and smaller amplitude responses.

For variables that modulate brain responses to speech sounds, phoneme surprisal seems to robustly correlate with superior temporal activity around 100 to 140ms after the onset of a phoneme (Brodbeck et al. 2018), although we’ll show below that phoneme surprisal effects are not limited to this time window. As a variable, phoneme surprisal is a product of information theory – how much information a phoneme carries in context, where the usual context to consider is the “prefix” or list of phonemes before the target phoneme starting from the beginning of a word. As we have seen in previous posts, phoneme surprisal is conventionally measured in terms of the probability distribution over the “cohort” of words that are consistent with this “prefix.”

In considering the literature on phoneme surprisal, and in planning future experiments, we should distinguish between “phoneme surprisal” as a variable (PS_V) and “phoneme surprisal” as a particular neural response (or multiple such responses) modulated by the phoneme surprisal variable (PS_N). We should also be clear about the difference between using surprisal as variable without an account of why linguistic processing should be sensitive to this variable at a particular point of time and in a particular brain region, and using surprisal as a variable in connection with a neurobiological theory of speech processing, as say in the “predictive coding” framework (see e.g., Gagnepain et al. 2012).

On the first distinction, at the NYU MorphLab we have published two studies using the PS_V that have discovered different – at least in time – neural responses sensitive to the variable. In Gwilliams and Marantz (2015), a study on Arabic root processing, we found that PS_V computed over roots yielded a PS_N on the third root consonant in the 100-140ms post phoneme onset time period in the superior temporal gyrus (see Figure 3 below). PS_V measured at this third root consonant for the whole word, by contrast, did not yield a PS_N in the same time frame. (The graph of PS_V effects shows additional PS_N’s after 200ms and after 300ms, which we will set aside.)

In Gaston & Marantz (2018), we examined the effect of prior context on the PS_N of English words like clash that can be used either as nouns or as verbs. We computed PS_V in various ways when these words were placed after to to force the verb use and after the to force the noun use. For example, we considered measuring PS_V after the by removing from the cohort all the words with only verb uses but leaving the full frequency of the target word as both noun and verb in the probability distribution of the remaining cohort vs. also reducing the frequency of the target in the cohort by removing its verb uses as well. We found a set of complicated PS_N responses sensitive to the contextual manipulation of PS_V (see Figure 3 below). However, the PS_N was not in the same time range as the (first) PS_N from the Gwilliams and Marantz study but instead came 100ms later.

For these studies from the MorphLab, the fact that the PS_V yielded distinct PS_N’s was not crucial to the interpretation of the results. Of interest in the Arabic study was whether there was evidence for processing the root of the word independent of the whole word, despite the fact that the root and “pattern” morphology that make up the word are temporally intertwined. For the study on the effects of context, we were interested in whether the preceding context would modulate cohort frequency effects, and if so, which measure of cohort refinement provided the best modulator of brain responses in the superior temporal lobe. Our conclusions were connected to our hypotheses about the relationship between PS_V and cognitive models of representation and processing, not to prior assumptions about PS_N.

That being said, ultimately our goal is to understand the neurobiology of speech perception – the way that the cognitive models are implemented in the neural hardware. For this goal, we should be seeking a consistent PS_N (or multiple consistent PS_N’s) and develop an interpretation of this PS_N within a neurologically grounded processing theory. For this goal, the literature provides some promising results. In a study examining subjects’ responses to listening to natural speech (well, audiobooks), Brodbeck et al. (2018) identify a strong PS_N in the same time range and same neural neighborhood as the PS_N in Gwilliams and Marantz’s Arabic study (see Figure 2 below). Brodbeck et al. did not decompose the words in their study, so PS_V was computed based solely on whole word cohorts, and function and content words weren’t distinguished. While context, including word-internal morphological context, may have modulated the effects of PS_V on the PS_N, this simple whole word PS_V measure nevertheless remained robust and stronger than other variables they entertained as potentially modulating the brain response. Laura Gwilliams’ ongoing work in the MorphLab has found a similar latency for a PS_N from naturalistic speech, using different analysis techniques (and a different data set) from Brodbeck et al.

The timing and location of Brodbeck’s PS_N response is broadly compatible with the timing and location of responses associated with the initial identification of phonemes, as measured, e.g., by ECoG recordings in Mesgarani et al. (2014) and subsequent publications (see Figure 1 below). This invites an interpretation of the PS_N as a measure of the predictability of a phoneme being identified, rather than in terms of the information content of the phoneme. Such an analysis is part of the “predictive coding” framework as described, e.g., in Gagnepain et al. (2012). In this framework, a response that could be Brodbeck’s PS_N in time and space is construed as an error signal proportional to the discrepancy between the predicted phoneme and the incoming phonetic information. It will be of great interest to tease apart predictions of a processing model based on predictive coding vs. one based on information theory. We note here the prediction made by Gagnepain et al. (2012) that we should not see, in addition to a PS_V-related neural response, a response that is modulated by cohort entropy. However, Brodbeck et al. observe a robust entropy response that was close to the PS_N both temporally and areally but nonetheless statistically independent (see their Figure 2 above).

Returning to the topic of PS_V responses in morphologically complex words, we see that it’s important to understand whether PS_V responses are uniquely associated with a PS_V that is computed using considerations like transition probability (the likelihood of an affix given the stem) and other factors that fix the probability of the affix before assessing the likelihood of the phonemes in the affix. One could imagine instead that the PS_Vs that matter most for the Brodbeck/Gwilliams early PS_N is computed over cohorts of morphemes, without modulation associated with the contextual statistics of the morphemes. Functional morphemes (prepositions, determiners, complementizers) are highly predicted in syntactic context, but the PS_V relevant to the Brodbeck/Gwilliams PS_N might ignore syntactic prediction in assigning probability weights to the morphemes in the relevant cohort for the PS_V. Consider that the context-modulated PS_N we observed in the Gaston & Marantz paper was not this early PS_N, but a significantly later response (with respect to phoneme onset), and that the Brodbeck et al. study apparently included functional morphemes without contextualizing their PS_V to predicted contextual frequencies of these morphemes. A contextually unmodulated PS_Vwould not strictly speaking be an information theoretic PS_V, since a contextually predicted phoneme is simply not as informative as a contextually unpredicted one, and this PS_V would thus overestimate the information content of contextually predicted phonemes (say, phonemes in a suffix after a stem that highly predicts the suffix). Still, the field awaits a set of plausible processing theories that make sense of the importance of the non-contextual PS_V and make further predictions for MEG experiments (that we can run).

References

Brodbeck, C., Hong, L. E., & Simon, J. Z. (2018). Rapid transformation from auditory to linguistic representations of continuous speech. Current Biology, 28(24), 3976-3983.

Gagnepain, P., Henson, R. N., & Davis, M. H. (2012). Temporal predictive codes for spoken words in auditory cortex. Current Biology, 22(7), 615-621.

Gaston, P., & Marantz, A. (2018). The time course of contextual cohort effects in auditory processing of category-ambiguous words: MEG evidence for a single “clash” as noun or verb. Language, Cognition and Neuroscience, 33(4), 402-423.

Gwilliams, L., & Marantz, A. (2015). Non-linear processing of a linear speech stream: The influence of morphological structure on the recognition of spoken Arabic words. Brain and language, 147, 1-13.

Mesgarani, N., Cheung, C., Johnson, K., & Chang, E. F. (2014). Phonetic feature encoding in human superior temporal gyrus. Science, 343(6174), 1006-1010.

Probability distributions over infinite lists?

August 18, 2020 / Alec Marantz / 0 Comments

Recall that a grammar provides a representation of the words and sentences of a language. For standard generative grammars, the grammar is a finite set of rules that describes or enumerates an infinite list (of words and sentences). In a previous post, we conceptualized word and sentence recognition as a process of determining from auditory or orthographic input which memb er of the infinite list of words or sentences we’re hearing or reading. Just as in “cohort” models of auditory word recognition, one could imagine for sentence recognition a cohort of all the sentences compatible with the auditory input at each point in a sentence, and a probability distribution over these sentences. Each subsequent phoneme narrows down the cohort and changes the probability distribution of the remaining candidates in the cohort.

The cat …
/s/	/sɪ/	/sɪt/	/sɪts/
simpers	simpers	sits	sits
sits	sits	sitter
sitter	sitter	…
sniffed	…
…

From one point of view, syntactic theory provides an account of the nature of the infinity that explains the “creativity” of language – that we understand and produce sentences that we haven’t heard or said before. The working linguist is mostly less concerned with the mathematical nature of the infinity (the subject matter to which the “Chomsky hierarchy” of grammars is related) and more concerned with what the human solution to infinity we describe in Universal Grammar tells us about the nature of language and matters like locality restrictions on long-distance dependencies. In my post on phrase structure rules and extended projections, I emphasized aspects of grammar that are finite. The core of a sentence, for example, could be viewed as the extended projection of a verb, and thus the set of sentences described/generated by the grammar could be a finite paradigm of the possibilities allowed in an extended projection.

But of course, while the set of inflected forms of a verb may be finite (as it is in English, for example), the set of extended projections of a verb are obviously not. While there may be a diverse set of “structure building” possibilities to consider here for multiplying the extended projections of verbs, the two most central considerations for infinity are usually called “arguments” and “adjuncts.” Argument noun phrases, or technically Determiner Phrases (DPs) in most contemporary theories, may occur as subjects and complements (e.g., objects) of extended projections. DPs (in at least most languages, putting aside the status of Everett’s (2005, et seq.) claims about Pirahã) may contain DPs, which may contain DPs, leading to infinity (the story about children in a country with an army with an unusual set of vehicles…). For adjuncts, consider at least the possibility of repeated prepositional phrase modification of verb phrases as a form of infinity (She played the piano on Thursday in Boston on a Steinway with a large audience in 80 degree heat…).

As noted in a previous post, the linguistically interesting account of these types of infinities involve recursion. DPs may occur in various positions, including within other DPs, but the structure of the DP is the same wherever they occur. That is, a DP within a DP has the same structure and follows the same rules as the containing DP.

Now it’s not entirely true that the position of a phrase doesn’t determine any aspects of its internal structure. For example, the number of a noun phrase (e.g., singular or plural), which is related to the number of its head noun, determines whether it may appear as the subject of __ is a good thing (Soup is a good thing, yes; *Beans is a good thing, no). So appearing as the subject of an agreeing verb determines aspects of the internal structure of subjects in English. And the verb phrase put on the table is grammatical in the context of What did you put on the table, but not in the context of *We put on the table. So being in the context of e.g., a wh-question determines aspects of the internal syntax of the VP.

Chomsky’s (1995) Minimalist Program offers one theory of the limits on this contextual determination of the internal structure of phrases. In this theory, a set of important constituents, such as DPs and verb phrases (the part of the extended projection of a verb that includes the subject DP), are “phases.”

Phase diagram (Citko 2014: 32)

A phase presents to the outside world only its “label” and constituents at its “edge” (at the top of the phase, α above). The label of a phase is a finite set of features, including those like number which would be relevant for subject-verb agreement. The edge of the phase would include what in what (you) put on the table, which is associated with the object position between put and on the table. So the verb phrase put on the table is only grammatical when the object of put appears at the edge of the verb phrase, and the appearance of the object at the edge will insure that the verb phrase is embedded within a larger structure that allows the object to appear in an appropriate position (What did you put on the table?).

Phase diagram for verb phrase (Citko 2014: 32)

The Minimalist Program allows for some information from inside a phase to matter for the grammatical embedding of a phase in a larger structure, but it does not allow the larger structure to mess with the internal structure of the phase beyond “selecting” for features of the label and features of the edge. And phases within phases can only help determine the grammatical position of the containing phase if they contribute features to its label or constituents to its edge.

Adopting the phase approach as a means to constrain infinity yields grammars that are not problematic to use in “parsing” (assigning a grammatical structure to) sentences as we hear them phoneme by phoneme or word by word (see e.g., Stabler (2013) and Fong (2014) for examples of “Minimalist” parsers). However, even phase-based infinity causes important difficulties for assigning a probability distribution over the members of a candidate set of sentences to compare to the linguistic input one hears or sees. How much probability should we assign to each of the infinite number of possible DPs as the subject of a sentence, for example, where the infinity is generated by DPs within DPs?

Even without these issues with infinity, the locality of syntactic dependencies, as captured by phase theory, itself puts pressure on any simple cohort style theory of sentence (or word) recognition. Since no information within a phase apart from its label and the constituents at its edge can interact with the syntactic structure above the phase, it’s not clear whether shifting probabilities for the internal structure of the phase should affect the probabilities for the containing structure as well. That is, once one has established the label and edge features of a subject DP, for example, the probability distribution over the cohort of compatible extended projections of the verb for which the DP is subject may be fixed, independent of further elaborations of the subject DP, including possible PP continuations as in the story about children in a country with an army with an unusual set of vehicles… – as far as the extended projection of the verb is concerned, we may be done computing a probability distribution after the story about children. Given the way phases fit together, this consideration about how the internal structure of a phase may affect processing of a containing phase covers one issue with infinity as well.

Note that cohort-style analyses of information-theoretic variables like phoneme surprisal always assume that the computation of cohorts can be reasonably accomplished while ignoring some possible context effects. The cohort is a set of units, perhaps morphemes or words. In any situation of language processing, there are an infinite set of contexts to consider that might affect the probability distribution over the members of the cohort, including larger words, phrases, sentences, and discourse contexts. Any experimental investigation of phoneme surprisal based on probability distributions over units must assume that these computations of cohorts and probability are meaningful even without computing the influence of some or all of these possible contextual influences.

Our MorphLab has some data, and an experiment in progress, that are relevant to this discussion. In Fruchter et al. (2015), subjects were visually presented with two-word modifier-noun phrases, one word at a time. For phrases where the second word is highly predicted by the first, like stainless steel, we found evidence that subjects retrieve the representation of the second word before they see it. This response was modulated by the strength of the prediction, but also, surprisingly, by the unigram frequency of the word. That is, even when a word is being retrieved solely on the basis of prediction, the relative frequency of that word compared to other words in the language, independent of context, modulates processing. This suggests the possibility that cohort-related phoneme surprisal responses might be driven at least in part by probability distributions over morphemes that are context-independent. Partly to test this possibility, Samantha Wray in the NeLLab is analyzing data from an experiment in which Tagalog speakers listened to a variety of Tagalog words, including compounds and reduplicated forms (involving full reduplication of bisyllabic stems). If frequency-based, but context-free, cohorts of morphemes are always relevant for phoneme surprisal, then phoneme surprisal phoneme by phoneme in the first and second copies in a reduplicated stem should be similar. By contrast, if one considers the prediction of the second copy in the reduplicated form from the first, the contextual phoneme surprisal in the second part of the reduplicated word should look nothing like the phoneme surprisal for the same phonemes in the first part of the word. So far, context-free phoneme surprisal in the second copy seems to be winning, although there are numerous complications.

Returning to the sentence level, the phase may provide a relevant context-free “cohort” domain for assigning probability distributions to an infinite set of syntactic structures. Without abandoning the idea that syntactic processing involves consideration of whole sentence syntactic structures, we can reduce our cohorts of potential structures to finite sets if we shield phase-internal processing from the processing of the larger structures containing the phase. When we’re processing a structure containing an embedded phase, we consider only the finite set of possibilities for the label and edge properties of this phase. Once we’re processing inside the phase, we define our cohort of possible structures using only external contextual information that fix the probability distribution over the phase’s label and its edge properties.

Applying this approach to morphological processing within words involves identifying what the relevant phases might be. Even within a phase, we need to consider the various ways in which a contextually-determined probability distribution over small (say, morpheme-sized) units might be affected by context. Much more on these topics in upcoming posts.

References

Citko, B. (2014). Phase theory: An introduction. Cambridge: CUP.

Chomsky, N. (1995). The Minimalist Program. Cambridge, MA: MIT Press.

Everett, D. L. (2005). Cultural constraints on grammar and cognition in Pirahã: Another look at the design features of human language. Current anthropology, 46(4), 621-646.

Fong, S. (2014). Unification and efficient computation in the Minimalist Program. In Lowenthal, F., & Lefebre, L. (eds.), Language and recursion, 129-138. New York: Springer.

Fruchter, J., Linzen, T., Westerlund, M., & Marantz, A. (2015). Lexical preactivation in basic linguistic phrases. Journal of cognitive neuroscience, 27(10), 1912-1935.

Stabler, E. P. (2013). Two models of minimalist, incremental syntactic analysis. Topics in cognitive science, 5(3), 611-633.

Contextual allosemy and idioms

August 11, 2020 / Alec Marantz / 5 Comments

[This post is a draft of a section of a chapter that I’m supposed to be writing with Neil Myler on contextual allosemy in Distributed Morphology for the forthcoming Handbook of Distributed Morphology. Comments and corrections solicited!]

In Distributed Morphology, idioms have been characterized as contextual allosemy. “Allosemy” is the property of having multiple meanings, either related or unrelated; “contextual allosemy” involves the choice of one of these meanings in a particular context or environment. The discussion of idioms as contextual allosemy starts with observations from the literature about locality restrictions on idioms. Notably, “subject idioms” with an idiomatic subject and verb, but an open object position, are observed to be rare or nonexistent, as are passive idioms that are not idiomatic in the active. These observations suggest a VP-sized locality domain for idioms that excludes transitive subjects. As part of the general project to dissolve the word/phrase distinction, Marantz (1997) relates these observations to Miyagawa’s (1980, 1984) claim that lexical causatives can be idiomatic, while syntactic causatives in Japanese are never idiomatic. The leading idea here is that the locality domain for idiom formation might fall inside of a phonological word. The Japanese syntactic causative, although a word-sized unit, nevertheless cannot take on an idiomatic meaning as a whole because the relationship between the causative affix and the verb stem would cross a locality barrier, much like the way that the relationship between a transitive subject and the verb does.

(25) VP idioms with V-(s)ase (Miyagawa (1984: 190)

a. hana o sak-ase
flower ACC bloom-CAUS
‘to succeed’

b. hara o her-ase
stomach ACC decrease-CAUS
‘to be hungry’

Marantz (1997) links these observations to the active voice node, which is a putative phase head in Minimalist Grammar. Such a node would separate the subject and a transitive verb, as well as the syntactic causative affix and the verb stem, which arguably contains active voice as well. The idea is that a phase boundary may occur within a word, as well as in phrases, and that phases mark the boundaries of lexical influence, such that idioms must be fully contained within a phase.

From the origins of this proposal within Distributed Morphology, any possible contrast between idioms and polysemy was blurred – “polysemy” referring to the property of having multiple related meanings, such as book the physical object and the intellectual property. Consider the title of a widely distributed paper addressing issues of stem suppletion, “‘Cat’ as a phrasal idiom” (Marantz 1995). The leading idea of the “Cat” paper was that all words, even apparently morphologically simply words like cat, decompose into at least a root and a category determining affix. The meaning of roots would always be determined contextually, within the domain of the first phase head up from the root. Calling cat a phrasal idiom, then, emphasizes both the dissolving of the word/phrase distinction and the assumption that idioms involve the same contextual calculation of root meaning as is involved in polysemy. That is, the connection between (fill the) bucket and (kick the) bucket might be parallel to that between (physical) book and (intellectual property) book.

What was missing from this early work, then, was any clear delineation among at least three possible relations between different meanings of the same phonological form: (accidental) homophonic, allosemic, and idiomatic. The two meanings of bat would, for the synchronic grammar at any rate, involve accidental homophony, the two meanings of book would involve polysemy and the two meanings of bucket an idiomatic connection. If these distinctions are real, linguistic theory should provide means beyond meaning intuitions to classify cases as one or the other. For the distinction between accidental homophony and polysemy, what’s needed is a theory of polysemy – what kinds of meaning relations are made available by grammars that might connect related meanings of the same root. Two apparent meanings of the same phonological form would involve polysemy to the extent that the relation between the meanings is analyzable within the theory of polysemy. What about the distinction between polysemy and idioms? Here, two possible generalizations emerge from the literature. First, idioms always involve a reading in addition to a possible literal reading, whereas in contextual allosemy, one reading may be forced. If there’s a bucket on the ground, you can always kick the bucket and spill its contents. However, although globe has related meanings of a sphere and the earth,global has only the earth reading. Second, when we’re considering the locality domain for interpretation, idioms seem to involve the relation between (at least) two roots, while allosemy may involve a root and a functional morpheme. Even in the case of the Japanese syntactic causatives, recent work by Yining Nie (2020) suggests that a root is involved for the causative affix in such constructions.

In an important and illuminating paper, Anagnostopoulou and Simiati (2013) present evidence from Greek adjectival participles to argue that the locality domains for idiom formation and contextual allosemy are indeed distinct, and that the domains correspond to the conflicting analyses of “special meanings” in Marantz (2001) and Marantz (2013). While active voice defines the domain for idioms, semantically relevant category determining nodes (n, v, a) create barriers for contextual allosemy. Anagnostopoulou and Samioti consider different types of deverbal adjectives formed with the suffixes –t(os) and –men(os). They argue that many –tos formations involve an adjective formed directly from a (semantic) root. With no barrier between the –tos affix and the root, these words can involve contextual allosemy on the verbal stem, with a special meaning triggered by the –tos not available for the verbal root as a verb.

(39) Stative –tos participles showing direct attachment of –tos to Root_event

a. Verb sfing-o ‘tighten’
Participle sfix-tos ‘tight, careful with money’

b. Verb ftin-o ‘spit’
Participle ftis-tos ‘spitted, spitting image’

Meanwhile, canonical –men(os) adjectives are built on semantically eventive v stems. Since the v intervenes between –menos and the root, –menos may not trigger special meanings not available for the use of the root as a verb. On the other hand, –menos adjectives may have additional, idiomatic readings not present for the verbal stem. For example, in (41), the –menos participle has either the literal meaning or the idiomatic reading, while the idiomatic reading is not present for the verb in its other forms.

(41) Eventive –menos participles

Verb trav-a-o ‘pull’
Participle trav-is-menos ‘pulled, far-fetched’

Finally, there are –t(os) adjectives with “ability/possibility” readings parallel to –able adjectives in English. The semantics of ability adjectives implicate active voice, and Anagnostopoulou and Samioti (2013) argue that there is additional morphological evidence indicating that the ability –tos adjectives are formed from a larger stem than the perfective –tos adjectives and –men(os) adjectives. For the ability –tos adjectives, no idiomatic reading is possible; any ability reading of the adjective is paralleled by a reading of the (transitive) verb.

(51) Ability –tos adjectives have no idiomatic readings

a. –menos trav-ig-menos ‘pulled, far-fetched’
–tos aksi-o-travix-tos ‘worth pulling’

b. –menos stri-menos ‘twisted, crotchety’
–tos aksi-o-strif-tos ‘worth twisting’

The Distributed Morphology literature has presented evidence that idiom formation respects boundaries that may appear inside or outside of phonological words. This evidence supports the general thesis of DM that the phonological word is not itself a relevant constituent for syntactic and semantic principles that govern syntactic structure and meaning. However, the literature does not yet provide a comprehensive analysis of idioms or of polysemy, relying on generalizations in these areas that stand outside a general theory.

References

Anagnostopoulou, E., & Samioti, Y. (2013). Allosemy, idioms, and their domains: Evidence from adjectival participles. In Folli, R., Sevdali, C., & Truswell, R. (eds.), Syntax and its Limits, 218-250. Oxford: OUP.

Marantz, A. (1995). “Cat” as a phrasal idiom: Consequences of late insertion in Distributed Morphology. MIT: Ms.

Marantz, A. (1997). No escape from syntax: Don’t try morphological analysis in the privacy of your own lexicon. University of Pennsylvania Working Papers in Linguistics, 4(2), 14.

Marantz, A. (2001). Words. WCCFL XX handout, University of Southern California.

Marantz, A. (2013). Locality domains for contextual allomorphy across the interfaces. In Matushansky, O., & Marantz, A. (eds.), Distributed Morphology today: Morphemes for Morris Halle, 95-115. Cambridge, MA: MIT Press.

Miyagawa, S. (1984). Blocking and Japanese causatives. Lingua, 64(2-3): 177-207.

Miyagawa, S. (1980). Complex Verbs and the Lexicon. University of Arizona: PhD dissertation.

Nie, Y. (2020). Licensing arguments. New York University: PhD dissertation.

Author: Alec Marantz (Page 5 of 11)

Phoneme surprisal

Probability distributions over infinite lists?

Contextual allosemy and idioms

News & Events

Meta