Page 8 of 11

Understanding sentences, Part 2

June 30, 2020 / Alec Marantz / 5 Comments

In Syntactic Structures, Chomsky (1957) provides an analysis of the English auxiliary verb system that explains both the order of auxiliary verbs, when more than one are present, and the connection between a given auxiliary and the morphological form of the auxiliary or main verb that follows. For example, progressive be follows perfect have and requires the present participle –ing suffix on the verb that follows it, as in John has been driving recklessly. Subsequent work on languages that load more “inflectional” morphology on verbs and use fewer independent auxiliary verbs has revealed that the order of tense, aspect, and modality morphology cross-linguistically generally mirrors the order of English auxiliaries: tense, modal, perfect, progressive, passive, verb (or, if these are realized as suffixes, verb, passive, progressive, perfect modal, tense). Work on the structure of verb phrases and noun phrases has revealed a set of “functional” categories (for nouns, things like number, definiteness and case) that form constituents with the noun and appear in similar hierarchies across languages.

Grimshaw (1991) was concerned with puzzles that involve the apparent optionality of these functional categories connected to nouns and verbs. For example, a verb phrase may appear only with tense, as in John sings, or it may appear with a number of auxiliaries, in which case tense appears on the top/left most auxiliary: John was/is singing, John has/had been singing, etc. If tense c-selects for a (main) verb, does it optionally also c-select for the progressive auxiliary, perfect auxiliary, etc.? The proper generalization, which was captured by Chomsky’s (1957) system in an elegant but problematic way, is that the functional categories appear in a fixed hierarchical order from the verb up (Chomsky had the auxiliaries in a fixed linear order, rather than a hierarchy, but subsequent research points to the hierarchical solution). There’s a sense in which the functional categories are optional – certainly no overt realization of aspect or “passive” is required in every English verb phrase. Yet there is also a downward selection associated with these categories. The modal auxiliaries, for example, require a bare verbal stem, while the perfect have auxiliary requires a perfect participle to head its complement, and the progressive auxiliary requires a present participle for its own complement.

Grimshaw suggested that noun, verbs, adjectives and prepositions (or postpositions) anchor the distribution of “functional” material like tense or number that appears with these words in larger phrases. To borrow her terminology, a “lexical” category (N, V, Adj, P) is associated with an “extended projection” of optional “functional” (non-lexical) heads. This fixed hierarchy of heads is projected above the structure in which the “arguments” of lexical categories, like subjects and objects, appear.

What emerges from this history of phrase structure within generative syntax since the 1950’s is an understanding of the distribution of morphemes and phrases in sentences that is not captured by standard phrase structure rules. Lexical categories are associated with an “extended projection,” the grammatical well-formedness of which is governed by a head’s demands for the features of the phrases that they combine with; for example, the perfective auxiliary wants to combine with a phrase headed by a perfect participle, and the verb rely wants to combine with a phrase headed by the preposition on. The requirements of heads are thus governed by properties related to semantic compositionality (s-selection) and not directly by subcategorization (c-selection). The “arguments” of lexical categories similarly have their distribution governed by factors of s-selection and other properties (e.g., noun phrases need case), rather than by c-selection of a particular item or by phrase structure generalizations that refer directly to category (e.g., VP → V NP, where NP is the category of the verb’s direct object).

How does this discussion of constituent structure relate to morphology and the internal structure of words? First, note that the formal statement of a selectional relation between one constituent and a feature of another constituent to which it is joined in a larger structure describes a small constituent structure (phrase structure) tree. For instance, to return to an example from Syntactic Structures, the auxiliary have in English selects for a complement headed by a perfect participle (often indicated by the form of one of the allomorphs of the perfect participle suffix –en). Chomsky formalized this dependency by having have introduced along with the –en suffix, then “hopping” the –en onto the adjacent verb, whatever that verb might be (progressive be, passive be, or the main verb). In line with contemporary theories, we might formalize the selectional properties of have with the feature in (1). This corresponds to, and could be used to generate or describe, the small tree in (1). We can suppose that the “perfect participle” features of –en are visible on the verb phrase node that contains verb-en.

(1) have : [ __ [ verb+en … ] ]

Extrapolating from this example, we can note that by combining various mini-trees corresponding to selectional features, one can generate constituent structure trees for whole sentences. That is, sentence structure to some extent can be seen as a projection of selectional features.

Here we can see the connection between the structure of sentences and the internal structure of words. It is standard practice in generative grammar to encode the distributional properties of affixes in selectional features. For example, the suffix –er can attach to verbs to create agentive or instrumental nouns, a property encoded in the selectional feature in (2) with its corresponding mini-tree.

(2) –er : [_N verb __ ]

The careful reader may notice an odd fact about the selectional feature (4): –er, of category N, appears to c-select for the category V. Yet in our discussion of lexical categories above in the phrasal domain, we noted that nouns, verbs and adjectives don’t generally c-select for their complements; rather, lexical categories “project” an “extended projection” of “functional” heads, and s-select for complements.

The term “derivational morphology” can be used to refer to affixes that appear to determine, and thus often appear to change, the category of the stems to which they attach. Derivational affixes in English fall into at least two (partially overlapping) categories: (i) those that are widely productive and don’t specify (beyond a category specification) a set of stems or affixes to which they like to attach, and (ii) those that are non- or semi-productive and only attach to a particular set of stems and affixes. Agentive/instrumental –er is a prime example of the first set, attaching to any verb, with the well-formedness of the result a function of the semantics of the combination (e.g., seemer is odd). The nominalizer –ity is of the second sort, creating nouns from a list of stems, some of which are bound roots (e.g., am-ity), and a set of adjectives ending specifically in the suffixes –al and –able. For this second set of derivational affixes, we can say that they s-select for their complement (-ity s-selects for a “property”) and further select for a specific set of morphemes, in the same way that, e.g., depend selects for on.

But for –er and affixes that productively attach to a lexical category of stems like verbs, we do seem to have some form of c-selection: the affixes seem to select for the category of the stems they attach to. But suppose this is upside-down. Suppose we can say that being a verb means that you can appear with –er. This is very similar to saying that the form verb-er can be projected up from the verb, in the same way that (tensed) verb-s and verb-ed are constructed. That is, –ercan be seen as part of the extended projection of a verb.

Extended projections are frequently analyzed as morphological paradigms when the functional material of the extended projection is realized as affixes on the head. By performing an extended projection and realizing the functional material morphophonologically, one fills out the paradigm of inflected forms of the head. On the proposed view that productive derivational morphology associated with categories of stems involves the extended projections of the stems themselves, forms in –er, for example, would then be part of the paradigm of verbs. (This discussion echoes Shigeru Miyagawa’s (1980) treatment of Japanese causatives in his dissertation.) I’ll fill in the details of this proposal, as well as explain the contrast that emerges between the two types of derivation (productive-paradigmatic vs. semi-productive-selectional), in a later post.

Finally, remember that extended projections can be phrasal. That is, the structure of an English sentence, with its possible auxiliary verbs and other material on top of the inflected main verb, is the extended projection of the verb that heads the verb phrase in the sentence. If we view the paradigms of inflected verbs and nouns as generated from the extended projections of their stems, we can view sentences in languages like English as paradigmatic – cells in the paradigm of the head verb generated via the extended projection of that verb. When we look at phonological words in agglutinative languages like Yup’ik Eskimo, we see that these words (i) can stand alone as sentences translated into full phrasal sentences in English and (ii) have been analyzed as part of the enormous paradigm of forms associated with the head verbal root of the word. These types of examples point directly to the connection between parsing words and parsing sentences.

References

Chomsky, N. (1957). Syntactic Structures. Walter de Gruyter.

Grimshaw, J. (1991). Extended projection. Brandeis University: Ms. (Also appeared in Grimshaw, J. (2005). Words and Structure. Stanford: CSLI).

Miyagawa, S. (1980). Complex verbs and the lexicon. University of Arizona: PhD dissertation.

Understanding sentences

June 23, 2020 / Alec Marantz / 0 Comments

Most approaches that try to relate linguistic knowledge to real time processing of sentences have considered phrase structure rules as a reasonable formalism for the hierarchical constituent structure of sentences. From the perspective of the history of generative linguistics, one can trace the importance of phrase structure rules to Chomsky’s argument in Syntactic Structures (1957) that our knowledge of language involves a hierarchical arrangement of words, phrases, and phrases containing phrases, rather than knowledge of a linear string of words (Marantz 2019). Introductory linguistics textbooks present sentence structure with familiar Sentence → Noun_Phrase Verb_Phrase rules, whose output is illustrated in labelled branching tree structures.

Image from Encyclopedia Britannica Note the function of “node” labels in standard phrase structure rules. First, and quite importantly, a label like NP appears in more than one rule. In a textbook presentation of English grammar, NP would appear as sister to VP as the expansion of the S node, but also as sister to the Verb in the expansion of VP. The important generalization captured here is that English includes phrases whose internal structure doesn’t uniquely determine their position in a sentence. Inside an NP, we don’t know if we’re inside a subject or an object – the potentially infinite list of NPs generated by the grammar could appear in either position.

It’s true that some languages will distinguish noun phrases using case. In a canonical tensed transitive sentence, the subject might be obligatorily marked with nominative case and the object with accusative case. In languages like Russian, this case marking appears on the head noun of the noun phrase as well as on agreeing adjectives and other constituents. Importantly, however, case-marked noun phrases don’t appear in unique positions in sentences. If you’re inside a dative-marked noun phrase in Icelandic, for example, you don’t know whether you’re in the subject position, the object position, or some other hierarchical position in a sentence. Furthermore, case marking in general appears as if it’s been added “on top of” a noun phrase. That is, the internal structure of a noun phrase (the distribution of determiners, adjectives, etc.) is generally the same within, say, a dative and a nominative noun phrase. As far as phrase structure generalizations are concerned, then, a noun phrase is a noun phrase, no matter what position it is in and what case marking it has.

In phrase structure rules, a node label that appears on the right side of one rule, such as VP in (1), can appear on the left side of another rule (2) that describes the internal structure of that node. That is, a node label serves to connect a phrase’s external distribution and its internal structure.

(1) S → NP VP

(2) VP → V (NP) (NP) PP* (S)

where parentheses indicate optionality and * indicates any number of the category, including zero

The development of the “X-bar” theory of phrase structure captured the important insight that node labels themselves are not arbitrary with respect to syntactic theory. Nodes are labeled according to their internal structure, and the labels themselves consist of sets of features derived from a universal list. So a noun phrase is phrase with a noun as its “head.” More generally, there is some computation over the features of the constituents within a phrase that determines the features of the node at the top. And it’s these top node features that determine the distribution of the phrase within other phrases, since it is via these features that the node will be identified by the right side of phrase structure rules, which describe the distribution of phrases inside phrases. X-bar theory therefore provided a constrained template for phrase structure rules as well as a built-in relationship between the label of a node and its internal structure: phrases are labeled by their heads, so an XP has an X as its head.

We can describe linguists’ evolving understanding of phrase structure by reviewing a simplified history of the syntactic literature. In Aspects of the Theory of Syntax (1965), Chomsky observed an apparent difference between phrases that appear as the sister to a verb in a verb phrase and phrases that appear as the sister to the verb phrase. Individual verbs seemed to care about the category of their complements within the verb phrase, but they did not seem to specify the category of the sister to the verb phrase, the “subject” of the sentence. For example, some verbs like hit might be transitive, requiring a noun phrase, while verbs like give seem to require either two noun phrases or a noun phrase and a prepositional phrase. Chomsky suggested that the category “verb” could be further “subcategorized” into smaller categories according their complements. Verbs like hit would carry the subcategorization feature +[__NP], putting them in the subcategory of transitive verbs and indicating that they (must) appear as a sister to a noun phrase within the verb phrase. On the other hand, verbs did not seem to specify the category of the “subject” of the sentence, which could be a prepositional phrase or an entire sentence, for example. Instead, verbs might care about whether the subject is, say, animate – a semantic feature. Verbs, then, could “select” for the semantic category of their “specifier” (the sister to the verb phrase), while they would be “subcategorized” by (or “for”) the syntactic categories of the complements they take.

It was then observed that the phrase structure rule for a verb phrase, as in (2), could be generated as the combination of (i) a union of the subcategorization features of English verbs, and (ii) some ordering statements that could follow from general principles (e.g., noun phrases want to be next to the verb and sentential complements want to be at the end of the verb phrase). From this observation came the claim that phrase structure rules themselves do not specify the categories of constituents other than their heads. That is, the distribution of non-heads within phrases, as well as their order, would follow from principles independent of the phrase structure rules.

At this point, we give a shout out to Jane Grimshaw, who contributed foundational papers to the next two developments we’ll discuss. First, Grimshaw (1979, 1981) noted that verbs like ask seem to semantically take a “question” complement, but that this complement can take the syntactic form of either a sentence (ask what the time is) or a noun phrase “concealed question” (ask the time). Other verbs, like wonder, allow the sentence complement but not the noun phrase (wonder what the time is, but not *wonder the time). Grimshaw suggests that semantic selection, like the selection for animate subjects Chomsky described in Aspects, must be distinct from selection for a syntactic category, i.e., subcategorization. She dubbed syntactic category selection “c-selection” and suggested an independence of c-selection and “s-selection” (selection for semantic category or features).

In responding to Grimshaw, David Pesetsky (1982) noted a missing cell in the table one gets by crossing the c-selection possibilities for verbs that s-select for questions. Although there are verbs like wonder that c-select for a sentence and not an NP, there are no verbs that c-select for an NP and not a sentence.

c-selection for sentence	c-selection for NP	s-selection for question
✓	✓	ask
✓	✘	wonder
✘	✓	*

Simplifying somewhat, based on this asymmetry, Pesetsky asks whether c-selection is necessary at all. Suppose verbs are only specified for s-selection. What, then, explains the distribution of concealed (NP) questions? Pesetsky notes that the distribution of noun phrases is constrained by what is called “case theory” – the need for noun phrases to be “licensed” by (abstract) case. So the ability to take a noun phrase complement is a special ability, case assignment, which is associated with the theory of case and the special status of noun phrases. By contrast, there is no parallel theory governing the syntactic distribution of embedded sentences. According to Pesetsky, then, verbs can be classified according to whether they s-select for questions. If they do, they will automatically be able to take sentential question complements, since complement sentences don’t require any extra special grammatical sauce. However, to take a concealed question, a verb must also be marked to assign case. The verb ask has this case-assigning property but the verb wonder doesn’t.

Perhaps, then, there is no c-selection – no “subcategorization features” – at all in grammar. Rather, the range of complements to verbs (along with nouns, adjectives, and prepositions) and their order and distribution would be explained by other factors, such as “case theory.”

But is it really true that no syntactic elements are specified to want to combine with nouns, verbs or adjectives? While it seems possible maintain that Ns, Vs, Adjs, and Ps don’t c-select, it would seem that other heads, like “Tense” or “Determiner”, want particular categories as their complements. And here’s where Grimshaw’s second important contribution to phrase structure theory will come in next time – the concept of an “extended projection” of a lexical item, like a verb (Grimshaw 1991).

References

Chomsky, N. (1957). Syntactic Structures. Walter de Gruyter.

Chomsky, N. (1965). Aspects of the Theory of Syntax. MIT Press.

Grimshaw, J. (1979). Complement selection and the lexicon. Linguistic Inquiry 10(2): 279-326.

Grimshaw, J. (1981). Form, function and the language acquisition device. In C. L. Baker and J. J. McCarthy (eds.), The Logical Problem of Language Acquisition, 165-182. MIT Press.

Grimshaw, J. (1991). Extended projection. Brandeis University: Ms. (Also appeared in Grimshaw, J. (2005). Words and Structure. Stanford: CSLI).

Marantz, A. (2019). What do linguists do? The Silver Dialogues.

Pesetsky, D. (1982). Paths and Categories. MIT: PhD dissertation.

What does it mean to recognize a morphologically complex word?

June 16, 2020 / Alec Marantz / 0 Comments

Lexical access has been formalized in various bottom-up models of word recognition as the process of deciding, among all the words of a language, which word is being presented (orally or visually). In the auditory modality, thinking of the word as unfolding from beginning to end, phoneme by phoneme, the models imagine a search among all lexical items for the items beginning with the encountered phonemes. This “cohort” of lexical competitors consistent with the observed phonemes is winnowed down as more phonemes are heard, until the “uniqueness point” of the presented word, the point at which only a single item is left in the cohort. This final item is recognized as the word being heard.

/k/	/kl/	/klæ/	/klæʃ/
clash	clash	clash	clash
clan	clan	clan
cleave	cleave	…
car	…
…

So, for apparently morphologically simple words, like cat, word recognition in the cohort-based view involves deciding which word, from a finite list provided by the grammar of the language, is being presented. For psycholinguistic processing models, we can assign a probability distribution over the finite list, perhaps based on corpus frequency, and, for auditory word recognition, we can compute the probability of each successive phoneme based on its probability distribution over the members of the cohort compatible with the input string of phonemes encountered at each point.

But what about multimorphemic words, either derived, like teacher, or inflected, like teaches? One approach to modelling the recognition of morphologically complex words would be to assume that the grammar provides structured representations of these words as consisting of multiple parts, “morphemes,” but that these structured representations join the list of monomorphemic words as potential members of the cohorts entertained by the listener/reader when confronted with a token of a word. For psycholinguistic models, the probability of these morphologically complex units can be estimated from their corpus frequency, as with apparently monomorphemic words like cat.

An immediate issue arises, at least for inflection, that we can recognize words that we haven’t heard before. (Here, we can delay the question of how the productivity, or lack thereof, of derivational morphology might figure into an approach to morphological processing that separates derivation and inflection. The relevant issues at this point can be illustrated with inflection, like tense, agreement, case and number morphology.) Erwin Chan from UPenn has quantified this aspect of inflectional morphology (Chan 2008). As you encounter more inflected forms as a learner and fill out the “paradigms” of noun and verb stems in your language, you also encounter more new stems with incomplete paradigms. Figure 4.5 from Chan’s dissertation shows that many of the expected inflected forms of Spanish verb lemmas are frequently unattested. This is known as the sparse data problem. For any amount of input data, some inflected forms of exemplified stems will be missing, requiring one to use one’s grammar to create these inflected forms when they are needed.

The sparse data problem shows that people must be able to process (produce and understand) words they haven’t heard or read before. But this might be not a real issue for word recognition if the list of words consistent with a grammar is finite. Speakers could use their grammars to pre-generate all the (potential) words of the language and place them in a list from which the cohort of potential candidates for recognition can be derived.

The immediate problem with this approach involves the psycholinguistic processing models alluded to earlier. These models require a probability distribution over the members of a cohort, and this distribution is estimated on the basis of corpus statistics. But what is the probability associated with a novel word, one generated by the grammar but not yet encountered? If one follows this approach to word recognition, one can estimate the expected corpus frequency of a morphologically complex word generated by the grammar based on the frequency of the stem and other factors. Fruchter & Marantz (2015), for instance, estimate the surface frequency of a complex word composed of stem X and suffix Y, F(X+Y), as a function of stem frequency (F(X)), biphone transition probability (the probability of encountering the first two phonemes of the suffix, given the preceding two phonemes of the stem, BTP(Y|X)), and semantic coherence (a measure of semantic well-formedness for a complex word, SC(X,Y)).

On a “whole word” approach to lexical access, where morphologically complex words join morphologically simple words on a list of candidates for recognition from which cohorts are computed, a single measure of word expectancy related to corpus frequencies of words and stems is used to derive frequency distributions over candidate words as wholes and, in the case of auditory word recognition, upcoming phonemes. The expectation is that recognition of a morphologically complex word will be modulated by whole word corpus frequency in the same way as a monomorphemic word.

The experimental work from my lab over the past 20 years, as well as related research from other labs, has shown, however, that this whole word approach to morphologically complex word recognition makes the wrong predictions, both for early visual neural responses in the processing of orthographically presented morphologically complex words and in the phoneme surprisal responses in the processing of auditorily presented complex words. That is, being morphologically complex matters for processing at the earliest stages of recognition, a fact that is incompatible with at least existing whole word cohort models of word recognition. For example, the presentation of an orthographic stimulus (a letter string) elicits a neural response from what has been called the Visual Word Form Area (VWFA) at about 170ms. This response is not directly modulated by the corpus frequency of a morphologically complex word, as might be expected if these words were on a stored list with monomorphemic words, but by a variable that reflects the relative frequency of the stem and the affixes. Our experiments have found that the transition probability between the stem and the affixes is the best predictor of the 170ms response (Solomyak & Marantz 2010; Lewis, Solomyak & Marantz 2011; Fruchter, Stockall & Marantz 2013; Fruchter & Marantz 2015; Gwilliams & Marantz 2018).

For auditory word recognition, a neural response in the superior temporal auditory processing regions peaking between 100 and 150ms after the onset of a phoneme is modulated by “phoneme surprisal,” a measure of the expectancy of the phoneme given the prior phonemes and the probability distribution over the cohort of words compatible with the prior phonemes. Work from my lab has shown that putting whole morphologically complex words into the list over which cohorts are defined does not yield good predictors for phoneme surprisal for phonemes in morphological complex words (Ettinger, Linzen & Marantz 2014; Gwilliams & Marantz 2015). Gwilliams & Marantz (2015), for instance, observe a main effect of phoneme surprisal based on morphological decomposition in the superior temporal gyrus but no effect based on whole word, or linear, phoneme expectancy (Figure 3).

Figure 3. Correlation between morphological and linear measures of surprisal and neural activity in superior temporal gyrus

It is clear, then, that morphological structure feeds into the probability distribution over candidates for recognition to yield accurate measures of phoneme surprisal. However, we do not have a motivated model of how exactly this occurs. That is, although we have shown that morphological complexity matters in word recognition, we do not have a good model of how it matters.

In his dissertation, Yohei Oseki proposes to attack the issue of processing multimorphemic words by eliminating the distinction between word and sentence processing (Oseki 2018) – which is in any case a dubious categorical contrast given the insights of Distributed Morphology and related linguistic theories.

If morphologically complex words are “parsed” from beginning to end, even if presented all at once visually, then the mechanisms of recognizing and assigning structure to a morphologically complex word would be the same as the mechanisms of recognizing and assigning structure to sentences. The estimate of the “surprisal” of a complex word, then, would not be an estimate of the corpus frequency of the word but would be computed over the frequencies of the individual morphemes and over the “rules” or generalizations that define the syntactic structure of the word (see also Gwilliams & Marantz 2018). Work in sentence processing has been able to assign “surprisal” values to sentences in this way, and Oseki provides evidence that this approach yields surprisal estimates for visually presented words that correlate with the 170ms response from VWFA.

Oseki’s work, while promising, does raise a number of problems and issues to which we will return. However, it serves to connect word recognition directly with sentence processing, leading us to examine what we know and don’t know about the latter. In the essays to follow, we’ll detour through considerations of sentence recognition and of syntactic theory, with some side trips through aspects of phonology and of meaning, before returning to our initial question of identifying the best models of morphologically complex word recognition to pit against experimental data.

References

Chan, E. (2008). Structures and distributions in morphology learning. University of Pennsylvania: PhD dissertation.

Ettinger, A., Linzen, T., & Marantz, A. (2014). The role of morphology in phoneme prediction: Evidence from MEG. Brain and Language, 129, 14-23.

Fruchter, J. & Marantz, A. (2015). Decomposition, Lookup, and Recombination: MEG evidence for the Full Decomposition model of complex visual word recognition. Brain and Language, 143, 81-96.

Fruchter, J., Stockall, L., & Marantz, A. (2013). MEG masked priming evidence for form-based decomposition of irregular verbs. Frontiers in Human Neuroscience, 7, 798.

Gwilliams, L., & Marantz, A. (2015). Tracking non-linear prediction in a linear speech stream: Influence of morphological structure on spoken word recognition. Brain and Language, 147, 1-13.

Gwilliams, L., & Marantz, A. (2018). Morphological representations are extrapolated from morpho-syntactic rules. Neuropsychologia, 114, 77–87.

Lewis, G., Solomyak, O., & Marantz, A. (2011). The neural basis of obligatory decomposition of suffixed words. Brain and Language, 118(3), 118-127.

Oseki, Y. (2018). Syntactic structures in morphological processing. New York University: PhD dissertation.

Solomyak, O., & Marantz, A. (2010). Evidence for early morphological decomposition in visual word recognition: A single-trial correlational MEG study. Journal of Cognitive Neuroscience, 22(9), 2042-2057.

NYU MorphLab

Page 8 of 11

Understanding sentences, Part 2

Understanding sentences

What does it mean to recognize a morphologically complex word?

News & Events

Meta