The centuries-old field of music theory and the field of music cognition, whose lifespan is measured in decades, are becoming more and more explicitly connected, in both their research goals and methodology. Cognitive science is a collection of disciplines concerned with human information processing (Posner 1989), and music cognition is the branch of that field devoted specifically to aspects of human intelligence as they apply to music (Sloboda 1985). A telling indication of their convergence with music theory was the appearance of the first volume of Eugene Narmour’s book The Analysis and Cognition of Basic Melodic Structures: The Implication-Realization Model (Narmour 1990), the final chapter of which includes specifications of cognition experiments that could be conducted to confirm or deny hypotheses advanced in the theory.

With or without music cognition, music theory has always addressed the question of how humans experience music. When building computer programs to emulate aspects of musical skills, we do well to profit from the guidance that tradition can lend. Computer emulations represent an applied music theory, implementing ideas from classic or more recent theories and testing their function in the context of live performance. Similarly, music cognition is concerned at once with investigating the music information-processing capabilities of the mind and building computer models able to recreate some of those capabilities. Here again, perhaps even more directly, the construction of interactive music systems can be guided by research carried out in a related discipline.

In this chapter, we will examine issues in music theory and cognition as they relate to the construction of interactive music systems. Our interest extends to both theoretical correlations and some resulting practical implications for the engineering of working interactive programs.

4.1 Music Listening

Probably the skill on which all musical disciplines rely more than any other is listening: “having a good ear” describes skilled musicians of any specialty. The contrast between listening and music analysis should be drawn here: analysis is related to listening, as are all musical skills, but differs in two ways relevant to this discussion. First, music analysis has random access to the material; the analyst proceeds with the text of a written score, which he can consult in any order regardless of the temporal presentation of the piece in performance. Second, the analyst is often concerned with learning the compositional methods used in constructing the work; depending on the method, these may have anywhere from a great deal to almost no perceptual salience for the uninitiated listener. The listener, by contrast, is constrained to hear the piece from left to right, as it were, in real time. There is no random access; rather, the music must be processed as it arrives. The groupings and relations the listener forms can only arise from cognitively available perceptual events and the interaction of these events with short- and long-term memory. Seen in this light, the problems of music listening are simply part of the larger problem of human cognition.

A primary goal in the development of Cypher has been to fashion a computer program able to listen to music. The inspiration and touchstone for successful music listening is, naturally, human performance of the task, and Cypher implements what I take to be a method with plausible relations to human music processing. Still, I do not make the stronger claim that this program is a simulation of what goes on in the mind. For one thing, we simply do not yet know how humans process music. Because of the elusive nature of the thought processes the theories of cognitive psychology seek to explain, we remain unable to verify these theories. It is irrelevant to this work, however, whether the processes implemented in it correspond directly to those at work in the human mind. At best, the program offers an existence proof that the tasks it accomplishes can be done this way. The point, in any case, is not to “reverse engineer” human listening but rather to capture enough musicianship, and enough ways of learning more complete musical knowledge to allow a computer performer to interact with human players on their terms, instead of forcing an ensemble to adapt to the more limited capabilities of an unstructured mechanical rendition.

Music Cognition

Work in music cognition with implications for interactive music systems investigates the elicitation of structure from an ongoing musical flow, and the ways such structuring affects the comprehension, performance, and composition of music. Examples include the schemata theory of musical expectation (McAdams 1987), the influence of structural understanding on expressive performance (Clarke 1988), and various formal representations of pitch and rhythmic information (Krumhansl 1990).

The field of music cognition is relatively new, particularly in comparison with music theory, which is a centuries-old collection of techniques for describing the construction of musical works and the ways they are experienced. Music theory has long dealt with the tendency of Western music to exhibit directed motion, or dramatic form, in its progression through time. Observers including Wallace Berry and Leonard Meyer have discussed goal-oriented patterns of tensions and relaxation (Berry 1976), and the way these arouse expectations, which are then confirmed or denied, in the mind of the listener (Meyer 1956). A perspective that describes musical discourse as movements of progression and recession around points of culmination and fulfillment is but one way of characterizing the experience of music; several studies, for example Kramer 1988, note many others. Still, the idea that enculturation and gradually acquired expectations of typical continuation play a major role in music listening has spurred several research efforts in interactive systems, including Cypher.

Expectation is, to be sure, only a part of what goes on during listening — for many kinds of music, it is not even the most important part. I am convinced that listening to music fully engages our cognitive capacity in a deep, significant way and that this stimulation is what makes music a universal and lasting interest for every human culture. Cypher does not, of course, capture all of that richness. Rather, the focus is on developing a computer program that can learn to recognize some facets of Western musical style and incorporate that knowledge in an evolving strategy for participating in live musical contexts. In the remainder of this book, I will use words such as “understanding” “intelligence,” “learning,” and “recognition” to speak of program function. It is not the point of this usage to further fan the flames of debate raging around the aptness of these words to describe the behavior of a computer program (Winograd and Flores 1986). The work described here actually exists and has been used with human musicians in a wide variety of settings over a period of several years. Programs are said to “understand” or exhibit “intelligence” in the sense that musicians interacting with them feel that the programs are behaving musically, that they can understand what the systems do, and that they can see a musically commonsense relation between their performance and the program’s. The word “understanding” is used to describe a quality of this interaction as it is experienced by musicians performing with a program.

4.2 Hierarchies and Music Theory

In this section I present some structural perspectives developed in the field of music theory, particularly those related to questions of hierarchy and progression. The point of the discussion is to gauge the descriptive power of each organizational scheme, whether this power can be amplified in combination with other structures, and what consequences the adoption of multiple perspectives has for the coherence and versatility of theories of music.

Heinrich Schenker

In the early part of this century the Galician music theorist Heinrich Schenker developed an analytic technique with significant implications for the questions we are considering here. In particular, his work explored the combination of a strong concept of structural levels, with the idea of progression from event to event within levels. The hierarchical nature at the basis of Schenker’s thought is readily apparent: “[Schenker] demonstrated that musical structure can be understood on three levels: foreground-middleground-background…. Analysis is a continuous process of connecting and integrating these three levels of musical perception” (Felix Salzer, in Schenker 1933, 14).

In the Schenkerian analysis shown in figure 4.1, notice the reduction of musical surface structures to a series of nested levels, with directed motion between certain events within each level. Such a graph portrays a double perspective on the musical flow: hierarchical in its layered organization and progressive in the implications between events within levels.

Figure. 4.1 Reprinted from Heinrich Schenker, Five Graphic Music Analyses, p. 47. (c) 1969 Dover Publications, Inc.

Figure. 4.1 Reprinted from Heinrich Schenker, Five Graphic Music Analyses, p. 47. (c) 1969 Dover Publications, Inc.

Beyond the use of hierarchy and progression, another contribution from Schenker’s theory is his use of recursion. Recursive techniques use the same process on distinct levels of a problem, telescoping up or down to find similar structures on varying scales. “Schenkerian analysis is in fact a kind of metaphor according to which a composition is seen as the large-scale embellishment of a simple underlying harmonic progression, or even as a massively-expanded cadence; a metaphor according to which the same analytical principles that apply to cadences in strict counterpoint can be applied, mutatis mutandis, to the· large-scale structures of complete pieces” (Cook 1987, 36).

These three ideas — hierarchy, directed motion, and recursion — form a major part of the Schenkerian legacy to musical analysis; however, the application of the whole of Schenker’s thought runs up against a rigidity of structure, which unnecessarily restricts the music it can successfully treat. The Ursatz, one of the foremost concepts associated with Schenker’s name, is a background structure that Schenker claims is present in every well-composed piece of music — even though, as Schenker was aware, the Ursatz clearly does not underlie a great percentage of the world’s musical styles. The nonconformance of most non-Western music, indeed, of many centuries’ worth of Western music, to the Ursatz, was a sign to Schenker that such music had not reached the summit of musical thought, dictated by the forces of nature, which he believed was achieved by Western tonal music in the classical style. “If, as I have demonstrated, all systems and scale formations which havr been and are taught in the music and theories of various peoples were and are merely self-deceptions, why should I take seriously the Greeks’ belief in the correctness of their prosody?” (Schenker 1979, 162; see also Narmour 1977,38).

Schenker’s rejection of music that did not fit his structural perspective is an extreme case, but not an isolated one. I find it to be emblematic of a recurring tendency in music theory to embrace a particular structural perspective so strongly that the theorist becomes blinded to the considerable power of a listener’s mind to organize and make sense of music in ways unforeseen by any single theoretic account.

We may conclude by reiterating some relevant contributions of Schenkerian analysis — structural levels, progressive motion within levels, and recursion — and the cautionary tale of its inventor’s application of them: trying to enforce a particular perspective in describing musical thought can end by discarding a significant part of the phenomena the theory could well be used to illuminate.

Lerdahl and Jackendoff

The music theory of Heinrich Schenker and the linguistic theory of Noam Chomsky are two major influences on the work of composer Fred Lerdahl and linguist Ray Jackendoff, set forth in their book A Generative Theory of Tonal Music (Lerdahl and Jackendoff 1983). The combination of Chomsky and Schenker is certainly not incidental: as noted by John Sloboda, “their theories have some striking similarities. They both argue, for their own subject matter, that human behaviour must be supported by the ability to form abstract underlying representations” (Sloboda 1985, 11).

Lerdahl and Jackendoff’s theory is designed to produce representations of pieces of tonal music that correspond to the cognitive structure present in an experienced listener’s mind after hearing a piece. “[The theory] is not intended to enumerate what pieces are possible, but to specify a structural description for any tonal piece; that is, the structure that the experienced listener infers in his hearing of the piece” (Lerdahl and Jackendoff 1983, 6).

They focus on those parts of music cognition they consider to be hierarchical in nature.

We propose four such components, all of which enter into the structural description of a piece. As an initial overview we may say that grouping structure expresses a hierarchical segmentation of the piece into motives, phrases, and section. Metrical structure expresses the intuition that the events of the piece are related to a regular alternation of strong and weak beats at a number of hierarchical levels. Time-span reduction assigns to the pitches of the piece a hierarchy of “structural importance” with respect to their position in grouping and metrical structure. Prolongational reduction assigns to the pitches a hierarchy that expresses harmonic and melodic tension and relaxation, continuity and progression. (Lerdahl and Jackendoff 1983, 8-9)

There are two kinds of rules in the theory: well-formedness rules and preference rules. The first rule set generates a number of possible interpretations. The second rule set will choose, from among those possibilities, the interpretation most likely to be selected by an experienced listener.

One of the attractions of Lerdahl and Jackendoff’s work is that it treats musical rhythm much more explicitly than do many music theories, Schenker’s being an example. In fact, they regard their theory as a foundation for Schenker’s work: where his graphs highlight the relative importance of different pitches in a composition, Lerdahl and Jackendoff point to segmentation and rhythmic structuring as the cognitive principles underlying that sense of importance. Another strength of their approach is that it coordinates the contributions of several musical features, including meter, harmony, and rhythm. Further, predictions of phrase boundaries made by their model correspond well to reports from listeners under some test conditions (Palmer and Krumhansl 1987).

Figure 4.2 Reprinted from Fred Lerdahl and Ray Jackendoff, A Generative Theory of Tonal Music, p. 144. (c) 1983 The Massachusetts Institute of Technology.

Figure 4.2 Reprinted from Fred Lerdahl and Ray Jackendoff, A Generative Theory of Tonal Music, p. 144. (c) 1983 The Massachusetts Institute of Technology.

The output of the well-formedness rules is hierarchic and recursive, as figure 4.2 shows. The tree structure at the top of the figure can be divided into hierarchic levels according to the branching structure; the rules generating branches and their relative dominance at each level are the same and are applied recursively. The progressive perspective, indicating directed motion within levels, is attenuated in the Lerdahl and Jackendoff version. The successive levels of the rhythmic interpretation shown in figure 4.2 are related in an exacting tree structure; the only deviation from formal trees is that leaf nodes sometimes are linked to two adjacent nodes on the next highest level. The construction of these trees through application of the rule set makes possible comprehensive predictions of the strong/weak beat relationships experienced in a given musical context.

Again in the case of Lerdahl and Jackendoff’s theory, however, an exaggerated reliance on one perspective leads both to uncomfortable accounts of cognition, and finally to a devaluation of music not conforming to the structure. First of all, their theory does not say anything about how the representations they describe are actually formed by a listener during audition: “Instead of describing the listener’s real-time mental processes, we will be concerned only with the final state of his understanding. In our view it would be fruitless to theorize about mental processing before understanding the organization to which the processing leads” (Lerdahl and Jackendoff 1983, 3-4). This stance arises from a Chomskian desire to identify what a listener knows when she knows a work of music — what mental representations could account for the kinds of behavior we recognize as musical.

Another potential problem has to do with the cognitive reality of the upper reaches of the full tree structure they propose: “Evidence for the highest level in this structure is rather sparse, and is confined to statements by a number of composers (Mozart, Beethoven, Hindemith) which indicate that they were able to hear (or imagine) their own compositions in a single ‘glance'” (Clarke 1988, 2-3).

The reality of this proposed representation can be questioned in connection with the first problem: given what is known about human memory systems and their capacity, how can such a complete image of a piece of music be built up and remain present after a single audition or, for that matter, even after repeated listenings? Perhaps this question reduces to a chicken-and-egg problem: Lerdahl and Jackendoff maintain that it makes no sense to consider the experience of listening until a plausible model of the resulting structure has been built; others find it hard to imagine a representation without a plausible model of how it might be constructed. The fact that Lerdahl and Jackendoff have not come up with an account of how listening produces the structure they propose suggests that there is no easy solution.

Fred Lerdahl, in his essay “Cognitive Constraints on Compositional Systems” (1988), gives the following motivation for relating the theory elaborated in Lerdahl and Jackendoff 1983 to compositional technique: “Cognitive psychology has shown in recent decades that humans structure stimuli in certain ways rather than others. Comprehension takes place when the perceiver is able to assign a precise mental representation to what is perceived. Not all stimuli, however, facilitate the formation of a mental representation. Comprehension requires a degree of ecological fit between the stimulus and the mental capabilities of the perceiver” (Lerdahl 1988, 232).

Lerdahl complains that modern music composition has discarded any connection between what he refers to as the compositional grammar used in constructing a piece and the grammar listeners use to form a mental representation of it. This argument leads to his “Aesthetic Claim 2: The best music arises from an alliance of a compositional grammar with the listening grammar” (LerdahI1988, 256). A recurring example in Lerdahl’s essay is Boulez’s composition Le Marteau Sans Maître, which does have, ifone accepts Lerdahl’s decomposition of the composing/performing/listening complex, a singularly striking decoupling of compositional and listening grammars. But, as Lerdahl points out, “this account is complicated by the fact that, as noted above, Boulez created Le Marteau not only through serial procedures but through his own inner listening. In the process he followed constraints that, while operating on the sequence of events produced by the compositional grammar, utilized principles from the listening grammar” (Lerdahl 1988,234).

In my view, Fred Lerdahl should not be so surprised that Boulez’s serial technique, his “compositional grammar,” is often treated as if it were irrelevant, particularly when he himself notes that much of the interest of the piece comes from the workings of a musical mind operating beyond the scope of the purely formal rules. He cannot have it both ways: he cannot maintain that what makes Le Marteau a great piece of music is Boulez’s musicianship, his “intuitive constraints,” and maintain at the same time that music cannot be great unless cognition is explicitly coded into the formal system. To say that “the best music arises from an alliance of a compositional grammar with the listening grammar” and at the same time to recognize Le Marteau as a “remarkable” work when no such alliance occurs must mean that the Aesthetic Claim 2 carries little force indeed.

What is important is the way listeners make sense of music, a sense employed by composers, performers, and listeners alike — the very point so emphatically put forward by Lerdahl himself. Lerdahl and Jackendoff are the first to point out, however, that though their theory may well explain parts of that sense, it is not by any means complete. Basing aesthetic claims, and establishing constraints on composition, on an incomplete account again amounts to overestimating the theory and shortchanging the mind’s capacity to deal with many different kinds of music.

Eugene Narmour

A forceful statement against an overreliance on tree structures in music theory can be found in Narmour 1977:

If, however, as has been implied, the normal state of affairs in tonal music is noncongruence of parameters between levels instead of congruence, it follows that analytical reductions should be conceptualized not as trees-except perhaps in the most simplistic kinds of music where each unit (form, prolongation, whatever) is highly closed — but as networks. That is, musical structures should not be analyzed as consisting of levels systematically stacked like blocks … but rather as intertwined, reticulated complexes — as integrated, nonuniform hierarchies.

 

Unity would then be a result of the interlocking connections that occur when implications are realized between parts rather than as a result of relationships determined by the assumption of a preexisting whole. (Narmour 1977, 97-98)

He observes in his article “Some Major Theoretical Problems Concerning the Concept of Hierarchy in the Analysis of Tonal Music” (1984) that in fact much of what passes for hierarchical structuring in music theory is not hierarchic at all, but rather “systemic,” by which he refers to “musical relationships which are conceived in Gestaltist fashion as parts of a completely integrated whole” (Narmour 1984, 138). These structures, he states, are not hierarchic because differentiations of material on lower levels are not reflected in their representation on levels higher up. Because of the loss of information as we travel up the tree, all individual characteristics of a particular piece of music are subordinated to a priori style traits; such analysis “reduces an idiostructural event to a default case of the style” (Narmour 1984, 135).

Narmour’s own theory, the implication-realization model, has as a structural consequence the postulation of an extensive, multifaceted network of connections among musical events on various levels of a composition. Following the work of Leonard Meyer and others, Narmour emphasizes the impact of listeners’ expectations on their experience of a piece of music. Further, he recognizes the operation of multiple perspectives in music cognition: “What makes the theory and analysis of music exceptionally difficult, I believe, is that pieces display both systematic and hierarchical tendencies simultaneously. And, as we shall see, this suggests that both ‘tree’ and ‘network’ structures may be present in the same patterning” (Narmour 1977, 102).

Eugene Narmour’s theory tends to assume the goal-directed, expectation-based model of music cognition. It relies heavily on the ideas of hierarchy and progression; in fact, it is the most consistently progression-oriented theory of the three reviewed here. Because Narmour sees progressions operating between noncongruent elements, however, his analytical structures tend not to resemble trees; for much the same reason, they are not recursive.

4.3 Cypher Hierarchies

Cypher’s listener and player are organized hierarchically, though these hierarchies tend toward Narmour’s network ideas rather than the more strictly structured trees of Lerdahl and Jackendoff. Further, the music representations adopted are expressed in a computer program, where processes realize the thrust of the theory. “To understand how memory and process merge in ‘listening’ we will have to learn to use much more ‘procedural’ descriptions, such as programs that describe how processes proceed” (Minsky 1989, 646). The program has been deeply influenced by several strands of thought in music theory, among them the ones we have just reviewed. Moreover, Cypher does not describe many aspects of musical experience as well as do these predecessors. Its virtues, I would venture, are that Cypher concentrates on building a working procedural implementation of its theory, which can be tried out with music as it is performed, and that Cypher builds up its musical competence through several perspectives and the interaction of many simple, cross-connected processes.

The levels of Cypher’s hierarchies are distinguished in three ways. First, higher levels refer to collections of the objects treated by lower levels. For example, on the listening side, the lowest level examines individual events, while the next highest level looks at the behavior within a group of such events. Second, higher levels use the abstractions produced by lower levels in their processing. So, the second-level listening agents, which describe groups of events, will use the classifications of those events made by a lower-level analysis to generate a description. Third, because of the temporal nature of music, groups of events will be extended through time; therefore, higher levels in the hierarchy will describe structures that span longer durations of time.

Figure 4.3

Figure 4.3

Figure 4.3 shows some important ways relations are drawn between the events analyzed and generated by the program; however, this way of looking at the information is only one of many perspectives Cypher adopts in the course of carrying out various tasks. This first perspective is so strictly hierarchical that we can best depict it as a tree structure; subevents are connected to only one superevent, and not to each other. Other perspectives do not relate things in such an orderly way. The underlying idea is to provide several different ways of regarding and manipulating the parts of a musical context: “A thing or idea seems meaningful only when we have several different ways to represent it — different perspectives and different associations” (Minsky 1989, 640). This section will review the organizations suggested by different perspectives on the data under consideration and by the kinds of connections and communication linking the agents that process that data.

The Progressive Perspective

We are considering various perspectives on the operation of Cypher. One important axis around which to organize these perspectives separates the raw material from the processing. Some structures concern the way the sounding events, which make up the fabric of the analyzed or generated textures, are grouped and related. The processes that perform the analysis and generation are themselves operative on different levels, and have their own connections of communication and grouping. We have already seen one perspective on the musical events, captured by the tree structure drawn above. Another perspective places more emphasis on their progression through time and the associated relations of succession and precedence.

Figure 4.4 illustrates the progressive perspective. The objects shown represent the same sounding events pictured on level 1 of the previous figure. There the emphasis was on their subsumption in metaevents; here we see that pointers connect events to their neighbors in both temporal directions. By traversing the pointers, we can arrive at proximate events either earlier or later in time. The progressive perspective is adopted on other structures as well; harmonic progressions, patterns of rhythm or melody, and higher level groups of Cypher events all are related, at times, by the operations of succession and precedence.

Figure 4.4

Figure 4.4

Already we can see that multiple perspectives tangle the depiction of relations between sounding objects; we can perhaps combine progression and hierarchy into a single visual representation that makes sense (figure 4.5).

Figure 4.5

Figure 4.5

All events are connected in a hierarchy and simultaneously tied together in relations of succession and precedence. The addition of the progressive perspective takes our example far from the usual definition of a tree structure, however. Although it is clear enough to follow the combination of two relations, the effect of multiplying perspectives can be easily recognized as well: universal laws of relation between objects become obscured and complicated when there are several ways to connect and relate them.

In Cypher, a single, simple, universal form for relating objects, such as a tree structure, has been abandoned in favor of collections of ways to relate things. Our structures may become difficult to draw, but we will not be constrained by the representation to adopt a complicated solution to a problem that can be simply treated by a more appropriate perspective. Such pragmatism may seem straightforward, but as our review of other structural approaches has shown, it is in fact an exception to common practice. Many prominent music theories devise a single structural perspective within which to describe musical behavior and continue to adhere to that perspective no matter what difficulties are encountered in trying to deal with the fullness of musical experience.

Message Routes

On the other side of the events/processing axis, Cypher includes two large collections of interacting agents: those associated with the listener and those associated with the player. The processes within each collection communicate with each other; different kinds of messages are passed between the two collections. The hierarchical perspective on these processes is largely a function of the level of the events they treat: level-1 processes deal with the individual sounding objects; level-2 processes deal with groups of these events.

Processes communicate by passing messages. These communication links form another perspective through which the ongoing musical information is regarded; the pattern of links and the messages passing across them give an indication of which features of the music are important for which tasks, and how the agents carrying out various tasks collaborate in their execution. Figure 4.6 shows some typical communication links within and between the two sides.

Figure 4.6

Figure 4.6

Although the lines of communication are hierarchical, in the sense that there is a meaningful distinction to be made between levels of processing, we see that they resemble a network of relations rather than a more strictly formed tree structure. There are many processes on each side dealing with any single sounding event. Further, some general directions of the information flow can be noted: first, information tends to go up from the sounding events to analytical processes on the listening side, and on the playing side, generation methods are sending increasingly precise information down to the level of individual events, which, when complete, are sent on to the response devices at the specified time. Another important regularity is that information passes only from the listener to the player and not in the other direction. The only communication passing from the player back to the listener involves queries for additional data.

4.4 Expressive Performance

The most noticeable difference between the musical performances of human and machine players is the expressive variation human musicians invariably add. Expression is performed across several parameters, among them onset timing, dynamic, and duration (articulation). Studies in performance practice such as Palmer 1988 are beginning to give us good data about the strategies humans use to play expressively, which is critical information if machine players are to learn to incorporate such expression in their own performances. Such work is of central importance to the development of interactive systems for two reasons: first, the cognitive faculties these studies highlight are an important part of human musical understanding, and second, to the extent that they can uncover operative performance strategies, these strategies can be implemented and made part of the arsenal of performance techniques available to interactive programs.

Cypher is one of the interactive systems beginning to use the findings of music cognition to implement the program’s performance mechanisms. Humanperformers accomplish expressive variation through many musical parameters, including pitch inflections and timbral change. Cypher ignores timbral information and does not include continuous controls such as pitch variation. Therefore, those parameters, though highly desirable candidates for manipulation by a computer performer, are left untreated by the program. Rather, the problem of expressivity is explored through two parameters currently available to the system: variations of timing and loudness.

Expressive Timing

A sense of time is one of the most critical musical faculties. In human performance, we may separate structural from expressive timing considerations, as is done by most listeners (Clarke 1987). For example, listeners will usually interpret a series of gradually lengthening durations as a pattern of equally spaced notes undergoing a ritardando. In such a case, the structural rhythm perceived is the string of equally spaced note values. The expressive component of the percept is the ritard.

The separation of rhythmic experience into structural and expressive components is a form of categorical perception. As evidence of this, Eric Clarke discusses experiments in which two easily structured rhythmic presentations are chosen. A series of test rhythmic patterns is generated, with the two simple cases at the extremes. Intermediate steps form an interpolation between them, continuously varying durations of some notes until the opposite extreme is reached. Listeners to this sequence will perceive the straightforward rhythmic interpretation of the first pattern and continue to maintain this interpretation through several of the intermediate steps. Then as the other extreme is neared, a jump is made to an interpretation matching the pattern at the opposite extreme. Rhythmic parsing is thus done categorically, with percepts assigned to one of a limited number of simply structured possibilities, where additional timing variations are perceived as expressive, not structural.

We have seen some basic tools for the manipulation of time in interactive systems, such as scheduling and time stamping. These techniques, however, are clearly on a level well below that of human temporal perception. The human experience of time can suggest important ways to organize and elaborate the programs’ temporal facilities. Further, many efforts already extant in computer music provide tools for working with time that are far more powerful than the elementary techniques reviewed earlier.

Time Maps

The time map described in Jaffe 1985 provides a general method for coordinating temporal deformations among several voices of output. The idea is to be able to specify tempo fluctuations in such a way that voices are able to speed up or slow down independently of one another and to return to synchronization at any desired point. As an example, imagine that a composer would want slight ritardandi setting off the cadence of certain phrases: a time map would allow the notation of that phrase as regular (for example, a series of eighth notes) but would force the performance to include delays of slightly increasing duration between successive note onsets as we approach the cadence. A time map, then, is a function describing the deviation between the performed and notated lengths of the duration between two attacks. Such maps, or tempo curves, can be stored to affect the generation of temporal performance in real time. Similarly, tempo curves can be analyzed from real performances as a way to quantify the expressive timing added by virtuoso players.

As in the case of the example just outlined, these techniques have tended to be applied to score-driven models. As such, maps are applied most naturally in situations where there is a score or structural notation of the rhythmic content. To be able to use them in interactive systems with no score requires those systems to be able to assert goal points in their output or in some way to segment the material they are producing, such that a time map can meaningfully adjust the time offsets before performing the fragment. For example, indication of some output point as a phrase boundary can mark it as the goal of a slight decelerando, emphasizing that point’s structural importance, even if the phrase has not been stored in advance. All that is required is that the system mark the fragment as a phrase and apply a map before performance.

This is not to suggest that finding goal points in unnotated output is trivial, but it indicates one motivation for attempting to identify them. Such gestures as phrase lengthening are important cues for the perceptual system and serve to accentuate structural articulations in performed music. Their more widespread application is critical to the continued development of interactive systems, particularly as those systems are used more commonly as equal partners in human instrumental ensembles. To use time maps well in performance-driven systems will require advances in two areas: the real-time specification of goal points in musical output, such as phrase or larger structural boundaries, and the flexible description of time deformations for fragments of any duration, together with their calculation in real time.

Several interactive programming environments provide timing facilities that could support these kinds of extensions. The CMU MIDI Toolkit, for example, allows the specification of sequences by several means: text editing, capturing a MIDI stream in real time, or converting standard MIDI files produced by some commercial sequencer or other source. Playback of these scores operates through a virtual time system, which can be varied with respect to real time. Affecting the clock rate of the virtual time system can produce variable playback speeds, allowing synchronization with external events (Dannenberg 1991). Such a facility could be used to apply stored time maps to designated portions of sequences in interactive performance. The problem of finding structural units across which to apply the maps remains; these must be identified by hand. But using a time map to vary the virtual time reference of the sequence playback, would allow the realization of performance-driven tempo variation on stored music fragments.

Time maps, or tempo curves, have deficiencies of their own which will require further study, however. The difficulty with such representations of expressive timing is that there is no general library of time maps that can be applied to any arbitrary piece of music and yield satisfactory results. In their article “Tempo Curves Considered Harmful,” Peter Desain and Henkjan Honing (1991) systematically show how the map representation fails to retain the necessary expressivity through various transformations. As an example of their argument, consider the tempo control implemented in most commercial MIDI sequencers. The idea of a tempo control is that a user can shift the speed of a performance up or down, changing the overall tempo of a sequence without degrading the quality of the performance. Desain and Honing show that the use of simple arithmetic transformations to the time offsets between events will not yield a new performance that remains close to what a human player would do when modifying the tempo of a performance by a similar amount. A human player will apply different timing variations to the same piece when the tempo is increased by 50 percent, generally by using less variation, producing a “flatter” tempo map between the ideal and performed versions. Turning up the tempo control on a sequencer, however, will use the same map shrunk by 50 percent, making variations that were effective at the slower tempo sound clownish and inappropriate at the higher speed.

The compositional processes implemented in Cypher provide both directed and static means of temporal deformation. Directed temporal operations perform a linear transformation of the offset times separating events, either lengthening them (decelerando), or shortening them (accelerando). Static operations add smaller, nondirectional changes to the length of event offsets. These possibilities may, as is the case with all compositional methods, be applied on any level, in response to messages arriving from the listener. Level-1 applications will change offsets on a per-event basis (see the transformation modules accelerator and decelerator described in section 6.1). Level-2 applications will be invoked in response to regularity or phrase-boundary messages. Therefore, their action will be advanced with the frequency of the appropriate messages coming from level 2; temporal deformations attached to phrase boundaries will qe advanced on a per-phrase basis, for example.

Dynamic Variation

Dynamic variation is another common conveyance of expressive performance. Changes in loudness are used to emphasize structural boundaries and to highlight directed motion toward some musical goal. Crescendo and decrescendo are clear examples of expressive dynamic variation, but they are far from the only ones.

As in the case of temporal deviations, slight, quickly changing perturbations of the dynamic level are a critical part of an expressive, or even just acceptable, musical performance. But dynamic variations cannot be applied randomly or by following only local musical constructs. The essence of expression is to use variation in pointing out structural boundaries, major articulations of the composition in progress.

Unless some of that structure is present, either laid out by a human user, or found by the program itself, dynamic variation will add unpredictability to a computer performance, but will not enhance the performance of larger musical gestures, as a human performer normally would.

Figure 4.7

Figure 4.7

The Max subpatch in figure 4.7, written by Todd Winkler, shows a method for adding some random jitter to performance variables, within a specified percentage. Slight randomization of inter-onset intervals, or dynamic values, can eliminate some of the most egregiously “mechanical” effects of computer generated music. In fact, much of what is advertised as “human feel” in commercial synthesis systems is based on just such an idea. I have argued here that interactive systems need to move beyond strictly random variation in expressive performance, and I do not mean to suggest that Winkler proposes this patch as a solution to the problems of expression. Just such an approach can be quite effective, however, when used in conjunction with more structured variations — some fluctuation within an overall crescendo, for example, or perturbing the progression of an algorithmically generated accelerando.

Dynamic variation in Cypher can take three forms: crescendo, decrescendo, or a more quickly changing, unstable pattern of change in loudness. These possibilities subsume two kinds of directed motion (louder, softer) and one relatively static pattern of change. Further, these processes can operate simultaneouslyondifferent compositional levels. Imagine a series of decrescendi embedded in a larger, more global crescendo, or small, local dynamic changes within an overall . decrescendo. The level structure of Cypher provides a framework for directing the application of dynamic variation to different structural planes of the compositional output. If a dynamic process is applied on level 1, for example, changes in loudness will occur on a per-event basis (see the louder and quieter transformation modules). Dynamic processes on the second level will be applied as a function of the listener message to which they are connected: a crescendo connected to the phrase boundary message, for example, will affect output on a per-phrase basis, with each successive phrase somewhat louder than the last.

Structural Anticipation

The performance of music is determined by the performers’ understanding of the composition they are playing. Performance pedagogy on all levels emphasizes the skill of interpretation, transcending the
mechanics of pure technique. Once a performer has a conception of the music to convey, all of the specifics of physical interaction with an instrument are executed in accordance with the expression of that conception (Clarke 1988). Further, different performance circumstances will engender different performance strategies, still with the goal of expressing the same underlying expression. External factors such as the size or resonance of a hall or faster tempi chosen by a conductor could cause the relationships between various parameters of the performance to change. These changes will, however, be made to optimize the conveyance of a performer’s understanding of the composition’s structure.

Many current interactive systems, including Cypher, are currently able to recognize some musical structures in a stream of events represented by MIDI. Recognition alone, however, is not enough. To
perform expressively, a player (human or machine) must be looking ahead, anticipating the future motion of the musical discourse. A program performing expressively, therefore, must have knowledge of typical continuations in a style if it is to shape its performance in accordance with the anticipated contour of some musical structure.

Such anticipation can only come from familiarity with the norms of a musical style, providing another motivation for beginning to include a base of musical knowledge in computer music programs. Again, music theory and music cognition offer indications of how expectation works in the human experience of music. Phenomena such as tonal expectation have received significant attention including computer simulations, from music cognition. In chapter 7, we will review some of these efforts as we consider artificial intelligence applications in interactive systems.

The concluding point to be made here is that an understanding of structural functions shapes the composition, performance, and audition of music. Research in music cognition has been directed toward discovering the nature of that understanding and how it affects the performance of musical tasks. Music theory has long provided a rich collection of ideas on how musical structures are built and perceived. Interactive systems must come to grips with these problems, and they have already begun to do so. Without concepts of musical style sufficiently strong to anticipate probable continuations and points of articulation, computer programs will never be able to play music with the expression that is evident in human performances. Even the strategies humans employ in the structural cognition of music are only dimly understood. Interactive music systems need to advance in this area for the sake of their own performance levels; at the same time, interactive systems can provide a unique testing ground for theories advanced in other disciplines, such as applied theories of music and music cognition, by showing the function of theoretical constructs in real performance situations.