Chapter 5: Machine Listening

One of the primary motivations for building an analysis section into an interactive music system is the opportunity it affords for generating new output whose relation to the musical context is readily apparent to a human listener. It is my belief that trying to follow the example of human cognition in building a machine listener will be fruitful not only because humans provide the best, indeed the only, model of successful music listeners but also because couching an interactive system in terms compatible with human listening categories will encourage the generation of music that, though procedurally generated, will resonate with the human cognitive processes for which it is ultimately intended.

The division of a computer music system into listening (or analytical) and composition components seems to be a natural one. Several efforts have explicitly described their structure using such terms, and others easily fall into them. Cypher is one system where a listener / composition axis is distinctly maintained. Another clear example is the T -MAX system built by Bruce Pennycook and his colleagues at McGill University (Pennycook and Lea 1991). The HARP (Hybrid Action Representation and Planning) system, built at the University of Genoa by Antonio Camurri and others, likewise has two subsystems: one for managing sounds and analyses and another handling scores and descriptions of pieces (Camurri et al. 1991). David Wessel’s improvisation software is split into listening, composing, and performing assistants (Wessel 1991). Further, variants of the machine-listening task are evident in several applications for such things as the automatic transcription of audio recordings (Chafe, Mont-Reynaud, and Rush 1982) or tutoring systems for the piano (Dannenberg et al. 1990).

To say that building a listening capability into a computer program can enhance its musicality is not to imply that there is a generally agreed theory of what music listening is. Many competing versions of the cognitive process of listening exist, within and among fields specifically concerned with the question, such as music theory and music cognition. Building a machine listener demands making choices about the matter and means dealing with the consequences of those choices in live performance. This chapter will explore in considerable detail the listening engine of Cypher, expanding the discussion with references to other systems as these become relevant. Here as in the remainder of the text, it is crucial to note the way choices of representation and processing embody particular conceptions of the way music is experienced, thereby amplifying the power of a program to deal with musical situations in those terms and attenuating the same system’s power to address other conceptions.

The transformations of musical information made by listening systems are not lossless. In other words, the abstractions made do not produce a representation that can be transformed back into the original signal, as, for example, a Fourier analysis can be changed back into a sound-pressure wave. The abstraction of MIDI already throws away a good deal of timbral information about the music it represents. The analytical transformations described here, similarly, cannot be used to reproduce the original stream of MIDI data. Information is lost; for that reason, it is critical to take care that the parameters of the representation preserve salient aspects of the musical flow.

We will be concerned for the most part with listening systems that assume the MIDI abstraction. A major part of human musical understanding arises from perceptions of timbre, and these are discarded when MIDI represents a musical flow. Some researchers have built listening capabilities into systems that begin on the signal level, trying to identify notes and articulations in a digital audio representation. Examples would include Barry Vercoe’s recent work (Vercoe 1990), and the various forms of automatic transcription (Chafe, Mont-Renaud, and Rush 1982). Adding timbral information to machine-listening algorithms will certainly make them more sensitive to some of the most prominent aspects of music in human perception; although a reliance on personal computers and MIDI have made timbre hard to reach currently, chapter 8 will show how timbral representations could enrich already existing models.

5.1 Feature Classification

Cypher’s listener, as we have seen, is organized hierarchically. Currently, two levels of analysis exist: the lowest level describes individual note or chord events, and the level above that describes the way these events change over time. To describe incoming events, the listener process on level 1 (a lower level, level 0, would be a timbral analysis) classifies each of them as occupying one point in a multidimensional featurespace. This is a conceptual space whose dimensions correspond to several perceptual categories; placement of an object in the space asserts that the object has the combination of qualities indicated by its relation to the various featural dimensions.

To illustrate the concept, consider a three-dimensional featurespace, bounded by the perceptual categories register, density, and dynamic. Points occupying such a featurespace are shown in figure 5.1. The position of the points in this example space are determined by assigning each event values for the three categories bounding the space; those values, taken as a vector, specify a unique location in the space for the event.

Figure 5.1: Featurespace

In figure 5.1, we see a cluster of points near the upper right back corner of the cube. These points represent sounding events that have been classified as high, loud, chords: the values assigned to each of these qualities is near the maximum endpoint of the scale. Similarly, there is another cluster of points near the lower left front corner of the cube that, since the values assigned to all qualities are near the minimum endpoint, correspond to low, soft, single notes. We can see that events will be grouped more closely as their perceptual qualities are more closely related; points near each other in the space represent events with many perceptual qualities in common.

The featurespace actually used by the level-1 analysis has six dimensions: register, loudness, vertical density, attack speed, duration, and harmony. The values assigned to each feature are a function of the MIDI data associated with the input event and of the points in time at which the event begins and ends. For example, the loudness dimension is decided by simply comparing the event’s MIDI velocity to a threshold: those velocities above the threshold are classified as loud and those below it as soft.

Classifying loudness into one of two categories is of course unnaturally restrictive: there are certainly many more musically meaningful dynamic gradations than two. The point of this research, however, is to work on the combination and communication of many analysis agents. Therefore, the individual analysis processes are kept skeletally simple. Scaling the loudness analysis up to much finer gradations would change nothing in the design of the system other than making it bigger. Because such a change is not central to the function of the program, it has been left out of the research version.

The featurespace classification combines the output of each feature agent into a single vector, which is the main message passed from the level-1 listener to the player. The contribution of any single feature can be easily masked out of the whole. In this way, the player is continually informed of the behavior of each feature individually, and in combination. Further, higher listening levels will use the featurespace abstraction to characterize the development of each feature’s behavior over time.

Before continuing, it is worth clarifying some points of terminology. Cypher’s basic architecture is derived from several ideas put forward in Marvin Minsky’s book The Society of Mind (Minsky 1986). Two often-used terms from that work are agent and agency. Minsky describes the distinction between them as follows:

Agent Any part or process of the mind that by itself is simple enough to understand — even though the interactions among groups of such agents may produce phenomena that are much harder to understand.

Agency Any assembly of parts considered in terms of what it can accomplish as a unit, without regard to what each of its parts does by itself. (Minsky 1986, 326)

Thus, the feature agents described in the following sections, whose combined output specifies a point in the level-1 featurespace, are simple processes responsible for analyzing some aspect of a musical performance. As we shall see, agencies built from combinations of these simple processes are able to accomplish more complicated analytical tasks.

Note and Event

Before proceeding, let us examine some of the structures used by the program for representing music. The Note and Event structures are two of the most important ones.

typedef struct {
    unsigned char pitch, velocity;
    long duration;
    } Note;

A Note closely follows the MIDI standard’s concept of a note: it records a pitch (using the MIDI pitch-numbering system, where middle C = 60) and a velocity, representing the strength with which the note was attacked. Cypher adds a duration to these parameters, where the duration is length of time in milliseconds between the MIDI Note On message initiating the note and the Note Off message that terminates it.

An Event structure holds at least one Note. Events are kept in doubly linked lists, which facilitate navigation forward and backward through temporally proximate events (the progressive perspective). The prev and next pointers point to the preceding and succeeding Events, respectively.

typedef struct ev {
    struct ev *prev, *next;
    long time, offset;
    char chordsize, external;
    Note data[NOTEMAX];
    } Event;

The most important pieces of information added to the collection of Notes held in an Event are timestamps, which associate an Event with a point in time. There are two ways of identifying the onset time of a Event: the first records the absolute time of the Event relative to the beginning of program execution. The second method records the time offset of the Event relative to the Event preceding it. It is this second timestamp which is most often used; in fact, the absolute timestamp is used mainly to calculate the offset and the duration of Notes. The chordsize indicates how many Notes an Event contains and the data array holds the Notes themselves. The external field indicates simply whether the event arrived from the outside world or was generated by the program itself and is needed for some timing considerations.

Focus and Decay

Focus is a technique used to make feature judgments relative to the musical context actually heard. Rather than evaluating qualities against a constant scale, the measurement scale itself changes as new information arrives. When there is change within a very small range of values, the focus of the agent narrows to an area where change can be detected. When values are changing over a very wide field of possibilities, the focus pulls back to register change across a broader range of values.

Decay is the adjustment of the focal scale over time. The magnitude of the measurements made is relative to the range of change seen-the principle of focus. The scale against which measurements are made changes over time, as data that are not reinforced tend to recede from influencing the current context — the principle of decay. The first featural agent we will consider, the one classifying register, will provide a good example of these two techniques at work. The remaining agents make less initial use use of focus and decay; either their lack of precision or more vexing theoretical problems make them less amenable to such treatment. The relation of each agent to the concept, however, will be discussed as we proceed.

Register

The register agent classifies the pitch range within which an event is placed. At the most basic level, this classification distinguishes high from low in pitch space. Making realistic registral judgments requires more precision than a separation of high from low, however, and is further sensitive to the conditioning effects of context and timbre. Our considerations, limited by the representations of the MIDI standard, do not include timbre, and our precision is limited to two bits. Still, registral judgments in Cypher are made against a scale derived from the pitch information actually experienced by the program.

One way to classify register would be to divide a piano keyboard up into some number of equally sized divisions, compare incoming events against these divisions, and assign classifications according to which division contained them. The Max patch shown in figure 5.2 is an implementation of this idea. It is quite straightforward: Note On messages are checked to be within the expected limits of a MIDI keyboard. Then, in the expr object, 21 is subtracted from the pitch number, since 21 is the lowest pitch number on an 88-key keyboard. This makes the note numbers of incoming pitches range from zero to
87. Simply dividing the result by 22, then, yields 4 classifications, labeled from 0 to 3. This classification is seen in the number box at the bottom.

Figure 5.2

Although splitting up the 88 keys of the keyboard will work well for piano music and other ensemble textures that use most of the musical range, it makes little sense for other solo instruments or combinations of instruments using only a fraction of that span. Therefore, the register agent in Cypher keeps a scale of low to high based on the pitches actually analyzed up to any given moment (the principle of focus). As each new event arrives, it is compared to the scale stored for that event stream. If the event is lower or higher than either endpoint of the scale, the scale is adjusted outward to include the new event. For example, if the lowest point on the scale is MIDI pitch number 60 and a new event of 48 arrives, the low endpoint of the scale is changed to 48.

The precision of the classification reports is directly tied to the size of the scale against which the measurements are made. If the pitch scale for some stream is less than two octaves, the register agent will only distinguish between low and high. Once the span opens out to over two octaves, possible classifications expand to four: low, medium low, medium high, and high. Consequently, judgments made about pitches played in a small ambitus will have fewer gradations than judgments concerning pitches presented in a more varied context. In the case of chords, the overall register classification is decided by a “majority” rule: the register with the most pitches of the chord is declared the register of the event as a whole. If more than one register has the same number of pitches in a chord, that is, if there is a tie in register classification after application of the “majority” rule, the lowest classification among those tied will be chosen.

The adjustment of the endpoints of a pitch scale as new pitch information arrives is an example of focus in a feature agent. A necessary complement is the principle of decay, which also changes the endpoints over time, but in the opposite direction. These two operations together involve the register scale in a process of continual expansion and contraction. As we have seen, when new pitches arrive that exceed the bounds of the previously established range, the register scale grows by replacing one of the endpoints to match the new data. If an endpoint has not been reinforced for five seconds, that endpoint shrinks in toward the opposing endpoint by one halfstep (one MIDI pitch number). Thereafter, the endpoint will continue to shrink inward by one halfstep every half second. When a pitch arrives to reinforce or extend an endpoint, the scale again grows outward to meet it, and the decay timer is reset to five seconds. After a new duration of five seconds with no additional information near the extreme, the scale will again begin to shrink inward. The addition of focus and decay to the Max register agent shown in figure 5.2 would complicate matters but could easily be accomplished; Max’s timing and statistical objects, such as maximum, minimum, and timer, would be appropriate tools.

Dynamic

The dynamic agent classifies the loudness of events. In this case, the nature of the feature demands careful consideration of the focus and decay principles. For MIDI data, there is already a highly significant scale of possible loudness values against which events can be compared: velocity information is encoded in seven bits, giving a range of values varying from 0 to 127. Unfortunately, the perceived loudness oftwo different synthesizer voices, even using the same velocity value, can vary widely. Real classification of perceived loudness would require signal processing of an acoustic signal, which Cypher deliberately avoids. The listener is forced instead to rely on the MIDI velocity measurement, which records the pressure with which a key was struck in the case of keyboards, and some approximation of force of attack when coming from a commercial pitch-to-MIDI converter following an acoustic source.

Accepting the MIDI velocity scale of 0 to 127, it was found that focus and decay had little effect on establishing any particularly more appropriate scale of values for measuring loudness. In fact, even with the MIDI scale there seems to be no way to establish good thresholds for distinguishing velocity levels. One keyboard played normally may readily give a MIDI velocity of 110, for example, whereas another must be hammered with all the strength at one’s command to get the same reading. There is no clear way to compensate for this algorithmically -consistently low readings, for example, could come from an exceptionally delicate piece. Placing the loudness threshold in the middle of the values actually seen, then, would simply classify half of the events as loud when all of them are experienced as softly played.

For this reason, the dynamic agent is the only one that must be hand tuned for different instruments; an actual test with the physical instrument to be used must be performed to find out what MIDI velocity readings will be registered for various types of playing. With that information in hand, a threshold corresponding to the instrument can be established, distinguishing soft from loud playing. Once a threshold has been chosen, the agent simply compares the MIDI velocity of incoming events to it and reports a classification of loud or soft according to whether the event is above or below the threshold. Multiple-note events are classified as loud when more than half of the member notes are above the threshold and are classified as soft otherwise.

The Max patch of figure 5.3 is an implementation of the dynamic agent. The object compares the velocity values of Note On messages against a threshold of 100.Ifthe velocity is less than or equal to 100, the number box below will show a classification of zero; otherwise, the classification will be
1. The dial above the test can be used to set the threshold value.

Figure 5.3

Density

The density agent tracks vertical density. Vertical density refers to the number and spacing of events played simultaneously; this in contrast to horizontal density, the number and spacing of events played in succession. Horizontal density is treated by the speed agent. The techniques of focus and decay are laid aside when considering vertical density; this is because the perceptual difference between linear and chordal textures seems to remain constant regardless of the context. Similarly, though context does condition the perception of extension or clustering in chord voicings, the effect is still relatively weak. Because both of these aspects of vertical density (number of notes, spacing of notes) are represented with such low precision here, the coarser, more or less constant, perceptions are being modeled. Consequently, the thresholds used to decide classifications of density and spacing are constant as well.

The classification performed by this agent represents two related types of information: first, an event is judged to be either a single note or a chord. Second, if the event is a chord, the distance between extremes of the chord is considered, giving a classification of the chord’s spacing. The first judgment, deciding if an event is a single note or a chord, is harder than it sounds. The main problem has to do with the way MIDI information is transmitted — serially, with notes from any simultaneity sent down the wire one at a time. Another problem is finding the boundary that separates fast trills from chords. At what point do the onsets of neighboring notes follow each other so quickly that they begin to be heard as a simultaneity? The answer to the second question, in Cypher, is 100 milliseconds. (Note that this constant was arrived at simply through experimentation with MIDI gear and does not represent any empirically justifiable mark.) Notes arriving within 100 milliseconds of the first note in an event are considered part of the same event. Establishing that threshold, however, leads directly to the first problem: Must we always wait 100 milliseconds to find out if any more notes will arrive?

It took a surprisingly long time to find a good solution to this problem. The final implementation works as follows: The computer running Cypher is attached to an interface that receives and sends MIDI messages. Incoming MIDI messages are parsed, timestamped, and buffered in a MIDI driver that responds to the interrupts generated when a new message arrives. Cypher reads the buffered events from the MIDI driver in its main loop. First, a single MIDI event is read from the driver, and its timestamp is recorded. This gives us the start time of the current event. A counter, which will be incremented each time a null event is read from the driver, is set to zero. (A null event is sent from the driver when an attempt is made to read it before a new MIDI event is complete in the buffer. Many null events can separate two MIDI events, even if the duration separating them is very small.)

Now, MIDI events are repeatedly read from the driver and added to the Cypher event under construction as long as the following two conditions are true: the duration from the start time to the time of the most recent MIDI event read is less than 110 milliseconds and the number of null MIDI events read is less than 50. These conditions are a consequence of the normal pattern of input accompanying a performed chord. Timestamps for the notes in the chord are close to one another but almost never simultaneous; however, a range of 110 milliseconds will almost always catch onsets that belong together. Second, even MIDI packets with the same timestamp will sometimes be separated by null events. This is why one cannot simply keep reading events until the first null event. Allowing 50 null events to be read before giving up on additional data seems to capture almost all chords as chords, without introducing delays to system response while the program is waiting for additional data to straggle in. There is still some confusion in vertical versus horizontal density: even with this method, fast trills will sometimes be read as groups of very compact chords. An additional rule could be added that would insert a chord boundary any time the same pitch was about to be added a second time to the same chord.

The Max patch shown in figure 5.4 accomplishes a similar operation. The object will collect all incoming integers arriving within 100 milliseconds of each other. Then, using the objects iter and counter, the number of pitches in the list is counted. If this is greater than one, the event is classified as a chord (labeled 1), and otherwise as a single note (labeled 0).

Figure 5.4

There are two differences between this and the Cypher scheme: The Max patch of figure 5.4 will always wait 100 milliseconds before reporting the pitches recorded; because of the additional “null event” counter, Cypher is usually able to produce all the notes of a chord without waiting the full duration. Second, Cypher goes beyond the chord/note distinction of the Max patch: if it has been established that the density of an event is greater than one, the agent classifies the spacing of the chord. To do this, the extreme notes are identified, then the distance between extremes is measured. Chords falling within an octave are octave1; chords covering between one and two octaves are octave2; and chords spanning more than two octaves are wide.

Attack Speed

The speed agent classifies the temporal offset between the event being analyzed and the event previous to it in time. The offset is the duration in centiseconds between the attack of the previous event and the attack of the analyzed event: this duration is sometimes referred to as the inter-onset interval (IOI). Measuring the inter-onset interval indicates the horizontal density of events; a low IOI separates events closely spaced in time.

The speed agent currently uses an absolute scale to classify the 101 into one of four categories: events with an IOl longer than 2100 milliseconds are classified as slow; those between 2100 and 1000 milliseconds are classified as medium slow; those between 1000 and 300 milliseconds are medium fast; and those shorter than 300 milliseconds are fast. Note that the ranges decrease in size as the speed increases: the range of offsets for medium slow events is 1100 milliseconds, down to a range for medium fast events that covers 700 milliseconds.

The Max patchin figure 5.5 is a version of the speed agent. With each arriving note, a bang goes to both inlets of a timer object. This yields the time in milliseconds between each Note On message. Then a series of split objects classify the inter-onset interval. The object will pass durations between 100 and 300 msec out the left outlet, and anything else out the right outlet. Two more splits help classify the IOI into one of four categories, as in Cypher. The main difference between this patch and Cypher is that the latter is also performing the chord-gathering algorithm described in the previous section. Therefore, rather than measuring all arriving Note On messages, as are produced by the notein object in figure 5.5, the speed agent measures the IOI between Cypher events, which could be either individual notes or chords.

Figure 5.5

In Cypher, the threshold at which horizontal density becomes vertical density is 100 milliseconds: all raw MIDI events arriving within 100 milliseconds of the onset of a Cypher event are grouped together in that Cypher event as a chord. In other words, Cypher events constructed from MIDI data arriving from the outside world will never have an 101 less than 100 milliseconds (Cypher events being wholly generated in the composition section, however, might be spaced more closely). The 100-millisecond threshold means that the range for fast offsets is limited to 200 milliseconds (the difference between 300 milliseconds, the upper bound, and 100 milliseconds, the limit below which no offsets will be recorded).

The speed feature clearly could benefit from a focus scale. A sliding series of thresholds able to reflect relative speed variations according to context would give a more lifelike representation of the experience of attack speed. Indeed, with adjustments by focus and decay, speed might be represented with somewhat less precision than would be needed without them, since the reduced classification set will always be referring to meaningful gradations. In other words, rather than having a large set of possible speeds, many of which remain unused in some context, a smaller set, adjusting itself to the data arriving, could give a more accurate picture of what is actually happening.

Duration

The duration agent classifies the length of Cypher events. A duration is the span of time between the onset, or Note On message, at the beginning of an event and the Note Off message terminating it. Unlike the agents reviewed so far, the duration agent adds a classification to the featurespace characterizing the event previous to the one whose attack has just been recorded. This is because the level-1 featurespace analysis is done as quickly as possible after the initial attack of an event. A good part of the perceptual information accompanying a musical event is generated at the onset, particularly if that event has been represented in MIDI. If it is MIDI, and continuous controller information is ignored, all of the relevant information is present at the onset — except for the event’s duration.

There are two ways to respond to this problem: either analysis can be delayed until an event has finished, or the current event can be analyzed for everything but duration. The second solution has been chosen, since durations tend to remain relatively constant from event to event, and because the responsiveness of the system improves markedly if the current event is the one analyzed and complemented by the composition section.

At present, an absolute scale is used to classify duration. Events whose duration exceeds 300 milliseconds are classified as long, and the rest as short. The duration classification should be given more precision and made available as a value relative to the current beat duration. Much as combining chord and key classifications yields the function of the chord in the key, combining durations and beat periods would yield an expression of an incoming duration in terms of whole and fractional parts of beats. Having durations available in such terms (one quarter note, dotted eighth note, etc.) could produce sequences amenable to treatment by the pattern processes described in chapter 6.

The implementation of focus on duration is problematic: a combination of sliding and fixed scales seems to be the correct way to proceed. For example, a grace note (a note of very short duration preceding some primary pitch) always has about the same length and is always experienced as very short, no matter how long the surrounding durations are. So grace notes should always be classified as (very) short — the classification should not change as the primary pitches become shorter or longer. More typical note durations are sensitive to context, however: an eighth note surrounded by thirty-second notes will sound longer than an eighth note of the same duration surrounded by whole notes. Perhaps relatively extreme durations should receive constant classifications, and those in the usual duration range should be subject to a more context-sensitive focal scale.

The Max patch in figure 5.6 is a schematized version of a duration agent. The select object separates Note On messages from Note Offs. When a Note On arrives, the left inlet of the timer receives a bang. For a Note Off, the right inlet is banged, yielding the time in milliseconds between the two. This duration is then compared against a threshold of 300 milliseconds, and a classification of long (1) or short (0) results.

Figure 5.6

Note that this patch is too reduced to be very useful in its present form: here it measures durations between any Note On and any Note Off regardless of whether those messages correspond to the same pitch.

5.2 Harmonic Analysis

The characteristic trait of categorical perception (the human propensity for separating some percepts into different distinct classes) is that the perceiver will tend to classify a stimulus as belonging to one category and then switch to another category as the stimulus is continually varied along some dimension that defines class membership. In other words, the perceiver will usually classify the stimulus as being one thing or the other and rarely as an object with some qualities of both. In the following two sections, I describe listening tasks resembling categorical perception: The first, harmonic analysis, maintains a theory of the current root and mode of the chordal area through which the music is passing, and, on a higher level, of the key to which those chords are related. The second task, beat tracking, performs the categorical analysis illustrated in the previous example: separating structural beat durations from the expressive and otherwise irregular spacings of event offsets.

Chord Identification

The goal of chord identification is to determine, in real time, the central pitch of a local harmonic area. The harmonic sense implemented here models a rather simple version of Western tonality. The choice of this particular orientation was made for two reasons: first, a pragmatic desire to test the function of categoric harmonic perception on easily understood tonal examples, and second, because a rudimentary understanding of tonality seems a reasonable capacity to give a program that attempts to deal with a wide variety of musical styles. To be sure, the tonal sense discussed here will be inadequate to describe many harmonic systems-a listener of many twentieth-century examples, in particular, should be programmed with a more wide-ranging vocabulary. The chord and key identification techniques described achieve a good measure of success within the target style, however, and mark out a path for adding supplemental harmonic competencies.

With the aforementioned motivations, a series of tests were performed to determine a reliable method for finding the root and mode, in a simple tonic sense, of musical passages played in real time on a MIDI keyboard. The first method closely followed the model described in Scarborough et al. 1989. In this approach, connectionist principles are applied to the problem of analyzing harmonies. A neural network with twelve input nodes is used, where each input node corresponds to one of the twelve pitch classes in the tempered scale, regardless of octave. As each incoming MIDI note arrives, positive increments are added to theories associated with six chords of which the note could be a member, and negative increments are added to all other theories.

Figure 5.7

In figure 5.7, we see an arriving “c” MIDI note shown as the leftmost input node at the bottom of the figure. Positive activation is seen spreading out from that node to the’six simple triads (shown as output nodes at the top of the figure) of which the note could be a member: C major, C minor, F major, F minor, Ab major, and A minor. The negative increments, sent on to all the other output nodes, are not drawn.

The chord theories correspond to major and minor triads built on each of the twelve scale degrees. This is done not to restrict the harmonic discourse to major and minor triads but to direct the analyzer
to find a likely root and mode (major or minor) for any arbitrary chord. Therefore, the network will report a root of F not only for F major or minor triads but for any chord that is more like F than anything else. Further, the output will be conditioned by the context, since a large amount of activation built up in some node will tend to keep that node dominant if subsequent input is ambiguous. The increments sent to each output node were developed by hand, through trial and error and a rudimentary consideration of traditional music theory. It would change nothing in the overall structure of the program to train a “real” neural net through back propagation and use the learned weights.

At any point in time, the root and mode of the current chord are taken to be those corresponding to the theory with the largest score. This method is a plausible first approximation of a tonic harmony analyzer; the unadorned version attains accuracy rates of better than 70 percent on Bach chorale examples. Successive refinements to this analysis method, however, were made not by concentrating on the weights associated with particular nodes but rather by consulting a continually growing network of related features in an effort to make the job of the chord analyzer simpler. Similarly, another network topology, including a layer of hidden nodes, might produce better results from the connectionist analysis (see section 7.5 for an overview of neural network design). The improvements gained from connection to an agency of additional analysis processes, however, would presumably still
hold.

Connecting Additional Agents

Translating performed musical gestures to a MIDI stream causes simultaneous events (chords) to be serialized and sent down the wire one note after another. The effect of this serialization on connectionist analysis can be seen from the two trials shown in figures 5.9 and 5.10. The data analyzed is taken from the first phrase of the chorale Ach Gott, von Himmel sieh’ darein by J. S. Bach (figure 5.8).

Figure 5.8

The graphs of data shown for each trial should be read as follows: the row of pitch names at the top shows the 24 possible chord theories, where an uppercase letter indicates the major mode and lowercase indicates minor. So, the first two entries correspond to C major and C minor. The leftmost column shows incoming notes; these are considered without octave references and so are shown by pitch name only.

Reading across each row from the incoming pitch name, the tally associated with each chord theory is shown. Those theories with the highest value are followed by the sign “]” — for example, a value of 5] for a given theory means that that theory’s tally equals 5 and that 5 is the highest tally of all theories in that row (though there may be others with the same score). Finally, the rightmost column shows the confidence of the analyzer in the winning theory (or theories). Confidence is simply a measure of the strength of the winning theory relative to all the scores in that row, as in

certain = (high_theory/total_points_in_row)*100.

Therefore, the more a winning theory captures the points available at any one time, the higher its confidence rating will be.

Figure 5.9

Figure 5.10

In the first trial, the E major chord at the beginning of the chorale was arpeggiated with the notes played from bottom to top (to force an evaluation in that order); in the second trial, the same chord was arpeggiated from top to bottom. The absolute value for a theory of E major, after all four notes have passed, is the same in both trials (14). The certainty rating is slightly different (41 in the first trial, 32 in the second). The intermediate results, however, are strikingly different. The second trial begins with a theory of B (major or minor), and passes through G sharp minor before settling on E major. The first trial produces E major (or, at the outset, minor) throughout.

The same phenomenon will be observed for any chord when the order of evaluation is changed. Each successive pitch serves to direct a search further through the space of chord theories; different orderings of pitches necessarily result in different search paths. We are able to skirt the problem of internal path deviations arising from evaluation order differences by consulting the density agent. When that communication is added, the listener knows which notes are part of a chord and which have been played separately. Therefore, a simple refinement of the chord agent allows it to reserve classification until all the notes of a chord have been processed. The internal path through the chord theories is rendered irrelevant, since it is never seen by the rest of the system. Only the final theory, which remains reasonably stable (though certainty ratings are seen to change from one ordering to another) is broadcast.

Next, a communication path is established to the register agent. In Western tonal music, significant harmonic information tends to be presented in relatively lower registral placements. The bass voice of a four-part texture is more likely to be consonant with the prevailing harmony than the soprano: higher voices often include pitches dissonant to the harmony as passing tones or in ornamentation. Therefore, the chord identification agency was modified to give greater weight to information coming from the lowest register than to messages from higher registral areas.

We can see this illustrated in the data of figure 5.11: the pitches that have been found in the lowest register are marked by an asterisk in the far right column. The effect of these pitches on the chord theories can be seen from their greater weight; the bass “e” from the first chord, for instance, gives twice as much emphasis to the E major theory (10 points) than does the “e” higher up (5 points).

Figure 5.11

Finally, the beat agency is consulted. The heuristic here is that pitch information on the beat is more likely to be consonant with the dominant harmonic area than pitches off the beat. The beat agency conducts an initial pass over the data before the chord agency is called; so the chord process is able to obtain good information from the beat tracker about the current event. Weights associated with events on the beat are given 1.1 times the emphasis of events off the beat.

The chord identifier is able to return two kinds of information in response to query messages: the first is the current theory number. The theory number conveys both the root and mode of the chord, since twenty-four separate theories are maintained. Major and minor mode versions of the same root are paired, such that C major is theory 0, C minor is theory I, and so on. Because of this ordering, querying processes can find the mode of the chord by taking the theory number modulo 2 (even theories are major, odd ones are minor) or can find the root of the chord by dividing the theory number by two. The second piece of information returned is the confidence rating of the analyzer in the theory. This can be used by other agents to discard theories with low confidence levels.

Key Identification

In Western tonal harmony, there is a clear hierarchical relationship between the concepts of chord and key. Chords are local harmonic areas dominated by the root of a collection of pitches in a small temporal area. On a higher level, chords are related to keys-harmonic complexes whose central pitch affects the perception of tonal relations through longer spans. The level-2 harmonic analysis (hereafter referred to as key identification) has, like the chord agency, a connectionist core. The input nodes of the chord net are activated by the pitch classes of all Cypher events present in the listener’s input stream. The key identification network has twenty-four input nodes, which are activated by the chord classifications from the harmonic analysis agency one level down. Each arriving chord classification, one per event, will activate an input node, which in turn spreads activation among the twenty-four output nodes interpreted as corresponding to the minor and major modes built on each of the twelve scale degrees.

The key theories that are most positively influenced by an incoming major chord are those for which the chord could be the tonic, dominant, or subdominant. Other increments vary by the mode and scale degree of the chord and the mode of the prevailing key theory (since chords on some scale degrees will be minor in the major key and major in the minor key). In figure 5.12 we see twelve input nodes represented, for major mode chord reports from the twelve scale degrees. The complete network has twenty-four input nodes. In the figure, positive activation is spreading from the input to key theories for which the chord is tonic, subdominant, or dominant. Negative activation, which is simultaneously spreading from the same input node to all other key theories, is not shown.

Figure 5.12

The weights used were determined by trial and error rather than by a machine-learning technique such as the back-propagation algorithm. Again, these weights could easily be replaced by a new set developed with a learning rule. In developing them by hand, we noticed that finding good negative increments for key weights, as for chord weights, is in a sense more important than establishing positive ones. In this scheme, only four theories receive positive increments from a chord input. The other twenty theories receive negative or null increments. Well-chosen negative weights are important because they break down an established theory and make room for a new one to emerge. Particularly when keys are being tracked, one theory will tend to remain strongest for a long period, but must be supplanted by another rather quickly. Negative weights, coupled with an upper bound on theory strength, keep any theory from becoming so dominant that the analysis will become sluggish at finding changes of key.

Key Agency Connections

The connectionist part of the key identification process is again at the core of a larger agency, in which other featural and higher-level agents are linked together with the key net to form a complete, more accurate reading of the current harmonic context. The first agent added to the key net is the one tracking vertical density. Information is sent from the chord agency to the key agency with every input event. It is of interest to the key analyzer, however, to know the vertical density associated with arriving chord reports. Many Western musical styles will present nonharmonic pitches in passing tones, ornaments, or other kinds of linear embellishments to a more chordal texture. This heuristic is represented in the key agent by giving greater weight to harmonic information presented in simultaneities. Linear pitch material will still contribute to the analysis but will accumulate more slowly than chordal input. This connection to an analysis of vertical density enables us to know what kind of behavior to expect given the three possible textural types: (1) all chords: harmonic analyses carry the same weight, since there is no difference in vertical density, (2) completely linear: key analyses change more slowly than in denser textures, (3) combinations of linear and chordal material: denser material will advance harmonic analyses more quickly than do the ( typically) more nonharmonic linear voices.

Information flows from the chord identification to the key analyzer; an improvement to the chord identification algorithm was made by including a feedback path from the key to the chords. This is because the chord finder often produces ties between two rival chord theories. One typical case is confusion between major and minor theories based on the same root. This situation can not be disambiguated until a third comes along. Ties between other theories frequently arise, however, particularly when the analysis is in transition from one dominant theory to another. In these instances, the current key analysis is consulted. A table of probabilities for each chord in each key is consulted, and the most likely chord for the current key is chosen for the output of the chord identification process.

The chord agency includes the neural net chord analysis core, informed by the register, density, and key analysis agents. Each of these agents performs its own specialized task and communicates the results of this analysis to other agencies that profit from a broader context. The chord agency itself could benefit from an expanded repertoire of contributing agents: duration and dynamic could also help indicate relative importance among incoming events, such that the pitches associated with louder or longer events receive greater weight in the chord identification process.

In figure 5.13, we see the connections between the network core of the chord agency and other contributing agents. Dynamic and duration are connected by dotted lines to indicate their potential, but not actual, contribution. Beat tracking has a two-way connection to the chord net. The two-way link indicates that the agencies are consulting each other. Chords and keys are similarly connected in both directions. Finally, the density and register agents are shown sending information to the chord and/or key analyses.

Figure 5.13

Bruce Pennycook’s team at McGill University has built similar agency-style networks of expertise to perform real-time listening tasks. Their tracker object, built into a specialized Max environment, is able to track the beat period, metric placement, and chord progression of a jazz trio consisting of piano, bass, and drums (Pennycook 1992). The beat-tracking agent estimates a beat period from the drum lead-in to a performance. In an algorithm similar to the one discussed in the following section, the durations of various subdivisions of the basic period are calculated and compared against actually arriving inter-onset intervals. The current beat period is then adjusted as the real durations deviate from the estimated attack arrivals. Further, this system can distinguish beats in a measure because of regularities in the lead-in. Using that knowledge, harmonic information coming from the bass instrument on the downbeat is given a privileged status in the calculation of the current chord. Here again we see an interplay between harmonic and rhythmic tracking, where heuristic knowledge of rhythmic activity serves to strengthen harmonic estimates.

5.3 Beat Tracking

Beat tracking is the process of finding a duration to represent the perceived interval of a beat, described as that level of temporal periodicity in music to which a listener would tap a foot, or a conductor move a baton (Chung 1989).

Grosvenor Cooper and Leonard Meyer distinguish between three levels of rhythmic activity: the first level is a pulse, a regularly recurring succession of undifferentiated events. The second level is meter, which differentiates among regularly recurring events, and the third is rhythm, in which particular patterns of strong and weak beats arise (Cooper and Meyer 1960). We can use these distinctions to discuss the beat-tracking capabilities built into interactive music systems. Several of them, including Cypher, attempt to identify only the lowest level of rhythmic activity — the program’s task is to find the interval of a basic pulse in the music without grouping these pulses into a sense of meter.

In all of the applications we will consider here, there is no representation of expected input available to the program. In other words, no pattern matching can be done between live input and a stored score. Score followers, reviewed elsewhere, effectively perform beat tracking by matching in this way. Here the problem is to find a beat pulse in completely novel input. Further, we are interested in finding the beat in real time. A number of systems scan musical representations in several passes to find a beat (Chafe, Mont-Reynaud, and Rush 1982). Although we will look at how these techniques could be adapted to real time systems, such programs will not be our primary focus.

Beat Tracking in Cypher

Cypher accomplishes beat tracking with an agency built around a connectionist core tuned to find beat periodicities in MIDI data. The beat agency is quite similar to the agencies for chord and key identification in this respect and in fact consults many of the same agents that contribute to the other tasks. The beat agency sends to the chord, key, and phrase agencies the progress of its analysis and in turn consults the feature agents and the chord agency to assist in determining a plausible beat interpretation of the incoming MIDI stream.

The connectionist core of the beat agency maintains a large number of theories, which represent possible beat periods. Theories are maintained for all possibilities between two extremes judged to be the limit of a normal beat duration. The limits in this case were taken from the indications on a musical metronome, which has historically been found sufficient for typical beat durations. These limits are 40 beats per minute on the slow end, and 208 beats per minute on the fast end. Metronomic beat periods then fall in the range of 288 to 1500 milliseconds. In the multiple theory algorithm, separate theories are maintained for all possible centisecond offsets within this range; in other words, offsets from 28 to 150 centiseconds (a total of 123 possibilities) are regarded as possible beat periods.

There are two parts to a beat theory: the theory’s points and an expected arrival time. There are also two ways for a theory to accrue points: First, each incoming’ event spawns a list of candidate beat interpretations. The members of this list are all awarded some number of points. Second, the arrival time of incoming events are checked against the predictions of all nonzero theories. Theories whose prediction coincides with the real event are given additional points. Figure 5.14 illustrates this principle: the top row of arrows represents actual incoming MIDI events. The lower row of arrows represents the event arrivals predicted by a beat theory. When the arrows of actual events line up with the predicted arrivals, the corresponding theory is positively incremented. When no actual arrow aligns with a beat prediction, the theory loses strength.

Figure 5.14

The algorithm works as follows: When an event arrives for analysis on level1, it is passed to the beat-tracking agency. The first thing the tracker does is to examine the expected event arrival times of all theories. If the real arrival coincides with an expected arrival for any nonzero theory (within a small margin to allow for performance deviations), points are added to that theory’s score. If the real arrival comes before an expected arrival, the real offset is subtracted from the expected offset, so that on the next event the agent will still be looking for a “hit” at the same absolute time. If the real offset arrives later than an expected offset, points are subtracted from that theory’s score. The heuristic here is that syncopations are unlikely; that is, true beat pulses will usually have events aligned with them.

The beat tracker can return two different values in response to queries: (1) the absolute time of the next expected beat, and (2) the current beat period. The first response allows other processes to schedule events to coincide with the expected next beat. The second value can be used to schedule events at regular intervals corresponding to the calculated beat period.

Generating Candidate Interpretations

The first part of the beat-tracking algorithm uses the syncopation heuristic; theories whose estimated time of arrival coincides with a real event are rewarded, and those theories with ETAs before the incoming event (that is, the incoming event is syncopated with respect to the theory) are penalized.
The second step in the algorithm looks for the five most likely beat interpretations of the incoming event offset. Two candidates are found from the offset itself and from the offset of the previous event. Then a set of factors is used to generate the rest: the members of the set are each multiplied in turn with the offset of the incoming event to produce a possible beat duration. If the resulting duration is inside the acceptable range (on the metronome scale), it is added to the list of candidates. If the factored offset is off the metronome (outside the approved range), it is rejected, and the next factor on the list is tried. This process continues until five candidate interpretations are found.

Initially, two candidates are generated independently of the factors set. The first interpretation evaluated is always the offset associated with the incoming event; if that offset by itself is within the accepted range, it is given first place on the list. The second interpretation evaluated is the result of adding together the current and previous event offsets. The idea here is that the true beat duration will often not be directly present in the input, for example in the case of a regularly recurring quarter/eighth duration pattern (such as might arise in 6/ 8 time). In such a situation, the true beat duration is a dotted quarter; that duration, however, will be rarely present in the input. It would arise from the factorization process, as we shall see in a moment. It is quite effective to consider the simple possibility of adding adjacent offsets, however, since (as in the example) these will often yield the appropriate duration. For the second candidate, then, the current and previous offsets are summed and added to the list if the result is on the metronome.

All other candidates are generated from the set of factors. This set assumes the usual Western rhythmic subdivisions of two or three, or multiples of two or three, to a beat. The factors interpret the incoming offset as if it represents half the beat duration, or twice the beat duration, or 1/3 the beat duration, or 1.5 times the beat duration, etc. The set of factors follows:

static float factors [FACTORS] = { 2.00, 0.50, 3.00, 1.50, 0.66, 4.00, 6.00, 1.33, 0.33, 0.75 };

The trace in figure 5.15 was generated in real time from an analysis of a live performance. The rhythm played was the following, where 8 stands for a played eighth note, and 4 a quarter note, in moderate tempo: 88 88 88 88 4 4 4. There is one fewer row in the activation trace than in the rhythm, because two notes must pass to give an initial inter-onset interval.

Figure 5.15

This and subsequent beat theory activation traces are to be read as follows: the first column indicates the offset in centiseconds of the event being analyzed, measured from the attack of the previous event. Then, the trace has five columns marked from 0 to 4. In these columns are recorded the five beat theories with the highest scores. For each theory, there are two values shown. The first is the projected beat duration for the theory; the second (following the duration in brackets) is the points currently associated with that theory. So, an entry of 38[ 6] would indicate a beat duration theory of 38 centiseconds, which currently has a score of six.

In the example (figure 5.15), we see the eighth-note duration hovering around 37 centiseconds, with a typical performance spread covering a total of about 4 centiseconds. The quarter notes also show relatively constant values, more or less twice that of the eighth notes. The eighth-note tempo is initially accepted as the beat offset, because it is “on the metronome.” When the quarter notes arrive, however, the quarter note duration quickly takes over as the beat duration. This process is seen clearly at the input of 77, the second quarter note (at the end of the first quarter-note duration). Because this duration arrives later than the expectation for the eighth-note pulse, the eighth-note theory loses seven points (drops from a score of 48 down to 41). The input falls directly on the expected quarter-note arrival time, however, and this theory gains six points. When the final quarter note comes (at the end of the second quarter-note duration), the same thing happens, and the quarter-note beat duration takes over first place.

The most interesting entries on the activation trace are those at the far right of the figure, the columns marked “R” and “E.” These mark the “real” and “expected” event arrival times. The “real” time is the clock time recorded when the incoming event was analyzed, and the “expected” time is the arrival time calculated for the next event after the analysis. Therefore, the expected time in any row should match a real time in some later row, if the tracker is working correctly. In the figure, I have marked the correspondences between expected and real arrival times. Lowercase letters are placed beside correctly predicted event times, with the same letter next to the prediction and the actual event. We see a quite acceptable alignment of expected and real event arrivals, up to the moment when the performance changes from eighth notes to quarter notes. The events predicted to arrive at times 642 and 720 never do arrive: the real events are syncopated with respect to them. By the end of the example, the quarter-note pulse has taken over, and the next expected arrival is anticipated at one quarter-note duration later.

Accommodating Performance Deviations

Even in this most simple example, which should present no performance difficulties whatever, we see the inevitable onset deviations that mark human playing. Such deviations will be even more pronounced with more difficult music, or when a performer is consciously adding expressive transformations to the offset times. The beat and meter analysis described in Chafe, Mont-Reynaud, and Rush 1982 corrects for these deviations by finding a threshold that will allow them to determine notes of equal duration. That is, if the deviation between two notes is within the threshold, they will be treated as having the same value. Once a workable threshold has been found, adjacent note values can be classified in terms of greater than, less than, or equal. Rhythmic, or agogic, accentuation results from short-long pairs, and these accents are used as landmarks, together with harmonic accents, to decide the meter of a musical section.

Cypher attempts to accommodate expressive variations as follows: before points are awarded, the beat-tracking algorithm searches the immediate vicinity of a candidate theory for other nonzero theories. Nonzero neighbors are taken to be a representation of the same beat, but with a performance deviation from the candidate. If such a neighbor is found, the tracker asserts that the composite theory is midway between the neighbor and the candidate and adds together the points from the neighbor and those due to the candidate. Then the neighbor and candidate theories are zeroed out, leaving one theory in that vicinity — the average duration calculated from previous and current inputs.

Figure 5.16

We can clearly see this “zeroing in” behavior in the next trace (figure 5.17), taken from a live performance of the opening of the Bach chorale example shown in figure 5.16. The tracker maintains the quarter-note pulse as the leading theory throughout. Because of the performance deviations, however, the period of the beat duration moves slightly throughout the trace, starting at 74 centiseconds and moving as low as 68 with an average value around 70. Again, I have marked correspondences between expected and actual arrival times with lowercase letters — identical letters show successful predictions of event arrivals.

Figure 5.17

The tracker does a respectable job on this example: 13 out of 14 quarter notes are correctly predicted, 12 of them within 4 centiseconds. It is interesting to note where the predictions break down: the sixteenth notes at the end of measure 4 are performed a little slowly, which leads to an incorrectly predicted downbeat of measure 5. By the second beat of measure 5, however, the prediction again matches reality.

The Bach example is of limited rhythmic complexity but demonstrates the kinds of rhythmic behavior the tracker can handle well. This method is more erratic with rhythmically complex input; passages with fast figuration around a slow underlying beat particularly tend to confuse it (such examples are an exaggerated form of the sixteenth-note mistake in this example). Probably such mistakes could be improved with better weights, more sophisticated candidate selection, and added constraints indicating typical rhythmic progressions. The point of this study, however, was not to develop the perfect beattracker but rather to include beat tracking in a network of many competencies, all able to help each other achieve better performance on their specialty. The chord agency and dynamic agent are two entities whose reports affect the calculation of beat periods. In chapter 7, I will explore in more detail the relationships the beat tracker joins into with other agents.

Meter Detection

The digital audio editing system developed by Chafe, Mont-Reynaud, and Rush (1982) determines the second level of Cooper and Meyer’s hierarchy of rhythmic behavior: a meter of strong and weak beat relationships across the basic pulse. Their program does not function in real time, but it is interesting for its approach to the derivation of meter from an undifferentiated representation. For the purposes of this discussion, we can treat the basic material as MIDI data: the representation they actually used came from a lower level of audio analysis, yielding pitch and timing information much like MIDI.
First-pass processes find accents, either melodic or rhythmic, in the succession of events. Then these accentual anchors guide the generation of bridging structures, spanning several events between anchors. When a regularly recurring bridge interval has been found, it is applied forward and backward through the remainder of the example to find attacks that may coincide with a continuation of the bridge in either direction. “With the method used, the system looks ahead from simple bridges to see if it can extend to a note an equal bridge-width apart. Any remaining unconnected zones are examined in reverse on a second pass. The method of targeting ahead (or behind) for a particular time interval enables the system to ignore intervening syncopated accents and frees it to latch on to any note placed within a prescribed distance of the target” (Chafe, Mont-Reynaud, and Rush 1982,544).

Good bridge intervals are then clustered together, and these clusters are examined for lengths in simple relations with each other, preferably in a ratio of 1:2:4 or some similar integer sequence. Durations within a bridge are brought into line with similarly simple subdivisions of the span. Some possible note value interpretations are accepted following stylistic heuristics, such as considering double-dotted values less likely for fast tempi than slow ones. Finally a meter is chosen from stylistically acceptable candidates. Again simple integer ratios are preferred, in duple or triple meters. The derived meter is then used to produce a barred notation of the performed music. This system was shown to work well for examples on the order of Mozart piano sonatas but would break down for more rhythmically complex inputs, particularly those with many syncopations.

Longuet-Higgins and Lee

In their article “The Rhythmic Interpretation of Monophonic Music,” H. C. Longuet-Higgins and C. S. Lee (1984) develop a partial theory of how listeners are able to arrive at a rhythmic interpretation of a stimulus presented as a succession of durations only — that is, no further cues from dynamics, harmonic activity, rubato, text, etc. are available. They present their considerations with respect to notated music examples, rather than live performances: “The expressive qualities of live performance vary so widely that we have felt it essential to restrict the discussion in some way, and we have done this by concentrating on those directions about a performance that a composer customarily indicates in a musical score” (Longuet-Higgins and Lee 1984, 151).

The authors consider traditional Western meters as hierarchical structures in which successive levels represent duple or triple subdivisions of the level above. A 4/4 meter would, in this system, be represented by the list [2 2 2 … ], which indicates that each level in the hierarchy forms a duple subdivision of the level above. Similarly, a 3/4 meter is represented as [3 2] (three beats in a bar, with duple subdivisions), and 6/8 as [2 3] (two beats per bar, with triple subdivisions). They present a set of context-free generative rules, which produce possible rhythms within a given meter. Then “the relationship between the rhythm and the meter may be simply stated: The former is one of the structures that is generated by the grammar associated with the latter” (Longuet-Higgins and Lee 1984, 155).

Even with such a generative description of rhythm, many situations arise that could be equally well described by more than one structure. Such situations give rise to rhythmic ambiguity in music. Still, ambiguities are usually assigned a particular interpretation by most listeners, and the forces behind these assignations occupy Longuet-Higgins and Lee. They begin by looking at syncopation, a phenomenon whereby a “strong” event is displaced relative to an underlying meter. “We may suspect that this is a general characteristic of ‘natural’ interpretations: that when a sequence of notes can be interpreted as the realization of an unsyncopated passage, then the listener will interpret the sequence in this way” (Longuet-Higgins and Lee 1984, 157). Accordingly, the authors define regular passages to be those with the following characteristics:

Every bar, except possibly the first, begins with a sounded note. (This ensures that there are no syncopations across bar lines.)
All the bars are generated by the same standard meter.
There are no syncopations within any of the bars. (Longuet-Higgins and Lee 1984, 158)

If a passage is regular, the authors assert, the metric structure can be derived from the durations of the sounding notes. A set of rules is introduced along with a procedure for applying them. Basically, the shortest metrical unit is identified with a first pass through the passage. Then longer durations are found and compared in length with the shortest. Multiples of the short duration that produce the longer ones then are used to calculate the meter.

The method advanced by Longuet-Higgins and Lee is not immediately appropriate for use in an interactive system: for one thing, their algorithm relies on several passes through the data, a technique generally unavailable to real-time programs interested in responding as quickly as possible. We can notice two features of their work, however, that could be adapted to interactive implementation and that in fact are already found in beat-tracking methods: first, a reliance on the heuristic that syncopations are unlikely, the same operative principle behind Cypher’s beat tracker; and second, a preference to find larger durations from multiples of small ones. There are basically two ways to relate the durations of a musical passage: the small ones can be regarded as subdivisions of the large ones (division), or the large ones found to be multiples of the small (multiplication). In the some of the beat-tracking methods reviewed in this text, we find George Lewis advocating a multiplication scheme, as Longuet-Higgins and Lee have done, with other researchers favoring the division approach, as in the Mont-Reynaud beat tracker.

The work of Longuet-Higgins and Lee is significant for the development of interactive meter detectors because it can be regarded as an incremental improvement of the beat-tracking methods already developed. If a program is able to find the beat duration accurately, the syncopation and multiplication rules they propose offer a promising avenue for extending an analysis to the metric level.

5.4 High-Level Description

In chapter 4 we reviewed some of the considerations involved in separating musical structure into different hierarchical levels. Here we will look at some of the analytic methods that have been developed for describing behavior on higher levels of such music hierarchies. There are a number of problems specific to the description of higher-level behavior, among them grouping, classification, and direction detection. Of these, the first is the most fundamental and has received the most attention. David Wessel’s improvisation engine uses principles derived from Lerdahl and Jackendoff’s grouping rules to find phrase boundaries in live MIDI performances. The HARP system of Camurri et al. deals with “music actions,” defined to be musical “chunks” or phrases. The CompAss project described by Andreas Mahling of the Institut fur Informatik in Stuttgart provides phrase structure editors.

In Cypher, the listening processes on level 2 describe the behavior of several events. They examine the features pace reports from level 1 and look for two main types of structure in the behavior of the features over time. First, one agent tries to group events together into phrases. Another agent looks at all events within a phrase and decides whether each of the level-1 features is behaving regularly or irregularly within it. A third agent, as yet unimplemented but part of the same conception, would look for direction in the motion of lower-level features.

These three tasks (grouping, observations of regularity and direction) correspond to a general view of music in which change (or the lack of it) and goal-directed motion form the fundamental axes around which musical experience revolves. Such a view is espoused, for example, in Wallace Berry’s book Structural Functions in Music: “By recurrent reference to interrelations among element-systems, reciprocal and analogical correspondences are indicated in which the actions of individual elements are seen to project expressive shapes of progressive, recessive, static, or erratic tendencies. Progressive and recessive (intensifying and resolving) processes are seen as basic to musical effect and experience” (Berry 1976, 2).

Not all music exhibits such directionality, and therefore does not comfortably bear description using such terms; however some music, particularly much Western music, is about directionality and change. Cypher is biased toward looking for goal-directed musical behavior; however, the listener will still have something meaningful to say about music that is not primarily goal directed. In that case, useful analysis will be shoved down, as it were, to level 1. Classifications of individual features and local musical contexts will take precedence over longer-term descriptions of motion and grouping.

Phrase Finding

Locating meaningful phrase groupings in a representation of music is one of the most important problems in interactive music and one of the most difficult. “I do not expect much more to come of a search for a compact set of rules for musical phrases. (The point is not so much what we mean by rule, as how large a body of knowledge is involved)” (Minsky 1989, 645). Minsky seems here to be pointing to the musical commonsense nature of phrase grouping, and commonsense reasoning is notoriously difficult to capture, particularly through a small, generalized set of rules. The CYC project is a current attempt to approach common sense, through an elaborate, cross-connected network of knowledge taken from newspaper accounts, encyclopedias, textbooks, and the like (Lenat and Guha 1991). Perhaps a vast network of similarly constructed musical knowledge would help us identify phrases; in any event, some systems already look for phrase boundaries, and we can try to evaluate their approaches and levels of success.

Figure 5.18

In Cypher, phrases are musical sequences, commonly from around two to ten seconds in duration, that cohere due to related harmonic, rhythmic, and textural behaviors. The level-2 listener detects boundaries between phrases by looking for discontinuities in the output of the level-1 feature agents. Each agent is given a different weight in affecting the determination of a phrase boundary; discontinuities in timing, for instance, contribute more heavily than differences in dynamic. The phrase boundary agent collects information from all of the perceptual features, plus the chord, key, and beat agencies. When a discontinuity is noticed in the output of a feature agent, the weight for that feature is summed with whatever other discontinuities are present for the same event. When the sum of these weights surpasses a threshold, the phrase agent signals a group boundary. Note that this signal will correspond to the initial event of a new phrase; the discontinuities are not noticed until after the change.

The remaining feature dimension, harmony, is treated somewhat differently. The weight of the harmonic analysis is decided by the function of the current chord (local harmonic area) in relation to the current key. In other words, a chord analysis of F major will not be considered in isolation but as a functional harmony in some key. If the current key were also F major, for instance, the chord would have a tonic function (or function number zero, in the Cypher numbering scheme). Following the conventions of Western tonal harmony, tonic and dominant functions are given more weight as potential phrase boundaries than are chords built on other scale degrees. The table shown in figure 5.19 lists the phrase boundary weights given to the various chord functions.

Function  I   i   bII bii II   ii   bIII biii  
Weight    4   4   1   1   2    2    1    2  
Function  III iii IV  iv  #IV  #iv  V    v  
Weight    2   0   3   3   1    1    5    5  
Function  bVI bvi VI  vi  bVII bvii VII  vii  
Weight    1   2   2   0   1    2    2    2

Figure 5.19

The phrase boundary analysis relies heavily on the progressive perspective; neighboring events in time are compared, and split into different groups according to their similarity or dissimilarity. Another manifestation of this reliance is an extension of the functional harmony contribution: adjacent events’ functions are checked to see if they manifest a dominant/tonic relationship. That is, if the harmonic event for the last event has a function a perfect fifth above that of the current event, the evidence for a phrase boundary is strengthened. Another contribution comes from the beat tracker: events on the beat are given more weight as potential phrase boundaries than events not landing on a predicted beat point.

Further, the phrase boundary calculation implements a version of the techniques of focus and decay. Initially the phrase boundary threshold is set to a constant. When a new phrase boundary is found, the number of events in the new phrase is checked: if there are under two events in a phrase, the threshold is incremented-phrase boundaries are being found too quickly. On the other side, there is an maximum limit to the number of events in a phrase. If the event count passes this limit, a phrase boundary is declared and the threshold is set lower-phrase boundaries are not being found quickly enough. In this case, the heuristic for moving the threshold up seems to work well. The decay part of the rule seems too arbitrary, however; there should be a more musical way to decide that phrases are too long than by counting events.

Regularity

The second higher-level analysis task carried out by Cypher is to describe regular and irregular behaviors by the features tracked on level 1. Within each phrase, the regularity detectors lookto see whether the classifications for each feature are changing often or remaining relatively static. Features whose classifications do not change much are deemed regular, while those with more dynamic behaviors are called irregular.

In figure 5.20, we see the feature classifications for several Cypher events played out over time. The black dots represent the register classifications for each event, and the grey dots, the dynamic values. In this case, dynamic behavior would be termed regular by the level-2 listener, and registral behavior would be flagged as irregular.

Figure 5.20

The phrase boundary agency uses discontinuities between feature classifications for adjacent events as evidence of phrase boundaries. The regularity report on level 2 is closely allied to the group detection phase: the classification of regularity and irregularity is done over the history of a group. That is, once a group boundary has been found, regularity/irregularity tracking begins anew. The first event of any phrase, therefore, will always be regular for all features, since it is the only event in the phrase. Once two and more events are present in the current phrase, regularity judgments will again characterize a significant population of events. Then, for each feature, the number of discontinuities among events within the current phrase is calculated each time a new event arrives. If there are more discontinuities than identities, the feature is said to be behaving irregularly. If feature transitions are identical more often than they are different, the feature is behaving regularly.

5.5 Computer-Generated Analysis

The following sections document a real-time musical analysis of a performance of the Bach chorale shown in figure 5.16. This chorale was chosen because it presents a challenging tonal language, stretching the capabilities of the harmonic analysis. The rhythmic level of the piece, on the other hand, is less challenging but still presents some variety. Simpler, more easily handled examples could have been presented instead, but in the following exposition the ways in which the process falters are as instructive as the successes. All of the charts were generated by Cypher as the chorale was being played; as such, they represent the quality of the information about the piece that would be available to the rest of the program during performance.

Bach Chorale

We will first consider the output of the key agency. Incoming chord agency reports are shown in the leftmost column of figure 5.21. The activation associated with each key theory is shown in the 24 following columns. The highest activation for a key theory in any row is followed by a “]” symbol. So in the very first row the chord agency reported an E major chord, spawning a key theory of E major. Each measure in the original score has been marked in the trace with a dash in the chord column. Accordingly, there is a dash after the first E major report, since that measure contains only the pickup chord. The dashes are a convenience for comparing the trace to the score.

Figure 5.21

Notice the first key change, in measure 7. The tonic moves from A minor to E minor, which has taken over by the middle of measure 7. The move back to A rhinor is more circuitous; the agency notes the ambiguity at the onset of measure 8, with the F major chord, and abandons E minor for C major. The E major to A minor transition between the second and third beats, however, is enough to reestablish A minor. Another instability occurs at the beginning of measure 9, when a first inversion A minor chord is misread as C major. These reports, and the D minor report in measure 8, would be ignored, since key changes that are not confirmed by at least two successive reports are discarded. The tonal ambiguity of the end of the piece is instructive: the accumulation of accidentals in the final two measures leads to uncertain chord reports, contributing to a wandering key analysis as the piece closes. Here it is possible for the key agency to make different analyses on different performances. Sometimes the piece comes out in A minor and sometimes in E major: the instability of the chord reports, caused by ambiguous chords such as the penultimate D# diminished seventh, means that the key tracker will be less stable as well. The key analysis is particularly useful in going through the phrase boundary analysis, shown in figure 5.22.

Figure 5.22

The phrase boundary trace should be read as follows: the first five columns report the weights added by featural discontinuities between events. When one of these columns has a nonzero value, the corresponding feature has changed classifications between the event on that row and the preceding event. Then in the column marked chord, the weight added for chord function is reported. The function reported for each event is also noted. Next, key changes show up as an additional four points. Events on the beat are awarded an additional two points, shown in the next column. The next to last column is the total of the activation from all the previous ones, and the number in the far right column is the current phrase boundary threshold. Therefore, if the value in the second to last column for any row is greater than the last number (the threshold), a phrase boundary was reported for that event. I have added asterisks to the trace to mark the identified phrase boundaries.

In the phrase detection for the Bach example, notice that the phrase threshold is initially too low. There are double hits and spurious boundaries, such as the first two events of measure two. The phrase agency notices this as well, and we see the threshold moving up with each double hit. By the time the threshold gets to about 9, better boundaries are being found. The beginning of the phrase on the last beat of measure 5, for example, is found correctly. The following phrase beginning, at the end of measure 7, is also correctly identified. Here we see the double-hit problem arise again, however. Evidence for phrase boundaries in this scheme seems to accumulate and dissipate too slowly, leading to identifications of phrase group boundaries on neighboring events. The final phrase boundary in the example, at the end of measure 9, shows exactly the same behavior: the correct boundary is identified,but the following event is chosen as well. This problem can be easily corrected in the program simply by ignoring double hits. The main cause of the problem is the speed change weight. After the fermatas, the program finds the change in duration between events and contributes to the boundary identification. Since the speed change at a fermata is so pronounced, however, successive events tend to report changes in classification, leading to the double hits.