Chapter 7: Interactive Architectures and Artificial Intelligence
Techniques from artificial intelligence (AI) have been used in musical applications for many years. Each successive wave of AI technology, in fact, seems to find its way into several music programs. Here we will examine some of those efforts, including the influence of AI on the development of Cypher. This will not, however, be an exhaustive review of the many projects combining music and artificial intelligence. Rather, I will concentrate on branches that have already had an influence on the construction of interactive systems or that seem to hold particular promise for such an impact in the short term.
Artificial intelligence has often been attacked for concentrating on the “rational” processes of mind and for being unable to address the more subjective, emotional aspects of thought (Dreyfus 1979). A parallel notion holds that music engages precisely those parts of the mind that can least accurately be modeled formally, again usually termed emotional, irrational, or aesthetic (Clifton 1975). Taken together, these observations would seem to indicate a limited range of application for artificial intelligence techniques in music. A partial reply can be found in Marvin Minsky’s essay “Music, Mind, and Meaning”: “Our culture has a universal myth in which we see emotion as more complex and obscure than intellect. Indeed,emotion might be ‘deeper’ in some sense of prior evolution, but this need not make it harder to understand; in fact, I think today we actually know much more about emotion than about reason” (Minsky 1989, 639).
If such fundamental limitations to the operative power of artificial intelligence exist, they should be rapidly exposed in programs attempting to function musically. As we proceed through an examination of several extant systems, we will notice in some cases an unsuspected power of traditional formal techniques, and in others, new hybrid models that combine logical operations with non-rule-based components. To be sure, widespread limitations and compromises surrounding the musical intelligence of these efforts will be noticed as well. As attempts to implement musicianship in computer programs continue, interactive music systems may well serve to demonstrate empirically the power of formalized thought for the representation and generation of music; for some time to come, the failures may be as instructive as the successes in this regard. Further, the compartmentalization of cognition into “rational” and “emotional” components could become blurred as well, if those “emotional” thoughts prove to be at all amenable to treatment by techniques generally regarded as purely “rational.” In fact, some practitioners regard the confrontation with a computer performer as a stimulus to exploration of the irrational:
My approach to the art of musical improvisation is concerned with developing the musically expressive resources of one’s unconscious mind. During an improvisation, which can begin without any preconceived score or plan, the unconscious mind of a performer generates an initial act or gesture, expressed sonically through the vocal or instrumental medium. As it evolves, the improvisation develops as a kind of dialogue between the freely associative unconscious mind and its sonic manifestations, mediated by the physical actions of the performer.
Musical improvisation within the medium of live electronics can have the effect of further objectifying and detaching the perceived, sounding image; the image becomes detached both from its unconscious source and from the physical actions employed in its production. The “electronic double” takes on a life ofits own-physically displaced in space, often amplified larger than life, perhaps costumed in new modulations, distortions and syntheses. (Teitelbaum 1991)
7.1 Cypher Societies
I will begin with a review of the borrowings from artificial intelligence found in Cypher. The most pervasive influence should be mentioned first: the theory of intelligence and cognition put forward by Marvin Minsky in his book The Society of Mind (Minsky 1986). The central idea of Minsky’s theory is that the performance of complex tasks, which require sophisticated forms of intelligence, is accomplished through the coordinated action and cross-connected communication of many small, relatively unintelligent, self-contained agents.
Similarly, the basic architecture of Cypher, on both the listening and playing sides, combines the action of many small, relatively simple agents. Agencies devoted to higher-level, more sophisticated tasks are
built up from collections of low-level agents handling smaller parts of the problem and communicating the progress of their work to the other agents involved. We have seen in the discussion of the listener, for example, that the process of chord identification is accomplished by an agency of interconnected feature classifiers and harmony experts.
This chapter will discuss in depth the design of an agency-based program architecture, describing the methods of communication and coordination that go into building high-level processes from smaller agents and the way a particular view of musical comprehension and composition is captured by such a design. First we will review in some detail the basic construction of the program and the representations used for various tasks. Gradually, a picture showing the cooperation of many agents will emerge, enabling us to look at the shared responsibility for various tasks spread through the program.
When a Cypher event is sent to a listener, first of all a collection of lowlevel feature classification agents is trained on it. These agents can be considered independent of one another, at least at this stage.
Cypher events are built from incoming MIDI data, and can represent either single notes or chords. In figure 7.1, we see three feature agents analyzing a new Cypher event. (In this and all subsequent illustrations, the arrows indicate the direction of information flow. In figure 7.1, information from the Cypher event is going up into the feature agents.) The three features shown (density, register, and dynamic) are all calculated at attack time. Only one of them is context dependent and therefore requires access to memory: the scale against which the register classification is made has been established by the analyses of preceding events.
In the expanded figure 7.2, the context of previous events has been added. The register agent, then, classifies the new event within the context established. The duration agent, moreover, depends directly on earlier events to perform its analysis task. In another expansion of the original processing diagram (figure 7.3), we see the addition of two more feature agents: one measuring the speed of attacks and the other tracking duration.
The speed agent classifies the temporal distance between two events and therefore relies on one event before the current one to make its calculation. Accordingly, the very first speed classification will be meaningless, since it only measures the length of time between the beginning of program execution and the performance of the first note. An even greater reliance on previous information is shown by the duration agent: because featural classification is done at attack time, it is impossible to analyze the eventual length of the new event. For that reason, the duration classification for the previous event is included in the current report.
Rather than being a limitation, this procedure seems to reflect the way music is actually heard; the program will react to the duration of events closer to their release than to their attack. It can happen, however, that a new event will arrive before the previous one has been released. Each event is assigned a provisional duration at attack time. In the situation just mentioned, the provisional duration will be the one analyzed for the last complete event; even though this duration may correspond to an event played some significant amount of time previously, it will still have more resonance with the current context than would some arbitrary constant.
The next step in the classification of an input event is the local harmonic analysis. Although the sequence of level-1 listening is presented here as a serial process, note that all of the analyses described so far could be done by the same collection of agents working in parallel. As shown in figure 7.4, the chord agency accesses the new Cypher event directly to find the values of the pitch information associated with it. At the same time, the other featural agents report their classifications of the event to the chord agent, guiding the evaluation of the raw pitch data. The chord agency uses a connectionist network to find a plausible root for the incoming pitches. The network is assisted by the feature agents, whose reports will change the weights associated with certain pitches according to their classifications. As an example, recall that pitches with a low register classification will tend to receive more weight than pitches higher in the range.
The formation of an agency incorporating a specialized, connectionist network, informed by the reports of other agents, gives the first real ordering to the processing of the level-1 listener. The low-level, perceptual feature agents (register, density, etc.) must complete their work before they can accurately inform the chord agency of their classifications. In a parallel implementation, then, a two-stage process would result: first, the feature agents arrive at their classifications in parallel; then, the chord agency accesses those classifications and the raw note data to find the dominant pitch of a local harmonic area.
The agency responsible for detecting beat periodicities shows an organization strongly paralleling that of the chord agency. Once again, a connectionist core is informed by featural agents, such that certain weights will be altered according to the classifications reported. Both the beat agency and the chord agency maintain separate, private memories, which record the activation levels of various theories. Further, there may be several copies of the listener active at the same time, and each copy will maintain separate memories for its constituent agencies.
For every distinct stream, the beat and chord agencies communicate results with each other. In figure 7.5, we see the beat and chord agencies, each coupled with its own local memories and receiving information from the low-level feature agents. Again, this organization reinforces the pattern of two-stage processing on level 1 that we noted earlier: first the perceptual feature agents (register, density, etc.) arrive at independent classifications. Information from these features, from the raw event data, and from the other second-stage agency are processed by the chord and beat agencies after the perceptual agents have finished, making use of contexts recorded in their local memories.
In figure 7.5, we see the first two-way connections, signifying communication paths where each end can be the source or destination of information traveling across the connection. Obviously such communication holds between an agency and its memory; the presence of two-way communication between agencies, such as between the chord and beat agencies, is a more problematic issue.
When two agencies are cooperating, making their calculations interdependent, a situation arises in which the ordering of execution and communication between the two sides becomes unclear. In the case of the beat and chord agencies, certain types of results arrived at by either agency can further some interpretation calculated by the other. If the beat agency reports that a new event was on the beat, the pitch information associated with that event is given more weight by the chord agency. Conversely, if the chord agency finds evidence of a new local harmonic area, that finding tends to influence the beat agency to find a beat boundary. The problem then becomes one of deciding which agency should have a stronger influence on the other or what message should be passed in which direction first.
In this case, a plausible way to arbitrate the communication between the chord and beat agencies is to use the confidence levels each of them maintains: messages with a high confidence rating are given more weight by their recipient than low-rated messages. Still, the question of execution order remains. Because the interagency influence is mutual, there is no clear way to decide whether partial results should be communicated from the chord agency to the beat agency first, or vice versa.
In a single processor implementation, like this one, the question must be decided: in Cypher, the beat tracker sends a message to the chord identifier first, followed by a message in the opposite direction. I regard a beat boundary as stronger evidence for the significance of the associated harmonic information than I regard the identification of a new chord as evidence for a beat boundary. Further, the beat tracker is making predictions about the arrival of the next beat. The chord agency can therefore plausibly influence the belief in the beat prediction after an initial estimate has been made. The processing chain proceeds as follows: the beat tracker uses the event and feature data to make an initial prediction, and to classify the current event as on or off the beat. Second, the chord agency decides on a classification, referencing the beat tracker’s onbeat decision. Third, a change in chord will send a message to the beat tracker, altering the confidence in the current theory according to whether or not the chord change coincided with a beat boundary.
Attachment of Classifications
At this point, the raw Cypher events processed so far are assigned a featurespace classification, which records values for the perceptual features and local harmony. In figure 7.6, the featurespace word is shown as an added layer of information surrounding the original event. Attaching the classification to the event allows us to accomplish two objectives: (1) detailed information about the nature of the event is made available to level-2 processes, which observe the way such information changes over time, and (2) the judgment of all individual agents concerning an event can be ascertained by referencing it, to further the processing of any agency which might require the information.
Basically, there are two ways to implement the connectivity of agents into larger agencies: either the results of the agent can be broadcast through messages to all concerned agencies, or the results can be attached to the event being analyzed, where it is available to any interested party. The first method is advantageous for parallel architectures, when the results can be sent out to many processes working simultaneously; the second is more efficient for serial implementations, where the desired result can be simply read off of the event, rather than obtained through a message.
In figure 7.6, notice the addition of two agencies, one that tracks key (the phrase-level harmonic analysis) and another that looks for phrase boundaries. Again, these agencies maintain local memories, which are also kept distinct per event stream. In other words, the key agency has access to a local memory, which communicates with no other process. Further, there are separate local memories for the key agent associated with all active event streams. In Cypher, then, there will be two local memories for the key agency, because there are two active listeners: one corresponding to the stream of events coming from the outside world, and another associated with the event stream produced by the composition section. The key agency is involved in a mutual communication link with the chord agency; the phrase finder communicates with chord, key, and beat, as well as referencing the featurespace class1fIcation for the current event — thereby effectively communicating with the feature classification agents as well.
The information attached to Cypher events is one of the fundamental representations of music used by the program. Representational and processing choices make some conceptions of music easily expressed, whereas others can be approached only with difficulty. The featurespace representation, which plays such a fundamental role here, tends to treat parameters of the sound-pressure waves carrying musical percepts as central. Register, loudness, and density are all primary components of the representation.
The listener on level 2 characterizes the way lower-level classifications change over time. The analysis record attached to each Cypher event is used to make this calculation, and the results are added to it as well. In figure 7.7, notice that the regularity detection agent has been added to the picture of the listener.
The regularity agent refers to the featurespace word, the difference between the current featurespace and the previous one, the running statistics of classification change during the current phrase, and the chord, key, and phrase agencies. A detailed description of how this analysis works can be found in chapter 5: the point to be noticed here is the connectivity and quasi-hierarchical arrangement of the various processes involved in a listener. The important characteristics of this architecture are preservation of the progressive perspectIve — that is, continual access to information associated with events proximate in time; maintenance of local memories on a per-stream basis for each agency; mutual reinforcement of cooperating agencies; and the attachment of analytic results to the events classified, where they can be read by other interested processes.
The illustration in figure 7.7 provides a reasonable picture of the architecture of Cypher’s listener. However, it is at a significant level of abstraction. As we have built up this picture, lower-level details (such as the feature agents) have been boiled down to the analysis record attached to the original events. With the understanding that the figure is a useful reduction of the entire collection of calculations carried out, we see the real outline of the type of organization I referred to earlier as a “layered network.” Recall the claim that the listener embodied a hierarchical analysis of an event stream but that the hierarchy was treelike in only a very restricted sense. Here we see the structure behind the claim: the organization can be considered hierarchical in the sense that the regularity description subsumes individual events in a characterization of a group of events. At the same time, the progressive and communication links between events and the processes analyzing them render the architecture as a whole more like a network of interconnected agents. And this is only half of the story. Next we will look at composition networks and will finally sketch the structure encompassing both competences.
7.2 Composition and Communication
In chapter 6, we examined the building blocks of compositional networks as they are constructed in Cypher. In this section, we will consider in more depth the nature of the networks linking these blocks together. Recall that there are three strategies available for generating music from the composition section: playing sequences, generating new material, or transforming received material. In contrast to the parallelism of the listener, compositional agencies built from these elements tend to link many processes together in chains, with the music that is generated being passed along from one link in the chain to the next.
The way several transformations of some material are invoked is the most obvious manifestation of the chainlike behavior of many Cypher composition networks. When several transformations are due to be applied to a block of events, they are executed in series, with the output of one transformation passed along to the input of the next (figure 7.8).
It is remarkable that the processing networks on the composition side are organized so differently from those of the listening side. Both competencies use many small agents to build up more complex behaviors, but on the listening side these can operate largely in parallel, whereas on the composition side they function largely in series. A reliance on the idea of transformations engenders such a structure. Applying a set of transformations to some base material in parallel is, first of all, not possible on the current hardware platform, and, second, would produce highly variable results from event to event. We have already seen that the order of transformations can make a big difference in the eventual result; transformations all operating on the same material in parallel would have no fixed order, and the sounding result at the output would show wide deviations for even identical input material.
With the model of composition implemented in Cypher, there really is no way around a significant amount of serial processing; however, the real interest of the composition section lies in the way decisions are made to apply some transformations and not others, and in the way some material is chosen as the basis for transformation. The way these decisions are expressed, evaluated, and altered is where the parallelism of the architecture as a whole again comes into play. In the next section, I will explore the question of how the operation of one or more listeners is coordinated with that of the composer.
A Cypher listener performs several levels of analysis and maintains a record of the classifications calculated for many different musical features and their behavior over time. The record containing this information is sent on to a process that organizes the invocation of composition methods, making the output of all listener components available for compositional decisions. According to the response connections established between incoming listener messages and composition methods, anywhere from zero to a dozen or more methods will be called in series, in the order of their priority.
In figure 7.9, we see on the left a stylized version of the listener representation built up in the previous sections. At the bottom of the listener we see incoming events, with their attached analysis records. The connections manager box is receiving the analysis record for the current event. According to the connections maintained there, certain transformations will be called up and ordered by priority. These are then applied to the same input event classified by the analysis record, and the result of all transformations is sent on to the output. In this example, the seed material for transformations is the input itself; another function called by the connections manager might decide to choose other material as a base, in response to some listener message.
The connections between listener messages and composition methods are established as logical expressions on the features and regularities reported by the listener. Consider the following example: High OR Loud -> Trill. Here, the triller transformation is queued up to be applied to some seed material, if the analysis record associated with the current event shows the event to be either high or loud. The job of the connections manager, then, is to parse and evaluate the logical expressions on the left-hand side of such a connection and prepare for execution the appropriate composition methods found on the right-hand side.
The connections as represented on the graphic interface allow only inclusive OR as a logical operator. Attaching two features to some method means that when one or the other feature is found, the method will be readied for execution. Any number of features can be tied to responses in this way; however, they are all related to each other by this OR relation. The interface is not, however, the only way to specify connections. The scripts of response connections executed during score orientation allow the connections manager to be programmed using a slightly wider range of logical operators. Through the script mechanism, inclusive OR and NOT operators are available for connecting any feature or regularity report to a method. For each feature, the connection manager searches two lists: if the feature is asserted in the analysis record for the event under consideration, all methods in the OR list will be queued for execution. If the feature is negated in the analysis record, all methods in the NOT list are readied.
The full range of logical expressions should be supported by the connections manager. The path to completing it is clear; what we need is a fully developed parser, a compact representation of the notated expressions, and extensions to the connection manager that, for each incoming event, would evaluate the appropriate parts of the analysis record to see if any active expressions are satisfied. Further,full nesting of expressions should be handled by the same mechanism, so that compound statements such as (Soft OR High) AND NOT Fast could be specified as well.
Another component in the collection of processes establishing interaction between listening and composing has been added to figure 7.10. The listener / connections manager interaction described in the previous section is shown at the left of the figure. Now, we see two more stages added between the transformations applied by the connections manager, and the output of data to the synthesizers: first, a second copy of the listener analyzes the output emanating from the collection of methods called up by the connections manager. This second copy of the listener is labeled “critic” in the figure. Actually, the second listener is only half of the critic. The other half is made up of the rules shown in the box marked “productions.” The critic includes a number of production rules, which examine the output of the composition section, and apply further modifications to it when conditions specified in the productions are satisfied.
There are two major differences between the operation of the connections manager and the critic’s productions. First, the compositional methods invoked by the connections manager in response to listener messages are specified by the user, either in a script of connections, which is stepped through by the score-orientation mechanism, or by using the graphic interface, which allows the user to draw connections between messages and methods on the screen. The critic, in contrast, maintains a number of rules associating responses with listener messages under certain conditions. These connections, then, are. applied automatically according to the musical context. Often, the interface will be used to try out several analysis/response combinations, the most useful of which is saved to the productions for automatic application in performance.
The second difference is that the critic’s connection rules can use the full range of logical expressions on listener messages. Whereas the logical operators available to the connection manager are currently limited to inclusive OR and NOT, in the critic any arbitrary expression can be written as the conditional part of a production rule. A set of classification-reading routines can get the analysis results from the
information attached to any event. Using these routines, conditional statements such as
if ((LoudVal(featurespace»O) && (RegVal(featurespace)<3))
can be written to any level of complexity, taking advantage of the normal C language evaluation mechanisms. The example expression means that the action associated with this condition should be executed if the output of the composition section is louder than the quietest possible value and not in the highest register. Accordingly, critic production rules can be sensitive to more complicated conditions and combinations of input features than can the connections manager links.
7.3 Production Systems
Rule-based, or production, systems rely on a collection of rules with the form if <condition> then <action>. The two halves of production rules are also commonly referred to as the antecedent and consequent parts. In choosing an action to perform, a representation of the current situation will be examined to see if features mentioned in the condition part of the rule are in the state required. If so, the action part of the rule fires and makes some change to the state.
A musical application of this approach is Stephan Schwanauer’s Music Understanding System Evolver (Schwanauer 1988). MUSE uses a generate-and-test method to perform tasks such as supplying a soprano voice to an unfigured bass. “Generate and test” means that the system first proposes (generates) a partial solution to the problem, such as positing the next note in a nascent soprano line. Production rules then test the proposal and accept or reject it according to the knowledge represented, which has to do with traditional music-theoretic concepts such as voice leading. When a generated step is found defective, another is proposed and evaluated. Sometimes problems are severe enough that several previous steps must be undone, until the system returns to a point where generation can again be continued successfully. These additional generations are known as backtracking, which is a common strategy in production systems for working out of failure.
Production systems lead into one of the main subtopics of artificial intelligence: planning. If a robot needs to achieve some goal, it must devise a plan to accomplish it. Production systems are basically a means of satisfying constraints; there are many others. Planning has often been approached from such a standpoint; one of the earliest efforts, hubristically called the General Problem Solver (GPS), employed a method henceforward termed means-ends analysis to choose a plan of action and execute it (Winston 1984,146-157). In this scheme, the robot compares the current state of the world with a goal state, and then applies operations in an effort to reduce the difference between the two. In effect, GPS is employing a depth-first search of the state space; when the goal state is found (if it can be found), the search ends. GPS is recursive, in that it applies itself to subproblems as it works through the search. Further, some versions of the technique allow it to defer solving subproblems until an overall solution path has been found-:a rudimentary kind of planning.
We can see here an affinity with production systems; a state is examined, and under certain conditions, actions are called up to modify the state. Planning in this sense has been explored in music expert systems, such as Schwanauer’s or Kemal Ebcioglu’s chorale harmonizer. A GPS-like engine has received little attention in interactive systems, though it might well be a fruitful strategy. One hurdle is that these mechanisms in their classic formulations are assumed to be working outside of real time; backtracking works well when there are no time constraints on producing an action. Production-style strategies can function in real time if one is reasonably confident that correct operations will be chosen at each juncture-in other words, if backtracking is kept to a minimum. The other solution would be to buffer the results of a calculation until all possible backtracking has completed. Although it is interactive, such a solution would not have the immediacy of many systems now in use.
The Max patch of figure 7.11 is a production system reduced almost to the point of caricature. There are two number boxes in the middle of the patch; when the left one is moved, it is compared with the value of the number box to the right. Whenever the left value passes below the value of the box on the right, the right box’s value is decremented by one. Some of the fundamentals of production systems can be seen from this: the state of the system is held in the two number boxes. A test is made on this state, with the expr object marked “condition” and the change object beneath it. When the left box passes below the right one in value, the condition is true, and a change is made to the state of the system: the threshold value stored in the right box is lowered.
The above patch is caricatural because it has no provision for dealing with error, through backtracking, for example. Further, the maintenance and manipulation of state is an area in which graphic languages built for the control of data flow, such as Max, are less adept than other systems designed for that purpose, such as PROSPECTOR (Duda, Gaschnig, and Hart 1979) or Dendral (Buchanan and Feigenbaum 1978). For that reason, the programs we will now review have been implemented in either specialized production system environments or text-based development languages such as Lisp.
Peter Beyls’s program Oscar is an interactive music system composed of three parts: a pattern-dire~ted inference system, a planner, and an executer. The inference system is a collection of production rules, where the antecedent parts of the rules are patterns stored in the working memory of the program, against which input from the outside world is matched. When a match is found for any production rule, the consequent part of the rule is executed. The program has available a number of methods of response, most of which generate MIDI data and concern such musical issues as melodic contour or variation of incoming material (Beyls 1988).
Beyls has organized the memory of the system into long-term and short-term components. The long-term memory saves melodic intervals. The short-term memory is split into two halves, one maintaining salient events from the input stream and the other recording the recent behavior of Oscar itself. Because the program separates its own behavior from the nature of incoming material, Oscar is able to make comparative judgments about its place in the context. Further, the melodic patterns saved in the long-term memory are sometimes used in conjunction with more recent contours for melodic interpolations. Patterns lying between the two stored melodies are produced to create variations related to but showing novel deviations from already heard material.
Oscar develops opinions about the world in a musical sense, concerning both its own activity and the behavior of stimuli arriving from the outside world. On one axis, the program changes between states
of boredom or interest. Interest is excited by unexpected events, while boredom is induced by repetition or the lack of any stimulus at all. On another axis, Oscar is able to adjust its sensitivity to stimuli, producing a fit between the kinds of information being received, and the responses those stimuli evoke. The adjustment of stimulation thresholds implements a process of focus on the strength of musical input, where strength is measured in terms of the stimuli Oscar expects and the ways
they deviate from patterns stored in memory.
The planning component maintains a goal state and a list of steps to be taken in achieving it. Each successive step accomplishes a subgoal on the way to the overall desired state. In this respect, the planning part of Oscar uses means-ends analysis. In Oscar this general mechanism is extended to allow interrupts from the perceptual part of the program: if new information arrives that causes a goal to be superseded, the current path may be abandoned for a new line of subgoal processing.
Interactive Production Systems
Rule-based systems have found some use in interactive systems: a notable example is the Interactor program designed by Morton Subotnick and Mark Coniglio. In Interactor, scenes are constructed that consist of a set of production rules. The conditions of each rule are matched against the state of MIDI input and the program’s internal state: the first rule whose condition matches is executed. One possible action for a rule is to jump to another scene, activating a new set of conditions and actions. The similarity to production systems is limited to the form and processing of the rules; there is no generate-and -test cycle, nor any backtracking.
The ideas of Interactor are evident in such Subotnick works as A Desert Flowers, written for chamber orchestra and computer. In performance, the conductor leads the ensemble with a MIDI baton, which is
tracked by the computer. The software is able to detect changes in direction, and the velocity with which the conductor moves the baton (Dreier 1991). Further, information from a MIDI piano provides cross-referential confirmation of the ensemble’s place in the score. These cues are used to advance the computer processing, through the system of scene changes previously described. Among the manipulations the computer then performs are timbral transformations, spatialization, and harmonic mutation from one synthesized chord to another.
In this limited sense of productions, Cypher can be regarded as another example. The messages from the listener to the player are sent as a function of features found in the environment; certain player processes are invoked when messages corresponding to particular feature configurations are sent. This again is a form of production rule: the <if> part is made up of listener states, and <then> corresponds to the player processes called by the listener. Generate-and-test or backtracking mechanisms are again absent. Cypher is in one respect even further removed from the realm of production systems than Interactor: in the consequent part of a Cypher production, the player processes invoked have no effect on the state examined by the <if> condition. The only exception to this observation arises when the program is running introspectively: then the compositional operations produce events that are fed back into the listener, effecting changes in the state. Two sets of productions are maintained in Cypher. The first set is manipulable directly by the user, either through the interface or through a score-orientation script. The second set makes up the program’s internal critic, which reviews output about to be played and alters it according to the production rules representing its aesthetic preferences.
7.4 Knowledge Representation
An important subfield of artificial intelligence known as knowledge representation is concerned with methods and structures for capturing and manipulating knowledge. Again, several techniques derived from knowledge representation have been applied to musical problems and have already affected, or seem likely to affect, the evolution of interactive performance as well.
Object orientation is usually recommended as a programming discipline: objects can inherit functionality from one another and can communicate by passing messages that invoke methods local to the receiving object. Max is an object-oriented programming language. Objects are built into larger groups by describing relationships between them. As part of the Kyma system for programming computer music compositions, Carla Scaletti and Kurt Hebel propose an object-based representation for digital audio signals, where a signal is taken to be a sound ranging in complexity from a single sonic event to entire compositions. Sound objects have a duration and a method for computing their next sample. Atomic sound objects compute the following sample directly; compound objects derive a sample from the results of all their constituent atomic objects.
The use of variables within a sound object definition produces lifted sound objects. Lifted objects represent a higher level of abstraction than completely instantiated ones (hence they are “lifted” relative to specific objects). By substituting values for the variables of lifted objects, specific objects can be instantiated. Lifted objects therefore represent classes of sound objects, and instances of that class can be produced by substituting values for variables.
Stephen Travis Pope’s Musical Object Development Environment (MODE), is a large collection of object-oriented software components for composition, performance, and sound processing (Pope 1991a). Like Kyma, MODE uses object-like structuring for sound and compositional representations. The system combines real-time sampled sound, software synthesis, and interactive performance capabilities on UNIX workstations running Smalltalk-80. Pope has developed MODE as an examplar of a class of interactive systems he calls the Interim DynaPiano (IDP), which feature large memories, real-time sampled sound and MIDI drivers, C-based music and DSP software libraries, and interactive graphical software development environments running under a multitasking operating system (UNIX) on a commercial engineering workstation. Kyma also falls into the IDP systems class.
MODE objects can behave differently according to context; for example, the behavior can depend on what the object represents, or on howits value is stored. Sounds are described with such classes as pitch, loudness, and duration. These magnitude models are then represented in several ways, for use in various contexts: Pitch can be accessed as a MIDI key number, a floating point frequency, a note name string, or a harmonic series ratio. Time is represented with relative durations, and as is the case with the other magnitudes, it may be expressed in a variety of ways: “Duration objects can have simple numerical or symbolic values, … or they can be conditions (e.g., the duration until some event x occurs), Boolean expressions of other durations, or arbitrary blocks of Smalltalk-80 code” (Pope 1991a, 75-76). This flexibility extends to virtually all of the objects MODE includes and allows events, for example, to have as properties envelope functions, digital signal-processing scripts, compositional algorithms, or any other complex behavior that may be appropriate to the kind of musical structure being modeled. MODE and Kyma offer the power and clarity of object-oriented programming environments in the context of interactive performance, giving the system designer (composer) enormous latitude over the behavior and relations of defined musical entities. The reader is referred to Pope 1991b for further information.
The artificial intelligence community, and several music applications as well, has a long tradition of investigation into the concept of a frame (Minsky 1986). A frame is a collection of related information, representing typical features of some situation, that will be used to direct processing whenever a variant of that situation is encountered. A frame consists of a number of terminals (or slots), which correspond to features usually found in the frame’s reference situation. For example, a frame for a chair might have terminals representing legs, a seat, a back, etc. “Much of the phenomenological power of the theory hinges on the inclusion of expectations and other kinds of presumptions. A frame’s terminals are normally already filled with ‘default’ assignments. Thus a frame may contain a great many details whose supposition is not specifically warranted by the situation” (Minsky 1985,246-247).
This observation sparked a great deal of interest in the development of languages such as KRL (Bobrow and Winograd 1977), which facilitate the expression of related information in a frame notation. It is in fact this style of grouping information together, and relating different frames through frame arrays (Minsky 1986), or other chunking schemes, that has endured. Patrick Hayes, for example, questions whether the frame proposal ever described a representation scheme: “A more serious suggestion is that the real force of the frames idea was not at the representational level at all, but rather at the implementation level: a suggestion about how to organize large memories. Looked at in this light, we could sum up ‘frames’ as the suggestion that we should store assertions in nameable ‘bundles’ which can be retrieved via some kind of indexing mechanism on their names” (Hayes 1981,457).
A distinct but related proposal is the semantic net, which links concepts through relations of inheritance and specialization. The KLONE language (Brachman and Schmolze 1985) was developed to describe semantic nets, and extended by Antonio Camurri and his colleagues for musical purposes in the HARP system (Camurri et al. 1991). An important extension was to include temporal primitives, which allow objects in the system to be situated, and related to one another, in time. For example, music objects in HARP have begin and end terminals, which can be used to define temporal relations between them.
A basic relation in semantic nets is the IS-A link, which defines some concept as an example of one or several others. For instance, in the HARP knowledge base, a canon at the unison IS-A canon. That is, canon-at-unison is understood as a specialization of the more general concept canon. Canon-at-unison will inherit the terminals of the canon concept, with some additional ones describing its specialized function. Another relation associates a terminal with some concept. In the semantic net describing Corrado Canepa’s composition Anceps Imago, the concept three-part-concerto IS-A compound-music-action. At the same time, the concepts 1st-part and 2nd-part fill terminals of three-part-concerto, indicating two of the sections of the piece.
The passage of time is a critical concept to be included in any computational model of musical activity. We need to be able to manipulate operations on many temporal levels and to indicate their starting time, duration, rate of speed, etc. A script is a representation that defines an ordering for events. A restaurant script, for example, describes the typical situation of entering the restaurant, examining the menu, ordering, eating, paying, and leaving (Schank and Abelson 1977).
For a moment let us consider the generation of a musical phrase, with the realization that similar questions will arise on other temporal levels, such as groups or parts of phrases. If we can specify the kind of motion traversed by the phrase in terms of its harmonic, textural, and rhythmic functions, we can be directed to operations appropriate to. it. A cadential phrase could include a pointer to a time map that might shape the rhythmic events to effect a slight ritardando leading to the cadential point. Again we will want to have representations of typical situations, as in the preceding example.
We can combine the idea of a script with that of the frame in a representation known as a trans-frame. A trans-frame is a frame including certain kinds of terminals: actor, origin, trajectory, destination, and vehicle are examples. Some questions a trans-frame will be used to address are, “Where does the action start? Where does it end? What instrument is used? What is its purpose or goal? What are its effects? What difference will it make?” (Minsky 1986, 219). Trans-frames represent some kind of transfer or transformation. They would be useful for interactive systems on higher levels of both listening and composing tasks, and for the same reasons: to the extent that a musical discourse shows a goal-oriented behavior, or makes recognizable variants of some preexisting material, such behavior could be represented, at least in part, by a trans-frame representation.
7.5 Neural Networks
The term “neural network” refers to a class of computer programs that learn to associate output states with input vectors, through a network of interconnected processing units whose patterns of connectivity are modified by a learning rule. Other terms describing the samebasic idea are “connectionism,” or “parallel distributed processing (PDP)” (Rumelhart and McClelland 1986). A neural network is made up of layers of units, or nodes, of which there are at least two: input and output units. Some networks include another layer between these, of so-called “hidden” units. Input units are set to values corresponding to some features of the situation being modeled. Output units reflect the result of the network computation. Hidden units form a layer between the input and output nodes and are not seen by anything outside the system.
Within the system, units have associated with them a level of activation. Activation is propagated between connected units, and an activation rule indicates how the levels of activation arriving from all inputs to a unit are to be combined with the unit’s own level to arrive at a new activation state. Depending on the nature of the problem, different numbers of input, hidden, and output units are used. Further, a connectivity pattern will be specified among all units in the system, designating which units will propagate activation to which others.
The simplified network shown in figure 7.12 has three input units, two hidden units, and one output unit. Again, the topology of any working network is determined by the nature of the situation being modeled and the facility with which the network is able to capture the necessary relations. Our example network is fully connected: between successive layers, each lower unit is connected to all the units in the layer above it. The network in figure 7.12 can be trained to generate values at the output node corresponding to certain sets of three values at the input. The training process involves defining a set of examples showing “correct” associations between input vectors and output values. These examples are presented to the network, where all of the weights on the connections between units are initially random. A learning rule then adjusts the weights on unit connections according to the contribution of each unit to a correct, or incorrect, result. For instance, in the learning regime known as back propagation, the input vector of a training set is presented to the net. According to the activation rule, activation percolates up through the hidden units to the output node. Then, the output value achieved is compared with the desired training value. The difference between the two is used to modify the connection weights in a traversal back down the net. Repeating this process yields a set of connection weights that, if successful, ensures that the network will associate the output values from the training set with the input vectors.
An extensive literature on training and using neural networks covers these issues in much greater depth; the reader is referred to Todd and Loy 1991 and Rumelhart and McClelland 1986. What is important to notice here is that neural nets are able to perform tasks such as associative learning, or the discovery of regularity, without rule sets. Everything the system “knows” is captured by the connection weights between processing units. Artificial intelligence has repeatedly run up against the brittleness of rule sets, which are explicitly formulated plans of action that break down when confronted with contexts outside their area of specialized expertise. For some purposes, particularly perception-like tasks such as pattern completion, neural networks are able to function well in a variety of contexts, without the rule orientation that can lead to a limiting specialization.
Neural networks have found their way into several music programs, including interactive ones. Michael Lee, Adrian Freed, and David Wessel report on a new Max object developed at the Center for New Music and Audio Technologies, called MAXNet, that performs neural net computations. The net has successfully been trained to recognize gestures from MIDI keyboards, guitar controllers, the Boie radio drum, and the Buchla Lightning controller (Lee, Freed, and Wessel 1991). In the following sections we will look at the application of neural networks to several musical problem areas, particularly as they relate to outstanding problems in interactive music systems.
Modeling Tonal Expectancies
Several researchers have used neural networks to model human perception of tonal relationships. The neural networks at the center of Cypher’s chord and key recognition agencies were derived from work reported in Scarborough, Miller, and Jones 1989. Another major contribution can be found in an article titled “Modeling the Perception of Tonal Structure with Neural Nets,” by Jamshed Bharucha and Peter Todd (1989). In it they discuss the cognitive issue of acquiring tonal competence-how untrained listeners are able to learn expectations about tonal harmonic relationships in normal Western music. Several perceptual experiments have shown that listeners without musical training are able to recognize departures from normal tonal patterns, identify distance relationships between chords, and decide whether chords are tuned or mistuned. Since this competence did not come from formal musical training, it must have been acquired passively through exposure to numerous examples of the style.
Bharucha and Todd discuss two kinds of expectancy: schematic and veridical. Veridical expectancies are learned for specific pieces of music. Schematic expectancies are culturally based, arising from the knowledge of typical sequences built from exposure to a number of veridical experiences. Bharucha and Todd were able to train neural networks to exhibit both schematic and veridical expectancies. They take this as evidence for a theory of harmony learning that should supersede rule-based models, because the neural net version acquires expectations through passive learning, rather than by accumulating a collection of ad hoc rules.
The first network they describe maps from some musical scale presented at the input units to the same scale at the output. Once trained, the net can fill in missing tones in the scale when presented with a subset or variation of the scale at the input. Further, such self-organizing networks can learn hierarchical harmonic relationships, such as those arising among tones, chords, and keys. A net called MUSACT, developed by Bharucha, showed emergent recognition of the importance of the circle of fifths in distance relationships between chords in a key (Bharucha 1987).
Bharucha and Todd then trained a sequential network to develop expectancies concerning chord progressions. The network was made sequential as follows: each input node had a self-recurrent connection, with an identical constant weight of between 0.0 and 1.0. The self-recurrent connection fed the activation of each input node back to itself on successive input steps, thereby providing the input layer with a decaying memory of the preceding context. The input nodes corresponded to the six most common major and minor chords of a key. For each successive chord in a sequence, the input node for that chord was activated, and all nodes received their own previous activation level from the previous step, modified by the decay weight on their self-recurrent links.
After training on a number of typical Western harmonic chord progressions, the network developed a schematic expectancy for the probability of some chords following certain others. In other words, for each input chord, the activation of output nodes matched a probability distribution showing the likelihood of the six chord types as a successor. The tonic chord was likely to be followed by the subdominant or dominant, for instance, while the dominant itself aroused a probability distribution strongly favoring the tonic. Whereas the sequential network did not learn veridical expectancies for specific progressions in the training set, it did develop a schematic expectancy for the most probable transitions in Western chord sequences.
The network was then extended to include “additional context” units in the input layer. These context units were clamped to some values uniquely identifying particular chord progressions. With the
addition of this information, essentially the name of the sequence being learned, the expanded network could learn veridical expectancies: it predicted successive chords in particular examples. Once a number of examples had been learned, two new sequences were presented to the net: one followed normal harmonic transitions, and the other did not. The net learned the typical sequence much more rapidly than the other, demonstrating the schematic expectancies acquired by the net in the process of learning specific examples.
The work reported by Bharucha and Todd shows that neural networks that learn regularities in harmonic behavior can be developed through exposure to examples in a style. Particularly, networks trained to recognize specific sequences learn normative harmonic behavior as an emergent property. The techniques of neural network design and training tend to be well suited to tasks concerned with recognition, pattern completion, and expectation: Bharucha and Todd have shown that tonal expectancy is one such area in computer-based music perception.
Connectionism and Algorithmic Composition
In work closely related to that reported in the previous section, Peter Todd (1989) developed some sequential net designs for experiments in algorithmic composition. Again, this work was not developed for interactive systems, but could certainly be adapted for such use. Todd notes, “In contrast to the slow learning these networks often exhibit, their performance once trained is practically instantaneous” (Todd 1989,38). He describes his method thus:
Our new approach to algorithmic composition is first to create a network that can learn certain aspects of musical structure, second to give the network a selection of musical examples from which to learn those structural aspects, and third to let the network use what it has learned to construct new pieces of music. We can satisfy the first step by designing a network that can exactly reproduce a given set of musical examples, because being able to reproduce the examples requires that the network has learned a great deal about their structure. (Todd 1989,27-28)
The network Todd used to learn melodic sequences shows some features in common with the tonal expectancy nets reviewed in the previous section. Again plan units (called in the other work “additional context” units) are clamped to a unique identifying value for each individual sequence being learned. These plan units allow the net to learn specific progressions, rather than only generalizing from all of the training examples presented. Context is preserved in two ways: First, each input unit feeds back to itself, modified by a decay, so that more recent inputs are represented more strongly than more distant ones. Second, the output units are fed back to the input, eliciting the production ofthe next pitch inthe sequence. During training, the target values are fed back to the context units, rather than the outputs. Once the targets are learned, attaching the outputs instead is equivalent, since now the targets and the outputs will be the same. Training proceeds much more quickly, however, if incorrect outputs are not maintained in the context units during each trial (Todd 1989, 37).
Input and output units represent absolute pitches, rather than some more key-independent quantity such as intervals. This is because when the network makes an error in the intervallic representation, the rest of the sequence is suddenly thrown into another key, producing a quite jarring effect. When each unit corresponds instead to one pitch, errors are kept local to one wrong note, rather than affecting the remainder of the example. Duration is represented as well: the output of the network is assumed to be quantized to some minimum time slice. Then, pitches that sound for more than one slice are presented in succession, for the full duration of the note.
Once the network has been trained, it can be made to produce new melodies simply by altering the plan units. We have seen that the plan units identify particular melodies, such that when the plan units have the same values they did during training, the network will reproduce the same sequence of notes. If a network has been trained to produce a single melody, new ones can be made by setting the plan units to some value other than the one used during training. Peter Todd shows examples from this method and another approach in which the network learns several sequences, after which the plan units are varied to produce interpolations between two known melodies.
In my view the fallacy in Todd’s compositional model is the belief that “being able to reproduce the examples requires that the network has learned a great deal about their structure” (Todd 1989, 28). This is roughly equivalent to maintaining that a neural network capable of reproducing the letter sequence of a sentence has learned a great deal about what the sentence says. In fact, the net knows virtually nothing about the structure of the passages it has learned, and the shallowness of its understanding is quite clear from the variations produced. Meter, in particular, is ignored by the network; the simple 2/4 meter of the training melodies is routinely changed into essentially random combinations of eighth-and quarter-note durations. Harmonic considerations do not fare much better. Interpolations between two sequences, for example, consist of the beginning of one sequence and the close of the other, with some quasi-random selection of pitches from both in the middle. Peter Todd is correct in noting that the interpolated melodies produce unpredictable combinations of the originals; however, they also bear little relation to traditional variations.
Connectionism and Music Cognition
In his essay “Understanding Music Cognition: A Connectionist View,” Christoph Lischka (1991) directly discusses the application of artificial intelligence techniques to modeling music cognition. He is, as a point of departure, skeptical about the theoretical power of music cognition: “Both concepts, music as well as cognition, involve such complexities that it seems hopeless to clarify their mutual interaction” (Lischka 1990,417). However, positing the view that the very idea of music cognition is culturally conditioned and therefore problematic, Lischka sets out to explore the possibilities for modeling musical thought through artificial intelligence, and, in particular, he examines how some of the underlying assumptions of AI limit the sphere of cognition that can be so treated.
As a case study, Lischka considers the harmonization of Bach chorales. He distinguishes three design paradigms from the AI tradition that might be applied: (1) decomposition of chorale harmonization into subtasks, which are individually solved and put back together for an overall solution; (2) case-based reasoning, in which situations are matched against known cases and the known cases are modified to suit the new problem; and (3) constraint satisfaction, where harmonization is modeled as a network of interacting constraints. Reviewing implementations built using some techniques from all three design paradigms, for instance those of Ebcioglu 1988 and Lischka and Giisgen 1988, Lischka reports, “Our implementation exhibits interesting behavior; the proposed harmonizations, for instance, are (in a sense) correct. But they are not exciting. What is lacking, for instance, is some kind of global coherency. Also, the examples are not very specific to J. S. Bach’s practice” (Lischka 1990, 427). He concludes that the difficulty lies in capturing the cognitive processes actually used by expert harmonizers: the techniques reported in knowledge acquisition interviews seem closest to case-based reasoning, but the perception-like nature of the thought processes involved makes it extremely difficult to elicit a chain of inference, particularly in the terms required by a symbolic, information-processing model.
Neural net models can skirt the knowledge acquisition problem by providing a system itself able to learn the necessary relations between input and output. Particularly for perception-like tasks, neural nets can be trained to recognize contexts and complete patterns-areas where elicitation of rules from knowledge acquisition is notoriously difficult. Rather than accept the relative advance of connectionist models for some kinds of information-processing tasks, Lischka goes on to question information processing itself as a paradigm for cognition. In arguments reminiscent of Winograd and Flores 1986, he notes problems with the information-processing paradigm’s reliance on representations, on inference as a general approach to problem solving, and on the idea that behavior can be decomposed into constituent parts approached individually and recomposed into an overall solution. He concludes,
If we are interested in building truly cognitive systems there is strong evidence to pursue a general paradigm shift. The usual consequence drawn from the previously mentioned insights is to yield to resignation concerning the possibility of AI. The argument runs more or less as follows: Because cognition seems to be inevitably bound to biological organisms, there will be no way to construct artificial systems that exhibit comparable functionality. On the contrary, we think that there exists a much weaker conclusion: Because cognition is probably bound to biological organisms, the only possible way to build artificial cognitive systems is to ground them on artificial organisms. (Lischka 1991,436)
The research program Lischka proposes to reach a new paradigm involves implementing Piagetian schemata in artificial organisms, where complex behavior will emerge from the interaction of many small, simple elements. Because these interactions are nonlinear in organic systems, precise analysis is impractical. Rather, simulation on massively parallel machines with cellular automata or neural networks would provide a path to implementation. Lischka suggests simulation of the rise of auditory imagery in Piagetian terms, where sensorimotor schemata are interiorized in the development of an inner voice, as a research objective. Such an agenda is certainly not idle speculation; Gary Drescher reports on recent work implementing a Piagetian schema system in a computer program inhabiting a microworld, “The mechanism learns from its experiences by processes of empirical learning and concept invention, and uses what it learns to plan actions, often for the sake of explicit goals” (Drescher 1991, 1). The learning processes embedded in the program develop concepts increasingly independent of immediate perception, recapitulating some of the classic stages in Piagetian development.
Christoph Lischka’s provocative essay suggests a direction for AI-based interactive systems that would take the performer paradigm to a new level of independence. We know that many of the techniques he advocates (cellular automata, neural nets, etc.) can, once trained, function in real time. Finding ways to allow such systems to learn their own interactions with the musical world would produce programs that could not only be thought of as separate personalities but that would have ways of interacting with their environment that are unpredictable by their maker; they would truly decide on the basis of their own, nonsymbolic mode of perception, what to make of a musical context.
7.6 Pattern Processing
Pattern processing encompasses two goals: (1) learning to recognize important sequential structures from repeated exposure to musical examples, and (2) matching new input against these learned structures. I will refer to processing directed toward the first goal as pattern induction; processing directed toward the second will be called pattern matching. In pattern induction, we want to find sequences — strings of numbers representing such things as harmonic progressions, rhythmic patterns, or melodic fragments — that gain significance either through repeated use, or because of their relation to other known structures. Pattern matching, in contrast, is an algorithm for comparing new input to the known sequences found from pattern induction and signaling new instances of the stored strings.
The two kinds of processing have different effects. A successful application of pattern induction will yield a new pattern, to be added to a database, which is retained in the program’s long-term memory from execution to execution. The pattern-matching process can send out several different messages, depending on its progress through a match. For example, the matcher will emit one message at the completion of a successful match, identifying which pattern was found; another kind of message will signal an ongoing match, when the beginning of some pattern has been seen. In figure 7.13, we see the information from a completed phrase being taken up into the induction process, which compares the phrase with known strings and new strings recently heard by the program, to see if the phrase should be added to the database. Pattern matching compares incoming events with the known strings, to see if there is a match.
Precedents of this kind of pattern processing are found in several sources, musical as well as nonmusical. Pattern recognition, a field that would seem to bear a strong resemblance to pattern induction, is in fact quite distinct; pattern recognition’s most common area of inquiry is computer vision, and it encompasses several techniques developed for finding objects in a raw image. The patterns dealt with here differ in two important respects from vision problems: first, the patterns we seek to find are played out over time, and second, our patterns gain significance through repetition and their resemblance to known structures, rather than through the intrinsic qualities of any individual instance. Similarly, strict matching of precisely defined patterns is only a beginning for the processing needed to deal with musical sequences. Deviations from an exact repetition of stored patterns can arise from many causes, and a musical pattern matcher needs to be able to accommodate them: examples could include transposition, change in ordering, omissions, and variations.
In our parlance, pattern induction is the assertion that certain sequences of numbers have been repeated to such a degree that they should become marked as significant and remembered for later use. David Cope’s Experiments in Music Intelligence (EMI) are an example of pattern induction in this sense: large samples of music from particular composers are analyzed by the program to extract patterns or “signatures,” that exceed frequency thresholds maintained by the software. “This article assumes that 1) pattern matching is a powerful technique to use in attempting to discover some of the reasons why composer’s styles sound as they do; 2) patterns judged as alike which occur in different works of a composer are valuable and constitute what the author will call signatures; 3) re-using such signatures in music composed following standard chord protocols and voice leading can lead to replications in the style of the composer” (Cope 1990 288).
In Cope’s view, culling patterns from large collections of data representing several examples of a particular composer’s work captures significant features of that composer’s style. Demonstrations of the learned styles are built by pressing signatures into a prepared substrate of tonal material. For example, an EMI composition in the style of Mozart begins with a template of textbook sonata-allegro form and adds signature patterns such as motives and cadences to provide the proper Mozartian stylistic traits.
What I call induction Cope refers to as matching: the heart of his system is a group of routines that exhaustively search number sequences to find repeated series. The number sequences that are processed represent chains of intervals. The first step in the induction process reduces strings of pitches to strings of the intervals separating those pitches. In this way the same material beginning on different scale degrees will have an identical, intervallic representation. Cope’s algorithm then performs an exhaustive search of melodic sequences to find those considered identical, within certain tolerances.
One tolerance he uses is a range tolerance, which will allow intervals falling within the specified range to match. During the search part of the algorithm, each successive interval in the two strings being matched is compared. Identical intervals will, obviously, be considered a match; intervals whose absolute difference does not exceed the range tolerance will match as well. For example, setting the range tolerance to 1 will produce matches between major and minor mode versions of a melody, because the scale degrees of the major and minor modes usually differ by no more than one half step.
Another tolerance, or “tuner,” as Cope calls them, is the error tolerance, which is the maximum number of times interval matches can fall outside the bounds of the range tolerance without aborting the match as a whole. If the error tolerance is nonzero, patterns with some small number of significant deviations in intervallic structure will still be recorded as patterns. This tuner allows variants of some sequence to be recognized as similar, where the variations go beyond the simple major/minor discrepancies handled by the range tolerance.
David Cope’s EMI experiments are interesting because they implement a proven method for finding significant sequences, where significance is regarded as a function of frequent repetition-in other words, an algorithm for pattern induction. This evaluation is somewhat different from Cope’s own; he considers the program to be finding fundamental elements of a composer’s style. That claim I find overblown: style is a combination of processes operating on many levels of harmonic, rhythmic, and formal structure. Cope’s signatures correspond, in my view, to an inventory of a composer’s cliches rather than to the essence of her style.
Simon and Sumner
In their 1968 article “Pattern in Music” Herbert Simon and Richard Sumner outline a method of describing alphabets and their manipulation that represents significant features of temporal sequences. Moreover, Simon and Sumner consider patterns to be among the fundamental structures of music cognition and in particular to be a strong determinant of the expectations and goal-directed motion that are a basic part of the experience of Western music. “We are led by these studies to conclude that pattern-seeking is a common activity of people faced with temporal sequences, and that the vocabulary, or stock of basic concepts they have available for describing patterns is parsimonious, involving only a few elementary notions together with rules for combining them, and also is relatively independent of the specific stimulus material” (Simon and Sumner 1968, 222).
Simon and Sumner’s agenda assumes that patterns are the primary building blocks of music — if an exacting notation of patterns and transformations could be developed, entire pieces of music could be described and generated from their basic pattern structure. Though I regard pattern processing as one perspective on musical experience, which only gains coherence in conjunction with other modes, I am nonetheless struck by the terms they employ to describe patterns and their function in music:
Patterns involve periodicity — repetition (in a generalized sense) at intervals that occur periodically (in a generalized sense). Patterns make use of alphabets — sets of symbols ordered in a definite sequence. Patterns can be compound — made up of subpatterns which can themselves be represented as arrangements of symbols. Patterns generally possess phrase structure, which maybe explicitly indicated by various forms of punctuation. Patterns, as we have already seen, may be multidimensional. Repetition in pattern generally involves variation. (Simon and Sumner 1968, 228)
Their treatment of patterns comprises two stages of processing, which bear some resemblance to my own: the first is called pattern induction, finding structure in a sequence of elements (I have adopted their terminology for an analogous operation). The second process is termed sequence extrapolation, the generation of an extension to some pattern according to the structure found through pattern induction.
“The pattern inductor may be thought of as a ‘listener,’ since it accepts the music (or the score) as input, and detects the melodic, harmonic, and rhythmic relations implicit in it” (Simon and Sumner 1968,244). The work reported in this article is an intriguing approach to pattern processing, which relates the cognitive machinery of music perception to such tasks as letter sequence completion. I think, however, that the “magic bullet” status Simon and Sumner accord to pattern processing is too optimistic: the predictive power they describe accounts for only a part of the full experience of music. Their technique, in particular, does not learn or retain any sequences, regardless of their importance or frequency in the examples it has seen. Moreover, it is not clear how their rules should be applied in real time; the rules for finding phrase boundaries,for example, require much more backtracking than is possible for a system trying to find boundaries as they occur.
The CompAss project developed at the Institut für Informatik in Stuttgart is a system for editing musical compositions on the phrase level (Mahling 1991). Built on top of Smalltalk-80, CompAss includes two groups of editors: the first group is used to modify time-independent relationships between musical objects, and the second group defines a composition’s form. The time-independent editors are coordinated in a PhraseStructureEditor window, allowing different views on musical objects. The output of some chain of operations can be seen, for example, in common music notation, or a MIDI event list. A PhraseArrangementEditor window manages the second group of editors, laying out in time the phrases defined with the first group.
Musical objects are organized into a hierarchical knowledge base, in which concepts are related to one another through taxonomy networks and PlayContexts. A PlayContext provides all the information necessary to perform a composition from the computer system, including MIDI-level details such as synthesizers and program changes. The processing layer provided by a PlayContext reinterprets MIDI information coming from the rest of the system according to the nature of the device to which it is routed. For example, Note On messages routed to synthesizers will have the customary effect. The same messages routed to a drum machine will select different drum sounds by the pitch number. Note On messages routed to a MIDI mixer will select channel numbers through the pitch field and fader settings with the velocity value.
The analytical function of CompAss is meant to abstract essential characteristics of musical phrases from several examples. The abstracted patterns are found by a procedure called constant matching. In constant matching, two phrases are compared with each other. Any elements that differ between the two phrases are replaced with variables. This procedure can be continued with any number of example phrases: the pattern abstracted is guaranteed to match all examples from the training set. CompAss uses patterns found in this way as part of the compositional process. Arbitrary new phrases can be matched against the stored patterns; if similar phrase patterns are found, aspects of the stored abstraction can be used to guide further elaboration of the new phrase. Constant matching is complemented by a set of incremental abstraction editors, which allow a user to pull salient relations out of a single phrase interactively. The abstractions supported by CompAss are directed toward representing such processes as voice leading: relationships among tones in a phrase can be abstracted as intervallic networks, as groups of scale degree members, or in terms of duration ratios.
Though CompAss is not a real-time interactive system, the motivations behind it lead us to consider it here. The constant matching algorithm, coupled with user-directed pattern abstraction editors,
provides mechanisms for the system to learn phrase types and for a musician to improve on those representations. As it stands, new phrases can be matched against the stored patterns in a session of
computer-assisted composition. One can easily imagine CompAss-style patterns incorporated in a real-time pattern matcher, where associated rule sets determine how the system will react to matches
between the known patterns and incoming or machine-generated phrases.
7.7 Induction and Matching
Pattern induction is a form of unsupervised learning; the goal is to identify patterns in a musical sequence that have been repeated to such a degree that they should be remembered as significant and used in constructing responses. The learning is unsupervised because it occurs in the course of performance-another approach would be to teach the system the patterns required to function in some context. This would bias the computer performer towards sequences known in advance, however, and would limit its capacity to pick up on and use regularities introduced in the course of an improvisation. For this reason, induction explores the more difficult, unsupervised situation. Patterns can be induced from any number sequence; such sequences could be interpreted as representing melodies, rhythmic patterns, harmonic progressions, etc. In Cypher, patterns are induced and matched for melodies and harmonic progressions. A similar procedure has been implemented as part of the McGill University TMax project.
In their article “Real-Time Computer Accompaniment of Keyboard Performances,” Joshua Bloch and Roger Dannenberg describe techniques for matching monophonic and polyphonic performances against stored scores representing the expected performance. The patterns Cypher currently considers are strings of numbers, which may be interpreted in various ways: as chord progressions, for example, or progressions of rhythmic durations. The patterns considered in Bloch and Dannenberg 1985 are the Note On messages arriving from the performance of some musical score. Therefore, Cypher’s problem corresponds to their monophonic matching case: the program does not try to coordinate the reports of multiple listeners or consider simultaneous harmonic or rhythmic streams, and so it is able to avoid the polyphonic matching problem they describe later.
Briefly, the Bloch and Dannenberg algorithm computes a matrix relating score events (the stored representation) to performance events (the notes coming from the live player). Each time a new performance event arrives, the next column of the matrix is computed: finding the maximum rating of possible score to performance correlations will point to the most likely position in the score of the current performance. This approach maps well to the pattern matching task: we want to match incoming information against stored strings to see if any known pattern is being followed. The known pattern, then, corresponds to the score in the Bloch and Dannenberg application, and new input corresponds to their performance. There are, however, significant differences as well. First of all, pattern processing requires matching against many different patterns simultaneously. Second, there is no a priori way to know when a pattern may begin; new input must be continually matched against the head of many patterns to find incipient pattern instances.
Pattern induction is basically a matter of pattern matching, but with different kinds of data preparation and different interpretation of the results. Following this idea, the pattern-matching algorithm of Bloch and Dannenberg was made neutral enough to accommodate the demands of both induction and straight matching. There are two main advantages to this strategy: One is that a single matcher can be maintained, with the differences in processing required for induction and matching reflected in the way the process is called and the way its return value is used. Second, this pattern-matching algorithm is much more efficient than the exhaustive search employed by Cope. Cope’s intention was certainly not to develop a real-time pattern matcher; further, his exhaustive methods will turn up patterns not found by the algorithm described here. The point of pattern processing in Cypher, however, is to coordinate the induction and matching of patterns with other kinds of analysis in real time. Therefore, the constrained method developed by Bloch and Dannenberg is more appropriate to both the induction and matching problems.
The induction process is called twice, once for each type of data that can spawn a pattern, at every phrase boundary. Phrase boundaries are demarcated in the manner explained in chapter 5; once a boundary has been found, the harmonic progression and melodic intervals for the most recently completed phrase are sent to the induction process to be matched against progressions and melodies from earlier phrases. The TMax environment developed at McGill University employs a similar task division: as MIDI data arrives, it is first sent to a process called the segmenter, which uses several heuristics to identify possible phrase boundaries. When such a boundary has been located, a k mismatch pattern-matching algorithm is trained on the events of the phrase. When a pattern is successfully matched against the stored collection of known patterns, the matcher reports back to the main process the location of the found pattern in the phrase. Otherwise, the new material is saved as a pattern in the collection (Pennycook and Lea 1991).
As noted above, the Bloch and Dannenberg algorithm maintains a matrix of score-to-performance matches. The basis of their technique is described thus:
An integer matrix is computed where each row corresponds to an event in the score and each column corresponds to an event in the performance. A new column is computed for each performed event. The integer computed for row r and column c answers the following question: If we were currently at the rth score event and the cth performance event, what would be the highest rating of any association up to the current time? The answer to this question can be computed from the answers for the previous column (the previous performance event) and from the previous row of the current column. The maximum rating up to score event r, performance event c will be at least as great as the one up to r-1, c because considering one more score event cannot reduce the number of possible matches. Similarly, the maximum rating up to r, c will be at least as great as the one up to r, c-1, where one less performance event is considered. Furthermore, if score event r matches performance event c, then the rating will be exactly one greater than the one up to r-1, c-1. (Bloch and Dannenberg 1985,281)
Because the only information needed to compute the next column of the matrix is the previous column, the storage required for each pattern is simply twice the largest pattern size. Further, the computation can be centered around the highest rating achieved so far, limiting the necessary processing to a constant as well. These characteristics of the algorithm make it eminently suitable for our real-time requirements. The Cypher routine StringMatcher(s, start, newelement) — where s is a pointer to a string, start is the position in the matrix of the highest match, and newelement is the next element to be processed — is an implementation of the algorithm described by Bloch and Dannenberg. To induce harmonic progressions or melodic intervals, the matcher is handed two patterns: one from the current phrase, and one from the previous phrase. The larger pattern plays the role of the performance, in the Bloch and Dannenberg sense, and the smaller pattern is used as a score. In other words, the larger pattern is matched against the smaller one. Each element from the larger pattern is successively sent to the matcher, always to be matched against the smaller pattern. After all elements from the larger pattern have been processed, the highest rating achieved in the smaller pattern’s matrix is the last element in the smaller pattern’s array that was actually found. If this rating is larger than 4, that is, if at least 4 elements from the smaller pattern were also found in the larger one, the induction is successful, and an attempt is made to add a new entry to the list of known patterns. Now the newly induced pattern must be compared to those already known; accordingly, it is matched against all patterns already in memory. If the rating after matching the new pattern against a known pattern is 4 or more, the patterns are considered to be the same. In that case, the known pattern’s strength is incremented, and the process ends. If the new pattern does not match any known pattern, it is added to the list to be saved in the program’s long-term memory.
The Cypher analysis record for every event in a listening stream includes a representation of the event’s harmonic function relative to the active key. As each successive event is read from the stream and added to the phrase, the root of the local harmonic area and the current key are obtained from their respective agencies. With these two pieces of information, a simple calculation provides the function. Both chords and keys are stored as an integer between 0 and 23. Major and minor modes of the same root are stored adjacently; chord and key number zero are both c major. Therefore, for both chord and key, C major is 0, C minor is 1, C# major is 2, C# minor 3, and so on. With this system, we can use the following define statement to calculate chord function within the key:
#define Function(key, chord) (((chord+24) – (key – (key%2)))%24)
The effect of the defined Function statement is to move the position of 0 across the collection of chord theories. If both key and chord are already 0, for example, the position of 0 in the chord theories will remain on C major and will indicate that chord as the tonic of the key. If the key is 4 (D major) and the chord 8 (E major), Function will return a value 4, indicating a major chord based on the scale degree a major second above the tonic. If the key is minor, however, Function subtracts the value of the major key on the same root; this keeps minor functions on odd values and major functions represented as even values, which is the convention for all harmonic reports in the program. Relating local harmonies to the key effectively removes key from consideration in storing and matching harmonic progressions. A functional progression of tonic to dominant to tonic will look identical, in this representation, regardless of the key in which it was originally presented. Chord progressions are sent through the induction process at each phrase boundary. When a phrase boundary is found, Cypher makes a list of the chord functions by reading from the beginning of the just completed phrase to the end, appending to the list function values for every event along the way. If a function for some event is found to be identical to the one before, it is discarded.
The other current source of sequences is melodic activity. Melodies are strings of single-note events; the phrase boundary demarcation is used again to identify melodic groups to be submitted to the pattern inducer. Preprocessing of melodic information into a pattern requires two steps: First, we are currently considering only monophonic melodies, so as the preprocessor steps through the events in a phrase, events with a vertical density greater than 1 are discarded. Second, as in the case of harmonic progressions, it is desirable to give the pattern some degree of independence from the specific context in which it is embedded. In the case of melodies, this is done by casting specific pitch sequences into sequences of intervals. Each new pitch appended to the melody list is compared with the previous one: the difference between the two note values is the value added to the list. So a rise of a major third from C3 to E3 (MIDI notes 60 to 64) is represented as +4, an ascending major third. Descending intervals will produce negative intervals. Intervallic representation allows melodic contours beginning on different scale degrees or in different registers to be successfully matched. Note that this representation means that there will be one fewer interval generated than the total number of events in the
As patterns are induced, they are added to a list of known patterns. Two lists are maintained: one for melodic patterns, and the other for harmonic progressions. Every time Cypher ceases operation, it writes out a file storing a record of the patterns it knows. This file represents the long-term memory of the program. Then whenever Cypher starts up again, it loads in the patterns from long-term memory and activates them for matching. In this way, the program incrementally learns and uses patterns found in the music played to it.
Each known pattern has an associated strength: the strength is an indication of the frequency with which the pattern has been encountered in recent invocations of the program. Whenever the inducer notices a pattern, it attempts to add the pattern to the known list. Before such an addition will succeed, the pattern is matched against those already on the list. If the pattern is similar to one already known, that is, if it matches most of the known pattern, the new copy will not be appended. In that case, however, the strength of the already stored pattern will be increased. Pattern strengths are used for two purposes: First, strong patterns will be used more often by the composition section as seed material. Second, pattern strength decays slightly with every program execution. If a pattern is not seen again through a sufficient number of executions, it will be deleted from the list. In other words, if a pattern is learned but then not seen again for a long time, it will be forgotten — dropped from the long-term memory file.
The pattern induction process just described works, after a fashion: it has successfully found chord progressions and simple melodies from listening to music. These successes, however, are limited to what I would call “toy” examples. Induced chord progressions, for instance, are found by repeating the progression literally several times in succession until the program notices it. Similarly for melodies, simple sets of intervals not more than eight notes long can be found by pattern induction if they are repeatedly played to the machine.
Much more desirable, of course, would be pattern induction that could function in the real world. Finding chord progressions and melodic patterns from improvisation or performance of compositions not geared to the algorithm is where this work is intended to lead. The reasons such “real-world” functionality remains elusive, in fact, have much less to do with finding an efficient matching algorithm than they do with good preparation of data to present the matcher as input. There are at least three major problems here: (1) building patterns on the right level, (2) successfully grouping information into patterns, and (3) storing good candidates for repeated matching attempts.
Different pattern types will present meaningful sequences of data on different levels. Melodic information, for example, will most often be found on level 1, in the succession of events; however, some evaluation of the data should be done before building a candidate pattern, to remove ornaments, for example. A “real-world” pattern inducer should be able to find melodies repeated with some ornamentation. Having a speed feature classification at the grace-note level would be a first approximation: events with the highest speed rating could be discarded when building melodic patterns for induction. Harmonic progressions, however, are most likely not found on the event level. Rather than trying to match chord sequences presented in a phrase, the progression preparation should try to identify an important, or dominant, chord or two within each phrase. Sequences of important chords could then be built into patterns and presented as candidates for induction.
The second problem, finding good groups, is related to the workings of the phrase boundary agency. As it is now, induction takes place at the end of each phrase, comparing the just completed phrase with the one before it. This works reasonably well, as long as the phrase boundary finder is reliably finding boundaries in the same place when presented with similar music. Particularly before the phrase threshold has stabilized, this is not always the case. Perfectly plausible patterns, then, may not be found, because a spurious phrase boundary would interrupt the construction of exactly the pattern we might wish to notice.
The third problem involves the question of how to identify sequences that are worth saving for future matching attempts. In “real-world” music, interesting patterns will rarely be literally repeated end to end. Normally, some melody, for example, will appear and reappear, but with each occurrence separated from the others in time. In a pattern inducer looking for matches between adjacent phrases, such sequences will not be found. What we need is some way to select melodic or harmonic strings that are interesting in their own right, so they will be kept around and matched against future sequences even when an immediate match is not found. This is the main lesson to be learned from the induction process reported here: repetition alone is an insufficient heuristic for finding good sequences. Using only repetition means that all sequences will have to be stored and tried out for induction, in an exhaustive search that could quickly drop out of real time. Melodies and harmonies must also be recognized as interesting because of some intrinsic qualities and tested repeatedly for matches against subsequent material.
The pattern-matching algorithm described in the beginning of section 7.7, in connection with the induction process, is the same one used for the related problem of matching induced patterns against incoming data. StringMatcher(), the routine responsible for matching patterns and returning a rating of the highest matched element between them, is the same one used here. In this section, I will review the differences in the way data is prepared, and results interpreted to handle the matching problem.
Because of the properties of StringMatcher(), every incoming event can be tested against the known patterns: there is no need to wait for a phrase boundary to find groups, as in the case of pattern induction. When the rating of a known pattern becomes nonzero, after exposure to incoming events, some part of it has been matched. Therefore, the general strategy is this: When a new event arrives, it is matched melodically and harmonically against known patterns. If more than half of a known pattern is found, a message is sent to the composition section along with the remainder of the pattern. If all of a pattern is recognized, another message is sent to the player, the rating matrix is reset to zero, and the process begins again.
Though the matrix-rating algorithm of Bloch and Dannenberg has proved to be a useful starting point, significant adaptations have been made to it for Cypher’s pattern-matching tasks. The algorithm has been changed, because our demands are significantly different. For example, in the score-following case, an assumption is made that the performance will not go backward. That is, once the matcher is relatively sure of having matched some part of a score, it will try to continue on from that point. In matching melodic patterns, however, we can never be sure when a pattern might start, and we have less reason to believe that matching any part of it means we will not see that part again before the rest of the pattern. Therefore, the ratings of the original algorithm have to be modified if a partial match has not been continued for some time.
One of the efficiencies of the Bloch and Dannenberg algorithm is that a window can be centered on the highest-rated event found with the matrix and continued from there. The same convenience is adopted here as is evidenced by the start argument to the StringMatcher() routine. The deviations from the Bloch and Dannenberg version come when no new match is found. In that case, one is subtracted from the rating of all of the elements in the matrix. Our window begins at the point of the latest match, represented by the start argument; however, it continues from there to the end of the pattern. Therefore, when we subtract one from all ratings on an unsuccessful match, we effectively start looking from one element further back in the target sequence with the next match. If the next element is found on a subsequent attempt, nothing is lost — because we continue to search to the end, it will match in the same place as if no decay of the ratings had taken place. If the next element is not found for several events, however, the decay in ratings means that the matcher will again be searching from the beginning of the pattern. In other words, if the continuation of some pattern is not detected, the decay in element ratings will gradually return the matcher to the start condition, where it is again looking for
the head of the sequence.
We have seen that the matcher returns a rating representing the number of matches of the new pattern against a known one. Because of the way this matcher uses the rating matrix, the first location in the matrix of the maximum rating reached by the match equals the position of the last matched element in the known pattern. With this information, the composition section can schedule the rest of the pattern for performance or for further elaboration from a compositional network. The routine FinishMelody(String *s) is called when the rating and number of matches returned for some pattern is greater than half its length. FinishMelody() will then examine the maxrating array to find the last matched element. The rest of the elements after that last match will be scheduled to play on successive beats after the invocation of FinishMelody(). This is a crude but effective way to test the matcher: a frame representation storing the number of beats associated with each remaining element of the pattern would preserve the rhythmic integrity of remembered material. Still, playing out the remainder on the beat is already much preferable to playing it back at an unchanging speed whenever the beginning is found.