In this book, we have examined the underpinnings of interactive music systems both in terms of their technical implementation and the musical possibilities they advance. The specific realizations reviewed here are unquestionably quite diverse; still, a number of recurring trends emerge. To many interactive systems, the transfer of some degree of musical intelligence to the machine is central. In fact, virtually all computer programs dealing with musical issues could benefit greatly from such a transfer; however, the degree to which musicianship can be functionally captured by a computer program is still an open question.
Researchers in several fields, including music theory, cognitive science, and artificial intelligence, are each considering this question from their own perspective. Composers and performers working with computers onstage are another source of the development of machine musicianship. Some kinds of music are effectively made this way; in fact, much of the music made with computers today could not be produced in any other way. The development and interpenetration of these differing movements as they transfer musical sensibility to a computer will continue to show the ways in which human music cognition can be so modeled and what parts of musical skill will continue to defy any computable formulation. The search will continue under many of the rubrics we have already seen; as with any vital undertaking, it will also branch into new paths that are unsuspected now or whose outlines are just coming into view. The remarks in this chapter will identify some of those paths and provide a motivation for undertaking the effort at all.
8.1 Cooperative Interaction
An important goal for interactive systems is to devise artificial musicians able not only to respond to human players but to cooperate with them. The programs we have reviewed here demonstrate a style of interaction we can refer to as call and response: the human user initiates some action, and the computer system responds. To arrive at a more sophisticated interaction, or cooperation, the system must be able to understand the directions and goals of a human counterpart sufficiently to predict where those directions will lead and must know enough about composition to be able to reinforce the goals at the same moment as they are achieved in the human performance. Programs are now forced to rely on call and response because they are unable to predict what a performer will do next. The only exceptions to this observation are score followers, an example of limited cooperative music systems: their derivation of tempo is in fact a prediction of the future spacing of events in a human performance based on an analysis of the immediate past.
Although endowing a computer with the clairvoyance to know what a human will do next seems a formidable undertaking, many kinds of music are highly predictable. If they were not, even humans would not be able to play together. The example of rhythm is a clear one: music to which one’s foot can be tapped is displaying a predictable rhythm — otherwise, you would not know when to put your foot down. For a cooperative system to be equally prescient, it must pick up on regularities, or near regularities, in the onset times of external events. Once it has detected such a regularity, it can assert a continuation of it and begin scheduling events to coincide with the found pulse. The beat-tracking schemes we examined in chapter 5 do just this.
For example, the algorithm proposed in Dannenberg and Mont-Reynaud 1987 does not rely on a stored representation of the music to be followed and could form a component of a generally cooperative system. Mont-Reynaud’s tracker is sensitive to “healthy” notes, that is, notes exceeding a minimum time threshold and not slgnificantly shorter than the preceding note. To build a predictor, the program then tries to notice arithmetic regularities in the offset times of three successive healthy notes. If roughly equal offset times are found, they are brought into a typical eighth-note range and taken to be the current beat. George Lewis has a technique for beat following that depends on the detection of a “subtempo,” some short duration that, when added to itself some integer number of times, will account for most of the actual durations being heard. In other words, the Lewis algorithm is additive, looking to find large durations from combinations of short ones, rather than looking for a tempo of longer values and subdividing it to account for the quick notes.
Extrapolating from the beat detection case, we can see that what we need to predict the future of a musical performance is the construction of high-level recognizers: processes watching the results of event-to-event analyses and searching for regularities and directions. Once we have a pattern, predicting the future reverts to the score-following case: not in the strict sense of matching tempi but in the general sense of comparing incoming material with what we believe to be an ongoing pattern and notifying response processes about the goodness of fit. If our follower is matching well, the responders can schedule output events to coincide with events in the remainder of the stored pattern. The work on pattern induction and recognition, and the use of scripts and frames reported in chapter 7 form fragments of the mechanism needed to accomplish such behavior.
Once such prototype induction/matching mechanisms are formed, we will learn whether correlations of patterns found by beat tracking, chord finding, and key recognition might allow the cooperative performance of some musical styles. If a listener has informed a generation section that “we are following a harmonic pattern (I-IV-I-V-I), in which the chords change every four beats. The current chord is tonic in the key of C, and the next chord change will occur two beats from now,” the generator could schedule events consonant with the upcoming chord to be played in relation to the found beat pattern.
Recognition of directed musical motion can take place in two ways, performed independently or in combination: either direction can be detected in the event-to-event behavior of the input, or the input can be recognized as following a stored template of typical musical patterns. The first kind of recognition is the one addressed by the work of Simon and Sumner 1968: they assume certain fundamental alphabets, such as scales, and a small number of way of organizing the elements of those alphabets. Building on these assumptions, they write processes to find patterns arising from manipulating the alphabets with the given operations. Once such a pattern has been found, they can extrapolate (predict) the extension of the sequence by continuing to apply the discovered operations on the underlying alphabet.
The second kind of direction detection would be an extension of the pattern induction/matching mechanism. The same script representations carrying information about typical harmonic and rhythmic behavior could be fitted with additional slots describing motions of intensification and relaxation along some dimension. Again, it remains to be seen whether scripts with sufficient subtlety and variability to be musically useful can be devised.
8.2 Parallel Implementations
Current generations of high-end computers are based on some degree of parallelism, where several processors are used simultaneously to attack engineering and scientific problems. The interactive music systems discussed in this book are usually implemented on personal computers, for reasons of access, transportability, and cost. As parallel hardware implementations begin to move down into the personal computing class, interactive systems will start to take advantage of the higher power offered by collections of CPUs. With the power of parallel processors, however, comes the problem of programming them effectively, to make use of the opportunities for simultaneous execution of several processes. The software architecture of interactive systems, in other words, then needs to be split into two or more subtasks, which can be treated in parallel.
T-MAX
One of the first efforts in this area is the T-MAX project, developed by Bruce Pennycook and his colleagues at McGill University. T-MAX stands for Transputer Max and is a software environment based on Max, running on a Macintosh computer platform extended with three transputers (25 MHz. floating point devices) and a Digidesign Sound Accelerator card. In a related development, a team from the Durham Music Technology center has developed a parallel architecture for real-time synthesis and signal processing, also built around Inmos Transputers (Bailey et al. 1990).
A primary task of the T-MAX project is to implement an Automatic Improviser, an interactive system capable of tonal jazz improvisation. On a parallel hardware platform, one of the first decisions is how to divide the task across the available processors. For the Automatic Improviser, three partitions were devised — listening, pattern matching, and playing. The partitions were spread across the processor resources as follows: (1) the Macintosh host handles MIDI transfers and system IO, (2) one T805 transputer directs communication between the transputers and the Macintosh, (3) one T805 performs realtime analysis (listening), and (4) one T805 handles pattern matching and performance algorithms.
The devotion of one transputer to directing communication traffic to and from the Macintosh host allows transparent access to the processing power of the other T805s from within the T-MAX environment. When a T-MAX object is loaded in Max, T-MAX loads the executable code onto the appropriate transputer, and sets up a communications channel between the affected transputer and the Max window. In this way, a user can generate as many instances of a T-MAX object as are needed, with the system taking care of all of the details of allocating processor resources, and passing messages across the machine boundary.
The listening part of the Automatic Improviser project is an adaptation and extension of Cypher. Incoming MIDI data is packaged into events and sent to the T-MAX listener object. This object was originally developed from the C code of Cypher’s listener, then elaborated by the T-MAX developers. The T-MAX listener performs the two-level analysis described in chapter 5, the results of which become available from two outlets on the object, one outlet for each level of analysis. Separate objects called extractors then accept the analyses coming from the listener object and pull out the classifications for each separate feature, or regularity report. These classifications, available at the outlets of the extractors, can then be further processed in any way by the full Max environment.
The patch shown in figure 8.1 is one of the first implementations of this idea, showing an early test of the listener in a Max environment. Here, the listener object is a straight import of the Cypher listener. Pitch and velocity information from notein is sent to the inlets, and the level-1 and level-2 listener classifications come from the two outlets. The companion objects fextract (for feature extract) and rextract (for regularity extract) then make the various classifications available for other processing. In figure 8.1 a collection of dials and buttons show the classifications emanating from the two extractions.
8.3 Signal Processing
To this point we have considered signal processing as an emanation from some black box generically called a synthesizer, or sampler. The use of commercially available synthesis devices tends to encourage standardized approaches to signal processing, typified in the worst case by the exclusive use of factory presets. In fact, the ubiquitous insinuation of a small subset of synthesis techniques into the sonic arsenal, or even the repeated use of exactly the same sounds, is a recent development in the history of real-time or interactive systems. Earlier examples tended to be built around high-speed signal processing devices, among which the 4X machine, developed at IRCAM by Giuseppe di Giugno and his team, is the most prominent example (Favreau et al. 1986).
Now, after a period of relative domination by commercially supported synthesis techniques, signal processing is again being opened to more widespread experimentation by the appearance of such devices as the Digidesign Sound Accelerator (Lowe and Currie 1989), the NeXT machine’s digital signal processing (DSP) capabilities (Jaffe and Boynton 1989), and the M.I.T. Media Laboratory’s Real-Time Acoustic Processing (RTAP) board (Boynton and Cumming 1988). All of these devices are built around high-speed DSP chips, such as the Motorola 56001, which is at the heart of the aforementioned hardware packages. As the design and manufacture of digital signal processors pass to large chip makers, these devices will become cheaper, more powerful, and more widespread, and the processors, as well as the synthesis techniques, will become standard.
The example of the 4X machine is instructive, as it relates to the kinds of interaction widespread DSP capabilities will make possible. The main differences between the 4X and the current generation of DSP devices are that the 4X was much more powerful but the new chips have been well integrated with personal computer platforms such as Apple’s Macintosh family or the NeXT machine. The 4X was most often used as a high-powered synthesis engine, or real-time sound processor. But what made the 4X clearly superior to a rack full of specialized boxes was not its suitability to any particular synthesis technique, but that it was programmable and could implement whatever mix of functionalities was necessary in the way that was most appropriate to the piece at hand (Baisnee et al. 1986).
On the control level, the addition of signal-processing capabilities leaves the composer’s task largely unchanged — organizing the processing and arrangement of sound through the course of the composition or improvisation. It is at the level of the input and output to the control algorithms that the choices will be multiplied. On input, DSP capabilities allow access to aspects of instrumental sound not well represented by MIDI, most notably continually varying timbral information. On output, the composer can use flexible combinations of processing algorithms and an expanded range of real-time sound transformation techniques. Although using DSP devices is more demanding than selecting presets or even using a sound librarian, the sound world expressed and understood by interactive systems is already being enriched through the incorporation of real-time digital signal processing.
The Hierarchical Music Specification Language (HMSL), developed at Mills College, is a Forth-based, object-oriented programming language that supports the development of interactive music compositions. Recently HMSL has been extended to include digital signal processing objects, which are realized on several Motorola 56000-based architectures. The language itself makes few assumptions about the nature of the devices that will be used to realize any particular piece: “HMSL compositions have used electric motors, graphics devices, solenoids, text files, and MIDI synthesizers, as output devices” (Burk 1991, 15). To use the DSP capabilities, a composer concatenates various sound units, such as oscillators, filters, reverb units, envelopes, and the like. These units are realized with small code resources installed on a resident 56000 processor.
Max and DSP
IRCAM has again set a standard for the incorporation of digital signal processing in interactive systems, with the recent IRCAM Signal Processing Workstation. One of the most novel aspects of the ISPW is that it is built around a high-powered but general-purpose processor (the Intel i860), rather than a device designed particularly for signal processing (such as the Motorola 56000 series). Indeed, as processor speed generally increases, the need for specially designed devices will generally diminish. And in any case, the market for fast processors is so much larger than the one for signal processing devices that development efforts in the first realm will tend to overwhelm those in the second sooner rather than later.
The Max patch in figure 8.2, provided by Cort Lippe, shows a simple example combining control and audio level objects in a single patch. The notein and stripnote objects are familiar by now; a transition from control to audio level information is effected with the mtof object, which converts MIDI note numbers into a frequency value, to control an audio oscillator. Every Max object whose name ends with the “~” character, is operating in the audio realm. Therefore, in the audio portion of the patch in figure 8.2, a simple oscillator is multiplied with the output of a simple line segment generator to control the amplitude. The result is sent to both channels of a stereo digital-to-analog converter, which can be enabled and disabled with start and stop messages, shown connected to the left inlet.
The composer Todd Winkler has explored the use of Max for processing live acoustic sources, through the use of commercial digital signal processing devices, such as the Yamaha SPX-1000 or DMP-11 mixer. Many of the same effects realizable with the ISPW can be achieved in this way, though the configuration again must confront a machine boundary between the computer and the DSP device and a concomitant reliance on MIDI. In Winkler’s work, a Max patch sends continuous controller data to affect parameters of signal-processing algorithms programmed in the commercial devices and modifying the sound of an acoustic instrumental performance.
For example, a module called “Capture Gesture” uses EXPLODE to record the dynamics and rhythm of a 6-second violin phrase. Using velocity as break points, the MAX “line” object creates a function that is sent as continuous controller data to change reverberation time. By lengthening the overall time of the function, the apparent size of the hall changes continuously over the next two minutes of the piece, with the same proportions as the original phrase, thus enabling the performer to “phrase” the apparent size of the hall. (Winkler 1991, 547)
8.4 Conclusion
Trouble would begin, however, if mechanical music were to flood the world to the detriment of live music, just as manufactured products have done to the detriment of handicraft. I conclude my essay with this supplication: May God protect our offspring from this plague!
-Bela Bartok, “Mechanical Music”
Bartok’s (1976) supplication concerns primarily the introduction of recording technology, though he does cite the example of one determined composer who attempted a kind of synthesis by etching his own grooves into the vinyl of a phonograph record. With the application of ever more powerful technologies to the production of music, the ways music is composed, performed, and experienced have changed radically. The fears about the future of music, particularly music performance, have grown as well. In an article in the New York Times of May 13, 1990, titled “The Midi Menace: Machine Perfection is Far From Perfect,” Jon Pareles (1990) notes with distress the burgeoning use of sequenced MIDI material as a substitute for human players in the performance of pop music. He writes, “If I wanted flawlessness, I’d stay home with the album. The spontaneity, uncertainty and ensemble coordination that automation eliminates are exactly what I go to concerts to see; the risk brings the suspense, and the sense of triumph, to live pop.”
Similarly, I have often heard professional musicians complain that soon all the performance work will go to machines. At first, I took such complaints as evidence of a lack of familiarity with the field: it seemed to me there is little danger of machines taking over the concert stage as long as they remain such remarkably poor musicians. And yet, as Pareles notes, machines are assuming an ever-increasing role in the performance of music. Because of the nature of the machine’s participation, such occasions come to resemble less a live performance than the public audition of a tape recording.
As this book has demonstrated, however, interactive systems are not concerned with replacing human players but with enriching the performance situations in which humans work. The goal of incorporating humanlike music intelligence grows out of the desire to fashion computer performers able to play music with humans, not for them. A program able to understand and play along with other musicians ranging from the awkward neophyte to the most accomplished professional should encourage more people to play music, not discourage those who already do.
It is in this respect that interactive music programs can change the way we think about machines and how we use them. We do not currently expect a machine to understand enough about what we are trying to do to be able to help us achieve it. Many problems must be solved to make a music machine able to show such sensitivity. But the means, the technical and intellectual foundations, and the people needed to address these problems are already engaged. Now we must concentrate on improving the musicianship of our emerging partners.
In developing my own computer musician, I attempt to make human participation a vital and natural element of the performance situation — not because I am concerned about putting performers out of work but rather because I believe that if the numbers of humans actively, physically making music declines, the climate for music making and the quality of music will deteriorate. Pareles concludes, “Perhaps the best we can hope for is that someone will come up with a way to program in some rough edges, too.” I hope we can do better than that: develop computer musicians that do not just play back music for people, but become increasingly adept at making new and engaging music with people, at all levels of technical proficiency.