Two basic pillars of interactive systems are MIDI handling and scheduling. The Musical Instrument Digital Interface (MIDI) standard was developed by instrument and synthesizer manufacturers, and it allows controllers, computers, and synthesizers to pass data among themselves (Loy 1985). All of the systems reviewed here deal with the MIDI standard to some extent and therefore require software components to receive, package, and transmit properly formatted MIDI messages. The second imperative of music programs is to be able to perform tasks at specified points in time. Music is a temporal art, and any computer program dealing with music must have sophisticated facilities for representing time and for scheduling processes to occur at particular points in time.

The processing chain of interactive computer music systems can be conceptualized in three stages. The first is the sensing stage, when data is collected from controllers reading gestural information from the human performers onstage. Second is the processing stage, in which a computer reads and interprets information coming from the sensors and prepares data for the third, or response stage, when the computer and some collection of sound-producing devices share in realizing a musical output.

An important reason for dividing the chain into these three links is that there are usually machine boundaries between each successive stage. Commercial manufacturers dominate the sensing and response stages, through MIDI controllers and synthesizers. The sophistication of these devices, which implement several modes of operation and can often themselves be programmed, requires a discussion of sensing and response, rather than simply of a real-time computer program with some input and output. In full-blown interactive music systems, functionality is spread across three clusters of machines, and the designer often can choose to place certain methods within one or another cluster.

The processing stage has commercial entries as well, most notably MIDI sequencers. It is in processing, however, that individual conceptions of interactive music are most readily expressed, in any of a variety of programming languages with temporal and MIDI extensions. In this chapter, we will examine the three phases of interactive processing, looking at the hardware associated with each, the impact of machine boundaries separating each phase, and the nature of communications protocols developed to pass information between them.

2.1 Sensing

The recent fast growth in the development of interactive music systems is due in no small part to the introduction of the MIDI standard. We will begin our examination of the sensing stage with a consideration of MIDI, followed by a look at some of the standard’s limitations and the way new sensing technologies have grown up to fill the gaps.

The MIDI Standard

The MIDI standard is a hardware specification and communications protocol that allows computers, controllers, and synthesis gear to pass information among themselves (Loy 1985). MIDI abstracts away from the acoustic signal level of music, up to a representation based on the concept of notes, comprising a pitch and velocity, that go on and off. The MIDI abstraction is eminently well suited to keyboard instruments, such as piano or mallet percussion, which can be represented as a series of switches. In fact, MIDI sensing on such instruments operates by treating each separate key as a switch. When the key is depressed, a Note On message is sent out, indicating which key was struck and with what velocity. When the key is released, a Note Off message (or, equivalently, Note On with a velocity of zero) is transmitted with the key number.

The note concept is the fundamental MIDI paradigm. All MIDI instruments implement it. Following its genesis from commercial keyboard controllers, MIDI represents continuously varying control functions as well. One of these is known as pitchbend, usually implemented with a wheel built into the instrument. Each time the wheel is moved, a new pitchbend value is transmitted to the processing stage. Such continuous control messages are transmitted with a controller number and a value. Many instruments allow the user to assign controller numbers to physical devices, remapping the pitchbend wheel, for example, to another control channel. The generation of continuous controls is therefore, in most cases, easily reconfigurable. Moreover, additional layers of remapping and interpretation can be implemented in the processing and response stages.

Although the introduction of the MIDI standard has had the salutary effect of greatly expanding research into and performance with interactive music systems, use of the standard imposes limitations of several kinds (Moore 1988). First, because MIDI communicates on the control level — performance gestural events rather than any representation of the audio signal — the standard cannot be used to describe or control much of the timbral aspect of a musical performance. Control over synthesis and the evolution over time of any particular sound is coded into each synthesizer and may be affected through MIDI only by using an ad hoc collection of triggers, continuous controls, and system exclusive commands, private to each machine and inimical to the very idea of a standard.

That said, several synthesis algorithms have been built into commercial synthesizers that are able to achieve a quite broad and subtly varying range of sounds. The most successful of these offer a small number of fairly powerful control variables, which fit the control situation of a MIDI environment well. In a typical transaction, a Note On message emitted from a computer or controller will trigger a complex reaction from the synthesizer, realizing the attack portion of the sound. Continuous controls can affect variables of the synthesis algorithm during the “steady state” portion of the sound, modifying such things as filter cutoff frequencies or the amplitude of a modulating oscillator. A Note Off message then initiates the sound’s decay, again stored as a complex, though relatively constant event in the synthesizer.

A primary cause of this transaction paradigm is the fact that the MIDI standard enforces a transmission bandwidth of 31,250 bits per second. Each eight-bit MIDI byte is surrounded with start and stop bits, making their effective size equal to ten bits. Therefore, a standard MIDI Note On message, which requires three bytes (30 bits) of information, takes approximately 1 millisecond to transmit. As Gareth Loy points out in (Loy 1985), the performance through MIDI of a ten-note chord will introduce a delay of 10 milliseconds between the first note of the chord and the last. Though 10 milliseconds is not enough to affect the percept of that event as a single chord, it can have an effect on the timbral quality of the sound. Further, when we consider the impact of a 1 millisecond transmission time on the performance of a ten-note chord sent out simultaneously over ten channels, the 100 millisecond delay between the first note and the last will certainly be heard.

When we consider MIDI as a communications channel not only for note messages but for the state of continuous controllers, the bandwidth problems are even more serious. Though the standard does not provide a good way to control the internal evolution of a sound, such facilities as there are cannot be fed quickly enough at 31,250 bits per second to afford a performer close control over the sounding result (Moore 1988). The result is to accentuate the machine boundary between the computer and its outboard synthesis gear: a user must program the synthesizer with instructions concerning how to evolve a sound in response to triggers sent out from the computer. The limitations of MIDI bandwidth are thus somewhat attenuated, moving some of the musical control over to the sound gear, but at the expense of an integrated and flexible development environment for all aspects of computer performance.

Despite these concerns, the positive influence of the MIDI standard has far outweighed its limitations. Most of the work described in this book could not have been achieved without it. As interactive systems grow in power and application, we can only hope that the standard will grow with them. In any event, the migration of sensing and response capabilities back to the host computer has the potential to obviate the need for such a communications protocol entirely.

Custom Controllers

Information can arrive from sensors in forms other than MIDI. Another important input type is samples, the digital representation of a time-varying audio signal. Usually, the sound of some musical instrument is picked up by a microphone, sent through an analog-to-digital converter (ADC), and then on to the computer. Compact-disc-quality sampling rates for digital audio produce (at least) 44,100 16-bit samples per second, so dealing with a raw sample stream demands very high-powered processing. For that reason, interactive systems designed to handle audio have dedicated hardware devices able to process sounds at the requisite speeds. We will return to a discussion of audio input and output in the section on response.

MIDI controllers can be thought of as gestural transducers, providing a representation of human musical performance. Keyboard performances are represented fairly well. Much of the important gestural information from performance on other instruments (strings, winds, voice), however, is not fully captured by MIDI sensors. For that reason, a significant research effort has grown up around the attempt to build controllers better able to capture the range of expression afforded by traditional musical instruments.

Implementing the violoncello interface used for Tod Machover’s composition Begin Again Again . . . , a team including Neil Gerschenfeld, Joseph Chung, and Andy Hong developed sensors to track five aspects of the cellist’s physical performance: (1) bow pressure, (2) bow position (transverse to the strings), (3) bow placement (distance from the bridge), (4) bow wrist orientation, and (5) finger position on the strings (Machover et al. 1991). To sense positions (2) and (3), a drive antenna was mounted on the cello bridge and a receiving antenna on the bow. The capacitance between the two yielded both position of the bow across the strings and position of the bow relative to the bridge. Pressure on the bow was measured from the player’s finger, rather than the bow hairs. Again a capacitance measurement was used, by putting a foam capacitor around the part of the bow where finger pressure would be applied. Finger position on the strings was found from the resistance between the metal strings and strips of conductive thermoplastic sheet mounted on the fingerboard beneath them. Finally, wrist angle was read from a sensor mounted on the wrist that measures joint angles from the movement of magnets corresponding to the wrist and the back of the hand. These five traits of the physical gestures were continually tracked by computers during the performance of the piece, and used to change the timbral presentation of computer music associated with each section of the composition (see also section 3.4.4).

Space-Control Performance

Using gestural control to affect the output of electronic instruments is a principle that was already firmly established in 1919 by the Russian scientist Lev Termen with his “Aetherphone,” later called the “Theremin” after the Gallicized version of his name (Glinsky 1992). The instrument produces monophonic music with a quasi–sine wave timbre and is played by moving two hands in the vicinity of antennae controlling pitch and amplitude. The tones come from two heterodyning oscillators, one of fixed frequency, and the other of variable frequency. Both oscillate at frequencies well above the range of human hearing; however, when the variable oscillator’s frequency is changed, a difference tone is created between it and the reference oscillator, and this difference tone does fall in the audible range. Moving the right hand toward and away from one antenna changes the variable oscillator frequency, thereby producing eerie portamento effects as the audible beat frequency goes up and down. The amplitude is similarly controlled by the movements of the other hand. Because of this continuous control over pitch and loudness, the instrument is capable of quite expressive performances, though mastery of it requires years of practice. The Theremin was a sensation, and it was played by its inventor and other virtuoso performers to packed houses throughout Europe and the United States. Similar devices existed around the same time but never caught the public imagination the way the Theremin did. An important reason for this was what Termen called “space-control performance”: the fact that the instrument was played without anyone actually touching it. Competitors outfitted with more conventional keyboard or other interfaces never aroused the same sense of wonder as the space-control Theremin.

A fascination with the seeming magic of music performed by movement of the hands alone has carried through to the use of hand-based controllers in several recent interactive systems. Stichting STEIM in Amsterdam has devoted a considerable research effort to the development of new gestural controllers, resulting in such devices as The Hands (a pair of proximity-sensitive hand-mounted sensors) and The Web (a weblike device in which manipulations of one part affect other regions) (Krefeld 1990). Similarly, a number of hand controllers have been marketed for various purposes, including the Exos Dextrous Hand Master, the VPL Data Glove, and the Mattel Powerglove, which have subsequently been adapted for experimentation in interactive music. Two composers who have used this technology are Michel Waisvisz of STEIM, who performs the composition The Hands with the controller of the same name, and Tod Machover, who adapted the Dextrous Hand Master to control timbral variation through MIDI mixers in his composition Bug-Mudra.

The Buchla Lightning Controller is another variation on the same idea: the controller responds to motions made in space with various transmitting devices (Rich 1991). One of these transmitters is a wand, which sends infrared information to a control box. The control box is thereby able to track the motion of the wand within a performance field, which is about as wide as the distance from the transmitter to the control box and whose height is about 60 percent of the width. This performance field is split up into eight cells, and various combinations of the eight cells make up different zones. The controller responds to “strikes” (quick changes of direction made with the transmitter) within a zone, entry or exit from any zone, and combinations of switches built into the transmitters and foot pedals. The control box can be programmed to send out MIDI note, control, or program change commands, as well as MIDI clock messages. Further, one of the Lightning presets allows the controller to talk directly to Max patches. Lightning is thus a general control device, which enables users to make physical gestures in the air and define the functionality of those gestures through MIDI-based programming.

The Radio Drum is a three-dimensional percussion controller, sensitive to the placement of beats on the face of the drum and the position of the drumstick through the air as it approaches the drum face (Mathews 1989). The Radio Drum was designed by Max Mathews and Robert Boie, and Mathews’s main conception of the device is to use it to control tempo. In one performance, a singer beats the Radio Drum to cue each successive attack in the performance of an accompanimental part. Of course, the three-dimensional information coming from the drum can be used to trigger much more intricate interactions between the drummer and the computer, an area which has been extensively explored by such composers as Andrew Schloss and Richard Boulanger.

Pitch Detectors

Solutions geared to individual instruments have proliferated because of significant differences among the instruments themselves: critical controls for one family of instruments do not even exist on others. Another obstacle is the fact that reliable real-time pitch tracking from an arbitrary audio signal has not been developed. Commercial devices for the job (pitch-to-MIDI converters) exist, but they have widely variable results for different instruments, even for different performers on the same instrument. General solutions have not been forthcoming because the problem is a difficult one: techniques such as the Fast Fourier Transform (FFT) are often not fast enough or simply fail to find the correct pitch. Consider the case of trying to identify a pitch at 60 hertz: generally, two cycles of the waveform will be required for analysis. At that frequency, a delay of over 30 milliseconds is required in the best case for identification. When we consider, in addition, that the attack portion of an instrumental waveform will be the least regular part of it, the difficulty of quickly and accurately finding the pitch through standard Fourier analysis is readily seen.

One approach to the pitch-tracking problem has been to solve it for other musical instruments as it was solved for keyboards. The keyboard, and mallet percussion, case is simple: each key is treated as a switch, and the pitch played can be read off the switch depressed. Extending this concept to other instruments means fitting them with mechanical sensors to read fingerings or hand position, reducing the problem of pitch detection to one of indexing known fingerings to the pitches they produce. In cases where one fingering produces more than one pitch (overblown wind instruments, for example), minimal additional signal processing can disambiguate between the remaining possibilities.

An early instance of fingering-based pitch detection was built into the IRCAM flute controller. Optical sensors tracked the manipulations made by the player’s fingers on the instrument, and the found configurations were used as indices into a table of known pitches for each fingering. Since one flute fingering can potentially produce more than one pitch, additional signal processing on the 4X machine analyzed the audio signal to decide between the remaining possibilities [Baisnee et al. 1986]. Other wind instrument devices have been built following similar physical-mapping principles: the Yamaha WX-7 wind controller follows key positions, and in fact extends the traditional range of single-reed instruments with additional octave keys, allowing the player to cover up to seven octaves.

MIDI adaptations for other instruments abound as well. MIDI guitars, for example, have been built by several manufacturers. Entire families of string instruments have been commercialized by Zeta and the RAAD group. Hardly an instrument exists that has not had some work done to allow it to transmit MIDI messages. The underlying message of this trend is clear: the motivation to participate in the expanded possibilities afforded by computer-based instruments is a compelling one for performers and builders alike. The general operating principles of existing orchestral instruments are being maintained, to capitalize on the years of training professional players have had. Because traditional instruments are markedly different, new control adaptations must be made on a highly individual basis. General analysis systems based on the properties of audio signals have not provided a solution, given the inability of techniques such as Fourier analysis to provide accurate results within the demands of real-time performance. Given these constraints, new instrumental controllers combine some interpretation of physical gestures with audio signal processing to provide a wider range of gestural information from instrumental performance. At the same time, controllers divorced from any relation to the orchestral instruments, but which give wider scope to the tracking of expressive gestures, are being developed in growing numbers.

2.2 Processing

The information collected during the sensing stage is passed on to a computer, which begins the processing stage. Communication between the two stages assumes a protocol understood by both sides, and the most widely used protocol is the MIDI standard. Other signals passing between them could include digital audio signals or custom control information. In either case, some form of the procedure for handling MIDI streams would apply.

Early interactive systems always needed to implement a MIDI driver, which was capable of buffering a stream of serial information from a hardware port and then packaging it as a series of MIDI commands. Such a facility still must be included in every program, but for most hardware platforms the problems of MIDI transmission have been solved by a standardized driver and associated software. A good example is the Midi Manager™, written for Apple Macintosh computers. The Midi Manager makes it possible for several MIDI applications to run simultaneously on a single computer. An associated desk accessory called PatchBay allows a user to route MIDI streams between applications and the Apple Midi Driver, and to specify timing relations between all of them (Wyatt 1991).

Beyond the bookkeeping details of receiving and packaging valid MIDI commands, MIDI drivers invariably introduce time stamps, a critical extension of the standard. Time stamps are an indication of the time at which some MIDI packet arrived at the computer. In the case of the MIDI Manager, time stamps are notated in one of a number of possible formats, with a resolution of one millisecond. With the addition of time stamps, the timing information critical to music production becomes available. Interpretation processes can use this information to analyze the rhythmic presentation of incoming MIDI streams. Timing information is also needed as the computer prepares data for output through the response stage, and it is here that the function of a real-time scheduler comes into play.

Real-Time Schedulers

Scheduling is the process whereby a specific action of the computer is made to happen at some point in future time. Typically, a programmer can invoke scheduling facilities to delay the execution of a procedure for some number of milliseconds. Arguments to the routine to be executed are saved along with the name of the routine itself. When the scheduler notices that the specified time point has arrived, the scheduled process is called with the saved arguments.

Much of the work on interactive music systems developed at the MIT Media Laboratory, for example, relies on a scheduler adapted from work described in (Boynton 1987). The scheduler allows timed execution of any procedure called with any arbitrary list of arguments. A centisecond clock is maintained in the MIDI driver (also adapted from work by Lee Boynton) and is referenced by the driver to timestamp arriving MIDI data. The clock records the number of centiseconds that have passed since the MIDI driver was opened. Accordingly, incoming MIDI events are marked with the current clock time at interrupt level, when they arrive at the serial port. This same clock is used by the scheduler to time the execution of scheduled tasks.

Tasks are associated with a number of attributes, which control initial and (possible) repeated executions of the scheduled function. We can see these attributes in the argument list to Scheduler_CreateTask, the routine which that a function into the scheduler task queue.

Scheduler_CreateTask(time, tol, imp, per, fun, args)
         short imp, tol, per;
         long time;
         void (*fun)();
         arglist args;

The first argument is time: the absolute clock time at which the function is to be executed. Absolute clock time means that the time point is expressed in centiseconds since the opening of the MIDI driver. The tolerance is an amount of time allowed for startup of the function. If there is a tolerance argument, the function will be called at a clock-time point that is calculated by subtracting tolerance from time. This accommodates routines whose effect will be noticed some fixed amount of time after their execution. In that case, the function can be scheduled to take place at the time its effect is desired, and the startup time is passed along as a tolerance argument.

Figure 2.1

In figure 2.1, the black bullets represent sounding events produced by some player process. The process needs a fixed amount of time to produce the sound — a tolerance — represented by the arrow marking a point some time in advance of the bullet. The call to Scheduler_CreateTask, then, would give the time for the desired arrival of a sounding event, with a tolerance argument to provide the necessary advance processing.

The scheduler maintains three separate, prioritized queues, and guarantees that all waiting tasks from high-priority queues will be executed before any tasks from lower-priority queues are invoked. The importance argument determines which priority the indicated task will receive. If a nonzero period argument is included, the task will reschedule itself period centiseconds after each execution. Periodicity is a widespread attribute of music production, and the facility of the period argument elegantly handles the necessity. Tasks that are periodically rescheduling themselves can be halted by an explicit kill command at any time. In figure 2.1, a period argument would continue the sounding events at regular intervals after the first one as shown. The tolerance argument remains in force for each invocation, providing the advance processing time required. Finally, a pointer to the function to be called, and the arguments to be sent the function on execution, are listed. The arguments are evaluated at scheduling time, rather than when the function is eventually invoked.

A similar facility is provided by the CMU MIDI Toolkit, adapted from Doug Collinge’s language Moxie (Dannenberg 1989): the cause() routine will invoke a procedure, with the listed arguments, some number of centiseconds in the future. For instance, the call

cause(100, midi_note, 1, 60, 0);

will induce the scheduler to call the routine midi_note 100 centiseconds after the execution of cause(). The arguments 1, 60, and 0 are specific to midi_note (and correspond to channel, note number, and velocity). Any list of arguments can be presented after the name of the routine to be invoked, as in the Boynton syntax shown above. In fact, one of the most noticeable differences between the two schedulers is that Boynton’s expects absolute time points as a reference, whereas Dannenberg’s cause() routine requires the execution time of the scheduled routine to be specified relative to the invocation of cause itself.

For both versions of this idea (and there are several others), it is important to note that the model assumes that none of the scheduled routines will require a long processing time. Once a procedure is called by the scheduler, it runs until the end, and then relinquishes control to the scheduler again. These routines are not interrupted, except by a strictly limited number of input handlers. If some procedure requires extensive processing, it must reschedule itself at regular intervals to allow the scheduler enough CPU time to keep up with the execution of any other tasks in the queue.

An extensive literature has grown up around the subject of real-time scheduling, particularly as it affects music programming, and several languages have been implemented that address the problem in various ways. The reader is referred to Roger Dannenberg’s excellent survey (Dannenberg 1989). Besides the scheduler included in the Midi Manager, extensive possibilities for scheduling events have been built into Max, the graphic programming language for interactive systems that we will examine in some detail.

Building Functionality

Along the sensing/processing/response chain, the differentiation of functionality among systems is accomplished most strongly through processing. Certain classes of hardware can generally be grouped with each link in the chain. Controllers contribute to the sensing stage, and synthesizers and other sound-making gear work in the response stage. The hardware of the processing stage is a digital computer. The programs running in this computer are the subject of this book, and different approaches to the processing stage will be seen repeatedly in the following pages.

We have already reviewed some of the problems surrounding sensing and two of the indispensable components of processing: MIDI handling and scheduling. The rest of the processing component is what distinguishes Michel Waisvisz’s The Hands from Morton Subotnick’s A Desert Flowers from Jean-Claude Risset’s Duet. In describing the sequence of events in an interactive music system, this subsection provides only a placeholder for a discussion of the cornucopia of possible processing approaches. The rest of this book fills the placeholder by describing in detail the processing link in the chain.

2.3 Response

The response stage resembles the sensing stage most strongly in the protocol used to pass information. Again here, the computer and the devices used to actually perform the responses communicate most often through the MIDI standard. The exact nature of the directions sent out by the computer will depend on the synthesis devices in use and the kinds of effects those devices are used to realize. Presently, commercial MIDI gear tends to fall into two large groups: synthesis and sampling. Synthesis modules use an algorithm — for example, frequency modulation — to produce sounds. Sampling gear has stored waveforms, often recordings of traditional acoustic instruments, which are played back at specific pitches in response to MIDI messages. Response commands sent to the devices, then, would include NoteOn and Off messages and whatever controller values are desired for manipulation of the device’s sound production technique.

Real-Time Digital Signal Processing

Before the arrival of MIDI, interactive computer music was often realized using specialized digital signal processing hardware for the sensing and response stages. A prominent example of this approach was the series of realtime processors designed by Giuseppe di Guigno and his team at IRCAM, culminating in the 4X machine (Baisnee et al. 1986). The 4X was used for a range of interactive compositions, including Pierre Boulez’s Repons, several pieces by Philippe Manoury, such as Jupiter and Pluton, and my own Hall of Mirrors. Sensing involved the intake of audio samples from microphones and reading commands typed at an alphanumeric keyboard. Responses were generated by real-time signal processing of the audio taken in from the microphones, producing a live transformation of sounds already being performed by acoustic instruments. Later, MIDI handling was added to the capabilities of the 4X real-time system, to take advantage of the MIDI controllers which were by then becoming available in large numbers (Favreau et al. 1986).

As MIDI synthesis equipment became more sophisticated and duplicated many of the techniques developed in the leading computer music institutions, much of the work of response generation fell to such devices. Another important reason for the dominance of commercial gear through the late 1980s and early 1990s was the considerable expense involved in acquiring and maintaining a device like the 4X, a cost prohibitive for most studios and certainly for individuals.

With the introduction of digital signal processing chips such as the Motorola 56000, however, the pendulum began to swing back toward real-time signal processing as a viable choice for generating sound. The efficiencies of mass production, coupled with the installation of these chips on a variety of add-on boards designed for personal systems such as the IBM PC, Apple Macintosh, and NeXT machine, made DSP hardware both inexpensive and relatively widespread. The viability of digital signal processing power in personal workstations is changing the face of response synthesis techniques. Direct-to-disk sampling applications, and specialized hardware/software packages such as the Digidesign SampleCell, now make possible the extensive use in performance of recorded sound, or transformations of live sound, without relying on external MIDI samplers.

IRCAM Signal Processing Workstation

The accelerated use of digital signal processing in live performance is again changing the nature of response capabilities in interactive music systems. The standard for the integration of control- and audio-level programming has moved forward with the commercialization of the IRCAM Signal Processing Workstation (ISPW). The ISPW consists of a NeXT computer equipped with a special accelerator board, on which reside two Intel i860 processors (Lindemann et al. 1991). The i860s are very fast, general-purpose processing devices. Before the ISPW, real-time digital signal processing tasks were accomplished with specialized processors, built particularly for DSP, such as the Motorola 56000. The advantage of using a general-purpose processor such as the i860 is that the machine boundary between signal- and control-level computations is erased.

To take advantage of this flexibility, a new version of Max was written to include signal objects. These objects can be used to build signal-processing programs, just as MIDI Max objects are configured to implement control programs. When the two classes are combined, the conceptualization and implementation of interactive systems using real-time signal processing is considerably eased. First, the response phase of the system is entirely programmable: not the choices of a manufacturer but the demands of a composition can decide the synthesis algorithms used. Second, a single programming environment can be used for both the processing and response phases. Third, since processing and response are realized with the same machine through the same programming environment, the need for communications protocols such as MIDI, with all their bandwidth and conceptual limitations, falls away (Puckette 1991).

The realization on the ISPW of powerful signal analysis techniques can eliminate much of the need for external sensing as well. The workstation is fast enough to perform an FFT and inverse FFT in real time, simultaneously with an extensive network of other signal and control processing. Already pitch- and envelope-tracking objects have been used for compositional sketches. If continuous sensing of pitch, amplitude, and timbral information can be achieved from the audio signal alone, the entire sensing/processing/response chain could be reduced to a single machine, with all the attendant gains in flexibility and implementation power that entails.

2.4 Commercial Interactive Systems

Commercially available interactive systems began to appear in the mid-1980s. Such programs illustrate the processing chain outlined in the previous section and several hallmarks of interaction. Rather than survey the full range of applications, we will briefly consider M and Jam Factory, two ground-breaking efforts in the field, before moving on to Max, a newer graphic programming environment that allows users to design their own interactive systems.

M and Jam Factory

In December of 1986, Intelligent Music released M and Jam Factory. Intelligent Music is a company founded by composer Joel Chadabe for developing and distributing interactive composing software. Chadabe and three others designed M; one of those collaborators, David Zicarelli, also designed Jam Factory (Zicarelli 1987). Among the breakthroughs implemented by these programs are graphic control panels, which allow access to the values of global variables affecting their musical output. Manipulating the graphic controls has an immediately audible effect. The sensing performed by M and Jam Factory centers around reading manipulations of the control panel and interpreting an incoming stream of MIDI events. Responses are sent out as MIDI.

Each program implements different, though related, compositional algorithms. “The basic idea of M is that a pattern (a pattern in M is a collection of notes and an input method) is manipulated by that various parameters of an algorithm” (Zicarelli 1987, 19). Four patterns are active simultaneously, and variables for each of them can be changed independently. The algorithm allows control over such parameters as orchestration, which assigns patterns to MIDI channels; sound choice, which selects program changes for the channels of a pattern; note density, the percentage of time that notes from a pattern are played; and transposition, which offsets pitch material from its original placement. Duration, articulation, and accent are governed by cyclic distributions, collections of data used to reset the values of these parameters with each clock tick. Further, the mouse can be used to “conduct” through various settings of these variables.

Jam Factory implements four players, whose material is generated using the transition tables characteristic of Markov chains (see section 6.2). Several tables of different orders are maintained, and “an essential part of the Jam Factory algorithm is the probabilistic decision made on every note as to what transition table to use” (Zicarelli 1987, 24). Separate transition tables for pitches and durations are employed, and each player has independent tables for both parameters. The probabilities of different-order transition tables being used affects the degree of variation and relatively straightforward playback of the stored material. For instance, first-order tables depend only on the previous pitch (in the case of melodic generation) to determine the following one. Second-order tables look at the previous two pitches, and so on. “For many applications, 70-80 percent Order 2 with the rest divided between Orders 1 and 3 will blend ‘mistakes’ with recognizable phrases from the source material in a satisfying manner” (Zicarelli 1987, 25).

Timing is expressed on two levels. First, a master tempo defines the rate of clock ticks. Then a time base, independent for each voice, sets the number of ticks that pass between successive events. This scheme closely follows the MIDI conception of time, which is organized around beats in a tempo. Beats always represent an identical number of ticks: quarter notes, for example, can span 24, 48, or 96 clock ticks. Though the number of ticks is variable for different applications or pieces of hardware, once a resolution is chosen, it remains constant for that notated duration. The speed of quarter-note realization, then, is changed by varying the overall tempo, which defines the duration of one clock tick. The advantage of this scheme is that the relation of events to an underlying meter is kept constant through tempo changes. The disadvantage is that it enforces a conceptualization of musical time in beats and tempi, even in music for which these are not appropriate categories.

M and Jam Factory are clear examples of performance-driven interactive systems. There is no stored score against which input is matched. The performance driving the program is basically a series of gestures with the mouse; an input control system allows the same functionality to be transferred to a MIDI keyboard. The response method is generative: stored lists of material are varied through the manipulation of a number of performance parameters. The programs follow a player paradigm in that they realize a distinct musical voice from the human performance. In face, M and Jam Factory are unusual with respect to later programs in that the human performance is not particularly musical: rather, the performer’s actions are almost entirely directed to manipulating program variables.


In 1990, Opcode Systems released the commercial version of Max, a graphic programming environment for interactive music systems. Max was first developed at IRCAM by Miller Puckette and prepared for commercial release by David Zicarelli. Max is an object-oriented programming language, in which programs are realized by manipulating graphic objects on a computer screen and making connections between them. The collection of objects provided, the intuitively clear method of programming, and the excellent documentation provided by Opcode make Max a viable development environment for musicians with no prior technical training.

Throughout this book, examples of music programming techniques will be illustrated with Max. From the CD-ROM supplement, working Max patches can be downloaded and run on an Apple Macintosh computer. Even without a computer, the graphic nature of the language allows us to readily follow the algorithm being discussed. For an in-depth introduction to Max, the reader is referred to the documentation provided by OpCode Systems Inc. with the program (Dobrian and Zicarelli 1990). Enough discussion of the modules used in the illustrations will be provided with each example to make the process under discussion clear to those readers unfamiliar with Max as well.

Object orientation is a programming discipline that isolates computation in objects, self-contained processing units that communicate through passing messages. Receiving a message will invoke some method within an object. Methods are constituent processing elements, which are related to each other, and isolated from other methods, by virtue of their encapsulation in a surrounding object. Depending on the process executed by a method, a message to an object may enclose additional arguments required by that process as well. Full object orientation includes the concept of inheritance, by which objects can be defined as specializations of other objects. Max does not implement inheritance, though the fundamentals of messages and methods will become quite clear from using the language. For an introduction to the concepts of object-orientation and their application to music programming, the reader is referred to (Pope 1991).

In the next three sections, we will look at some objects in Max designed to handle the sensing, processing, and response stages of an interactive system. First, we can glimpse the flavor of programming in the environment: using Max, a composer specifies a flow of MIDI or other numerical data among various objects representing such operations as addition, scaling, transposition, delay, etc. A collection of interconnected objects is called a patch (in fact, Max has sometimes been called Patcher). Objects in a patch can be connected and moved to control data flow easily between them; “monitor” objects can be placed anywhere in a signal path, to inspect the input to or output from other objects; tables and histograms can easily be added to store or display data. Further, patches can be constructed hierarchically: once a working configuration of objects for some process has been found, it can be saved to a file and then included as a subpatch in other, larger programs.

Although Max is designed for making interactive music programs, one should always remember that it is a programming language, and therefore fundamentally different from sequencers and patch editors and similarly specialized commercial software applications. The effort required to achieve the same sequencing sophistication with Max that is available in programs such as Vision or Performer, for example, would be considerable and unjustified if the specialized packages do everything that is needed. Once the limitations of these programs are felt, however, and as soon as ideas of interactive and algorithmic composition arise, the power and flexibility of a programming language is indispensable. Other languages are available, but the documentation, ease of use, and focused optimization of Max for building such systems are compelling recommendations indeed.


Sensing in Max includes an extensive collection of objects for handling MIDI messages. These objects access the internal MIDI driver software maintained by the program; according to the preferences of the user, either a self-contained MIDI driver or the Apple MIDI Manager can be selected for this purpose. MIDI sensing objects are distinctive, for one thing, because of the nature of information required by their inlets. On a Max object, the name of the object is found in the middle of the graphic box. Along the top are darkened inlets, where messages are sent to the object. Along the bottom, the black strips are outlets, where messages are sent out. Each object has inlets and outlets suited to the methods it encapsulates. The midiin object, for example, has one inlet and one outlet (see figure 2.2).

Figure 2.2

Figure 2.2

Midiin receives a stream of messages from a MIDI port. A port could be one of the Apple serial ports, either printer or modem. It could also be a virtual port, associated with a configuration maintained by PatchBay and the Apple MIDI Manager. The inlet to midiin is used to indicate the port to which the object should be listening. From the outlet comes MIDI bytes, retrieved from the designated port.

This object is emblematic of one source of power found in the Max environment: behind the simple object lies communication with a resident MIDI driver, or with the Apple Midi Manager, possibly extended by the Opcode Midi System (OMS). Rather than dealing with the intricacies of each standard, a musician has only to understand midiin to deal with MIDI streams. I will not continue to insist on the value of this level of abstraction, because the point should be clear enough: with environments like Max, musicians are free to concentrate on musical issues. The complications of scheduling and MIDI management, which are, as we have seen, the two indispensable elements of interactive systems, are hidden behind simple and uniform programming elements.


Like section 2.2 on general processing principles in interactive systems, the following section is little more than a placeholder for the rest of the book. Processing is the heart of any interactive program: the examples scattered through the text, and included on the CD-ROM, illustrate the enormous variety of approaches to processing musical information. Here again, however, we will pause to consider facilities for treating one of the most important aspects of processing: the manipulation of time, and, as a fundamental tool for such manipulation, scheduling of events.

A prototypical Max scheduling object is delay. Delay receives a bang message at its left inlet and sends it back out after some number of milliseconds, set either by an argument to the object or as a message arriving at delay’s right inlet. Bang is the most common Max message. Almost all objects understand bang to mean “do it.” In other words, whatever the object does, it will do when a bang arrives at its leftmost inlet. In this case, delay delays the transmission of a bang. What allows delay to save up the incoming bang and send it back out at some specified point in the future is Max’s internal real-time scheduler.

Figure 2.3

Figure 2.3

In figure 2.3, the delay object at the center of the patch serves to delay the transmission of a bang message from its left inlet to its outlet. A bang can be sent to delay by clicking on the button connected to its left inlet. When the patch is first loaded, 1000 milliseconds (1 second) will pass between the time the top button is clicked and the time the button attached to delay’s outlet flashes, indicating the transmission of the bang. This is because delay was given an initial argument of 1000, the number shown immediately following the name of the object. Arguments can be used to set variables within an object to some initial value, as was done here. They are listed after the name of the object when it is first placed on the screen. Besides bang, delay also understands the message stop. A stop message can be sent to delay by clicking on the message box with stop written in it, connected to delay’s left inlet. Stop tells delay to abort the transmission of any bangs it may be saving up. Therefore, if stop is clicked after sending a bang to the delay inlet, but before a second has gone by, delayed transmission of the bang will be cancelled, and the lower button will not flash.

The amount of time the bang is delayed can be varied using the objects stacked above delay’s right inlet. Initially, bangs are delayed by 1 second, because the argument given to the object sets its delay time to be 1000 milliseconds. If the uppermost number box is manipulated, new delay times will be sent to the delay object, changing the duration between incoming and outgoing bangs. In the patch pictured, the delay time objects have been changed to set the delay between incoming and outgoing bangs to 1140 milliseconds (1.14 seconds).


The response stage of an interactive system sends the results of processing out to some collection of devices, usually to make sound, and usually through MIDI. Again, Max provides a full range of objects for responding through MIDI, of which the noteout object is a good example.

Figure 2.4

Figure 2.4

We can see quite clearly from the graphic representation of the object (figure 2.4) that it handles output, or response. Notice that the object has no outlet. This is because the MIDI management tools of Max lurk behind noteout’s facade. Again, programs built with noteout are able to send MIDI messages directly out the serial port, or to use Midi Manager or OMS without the details of such transmission being a matter of concern to the programmer.

The inlets of noteout correspond directly to the format of the MIDI Note On message. The leftmost inlet expects a MIDI note number, the middle inlet requires a MIDI velocity between 0 and 127, and the rightmost inlet is used to set the MIDI channel on which the resulting message will be transmitted. What will be sent to these inlets are questions of processing — the musical decisions and methods of the previous phase.

Figure 2.5

Figure 2.5

In the patch in figure 2.5, some processing has been added above the noteout object. Attached to the right inlet is a message box with the message “1”. Clicking on this box will set noteout to send messages out on MIDI channel 1. Above the middle inlet are some objects to set the velocity field. Clicking the toggle at the top of the stack on will set the velocity inlet of the noteout object to 90 (as we can see from the number box just above it). Clicking the toggle off sets the velocity to zero, turning the note off. Above the leftmost inlet is a keyboard slider, which can be used to set the pitch number of the note message. To produce output, a user would click first on the ‘1’ message box, to set the MIDI channel. Then, clicking on the toggle would set the velocity field to 90. Finally, clicking on a note on the keyboard slider would turn the note on. To turn the sound back off again, first the velocity toggle must be clicked to the off state. Then, clicking on the same note of the keyboard slider will cause the sound to turn off. This is because the noteout object will perform its method when new input arrives at the leftmost inlet. Therefore, if the velocity field is changed without reclicking on the pitch slider, no effect is heard. This principle is a general one in Max – for most objects, the operation of the object is executed when a message arrives at the leftmost inlet. Patches should be used and constructed with this in mind, to avoid unexpected results from problems with ordering.

Figure 2.6

Figure 2.6

The patch in figure 2.6 introduces some additional aspects of Max programming, and begins to show how the language can be used to manage the distribution of functionality across the processing/response boundary. We have seen that Max supports levels of abstraction in which working patches can be encapsulated and included in other, larger configurations. The patch of figure 2.6 is a subpatch, in that it is written to be included in other patches. Across the top of the figure, then, are three inlets, marked “pitch”, “velocity”, and “segment time.” When included in an enclosing patch, these three inlets will appear at the top of the box representing this subpatch, just as small blackened inlets appear at the top of standard objects, such as stripnote in the figure. Similarly, the boxes with arrows pointing up at the bottom are outlets, and indicate where information will flow out of the subpatch and back into the surrounding program.

As a subpatch, this bit of code does not perform a completely self-contained function, as did the earlier examples. Its job is to take pitch and velocity pairs, and a segment length, and prepare these for the realization of an amplitude envelope segment with a response-phase device. (This example was adapted from programs written by Miller Puckette and Cort Lippe for compositions by Philippe Manoury, including Pluton and La partition du ciel et de l’enfer.) The pitch sent to the subpatch is multiplied by 100, to allow for microtonal inflections between semitones. If a Note On message is being handled, the velocity value will be nonzero. In that case, the incoming velocity is mapped to a new velocity with the table object, which we will explore in more detail in the following section. Otherwise the velocity remains zero, indicating a Note Off. In either case, the velocity is packed with a time value, which indicates the duration during which the amplitude setting of the response device should move from its present value, to the one produced by this subpatch. Packing numbers puts them into a list, which can be manipulated as a single item – sent into one inlet, for example. Unpacking reverses the operation.

2.5 Examples

In this section, I will present two simple examples of music generation methods implemented with Max. They refer to some of the earliest examples of algorithmic thought in Western music composition. The stages of interactive music systems will be shown in the context of simple working programs, and the basics of Max notation as it will be used for the remaining examples in this text are reviewed.

Guido’s Method

Several authors have described the theoretic groundwork of compositional formalisms, and their recurrent use throughout the history of Western music (Loy 1989). An early example is Guido’s method for composing chant. To set a liturgical text in chant, all of the vowels of the text are extracted. Guido specified a table comprising the tones of the scale and a way to use the derived vowel set as an index into the table. To produce a chant melody, the user looks in the table of tones for the one corresponding to each vowel of the text in turn. The tone found is set against the syllable containing the vowel, and the process continues until the end of the text is reached. The original correspondence was defined as in figure 2.7.

Figure 2.7

Figure 2.7

The pitches are the scale members ranging from the G below middle C, to the A one line above the treble clef. The vowels are arranged above these pitches. To find a pitch for a vowel, the composer located a vowel in the top line, and found a corresponding pitch underneath. Note that there is more than one pitch associated with each vowel. Forming a chant with Guido’s method, the composer would rely on more traditional heuristics for choosing between the possibilities: selecting the closest pitch to the previous one, for instance, or favoring certain intervals.

First written down around the year 1026, Guido’s method is a perfect example of table lookup as a compositional formalism. The continued usefulness of this idea is evidenced by the presence in Max of the object table. Just as Guido employed the concept, table in Max is used to store compositional material and read it back out under different orderings for various circumstances. Its appearance on the screen shows the most important features of the structure: a two-dimensional graph of values contains all of the table’s information. The x axis of the graph indicates the addresses of values in the table, and the y axis marks the range of potential numerical values to be stored. Each address in the table, then, holds one value within the specified range.

Figure 2.8

Figure 2.8

In figure 2.8, the contents of a Max table object are shown. Along the bottom of the rectangle representing the table, the addresses are shown. 0 is the lowest available address, and 4 the highest: there are 5 addresses in all. The address 2 is also marked along the x axis, because the pencil cursor is pointing to the value stored at location 2. Similarly, along the y axis are marked pitch names, corresponding to the values stored. E3 is marked across from the pencil cursor, since that is the value currently stored at address 2.

We can easily implement a version of Guido’s method using the Max table object. Each vowel is used to generate an address sent to the left inlet of the table, and the pitch found at that address is played out as a MIDI note. This program is inadequate as a full Guidonian chant generator, because we limit the vowels to one pitch each. In the original table, each vowel corresponds to several pitch choices. Precisely this aspect of Guido’s method means that it is not strictly algorithmic: each step in the process is not definite, because any vowel (the address) corresponds to more than one possible value. This lack of precision leaves room for artistic judgement to influence the outcome of the method. Such “trapdoors” can often be found in compositional formalisms, down to the programs in use today. To bring the method to a point where it can play chant melodies interactively as the text is typed, however, a decision must be made.

Figure 2.9

Figure 2.9

A variety of strategies are possible, any of which can easily be implemented in Max. In the second interactive Guidonian chant generator, shown in figure 2.10, the full vowel/pitch correspondence table is used. Each vowel typed can produce several indices into the table: the vowel “a” addresses four places, and every other vowel, three.

Figure 2.10

Figure 2.10

The strategy implemented in figure 2.10 for choosing a particular address at any given vowel is to cycle through the possibilities: each successive occurrence of a vowel will index the next possible pitch for that vowel in the table. When the end of the table is reached, the cycle begins again at the beginning of the list. Each vowel cycles through the table independently of the others – in other words, one vowel could be accessing its last possible entry at the same time that another is still pointing to its first.

Now, typing any text on the computer’s keyboard will simultaneously produce a Guidonian chant melody. As the text is typed, the vowels are selected and used as an index into a table of possible pitches, and each vowel cycles through its entries independently. The rhythmic presentation is purely a function of typing speed; each successive tone is produced as soon as the corresponding vowel is typed.


As can be said of many currents in compositional thought, exploration of formalisms in the generation of music has waxed and waned over the centuries. Periods of intense interest are followed by relative neglect, until stylistic circumstances bring formalistic procedures to the fore once again. After Guido’s method, another important algorithmic school grew in the fourteenth century with the rise of the isorhythmic motet (Reese 1959). An isorhythm is a repetition of some pattern of durations, usually in the tenor, coupled to the pitches of a liturgical chant. Interesting variations arise when the length of the chant melody and the length of the duration pattern are unequal: as the two are repeated, different parts of the chant come to be associated with different durations.

Here again, the formalism can be modeled as a form of table lookup. Let us store an isorhythm in a Max table as a succession of durations held at adjacent addresses. In the traditional usage, the table would be read in order from the beginning until the last address is reached. Then, the address counter would be reset to the first position, and the table read through again. A second table can be used to hold the successive pitches of the chant melody, and these paired up with the durations found in the isorhythm table. The Max patch shown in figure 2.11 contains two such tables, marked pitches and durations. Clicking the start toggle on will cause the resulting melody to play through, looping back to the beginning at the end of each complete presentation. The power afforded us by the interactive implementation of this idea, however, allows experimentation with several variations, some explored in the fourteenth century, and some not.

Figure 2.11

Figure 2.11

First, we have made this algorithm interactive by specifying that input coming from a MIDI keyboard will replace pitches in the note list. The table object will output the value stored at some address, unless a new value is presented to the right inlet. In that case, the new value is stored and nothing sent out. Therefore, if we connect the pitch information coming from a MIDI source to the data inlet of the pitch table, notes played through MIDI will be fed into the table at whichever address is currently chosen by the counter.

Now we can affect the computer’s performance by our playing: if we do nothing, the output will sound as before. If we play new notes, the result will become less repetitive, as fresh pitches are added to the basic material. If we play quickly enough, notes will not be repeated at all, since they will be replaced before they are performed a second time.

Another possibility would be to make the effective lengths of the two tables unequal. If we added a second counter object to step through the durations table, we could set its upper limit to a value lower than that in the notes counter, for example. Then, setting the whole process in motion would cause different parts of the pitch table to be matched with different parts of the duration pattern each time through, until eventually they realign at the beginning of both tables and the process begins anew. The counter object allows us a number of other variations on this idea: we could make one counter read through the table from beginning to end, then turn around and read back from end to beginning, oscillating back and forth through the data instead of looping around. The other could read every other element, always moving backward.

How may we evaluate this system? First, we can categorize it according to our previously introduced dimensions: the technique is, first of all, performance driven. There is no stored score or concept of beat or tempo employed here, only time offsets. Second, it is essentially generative. Although external data feed the algorithm, the basic operation uses stored elemental material to generate the output, as opposed to transformations performed on the input as a musical whole. The system follows the player paradigm: the machine’s music is recognizable as a voice separate from the human player.

The compositional method, in this form, is not particularly interesting. The output shows some readily perceptible relation to a human performance but does nothing more than repeat it with a constant, rudimentary style of variation. There are no control variables to allow changes in the variation technique. Many isorhythmic motets, however, are very beautiful pieces of music. This is because the formalism was again replete with “trapdoors”: the isorhythmic method was used as a structural device, serving to unify and organize a much more complex and varied musical whole.

The compositional method, in this form, is not particularly interesting. The output shows some readily perceptible relation to a human performance but does nothing more than repeat it with a constant, rudimentary style of variation. There are no control variables to allow changes in the variation technique. Many isorhythmic motets, however, are very beautiful pieces of music. This is because the formalism was again replete with “trapdoors”: the isorhythmic method was used as a structural device, serving to unify and organize a much more complex and varied musical whole.

Often such formalisms have served a similar role – as a basic structure, a stimulant for the composer’s imagination, or even an “in joke” between the composer and generations of analysts to come. As computers have been enlisted to realize such processes, however, more and more responsibility is given over to the process. Certainly, many programs are used collaboratively by the composer, who is free to override, interpret, or elaborate the output of the compositional formalism in any way, just as the users of isorhythmic techniques did. But as experience and a realization of the power of such processes grow, a great many programs have been implemented, with a sophisticated and highly developed sense of musicianship, for the automated, and unassisted, composition of music.