1 Introduction

The sound of the human voice plays a role in almost every genre of music making, and there is an underlying fascination with the sound of the human voice and its role in expressing emotion. The sound of the human voice is poignant since it is the basis of everyday communication in the fulfilment of daily life, including conversation, laughter and non-verbal vocalisation such as wordless song and infant-directed utterance. Electronic voice synthesis now provides a basis for highly intelligible artificial speech, and is becoming more commonplace in everyday life. But the synthesis of natural sounding artificial speech is not a reality; electronically synthesised speech is rarely, if ever, mistaken as being of human origin.

In terms of early attempts at creating the sound of the human voice, cathedral and some large church pipe organs as well as many cinema/theatre pipe organs include a Vox Humana stop which is designed to imitate the human voice. This stop has been in existence since the late 16th-century (WWW-1), including, for example, the “England Organ” in St. Stephen’s Walbrook (WWW-2). The Vox Humana stop uses reed pipes that make use of a metal vibrating reed to create the sound input, and the usually cylindrical resonators with slots near their closed tops that sit above the reeds are short compared to the length of equivalent full-length open pipes that would sound the same pitch. The Vox Humana stop is usually used in conjunction with the tremulant stop which adds tremolo to the sound; the well-respected organ building commentator Audsley notes that “there is only one speaking stop in the organ whose characteristic voice depends on the effect of the tremolant: we allude to the Vox Humana(Audsley, 1965, p. 210).

Overall though, for many listeners, the sound of the Vox Humana is disappointing as regards simulating the human voice, since it tends to be rather harsh and nasal in its output sound with no distinguishable vowel to be discerned. (Audsley, 1965, p. 609) states for the Vox Humana that “even the best results that have hitherto been obtained fall far short of what is to be desired” and that in some European organs “such stops when heard in their immediate neighbourhood are coarse and vulgar in the extreme”, and that “of all the stops in the organ, the Vox Humana is the one to which distance lends the greatest charm”.

The lack of naturalness or indeed, recognisable vowel in the output from the Vox Humana pipe organ stop offers an opportunity for the development of a new approach to a keyboard instrument simulating the human voice. Brackhane & Trouvain (2013) note that “the use of the Vox Humana stop once was a substitution for church boy’s choirs”, which provides one rather specialist application. They further note that the measured formants (peaks in the output spectrum resulting from the natural acoustic resonances of the pipes that enable vowel identification [e.g., Titze et al. (2015)]) were higher on average than those of the human voice, formant frequencies varied between organ builders, and that, perceptually, there was strong link between the vowel perceived and the pitch of the note itself. The fact that the Vox Humana stop is not commonly found on church organs suggests that using it to substitute for a choir is not very successful; most likely a direct consequence of the differences between its formant frequencies and human natural vowel formant frequencies.

The synthesis of vowel sounds is commonly achieved through formant modelling based on the analysis of the acoustics of spoken or sung vowels (e.g., Rodet (2002); Holmes & Holmes (2001)). The formants are connected in either series or parallel and each has its advantages and disadvantages. Indeed, Holmes (1983) demonstrated that a natural-sounding output could be achieved with a parallel formant synthesiser, but it involved many iterations using synthesis by analysis over months to achieve one sentence. Since the acoustic model of the human vocal tract is based on a series connection of the resonances, it is likely that Holmes’ success was in part due to the synthesiser modelling subtle changes in the acoustic features belonging to the excitation.

The ability to alter at will individual formant frequencies and bandwidths in either the series or parallel formant synthesiser means that one can readily end up with a formant specification that cannot be realised by the human vocal tract. The implementation of the Vox Humana organ stop attempts to create a vocal tract that is stimulated by a larynx voiced (buzzy reed) source; the acoustic resonances are fixed as a function of the shape of the resonator tubes. A more appropriate way to achieve this is to make use of measured human vocal tract shapes for specific vowels; in this way the acoustic formants are defined directly by the resonances of the tracts themselves. This is the underlying thinking behind the Vocal Tract Organ. The acoustic excitation is provided electronically based on known acoustic pressure models for larynx excitation which could be readily modified as desired for special effects. Not only does the Vocal Tract Organ have the potential to provide a modern Vox Humana stop with an output that is closer to human vowels, it also offers a new instrument for composition and performance using gestural and other controls. Although the Mellotron (Reid, 2002) provided a way to achieve choral textures, its output was based on recordings which could not be altered. The Vocal Tract Organ will enable multiple tracts to be used together to for choral textures of the same or even different vowels as desired.

The idea behind the Vocal Tract Organ arose as part of work to explore a route to more natural synthetic speech based on observations relating to natural human speech articulation, we have captured magnetic resonance images (MRI) of vocal tract positions for different speech sounds. These are used to provide the vocal tract shape data for a 2-D (Mullen et al., 2007) and 3-D (Speed et al., 2014) digital waveguide synthesis of the acoustic output from the tracts when a suitable larynx excitation source is provided. The larynx source is the Liljencrants-Fant (LF) model that is commonly used in speech synthesis systems and can be rendered as a mathematical model and adjusted to synthesize different voice qualities (Fant et al., 1985).

Although digital waveguide synthesis is carried out in software, the presence of 3-D MRI representations of the vocal tract for different speech sounds at a time when 3-D printing became available led to the production of 3-D printed vocal tracts to support learning and understanding of spoken and sung sound production. This further led to the suggestion that these could potentially be used as hardware vocal synthesis systems if an appropriate loudspeaker could be found and if the 3-D printed wall thickness was acoustically appropriate. In terms of a suitable loudspeaker, one with a small output port was required and an Adastra loudspeaker drive unit (Adastra 952.210) provided this whilst having a heavy metal body via which little sound was transmitted. Experiments were carried out with various wall thicknesses and a 2 mm wall thickness was found to be most appropriate. Fig. 1 shows a 3-D printed Vocal Tract for the vowel in “spa” mounted atop an Adastra 952.210 loudspeaker drive unit as used in the Vocal Tract Organ.

The creation of 3-D printed vocal tract models based on MRI data has been used elsewhere to explore the physiological changes in vocal tract shape when lowering fundamental frequency (Hirai et al., 1993); how the shape of the vocal tracts of sopranos change when they sing different notes (Takemoto et al., 2012); and the physiology of vowel production (Kitamura et al., 2008; Honda et al., 2010).

Once a hardware version of a single 3-D printed vocal tract driven acoustically via a loudspeaker had been trialled successfully, it became clear to the author that this one tract on a loudspeaker had visual similarity to an organ pipe and the possibility to be implemented to play a number of notes together to create choral-like textures. In addition, the output sound has highly vowel-like qualities Thus the notion of creating the Vocal Tract Organ was initiated. To the author’s knowledge, the concept of a Vocal Tract Organ that makes use of measured human vocal tract shapes is entirely novel.

2 The Vocal Tract Organ

The notion behind the Vocal Tract Organ is that it should be a chamber organ style instrument that features 3-D printed vocal tracts on loudspeakers that are playable from a piano-style keyboard. Since the human vocal tract does not vary in length with pitch change as flue organ pipes do, there is no requirement for one tract per note as there is with organ pipes. Human voice (singing and speech) production is based on the notion of a sound source (the vibrating vocal folds in the larynx for voiced or pitched sounds and air being forced through a narrow gap producing turbulence and therefore acoustic noise for voiceless or non-pitched sounds) that is modified by the acoustic properties of the vocal tract (throat, mouth and nose) (e.g., Sundberg (1987)). Synthetic speech production generally relies on modelling the sound source and the sound modifiers separately in the production of its output. On the basis that human voice production is essentially linear, it is theoretically reasonable to play multiple sound sources (for example the individual notes of chords) through a single vocal tract. This means that the tract does not affect the sound source and therefore the sources for different notes can be added together before they are played through the tract. Whilst this assumption seems is reasonable in the context of electronic synthesis, where the acoustic properties of the tract cannot modify the loudspeaker output from the sound source, more recent thinking in regard to human voice production is that there are non-linear interactions in the natural human acoustic vocal system (Titze, 2006).

In order to implement the Vocal Tract Organ, two approaches have been implemented for the synthesis of the voice source. The first involves the use of Pure Data, or Pd (Puckette, 2007), which is a freeware graphical audio synthesis system for Mac or PC. The second makes use of Arduino boards, which are relatively easy-to-program microcontrollers. Each is described below.

2.1 Pure Data (Pd)-Based Voice Source

Previous work on 4-part choral textures based on vowel synthesis (Howard et al., 2013) implemented four three-formant parallel synthesisers using Pd, each of which had a voice source that had the option of using a sawtooth, pulse or LF waveshape that incorporated controls for fundamental frequency, vibrato rate and vibrato depth.

Each note of the Vocal Tract Organ requires one of these voice source sections with as added amplitude control. The elements of the Pd patch for the overall six-channel system and one larynx source are shown in Fig. 1. Note that the lower part of Fig. 1 is the contents of one of the “pd larynx” object boxes in the upper part of the figure.

This version of the Vocal Tract Organ takes MIDI (musical instrument digital interface) data on channel 1 (“notein 1”) and it is six-note polyphonic with voice stealing (“poly 6 1”). Each received MIDI note is routed to one of the six larynx sources depending on the polyphonic voice number and thence to an individual audio channel. Each larynx source is identical and that for channel one is illustrated in the lower part of Fig. 2. The input note number is converted into a fundamental frequency value (“mtof”). The velocity value is used to control the output amplitude attack and decay, and these are ramped to avoid clicks over 150 ms for note off and 100 ms for note on. Vibrato rate (in Hz) and depth (in cents – one cent is one hundredth of a semitone) are controlled via sliders as “vibRateHz1” and “vibDepthCents1” respectively. Vibrato depth in cents is converted into a frequency ratio (highest vibrato frequency/lowest vibrato frequency) for implementation as shown in equation 1 (Howard & Angus, 2009, App. 3).

Frequency Ratio = $2^{(cents/1200)}$

The output larynx waveform is selected as a pulse excitation synthesised with 20 harmonics (“pulse1”), sawtooth excitation synthesised with 20 harmonics (“saw1”) or the LF model (“drawn1”), which is set up and changed by drawing it with the mouse (a typical calculated cycle of the LF model is shown in Fig. 3). Typical shape changes for the LF model are described in Fant et al. (1985). To ensure that the six sources are not in synchrony their vibrato settings are set such that they are always different from each other.

2.2 Arduino-Based Voice Source

To make the Vocal Tract Organ playable as a general-purpose musical instrument, it would be most appropriate if it did not require a dedicated computer. One way to enable this is to make use of a microcomputer to create the larynx sound sources and to play the organ from a MIDI keyboard. The Arduino family of microcontrollers offer a cost-effective solution for a dedicated microcontroller solution, and experimentation showed that it is possible to run a 6-note polyphonic MIDI synthesiser on an Arduino MEGA. This makes use of the Arduino “MIDI” library (WWW-3). It takes its wave shape from a stored single cycle, in this case of the LF voice source, which was set up in an Excel™ spread sheet with 1024 samples at any desired resolution in terms of bits per sample.

Audio for the Arduino-based Vocal Tract Organ makes use of the Arduino “Mozzi” library (WWW-4) that can run in one of two modes: “LoFi” and “HiFi”, the difference being the resolution of the data samples. In LoFi mode, audio samples are 8-bit and the sampling rate is 16 kHz, which provides sufficient bandwidth for a larynx voice source. In HiFi mode, audio samples are 14-bit and the sampling rate is 32 kHz. Both LoFi and HiFi modes have been explored and either can be implemented as part of a 6-channel polyphonic synthesiser.

The specification for the Arduino-based Vocal Tract Organ is subject to change as it is developed, but it is currently as follows.

• 6-note polyphonic
• MIDI IN
• MIDI THRU
• Multiple stops in terms of vowel sound
• Multiple stop footages (e.g., 8’, 4’, 2 2/3’, 2’, 1 3/5’, 1 1/3’, 1 1/7’, 1’)
• Audio output to drive the Adastra 952.210
• Just and equal temperaments
• Master tuning control
• Vibrato rate control
• Vibrato depth control
• Volume Control

The multiple stops item is the proposition that the Vocal Tract Organ, which currently makes use of 3-D printed vocal tracts for the vowel in spa for an adult male, will ultimately have different vowel sounds available at least for an adult male and an adult female. An extended version of the Vocal Tract Organ might include stops at different footages which is how acoustic harmonic synthesis control is achieved with the pipe organ. At the moment in keeping with typical Vox Humana pipe organ stops, the Vocal Tract Organ makes use of an 8 foot (8’) stop meaning that the notes play at concert pitch (Howard & Angus, 2009, ch. 5), but there is no reason why there could not be 4’ stops (sounding an octave higher), 16’ stops (sounding an octave lower) or any other footage creating acoustically members of the harmonic series of the 8’ fundamental - 9 are shown in the specification above and the relevant harmonic number can be found by dividing the footage into 8 (e.g., Howard & Angus (2009)).

Adding stops to reinforce acoustically the harmonic series of the notes played is basic to the design of a pipe organ. In the context of the Vox Humana, this is done typically by including a 16’ and a 4’ rank to provide a chorus effect. In some theatre organs, a Vox Humana Celeste 8’ was included which is a second rank at 8’ pitch that is slightly detuned from concert pitch such that its acoustic output beats with the main Vox Humana 8’ rank thus adding to the overall perceived chorus effect. One of the main advantages of creating electronic organ stops is their cheapness compared to traditional pipe organ stops and the ability to experiment with a number of different combinations readily. This is the thinking behind the proposition of creating a range of stop footages within the Vocal Tract Organ to enable their outputs when in various combinations to be assessed aesthetically perceptually.

Another important aspect of the Vocal Tract Organ is the pitch range that is available, which encompasses the 5 octaves of a typical pipe organ keyboard. No typical human singer can sing over such a wide pitch range; 3 octaves would be around the maximum for a trained opera singer. This provides another addition to the perceptual space that can be excited by the Vocal Tract Organ. Clearly the higher the pitch the wider the harmonic spacing and therefore the less harmonics there are to excite the resonances of the vocal tract to produce the characteristic formant peaks in the overall output (Sundberg, 1987). Rodet (2002) comments in regard to listeners’ responses when hearing their synthetic interpretation of Mozart’s “Queen of the Night” aria from Die Zauberflöte that “unaware listeners attribute the synthetic voice to a human singer and not to a computer and that they spontaneously emit judgements about the quality of the singer”. This is an example of singing at very high pitches up to F6 with a nominal fundamental frequency of 1397 Hz and listeners can accept this as being of human origin. The exploration of sounds produced by human vocal tracts over a pitch range that is far greater than is usually heard from a human singer is another opportunity offered by the Vocal Tract Organ.

In order to have the flexibility to meet this aspect of the specification, the use of one Arduino MEGA driving one loudspeaker with the appropriate 3-D printed vocal tract for the vowel relating for each stop is proposed. To achieve this, each Arduino board requires a MIDI IN and a MIDI THRU port so that any number of Arduinos can be daisy-chained together. The MIDI IN is opto-isolated as defined in the MIDI specification (WWW-3) and its output interfaces directly with the “COMS RX0” port of the Arduino and after pulse shaping to the MIDI THRU socket.

Switches are connected via the digital input pins of the Arduino MEGA for the stop to activate the appropriate Arduino board and to toggle between just and equal temperament. Master tuning, vibrato rate and vibrato depth are implemented using> three linear 10 kΩ potentiometers connected to three Arduino MEGA analogue inputs. The fundamental frequency values are set up as an array for the top octave of MIDI notes in both just and equal temperament to be called as required and divided by the appropriate power of 2 depending on the octave specified by the note-on value. Master tuning is to be implemented as an offset to one of the notes in the top octave and the remaining 11 fundamental frequencies are recalculated for the relevant array (just or equal temperament). The volume control is implemented using a logarithmic 10 kΩ potentiometer in the analogue output path.

The Arduino Vocal Tract Organ is work in progress. A 6-note polyphonic LoFi voice source has been implemented incorporating an Arduino shield board (see Fig. 3) that includes vibrato rate, vibrato depth, volume, just and equal temperaments, stop switch, MIDI IN and MIDI THRU. Master tuning is to be implemented as is a HiFi audio output; the LoFi version is not as successful as the Pd implementation.

3 Performances with the Vocal Tract Organ

The Vocal Tract Organ provides one way of creating singing textures (Howard, 2014) and the Pd version has been employed in concerts for which the author has composed special pieces for performance. Vocal Vision I was the first piece that was composed for two sopranos and four-part Vocal Tract Organ and premièred at Woodend, Scarborough for the CREST Network concert (26 January 2013) and performed again for the 2013 World Voice Day (16 April 2013) (performance: WWW-6; score: WWW-7). The second piece, Vocal Vision II (performance: WWW-8; score: WWW-9) is a barbershop vocalise-style piece for bass, tenor and two-part Vocal Tract Organ (Howard, 2014b), and it was premièred at the author’s “From south to north, a vocal π” at the York Centre for Early Music in as part of the 2013 Festival of Ideas (24th June 2013).

The third piece was performed at a black-tie, after-dinner short flashmob opera aria entertainment to highlight engineering impact in a musical context for the Royal Academy of Engineering 2013 Summer Soirée held at the University of York (27 June 2013) with a member of the British Royal Family present. The piece requested was “O mio babbino caro” from Giacomo Puccini’s opera Gianni Schicchi (1918) sung by a soprano to be accompanied on the Vocal Tract Organ. Since the usual keyboard reduction is an orchestral reduction that is unsuitable for an organ-like accompaniment, the author arranged a new accompaniment in a chorale-like style specially for this performance (WWW-10). A request for a repeat performance (no recording was allowed at the first performance) was made for a concert in the Picture Gallery at Royal Holloway, University of London on 15 March 2016 (WWW-11).

Once the modifications and additions have been made to the instrument, it is hoped that the notion of writing music for the Vocal Tract Organ might appeal to some of today’s composers (e.g., Miranda (2014); Wishart (2012)), who are pushing the boundaries of what is possible in vocal output and listener perception.

4 Conclusions

The Vocal Tract Organ is a new musical instrument that makes use of 3-D printed vocal tracts based on MRI images of human tracts for different vowels sitting atop loudspeakers that are driven by an appropriate larynx source waveform. The working demonstrated performance version is implemented in pure data (Pd) and requires a computer, six-channel audio output and MIDI input. The stand-alone version is in development and it makes use of Arduino MEGA boards for its larynx sources and offers the potential for a multi-stop Vocal Tract Organ that is portable. The musical potential of the Vocal Tract Organ has only been hinted at with three pieces performed at separate special events. Interest has been shown by a leading UK pipe organ builder in using 2-D printed tracts as the resonators for pipe organ reeds and experiments are in progress to assess their acoustic outputs and musical efficacy.

None