A vocoder (, a contraction of voice and encoder) is a category of voice codec that analyzes and synthesizes the human voice signal for audio data compression, multiplexing, voice encryption or voice transformation.
The vocoder was invented in 1938 by Homer Dudley at Bell Labs as a means of synthesizing human speech. This work was developed into the channel vocoder which was used as a voice codec for telecommunications for coding speech to conserve bandwidth in transmission.
By encrypting the control signals, voice transmission can be secured against interception. Its primary use in this fashion is for secure radio communication. The advantage of this method of encryption is that none of the original signal is sent, only envelopes of the bandpass filters. The receiving unit needs to be set up in the same filter configuration to re-synthesize a version of the original signal spectrum.
The human voice consists of sounds generated by the opening and closing of the glottis by the vocal cords, which produces a periodic waveform with many harmonics. This basic sound is then filtered by the nose and throat (a complicated resonant piping system) to produce differences in harmonic content (formants) in a controlled way, creating the wide variety of sounds used in speech. There is another set of sounds, known as the unvoiced and plosive sounds, which are created or modified by the mouth in different fashions.
The vocoder examines speech by measuring how its spectral characteristics change over time. This results in a series of signals representing these modified frequencies at any particular time as the user speaks. In simple terms, the signal is split into a number of frequency bands (the larger this number, the more accurate the analysis) and the level of signal present at each frequency band gives the instantaneous representation of the spectral energy content. To recreate speech, the vocoder simply reverses the process, processing a broadband noise source by passing it through a stage that filters the frequency content based on the originally recorded series of numbers.
Specifically, in the encoder, the input is passed through a multiband filter, then each band is passed through an envelope follower, and the control signals from the envelope followers are transmitted to the decoder. The decoder applies these (amplitude) control signals to corresponding amplifiers of the filter channels for re-synthesis.
Information about the instantaneous frequency of the original voice signal (as distinct from its spectral characteristic) is discarded; it was not important to preserve this for the vocoder's original use as an encryption aid. It is this "dehumanizing" aspect of the vocoding process that has made it useful in creating special voice effects in popular music and audio entertainment.
The vocoder process sends only the parameters of the vocal model over the communication link, instead of a point-by-point recreation of the waveform. Since the parameters change slowly compared to the original speech waveform, the bandwidth required to transmit speech can be reduced. This allows more speech channels to utilize a given communication channel, such as a radio channel or a submarine cable.
Analog vocoders typically analyze an incoming signal by splitting the signal into multiple tuned frequency bands or ranges. A modulator and carrier signal are sent through a series of these tuned bandpass filters. In the example of a typical robot voice, the modulator is a microphone and the carrier is noise or a sawtooth waveform.[clarification needed] There are usually between eight and 20 bands.
The amplitude of the modulator for each of the individual analysis bands generates a voltage that is used to control amplifiers for each of the corresponding carrier bands. The result is that frequency components of the modulating signal are mapped onto the carrier signal as discrete amplitude changes in each of the frequency bands.
Often there is an unvoiced band or sibilance channel. This is for frequencies that are outside the analysis bands for typical speech but are still important in speech. Examples are words that start with the letters s, f, ch or any other sibilant sound. These can be mixed with the carrier output to increase clarity. The result is recognizable speech, although somewhat "mechanical" sounding. Vocoders often include a second system for generating unvoiced sounds, using a noise generator instead of the fundamental frequency.
In the channel vocoder algorithm, among the two components of an analytic signal, considering only the amplitude component and simply ignoring the phase component tends to result in an unclear voice; on methods for rectifying this, see phase vocoder.
The development of a vocoder was started in 1928 by Bell Labs engineer Homer Dudley, who was granted patents for it, US application 2,151,091 on March 21, 1939, and US application 2,098,956 on Nov 16, 1937.
To demonstrate the speech synthesis ability of its decoder part, the Voder (Voice Operating Demonstrator), was introduced to the public at the AT&T building at the 1939-1940 New York World's Fair. The Voder consisted of a switchable pair of electronic oscillator and noise generator as a sound source of pitched tone and hiss, 10-band resonator filters with variable-gain amplifiers as a vocal tract, and the manual controllers including a set of pressure-sensitive keys for filter control, and a foot pedal for pitch control of tone. The filters controlled by keys convert the tone and the hiss into vowels, consonants, and inflections. This was a complex machine to operate, but a skilled operator could produce recognizable speech.[media 1]
Dudley's vocoder was used in the SIGSALY system, which was built by Bell Labs engineers in 1943. SIGSALY was used for encrypted high-level voice communications during World War II. The KO-6 voice coder was released in 1949 in limited quantities; it was a close approximation to the SIGSALY at 1200 bit/s. In 1953, KY-9 THESEUS 1650 bit/s voice coder used solid state logic to reduce the weight to 565 pounds (256 kg) from SIGSALY's 55 tons, and in 1961 the HY-2 voice coder, a 16-channel 2400 bit/s system, weighted 100 pounds (45 kg) and was the last implementation of a channel vocoder in a secure speech system.
Later work in this field has since used digital speech coding. The most widely used speech coding technique is linear predictive coding (LPC), which was first proposed by Fumitada Itakura of Nagoya University and Shuzo Saito of Nippon Telegraph and Telephone (NTT) in 1966. Another speech coding technique, adaptive differential pulse-code modulation (ADPCM), was developed by P. Cummiskey, Nikil S. Jayant and James L. Flanagan at Bell Labs in 1973.
Even with the need to record several frequencies, and additional unvoiced sounds, the compression of vocoder systems is impressive. Standard speech-recording systems capture frequencies from about 500 Hz to 3,400 Hz, where most of the frequencies used in speech lie, typically using a sampling rate of 8 kHz (slightly greater than the Nyquist rate). The sampling resolution is typically 12 or more bits per sample resolution (16 is standard), for a final data rate in the range of 96-128 kbit/s, but a good vocoder can provide a reasonably good simulation of voice with as little as 2.4 kbit/s of data.
"Toll quality" voice coders, such as ITU G.729, are used in many telephone networks. G.729 in particular has a final data rate of 8 kbit/s with superb voice quality. G.723 achieves slightly worse quality at data rates of 5.3 kbit/s and 6.4 kbit/s. Many voice vocoder systems use lower data rates, but below 5 kbit/s voice quality begins to drop rapidly.
Several vocoder systems are used in NSA encryption systems:
(ADPCM is not a proper vocoder but rather a waveform codec. ITU has gathered G.721 along with some other ADPCM codecs into G.726.)
Modern vocoders that are used in communication equipment and in voice storage devices today are based on the following algorithms:
Since the late 1970s, most non-musical vocoders have been implemented using linear prediction, whereby the target signal's spectral envelope (formant) is estimated by an all-pole IIR filter. In linear prediction coding, the all-pole filter replaces the bandpass filter bank of its predecessor and is used at the encoder to whiten the signal (i.e., flatten the spectrum) and again at the decoder to re-apply the spectral shape of the target speech signal.
One advantage of this type of filtering is that the location of the linear predictor's spectral peaks is entirely determined by the target signal, and can be as precise as allowed by the time period to be filtered. This is in contrast with vocoders realized using fixed-width filter banks, where spectral peaks can generally only be determined to be within the scope of a given frequency band. LP filtering also has disadvantages in that signals with a large number of constituent frequencies may exceed the number of frequencies that can be represented by the linear prediction filter. This restriction is the primary reason that LP coding is almost always used in tandem with other methods in high-compression voice coders.
Waveform-interpolative (WI) vocoder was developed in AT&T Bell Laboratories around 1995 by W.B. Kleijn, and subsequently a low- complexity version was developed by AT&T for the DoD secure vocoder competition. Notable enhancements to the WI coder were made at the University of California, Santa Barbara. AT&T holds the core patents related to WI, and other institutes hold additional patents.
For musical applications, a source of musical sounds is used as the carrier, instead of extracting the fundamental frequency. For instance, one could use the sound of a synthesizer as the input to the filter bank, a technique that became popular in the 1970s.
Werner Meyer-Eppler, a German scientist with a special interest in electronic voice synthesis, published a thesis in 1948 on electronic music and speech synthesis from the viewpoint of sound synthesis. Later he was instrumental in the founding of the Studio for Electronic Music of WDR in Cologne, in 1951.
In 1968, Bruce Haack built a prototype vocoder, named "Farad" after Michael Faraday. It was first featured on "The Electronic Record For Children" released in 1969 and then on his rock album The Electric Lucifer released in 1970.[media 3]
In 1970, Wendy Carlos and Robert Moog built another musical vocoder, a ten-band device inspired by the vocoder designs of Homer Dudley. It was originally called a spectrum encoder-decoder and later referred to simply as a vocoder. The carrier signal came from a Moog modular synthesizer, and the modulator from a microphone input. The output of the ten-band vocoder was fairly intelligible but relied on specially articulated speech. Some vocoders use a high-pass filter to let some sibilance through from the microphone; this ruins the device for its original speech-coding application, but it makes the talking synthesizer effect much more intelligible.
The 1975 song The Raven of album Tales of Mystery and Imagination by The Alan Parsons Project, features Alan Parsons performing vocals through an EMI vocoder. According to the album's liner notes, "The Raven" was the first rock song to feature a digital vocoder.
Pink Floyd also used a vocoder on three of their albums, first on their 1977 Animals for the songs Sheep and Pigs (Three different Ones), and then on their 1987's A Momentary Lapse of Reason on A New Machine Part 1 and A New Machine Part 2, and finally on 1994's The Division Bell, on Keep Talking.
Vocoders have appeared on pop recordings from time to time, most often simply as a special effect rather than a featured aspect of the work. However, many experimental electronic artists of the new-age music genre often utilize vocoder in a more comprehensive manner in specific works, such as Jean Michel Jarre (on Zoolook, 1984) and Mike Oldfield (on QE2, 1980 and Five Miles Out, 1982).
Vocoder module and use by M. Oldfield can be clearly seen on his "Live At Montreux 1981" DVD (Track "Sheba").
There are also some artists who have made vocoders an essential part of their music, overall or during an extended phase. Examples include the German synthpop group Kraftwerk, the Japanese new wave group Polysics, Stevie Wonder ("Send One Your Love", "A Seed's a Star") and jazz/fusion keyboardist Herbie Hancock during his late 1970s period. In 1982 Neil Young used a Sennheiser Vocoder VSM201 on six of the nine tracks on Trans. Perhaps the most heard, yet often unrecognized, example of the use of a vocoder in popular music, is on Michael Jackson's 1982 album Thriller, in the song "P.Y.T. (Pretty Young Thing)". During the first few seconds of the song, the background voicings "ooh-ooh, ooh, ooh", behind his spoken words, exemplify the heavily modulated sound of his voice through a Vocoder. The bridge features a vocoder as well ("Pretty young thing/You make me sing"), courtesy of session musician Michael Boddicker.
Coldplay have used a vocoder in some of their songs. For example, in "Major Minus" and "Hurts Like Heaven", both from the album Mylo Xyloto (2011), Chris Martin's vocals are mostly vocoder-processed. "Midnight", from Ghost Stories (2014), also features Martin singing through a vocoder. The hidden track "X Marks The Spot" from A Head Full of Dreams has also been recorded through a vocoder.
Noisecore band Atari Teenage Riot have used vocoders in variety of their songs and live performances such as Live at the Brixton Academy (2002) alongside other digital audio technology both old and new.
Among the most consistent uses of vocoder in emulating the human voice are Daft Punk, who have used this instrument from their first album Homework (1997) to their latest work Random Access Memories (2013) and consider the convergence of technological and human voice "the identity of their musical project". For instance, the lyrics of "Around the World" (1997) are integrally vocoder-processed, "Get Lucky" (2013) features a mix of natural and processed human voices, and "Instant Crush" (2013) features Julian Casablancas singing into a vocoder.
"Robot voices" became a recurring element in popular music during the 20th century. Apart from vocoders, several other methods of producing variations on this effect include: the Sonovox, Talk box, and Auto-Tune,[media 4] linear prediction vocoders, speech synthesis,[media 5][media 6] ring modulation and comb filter.
Vocoders are used in television production, filmmaking and games, usually for robots or talking computers. The robot voices of the Cylons in Battlestar Galactica were created with an EMS Vocoder 2000. The 1980 version of the Doctor Who theme, as arranged and recorded by Peter Howell, has a section of the main melody generated by a Roland SVC-350 vocoder. A similar Roland VP-330 vocoder was used to create the voice of Soundwave, a character from the Transformers series.
In 1972, Isao Tomita's first electronic music album Electric Samurai: Switched on Rock was an early attempt at applying speech synthesis technique  in electronic rock and pop music. The album featured electronic renditions of contemporary rock and pop songs, while utilizing synthesized voices in place of human voices. In 1974, he utilized synthesized voices in his popular classical music album Snowflakes are Dancing, which became a worldwide success and helped to popularize electronic music. Emerson, Lake and Palmer used it for the album Brain Salad Surgery (1973).
The Vocoder (Voice Operated reCorDER) and Voder (Voice Operation DEmonstratoR) developed by the research physicist Homer Dudley, ... The Voder was first unveiled in 1939 at the New York World Fair (where it was demonstrated at hourly intervals) and later in 1940 in San Francisco. There were twenty trained operators known as the 'girls' who handled the machine much like a musical instrument such as a piano or an organ, ... This was done by manipulating fourteen keys with the fingers, a bar with the left wrist and a foot pedal with the right foot.
LPC methods are the most widely used in speech coding