Vocoder


A vocoder is a category of speech coding that analyzes and synthesizes the human voice signal for audio data compression, multiplexing, voice encryption or voice transformation.
The vocoder was invented in 1938 by Homer Dudley at Bell Labs as a means of synthesizing human speech. This work was developed into the channel vocoder which was used as a voice codec for telecommunications for speech coding to conserve bandwidth in transmission.
By encrypting the control signals, voice transmission can be secured against interception. Its primary use in this fashion is for secure radio communication. The advantage of this method of encryption is that none of the original signal is sent, only envelopes of the bandpass filters. The receiving unit needs to be set up in the same filter configuration to re-synthesize a version of the original signal spectrum.
The vocoder has also been used extensively as an electronic musical instrument. The decoder portion of the vocoder, called a voder, can be used independently for speech synthesis.

Theory

The human voice consists of sounds generated by the periodic opening and closing of the glottis by the vocal cords, which produces an acoustic waveform with many harmonics. This initial sound is then filtered by movements in the nose, mouth and throat to produce fluctuations in harmonic content in a controlled way, creating the wide variety of sounds used in speech. There is another set of sounds, known as the unvoiced and plosive sounds, which are created or modified by a variety of sound-generating disruptions of airflow occurring in the vocal tract.
The vocoder analyzes speech by measuring how its spectral energy distribution characteristics fluctuate across time. This analysis results in a set of temporally parallel envelope signals, each representing the individual frequency band amplitudes of the user's speech. Put another way, the voice signal is divided into a number of frequency bands and the level of signal present at each frequency band, occurring simultaneously, is measured by an envelope follower, representing the spectral energy distribution across time. This set of envelope amplitude signals is called the "modulator".
To recreate speech, the vocoder reverses the analysis process, variably filtering an initial broadband noise, by passing it through a set of band-pass filters, whose individual envelope amplitude levels are controlled, in real time, by the set of analyzed envelope amplitude signals from the modulator.
The digital encoding process involves a periodic analysis of each of the modulator's multiband set of filter envelope amplitudes. This analysis results in a set of digital pulse code modulation stream readings. Then the pulse code modulation stream outputs of each band are transmitted to a decoder. The decoder applies the pulse code modulations as control signals to the corresponding amplifiers of the output filter channels.
Information about the fundamental frequency of the initial voice signal is discarded; it was not important to preserve this for the vocoder's original use as an encryption aid. It is this dehumanizing aspect of the vocoding process that has made it useful in creating special voice effects in popular music and audio entertainment.
Instead of a point-by-point recreation of the waveform, the vocoder process sends only the parameters of the vocal model over the communication link. Since the parameters change slowly compared to the original speech waveform, the bandwidth required to transmit speech can be reduced. This allows more speech channels to utilize a given communication channel, such as a radio channel or a submarine cable.
Analog vocoders typically analyze an incoming signal by splitting the signal into multiple tuned frequency bands or ranges. To reconstruct the signal, a carrier signal is sent through a series of these tuned band-pass filters. In the example of a typical robot voice the carrier is noise or a sawtooth waveform. There are usually between 8 and 20 bands.
The amplitude of the modulator for each of the individual analysis bands generates a voltage that is used to control amplifiers for each of the corresponding carrier bands. The result is that frequency components of the modulating signal are mapped onto the carrier signal as discrete amplitude changes in each of the frequency bands.
Often there is an unvoiced band or sibilance channel. This is for frequencies that are outside the analysis bands for typical speech but are still important in speech. Examples are words that start with the letters s, f, ch or any other sibilant sound. Using this band produces recognizable speech, although somewhat mechanical sounding. Vocoders often include a second system for generating unvoiced sounds, using a noise generator instead of the fundamental frequency. This is mixed with the carrier output to increase clarity.
In the channel vocoder algorithm, among the two components of an analytic signal, considering only the amplitude component and simply ignoring the phase component tends to result in an unclear voice; on methods for rectifying this, see phase vocoder.

History

The development of a vocoder was started in 1928 by Bell Labs engineer Homer Dudley, who was granted patents for it on March 21, 1939, and Nov 16, 1937.
To demonstrate the speech synthesis ability of its decoder section, the voder was introduced to the public at the AT&T building at the 1939–1940 New York World's Fair. The voder consisted of an electronic oscillator a sound source of pitched tone and a noise generator for hiss, 10 bands of resonator filters, each controlled by a variable-gain amplifier as a vocal tract, and the manual controllers including a set of pressure-sensitive keys for filter control, and a foot pedal for pitch control of tone. The filters controlled by keys convert the tone and the hiss into vowels, consonants, and inflections. This was a complex machine to operate, but a skilled operator could produce recognizable speech.
Dudley's vocoder was used in the SIGSALY system, which was built by Bell Labs engineers in 1943. SIGSALY was used for encrypted voice communications during World War II. The KO-6 voice coder was released in 1949 in limited quantities; it was a close approximation to the SIGSALY at. In 1953, KY-9 THESEUS voice coder used solid-state logic to reduce the weight to from SIGSALY's, and in 1961 the HY-2 voice coder, a 16-channel system, weighed and was the last implementation of a channel vocoder in a secure speech system.
Later work in this field has since used digital speech coding. The most widely used speech coding technique is linear predictive coding. Another speech coding technique, adaptive differential pulse-code modulation, was developed by P. Cummiskey, Nikil S. Jayant and James L. Flanagan at Bell Labs in 1973.

Applications

  • Terminal equipment for systems based on digital mobile radio.
  • Digital voice scrambling and encryption
  • Cochlear implants: noise and tone vocoding is used to simulate the effects of cochlear implants.
  • Musical and other artistic effects

    Modern implementations

Even with the need to record several frequencies and additional unvoiced sounds, the compression of vocoder systems is impressive. Standard speech-recording systems capture frequencies from about 500 to 3,400 Hz, where most of the frequencies used in speech lie, typically using a sampling rate of 8 kHz. The sampling resolution is typically 8 or more bits per sample, for a data rate in the range of, but a good vocoder can provide a reasonably good simulation of voice with as little as of data.
Toll quality voice coders, such as ITU G.729, are used in many telephone networks. G.729 in particular has a final data rate of with superb voice quality. G.723 achieves slightly worse quality at data rates of 5.3 and. Many voice vocoder systems use lower data rates, but below voice quality begins to drop rapidly.
Several vocoder systems are used in NSA encryption systems:
Modern vocoders that are used in communication equipment and in voice storage devices today are based on the following algorithms:
Vocoders are also currently used in psychophysics, linguistics, computational neuroscience and cochlear implant research.

Linear prediction-based

Since the late 1970s, most non-musical vocoders have been implemented using linear prediction, whereby the target signal's spectral envelope is estimated by an all-pole IIR filter. In linear prediction coding, the all-pole filter replaces the bandpass filter bank of its predecessor and is used at the encoder to whiten the signal and again at the decoder to re-apply the spectral shape of the target speech signal.
One advantage of this type of filtering is that the location of the linear predictor's spectral peaks is entirely determined by the target signal and can be as precise as allowed by the time period to be filtered. This is in contrast with vocoders realized using fixed-width filter banks, where the location of spectral peaks is constrained by the available fixed frequency bands. LP filtering also has disadvantages in that signals with a large number of constituent frequencies may exceed the number of frequencies that can be represented by the linear prediction filter. This restriction is the primary reason that LP coding is almost always used in tandem with other methods in high-compression voice coders.