Speech perception


Speech perception is the process by which the sounds of language are heard, interpreted, and understood. The study of speech perception is closely linked to the fields of phonology and phonetics in linguistics and cognitive psychology and perception in psychology. Research in speech perception seeks to understand how human listeners recognize speech sounds and use this information to understand spoken language. Speech perception research has applications in building computer systems that can recognize speech, in improving speech recognition for hearing- and language-impaired listeners, and in foreign-language teaching.
The process of perceiving speech begins at the level of the sound signal and the process of audition. After processing the initial auditory signal, speech sounds are further processed to extract acoustic cues and phonetic information. This speech information can then be used for higher-level language processes, such as word recognition.

Acoustic cues

Acoustic cues are sensory cues contained in the speech sound signal which are used in speech perception to differentiate speech sounds belonging to different phonetic categories. For example, one of the most studied cues in speech is voice onset time or VOT. VOT is a primary cue signaling the difference between voiced and voiceless plosives, such as "b" and "p". Other cues differentiate sounds that are produced at different places of articulation or manners of articulation. The speech system must also combine these cues to determine the category of a specific speech sound. This is often thought of in terms of abstract representations of phonemes. These representations can then be combined for use in word recognition and other language processes.
It is not easy to identify what acoustic cues listeners are sensitive to when perceiving a particular speech sound:
At first glance, the solution to the problem of how we perceive speech seems deceptively simple. If one could identify stretches of the acoustic waveform that correspond to units of perception, then the path from sound to meaning would be clear. However, this correspondence or mapping has proven extremely difficult to find, even after some forty-five years of research on the problem.

If a specific aspect of the acoustic waveform indicated one linguistic unit, a series of tests using speech synthesizers would be sufficient to determine such a cue or cues. However, there are two significant obstacles:
  1. One acoustic aspect of the speech signal may cue different linguistically relevant dimensions. For example, the duration of a vowel in English can indicate whether or not the vowel is stressed, or whether it is in a syllable closed by a voiced or a voiceless consonant, and in some cases it can distinguish the identity of vowels. Some experts even argue that duration can help in distinguishing of what is traditionally called short and long vowels in English.
  2. One linguistic unit can be cued by several acoustic properties. For example, in a classic experiment, Alvin Liberman showed that the onset formant transitions of differ depending on the following vowel but they are all interpreted as the phoneme by listeners.

    Linearity and the segmentation problem

Although listeners perceive speech as a stream of discrete units, this linearity is difficult to see in the physical speech signal. Speech sounds do not strictly follow one another, rather, they overlap. A speech sound is influenced by the ones that precede and the ones that follow. This influence can even be exerted at a distance of two or more segments.
Because the speech signal is not linear, there is a problem of segmentation. It is difficult to delimit a stretch of speech signal as belonging to a single perceptual unit. As an example, the acoustic properties of the phoneme will depend on the production of the following vowel.

Lack of invariance

The research and application of speech perception must deal with several problems which result from what has been termed the lack of invariance. Reliable constant relations between a phoneme of a language and its acoustic manifestation in speech are difficult to find. There are several reasons for this:

Context-induced variation

Phonetic environment affects the acoustic properties of speech sounds. For example, in English is fronted when surrounded by coronal consonants. Or, the voice onset time marking the boundary between voiced and voiceless plosives are different for labial, alveolar and velar plosives and they shift under stress or depending on the position within a syllable.

Variation due to differing speech conditions

One important factor that causes variation is differing speech rate. Many phonemic contrasts are constituted by temporal characteristics and they are certainly affected by changes in speaking tempo. Another major source of variation is articulatory carefulness vs. sloppiness which is typical for connected speech.

Variation due to different speaker identity

The resulting acoustic structure of concrete speech productions depends on the physical and psychological properties of individual speakers. Men, women, and children generally produce voices having different pitch. Because speakers have vocal tracts of different sizes the resonant frequencies, which are important for recognition of speech sounds, will vary in their absolute values across individuals. Research shows that infants at the age of 7.5 months cannot recognize information presented by speakers of different genders; however by the age of 10.5 months, they can detect the similarities. Dialect and foreign accent can also cause variation, as can the social characteristics of the speaker and listener.

Perceptual constancy and normalization

Despite the great variety of different speakers and different conditions, listeners perceive vowels and consonants as constant categories. It has been proposed that this is achieved by means of the perceptual normalization process in which listeners filter out the noise to arrive at the underlying category. Vocal-tract-size differences result in formant-frequency variation across speakers; therefore a listener has to adjust his/her perceptual system to the acoustic characteristics of a particular speaker. This may be accomplished by considering the ratios of formants rather than their absolute values. This process has been called vocal tract normalization. Similarly, listeners are believed to adjust the perception of duration to the current tempo of the speech they are listening to – this has been referred to as speech rate normalization.
Whether or not normalization actually takes place and what is its exact nature is a matter of theoretical controversy. Perceptual constancy is a phenomenon not specific to speech perception only; it exists in other types of perception too.

Categorical perception

Categorical perception is involved in processes of perceptual differentiation. People perceive speech sounds categorically, that is to say, they are more likely to notice the differences between categories than within categories. The perceptual space between categories is therefore warped, the centers of categories working like a sieve or like magnets for incoming speech sounds.
In an artificial continuum between a voiceless and a voiced bilabial plosive, each new step differs from the preceding one in the amount of VOT. The first sound is a pre-voiced, i.e. it has a negative VOT. Then, increasing the VOT, it reaches zero, i.e. the plosive is a plain unaspirated voiceless. Gradually, adding the same amount of VOT at a time, the plosive is eventually a strongly aspirated voiceless bilabial. In this continuum of, for example, seven sounds, native English listeners will identify the first three sounds as and the last three sounds as with a clear boundary between the two categories. A two-alternative identification test will yield a discontinuous categorization function.
In tests of the ability to discriminate between two sounds with varying VOT values but having a constant VOT distance from each other, listeners are likely to perform at chance level if both sounds fall within the same category and at nearly 100% level if each sound falls in a different category.
The conclusion to make from both the identification and the discrimination test is that listeners will have different sensitivity to the same relative increase in VOT depending on whether or not the boundary between categories was crossed. Similar perceptual adjustment is attested for other acoustic cues as well.

Top-down influences

In a classic experiment, Richard M. Warren replaced one phoneme of a word with a cough-like sound. Perceptually, his subjects restored the missing speech sound without any difficulty and could not accurately identify which phoneme had been disturbed, a phenomenon known as the phonemic restoration effect. Therefore, the process of speech perception is not necessarily uni-directional.
Another basic experiment compared recognition of naturally spoken words within a phrase versus the same words in isolation, finding that perception accuracy usually drops in the latter condition. To probe the influence of semantic knowledge on perception, Garnes and Bond similarly used carrier sentences where target words only differed in a single phoneme whose quality changed along a continuum. When put into different sentences that each naturally led to one interpretation, listeners tended to judge ambiguous words according to the meaning of the whole sentence
. That is, higher-level language processes connected with morphology, syntax, or semantics may interact with basic speech perception processes to aid in recognition of speech sounds.
It may be the case that it is not necessary and maybe even not possible for a listener to recognize phonemes before recognizing higher units, like words for example. After obtaining at least a fundamental piece of information about phonemic structure of the perceived entity from the acoustic signal, listeners can compensate for missing or noise-masked phonemes using their knowledge of the spoken language. Compensatory mechanisms might even operate at the sentence level such as in learned songs, phrases and verses, an effect backed-up by neural coding patterns consistent with the missed continuous speech fragments, despite the lack of all relevant bottom-up sensory input.