McClellan, S., Gibson, J.D., Ephraim, Y., Fussell, J.W., Wilcox, L.D., Bush, M.A., Gao, Y., Ramabhadran, B., Picheny, M. “Speech Signal Processing” The Electrical Engineering Handbook Ed. Richard C. Dorf Boca Raton: CRC Press LLC, 2000 15 Speech Signal Processing 15.1Coding, Transmission, and Storage General Approaches ? Model Adaptation ? Analysis-by-Synthesis ? Particular Implementations ? Speech Quality and Intelligibility ? Standardization ? Variable Rate Coding ? Summary and Conclusions 15.2Speech Enhancement and Noise Reduction Models and Performance Measures ? Signal Estimation ? Source Coding ? Signal Classification ? Comments 15.3Analysis and Synthesis Analysis of Excitation ? Fourier Analysis ? Linear Predictive Analysis ? Homomorphic (Cepstral) Analysis ? Speech Synthesis 15.4Speech Recognition Speech Recognition System Architecture ? Signal Pre-Processing ? Dynamic Time Warping ? Hidden Markov Models ? State-of-the-Art Recognition Systems 15.5Large Vocabulary Continuous Speech Recognition Overview of a Speech Recognition System ? Hidden Markov Models As Acoustic Models for Speech Recognition ? Speaker Adaptation ? Modeling Context in Continuous Speech ? Language Modeling ? Hypothesis Search ? State-of-the-Art Systems ? Challenges in Speech Recognition ? Applications 15.1 Coding, Transmission, and Storage Stan McClellan and Jerry D. Gibson Interest in speech coding is motivated by a wide range of applications, including commercial telephony, digital cellular mobile radio, military communications, voice mail, speech storage, and future personal communica- tions networks. The goal of speech coding is to represent speech in digital form with as few bits as possible while maintaining the intelligibility and quality required for the particular application. At higher bit rates, such as 64 and 32 kbits/s, achieving good quality and intelligibility is not too difficult, but as the desired bit rate is lowered to 16 kbits/s and below, the problem becomes increasingly challenging. Depending on the application, many difficult constraints must be considered, including the issue of complexity. For example, for the 32-kbits/s speech coding standard, the ITU-T 1 not only required highly intelligible, high-quality speech, but the coder also had to have low delay, withstand independent bit error rates up to 10 –2 , have acceptable performance degradation for several synchronous or asynchronous tandem connections, and pass some voiceband modem signals. Other applications may have different criteria. Digital cellular mobile radio in the U.S. has no low delay or voiceband modem signal requirements, but the speech data rates required are under 8 kbits/s and the transmission medium (or channel) can be very noisy and have relatively long fades. These considerations affect the speech coder chosen for a particular application. As speech coder data rates drop to 16 kbits/s and below, perceptual criteria taking into account human auditory response begin to play a prominent role. For time domain coders, the perceptual effects are incorporated using a frequency-weighted error criterion. The frequency-domain coders include perceptual effects by allocating 1 International Telecommunications Union, Telecommunications Standardization Sector, formerly the CCITT. Stan McClellan University of Alabama at Birmingham Jerry D. Gibson Texas A&M University Yariv Ephraim AT&T Bell Laboratories George Mason University Jesse W. Fussell Department of Defense Lynn D. Wilcox FX Palo Alto Lab Marcia A. Bush Xerox Palo Alto Research Center Yuqing Gao IBM T.J. Watson Research Center Bhuvana Ramabhadran IBM T.J. Watson Research Center Michael Picheny IBM T.J. Watson Research Center ? 2000 by CRC Press LLC The focus of this article is the contrast among the three most important classes of speech coders that have representative implementations in several international standards—time-domain coders, frequency-domain coders, and hybrid coders. In the following, we define these classifications, look specifically at the important characteristics of representative, general implementations of each class, and briefly discuss the rapidly changing national and international standardization efforts related to speech coding. General Approaches Time Domain Coders and Linear Prediction Linear Predictive Coding (LPC) is a modeling technique that has seen widespread application among time- domain speech coders, largely because it is computationally simple and applicable to the mechanisms involved in speech production. In LPC, general spectral characteristics are described by a parametric model based on estimates of autocorrelations or autocovariances. The model of choice for speech is the all-pole or autoregressive (AR) model. This model is particularly suited for voiced speech because the vocal tract can be well modeled by an all-pole transfer function. In this case, the estimated LPC model parameters correspond to an AR process which can produce waveforms very similar to the original speech segment. Differential Pulse Code Modulation (DPCM) coders (i.e., ITU-T G.721 ADPCM [CCITT, 1984]) and LPC vocoders (i.e., U.S. Federal Standard 1015 [National Communications System, 1984]) are examples of this class of time-domain predictive architec- ture. Code Excited Coders (i.e., ITU-T G728 [Chen, 1990] and U.S. Federal Standard 1016 [National Commu- nications System, 1991]) also utilize LPC spectral modeling techniques. 1 Based on the general spectral model, a predictive coder formulates an estimate of a future sample of speech based on a weighted combination of the immediately preceding samples. The error in this estimate (the prediction residual) typically comprises a significant portion of the data stream of the encoded speech. The residual contains information that is important in speech perception and cannot be modeled in a straightfor- ward fashion. The most familiar form of predictive coder is the classical Differential Pulse Code Modulation (DPCM) system shown in Fig. 15.1. In DPCM, the predicted value at time instant k,?s(k *k – 1), is subtracted from the input signal at time k, s(k), to produce the prediction error signal e(k). The prediction error is then approximated (quantized) and the quantized prediction error, e q (k), is coded (represented as a binary number) for transmission to the receiver. Simultaneously with the coding, e q (k) is summed with ?s(k *k – 1) to yield a reconstructed version of the input sample, ?s(k). Assuming no channel errors, an identical reconstruction, distorted only by the effects of quantization, is accomplished at the receiver. At both the transmitter and receiver, the predicted value at time instant k +1 is derived using reconstructed values up through time k, and the procedure is repeated. The first DPCM systems had ? B(z) = 0 and ?(z) = , where {a i ,i = 1…N} are the LPC coefficients and z –1 represents unit delay, so that the predicted value was a weighted linear combination of previous reconstructed values, or 1 However, codebook excitation is generally described as a hybrid coding technique. FIGURE 15.1 Differential encoder transmitter with a pole-zero predictor. a i z i i N - = ? 1 ? 2000 by CRC Press LLC (15.1) Later work showed that letting ? B(z) = improves the perceived quality of the reconstructed speech 1 by shaping the spectrum of the quantization noise to match the speech spectrum, as well as improving noisy- channel performance [Gibson, 1984]. To produce high-quality, highly intelligible speech, it is necessary that the quantizer and predictor parameters be adaptive to compensate for nonstationarities in the speech waveform. Frequency Domain Coders Coders that rely on spectral decomposition often use the usual set of sinusoidal basis functions from signal theory to represent the specific short-time spectral content of a segment of speech. In this case, the approximated signal consists of a linear combination of sinusoids with specified amplitudes and arguments (frequency, phase). For compactness, a countable subset of harmonically related sinusoids may be used. The two most prominent types of frequency domain coders are subband coders and multi-band coders. Subband coders digitally filter the speech into nonoverlapping (as nearly as possible) frequency bands. After filtering, each band is decimated (effectively sampled at a lower rate) and coded separately using PCM, DPCM, or some other method. At the receiver, the bands are decoded, upsampled, and summed to reconstruct the speech. By allocating a different number of bits per sample to the subbands, the perceptually more important frequency bands can be coded with greater accuracy. The design and implementation of subband coders and the speech quality produced have been greatly improved by the development of digital filters called quadrature mirror filters (QMFs) [Johnston, 1980] and polyphase filters. These filters allow subband overlap at the encoder, which causes aliasing, but the reconstruction filters at the receiver can be chosen to eliminate the aliasing if quantization errors are small. Multi-band coders perform a similar function by characterizing the contributions of individual sinusoidal components to the short-term speech spectrum. These parameters are then quantized, coded, transmitted, and used to configure a bank of tuned oscillators at the receiver. Outputs of the oscillators are mixed in proportion to the distribution of spectral energy present in the original waveform. An important requirement of multi-band coders is a capability to precisely determine perceptually significant spectral components and track the evolution of their energy and phase. Recent developments related to multi-band coding emphasize the use of harmonically related components with carefully intermixed spectral regions of bandlimited white noise. Sinusoidal Transform Coders (STC) and Multi-Band Excitation coders (MBE) are examples of this type of frequency domain coders. Model Adaptation Adaptation algorithms for coder predictor or quantizer parameters can be loosely grouped based on the signals that are used as the basis for adaptation. Generally, forward adaptive coder elements analyze the input speech (or a filtered version of it) to characterize predictor coefficients, spectral components, or quantizer parameters in a blockwise fashion. Backward adaptive coder elements analyze a reconstructed signal, which contains quantization noise, to adjust coder parameters in a sequential fashion. Forward adaptive coder elements can produce a more efficient model of speech signal characteristics, but introduce delay into the coder’s operation due to buffering of the signal. Backward adaptive coder elements do not introduce delay, but produce signal models that have lower fidelity with respect to the original speech due to the dependence on the noisy reconstructed signal. Most low-rate coders rely on some form of forward adaptation. This requires moderate to high delay in processing for accuracy of parameter estimation (autocorrelations/autocovariances for LPC- based coders, sinusoidal resolution for frequency-domain coders). The allowance of significant delay for many coder architectures has enabled a spectrally matched pre- or post-processing step to reduce apparent quanti- zation noise and provide significant perceptual improvements. Perceptual enhancements combined with analysis-by-synthesis optimization, and enabled by recent advances in high-power computing architectures such as digital signal processors, have tremendously improved speech coding results at medium and low rates. 1 In this case, the predicted value is ?s(k* k – 1) = . ?? .skk ask i i i N - ( ) =- ( ) = ? 1 1 b j z j j M - = ? 1 ask i be k j i i N jq j M ? -()+-() == ?? 11 ? 2000 by CRC Press LLC Analysis-by-Synthesis A significant drawback to traditional “instantaneous” coding approaches such as DPCM lies in the perceptual or subjective relevance of the distortion measure and the signals to which it is applied. Thus, the advent of analysis-by-synthesis coding techniques poses an important milestone in the evolution of medium- to low-rate speech coding. An analysis-by-synthesis coder chooses the coder excitation by minimizing distortion between the original signal and the set of synthetic signals produced by every possible codebook excitation sequence. In contrast, time-domain predictive coders must produce an estimated prediction residual (innovations sequence) to drive the spectral shaping filter(s) of the LPC model, and the classical DPCM approach is to quantize the residual sequence directly using scalar or vector quantizers. The incorporation of frequency- weighted distortion in the optimization of analysis-by-synthesis coders is significant in that it de-emphasizes (increases the tolerance for) quantization noise surrounding spectral peaks. This effect is perceptually trans- parent since the ear is less sensitive to error around frequencies having higher energy [Atal and Schroeder, 1979]. This approach has resulted in significant improvements in low-rate coder performance, and recent increases in processor speed and power are crucial enabling techniques for these applications. Analysis-by-synthesis coders based on linear prediction are generally described as hybrid coders since they fall between waveform coders and vocoders. Particular Implementations Currently, three coder architectures dominate the fields of medium and low-rate speech coding: ? Code-Excited Linear Prediction (CELP): an LPC-based technique which optimizes a vector of excitation samples (and/or pitch filter and lag parameters) using analysis-by-synthesis. ? Multi-Band Excitation (MBE): a direct spectral estimation technique which optimizes the spectral recon- struction error over a set of subbands using analysis-by-synthesis. ? Mixed-Excitation Linear Prediction (MELP): an optimized version of the traditional LPC vocoder which includes an explicit multiband model of the excitation signal. Several realizations of these approaches have been adopted nationally and internationally as standard speech coding architectures at rates below 16 kbits/s (i.e., G.728, IMBE, U.S. Federal Standard 1016, etc.). The success of these implementations is due to LPC-based analysis-by-synthesis with a perceptual distortion criterion or short- time frequency-domain modeling of a speech waveform or LPC residual. Additionally, the coders that operate at lower rates all benefit from forward adaptation methods which produce efficient, accurate parameter estimates. CELP The general CELP architecture is described as a blockwise analysis-by-synthesis selection of an LPC excitation sequence. In low-rate CELP coders, a forward-adaptive linear predictive analysis is performed at 20 to 30 msec intervals. The gross spectral characterization is used to reconstruct, via linear prediction, candidate speech segments derived from a constrained set of plausible filter excitations (the “codebook”). The excitation vector that produces the synthetic speech segment with smallest perceptually weighted distortion (with respect to the original speech) is chosen for transmission. Typically, the excitation vector is optimized more often than the LPC spectral model. The use of vectors rather than scalars for the excitation is significant in bit-rate reduction. The use of perceptual weighting in the CELP reconstruction stage and analysis-by-synthesis optimization of the dominant low-frequency (pitch) component are key concepts in maintaining good quality encoded speech at lower rates. CELP-based speech coders are the predominant coding methodologies for rates between 4 kbits/s and 16 kbits/s due to their excellent subjective performance. Some of the most notable are detailed below. ? ITU-T Recommendation G.728 (LD-CELP) [Chen, 1990] is a low delay, backward adaptive CELP coder. In G.728, a low algorithmic delay (less than 2.5 msec) is achieved by using 1024 candidate excitation sequences, each only 5 samples long. A 50th-order LPC spectral model is used, and the coefficients are backward-adapted based on the transmitted excitation. ? The speech coder standardized by the CTIA for use in the U.S. (time-division multiple-access) 8 kbits/s digital cellular radio systems is called vector sum excited linear prediction (VSELP) [Gerson and Jasiuk, ? 2000 by CRC Press LLC 1990]. VSELP is a forward-adaptive form of CELP where two excitation codebooks are used to reduce the complexity of encoding. ? Other approaches to complexity reduction in CELP coders are related to “sparse” codebook entries which have few nonzero samples per vector and “algebraic” codebooks which are based on integer lattices [Adoul and Lamblin, 1987]. In this case, excitation code vectors can be constructed on an as-needed basis instead of being stored in a table. ITU-T standardization of a CELP algorithm which uses lattice- based excitations has resulted in the 8 kbps G.729 (ACELP) coder. ? U.S. Federal Standard 1016 [National Communications System, 1991] is a 4.8 kbps CELP coder. It has both long- and short-term linear predictors which are forward adaptive, and so the coder has a relatively large delay (100 msec). This coder produces highly intelligible, good-quality speech in a variety of environments and is robust to independent bit errors. Below about 4 kbps, the subjective quality of CELP coders is inferior to other architectures. Much research in variable-rate CELP implementations has resulted in alternative coder architectures which adjust their coding rates based on a number of channel conditions or sophisticated, speech-specific cues such as phonetic segmen- tation [Wang and Gersho, 1989; Paksoy et al., 1993]. Notably, most variable-rate CELP coders are implemen- tations of finite-state CELP wherein a vector of speech cues controls the evolution of a state-machine to prescribe mode-dependent bit allocations for coder parameters. With these architectures, excellent speech quality at average rates below 2 kbps has been reported. MBE The MBE coder [Hardwick and Lim, 1991] is an efficient frequency-domain architecture partially based on the concepts of sinusoidal transform coding (STC) [McAulay and Quatieri, 1986]. In MBE, the instantaneous spectral envelope is represented explicitly by harmonic estimates in several subbands. The performance of MBE coders at rates below 4 kbps is generally “better” than that of CELP-based schemes. An MBE coder decomposes the instantaneous speech spectrum into subbands centered at harmonics of the fundamental glottal excitation (pitch). The spectral envelope of the signal is approximated by samples taken at pitch harmonics, and these harmonic amplitudes are compared to adaptive thresholds (which may be determined via analysis-by-synthesis) to determine subbands of high spectral activity. Subbands that are determined to be “voiced” are labeled, and their energies and phases are encoded for transmission. Subbands having relatively low spectral activity are declared “unvoiced”. These segments are approximated by an appropriately filtered segment of white noise, or a locally dense collection of sinusoids with random phase. Careful tracking of the evolution of individual spectral peaks and phases in successive frames is critical in the implementation of MBE-style coders. An efficient implementation of an MBE coder was adopted for the International Maritime Satellite (INMAR- SAT) voice processor, and is known as Improved-MBE, or IMBE [Hardwick and Lim, 1991]. This coder optimizes several components of the general MBE architecture, including grouping neighboring harmonics for subband voicing decisions, using non-integer pitch resolution for higher speaker fidelity, and differentially encoding the log-amplitudes of voiced harmonics using a DCT-based scheme. The IMBE coder requires high delay (about 80 msec), but produces very good quality encoded speech. MELP The MELP coder [McCree and Barnwell, 1995] is based on the traditional LPC vocoder model where an LPC synthesis filter is excited by an impulse train (voiced speech) or white noise (unvoiced speech). The MELP excitation, however, has characteristics that are more similar to natural human speech. In particular, the MELP excitation can be a mixture of (possibly aperiodic) pulses with bandlimited noise. In MELP, the excitation spectrum is explicitly modeled using Fourier series coefficients and bandpass voicing strengths, and the time- domain excitation sequence is produced from the spectral model via an inverse transform. The synthetic excitation sequence is then used to drive an LPC synthesizer which introduces formant spectral shaping. Common Threads In addition to the use of analysis-by-synthesis techniques and/or LPC modeling, a common thread between low-rate, forward adaptive CELP, MBE, and MELP coders is the dependence on an estimate of the fundamental glottal frequency, or pitch period. CELP coders typically employ a pitch or long-term predictor to characterize ? 2000 by CRC Press LLC the glottal excitation. MBE coders estimate the fundamental frequency and use this estimate to focus subband decompositions on harmonics. MELP coders perform pitch-synchronous excitation modeling. Overall coder performance is enhanced in the CELP and MBE coders with the use of sub-integer lags [Kroon and Atal, 1991]. This is equivalent to performing pitch estimation using a signal sampled at a higher sampling rate to improve the precision of the spectral estimate. Highly precise glottal frequency estimation improves the “naturalness” of coded speech at the expense of increased computational complexity, and in some cases increased bit rate. Accurate characterization of pitch and LPC parameters can also be used to good advantage in postfiltering to reduce apparent quantization noise. These filters, usually derived from forward-adapted filter coefficients transmitted to the receiver as side-information, perform post-processing on the reconstructed speech which reduces perceptually annoying noise components [Chen and Gersho, 1995]. Speech Quality and Intelligibility To compare the performance of two speech coders, it is necessary to have some indicator of the intelligibility and quality of the speech produced by each coder. The term intelligibility usually refers to whether the output speech is easily understandable, while the term quality is an indicator of how natural the speech sounds. It is possible for a coder to produce highly intelligible speech that is low quality in that the speech may sound very machine-like and the speaker is not identifiable. On the other hand, it is unlikely that unintelligible speech would be called high quality, but there are situations in which perceptually pleasing speech does not have high intelligibility. We briefly discuss here the most common measures of intelligibility and quality used in formal tests of speech coders. DRT The diagnostic rhyme test (DRT) was devised by Voiers [1977] to test the intelligibility of coders known to produce speech of lower quality. Rhyme tests are so named because the listener must determine which consonant was spoken when presented with a pair of rhyming words; that is, the listener is asked to distinguish between word pairs such as meat-beat, pool-tool, saw-thaw, and caught-taught. Each pair of words differs on only one of six phonemic attributes: voicing, nasality, sustention, sibilation, graveness, and compactness. Specifically, the listener is presented with one spoken word from a pair and asked to decide which word was spoken. The final DRT score is the percent responses computed according to P = (R – W) ′ 100, where R is the number correctly chosen, W is the number of incorrect choices, and T is the total of word pairs tested. Usually, 75 £ DRT £ 95, with a good being about 90 [Papamichalis, 1987]. MOS The Mean Opinion Score (MOS) is an often-used performance measure [Jayant and Noll, 1984]. To establish a MOS for a coder, listeners are asked to classify the quality of the encoded speech in one of five categories: excellent (5), good (4), fair (3), poor (2), or bad (1). Alternatively, the listeners may be asked to classify the coded speech according to the amount of perceptible distortion present, i.e., imperceptible (5), barely percep- tible but not annoying (4), perceptible and annoying (3), annoying but not objectionable (2), or very annoying and objectionable (1). The numbers in parentheses are used to assign a numerical value to the subjective evaluations, and the numerical ratings of all listeners are averaged to produce a MOS for the coder. A MOS between 4.0 and 4.5 usually indicates high quality. It is important to compute the variance of MOS values. A large variance, which indicates an unreliable test, can occur because participants do not known what categories such as good and bad mean. It is sometimes useful to present examples of good and bad speech to the listeners before the test to calibrate the 5-point scale [Papamichalis, 1987]. The MOS values for a variety of speech coders and noise conditions are given in [Daumer, 1982]. DAM The diagnostic acceptability measure (DAM) developed by Dynastat [Voiers, 1977] is an attempt to make the measurement of speech quality more systematic. For the DAM, it is critical that the listener crews be highly trained and repeatedly calibrated in order to get meaningful results. The listeners are each presented with encoded sentences taken from the Harvard 1965 list of phonetically balanced sentences, such as “Cats and dogs 1 T --- ? 2000 by CRC Press LLC each hate the other” and “The pipe began to rust while new”. The listener is asked to assign a number between 0 and 100 to characteristics in three classifications—signal qualities, background qualities, and total effect. The ratings of each characteristic are weighted and used in a multiple nonlinear regression. Finally, adjustments are made to compensate for listener performance. A typical DAM score is 45 to 55%, with 50% corresponding to a good system [Papamichalis, 1987]. The perception of “good quality” speech is a highly individual and subjective area. As such, no single performance measure has gained wide acceptance as an indicator of the quality and intelligibility of speech produced by a coder. Further, there is no substitute for subjective listening tests under the actual environmental conditions expected in a particular application. As a rough guide to the performance of some of the coders discussed here, we present the DRT, DAM, and MOS values in Table 15.1, which is adapted from [Spanias, 1994; Jayant, 1990]. From the table, it is evident that at 8 kbits/s and above, performance is quite good and that the 4.8 kbits/s CELP has substantially better performance than LPC-10e. Standardization The presence of international, national, and regional speech coding standards ensures the interoperability of coders among various implementations. As noted previously, several standard algorithms exist among the classes of speech coders. The ITU-T (formerly CCITT) has historically been a dominant factor in international standardization of speech coders, such as G.711, G.721, G.728, G.729, etc. Additionally, the emergence of digital cellular telephony, personal communications networks, and multimedia communications has driven the for- mulation of various national or regional standard algorithms such as the GSM full and half-rate standards for European digital cellular, the CTIA full-rate TDMA and CDMA algorithms and their half-rate counterparts for U.S. digital cellular, full and half-rate Pitch-Synchronous CELP [Miki et al., 1993] for Japanese cellular, as well as speech coders for particular applications [ITU-TS, 1991]. The standardization efforts of the U.S. Federal Government for secure voice channels and military applica- tions have a historically significant impact on the evolution of speech coder technology. In particular, the recent re-standardization of the DoD 2400 bits/s vocoder algorithm has produced some competing algorithms worthy of mention here. Of the classes of speech coders represented among the algorithms competing to replace LPC-10, several implementations utilized STC or MBE architectures, some used CELP architectures, and others were novel combinations of multiband-excitation with LPC modeling [McCree and Barnwell, 1995] or pitch- synchronous prototype waveform interpolation techniques [Kleijn, 1991]. The final results of the U.S. DoD standard competition are summarized in Table 15.2 for “quiet” and “office” environments. In the table, the column labeled “FOM” is the overall Figure of Merit used by the DoD Digital Voice Processing Consortium in selecting the coder. The FOM is a unitless combination of complexity and performance components, and is measured with respect to FS-1016. The complexity of a coder is a weighted combination of memory and processing power required. The performance of a coder is a weighted combination of four factors: quality (Q—measured via MOS), intelligibility (I—measured via DRT), speaker recognition (R), and communicability (C). Recognizability and communicability for each coder were measured by tests TABLE 15.1Speech Coder Performance Comparisons Algorithm Standardization Rate Subjective (acronym) Body Identifier kbits/s MOS DRT DAM m-law PCM ITU-T G.711 64 4.3 95 73 ADPCM ITU-T G.721 32 4.1 94 68 LD-CELP ITU-T G.728 16 4.0 94 a 70 a RPE-LTP GSM GSM 13 3.5 — — VSELP CTIA IS-54 8 3.5 — — CELP U.S. DoD FS-1016 4.8 3.13 b 90.7 b 65.4 b IMBE Inmarsat IMBE 4.1 3.4 — — LPC-10e U.S. DoD FS-1015 2.4 2.24 b 86.2 b 50.3 b a Estimated. b From results of 1996 U.S. DoD 2400 bits/s vocoder competition. ? 2000 by CRC Press LLC comparing processed vs. unprocessed data, and effectiveness of communication in application-specific coop- erative tasks [Schmidt-Nielsen and Brock, 1996; Kreamer and Tardelli, 1996]. The MOS and DRT scores were measured in a variety of common DoD environments. Each of the four “finalist” coders ranked first in one of the four categories examined (Q,I,R,C), as noted in the table. The results of the standardization process were announced in April, 1996. As indicated in Table 15.2, the new 2400 bits/s Federal Standard vocoder replacing LPC-10e is a version of the Mixed Excitation Linear Prediction (MELP) coder which uses several specific enhancements to the basic MELP architecture. These enhancements include multi-stage VQ of the formant parameters based on frequency-weighted bark-scale spectral distortion, direct VQ of the first 10 Fourier coefficients of the excitation using bark-weighted distortion, and a gain coding technique which is robust to channel errors [McCree et al., 1996]. Variable Rate Coding Previous standardization efforts and discussion here have centered on fixed-rate coding of speech where a fixed number of bits are used to represent speech in digital form per unit of time. However, with recent developments in transmission architectures (such as CDMA), the implementation of variable-rate speech coding algorithms has become feasible. In variable-rate coding, the average data rate for conversational speech can be reduced by a factor of at least 2. A variable-rate speech coding algorithm has been standardized by the CTIA for wideband (CDMA) digital mobile cellular telephony under IS-95. The algorithm, QCELP [Gardner et al., 1993], is the first practical variable-rate speech coder to be incorporated in a digital cellular system. QCELP is a multi-mode, CELP-type analysis-by-synthesis coder which uses blockwise spectral energy measurements and a finite-state machine to switch between one of four configurations. Each configuration has a fixed rate of 1, 2, 4, or 8 kbits/s with a predetermined allocation of bits among coder parameters (coefficients, gains, excitation, etc.). The subjective performance of QCELP in the presence of low background noise is quite good since the bit allocations per- mode and mode-switching logic are well-suited to high-quality speech. In fact, QCELP at an average rate of 4 kbits/s has been judged to be MOS-equivalent to VSELP, its 8 kbits/s, fixed-rate cellular counterpart. A time- averaged encoding rate of 4 to 5 kbits/s is not uncommon for QCELP, however the average rate tends toward the 8 kbits/s maximum in the presence of moderate ambient noise. The topic of robust fixed-rate and variable- rate speech coding in the presence of significant background noise remains an open problem. Much recent research in speech coding below 8 kbits/s has focused on multi-mode CELP architectures and efficient approaches to source-controlled mode selection [Das et al., 1995]. Multimode coders are able to quickly invoke a coding scheme and bit allocation specifically tailored to the local characteristics of the speech signal. This capability has proven useful in optimizing perceptual quality at low coding rates. In fact, the majority of algorithms currently proposed for half-rate European and U.S. digital cellular standards, as well as many algo- rithms considered for rates below 2.4 kbits/s are multimode coders. The direct coupling between variable-rate (multimode) speech coding and the CDMA transmission architecture is an obvious benefit to both technologies. TABLE 15.2Speech Coder Performance Comparisons Taken from Results of 1996 U.S. DoD 2400 bits/s Vocoder Competition Algorithm Quiet Office (acronym) FOM Rank Best MOS DRT DAM MOS DRT DAM MELP 2.616 1 I 3.30 92.3 64.5 2.96 91.2 52.7 PWI 2.347 2 Q 3.28 90.5 70.0 2.88 88.4 55.5 STC 2.026 3 R 3.08 89.9 63.8 2.82 91.5 54.1 IMBE 2.991 * C 2.89 91.4 62.3 2.71 91.1 52.4 CELP 0.0 N/A — 3.13 90.7 65.4 2.89 89.0 56.1 LPC-10e –9.19 N/A — 2.24 86.2 50.3 2.09 85.2 48.4 * Ineligible due to failure of the quality (MOS) criteria minimum requirements (better than CELP) in both quiet and office environments. ? 2000 by CRC Press LLC Summary and Conclusions The availability of general-purpose and application-specific digital signal processing chips and the ever-widening interest in digital communications have led to an increasing demand for speech coders. The worldwide desire to establish standards in a host of applications is a primary driving force for speech coder research and development. The speech coders that are available today for operation at 16 kbits/s and below are conceptually quite exotic compared with products available less than 10 years ago. The re-standardization of U.S. Federal Standard 1015 (LPC-10) at 2.4 kbits/s with performance constraints similar to those of FS-1016 at 4.8 kbits/s is an indicator of the rapid evolution of speech coding paradigms and VLSI architectures. Other standards to be established in the near term include the European (GSM) and U.S. (CTIA) half-rate speech coders for digital cellular mobile radio. For the longer term, the specification of standards for forth- coming mobile personal communications networks will be a primary focus in the next 5 to 10 years. In the preface to their book, Jayant and Noll [1984] state that “our understanding of speech and image coding has now reached a very mature point …” As of 1997, this statement rings truer than ever. The field is a dynamic one, however, and the wide range of commercial applications demands continual progress. Defining Terms Analysis-by-synthesis: Constructing several versions of a waveform and choosing the best match. Predictive coding: Coding of time-domain waveforms based on a (usually) linear prediction model. Frequency domain coding: Coding of frequency-domain characteristics based on a discrete time-frequency transform. Hybrid coders: Coders that fall between waveform coders and vocoders in how they select the excitation. Standard: An encoding technique adopted by an industry to be used in a particular application. Mean Opinion Score (MOS): A popular method for classifying the quality of encoded speech based on a five- point scale. Variable-rate coders: Coders that output different amounts of bits based on the time-varying characteristics of the source. Related Topics 17.1 Digital Image Processing ? 21.4 Example 3: Multirate Signal Processing References A. S. Spanias, “Speech coding: A tutorial review,” Proc. IEEE, 82, 1541–1575, October 1994. A. Gersho, “Advances in speech and audio compression,” Proc. IEEE, 82, June 1994. W. B. Kleijn and K. K. Paliwal, Eds., Speech Coding and Synthesis, Amsterdam, Holland: Elsevier, 1995. CCITT, “32-kbit/s adaptive differential pulse code modulation (ADPCM),” Red Book, III.3, 125–159, 1984. National Communications System, Office of Technology and Standards, Federal Standard 1015: Analog to Digital Conversion of Voice by 2400 bit/second Linear Predictive Coding, 1984. J.-H. Chen, “High-quality 16 kb/s speech coding with a one-way delay less than 2 ms,” Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Albuquerque, NM, pp. 453–456, April 1990. National Communications System, Office of Technology and Standards, Federal Standard 1016: Telecommunications: Analog to Digital Conversion of Radio Voice by 4800 bit/second Code Excited Linear Prediction (CELP), 1991. J. Gibson, “Adaptive prediction for speech encoding,” IEEE ASSP Magazine, 1, 12–26, July 1984. J. D. Johnston, “A filter family designed for use in quadrature mirror filter banks,” Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Denver, CO, pp. 291–294, April 1980. B. Atal and M. Schroeder, “Predictive coding of speech signals and subjective error criteria,” IEEE Trans. Acoust., Speech, Signal Processing, ASSP-27, 247–254, June 1979. I. Gerson and M. Jasiuk, “Vector sum excited linear prediction (VSELP) speech coding at 8 kb/s,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Albuquerque, NM, pp. 461–464, April 1990. ? 2000 by CRC Press LLC J.-P. Adoul and C. Lamblin, “A comparison of some algebraic structures for CELP coding of speech,” Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Dallas, TX, pp. 1953–1956, April 1987. S. Wang and A. Gersho, “Phonetically-based vector excitation coding of speech at 3.6 kbps,” Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Glasgow, Scotland, pp. 49–52, May 1989. E. Paksoy, K. Srinivasan, and A. Gersho, “Variable rate speech coding with phonetic segmentation,” Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Minneapolis, MN, pp. II.155–II.158, April 1993. J. Hardwick and J. Lim, “The application of the IMBE speech coder to mobile communications,” Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 249–252, May 1991. R. McAulay and T. Quatieri, “Speech analysis/synthesis based on a sinusoidal representation,” IEEE Trans. Acoust., Speech, Signal Processing, 34, 744–754, August 1986. A. McCree and T. Barnwell, “A mixed excitation LPC vocoder model for low bit rate speech coding,” IEEE Trans. Speech Audio Processing, 3, 242–250, July 1995. P. Kroon and B. S. Atal, “On improving the performance of pitch predictors in speech coding systems,” in Advances in Speech Coding, B. S. Atal, V. Cuperman, and A. Gersho, Eds., Boston, Mass: Kluwer, 1991, pp. 321–327. J.-H. Chen and A. Gersho, “Adaptive postfiltering for quality enhancement of coded speech,” IEEE Trans. Speech and Audio Processing, 3, 59–71, January 1995. W. Voiers, “Diagnostic evaluation of speech intelligibility,” in Speech Intelligibility and Recognition, M. Hawley, Ed., Stroudsburg, Pa.: Dowden, Hutchinson, and Ross, 1977. P. Papamichalis, Practical Approaches to Speech Coding, Englewood Cliffs, N.J.: Prentice-Hall, 1987. N. S. Jayant and P. Noll, Digital Coding of Waveforms, Englewood Cliffs, N.J.: Prentice-Hall, 1984. W. Daumer, “Subjective evaluation of several different speech coders,” IEEE Trans. Commun., COM-30, 655–662, April 1982. W. Voiers, “Diagnostic acceptability measure for speech communications systems,” Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 204–207, 1977. N. Jayant, “High-quality coding of telephone speech and wideband audio,” IEEE Communications Magazine, 28, 10–20, January 1990. S. Miki, K. Mano, H. Ohmuro, and T. Moriya, “Pitch synchronous innovation CELP (PSI-CELP),” Proc. European Conf. Speech Comm. Technol., Berlin, Germany, pp. 261–264, September 1993. ITU-TS Study Group XV, Draft recommendation AV.25Y—Dual Rate Speech Coder for Multimedia Telecommu- nication Transmitting at 5.3 & 6.4 kbit/s, December 1991. W. Kleijn, “Continuous representations in linear predictive coding,” Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 201–204, 1991. A. Schmidt-Nielsen and D. Brock, “Speaker recognizability testing for voice coders,” Proc. IEEE Int. Conf. Acoust., Speech, and Signal Processing, pp. 1149–1152, April 1996. E. Kreamer and J. Tardelli, “Communicability testing for voice coders,” Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 1153–1156, April 1996. A. McCree, K. Truong, E. George, T. Barnwell, and V. Viswanathan, “A 2.4 kbit/s MELP coder candidate for the new U.S. Federal Standard, Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 200–203, April 1996. W. Gardner, P. Jacobs, and C. Lee, “QCELP: A variable rate speech coder for CDMA digital cellular,” in Speech and Audio Coding for Wireless Networks, B. S. Atal, V. Cuperman, and A. Gersho, Eds., Boston, Mass.: Kluwer, 1993, pp. 85–92. A. Das, E. Paksoy, and A. Gersho, “Multimode and variable-rate coding of speech,” in Speech Coding and Synthesis, W. B. Kleijn and K. K. Paliwal, Eds., Amsterdam: Elsevier, 1995, pp. 257–288. Further Information For further information on the state of the art in speech coding, see the articles by Spanias [1994] and Gersho [1994], and the book Speech Coding and Synthesis by Kleijn and Paliwal [1995]. ? 2000 by CRC Press LLC 15.2 Speech Enhancement and Noise Reduction Yariv Ephraim Voice communication systems are susceptible to interfering signals normally referred to as noise. The interfering signals may have harmful effects on the performance of any speech communication system. These effects depend on the specific system being used, on the nature of the noise and the way it interacts with the clean signal, and on the relative intensity of the noise compared to that of the signal. The latter is usually measured by the signal- to-noise ratio (SNR), which is the ratio of the power of the signal to the power of the noise. The speech communication system may simply be a recording which was performed in a noisy environment, a standard digital or analog communication system, or a speech recognition system for human-machine communication. The noise may be present at the input of the communication system, in the channel, or at the receiving end. The noise may be correlated or uncorrelated with the signal. It may accompany the clean signal in an additive, multiplicative, or any other more general manner. Examples of noise sources include competitive speech; background sounds like music, a fan, machines, door slamming, wind, and traffic; room reverberation; and white Gaussian channel noise. The ultimate goal of speech enhancement is to minimize the effects of the noise on the performance of speech communication systems. The performance measure is system dependent. For systems which comprise recordings of noisy speech, or standard analog communication systems, the goal of speech enhancement is to improve perceptual aspects of the noisy signal. For example, improving the quality and intelligibility of the noisy signal are common goals. Quality is a subjective measure which reflects on the pleasantness of the speech or on the amount of effort needed to understand the speech material. Intelligibility, on the other hand, is an objective measure which signifies the amount of speech material correctly understood. For standard digital communication systems, the goal of speech enhancement is to improve perceptual aspects of the encoded speech signal. For human-machine speech communication systems, the goal of speech enhancement is to reduce the error rate in recognizing the noisy speech signals. To demonstrate the above ideas, consider a “hands-free’’ cellular radio telephone communication system. In this system, the transmitted signal is composed of the original speech and the background noise in the car. The background noise is generated by an engine, fan, traffic, wind, etc. The transmitted signal is also affected by the radio channel noise. As a result, noisy speech with low quality and intelligibility is delivered by such systems. The background noise may have additional devastating effects on the performance of this system. Specifically, if the system encodes the signal prior to its transmission, then the performance of the speech coder may significantly deteriorate in the presence of the noise. The reason is that speech coders rely on some statistical model for the clean signal, and this model becomes invalid when the signal is noisy. For a similar reason, if the cellular radio system is equipped with a speech recognizer for automatic dialing, then the error rate of such recognizer will be elevated in the presence of the background noise. The goals of speech enhancement in this example are to improve perceptual aspects of the transmitted noisy speech signals as well as to reduce the speech recognizer error rate. Other important applications of speech enhancement include improving the performance of: 1.Pay phones located in noisy environments (e.g., airports) 2.Air-ground communication systems in which the cockpit noise corrupts the pilot’s speech 3.Teleconferencing systems where noise sources in one location may be broadcasted to all other locations 4.Long distance communication over noisy radio channels The problem of speech enhancement has been a challenge for many researchers for almost three decades. Different solutions with various degrees of success have been proposed over the years. An excellent introduction to the problem, and review of the systems developed up until 1979, can be found in the landmark paper by Lim and Oppenheim [1979]. A panel of the National Academy of Sciences discussed in 1988 the problem and various ways to evaluate speech enhancement systems. The panel’s findings were summarized in Makhoul et al. [1989]. Modern statistical approaches for speech enhancement were recently reviewed in Boll [1992] and Ephraim [1992]. ? 2000 by CRC Press LLC In this section the principles and performance of the major speech enhancement approaches are reviewed, and the advantages and disadvantages of each approach are discussed. The signal is assumed to be corrupted by additive statistically independent noise. Only a single noisy version of the clean signal is assumed available for enhancement. Furthermore, it is assumed that the clean signal cannot be preprocessed to increase its robustness prior to being affected by the noise. Speech enhancement systems which can either preprocess the clean speech signal or which have access to multiple versions of the noisy signal obtained from a number of microphones are discussed in Lim [1983]. This presentation is organized as follows. In the second section the speech enhancement problem is formu- lated and commonly used models and performance measures are presented. In the next section signal estimation for improving perceptual aspects of the noisy signal is discussed. In the fourth section source coding techniques for noisy signals are summarized, and the last section deals with recognition of noisy speech signals. Due to the limited number of references (10) allowed in this publication, tutorial papers are mainly referenced. Appropriate credit will be given by pointing to the tutorial papers which reference the original papers. Models and Performance Measures The goals of speech enhancement as stated in the first section are to improve perceptual aspects of the noisy signal whether the signal is transmitted through analog or digital channels and to reduce the error rate in recognizing noisy speech signals. Improving perceptual aspects of the noisy signal can be accomplished by estimating the clean signal from the noisy signal using perceptually meaningful estimation performance mea- sures. If the signal has to be encoded for transmission over digital channels, then source coding techniques can be applied to the given noisy signal. In this case, a perceptually meaningful fidelity measure between the clean signal and the encoded noisy signal must be used. Reducing error rate in speech communication systems can be accomplished by applying optimal signal classification approaches to the given noisy signals. Thus the speech enhancement problem is essentially a set of signal estimation, source coding, and signal classification problems. The probabilistic approach for solving these problems requires explicit knowledge of the performance measure as well as the probability laws of the clean signal and noise process. Such knowledge, however, is not explicitly available. Hence, mathematically tractable performance measures and statistical models which are believed to be meaningful are used. In this section we briefly review the most commonly used statistical models and performance measures. The most fundamental model for speech signals is the Gaussian autoregressive (AR) model. This model assumes that each 20- to 40-msec segment of the signal is generated from an excitation signal which is applied to a linear time-invariant all-pole filter. The excitation signal comprises a mixture of white Gaussian noise and a periodic sequence of impulses. The period of that sequence is determined by the pitch period of the speech signal. This model is described in Fig. 15.2. Generally, the excitation signal represents the flow of air through the vocal cords and the all-pole filter represents the vocal tract. The model for a given sample function of speech FIGURE 15.2Gaussian autoregressive speech model. ? 2000 by CRC Press LLC signals, which is composed of several consecutive 20- to 40-msec segments of that signal, is obtained from the sequence of AR models for the individual segments. Thus, a linear time-varying AR model is assumed for each sample function of the speech signal. This model, however, is slowly varying in accordance with the slow temporal variation of the articulatory system. It was found that a set of approximately 2048 prototype AR models can reliably represent all segments of speech signals. The AR models are useful in representing the short time spectrum of the signal, since the spectrum of the excitation signal is white. Thus, the set of AR models represents a set of 2048 spectral prototypes for the speech signal. The time-varying AR model for speech signals lacks the “memory’’ which assigns preference to one AR model to follow another AR model. This memory could be incorporated, for example, by assuming that the individual AR models are chosen in a Markovian manner. That is, given an AR model for the current segment of speech, certain AR models for the following segment of speech will be more likely than others. This results in the so- called composite source model (CSM) for the speech signal. A block diagram of a CSM is shown in Fig. 15.3. In general, this model is composed of a set of M vector subsources which are controlled by a switch. The position of the switch at each time instant is chosen randomly, and the output of one subsource is provided. The position of the switch defines the state of the source at each time instant. CSMs for speech signals assume that the subsources are Gaussian AR sources, and the switch is controlled by a Markov chain. Furthermore, the subsources are usually assumed statistically independent and the vectors generated from each subsource are also assumed statistically independent. The resulting model is known as a hidden Markov model (HMM) [Rabiner, 1989] since the output of the model does not contain the states of the Markovian switch. The performance measure for speech enhancement is task dependent. For signal estimation and coding, this measure is given in terms of a distortion measure between the clean signal and the estimated or the encoded signals, respectively. For signal classification applications the performance measure is normally the probability of misclassification. Commonly used distortion measures are the mean-squared error (MSE) and the Itakura- Saito distortion measures. The Itakura-Saito distortion measure is a measure between two power spectral densities, of which one is usually that of the clean signal and the other of a model for that signal [Gersho and Gray, 1991]. This distortion measure is normally used in designing speech coding systems and it is believed to be perceptually meaningful. Both measures are mathematically tractable and lead to intuitive estimation and coding schemes. Systems designed using these two measures need not be optimal only in the MSE and the Itakura-Saito sense, but they may as well be optimal in other more meaningful senses (see a discussion in Ephraim [1992]). Signal Estimation In this section we review the major approaches for speech signal estimation given noisy signals. FIGURE 15.3Composite source model. ? 2000 by CRC Press LLC Spectral Subtraction The spectral subtraction approach [Weiss, 1974] is the simplest and most intuitive and popular speech enhance- ment approach. This approach provides estimates of the clean signal as well as of the short time spectrum of that signal. Estimation is performed on a frame-by-frame basis, where each frame consists of 20–40 msec of speech samples. In the spectral subtraction approach the signal is Fourier transformed, and spectral components whose variance is smaller than that of the noise are nulled. The surviving spectral components are modified by an appropriately chosen gain function. The resulting set of nulled and modified spectral components constitute the spectral components of the enhanced signal. The signal estimate is obtained from inverse Fourier transform of the enhanced spectral components. The short time spectrum estimate of the signal is obtained from squaring the enhanced spectral components. A block diagram of the spectral subtraction approach is shown in Fig. 15.4. Gain functions motivated by different perceptual aspects have been used. One of the most popular functions results from linear minimum MSE (MMSE) estimation of each spectral component of the clean signal given the corresponding spectral component of the noisy signal. In this case, the value of the gain function for a given spectral component constitutes the ratio of the variances of the clean and noisy spectral components. The variance of the clean spectral component is obtained by subtracting an assumed known variance of the noise spectral component from the variance of the noisy spectral component. The resulting variance is guar- anteed to be positive by the nulling process mentioned above. The variances of the spectral components of the noise process are normally estimated from silence portions of the noisy signal. A family of spectral gain functions proposed in Lim and Oppenheim [1979] is given by (15.2) where Z n and V n denote the nth spectral components of the noisy signal and the noise process, respectively, and a > 0, b 3 0, c > 0. The MMSE gain function is obtained when a = 2, b = 1, and c = 1. Another commonly used gain function in the spectral subtraction approach is obtained from using a = 2, b = 1, and c = 1/2. This gain function results from estimating the spectral magnitude of the signal and combining the resulting estimate with the phase of the noisy signal. This choice of gain function is motivated by the relative importance of the spectral magnitude of the signal compared to its phase. Since both cannot be simultaneously optimally estimated [Ephraim, 1992], only the spectral magnitude is optimally estimated, and combined with an estimate of the complex exponential of the phase which does not affect the spectral magnitude estimate. The resulting estimate FIGURE 15.4Spectral subtraction signal estimator. g ZbEV Z nN n n a n a n a c = {} ? è ? ? ? ? ÷ ÷ = ** ** ** – ,...,1 ? 2000 by CRC Press LLC of the phase can be shown to be the phase of the noisy signal within the HMM statistical framework. Normally, the spectral subtraction approach is used with b = 2, which corresponds to an artificially elevated noise level. The spectral subtraction approach has been very popular since it is relatively easy to implement; it makes minimal assumptions about the signal and noise; and when carefully implemented, it results in reasonably clear enhanced signals. A major drawback of the spectral subtraction enhancement approach, however, is that the residual noise has annoying tonal characteristics referred to as “musical noise.” This noise consists of narrowband signals with time-varying frequencies and amplitudes. Another major drawback of the spectral subtraction approach is that its optimality in any given sense has never been proven. Thus, no systematic methodology for improving the performance of this approach has been developed, and all attempts to achieve this goal have been based on purely heuristic arguments. As a result, a family of spectral subtraction speech enhancement approaches have been developed and experimentally optimized. In a recent work [Ephraim et al., 1995] a version of the spectral subtraction was shown to be a signal subspace estimation approach which is asymptomatically optimal (as the frame length approaches infinity) in the linear MMSE sense. Empirical Averages This approach attempts to estimate the clean signal from the noisy signal in the MMSE sense. The conditional mean estimator is implemented using the conditional sample average of the clean signal given the noisy signal. The sample average is obtained from appropriate training sequences of the clean and noisy signals. This is equivalent to using the sample distribution or the histogram estimate of the probability density function (pdf) of the clean signal given the noisy signal. The sample average approach is applicable for estimating the signal as well as functionals of that signal, e.g., the spectrum, the logarithm of the spectrum, and the spectral magnitude. Let {Y t , t = 0, . . ., T} be a training data from the clean signal, where Y t is a K-dimensional vector in the Euclidean space R K . Let {Z t , t = 0, . . ., T} be a training data from the noisy signal, where Z t , ? R K . The sequence {Z t } can be obtained by adding a noise training sequence {V t ,t = 0, . . ., T} to the sequence of clean signals {Y t }. Let z ? R K be a vector of the noisy signal from which the vector y of the clean signal is estimated. Let Y(z) = {Y t : Z t = z, t = 0, . . ., T} be the set of all clean vectors from the training data of the clean signal which could have resulted in the given noisy observation z. The cardinality of this set is denoted by *Y(z)*. Then, the sample average estimate of the conditional mean of the clean signal y given the noisy signal z is given by (15.3) Obviously, this approach is only applicable for signals with finite alphabet since otherwise the set Y(z) is empty with probability one. For signals with continuous pdf’s, the approach can be applied only if those signals are appropriately quantized. The sample average approach was first applied for enhancing speech signals by Porter and Boll in 1984 [Boll, 1992]. They, however, considered a simpler situation in which the noise true pdf was assumed known. In this case, enhanced signals with residual noise characterized as being a blend of wideband noise and musical noise were obtained. The balance between the two types of residual noise depended on the functional of the clean signal which was estimated. The advantages of the sample average approach are that it is conceptually simple and it does not require a priori assumptions about the form of the pdf’s of the signal and noise. Hence, it is a nonparametric estimation approach. This approach, however, has three major disadvantages. First, the estimator does not utilize any speech specific information such as the periodicity of the signal and the signal’s AR model. Second, the training ? {} (,) (,) () () yEyz ypyzdy pyzdy z Y t Y t z = = = ò ò ? ? {} * ** 1 Y Y ? 2000 by CRC Press LLC sequences from the signal and noise must be available at the speech enhancement unit. Furthermore, these training sequences must be applied for each newly observed vector of the noisy signal. Since the training sequences are normally very long, the speech enhancement unit must have extensive memory and computational resources. These problems are addressed in the model-based approach described next. Model-Based Approach The model-based approach [Ephraim, 1992] is a Bayesian approach for estimating the clean signal or any functional of that signal from the observed noisy signal. This approach assumes CSMs for the clean signal and noise process. The models are estimated from training sequences of those processes using the maximum likelihood (ML) estimation approach. Under ideal conditions the ML model estimate is consistent and asymp- totically efficient. The ML model estimation is performed using the expectation-maximization (EM) or the Baum iterative algorithm [Rabiner, 1989; Ephraim, 1992]. Given the CSMs for the signal and noise, the clean signal is estimated by minimizing the expected value of the chosen distortion measure. The model-based approach uses significantly more statistical knowledge about the signal and noise compared to either the spectral subtraction or the sample average approaches. The MMSE signal estimator is obtained from the conditional mean of the clean signal given the noisy signal. If y t ? R K denotes the vector of the speech signal at time t, and z t 0 denotes the sequence of K-dimensional vectors of noisy signals {z 0 , . . ., z t } from time t = 0 to t = t, then the MMSE estimator of y t is given by (15.4) where – x t denotes the composite state of the noisy signal at time t. This state is given by – x t D = (x t , ~ x t ), where x t is the Markov state of the clean signal at time t and ~ x t denotes the Markov state of the noise process at the same time instant t. The MMSE estimator, Eq. (15.4), comprises a weighted sum of conditional mean estimators for the composite states of the noisy signal, where the weights are the probabilities of those states given the noisy observed signal. A block diagram of this estimator is shown in Fig. 15.5. The probabilityP( – x t *z t 0 ) can be efficiently calculated using the forward recursion associated with HMMs. For CSMs with Gaussian subsources, the conditional mean E{y t *z t , – x t } is a linear function of the noisy vector z t , given by FIGURE 15.5HMM-based MMSE signal estimator. ? {} (){,} yEyz PxzEyzx tt t t x t ttt t = = ? * ** 0 0 ? 2000 by CRC Press LLC E(y t *z t , – x t ) = S xt (S xt +S~ xt ) –1 z t D = H ˉxt z t (15.5) where S xt and S ~ xt denote the covariance matrices of the Gaussian subsources associated with the Markov states x t and – x t , respectively. Since, however, P( – x t *z t 0 ) is a nonlinear function of the noisy signal z t 0 , the MMSE signal estimator ?y t is a nonlinear function of the noisy signal z t 0 . The MMSE estimator, Eq. (15.4), is intuitively appealing. It uses a predesigned set of filters {H– xt, } obtained from training data of speech and noise. Each filter is optimal for a pair of subsources of the CSMs for the clean signal and the noise process. Since each subsource represents a subset of signals from the corresponding source, each filter is optimal for a pair of signal subsets from the speech and noise. The set of predesigned filters covers all possible pairs of speech and noise signal subsets. Hence, for each noisy vector of speech there must exist an optimal filter in the set of predesigned filters. Since, however, a vector of the noisy signal could possibly be generated from any pair of subsources of the clean signal and noise, the most appropriate filter for a given noisy vector is not known. Consequently, in estimating the signal vector at each time instant, all filters are tried and their outputs are weighted by the probabilities of the filters to be correct for the given noisy signal. Other strategies for utilizing the predesigned set of filters are possible. For example, at each time instant only the most likely filter can be applied to the noisy signal. This approach is more intuitive than that of the MMSE estimation. It was first proposed in Drucker [1968] for a five-state model which comprises subsources for fricatives, stops, vowels, glides, and nasals. This approach was shown by Ephraim and Merhav [Ephraim, 1992] to be optimal only in an asymptotic MMSE sense. The model-based MMSE approach provides reasonably good enhanced speech quality with significantly less structured residual noise than the spectral subtraction approach. This performance was achieved for white Gaussian input noise at 10 dB input SNR using 512-2048 filters. An improvement of 5–6 dB in SNR was achieved by this approach. The model-based approach, however, is more elaborate than the spectral subtraction approach, since it involves two steps of training and estimation, and training must be performed on sufficiently long data. The MMSE estimation approach is usually superior to the asymptotic MMSE enhancement approach. The reason is that the MMSE approach applies a “soft decision” rather than a “hard decision” in choosing the most appropriate filter for a given vector of the noisy signal. A two-state version of the MMSE estimator was first applied to speech enhancement by McAulay and Malpass in 1980 [Ephraim, 1992]. The two states corresponded to speech presence and speech absence (silence) in the noisy observations. The estimator for the signal given that it is present in the noisy observations was imple- mented by the spectral subtraction approach. The estimator for the signal in the “silence state” is obviously equal to zero. This approach significantly improved the performance of the spectral subtraction approach. Source Coding An encoder for the clean signal maps vectors of that signal onto a finite set of representative signal vectors referred to as codewords. The mapping is performed by assigning each signal vector to its nearest neighbor codeword. The index of the chosen codeword is transmitted to the receiver in a signal communication system, and the signal is reconstructed using a copy of the chosen codeword. The codewords are designed to minimize the average distortion resulting from the nearest neighbor mapping. The codewords may simply represent waveform vectors of the signal. In another important application of low bit-rate speech coding, the codewords represent a set of parameter vectors of the AR model for the speech signal. Such coding systems synthesize the signal using the speech model in Fig. 15.2. The synthesis is performed using the encoded vector of AR coefficients as well as the parameters of the excitation signal. Reasonably good speech quality can be obtained using this coding approach at rates as low as 2400–4800 bits/sample [Gersho and Gray, 1991]. When only noisy signals are available for coding, the encoder operates on the noisy signal while representing the clean signal. In this case, the encoder is designed by minimizing the average distortion between the clean signal and the encoded signal. Specifically, let y denote the vector of clean signal to be encoded. Let z denote the corresponding given vector of the noisy signal. Let q denote the encoder. Let d denote a distortion measure. Then, the optimal encoder is designed by ? 2000 by CRC Press LLC (15.6) When the clean signal is available for encoding the design problem is similarly defined, and it is obtained from Eq. (15.6) using z = y. The design problem in Eq. (15.6) is not standard since the encoder operates and represents different sources. The problem can be transformed into a standard coding problem by appropriately modifying the distortion measure. This was shown by Berger in 1971 and Ephraim and Gray in 1988 [Ephraim, 1992]. Specifically, define the modified distortion measure by (15.7) Then, by using iterated expectation in Eq. (15.8), the design problem becomes (15.8) A useful class of encoders for speech signals are those obtained from vector quantization. Vector quantizers are designed using the Lloyd algorithm [Gersho and Gray, 1991]. This is an iterative algorithm in which the codewords and the nearest neighbor regions are alternatively optimized. This algorithm can be applied to design vector quantizers for clean and noisy signals using the modified distortion measure. The problem of designing vector quantizers for noisy signals is related to the problem of estimating the clean signals from the noisy signals, as was shown by Wolf and Ziv in 1970 and Ephraim and Gray in 1988 [Ephraim, 1992]. Specifically, optimal waveform vector quantizers in the MMSE sense can be designed by first estimating the clean signal and then quantizing the estimated signal. Both estimation and quantization are performed in the MMSE sense. Similarly, optimal quantization of the vector of parameters of the AR model for the speech signal in the Itakura-Saito sense can be performed in two steps of estimation and quantization. Specifically, the autocorrelation function of the clean signal, which approximately constitutes the sufficient statistics of that signal for estimating the AR model, is first estimated in the MMSE sense. Then, optimal vector quantization in the Itakura-Saito sense is applied to the estimated autocorrelation. The estimation-quantization approach has been most popular in designing encoders for speech signals given noisy signals. Since such design requires explicit knowledge of the statistics of the clean signal and the noise process, but this knowledge is not available as argued in the second section, a variety of suboptimal encoders were proposed. Most of the research in this area focused on designing encoders for the AR model of the signal due to the importance of such encoders in low bit-rate speech coding. The proposed encoders mainly differ in the estimators they used and the functionals of the speech signal these estimators have been applied to. Important examples of functionals which have commonly been estimated include the signal waveform, autocorrelation, and the spectral magnitude. The primarily set of estimators used for this application were obtained from the spectral subtraction approach and its derivatives. A version of the sample average estimator was also developed and applied to AR modeling by Juang and Rabiner in 1987 [Ephraim, 1992]. Recently, the HMM-based estimator of the autocorrelation function of the clean signal was used in AR model vector quantization [Ephraim, 1992]. Designing of AR model-based encoders from noisy signals has been a very successful application of speech enhancement. In this case both the quality and intelligibility of the encoded signal can be improved compared to the case where the encoder is designed for the clean signal and the input noise is simply ignored. The reason is that the input noise has devastating effects of the performance of AR model-based speech coders, and any “reasonable” estimation approach can significantly improve the performance of those coders in noisy environments. Signal Classification In recognition of clean speech signals a sample function of the signal is associated with one of the words in the vocabulary. The association or decision rule is designed to minimize the probability of classification error. When only noisy speech signals are available for recognition a very similar problem results. Specifically, a sample min { ( , ( ))} q Edy qz ¢dzqz Edyqz z(, ()) {(, ()) } D * min { ( , ( ))} q Ed z qz¢ ? 2000 by CRC Press LLC function of the noisy signal is now associated with one of the words in the vocabulary in a way which minimizes the probability of classification error. The only difference between the two problems is that the sample functions of the clean and noisy signals from a given word have different statistics. The problem in both cases is that of partitioning the sample space of the given acoustic signals from all words in the vocabulary into L partition cells, where L is the number of words in the vocabulary. Let {W i , i = 1, . . ., L} denote a set of words in a given vocabulary. Let z denote the acoustic noisy signal from some word in the vocabulary. Let W D = {w 1 , . . ., w L } be a partition of the sample space of the noisy signals. The probability of error associated with this partition is given by (15.9) where P(W i ) is the a priori probability of occurrence of the ith word, and p(z* W i ) is the pdf of the noisy signal from the ith word. The minimization of P e (W) is achieved by the well-known maximum a posteriori (MAP) decision rule. Specifically, z is associated with the word W i for which p(z* W i )P(W i ) > p(z* W j )P(W j ) for all j = 1, . . ., L and j i. Ties are arbitrarily broken. In the absence of noise, the noisy signal z becomes a clean signal y, and the optimal recognizer is obtained by using the same decision rule with z = y. Hence, the only difference between recognition of clean signals and recognition of noisy signals is that in the first case the pdf’s {p(y*W i )} are used in the decision rule, while in the second case the pdf’s {p(z* W i )} are used in the same decision rule. Note that optimal recognition of noisy signals requires explicit knowledge of the statistics of the clean signal and noise. Neither the clean signal nor any function of that signal needs to be estimated. Since, however, the statistics of the signal and noise are not explicitly available as argued in the second section, parametric models are usually assumed for these pdf’s and their parameters are estimated from appropriate training data. Normally, HMMs with mixture of Gaussian pdf’s at each state are attributed to both the clean signal and noise process. It can be shown (similarly to the case of classification of clean signals dealt with by Merhav and Ephraim in 1991 [Ephraim, 1992]) that if the pdf’s of the signal and noise are precisely HMMs and the training sequences are significantly longer than the test data, then the MAP decision rule which uses estimates of the pdf’s of the signal and noise is asymptotically optimal. A key issue in applying hidden Markov modeling for recognition of speech signals is the matching of the energy contour of the signal to the energy contour of the model for that signal. Energy matching is required for two main reasons. First, speech signals are not strictly stationary and hence their energy contours cannot be reliably estimated from training data. Second, recording conditions during training and testing vary. An approach for gain adaptation was developed [Ephraim, 1992]. In this approach, HMMs for gain-normalized clean signals are designed and used together with gain contour estimates obtained from the noisy signals. The gain adaptation approach is implemented using the EM algorithm. This approach provides robust speech recognition at input SNRs greater than or equal to 10 dB. The relation between signal classification and estimation was established in Kailath [1969] for continuous time signals contaminated by additive statistically independent Gaussian white noise. It was shown that min- imum probability of error classification can be achieved by applying the MAP decision rule to the causal MMSE estimator of the clean signal. This interesting theoretical result provides the intuitive basis for a popular approach for recognition of noisy speech signals. In this approach, the clean signal or some feature vector of the signal is first estimated and then recognition is applied. In the statistical framework of hidden Markov modeling, however, the direct recognition approach presented earlier is significantly simpler since both the clean signal and the noisy signal are HMMs [Ephraim, 1992]. Hence, the complexity of recognizing the estimated signal is the same as that of recognizing the noisy signal directly. Other commonly used approaches for recognition of noisy speech signals were developed for systems which are based on pattern recognition. When clean signals are available for recognition, these systems match the input signal to the nearest neighbor acoustic templet which represents some word in the vocabulary. The templets mainly comprise spectral prototypes of the clean signals. The matching is performed using a distance measure between the clean input signal and the templet. When only noisy signals are available for recognition, PPWpzWdz e z i i L i i () ( ) ( )W= = ? ò ? 1 * w ? 2000 by CRC Press LLC several modifications of the pattern matching approach were proposed. Specifically, adapting the templets of the clean signal to reflect the presence of the noise was proposed by Roe in 1987 [Ephraim, 1992]; choosing templets for the noisy signal which are more robust than those obtained from adaptation of the templets for the clean signal was often proposed; and using distance measures which are robust to noise, such as the projection measure proposed by Mansour and Juang in 1989 [Ephraim, 1992]. These approaches along with the prefiltering approach in the sampled signal case are fairly intuitive and are relatively easy to implement. It is difficult, however, to establish their optimality in any well-defined sense. Another interesting approach based on robust statistics was developed by Merhav and Lee [Ephraim, 1992]. This approach was shown asymptotically optimal in the minimum probability of error sense within the hidden Markov modeling framework. The speech recognition problem in noisy environments has also been a successful application of speech enhancement. Significant reduction in the error rate due to the noise presence was achieved by the various approaches mentioned above. Comments Three major aspects of speech enhancement were reviewed. These comprise improving the perception of speech signals in noisy environments and increasing the robustness of speech coders and recognition systems in noisy environments. The inherent difficulties associated with these problems were discussed, and the main solutions along with their strengths and weaknesses were presented. This section is an introductory presentation to the speech enhancement problem. A comprehensive treatment of the subject can be found in Lim [1979], Makhoul et al. [1989], Boll [1992], and Ephraim [1992]. Significant progress in understanding the problem and in developing new speech enhancement systems was made during the 1980s with the introduction of statistical model-based approaches. The speech enhancement problem, however, is far from being solved, and major progress is still needed. In particular, no speech enhancement system which is capable of simultaneously improving both the quality and intelligibility of the noisy signal is currently known. Progress in this direction can be made if more reliable statistical models for the speech signal and noise process as well as meaningful distortion measures can be found. Defining Terms Autoregressive model: Statistical model for resonant signals. Classifier: Maps signal utterances into a finite set of word units, e.g., syllables. Encoder: Maps signal vectors into a finite set of codewords. A vector quantizer is a particular type of encoder. Hidden Markov model: Statistical model comprised of several subsources controlled by Markovian process. Intelligibility: Objective quantitative measure of speech perception. Noise: Any interfering signal adversely affecting the communication of the clean signal. Quality: Subjective descriptive measure of speech perception. Signal: Clean speech sample to be communicated with human or machine. Signal-to-noise ratio: Ratio of the signal power to the noise power measured in decibels. Speech enhancement: Improvement of perceptual aspects of speech signals. Related Topics 48.1 Introduction ? 73.2 Noise References S. F. Boll, “Speech enhancement in the 1980’s: noise suppression with pattern matching,” in Advances in Speech Signal Processing, S. Furui and M. M. Sonhdi, Eds., New York: Marcel Dekker, 1992. H. Drucker, “Speech processing in a high ambient noise environment,” IEEE Trans. Audio Electroacoust., vol. 16, 1968. Y. Ephraim, “Statistical model based speech enhancement systems,” Proc. IEEE, vol. 80, 1992. ? 2000 by CRC Press LLC Y. Ephraim and H. L. Van Trees, “A signal subspace approach for speech enhancement,” IEEE Trans. on Speech and Audio Processing, vol. 3, 251–316, 1995. A. Gersho and R.M. Gray, Vector Quantization and Signal Compression, Boston: Kluwer Academic Publishers, 1991. T. Kailath, “A general likelihood-ratio formula for random signals in Gaussian noise,” IEEE Trans. on Inform Theory, vol. 15, 1969. J. S. Lim, Ed., Speech Enhancement, Englewood Cliffs, N.J.: Prentice-Hall, 1983. J. S. Lim and A. V. Oppenheim, “Enhancement and bandwidth compression of noisy speech,” Proc. IEEE, vol. 67, 1979. J. Makhoul, T. H. Crystal, D. M. Green, D. Hogan, R. J. McAulay, D. B. Pisoni, R. D. Sorkin, and T. G. Stockham, Removal of Noise From Noise-Degraded Speech Signals, Washington, D.C.: National Academy Press, 1989. L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proc. IEEE, vol. 77, 1989. M. R. Weiss, E. Aschkenasy, and T. W. Parsons, “Processing speech signals to attenuate interference,” in IEEE Symp. on Speech Recognition, Pittsburgh, 1974. Further Information A comprehensive treatment of the speech enhancement problem can be found in the four tutorial papers and book listed below. J. S. Lim and A. V. Oppenheim, “Enhancement and bandwidth compression of noisy speech,” Proc. IEEE, vol. 67, 1979. J. Makhoul, T. H. Crystal, D. M. Green, D. Hogan, R. J. McAulay, D. B. Pisoni, R. D. Sorkin, and T. G. Stockham, Removal of Noise From Noise-Degraded Speech Signals, Washington, D.C.: National Academy Press, 1989. S. F. Boll, “Speech enhancement in the 1980’s: noise suppression with pattern matching,” in Advances in Speech Signal Processing, S. Furui and M. M. Sonhdi, Eds., New York: Marcel Dekker, 1992. Y. Ephraim, “Statistical model based speech enhancement systems,” Proc. IEEE, vol. 80, 1992. J. S. Lim, Ed., Speech Enhancement, Englewood Cliffs, N.J.: Prentice-Hall, 1983. 15.3 Analysis and Synthesis Jesse W. Fussell After an acoustic speech signal is converted to an electrical signal by a microphone, it may be desirable to analyze the electrical signal to estimate some time-varying parameters which provide information about a model of the speech production mechanism. Speech analysis is the process of estimating such parameters. Similarly, given some parametric model of speech production and a sequence of parameters for that model, speech synthesis is the process of creating an electrical signal which approximates speech. While analysis and synthesis techniques may be done either on the continuous signal or on a sampled version of the signal, most modern analysis and synthesis methods are based on digital signal processing. A typical speech production model is shown in Fig. 15.6. In this model the output of the excitation function is scaled by the gain parameter and then filtered to produce speech. All of these functions are time-varying. FIGURE 15.6A general speech production model. ? 2000 by CRC Press LLC For many models, the parameters are varied at a periodic rate, typically 50 to 100 times per second. Most speech information is contained in the portion of the signal below about 4 kHz. The excitation is usually modeled as either a mixture or a choice of random noise and periodic waveform. For human speech, voiced excitation occurs when the vocal folds in the larynx vibrate; unvoiced excitation occurs at constrictions in the vocal tract which create turbulent air flow [Flanagan, 1965]. The relative mix of these two types of excitation is termed “voicing.” In addition, the periodic excitation is characterized by a fundamental frequency, termed pitch or F0. The excitation is scaled by a factor designed to produce the proper amplitude or level of the speech signal. The scaled excitation function is then filtered to produce the proper spectral characteristics. While the filter may be nonlinear, it is usually modeled as a linear function. Analysis of Excitation In a simplified form, the excitation function may be considered to be purely periodic, for voiced speech, or purely random, for unvoiced. These two states correspond to voiced phonetic classes such as vowels and nasals and unvoiced sounds such as unvoiced fricatives. This binary voicing model is an oversimplification for sounds such as voiced fricatives, which consist of a mixture of periodic and random components. Figure 15.7 is an example of a time waveform of a spoken /i/ phoneme, which is well modeled by only periodic excitation. Both time domain and frequency domain analysis techniques have been used to estimate the degree of voicing for a short segment or frame of speech. One time domain feature, termed the zero crossing rate, is the number of times the signal changes sign in a short interval. As shown in Fig. 15.7, the zero crossing rate for voiced sounds is relatively low. Since unvoiced speech typically has a larger proportion of high-frequency energy than voiced speech, the ratio of high-frequency to low-frequency energy is a frequency domain technique that provides information on voicing. Another measure used to estimate the degree of voicing is the autocorrelation function, which is defined for a sampled speech segment, S, as (15.10) where s(n) is the value of the nth sample within the segment of length N. Since the autocorrelation function of a periodic function is itself periodic, voicing can be estimated from the degree of periodicity of the auto- correlation function. Figure 15.8 is a graph of the nonnegative terms of the autocorrelation function for a 64-ms frame of the waveform of Fig. 15.7. Except for the decrease in amplitude with increasing lag, which results from the rectangular window function which delimits the segment, the autocorrelation function is seen to be quite periodic for this voiced utterance. FIGURE 15.7Waveform of a spoken phoneme /i/ as in beet. ACF =0 –1 () ()( )tt=- ? 1 N snsn n N ? 2000 by CRC Press LLC If an analysis of the voicing of the speech signal indicates a voiced or periodic component is present, another step in the analysis process may be to estimate the frequency (or period) of the voiced component. There are a number of ways in which this may be done. One is to measure the time lapse between peaks in the time domain signal. For example in Fig. 15.7 the major peaks are separated by about 0.0071 s, for a fundamental frequency of about 141 Hz. Note, it would be quite possible to err in the estimate of fundamental frequency by mistaking the smaller peaks that occur between the major peaks for the major peaks. These smaller peaks are produced by resonance in the vocal tract which, in this example, happen to be at about twice the excitation frequency. This type of error would result in an estimate of pitch approximately twice the correct frequency. The distance between major peaks of the autocorrelation function is a closely related feature that is frequently used to estimate the pitch period. In Fig. 15.8, the distance between the major peaks in the autocorrelation function is about 0.0071 s. Estimates of pitch from the autocorrelation function are also susceptible to mistaking the first vocal track resonance for the glottal excitation frequency. The absolute magnitude difference function (AMDF), defined as, (15.11) is another function which is often used in estimating the pitch of voiced speech. An example of the AMDF is shown in Fig. 15.9 for the same 64-ms frame of the /i/ phoneme. However, the minima of the AMDF is used as an indicator of the pitch period. The AMDF has been shown to be a good pitch period indicator [Ross et al., 1974] and does not require multiplications. Fourier Analysis One of the more common processes for estimating the spectrum of a segment of speech is the Fourier transform [Oppenheim and Schafer, 1975]. The Fourier transform of a sequence is mathematically defined as (15.12) where s(n) represents the terms of the sequence. The short-time Fourier transform of a sequence is a time- dependent function, defined as FIGURE 15.8Autocorrelation function of one frame of /i/. AMDF =0 –1 () ()–( )tt=- ? 1 N sn sn n N ** Se sne jjn n () () – – ww = =¥ ¥ ? ? 2000 by CRC Press LLC (15.13) where the window function w(n) is usually zero except for some finite range, and the variable m is used to select the section of the sequence for analysis. The discrete Fourier transform (DFT) is obtained by uniformly sampling the short-time Fourier transform in the frequency dimension. Thus an N-point DFT is computed using Eq. (15.14), (15.14) where the set of N samples, s(n), may have first been multiplied by a window function. An example of the magnitude of a 512-point DFT of the waveform of the /i/ from Fig. 15.10 is shown in Fig. 15.10. Note for this figure, the 512 points in the sequence have been multiplied by a Hamming window defined by FIGURE 15.9Absolute magnitude difference function of one frame of /i/. FIGURE 15.10Magnitude of 512-point FFT of Hamming windowed /i/. Se wmnsne m jj n () (–)() – – ww = =¥ ¥ ? Sk sne jnkN n N () () –/ – = = ? 2 0 1 p ? 2000 by CRC Press LLC (15.15) Since the spectral characteristics of speech may change dramatically in a few milliseconds, the length, type, and location of the window function are important considerations. If the window is too long, changing spectral characteristics may cause a blurred result; if the window is too short, spectral inaccuracies result. A Hamming window of 16 to 32 ms duration is commonly used for speech analysis. Several characteristics of a speech utterance may be determined by examination of the DFT magnitude. In Fig. 15.10, the DFT of a voiced utterance contains a series of sharp peaks in the frequency domain. These peaks, caused by the periodic sampling action of the glottal excitation, are separated by the fundamental frequency which is about 141 Hz, in this example. In addition, broader peaks can be seen, for example at about 300 Hz and at about 2300 Hz. These broad peaks, called formants, result from resonances in the vocal tract. Linear Predictive Analysis Given a sampled (discrete-time) signal s(n), a powerful and general parametric model for time series analysis is (15.16) where s(n) is the output and u(n) is the input (perhaps unknown). The model parameters are a(k) for k = 1, p, b(l) for l = 1, q, and G. b(0) is assumed to be unity. This model, described as an autoregressive moving average (ARMA) or pole-zero model, forms the foundation for the analysis method termed linear prediction. An autoregressive (AR) or all-pole model, for which all of the “b” coefficients except b(0) are zero, is frequently used for speech analysis [Markel and Gray, 1976]. In the standard AR formulation of linear prediction, the model parameters are selected to minimize the mean-squared error between the model and the speech data. In one of the variants of linear prediction, the autocorrelation method, the minimization is carried out for a windowed segment of data. In the autocorrelation method, minimizing the mean-square error of the time domain samples is equivalent to minimizing the integrated ratio of the signal spectrum to the spectrum of the all-pole model. Thus, linear predictive analysis is a good method for spectral analysis whenever the signal is produced by an all-pole system. Most speech sounds fit this model well. One key consideration for linear predictive analysis is the order of the model, p. For speech, if the order is too small, the formant structure is not well represented. If the order is too large, pitch pulses as well as formants begin to be represented. Tenth- or twelfth-order analysis is typical for speech. Figures 15.11 and 15.12 provide examples of the spectrum produced by eighth-order and sixteenth-order linear predictive analysis of the /i/ waveform of Fig. 15.7. Figure 15.11 shows there to be three formants at frequencies of about 300, 2300, and 3200 Hz, which are typical for an /i/. Homomorphic (Cepstral) Analysis For the speech model of Fig. 15.6, the excitation and filter impulse response are convolved to produce the speech. One of the problems of speech analysis is to separate or deconvolve the speech into these two compo- nents. One such technique is called homomorphic filtering [Oppenheim and Schafer, 1968]. The characteristic system for a system for homomorphic deconvolution converts a convolution operation to an addition operation. The output of such a characteristic system is called the complex cepstrum. The complex cepstrum is defined as the inverse Fourier transform of the complex logarithm of the Fourier transform of the input. If the input sequence is minimum phase (i.e., the z-transform of the input sequence has no poles or zeros outside the unit circle), the sequence can be represented by the real portion of the transforms. Thus, the real cepstrum can be computed by calculating the inverse Fourier transform of the log-spectrum of the input. wn nN n N() . –. cos( /( –)) –=£ = 054 046 2 1 0 1 0 p otherwise sn aksnk G blunl l q k p () ()(–) ()(–)=- + == ?? 01 ? 2000 by CRC Press LLC Figure 15.13 shows an example of the cepstrum for the voiced /i/ utterance from Fig. 15.7. The cepstrum of such a voiced utterance is characterized by relatively large values in the first one or two milliseconds as well as by pulses of decaying amplitudes at multiples of the pitch period. The first two of these pulses can clearly be seen in Fig. 15.13 at time lags of 7.1 and 14.2 ms. The location and amplitudes of these pulses may be used to estimate pitch and voicing [Rabiner and Schafer, 1978]. In addition to pitch and voicing estimation, a smooth log magnitude function may be obtained by windowing or “liftering” the cepstrum to eliminate the terms which contain the pitch information. Figure 15.14 is one such smoothed spectrum. It was obtained from the DFT of the cepstrum of Fig. 15.13 after first setting all terms of the cepstrum to zero except for the first 16. Speech Synthesis Speech synthesis is the creation of speech-like waveforms from textual words or symbols. In general, the speech synthesis process may be divided into three levels of processing [Klatt, 1982]. The first level transforms the text into a series of acoustic phonetic symbols, the second transforms those symbols to smoothed synthesis param- eters, and the third level generates the speech waveform from the parameters. While speech synthesizers have FIGURE 15.11Eighth-order linear predictive analysis of an “i”. FIGURE 15.12Sixteenth-order linear predictive analysis of an “i”. ? 2000 by CRC Press LLC been designed for a variety of languages and the processes described here apply to several languages, the examples given are for English text-to-speech. In the first level of processing, abbreviations such as “Dr.” (which could mean “doctor” or “drive”), numbers (“1492” could be a year or a quantity), special symbols such as “$”, upper case acronyms (e.g., NASA), and nonspoken symbols such as “’” (apostrophe) are converted to a standard form. Next prefixes and perhaps suffixes are removed from the body of words prior to searching for the root word in a lexicon, which defines the phonetic representation for the word. The lexicon includes words which do not obey the normal rules of pronunciation, such as “of”. If the word is not contained in the lexicon, it is processed by an algorithm which contains a large set of rules of pronunciation. In the second level, the sequences of words consisting of phrases or sentences are analyzed for grammar and syntax. This analysis provides information to another set of rules which determine the stress, duration, and pitch to be added to the phonemic representation. This level of processing may also alter the phonemic representation of individual words to account for coarticulation effects. Finally, the sequences of parameters which specify the pronunciation are smoothed in an attempt to mimic the smooth movements of the human articulators (lips, jaw, velum, and tongue). The last processing level converts the smoothed parameters into a time waveform. Many varieties of waveform synthesizers have been used, including formant, linear predictive, and filter-bank versions. These waveform FIGURE 15.13 Real cepstrum of /i/. FIGURE 15.14 Smoothed spectrum of /i/ from 16 points of cepstrum. ? 2000 by CRC Press LLC synthesizers generally correspond to the synthesizers used in speech coding systems which are described in the first section of this chapter. Defining Terms Cepstrum: Inverse Fourier transform of the logarithm of the Fourier power spectrum of a signal. The complex cepstrum is the inverse Fourier transform of the complex logarithm of the Fourier tranform of the complex logarithm of the Fourier transform of the signal. Pitch: Frequency of glottal vibration of a voiced utterance. Spectrum or power density spectrum: Amplitude of a signal as a function of frequency, frequently defined as the Fourier transform of the autocovariance of the signal. Speech analysis: Process of extracting time-varying parameters from the speech signal which represent a model for speech production. Speech synthesis: Production of a speech signal from a model for speech production and a set of time-varying parameters of that model. Voicing: Classification of a speech segment as being voiced (i.e., produced by glottal excitation), unvoiced (i.e., produced by turbulent air flow at a constriction) or some mix of those two. Related Topic 14.1 Fourier Transforms References J. Allen, “Synthesis of speech from unrestricted text,” Proc. IEEE, vol. 64, no. 4, pp. 433–442, 1976. J. L. Flanagan, Speech Analysis, Synthesis and Perception, Berlin: Springer-Verlag, 1965. D. H. Klatt, “The Klattalk Text-to-Speech System” IEEE Int. Conf. on Acoustics, Speech and Signal Proc., pp. 1589–1592, Paris, 1982. J. D. Markel and A. H. Gray, Jr., Linear Prediction of Speech, Berlin: Springer-Verlag, 1976. A. V. Oppenheim and R. W. Schafer, “Homomorphic analysis of speech,” IEEE Trans. Audio Electroacoust., pp. 221–226, 1968. A. V. Oppenheim and R.W. Schafer, Digital Signal Processing, Englewood Cliffs, N.J.: Prentice-Hall, 1975. D. O’Shaughnessy, Speech Communication, Reading, Mass.: Addison-Wesley, 1987. L. R. Rabiner and R. W. Schafer, Digital Processing of Speech Signals, Englewood Cliffs, N.J.: Prentice-Hall, 1978. M. J. Ross, H .L. Shaffer, A. Cohen, R. Freudberg, and H. J. Manley, “Average magnitude difference function pitch extractor,” IEEE Trans. Acoustics, Speech and Signal Proc., vol. ASSP-22, pp. 353–362, 1974. R. W. Schafer and J. D. Markel, Speech Analysis, New York: IEEE Press, 1979. Further Information The monthly magazine IEEE Transactions on Signal Processing, formerly IEEE Transactions on Acoustics, Speech and Signal Processing, frequently contains articles on speech analysis and synthesis. In addition, the annual conference of the IEEE Signal Processing Society, the International Conference on Acoustics, Speech, and Signal Processing, is a rich source of papers on the subject. 15.4 Speech Recognition Lynn D.Wilcox and Marcia A. Bush Speech recognition is the process of translating an acoustic signal into a linguistic message. In certain applica- tions, the desired form of the message is a verbatim transcription of a sequence of spoken words. For example, in using speech recognition technology to automate dictation or data entry to a computer, transcription accuracy is of prime importance. In other cases, such as when speech recognition is used as an interface to a database ? 2000 by CRC Press LLC query system or to index by keyword into audio recordings, word-for-word transcription is less critical. Rather, the message must contain only enough information to reliably communicate the speaker’s goal. The use of speech recognition technology to facilitate a dialog between person and computer is often referred to as “spoken language processing.” Speech recognition by machine has proven an extremely difficult task. One complicating factor is that, unlike written text, no clear spacing exists between spoken words; speakers typically utter full phrases or sentences without pause. Furthermore, acoustic variability in the speech signal typically precludes an unambiguous mapping to a sequence of words or subword units, such as phones. 1 One major source of variability in speech is coarticulation, or the tendency for the acoustic characteristics of a given speech sound or phone to differ depending upon the phonetic context in which it is produced. Other sources of acoustic variability include differences in vocal-tract size, dialect, speaking rate, speaking style, and communication channel. Speech recognition systems can be constrained along a number of dimensions in order to make the recog- nition problem more tractable. Training the parameters of a recognizer to the speech of the user is one way of reducing variability and, thus, increasing recognition accuracy. Recognizers are categorized as speaker-depen- dent or speaker-independent, depending upon whether or not full training is required by each new user. Speaker- adaptive systems adjust automatically to the voice of a new talker, either on the basis of a relatively small amount of training data or on a continuing basis while the system is in use. Recognizers can also be categorized by the speaking styles, vocabularies, and language models they accom- modate. Isolated word recognizers require speakers to insert brief pauses between individual words. Contin- uous speech recognizers operate on fluent speech, but typically employ strict language models, or grammars, to limit the number of allowable word sequences. Wordspotters also accept fluent speech as input. However, rather than providing full transcription, wordspotters selectively locate relevant words or phrases in an utter- ance. Wordspotting is useful both in information-retrieval tasks based on keyword indexing and as an alternative to isolated word recogniton in voice command applications. Speech Recognition System Architecture Figure 15.15 shows a block diagram of a speech recognition system. Speech is typically input to the system using an analog transducer, such as a microphone, and converted to digital form. Signal pre-processing consists of computing a sequence of acoustic feature vectors by processing the speech samples in successive time intervals. In some systems, a clustering technique known as vector quantization is used to convert these continuous- valued features to a sequence of discrete codewords drawn from a codebook of acoustic prototypes. Recognition of an unknown utterance involves transforming the sequence of feature vectors, or codewords, into an appro- priate message. The recognition process is typically constrained by a set of acoustic models which correspond to the basic units of speech employed in the recognizer, a lexicon which defines the vocabulary of the recognizer 1 Phones correspond roughly to pronunciations of consonants and vowels. FIGURE 15.15Architecture for a speech recognition system. ? 2000 by CRC Press LLC in terms of these basic units, and a language model which specifies allowable sequences of vocabulary items. The acoustic models, and in some cases the language model and lexicon, are learned from a set of representative training data. These components are discussed in greater detail in the remainder of this chapter, as are the two recognition paradigms most frequently employed in speech recognition: dynamic time warping and hidden Markov models. Signal Pre-Processing An amplitude waveform and speech spectrogram of the sentence “Two plus seven is less than ten” is shown in Fig. 15.16. The spectrogram represents the time evolution (horizontal axis) of the frequency spectrum (vertical axis) of the speech signal, with darkness corresponding to high energy. In this example, the speech has been digitized at a sampling rate of 16 kHz, or roughly twice the highest frequency of relevant energy in a high- quality speech signal. In general, the appropriate sampling rate is a function of the communication channel. In telecommunications, for example, a bandwidth of 4 kHz, and, thus, a Nyquist sampling rate of 8 kHz, is standard. The speech spectrum can be viewed as the product of a source spectrum and the transfer function of a linear, time-varying filter which represents the changing configuration of the vocal tract. The transfer function determines the shape, or envelope, of the spectrum, which carries phonetic information in speech. When excited by a voicing source, the formants, or natural resonant frequencies of the vocal tract, appear as black bands running horizontally through regions of the speech spectrogram. These regions represent voiced segments of speech and correspond primarily to vowels. Regions characterized by broadband high-frequency energy, and by extremely low energy, result from noise excitation and vocal-tract closures, respectively, and are associated with the articulation of consonantal sounds. Feature extraction for speech recognition involves computing sequences of numeric measurements, or feature vectors, which typically approximate the envelope of the speech spectrum. Spectral features can be extracted directly from the discrete Fourier transform (DFT) or computed using linear predictive coding (LPC) tech- niques. Cepstral analysis can also be used to deconvolve the spectral envelope and the periodic voicing source. Each feature vector is computed from a frame of speech data defined by windowing N samples of the signal. While a better spectral estimate can be obtained using more samples, the interval must be short enough so that the windowed signal is roughly stationary. For speech data, N is chosen such that the length of the interval covered by the window is approximately 25 to 30 msec. The feature vectors are typically computed at a frame rate of 10 to 20 msec by shifting the window forward in time. Tapered windowing functions, such as the Hamming window, are used to reduce dependence of the spectral estimate on the exact temporal position of FIGURE 15.16 Speech spectrogram of the utterance “Two plus seven is less than ten.” (Source: V.W. Zue, “The use of speech knowledge in automatic speech recognition,” Proc. IEEE, vol. 73, no. 11, pp. 1602–1615, ? 1985 IEEE. With permission.) ? 2000 by CRC Press LLC the window. Spectral features are often augmented with a measure of the short time energy of the signal, as well as with measures of energy and spectral change over time [Lee, 1988]. For recognition systems which use discrete features, vector quantization can be used to quantize continuous- valued feature vectors into a set or codebook of K discrete symbols, or codewords [Gray, 1984]. The K codewords are characterized by prototypes y 1 . . . y K . A feature vector x is quantized to the kth codeword if the distance from x to y k , or d(x,y k ), is less than the distance from x to any other codeword. The distance d(x,y) depends on the type of features being quantized. For features derived from the short-time spectrum and cepstrum, this distance is typically Euclidean or weighted Euclidean. For LPC-based features, the Itakura metric, which is based on spectral distortion, is typically used [Furui, 1989]. Dynamic Time Warping Dynamic time warping (DTW) is a technique for nonlinear time alignment of pairs of spoken utterances. DTW-based speech recognition, often referred to as “template matching,” involves aligning feature vectors extracted from an unknown utterance with those from a set of exemplars or templates obtained from training data. Nonlinear feature alignment is necessitated by nonlinear time-scale warping associated with variations in speaking rate. Figure 15.17 illustrates the time correspondence between two utterances, A and B, represented as feature- vector sequences of unequal length. The time warping function consists of a sequence of points F = c 1 , . . ., c K in the plane spanned by A and B, where c k = (i k ,j k ). The local distance between the feature vectors a i and b j on the warping path at point c = (i,j) is given as d(c) = d(a i ,b j ) (15.17) The distance between A and B aligned with warping function F is a weighted sum of the local distances along the path, (15.18) FIGURE 15.17Dynamic time warping of utterances A and B. (Source: S. Furui, Digital Speech Processing, Synthesis and Recognition, New York: Marcel Dekker, 1989. With permission.) DF N dcw k k K k () ()= = ? 1 1 ? 2000 by CRC Press LLC where w k is a nonnegative weighting function and N is the sum of the weights. Path constraints and weighting functions are chosen to control whether or not the distance D(F) is symmetric and the allowable degree of warping in each direction. Dynamic programming is used to efficiently determine the optimal time alignment between two feature-vector sequences [Sakoe and Chiba, 1978]. In DTW-based recognition, one or more templates are generated for each word in the recognition vocabulary. For speaker-dependent recognition tasks, templates are typically created by aligning and averaging the feature vectors corresponding to several repetitions of a word. For speaker-independent tasks, clustering techniques can be used to generate templates which better model pronunciation variability across talkers. In isolated word recognition, the distance D(F) is computed between the feature-vector sequence for the unknown word and the templates corresponding to each vocabulary item. The unknown is recognized as that word for which D(F) is a minimum. DTW can be extended to connected word recognition by aligning the input utterance to all possible concatenations of reference templates. Efficient algorithms for computing such alignments have been developed [Furui, 1989]; however, in general, DTW has proved most applicable to isolated word recognition tasks. Hidden Markov Models 1 Hidden Markov modeling is a probabilistic pattern matching technique which is more robust than DTW at modeling acoustic variability in speech and more readily extensible to continuous speech recognition. As shown in Fig. 15.18, hidden Markov models (HMMs) represent speech as a sequence of states, which are assumed to model intervals of speech with roughly stationary acoustic features. Each state is characterized by an output probability distribution which models variability in the spectral features or observations associated with that state. Transition probabilities between states model durational variability in the speech signal. The probabilities, or parameters, of an HMM are trained using observations (VQ codewords) extracted from a representative sample of speech data. Recognition of an unknown utterance is based on the probability that the speech was generated by the HMM. More precisely, an HMM is defined by: 1.A set of N states {S 1 . . . S N }, where q t is the state at time t. 2.A set of K observation symbols {v 1 . . . v K }, where O t is the observation at time t. 3.A state transition probability matrix A = {a ij }, where the probability of transitioning from state S i at time t to state S j at time t + 1 is a ij = P(q t+1 = S j *q t = S i ). 4.A set of output probability distributions B, where for each state j, b j (k) = P(O t = v k *q t = S j ). 5.An initial state distribution p = {p i }, where p i = P(q 1 = S i ). At each time t a transition to a new state is made, and an observation is generated. State transitions have the Markov property, in that the probability of transitioning to a state at time t depends only on the state at time 1 Although the discussion here is limited to HMMs with discrete observations, output distributions such as Gaussians can be defined for continuous-valued features. FIGURE 15.18A typical HMM topology. ? 2000 by CRC Press LLC t – 1. The observations are conditionally independent given the state, and the transition probabilites are not dependent on time. The model is called hidden because the identity of the state at time t is unknown; only the output of the state is observed. It is common to specify an HMM by its parameters l = (A, B, p). The basic acoustic unit modeled by the HMM can be either a word or a subword unit. For small recognition vocabularies, the lexicon typically consists of whole-word models similar to the model shown in Fig. 15.18. The number of states in such a model can either be fixed or be made to depend on word duration. For larger vocabularies, words are more often defined in the lexicon as concatenations of phone or triphone models. Triphones are phone models with left and right context specified [Lee, 1988]; they are used to model acoustic variability which results from the coarticulation of adjacent speech sounds. In isolated word recognition tasks, an HMM is created for each word in the recognition vocabulary. In continuous speech recognition, on the other hand, a single HMM network is generated by expressing allowable word strings or sentences as concatenations of word models, as shown in Fig. 15.19. In wordspotting, the HMM network consists of a parallel connection of keyword models and a background model which represents the speech within which the keywords are embedded. Background models, in turn, typically consist of parallel connections of subword acoustic units such as phones [Wilcox and Bush, 1992]. The language model or grammar of a recognition system defines the sequences of vocabulary items which are allowed. For simple tasks, deterministic finite-state grammars can be used to define all allowable word sequences. Typically, however, recognizers make use of stochastic grammars based on n-gram statistics [Jelinek, 1985]. A bigram language model, for example, specifies the probability of a vocabulary item given the item which precedes it. Isolated word recognition using HMMs involves computing, for each word in the recognition vocabulary, the probability P(O*l) of the observation sequence O = O 1 . . . O T . The unknown utterance is recognized as the word which maximizes this probability. The probability P(O*l) is the sum over all possible state sequences Q = q 1 . . . q T of the probability of O and Q given l, or (15.19) FIGURE 15.19Language model, lexicon, and HMM phone models for a continuous speech recognition system. (Source: K.F. Lee, “Large-Vocabulary Speaker-Independent Continuous Speech Recognition: The SPHINX System,” Ph.D. Disserta- tion, Computer Science Dept., Carnegie Mellon, April 1988. With permission.) pO POQ POQ PQ bOa bO Q qq qqQ q q q T () (,) (,)() () ().. ... ****llllp== = ??? 11 1 1 2 2 12 ? 2000 by CRC Press LLC Direct computation of this sum is computationally infeasible for even a moderate number of states and observations. However, an iterative algorithm known as the forward-backward procedure [Rabiner, 1989] makes this computation possible. Defining the forward variable a as a t (i) = P(O 1 . . . O t ,q t = S i *l) (15.20) and initializing a 1 (i) = p i b i (O 1 ), subsequent a t (i) are computed inductively as (15.21) By definition, the desired probability of the observation sequence given the model l is (15.22) Similarly, the backward variable b can be defined b t (i) = P(O t+1 . . . O T *q t = S i ,l) (15.23) The bs are computed inductively backward in time by first initializing b T (j) = 1 and computing (15.24) HMM-based continuous speech recognition involves determining an optimal word sequence using the Viterbi algorithm. This algorithm uses dynamic programming to find the optimal state sequence through an HMM network representing the recognizer vocabulary and grammar. The optimal state sequence Q* = (q 1 * … q T *) is defined as the sequence which maximizes P(Q*O,l), or equivalently P(Q,O*l). Let d t (i) be the joint probability of the optimal state sequence and the observations up to time t, ending in state S i at time t. Then d t (i) = max P(q 1 . . . q t–1 ,q t = S i ,O 1 . . . O t *l) (15.25) where the maximum is over all state sequences q 1 . . . q t–1 . This probability can be updated recursively by extending each partial optimal path using (15.26) At each time t, it is necessary to keep track of the optimal precursor to state j, that is, the state which maximized the above probability. Then, at the end of the utterance, the optimal state sequence can be retrieved by backtracking through the precursor list. Training HMM-based recognizers involves estimating the parameters for the word or phone models used in the system. As with DTW, several repetitions of each word in the recognition vocabulary are used to train HMM-based isolated word recognizers. For continuous speech recognition, word or phone exemplars are typically extracted from word strings or sentences [Lee, 1988]. Parameters for the models are chosen based on a maximum likelihood criterion; that is, the parameters l maximize the likelihood of the training data O, P(O*l). This maximization is performed using the Baum-Welch algorithm [Rabiner, 1989], a re-estimation aa ttij i N jt jiabO + = + = ?1 1 1 () () ( ) PO i T i N () ()*la= = ? 1 bb tij j N jtt iabOj() ( ) ()= = ++? 1 11 dd t i tijjt jiabO ++ = 11 () max () ( ) ? 2000 by CRC Press LLC technique based on first aligning the training data O with the current models, and then updating the parameters of the models based on this alignment. Let x t (i,j) be the probability of being in state S i at time t and state S j at time t + 1 and observing the observation sequence O. Using the forward and backward variables a t (i) and b t (j), x t (i,j) can be written as (15.27) An estimate of a ij is given by the expected number of transitions from state S i to state S j divided by the expected number of transitions from state S i . Define g t (i) as the probability of being in state S i at time t, given the observation sequence O (15.28) Summing g t (i) over t yields a quantity which can be interpreted as the expected number of transitions from state S i . Summing x t (i,j) over t gives the expected number of transitions from state i to state j. An estimate of a ij can then be computed as the ratio of these two sums. Similarly, an estimate of b j (k) is obtained as the expected number of times being in state j and observing symbol v k divided by the expected number of times in state j. (15.29) State-of-the-Art Recognition Systems Dictation-oriented recognizers which accommodate isolated word vocabularies of many thousands of words in speaker-adaptive manner are currently available commercially. So too are speaker-independent, continuous speech recognizers for small vocabularies, such as the digits; similar products for larger (1000-word) vocabu- laries with constrained grammars are imminent. Speech recognition research is aimed, in part, at the develop- ment of more robust pattern classification techniques, including some based on neural networks [Lippmann, 1989] and on the development of systems which accommodate more natural spoken language dialogs between human and machine. Defining Terms Baum-Welch: A re-estimation technique for computing optimal values for HMM state transition and output probabilities. Continuous speech recognition: Recognition of fluently spoken utterances. Dynamic time warping (DTW): A recognition technique based on nonlinear time alignment of unknown utterances with reference templates. Forward-backward: An efficient algorithm for computing the probability of an observation sequence from an HMM. Hidden Markov model (HMM): A stochastic model which uses state transition and output probabilities to generate observation sequences. xl ab ab ttitj tijt jt tijt jt ij N ij Pq S q S O ia jb O ia jb O (, ) ( , , ) () () ( ) () () ( ) == = = + ++ ++ = ? 1 11 11 1 * glx tti t j N iPqSO ij() ( , ) (, )== = = ? * 1 ? (, ) () ? () () () – : a ij i bk j j ij t t T t t T j t tO y t t T tk == = = = = ? ? ? ? x g g g 1 1 11 ? 2000 by CRC Press LLC Isolated word recognition: Recognition of words or short phrases preceded and followed by silence. Signal pre-processing:Conversion of an analog speech signal into a sequence of numeric feature vectors or observations. Viterbi: An algorithm for finding the optimal state sequence through an HMM given a particular observation sequence. Wordspotting: Detection or location of keywords in the context of fluent speech. References S. Furui, Digital Speech Processing, Synthesis, and Recognition, New York: Marcel Dekker, 1989. R. M. Gray, “Vector quantization,” IEEE ASSP Magazine, vol. 1, no. 2, pp. 4–29, April 1984. F. Jelinek, “The development of an experimental discrete dictation recognizer,” Proc. IEEE, vol. 73, no. 11, pp. 1616–1624, Nov. 1985. K. F. Lee, “Large-Vocabulary Speaker-Independent Continuous Speech Recognition: The SPHINX System,” Ph.D. Dissertation, Computer Science Department, Carnegie Mellon University, April 1988. R. P. Lippmann, “Review of neural networks for speech recognition,” Neural Computation, vol. 1, pp. 1–38, 1989. L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proc. IEEE, vol. 77, no. 2, pp. 257–285, Feb. 1989. H. Sakoe and S. Chiba, “Dynamic programming algorithm optimization for spoken word recognition,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 26, no. 1, pp. 43–49, Feb. 1978. L. D. Wilcox and M. A. Bush, “Training and search algorithms for an interactive wordspotting system,” in Proceedings, International Conference on Acoustics, Speech and Signal Processing, San Francisco, March 1992, pp. II-97–II-100. V. W. Zue, “The use of speech knowledge in automatic speech recognition,” Proc. IEEE, vol. 73, no. 11, pp. 1602–1615, Nov. 1985. Further Information Papers on speech recognition are regularly published in the IEEE Speech and Audio Transactions (formerly part of the IEEE Transactions on Acoustics, Speech and Signal Processing) and in the journal Computer Speech and Language. Speech recognition research and technical exhibits are presented at the annual IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), the biannual European Conference on Speech Communication and Technology (Eurospeech), and the biannual International Conference on Spoken Language Processing (ICSLP), all of which publish proceedings. Commercial applications of speech recognition technol- ogy are featured at annual American Voice Input-Output Society (AVIOS) and Speech Systems Worldwide meetings. A variety of standardized databases for speech recognition system development are available from the National Institute of Standards and Technology in Gaithersburg, MD. 15.5 Large Vocabulary Continuous Speech Recognition Yuqing Gao, Bhuvana Ramabhadran, and Michael Picheny Speech recognition is the process of converting an acoustic signal to a textual message. High recognition accuracy is of prime importance in order for a speech interface to be of any practical use in a dictation task, or any kind of intelligent human–machine interaction. Speech recognition is made extremely difficult by co-articulation, variations in speaking styles, rates, vocal-tract size across speakers, and communication channels. Speech research has been underway for over 4 decades, and many problems have been addressed and solved fully or partially. High performance can be achieved on tasks such as isolated word recognition, small and middle vocabulary recognition, and recognition of speech in nonadverse conditions. Large vocabulary (over 30K words), speaker-independent, continuous speech recognition has been one of the major research targets for years. Although for some large vocabulary tasks, high recognition accuracies have been achieved [7], significant challenges emerge as more and more applications make themselves viable for speech input. ? 2000 by CRC Press LLC Continuous Speech Recognition Continuous speech recognition is significantly more difficult than isolated word recognition. Its complexity stems from the following three properties of continuous speech. 1. Word boundaries are unclear in continuous speech, whereas in isolated word recognition they are well- known and can be used to improve the accuracy and limit the search. For example, in the phrase “this ship,” the /s/ of “this” is often omitted. Similarly, in “we were away a year,” the whole sentence is one long vocalic segment, and the word boundaries are difficult to locate. 2. Co-articulatory effects are much stronger than in isolated speech. Although we try to pronounce words as concatenated sequences of individual speech sounds (phones), our articulators possess inertia which retards their motion from one position to another. As a result, a phone is strongly influenced by the previous and the following phones. This effect occurs both within single words and between words and is aggravated as the speaking rate increases. 3. Function words (articles, prepositions, pronouns, short verbs, etc.) tend to be poorly articulated. In particular, the phones are often shortened, skipped, or deleted. As a result, speech recognition error rates increase drastically from isolated word to continuous speech. More- over, the processing power needed to recognize continuous speech increases as well. The primary advantages of continuous speech are two-fold. First, typical speaking rates for continuous speech are 140 to 170 words per minute, while isolated word mode speakers seldom exceed 70 words per minute. Secondly, continuous speech is a natural mode of human communication. Forcing pauses between words introduces artificiality and reduces user friendliness. The unnaturalness of isolated word speech breaks the speaker’s train of thought. Large Vocabulary In the 1990s, the term “large vocabulary” has come to mean 30K words or more. Although the vocabulary size is certainly not the best measure of a task’s difficulty, it does affect the severity of many problems such as the acoustic confusability of words, the degradation in performance due to using sub-word unit models, and the computational complexity of the hypothesis search. Clearly, the number of confusable words grows substantially with the vocabulary size. As the vocabulary size increases, it becomes impractical to model each word individually, because neither the necessary training data nor the requisite storage is available. Instead, models must be based on sub-word units. These sub-word models usually lead to degraded performance because they fail to capture co-articulation effects as well as whole-word models. Additionally, the computational complexity of the search requires the introduction of efficient search methods such as “fast match” [26] which reject all but the most plausible word hypotheses to limit the computation effort. These word hypotheses which survive the “fast match” are then subjected to the full detailed analysis. Naturally, this process may introduce search errors, reducing the accuracy. Some of the key engineering challenges in building speech recognition systems are selecting a proper set of sub-word units (e.g., phones), assembling units into words (baseforms), modeling co-articulating effects, accomodating the different stress patterns of different languages, and modeling pitch contours for tone-based languages such as Mandarin. Overview of a Speech Recognition System The general architecture of a typical speech recognition system is given in Fig. 15.20. The speech signal is typically input to the system via a microphone or a telephone. Signal preprocessing consists of computing a series of acoustic vectors by processing the speech signal at regular time intervals (frames), which are typically 10 ms long. These acoustic vectors are usually a set of parameters, such as LPC cepstra [23] or filter bank outputs (PLP [30], RASTA [28], etc.). In order to capture the change in these vectors over time, they have been augmented with their time derivatives or discriminant projection techniques (e.g., see LDA [10, 29]). The recognizer consists of three parts: the acoustic model, the language model, and the hypothesis search. The recognition process involves the use of acoustic models over these feature vectors to label them with their ? 2000 by CRC Press LLC phonetic class. The acoustic models usually used are Hidden Markov Models. Artificial Neural Networks [16] or Dynamic Time Warping [17] based models have also been used, but will not be covered in this chapter section. Context-dependent acoustic models [9, 10] are obtained by querying the phonetic context using the concept of tri-phones or decision trees (networks) [2] that are constructed from a large amount of training data. A multidimensional Gaussian mixture model is used to model the feature vectors of the training data that have similar phonetic contexts. These models are then used as a set of observation densities in continuous Hidden Markov Models (HMMs). Each feature vector is labeled as a context-dependent phonetic class which is the closest acoustic class to the feature vector. A sequence of labels thus obtained is used to obtain a set of candidate words that are then pruned with the help of a language model. A language model bases its prediction of the next word on the history of the words preceding it. Finally, a hypothesis search is conducted through all possible sequences of hypothesized words to determine the optimal word sequence given the acoustic observations. Several adaptation techniques have been proposed to derive speaker-dependent systems from the speaker- independent system described above. These techniques modify/tune the parameters of the acoustic models to the specific speaker. Hidden Markov Models As Acoustic Models for Speech Recognition There are many ways to characterize the temporal sequence of speech sounds as represented by a sequence of spectral observations. The most common way is to model the temporal sequence of spectra in terms of a Markov chain to describe the way one sound changes to another by imposing an explicitly probabilistic structure on the representation of the evolutional sequence. If we denote the spectral vector at time t by O t the observed spectral sequence, lasting from t = 1 to t = T, is then represented by Consider a first-order N-state Markov chain as illustrated for N = 3 in Fig. 15.21. Such a random process has the simplest memory: the value at time t depends only on the value at the preceding time and on nothing that went on before. However, it has a very useful property that leads to its application to speech recognition problem: the states of the chain generate observation sequences while the state sequence itself is hidden from the observer. The system can be described as being one of the N distinct states, S 1 , S 2 , …, S N , at any discrete time instant t. We use the state variable q t as the state of the system at time t. Assume that the Markov chain is time invariant (homogeneous), so the transition probabilities do not depend on time. The Markov chain is then described by a state transition probability matrix A = [a ij ], where a ij = P(q t = S j |q t–1 = S i ), 1 £ i, j £ N (15.30) The transition probabilities satisfy the following constraints: a ij ? 0 (15.31) FIGURE 15.20 General architecture of a speech recognition system. OOOO t t T T {} = () =1 12 ,,,L ? 2000 by CRC Press LLC (15.32) Assume that at the initiated time, t = 0, the state of the system q 0 is specified by an initial state probability vector p T = [p 1 , p 2 , …, p N ]. Then for any state sequence q = (q 0 , q 1 ,q 2 , …, q T ), where q t ? {S 1 , S 2 , …, S N }, the probability of q being generated by the Markov chain is P(q|A, p ) = p q0 , a q0q1 a q1q2 …a qT–1qT (15.33) Suppose now that the state sequence q is a sequence of speech sounds and cannot be observed directly. Instead, observation O t is produced with the system in some unobserved state q t (where q t ? {S 1 , S 2 , …, S N }). Assume that the production of O t in each possible S i , i = 1, 2, …, N is stochastic and is characterized by a set of observation probability measures B = {b i (O t ) N i=1 }, where b i (O t ) = P(O t | q t = S i ) (15.34) If the state sequence q that led to the observation sequence O = (O 1 , O 2 , …, O T ) is known, the probability of being generated by the system is assumed to be P(O|q, B) = b q 1 (O 1 )b q 2 (O 2 ) … b q T (O T ) (15.35) Therefore, the joint probability of O and q being produced by the system is (15.36) The probability of producing the observation sequence O by the random process without assuming knowl- edge of the state sequence is (15.37) FIGURE 15.21 A first-order three-state hidden Markov model. ai ij j N =? = ∑ 1 1 PO Ab a b O qt T qqq t ttt |, ,pipi () = () = ?01 1 Π PO AB POq AB a b O q q q t T qqq t ttt |, , , |, ,pipipi ( ) = ( ) = ( ) ∑∑ ?? 0 11 Π ? 2000 by CRC Press LLC Continuous Parameter Hidden Markow Models The triple (p , A, B) defines a Hidden Markov Model (HMM). More specifically, a hidden Markov model is characterized by the following: 1. A state space {S 1 , S 2 , …, S N }. Although the states are not explicitly observed, for many applications there is often some physical significance attached to the states. In the case of speech recognition, this is often a phone or a portion–inital, middle, final–of a phone. We donote the state at time t as q t . 2. A set of observations O = (O 1 , O 2 , …, O T ). The observations can be a set of discrete symbols chosen from a finite set, or continuous signals (or vectors). In speech recognition application, although it is possible to convert continuous speech representations into a sequence of discrete symbols via vector quantization codebooks and other methods, serious degradation tends to result from such discretization of the signal. In this article, we focus on HMMs with continuous observation output densities to model continuous signals. 3. The initial state distribution p = {p i } in which 4. The state transition probability distribution A = {a ij }defined in Eq. (15.30). 5. The observations probability distribution, B = {b j (O t )}, defined in Eq. (15.34). p i = P(q 0 = S i ), 1 £ i £ N Given the form of HMM, the following three basic problems of interest must be solved for the model to be useful in applications. Task 1 (Evaluation): Given the observation sequence O = (O 1 , O 2 , …, O T ) and a model l = (p , A, B), how does one efficiently compute P(O|l )? Task 2 (Estimation): Given the observation sequence O = (O 1 , O 2 , …, O T ), how does one solve the inverse problem of estimating the parameters in l ? Task 3 (Decoding): Given the observation sequence O and a model l , how does we deduce the most likely state sequence q that is optimal in some sense or best explains the observations? The Evaluation Problem With unbounded computational power, Eq. (15.37) can be used to compute P(O|l ). However, it involves on the order of 2TN T calculations, because the summation in Eq. (15.37) has N T+1 possible sequences. This is computationally infeasible even for small values of N and T. An iterative algorithm known as the forward-backward procedure makes this computation possible. Defining the forward variable a as a t (i) = P(O 1 , …, O t , q t = S i |l ) (15.38) and initializing a 1 (i) = p i b i (O 1 ), subsequent a t (i) are computed inductively as (15.39) By definition, the desired probability of the observation sequence given the model l is (15.40) αα tt i N ij j t jiabO + = + ( ) = ( ) ( ) ∑ 1 1 1 PO i T i N | λα ( ) = ( ) = ∑ 1 ? 2000 by CRC Press LLC Another alternative is to use the backward procedure by defining the backward variable b : b t (i) = P(O t+1 ,O t+2 ,…,O T /q t = S i , l) (15.41) b t (i) is the probability of the partial observation sequence from t + a to the end T, biven state S i and model l . The initial values are b T (i) = 1 for all i. The values at time, T – 1, T – 2, …, 1, can be computed inductively: (15.42 The probability of the observation sequence given the model l can be expressed in terms of the forward and backward probabilities: (15.43) The forward and backward variables are very useful and will be used in the next section. The Estimation Problem Given an observation sequence or a set of sequences, (multiple utterances) the estimation problem is to find the “right” model parameter values that specify a model most likely to produce the given sequence. In speech recognition, this is called training. There is no known closed form analytic solution for the maximum likelihood model parameters. Nevertheless we can choose l = (p , A, B) such that its likelihood, P(O|l ), is locally maximized using an iterative procedure such as the Baum-Welch re-estimation method (a form of the EM [expectation- maximization] method [4]. The method intoduces an auxiliary function Q( ? l , l ) and maximizes it. (15.44) The re-estimation technique consists of first aligning the training data O with the current models, and then updating the parameters of the models based on the alignment to obtain a new estimate l . Let z t (i, j) be the probability of being in state S i at time t and state S j at time t + 1, given the model l and the observation sequence O: (15.45) Using the forward and backward variables defined in section 3.2, a t (i) and b t (j), z t (i, j) can be written as β tij j N jt t iabO j () = ()() = ++∑ 1 11 β PO i i i t i N tT i N | λ () = () () = () == ∑∑ αβ α 11 QPOqPOq q ? ,,| ? log , |λλ λ λ ( ) = ( ) ( ) ∑ ζλ λ λ ttitj tit j ij Pq S q S O Pq S q S O PO ,,|, ,,| | () == = () = += () () + + 1 1 ? 2000 by CRC Press LLC (15.46) An estimate of a ij is given by the expected number of transitions from state S i to state S j , divided by the expected number of transitions from state S i . Define g t (i) as the probability of being in state S i at time t, given the observation sequence O (15.47) Summing g t (i) over t yields a quantity that can be interpreted as the expected number of transitions from state S i . Summing z t (i, j) over t gives the expected number of transitions from state S i to S j . An estimate of a ij can be computed as the ratio of these two sums. (15.48) For the discrete observation case, an estimate of b j (k) is obtained as the expected number of times being in state S i and observing symbol n k divided by the expected number of times in state S j . (15.49) The most general representation of a continuous density of HMMs is a finite mixture of the form where O is the observation vector being modeled, c jk is the mixture coefficient for the kth mixture in state j and N is any log-concave or elliptically symmetric density. Typically, we assume that N is Gaussian with mean vector m jk and covariance matrix U jk for the kth mixture component in state j. The mixture weights c jk satisfy the constraint: (15.50) c jk ? 0, 1 £ j £ N, 1 £ k £ M (15.51) Let g t (j,k) be the probability of being in state S j at time t with k-th mixture component accounting for O t : (15.52) ξ αα αα t tijt jt t ij N ij t j t ij ijbO ijbO , , () = () () ( ) () () ( ) ++ = ++∑ β β 11 1 11 γλζ tti t j N iPqSO ij () == () = () = ∑ |, , 1 ? , a ij i ij t T t t T t = () () = ? = Σ Σ 1 1 1 ζ γ ? : bk j j j tO v t t T t tk () = () () = = Σ Σ γ γ 1 bO cNO U j N jjk k M jk jk () = () ≤≤ = ∑ 1 1,, ,μ cjN jk k M = ∑ =≤≤ 1 11, γλ λ λ ttjt tjt tj jk Pq S k k O Pq S k kO Pq S O ,,|, ,,| ,| () == = () = == () = () ? 2000 by CRC Press LLC The re-estimation formula for the coefficients of the mixture density are: (15.53) (15.54) (15.55) Details on how the re-estimation formulas are derived from the auxiliary function Q(l ? ,l ) can be found in [25] and [23]. Viterbi Algorithm: One Solution for the Decoding Problem We define the optimal state sequence q* = (q 1 *, …, q T *) is defined as the sequence which maximizes P(q|O, l ), or equivalently P(q, O| l ). Let d t (i) be the joint probability of the optimal state sequence and the observation up to time t, ending in state S i at time t. Then, d t (i) = max P(q 1 , … q t–1 , q t = S i , O 1 , …, O t |l ) (15.56) where the maximum is over all state sequences q 1 , …, q t–1 . This probability can be updated recursively by extending each partial optimal path using d t (i) = max d t (i)a ij b j (O t+1 ) (15.57) At each time t, it is necessary to keep track of the optimal precursor t of state j, that is, the state that maximized the above probability. Then, at the end of the utterance, the optimal state sequence can be retrieved by backtracking through the precursor list [11]. Speaker Adaptation In spite of recent progress in the design of speaker-independent (SI) systems, error rates are still typically two or three times higher than equivalent speaker-dependent (SD) systems. Variability in both anatomical and personal characteristics contribute to this effect. Anatomical differences include the length of the vocal tract, the size of the nasal cavity, etc. Similarly, there are variable speaking habits, such as accent, speed, and loudness. The straight-forward approach which blindly mixes the statistics for all speakers discards useful information. The large amount of speaker-specific data required to train SD systems renders them impractical for many applications. However, it is possible to use a small amount of the new speaker’s speech (adaptation data) to “tune” the SI models to the new speaker. Ideally, we would like to retain the robustness of well-trained SI models, yet improve the appropriateness of the models for the new speaker. Such methods are called speaker adaptation techniques. The adaptation is said to be supervised if the true text transcript of the adaptation data is known; otherwise, the adaption is said to be unsupervised. Maximum a posteriori Estimation A widely used speaker adaptation method maximizes the posterior estimation of HMMs [3]. The conventional maximum likelihood (ML) based algorithms assume the HMM parameters to be unknown but fixed, and the parameter estimators are derived entirely from the training observation sequence using the Baum-Welch algo- rithms. Sometimes, prior information about the HMM parameters is available, whether from subject matter ? , , c jk jk jk t T t t T k M t = ( ) ( ) = == Σ ΣΣ 1 11 γ γ ? , , μ γ γ jk t T tt t T t jko jk = () () = = Σ Σ 1 1 ? ,– – , U jk o o jk jk t T t t jk t jk T t T t = ()()() () = = Σ Σ 1 1 γμμ γ i ? 2000 by CRC Press LLC considerations or from previous experience. Designers may wish to use this prior information–in addition to the sample observations–to infer the HMM parameters. The Maximum a Posteriori (MAP) framework naturally incorporates prior information into the estimation process, which is particularly useful for dealing with problems posed by sparse training data, where ML estimates become inaccurate. MAP parameter estimates approah the ML estimates when data is plentiful, but are governed by the prior information in the absence of data. If l is the parameter vector to be estimated from the observation O with probability density function (pdf) f(O|l ) and g is the prior pdf of l , then the MAP estimate is defined as the maximum of the posterior pdf of l , g(l |o). Rather than maximizing the auxiliary function Q( ? l , l ) as in Eq. 15.44, we instead maximize an auxiliary function that includes a contribution from the prior distribution of the model parameters. (15.58) The appropriate prior distributions are Gaussian distributions for the means, gamma distributions for the inverse variances, and Dirichlet distributions for the mixture weights [3]. The problem with MAP is that it adapts only parameters for which explicit training data is available, and it converges slowly for tasks where there are limited adaptation data and many parameters to be estimated. Many adaptation algorithms [15] [14] have been developed which attempt to generalize from “nearby” training data points to overcome this difficulty. Transform-Based Adaptation Another category of adaptation technique uses a set of regression-based transforms to tune the mean and variance of a hidden Markov model to the new speaker. Each of the transformations is applied to a number of HMMs and estimated from the corresponding data. Using this sharing of transformations and data, the method can produce improvements even if a small amount of adaptation data is available for the new speaker by using a global transform for all HMMs in the system. If more data is available, the number of transforms is increased. Maximum Likelihood Linear Regression (MLLR) The MLLR framework was first introduced in [27]. Consider the case of a continuous density HMM system with Gaussian output distributions. A particular Gaussian distribution, g, is characterized by a mean vector, m g , and a covariance matrix U g . Given a speech vector o t , the probability of that vector being generated by a Gaussian distribution g is b g (o t ): The adaptation of the mean vector is obtained by applying a transformation matrix W g to the extended mean vector z g to obtain an adapted mean vector ?m g where W g is a d*(d + 1) matrix which maximizes the likelihood of the adaptation data, and the z g is defined as z g = [W , m 1 , …, m d ] T where W is the offset term for the regression. The probability for the adapted system becomes RQ g ? , ? , log ? λλ λλ λ () = () + () bo U e gt d g oUo tg T gtg () = () ()() 1 2 2 12 12 1 pi μμ || –– – – ? μζ ggg W= ? 2000 by CRC Press LLC The auxiliary fuction in Eq. 15 can be used here to estimate W g . It can be shown the W g can be estimated using the equation below: (15.59) where g g (t) is the posterior probability of occupying state q at time t given that the observation sequence O is generated. The MLLR algorithm can also be extended to transform the covariance matrices [38]. Cluster-Based Speaker Adaptation Yet another category of speaker adaptation methodology is based on the fact that a speech training corpus contains a number of training speakers, some of whom are closer, acoustically, to the test speaker than others. Therefore, given a test speaker, if the acoustic models are re-estimated from a subset of the training speakers who are acoustically close to the test speaker, the system should be a better match to the test data of the speaker. A further improvement is obtained if the acoustic space of each of these selected training speakers is transformed, by using transform-based adaptation method, to come closer to the test speaker. This scheme was shown to produce better speaker adaptation performance than other algorithms, for example MLLR [27] or MAP adaptation [3], when only a small amount of adaptation data was available. However, the implementation of this method required the entire training corpus to be available online for the adaptation process, and this is not practical in many situations. This problem can be circumvented if a model is stored for each of the training speakers, and the transformation is applied to the model. The trans- formed models are then combined to produce the speaker-adapted model. However, due to the large number of training speakers, storing the models of each training speaker would require a prohibitively large amount of storage. Also, we may not have sufficient data from each training speaker to robustly estimate the parameters of the speaker-dependent model. To solve this problem and retain the advantage of the method, a new algorithm is presented [21]. It is to precluster the training speakers acoustically into clusters. For each cluster, an HMM system (called a cluster- dependent system) is trained using speech data from the speakers who belong to the cluster. When a test speaker’s data is available, we rank these cluster-dependent systems according to the distances between the test speaker and each cluster, and a subset of these clusters, acoustically closest to the test speaker, is chosen. Then the model for each of the selected clusters is transformed further to bring the model closer to the test speaker’s acoustic space. Finally, these adapted cluster models are combined to form a speaker-adapted system. Hence, compared to [22], we now choose clusters that are acoustically close to the test speaker, rather than individual training speakers. This method solves the problem of excessive storage for the training speaker models because the number of clusters is far fewer than the number of training speakers, and it is relatively inexpensive to store a model for each cluster. Also, as each cluster contains a number of speakers, we have enough data to robustly estimate the parameters of the model for the cluster. bo U e gt d g oW U oW tgg T gt gg () = () ()() 1 2 2 12 12 1 pi μμ || –– – – γζγ ζ ggtg T gg t T t T ggg T tU o tU W () = () == ∑∑ ––?11 11 γ λ λ φ gt t PO POs q () = () = () ∈ ∑ 1 | ,| Φ ? 2000 by CRC Press LLC Vocal Tract Length Normalization (VTL) Several attempts have been made to model variations in vocal tract length across several speakers. The idea was originally introduced by Bamberg [42] and revived through a parametric approach in [39]. Assume a uniform tube with length L for the model of the vocal tract. Then each formant frequency will be proportional to 1/L. The first-order effect of a difference in vocal tract length is the scaling of the frequency axis. The idea behind VTL is to rescale or warp the frequency axis during the signal processing step in a speech recognition system, to make speech from all speakers appear as if it was produced by a vocal tract of a single standard length. Such normalizations have led to significant gains in accuracy by reducing variability amongst speakers and allowing the pooling of training data for the construction of sharper models. Three VTL methods have been recently proposed. In [39], a parametric method of normalization which counteracts the effect of varied vocal tract length is presented. This method is particularly useful when only a small amount of training data is available and requires the determination of the formant frequencies. In [40], an automated method is presented that uses a simple generic voiced speech model to rapidly select appropriate frequency scales. This generic model is a mixture of 256 multiversity Gaussians with diagonal covariances trained on the unwarped data. Different warp scales are selected to linearly transform the frequency axis of the speaker’s data. The resulting warped features are scored against the generic model. The warp scale that scores the best is selected as the warp scale for that speaker. An iterative process updates the generic model with the new features obtained after warping each speaker with the best warp scale. Once the best warp scales for each speaker have been determined, SI models are built with the appropriately warped feature vectors. This warp selection method allows data from all speakers to be merged into one set of cannonical models. In [41], a class of transforms are proposed which achieve a remapping of the frequency axis much like the conventional VTL methods. These mappings known as all-pass transforms, are linear in the cepstral domain which makes speaker normalization simple to imple- ment. The parameters of these transforms are computed using conjugate gradient methods. Modeling Context in Continuous Speech Speech recognition cannot be accurately modeled by a concatenation of elementary HMMs corresponding to individual phones of a word baseform. A phone is a sub-word acoustic unit of speech. The realizations of the phones depend on their context. This is especially true for continuous speech where the phenomenon called co-articulation is observed. Co-articulation is when the pronunciation of a phoneme is affected by the phones preceding and following it, such as, the t in top and pot. This section discussed several methods that will yield HMM building blocks that take into account phonetic context. A word is specified by its phonetic baseform, the phones are transformed into their appropriate allophones according to the context in which they appear, and a concatenation of the HMMs of these allophones results in the word HMM. Two standard approaches are used to make use of contextual information: 1.Tri-phones as building blocks 2.Decision trees that lead to general contextual building blocks Tri-Phones In order to take into account the influence of context on pronunciation, many speech recognizers base their modeling on the tri-phone concept. The tri-phone concept was first introduced by Chow et al. [5, 24] and Lee et al. [7, 8] in the 1980s. In this concept, the pronunciation of a phone is influenced by the preceding and following phone (i.e., the triplet is used to model the realization of the phone). The phone p embedded in the context p 1 and p 2 is specified by the tri-phone p 1 pp 2 , where p 1 and p 2 are the previous and the following phones. Different such realizations of p are called allophones. This amounts to saying that the contextual influence of the preceding and following phone is most important. If this solution were carried out literally, the resulting allophone alphabet would be too large to be useful. For example, a phonetic alphabet of size M would produce M 3 allophones. Even though in practice, not all M 3 allophones occur, the number of possible allophones is still large. So the tri-phone method relies on an equivalence classification of the context of phones. One simple scheme involves the clustering of tri-phones into distinct categories to characterize them. The criterion used for ? 2000 by CRC Press LLC clustering can be an information theoretic measure such as likelihood or entropy. This is the concept behind generalized tri-phones [8]. Decision trees (described in the next section) can also be used to determine the distinct categories. Another scheme is to tie the HMM distributions of these tri-phones. More recently, methods that cluster individual states of the HMM [1] have been introduced. A drawback in using tri-phones is that wider contexts (i.e., three, four, or five phones to the left) may be important. Decision Trees The purpose of the decision tree is to map a large number of conditions (i.e., phonetic contexts) to a small manageable number of equivalence classes. Each terminal node of the tree represents a set of phonetic contexts. The aim in decision tree construction is to find the best possible definition of equivalence classes [2]. During decoding, the acoustic models to be used for a given observation string are chosen based on the current acoustic context — that is, by pouring the data down the decision tree until a terminal node is reached and using the models at that terminal node to compute the likelihood of the data. Decision Tree Construction Maximum Likelihood (ML) estimation is one common technique used for constructing a decision tree; i.e., the aim is to find the different classes that maximize the likelihood of the given training data. A binary decision tree is grown in the following fashion. 1. From among a set of questions (at the phonetic or lexeme level), the best question for partitioning the data into two classes (i.e., the question that maximizes the likelihood) is found. 2. The above step is repeated recursively on each of the two classes until either there is insufficient data to continue or the best question is not sufficiently helpful. The easiest way to construct a decision tree is to create — in advance — a list of possible questions for each variable that may be tested. Finding the best question at any given node consists of subjecting all the relevant variables to each of the questions on the corresponding list and picking the best combination of the variable and the question. In building an acoustic decision tree using phonetic context, at least 10 variables may be interrogated, 5 preceding phones and 5 following phones. Since all of these variables belong to the same phonetic alphabet, only one set of questions needs to be prepared, where each question is a subset of the phonetic alphabet. Typically, trees are less than 20 layers deep, as beyond that the algorithm runs out of data. Let X 1 … X n denote n discrete random variables whose values may be tested. Let Q ij denote the jth prede- termined question for X i 1. Starting at the root node, try splitting each node into two subnodes. 2. For each variable X i , evaluate questions Q i1 , Q i2 , etc. Let Q b denote the best question estimated using any one of the criteria described earlier. The best question at a node is the question that maximizes the likelihood of the training data at that node after applying the question. 3. Find the best pair X i , Q b denoted as X, Q b . 4. If the selected question is not sufficiently helpful (gain in likelihood due to the split is not significant) or does not have sufficient data points, make the current node a leaf. 5. Otherwise, split the current node into two new subnodes according to the answer of question Q b on variable X b . The algorithm stops when all nodes are either too small to split further or have been marked as leaves. Over- training is prevented by limiting the number of asked questions. Questions A decision tree has a question associated with every non-terminal node. These can be grouped into continuous and discrete questions. Discrete questions. If X is a discrete random variable that takes values in some finite alphabet R, then a question about X has the form: Is X an element of S where S is a subset of R? Typically, questions are of the form “Is the preceding phone a vowel?” or “Is the following phone an unvoiced stop?” ? 2000 by CRC Press LLC Continuous questions. If X is a continuous random variable that takes real values, a question about X has the form: X ? t where t is some real value. Instead of limiting the questions to a predefined set, we could search for the best subset of values taken by the random variable at any node and use the best question found. This implies that we generate questions on the fly during tree construction. The disadvantages of this approach include too much CPU time to search for the best subset and because there are so many subsets, there is too much freedom in the tree-growing algorithm, resulting in over-training or spurious questions that do not generalize very well. All of these questions can be constructed in advance by experts. For example, phonetic questions can be generated by linguists or by algorithms [9]. Language Modeling Humans recognize words based not only on what they hear, but also on what they have heard in the past, as well as what they anticipate to hear in the future. It is this capability that make humans the best speech recognition systems. Modern speech recognition systems attempt to achieve this human capability through language modeling. Language modeling is the art and science of anticipating or predicting words or word sequences from nonacoustic sources of information such as context, structure, and grammar of the particular language, previously heard word sequences. In large vocabulary speech recognition, in which word sequences W are uttered to convey some message, the language model P(W) is of critical importance to the recognition accuracy. In most cases, the language model must be estimated from a large text corpus. For practical reasons, the word sequence probability P(W) is approximated by (15.60) This is called an N-gram language model, where N is the number of words from the history that are used in the computation. Typically, N = 3 and these are referred to as trigram language models, Q is the number of words in the sequence being decoded. The conditional probabilities in Eq. (15.60) are estimated by the simple relative frequency approach described in [23]. The maximum entropy approach is a method of estimating the conditional probability distributions (dis- cribed below). In cases when the text corpus is not large enough to reliably estimate the probabilities, smoothing techniques such as linear smoothing (deleted interpolation) are applied (described below). Perplexity is a measure of performance of language models. Perplexity, defined in Eq. (15.61) is the average word branching factor of the language model. B = 2 H = P(w 1 , w 2 , …, w Q ) –1/Q (15.61) Perplexity is an important parameter is specifying the degree of sophistication in a recognition task, from the source uncertainty to the quality of the language model. Other techniques that have been used in language modeling include desision tree models [43] and automat- ically inferred linked grammars to model long range correlations [44]. Smoothing In computing the language model probabilities, we desire the following: fewer parameters to estimate; the available data is sufficient for the estimation of parameters; and that the probability can be constructed at recognition time from the parameter values while occupying limited storage. Several smoothing techniques have been proposed to handle the scarcity of data [25]. These are essential in the construction of n-gram language models. They include linear smoothing, also known as deleted interpolation, backing-off, bucketing PW Pww w w NiiiN i Q () = () ?? ?+ = ∏ |, , 12 1 1 K ? 2000 by CRC Press LLC and equivalence classification techniques. An extensive empirical study of these techniques for language mod- eling is given in [35]. A brief description of two smoothing techniques is covered in this section. Linear smoothing, is due to Jelinek and Mercer [37], where a class of smoothing models that involve linear interpolation are presented. The maximum-likelihood estimate is interpolated with the smoothed lower-order distribution defined analogously, i.e. (15.62) To yield meaningful results, the training data used to estimate need to be distinct from the data used to estimate P ML . In held-out interpolation, a section of training data is reserved for this purpose. In[37], a technique called deleted interpolation is described where different parts of the training data rotate in train either P ML or the and the results are then averaged. The other widely used smoothing technique in speech recognition is the backing-off technique described by Katz [36]. Here, the Good-Turing estimate [36] is extended by adding the interpolation of higher-order models with lower-order models. This technique performs best for bigram models estimated from small training sets. The trigram language model probability is defined by, p(w 3 |w 1 ,w 2 ) = l 3 f(w 3 |w 1 ,w 2 ) + l 2 f(w 3 |w 2 ) + l 1 f(w 3 ) (15.63) The backing-off technique suggests that if the counts C(w 1 , w 2 ) is sufficiently large, the f(w 3 |w 1 ,w 2 ) by itself is a better estimate of p(w 3 |w 1 ,w 2 ). Hence, for different values of counts, a different estimate of p(w 3 |w 1 ,w 2 ) is used. Several variations on the choice of this threshold and the Good-Turing type function are described in [25]. Maximum Entropy based Language Models A maximum-likelihood approach for automatically constructing maximum entropy models is presented in [34]. The maximum entropy method finds a model that simultaneously satifies a set of constraints Such a model is a method of estimating the conditional probability distributions. The principle is simple: model all that is known and assume nothing about that which is not known. Let x and y be a set of random variables such that P(y|?) is the probability that the model assigns an output y given ?. Let f(?, y) be the indicator function (the expected value of this function is the feature function) that takes a binary value of 1 or 0 to fit the training data. If P(?) satisfies ?,y P(?)P(y|?)f(?,y) = d(i) where d(i) are the constraints then there must be a probability P(?) that satisfies all the constraints uniformly. A mathematical measure of uniformity of conditional distributions is the conditional entropy, H. The solution to this problem is obtained by selecting the model with maximum entropy from the set of possible models, C, i.e., p* = argmaxH (15.64) It can be shown that p* is well-defined and that there is always a model with maximum entropy in any constrained set C. For simple cases, the above equation can be solved mathematically, but for a more general case, the use of Lagrange multipliers from constrained optimization theory is employed. This approach leads to the following statement. The maximum entropy model subject to a set of constraints, has the parametric form given by Eq. 15.65, (15.65) where the Lagrangian multipliers, l i s can be determined by maximizing the Lagrangian, l = H + i l i (p(f i ) – p ? (f i )). Q l (?) is the normalization constant and p(f i ) and p?(f i ) are the emperical and expected distributions. Since the Lagrangian is the log-likelihood for the exponential model, P, the solution states that the model with PP P erp i i n i ML i i n i erp i i n i in i in i int – – – – int – – ||–| – – – – ωω λ ωω λ ωω ωω ++ + () = () + ? ? ? ? () ++ 1 1 1 1 2 1 1 1 1 1 1 λ ω in i – – +1 1 λ ω in i – – +1 1 S Py Q f y i i | exp ,?? ? () = () ( ) ∑λ λ S peC ? 2000 by CRC Press LLC the maximum entropy is one that maximizes the likelihood of the training data. Details on the construction of maximum entropy models, techniques for the selection of features to be included in models and the computations of the parameters of these models are addressed in [34]. Techniques for computing the parameters of such models such as, hill climbing, iterative projection and iterative scaling algorithms are described in [25]. Hypothesis Search It is the aim of the speech recognizer to determine the best possible word sequence given the acoustic obser- vation; that is, the word string W ? that satisfies (15.66) Since the word string is made up of a sequence of words, the search for W ? is done from several possible hypotheses. Viterbi Search, a time-synchronous search strategy, and Tree Search, a time-asynchronous search strategy, are presented here. Viterbi Search The Viterbi algorithm [11] introduced previously finds the most likely path through an HMM. Equation (15.66) demands that we find for each candidate word string W, the probability of the set of paths that corresponds to the word string W and then identify the word string whose set of paths has the highest probability. Section 2 described the hidden Markev Model concept. The Viterbi algorithm that finds the maximizing state sequence for successive levels i (there may be several), and deciding at the final level k, from among the competing sequences. At each level i, the paths whose probabilities fall below a threshold are purged. A traceback from the final state for which the probability is maximized to the start state in the purged trellis yields the most likely state sequence. Each word is represented as a concatenation of several HMMs, one corresponding to each of the phones the phones that make up the word. If we are using a bigram language model, then P(W) = P(w i )P n i=2 P(w i |w i–1 ) = 0, and W ? the most likely path through these HMMs. The number of states is proportional to the vocabulary size, V. If we are using the trigram language model, then, P(W) = P(w 1 )P(w 2 |w 1 )P n i=3 P(w i |w i–2 , w i–1 ), and the graph becomes more complicated with the number of states being proportional to |V| 2 . No practical algorithms exist for finding the exact solution, but the Viterbi algorithm will find the most likely path through these HMMs whose identity can then determine the recognized word string. One drawback of the Viterbi algorithm is the number of states that have to be evaluated for a bigram language model, even for a practical vocabulary size of 60,000 words. A shortcut that is commonly used is the beam search. Here, the maximal probability P m i–1 of the states at stage i – 1, i.e., max s1, …, sl P(s 1 , s 2 , …, s i–1 , s l=i , y 1 , …, y i |s 0 ) is computed and used as the basis for computing a dynamic threshold to prune out all states in the trellis whose path probabilities fall below this threshold. Multi-pass search strategies have been proposed over the thresholding used in the beam search to handle more complex language models [6]. Tree Search The search for the most likely word sequence can be thought of as searching for a path in a tree whose branches are labeled with the various words of the vocabulary V such that there are |V| branches leaving each node, one for each word (i.e., the size of the vocabulary). Typically, in large vocabulary continuous speech recognition, this search from a tree of possible hypotheses turns out to be a very large computational effort. Hence, the search is limited by a fast match approach [26] that will reject from consideration several branches of the tree without subjecting them to a detailed analysis. The Viterbi algorithm achieves the same kind of pruning using the beam search approach and multi-pass strategies. Stack Search Stack search algorithms for speech recognition have been used at IBM [19] and MIT Lincoln Labs [20]. This heuristic search algorithm helps to reduce computational and storage needs without sacrificing accuracy. Any ? |WPWPAW= ()( ) argmax w ? 2000 by CRC Press LLC tree search must be based on some evaluation criterion related to the search’s purpose. The algorithm below is a popular algorithm used for the heuristic determination of minimum-cost paths 1. Insert into a stack all the single-branch paths corresponding to the words of the vocabulary. 2. Sort these entries in descending order of a function F(w i ), where w i ? vocabulary V. 3. If the top entry in the stack is the end of the utterance, the search ends; otherwise, each entry in the stack is extended using F(·) for all possible words in the vocabulary and inserted into the stack while maintaining the stack order. F(·) is (15.67) where w r denotes a word string of length r and a m 1 denotes the acoustic data to be recognized. The methods described in [14] incorporate the definition of an envelope that is used to mark partial paths in the stack as alive or dead; these considerably speed up the search. In [13], a tree search strategy called the envelope search is presented. This is a time-asynchronous search that combines aspects of the A* search with the time-synchronous Viterbi search. Several bi-directional search strategies have been proposed by Li et al. [18]. Kenny et al. discuss several aspects of A* algorithms in [12]. A different approach involving majority decisions on observed discrete acoustic output strings leading to a polling fast match is introduced by Bahl et al. in [15]. The use of multiple stacks is yet another way to control the search procedure and is presented in [13]. Tree Search vs. Viterbi Search A Viterbi search of a trellis finds the most likely succession of transitions through a composite HMM composed of word HMMs. The number of states in a trellis stage (determined by the end states of the word HMMs) must be limited to keep the search’s storage and computational requirements feasible. The tree search imposed no such constraints on the number of end states as long as this search does not prune out the correct path. Both algorithms are suboptimal in the sense that they do not guarantee to find the most probable word string. State-of-the-Art Systems In the 1998 DARPA Hub-4E English Broadcast News Benchmark Test, an overall recognition error rate of 13.5% was achieved. This test includes recognition of baseline broadcast speech, spontaneous broadcast speech, speech over telephone channels, speech in the presence of background music, speech under degraded acoustic conditions, speech from non-native speakers, and all other kinds of speech. Details can be obtained from the NIST Web site. Another benchmark test is the Switchboard Task, which is the transcription of conversations between two people over the telephone. Error rates for this task are approximately 37%. In the Airline Travel Information System (ATIS) speech recognition evaluation conducted by DARPA, error rates close to 2% have been obtained. High recognition accuracies have been obtained for digit recognition, with error rates under 1% (TIMIT database [31]). Challenges in Speech Recognition Some of the issues that still arise in speech recognition, and make interesting research problems for the present and the future, include: 1. Accurate transcription of spontaneous speech compared to read speech is still a major challenge because of its inherent casual and incoherent nature, embedded disfluencies, and incomplete voicing of several phones or words. 2. Recognizing speech between individuals and/or multiple speakers in a conference. 3. Robustness of recognition to different kinds of channels, background noise in the form of music, speech over speech, variation in distances between the speaker and the microphone, etc. 4. Recognition across age groups, speaking rates, and accents. 5.Building of a language model for an unknown domain and the addition of out-of-vocabulary (oov) words. Fw Pa w k w mk r 111 () = () max , ? 2000 by CRC Press LLC 6. In order to dynamically adapt to speakers (i.e., make use of their speech when no transcription is available), unsupervised adaptation is necessary. To do this accurately, we need a confidence measure on the decoded script. 7. Speech recognition systems do not have any understanding of decoded speech. To move toward under- standing/machine translation, we need some post-processing of the transcription that could lead to intelligent conversational systems. Applications The use of speech as a means of input, in a fashion similar to the use of the keyboard and the mouse, has resulted in the application of speech recognition to a wide set of fields. These can be broadly divided into three segments: desktop, telephony, and embedded systems. 1. In the desktop areas, continuous speech recognition has been used for dictation of text documents, for commands to navigate the desktop environment, and Internet surfing. Dictation accuracies of the order of 96% and greater have been achieved. The main players in this field include IBM with their ViaVoice series of products. Dragon Systems, L&H, Philips, and Microsoft. 1 Software tailored to dictation in specialized fields, such as radiology, general medicine, and the legal domain, have also been put out by some of these companies. Recorded speech using hand-held digital recorders can also be transcribed subsequently by the same software. 2. Telephony is an emerging field for applications of speech recognition. These include repertory dialing, automated call type recognition, credit card validation, directory listening retrieval, speaker identifica- tion, financial applications such as trading of stocks and mutual funds, banking, voice mail transcription, companies have their own specialized products for telephony. 3. The use of speech input for embedded systems is a relatively new field because only recently handheld systems have adequate CPV and memory for accurate speech recognition. Defining Terms Acoustic Model:Any statistical or syntactic model that represents a speech signal. Baseforms:Representation of a word as a sequence of phones. Baum-Welch algorithm: A form of EM (expectation-maximization) is an iterative procedure to estimate the parameters of a stochastic model by maximizing the likelihood of the data. Cepstra: The Fourier Transform of the logarithm of the power spectrum sampled at regular intervals. Co-articulation:Pronunciation of phones being influenced by the previous and following phones. Decision trees: A technique used to group several conditions into classes. Forward-backward: A recursive algorithm for computing the posterior probability of a HMM using forward and backward variables. Gaussian mixtures: Convex combination of Gaussian (a kind of probability distribution function) functions. Hidden Markov Model (HMM): A stochastic model that uses state transition and output probabilities to generate observation sequences. Hypothesis search: Search through a large set of hypotheses of word sequences to find the optimal word sequence. Language Model: Language Models predict words or word sequences from nonacoustic sources of informa- tion, such as context, structure, and grammar of the particular language. Linear Prediction Coefficients (LPC):A representation of an analog signal using an Auto Regressive model. MAP:Maximum a posteriori. Technique for speaker adaptation. MLLR:Maximum Likelihood Linear Regression. Technique for speaker adaptation. Phones: Sub-word acoustic unit. 1 While this is not an exhaustive list of companies with continuous speech recognition software products, they are the leaders in the field to date. ? 2000 by CRC Press LLC Signal pre-processing: Conversion of an analog speech signal into a sequence of numeric feature vectors or observations. Speaker adaptation: The process of using a small amount of data from a new speaker to tune a set of speaker- independent acoustic models to a new speaker. Supervised and Unsupervised Adaptation: In speaker adaptation, the procedure is said to be supervised if the true transcription of the adaptation data is known and is unsupervised otherwise. Tri-phone: Context-dependent model of a phone as a function of its previous and succeeding phones. Viterbi: An algorithm for finding the optimal state sequence through an HMM, given a particular observation sequence. References 1. Young, S. J. and Woodland, P. C., State clustering in HMM-based continuous speech recognition, Com- puter Speech and Language, 8, 369, 1994. 2. Bahl, L. R. et al., Decision trees for phonological rules in continuous speech, ICASSP, 1991. 3. Gauvain, Jean-Luc and Lee, Chin-Hui, Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains, IEEE Transactions on Speech and Audio Processing, 2, 1994. 4. Baum, L. E., An inequality and associated maximization technique in statistical estimation of probabilistic functions of Markov processes, Inequalities, 3, 1, 1972. 5. Chow, Y. et al., BYBLOS: The BBN continuous speech recognition system, IEEE International Conference on Acoustics Speech and Signal Processing, pp. 89, 1987. 6. Lee, C. H. , Soong, F. K., Paliwal, K. K., Automatic Speech and Speaker Recognition, Kluwer Academic Publishers, 1996. 7. Lee, K. and Hon, H., Large vocabulary speaker independent continuous speech recognition, IEEE Inter- national Conference on Acoustics Speech and Signal Processing, 1988. 8. Lee, K., Hon, H., Hwang, M., Mahajan, S., and Reddy, R., The Sphinx speech recognition system, IEEE International Conference on Acoustics Speech and Signal Processing, 1989. 9. Bahl, L., de Souza, P., Gopalakrishman, P. S., and Picheny, M., Context-dependent vector quantization for continuous speech recognition, ICASSP, 1993. 10. Bahl, L., de Souza, P., Gopalakrishman, P. S., Nahamoo, D., and Picheny, M., Robust methods for using context dependent features and models in a continuous speech recognizer, ICASSP, I, 533, 1994. 11. Viterbi, A. J., Error bounds for convolution codes and an asymmetrically optimum decoding algorithm, IEEE Transactions on Information Theory, IT-13, 260, 1967. 12. Kenny, P. et al., A new fast match for very large vocabulary continuous speech recognition, ICASSP, II, 656, 1993. 13. Bahl, L. R., Gopalakrishman, P. S., and Mercer, R. L., Search issues in large vocabulary speech recognition, Proceedings of the 1993 Workshop on Automatic Speech Recognition, 1993. 14. Gopalakrishman, P. S., Bahl, L. R., and Mercer, R. L., A tree search strategy for large-vocabulary contin- uous speech recognition, ICASSP, I, 572, 1995. 15. Bahl, L. R., Bakis, R., de Souza, P. V., and Mercer, R. L., Obtaining candidate words by polling in a large vocabulary speech recognition system, ICASSP, I, 489, 1998. 16. Lippman, R. P., Review of neural networks for speech recognition, in Readings in Speech Recognition, Waibel, A. and Lee, K. F., Eds., Morgan Kaufmann, San Mateo, CA, 1990. 17. Rabiner, L. R. and Levinson, S. E., Isolated and connected word recognition — Theory and selected applications, IEEE Transactions on Communications, COM-29, 621, 1981. 18. Li, Z., Boulianne, G., Laboute, P., Barszcz, M., Garudadri, H., and Kenny, P., Bi-directional graph search strategies for speech recognition, Computer Speech and Language, 10, 295, 1996. 19. Bahl, L. R., Jelinek, F., and Mercer, R. L., A maximum likelihood approach to continuous speech recog- nition, IEEE Trans. Pat. Anal. and Mach. Int., PAMI-5, 179, 1983. 20. Paul, D., An efficient A* stack decoder algorithm for continuous speech recognition with a stochastic language model, Proc. DARPA Workshop on Speech and Natural Language, pp. 405, 1992. 21. Gao et al., Speaker adaptation based on pre-clustering training speakers, Eurospeech-97, pp. 2091, 1997. ? 2000 by CRC Press LLC 22. Padmanabhan et al., Speaker clustering transformation for speaker adaptation in large vocabulary speech recognition systems, ICASSP, 1996. 23. Rabiner, L. and Juang, B.-H., Fundamentals of Speech Recognition, Prentice-Hall Signal Process Series, 1993. 24. Schwartz, R., Chow, Y., Kimball, O., Roucos, S., Krasner, M., and Makhoul, J., Context-dependent modeling for acoustic-phonetic recognition of continuous speech, ICASSP, 1985. 25. Jelinek, F., Statistical Methods for Speech Recognition, MIT Press, 1997. 26. Bahl, L. R. et al., A fast approximate match for large vocabulary speech recognition, IEEE Transactions on Speech and Audio Processing, 1, 59, 1993. 27. Leggetter, C. J. and Woodland, P. C., Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models, Computer Speech and Language, Vol. 9, pp. 171, 1995. 28. Hermansky, H. and Morgan, N., RASTA processing of speech, IEEE Transactions on Speech and Audio Processing, 2, 587, 1994. 29. Hunt, M. J., A statistical approach to metrics for word and syllable recognition, Journal of Acoustic Society of America, 66(S1), S35(A), 1979. 30. Hermansky, H., Perceptual linear predictive (PLP) analysis of speech, Journal of Acoustic Society of America, 87(4), 1748, 1990. 31. Lamel, L., Speech database development: Design and analysis of the acoustic-phonetic corpus, Proceedings of the DARPA Speech Recognition Workshop, pp. 100, 1986. 32. Lee, K., Automatic Speech Recognition, Kluwer Academic, 1989. 33. Pallett, D. S., Fiscus, J. G., Fisher, J. S., Garafolo, W. M., Lund, B. A., Martin, A., and Brzybocki, M. A., 1994 Benchmark Tests for the ARPA Spoken Language Program, Proceedings of the spoken language systems technology workshop, Austin, TX, Jan. 22–25 1995. 34. Berger, A. L., Pietra, S. A., V. J., A maximum entropy approach to natural language processing, Compu- tational Linguistics, Vol. 22, Np. 1, pp. 1, pp. 39–73, Mar. 1996. 35. Chen, S. A., Goodman, J., An empirical study of smoothing techniques for language modeling, Technical Report, TR-10-98, Havard University, August 1998. 36. Katz, S. M., Estimation of probabilities from Sse data for the language model component of a speech recognizer, IEEE Transactions on Acoustics Speech and Signal Processing, ASSP-35(3):400–401, Mar. 1987. 37. Jelinek, F. and Mercer, R. L., Interpolated estimation of Markov source parameters from sparse data, Proceedings of the Workshop on Pattern Recognition in Practice, May 1980. 38. Gales, M., Maximum likelihood linear transformations for HMM-based speech recognition, Computer Speech and Language, Vol. 12, pp 75–98, 1998. 39. Eide, E., et al., A parametric approach to vocal tract normalization, Proceedings of the 15th Annual Speech Research Symposium, CLSP, Baltimore, pp. 161-167, June 1995. 40. Wegmann, S. et al., Speaker normalization on conversational telephone speech, ICASSP-96, Vol. 1, pp. 339–341, May 1996. 41. McDonough, J. et al., Speaker adaptation with all-pass transforms, ICASSP-99, Vol. II, pp. 7575-760, Phoneix, May 1999. 42. Bamberg, P., Vocal tract normalization, Verbex Internal Technical Report, 1981. 43. Bahl, L. R. et al., A tree based statistical language model for natural language speech, IEEE Transactions on Acoustics, Speech and Signal Processing, Vol. 37(7), 1989. 44. Berger, A. et al., The Candide System for Machine Translation, Proceedings of the ARPA Conference on Human Language Technology, New Jersey, 1994. 45. Chen, S., Adaptation by correlation, Proceedings of the DARPA Speech Recognition Workshop, Virginia, 1997. 46. Shinoda, K., Lee, C.-H., Structural MAP speaker adaptation using hierarchical priors, IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings, pp 381–388, 1997. ? 2000 by CRC Press LLC For Further Information There are three major speech-related conferences each year, namely, International Conference on Acoustics, Speech and Signal Processing (ICASSP), International Conference on Spoken Language Processing (ICSLP), and European Conference on Speech Communication and Technology (EUROSPEECH). Besides this, Defense Advanced Research Projects Agency (DARPA) conducts workshops on Broadcast News Transcription (transcription of live television broadcasts) and Switchboard (conversations between individuals over the telephone) tasks. Also, there are several conferences addressing specific issues such as phonetic sciences, robust methods for speech recognition in adverse conditions, etc. Journals related to speech include IEEE Transactions on Speech and Audio Processing, IEEE Transactions on Signal Processing, Computer and Speech Language, Speech Communications and IEEE Transactions on Information Theory. Additional details on the statistical techniques used in speech recognition can be found in several books [23, 25, 32]. A good review of current techniques can also be found in [6]. Acknowledgements The authors wish to thank Dr. Harry Printz and Dr. R. T. Ward of IBM T. J. Watson Research Center for their careful review of this manuscript and many useful suggestions. ? 2000 by CRC Press LLC