Pillai, S.U., Shim, T.I., Batalama, S.N., Kazakos, D., Daum, F. “Spectral Estimation and Modeling” The Electrical Engineering Handbook Ed. Richard C. Dorf Boca Raton: CRC Press LLC, 2000 16 Spectral Estimation and Modeling 16.1 Spectral Analysis Historical Perspective ? Modern Spectral Analysis 16.2 Parameter Estimation Bayesian Estimation ? Mean-Square Estimation ? Minimax Estimation ? Maximum Likelihood Estimation ? Other Parameter Estimation Schemes 16.3 Kalman Filtering Kalman Filter Equations ? Kalman Filter Examples ? Extended Kalman Filter ? Nonlinear Filters ? Practical Issues 16.1 Spectral Analysis S. Unnikrishna Pillai and Theodore I. Shim Historical Perspective Modern spectral analysis dates back at least to Sir Isaac Newton [Newton, 1671], whose prism experiments with sunlight led him to discover that each color represented a particular wavelength of light and that the sunlight contained all wavelengths. Newton used the word spectrum, a variant of the Latin word specter, to describe the band of visible light colors. In the early eighteenth century, Bernoulli discovered that the solution to the wave equation describing a vibrating string can be expressed as an infinite sum containing weighted sine and cosine terms. Later, the French engineer Joseph Fourier in his Analytical Theory of Heat [Fourier, 1822] extended Bernoulli’s wave equation results to arbitrary periodic functions that might contain a finite number of jump discontinuities. Thus, for some T 0 > 0, if f (t) = f (t + T 0 ) for all t, then f (t) represents a periodic signal with period T 0 and in the case of real signals, it has the Fourier series representation where ω 0 = 2π/T 0 , and with A 0 representing the dc term (k = 0). Moreover, the infinite sum on the right-hand side of the above expression converges to [f (t –0 ) + f (t +0 )]/2. The total power P of the periodic function satisfies the relation ft A A k t B k t kk k ( ) ( cos sin )=+ + = ∞ ∑000 1 2 ωω A T f t k t dt B T ft k tdt kk TT == ∫∫ 11 0 0 0 0 0 0 00 ( ) cos , ( ) sinωω S. Unnikrishna Pillai Polytechnic University Theodore I. Shim Polytechnic University Stella N. Batalama State University of New York at Buffalo Dimitri Kazakos University of Southwestern Louisiana Fred Daum Raytheon Company ? 2000 by CRC Press LLC implying that the total power is distributed only among the dc term, the fundamental frequency w 0 = 2p/T 0 and its harmonics kw 0 , k 3 1, with 2(A 2 k +B 2 k ) representing the power contained at the harmonic kw 0 . For every periodic signal with finite power, since A k ? 0, B k ? 0, eventually the overharmonics become of decreasing importance. The British physicist Schuster [Schuster, 1898] used this observation to suggest that the partial power P k = 2(A 2 k +B 2 k ) at frequency kw 0 , k = 0 ? ¥, be displayed as the spectrum. Schuster termed this method the periodogram, and information over a multiple of periods was used to compute the Fourier coefficients and/or to smooth the periodogram, since depending on the starting time, the periodogram may contain irregular and spurious peaks. A notable exception to periodogram was the linear regression analysis introduced by the British statistician Yule [Yule, 1927] to obtain a more accurate description of the periodicities in noisy data. Because the sampled periodic process x(k) = cos kw 0 T containing a single harmonic component satisfies the recursive relation x(k) = ax(k – 1) – x(k – 2) where a = 2 cos w 0 T represents the harmonic component, its noisy version x(k) + n(k) satisfies the recursion x(k) = ax(k – 1) – x(k – 2) + n(k) Yule interpreted this time series model as a recursive harmonic process driven by a noise process and used this form to determine the periodicity in the sequence of sunspot numbers. Yule further generalized the above recursion to x(k) = ax(k – 1) + bx(k – 2) + n(k) where a and b are arbitrary, to describe a truly autoregressive process and since for the right choice of a, b the least-square solution to the above autoregressive equation is a damped sinusoid, this generalization forms the basis for the modern day parametric methods. Modern Spectral Analysis Norbert Wiener’s classic work on Generalized Harmonic Analysis [Wiener, 1930] gave random processes a firm statistical foundation, and with the notion of ensemble average several key concepts were then introduced. The formalization of modern day probability theory by Kolmogorov and his school also played an indispensable part in this development. Thus, if x(t) represents a continuous-time stochastic (random) process, then for every fixed t, it behaves like a random variable with some probability density function f x (x,t). The ensemble average or expected value of the process is given by and the statistical correlation between two time instants t 1 and t 2 of the random process is described through its autocorrelation function P T ft dt A A B kk k T ==++ = ¥ ? ò 1 2 0 2 0 222 1 0 0 **() ( ) m xx tExt xfxtdx() [()] (,)== -¥ ¥ ò Rtt Extxt xxfxxttdxdxRtt xx xx xx (,) [()*()] * (,,,) * (,) – – 12 1 2 12 121212 21 12 == = ¥ ¥ -¥ ¥ òò ? 2000 by CRC Press LLC where f x1x2 (x 1 , x 2 , t 1 , t 2 ) represents the joint probability density function of the random variable x 1 = x(t 1 ) and x 2 = x(t 2 ) and * denotes the complex conjugate transpose in general. Processes with autocorrelation functions that depend only upon the difference of the time intervals t 1 and t 2 are known as wide sense stationary processes. Thus, if x(t) is wide sense stationary, then E[x(t + t)x*(t)] = R xx (t) = R* xx (–t) To obtain the distribution of power versus frequency in the case of a stochastic process, one can make use of the Fourier transform based on a finite segment of the data. Letting represent the power contained in a typical realization over the interval (–T, T), its ensemble average value as T ? ¥ represents the true power contained at frequency w. Thus, for wide sense stationary processes (16.1) Moreover, the inverse relation gives (16.2) and hence Thus S(w) represents the power spectral density and from Eqs. (16.1) and (16.2), the power spectral density and the autocorrelation function form a Fourier transform pair, the well-known Wiener–Khinchin theorem. If x(kT) represents a discrete-time wide sense stationary stochastic process, then and the power spectral density is given by or in terms of the normalized variable q = wT, P T xte dt T jt T T () ()w w = - - ò 1 2 2 SEP Rttedtd R T ed Red T T T xx T T T T jtt T xx T T j xx j () lim[ ()] lim ( – ) lim () – () –– –(–) – – ww t t ttt w wt wt == = ? è ? ? ? ÷ =3 ?¥ ?¥ ?¥ - -¥ ¥ òò òò 12 12 2 2 12 1 2 0 ** RSed xx j () ()t p ww wt = -¥ ¥ ò 1 2 RExtP Sd xx () () ()[] – 0 1 2 2 === ¥ ¥ ò ** p ww rExnkTxnTr kk =+ ( ) = - {( ) ()} ** Sre k jkT k ()w w = - =-¥ ¥ ? ? 2000 by CRC Press LLC (16.3) and the inverse relation gives the autocorrelations to be Thus, the power spectral density of a discrete-time process is periodic. Such a process can be obtained by sampling a continuous-time process at t = kT, *k* = 0 ? ¥, and if the original continuous-time process is band- limited with a two-sided bandwidth equal to 2w b = 2p/T, then the set of discrete samples so obtained is equivalent to the original process in a mean-square sense. As Schur [Schur, 1917] has observed, for discrete-time stationary processes the nonnegativity of the power spectrum is equivalent to the nonnegative definiteness of the Hermitian Toeplitz matrices, i.e., (16.4) If x(nT) is the output of a discrete-time linear time-invariant causal system driven by w(nT), then we have the following representation: (16.5) In the case of a stationary input, the output is also stationary, and its power spectral density is given by S x (q) = *H(e jq )* 2 S w (q) (16.6) where S w (q) represents the power spectral density of the input process. If the input is a white noise process, then S w (q) = s 2 and S x (q) = s 2 *H(e jq )* 2 Clearly if H(z) is rational, so is S x (q). Conversely, given a power spectral density (16.7) that satisfies the integrability condition (16.8) SreSk k jk k () ( )qqp q ==+3 - =-¥ ¥ ? 20 rSedr k jk k == - - ò 1 2p qq q p p () * S rr r rr r rr r k k k k kk k () ... * ... * ... * , – – q3? = ? è ? ? ? ? ? ? ? ÷ ÷ ÷ ÷ ÷ =3 =?¥00 01 10 1 10 TT MMOM wnT Hz hkTz xnT hkTwn kT k kk () () () () ()( )?= ?= - ( ) = ¥ = ¥ ?? 00 Sre xk jk k ()q q =3 =-¥ ¥ ? 0 Sd x ()qq p p <¥ - ò ? 2000 by CRC Press LLC and the physical realizability (Paley–Wiener) criterion (16.9) there exists a unique function H(z) that is analytic together with its inverse in *z* < 1 (minimum phase factor) such that (16.10) and H(z) is known as the Wiener factor associated with S x (q) and as Eq. (16.5) shows, when driven by white noise, it generates a stochastic process x(nT) from past samples and its power spectral density matches with the given S x (q). In this context, given a finite set of autocorrelations r 0 , r 1 , . . ., r n , the spectral extension problem is to obtain the class of all extensions that match the given data, i.e., such an extension K(q) must automatically satisfy K(q) 3 0 and in addition to satisfying Eqs. (16.8) and (16.9). The solution to this problem is closely related to the trigonometric moment problem, and it has a long and continuing history through the works of Schur [1917]; Nevanlinna, Akheizer and Krein [Akheizer and Krein, 1962]; Geronimus [1954]; and Shohat and Tamarkin [1970], to name a few. If the given autocorrelations are such that the matrix T n in Eq. (16.4) is singular, then there exists an m £ n such that T m – 1 is positive definite (T m – 1 > 0) and T m is singular [det T m = 0, det (.) representing the determinant of (.)]. In that case, there exists a unique vector X = (x 0 , x 1 , . . ., x m ) T such that T m X = 0 and further, the autocorrelations have a unique extension given by (16.11) where e j qi , i = 1 ? m are the m zeros of the polynomial x 0 + x 1 z + . . . + x m z m and P i > 0. This gives (16.12) ln ( )Sd x qq p p >-¥ - ò Hz bz z k k k () ,=< = ¥ ? ** 1 0 SHreHeae x r jj q qq ( ) == ?- lim 10 22 |( )| |( )|,.. 1 2 0 p qq q p p Ked r k n jk k () , - ò ==? cPe k ki jk i m i ==?¥ = ? q , ** 0 1 TA A m m P P P – ... ... ... * 1 1 2 00 00 = ? è ? ? ? ? ? ? ÷ ÷ ÷ ÷ MM OM ? 2000 by CRC Press LLC where A is an m ′ m Vandermonde matrix given by and Eq. (16.12) can be used to determine P k > 0, k = 1 ? m. The power spectrum associated with Eq. (16.11) is given by and it represents a discrete spectrum that corresponds to pure uncorrelated sinusoids with signal powers P 1 , P 2 , …, P m . If the given autocorrelations satisfy T n > 0, from Eq. (16.4), every unknown r k , k 3 n + 1, must be selected so as to satisfy T k > 0, and this gives *r k+1 – z k * 2 £ R 2 k (16.13) where z k = f T k T – k 1 b k , f k = (r 1 , r 2 , . . ., r k ) T , b k = (r k , r k –1 , . . . , r 1 ) and R k = det T k /det T k –1 . From Eq. (16.13), the unknowns could be anywhere inside a sequence of circles with center z k and radius R k , and as a result, there are an infinite number of solutions to this problem. Schur and Nevanlinna have given an analytic characterization to these solutions in terms of bounded function extensions. A bounded function r(z) is analytic in *z * < 1 and satisfies the inequality *r(z)* £ 1 everywhere in *z* < 1. In a network theory context, Youla [1980] has also given a closed form parametrization to this class of solutions. In that case, given r 0 , r 1 , . . ., r n , the minimum phase transfer functions satisfying Eqs. (16.8) and (16.9) are given by (16.14) where r(z) is an arbitrary bounded function that satisfies the inequality (Paley-Wiener criterion) and G(z) is the minimum phase factor obtained from the factorization 1 – *r(e jq )* 2 = *G(e jq )* 2 Further, P n (z) represents the Levinson polynomial generated from r 0 ? r n through the recursion A = ? è ? ? ? ? ? ? ? ? ÷ ÷ ÷ ÷ ÷ ÷ ==? - 11 1 1 12 1 2 2 22 1 1 2 11 ... ... ... ... ... ,, –– ll l ll l ll l l q m m mm m m i j ei m i MM M SP kk k m () ( )qdqq=- = ? 1 Hz z Pz z zPz nn r r () () ()– () ? () = G ln ( ) - ò - [] >-¥ p p q rq1 2 **ed j 1 2 11 - ( ) = ( ) - ( ) -- sPz P z zsP z nn n nn ? ? 2000 by CRC Press LLC that starts with P 0 (z) = 1/ , where (16.15) represents the reflection coefficient at stage n. Here, { } n denotes the coefficient of z n in the expansion { }, and ~ P n (z) D = z n P* n (1/z*) represents the polynomial reciprocal to R n (z). Notice that the given information r 0 ? r n enters P n (z) through Eq. (16.5). The power spectral density K(q) = *H r (e j q )* 2 associated with Eq. (16.14) satisfies all the interpolation properties described before. In Eq. (16.14), the solution r(z) [ 0 gives H(z) = 1/P n (z), a pure AR(n) system that coincides with Burg’s maximum entropy extension. Clearly, if H r (z) is rational, then r(z) must be rational and, more interestingly, every rational system must follow from Eq. (16.14) for a specific rational bounded function r(z). Of course, the choice of r(z) brings in extra freedom, and this can be profitably used for system identification as well as rational and stable approxi- mation of nonrational systems [Pillai and Shim, 1993]. Defining Terms Autocorrelation function: The expected value of the product of two random variables generated from a random process for two time instants; it represents their interdependence. Expected value (or mean) of a random variable: Ensemble average value of a random variable that is given by integrating the random variable after scaling by its probability density function (weighted average) over the entire range. Power spectrum: A nonnegative function that describes the distribution of power versus frequency. For wide sense stationary processes, the power spectrum and the autocorrelation function form a Fourier transform pair. Probability density function: The probability of the random variable taking values between two real numbers x 1 and x 2 is given by the area under the nonnegative probability density function between those two points. Random variable: A continuous or discrete valued variable that maps the set of all outcomes of an experiment into the real line (or complex plane). Because the outcomes of an experiment are inherently random, the final value of the variable cannot be predetermined. Stochastic process: A real valued function of time t, which for every fixed t behaves like a random variable. Related Topics 14.1 Fourier Transforms ? 40.2 Spectrum, Specifications, and Measurement Techniques ? 73.3 Stochastic Processes References N.I. Akheizer and M. Krein, Some Questions in the Theory of Moments, American Math. Soc. Monogr., 2, 1962. J.B.J. Fourier, Theorie Analytique de la Chaleur (Analytical Theory of Heat), Paris, 1822. L. Y. Geronimus, Polynomials Orthogonal on a Circle and Their Applications, American Math. Soc., Translation, 104, 1954. I. Newton, Philos. Trans., vol. IV, p. 3076, 1671. S.U. Pillai and T.I. Shim, Spectrum Estimation and System Identification, New York: Springer-Verlag, 1993. I. Schur, “Uber Potenzreihen, die im Innern des Einheitzkreises Beschrankt Sind,” Journal fur Reine und Angewandte Mathematik, vol. 147, pp. 205–232, 1917. J.A. Shohat and J.D. Tamarkin, The Problem of Moments, American Math. Soc., Math. Surveys, 1, 1970. r 0 sPzrzP nn k k k n n n = ì í ? ? ? ü y ? t ? = ?–– () () 1 1 1 0 ? 2000 by CRC Press LLC N. Wiener “Generalized harmonic analysis,” Acta Math., vol. 55, pp. 117–258, 1930. D.C. Youla, “The FEE: A New Tunable High-Resolution Spectral Estimator: Part I,” Technical note, no. 3, Dept. of Electrical Engineering, Polytechnic Institute of New York, Brooklyn, New York; also RADC Report, RADC-TR-81-397, AD A114996, 1982, 1980. G.U. Yule, “On a method of investigating periodicities in disturbed series, with special reference to Wolfer’s sunspot numbers,” Philos. Trans. R. Soc. London, Ser. A, vol. 226, pp. 267–298, 1927. 16.2 Parameter Estimation Stella N. Batalama and Dimitri Kazakos Parameter estimation is the operation of assigning a value in a continuum of alternatives to an unknown parameter based on a set of observations involving some function of the parameter. Estimate is the value assigned to the parameter and estimator is the function of the observations that yields the estimate. The basic elements in the parameter estimation are a vector parameter q m , a vector space % m where q m takes its values, a stochastic process X(t) parameterized by q m and a performance criterion or cost function. The estimateq ^ m (x n ) based on the observation vector x n = [x 1 , x 2 ,...,x n ] is a solution of some optimization problem according to the performance criterion. In the following, the function f(x n *q m ) will denote the conditional joint probability density function of the random variables x 1 ,...,x n . There are several parameter estimation schemes. If the process X(t) is parametrically known, i.e., if its conditional joint probability density functions are known for each fixed value q m of the vector parameter q m , then the corresponding parameter estimation scheme is called parametric. If the statistics of the process X(t) are nonparametrically described, i.e., given q m ? % m any joint probability density function of the process is a member of some nonparametric class of probability density functions, then the nonparametric estimation schemes arise. Let G n denote the n-dimensional observation space. Then an estimatorq ^ (x n ) of a vector parameter q m is a function from the observation space, G n , to the parameter space, % m . Since this is a function of random variables, it is itself a random variable (or random vector). There are certain stochastic properties of estimators that quantify somehow their quality. In this sense an estimator is said to be unbiased if its expected value is the true parameter value, i.e., if E q {q ^ m (x n )} = q m where the subscript q on the expectation symbol denotes that the expectation is taken according to the probability density function f(x n *q m ). In the case where the observation space is the ? n and the parameter is a scalar, it is The bias of the estimate is the Euclidean norm **q m – E q {q m (x n )}** 1/2 . Thus, the bias measures the distance between the expected value of the estimate and the true value of the parameter. Clearly, the estimator is unbiased when the bias is zero. Usually it is of interest to know the conditional variance of an unbiased estimate. The bias of the estimate q ^ m (x n )and the conditional variance E q {**q ^ m (x n )–E q {q ^ m (x n )}** 2 *q m } generally represent a trade-off. Indeed, an unbiased estimate may induce relatively large variance. On the other hand, the introduction of some low-level bias may then result in a significant reduction of the induced variance. Exfxdx n R nnmn n q qq{ ? ()} ? ()( )x = ò * ? 2000 by CRC Press LLC In general, the bias versus variance trade-off should be studied carefully for the correct evaluation of any given parameter estimate. A parameter estimate is called efficient if the conditional variance equals a lower bound known as the Rao-Cramèr bound. It will be useful to present briefly this bound. The Rao-Cramèr bound gives a theoretical minimum for the variance of any estimate. More specifically, let q ^ (x n )be the estimate of a scalar parameter q given the observation vector x n . Let f(x n *q) be given, twice continuously differentiable with respect to q, and satisfy also some other mild regularity conditions. Then, . Sometimes we need to consider the case where the sample size n increases to infinity. In such a case, an estimator is said to be consistent if q ^ m (x n ) ? q m as n ? ¥ Since the estimateq ^ m (x n ) is a random variable, we have to specify in what sense the above holds. Thus, if the above limit holds w.p. 1, we say thatq ^ m (x n ) is strongly consistent or consistent w.p. 1. In a similar way we can define a weakly consistent estimator. As far as the asymptotic distribution of q(x n ) as n ? ¥ is concerned, it turns out that the central limit theorem can often be applied toq ^ (x n ) to infer that [q ^ (x n )– q] is asymptotically normal with zero mean as n ? ¥. In order to examine certain parameter estimation schemes we need first to present the definition of some related functions. Penalty or cost function c[q m ,q ^ m (x n )] is a scalar, nonnegative function whose values vary as q m varies in the parameter space E m and as the sequence x n takes different values in the observation space, G n . Conditional expected penalty c(q m ,q ^ m ) induced by the parameter estimate and the penalty function is a function defined as follows: c(q m ,q ^ m ) = E q {c[q m ,q ^ m (x n )]} If an a priori density function p(q m ) is available, then the expected penaltyc(q ^ m , p) can be evaluated. The various existing parameter estimation schemes evolve as the solutions of optimization problems whose objective function is either the conditional expected penalty or the conditional density function f(x n *q m ). Bayesian Estimation Scheme In the Bayesian estimation scheme the available assets are: 1.A parametrically known stochastic process parameterized by q m , in other words, a given conditional joint density function f(x n *q m ) defined on the observation space G n , where q m is a well-defined parameter vector. 2.A realization x n from the underlying active process, where the implied assumption is that the process remains unchanged throughout the whole observation period. 3.A density function p(q m ) defined on the parameter space % m . 4.For each data sequence x n , parameter vector q m and parameter estimateq ^ m (x n ), a penalty scalar function c[q m ,q ^ m (x n )] is given. 5.A performance criterion which is the minimization of the expected penalty c(q m , p). The estimateq ^ m 0 that minimizes the expected penalty is called optimal Bayesian estimate at p. Under some mild conditions the optimal Bayesian estimateq ^ m 0 (x n ) is the conditional expectation E{q m *x n }. EEfx nn qq qq ? ?q q ? ( )– log( ) – x [] ì í ? ü y t 3 é ? ê ê ù ? ú ú ì í ? ? ? ü y ? t ? 2 2 1 * n ? 2000 by CRC Press LLC If the penalty function has the form c[q m ,q ^ m ]= 1 – d(**q m –q ^ m **), where d() is the delta function, then the optimal Bayesian estimate is called maximum a posteriori estimate since it happens to maximize the a posteriori density p(q m *x n ). Another special case of penalty function is the function **q m –q ^ m ** 2 . In this case the Bayesian estimate is called minimum mean-square estimate and equals the conditional expectation E{q m *x n }. In the following we present some more details about mean-square estimation since it is one of the most popular schemes. Mean-Square Estimation For the simplicity of our discussion we consider the case of estimating a single continuous type random variable q with density p(q) instead of estimating a random vector. We also reduce the dimensionality of the observation space to one. In this framework the penalty function will be the square of the estimation error (q –q ^ ) 2 and the performance or optimality criterion will be the minimization of the mean (expected) square value of the estimation error. We will first consider the case of estimating a random variable q by a constantq ^ . This means that we wish to find a constantq ^ such that the mean-square (MS) error is minimum. Since e depends onq ^ , it is minimum if i.e., if The case where q is to be estimated by a functionq ^ (x) of the random variable (observation) x, and not by a constant, is examined next. In this case the MS error takes the form: where p(q,x) is the joint density of the random variables u and x. In this case we need to find that function q ^ (x)which minimizes the MS error. It can be proved that the function that minimizes the MS error is The functionq ^ (x) is called nonlinear MS estimate. eE pd== ¥ ¥ ò {(– ? )} (– ? )()qq qqqq 22 de d pd q qqqq== ¥ ¥ ò 20(– ? )() – ? {} () – qqqqq== ¥ ¥ ò Epd e E x p xdxdy== ¥ ¥ ¥ ¥ òò {[– ? ()]} [– ? ()](,) –– qq qq qx 22 ? () { } ( ) – qqqqqxEx pxd== ¥ ¥ ò ** ? 2000 by CRC Press LLC As we have seen, when the penalty function is the quadratic function (q –q ^ ) 2 , then the optimal Bayesian estimate is the conditional expectation E{u*x}. If x and u are jointly Gaussian, then the above conditional expectation is a linear function of x. But when the above statistics are not Gaussian, then the optimal Bayesian estimate is generally a nonlinear function of x. Thus, to resolve this problem we introduce suboptimal Bayesian schemes for this quadratic penalty function. In particular we consider only the class of linear parameter estimates and we try to find that estimate which minimizes the expected quadratic penalty. This estimate is called linear MS estimate and it is used in many applications because of the simplicity of the solution. The linear estimation problem is the estimation of a random variable u in terms of a linear function Ax + B of x, i.e.,q ^ (x) = Ax + B. In this case we need to find the constants A and B in order to minimize the MS error e = E{[q – (Ax + B)] 2 } A fundamental principle in the MS estimation is the orthogonality principle. This principle states that the optimum linear MS estimate Ax + B of u is such that the estimation error u – (Ax + B) is orthogonal to the data x, i.e., E{[u – (Ax + B)]x} = 0 Using the above principle, it can be proved that e is minimum if i.e., h x , h q are the means of x and u; s x 2 , s q 2 are the corresponding variances; s x , s q is the standard deviation of x and q; and r is the correlation coefficient of x and u. Thus the MS error takes the form e = s q 2 (1 – r 2 ). The estimate q ^ (x) = Ax + B is called the nonhomogeneous linear estimate of u in terms of x. If u is estimated by a functionq ^ (x) = ax, the estimate is called homogeneous. It can be also shown by the orthogonality principle that for the homoge- neous estimate Using the orthogonality principle it can be shown that if the random variables u and x are Gaussian zero mean, then the optimum nonlinear estimate of q equals the linear estimate. In other words ifq ^ (x) = E{q*x} is the optimum nonlinear estimate of u andq ^ = ax is the optimum linear estimate, thenq ^ (x) =E{q ^ *x} = q ^ = ax. This is true since the random variables u and x have zero mean, E{u} = E{x} = 0, and thus the linear estimateq ^ has zero mean too, E{q ^ }= 0. This implies that the linear estimation error e = u –q ^ also has zero mean, E{e} = E{u –q ^ } = 0. A r BA EE r E x x x xx x x == == = s s hh hh shsqh hqh ss q q q qq q q and – {}, {} {(– )}, {(– )} {(– )(– )} x x x u 22 2 a= E E {} {} x x u 2 ? 2000 by CRC Press LLC On the other hand, the orthogonality principle can be applied, which implies that the linear estimation error e is orthogonal to the data, E{ex} = 0. Since e is Gaussian, it is independent of x and thus E{e*x} = E{e} = 0, which is equivalent to the following: E{u –u ^ *x} = 0 T E{u*x} –E{u ^ *x} = 0 T E{u*x} =E{u ^ *x} Tq ^ (x) = ax Tq ^ (x) =q ^ i.e., the nonlinear and the linear estimates coincide. In addition, since the linear estimation error e = u – u ^ is independent of the data x, so is the square error, i.e., E{(u –u ^ ) 2 *x} = E{(q –q ^ 2 } = V Thus, the conditional mean of u assuming the data x equals its MS estimate and the conditional variance the MS error. That simplifies the evaluation of conditional densities when Gaussian random variables are involved because since f(q*x) is Gaussian, it has the form Minimax Estimation Scheme In the minimax estimation scheme the available assets are: 1.A parametrically known stochastic process parameterized by q m . 2.A realization x n from the underlying active process. 3.A scalar penalty function c[q m ,q ^ m (x n )] for each data sequence x n , parameter vector q m , and parameter estimateq ^ m (x n ). The minimax schemes are solutions of saddle-point game formalizations, with payoff function the expected penalty c(q ^ m , p) and with variables the parameter estimateq ^ m and the a priori parameter density function p. If a minimax estimateq ^ m 0 exists, it is an optimal Bayesian estimate, at some least favorable a priori distribution p 0 . Maximum Likelihood Estimation Scheme Maximum likelihood estimation was first introduced by Fisher. It is a very powerful estimation procedure that yields many of the well-known estimation methods as special cases. The essential difference between Bayesian and maximum likelihood parameter estimation is that in Bayesian Estimation the parameter q m is considered to be random with a given density function, while in the maximum likelihood framework it is unknown but fixed. Consider a random process X(t) parameterized by q m , where q m is an unknown fixed parameter vector of finite dimensionality m (e.g., q m ? ? m ). More specifically the conditional joint probability density function f(x 1 , …, x n *q m ) is well known for every q m , where x n = [x 1 , …, x n ] is a realization (or observation vector or sample vector) of the process X(t). The problem is to find an estimate of the parameter vector q m based on the realization of X(t). (Note that the dimensionality of the parameter vector q m in the joint probability density function is assumed to be fixed.) The intuition behind the maximum likelihood method is that we choose those parameters [q 1 , …, q m ] from which the actually observed sample vector is most likely to have come. This means that the estimator of q m is selected so that the observed sample vector becomes as “likely as possible.” fX V x V ( ) exp –[– ] q p qa * = ì í ? ? ? ü y ? t ? 1 2 2 2 ? 2000 by CRC Press LLC In this sense we call the conditional joint probability density function f(x n *q m ) as likelihood function l(q m ). The likelihood function l(q m ) is a deterministic function of the parameter vector q m once the observed variables x 1 , …, x n are inserted. This means that q m is variable and the sample vector x n is fixed, while the conditional joint probability density function is considered as a function of the observation vector x n with q m fixed. The maximum likelihood estimator of q m is that value of the parameter vector for which the likelihood function is maximized. In many cases it is more convenient to work with another function called log-likelihood function, L(q m ), rather than the likelihood function. The log-likelihood function is the natural logarithm of l(q m ). Since the logarithm is a monotone function, it follows that whenever L(q) achieves its maximum value, l(q m ) is maxi- mized too, for the same value of the parameter vector q m . Thus the log-likelihood function is maximized for that value of the vector parameter q m for which the first partial derivatives with respect to q i , i = 1, …,m are equal to zero, i.e., where q ^ ML denotes the maximum likelihood estimate of the vector parameter q m . It can be shown that when the process X(t) is memoryless and stationary (i.e., when x 1 ,…,x n are independent, identically distributed) then the ML estimators are consistent, asymptotically efficient, and asymptotically Gaussian. Example: Let x i , i = 1,…, n, be Gaussian independent random variables with mean q and variance s 2 i : x i ? N(q,s 2 i ). The mean q is to be estimated and the Rao-Cramèr bound is to be evaluated. Since q is unknown but fixed, we will use the maximum likelihood estimation scheme. The random variable x i has the probability density function Since x i , i = 1,…, n, are independent, the joint density function is which is exactly the likelihood function for this estimation problem. The log-likelihood function is We can maximize the log-likelihood function with respect to q and find the maximizing value to be equal to Note that for equal variances the maximum likelihood estimate coincides with the commonly used sample mean. ? : () q ?q ?q ML m i L = 0 1 2 2 2 2 ps q s i i i x exp (–) - ì í ? ? ? ü y ? t ? fx x x in ii n i i ( , . . ., ) exp – (–) * q ps q s = ì í ? ? ? ü y ? t ? = ? 1 2 2 1 2 2 log ( , . . ., ) – log( ) – log – (–) fx x nx ni i n i ii n 1 1 2 2 1 2 2 1 2 * qps q s = == ?? ? ()q s s ML n ii n i ii n x x = = = ? ? 1 1 2 1 2 1 ? 2000 by CRC Press LLC The Rao-Cramèr bound can be found as follows: In conclusion, we see that for Gaussian data the sample mean estimate is efficient because it coincides with the maximum likelihood estimate. When the data are contaminated with a fraction of data coming from an unknown probability density function, the so called outliers, the sample mean performs poorly even when the fraction of outliers is small. This observation gave birth to the branch of statistics called robust statistics. Other Parameter Estimation Schemes The Bayesian, minimax, and maximum likelihood estimation schemes described above make up the class of parametric parameter estimation procedures. The common characteristic of those procedures is the availability of some parametrically known stochastic process that generates the observation sequence x n . When for every given parameter value q m the stochastic process that generates x n is nonparametrically described, the nonpara- metric estimation schemes arise. The latter schemes may evolve as the solutions of certain saddle-point games, whose payoff function originates from the parametric maximum likelihood formalizations. It is assumed that, in addition to the nonparametrically described data-generating process, the only assets available are a realization x n from the underlying active process and the parameter space % m . The qualitative robustness in parameter estimation corresponds to local performance stability for small deviations from a nominal, parametrically known, data-generating process. Defining Terms Bayesian estimation: An estimation scheme in which the parameter to be estimated is modeled as a random variable with known probability density function. Bias:The norm of the difference between the true value of the estimate and its mean value. Consistent estimator:An estimator whose value converges to the true parameter value as the sample size tends to infinity. If the convergence holds w.p. 1, then the estimator is called strongly consistent or consistent w.p. 1. Efficient estimator: An estimator whose variance achieves the Rao-Cramèr bound. Estimate:Our best guess of the parameter of interest based on a set of observations. Estimator: A mapping from the data space to the parameter space that yields the estimate. Homogeneous linear estimator:An estimator which is a homogeneous linear function of the data. Maximum likelihood estimate: An estimate that maximizes the probability density function of the data conditioned on the parameter. Mean-square estimation: An estimation scheme in which the cost function is the mean-square error. Minimax estimate: The optimum estimate for the least favorable prior distribution. Nonhomogeneous linear estimator:An estimator which is a nonhomogeneous linear function of the data. Nonlinear MS estimate: The optimum estimate under the mean-square performance criterion. Nonparametric estimation: An estimation scheme in which no parametric description of the statistical model is available. Orthogonality principle:The fundamental principle for MS estimates. It states that the estimation error is orthogonal to the data. Parameter estimation: The procedure by which we combine all available data to obtain our best guess about a parameter of interest. Parametric estimation: An estimation scheme in which the statistical description of the data is given accord- ing to a parametric family of statistical models. E d d fx E d d fx nn ii n qq q q q q s log( ) – log( ) – ** é ? ê ê ù ? ú ú ì í ? ? ? ü y ? t ? = ì í ? ? ? ü y ? t ? = = ? 2 1 2 22 1 1 ? 2000 by CRC Press LLC Penalty or cost function: A nonnegative scalar function which represents the cost incurred by an inaccurate value of the estimate. Qualitative robustness: A geometric formulation of robust estimation. Robust estimation: An estimation scheme in which we optimize performance for the least favorable statistical environment among a specified statistical class. Unbiased estimator: An estimator whose mean value is equal to the true parameter value. Related Topics 73.1 Signal Detection ? 73.3 Stochastic Processes References S. Haykin, Adaptive Filter Theory, Englewood Cliffs, N.J.: Prentice-Hall, 1991. D. Kazakos and P. Papantoni-Kazakos, Detection and Estimation, New York: Computer Science Press, 1990. L. Ljung and T. S?derstr?m, Theory and Practice of Recursive Identification, Cambridge, Mass.: The MIT Press, 1983. A. Papoulis, Probability, Random Variables, and Stochastic Processes, New York: McGraw-Hill, 1984. Further Information IEEE Transactions on InformationTheory is a bimonthly journal that publishes papers on theoretical aspects of estimation theory and in particular on transmission, processing, and utilization of information. IEEE Transactions on Signal Processing is a monthly journal which presents applications of estimation theory to speech recognition and processing, acoustical signal processing, and communication. IEEE Transactions on Communications is a monthly journal presenting applications of estimation theory to data communication problems, synchronization of communication systems, channel equalization, and image processing. 16.3 Kalman Filtering Fred Daum The Kalman filter is a linear recursive algorithm that solves the least squares problem for time-varying linear systems with non-stationary noise. It estimates the state of a linear dynamical system given linear measurements corrupted by additive noise. It is an optimal estimator, assuming that the measurement noise is Gaussian, and assuming that all other relevant probability densities are also Gaussian. For example, the location of your car can be estimated using a Kalman filter to combine noisy measurements of distance from four or more satellites. As a second example, the position and velocity of an airplane can be estimated by a Kalman filter using noisy measurements of range, azimuth, and elevation from one or more radars. As a third example, the future price of IBM stock can be predicted using a Kalman filter to combine noisy data on thousands of relevant economic variables, using a dynamic model of the stock market and the overall economy. The Kalman filter has been applied to solve many diverse real-world engineering problems, including spacecraft navigation, GPS navigation, robotics, air traffic control, missile guidance, chemical plant control, stock market prediction, weather prediction, speech recognition, speech encoding and compression, radar target tracking, satellite orbit estimation, and inertial navigation. See [Sorenson, 1985] for other applications. Most real-world engineering problems have measurement equations or dynamical system equations that are nonlinear in the state vector. Therefore, the Kalman filter equations cannot be applied directly; rather, the problem must be approximated using linear equations. This linear approximation is very straightforward, and it is called the “extended Kalman filter” (EKF). One of the main reasons for the wide application of the Kalman filter is the ease with which a nonlinear system can be approximated by a linear system. The resulting approx- imation is often very good, resulting in good EKF performance. Unfortunately, the EKF performance is sometimes poor, in which case a plethora of alternative approximations can be attempted. ? 2000 by CRC Press LLC Kalman Filter Equations The Kalman filter algorithm is shown as a block diagram in Fig. 16.1. The estimate of x is updated recursively as new measurements become available. Each measurement update corresponds to one iteration of Fig. 16.1. The symbols in Fig. 16.1 are defined in Table 16.1. The Kalman filter uses both a model of the system dynamics x k = Φ k x k–1 + w k (16.16) as well as a model of the measurements z k = H k x k + v k (16.17) These models are a combination of deterministic and random effects. One can think of the true state vector,xk, evolving in time according to a deterministic linear dynamical system: x k = Φ k x k–1 (16.18) with a random perturbation modeled by w k . Likewise, the measurement model consists of a deterministic linear part: z k = H k x k (16.19) with an additive random perturbation of v k . As shown in Fig. 16.1, the Kalman filter predicts the state vector from one time (t k–1 ) to the next (t k ) using the deterministic part of the linear dynamical system. This is the FIGURE 16.1 Block diagram of Kalman filter. ? 2000 by CRC Press LLC ? 2000 by CRC Press LLC Value Vector of dimension n Scalar Integer Vector of dimension m m × n matrix × m matrix ix R k , statistically x k Vector of dimension m n × n matrix ix Q k , statistically x k Vector of dimension n × n matrix n × n matrix n × n matrix Vector of dimension n Vector of dimension n Set of m-dimensional vectors Vector of dimension n n × n matrix — Function Function Operation Operation n × n matrix TABLE 16.1 Definition of Symbols Symbol Meaning Mathematical Definition x k State vector of a linear dynamical system at time t k x k = Φ k x k–1 + w k t k Time of the k th measurement — k Index of discrete time measurements — z k k th measurement z k = H k x k + v k H k Measurement matrix at time t k See above R k Covariance matrix of measurement noise at time t k R k = E(v k v k T )m v k Measurement noise at time t k Gaussian zero mean random variable with covariance matr independent from sample to sample, and statistically independent of Φ k Transition matrix of linear dynamical system from time t k–1 to t k See above w k Process noise Gaussian zero mean random variable with covariance matr independent from sample to sample, and statistically independent of Q k Covariance matrix of process noise at time t k Q k = E(w k w k T )n P k Error covariance matrix of x k conditioned on Z k M k Error covariance matrix of x k conditioned on Z k–1 ?x k Estimate of x at time t k conditioned on Z k x k Estimate of x at time t k conditioned on Z k–1 Z k Set of measurements up to and including time t k ?x o Initial estimate of x at time t o , prior to any measurements P o Initial error covariance matrix of ?x o , prior to any measurements E(?) Expected value of (?) E(A) = ∫ A p(A)dA p(A) Probability density of A Gaussian probability density p(A|B) Probability density of A conditioned on B Gaussian probability density (?) T Transpose of (?)(A T ) ij = A ji for a matrix A with ij th element A ij (?) –1 Inverse of matrix (?)A –1 is the inverse of matrix A if and only if A –1 A = AA –1 = I I Identity matrix PExxxxZ kkkkk T k =? ? ( )( ) ? ? ? ? ? ? ?? MExxxxZ kkkkk T k =? ? ( )( ) ? ? ? ? ? ? ?1 ? xExZ kkk = ( ) xExZ kkk = ? ( ) 1 Zzz z kk =… {} 12 ,,, ? xEx oo = ( ) PExxxx ooooo T =? ? ( )( ) ? ? ? ? ? ? ?? I j ij = =? ? ? 1 0 for i otherwise best prediction of x(t k ) given x(t k–1 ), assuming that w k is a zero-mean random variable uncorrelated with x(t k–1 ). Hence, the state vector prediction in the Kalman filter is intuitively exactly what one would expect. The state vector update is also intuitively reasonable. In particular, it is a linear combination of the predicted value of x and the latest measurement z k . This linear combination of x k and z k is computed as a compromise between prediction errors in x k and the measurement noise errors in z k . That is, the Kalman filter gain is computed to optimally combine x k and z k , using the known models of system dynamics and measurements. More specifically, the Kalman filter gain is computed to minimize the following error criterion at the time of the k th measurement: (16.20) The covariance matrix of the prediction errors is M k , and R k is the measurement error covariance matrix. If M k is large, then the first term in J is weighted less because the prediction, x k , is relatively inaccurate. Similarly, if R k is large, then the second term in J is weighted less because the measurement, z k , is relatively inaccurate. The weighted least squares criterion, J, strikes a balance between prediction errors and measurement errors. To find the value of ?x k that minimizes J, we can set the derivative of J with respect to ?x k equal to zero: (16.21) Using the fact that covariance matrices are symmetric, and rearranging terms, we get: (16.22) and solving for ?x k : (16.23) This can be put into the form: (16.24) where: (16.25) Further matrix algebra can be used to represent the Kalman filter gain in two other forms: (16.26) (16.27) in which P k is the error covariance matrix of ?x k . Jxx Mxx zHx RzHx kk T kk k k kk T kk kk =? ( ) ? ( ) +? ( ) ? ( ) ?? ??? ? 11 ? ? =? ( ) +? ( ) ? ( ) = ?? J x xx M zHx R H k kk T kkk T kk ? ?? 22 0 11 MHRHxMxHRz kk T kkk kk k T kk ?? ? ? + ( ) =+ 11 1 1 ? ? xMHRH MxHRz kkk T kk kk k T kk =+ ( ) + [] ?? ? ??11 1 11 ? xxKzHx kkkk kk =+ ? [] KMHRHHR kkk T kk k T k =+ ( ) ?? ? ?11 1 1 KMHHMHR kkk T kkk T k =+ ( ) ?1 KPHR kkk T k = ?1 ? 2000 by CRC Press LLC The above calculation of the Kalman filter gain is by no means a derivation of the Kalman filter, but rather it is merely a simple heuristic calculation that gives insight into these equations. The error criterion, J, is the logarithm of a Gaussian probability density of x conditioned on Z k . A lucid derivation of the Kalman filter is given in Ho and Lee [1964], from a Bayesian viewpoint. In contrast, Kalman’s original derivation does not use Bayesian methods; see Kalman [1960]. Other derivations of the Kalman filter are in Gelb [1974] and Jazwinski [1970]. The Kalman filter is “optimal” in a number of ways, under different assumptions, which are discussed in these references. The Kalman filter is stable under rather mild assumptions, which can always be achieved in practice, by using a state vector with minimum dimension to realize the linear dynamical model and the linear measurement model. This corresponds to a completely controllable dynamical system and a completely observable measurement model. Stability of the Kalman filter under these conditions was proved in a beautiful paper by Kalman (1963). Kalman’s stability theory has great practical value because engineers do not need to check for Kalman filter stability as a separate issue. This result is all the more impressive when we consider that the Kalman filter is a time-varying linear dynamical system that can be very complex and have very high dimension. Nevertheless, the Kalman filter is automatically stable under the simple minimality assumption given above. It is important to emphasize that both the linear dynamical system and the measurement model can be time- varying, and both the process noise and measurement noise can be nonstationary. That is, all of the following matrices can vary with time: Φ k , H k , Q k , and R k . Also, the discrete times at which measurements are made (t k for k = 1, 2, …) are completely arbitrary; that is, the sample rate can be nonuniform. Kalman Filter Examples A good way to understand the Kalman filter equations is to look at some simple low-dimensional examples. Example 1 Consider the problem of estimating the value of an unknown constant scalar, x, given a sequence of noisy measurements: z k = x k + v k (16.28) where the variance of measurement noise is constant, and the variance of the a priori estimate of x is infinite. For this example, using the Kalman filter notation: Φ k = 1 Q k = 0 H k = 1 R k = constant = c P o = ∞ the corresponding Kalman filter equations given in Fig. 16.1 are (16.29) (16.30) (16.31) xx kk = ? ? 1 ? xxKzx kkkkk =+ ? ( ) KMMc kkk =+ ( ) ? 2000 by CRC Press LLC (16.32) (16.33) (16.34) which simplifies to (16.35) (16.36) (16.37) After some more algebra, it turns out that (16.38) where P o –1 = 0, and hence (16.39) (16.40) Therefore, the variance of estimation error after k measurements is (16.41) which is exactly what we should expect for this example. Also, the Kalman filter gain is (16.42) (16.43) (16.44) MP kk = ?1 PMMMc kkkk =? + ( ) 2 P o =∞ ?? ? xx Kzx kk kkk =+ ? ( ) ??11 KP P c kk k =+ ( ) ??11 PP P P c kk k k =? + ( ) ???11 2 1 PP c kk ? ? ? =+ 1 1 1 1 Pc k j k ? = = ∑ 1 1 1 Pkc k ? = 1 Pck k = KP P c kk k =+ ( ) ??11 K cP k k = + ? 1 1 1 K k k = +? 1 11 ? 2000 by CRC Press LLC (16.45) which is intuitively very reasonable. Furthermore, the Kalman filter can now be written as: (16.46) which has the solution (16.47) The Kalman filter for this simple example is nothing more than our old friend, the arithmetic average. Example 2 Consider the same problem as in Example 1, but with R k not constant. It is easy to show that the Kalman filter in this case is (16.48) where the estimation error variance after k measurements is given by (16.49) and the Kalman filter estimate of x after k measurements is: (16.50) This result is intuitively very reasonable. In particular, the more accurate measurements (corresponding to small R j ) are weighted more heavily in estimating x; conversely, relatively inaccurate measurements (with large R j ) are given little weight. Example 3 Consider the problem of estimating the value of a quantity, y, that changes linearly with time with an unknown rate of change, given a sequence of measurements of y corrupted by additive noise that is statistically independent from sample to sample. In the Kalman filter setup, we could model this problem as follows. Let the state vector be: (16.51) Kk k =1 ?? ? xx k zx kk kk =+ ? ( ) ??11 1 ? x k z kj j k = = ∑ 1 1 ?? ? xx PRzx kk kkkk =+ ( ) ? ( ) ??11 PR kj j k = ( ) = ∑ 11 1 ? x zR R k jj j k j j k = ( ) ( ) = = ∑ ∑ 1 1 1 x y y = ? ? ? ? ? ? ? ? ˙ ? 2000 by CRC Press LLC The transition matrix would be: (16.52) where ?t k = t k – t k–1 . Furthermore, H k = [1 0] Q k = 0 R k = constant P o –1 = 0 Assuming a constant value of ?t k = T, it turns out that the error covariance matrix is: (16.53) See [Sorenson, 1967] for details. Extended Kalman Filter In practical applications, it is rather rare to find a problem with dynamical equations and measurement equations that are linear in x. Nevertheless, engineers use the Kalman filter theory applied to a linear approx- imation of the actual nonlinear dynamics and measurements. This approximation is called the extended Kalman filter (EKF); it is very straightforward and popular. The Kalman filter itself is almost never used in real-world applications, but rather the EKF is essentially ubiquitous. Figure 16.2 shows a block diagram of the EKF. Note that the EKF uses the nonlinear dynamics and nonlinear measurement equations to predict x k and z k , rather than using a linear approximation. In contrast, the EKF uses linear approximations of f(x) and h(x) to compute the covariance matrices and the Kalman filter gain. The nonlinear dynamical model for x is: (16.54) and the nonlinear measurement model is: (16.55) Also, note in Fig. 16.2 that the estimate of x is used to compute the Kalman filter gain, unlike the Kalman filter, in which the filter gain and the error covariance matrices do not depend on x (see Fig. 16.1). Unlike the Kalman filter, there is no guarantee that the EKF is stable. Moreover, there is no reason to suppose that the EKF will give optimal performance. Although in many applications the EKF performance is good, it is well known that the EKF performance is often poor or far from optimal. Unfortunately, there is no theory that predicts when the EKF will give good performance, but rather engineers use Monte Carlo simulations to evaluate EKF performance. There is a vast literature on methods to improve the EKF performance, including second-order Taylor series, iteration of Fig. 16.2 to improve the linearization, tuning the process noise covariance matrix, decoupling the Φ ? k k t = ? ? ? ? ? ? 1 01 M kT TTk R kk k = ? ( ) ? ( ) ? ? ? ? ? ? ? ? + ( ) 22 1 6 61211 2 xfx w kk k = ( ) + ?1 zhx v kkk = ( ) + ? 2000 by CRC Press LLC error covariance matrix, preferred order of processing the components of a vector-valued measurement, careful choice of coordinates (e.g., polar vs. Cartesian), hybrid coordinate systems, etc. There is no guarantee that any of these methods will improve EKF performance; in some cases, second-order corrections and/or iteration actually make EKF performance worse, contrary to intuition. Reviews of these techniques are given in Tanizaki [1996], as well as Wishner et al. [1969], Mehra [1971], Leskiw et al. [1982], Gelb [1974], Bellaire et al. [1995], Henriksen [1982], Fang [1976], Daum et al. [1983], and Jazwinski [1970]. Nonlinear Filters Considering the frequently disappointing performance of the EKF noted in the previous section, there has been intense research to develop better nonlinear filters. An exact nonlinear recursive filter was derived by Bene?s (for a certain class of nonlinear problems) in a seminal paper [1981]. The Bene?s filter is “exact” in the sense that it computes an optimal estimate of x, without any approximations, in contrast to the EKF, which uses linear approximations. A generalization of the Bene?s filter and the Kalman filter was developed by Daum [1986a; 1986b]. Figure 16.3 shows the superior performance of this new nonlinear filter compared to the EKF for certain practical applications; see Schmidt [1993] for details. A more general class of exact nonlinear recursive filters is based on the exponential family of probability densities. The Kalman filter theory is based on a Gaussian density, which is a special case of the exponential family; see Daum [1988; 1997a, b] for a development of this theory, which is summarized in Table 16.2. Another alternative to the EKF, reported in Julier et al. [1995], is called the unscented filter, and in contrast to the EKF, does not use Jacobians, but rather evaluates multidimensional integrals by sampling at carefully selected points much like Gauss-Hermite quadrature formulas. The unscented filter shows much better per- formance than the EKF in certain applications, with less computational complexity than the EKF. Exact recursive filters for nonlinear estimation problems generally do not exist. This is not surprising, considering that the existence of an exact recursive filter corresponds to the following minor miracle: (16.56) FIGURE 16.2 The extended Kalman (EKF) is a linear approximation. px tZ px t growing fixed kk ,, ( ) ( ) = ↑↑ ψ dimension dimension with k for all k ? 2000 by CRC Press LLC in which ψ k is a sufficient statistic for x. A recursive filter exists when there is a sufficient statistic with fixed finite dimension. In classical statistics, for parameter estimation, it is well known that this will happen (assuming certain regularity conditions) if and only if the conditional density is from an exponential family; see Daum [1988]. The theory of fixed finite dimensional filters has also been developed from a completely different perspective, using Lie algebras; see Bene?s [1987]. Non-recursive filters generally have superior performance compared with the EKF, at the cost of higher computational complexity. For parameter estimation problems (corresponding to zero process noise, Q k = 0), these non-recursive filters are popular in practical applications despite the increased computational complexity compared with the EKF. Gauss invented this type of non-recursive nonlinear filter over 200 years ago; see Sorenson [1980]. On the other hand, non-recursive filters for estimating x, where x is a Markov process with non-zero process noise (Q k ≠ 0), generally have much greater computational complexity. Nevertheless, with a sufficiently fast computer, it would be practical to implement such a non-recursive algorithm. The theory to design such algorithms is well known, and some Monte Carlo simulations have shown excellent performance relative to the EKF; see Sorenson [1988] and Kastella et al. [1997]. Presumably, with computers getting ten times faster (at fixed cost) every 5 years, the application of such non-recursive filters will become common in the future, despite very high computational complexity relative to the EKF. Practical Issues Data Association One of the best ways to ruin the performance of a Kalman filter is to put the wrong data into it. In a dense multiple-target environment, for applications with sensors such as radar, passive infrared, sonar, acoustic, or optical, the question of which measurement originated from which target is a very serious issue. In addition, the measurements could be unresolved mixtures from two or more objects. There is a plethora of algorithms to mitigate these problems, as well as a vast literature on this subject, including Blackman [1986]; Blackman et al. [1999], Bar-Shalom [1995], and Daum [1992]. FIGURE 16.3 New nonlinear filter vs. extended Kalman filter [See Schmidt (1993)]. ? 2000 by CRC Press LLC ? 2000 by CRC Press LLC Propagation Equations lter 3 +PA GG TT ?Am Pb PAP 1 2 ) ( ) ) ++?+ +++ PAm Dm Pb E PAP DP PD Q T 1 ) + ? ? ? ? + ADmEPb PAP PD DP I T 2 B T M T j T j ψ ψψ =… = ( ) Γ ΓΓΓ Γ Γ 12 ,,, TABLE 16.2 Exact Recursive Filters Conditional Density Filter p(x, t?Z k ) Class of Dynamics 1. Kalman (1960) η = Gaussian 2. Bene?s (1981) and 3. Daum (1986) and where 4. Daum (1986) Same as filter 3, but with Same as fi 5. Daum (1986) and 6. Daum (1988) Solution of PDE for θ(x, t) ? ? = ( ) f x At ? ? = =+ mAm PAP η exp ∫ ( ) [] x fxdx ? ? = ? ? ? ? ? ? ? ? f x f x T fx tr f x xAxbxc TT ( ) ? ? ? ? ? ? + ? ? =++ 2 ? ? =? =? m P P I η α Px ss ( ) fQrDxE T ?=+α tr f x rQr x Ax b x c TT T ? ? +=++ ? ? ? ? ? ? α 2 r x Px ss = ? ? ( ) log ? ( ? ( =? =? m P 21 21 αα α η α qxt, ( ) r x qxt= ? ? ( ) log , ηQxt, ( ) ? ? ? ? ? =? ? ? ? ? ? ? f x f x DD T T ? ? ++=? ? ? ? ? ? ? ? ++ + + ?? ? ? ? ? ? ? ? ? ? ? ? ? ( ) f t Dx E f x f x tr f x ADDxDEb T T TT 1 2 2 ? ( ? =? =? mP P 2 pxt xt Z t T k , exp , , ( ) ( ) ( ) [] θψ d dt A ψ =+ where with Ill-Conditioning The Kalman filter covariance matrix can be extremely ill-conditioned in certain applications, as analyzed in Daum et al. [1983]. A Kalman filter is ill-conditioned if its performance is significantly degraded by numerical errors. Special factorizations of the covariance matrices have been developed to mitigate this problem; the standard reference is Bierman [1977]. There are many other methods to mitigate ill-conditioning, as discussed by Daum et al. [1983]. Adaptive Kalman Filters The Kalman filter was derived assuming that the process noise covariance matrix, Q k , as well as all other matrices (Φ k , H k , R k ) are known exactly a priori. In practice, however, these assumptions may be inappropriate. To mitigate this uncertainty a number of adaptive Kalman filter algorithms can be used. Most of these algorithms consist of a bank of Kalman filters combined adaptively using Bayes’ rule. This basic structure was invented by Magill [1965], and it has now evolved into a more sophisticated algorithm, called interacting multiple models invented by Blom [1984]. A recent survey of this topic is given by Bar-Shalom et al. [1995]. Measurement Models The Kalman filter theory assumes Gaussian measurement errors that are statistically independent from sample to sample, with zero mean and exactly known covariance matrix (R k ). However, in many practical applications, these are poor approximations of reality. For example, in radar applications, the measurments of range, azimuth, and elevations are often biased, non-Gaussian, and correlated with time, owing to diverse physical effects including multipath, tropospheric refraction, ducting, ionospheric refraction, glint, RV wake, rocket exhaust plume, RFI, ECM, unresolved measurements, bias errors in time, location and angular orientation for the radar itself, radar hardware errors, etc. Gauss himself cautioned against na?ve least squares fitting of data with bias and drift; see Gauss [1995]. Performance Evaluation As shown in Fig. 16.1, the Kalman filter computes the covariance matrix of the estimation error (P k ). However, in practical applications, this theoretical covariance matrix may be extremely optimistic, owing to the effects noted earlier (nonlinearity, ill-conditioning, data association errors, unresolved data, errors in modeling both measurement errors and target dynamics) as well as bugs in the software itself. Therefore, the standard approach to evaluate Kalman filter performance is Monte Carlo simulation. However, no one in their right mind would believe the results of a complex Monte Carlo simulation without a back-of-the-envelope calculation that is in rough agreement with the simulation results. A good source of such simple formulas is Brookner [1998]. Obviously, the very best way to evaluate Kalman filter performance is to conduct extensive real-world testing. Unfortunately, the cost and practicality of this approach is often prohibitive or is deemed to be not cost-effective. A judicious combination of extensive Monte Carlo simulation and limited real-world testing is often the most practical approach to performance evaluation. The best possible performance for EKFs can be computed using theoretical lower bounds on the error covariance matrix, such as the Cramér-Rao bound (CRB) for parameter estimation. For the standard Kalman filter setup with zero process noise (Q k = 0), it turns out that the CRB is simply the Kalman filter error covariance matrix itself; see Taylor [1979]. On the other hand, for non-zero process noise, the available bounds are much more complex to compute and they are generally not tight; see Kerr [1989] for a detailed review of the state of the art. More generally, the theory of error bounds when data association errors are considered is developed in Daum [1997a, b]. Digital Realization All of the algorithms discussed here are always implemented using digital computers, owing to their superior accuracy, flexibility, and dynamic range, as compared to currently available analog devices. Generally, 32-bit or 64-bit floating point arithmetic is required for most practical applications, although extra precision may be required for extremely ill-conditioned problems. The idea of using analog computers to implement Kalman filters (which is sometimes suggested in academic circles) is rather na?ve, owing to the limited accuracy, limited dynamic range, and inflexibility of analog ? 2000 by CRC Press LLC computers. Likewise, the filtering theory for continuous time measurements (which dominates the academic literature on nonlinear filtering) is also impractical because measurements must be made in discrete time to accommodate digital computers. The na?ve approximation of discrete time measurements by continuous time data generally results in poor performance, and it is not used by practical engineers. The academic literature is out of touch with such practical issues; for example, see Hazewinkel et al. [1981]. Defining Terms Kalman filter: A recursive algorithm that estimates the state vector of a linear dynamical system given noisy measurements that are linear in the state vector. This algorithm was invented by Rudolf E. Kalman, and it was published in 1960. Extended Kalman filter: A recursive algorithm for estimating the state of nonlinear dynamical systems that uses the Kalman filter equations based on a linear approximation to the nonlinear dynamical system and/or nonlinear measurement equations. State vector: A vector that specifies the state of a dynamical system. For example, the position and velocity of a leaf falling to the ground could be the state vector for the leaf. For deterministic systems, this corresponds to the initial conditions for a system of ordinary differential equations. For a special type of random process, called a Markov process, the future state is statistically independent of the past, conditioned on knowledge of the state at the present time. Covariance matrix: A matrix that gages the uncertainty in a vector using second moments. The diagonal elements of this matrix are the variances of uncertainty in the components of the vector. References Y. Bar-Shalom and X. Li. Multitarget-Multisensor Tracking, YBS, 1995. R. Bellaire, E. W. Kamen, and S. M. Zabin, A new nonlinear iterated filter with applications to target tracking, SPIE Proceedings, San Diego, 1995. V. E. Bene?s, Nonlinear filtering: problems, examples, applications, Advances in Statistical Signal Processing, Vol. 1, pp. 1–14, JAI Press, 1987. V. E. Bene?s, Exact finite-dimensional filters for certain diffusions with nonlinear drift, Stochastics, 5, 65–92, 1981. G. J. Bierman, Factorization Methods for Discrete Sequential Estimation, New York: Academic, 1977. S. S. Blackman and R. F. Popoli. Design and Analysis of Modern Tracking Systems, Artech House, 1999. S. S. Blackman. Multi-Target Tracking with Radar Applications, Artech House Inc., 1986. H. A. P. Blom. A sophisticated tracking algorithm for ATC surveillance data, Proceedings of International Radar Conference, Paris, 1984. E. Brookner. Tracking and Kalman Filtering Made Easy, John Wiley & Sons, 1998. R. G. Brown and P. Y. C. Hwang, Introduction to Random Signals and Applied Kalman Filtering, third edition, John Wiley & Sons, 1997. A.E. Bryson and Y. C. Ho, Applied Optimal Control, Blaisdell Publishing, 1969. R. S. Bucy, Linear and nonlinear filtering, Proc. IEEE, 58, 854–864, 1970. F. E. Daum. (1997a), Virtual measurements for nonlinear filters, Proceedings of IEEE Control and Decision Conference, San Diego, pp. 1657–1662, 1997. F. E. Daum. (1997b) Cramér-Rao type bounds for random set problems, pp. 165–183 in Random Sets, Ed. By J. Goutsias, R. Mahler, and H. T. Nguyen, Springer-Verlag, 1997. F. E. Daum. Beyond Kalman filters: practical design of nonlinear filters, Proceedings of the SPIE Conference on Signal and Data Processing of Small Targets, pp. 252–262, Orlando, FL, April 1995. F. E. Daum. A system approach to multiple target tracking, Multitarget-Multisensor Tracking, Volume II (Y. Bar- Shalom, ed.), pp. 149–181, Artech House, 1992. F. E. Daum. New exact nonlinear filters, Bayesian Analysis of Time Series and Dynamic Models (J.C. Spall, ed.), pp. 199–226, Marcel Dekker, New York, 1988. F. E. Daum. (1986a). Exact finite dimensional nonlinear filters, IEEE Trans. Autom. Control AC-31(7), 616–622, 1986. ? 2000 by CRC Press LLC F. E. Daum. (1986b). New nonlinear filters and exact solutions of the Fokker-Planck equations, in Proceedings of the American Control Conference, pp. 884–888, 1986. F. E. Daum and R. J. Fitzgerald. Decoupled Kalman filters for phased array radar tracking, IEEE Trans. Autom. Control, AC-28, 269–283, 1983. B. T. Fang. A nonlinear counterexample for batch and extended sequential estimation algorithms, IEEE Trans. Autom. Control, AC-21, 138–139, 1976. C. F. Gauss. Theoria Combinationis Observationum Erroribus Minimis Obnoxiae, translated by G. W. Stewart, SIAM, 1995. A. Gelb (Editor). Applied Optimal Estimation, MIT Press, 1974. M. Hazewinkel and J. C. Willems, Eds. Stochastic Systems: The Mathematics of Filtering and Identification and Applications, D. Reidel, Dordrecht, The Netherlands, 1981. R. Henriksen. The truncated second-order nonlinear filter revisited, IEEE Trans. Autom. Control, AC-27, 247–251, 1982. Y. C. Ho and R. C. K. Lee. A Bayesian approach to problems in stochastic estimation and control, IEEE Trans. Autom. Control, AC-9, 333–339, 1964. C. E. Hutchinson. The Kalman filter applied to aerospace and electronic systems, IEEE Trans. Aerosp. Electron. Syst. 500–504, 1984. A. H. Jazwinski. Stochastic Processes and Filtering Theory, Academic Press, New York, 1970. S. Julier, J. Uhlmann, and H. Durrant-Whyte. A new approach to filtering nonlinear systems, Proceedings of American Control Conference, June 1995. R. E. Kalman. New methods in Wiener filtering theory, in Proc. Symp. Eng. Appl. of Random Function Theory and Probability, F. Kozin and J. L. Bogdanoff, Eds. New York: Wiley, 1963. R. E. Kalman. A new approach to linear filtering and prediction problems, Trans. ASME J. Basic Eng., 82D, 35–45, 1960. K. Kastella and A. Zatezalo. Nonlinear filtering for detection, tracking and ID algorithms, ONR/NSWC Workshop on Filtering and Tracking, May 1997. T. H. Kerr. Status of CR-like lower bounds for nonlinear filtering, IEEE Trans. Aerosp. Electron. Syst., 25, 590–601, 1989. D. M. Leskiw and K. S. Miller. Nonlinear estimation with radar observations, IEEE Trans. Aerosp. Electron. Syst., AES-18, 192–200, 1982. D. T. Magill. Optimal adaptive estimation of sampled stochastic processes, IEEE Trans. Autom. Control, AC-10, 434–439, 1965. R. K. Mehra. A comparison of several nonlinear filters for reentry vehicle tracking, IEEE Trans. Autom. Control, AC-16, 307–319, 1971. G. C. Schmidt. Designing nonlinear filters based on Daum’s Theory, AIAA Journal of Guidance, Control and Dynamics, 16, 371–376, 1993. B. E. Schutz, J. D. McMillan, and B. D. Tapley. Comparison of statistical orbit determination methods, AIAA J., Nov. 1974. H. W. Sorenson. Recursive estimation for nonlinear dynamic systems, Bayesian Analysis of Time Series and Dynamic Models, J.C. Spall, Ed., pp. 127–165, Marcel Dekker, 1988. H. W. Sorenson. Kalman Filtering: Theory and Applications, IEEE Press, New York, 1985. H. W. Sorenson. Parameter Estimation, Marcel Dekker, 1980. H. W. Sorenson. On the development of practical nonlinear filters, Inf. Sci. 7, 253–270, 1974. H. W. Sorenson. On the error behavior in linear minimum variance estimation problems, IEEE Trans. Autom. Control, AC-12, 557–562, 1967. H. Tanizaki. Nonlinear Filters, 2nd ed., Springer-Verlag, 1996. J. H. Taylor. Cramér-Rao estimation error lower bound analysis for nonlinear systems with unknown deter- ministic variables, IEEE Trans. Autom. Control, April 1979. R. P. Wishner, R. E. Larson, and M. Athans. Status of radar tracking algorithms, Symp. on Nonlinear Estimation Theory and Appl., 1970. R. P. Wishner, J. A. Tabaczynski, and M. Athans. A comparison of three non-linear filters, Automatica 5, 487–496, 1969. ? 2000 by CRC Press LLC Further Information The best concise introduction to Kalman filtering is Chapter 12 in Bryson and Ho [1969]. The three best books on Kalman filters are Gelb [1974], Sorenson [1985], and Brown et al. [1997]. The standard reference on nonlinear filters is Jazwinski [1970]. The best journal on Kalman filter applications is the IEEE Transactions on Aerospace and Electronic Systems, which typically has several practical papers on Kalman filters each issue. Two good conferences with many papers on Kalman filters are the IEEE Conference on Decision and Control (mid- December annually) and the SPIE Conference on Signal and Data Processing (April each year). ? 2000 by CRC Press LLC