Pillai, S.U., Shim, T.I., Batalama, S.N., Kazakos, D., Daum, F. “Spectral Estimation and
Modeling”
The Electrical Engineering Handbook
Ed. Richard C. Dorf
Boca Raton: CRC Press LLC, 2000
16
Spectral Estimation
and Modeling
16.1 Spectral Analysis
Historical Perspective ? Modern Spectral Analysis
16.2 Parameter Estimation
Bayesian Estimation ? Mean-Square Estimation ? Minimax
Estimation ? Maximum Likelihood Estimation ? Other Parameter
Estimation Schemes
16.3 Kalman Filtering
Kalman Filter Equations ? Kalman Filter Examples ? Extended Kalman
Filter ? Nonlinear Filters ? Practical Issues
16.1 Spectral Analysis
S. Unnikrishna Pillai and Theodore I. Shim
Historical Perspective
Modern spectral analysis dates back at least to Sir Isaac Newton [Newton, 1671], whose prism experiments
with sunlight led him to discover that each color represented a particular wavelength of light and that the
sunlight contained all wavelengths. Newton used the word spectrum, a variant of the Latin word specter, to
describe the band of visible light colors.
In the early eighteenth century, Bernoulli discovered that the solution to the wave equation describing a
vibrating string can be expressed as an infinite sum containing weighted sine and cosine terms. Later, the French
engineer Joseph Fourier in his Analytical Theory of Heat [Fourier, 1822] extended Bernoulli’s wave equation
results to arbitrary periodic functions that might contain a finite number of jump discontinuities. Thus, for
some T
0
> 0, if f (t) = f (t + T
0
) for all t, then f (t) represents a periodic signal with period T
0
and in the case
of real signals, it has the Fourier series representation
where ω
0
= 2π/T
0
, and
with A
0
representing the dc term (k = 0). Moreover, the infinite sum on the right-hand side of the above
expression converges to [f (t
–0
) + f (t
+0
)]/2. The total power P of the periodic function satisfies the relation
ft A A k t B k t
kk
k
( ) ( cos sin )=+ +
=
∞
∑000
1
2 ωω
A
T
f t k t dt B
T
ft k tdt
kk
TT
==
∫∫
11
0
0
0
0
0
0
00
( ) cos , ( ) sinωω
S. Unnikrishna Pillai
Polytechnic University
Theodore I. Shim
Polytechnic University
Stella N. Batalama
State University of New York
at Buffalo
Dimitri Kazakos
University of Southwestern
Louisiana
Fred Daum
Raytheon Company
? 2000 by CRC Press LLC
implying that the total power is distributed only among the dc term, the fundamental frequency w
0
= 2p/T
0
and its harmonics kw
0
, k 3 1, with 2(A
2
k
+B
2
k
) representing the power contained at the harmonic kw
0
. For every
periodic signal with finite power, since A
k
? 0, B
k
? 0, eventually the overharmonics become of decreasing
importance.
The British physicist Schuster [Schuster, 1898] used this observation to suggest that the partial power P
k
=
2(A
2
k
+B
2
k
) at frequency kw
0
, k = 0 ? ¥, be displayed as the spectrum. Schuster termed this method the
periodogram, and information over a multiple of periods was used to compute the Fourier coefficients and/or
to smooth the periodogram, since depending on the starting time, the periodogram may contain irregular and
spurious peaks. A notable exception to periodogram was the linear regression analysis introduced by the British
statistician Yule [Yule, 1927] to obtain a more accurate description of the periodicities in noisy data. Because
the sampled periodic process x(k) = cos kw
0
T containing a single harmonic component satisfies the recursive
relation
x(k) = ax(k – 1) – x(k – 2)
where a = 2 cos w
0
T represents the harmonic component, its noisy version x(k) + n(k) satisfies the recursion
x(k) = ax(k – 1) – x(k – 2) + n(k)
Yule interpreted this time series model as a recursive harmonic process driven by a noise process and used this
form to determine the periodicity in the sequence of sunspot numbers. Yule further generalized the above
recursion to
x(k) = ax(k – 1) + bx(k – 2) + n(k)
where a and b are arbitrary, to describe a truly autoregressive process and since for the right choice of a, b the
least-square solution to the above autoregressive equation is a damped sinusoid, this generalization forms the
basis for the modern day parametric methods.
Modern Spectral Analysis
Norbert Wiener’s classic work on Generalized Harmonic Analysis [Wiener, 1930] gave random processes a firm
statistical foundation, and with the notion of ensemble average several key concepts were then introduced. The
formalization of modern day probability theory by Kolmogorov and his school also played an indispensable
part in this development. Thus, if x(t) represents a continuous-time stochastic (random) process, then for every
fixed t, it behaves like a random variable with some probability density function f
x
(x,t). The ensemble average
or expected value of the process is given by
and the statistical correlation between two time instants t
1
and t
2
of the random process is described through
its autocorrelation function
P
T
ft dt A A B
kk
k
T
==++
=
¥
?
ò
1
2
0
2
0
222
1
0
0
**() ( )
m
xx
tExt xfxtdx() [()] (,)==
-¥
¥
ò
Rtt Extxt xxfxxttdxdxRtt
xx xx xx
(,) [()*()]
*
(,,,)
*
(,)
–
–
12 1 2 12 121212 21
12
== =
¥
¥
-¥
¥
òò
? 2000 by CRC Press LLC
where f
x1x2
(x
1
, x
2
, t
1
, t
2
) represents the joint probability density function of the random variable x
1
= x(t
1
) and
x
2
= x(t
2
) and * denotes the complex conjugate transpose in general. Processes with autocorrelation functions
that depend only upon the difference of the time intervals t
1
and t
2
are known as wide sense stationary processes.
Thus, if x(t) is wide sense stationary, then
E[x(t + t)x*(t)] = R
xx
(t) = R*
xx
(–t)
To obtain the distribution of power versus frequency in the case of a stochastic process, one can make use
of the Fourier transform based on a finite segment of the data. Letting
represent the power contained in a typical realization over the interval (–T, T), its ensemble average value as
T ? ¥ represents the true power contained at frequency w. Thus, for wide sense stationary processes
(16.1)
Moreover, the inverse relation gives
(16.2)
and hence
Thus S(w) represents the power spectral density and from Eqs. (16.1) and (16.2), the power spectral density
and the autocorrelation function form a Fourier transform pair, the well-known Wiener–Khinchin theorem.
If x(kT) represents a discrete-time wide sense stationary stochastic process, then
and the power spectral density is given by
or in terms of the normalized variable q = wT,
P
T
xte dt
T
jt
T
T
() ()w
w
=
-
-
ò
1
2
2
SEP Rttedtd
R
T
ed Red
T
T
T
xx
T
T
T
T
jtt
T
xx
T
T
j
xx
j
() lim[ ()] lim ( – )
lim () – ()
––
–(–)
–
–
ww
t
t
ttt
w
wt wt
==
=
?
è
?
?
?
÷
=3
?¥ ?¥
?¥
-
-¥
¥
òò
òò
12 12
2
2
12
1
2
0
**
RSed
xx
j
() ()t
p
ww
wt
=
-¥
¥
ò
1
2
RExtP Sd
xx
() () ()[]
–
0
1
2
2
===
¥
¥
ò
**
p
ww
rExnkTxnTr
kk
=+
( )
=
-
{( ) ()} **
Sre
k
jkT
k
()w
w
=
-
=-¥
¥
?
? 2000 by CRC Press LLC
(16.3)
and the inverse relation gives the autocorrelations to be
Thus, the power spectral density of a discrete-time process is periodic. Such a process can be obtained by
sampling a continuous-time process at t = kT, *k* = 0 ? ¥, and if the original continuous-time process is band-
limited with a two-sided bandwidth equal to 2w
b
= 2p/T, then the set of discrete samples so obtained is equivalent
to the original process in a mean-square sense.
As Schur [Schur, 1917] has observed, for discrete-time stationary processes the nonnegativity of the power
spectrum is equivalent to the nonnegative definiteness of the Hermitian Toeplitz matrices, i.e.,
(16.4)
If x(nT) is the output of a discrete-time linear time-invariant causal system driven by w(nT), then we have the
following representation:
(16.5)
In the case of a stationary input, the output is also stationary, and its power spectral density is given by
S
x
(q) = *H(e
jq
)*
2
S
w
(q) (16.6)
where S
w
(q) represents the power spectral density of the input process. If the input is a white noise process,
then S
w
(q) = s
2
and
S
x
(q) = s
2
*H(e
jq
)*
2
Clearly if H(z) is rational, so is S
x
(q). Conversely, given a power spectral density
(16.7)
that satisfies the integrability condition
(16.8)
SreSk
k
jk
k
() ( )qqp
q
==+3
-
=-¥
¥
?
20
rSedr
k
jk
k
==
-
-
ò
1
2p
qq
q
p
p
()
*
S
rr r
rr r
rr r
k
k
k
k
kk
k
()
...
* ...
* ...
*
,
–
–
q3? =
?
è
?
?
?
?
?
?
?
÷
÷
÷
÷
÷
=3 =?¥00
01
10 1
10
TT
MMOM
wnT Hz hkTz xnT hkTwn kT
k
kk
() () () () ()( )?= ?= -
( )
=
¥
=
¥
??
00
Sre
xk
jk
k
()q
q
=3
=-¥
¥
?
0
Sd
x
()qq
p
p
<¥
-
ò
? 2000 by CRC Press LLC
and the physical realizability (Paley–Wiener) criterion
(16.9)
there exists a unique function H(z) that is analytic together with its inverse in *z* < 1 (minimum phase factor)
such that
(16.10)
and
H(z) is known as the Wiener factor associated with S
x
(q) and as Eq. (16.5) shows, when driven by white noise,
it generates a stochastic process x(nT) from past samples and its power spectral density matches with the given
S
x
(q).
In this context, given a finite set of autocorrelations r
0
, r
1
, . . ., r
n
, the spectral extension problem is to obtain
the class of all extensions that match the given data, i.e., such an extension K(q) must automatically satisfy
K(q) 3 0
and
in addition to satisfying Eqs. (16.8) and (16.9).
The solution to this problem is closely related to the trigonometric moment problem, and it has a long and
continuing history through the works of Schur [1917]; Nevanlinna, Akheizer and Krein [Akheizer and Krein,
1962]; Geronimus [1954]; and Shohat and Tamarkin [1970], to name a few. If the given autocorrelations are
such that the matrix T
n
in Eq. (16.4) is singular, then there exists an m £ n such that T
m – 1
is positive definite
(T
m – 1
> 0) and T
m
is singular [det T
m
= 0, det (.) representing the determinant of (.)]. In that case, there exists
a unique vector X = (x
0
, x
1
, . . ., x
m
)
T
such that T
m
X = 0 and further, the autocorrelations have a unique
extension given by
(16.11)
where e
j qi
, i = 1 ? m are the m zeros of the polynomial x
0
+ x
1
z + . . . + x
m
z
m
and P
i
> 0. This gives
(16.12)
ln ( )Sd
x
qq
p
p
>-¥
-
ò
Hz bz z
k
k
k
() ,=<
=
¥
?
** 1
0
SHreHeae
x
r
jj
q
qq
( )
==
?-
lim
10
22
|( )| |( )|,..
1
2
0
p
qq
q
p
p
Ked r k n
jk
k
() ,
-
ò
==?
cPe k
ki
jk
i
m
i
==?¥
=
?
q
, ** 0
1
TA A
m
m
P
P
P
–
...
...
...
*
1
1
2
00
00
=
?
è
?
?
?
?
?
?
÷
÷
÷
÷
MM OM
? 2000 by CRC Press LLC
where A is an m ′ m Vandermonde matrix given by
and Eq. (16.12) can be used to determine P
k
> 0, k = 1 ? m. The power spectrum associated with Eq. (16.11)
is given by
and it represents a discrete spectrum that corresponds to pure uncorrelated sinusoids with signal powers P
1
,
P
2
, …, P
m
.
If the given autocorrelations satisfy T
n
> 0, from Eq. (16.4), every unknown r
k
, k 3 n + 1, must be selected
so as to satisfy T
k
> 0, and this gives
*r
k+1
– z
k
*
2
£ R
2
k
(16.13)
where z
k
= f
T
k
T
–
k
1
b
k
, f
k
= (r
1
, r
2
, . . ., r
k
)
T
, b
k
= (r
k
, r
k –1
, . . . , r
1
) and R
k
= det T
k
/det T
k
–1
.
From Eq. (16.13), the unknowns could be anywhere inside a sequence of circles with center z
k
and radius
R
k
, and as a result, there are an infinite number of solutions to this problem. Schur and Nevanlinna have given
an analytic characterization to these solutions in terms of bounded function extensions. A bounded function
r(z) is analytic in *z * < 1 and satisfies the inequality *r(z)* £ 1 everywhere in *z* < 1.
In a network theory context, Youla [1980] has also given a closed form parametrization to this class of
solutions. In that case, given r
0
, r
1
, . . ., r
n
, the minimum phase transfer functions satisfying Eqs. (16.8) and
(16.9) are given by
(16.14)
where r(z) is an arbitrary bounded function that satisfies the inequality (Paley-Wiener criterion)
and G(z) is the minimum phase factor obtained from the factorization
1 – *r(e
jq
)*
2
= *G(e
jq
)*
2
Further, P
n
(z) represents the Levinson polynomial generated from r
0
? r
n
through the recursion
A =
?
è
?
?
?
?
?
?
?
?
÷
÷
÷
÷
÷
÷
==?
-
11 1
1
12
1
2
2
22
1
1
2
11
...
...
...
...
...
,,
––
ll l
ll l
ll l
l
q
m
m
mm
m
m
i
j
ei m
i
MM M
SP
kk
k
m
() ( )qdqq=-
=
?
1
Hz
z
Pz z zPz
nn
r
r
()
()
()– ()
?
()
=
G
ln ( )
-
ò
-
[]
>-¥
p
p
q
rq1
2
**ed
j
1
2
11
-
( )
=
( )
-
( )
--
sPz P z zsP z
nn n nn
?
? 2000 by CRC Press LLC
that starts with P
0
(z) = 1/ , where
(16.15)
represents the reflection coefficient at stage n. Here, { }
n
denotes the coefficient of z
n
in the expansion { }, and
~
P
n
(z)
D
= z
n
P*
n
(1/z*) represents the polynomial reciprocal to R
n
(z). Notice that the given information r
0
? r
n
enters P
n
(z) through Eq. (16.5). The power spectral density
K(q) = *H
r
(e
j q
)*
2
associated with Eq. (16.14) satisfies all the interpolation properties described before. In Eq. (16.14), the solution
r(z) [ 0 gives H(z) = 1/P
n
(z), a pure AR(n) system that coincides with Burg’s maximum entropy extension.
Clearly, if H
r
(z) is rational, then r(z) must be rational and, more interestingly, every rational system must
follow from Eq. (16.14) for a specific rational bounded function r(z). Of course, the choice of r(z) brings in
extra freedom, and this can be profitably used for system identification as well as rational and stable approxi-
mation of nonrational systems [Pillai and Shim, 1993].
Defining Terms
Autocorrelation function: The expected value of the product of two random variables generated from a
random process for two time instants; it represents their interdependence.
Expected value (or mean) of a random variable: Ensemble average value of a random variable that is given
by integrating the random variable after scaling by its probability density function (weighted average)
over the entire range.
Power spectrum: A nonnegative function that describes the distribution of power versus frequency. For wide
sense stationary processes, the power spectrum and the autocorrelation function form a Fourier transform
pair.
Probability density function: The probability of the random variable taking values between two real numbers
x
1
and x
2
is given by the area under the nonnegative probability density function between those two points.
Random variable: A continuous or discrete valued variable that maps the set of all outcomes of an experiment
into the real line (or complex plane). Because the outcomes of an experiment are inherently random,
the final value of the variable cannot be predetermined.
Stochastic process: A real valued function of time t, which for every fixed t behaves like a random variable.
Related Topics
14.1 Fourier Transforms ? 40.2 Spectrum, Specifications, and Measurement Techniques ? 73.3 Stochastic
Processes
References
N.I. Akheizer and M. Krein, Some Questions in the Theory of Moments, American Math. Soc. Monogr., 2, 1962.
J.B.J. Fourier, Theorie Analytique de la Chaleur (Analytical Theory of Heat), Paris, 1822.
L. Y. Geronimus, Polynomials Orthogonal on a Circle and Their Applications, American Math. Soc., Translation,
104, 1954.
I. Newton, Philos. Trans., vol. IV, p. 3076, 1671.
S.U. Pillai and T.I. Shim, Spectrum Estimation and System Identification, New York: Springer-Verlag, 1993.
I. Schur, “Uber Potenzreihen, die im Innern des Einheitzkreises Beschrankt Sind,” Journal fur Reine und
Angewandte Mathematik, vol. 147, pp. 205–232, 1917.
J.A. Shohat and J.D. Tamarkin, The Problem of Moments, American Math. Soc., Math. Surveys, 1, 1970.
r
0
sPzrzP
nn k
k
k
n
n
n
=
ì
í
?
?
?
ü
y
?
t
?
=
?––
() ()
1
1
1
0
? 2000 by CRC Press LLC
N. Wiener “Generalized harmonic analysis,” Acta Math., vol. 55, pp. 117–258, 1930.
D.C. Youla, “The FEE: A New Tunable High-Resolution Spectral Estimator: Part I,” Technical note, no. 3, Dept.
of Electrical Engineering, Polytechnic Institute of New York, Brooklyn, New York; also RADC Report,
RADC-TR-81-397, AD A114996, 1982, 1980.
G.U. Yule, “On a method of investigating periodicities in disturbed series, with special reference to Wolfer’s
sunspot numbers,” Philos. Trans. R. Soc. London, Ser. A, vol. 226, pp. 267–298, 1927.
16.2 Parameter Estimation
Stella N. Batalama and Dimitri Kazakos
Parameter estimation is the operation of assigning a value in a continuum of alternatives to an unknown
parameter based on a set of observations involving some function of the parameter. Estimate is the value
assigned to the parameter and estimator is the function of the observations that yields the estimate.
The basic elements in the parameter estimation are a vector parameter q
m
, a vector space %
m
where q
m
takes
its values, a stochastic process X(t) parameterized by q
m
and a performance criterion or cost function. The
estimateq
^
m
(x
n
) based on the observation vector x
n
= [x
1
, x
2
,...,x
n
] is a solution of some optimization problem
according to the performance criterion. In the following, the function f(x
n
*q
m
) will denote the conditional
joint probability density function of the random variables x
1
,...,x
n
.
There are several parameter estimation schemes. If the process X(t) is parametrically known, i.e., if its
conditional joint probability density functions are known for each fixed value q
m
of the vector parameter q
m
,
then the corresponding parameter estimation scheme is called parametric. If the statistics of the process X(t)
are nonparametrically described, i.e., given q
m
? %
m
any joint probability density function of the process is a
member of some nonparametric class of probability density functions, then the nonparametric estimation
schemes arise.
Let G
n
denote the n-dimensional observation space. Then an estimatorq
^
(x
n
) of a vector parameter q
m
is a
function from the observation space, G
n
, to the parameter space, %
m
. Since this is a function of random variables,
it is itself a random variable (or random vector).
There are certain stochastic properties of estimators that quantify somehow their quality. In this sense an
estimator is said to be unbiased if its expected value is the true parameter value, i.e., if
E
q
{q
^
m
(x
n
)} = q
m
where the subscript q on the expectation symbol denotes that the expectation is taken according to the
probability density function f(x
n
*q
m
). In the case where the observation space is the ?
n
and the parameter is
a scalar, it is
The bias of the estimate is the Euclidean norm **q
m
– E
q
{q
m
(x
n
)}**
1/2
. Thus, the bias measures the distance between
the expected value of the estimate and the true value of the parameter. Clearly, the estimator is unbiased when
the bias is zero.
Usually it is of interest to know the conditional variance of an unbiased estimate. The bias of the estimate
q
^
m
(x
n
)and the conditional variance
E
q
{**q
^
m
(x
n
)–E
q
{q
^
m
(x
n
)}**
2
*q
m
}
generally represent a trade-off. Indeed, an unbiased estimate may induce relatively large variance. On the other
hand, the introduction of some low-level bias may then result in a significant reduction of the induced variance.
Exfxdx
n
R
nnmn
n
q
qq{
?
()}
?
()( )x =
ò
*
? 2000 by CRC Press LLC
In general, the bias versus variance trade-off should be studied carefully for the correct evaluation of any given
parameter estimate. A parameter estimate is called efficient if the conditional variance equals a lower bound
known as the Rao-Cramèr bound.
It will be useful to present briefly this bound.
The Rao-Cramèr bound gives a theoretical minimum for the variance of any estimate. More specifically, let
q
^
(x
n
)be the estimate of a scalar parameter q given the observation vector x
n
. Let f(x
n
*q) be given, twice
continuously differentiable with respect to q, and satisfy also some other mild regularity conditions. Then,
.
Sometimes we need to consider the case where the sample size n increases to infinity. In such a case, an
estimator is said to be consistent if
q
^
m
(x
n
) ? q
m
as n ? ¥
Since the estimateq
^
m
(x
n
) is a random variable, we have to specify in what sense the above holds. Thus, if
the above limit holds w.p. 1, we say thatq
^
m
(x
n
) is strongly consistent or consistent w.p. 1. In a similar way we
can define a weakly consistent estimator.
As far as the asymptotic distribution of q(x
n
) as n ? ¥ is concerned, it turns out that the central limit
theorem can often be applied toq
^
(x
n
) to infer that [q
^
(x
n
)– q] is asymptotically normal with zero mean as
n ? ¥.
In order to examine certain parameter estimation schemes we need first to present the definition of some
related functions. Penalty or cost function c[q
m
,q
^
m
(x
n
)] is a scalar, nonnegative function whose values vary as
q
m
varies in the parameter space E
m
and as the sequence x
n
takes different values in the observation space, G
n
.
Conditional expected penalty c(q
m
,q
^
m
) induced by the parameter estimate and the penalty function is a function
defined as follows:
c(q
m
,q
^
m
) = E
q
{c[q
m
,q
^
m
(x
n
)]}
If an a priori density function p(q
m
) is available, then the expected penaltyc(q
^
m
, p) can be evaluated.
The various existing parameter estimation schemes evolve as the solutions of optimization problems whose
objective function is either the conditional expected penalty or the conditional density function f(x
n
*q
m
).
Bayesian Estimation Scheme
In the Bayesian estimation scheme the available assets are:
1.A parametrically known stochastic process parameterized by q
m
, in other words, a given conditional
joint density function f(x
n
*q
m
) defined on the observation space G
n
, where q
m
is a well-defined parameter
vector.
2.A realization x
n
from the underlying active process, where the implied assumption is that the process
remains unchanged throughout the whole observation period.
3.A density function p(q
m
) defined on the parameter space %
m
.
4.For each data sequence x
n
, parameter vector q
m
and parameter estimateq
^
m
(x
n
), a penalty scalar function
c[q
m
,q
^
m
(x
n
)] is given.
5.A performance criterion which is the minimization of the expected penalty c(q
m
, p).
The estimateq
^
m
0
that minimizes the expected penalty is called optimal Bayesian estimate at p. Under some
mild conditions the optimal Bayesian estimateq
^
m
0
(x
n
) is the conditional expectation E{q
m
*x
n
}.
EEfx
nn
qq
qq
?
?q
q
?
( )– log( )
–
x
[]
ì
í
?
ü
y
t
3
é
?
ê
ê
ù
?
ú
ú
ì
í
?
?
?
ü
y
?
t
?
2
2
1
*
n
? 2000 by CRC Press LLC
If the penalty function has the form c[q
m
,q
^
m
]= 1 – d(**q
m
–q
^
m
**), where d() is the delta function, then the
optimal Bayesian estimate is called maximum a posteriori estimate since it happens to maximize the a posteriori
density p(q
m
*x
n
).
Another special case of penalty function is the function **q
m
–q
^
m
**
2
. In this case the Bayesian estimate is
called minimum mean-square estimate and equals the conditional expectation E{q
m
*x
n
}.
In the following we present some more details about mean-square estimation since it is one of the most
popular schemes.
Mean-Square Estimation
For the simplicity of our discussion we consider the case of estimating a single continuous type random variable
q with density p(q) instead of estimating a random vector. We also reduce the dimensionality of the observation
space to one. In this framework the penalty function will be the square of the estimation error (q –q
^
)
2
and
the performance or optimality criterion will be the minimization of the mean (expected) square value of the
estimation error.
We will first consider the case of estimating a random variable q by a constantq
^
. This means that we wish
to find a constantq
^
such that the mean-square (MS) error
is minimum. Since e depends onq
^
, it is minimum if
i.e., if
The case where q is to be estimated by a functionq
^
(x) of the random variable (observation) x, and not by
a constant, is examined next. In this case the MS error takes the form:
where p(q,x) is the joint density of the random variables u and x. In this case we need to find that function
q
^
(x)which minimizes the MS error. It can be proved that the function that minimizes the MS error is
The functionq
^
(x) is called nonlinear MS estimate.
eE pd==
¥
¥
ò
{(–
?
)} (–
?
)()qq qqqq
22
de
d
pd
q
qqqq==
¥
¥
ò
20(–
?
)()
–
?
{} ()
–
qqqqq==
¥
¥
ò
Epd
e E x p xdxdy==
¥
¥
¥
¥
òò
{[–
?
()]} [–
?
()](,)
––
qq qq qx
22
?
() { } ( )
–
qqqqqxEx pxd==
¥
¥
ò
**
? 2000 by CRC Press LLC
As we have seen, when the penalty function is the quadratic function (q –q
^
)
2
, then the optimal Bayesian
estimate is the conditional expectation E{u*x}. If x and u are jointly Gaussian, then the above conditional
expectation is a linear function of x. But when the above statistics are not Gaussian, then the optimal Bayesian
estimate is generally a nonlinear function of x. Thus, to resolve this problem we introduce suboptimal Bayesian
schemes for this quadratic penalty function. In particular we consider only the class of linear parameter estimates
and we try to find that estimate which minimizes the expected quadratic penalty. This estimate is called linear
MS estimate and it is used in many applications because of the simplicity of the solution.
The linear estimation problem is the estimation of a random variable u in terms of a linear function Ax +
B of x, i.e.,q
^
(x) = Ax + B. In this case we need to find the constants A and B in order to minimize the MS error
e = E{[q – (Ax + B)]
2
}
A fundamental principle in the MS estimation is the orthogonality principle. This principle states that the
optimum linear MS estimate Ax + B of u is such that the estimation error u – (Ax + B) is orthogonal to the
data x, i.e.,
E{[u – (Ax + B)]x} = 0
Using the above principle, it can be proved that e is minimum if
i.e., h
x
, h
q
are the means of x and u; s
x
2
, s
q
2
are the corresponding variances; s
x
, s
q
is the standard deviation
of x and q; and r is the correlation coefficient of x and u. Thus the MS error takes the form e = s
q
2
(1 – r
2
).
The estimate
q
^
(x) = Ax + B
is called the nonhomogeneous linear estimate of u in terms of x. If u is estimated by a functionq
^
(x) = ax,
the estimate is called homogeneous. It can be also shown by the orthogonality principle that for the homoge-
neous estimate
Using the orthogonality principle it can be shown that if the random variables u and x are Gaussian zero
mean, then the optimum nonlinear estimate of q equals the linear estimate. In other words ifq
^
(x) = E{q*x} is
the optimum nonlinear estimate of u andq
^
= ax is the optimum linear estimate, thenq
^
(x) =E{q
^
*x} = q
^
= ax.
This is true since the random variables u and x have zero mean, E{u} = E{x} = 0, and thus the linear estimateq
^
has zero mean too, E{q
^
}= 0. This implies that the linear estimation error e = u –q
^
also has zero mean, E{e} =
E{u –q
^
} = 0.
A
r
BA
EE
r
E
x
x
x
xx
x
x
==
==
=
s
s
hh
hh
shsqh
hqh
ss
q
q
q
qq
q
q
and –
{}, {}
{(– )}, {(– )}
{(– )(– )}
x
x
x
u
22 2
a=
E
E
{}
{}
x
x
u
2
? 2000 by CRC Press LLC
On the other hand, the orthogonality principle can be applied, which implies that the linear estimation error
e is orthogonal to the data, E{ex} = 0. Since e is Gaussian, it is independent of x and thus E{e*x} = E{e} = 0,
which is equivalent to the following:
E{u –u
^
*x} = 0 T E{u*x} –E{u
^
*x} = 0
T E{u*x} =E{u
^
*x} Tq
^
(x) = ax Tq
^
(x) =q
^
i.e., the nonlinear and the linear estimates coincide.
In addition, since the linear estimation error e = u – u
^
is independent of the data x, so is the square error, i.e.,
E{(u –u
^
)
2
*x} = E{(q –q
^
2
} = V
Thus, the conditional mean of u assuming the data x equals its MS estimate and the conditional variance the
MS error. That simplifies the evaluation of conditional densities when Gaussian random variables are involved
because since f(q*x) is Gaussian, it has the form
Minimax Estimation Scheme
In the minimax estimation scheme the available assets are:
1.A parametrically known stochastic process parameterized by q
m
.
2.A realization x
n
from the underlying active process.
3.A scalar penalty function c[q
m
,q
^
m
(x
n
)] for each data sequence x
n
, parameter vector q
m
, and parameter
estimateq
^
m
(x
n
).
The minimax schemes are solutions of saddle-point game formalizations, with payoff function the expected
penalty c(q
^
m
, p) and with variables the parameter estimateq
^
m
and the a priori parameter density function p. If
a minimax estimateq
^
m
0
exists, it is an optimal Bayesian estimate, at some least favorable a priori distribution p
0
.
Maximum Likelihood Estimation Scheme
Maximum likelihood estimation was first introduced by Fisher. It is a very powerful estimation procedure that
yields many of the well-known estimation methods as special cases.
The essential difference between Bayesian and maximum likelihood parameter estimation is that in Bayesian
Estimation the parameter q
m
is considered to be random with a given density function, while in the maximum
likelihood framework it is unknown but fixed.
Consider a random process X(t) parameterized by q
m
, where q
m
is an unknown fixed parameter vector of
finite dimensionality m (e.g., q
m
? ?
m
). More specifically the conditional joint probability density function
f(x
1
, …, x
n
*q
m
) is well known for every q
m
, where x
n
= [x
1
, …, x
n
] is a realization (or observation vector or
sample vector) of the process X(t).
The problem is to find an estimate of the parameter vector q
m
based on the realization of X(t). (Note that
the dimensionality of the parameter vector q
m
in the joint probability density function is assumed to be fixed.)
The intuition behind the maximum likelihood method is that we choose those parameters [q
1
, …, q
m
] from
which the actually observed sample vector is most likely to have come. This means that the estimator of q
m
is
selected so that the observed sample vector becomes as “likely as possible.”
fX
V
x
V
( ) exp
–[– ]
q
p
qa
* =
ì
í
?
?
?
ü
y
?
t
?
1
2
2
2
? 2000 by CRC Press LLC
In this sense we call the conditional joint probability density function f(x
n
*q
m
) as likelihood function l(q
m
).
The likelihood function l(q
m
) is a deterministic function of the parameter vector q
m
once the observed variables
x
1
, …, x
n
are inserted. This means that q
m
is variable and the sample vector x
n
is fixed, while the conditional
joint probability density function is considered as a function of the observation vector x
n
with q
m
fixed. The
maximum likelihood estimator of q
m
is that value of the parameter vector for which the likelihood function is
maximized.
In many cases it is more convenient to work with another function called log-likelihood function, L(q
m
),
rather than the likelihood function. The log-likelihood function is the natural logarithm of l(q
m
). Since the
logarithm is a monotone function, it follows that whenever L(q) achieves its maximum value, l(q
m
) is maxi-
mized too, for the same value of the parameter vector q
m
. Thus the log-likelihood function is maximized for
that value of the vector parameter q
m
for which the first partial derivatives with respect to q
i
, i = 1, …,m are
equal to zero, i.e.,
where q
^
ML
denotes the maximum likelihood estimate of the vector parameter q
m
.
It can be shown that when the process X(t) is memoryless and stationary (i.e., when x
1
,…,x
n
are independent,
identically distributed) then the ML estimators are consistent, asymptotically efficient, and asymptotically
Gaussian.
Example: Let x
i
, i = 1,…, n, be Gaussian independent random variables with mean q and variance s
2
i
: x
i
?
N(q,s
2
i
). The mean q is to be estimated and the Rao-Cramèr bound is to be evaluated. Since q is unknown
but fixed, we will use the maximum likelihood estimation scheme. The random variable x
i
has the probability
density function
Since x
i
, i = 1,…, n, are independent, the joint density function is
which is exactly the likelihood function for this estimation problem. The log-likelihood function is
We can maximize the log-likelihood function with respect to q and find the maximizing value to be equal to
Note that for equal variances the maximum likelihood estimate coincides with the commonly used sample mean.
?
:
()
q
?q
?q
ML
m
i
L
= 0
1
2
2
2
2
ps
q
s
i
i
i
x
exp
(–)
-
ì
í
?
?
?
ü
y
?
t
?
fx x
x
in
ii
n
i
i
( , . . ., ) exp –
(–)
* q
ps
q
s
=
ì
í
?
?
?
ü
y
?
t
?
=
?
1
2 2
1
2
2
log ( , . . ., ) – log( ) – log –
(–)
fx x
nx
ni
i
n
i
ii
n
1
1
2
2
1
2
2
1
2
* qps
q
s
=
==
??
?
()q
s
s
ML
n
ii
n
i
ii
n
x
x
=
=
=
?
?
1
1
2
1
2
1
? 2000 by CRC Press LLC
The Rao-Cramèr bound can be found as follows:
In conclusion, we see that for Gaussian data the sample mean estimate is efficient because it coincides with
the maximum likelihood estimate.
When the data are contaminated with a fraction of data coming from an unknown probability density
function, the so called outliers, the sample mean performs poorly even when the fraction of outliers is small.
This observation gave birth to the branch of statistics called robust statistics.
Other Parameter Estimation Schemes
The Bayesian, minimax, and maximum likelihood estimation schemes described above make up the class of
parametric parameter estimation procedures. The common characteristic of those procedures is the availability
of some parametrically known stochastic process that generates the observation sequence x
n
. When for every
given parameter value q
m
the stochastic process that generates x
n
is nonparametrically described, the nonpara-
metric estimation schemes arise. The latter schemes may evolve as the solutions of certain saddle-point games,
whose payoff function originates from the parametric maximum likelihood formalizations. It is assumed that,
in addition to the nonparametrically described data-generating process, the only assets available are a realization
x
n
from the underlying active process and the parameter space %
m
. The qualitative robustness in parameter
estimation corresponds to local performance stability for small deviations from a nominal, parametrically
known, data-generating process.
Defining Terms
Bayesian estimation: An estimation scheme in which the parameter to be estimated is modeled as a random
variable with known probability density function.
Bias:The norm of the difference between the true value of the estimate and its mean value.
Consistent estimator:An estimator whose value converges to the true parameter value as the sample size
tends to infinity. If the convergence holds w.p. 1, then the estimator is called strongly consistent or consistent
w.p. 1.
Efficient estimator: An estimator whose variance achieves the Rao-Cramèr bound.
Estimate:Our best guess of the parameter of interest based on a set of observations.
Estimator: A mapping from the data space to the parameter space that yields the estimate.
Homogeneous linear estimator:An estimator which is a homogeneous linear function of the data.
Maximum likelihood estimate: An estimate that maximizes the probability density function of the data
conditioned on the parameter.
Mean-square estimation: An estimation scheme in which the cost function is the mean-square error.
Minimax estimate: The optimum estimate for the least favorable prior distribution.
Nonhomogeneous linear estimator:An estimator which is a nonhomogeneous linear function of the data.
Nonlinear MS estimate: The optimum estimate under the mean-square performance criterion.
Nonparametric estimation: An estimation scheme in which no parametric description of the statistical model
is available.
Orthogonality principle:The fundamental principle for MS estimates. It states that the estimation error is
orthogonal to the data.
Parameter estimation: The procedure by which we combine all available data to obtain our best guess about
a parameter of interest.
Parametric estimation: An estimation scheme in which the statistical description of the data is given accord-
ing to a parametric family of statistical models.
E
d
d
fx E
d
d
fx
nn
ii
n
qq
q
q
q
q
s
log( ) – log( )
–
**
é
?
ê
ê
ù
?
ú
ú
ì
í
?
?
?
ü
y
?
t
?
=
ì
í
?
?
?
ü
y
?
t
?
=
=
?
2
1
2
22
1
1
? 2000 by CRC Press LLC
Penalty or cost function: A nonnegative scalar function which represents the cost incurred by an inaccurate
value of the estimate.
Qualitative robustness: A geometric formulation of robust estimation.
Robust estimation: An estimation scheme in which we optimize performance for the least favorable statistical
environment among a specified statistical class.
Unbiased estimator: An estimator whose mean value is equal to the true parameter value.
Related Topics
73.1 Signal Detection ? 73.3 Stochastic Processes
References
S. Haykin, Adaptive Filter Theory, Englewood Cliffs, N.J.: Prentice-Hall, 1991.
D. Kazakos and P. Papantoni-Kazakos, Detection and Estimation, New York: Computer Science Press, 1990.
L. Ljung and T. S?derstr?m, Theory and Practice of Recursive Identification, Cambridge, Mass.: The MIT Press,
1983.
A. Papoulis, Probability, Random Variables, and Stochastic Processes, New York: McGraw-Hill, 1984.
Further Information
IEEE Transactions on InformationTheory is a bimonthly journal that publishes papers on theoretical aspects of
estimation theory and in particular on transmission, processing, and utilization of information.
IEEE Transactions on Signal Processing is a monthly journal which presents applications of estimation theory
to speech recognition and processing, acoustical signal processing, and communication.
IEEE Transactions on Communications is a monthly journal presenting applications of estimation theory to data
communication problems, synchronization of communication systems, channel equalization, and image processing.
16.3 Kalman Filtering
Fred Daum
The Kalman filter is a linear recursive algorithm that solves the least squares problem for time-varying linear
systems with non-stationary noise. It estimates the state of a linear dynamical system given linear measurements
corrupted by additive noise. It is an optimal estimator, assuming that the measurement noise is Gaussian, and
assuming that all other relevant probability densities are also Gaussian.
For example, the location of your car can be estimated using a Kalman filter to combine noisy measurements
of distance from four or more satellites. As a second example, the position and velocity of an airplane can be
estimated by a Kalman filter using noisy measurements of range, azimuth, and elevation from one or more
radars. As a third example, the future price of IBM stock can be predicted using a Kalman filter to combine
noisy data on thousands of relevant economic variables, using a dynamic model of the stock market and the
overall economy.
The Kalman filter has been applied to solve many diverse real-world engineering problems, including
spacecraft navigation, GPS navigation, robotics, air traffic control, missile guidance, chemical plant control,
stock market prediction, weather prediction, speech recognition, speech encoding and compression, radar target
tracking, satellite orbit estimation, and inertial navigation. See [Sorenson, 1985] for other applications.
Most real-world engineering problems have measurement equations or dynamical system equations that are
nonlinear in the state vector. Therefore, the Kalman filter equations cannot be applied directly; rather, the
problem must be approximated using linear equations. This linear approximation is very straightforward, and
it is called the “extended Kalman filter” (EKF). One of the main reasons for the wide application of the Kalman
filter is the ease with which a nonlinear system can be approximated by a linear system. The resulting approx-
imation is often very good, resulting in good EKF performance. Unfortunately, the EKF performance is
sometimes poor, in which case a plethora of alternative approximations can be attempted.
? 2000 by CRC Press LLC
Kalman Filter Equations
The Kalman filter algorithm is shown as a block diagram in Fig. 16.1. The estimate of x is updated recursively
as new measurements become available. Each measurement update corresponds to one iteration of Fig. 16.1.
The symbols in Fig. 16.1 are defined in Table 16.1.
The Kalman filter uses both a model of the system dynamics
x
k
= Φ
k
x
k–1
+ w
k
(16.16)
as well as a model of the measurements
z
k
= H
k
x
k
+ v
k
(16.17)
These models are a combination of deterministic and random effects. One can think of the true state vector,xk,
evolving in time according to a deterministic linear dynamical system:
x
k
= Φ
k
x
k–1
(16.18)
with a random perturbation modeled by w
k
. Likewise, the measurement model consists of a deterministic linear
part:
z
k
= H
k
x
k
(16.19)
with an additive random perturbation of v
k
. As shown in Fig. 16.1, the Kalman filter predicts the state vector
from one time (t
k–1
) to the next (t
k
) using the deterministic part of the linear dynamical system. This is the
FIGURE 16.1 Block diagram of Kalman filter.
? 2000 by CRC Press LLC
? 2000 by CRC Press LLC
Value
Vector of dimension n
Scalar
Integer
Vector of dimension m
m × n matrix
× m matrix
ix R
k
, statistically
x
k
Vector of dimension m
n × n matrix
ix Q
k
, statistically
x
k
Vector of dimension n
× n matrix
n × n matrix
n × n matrix
Vector of dimension n
Vector of dimension n
Set of m-dimensional vectors
Vector of dimension n
n × n matrix
—
Function
Function
Operation
Operation
n × n matrix
TABLE 16.1 Definition of Symbols
Symbol Meaning Mathematical Definition
x
k
State vector of a linear dynamical system at time t
k
x
k
= Φ
k
x
k–1
+ w
k
t
k
Time of the k
th
measurement —
k Index of discrete time measurements —
z
k
k
th
measurement z
k
= H
k
x
k
+ v
k
H
k
Measurement matrix at time t
k
See above
R
k
Covariance matrix of measurement noise at time t
k
R
k
= E(v
k
v
k
T
)m
v
k
Measurement noise at time t
k
Gaussian zero mean random variable with covariance matr
independent from sample to sample, and statistically independent of
Φ
k
Transition matrix of linear dynamical system from time t
k–1
to t
k
See above
w
k
Process noise Gaussian zero mean random variable with covariance matr
independent from sample to sample, and statistically independent of
Q
k
Covariance matrix of process noise at time t
k
Q
k
= E(w
k
w
k
T
)n
P
k
Error covariance matrix of x
k
conditioned on Z
k
M
k
Error covariance matrix of x
k
conditioned on Z
k–1
?x
k
Estimate of x at time t
k
conditioned on Z
k
x
k
Estimate of x at time t
k
conditioned on Z
k–1
Z
k
Set of measurements up to and including time t
k
?x
o
Initial estimate of x at time t
o
, prior to any measurements
P
o
Initial error covariance matrix of ?x
o
, prior to any measurements
E(?) Expected value of (?) E(A) = ∫ A p(A)dA
p(A) Probability density of A Gaussian probability density
p(A|B) Probability density of A conditioned on B Gaussian probability density
(?)
T
Transpose of (?)(A
T
)
ij
= A
ji
for a matrix A with ij
th
element A
ij
(?)
–1
Inverse of matrix (?)A
–1
is the inverse of matrix A if and only if A
–1
A = AA
–1
= I
I Identity matrix
PExxxxZ
kkkkk
T
k
=? ?
( )( )
?
?
?
?
?
?
??
MExxxxZ
kkkkk
T
k
=? ?
( )( )
?
?
?
?
?
?
?1
?
xExZ
kkk
=
( )
xExZ
kkk
=
?
( )
1
Zzz z
kk
=…
{}
12
,,,
?
xEx
oo
=
( )
PExxxx
ooooo
T
=? ?
( )( )
?
?
?
?
?
?
??
I
j
ij
=
=?
?
?
1
0
for i
otherwise
best prediction of x(t
k
) given x(t
k–1
), assuming that w
k
is a zero-mean random variable uncorrelated with x(t
k–1
).
Hence, the state vector prediction in the Kalman filter is intuitively exactly what one would expect. The state
vector update is also intuitively reasonable. In particular, it is a linear combination of the predicted value of x
and the latest measurement z
k
. This linear combination of x
k
and z
k
is computed as a compromise between
prediction errors in x
k
and the measurement noise errors in z
k
. That is, the Kalman filter gain is computed to
optimally combine x
k
and z
k
, using the known models of system dynamics and measurements. More specifically,
the Kalman filter gain is computed to minimize the following error criterion at the time of the k
th
measurement:
(16.20)
The covariance matrix of the prediction errors is M
k
, and R
k
is the measurement error covariance matrix. If
M
k
is large, then the first term in J is weighted less because the prediction, x
k
, is relatively inaccurate. Similarly,
if R
k
is large, then the second term in J is weighted less because the measurement, z
k
, is relatively inaccurate.
The weighted least squares criterion, J, strikes a balance between prediction errors and measurement errors.
To find the value of ?x
k
that minimizes J, we can set the derivative of J with respect to ?x
k
equal to zero:
(16.21)
Using the fact that covariance matrices are symmetric, and rearranging terms, we get:
(16.22)
and solving for ?x
k
:
(16.23)
This can be put into the form:
(16.24)
where:
(16.25)
Further matrix algebra can be used to represent the Kalman filter gain in two other forms:
(16.26)
(16.27)
in which P
k
is the error covariance matrix of ?x
k
.
Jxx Mxx zHx RzHx
kk
T
kk k k kk
T
kk kk
=?
( )
?
( )
+?
( )
?
( )
??
??? ?
11
?
?
=?
( )
+?
( )
?
( )
=
??
J
x
xx M zHx R H
k
kk
T
kkk
T
kk
?
??
22 0
11
MHRHxMxHRz
kk
T
kkk kk k
T
kk
?? ? ?
+
( )
=+
11 1 1
?
?
xMHRH MxHRz
kkk
T
kk kk k
T
kk
=+
( )
+
[]
??
?
??11
1
11
?
xxKzHx
kkkk kk
=+ ?
[]
KMHRHHR
kkk
T
kk k
T
k
=+
( )
??
?
?11
1
1
KMHHMHR
kkk
T
kkk
T
k
=+
( )
?1
KPHR
kkk
T
k
=
?1
? 2000 by CRC Press LLC
The above calculation of the Kalman filter gain is by no means a derivation of the Kalman filter, but rather
it is merely a simple heuristic calculation that gives insight into these equations. The error criterion, J, is the
logarithm of a Gaussian probability density of x conditioned on Z
k
. A lucid derivation of the Kalman filter is
given in Ho and Lee [1964], from a Bayesian viewpoint. In contrast, Kalman’s original derivation does not use
Bayesian methods; see Kalman [1960]. Other derivations of the Kalman filter are in Gelb [1974] and Jazwinski
[1970]. The Kalman filter is “optimal” in a number of ways, under different assumptions, which are discussed
in these references.
The Kalman filter is stable under rather mild assumptions, which can always be achieved in practice, by
using a state vector with minimum dimension to realize the linear dynamical model and the linear measurement
model. This corresponds to a completely controllable dynamical system and a completely observable measurement
model. Stability of the Kalman filter under these conditions was proved in a beautiful paper by Kalman (1963).
Kalman’s stability theory has great practical value because engineers do not need to check for Kalman filter
stability as a separate issue. This result is all the more impressive when we consider that the Kalman filter is a
time-varying linear dynamical system that can be very complex and have very high dimension. Nevertheless,
the Kalman filter is automatically stable under the simple minimality assumption given above.
It is important to emphasize that both the linear dynamical system and the measurement model can be time-
varying, and both the process noise and measurement noise can be nonstationary. That is, all of the following
matrices can vary with time: Φ
k
, H
k
, Q
k
, and R
k
. Also, the discrete times at which measurements are made (t
k
for k = 1, 2, …) are completely arbitrary; that is, the sample rate can be nonuniform.
Kalman Filter Examples
A good way to understand the Kalman filter equations is to look at some simple low-dimensional examples.
Example 1
Consider the problem of estimating the value of an unknown constant scalar, x, given a sequence of noisy
measurements:
z
k
= x
k
+ v
k
(16.28)
where the variance of measurement noise is constant, and the variance of the a priori estimate of x is infinite.
For this example, using the Kalman filter notation:
Φ
k
= 1
Q
k
= 0
H
k
= 1
R
k
= constant = c
P
o
= ∞
the corresponding Kalman filter equations given in Fig. 16.1 are
(16.29)
(16.30)
(16.31)
xx
kk
=
?
?
1
?
xxKzx
kkkkk
=+ ?
( )
KMMc
kkk
=+
( )
? 2000 by CRC Press LLC
(16.32)
(16.33)
(16.34)
which simplifies to
(16.35)
(16.36)
(16.37)
After some more algebra, it turns out that
(16.38)
where P
o
–1
= 0, and hence
(16.39)
(16.40)
Therefore, the variance of estimation error after k measurements is
(16.41)
which is exactly what we should expect for this example. Also, the Kalman filter gain is
(16.42)
(16.43)
(16.44)
MP
kk
=
?1
PMMMc
kkkk
=? +
( )
2
P
o
=∞
?? ?
xx Kzx
kk kkk
=+ ?
( )
??11
KP P c
kk k
=+
( )
??11
PP P P c
kk k k
=? +
( )
???11
2
1
PP
c
kk
?
?
?
=+
1
1
1
1
Pc
k
j
k
?
=
=
∑
1
1
1
Pkc
k
?
=
1
Pck
k
=
KP P c
kk k
=+
( )
??11
K
cP
k
k
=
+
?
1
1
1
K
k
k
=
+?
1
11
? 2000 by CRC Press LLC
(16.45)
which is intuitively very reasonable. Furthermore, the Kalman filter can now be written as:
(16.46)
which has the solution
(16.47)
The Kalman filter for this simple example is nothing more than our old friend, the arithmetic average.
Example 2
Consider the same problem as in Example 1, but with R
k
not constant. It is easy to show that the Kalman filter
in this case is
(16.48)
where the estimation error variance after k measurements is given by
(16.49)
and the Kalman filter estimate of x after k measurements is:
(16.50)
This result is intuitively very reasonable. In particular, the more accurate measurements (corresponding to
small R
j
) are weighted more heavily in estimating x; conversely, relatively inaccurate measurements (with large
R
j
) are given little weight.
Example 3
Consider the problem of estimating the value of a quantity, y, that changes linearly with time with an unknown
rate of change, given a sequence of measurements of y corrupted by additive noise that is statistically independent
from sample to sample. In the Kalman filter setup, we could model this problem as follows. Let the state vector be:
(16.51)
Kk
k
=1
?? ?
xx
k
zx
kk kk
=+ ?
( )
??11
1
?
x
k
z
kj
j
k
=
=
∑
1
1
?? ?
xx PRzx
kk kkkk
=+
( )
?
( )
??11
PR
kj
j
k
=
( )
=
∑
11
1
?
x
zR
R
k
jj
j
k
j
j
k
=
( )
( )
=
=
∑
∑
1
1
1
x
y
y
=
?
?
?
?
?
?
?
?
˙
? 2000 by CRC Press LLC
The transition matrix would be:
(16.52)
where ?t
k
= t
k
– t
k–1
. Furthermore,
H
k
= [1 0]
Q
k
= 0
R
k
= constant
P
o
–1
= 0
Assuming a constant value of ?t
k
= T, it turns out that the error covariance matrix is:
(16.53)
See [Sorenson, 1967] for details.
Extended Kalman Filter
In practical applications, it is rather rare to find a problem with dynamical equations and measurement
equations that are linear in x. Nevertheless, engineers use the Kalman filter theory applied to a linear approx-
imation of the actual nonlinear dynamics and measurements. This approximation is called the extended Kalman
filter (EKF); it is very straightforward and popular. The Kalman filter itself is almost never used in real-world
applications, but rather the EKF is essentially ubiquitous.
Figure 16.2 shows a block diagram of the EKF. Note that the EKF uses the nonlinear dynamics and nonlinear
measurement equations to predict x
k
and z
k
, rather than using a linear approximation. In contrast, the EKF
uses linear approximations of f(x) and h(x) to compute the covariance matrices and the Kalman filter gain.
The nonlinear dynamical model for x is:
(16.54)
and the nonlinear measurement model is:
(16.55)
Also, note in Fig. 16.2 that the estimate of x is used to compute the Kalman filter gain, unlike the Kalman filter,
in which the filter gain and the error covariance matrices do not depend on x (see Fig. 16.1). Unlike the Kalman
filter, there is no guarantee that the EKF is stable. Moreover, there is no reason to suppose that the EKF will give
optimal performance. Although in many applications the EKF performance is good, it is well known that the
EKF performance is often poor or far from optimal. Unfortunately, there is no theory that predicts when the
EKF will give good performance, but rather engineers use Monte Carlo simulations to evaluate EKF performance.
There is a vast literature on methods to improve the EKF performance, including second-order Taylor series,
iteration of Fig. 16.2 to improve the linearization, tuning the process noise covariance matrix, decoupling the
Φ
?
k
k
t
=
?
?
?
?
?
?
1
01
M
kT
TTk
R
kk
k
=
?
( )
?
( )
?
?
?
?
?
?
?
? +
( )
22 1 6
61211
2
xfx w
kk k
=
( )
+
?1
zhx v
kkk
=
( )
+
? 2000 by CRC Press LLC
error covariance matrix, preferred order of processing the components of a vector-valued measurement, careful
choice of coordinates (e.g., polar vs. Cartesian), hybrid coordinate systems, etc. There is no guarantee that any
of these methods will improve EKF performance; in some cases, second-order corrections and/or iteration
actually make EKF performance worse, contrary to intuition. Reviews of these techniques are given in Tanizaki
[1996], as well as Wishner et al. [1969], Mehra [1971], Leskiw et al. [1982], Gelb [1974], Bellaire et al. [1995],
Henriksen [1982], Fang [1976], Daum et al. [1983], and Jazwinski [1970].
Nonlinear Filters
Considering the frequently disappointing performance of the EKF noted in the previous section, there has been
intense research to develop better nonlinear filters. An exact nonlinear recursive filter was derived by Bene?s
(for a certain class of nonlinear problems) in a seminal paper [1981]. The Bene?s filter is “exact” in the sense
that it computes an optimal estimate of x, without any approximations, in contrast to the EKF, which uses
linear approximations. A generalization of the Bene?s filter and the Kalman filter was developed by Daum [1986a;
1986b]. Figure 16.3 shows the superior performance of this new nonlinear filter compared to the EKF for certain
practical applications; see Schmidt [1993] for details. A more general class of exact nonlinear recursive filters
is based on the exponential family of probability densities. The Kalman filter theory is based on a Gaussian
density, which is a special case of the exponential family; see Daum [1988; 1997a, b] for a development of this
theory, which is summarized in Table 16.2.
Another alternative to the EKF, reported in Julier et al. [1995], is called the unscented filter, and in contrast
to the EKF, does not use Jacobians, but rather evaluates multidimensional integrals by sampling at carefully
selected points much like Gauss-Hermite quadrature formulas. The unscented filter shows much better per-
formance than the EKF in certain applications, with less computational complexity than the EKF.
Exact recursive filters for nonlinear estimation problems generally do not exist. This is not surprising,
considering that the existence of an exact recursive filter corresponds to the following minor miracle:
(16.56)
FIGURE 16.2 The extended Kalman (EKF) is a linear approximation.
px tZ px t
growing fixed
kk
,,
( ) ( )
=
↑↑
ψ
dimension dimension
with k for all k
? 2000 by CRC Press LLC
in which ψ
k
is a sufficient statistic for x. A recursive filter exists when there is a sufficient statistic with fixed
finite dimension. In classical statistics, for parameter estimation, it is well known that this will happen (assuming
certain regularity conditions) if and only if the conditional density is from an exponential family; see Daum
[1988]. The theory of fixed finite dimensional filters has also been developed from a completely different
perspective, using Lie algebras; see Bene?s [1987].
Non-recursive filters generally have superior performance compared with the EKF, at the cost of higher
computational complexity. For parameter estimation problems (corresponding to zero process noise, Q
k
= 0),
these non-recursive filters are popular in practical applications despite the increased computational complexity
compared with the EKF. Gauss invented this type of non-recursive nonlinear filter over 200 years ago; see
Sorenson [1980]. On the other hand, non-recursive filters for estimating x, where x is a Markov process with
non-zero process noise (Q
k
≠ 0), generally have much greater computational complexity. Nevertheless, with a
sufficiently fast computer, it would be practical to implement such a non-recursive algorithm. The theory to
design such algorithms is well known, and some Monte Carlo simulations have shown excellent performance
relative to the EKF; see Sorenson [1988] and Kastella et al. [1997]. Presumably, with computers getting ten
times faster (at fixed cost) every 5 years, the application of such non-recursive filters will become common in
the future, despite very high computational complexity relative to the EKF.
Practical Issues
Data Association
One of the best ways to ruin the performance of a Kalman filter is to put the wrong data into it. In a dense
multiple-target environment, for applications with sensors such as radar, passive infrared, sonar, acoustic, or
optical, the question of which measurement originated from which target is a very serious issue. In addition,
the measurements could be unresolved mixtures from two or more objects. There is a plethora of algorithms
to mitigate these problems, as well as a vast literature on this subject, including Blackman [1986]; Blackman
et al. [1999], Bar-Shalom [1995], and Daum [1992].
FIGURE 16.3 New nonlinear filter vs. extended Kalman filter [See Schmidt (1993)].
? 2000 by CRC Press LLC
? 2000 by CRC Press LLC
Propagation Equations
lter 3
+PA GG
TT
?Am Pb
PAP
1
2
) ( )
)
++?+
+++
PAm Dm Pb E
PAP DP PD Q
T
1
)
+ ? ?
? ? +
ADmEPb
PAP PD DP I
T
2
B
T
M
T
j
T
j
ψ
ψψ
=…
=
( )
Γ
ΓΓΓ Γ
Γ
12
,,,
TABLE 16.2 Exact Recursive Filters
Conditional Density
Filter p(x, t?Z
k
) Class of Dynamics
1. Kalman (1960) η = Gaussian
2. Bene?s (1981) and
3. Daum (1986) and
where
4. Daum (1986) Same as filter 3, but with Same as fi
5. Daum (1986) and
6. Daum (1988) Solution of PDE for θ(x, t)
?
?
=
( )
f
x
At
?
?
=
=+
mAm
PAP
η exp ∫
( )
[]
x
fxdx
?
?
=
?
?
?
?
?
?
?
?
f
x
f
x
T
fx tr
f
x
xAxbxc
TT
( )
?
?
?
?
?
?
+
?
?
=++
2
?
?
=?
=?
m P
P I
η
α
Px
ss
( )
fQrDxE
T
?=+α
tr
f
x
rQr x Ax b x c
TT T
?
?
+=++
?
?
?
?
?
?
α
2
r
x
Px
ss
=
?
?
( )
log
?
(
?
(
=?
=?
m
P
21
21
αα
α
η
α
qxt,
( )
r
x
qxt=
?
?
( )
log ,
ηQxt,
( )
?
?
?
?
?
=?
?
?
?
?
?
?
f
x
f
x
DD
T
T
?
?
++=?
?
?
?
?
?
?
?
++ + +
??
?
?
?
?
?
?
?
?
?
?
?
?
( )
f
t
Dx E
f
x
f
x
tr
f
x
ADDxDEb
T
T
TT
1
2
2
?
(
?
=?
=?
mP
P 2
pxt xt Z t
T
k
, exp , ,
( ) ( ) ( )
[]
θψ
d
dt
A
ψ
=+
where
with
Ill-Conditioning
The Kalman filter covariance matrix can be extremely ill-conditioned in certain applications, as analyzed in
Daum et al. [1983]. A Kalman filter is ill-conditioned if its performance is significantly degraded by numerical
errors. Special factorizations of the covariance matrices have been developed to mitigate this problem; the
standard reference is Bierman [1977]. There are many other methods to mitigate ill-conditioning, as discussed
by Daum et al. [1983].
Adaptive Kalman Filters
The Kalman filter was derived assuming that the process noise covariance matrix, Q
k
, as well as all other matrices
(Φ
k
, H
k
, R
k
) are known exactly a priori. In practice, however, these assumptions may be inappropriate. To
mitigate this uncertainty a number of adaptive Kalman filter algorithms can be used. Most of these algorithms
consist of a bank of Kalman filters combined adaptively using Bayes’ rule. This basic structure was invented by
Magill [1965], and it has now evolved into a more sophisticated algorithm, called interacting multiple models
invented by Blom [1984]. A recent survey of this topic is given by Bar-Shalom et al. [1995].
Measurement Models
The Kalman filter theory assumes Gaussian measurement errors that are statistically independent from sample
to sample, with zero mean and exactly known covariance matrix (R
k
). However, in many practical applications,
these are poor approximations of reality. For example, in radar applications, the measurments of range, azimuth,
and elevations are often biased, non-Gaussian, and correlated with time, owing to diverse physical effects
including multipath, tropospheric refraction, ducting, ionospheric refraction, glint, RV wake, rocket exhaust
plume, RFI, ECM, unresolved measurements, bias errors in time, location and angular orientation for the radar
itself, radar hardware errors, etc. Gauss himself cautioned against na?ve least squares fitting of data with bias
and drift; see Gauss [1995].
Performance Evaluation
As shown in Fig. 16.1, the Kalman filter computes the covariance matrix of the estimation error (P
k
). However,
in practical applications, this theoretical covariance matrix may be extremely optimistic, owing to the effects
noted earlier (nonlinearity, ill-conditioning, data association errors, unresolved data, errors in modeling both
measurement errors and target dynamics) as well as bugs in the software itself. Therefore, the standard approach
to evaluate Kalman filter performance is Monte Carlo simulation. However, no one in their right mind would
believe the results of a complex Monte Carlo simulation without a back-of-the-envelope calculation that is in
rough agreement with the simulation results. A good source of such simple formulas is Brookner [1998].
Obviously, the very best way to evaluate Kalman filter performance is to conduct extensive real-world testing.
Unfortunately, the cost and practicality of this approach is often prohibitive or is deemed to be not cost-effective.
A judicious combination of extensive Monte Carlo simulation and limited real-world testing is often the most
practical approach to performance evaluation.
The best possible performance for EKFs can be computed using theoretical lower bounds on the error
covariance matrix, such as the Cramér-Rao bound (CRB) for parameter estimation. For the standard Kalman
filter setup with zero process noise (Q
k
= 0), it turns out that the CRB is simply the Kalman filter error covariance
matrix itself; see Taylor [1979]. On the other hand, for non-zero process noise, the available bounds are much
more complex to compute and they are generally not tight; see Kerr [1989] for a detailed review of the state
of the art. More generally, the theory of error bounds when data association errors are considered is developed
in Daum [1997a, b].
Digital Realization
All of the algorithms discussed here are always implemented using digital computers, owing to their superior
accuracy, flexibility, and dynamic range, as compared to currently available analog devices. Generally, 32-bit
or 64-bit floating point arithmetic is required for most practical applications, although extra precision may be
required for extremely ill-conditioned problems.
The idea of using analog computers to implement Kalman filters (which is sometimes suggested in academic
circles) is rather na?ve, owing to the limited accuracy, limited dynamic range, and inflexibility of analog
? 2000 by CRC Press LLC
computers. Likewise, the filtering theory for continuous time measurements (which dominates the academic
literature on nonlinear filtering) is also impractical because measurements must be made in discrete time to
accommodate digital computers. The na?ve approximation of discrete time measurements by continuous time
data generally results in poor performance, and it is not used by practical engineers. The academic literature
is out of touch with such practical issues; for example, see Hazewinkel et al. [1981].
Defining Terms
Kalman filter: A recursive algorithm that estimates the state vector of a linear dynamical system given noisy
measurements that are linear in the state vector. This algorithm was invented by Rudolf E. Kalman, and
it was published in 1960.
Extended Kalman filter: A recursive algorithm for estimating the state of nonlinear dynamical systems that
uses the Kalman filter equations based on a linear approximation to the nonlinear dynamical system
and/or nonlinear measurement equations.
State vector: A vector that specifies the state of a dynamical system. For example, the position and velocity
of a leaf falling to the ground could be the state vector for the leaf. For deterministic systems, this
corresponds to the initial conditions for a system of ordinary differential equations. For a special type
of random process, called a Markov process, the future state is statistically independent of the past,
conditioned on knowledge of the state at the present time.
Covariance matrix: A matrix that gages the uncertainty in a vector using second moments. The diagonal
elements of this matrix are the variances of uncertainty in the components of the vector.
References
Y. Bar-Shalom and X. Li. Multitarget-Multisensor Tracking, YBS, 1995.
R. Bellaire, E. W. Kamen, and S. M. Zabin, A new nonlinear iterated filter with applications to target tracking,
SPIE Proceedings, San Diego, 1995.
V. E. Bene?s, Nonlinear filtering: problems, examples, applications, Advances in Statistical Signal Processing,
Vol. 1, pp. 1–14, JAI Press, 1987.
V. E. Bene?s, Exact finite-dimensional filters for certain diffusions with nonlinear drift, Stochastics, 5, 65–92, 1981.
G. J. Bierman, Factorization Methods for Discrete Sequential Estimation, New York: Academic, 1977.
S. S. Blackman and R. F. Popoli. Design and Analysis of Modern Tracking Systems, Artech House, 1999.
S. S. Blackman. Multi-Target Tracking with Radar Applications, Artech House Inc., 1986.
H. A. P. Blom. A sophisticated tracking algorithm for ATC surveillance data, Proceedings of International Radar
Conference, Paris, 1984.
E. Brookner. Tracking and Kalman Filtering Made Easy, John Wiley & Sons, 1998.
R. G. Brown and P. Y. C. Hwang, Introduction to Random Signals and Applied Kalman Filtering, third edition,
John Wiley & Sons, 1997.
A.E. Bryson and Y. C. Ho, Applied Optimal Control, Blaisdell Publishing, 1969.
R. S. Bucy, Linear and nonlinear filtering, Proc. IEEE, 58, 854–864, 1970.
F. E. Daum. (1997a), Virtual measurements for nonlinear filters, Proceedings of IEEE Control and Decision
Conference, San Diego, pp. 1657–1662, 1997.
F. E. Daum. (1997b) Cramér-Rao type bounds for random set problems, pp. 165–183 in Random Sets, Ed. By
J. Goutsias, R. Mahler, and H. T. Nguyen, Springer-Verlag, 1997.
F. E. Daum. Beyond Kalman filters: practical design of nonlinear filters, Proceedings of the SPIE Conference on
Signal and Data Processing of Small Targets, pp. 252–262, Orlando, FL, April 1995.
F. E. Daum. A system approach to multiple target tracking, Multitarget-Multisensor Tracking, Volume II (Y. Bar-
Shalom, ed.), pp. 149–181, Artech House, 1992.
F. E. Daum. New exact nonlinear filters, Bayesian Analysis of Time Series and Dynamic Models (J.C. Spall, ed.),
pp. 199–226, Marcel Dekker, New York, 1988.
F. E. Daum. (1986a). Exact finite dimensional nonlinear filters, IEEE Trans. Autom. Control AC-31(7), 616–622,
1986.
? 2000 by CRC Press LLC
F. E. Daum. (1986b). New nonlinear filters and exact solutions of the Fokker-Planck equations, in Proceedings
of the American Control Conference, pp. 884–888, 1986.
F. E. Daum and R. J. Fitzgerald. Decoupled Kalman filters for phased array radar tracking, IEEE Trans. Autom.
Control, AC-28, 269–283, 1983.
B. T. Fang. A nonlinear counterexample for batch and extended sequential estimation algorithms, IEEE Trans.
Autom. Control, AC-21, 138–139, 1976.
C. F. Gauss. Theoria Combinationis Observationum Erroribus Minimis Obnoxiae, translated by G. W. Stewart,
SIAM, 1995.
A. Gelb (Editor). Applied Optimal Estimation, MIT Press, 1974.
M. Hazewinkel and J. C. Willems, Eds. Stochastic Systems: The Mathematics of Filtering and Identification and
Applications, D. Reidel, Dordrecht, The Netherlands, 1981.
R. Henriksen. The truncated second-order nonlinear filter revisited, IEEE Trans. Autom. Control, AC-27, 247–251,
1982.
Y. C. Ho and R. C. K. Lee. A Bayesian approach to problems in stochastic estimation and control, IEEE Trans.
Autom. Control, AC-9, 333–339, 1964.
C. E. Hutchinson. The Kalman filter applied to aerospace and electronic systems, IEEE Trans. Aerosp. Electron.
Syst. 500–504, 1984.
A. H. Jazwinski. Stochastic Processes and Filtering Theory, Academic Press, New York, 1970.
S. Julier, J. Uhlmann, and H. Durrant-Whyte. A new approach to filtering nonlinear systems, Proceedings of
American Control Conference, June 1995.
R. E. Kalman. New methods in Wiener filtering theory, in Proc. Symp. Eng. Appl. of Random Function Theory
and Probability, F. Kozin and J. L. Bogdanoff, Eds. New York: Wiley, 1963.
R. E. Kalman. A new approach to linear filtering and prediction problems, Trans. ASME J. Basic Eng., 82D, 35–45,
1960.
K. Kastella and A. Zatezalo. Nonlinear filtering for detection, tracking and ID algorithms, ONR/NSWC Workshop
on Filtering and Tracking, May 1997.
T. H. Kerr. Status of CR-like lower bounds for nonlinear filtering, IEEE Trans. Aerosp. Electron. Syst., 25, 590–601,
1989.
D. M. Leskiw and K. S. Miller. Nonlinear estimation with radar observations, IEEE Trans. Aerosp. Electron. Syst.,
AES-18, 192–200, 1982.
D. T. Magill. Optimal adaptive estimation of sampled stochastic processes, IEEE Trans. Autom. Control, AC-10,
434–439, 1965.
R. K. Mehra. A comparison of several nonlinear filters for reentry vehicle tracking, IEEE Trans. Autom. Control,
AC-16, 307–319, 1971.
G. C. Schmidt. Designing nonlinear filters based on Daum’s Theory, AIAA Journal of Guidance, Control and
Dynamics, 16, 371–376, 1993.
B. E. Schutz, J. D. McMillan, and B. D. Tapley. Comparison of statistical orbit determination methods, AIAA
J., Nov. 1974.
H. W. Sorenson. Recursive estimation for nonlinear dynamic systems, Bayesian Analysis of Time Series and
Dynamic Models, J.C. Spall, Ed., pp. 127–165, Marcel Dekker, 1988.
H. W. Sorenson. Kalman Filtering: Theory and Applications, IEEE Press, New York, 1985.
H. W. Sorenson. Parameter Estimation, Marcel Dekker, 1980.
H. W. Sorenson. On the development of practical nonlinear filters, Inf. Sci. 7, 253–270, 1974.
H. W. Sorenson. On the error behavior in linear minimum variance estimation problems, IEEE Trans. Autom.
Control, AC-12, 557–562, 1967.
H. Tanizaki. Nonlinear Filters, 2nd ed., Springer-Verlag, 1996.
J. H. Taylor. Cramér-Rao estimation error lower bound analysis for nonlinear systems with unknown deter-
ministic variables, IEEE Trans. Autom. Control, April 1979.
R. P. Wishner, R. E. Larson, and M. Athans. Status of radar tracking algorithms, Symp. on Nonlinear Estimation
Theory and Appl., 1970.
R. P. Wishner, J. A. Tabaczynski, and M. Athans. A comparison of three non-linear filters, Automatica 5, 487–496,
1969.
? 2000 by CRC Press LLC
Further Information
The best concise introduction to Kalman filtering is Chapter 12 in Bryson and Ho [1969]. The three best books
on Kalman filters are Gelb [1974], Sorenson [1985], and Brown et al. [1997]. The standard reference on
nonlinear filters is Jazwinski [1970]. The best journal on Kalman filter applications is the IEEE Transactions on
Aerospace and Electronic Systems, which typically has several practical papers on Kalman filters each issue. Two
good conferences with many papers on Kalman filters are the IEEE Conference on Decision and Control (mid-
December annually) and the SPIE Conference on Signal and Data Processing (April each year).
? 2000 by CRC Press LLC