16.322 Stochastic Estimation and Control, Fall 2004
Prof. Vander Velde
Page 1 of 8
Lecture 22
Last time:
1
1
1
22
1?
22
1?
?
?
11
N
k
kNxk
N
kNxk
x z
x
σ σ
σ σ
=+
=+
+
=
+
∑
∑
The estimate of x based on N data points can then be made without
reprocessing the first
1
N points. Their effect can be included simply by starting
with a pseudo observation which is equal to the estimate based on the first
1
N
points having a variance equal to the variance of the estimate based on
1
N .
The same is true of the variance of the estimate based on N .
11
22 2
1??
11 1
N
kNxx k
σ σσ
=+
=+
∑
A priori information about x can be included in exactly this way whether or not
it was derived from previous measurements. Whatever the source of the prior
information, it can be expressed as an a priori distribution ()f x , or at least as an
expected value and a variance. Take the expected value as a pseudo observation,
2
0
σ , and accumulate this data with the actual data using the standard formulae.
With the prior information included as a pseudo observation, the least squares
estimate is formed just as if there were no prior information. The result, for
normal variables at least, is identical to the estimators based on the conditional
distribution of x .
Bayes’ rule can be used to form the distribution ( )
12
| , ,...,
N
f xzz z starting from
the original a priori distribution ()f x
()
()
()()
()()
1
1
1
12
12
12
()
() |
(| )
() |
||
(|, )
||
etc.
fx
fxfz x
fxz
fufz udu
f xz fz x
fxzz
f uz fz udu
=
=
∫
∫
if the measurements are conditionally independent. Two disadvantages relative
to the previous method:
16.322 Stochastic Estimation and Control, Fall 2004
Prof. Vander Velde
Page 2 of 8
? More computation unless you know each conditional density is going to be
normal
? Must provide ()f x - the a priori distribution. This is both the advantage and
disadvantage of this method.
Other estimators include the effect of a priori information directly. Several
estimators are based on the conditional probability distribution of x given the
values of the observations. In this approach, we think of x as a random variable
having some distribution. This troubles some people since we know x is in fact
fixed at some value throughout all the experiment. However, the fact that we do
not know what the value is is expressed in terms of a distribution of possible
values for x . The extent of our a priori knowledge is reflected in the variance of
the a priori distribution we assign.
Having an a priori distribution for x , and the values of the observations, we can
in principle – and often in fact – calculate the conditional distribution of x given
the observations. This is in fact the a posteriori distribution, ( )
1
| ,...,
N
f xz z . This
distribution expresses the probability density for various values of x given the
values of the observations and the a priori distribution. Having this distribution,
one can define a number of reasonable estimates. One is the minimum variance
estimate – that value ?x which minimizes the error variance.
()( )
2
1
?Var | ,...,
N
x xfxz zdx
∞
?∞
=?
∫
But a derivative of this variance with respect to ?x shows the minimizing value of
?x to be
()
1
? | ,...,
N
x xf x z z dx
∞
?∞
=
∫
which is the conditional mean – the mean of the conditional distribution of x .
Another reasonable estimate of x based on this conditional distribution is the
value at the maximum probability density. This can perfectly well be called the
“maximum likelihood” estimate though it is not necessarily the same as the
maximum likelihood estimate we have just derived. Schweppe calls it the MAP
(“maximum a posteriori probability”) estimator. The two are related as follows:
The first is the x which maximizes
()
()
1
1
,..., ,
,..., |
()
N
N
f zzx
fz z x
fx
=
16.322 Stochastic Estimation and Control, Fall 2004
Prof. Vander Velde
Page 3 of 8
The second is the x which maximizes
()
()
()
()
()
()
()
1
1
1
1
1
1
1
, ,...,
| ,...,
,...,
,..., ,
,..., |
()
,...,
| ,...,
()
N
N
N
N
N
N
N
fxz z
fxz z
fz z
fz zx
fz z x
fx
f zz
fxz z
fx
=
=
=
The a priori distribution ()f x , and the distribution of x is also involved in the
joint distribution of the observations. Only if the distribution of x is flat for all x
can we guarantee that these two functions will have maxima at the same value of
x . But a flat distribution for x for all x is exactly the case when there is no a
priori knowledge about x . In that case all values of x have equal probability
density which is in fact an infinitesimal. We can consider ()f x in that case to be
the limit of almost any convenient distribution as the variance →∞.
If, on the other hand, we do have some prior information about x based on some
previous measurements or on physical reason, ()f x will have some finite shape
– often a normal shape – and the x which maximizes ( )
1
,..., |
N
f zzx will not
maximize ()
1
| ,...,
N
f xz z . In this case the latter choice of x is to be preferred
since it is the most probable value of x based on the a priori distribution of x
and the values of the observations, whereas the former depends only on the
observations.
We just note in passing that for a normal ( )
1
| ,...,
N
f xz z or any other distribution
which is symmetric about the maximum point, the most probable value is equal
to the conditional mean which is the minimum variance value of ?x . In the
absence of prior information, this is also the maximum likelihood estimate,
which is the least weighted squares estimate. This we found to be a linear
combination of the data. So in the case of a normal conditional distribution with
no prior information, the optimum linear estimate based on the data is the
minimum variance estimator. No nonlinear operation on the data can give a
smaller variance estimate.
16.322 Stochastic Estimation and Control, Fall 2004
Prof. Vander Velde
Page 4 of 8
For non-normal distributions, one often defines a linear estimator and optimizes
it based on minimum mean squared error. This gives the same estimate as that
derived here. But in that case it may be that some nonlinear operation on the
data could do better.
Also note that if ()
1
| ,...,
N
f xz z is known to be normal, as if all noises are normal,
the initial distribution is normal, and only linear operations are involved, then
the distribution is completely defined by its mean and variance – and only these
parameters need be computed. But in any other case, updating a general
distribution requires the entire distribution- and the estimation problem becomes
infinite in dimension.
Estimators based on the conditional distribution of x are thus very satisfying
theoretically, but are more complicated to derive than a simple least squares fit.
We have found the least squares estimate to equal these other estimators if there
is no a priori information. Fortunately it is possible to include a priori
information in the least squares format in such a way as to recast the problem in
the form of an equivalent estimation problem with no prior information. This is
done by introducing the a priori information in the form of an additional pseudo
observation. That this can be done is due to the fact that the least squares
estimate is a linear combination of the observations – and thus it is possible to
group the observations in any way.
Recursive formulation
(The other method is called batch processing.)
Suppose we had the estimate
0
?x based on any prior information and all
measurements already taken, and its variance
2
0
σ . Then we took just one more
measurement and wished to obtain an improved estimate of x immediately.
The formula says
()
()
0
2222
00
10
22 22
00
22
0
2
0
0022
0
00
?
??
11
??
??
zz
zz
z
z
xz
x xz
xzx
xkzx
σσσσ
σσ σσ
σσ
σ
σσ
+
== +
++
+
=+ ?
+
=+ ?
16.322 Stochastic Estimation and Control, Fall 2004
Prof. Vander Velde
Page 5 of 8
New estimate = Old estimate + Gain(Measurement residual)
This is the form of the modern recursive estimator. If this is formulated directly
in the case where several parameters are being estimated, the
( )
1
22
0 z
σσ
?
+ is a
matrix inversion. However, if only one scalar measurement is being processed,
the inversion is a scalar reciprocal.
()
()
()
22
0
222 22
10 0
1
22222
10 0
2
0
1
22 2
00
111
1
z
zz
zz
z
k
k
σ σ
σ σσ σσ
σσσσσ
σ
σσ σ
?
?
+
=+=
=+
=?
≡+
Extensions to this simple problem:
? Several estimated parameters instead of one
o No conceptual difficulty
o Get vector and matrix operations instead of scalar
? Dynamic parameters instead of static
o No real difficulty if they obey a set of linear differential equations
? Several simultaneous measurements instead of one
o No difficulty if they are linearly related to x
? Biased measurement noise
o Estimate the bias
? Correlated measurement noise
o Estimate the noise (different form of filter) or work with independent
measurement differences
? Non-normal noises
o Makes maximum likelihood difficult
o Requires full distribution rather than mean and variance
? Nonlinear system constraints or measurements
o Makes things very difficult
o Requires more information than mean and variance
Statistics in State Space Formulation
In the state space formulation we depend on the concept of the shaping filter
exclusively. Even if the input statistics are time varying, we suppose the input to
be generated by passing white noise through a suitable time-varying shaping
filter. In the non-stationary case it is not clear how to generate the shaping
system directly.
16.322 Stochastic Estimation and Control, Fall 2004
Prof. Vander Velde
Page 6 of 8
Non-stationary, multiple-input, multiple-output case
Note that this model may represent the shaping filter for a random process alone,
or a system driven by white noise, or a system driven by correlated noise with
the shaping filter included.
Shaping filter:
vAvBn
dCv
=+
=
&
Transfer function (state space model):
x Fx d
yHx
=+
=
&
Augmented system:
{
[]
0
0
0
B
A
C
x
x
v
FGC
x xnAxBn
AB
yH xCx
′
′
′
??
′=
??
??
????
′ ′ ′′ ′′+≡+
????
????
′′′=≡
=&
14243
14243
We will drop the primes and treat the input noise as white.
() () () () ()
() () ()
x tAtxtBtnt
yt Ctxt
+
=
=&
We will allow the possibility of a biased input noise.
()nt is defined.
()
11 22 121
() () () () ()
T
nt nt nt nt Nt t tδ
????
??=?
????
16.322 Stochastic Estimation and Control, Fall 2004
Prof. Vander Velde
Page 7 of 8
The elements of ()Nt are the intensities, or spectral densities, of the components
of ()nt. Never call it the variance of the white noise!
Propagation of the mean
() () () () ()
() () ()
x tAtxtBtnt
yt Ctxt
+
=
=&
So the mean is propagated directly by the system dynamics.
Propagation of the covariance matrix
() () () () () () () () ()
() () ()
TT
T
T
yt yt yt yt Ct xt xt xt xt C
CtXtCt
????????
??=??
????????
=
So we require the covariance matrix for the full state variable ()x t .
Can we derive a differential equation for the covariance matrix in the same way
as we did in the error propagation section?
() () () () ()
T
X t E xt xt xt xt
????
=? ?
????
For simplicity, define:
() () ()
() () ()
() () () () ()
xt xt xt
nt nt nt
x tAtxtBtnt
=?
=?
=+
%
%
&
%%%
[]
() () ()
() () () () ()
() () () () () () () () () ()
() () () () () () () () () () () ()
() () () () () (
T
TT
TTTTT
TT T
T
Xt xtxt
Xt xtxt xtxt
Atxt Btnt xt xt xt At nt Bt
Atxtxt Btntxt xtxt At xtnt Bt
AtXt XtAt Btnt
=
=+
??=+ + +
??
=+ +
%%
&&&
%% %%
%%%%% %
%% %% %% %%
% )() ()() ()
TTT
xt xtnt Bt+%%%
What is the correlation function between ()nt% and ()x t% ? It is not zero even
though ()nt% drives ()x t% and not ()x t% directly. Because ()nt% is a white noise, and
thus is infinite in magnitude almost everywhere, it has a non-negligible effect on
()x t% at the same point in time. But we need a more sophisticated calculus i.e., Ito
calculus, to figure out what it is.
So we cannot derive a differential equation for ()X t just by differentiating its
definition.
16.322 Stochastic Estimation and Control, Fall 2004
Prof. Vander Velde
Page 8 of 8
Instead, we can use a more round-about procedure. The response of the
differential equation for ()x t% can be expressed as
0
00
() (, ) ( ) (, ) ( ) ( )
t
t
x tttxt tBndφ ττττ=Φ +
∫
%% %
where (, )t τΦ is the transition matrix for the linear system with system matrix
()At . It satisfies the system homogeneous equation
(, ) () (, )
(,)
d
tAtt
dt
I
τ τ
ττ
Φ=Φ
Φ=
This approach works in this case because we can write down the form of the
solution to a set of linear differential equations. We cannot write down the form
of the solution to nonlinear differential equations so we cannot use this approach
in the case of nonlinear systems. However, the Ito calculus does apply to
nonlinear stochastic differential equations.