Ch. 3 Estimation
1 The Nature of Statistical Inference
It is argued that it is important to develop a mathematical model purporting
to provide a generalized description of the data generating process. A prob-
ability model in the form of the parametric family of the density functions
Φ = {f(x;θ),θ ∈ Θ} and its various ramifications formulated in last chapter
provides such a mathematical model. By postulating Φ as a probability model
for the distribution of the observation of interested, we could go on to consider
questions about the unknown parameters θ (via estimation and hypothesis
tests) as well as further observations from the probability model (prediction).
In the next section the important concept of a sampleing model is introduced
as a way to link the probability model postulated, say Φ = {f(x;θ),θ ∈ Θ},
to the observed data x ≡ (x1,...,xn)′ available. The sampling model provided
the second important ingredient needed to define a statistical model; the starting
point of any ”parametric” statistical inference.
In short, a statistical model is defined as comprising
(a). a probability model Φ = {f(x;θ),θ ∈ Θ}; and
(b). a sampling model x ≡ (X1,...,Xn)′.
The concept of a statistical model provide the starting point of all forms of sta-
tistical inference to be considered in the sequel. To be more precise, the concept
of a statistical model forms the basis of what is known as parametric inference.
There is also a branch of statistical inference known asnon?parametric inference
where no Φ is assumed a priori.
1.1 The sampling model
A sampleing model is introduced as a way to link the probability model postu-
lated, say Φ = {f(x;θ),θ ∈ Θ} and the observed data x ≡ (x1,...,xn)′ available.
It is designed to model the relationship between them and refers to the way the
1
observed data can be viewed in relation to Φ.
Definition 1:
A sample is defined to be a set of random variables (X1,X2,...,Xn) whose den-
sity functions coincides with the ”true” density function f(x;θ0) as postulated
by the probability model.
Data are generally drawn in one of two settings. A cross section sample is
a sample of a number of observational units all drawn at the same point in time.
A time series sample is a set of observations drawn on the same observational
unit at a number (usually evenly spaced) points in time. Many recently have been
based on time-series cross sections, which generally consistent of the same cross
section observed at several points in time. The term panel data set is usually
fitting for this sort of study.
Given that a sample is a set of r.v.’s related to Φ it must have a distribution
which we call the distribution of the sample.
Definition 2:
The distribution of the sample x ≡ (X1,X2,...,Xn)′, is defined to be the joint
distribution of the r.v.’s X1,X2,...,Xn denoted by
fx(x1,...,xn;θ) ≡ f(x;θ).
The distribution of the sample incorporates both forms of relevant informa-
tion, the probability as well as sample information. It must comes as no surprise
to learn that f(x;θ) plays a very important role in statistical inference. The form
of f(x;θ) depends crucially on the nature of the sampling model and as well as
Φ. The simplest but most widely used form of a sampling model is the one based
on the idea of a random experiment E and is called a random sample.
Definition 3:
A set of random variables (X1,X2,...,Xn) is called a random sample from f(x;θ)
if the r.v.’s X1,X2,...,Xn are independently and identically distributed (i.i.d). In
2
this case the distribution of the sample takes the form
f(x1,...,xn;θ) =
nproductdisplay
i=1
f(xi;θ) = (f(x;θ))n
the first equality due to independence and the second due to the fact that the
r.v. are identically distributed.
A less restrictive form of a sample model in which we call an independent
sample, where the identically distributed condition in the random sample is re-
laxed.
Definition 4:
A set of random variables (X1,X2,...,Xn) is said to be an independent sample
from f(xi;θi), i = 1,2,...n, respectively, if the r.v.’s X1,X2,...,Xn are indepen-
dent. In this case the distribution of the sample takes the form
f(x1,...,xn;θ) =
nproductdisplay
i=1
f(xi;θi).
Usually the density function f(xi;θi), i = 1,2,...,n belong to the same family
but their numerical characteristics (moments, etc.) may differs.
If we relax the independence assumption as well we have what we can call a
non-random sample.
Definition 5:
A set of random variables (X1,X2,...,Xn) 1 is said to be a non-random sample
from f(x1,x2,...,xn;θ) if the r.v.’s X1,X2,...Xn are non-i.i.d.. In this case the
only decomposition of the distribution of the sample possible is
f(x1,...,xn;θ) =
nproductdisplay
i=1
f(xi|x1,...,xi?1;θi)
given x0, where f(xi|x1,...,xi?1;θi), i = 1,2,...n represent the conditional distri-
bution of Xi given X1,X2,...,Xi?1.
1Here, we must regard this set of random variables as a sample of size ’one’ from a multi-
variate point of view.
3
In the context of statistical inferences need to postulate both probability as
well as a sampling model and thus we define a statistical model as comprising
both.
Definition 6:
A statistical model is defined as comprising
(a). a probability model Φ = {f(x;θ), θ ∈ Θ}; and
(b). a sampling model x ≡ (X1,X2,...,Xn)′.
It must be emphasized that the two important components of a statistical
model, the probability and sampling models, are clearly interrelated. For ex-
ample we cannot postulate the probability model Φ = {f(x;θ), θ ∈ Θ} if the
sample x is non-random. This is because if the r.v.’s X1,X2,...,Xn are not
independent the probability model must be defined in terms of their joint distri-
bution,.i.e. Φ = {f(x1,x2,...,xn;θ), θ ∈ Θ} (for example, stock price). More-
over, in the case of an independent but not identically distributed sample we
need to specify the individual density functions for each r.v. in the sample, i.e.
Φ = {fk(xk;θ), θ ∈ Θ,k = 1,2,...,n}. The most important implication of this
relationship is that when the sampling model postulated is found to be inappro-
priate it means that the probability model has to be re-specified as well.
1.2 An overview of statistical inference
The statistical model in conjunction with the observed data enable us to consider
the following question:
(A). Are the observed data consistent with the postulated statistical model ?
(model misspecification)
(B). Assuming that the postulated statistical model is consistent with the ob-
served data, what can we infer about the unknown parameter θ ∈ Θ ?
(a). Can we decrease the uncertainty about θ by reducing the parameters space
from Θ to Θ0 where Θ0 is a subset of Θ. (confidence estimation)
(b). Can we decrease the uncertainty about θ by choosing a particular value in
θ, say ?θ, as providing the most representative value of θ ? (point estimation)
4
(c). Can we consider the question that θ belongs to some subset Θ0 of Θ ?(
hypothesis testing)
(C). Assuming that a particular representative value ?θ of θ has been chosen what
can we infer about further observations from the data generating process (DGP)
as described by the postulated statistical model ? (prediction)
5
2 Point Estimation
(Point) Estimation refers to our attempt to give a numerical value to θ. Let
(S,F,P(·)) be the probability space of reference with X a r.v. defined on this
space. The following statistical model is postulated:
(i) Φ = {f(x;θ), θ ∈ Θ}, Θ ? R;
(ii)x ≡ (X1,X2,...,Xn)′ is a random sample from f(x;θ).
Estimation in the context of this statistical model takes the form of constructing
a mapping h(·) : X → Θ, where X is the observation space and h(·) is a Borel
function. The composite function (a statistic) ?θ ≡ h(x) : S → Θ is called an
estimator and its value h(x), x ∈ X an estimate of θ. It is important to distin-
guish between the two because the former is a random variable and the latter
is a real number.
Example:
Let f(x;θ) = [1/√2pi]exp{?12(x?θ)2}, θ ∈R, and x be a random sample from
f(x;θ). Then X = Rn and the following function define estimators of θ:
1. ?θ1 = 1nsummationtextni=1 Xi,
2. ?θ2 = 1k summationtextki=1 Xi, k = 1,2,...,n?1;
3. ?θ3 = 1n(X1 + Xn).
It is obvious that we can construct infinitely many such estimators. However,
constructing ”good” estimators is not so obvious. It is clear we need some criteria
to choose between theses estimators. In other words, we need to formalize what
we mean by a ”good” estimator.
2.1 Finite sample properties of estimator
2.1.1 Unbiasedness
An estimator is constructed with the sole aim of providing us the ”most represen-
tative values” of θ in the parameter space Θ, based on the available information
in the form of statistical model. Given that the estimator ?θ = h(x) is a r.v.
6
(being a Borel function a random vector x) any information of what we mean
by a ”most representative values” must be in terms of the distribution of ?θ, say
f(?θ). The obvious property to require a ’good’ estimator ?θ of θ to satisfy is that
f(?θ) is centered around θ.
Definition 7:
An estimator ?θ of θ is said to be an unbiased estimator of θ if
E(?θ) =
integraldisplay ∞
?∞
?θf(?θ)d?θ = θ.
That is, the distribution of ?θ has mean equal to the unknown parameter to be
estimated.
Note that an alternative, but equivalent, way to define E(?θ) is
E(?θ) =
integraldisplay ∞
?∞
···
integraldisplay ∞
?∞
h(x)f(x;θ)dx
where f(x;θ) = f(x1,x2,...,xn;θ) is the distribution of the sample, x.
It must be remembered that unbiasedness is a property based on the distri-
bution of ?θ. This distribution is often called sampling distribution of ?θ in order
to distinguish it from any other distribution of function of r.v.’s.
2.1.2 Efficiency
Although unbiasedness seems at first sight to be a highly desirable property it
turns out in most situations there are too many unbiased estimators for this prop-
erty to be used as the sole criterion for judging estimators. The question which
naturally arises is ” How can we choose among unbiased estimators ?”. Given
that the variance is a measure of dispersion, intuition suggests that the estimator
with the smallest variance is in a sense better because its distribution is more
’concentrated’ around θ.
Definition 8:
An unbiased estimator ?θ of θ is said to be relatively more efficient than some
7
other unbiased estimator ?θ if
Var(?θ) < Var(?θ).
In the case of biased estimators relative efficiency can be defined in terms of
the mean square error (MSE) which take the form
MSE(?θ,θ0) = E(?θ?θ0)2 = E(?θ?E(?θ) + E(?θ)?θ0)2
= Var(?θ) + [Bias(?θ,θ0)]2,
the cross-product term being zero. Bias(?θ,θ0) = E(?θ) ? θ0 is the bias of ?θ
relative to the value θ0.
For any two estimator ?θ and ?θ of θ if MSE(?θ,θ) ≤ MSE(?θ,θ), θ ∈ Θ with
strict inequality holding for some θ ∈ Θ, ?θ is said to be inadmissible.
Using the concept of relatively efficiency we can compare different estimator
we happen to consider. This is, however, not very satisfactory since there might
be much better estimators in terms of MSE for which we know nothing about.
In order to avoid choosing the better of two inefficient estimators we need some
absolute measure of efficiency. Such a measure is provided by the Cramer-
Rao lower bound.
Definition 9:
The equation
CR(θ) =
bracketleftBig
1 + dB(θ)dθ
bracketrightBig2
E
bracketleftbiggparenleftBig
? logf(x;θ)
?θ
parenrightBig2bracketrightbigg
is the Cramer?Rao lower bound, where f(x,θ) is the distribution of the sample
and B(θ) the bias. It can be shown that for any estimator θ? of θ
MSE(θ?,θ) ≥ CR(θ)
under the following regularity condition on Φ:
(a). The set A = {x : f(x;θ) > 0} does not depend on θ;
8
(b). For each θ ∈ Θ the distribution [?i logf(x;θ)]/(?θi), i = 1,2,3 exist for all
x ∈X;
(c). 0 < E[(?/?θ)logf(x;θ)]2 < ∞ for all θ ∈ θ.
In the case of unbiased estimators the inequality takes the form
Var(θ?) ≥
bracketleftBigg
E
parenleftbigg? logf(x;θ)
?θ
parenrightbigg2bracketrightBigg?1
;
the inverse of the lower bound is called Fisher′s information number and is
denoted by In(θ).2
Definition 10 (multi-parameters’s Cram′er-Rao Theorem):
An unbiased estimator ?θ of θ is said to be fully efficient if
Var(?θ) = E
bracketleftbiggparenleftbigg? logf(x;θ)
?θ
parenrightbiggparenleftbigg? logf(x;θ)
?θ
parenrightbigg′bracketrightbigg?1
= E
bracketleftbigg
??
2 logf(x;θ)
?θ?θ′
bracketrightbigg?1
where
In(θ) = E
bracketleftbiggparenleftbigg? logf(x;θ)
?θ
parenrightbiggparenleftbigg? logf(x;θ)
?θ
parenrightbigg′bracketrightbigg
= E
bracketleftbigg
??
2 logf(x;θ)
?θ?θ′
bracketrightbigg
,
is called the sample information matrix.
Proof: (for the case that θ is 1×1):
Given that f(x1,x2,...,xn;θ) is the joint density function of the sample, it pos-
sesses the property that
integraldisplay ∞
?∞
···
integraldisplay ∞
?∞
f(x1,x2,...,xn;θ)dx1...dxn = 1,
or, more compactly,
integraldisplay ∞
?∞
f(x;θ)dx = 1.
2It must bear in mind that the information matrix is a function of sample size n.
9
If we assume that the domain of x is independent of θ ((a) this permits straight-
forward differentiation inside the integral sign) and that the derivative ?f(·)/?θ
exist. Then differentiating the above equation with respect to θ results in
integraldisplay ∞
?∞
?f(x;θ)
?θ dx = 0. (1)
This equation can be reexpressed as
integraldisplay ∞
?∞
? lnf(x;θ)
?θ f(x;θ)dx = 0 (
d
dt lnf(t) =
f′(t)
f(t) ).
Therefore, it simply states that
E
bracketleftbigg? lnf(x;θ)
?θ
bracketrightbigg
= 0,
i.e., the expectation of the derivative of the natural logarithm of the likelihood
function of a random sample from a regular density is zero.
Likewise, differentiating (1) w.r.t. θ again provides
integraldisplay ∞
?∞
?2 lnf(x;θ)
?θ2 f(x;θ)dx +
integraldisplay ∞
?∞
? lnf(x;θ)
?θ
?f(x;θ)
?θ dx
=
integraldisplay ∞
?∞
?2 lnf(x;θ)
?θ2 f(x;θ)dx +
integraldisplay ∞
?∞
bracketleftbigg? lnf(x;θ)
?θ
bracketrightbigg2
f(x;θ)dx.
That is
Var
bracketleftbigg? lnf(x;θ)
?θ
bracketrightbigg
= ?E
bracketleftbigg?2 lnf(x;θ)
?θ2
bracketrightbigg
.
Now consider the estimator h(x) of θ whose expectation is
E(h(x)) =
integraldisplay
h(x)f(x;θ)dx. (2)
Differentiating (2) w.r.t. θ we obtain
?E(h(x))
?θ =
integraldisplay
h(x)?f(x;θ)?θ dx
=
integraldisplay
h(x)?lnf(x;θ)?θ f(x;θ)dx
= cov
bracketleftbigg
h(x), ? lnf(x;θ)?θ
bracketrightbigg
(since E
bracketleftbigg? lnf(x;θ)
?θ
bracketrightbigg
= 0).
10
By Cauchy-Schwarz inequality (E|YZ|≤ E(Y 2)1/2E(Z2)1/2), we have
cov
bracketleftbigg
h(x), ? lnf(x;θ)?θ
bracketrightbigg2
=
bracketleftbigg?E(h(x))
?θ
bracketrightbigg2
≤ Var(h(x))·Var
bracketleftbigg? lnf(x;θ)
?θ
bracketrightbigg
. (3)
In light of (3), we have
Var(h(x)) ≥
bracketleftBig
?E(h(x))
?θ
bracketrightBig2
?E
bracketleftBig
?2 lnf(x;θ)
?θ2
bracketrightBig.
If the estimator is unbiased, E(h(x)) = θ and
Var(h(x)) ≥ 1
?E
bracketleftBig
?2 lnf(x;θ)
?θ2
bracketrightBig.
Example:
For a random sample of size n from a normal distribution, the Cramer-Rao vari-
ance lower bound for unbiased estimator θ = (μ,σ2)′ is derived as following.
f(x1,x2,...,xn;θ) = f(x,θ) =
nproductdisplay
i=1
braceleftBigg
(2pi)?1/2(σ2)?1/2 exp
bracketleftBigg
?12
parenleftbiggx
i ?μ
σ
parenrightbigg2bracketrightBiggbracerightBigg
= (2pi)?n/2(σ2)?n/2 exp
bracketleftBigg
? 12σ2
nsummationdisplay
i=1
(xi ?μ)2
bracketrightBigg
.
lnf(x1,x2,...,xn;θ) = ?n2 ln(2pi)? n2 lnσ2 ? 12σ2
nsummationdisplay
i=1
(xi ?μ)2,
? lnL
?μ =
1
σ2
nsummationdisplay
i=1
(xi ?μ),
? lnL
?σ2 = ?
n
2σ2 +
1
2σ4
nsummationdisplay
i=1
(xi ?μ)2,
?2 lnL
?μ2 = ?
n
σ2,
?2 lnL
?(σ2)2 =
n
2σ4 ?
1
σ6
nsummationdisplay
i=1
(xi ?μ)2,
?2 lnL
?μ?σ2 = ?
1
σ4
nsummationdisplay
i=1
(xi ?μ).
11
The sample information matrix is
In(θ) = E
bracketleftbiggparenleftbigg? lnf(x;θ)
?θ
parenrightbiggparenleftbigg? lnf(x;θ)
?θ
parenrightbigg′bracketrightbigg
= E
bracketleftbigg
??
2 lnf(x;θ)
?θ?θ′
bracketrightbigg
= ?E
bracketleftBigg ?2 lnL
?μ2
?2 lnL
?μ?σ2
?2 lnL
?(σ2)2
?2 lnL
?μ?σ2
bracketrightBigg
=
bracketleftbigg n
σ2 00 n
2σ4
bracketrightbigg
. (how?)
The Cramer-Rao low variance Bound therefore is
I?1n (μ,σ) =
bracketleftbigg σ2
n 0
0 2σ4n
bracketrightbigg
.
2.1.3 Sufficiency
Efficiency can be seen as a property indicating that the estimator ”utilizes” all
the information contained in the statistical model. An important concept related
to the information of a statistical model is the concept of a sufficient statis-
tic introduced by Fisher (1922) as a way to reduce the sampling information
by discarding only the information of no relevance to any inference about θ. In
other words, a statistic τ(x) is said to be sufficient for θ if it makes no difference
whether we use x or τ(x) in inference concerning θ. Obviously in such a case we
would prefer to work with τ(x) instead of x, the former being of lower dimen-
sionality.
Definition 11 (sufficient statistic):
A statistic τ(·) : X → Rm, n > m (n is the size of a sample), is called suffi-
cient for θ if the conditional distribution f(x|τ(x) = τ) is independent of θ,
i.e. θ does not appear in f(x|τ(x) = τ) and the domain of f(·) does not involve θ.
Verifying this directly by deriving f(x|τ(x) = τ) and showing that is indepen-
dent of θ can be a very difficult exercise. One indirect way of verifying sufficiency
12
is provided by the following Theorem.
Theorem 1 ( Fisher?Neyman factorization):
The statistics τ(x) is sufficient for θ if and only if there exist a factorization of
the form
f(x;θ) = f(τ(x);θ)h(x),
where f(τ(x);θ) is the density function of τ(x) and depends on θ and h(x), some
function of x independent of θ.
Even this result, however, is of no great help because we have to have the sta-
tistics τ(x) as well as its distribution to begin with. Lehmann and Scheffe (1950)
provides us with a very convenient way to derive minimal sufficient statistics.
Intuition suggests that, since efficiency is related to full utilization of the
information in the statistical model, and sufficiency can be seen as a maximal
reduction of such information without losing any relevant information as far as
inference about θ is concerned, there must be a direct relationship between the
two properties. A relationship along the lines that when an efficient estimator is
needed we should look no further than the sufficient statistics, is provided by the
following Theorem.
Theorem 2 (Rao-Blackwell):
Let τ(x) be a sufficient statistic for θ and κ(x) be an estimator of θ, then
E(h(x)?θ)2 ≤ E(κ(x)?θ)2, θ ∈ Θ,
where h(x) = E(κ(x)|τ(x) = τ), i.e. the conditional expectation of κ(x) given
τ(x) = τ.
From the above discussion of the properties of unbiasedness, relative and fully
efficiency and sufficiency we can see that these properties are directly related to
the distribution of the estimator ?θ of θ. As argued repeats, deriving the distrib-
ution of Borel functions of r.v.’s such as ?θ = h(x) is a very difficult exercises and
very few results are available in the literature. These existing results are mainly
related to simple functions of normally distributed r.v.’s . For the cases where no
13
such results are available (which is the rule rather than the exception) we have to
resort to asymptotic results. This implies that we need to extend the above list
of criteria for ’good’ estimators to include asymptotic properties of estimators.
These asymptotic properties will refer to the behavior of ?θn as n →∞. In order
to emphasis the distinction between these asymptotic properties and the proper-
ties considered so far we will call the latter finite sample properties. The finite
sample properties are related directly to the distribution of ?θn, say f(?θn). On the
other hand, the asymptotic properties are related to the asymptotic distribution
of ?θn.
2.2 Asymptotic Properties
2.2.1 Consistency
A natural property to require estimators to have is that as n →∞ the probability
of ?θ being close to the true θ should increase as well. We formalize this idea using
the concept of convergence in probability associated with the weak law of large
numbers.
Definition 11:
An estimator ?θn = h(x) is said to be (weak) consistent for θ if
limn→∞ Pr(|?θ ?θ| < ε) = 1,
we write ?θn p?→ θ.
A sufficient condition for weakly consistent of ?θn is formalized in the following
Lemma.
Lemma (Convergence in Mean Square Error):
If ?θn is an estimator of θ which satisfies the following properties:
(a) limn→∞ E(?θn) = θ;
(b) limn→∞ Var(?θn) = 0, then ?θn p?→ θ.
It is important to note that these are only sufficient conditions for consistency
(not necessary); that is, consistency is not equivalent to the above conditions,
14
since for consistency Var(?θn) need not even exist.
A stronger form of consistency associated with almost sure convergence is a
very desirable asymptotic property for estimators.
Definition 12:
An estimator ?θn is said to be a strongly consistent estimator of θ if
Pr
parenleftBig
limn→∞?θn = θ
parenrightBig
= 1,
and is denoted by ?θn a.s.?→ θ.
Theorem 3:
Given g : Rk →Rl and any sequence random k×1 vector bn such that bn a.s.?→ b,
where b is k ×1, if g is continuous at b, then g(bn) a.s.?→ g(b).
2.2.2 Asymptotic Normality
A second form of convergence is convergence in distribution.
Definition 13:
Let Xn be a sequence of random variables indexed by the sample size, and assume
that Xn has cdf Fn(x). If limn→∞|Fn(x) ? F(x)| = 0 at all continuity point of
F(x), where F(x) is the distribution function of a random variable X, then Xn
converges in distribution to the random variable X, denoted as Xn d?→ X
and F(x) is called the limiting distribution of Xn.
Extending the central limit theorem to ?θn leads to the property of asymptotic
normality.
Definition 14:
An estimator ?θn is said to be asymptotic normal if two sequence {Vn(θ), n ≥ 1}
and {θn, n ≥ 1} exist such that
(Vn(θ))?12(?θn ?θn) d?→ Z ~ N(0,1),
15
where Vn(θ) = limn→∞Var(?θn).
In most cases of interest in practice such as the case of a random sample,
Vn(θ) is of order 1/n, denoted by Vn(θ) = O(1/n). In such a case asymptotic
normality can be written in the following form:
√n(?θ
n ?θn) ~ N(0,V(θ)),
where V(θ) = n·Vn(θ) represent the asymptotic variance.
Definition 15:
An estimator ?θn with Vn(?θ) = O(1/n) is said to be asymptotically unbiased
if
E
bracketleftBig√
n(?θn ?θn)
bracketrightBig
= 0 as n →∞.
It must be emphasized that asymptotic unbiasedness is a stronger condition
than limn→∞E(?θn) = θ; the former specifying the rate of convergence.
In relation to the variance of the asymptotic normal distribution we can define
the concept of asymptotic efficiency.
Definition 16:
An asymptotically normal estimator ?θn is said to be asymptotically efficient
if Vn(θ) = (In(θ))?1, where
In(θ) = E
bracketleftbiggparenleftbigg? logf(x;θ)
?θ
parenrightbiggparenleftbigg? logf(x;θ)
?θ
parenrightbigg′bracketrightbigg
= E
bracketleftbigg
??
2 logf(x;θ)
?θ?θ′
bracketrightbigg
,
i.e. the asymptotic variance achieve the limit of the Cramer-Rao lower bound.
See Chapter 4 for detailed coverage of asymptotic theorems.
16
3 Methods of Estimation
The purpose of this section is to consider various methods for constructing ”good
estimators” for the unknown parameters θ. These methods to be discussed are
the least?square method, the method of moment and the maximum likelihood
estimation method.
3.1 The method of least-squares
The method of least-square was first introduced by Legendre in 1805 and Gauss
in 1809 in the context of astronomical measurement. The problem as posed at
the time was one of approximating a set of noisy observations yi, i = 1,2,...,n,
with some known functions gi(θ1,θ2,...,θm), i = 1,2,...,n, which depended on
the unknown parameters θ = (θ1,...,θm)′, m < n. Legendre suggest minimizing
the squared errors
l(θ) =
nsummationdisplay
i=1
(yi ?gi(θ))2 the least?squares.
Assuming differentiability of gi(θ), i = 1,2,...,n, [?l(θ)]/?θ = 0 gives rise to the
so called normal equation of the form
(?2)
nsummationdisplay
i=1
[yi ?gi(θ)] ??θ
k
gi(θ) = 0, k = 1,2,...,m.
In this form the least-square method has nothing to do with the statistical model
developed above, it is merely an interpolation method in approximation theory.
Gauss extended the least square method to statistical method with the prob-
abilistic structure attached to the error term. He pose the problem in the form
yi = gi(θ) + εi, i = 1,2,..,n,
εi ~ NI(0,σ2), i = 1,2,...,n.
In this form the problem can be viewed as one of estimation in the context of
statistical model:
Φ =
braceleftBigg
f(yi;θ) = 1σradicalbig(2pi) exp
parenleftbigg
? 12σ2(yi ?gi(θ))2
parenrightbigg
,θ ∈ Θ
bracerightBigg
17
by transferring the probabilistic assumption from εi to yi, the observable r.v.,
and consider y = (Y1,Y2,...,Yn)′ be an independent sample from f(yi;θ),i =
1,2,...,n. Gauss went on to derive what we have called the distribution of the
sample
f(y;θ) = (2piσ2)?n/2 exp
parenleftBigg
? 12σ2
nsummationdisplay
i=1
(yi ?gi(θ))2
parenrightBigg
,
and suggested that maximization of f(y;θ) with respect to θ give rise to the
same estimator of θ as minimization the square errors
nsummationdisplay
i=1
(yi ?gi(θ))2.
As we will see below, the above maximization can be seen as a forerunner of the
maximum likelihood method.
3.2 The method of moments
In the context of a probability model Φ, however unknown parameters of inter-
est are not only associated with the mean but also with the higher moments.
This prompted Pearson in 1894 to suggest the method of moment as a general
estimation method.
Let us assume that x = (X1,X2,...,Xn)′ is a random sample from f(x;θ),
θ ∈ Rk. The raw moments of f(x;θ), μ′r ≡ E(xr), r ≥ 1, are by definition
functions of the unknown parameters, since
μ′r(θ) =
integraldisplay ∞
?∞
xrf(x;θ)dx, r ≥ 1.
In order to apply the method we need to express the unknown parameter θ in
the form
θi = gi(μ′1,μ′2,...,μ′k), i = 1,2,...,k,
where gis are continuous functions. The method of moment based on the substi-
tutional idea, proposes estimating θi using
?θi = gi(m1,m2,...,mk), i = 1,2,...,k,
18
as the estimator of ?θi, i = 1,2,...,k, where mr = (1/n)summationtextni=1 Xri , r ≥ 1 represent
the sample raw moments. The justification of the method is based on the fact
that if μ′1,μ′2,...,μ′k are one-to one function of θ then since
mr a.s.?→ μ′r, r ≥ 1,
it follows that (by Slutzky Theorem)
?θi a.s.?→ θi, i = 1,2,...,k.
Example
Let Xi ~ N(μ,σ2), i = 1,2,...,n, then μ′1 = μ, μ′2 = σ2 + μ2 and m1 =
(1/n)summationtextni=1 Xi, m2 = (1/n)summationtextni=1 X2i . The method suggests
?μ = m1 = ˉXn,
?σ2 = m2 ?(m1)2 = 1n
summationdisplay
i
X2i ?( ˉX2n) = 1n
nsummationdisplay
i=1
(Xi ? ˉX2n).
Although the method of moment usually yield consistent estimator they are in
general inefficient. This is taken by Fisher in several papers in the 1920s and 30s
arguing in favor of the MLE. The controversy between Pearson and Fisher about
the relative merits of their respective methods of estimations ends in the mid-
1930s with Fisher the winner and the absolute dominance then of the maximum
likelihood method.
The basic reason for the inefficiency of the estimator based on the method of
moment is not hard to find. It is due to the fact that the method does not use
any information relating to the probability model Φ apart from the assumption
that raw moments of the order k exist.
3.3 The maximum likelihood method
The maximum likelihood method of estimation was formulated by Fisher and
extended by various authors such as Cramer, Rao and Wald. In the current sta-
tistical literature the method of likelihood is by far the most widely used method
of estimation and plays a very important role in hypothesis testing.
19
3.3.1 The likelihood function
Consider the statistical model (univariate):
(a) Φ = {f(x;θ), θ ∈ Θ};
(b) x = (X1,X2,...,Xn)′ a sample from f(x;θ),
where x takes values in X = Rn, the observation space. The distribution of
the sample D(x1,x2,...,xn;θ) describe how the density changes as x takes differ-
ent value in X for a given θ ∈ Θ. In deriving the likelihood function we reason
as follows:
Since D(x;θ) incorporates all the information in the statistical model it makes
a lot of intuitive sense to reverse the argument in deriving D(x;θ) and consider
the question which value of θ ∈ Θ is mostly supported by a given sample
realisation x = x ? That is, the value θ under which x would have had the
highest ’likelihood’ of arising must be intuitively our best choice of θ.
Using this intuitive argument the likelihood function is defined by
L(θ;x) = k(x)D(x;θ), θ ∈ Θ,
where k(x) > 0 is a function of x only (not θ). In particular
L(·;x) : Θ → [0,1].
The presence of k(x) (which we always chose arbitrarily to be equal to one) in
the definition of L(θ;x) implies that the likelihood function is non-unique; any
monotonic transformation of it represents the same information. In particular:
(a) logL(θ;x), the log likelihood functions; and
(b) ? logL(θ;x)?θ ≡ s(θ;x), the score function,
incorporate the same information as L(θ;x).
3.3.2 The Maximum Likelihood Estimator (MLE)
Given that the likelihood function represents the support given to the various
θ ∈ Θ given x = x, it is natural to define the maximun likelihood estimator
20
of θ to be a Borel function ?θ : X → Θ such that
L(?θ;x) = maxθ∈ΘL(θ;x),
and there may be one, none or many such MLE’s.
Note that
logL(?θ;x) ≥ logL(θ?;x), for all θ? ∈ Θ.
In the case where L(θ;x) is differentiable the MLE can be derived as a solution
of the equations
? logL(θ;x)
?θ ≡ s(θ;x) = 0,
refereed as the likelihood equation.
Example (univariate normal distribution, i.i.d. sample):
Let X1,X2,...,Xn be a random sample from the N(μ,σ2). Find the MLEs of
θ = (μ,σ2)′.
For random sampling from a normal distribution, the log-likelihood and its
derivatives are
lnL(μ,σ2) = ?n2 ln(2pi)? n2 lnσ2 ? 12σ2
nsummationdisplay
i=1
(Xi ?μ)2,
? lnL
?μ =
1
σ2
nsummationdisplay
i=1
(Xi ?μ),
? lnL
?σ2 = ?
n
2σ2 +
1
2σ4
nsummationdisplay
i=1
(Xi ?μ)2.
Therefore, ?μ = 1nsummationtextni=1(Xi) = ˉX and ?σ2 = 1nsummationtextni=1(Xi? ˉX)2 is indeed MLE since
?2 lnL
?μ2
vextendsinglevextendsingle
vextendsinglevextendsingle
θ=?θ
= ? n?σ2 < 0,
?2 lnL
?(σ2)2
vextendsinglevextendsingle
vextendsinglevextendsingle
θ=?θ
= n2 ?σ4 ? 1?σ6
nsummationdisplay
i=1
(Xi ? ˉXn)2 = ?n2 ?σ4 < 0,
?2 lnL
?μ?σ2
vextendsinglevextendsingle
vextendsinglevextendsingle
θ=?θ
= ? 1?σ4
nsummationdisplay
i=1
(Xi ? ˉX) = 0.
21
We can deduce that
H(?θ) =
bracketleftBigg
? n?σ2 0
0 ?n2 ?σ4
bracketrightBigg
is a negative definite matrix.
Example:
The expected value and variance of the estimators, ?μ and ?σ2 would be
E(?μ) = E
parenleftBigg
1
n
nsummationdisplay
i=1
Xi
parenrightBigg
= 1nE
parenleftBigg nsummationdisplay
i=1
Xi
parenrightBigg
= 1n
parenleftBigg nsummationdisplay
i=1
E(Xi)
parenrightBigg
= 1nnμ = μ.
Denotex = (X1,X2,...,Xn)′ and μ = (μ,μ,...,μ)′ be the vector of this random
sample and its mean vector respectively. Then the joint distribution of this sample
would be that
x ~ N(μ,σ2I).
We know that
nsummationdisplay
i=1
(Xi ? ˉX)2 = x′M0x = (x?μ)′M0(x?μ),
since M0(x?μ) = M0x?M0μ = M0x, where M0 = I?1/n(ii′).
From the results of ’quadratic forms related to the normal distribution’ at
section 6.2.3 of Ch. 2, we have
summationtextn
i=1(Xi ? ˉX)
2
σ2 =
x′M0x
σ2 =
(x?μ)′M0(x?μ)
σ2 ~ χ
2
trace of M0 ≡ χ
2
n?1.
Therefore,
E
parenleftbiggsummationtextn
i=1(Xi ? ˉX)
2
σ2
parenrightbigg
= (n?1),
or
E
parenleftBigg nsummationdisplay
i=1
(Xi ? ˉX)2
parenrightBigg
= (n?1)σ2.
22
The expected value of the MLE estimator of σ2,?σ2 is
E(?σ2) = E
parenleftbiggsummationtextn
i=1(Xi ? ˉX)
2
n
parenrightbigg
= n?1n σ2 negationslash= σ2,
which is biased.
An unbiased estimator of σ2 would be
s2 =
summationtextn
i=1(Xi ? ˉX)
2
n?1 .
Example:
Let X1,X2,...,Xn be a random sample from the density f(x;θ) = θ(1?θ)x, x =
0,1,..., 0 < θ < 1. Find the MLEs of θ.
Since this is a random sample, the likelihood function takes the form
L(θ;X1,X2,...,Xn) = f(X1;θ)f(X2;θ)...f(Xn;θ)
= θ(1?θ)X1 ·θ(1?θ)X2 ···θ(1?θ)Xn
= θn(1?θ)
C8n
i=1 Xi,
and the log-likelihood is therefore
lnL(θ;X1,X2,...,Xn) = nlnθ +
nsummationdisplay
i=1
Xi ln(1?θ).
The FOC condition is
? lnL(θ;X1,X2,...,Xn)
?θ =
n
θ ?
summationtextn
i=1 Xi
1?θ = 0
which has a solution at
?θ = 1
1 + 1/nsummationtextni=1 Xi =
1
1 + ˉX.
Exercise:
Check second order condition to make sure that the solution from FOC is indeed
a MLE in the last example.
23
Example (multivariate normal distribution, i.i.d. sample):
Let x ≡ (x′1,x′2,...,x′T)′ be a random sample of size T from Nn(μ,Σ), i.e. xt ~
Nn(μ,Σ),t = 1,2,...,T. x being a nT ×1 vector. The likelihood function takes
the form
L(μ,Σ;x) =
Tproductdisplay
t=1
bracketleftbigg
(2pi)?n/2|Σ|?1/2 exp
braceleftbigg
?12(xt ?μ)′Σ?1(xt ?μ)
bracerightbiggbracketrightbigg
= (2pi)?nT/2|Σ|?T/2 exp
braceleftBigg
?12
Tsummationdisplay
t=1
(xt ?μ)′Σ?1(xt ?μ)
bracerightBigg
,
and the log-likelihood function therefore is
lnL(μ,Σ;x) = ?nT2 ln(2pi)? T2 ln(|Σ|)? 12
Tsummationdisplay
t=1
(xt ?μ)′Σ?1(xt ?μ).
Since
Tsummationdisplay
t=1
(xt ?μ)′Σ?1(xt ?μ) =
Tsummationdisplay
t=1
(xt ?ˉxT)′Σ?1(xt ?ˉxT) + T(ˉxT ?μ)′Σ?1(ˉxT ?μ)
= trΣ?1Λ+ T(ˉxT ?μ)′Σ?1(ˉxT ?μ)
where
ˉxT = 1T
summationdisplay
t
xt, Λ =
summationdisplay
t
(xt ?ˉxT)(xt ?ˉxT)′.
?
lnL(μ,Σ;x) = ?nT2 ln(2pi)? T2 ln(|Σ|)? 12(trΣ?1Λ)? T2(ˉxT ?μ)′Σ?1(ˉxT ?μ).
The FOC is
? lnL(μ,Σ;x)
?μ = ?TΣ
?1(ˉxT ?μ) = 0
? ?μ = ˉxT,
and
? lnL(μ,Σ;x)
?Σ?1 =
T
2Σ?
1
2Λ = 0
? ?Σ = 1T
summationdisplay
t
(xt ?ˉxT)(xt ?ˉxT)′.
24
In deriving the first order condition above, we have used the following reason.
Since Σ is positive definite, Σ?1 is also positive definite and (ˉxT ?μ)′Σ?1(ˉxT ?
μ) ≥ 0 and is 0 if and only if ?μ = ˉxT. To maximize the second terms of
lnL(μ,Σ;x) we use the following lemma.
Lemma (Anderson, p.62):
If D is positive of order p, the maximum of
f(G) = ?T ln|G|?trG?1D
with respect to positive definite matrices G exists, occurs at G = (1/T)D.
Proof:
Let D = EE′ and E′G?1E = H. Then G = EH?1E′, and |G| = |E||H?1||E′| =
|H?1||EE′| = |D|/|H|, and trG?1D = trG?1EE′ = trE′G?1E = trH. Then the
function to be maximized (with respect to positive definite H) is
f = ?T ln|D|+ T ln|H|?trH.
Let H = KK′, where K is lower triangular. Them the maximum of
f = ?T ln|D|+ T ln|K|2 ?trKK′
= ?T ln|D|+
psummationdisplay
i=1
(T lnk2ii ?k2ii)?
summationdisplay
i>j
k2ij
occurs at k2ii = T,kij = 0,i negationslash= j; that is, at H = TI. Then G = (1/T)EE′ =
(1/T)D.
3.3.3 Numerical Method
It is an erroneous conclusion that deriving the MLE is a matter of a simple
differentiation. Let us consider some examples where the derivation is not as
straightforward.
25
Example:
Let z = (z′1,z′2,...,z′n)′ where zi = (Xi,Yi), be a random sample from
f(x,y;ρ) = (1?ρ
2)?1/2
2pi exp
braceleftbigg
? 12(1?ρ2)(x2 ?2ρxy + y2)
bracerightbigg
,
then the log-likelihood function is
logL(ρ;X,Y) = ?nlog2pi ? n2 log(1?ρ2)? 12(1?ρ2)
nsummationdisplay
i=1
(X2i ?2ρXiYi + Y 2i ),
and the FOC would be
dlogL
dρ =
n(?2)
2(1?ρ2)ρ?ρ
summationtextn
i=1(X
2
i + Y
2
i ))
(1?ρ2)2 +
1 + ρ2
(1?ρ2)2
nsummationdisplay
i=1
(XiYi) = 0,
= n?ρ(1? ?ρ2) + (1 + ?ρ2)
nsummationdisplay
i=1
(XiYi)? ?ρ
parenleftBigg nsummationdisplay
i=1
X2i +
nsummationdisplay
i=1
Y 2i
parenrightBigg
= 0
This is a cubic equation in ρ and hence there are three possible value for the
MLE and additional search is needed to locate the maximum value using numer-
ical methods.
A short tour on Numerical method
The template for most gradient method in common use is the Newton’s
method. The basis for Newton’s method is a linear Taylor series approximation.
Expanding the first-order conditions,
?F(θ)
?θ = 0 (which may have nonlinear solution)
in a linear Taylor series around an arbitrary θ0 yield
?F(θ)
?θ ? g
0 +H0(θ?θ0) = 0,
where the superscript indicates that the term is evaluated at θ0 and g and H
are the gradient vector and Hessian matrix, respectively. Solving for θ and then
equating θ to θt+1 and θ0 to θt, we obtain the iteration
θt+1 = θt ?H?1t gt.
26
3.3.4 Finite Sample Properties of MLE
(a) Invariance
Let ?θ be a MLE of θ. If g(·) : Θ → R is a Borel function of θ then a MLE
of g(θ) exists and is given by g(?θ). This means that if the MLE of θ is available
then for functions such as θk, eθ, logθ, its MLE is derived by substitution ?θ in
place of θ, i.e. ?θk, e?θ and log ?θ are MLE’s of these functions. However, as in the
case of ?σ2, MLE are not in general unbiased estimator. In one particular case,
when unbiasedness is accompanied by full efficiency, however, the two coincide.
(b) Unbiasedness, full-efficiency
In the case where Φ satisfies the regularity conditions and ?θ is an unbiased
estimator of θ whose variance achieves the Cramer-Rao lower bound, then the
likelihood estimation has a unique solution equal to ?θ. This suggests that any
unbiased fully efficient estimator ?θ can be derived as a solution of the likelihood
equation, as in the case of ?μ.
(c) Sufficiency
The property mostly emphasized by Fisher in support of the MLE was the
properties of sufficiency. If τ(x) is a sufficient statistic for θ and a unique MLE ?θ
of θ exists then ?θ is a function of τ(x).
3.3.5 Asymptotic Properties (i.i.d. case and θ is a scalar) of MLE
Although MLE’s enjoy several optimum finite sample properties, as seen above,
their asymptotic properties provide the main justification for the almost universal
appeal to the method of maximum likelihood. As argued below, under certain
regularity conditions, MLE can be shown to be consistent, asymptotically normal
and asymptotically efficient.
Let us begin the discussion of asymptotic properties enjoyed by MLE by con-
sidering the simplest possible case where the statistical model is as follows:
(a). probability model, Φ = {f(x;θ),θ ∈ Θ}, θ being a scalar;
(b). sampling model, x ≡ (X1,X2,...,Xn)′ is a random sample from f(x;θ).
For our purpose it suffices to repeat the regularity condition on Φ:
27
(RC 1) The set A = {x : f(x;θ) > 0} does not depend on θ.
(RC 2) For each θ ∈ Θ the derivatives [?i logf(x;θ)]/(?θi), i = 1,2,3, exist for
all x ∈X.
(RC 3) 0 < E[(?/?θ)logf(x;θ)]2 < ∞ for all θ ∈ Θ.
(RC 4) For every θ ∈ Θ,
vextendsinglevextendsingle
vextendsinglevextendsingle?i logf(x;θ)
?θi
vextendsinglevextendsingle
vextendsinglevextendsingle≤ hi(x), i = 1,2,3,
where the function h1(x) and h2(x) are integrable over (?∞,∞), i.e.
integraldisplay ∞
?∞
hi(x)dx < ∞, i = 1,2,
and integraldisplay
∞
?∞
h3(x)f(x;θ)dx < k,
where k does not depend on θ.
Example:
Let x ≡ (X1,X2,...,Xn)′ be a random sample from f(x;θ) = 1/θ, where 0 ≤
x ≤ θ. Since the range of the random variables (X1,X2,...,Xn) depends on the
parameters θ. This function violates regularity condition 1 (RC 1) above and are
excluded from the properties below.
Example:
Let x ≡ (X1,X2,...,Xn)′ be a random sample from N(0,σ2). Then
?3 logf(x)
?σ6 = ?
n
σ6 +
3summationtextX2
σ8 →∞, as σ
2 → 0,
that is, the third derivative is not bounded in the open interval 0 < σ2 < ∞, the
regularity condition 4 (RC 4) is not satisfied.
Under the regular condition we can prove that the likelihood equation [? logL(θ;x)]/?θ =
0 admits a sequence of solution {?θn, n ≥ 1} such that:
(a) Consistency:
?θn a.s.?→ θ0, strong consistency
28
which implies that
?θn P?→ θ0, weak consistency.
(b). Asymptotic normality:
√n(?θ
n ?θ0) ~ N(0,I(θ)?1).
(3). Asymptotic efficiency:
I(θ) = lim
n→∞
bracketleftbigg1
nIn(θ)
bracketrightbigg
,
i.e.the asymptotic variance of ?θn equal the limit of the Cramer-Rao lower bound
for consistent estimator (that is, Var(?θn) = 1nI(θ)?1 = In(θ)?1).
Proof:
Take a Taylor expansion of [? logL(?θn;x)]/?θ at ?θ = θ0 we would get
? logL(?θ)
?θ =
? logL(θ0)
?θ +
?2 logL(θ0)
?θ2 (
?θn ?θ0) + op(n) = 0,
where op(n) refers to all terms of order n.3 The above expansion is based on RC
2 and RC 4.
In view of the fact the logL(θ) = summationtextni=1 logf(xi;θ) and the f(xi;θ),i =
1,2,...,n, is i.i.d., we can express the above Taylor expansion in the form
?1n
nsummationdisplay
i=1
?2 logf(xi;θ0)
?θ2 (
?θn ?θ0) = 1
n
nsummationdisplay
i=1
? logf(xi;θ0)
?θ + op(1).
Using the strong law of large numbers for the i.i.d. r.v.’s
An = 1n
nsummationdisplay
i=1
? logf(xi;θ0)
?θ
a.s.?→ E
parenleftbigg? logf(x;θ
0)
?θ
parenrightbigg
= 0
3Here, op(n) is from summationtextn op(1), and op(1) is from the fact that f(x,θ) is bound, i.e. f(x,θ)
is OP(1) (see White, p.28), then so does ? logf(x,θ)?θ , and its Taylor’s expansion’s remainder is
op(1) when ?θ → θ0.
29
and
Bn =
parenleftBigg
?1n
nsummationdisplay
i=1
?2 logf(xi;θ0)
?θ2
parenrightBigg
a.s.?→ I
i(θ) = E
parenleftbigg
??
2 logf(x;θ0)
?θ2
parenrightbigg
,
where summationtextni=1 Ii(θ) = In(θ) or In(θ) = nI1(θ) (see Arnold, p.271).
This in turn imply that
(?θn ?θ0) a.s.?→ 0, or ?θn a.s.?→ θ0.
To show the normality we multiple all the terms of the Taylor expansion by√
n to get
Bn√n(?θn ?θ0) = 1√n
nsummationdisplay
i=1
? logf(xi;θ0)
?θ + op(
√n).
Using the central limit theorem for i.i.d. r.v.’s we can show that
1√
n
nsummationdisplay
i=1
? logf(xi;θ0)
?θ ~ N(0,Ii(θ)).
(see the proof ofCramer?Rao Bound.)
Given that Bn a.s.?→ Ii(θ) we can deduce that
√n(?θ
n ?θ) ~ N(0,Ii(θ)?1) + Op(
√n) ≡ N(0,nI?1
n (θ)) + op(
√n),
or
(?θn ?θ) ~ N(0,In(θ)?1),
or
√n(?θ
n ?θ0) ~ N(0,I(θ)?1).
30
3.3.6 Estimating the asymptotic variance of the MLE, multivariate
parameters, θ is k ×1
(a). If the form of the expected value of the second derivative of the log-likelihood
is known, we can evaluate the information matrix at ?θ to estimate the covariance
matrix for the MLE,
[In(?θ)]?1 =
braceleftbigg
?E
bracketleftbigg?2 lnL(θ)
?θ?θ′
bracketrightbigg
θ=?θ
bracerightbigg?1
.
(b). If the expected value of the second derivative of the log-likelihood is compli-
cated, two alternative estimators is
[?In(?θ)]?1 =
braceleftBigg
?
bracketleftBigg
?2 lnL(?θ)
??θ??θ′
bracketrightBiggbracerightBigg?1
and (c). BHHH estimator
[?In(?θ)]?1 =
bracketleftBigg nsummationdisplay
i=1
?gi?g′i
bracketrightBigg?1
, where ?gi = ? lnf(xi;θ)?θ
vextendsinglevextendsingle
vextendsinglevextendsingle
θ=?θ
is k ×1.
The reason for using the BHHH is that
lnL =
nsummationdisplay
i=1
lnf(xi;θ).
Therefore,
? lnL
?θ = g =
nsummationdisplay
i=1
gi, (gi = ? lnf(xi;θ)?θ ,is k ×1),
and
?E
bracketleftbigg?2 lnL
?θ?θ′
bracketrightbigg
= E
bracketleftbiggparenleftbigg? lnL
?θ
parenrightbiggparenleftbigg? lnL
?θ
parenrightbigg′bracketrightbigg
= E
bracketleftBigg nsummationdisplay
i=1
nsummationdisplay
j=1
gig′j
bracketrightBigg
,
we drop the unequal subscript since lnf(xi;θ), i = 1,2,...,n is a random sample
to obtain
?E
bracketleftbigg?2 lnL
?θ?θ′
bracketrightbigg
= E
bracketleftBigg nsummationdisplay
i=1
gig′i
bracketrightBigg
.
31
Example:
Reproduce the results from Gauss on P.132 example 4.22 in Greene (4th ed.) or
p. 482 example 17.4 in Greene (5th ed.) to learn numerical method and three
asymptotic variance estimators.
32