Ch. 6 The Linear Model Under Ideal Conditions
The (multiple) linear model is used to study the relationship between a dependent
variable (Y) and several independent variables (X1,X2,...,Xk). That is
Y = f(X1,X2,...,Xk) + ε assume linear function
= β1X1 + β2X2 + ... + βkXk + ε
= x′β + ε
where Y is the dependent or explained variable, x = [X1 X2.....;Xk]′ are the
independent or the explanatory variables and β = [β1 β2..... βk]′ are unknown
coefficients that we are interested in learning about, either through estimation or
through hypothesis testing. The term ε is an unobservable random disturbance.
Suppose we have a sample of size T (allowing for non-random) observations
1 on the scalar dependent variable Yt and the vector of explanatory variables
xt = (Xt1,Xt2,...,Xtk)′, i.e.
Yt = x′tβ + εt, t = 1,2,...,T.
In matrix form, this relationship is written as
y =
?
??
??
??
?
Y1
Y2
.
.
.
YT
?
??
??
??
?
=
?
??
??
??
?
X11 X12 . . . X1k
X21 X22 . . . X2T
. . . . . .
. . . . . .
. . . . . .
XT1 XT2 . . . XTk
?
??
??
??
?
?
??
??
??
?
β1
β2
.
.
.
βk
?
??
??
??
?
+
?
??
??
??
?
ε1
ε2
.
.
.
εT
?
??
??
??
?
=
?
??
??
??
?
x′1
x′2
.
.
.
x′T
?
??
??
??
?
?
??
??
??
?
β1
β2
.
.
.
βk
?
??
??
??
?
+ε
= Xβ +ε,
where y is T × 1 vector, X is an T ×k matrix with rows x′t and ε is an T × 1
vector with element εt.
1Recall from Chapter 2 that we cannot postulate the probability model Φ if the sample is
non-random. The probability model must be defined in terms of their sample joint distribution.
1
Our goal is to regard last equation as a parametric probability and sampling
model, and try to inference the unknown βi’s and the parameters contained in ε.
1 The Probability Model: Gauss Linear Model
Assume that ε ~ N(0,Σ), if X are not stochastic, then by results from ”func-
tions of random variables” (n ? n transformation) we have y ~ N(Xβ, Σ).
That is, we have specified a probability and sampling model for y to be
(Probability and Sampling Model)
y ~ N
?
??
??
??
?
?
??
??
??
?
X11 X12 . . . X1k
X21 X22 . . . X2T
. . . . . .
. . . . . .
. . . . . .
XT1 XT2 . . . XTk
?
??
??
??
?
?
??
??
??
?
β1
β2
.
.
.
βk
?
??
??
??
?
,
?
??
??
??
?
σ21 σ12 . . . σ1T
σ21 σ22 . . . σ2T
. . . . . .
. . . . . .
. . . . . .
σT1 σT2 . . . σ2T
?
??
??
??
?
?
??
??
??
?
≡ N(Xβ,Σ),
That is the sample joint density function is
f(y;θ) = (2pi)?T/2|Σ|?1/2 exp(?1/2)(y?Xβ)′Σ?1(y?Xβ),
where θ = (β1,β2,...,βk,σ21,σ12,...,σ2T)′. It is easily seen that the number of pa-
rameters in θ is large than the sample size, T. Therefore, some restrictions must
be imposed in the probability and sampling model for the purpose of estimation
as we shall see in the subsequence.
One kind of restriction on θ is that Σ is a scalar matrix, then maximize the
likelihood of the sample model f(θ;x) (w.r.t. β) is equivalent to minimize the
equation (y?Xβ)′(y?Xβ) (=ε′ε = summationtextTt=1 ε2t, a sums of squared residuals), this
constitutes the foundation of ordinary least square estimation.
To generalize the discussions so far, we have made the following assumptions
that
(a) The model y = Xβ +ε is correct; (no problem of model misspecification)
2
(b) X is nonstochastic; (therefore, regression comes first from experimental sci-
ence)
(c) E(ε) = 0; (can easily be satisfied by adding a constant in the regression)
(d) Var(ε) = E(εε′) = σ2·I; (disturbance have same variance and are not auto-
correlated)
(e) Rank(X) = k; (for model identification)
(f) ε is normal distributed.
The above six assumptions are usually called the classical ordinary least
squares assumption or the ideal conditions.
2 Estimation: Ordinary Least Squares Estima-
tor
2.1 Estimation of β
Let us first consider the ordinary least square estimator (OLS) which is the
value for β that minimizes the sum of squared errors denoted as SSE (or
residuals, remember the principal of estimation at Ch. 3)
SSE(β) = (y?Xβ)′(y?Xβ)
=
Tsummationdisplay
t=1
(yt ?x′tβ)2
= y′y?2y′Xβ +β′X′Xβ.
The first order conditions for a minimum are
?SSE(β)
?β = ?2X
′y + 2X′Xβ = 0.
If X′X is nonsingular (which is satisfied by the assumption (e) of ideal condition
and Ch.1 Sec. 3.5), the system of k equations in k unknown can be uniquely
solved for the ordinary least squares (OLS) estimator
?β = (X′X)?1X′y =
bracketleftBigg Tsummationdisplay
t=1
xtx′t
bracketrightBigg?1 Tsummationdisplay
t=1
x′tyt. (1)
3
To ensure that ?β is indeed a solution of minimization, we require that
?2SSE(β)
?β?β′ = 2X
′X
must be a positive definite matrix. This condition is satisfied by assumption 5
and Ch1. Sec. 5.6.1.
Denote the T ×1 vector e, of least squares residual be
e = y?X?β,
then it is obvious that
X′e = X′(y?X?β) = X′y?X′X(X′X)?1X′y = 0, (2)
i.e., the regressors is orthogonal to the OLS residual. Therefore, if one of the
regressors is a constant term, the sum of the residuals is zero since the first
element of X′e would be
bracketleftbig 1 1 . . . 1 bracketrightbig
?
??
??
??
?
e1
e2
.
.
.
eT
?
??
??
??
?
=
Tsummationdisplay
t=1
et = 0. (a scalar)
2.2 Estimation of σ2
At this moment, we arrive at the following notation:
y = Xβ +ε
= X?β +e.
To estimate the variance of ε, σ2, a simple and intuitive idea is that to use infor-
mation from sample e.
4
Lemma:
The matrix MX = I?X(X′X)?1X′ is symmetric and idempotent . Furthermore,
MXX = 0.
Lemma:
e = MXy = MXε. That is we can interpret MX as a matrix that produces the
vector of least square residuals in the regression of y on X.
Proof:
e = y?X?β
= y?X(X′X)?1X′y
= (I?X(X′X)?1X′)y
= MXy
= MXXβ +MXε
= MXε.
Using the fact that MX is symmetric and idempotent we have
Lemma:
e′e = ε′M′XMXε = ε′MXε.
Theorem 1:
E(e′e) = σ2(T ?k).
Proof:
E(e′e) = E(ε′MXε)
= E[trace (ε′MXε)] (since ε′MXε, is a scalar, equals its trace)
= E[trace (MXεε′)]
= trace E(MXεε′)] (Why ?)
= trace (MXσ2IT)
= σ2 trace (MX),
5
but
trace MX = trace (IT)?trace (X(X′X)?1X′)
= trace (IT)?trace ((X′X)?1X′X)
= T ?k.
Corollary:
An unbiased estimator of σ2 is
s2 = e
′e
T ?k.
Exercise:
Reproduce the estimate results at Table 4.2 p. 52, for ?β, s2(X′X)?1 and e′e.
2.3 Partitioned Regression Estimation
It is common to specify a multiple regression model, when in fact, interest centers
on only one of a subset of the full set of variables. Let k1 +k2 = k we can express
the OLS result in isolation as
y = X?β +e
= bracketleftbig X1 X2 bracketrightbig
bracketleftbigg ?β
1?
β2
bracketrightbigg
+e
= X1?β1 +X2?β2 +e,
where X1 and X2 are T ×k1 and T ×k2, respectively; ?β1 and ?β2 are k1 ×1 and
k2 ×1, respectively.
What is the algebraic solution for ?β2 ? Denote M1 = I?X1(X′1X1)?1X′1,
then
M1y = M1X1?β1 +M1X2?β2 +M1e
= M1X2?β2 +e,
6
using the fact that M1X1 = 0 and M1e = e. Multiplying X′2 on the above
equation and using the fact that X′e =
bracketleftbigg X′
1
X′2
bracketrightbigg
e =
bracketleftbigg X′
1e
X′2e
bracketrightbigg
= 0 we have
X′2M1y = X′2M1X2?β2 +X′2e = X′2M1X2?β2.
Therefore ?β2 can be expressed in isolation as
?β2 = (X′2M1X2)?1X′2M1y
= (X?′2X?2)?1X?′2y?,
where
X?′2 = M1X2 and y? = M1y,
are vectors of residual from the regression of X2 and y on X1, respectively.
Theorem 2 (Frisch-Waugh):
The subvector ?β2 is the set of coefficients obtained when the residuals from a
regression of y on X1 alone are regressed on the set of residuals obtained when
each column of X2 is regressed on X1.
Example:
Consider a simple regression with a constant, then the slope estimator can also
be obtained from a data-demeaned regression without constant.
2.4 The Restricted Least Squares Estimators
Suppose that we explicitly imposes the restrictions of the hypothesis in the re-
gression (take the example of LM test). The restricted least square estimator is
obtained as the solution to
Minimizeβ SSE(β) = (y?Xβ)′(y?Xβ) subject to Rβ = q,
where R is a known J ×k matrix and q is values of these linear restrictions.
7
A Lagrangean function for this problem can be written
L?(β,λ) = (y?Xβ)′(y?Xβ)?2λ′(Rβ?q), where λ is J ×1.
The solutions ?β? and ?λ will satisfy the necessary conditions
?L?
? ?β? = ?2X
′(y?X?β?)?2R′?λ = 0,
?L?
??λ = 2(R
?β? ?q) = 0 (remember ?a′x
?x = a)
Dividing through by 2 and expanding terms produces the partitioned matrix
equation
bracketleftbigg X′X R′
R 0
bracketrightbiggbracketleftBigg ?β?
?λ
bracketrightBigg
=
bracketleftbigg X′y
q
bracketrightbigg
,
or
W?d? = v.
Assuming that the partitioned matrix in brackets is nonsingular, then
?d? = W?1v.
Using the partition inverse rule of
bracketleftbigg A
11 A12
A21 A22
bracketrightbigg?1
=
bracketleftbigg A?1
11(I+A12F2A21A
?1
11) ?A
?1
11A12F2
?F2A21A?111 F2
bracketrightbigg
,
where F2 = (A22 ?A21A?111A12)?1,
we have the restricted least squared estimator
?β? = ?β?(X′X)?1R′[R(X′X)?1R′]?1(R?β?q),
and
?λ = [R(X′X)?1R′]?1(R?β?q).
8
Exercise:
Show that Var( ?β?)?Var(?β) is a nonpositive definite matrix.
The above result of exercise holds whether or not the restriction are true.
One way to interpret this reduction in variance is as the value of the information
contained in the restriction. See Table 6.2 at p. 103.
Let e? equal y?X ?β?, i.e., the residuals vector from the restricted least square
estimator, then using the familiar device,
e? = y?X?β?X( ?β? ? ?β) = e?X( ?β? ? ?β).
The ’restricted’ sums of squared residuals is
e′?e? = e′e+ ( ?β? ? ?β)′X′X( ?β? ? ?β) ≥ e′e (3)
since X′X is a positive definite matrix.
2.5 Measurement of Goodness of Fit
Denote the dependent variable’s ’fitted value’ from dependent variables and OLS
estimator, ?y, to be ?y = X?β, that is y = ?y +e.
Lemma:
e′e = y′y??y′?y.
Proof:
Using the fact that X′y = X′X?β, we have
e′e = y′y?2?β′X′y + ?β′X′X?β
= y′y??y′?y.
Three measurements of variation are defined as following:
(a). SST (Sums of Squared Total variation)= summationtextTt=1(Yt ? ˉY)2 = y′M0y,
(b). SSR (Sums of Squared Regression variation)=summationtextTt=1(?Yt ? ˉ?Y)2 = ?y′M0?y,
9
(c). SSE (Sums of Squared Error variation)=summationtextTt=1(Yt ? ?Yt)2 = e′e,
where ˉY = 1T summationtextTt=1 Yt and ˉ?Y = 1T summationtextTt=1 ?Yt.
Lemma:
If one of the regressor is a constant, then ˉY = ˉ?Y.
Proof:
Writing y = ?y +e = Xβ +e = bracketleftbig i X2 bracketrightbig
bracketleftbigg ?β
1
?β2
bracketrightbigg
+e = i?β1 +X2?β2 +e, where i
is a column of ones, and using the fact that i′e = 0 we obtain the results.
Lemma:
If one of the regressor is a constant, then SST = SSR + SSE.
Proof:
Multiplying M0 on y = ?y +e we have
M0y = M0?y +M0e = M0?y +e, since M0e = e (why ?).
Therefore,
y′M0y = ?y′M′0?y+ 2?y′M′0e+e′e = ?y′M′0?y+e′e
= SSR + SSE,
using the fact that ?y′M′0e = ?β′X′e = 0
Definition:
If one of the regressor is a constant, the coefficient of determination is defined as
R2 = SSRSST = 1? SSESST .
From (3) we know that
e′?e? ≥ e′e.
One kind of restriction is of the form that Rβ = 0, and we may think it as a model
with fewer regressors (but with the same dependent variable). It is apparent that
the coefficient of determination from this restricted model, say R2? is smaller.
10
(Thus the R2 in the longer regression cannot be smaller.) It is tempting to
exploit this result by just adding variables to the model; R2 will continue to rise
to its limit. In view of this result, we sometimes report an adjusted R2, which is
computed as follow.
ˉR2 = 1? e′e/(T ?k)
y′M0y/(T ?1).
11
3 Statistical Properties of the OLS Estimators
We now investigate the statistical properties of the estimator of parameters, ?β
and s2 from OLS.
3.1 Finite Sample Properties
3.1.1 Unbiasedness
Based on the six classical assumptions, the expected value of ?β and s2 are
E(?β) = E[(X′X)?1(X′y)] = E[(X′X)?1(X′(Xβ + ε))] = E[β + (X′X)?1X′ε]
= β + (X′X)?1X′E(ε) = β, (using Assumption (b) and (c).)
and by construction
E(s2) = E(e
′e
T ?k =
(T ?k)σ2
T ?k = σ
2.
Therefore both ?β and s2 are unbiased estimators.
3.1.2 Efficiency
To investigate the efficiency of these two estimators, we first show their variance-
covariance. The variance-covariance matrix of ?β is
Var(?β) = E[(?β?β)(?β?β)′]
= E[(X′X)?1X′εε′X(X′X)?1]
= (X′X)?1X′E[εε′]X(X′X)?1
= (X′X)?1X′(σ2I)(X′X)?1
= σ2(X′X)?1. (using Assumption (b) and (d).)
With assumption (f) and using idempotent quadratic form, we have
e′e
σ2 =
ε′MXε
σ2 ~ χ
2
(T?k),
12
that is
Var
parenleftbigge′e
σ2
parenrightbigg
= 2(T ?k)
or Var(e′e) = 2(T ?k)σ4. The variance of s2 (= e′eT?k) is therefore
Var
parenleftbigg e′e
T ?k
parenrightbigg
= 2σ
4
T ?k.
Theorem: (Gauss-Markov)
The OLS estimator ?β is the best linear unbiased estimator (BLUE) of β.
Proof:
Consider any estimator linear in y, say ?β = Cy. Let C = (X′X)?1X′ +D. Then
E(?β) = E[((X′X)?1X′ +D)(Xβ +ε)]
= β +DXβ,
so that for ?β to be unbiased we require DX = 0. Then the covariance matrix of
?β is
E[(?β?β)(?β?β)′] = E[(X′X)?1X′ +D]εε′[X(X′X)?1 +D′]
= σ2[(X′X)?1X′IX(X′X)?1 +DIX(X′X)?1 + (X′X)?1X′ID′ +DID′]
= σ2(X′X)?1 + σ2DD′, since DX = 0.
Since DD′ is a positive semidefinite matrix (see Ch1, p.20), which shows that the
covariance matrix of ?β equals the covariance matrix of ?β plus a positive semidef-
inite matrix. Hence ?β is efficient relative to any other linear unbiased estimator
of β.
In fact we can go a step further in the discussion of the efficiency of ?β.
Theorem:
Let the linear regression y = Xβ + ε satisfy classical assumptions, then the
Cram′er-Rao lower bounds for the unbiased estimator of β and σ2 are σ2(X′X)?1
13
and 2σ4/T, respectively.
Proof:
The log-likelihood is
lnL(β,σ2;y) = ?T2 ln(2piσ2)? 12σ2(y?Xβ)′(y?Xβ).
Therefore,
?L
?β =
1
σ2(X
′y?X′Xβ) = 1
σ2X
′(y?Xβ),
?L
?σ2 =
T
2σ2 +
1
2σ4(y?Xβ)
′(y?Xβ),
?2L
?β?β′ = ?
1
σ2X
′X; ?E
bracketleftbigg ?2L
?β?β′
bracketrightbigg
= X
′X
σ2 ;
?2L
?(σ2)2 =
T
2σ4 ?
1
σ6(y?Xβ)
′(y?Xβ); ?E
bracketleftbigg ?2L
?(σ2)2
bracketrightbigg
= T2σ4 (how ?);
?2L
?β?σ2 = ?
1
σ4X
′(y?Xβ); ?E
bracketleftbigg ?2L
?β?σ2
bracketrightbigg
= 0.
Therefore, the information matrix is
IT(β,σ2) =
bracketleftbigg X′X
σ2 00 T
2σ4
bracketrightbigg
,
in turn, the Cram′er-Rao lower bounds for the unbiased estimator of β and σ2
are σ2(X′X)?1 and 2σ4/T.
From above theorem, the OLS ?β is an absolutely efficient estimator while s2
is not. However, it can be shown that s2 is indeed minimum variance unbiased
efficient through the alternative approach of complete, sufficient statistics. See
for example, Schmidt (1976), p.14.
3.1.3 Distribution (exact) of ?β and s2
We now investigate the finite sample distribution of the OLS estimators.
14
Theorem:
?β has a multivariate normal distribution with mean β and covariance matrix
σ2(X′X)?1.
Proof:
By assumption (c),(d) and (f), we know that ε ~ N(0,σ2I). Therefore by the
results from linear function of a normal vector (Ch 2, p.27) we have
β + (X′X)?1X′ε ~ N(β,(X′X)?1X′σ2IX(X′X)?1),
or
?β ~ N(β,σ2(X′X)?1).
Theorem:
s2 is distributed as a χ2 distribution multiplied by a constant,
s2 ~ σ
2
(T ?k) ·χ
2
T?k.
Proof:
As we have shown that e′eσ2 ~ χ2(T?k), this result follows immediately.
3.1.4 Independence of ?β and s2
Lemma:
Let Q be a symmetric, idempotent T×T matrix, B be an m×T matrix such that
BQ = 0, and ε ~ N(0,σ2IT). Then Bε and ε′Qε are distributed independently.
Proof:
See section 7.2.4 of Chapter 2.
15
Theorem:
?β and s2 are independent.
Proof:
s2 = ε′MXε/(T ?k) and ?β ?β = (X′X)?1X′ε. Since (X′X)?1X′MX = 0, the
above lemma implies that ?β and s2 are independent.
3.2 Asymptotic Properties
3.2.1 Consistency
We now investigate the properties of the OLS estimators when the sample go to
infinity T →∞.
Theorem:
The OLS estimator ?β is consistent.
Proof:
Denote limT→∞(X′X/T) = limT→∞(summationtextTt=1 x′txt)/T by Q and assume that Q is
finite and nonsingular. (What does it mean ?) Then limT→∞(X′X/T)?1 is also
finite. Therefore
lim
T→∞
(X′X)?1 = lim
T→∞
1
T
parenleftbiggX′X
T
parenrightbigg?1
= lim
T→∞
1
TQ
?1
= 0.
Since ?β is unbiased and its covariance matrix, σ2(X′X)?1, vanishes asymptoti-
cally, it converge in probability to β and must be consistent.
Alternative proof:
Note that
?β = (X′X)?1X′y = β + (X′X)?1X′ε = β +
parenleftbiggX′X
T
parenrightbigg?1 X′ε
T .
16
Since E(X′εT ) = 0. Also E(X′εT )(X′εT )′ = σ2T (X′XT ), so that
lim
T→∞
E
parenleftbiggX′ε
T
parenrightbiggparenleftbiggX′ε
T
parenrightbigg′
= lim
T→∞
σ2
T Q = 0.
But the fact that E(X′εT ) = 0 and limT→∞E(X′εT )(X′εT )′ = 0 imply that plimX′εT =
0. Therefore
plim ?β = β + Q?1plim X
′ε
T = β.
3.2.2 Asymptotically Normality
Since by assumption, X′X is Op(T), therefore (X′X)?1 → 0. To express the
limiting distribution of ?β, we need the following theorem.
Theorem:
The asymptotic distribution of√T(?β?β) isN(0,σ2Q?1), whereQ = limT→∞(X′XT ).
Proof:
For any sample size T, the distribution of √T(?β ?β) is N(0,σ2(X′XT )?1). The
above limiting results is therefore trivial.
Theorem:
The asymptotic distribution of √T(s2 ?σ2) is N(0,2σ4).
Proof:
Since the distribution of e′e/σ2 is χ2 with (T ?k) degree of freedom. Therefore
e′e
σ2 =
T?ksummationdisplay
t=1
v2t,
where the v2t are i.i.d. χ2 with one degree of freedom. But this is a sum of i.i.d.
with mean 1 and variance 2; according to the Lindberg-Levy central limit theorem
it follows that
1√
T ?k
T?ksummationdisplay
t=1
parenleftbiggv2
t ?1√
2
parenrightbigg
L?→ N(0,1).
17
But this is equivalent to saying that
1√
T ?k
parenleftbigge′e
σ2 ?(T ?k)
parenrightbigg
L?→ N(0,2),
or that
√T ?k(s2 ?σ2) L?→ N(0,2σ4),
or that
√T(s2 ?σ2) L?→ N(0,2σ4).
From above results, we find that although variance of s2 does not attains
Cram′er-Rao lower bound in finite sample, however it does in large sample.
Theorem:
s2 is asymptotically efficient.
Proof:
The asymptotic variance of s2 is 2σ4/T, which equals the Cram′er-Rao lower
bound.
18
4 Hypothesis Testing
4.1 Tests of a Single Linear Restriction on β: Tests based
on t Distribution
Lemma:
Let R be a 1×k vector, and define s? by
s? =
radicalbig
s2R(X′X)?1R′,
then R(?β?β)/s? has a t distribution with (T ?k) degrees of freedom.
Proof:
Clearly R(?β ? β) is a scalar random variable with zero mean and variance
σ2R(X′X)?1R′; call this variance σ2?. Then R(?β?β)σ? ~ N(0,1), but this test
statistics is not a pivot sine it contains the unknown parameter σ. We need some
transformation of this statistics to remove the parameter.
We know that (T ?k)s2/σ2 ~ χ2T?k, therefore,
s2?
σ2? =
s2
σ2 ~ χ
2
T?k/(T ?k).
Finally, then,
R(?β?β)
σ? =
R(?β?β)/σ?radicalbig
s2?/σ2? =
N(0,1)radicalBig
χ2T?k/(T ?k)
~ tT?k.
The above results is established upon the numerator and denominator being in-
dependent. This condition is shown to be true at 3.1.4. of this Chapter.
Theorem (Test of a single linear restriction on β):
Let R be a known 1×k vector, and r be a known scalar. Then under the null
hypothesis that Rβ = r, the test statistics
R?β?r
s? ~ tT?k.
19
Proof:
Under the null hypothesis,
R?β?r
s? =
R?β?Rβ
s? ~ tT?k.
Corollary (Test of significance of βi):
Let
s?βi =
radicalBig
s2(X′X)?1ii ,
then under the null hypothesis that βi = 0, the test statistics
?βi
s?βi ~ tT?k.
Proof:
This is a special case of last Theorem, with r = 0 and R being a vector of zeros
except for a one in the i?th position.
Example:
Reproduce all the results at Greene 5-th. ed., p. 103, Table 6.2.
4.2 Tests of Several Linear Restriction on β: Tests based
on F Distribution
Theorem:
Let R be a known matrix of dimension m × k and rank m, q a known m × 1
vector. Then under the null hypothesis that Rβ = q, the statistics
(R?β?q)′[R(X′X)?1R′]?1(R?β?q)/m
e′e/(T ?k)
= (R
?β?q)′[s2R(X′X)?1R′]?1(R?β?q)
m ~ Fm,T?k.
20
Proof:
From the liner function of a normal vector, we have
R?β ~ N(Rβ,σ2R(X′X)?1R′).
Further by the quadratic form of normal vector (Sec. 6.2.2 of Ch. 2) we have
(R?β?Rβ)′[σ2R(X′X)?1R′]?1(R?β?Rβ) ~ χ2m.
Then under the null hypothesis that Rβ = q, the test statistics
(R?β?q)′[σ2R(X′X)?1R′]?1(R?β?q) ~ χ2m. (4)
However, this test statistics in (4) is not a pivot sine it contains the unknown
parameter σ2. We need some transformation of this statistics to remove the pa-
rameter as in the single test.
Recall that (T ?k)s2/σ2 = e′e/σ2 ~ χ2T?k, therefore the numerator and the
denominator of the statistics in the following is trying to remove out the unknown
parameter σ2 from (4) such that
(R?β?q)′[σ2R(X′X)?1R′]?1(R?β?q)/m
(T ?k)s2/σ2(T ?k) =
(R?β?q)′[s2R(X′X)?1R′]?1(R?β?q)
m (5)
≡
(R?β?q)′[σ2R(X′X)?1R′]?1(R?β?q)/m
e′e/σ2(T ?k) =
(R?β?q)′[R(X′X)?1R′]?1(R?β?q)/m
e′e/(T ?k) , (6)
which are distributed as χ2m/m and χ2T?k/(T ?k), respectively. This statistics in
(5) and (6) are distributed as a Fm,T?k if this two χ2 are independent. Indeed, it
is the case as can be proven in the same line as the single test.
Exercise:
Show the two χ2 are independent in the last Theorem.
21
4.2.1 Tests of Several Linear Restriction on β from restricted least
squares estimator
Recalling that ?β? = ?β?(X′X)?1R′[R(X′X)?1R′]?1(R?β?q) and e′?e? = e′e+
( ?β? ? ?β)′X′X( ?β? ? ?β), where ?β? and e? are estimators and residuals from the
restricted least squares errors. We finds that
e′?e? ?e′e = ( ?β? ? ?β)′X′X( ?β? ? ?β)
= (R?β?q)′[R(X′X)?1R′]?1R(X′X)?1X′X(X′X)?1R′[R(X′X)?1R′]?1(R?β?q)
= (R?β?q)′[R(X′X)?1R′]?1(R?β?q).
Therefore, under the null hypothesis that Rβ = q we have the third statistics
from (6) that would also distributed as Fm,T?k, that is
(R?β?q)′[R(X′X)?1R′]?1(R?β?q)/m
e′e/(T ?k) =
(e′?e? ?e′e)/m
e′e/(T ?k) ~ Fm,T?k, (7)
or the fourth statistics from (7)
( e′?e?y′M0y ? e′ey′M0y)/m
( e′ey′M0y)/(T ?k) =
(R2 ?R2?)/m
(1?R2)/(T ?k) ~ Fm,T?k, (8)
where R2? is the R-square under the restriction estimation.
Corollary (Test of the significance of a regression):
If all the slope’s coefficients (except for constant term) are zero, then R is (k ?
1)×k (m = k?1), q = 0. Under this circumstance, R2? = 0. The test statistics
to test the significance of the regression that H0 : Rβ = 0 is therefore from (8)
that under the null hypothesis
R2/(k ?1)
(1?R2)/(T ?k) ~ Fk?1,T?k.
22
Exercise:
Using the above four F ? Ratio to compute the test statistics F = 109.84 at
Greene 5-th. Ed. P. 99.
4.2.2 Tests of Structural Change
One of the more common applications of the F tests is in tests of structural
change. In specifying a regression model, we assume that its assumptions apply
to all the observations in our sample. It is straightforward, however, to test the
hypothesis that some or all of the regression coefficients are different in different
subsets of the data.
Theorem (Chow Test):
Suppose that one has T1 observations on a regression equation
y1 = X1β1 + ε1,
and T2 observations on another regression equation
y2 = X2β2 + ε2.
Suppose that X1 and X2 are made up of k regressors. Let SSE1 denote the
sum of squared errors in the regression of y1 on X1 and SSE2 denote the sum
of squared errors in the regression of y2 on X2. Finally let the ”joint regression”
equation be
bracketleftbigg y
1
y2
bracketrightbigg
=
bracketleftbigg X
1
X2
bracketrightbigg
β +
bracketleftbigg ε
1
ε2
bracketrightbigg
(9)
and SSE be the sum of squared errors in the joint regression. Then under the
null hypothesis that β1 = β2 and that ε =
bracketleftbigg ε
1
ε2
bracketrightbigg
is distributed as N(0,σ2IT),
the statistics
(SSE ?SSE1 ?SSE2)/k
(SSE1 + SSE2)/(T ?2k)
is distributed as Fk,T?2k.
23
Proof:
The ’separated regression’ model can be written as
bracketleftbigg y
1
y2
bracketrightbigg
=
bracketleftbigg X
1 0
0 X2
bracketrightbiggbracketleftbigg β
1
β2
bracketrightbigg
+
bracketleftbigg ε
1
ε2
bracketrightbigg
= Xβ + ε. (10)
The OLS of the separated model is therefore
bracketleftbigg ?β
1?
β2
bracketrightbigg
= ?β = (X′X)?1X′y =
bracketleftbigg X′
1X1 0
0 X′2X′2
bracketrightbigg?1bracketleftbigg y
1
y2
bracketrightbigg
=
bracketleftbigg (X′
1X1)
?1X′
1y1
(X′2X′2)?1X′2y2
bracketrightbigg
,(11)
and the residual is
bracketleftbigg e
1
e2
bracketrightbigg
= e =
bracketleftbigg y
1
y2
bracketrightbigg
?
bracketleftbigg X
1 0
0 X2
bracketrightbiggbracketleftbigg ?β
1
?β2
bracketrightbigg
=
bracketleftbigg y
1 ?X1 ?β1
y2 ?X2 ?β2
bracketrightbigg
.
The sum of squared residual of the separate regression is e′e = e′1e1 +e′2e2 =
SSE1 + SSE2, which is the sums of squared residuals from the addition of the
’separated regression’ and can be regarded as ’errors from unrestricted model’
relative to the joint regression (9). Regarding the sums of squared error SSE
from the joint regression’ as error from ”restricted” model, the results is apparent
from (7).
24
5 Prediction
Let us consider a set of T0 observations not included in the original sample of T
observations. Specifically, Let X0 denote these T0 observations on the regressors,
and y0 these observations on y.
Now let y0 be forecasted by
?y0 = X0?β,
where ?β = (X′X)?1X′y is the OLS estimator based on the original T observa-
tions. Finally let v0 be the set of forecast errors defined by
v0 = y0 ?X0?β = y0 ?X0(X′X)?1X′y.
Theorem:
E(v0) = 0 and E(v0v′0) = σ2(I+X0(X′X)?1X′0).
Proof:
E(v0) = E(y0 ?X0?β) = E[X0(β? ?β) +ε0] = 0,
and
E(v0v′0) = E{[X0(β? ?β) +ε0][X0(β? ?β) +ε0]′}
= E{[X0((X′X)?1X′ε) +ε0][((X′X)?1X′ε)′X′0 +ε′0]}
= σ2(X0(X′X)?1X′0) + σ2IT0 since E(εε′0 = 0),
= σ2(IT0 +X0(X′X)?1X′0)
Theorem:
Suppose that we wish to predict a single value of Y0 (T0 = 1) associated with a
regressor X0(1×k) = x′0, then
v0
s2(I+x0(X′X)?1x′0) ~ tT?k,
25
where v0 = Y0 ?x′0?β.
Proof:
As is shown that since v0 is a linear function of normal vector,
v0 ~ N(0,σ2(I+X0(X′X)?1X′0),
and
v0 ~ N(0,σ2(1 +x0(X′X)?1x′0).
Then
v0/[σ2(1 +x0(X′X)?1x′0)]
(T ?k)s2/σ2(T ?k) =
v0
s2(1 +x0(X′X)?1x′0) ~ tT?k.
Corollary:
The forecast interval for Y0 would be formed using
forecast interval = ?Y0 ±tα/2 ·s2(1 +x0(X′X)?1x′0)
Example:
Greene 5-th, p. 111.
5.1 Measuring the Accuracy of Forecasts
Various measures have been proposed for assessing the predictive accuracy of
forecast models. Two that are based on the residuals from the forecasts are the
root mean squared error
RMSE =
radicalBigg
1
T0
summationdisplay
i
(Yi ? ?Yi)2
and the mean absolute error
MAE = 1T
0
summationdisplay
i
|Yi ? ?Yi|,
26
where T0 is the number of periods being forecasted.
It is needed to keep in mind that however the RMSE and MAE are also
random variables. To compare predictive accuracy we need a ”test statistics” to
test equality of forecast accuracy. See for example Diebold and Mariano (1995,
JBES, p. 253).
27