Ch. 21 Univariate Unit Root Process
1 Introduction
Consider OLS estimation of a AR(1) process,
Yt = ρYt?1 +ut,
where ut ~i.i.d.(0,σ2), and Y0 = 0. The OLS estimator of ρ is given by
?ρT =
summationtextT
t=1Yt?1Ytsummationtext
T
t=1Y
2t?1 =
parenleftBigg Tsummationdisplay
t=1
Y2t?1
parenrightBigg?1parenleftBigg Tsummationdisplay
t=1
Yt?1Yt
parenrightBigg
and we also have
(?ρT ?ρ) =
parenleftBigg Tsummationdisplay
t=1
Y2t?1
parenrightBigg?1parenleftBigg Tsummationdisplay
t=1
Yt?1ut
parenrightBigg
. (1)
When the true value of ρ is less than 1 in absolute value, then Yt (so does Y2t
?) is a covariance-stationary process. Applying LLN for a covariance process (see
9.19 of Ch. 4) we have
(
Tsummationdisplay
t=1
Y2t?1)/T p?→E[(
Tsummationdisplay
t=1
Y2t?1)/T] =
bracketleftbiggT ·σ2
1?ρ2
bracketrightbigg
/T = σ2/(1?ρ2). (2)
Since Yt?1ut is a martingale difference sequence with variance
E(Yt?1ut)2 = σ2 σ
2
1?ρ2
and
1
T
Tsummationdisplay
t=1
bracketleftbigg
σ2 σ
2
1?ρ2
bracketrightbigg
→σ2 σ
2
1?ρ2.
Applying CLT for a martingale difference sequence to the second term in the
righthand side of (1) we have
1√
T(
Tsummationdisplay
t=1
Yt?1ut) L?→N(0,σ2 σ
2
1?ρ2). (3)
1
Substituting (2) and (3) to (1) we have
√T(?ρ
T ?ρ) = [(
Tsummationdisplay
t=1
Y2t?1)/T]?1 ·√T[(
Tsummationdisplay
t=1
Yt?1ut)/T] (4)
L?→
bracketleftbigg σ2
1?ρ2
bracketrightbigg?1
N(0,σ2 σ
2
1?ρ2) (5)
≡ N(0,1?ρ2). (6)
(6) is not valid for the case when ρ = 1, however. To see this, recall that the
variance of Yt when ρ = 1 is tσ2, then the LLN as in (2) would not be valid since
if we apply CLT, then it would incur that
(
Tsummationdisplay
t=1
Y2t?1)/T p?→E[(
Tsummationdisplay
t=1
Y2t?1)/T] = σ2
summationtextT
t=1t
T →∞. (7)
Similar reason would show that the CLT would not apply for 1√T (summationtextTt=1Yt?1ut).
( In stead, T?1(summationtextTt=1Yt?1ut) converges.) To obtain the limiting distribution, as
we shall prove in the following, for (?ρT ?ρ) in the unit root case, it turn out that
we have to multiply (?ρT ?ρ) by T rather than by √T:
T(?ρT ?ρ) =
bracketleftBigg
(
Tsummationdisplay
t=1
Y2t?1)/T2
bracketrightBigg?1bracketleftBigg
T?1(
Tsummationdisplay
t=1
Yt?1ut)
bracketrightBigg
. (8)
Thus, the unit root coefficient converge at a faster rate (T) than a coefficient for
stationary regression ( which converges at √T).
2
2 Unit Root Asymptotic Theories
In this section, we develop tools to handle the asymptotics of unit root process.
2.1 Random Walks and Wiener Process
Consider a random walk,
Yt = Yt?1 +εt,
where Y0 = 0 and εt is i.i.d. with mean zero and Var(εt) = σ2 <∞.
By repeated substitution we have
Yt = Yt?1 +εt = Yt?2 +εt?1 +εt
= Y0 +
tsummationdisplay
s=1
εs
=
tsummationdisplay
s=1
εs.
Before we can study the behavior of estimators based on random walks, we
must understand in more detail the behavior of the random walk process itself.
Thus, consider the random walk {Yt}, we can write
YT =
Tsummationdisplay
t=1
εt.
Rescaling, we have
T?1/2YT/σ = T?1/2
Tsummationdisplay
t=1
εt/σ.
(It is important to note hereσ2 should be read asVar(T?1/2summationtextTt=1εt) = E[T?1(summationtextεt)2] =
T·σ2
T = σ
2.) According to the Lindeberg-L′evy CLT, we have
T?1/2YT/σ L?→N(0,1).
More generally, we can construct a variable YT(r) from the partialsum of εt
YT(r) =
[Tr]?summationdisplay
t=1
εt,
3
where 0 ≤ r ≤ 1 and [Tr]? denotes the largest integer that is less than or equal
to Tr.
Applying the same rescaling, we define
WT(r) ≡ T?1/2YT(r)/σ (9)
= T?1/2
[Tr]?summationdisplay
t=1
εt/σ. (10)
Now
WT(r) = T?1/2([Tr]?)1/2
?
?
?([Tr]
?)?1/2
[Tr]?summationdisplay
t=1
εt/σ
?
?
?,
and for a givenr, the term in the brackets {·} again obeys the CLT and converges
in distribution to N(0,1), whereas T?1/2([Tr]?)1/2 converges to r1/2. It follows
from standard arguments that WT(r) converges in distribution to N(0,r).
We have written WT(r) so that it is clear that WT can be considered to be
a function of r. Also, because WT(r) depends on the ε′ts, it is random. There-
fore, we can think of WT(r) as defining a randomfunction of r, which we write
WT(·). Just as the CLT provides conditions ensuring that the rescaled random
walk T?1/2YT/σ (which we can now write as WT(1)) converges, as T become
large, to a well-defined limiting random variables (the standard normal), the
function central limit theorem (FCLT) provides conditions ensuring that the
random function WT(·) converge, as T become large, to a well-defined limit ran-
dom function, say W(·). The word ”Functional” in Functional Central Limit
theorem appears because this limit is a function of r.
Some further properties of random walk, suitably rescaled, are in the follow-
ing.
Proposition:
If Yt is a random walk, then Yt4 ?Yt3 is independent of Yt2 ?Yt1 for all t1 <t2 <
t3 <t4. Consequently, Wt(r4)?WT(r3) is independent of Wt(r2)?WT(r1) for all
[T ·ri]? = ti,i = 1,...,4.
4
Proof:
Note that
Yt4 ?Yt3 = εt4 +εt4?1 +...+εt3+1,
Yt2 ?Yt1 = εt2 +εt2?1 +...+εt1+1.
Since (εt2,εt2?1,...,εt1+1) is independent of (εt4,εt4?1,...,εt3+1) it follow thatYt4?
Yt3 and Yt2 ?Yt1 are independent.
Consequently,
WT(r4)?WT(r3) = T?1/2(εt4 +εt4?1 +...+εt3+1)/σ
is independent of
WT(r2)?WT(r1) = T?1/2(εt2 +εt2?1 +...+εt1+1)/σ.
Proposition:
For given 0 ≤a<b≤ 1, WT(b)?WT(a) L?→N(0,b?a) as T →∞.
Proof:
By definition
WT(b)?WT(a) = T?1/2
[Tb]?summationdisplay
t=[Ta]?+1
εt
= T?1/2([Tb]? ?[Ta]?)1/2 ×([Tb]? ?[Ta]?)?1/2
[Tb]?summationdisplay
t=[Ta]?+1
εt.
The last term ([Tb]? ? [Ta]?)?1/2summationtext[Tb]?t=[Ta]?+1εt L?→ N(0,1) by the CLT, and
T?1/2([Tb]? ?[Ta]?)1/2 = (([Tb]? ?[Ta]?)/T)1/2 → (b?a)1/2 as T → ∞. Hence
WT(b)?WT(a) L?→N(0,b?a).
5
In words, the random walk hasindependentincrementsand those increments
have a limiting normal distribution, with a variance reflecting the size of the
interval (b?a) over which the increment is taken.
It should not be surprising, therefore, that the limit of the sequence of function
WT(·) constructed from the random walk preserves these properties in the limit
in an appropriate sense. In fact, these properties form the basis of the definition
of the Wiener process.
Definition:
Let (S,F,P) be a complete probability space. Then W : S × [0,1] → R1 is
a standard Wiener process if each of r ∈ [0,1], W(·,r) is F-measurable, and in
addition,
(1). The process starts at zero: P[W(·,0) = 0] = 1.
(2). The increments are independent: if 0 ≤ a0 ≤ a1... ≤ ak ≤ 1, then
W(·,ai)?W(·,ai?1) is independent of W(·,aj)?W(·,aj?1), j = 1,..,k, j negationslash= i for
all i = 1,...,k.
(3). The increments are normally distributed: For 0 ≤a≤b≤ 1, the increment
W(·,b)?W(·,a) is distributed as N(0,b?a).
In the definition, we have written W(·,a) explicitness; whenever convenient,
however, we will write W(a) instead of W(·,a), analogous to our notation else-
where. The Wiener process is also called a Brownian motion in honor of Nor-
bert Wiener (1924), who provided the mathematical foundation for the theory of
random motions observed and described by nineteenth century botanist Robert
Brown in 1827.
2.2 Functional Central Limit Theorems
We earlier defined convergence in law for random variables, and now we need to
extend the definition to cover random functions. Let S(·) represent a continuous-
time stochastic process with S(r) representing its value at some date r for r ∈
[0,1]. Suppose, further, that any given realization, S(·) is a continuous function
of r with probability 1. For {ST(·)}∞T=1 a sequence of such continuous function,
6
we say that the sequence of probability measure induced by {ST(·)}∞T=1 weakly
converge to the probability measure induced by S(·), denoted by ST(·) =?S(·)
if all of the following hold:
(1). For any finite collection of k particular dates,
0 ≤r1 <r2 <...<rk ≤ 1,
the sequence of k-dimensional random vector {yT}∞T=1 converges in distribution
to the vector y, where
yT ≡
?
??
??
??
?
ST(r1)
ST(r2)
.
.
.
ST(rk)
?
??
??
??
?
y ≡
?
??
??
??
?
S(r1)
S(r2)
.
.
.
S(rk)
?
??
??
??
?
;
(2). For each ?> 0, the probability that ST(r1) differs from ST(r2) for any dates
r1 and r2 within δ of each other goes to zero uniformly in T as δ → 0;
(3). P{|ST(0)|>λ}→ 0 uniformly in T as λ→∞.
This definition applies to sequences of continuous functions, though the func-
tion in (9) is a discontinues step function. Fortunately, the discountinuities occur
at a countable set of points. Formally, ST(·) can be replaced with a similar con-
tinuous function, interpolating between the steps.
The Function Central Limit Theorem (FCLT) provides conditions under which
WT converges to the standard Wiener process, W. The simplest FCLT is a gen-
eralization of the Lindeberg-L′evy CLT, known as Donsker’s theorem.
Theorem: (Donsker)
Letεt be a sequence ofi.i.d.random scalars with mean zero. Ifσ2 ≡Var(εt) <∞,
σ2 negationslash= 0, then WT =?W.
Because pointwise convergence in distribution WT(·,r) L?→ W(·,r) for each
r ∈ [0,1] is necessary (but not sufficient) for weak convergence WT =? W, the
Lindeberg-L′evy CLT (WT(·,1) L?→W(·,1)) follows immediately from Donsker’s
theorem. Donsker’s theorem is strictly stronger than Lindeberg-L′evy however, as
both use identical assumptions, but Donsker’s theorem delivers a much stronger
7
conclusion. Donsker called his result an invarianceprinciple. Consequently, the
FCLT is often referred as an invariance principle.
So far, we have assumed that the sequence εt used to construct WT is i.i.d..
Nevertheless, just as we can obtain central limit theorems whenεt is not necessary
i.i.d.. In fact, versions of the FCLT hold for each CLT previous given in Chapter 4.
Theorem: Continuous Mapping Theorem:
If ST(·) =?S(·) and g(·) is a continuous functional, then g(ST(·)) =?g(S(·)).
In the above theorem, continuity of a functionalg(·) means that for any? > 0,
there exist aδ> 0 such that ifh(r) andk(r) are any continuous bounded functions
on [0,1], h : [0,1] → R1 and k : [0,1] → R1, such that |h(r) ?k(r)| < δ for all
r ∈ [0,1], then |g(h(·))?g(k(·))|<?.
8
3 Regression with a Unit Root
3.1 Dickey-Fuller Test, Yt is AR(1) process
Consider the following simple AR(1) process with a unit root,
Yt = βYt?1 +ut, (11)
β = 1 (12)
where Y0 = 0 and ut is i.i.d. with mean zero and variance σ2.
We consider the three least square regression
Yt = ?βYt?1 + ?ut, (13)
Yt = ?α+ ?βYt?1 + ?ut, (14)
and
Yt = ?α+ ?βYt?1 + ?δt+ ?ut, (15)
where ?β,(?α, ?β), and (?α, ?β,?δ) are the conventional least-squares regression coef-
ficients. Dickey and Fuller (1979) were concerned with the limiting distribution
of the regression in (13), (14), and (15) (?β,(?α, ?β), and (?α, ?β,?δ)) under the null
hypothesis that the data are generated by (11) and (12).
We first provide the following asymptotic results of the sample moments which
are useful to derive the asymptotics of the OLS estimator.
Lemma:
Let ut be a i.i.d. sequence with mean zero and variance σ2 and
yt = u1 +u2 +...+ut fort = 1,2,...,T, (16)
with y0 = 0. Then
(a) T?12
Tsummationtext
t=1
ut L?→σW(1),
(b) T?2
Tsummationtext
t=1
Y2t?1 L?→σ2integraltext10 [W(r)]2dr,
9
(c) T?32
Tsummationtext
t=1
Yt?1 L?→σintegraltext10 W(r)dr,
(d) T?1
Tsummationtext
t
Yt?1ut L?→ 12σ2[W(1)2 ?1],
(e) T?32
Tsummationtext
t=1
tut L?→σ[W(1)?integraltext10 W(r)dr],
(f) T?52
Tsummationtext
t=1
tYt?1 L?→σintegraltext10 rW(r)dr,
(g) T?3
Tsummationtext
t=1
tY2t?1 L?→σ2integraltext10 r[W(r)]2dr.
A joint weak convergence for the sample moments given above to their respective
limits is easily established and will be used below.
Proof:
(a) is a straightforward results of Donsker’s Theorem with r = 1.
(b) First rewriteT?2
Tsummationtext
t=1
Y2t?1 in term ofWT(rt?1) ≡T?1/2Yt?1/σ = T?1/2
t?1summationtext
s=1
us/σ,
where rt?1 = (t ? 1)/T, so that T?2
Tsummationtext
t=1
Y2t?1 = σ2T?1
Tsummationtext
t=1
WT(rt?1)2. Because
WT(r) is constant for (t?1)/T ≤r<t/T, we have
T?1
Tsummationdisplay
t=1
WT(rt?1)2 =
Tsummationdisplay
t=1
integraldisplay t/T
(t?1)/T
WT(r)2dr
=
integraldisplay 1
0
WT(r)2dr.
The continuous mapping theorem applies to h(WT) = integraltext10 WT(r)2dr. It follows
that h(WT) =?h(W), so that T?2summationtextTt=1Y2t?1 =?σ2integraltext10 W(r)2dr, as claimed.
(c). The proof of item (c) is analogous to that of (b). First rewriteT?3/2summationtextTt=1Yt?1
in term of WT(rt?1) ≡ T?1/2Yt?1/σ = T?1/2summationtextt?1s=1us/σ, where rt?1 = (t?1)/T,
so that T?3/2summationtextTt=1Yt?1 = σT?1summationtextTt=1WT(rt?1). Because WT(r) is constant for
10
(t?1)/T ≤r<t/T, we have
T?1
Tsummationdisplay
t=1
WT(rt?1) =
Tsummationdisplay
t=1
integraldisplay t/T
(t?1)/T
WT(r)dr
=
integraldisplay 1
0
WT(r)dr.
The continuous mapping theorem applies to h(WT) = integraltext10 WT(r)dr. It follows
that h(WT) =?h(W), so that T?3/2summationtextTt=1Yt?1 =?σintegraltext10 W(r)dr, as claimed.
(d). For a random walk, Y2t = (Yt?1 + ut)2 = Y2t?1 + 2Yt?1ut + u2t, imply-
ing that Yt?1ut = (1/2){Y2t ?Y2t?1 ?u2t} and then summationtextTt=1Yt?1ut = (1/2){Y2T ?
Y20 } ? (1/2)summationtextTt=1u2t. Recall that Y0 = 0, and thus it is convenient to writesummationtext
T
t=1Yt?1ut =
1
2Y
2
T ?
1
2
summationtextT
t=1u
2
t. From items (a)) we know that T
?1Y2
T (=
(T?1/2summationtextTt=1us)2 L?→σ2W2(1) and summationtextTt=1u2t p?→σ2 by LLN (Kolmogorov); then,summationtext
T
t=1Yt?1ut =?
1
2σ
2[W(1)2 ?1].
(e). We first observe thatsummationtextTt=1Yt?1 = [u1 +(u1 +u2)+(u1 +u2 +u3)+...+(u1 +
u2+u3+...+uT?1)] = [T?1)u1+(T?2)u2+(T?3)u3+...+[T?(T?1)uT?1] =summationtext
T
t=1(T ?t)ut =
summationtextT
t=1Tut ?
summationtextT
t=1tut, or
summationtextT
t=1tut = T
summationtextT
t=1ut ?
summationtextT
t=1Yt?1.
Therefore, T?32 summationtextTt=1tut = T?12 summationtextTt=1ut ?T?32 summationtextTt=1Yt?1. By applying the con-
tinuous mapping theorem to the joint convergence of items (a) and (c), we have
T?32
Tsummationdisplay
t=1
tut ?σ[W(1)?
integraldisplay 1
0
W(r)dr].
The proofs of item (f) and (g) is analogous to those of (c) and (b). First rewrite
T?5/2summationtextTt=1tYt?1 in term of WT(rt?1) ≡ T?1/2Yt?1/σ = T?1/2summationtextt?1s=1us/σ, where
rt?1 = (t ? 1)/T, so that T?5/2summationtextTt=1tYt?1 = σT?2summationtextTt=1tWT(rt?1). Because
WT(r) is constant for (t?1)/T ≤r<t/T, we have
T?1
Tsummationdisplay
t=1
(t/T)WT(rt?1) =
Tsummationdisplay
t=1
integraldisplay t/T
(t?1)/T
rWT(r)dr forr = t/T
=
integraldisplay 1
0
rWT(r)dr.
The continuous mapping theorem applies to h(WT) = integraltext10 rWT(r)dr. It follows
that h(WT) =?h(W), so that T?5/2summationtextTt=1tYt?1 =?σintegraltext10 rW(r)dr, as claimed.
11
We also writeT?3summationtextTt=1tY2t?1 in term ofWT(rt?1) ≡T?1/2Yt?1/σ = T?1/2summationtextt?1s=1us/σ,
where rt?1 = (t? 1)/T, so that T?3summationtextTt=1tY2t?1 = σ2T?2summationtextTt=1tWT(rt?1)2. Be-
cause WT(r) is constant for (t?1)/T ≤r<t/T, we have
T?1
Tsummationdisplay
t=1
(t/T)WT(rt?1)2 =
Tsummationdisplay
t=1
integraldisplay t/T
(t?1)/T
rWT(r)2dr for
=
integraldisplay 1
0
rWT(r)2dr.
The continuous mapping theorem applies to h(WT) = integraltext10 rWT(r)2dr. It follows
that h(WT) =? h(W), so that T?3summationtextTt=1tY2t?1 =? σ2integraltext10 rW(r)2dr. This com-
pletes the proofs of this lemma.
3.1.1 No Constant Term or Time Trend in the Regression; True
Process Is a Random Walk
We first consider the case that no constant term or time trend in the regression
model, but true process is a random walk. The asymptotic distributions of OLS
unit root coefficient estimator and t-ratio test statistics are in the following.
Theorem 1:
Let the data Yt be generated by (11) and (12); then as T →∞, for the regression
model (13),
T(?βT ?1) ? 1/2{[W(1)]
2 ?1}
integraltext1
0 [W(r)]
2dr
and
?t = (?βT ?1)?σ
?βT
? 1/2{[W(1)]
2 ?1}
{integraltext10 [W(r)]2dr}1/2,
where ?σ2?β
T
= [s2T ÷summationtextTt=1Y2t?1]1/2 and s2T denote the OLS estimate of the distur-
bance variance:
s2T =
Tsummationdisplay
t=1
(Yt ? ?βTYt?1)2/(T ?1).
12
Proof:
Since the deviation of the OLS estimate from the true value is characterized by
T(?βT ?1) =
T?1
Tsummationtext
t=1
Yt?1ut
T?2
Tsummationtext
t=1
Y2t?1
, (17)
which is a continuous function function of Lemma 1(b) and 1(d), it follows that
under the null hypothesis that β = 1, the OLS estimator ?β is characterized by
T(?βT ?1) ? 1/2{[W(1)]
2 ?1}
integraltext1
0 [W(r)]
2dr . (18)
To prove second part of this theorem, We first show the consistency of s2T.
Notice that the population disturbance sum of squares can be written
(yT ?yT?1β)′(yT ?yT?1β)
= (yT ?yT?1 ?β +yT?1 ?β?yT?1β)′(yT ?yT?1 ?β +yT?1 ?β?yT?1β)
= (yT ?yT?1 ?β)′(yT ?yT?1 ?β) + (yT?1 ?β?yT?1β)′(yT?1 ?β?yT?1β),
where yT = [Y1 Y2 ...YT]′ and the cross-product have vanished, since
(yT ?yT?1 ?β)′yT?1(?β?β) = 0,
by the OLS orthogonality condition (X′e = 0). Dividing last equation by T,
(1/T)(yT ?yT?1β)′(yT ?yT?1β)
= (1/T)(yT ?yT?1 ?β)′(yT ?yT?1 ?β) + (?β?β)′[(y′T?1yT?1)/T](?β?β)
or
(1/T)(yT ?yT?1 ?β)′(yT ?yT?1 ?β) (19)
= (1/T)(
Tsummationdisplay
t=1
u2t)?T1/2(?β?β)′[(y′T?1yT?1)/T2]T1/2(?β?β). (20)
Now (1/T)(summationtextTt=1u2t) p?→E(u2t) ≡σ2 by LLN for i.i.d. sequence, T1/2(?β?β) → 0
and (y′T?1yT?1)/T2 =?σ2integraltext10 [W(r)]2dr from (18) and Lemma 1(b), respectively.
We thus have
T1/2(?β?β)′[(y′T?1yT?1)/T2]T1/2(?β?β) p?→ 0′σ2
integraldisplay 1
0
[W(r)]2dr·0 = 0.
13
Substituting these results into (20) we have
(1/T)(yT ?yT?1 ?β)′(yT ?yT?1 ?β) p?→σ2.
The OLS disturbance’s variance estimator
s2T = [1/(T ?1)](yT ?yT?1 ?β)′(yT ?yT?1 ?β) (21)
= [T/(T ?1)](1/T)(yT ?yT?1 ?β)′(yT ?yT?1 ?β) (22)
p?→ 1·σ2 = σ2, (23)
therefore is a consistent estimator.
Finally, we can express the t statistics alternatively as
?tT = T(?βT ?1)
braceleftBigg
T?2
Tsummationdisplay
t=1
Y2t?1
bracerightBigg1/2
÷(s2T)1/2
or
?tT = T?1
summationtextT
t=1Yt?1utbraceleftBig
T?2summationtextTt=1Y2t?1
bracerightBig1/2
(s2T)1/2
,
which is a continuous function function of Lemma 1(b) and 1(d), it follows that
under the null hypothesis that β = 1, the asymptotic distribution of OLS t
statistics is characterized by
?tT L?→ 1/2σ2{[W(1)]2 ?1}braceleftBig
σ2integraltext10 [W(r)]2dr
bracerightBig1/2
(σ2)1/2
= 1/2{[W(1)]
2 ?1}
braceleftBigintegraltext1
0 [W(r)]
2dr
bracerightBig1/2. (24)
This complete the proof of this Theorem.
Statistical tables for the distributions of (18) and (24) for various sample size
T are reported in the section labeled Case 1 in Table B.5 and B.6, respectively.
This finite sample result assume Gaussian innovations.
3.1.2 Constant Term but No Time Trend included in the Regression;
True Process Is a Random Walk
We next consider the case that a constant term is added in the regression model,
but true process is a random walk. The asymptotic distributions of OLS unit
14
root coefficient estimator and t-ratio test statistics are in the following.
Theorem 2:
Let the data Yt be generated by (11) and (12); then as T →∞, for the regression
model (14),
T(?βT ?1) ? 1/2{[W(1)]
2 ?1}?W(1)·integraltext1
0 W(r)drintegraltext
1
0 [W(r)]
2dr?
bracketleftBigintegraltext1
0 W(r)dr
bracketrightBig2 (25)
and
?t = (?βT ?1)?σ
?βT
? 1/2{[W(1)]
2 ?1}?W(1)·integraltext1
0 W(r)drbraceleftbigg
integraltext1
0 [W(r)]
2dr?
bracketleftBigintegraltext1
0 W(r)dr
bracketrightBig2bracerightbigg1/2 , (26)
where ?σ2?β
T
= s2T bracketleftbig 0 1 bracketrightbig
bracketleftbigg T summationtextY
t?1summationtext
Yt?1 summationtextY2t?1
bracketrightbigg?1bracketleftbigg 0
1
bracketrightbigg
and s2T denote the OLS
estimate of the disturbance variance:
s2T =
Tsummationdisplay
t=1
(Yt ? ?αT ? ?βTYt?1)2/(T ?2).
Proof:
The proof of this theorem is analogous to that of Theorem 1 and is omitted here.
Statistical tables for the distributions of (25) and (26) for various sample size
T are reported in the section labeled Case 2 in Table B.5 and B.6, respectively.
This finite sample result assume Gaussian innovations.
These statistics test the null hypothesis that β = 1. However, a maintained
assumption on which the derivation of Theorem 2 was based on is that the true
value of α is zero. Thus, it might seem more natural to test for a unit root in
this specification by testing the joint hypothesis that α = 0 and β = 1. Dickey
and Fuller (1981) derive the limiting distribution of the likelihood ratio test for
the hypothesis that (α,β) = (0,1) and used Monte Carlo to calculate the distri-
bution of the OLS F test of this hypothesis. Their values are reported under the
15
heading Case 2 in table B.7.
3.1.3 Constant Term and Time Trend Included in the Regression;
True Process Is a Random Walk With or Without Drift
We finally in the section consider the case that a constant term and a linear trend
are added in the regression model, but true process is a random walk with a drift.
However, the true value of this drift turns out not to matter for the asymptotic
distributions of OLS unit root coefficient estimator and t-ratio test statistics in
this case.
Theorem 3:
Let the data Yt be generated by (11) and (12); then as T →∞, for the regression
model (15),
T(?β?1) ? 1/2{[W(1)?2
integraltext1
0 W(r)dr][W(1) + 6
integraltext1
0 W(r)dr?12
integraltext1
0 rW(r)dr]?1}integraltext
1
0 [W(r)]
2dr?4[integraltext1
0 W(r)dr]
2 + 12integraltext1
0 W(r)dr
integraltext1
0 rW(r)dr?12[
integraltext1
0 rW(r)dr]
2
and
?t = (?βT ?1)?σ
?βT
?T(?β?1)÷radicalbigQ,
where ?σ2?β
T
= s2T bracketleftbig 0 1 0 bracketrightbig
?
??
??
??
?
T
Tsummationtext
t=1
ξt?1
Tsummationtext
t=1
t
Tsummationtext
t=1
ξt?1
Tsummationtext
t=1
ξ2t?1
Tsummationtext
t=1
tξt?1
Tsummationtext
t=1
t
Tsummationtext
t=1
tξt?1
Tsummationtext
t=1
t2
?
??
??
??
?
?1
?
?
0
1
0
?
?,ξt = Yt?αt,
s2T denote the OLS estimate of the disturbance variance:
s2T =
Tsummationdisplay
t=1
(Yt ? ?α? ?βTYt?1 ? ?δt)2/(T ?3),
and
Q≡bracketleftbig 0 1 0 bracketrightbig
?
?
1 integraltext W(r)dr 1/2integraltext
W(r)dr integraltext[W(r)]2dr integraltext rW(r)dr
1/2 integraltext rW(r)dr 1/3
?
?
?1?
?
0
1
0
?
?.
16
Proof:
(a). Let the data generating process be
Yt = α+Yt?1 +ut,
and the regression model be
Yt = α+βYt?1 +δt+ut. (27)
Note that the regression model of (27) can be equivalently rewritten as
Yt = (1?β)α+β(Yt?1 ?α(t?1)) + (δ+βα)t+ut,
≡α? +β?ξt?1 +δ?t+ut, (28)
where α? = (1?β)α, β? = β, δ? = δ+βα, and ξt?1 = Yt?1?α(t?1). Moreover,
under the null hypothesis that β = 1 and δ = 0,
ξt = Y0 +u1 +u2 +...+ut;
that is, ξt is the random walk as described in Lemma 1. Under the maintained
hypothesis, α = α0, β = 1, and δ = 0, which in (28) means that α? = 0, β? = 1
and δ? = α0. The deviation of the OLS estimate from these true values is given
by
?
?
?α?
?β?1
?δ? ?α0
?
?=
?
??
??
??
?
T
Tsummationtext
t=1
ξt?1
Tsummationtext
t=1
t
Tsummationtext
t=1
ξt?1
Tsummationtext
t=1
ξ2t?1
Tsummationtext
t=1
tξt?1
Tsummationtext
t=1
t
Tsummationtext
t=1
tξt?1
Tsummationtext
t=1
t2
?
??
??
??
?
?1?
??
??
??
?
Tsummationtext
t=1
ut
Tsummationtext
t=1
ξt?1ut
Tsummationtext
t=1
tut
?
??
??
??
?
, (29)
or in shorthand as
C = A?1f.
From Lemma 1, the order of probability of the individual terms in (29) is as
follows,
?
?
?α?
?β?1
?δ? ?α0
?
?=
?
?
Op(T) Op(T 32) Op(T2)
Op(T 32) Op(T2) Op(T 52)
Op(T2) Op(T 52) Op(T3)
?
?
?1?
?
Op(T 12)
Op(T1)
Op(T 32)
?
?.
17
We define a rescaling matrices,
ΥT =
?
?
T 12 0 0
0 T1 0
0 0 T 32
?
?.
Multiplying the rescaling matrices on (29), we get
ΥTC = ΥTA?1ΥTΥ?1T f =bracketleftbigΥ?1T AΥ?1T bracketrightbig?1 Υ?1T f (30)
Substituting the results of Lemma A.1 to (30), we establish that
?b1 ? Q?1h1, (31)
where
?b1 ≡
?
?
T1/2 ?α?
T(?β?1)
T3/2(?δ? ?α0)
?
?,
Q ≡
?
?
1 integraltext W(r)dr 1/2integraltext
W(r)dr integraltext[W(r)]2dr integraltext rW(r)dr
1/2 integraltext rW(r)dr 1/3
?
?
h1 ≡
?
?
σW(1)
1
2σ
2{[W(1)]2 ?1}
σ[W(1)?integraltext W(r)dr]
?
?.
Thus, the asymptotic distribution of T(?β?1) is given by the middle row of (31),
that is,
T(?β?1) ? 1/2{[W(1)?2
integraltext1
0 W(r)dr][W(1) + 6
integraltext1
0 W(r)dr?12
integraltext1
0 rW(r)dr]?1}integraltext
1
0 [W(r)]
2dr?4[integraltext1
0 W(r)dr]
2 + 12integraltext1
0 W(r)dr
integraltext1
0 rW(r)dr?12[
integraltext1
0 rW(r)dr]
2.
Note that this distribution does not depend on either α or σ; in particular, it
doesn’t matter whether or not the true value of α is zero. .
(b). The asymptotic distribution of the OLS t statistics can be founded using
18
similar calculation as those in (23). Notice that
T2 · ?σ2?β
T
= T2 ·s2T bracketleftbig 0 1 0 bracketrightbig
?
??
??
??
?
T
Tsummationtext
t=1
ξt?1
Tsummationtext
t=1
t
Tsummationtext
t=1
ξt?1
Tsummationtext
t=1
ξ2t?1
Tsummationtext
t=1
tξt?1
Tsummationtext
t=1
t
Tsummationtext
t=1
tξt?1
Tsummationtext
t=1
t2
?
??
??
??
?
?1
?
?
0
1
0
?
?
= s2T bracketleftbig 0 1 0 bracketrightbig
?
?
T1/2 0 0
0 T 0
0 0 T3/2
?
?
×
?
??
??
??
?
T
Tsummationtext
t=1
ξt?1
Tsummationtext
t=1
t
Tsummationtext
t=1
ξt?1
Tsummationtext
t=1
ξ2t?1
Tsummationtext
t=1
tξt?1
Tsummationtext
t=1
t
Tsummationtext
t=1
tξt?1
Tsummationtext
t=1
t2
?
??
??
??
?
?1
?
?
T1/2 0 0
0 T 0
0 0 T3/2
?
?
?
?
0
1
0
?
?
= s2T bracketleftbig 0 1 0 bracketrightbig
?
??
??
??
?
1 T?3/2
Tsummationtext
t=1
ξt?1 T?2
Tsummationtext
t=1
t
Tsummationtext
t=1
T?3/2ξt?1
Tsummationtext
t=1
T?2ξ2t?1
Tsummationtext
t=1
T?5/2tξt?1
Tsummationtext
t=1
T?2t
Tsummationtext
t=1
T?5/2tξt?1
Tsummationtext
t=1
T?3t2
?
??
??
??
?
?1
?
?
0
1
0
?
?
L?→ σ2bracketleftbig 0 1 0 bracketrightbig
?
?
1 σintegraltext W(r)dr 1/2
σintegraltext W(r)dr σ2integraltext[W(r)]2dr σintegraltext rW(r)dr
1/2 σintegraltext rW(r)dr 1/3
?
?
?1?
?
0
1
0
?
?
= σ2bracketleftbig 0 1 0 bracketrightbig
?
?
1 0 0
0 σ 0
0 0 1
?
?
?1
×
?
?
1 integraltext W(r)dr 1/2integraltext
W(r)dr integraltext[W(r)]2dr integraltext rW(r)dr
1/2 integraltext rW(r)dr 1/3
?
?
?1?
?
1 0 0
0 σ 0
0 0 1
?
?
?1?
?
0
1
0
?
?
= bracketleftbig 0 1 0 bracketrightbig
?
?
1 integraltext W(r)dr 1/2integraltext
W(r)dr integraltext[W(r)]2dr integraltext rW(r)dr
1/2 integraltext rW(r)dr 1/3
?
?
?1?
?
0
1
0
?
?
≡ Q.
From this result it follows that the asymptotic distribution of the OLS t test
of the hypothesis that β = 1 is given by
?tT = T(?βT ?1)÷(T2 · ?σ2?β
T
)1/2 p?→T(?βT ?1)÷radicalbigQ.
19
This completes the proofs of Theorem 3.
Again, this distribution does not dependent on α or σ. The small-sample dis-
tribution of the OLS t statistics under the assumption of Gaussian disturbance is
presented under case 4 in Table B.6. If this distribution were truly t, then a value
below ?2.0 would be sufficient to reject the null hypothesis. However, Table B.6
reveals that, because of the nonstandard distribution, the t statistic must below
?3.4 before the null hypothesis of a unit root could be rejected.
The assumption that the true value of δ is equal to zero is again an auxiliary
hypothesis upon which the asymptotic properties of the test depend. Thus, as in
section 3.1.2, it is natural to consider the OLS F test of the joint null hypothesis
that δ = 0 and β = 1. Though this F test statistic is calculated in the usual way,
its asymptotic distribution is nonstandard, and the calculated F statistic should
be compared with the value under case 4 in Table B.7.
Remark:
To derive the asymptotic distribution in this chapter, it is useful to use the
following result:
bracketleftbigg α 0
0 β
bracketrightbiggbracketleftbigg a b
c d
bracketrightbiggbracketleftbigg α 0
0 β
bracketrightbigg
=
bracketleftbigg α2a αβb
αβc β2d
bracketrightbigg
.
The unit root tests are not confining themselves to the simple AR(1) process
as the original Dickey and Fuller (1919). There are two ways to generalize the unit
root process. One is to assume that Yt in (11) is a AR(p) process, the other is to
assume thatut is aARMA(p,q) (Said-Dickey) process or a generally nonparamet-
ric process satisfying certain memory and moment constrains (Phillips-Perron).
These modifications of unit root models are discussed in the following.
20
3.2 Augmented Dickey-Fuller Test, Yt is AR(p) process
Instead of (11) and (12), suppose that the data were generated from an AR(p)
process,
(1?φ1L?φ2L2 ?...?φpLp)Yt = εt, (32)
where εt is i.i.d. sequence with zero mean and variance σ2 and finite fourth
moment. It is helpful to write the autoregression (32) in a slightly different form.
To do so, define
β ≡ φ1 +φ2 +...+φp
ζj ≡ ?[φj+1 +φj+2 +...+φp] forj = 1,2,...,p?1.
Notice that for any value of φ1,φ2,...,φp, the following polynomials in L are
equivalent:
(1?βL)?(ζ1L+ζ2L2 +...+ζp?1Lp?1)(1?L)
= 1?βL?ζ1L+ζ1L2 ?ζ2L+ζ2L3 ?...?ζp?1Lp?1 +ζp?1Lp
= 1?(β +ζ1)L?(ζ2 ?ζ1)L2 ?(ζ3 ?ζ2)L3 ?...?(ζp?1 ?ζp?2)Lp?1 ?(?ζp?1)Lp
= 1?[(φ1 +φ2 +...+φp)?(φ2 +...+φp)]L?[?(φ3 +φ4 +...+φp) + (φ2 +...+φp)]L2
?...?[?(φp) + (φp?1 +φp)]Lp?1 ?(φp)Lp
= 1?φ1L?φ2L2 ?...?φpLp.
Thus, the autoregression (32) can be equivalently be written
{(1?βL)?(ζ1L+ζ2L2 +...+ζp?1Lp?1)(1?L)}Yt = εt (33)
or
Yt = βYt?1 +ζ1△Yt?1 +ζ2△Yt?2 +...+ζp?1△Yt?p+1 +εt. (34)
Example:
In the case of p = 3, (34) is
Yt = φ1Yt?1 +φ2Yt?2 +φ3Yt?3 +εt
= (φ1 +φ2 +φ3)Yt?1 ?(φ2 +φ3)[Yt?1 ?Yt?2]?(φ3)[Yt?2 ?Yt?3]
= βYt?1 +ζ1△Yt?1 +ζ2△Yt?2.
21
Suppose that the process that generated Yt contain a single unit root; that is
suppose one root of
(1?φ1z?φ2z2 ?...?φpzp) = 0 (35)
is unity, then
1?φ1 ?φ2 ?...?φp = 0, (36)
and all other root of (35) are outside the unit circle. Notice that (36) implies that
β = 1 in (34). Moreover, when β = 1, it would imply that
1?φ1L?φ2L2 ?...?φpLp = (1?L)(1?ζ1L?ζ2L2 ?...?ζp?1Lp?1)
and all the roots of (1?ζ1L?ζ2L2 ?...?ζp?1Lp?1) = 0 would lie outside the
unit circle. Under the null hypothesis that β = 1, expression (34) could then be
written as
△Yt = ζ1△Yt?1 +ζ2△Yt?2 +...+ζp?1△Yt?p+1 +εt. (37)
or
△Yt = ut (38)
where
ut = ψ(L)εt = (1?ζ1L?ζ2L2 ?...?ζp?1Lp?1)?1εt.
That is, we may express this AR(p) with an unit root process as the AR(1) with
an unit root process as (38) but with serially correlated ut.
Dickey and Fuller (1979) propose a test of unit root in this AR(p) model and
is known as the Augmented Dickey-Fuller (ADF) test.
3.2.1 Constant Term but No Time Trend included in the Regression;
True Process Is a Autoregressive with no Drift
Assume that the initial sample is of size T + p, with observation numbered
{Y?p+1,Y?p+2,...,YT}, and conditional on the first p observations. We are in-
terested in the properties of OLS estimation of
Yt = ζ1△Yt?1 +ζ2△Yt?2 +...+ζp?1△Yt?p+1 +α+βYt?1 +εt (39)
= x′tβ +εt (40)
22
under the null hypothesis that α = 0 and β = 1, i.e. the DGP is
△Yt = ζ1△Yt?1 +ζ2△Yt?2 +...+ζp?1△Yt?p+1 +εt, (41)
where β ≡ (ζ1,ζ2,...,ζp?1,α,β)′ and xt ≡ (△Yt?1,△Yt?2,...,△Yt?p+1,1,Yt?1)′.
The asymptotic distribution of the OLS coefficients estimators, ?βT are in the
following.
Theorem 4:
Let the data Yt be generated by (41); then as T → ∞, for the regression model
(39),
ΥT(?βT ?β) L?→
bracketleftbigg V 0
0 Q
bracketrightbigg?1bracketleftbigg h
1
h2
bracketrightbigg
=
bracketleftbigg V?1h
1
Q?1h2
bracketrightbigg
, (42)
where
ΥT =
?
??
??
??
??
?
√T 0 . . . 0 0
0 √T . . . 0 0
. . . . . . .
. . . . . . .
. . . . . . .
0 0 . . . √T 0
0 0 . . . 0 T
?
??
??
??
??
?
, V =
?
??
??
??
?
γ0 γ1 . . . γp?2
γ1 γ0 . . . γp?3
. . . . . .
. . . . . .
. . . . . .
γp?2 γp?3 . . . γ0
?
??
??
??
?
,
Q =
bracketleftbigg 1 λintegraltext W(r)dr
λintegraltext W(r)dr λ2integraltext[W(r)]2dr
bracketrightbigg
,
h1 ~Np?1(0,V), h2 ~
bracketleftbigg σW(1)
1/2σλ{[W(1)]2 ?1}
bracketrightbigg
,γj = E[(△Yt)(△Yt?j)]
and
λ = σ·ψ(1) = σ/(1?ζ1 ?ζ2 ?...?ζp?1). (43)
The results reveals that in a regression of I(1) variables on I(1) and I(0) vari-
ables, the asymptotic distribution of the coefficient of I(1) and I(0) vari-
ables are independent. Thus, the asymptotic distribution of √T(?ζj ?ζj), j =
1,2,..,p?1 and T(?β?β) are independent. This results can be used for showing
23
that the distribution of ?βT in the ADF regression is the Dickey-Fuller distrib-
ution (taking into account of serially correlated in the ut, see (44)). Also the
asymptotic distribution of √T(?ζj ?ζj) is normal.
Therefore, the limiting distribution of T(?β?β) is given by the second element
of
bracketleftbigg T1/2 0
0 T
bracketrightbiggbracketleftbigg ?α?0
?βT ?1
bracketrightbigg
L?→
bracketleftbigg 1 λintegraltext W(r)dr
λintegraltext W(r)dr λ2integraltext[W(r)]2dr
bracketrightbigg?1bracketleftbigg σW(1)
1/2σλ{[W(1)]2 ?1}
bracketrightbigg
≡
bracketleftbigg σ 0
0 σ/λ
bracketrightbiggbracketleftbigg 1 integraltext W(r)dr
integraltext W(r)dr integraltext[W(r)]2dr
bracketrightbigg?1bracketleftbigg W(1)
1/2{[W(1)]2 ?1}
bracketrightbigg
,
or
T(?βT ?1) ? (σ/λ)· 1/2{[W(1)]
2 ?1}?W(1)·integraltext1
0 W(r)drintegraltext
1
0 [W(r)]
2dr?
bracketleftBigintegraltext1
0 W(r)dr
bracketrightBig2 . (44)
The parameter (σ/λ) is the factor to correct the serial correlation in ut. When ut
is i.i.d., form (38) we have ζi = 0 and λ = σ, that is (σ/λ) = 1. This distribution
is back to simple Dickey-Fuller distribution. We are now in a position to propose
the ADF test statistics which correct for (σ/λ) and have the same distribution
as DF.
Theorem 5 (ADF): Let the data Yt be generated by (41); then as T → ∞,
for the regression model (39),
(a).
T(?βT ?1)
1? ?ζ1 ? ?ζ2 ?...? ?ζp?1 ?
1/2{[W(1)]2 ?1}?W(1)·integraltext10 W(r)dr
integraltext1
0 [W(r)]
2dr?
bracketleftBigintegraltext1
0 W(r)dr
bracketrightBig2 ,
(b).
tT = (
?βT ?1)
{s2Te′p+1(summationtextxtx′t)?1ep+1}1/2 ?
1/2{[W(1)]2 ?1}?W(1)·integraltext10 W(r)dr
braceleftbiggintegraltext
1
0 [W(r)]
2dr?
bracketleftBigintegraltext1
0 W(r)dr
bracketrightBig2bracerightbigg1/2 ,
where ep+1 = [0 0 ...0 1]′.
24
Proof:
(a). From (44) we have
T ·(λ/σ)(?βT ?1) ? 1/2{[W(1)]
2 ?1}?W(1)·integraltext1
0 W(r)drintegraltext
1
0 [W(r)]
2dr?
bracketleftBigintegraltext1
0 W(r)dr
bracketrightBig2 . (45)
Recall from (43 ) that
λ/σ = (1?ζ1 ?ζ2 ?...?ζp?1)?1.
Since ?ζj is consistent from (42), this magnitude is clearly consistently estimated
by
?λ/σ = (1? ?ζ1 ? ?ζ2 ?...? ?ζp?1)?1. (46)
It follows that
[T(?βT ?1)·(λ/σ)]?[T(?βT ?1)·( ?λ/σ)]
= T(?βT ?1)·[(λ/σ)? ?(λ/σ)]
= Op(1)·op(1) = op(1).
Thus [T(?βT ?1)·( ?λ/σ)] and [T(?βT ?1)·(λ/σ)] have the same asymptotic distri-
bution. This complete the proof of part (a).
To rove art (b), we first multiply the numerator and denominator of tT by T
results in
tT = T(
?βT ?1)
{s2Te′p+1ΥT(summationtextxtx′t)?1ΥTep+1}1/2. (47)
But
e′p+1ΥT(
summationdisplay
xtx′t)?1ΥTep+1 = e′p+1
bracketleftBig
Υ?1T (
summationdisplay
xtx′t)Υ?1T
bracketrightBig?1
ep+1
L?→ e′
p+1
bracketleftbigg V?1 0
0 Q?1
bracketrightbigg
ep+1
= 1
λ2 ·
braceleftbiggintegraltext
1
0 [W(r)]
2dr?
bracketleftBigintegraltext1
0 W(r)dr
bracketrightBig2bracerightbigg.
25
Hence from (45) and (47),
tT L?→ (σ/λ)1/2{[W(1)]
2 ?1}?W(1)·integraltext1
0 W(r)drintegraltext
1
0 [W(r)]
2dr?
bracketleftBigintegraltext1
0 W(r)dr
bracketrightBig2
÷
?
???
???
σ2
λ2 ·
braceleftbiggintegraltext
1
0 [W(r)]
2dr?
bracketleftBigintegraltext1
0 W(r)dr
bracketrightBig2bracerightbigg
?
???
???
1/2
= 1/2{[W(1)]
2 ?1}?W(1)·integraltext1
0 W(r)drbraceleftbigg
integraltext1
0 [W(r)]
2dr?
bracketleftBigintegraltext1
0 W(r)dr
bracketrightBig2bracerightbigg1/2 .
This is the same distribution as in (26). Thus, the usual t test of β = 1 for OLS
estimation ofAR(p) can be compared with Theorem 2 and use case 2 of Table B.6
without any correction for the fact ut is serially correlated (or △Y are included
in the regression).
26
3.3 Augmented Dickey-Fuller Test,Yt isaARMA(p,q) process
The fact that the distribution of ?βT in the ADF regression is the Dickey-Fuller
distribution has been extended by Said and Dickey (1984) to the more general
case in which, under the null hypothesis, the series of first difference are of the
general ARMA(p,q) form with unknown p and q. They showed that a regression
model, such as (39), is still valid for testing the unit root null under the presence
of the serial correlations of error, if the number of lags △Y included as regressor
increase with the sample size at a controlled rate T1/3. Essentially the moving
terms are being approximated by including enough autoregressive terms.
Consider the general ARIMA(p,1,q) model is defined by
Yt = βYt?1 +ut, (48)
(1?φ1L?φ2L2 ?...?φpLp)ut = (1 +θ1L+θ2L2 +...+θqLq)εt (49)
with Y0 = 0, εt is i.i.d. and β = 1.
(48) and (49) is equivalently written as
(1?φ1L?φ2L2 ?...?φpLp)△Yt = (1 +θ1L+θ2L2 +...+θqLq)εt. (50)
Therefore, (41) as a special case of (50) in which θj = 0,j = 1,2,...,q.
Rewrite (50) as
η(L)△Yt = εt (51)
where η(L) = (1?η1L?η2L2 ?....) = (1+θ1L+θ2L2 +...+θqLq)?1(1?φ1L?
φ2L2 ?...?φpLp). That is,
△Yt = η1△Yt?1 +η2△Yt?2 +η3△Yt?3 +....+εt
or
Yt = Yt?1 +η1△Yt?1 +η2△Yt?2 +η3△Yt?3 +....+εt. (52)
This motives us to estimate the coefficient in (52) by regressionYt onYt?1,△Yt?1,△Yt?2,...,
△Yt?k where k is a suitably chosen integer. To get consistent estimator of the
27
coefficient in (52) it is necessary to let k as a function of T.
Consider a truncated version of (52)
Yt = α+βYt?1 +η1△Yt?1 +η2△Yt?2 +...+ηk△Yt?k +etk
= x′tβ +etk, (53)
where β ≡ (α,β,ζ1,ζ2,...,ζk)′ and xt ≡ (1,Yt?1,△Yt?1,△Yt?2,...,△Yt?k)′. No-
tice that etk is not a white noise. In this case, the limiting distribution of t
statistics of the coefficient on the lagged Yt?1 (i.e., ?βT) from OLS estimation of
has the same Dickey-Fuller t-distribution as when ut is i.i.d..
Theorem 6 (Said-Dickey ADF): Let the data Yt be generated by (48) and
(49) and the regression model be (53). We assume that T?1/3k → 0 and there
exist c> 0,r> 0 such that ck>T1/r, then as T →∞,
tT = (
?βT ?1)
{s2Te′p+1(summationtextxtx′t)?1ep+1}1/2 ?
1/2{[W(1)]2 ?1}?W(1)·integraltext10 W(r)dr
braceleftbiggintegraltext
1
0 [W(r)]
2dr?
bracketleftBigintegraltext1
0 W(r)dr
bracketrightBig2bracerightbigg1/2 ,
where ek+2 = [0 1 ...0 0]′.
The intuition behind that k should be a function of T is clear from the fact
etk = ζk+1△Yt?k?1 +ζk+2△Yt?k?2 +...+εt.
Then as k →∞
etk ?εt = ζk+1△Yt?k?1 +ζk+2△Yt?k?2 +...
p?→ 0
from the absolute summability of ζj, i.e summationtext∞j=0|ζj| < ∞ which imply ζj →
0 and ζk+1△Yt?k?1 + ζk+2△Yt?k?2 + ... ?→ 0. Then, it is expected that a
ARIMA(p,1,q) process here would have the same asymptotic result with the
ARIMA(p,1,0) process. Therefore, the asymptotic result should be derived un-
der the condition that k → ∞. However, it can not increase quickly then T, i.e.
28
we need the condition that T?1/3k → 0.
3.3.1 Choice of Lag-Length in the ADF Test
It has been observed that the size and power properties of the ADF test are
sensitive to the number of lagged terms (k) used. Several guidelines have been
suggested for the choice of k. Ng and Perron (1995) examine these in details.
The guidelines are:
(a). Rule for fixing k at an arbitrary level independent of T. Overall,
choosing a fixed k is not desirable from their detailed simulation.
(b). Rule for fixing k as a function of T. A rule commonly used is the
one suggested by Schwert (1989) which is to choose
k = Int{c(T/100)1/d}.
Schwert suggest c = 12 and d = 4. The problem with such a rule is that it need
not be optional for all p and q in the ARMA(p,q).
(c). Information based rules. The information criteria suggest choosing k
to minimize an objective function that trades off parsimony against reduction in
sum of squares. The objective function is of the form (see also p.5 of Ch. 16)
Ik = log ?σ2k +kCTT .
The Akaike information criterion (AIC) choose CT = 2. The Schwarz Bayesian
information criterion (BIC) chooses CT = logT. Ng and Perron argue that both
AIC and BIC are asymptotically the same with ARMA(p,q) models and that
both of them choose k proportional to logT.
(d). Sequential rules. Hall (1994) discusses a general to specific rule
which is to start with a large value of k (kmax). We test the significance of the
last coefficient and reduce k iteratively until a significant statistic is encountered.
Ng and Perron (1995) compare AIC, BIC, and Hall’s general to specific ap-
proach through a Monte Carlo study. The major conclusions are:
(a). Both AIC and BIC choose very small value of k (e.g. k = 3). This results
in high size distortions, especially with MA errors (Remark: The intuition be-
hind this conclusion is that with MA error, this model is AR(∞), if you use too
29
small lagged value, your model look not very much like a AR(∞), and since this
asymptotic results is derived from k → ∞, your finite sample distribution is far
away from the asymptotic distribution and results in size distortion.)
(b). Hall’s criterion tends to choose higher values of k. The higher the kmax is,
the higher is the chosen value of k. This results in the size being at the nominal
level, but of course with a loss of power. (Remark: Unit root test statistics sa
far we have derived is the distribution under the null. On the other side, we
can derive the test statistics under the alternative hypothesis of stationary or
fractional difference process. Theses test statistics should go to ?∞, say, against
the left-tailed hypothesis, when the alternative hypothesis is true to be able to
consistently reject the null hypothesis, or what is called power is one. A common
results is that the asymptotic distribution of the unit root test statistics under
the alternative hypothesis is function of k, and with precisely, (T/k), say. For a
fixed T, (T/K) is smaller with a larger k. This cause the unit roots distribution
tend to ?∞ more slower under the alternative hypothesis and has a lower power.
See Lee and Schmidt (1996) and Lee and Shie (2004).)
What this study suggests is that Hall’s general to specific methods is preferable
to the others. DeJong et al. (1992) show that increasing k typically results in a
modest decrease in power but a substantial decrease in size distortion. If this is
the case the information criteria are at a disadvantage because they result in a
choice of very small value of k. However, Stock (1994) propose opposite evidence
arguing in favor of BIC compared with Hall’s method.
30
3.4 Phillips-Perron Test, ut is mixing process
In contrast to ADF where a AR(1) unit root process has been extend to AR(p)
and ARMA(p,q) with a unit root, Phillips (1987) and Phillips and Perron (1988)
extend the random walk (AR(1) process with a unit root) to a general setting
that allows for weakly dependent and heterogeneously distributed innovations.
The model Phillips (1987) consider are
Yt = βYt?1 +ut, (54)
β = 1 (55)
whereY0 = 0 (not necessary in Phillips’s original paper), β = 1 andut is a weakly
dependent and heterogeneously distributed innovations sequence to be specified
below.
We consider the three least square regressions
Yt = ?βYt?1 + ?ut, (56)
Yt = ?α+ ?βYt?1 + ?ut, (57)
and
Yt = ?α+ ?βYt?1 + ?δ(t? 12T) + ?ut, (58)
where ?β,(?α, ?β), and (?α,?δ, ?β) are the conventional least-squares regression coef-
ficients. Phillips (1987) and Phillips and Perron (1978) were concerned with
the limiting distribution of the regression in (56), (57), and (58) (?β,(?α, ?β), and
(?α,?δ, ?β)) under the null hypothesis that the data are generated by (54) and (55).
So far, we have assumed that the sequence ut used to construct WT is i.i.d..
Nevertheless, just as we can obtain central limit theorems when ut is not neces-
sary i.i.d., so also can we obtain FCLT when ut is not necessary i.i.d.. Here we
present a version of FCLT, due to McLeish (1975), under a very weak assumption
on ut.
Theorem 7 (McLeish): Let ut satisfies that
(a). E(ut) = 0,
31
(b). supt E|ut|γ <∞ for some γ > 2,
(c). λ2 = limT→∞E[T?1(summationtextut)2] exists and λ2 > 0, and
(d). ut is strong mixing with mixing coefficients αm that satisfysummationtext∞1 α1?2/γm <∞,
then WT =?W, where WT(r) ≡T?1/2summationtext[Tr]?t=1 ut/λ.
These conditions (a)–(d) allow for both temporal dependence (by mixing) and
heteroscedasticity (as long as λ2 = limT→∞E[T?1(summationtextut)2] exist) in the process
ut. (Hint: When ut is an i.i.d. process, then λ2 = limT→∞E[T?1(summationtextut)2] = σ2,
and this result is back to (10).)
We first provide the following asymptotic results of the sample moments which
are useful to derive the asymptotics of the OLS estimator.
Lemma 2:
Let ut be a random sequence that satisfies the assumptions in Theorem 7, and if
suptE|ut|γ+η <∞ for some η> 0,
yt = u1 +u2 +...+ut fort = 1,2,...,T, (59)
with y0 = 0. Then
(a) T?12
Tsummationtext
t=1
ut L?→λW(1),
(b) T?2
Tsummationtext
t=1
Y2t?1 =?λ2integraltext10 [W(r)]2dr,
(c) T?32
Tsummationtext
t=1
Yt?1 =?λintegraltext10 W(r)dr,
(d) T?1
Tsummationtext
t=1
u2t p?→σ2u = T?1
Tsummationtext
t=1
E(u2t).
(e) T?1
Tsummationtext
t
Yt?1ut =? 12{λ2[W(1)2]?σ2u},
(f) T?32
Tsummationtext
t=1
tut =?λ[W(1)?integraltext10 W(r)dr],
(g) T?52
Tsummationtext
t=1
tYt?1 =?λintegraltext10 rW(r)dr,
(h) T?3
Tsummationtext
t=1
tY2t?1 =?λ2integraltext10 r[W(r)]2dr.
32
A joint weak convergence for the sample moments given above to their respective
limits is easily established and will be used below.
Proof:
Proofs of items (a), (b), (c), (f), (g) and (h) are analogous to those proofs at
Lemma 1. Item (d) is the results of LLN for a mixing process.
(e). For a random walk, Y2t = (Yt?1 + ut)2 = Y2t?1 + 2Yt?1ut + u2t, imply-
ing that Yt?1ut = (1/2){Y2t ?Y2t?1 ?u2t} and then summationtextTt=1Yt?1ut = (1/2){Y2T ?
Y20 } ? (1/2)summationtextTt=1u2t. Recall that Y0 = 0, and thus it is convenient to writesummationtext
T
t=1Yt?1ut =
1
2Y
2
T ?
1
2
summationtextT
t=1u
2
t. From items (a)) we know that T
?1Y2
T (=
(T?1/2summationtextTt=1us)2 L?→ λ2W2(1) and summationtextTt=1u2t p?→ σ2u by LLN (MacLeish); then,summationtext
T
t=1Yt?1ut =?
1
2{λ
2[W(1)2]?σ2
u}.
3.4.1 No Constant Term or Time Trend in the Regression; True
Process Is a Random Walk
We first consider the case that no constant term or time trend in the regression
model, but true process is a random walk. The asymptotic distributions of OLS
unit root coefficients estimator and t-ratio test statistics are in the following.
Theorem 8:
Let the data Yt be generated by (54) and (55); and ut be a random sequence that
satisfies the assumptions in Theorem 7, and if suptE|ut|γ+η <∞ for some η> 0,
then as T →∞, for the regression model (56),
T(?βT ?1) ? 1/2([W(1)
2]?σ2
u/λ
2)
integraltext1
0 [W(r)]
2dr
and
t = (
?βT ?1)
?σ?βT ?
(λ/2σu){[W(1)]2 ?σ2u/λ2}
{integraltext10 [W(r)]2dr}1/2 ,
where ?σ2?β
T
= [s2T ÷summationtextTt=1Y2t?1]1/2 and s2T denote the OLS estimate of the distur-
33
bance variance:
s2T =
Tsummationdisplay
t=1
(Yt ? ?βTYt?1)2/(T ?1).
Proof:
Since the deviation of the OLS estimate from the true value is characterized by
T(?βT ?1) =
T?1
Tsummationtext
t=1
Yt?1ut
T?2
Tsummationtext
t=1
Y2t?1
, (60)
which is a continuous function function of Lemma 2’s (b) and (e), it follows that
under the null hypothesis that β = 1, the OLS estimator ?β is characterized by
T(?βT ?1) ? 1/2{λ
2[W(1)2]?σ2
u}
λ2integraltext10 [W(r)]2dr =
1/2([W(1)2]?σ2u/λ2)integraltext
1
0 [W(r)]
2dr . (61)
To prove second part of this theorem, We first note since ?β is a consistent
estimator of β from (61), s2T is a consistent estimator of σ2u, from the analogous
proofs as in Theorem 2. Then, we can express the t statistics alternatively as
tT = T(?βT ?1)
braceleftBigg
T?2
Tsummationdisplay
t=1
Y2t?1
bracerightBigg1/2
÷(s2T)1/2
or
tT = T
?1summationtextT
t=1Yt?1utbraceleftBig
T?2summationtextTt=1Y2t?1
bracerightBig1/2
(s2T)1/2
,
which is a continuous function function of Lemma 2’s (b) and (e), it follows that
under the null hypothesis that β = 1, the asymptotic distribution of OLS t
statistics is characterized by
tT L?→ (1/2){[λ
2W(1)]2 ?σ2
u}braceleftBig
λ2integraltext10 [W(r)]2dr
bracerightBig1/2
(σ2u)1/2
= (λ/2σu){[W(1)]
2 ?σ2
u/λ
2}
{integraltext10 [W(r)]2dr}1/2 . (62)
This complete the proof of this Theorem.
34
Theorem 8 extends (18) and (24) to the very general case of weakly de-
pendent and heterogeneously distributed data. When ut is a i.i.d. sequence,
λ2 = T?1E[(summationtextut)2] = T?1summationtextE(u2t) = σ2u, and we see that the results of Theo-
rem 8 reduces to those of Theorem 2.
Several interesting things are noteworthy to the asymptotic distribution re-
sults for the least squares estimators, (61). First, note that the scale factor here
is T, not √T as it previously has been. Thus, ?β is ”collapsing ” to its limits
at a much faster rate than before. This is sometimes called superconsistency.
Next, note that the limiting distribution is no longer normal; instead, we have a
distribution that is somewhat complicated function of a Wiener process. When
σ2u = λ2 (independence) we have the distribution of J.S. White (1958, p.1196),
apart from an incorrect scaling there, as noted by Phillips (1987). For σ2u = λ2,
this distribution is also that tabulated by Dickey and Fuller (1979) in their famous
work on testing for unit root.
In the regression setting studied in previous chapters the existence of serial
correlation in ut in the presence of a lagged dependent variable regressor leads
to the inconsistency of ?β for β0 as discussed in Chapter 10. Here, however, the
situation is quite different. Even though the regressor is a lagged dependent
variables, ?β is consistent for β0 = 1 despite the fact that condition (c) and (d) of
Theorem 7 permit ut to display considerable correlation.
The effect of the serial correlation is thatσ2u negationslash= λ2. This results is a shift of the
location of the asymptotic distribution away from zero (sinceE(W(1)2?σ2u/λ2) negationslash=
0 where E(W(1)2) = E(χ2(1)) = 1.) relative to the σ2u = λ2 case 9no serial
correlation). Despite this effect of the serial correlation in ut, we no longer have
the serious adverse consequence of the inconsistency of ?β.
One way of understanding why this is so is succinctly expressed by Phillips
(1987, p.283):
Intuitively, when the data generating process has a unit root, the
strength of the single (as measured by the sample variation of the re-
gressor Yt?1 dominates the noise by a factor of O(T), so that the effect
of any regressor-error correlation are annihilated in the regression as
T →∞.
Note that , however, even when σ2u = λ2 the asymptotic distribution given in
35
(61) is not centered about zero, so an asymptotic bias is still present. The reason
for this is that there generally exist a strong (negative) correlation betweenW(1)2
and (integraltext10 [W(r)2]dr)?1, resulting from the fact that W(1)2 and W(r)2 are highly
correlated for each r. Thus, even though E(W(1)2 ?σ2u/λ2) = 0 with σ2u = λ2,
we do not have
E
bracketleftBigg
1/2([W(1)2]?σ2u/λ2)integraltext
1
0 [W(r)]
2dr
bracketrightBigg
= 0.
See Abadir (1995) for further details.
3.4.2 Estimation of λ2 and σ2u
The limiting distribution given in Theorem 8 depend on unknown parameters
σ2u and λ2. Theses distributions are therefore not directly useable for statistical
testing. However, both theses parameters may be consistently estimated and the
estimates may be used to construct modified statistics whose limiting distribution
are independent of (λ2, σ2u). As we shall see, these new statistics provide very
general tests for the presence of a unit root in (54).
As shown in Lemma 2 (d), T?1summationtextTt=1u2t → σ2u. This provides us with the
simple estimator
s2u = T?1
Tsummationdisplay
t=1
(Yt ?Yt?1)2 = T?1
Tsummationdisplay
t=1
u2t,
which is consistent for σ2u under the null hypothesis β = 1. Since ?β → 1 by The-
orem 8 we may also use ?s2u = T?1summationtextT1 (Yt??βYt?1)2 as a consistent estimator ofσ2u.
Consistent estimation of λ2 = limT→∞T?1E(summationtextT1 ut)2 is more difficult. We
start by defining
λ2T = T?1E(
Tsummationdisplay
1
ut)2
= T?1
Tsummationdisplay
1
E(u2t) + 2T?1
T?1summationdisplay
τ=1
Tsummationdisplay
t=τ+1
E(utut?τ)
and by introducing the approximate
λ2Tl = T?1
Tsummationdisplay
1
E(u2t) + 2T?1
lsummationdisplay
τ=1
Tsummationdisplay
t=τ+1
E(utut?τ).
36
We shall call l the lag truncation number. For large T and large l<T, λ2Tl may
be expected to very close to λ2T if the total contribution in λ2T of covariance such
as E(utut?τ) with long lags τ > l is small. This will be true if ut satisfies the
assumption in Theorem 7. Formally, we have the following lemma.
Lemma 3:
If the sequence ut satisfies the assumption in Theorem 7 and ifl→∞ as T →∞,
then λ2T ?λ2Tl → 0 as T →∞.
This lemma suggests that under suitable conditions on the rate at which
l → ∞ as T → ∞ we may proceed to estimate λ2 from finite sample of data by
sequentially estimating λ2Tl. We define
s2Tl = T?1
Tsummationdisplay
1
u2t + 2T?1
lsummationdisplay
τ=1
Tsummationdisplay
t=τ+1
utut?τ. (63)
The following result establishes that s2Tl is a consistent estimator of λ2.
Theorem 9:
If
(a). ut satisfies all the assumptions in Theorem 7 except that part (b) is replaced
by the stronger moment condition: suptE|ut1|2γ <∞, for some γ > 2;
(b). l→∞ as T →∞ such that l = o(T1/4);
then s2Tl →λ2 as T →∞.
According to this result, if we allow the number of estimated autocovariances
to increase asT →∞ but control the rate of increase so thatl = o(T1/4), thens2Tl
yields a consistent estimator of λ2. Inevitably the choice of l will be an empirical
matter.
Rather than using the first difference ut = Yt?Yt?1 in the construction of s2Tl,
we could have used the residuals ?ut = Yt??βYt?1 from the least squares regression.
Since ?β → 1, this estimator is also consistent for λ2 under the null hypothesis
β = 1.
We remark that s2Tl is not constrained to be nongative as it presently defined
in (63). When there are large negative sample serial covariance, s2Tl can take
37
on negative values. Newey and West (1987) have suggested a modification to
variance estimator such as s2Tl which ensures that they are nonnegative. In the
present case, the modification yield:
?s2Tl = T?1
Tsummationdisplay
1
?u2t + 2T?1
lsummationdisplay
τ=1
wτl
Tsummationdisplay
t=τ+1
?ut?ut?τ, (64)
where
wTl = 1?τ/(l+ 1), (65)
which put a higher weight on more recent autocovariances.
3.4.3 New Tests for a Unit Root
The consistent estimator ?s2u and ?s2Tl may be used to develop new tests for unit
roots that apply under very general conditions. We define the statistics:
Z?β = T(?β?1)? 1/2(?s
2
Tl ? ?s
2
u)parenleftBig
T?2summationtextT1 Y2t?1
parenrightBig (66)
and
Z?t = tT ·(?s2u/?s2Tl)1/2 ?1/2(?s2Tl ? ?s2u)
?
??sTl
parenleftBigg
T?2
Tsummationdisplay
1
Y2t?1
parenrightBigg1/2?
?
?1
. (67)
Z?β is a transformation of the standardized estimator T(?β ? 1) and Z?t is a
transformation of the regression t statistics as in (62). The limiting distribution
of Z?β and Z?t are given by the following results.
Theorem 10 (Phillips 1987):
If the condition of Lemma 3 are satisfied, then as T →∞,
(a).
Z?β ? 1/2{[W(1)]
2 ?1}
integraltext1
0 [W(r)]
2dr
38
and
(b).
Z?t ? 1/2{[W(1)]
2 ?1}
{integraltext10 [W(r)]2dr}1/2
under the null hypothesis that data is generated by (54) and (55).
Proof:
(a). From (61) we have
T(?βT ?1) ? 1/2([W(1)
2]?σ2
u/λ
2)
integraltext1
0 [W(r)]
2dr
and from Lemma 2 (b) and Theorem 9 we have
1/2(s2Tl ?s2u)
T?2summationtextT1 Y2t?1 ?
1/2(λ2 ?σ2u)
λ2integraltext10 [W(r)]2dr ≡
1/2(1?σ2u/λ2)integraltext
1
0 [W(r)]
2dr .
Therefore, the test statistics Z?β is distributed as
Z?β = T(?βT ?1)? 1/2(?s
2
Tl ? ?s
2
u)
T?2summationtextT1 Y2t?1 ?
1/2([W(1)2]?σ2u/λ2 ?1 +σ2u/λ2)integraltext
1
0 [W(r)]
2dr
≡ 1/2([W(1)
2]?1)
integraltext1
0 [W(r)]
2dr .
(b). From (62) and the consistency ?s2u and ?s2Tl we have
tT ·(?s2u/?s2Tl)1/2 ? 1/2{[W(1)]
2 ?σ2
u/λ
2}
{integraltext10 [W(r)]2dr}1/2 . (68)
Consider the following statistics:
1/2(?s2Tl ? ?s2u)
?sTl
bracketleftBig
T?2summationtextT1 Y2t?1
bracketrightBig1/2 ? 1/2(λ
2 ?σ2
u)
λ2
braceleftBigintegraltext1
0 [W(r)]
2dr
bracerightBig1/2 ≡ 1/2(1?σ
2
u/λ
2)
braceleftBigintegraltext1
0 [W(r)]
2dr
bracerightBig1/2. (69)
39
Combining (68) and (69) we have
Z?t = tT ·(?s2u/?s2Tl)1/2 ? 1/2(?s
2
Tl ? ?s
2
u)
?sTl
bracketleftBig
T?2summationtextT1 Y2t?1
bracketrightBig1/2 ? 1/2([W(1)
2]?σ2
u/λ
2 ?1 +σ2
u/λ
2)
braceleftBigintegraltext1
0 [W(r)]
2dr
bracerightBig1/2
≡ 1/2([W(1)
2]?1)
braceleftBigintegraltext1
0 [W(r)]
2dr
bracerightBig1/2.
Theorem (11) demonstrate that the limiting distribution of the two statistics
Z?β andZ?t are invariant within a very wide class of weakly dependent and possibly
heterogeneously distributed innovationut. More, the limiting distribution ofZ?β is
identical to that of T(?β?1) when λ2 = σ2u, so that the statistical tables reported
in the section labeled Case 1 in Table B.5 are still useable.
The limiting distribution of Z?t given in Theorem (11) is identical to that of
regression tT statistics when λ2 = σ2u. This is, in fact, the limiting distribution
of the t statistics when the innovation ut is i.i.d.(0,σ2). Therefore, the statistical
tables reported in the section labeled Case 1 in Table B.6 are still useable.
Phillips and Perron (1988) analysis the asymptotic results of estimator OLS
when the regression contains a constant (?β) or a constant and a time trend (?β)
under the assumption that true data generating process is (54) and (55).
Theorem 11 (Phillips and Perron 1988):
If the condition of Lemma 3 are satisfied, then as T →∞,
Z?β = T(?β?1)? 1/2(?s
2
Tl ? ?s
2
u)
T?2summationtextT1 (Yt?1 ? ˉY?1)2,
Z?t = ?tT ·(?s2u/?s2Tl)1/2 ?1/2(?s2Tl ? ?s2u)
?
??sTl
parenleftBigg
T?2
Tsummationdisplay
1
(Yt?1 ? ˉY?1)2
parenrightBigg1/2?
?
?1
,
Z?β = T(?β?1)? T
6
24DX (?s
2
Tl ? ?s
2
u),
40
and
Z?t = ?tT ·(?s2u/?s2Tl)1/2 ? T
3(s2
Tl ?s
2
u)
4√3D1/2x sTl
,
the limiting distribution of Z are identical to those of distribution when λ2 = σ2u,
where ˉY?1 = summationtextT?11 Yt/(T ?1) and Dx = det(X′X) and the regressors are X =
(1,t,Yt?1), ?s2u = T?1summationtextT1 (Yt??α??βYt?1)2, ?s2Tl = T?1summationtextT1 ?u2t+2T?1summationtextlτ=1wτlsummationtextTt=τ+1 ?ut?ut?τ,
?s2u = T?1summationtextT1 [Yt??α??βYt?1??δ(t?12T)]2, and ?s2Tl = T?1summationtextT1 ?u2t+2T?1summationtextlτ=1wτlsummationtextTt=τ+1 ?ut?ut?τ.
Exercise:
Reproduce Case 4 at Table B.5 and B.6 on Hamilton (1994)’s p.762 and 763,
respectively. Two things to be noted:
(1). Confirms that the same results is obtained from non-gaussian i.i.d.
(2). The constant α will not affect this distribution. (so using α = 0 and
α = 10000 will get identical results)
Exercise:
Reproduce Table 1 of Phillips and Perron (1988, p.344) from which you will have
the chance to have your own unit root test’s (ADF and PP) program in Gauss
and the chance to see what are size distortion and lack of power of unit root tests.
41
3.5 Phillips-Perron Test, ut is a MA(∞) process
Hamilton (1994, Ch. 17) has parameterized the Phillips-Perron test by assuming
that the innovation in (54) to be
ut = ?(L)εt =
∞summationdisplay
j=0
?jεt?j, (70)
where εt is a white noise process (0,σ2ε) and) summationtext∞j=0j·|?j|<∞.
3.5.1 Beveridge-Nelson Decomposition
Since (70) is a subcase of the assumption in Theorem 7 (McLeish), all we have to
shown is that the ”long-run” variance in Theorem 7, λ2, here is equal to σ2ε ·?2(1)
(Hamilton, p.505, eq. 17.5.10.). To do this, we need the Beveridge-Nelson De-
composition.
Theorem 12 (Beveridge-Nelson (B-N) Decomposition):
Let
ut = ?(L)εt =
∞summationdisplay
j=0
?jεt?j, (71)
where εt is a white noise process (0,σ2ε) and summationtext∞j=0j·|?j|<∞.
Then
u1 +u2 +...+ut = ?(1)(ε1 +ε2 +...+εt) +ξt ?ξ0,
where ?(1) ≡ summationtext∞j=0?j,ξt = summationtext∞j=0αjεt?j,αj = ?(?j+1 +?j+2 +?j+3 +...) andsummationtext
∞
j=0|αj| < ∞. (Therefore, ξt is a stationary process since it is a MA(∞) with
absolute summable coefficients.)
Proof:
42
Observe that
tsummationdisplay
s=1
us =
tsummationdisplay
s=1
∞summationdisplay
j=0
?jεs?j
= {?0εt +?1εt?1 +?2εt?2 +...+?tε0 +?t+1ε?1 +...}
+{?0εt?1 +?1εt?2 +?2εt?3 +...+?t?1ε0 +?tε?1 +...}
+{?0εt?2 +?1εt?3 +?2εt?4 +...+?t?2ε0 +?t?1ε?1 +...}
+...+{?0ε1 +?1ε0 +?2ε?1 +...}
= ?0εt + (?0 +?1)εt?1 + (?0 +?1 +?2)εt?2 +...
+(?0 +?1 +?2 +...+?t?1)ε1 + (?1 +?2 +...+?t)ε0
+(?2 +?3 +...+?t+1)ε?1 +....
= (?0 +?1 +?2 +...)εt ?(?1 +?2 +?3 +...)εt
+(?0 +?1 +?2 +...)εt?1 ?(?2 +?3 +?4 +...)εt?1
+(?0 +?1 +?2 +...)εt?2 ?(?3 +?4 +?5 +...)εt?2 +...
+(?0 +?1 +?2 +...)ε1 ?(?t +?t+1 +?t+2 +...)ε1
+(?1 +?2 +?3 +...)ε0 ?(?t+1 +?t+2 +?t+3 +...)ε0
+(?2 +?3 +?4 +...)ε?1 ?(?t+2 +?t+3 +?t+4 +...)ε?1 +...
or
tsummationdisplay
s=1
us = ?(1)·
tsummationdisplay
s=1
εs +ξt ?ξ0,
where
ξt = ?(?1 +?2 +?3 +...)εt ?(?2 +?3 +?4 +...)εt?1
?(?3 +?4 +?5 +...)εt?3 ?...
ξ0 = ?(?1 +?2 +?3 +...)ε0 ?(?2 +?3 +?4 +...)ε?1
?(?3 +?4 +?5 +...)ε?2...
This theorem states that for any serial correlated process ut that satisfy (71),
its partial sum (summationtextut) can be write as the sum of a random walk (?(1)summationtextεt) and
a stationary process, ξt and a initial condition ξ0. Notice that ξt is stationary
43
from the fact that ξt =summationtext∞j=0αjεt?j, where αj = ?(?j+1 +?j+2 +...) and {αj}∞j=0
is absolutely summable:
∞summationdisplay
j=0
|αj| = |?1 +?2 +?3 +...|+|?2 +?3 +?4 +...|+|?3 +?4 +?5 +...|+...
≤ {|?1|+|?2|+|?3|+...}+{|?2|+|?3|+|?4|+...}+{|?3|+|?4|+|?5|+...}+...
= |?1|+ 2|?2|+ 3|?3|+...
=
∞summationdisplay
j=0
j·|?j|,
which is bounded by the assumptions in Theorem 12.
3.5.2 The Equality of Long-Run Variance in Phillips’s and Hamilton’s
Assumption
We now show that the long run variance E(T?1(summationtextut)2) in Theorem 7 is equal
to ?(1)2σ2ε (Hamilton’s p.505, (17.5.10). From B-N Decomposition we see that
u1 +u2 +...+uT = ?(1)(ε1 +ε2 +...+εT) +ξT ?ξ0,
therefore,
λ2 = T?1E[(u1 +u2 +...+uT)2]
= T?1E{[?(1)2(ε1 +ε2 +...+εT)2] +ξ2T +ξ20 + 2[?(1)(ε1 +ε2 +...+εT)·ξT]
+2[?(1)(ε1 +ε2 +...+εT)·ξ0] + 2[ξTξ0]}
= T?1
bracketleftBigg
(?(1)2Tσ2ε) +E(ξ2T) +E(ξ20) + 2
parenleftBigg
?(1)σ2ε
T?1summationdisplay
j=0
αj
parenrightBigg
+ 0 +σ2ε
∞summationdisplay
j=0
αjαT+j
bracketrightBigg
→ ?(1)2σ2ε
from the stationarity of ξT and absolute summability of αj. (We have to show
that summationtext∞j=0|αj|< 0 imply summationtext∞j=0αj < 0 and summationtext∞j=0αjαT+j < 0.)
Therefore, the results of Hamilton is all the same with those of Phillips as
long as we replace λ2 with ?(1)2σ2ε.
44
4 Issues in Unit Root Testing
4.1 Size Distortion and Low Power of Unit Root Tests
Schwart (1989) first presented Monte-Carlo evidence to point out the size distor-
tion problem of the commonly used unit root test. He argued that the distribution
of the Dickey-Fuller tests is far different from the distribution (this is the meaning
of size-distortion, the distribution under the null hypothesis is not what you have
expected, and therefore the 5% critical value is misleading) reported by Dickey
and Fuller if the underlying distribution contains a moving-average component.
He also suggests that the Phillips and Perron (PP) test suffer from size distor-
tions when the MA parameters is large, which is the case with many economic
time series as noted by Schwert (1989). The test with the least size distortion is
the Said-Dickey (1984) high-order autoregression t-test. Whereas Schwert com-
plained the size distortion of unit root tests, DeJong et al. complained about the
low power of unit root tests, DeJong et al. (1992) argued about that the unit
root tests have low power against possible trend-stationary alternatives. Similar
problems about size distortions and low power were noticed by Agiakoglou and
Newbold (1992).
The poor power problem is not unique to the unit root tests. Cochrane argue
that any test of the hypothesis θ = θ0 has arbitrarily low power against alter-
native θ0 ?? in small sample, but in many cases the difference between
θ0 and θ0 ?? would not be considered important from the statistical
or economic perspective. (For example, the expected value of a population
height is 170 or 171.) But the low power problem is particular disturbing in the
unit root case because of the discontinuity of the distribution theory near unit
root. (unit root test statistics has different asymptotic distribution under the null
and the alternative.)
Mention must be made of a paper by Gonzalo and Lee (1996) who complain
about the repetition of the phrase ”lack of power unit root test”. They show
numerically that the lack of power and size distortion of the Dickey-Fuller tests
for unit roots are similar to and in many situations even smaller than the lack
of power and size distortions of the standard student t-tests for stationary roots
in an autoregressive model. But arguments like this miss the important point.
45
There is no discontinuity of inference in the latter case but there is in the case
of unit root tests. Thus, the consequences of lack of power are vastly different in
the two cases.
There have been several solutions to the problems of size distortion and low
power of the ADF and PP tests. Some of these are modifications of the ADF
and PP tests and others are new tests. See Maddala and Kim (1999) p.103 for a
good survey.
4.2 Tests With Stationarity as Null, the KPSS
Kwiatkowski, Phillips, Schmidt, and Shin (1992), which is often referred to as
KPSS, start with the model
Yt = ψ+δt+ζt +εt,
where εt is a stationary process and ζt is a random walk given by
ζt = ζt?1 +ut, ut ~i.i.d.(0,σ2u).
The null hypothesis of stationarity is formulated as
H0 : σ2u = 0 orζt isaconstant.
The KPSS test statistics for this hypothesis is given by
LM =
summationtextT
t=1S
2
t
?s2Tl ,
where
?s2Tl = T?1
Tsummationdisplay
t=1
et + 2T?1
lsummationdisplay
τ=1
wτl
Tsummationdisplay
t=τ+1
etet?τ
is a consistent estimator of long run variance limT→∞T?1E(S2T). Here Wτl is an
optimal weighting function that corresponds to the choice of a spectral window.
KPSS use the Bartlett window, as suggested by Newey and West (1987),
wτl = 1? τl+ 1,
46
and et are the residuals from the regression of Yt on a constant and a time trend
(remember that LM test statistics is constructed under the null hypothesis), and
St is the partial sum of et defined by
St =
Tsummationdisplay
i=1
ei, t = 1,2,...,T.
For consistency of ?s2Tl, it is necessary thatl→∞asT →∞. The ratel = o(T1/2)
is usually satisfactory. KPSS derive the asymptotic distribution of the LM
statistic and tabulate the critical values by simulation.
For testing the null of level stationarity instead of trend stationarity the test
is constructed the same way except that et is obtained as the residual from a
regression of Yt on an intercept only. The test is an upper tail test.
It has been suggested (see, e.g., KPSS, p.176 and Choi, 1994, p.721) that
the tests using stationarity as null can be used for confirmatory analysis, i.e.,
to confirm our conclusions about unit roots. However, if both tests fail to reject
the respective nulls or both reject the respective nulls, we do not have a con-
firmation.
4.3 Panel Data Unit Root Tests
The principle motivation behind panel data unit root tests is to increase the
power of unit root tests by increasing the sample size. An alternative route of
increasing the sample size by using long time series data, it is argued, causes
problems arising from structural changes. However, it is not clear whether this
is more of a problem than cross-sectional heterogeneity, a problem with the use
of panel data.
It is often argued that the commonly used unit root tests such as ADF and
PP are not very powerful, and that using panel data you get a more power test.
See Maddala and Kim (1999), p.134 for a good survey.
47