Ch. 21 Univariate Unit Root Process 1 Introduction Consider OLS estimation of a AR(1) process, Yt = ρYt?1 +ut, where ut ~i.i.d.(0,σ2), and Y0 = 0. The OLS estimator of ρ is given by ?ρT = summationtextT t=1Yt?1Ytsummationtext T t=1Y 2t?1 = parenleftBigg Tsummationdisplay t=1 Y2t?1 parenrightBigg?1parenleftBigg Tsummationdisplay t=1 Yt?1Yt parenrightBigg and we also have (?ρT ?ρ) = parenleftBigg Tsummationdisplay t=1 Y2t?1 parenrightBigg?1parenleftBigg Tsummationdisplay t=1 Yt?1ut parenrightBigg . (1) When the true value of ρ is less than 1 in absolute value, then Yt (so does Y2t ?) is a covariance-stationary process. Applying LLN for a covariance process (see 9.19 of Ch. 4) we have ( Tsummationdisplay t=1 Y2t?1)/T p?→E[( Tsummationdisplay t=1 Y2t?1)/T] = bracketleftbiggT ·σ2 1?ρ2 bracketrightbigg /T = σ2/(1?ρ2). (2) Since Yt?1ut is a martingale difference sequence with variance E(Yt?1ut)2 = σ2 σ 2 1?ρ2 and 1 T Tsummationdisplay t=1 bracketleftbigg σ2 σ 2 1?ρ2 bracketrightbigg →σ2 σ 2 1?ρ2. Applying CLT for a martingale difference sequence to the second term in the righthand side of (1) we have 1√ T( Tsummationdisplay t=1 Yt?1ut) L?→N(0,σ2 σ 2 1?ρ2). (3) 1 Substituting (2) and (3) to (1) we have √T(?ρ T ?ρ) = [( Tsummationdisplay t=1 Y2t?1)/T]?1 ·√T[( Tsummationdisplay t=1 Yt?1ut)/T] (4) L?→ bracketleftbigg σ2 1?ρ2 bracketrightbigg?1 N(0,σ2 σ 2 1?ρ2) (5) ≡ N(0,1?ρ2). (6) (6) is not valid for the case when ρ = 1, however. To see this, recall that the variance of Yt when ρ = 1 is tσ2, then the LLN as in (2) would not be valid since if we apply CLT, then it would incur that ( Tsummationdisplay t=1 Y2t?1)/T p?→E[( Tsummationdisplay t=1 Y2t?1)/T] = σ2 summationtextT t=1t T →∞. (7) Similar reason would show that the CLT would not apply for 1√T (summationtextTt=1Yt?1ut). ( In stead, T?1(summationtextTt=1Yt?1ut) converges.) To obtain the limiting distribution, as we shall prove in the following, for (?ρT ?ρ) in the unit root case, it turn out that we have to multiply (?ρT ?ρ) by T rather than by √T: T(?ρT ?ρ) = bracketleftBigg ( Tsummationdisplay t=1 Y2t?1)/T2 bracketrightBigg?1bracketleftBigg T?1( Tsummationdisplay t=1 Yt?1ut) bracketrightBigg . (8) Thus, the unit root coefficient converge at a faster rate (T) than a coefficient for stationary regression ( which converges at √T). 2 2 Unit Root Asymptotic Theories In this section, we develop tools to handle the asymptotics of unit root process. 2.1 Random Walks and Wiener Process Consider a random walk, Yt = Yt?1 +εt, where Y0 = 0 and εt is i.i.d. with mean zero and Var(εt) = σ2 <∞. By repeated substitution we have Yt = Yt?1 +εt = Yt?2 +εt?1 +εt = Y0 + tsummationdisplay s=1 εs = tsummationdisplay s=1 εs. Before we can study the behavior of estimators based on random walks, we must understand in more detail the behavior of the random walk process itself. Thus, consider the random walk {Yt}, we can write YT = Tsummationdisplay t=1 εt. Rescaling, we have T?1/2YT/σ = T?1/2 Tsummationdisplay t=1 εt/σ. (It is important to note hereσ2 should be read asVar(T?1/2summationtextTt=1εt) = E[T?1(summationtextεt)2] = T·σ2 T = σ 2.) According to the Lindeberg-L′evy CLT, we have T?1/2YT/σ L?→N(0,1). More generally, we can construct a variable YT(r) from the partialsum of εt YT(r) = [Tr]?summationdisplay t=1 εt, 3 where 0 ≤ r ≤ 1 and [Tr]? denotes the largest integer that is less than or equal to Tr. Applying the same rescaling, we define WT(r) ≡ T?1/2YT(r)/σ (9) = T?1/2 [Tr]?summationdisplay t=1 εt/σ. (10) Now WT(r) = T?1/2([Tr]?)1/2 ? ? ?([Tr] ?)?1/2 [Tr]?summationdisplay t=1 εt/σ ? ? ?, and for a givenr, the term in the brackets {·} again obeys the CLT and converges in distribution to N(0,1), whereas T?1/2([Tr]?)1/2 converges to r1/2. It follows from standard arguments that WT(r) converges in distribution to N(0,r). We have written WT(r) so that it is clear that WT can be considered to be a function of r. Also, because WT(r) depends on the ε′ts, it is random. There- fore, we can think of WT(r) as defining a randomfunction of r, which we write WT(·). Just as the CLT provides conditions ensuring that the rescaled random walk T?1/2YT/σ (which we can now write as WT(1)) converges, as T become large, to a well-defined limiting random variables (the standard normal), the function central limit theorem (FCLT) provides conditions ensuring that the random function WT(·) converge, as T become large, to a well-defined limit ran- dom function, say W(·). The word ”Functional” in Functional Central Limit theorem appears because this limit is a function of r. Some further properties of random walk, suitably rescaled, are in the follow- ing. Proposition: If Yt is a random walk, then Yt4 ?Yt3 is independent of Yt2 ?Yt1 for all t1 <t2 < t3 <t4. Consequently, Wt(r4)?WT(r3) is independent of Wt(r2)?WT(r1) for all [T ·ri]? = ti,i = 1,...,4. 4 Proof: Note that Yt4 ?Yt3 = εt4 +εt4?1 +...+εt3+1, Yt2 ?Yt1 = εt2 +εt2?1 +...+εt1+1. Since (εt2,εt2?1,...,εt1+1) is independent of (εt4,εt4?1,...,εt3+1) it follow thatYt4? Yt3 and Yt2 ?Yt1 are independent. Consequently, WT(r4)?WT(r3) = T?1/2(εt4 +εt4?1 +...+εt3+1)/σ is independent of WT(r2)?WT(r1) = T?1/2(εt2 +εt2?1 +...+εt1+1)/σ. Proposition: For given 0 ≤a<b≤ 1, WT(b)?WT(a) L?→N(0,b?a) as T →∞. Proof: By definition WT(b)?WT(a) = T?1/2 [Tb]?summationdisplay t=[Ta]?+1 εt = T?1/2([Tb]? ?[Ta]?)1/2 ×([Tb]? ?[Ta]?)?1/2 [Tb]?summationdisplay t=[Ta]?+1 εt. The last term ([Tb]? ? [Ta]?)?1/2summationtext[Tb]?t=[Ta]?+1εt L?→ N(0,1) by the CLT, and T?1/2([Tb]? ?[Ta]?)1/2 = (([Tb]? ?[Ta]?)/T)1/2 → (b?a)1/2 as T → ∞. Hence WT(b)?WT(a) L?→N(0,b?a). 5 In words, the random walk hasindependentincrementsand those increments have a limiting normal distribution, with a variance reflecting the size of the interval (b?a) over which the increment is taken. It should not be surprising, therefore, that the limit of the sequence of function WT(·) constructed from the random walk preserves these properties in the limit in an appropriate sense. In fact, these properties form the basis of the definition of the Wiener process. Definition: Let (S,F,P) be a complete probability space. Then W : S × [0,1] → R1 is a standard Wiener process if each of r ∈ [0,1], W(·,r) is F-measurable, and in addition, (1). The process starts at zero: P[W(·,0) = 0] = 1. (2). The increments are independent: if 0 ≤ a0 ≤ a1... ≤ ak ≤ 1, then W(·,ai)?W(·,ai?1) is independent of W(·,aj)?W(·,aj?1), j = 1,..,k, j negationslash= i for all i = 1,...,k. (3). The increments are normally distributed: For 0 ≤a≤b≤ 1, the increment W(·,b)?W(·,a) is distributed as N(0,b?a). In the definition, we have written W(·,a) explicitness; whenever convenient, however, we will write W(a) instead of W(·,a), analogous to our notation else- where. The Wiener process is also called a Brownian motion in honor of Nor- bert Wiener (1924), who provided the mathematical foundation for the theory of random motions observed and described by nineteenth century botanist Robert Brown in 1827. 2.2 Functional Central Limit Theorems We earlier defined convergence in law for random variables, and now we need to extend the definition to cover random functions. Let S(·) represent a continuous- time stochastic process with S(r) representing its value at some date r for r ∈ [0,1]. Suppose, further, that any given realization, S(·) is a continuous function of r with probability 1. For {ST(·)}∞T=1 a sequence of such continuous function, 6 we say that the sequence of probability measure induced by {ST(·)}∞T=1 weakly converge to the probability measure induced by S(·), denoted by ST(·) =?S(·) if all of the following hold: (1). For any finite collection of k particular dates, 0 ≤r1 <r2 <...<rk ≤ 1, the sequence of k-dimensional random vector {yT}∞T=1 converges in distribution to the vector y, where yT ≡ ? ?? ?? ?? ? ST(r1) ST(r2) . . . ST(rk) ? ?? ?? ?? ? y ≡ ? ?? ?? ?? ? S(r1) S(r2) . . . S(rk) ? ?? ?? ?? ? ; (2). For each ?> 0, the probability that ST(r1) differs from ST(r2) for any dates r1 and r2 within δ of each other goes to zero uniformly in T as δ → 0; (3). P{|ST(0)|>λ}→ 0 uniformly in T as λ→∞. This definition applies to sequences of continuous functions, though the func- tion in (9) is a discontinues step function. Fortunately, the discountinuities occur at a countable set of points. Formally, ST(·) can be replaced with a similar con- tinuous function, interpolating between the steps. The Function Central Limit Theorem (FCLT) provides conditions under which WT converges to the standard Wiener process, W. The simplest FCLT is a gen- eralization of the Lindeberg-L′evy CLT, known as Donsker’s theorem. Theorem: (Donsker) Letεt be a sequence ofi.i.d.random scalars with mean zero. Ifσ2 ≡Var(εt) <∞, σ2 negationslash= 0, then WT =?W. Because pointwise convergence in distribution WT(·,r) L?→ W(·,r) for each r ∈ [0,1] is necessary (but not sufficient) for weak convergence WT =? W, the Lindeberg-L′evy CLT (WT(·,1) L?→W(·,1)) follows immediately from Donsker’s theorem. Donsker’s theorem is strictly stronger than Lindeberg-L′evy however, as both use identical assumptions, but Donsker’s theorem delivers a much stronger 7 conclusion. Donsker called his result an invarianceprinciple. Consequently, the FCLT is often referred as an invariance principle. So far, we have assumed that the sequence εt used to construct WT is i.i.d.. Nevertheless, just as we can obtain central limit theorems whenεt is not necessary i.i.d.. In fact, versions of the FCLT hold for each CLT previous given in Chapter 4. Theorem: Continuous Mapping Theorem: If ST(·) =?S(·) and g(·) is a continuous functional, then g(ST(·)) =?g(S(·)). In the above theorem, continuity of a functionalg(·) means that for any? > 0, there exist aδ> 0 such that ifh(r) andk(r) are any continuous bounded functions on [0,1], h : [0,1] → R1 and k : [0,1] → R1, such that |h(r) ?k(r)| < δ for all r ∈ [0,1], then |g(h(·))?g(k(·))|<?. 8 3 Regression with a Unit Root 3.1 Dickey-Fuller Test, Yt is AR(1) process Consider the following simple AR(1) process with a unit root, Yt = βYt?1 +ut, (11) β = 1 (12) where Y0 = 0 and ut is i.i.d. with mean zero and variance σ2. We consider the three least square regression Yt = ?βYt?1 + ?ut, (13) Yt = ?α+ ?βYt?1 + ?ut, (14) and Yt = ?α+ ?βYt?1 + ?δt+ ?ut, (15) where ?β,(?α, ?β), and (?α, ?β,?δ) are the conventional least-squares regression coef- ficients. Dickey and Fuller (1979) were concerned with the limiting distribution of the regression in (13), (14), and (15) (?β,(?α, ?β), and (?α, ?β,?δ)) under the null hypothesis that the data are generated by (11) and (12). We first provide the following asymptotic results of the sample moments which are useful to derive the asymptotics of the OLS estimator. Lemma: Let ut be a i.i.d. sequence with mean zero and variance σ2 and yt = u1 +u2 +...+ut fort = 1,2,...,T, (16) with y0 = 0. Then (a) T?12 Tsummationtext t=1 ut L?→σW(1), (b) T?2 Tsummationtext t=1 Y2t?1 L?→σ2integraltext10 [W(r)]2dr, 9 (c) T?32 Tsummationtext t=1 Yt?1 L?→σintegraltext10 W(r)dr, (d) T?1 Tsummationtext t Yt?1ut L?→ 12σ2[W(1)2 ?1], (e) T?32 Tsummationtext t=1 tut L?→σ[W(1)?integraltext10 W(r)dr], (f) T?52 Tsummationtext t=1 tYt?1 L?→σintegraltext10 rW(r)dr, (g) T?3 Tsummationtext t=1 tY2t?1 L?→σ2integraltext10 r[W(r)]2dr. A joint weak convergence for the sample moments given above to their respective limits is easily established and will be used below. Proof: (a) is a straightforward results of Donsker’s Theorem with r = 1. (b) First rewriteT?2 Tsummationtext t=1 Y2t?1 in term ofWT(rt?1) ≡T?1/2Yt?1/σ = T?1/2 t?1summationtext s=1 us/σ, where rt?1 = (t ? 1)/T, so that T?2 Tsummationtext t=1 Y2t?1 = σ2T?1 Tsummationtext t=1 WT(rt?1)2. Because WT(r) is constant for (t?1)/T ≤r<t/T, we have T?1 Tsummationdisplay t=1 WT(rt?1)2 = Tsummationdisplay t=1 integraldisplay t/T (t?1)/T WT(r)2dr = integraldisplay 1 0 WT(r)2dr. The continuous mapping theorem applies to h(WT) = integraltext10 WT(r)2dr. It follows that h(WT) =?h(W), so that T?2summationtextTt=1Y2t?1 =?σ2integraltext10 W(r)2dr, as claimed. (c). The proof of item (c) is analogous to that of (b). First rewriteT?3/2summationtextTt=1Yt?1 in term of WT(rt?1) ≡ T?1/2Yt?1/σ = T?1/2summationtextt?1s=1us/σ, where rt?1 = (t?1)/T, so that T?3/2summationtextTt=1Yt?1 = σT?1summationtextTt=1WT(rt?1). Because WT(r) is constant for 10 (t?1)/T ≤r<t/T, we have T?1 Tsummationdisplay t=1 WT(rt?1) = Tsummationdisplay t=1 integraldisplay t/T (t?1)/T WT(r)dr = integraldisplay 1 0 WT(r)dr. The continuous mapping theorem applies to h(WT) = integraltext10 WT(r)dr. It follows that h(WT) =?h(W), so that T?3/2summationtextTt=1Yt?1 =?σintegraltext10 W(r)dr, as claimed. (d). For a random walk, Y2t = (Yt?1 + ut)2 = Y2t?1 + 2Yt?1ut + u2t, imply- ing that Yt?1ut = (1/2){Y2t ?Y2t?1 ?u2t} and then summationtextTt=1Yt?1ut = (1/2){Y2T ? Y20 } ? (1/2)summationtextTt=1u2t. Recall that Y0 = 0, and thus it is convenient to writesummationtext T t=1Yt?1ut = 1 2Y 2 T ? 1 2 summationtextT t=1u 2 t. From items (a)) we know that T ?1Y2 T (= (T?1/2summationtextTt=1us)2 L?→σ2W2(1) and summationtextTt=1u2t p?→σ2 by LLN (Kolmogorov); then,summationtext T t=1Yt?1ut =? 1 2σ 2[W(1)2 ?1]. (e). We first observe thatsummationtextTt=1Yt?1 = [u1 +(u1 +u2)+(u1 +u2 +u3)+...+(u1 + u2+u3+...+uT?1)] = [T?1)u1+(T?2)u2+(T?3)u3+...+[T?(T?1)uT?1] =summationtext T t=1(T ?t)ut = summationtextT t=1Tut ? summationtextT t=1tut, or summationtextT t=1tut = T summationtextT t=1ut ? summationtextT t=1Yt?1. Therefore, T?32 summationtextTt=1tut = T?12 summationtextTt=1ut ?T?32 summationtextTt=1Yt?1. By applying the con- tinuous mapping theorem to the joint convergence of items (a) and (c), we have T?32 Tsummationdisplay t=1 tut ?σ[W(1)? integraldisplay 1 0 W(r)dr]. The proofs of item (f) and (g) is analogous to those of (c) and (b). First rewrite T?5/2summationtextTt=1tYt?1 in term of WT(rt?1) ≡ T?1/2Yt?1/σ = T?1/2summationtextt?1s=1us/σ, where rt?1 = (t ? 1)/T, so that T?5/2summationtextTt=1tYt?1 = σT?2summationtextTt=1tWT(rt?1). Because WT(r) is constant for (t?1)/T ≤r<t/T, we have T?1 Tsummationdisplay t=1 (t/T)WT(rt?1) = Tsummationdisplay t=1 integraldisplay t/T (t?1)/T rWT(r)dr forr = t/T = integraldisplay 1 0 rWT(r)dr. The continuous mapping theorem applies to h(WT) = integraltext10 rWT(r)dr. It follows that h(WT) =?h(W), so that T?5/2summationtextTt=1tYt?1 =?σintegraltext10 rW(r)dr, as claimed. 11 We also writeT?3summationtextTt=1tY2t?1 in term ofWT(rt?1) ≡T?1/2Yt?1/σ = T?1/2summationtextt?1s=1us/σ, where rt?1 = (t? 1)/T, so that T?3summationtextTt=1tY2t?1 = σ2T?2summationtextTt=1tWT(rt?1)2. Be- cause WT(r) is constant for (t?1)/T ≤r<t/T, we have T?1 Tsummationdisplay t=1 (t/T)WT(rt?1)2 = Tsummationdisplay t=1 integraldisplay t/T (t?1)/T rWT(r)2dr for = integraldisplay 1 0 rWT(r)2dr. The continuous mapping theorem applies to h(WT) = integraltext10 rWT(r)2dr. It follows that h(WT) =? h(W), so that T?3summationtextTt=1tY2t?1 =? σ2integraltext10 rW(r)2dr. This com- pletes the proofs of this lemma. 3.1.1 No Constant Term or Time Trend in the Regression; True Process Is a Random Walk We first consider the case that no constant term or time trend in the regression model, but true process is a random walk. The asymptotic distributions of OLS unit root coefficient estimator and t-ratio test statistics are in the following. Theorem 1: Let the data Yt be generated by (11) and (12); then as T →∞, for the regression model (13), T(?βT ?1) ? 1/2{[W(1)] 2 ?1} integraltext1 0 [W(r)] 2dr and ?t = (?βT ?1)?σ ?βT ? 1/2{[W(1)] 2 ?1} {integraltext10 [W(r)]2dr}1/2, where ?σ2?β T = [s2T ÷summationtextTt=1Y2t?1]1/2 and s2T denote the OLS estimate of the distur- bance variance: s2T = Tsummationdisplay t=1 (Yt ? ?βTYt?1)2/(T ?1). 12 Proof: Since the deviation of the OLS estimate from the true value is characterized by T(?βT ?1) = T?1 Tsummationtext t=1 Yt?1ut T?2 Tsummationtext t=1 Y2t?1 , (17) which is a continuous function function of Lemma 1(b) and 1(d), it follows that under the null hypothesis that β = 1, the OLS estimator ?β is characterized by T(?βT ?1) ? 1/2{[W(1)] 2 ?1} integraltext1 0 [W(r)] 2dr . (18) To prove second part of this theorem, We first show the consistency of s2T. Notice that the population disturbance sum of squares can be written (yT ?yT?1β)′(yT ?yT?1β) = (yT ?yT?1 ?β +yT?1 ?β?yT?1β)′(yT ?yT?1 ?β +yT?1 ?β?yT?1β) = (yT ?yT?1 ?β)′(yT ?yT?1 ?β) + (yT?1 ?β?yT?1β)′(yT?1 ?β?yT?1β), where yT = [Y1 Y2 ...YT]′ and the cross-product have vanished, since (yT ?yT?1 ?β)′yT?1(?β?β) = 0, by the OLS orthogonality condition (X′e = 0). Dividing last equation by T, (1/T)(yT ?yT?1β)′(yT ?yT?1β) = (1/T)(yT ?yT?1 ?β)′(yT ?yT?1 ?β) + (?β?β)′[(y′T?1yT?1)/T](?β?β) or (1/T)(yT ?yT?1 ?β)′(yT ?yT?1 ?β) (19) = (1/T)( Tsummationdisplay t=1 u2t)?T1/2(?β?β)′[(y′T?1yT?1)/T2]T1/2(?β?β). (20) Now (1/T)(summationtextTt=1u2t) p?→E(u2t) ≡σ2 by LLN for i.i.d. sequence, T1/2(?β?β) → 0 and (y′T?1yT?1)/T2 =?σ2integraltext10 [W(r)]2dr from (18) and Lemma 1(b), respectively. We thus have T1/2(?β?β)′[(y′T?1yT?1)/T2]T1/2(?β?β) p?→ 0′σ2 integraldisplay 1 0 [W(r)]2dr·0 = 0. 13 Substituting these results into (20) we have (1/T)(yT ?yT?1 ?β)′(yT ?yT?1 ?β) p?→σ2. The OLS disturbance’s variance estimator s2T = [1/(T ?1)](yT ?yT?1 ?β)′(yT ?yT?1 ?β) (21) = [T/(T ?1)](1/T)(yT ?yT?1 ?β)′(yT ?yT?1 ?β) (22) p?→ 1·σ2 = σ2, (23) therefore is a consistent estimator. Finally, we can express the t statistics alternatively as ?tT = T(?βT ?1) braceleftBigg T?2 Tsummationdisplay t=1 Y2t?1 bracerightBigg1/2 ÷(s2T)1/2 or ?tT = T?1 summationtextT t=1Yt?1utbraceleftBig T?2summationtextTt=1Y2t?1 bracerightBig1/2 (s2T)1/2 , which is a continuous function function of Lemma 1(b) and 1(d), it follows that under the null hypothesis that β = 1, the asymptotic distribution of OLS t statistics is characterized by ?tT L?→ 1/2σ2{[W(1)]2 ?1}braceleftBig σ2integraltext10 [W(r)]2dr bracerightBig1/2 (σ2)1/2 = 1/2{[W(1)] 2 ?1} braceleftBigintegraltext1 0 [W(r)] 2dr bracerightBig1/2. (24) This complete the proof of this Theorem. Statistical tables for the distributions of (18) and (24) for various sample size T are reported in the section labeled Case 1 in Table B.5 and B.6, respectively. This finite sample result assume Gaussian innovations. 3.1.2 Constant Term but No Time Trend included in the Regression; True Process Is a Random Walk We next consider the case that a constant term is added in the regression model, but true process is a random walk. The asymptotic distributions of OLS unit 14 root coefficient estimator and t-ratio test statistics are in the following. Theorem 2: Let the data Yt be generated by (11) and (12); then as T →∞, for the regression model (14), T(?βT ?1) ? 1/2{[W(1)] 2 ?1}?W(1)·integraltext1 0 W(r)drintegraltext 1 0 [W(r)] 2dr? bracketleftBigintegraltext1 0 W(r)dr bracketrightBig2 (25) and ?t = (?βT ?1)?σ ?βT ? 1/2{[W(1)] 2 ?1}?W(1)·integraltext1 0 W(r)drbraceleftbigg integraltext1 0 [W(r)] 2dr? bracketleftBigintegraltext1 0 W(r)dr bracketrightBig2bracerightbigg1/2 , (26) where ?σ2?β T = s2T bracketleftbig 0 1 bracketrightbig bracketleftbigg T summationtextY t?1summationtext Yt?1 summationtextY2t?1 bracketrightbigg?1bracketleftbigg 0 1 bracketrightbigg and s2T denote the OLS estimate of the disturbance variance: s2T = Tsummationdisplay t=1 (Yt ? ?αT ? ?βTYt?1)2/(T ?2). Proof: The proof of this theorem is analogous to that of Theorem 1 and is omitted here. Statistical tables for the distributions of (25) and (26) for various sample size T are reported in the section labeled Case 2 in Table B.5 and B.6, respectively. This finite sample result assume Gaussian innovations. These statistics test the null hypothesis that β = 1. However, a maintained assumption on which the derivation of Theorem 2 was based on is that the true value of α is zero. Thus, it might seem more natural to test for a unit root in this specification by testing the joint hypothesis that α = 0 and β = 1. Dickey and Fuller (1981) derive the limiting distribution of the likelihood ratio test for the hypothesis that (α,β) = (0,1) and used Monte Carlo to calculate the distri- bution of the OLS F test of this hypothesis. Their values are reported under the 15 heading Case 2 in table B.7. 3.1.3 Constant Term and Time Trend Included in the Regression; True Process Is a Random Walk With or Without Drift We finally in the section consider the case that a constant term and a linear trend are added in the regression model, but true process is a random walk with a drift. However, the true value of this drift turns out not to matter for the asymptotic distributions of OLS unit root coefficient estimator and t-ratio test statistics in this case. Theorem 3: Let the data Yt be generated by (11) and (12); then as T →∞, for the regression model (15), T(?β?1) ? 1/2{[W(1)?2 integraltext1 0 W(r)dr][W(1) + 6 integraltext1 0 W(r)dr?12 integraltext1 0 rW(r)dr]?1}integraltext 1 0 [W(r)] 2dr?4[integraltext1 0 W(r)dr] 2 + 12integraltext1 0 W(r)dr integraltext1 0 rW(r)dr?12[ integraltext1 0 rW(r)dr] 2 and ?t = (?βT ?1)?σ ?βT ?T(?β?1)÷radicalbigQ, where ?σ2?β T = s2T bracketleftbig 0 1 0 bracketrightbig ? ?? ?? ?? ? T Tsummationtext t=1 ξt?1 Tsummationtext t=1 t Tsummationtext t=1 ξt?1 Tsummationtext t=1 ξ2t?1 Tsummationtext t=1 tξt?1 Tsummationtext t=1 t Tsummationtext t=1 tξt?1 Tsummationtext t=1 t2 ? ?? ?? ?? ? ?1 ? ? 0 1 0 ? ?,ξt = Yt?αt, s2T denote the OLS estimate of the disturbance variance: s2T = Tsummationdisplay t=1 (Yt ? ?α? ?βTYt?1 ? ?δt)2/(T ?3), and Q≡bracketleftbig 0 1 0 bracketrightbig ? ? 1 integraltext W(r)dr 1/2integraltext W(r)dr integraltext[W(r)]2dr integraltext rW(r)dr 1/2 integraltext rW(r)dr 1/3 ? ? ?1? ? 0 1 0 ? ?. 16 Proof: (a). Let the data generating process be Yt = α+Yt?1 +ut, and the regression model be Yt = α+βYt?1 +δt+ut. (27) Note that the regression model of (27) can be equivalently rewritten as Yt = (1?β)α+β(Yt?1 ?α(t?1)) + (δ+βα)t+ut, ≡α? +β?ξt?1 +δ?t+ut, (28) where α? = (1?β)α, β? = β, δ? = δ+βα, and ξt?1 = Yt?1?α(t?1). Moreover, under the null hypothesis that β = 1 and δ = 0, ξt = Y0 +u1 +u2 +...+ut; that is, ξt is the random walk as described in Lemma 1. Under the maintained hypothesis, α = α0, β = 1, and δ = 0, which in (28) means that α? = 0, β? = 1 and δ? = α0. The deviation of the OLS estimate from these true values is given by ? ? ?α? ?β?1 ?δ? ?α0 ? ?= ? ?? ?? ?? ? T Tsummationtext t=1 ξt?1 Tsummationtext t=1 t Tsummationtext t=1 ξt?1 Tsummationtext t=1 ξ2t?1 Tsummationtext t=1 tξt?1 Tsummationtext t=1 t Tsummationtext t=1 tξt?1 Tsummationtext t=1 t2 ? ?? ?? ?? ? ?1? ?? ?? ?? ? Tsummationtext t=1 ut Tsummationtext t=1 ξt?1ut Tsummationtext t=1 tut ? ?? ?? ?? ? , (29) or in shorthand as C = A?1f. From Lemma 1, the order of probability of the individual terms in (29) is as follows, ? ? ?α? ?β?1 ?δ? ?α0 ? ?= ? ? Op(T) Op(T 32) Op(T2) Op(T 32) Op(T2) Op(T 52) Op(T2) Op(T 52) Op(T3) ? ? ?1? ? Op(T 12) Op(T1) Op(T 32) ? ?. 17 We define a rescaling matrices, ΥT = ? ? T 12 0 0 0 T1 0 0 0 T 32 ? ?. Multiplying the rescaling matrices on (29), we get ΥTC = ΥTA?1ΥTΥ?1T f =bracketleftbigΥ?1T AΥ?1T bracketrightbig?1 Υ?1T f (30) Substituting the results of Lemma A.1 to (30), we establish that ?b1 ? Q?1h1, (31) where ?b1 ≡ ? ? T1/2 ?α? T(?β?1) T3/2(?δ? ?α0) ? ?, Q ≡ ? ? 1 integraltext W(r)dr 1/2integraltext W(r)dr integraltext[W(r)]2dr integraltext rW(r)dr 1/2 integraltext rW(r)dr 1/3 ? ? h1 ≡ ? ? σW(1) 1 2σ 2{[W(1)]2 ?1} σ[W(1)?integraltext W(r)dr] ? ?. Thus, the asymptotic distribution of T(?β?1) is given by the middle row of (31), that is, T(?β?1) ? 1/2{[W(1)?2 integraltext1 0 W(r)dr][W(1) + 6 integraltext1 0 W(r)dr?12 integraltext1 0 rW(r)dr]?1}integraltext 1 0 [W(r)] 2dr?4[integraltext1 0 W(r)dr] 2 + 12integraltext1 0 W(r)dr integraltext1 0 rW(r)dr?12[ integraltext1 0 rW(r)dr] 2. Note that this distribution does not depend on either α or σ; in particular, it doesn’t matter whether or not the true value of α is zero. . (b). The asymptotic distribution of the OLS t statistics can be founded using 18 similar calculation as those in (23). Notice that T2 · ?σ2?β T = T2 ·s2T bracketleftbig 0 1 0 bracketrightbig ? ?? ?? ?? ? T Tsummationtext t=1 ξt?1 Tsummationtext t=1 t Tsummationtext t=1 ξt?1 Tsummationtext t=1 ξ2t?1 Tsummationtext t=1 tξt?1 Tsummationtext t=1 t Tsummationtext t=1 tξt?1 Tsummationtext t=1 t2 ? ?? ?? ?? ? ?1 ? ? 0 1 0 ? ? = s2T bracketleftbig 0 1 0 bracketrightbig ? ? T1/2 0 0 0 T 0 0 0 T3/2 ? ? × ? ?? ?? ?? ? T Tsummationtext t=1 ξt?1 Tsummationtext t=1 t Tsummationtext t=1 ξt?1 Tsummationtext t=1 ξ2t?1 Tsummationtext t=1 tξt?1 Tsummationtext t=1 t Tsummationtext t=1 tξt?1 Tsummationtext t=1 t2 ? ?? ?? ?? ? ?1 ? ? T1/2 0 0 0 T 0 0 0 T3/2 ? ? ? ? 0 1 0 ? ? = s2T bracketleftbig 0 1 0 bracketrightbig ? ?? ?? ?? ? 1 T?3/2 Tsummationtext t=1 ξt?1 T?2 Tsummationtext t=1 t Tsummationtext t=1 T?3/2ξt?1 Tsummationtext t=1 T?2ξ2t?1 Tsummationtext t=1 T?5/2tξt?1 Tsummationtext t=1 T?2t Tsummationtext t=1 T?5/2tξt?1 Tsummationtext t=1 T?3t2 ? ?? ?? ?? ? ?1 ? ? 0 1 0 ? ? L?→ σ2bracketleftbig 0 1 0 bracketrightbig ? ? 1 σintegraltext W(r)dr 1/2 σintegraltext W(r)dr σ2integraltext[W(r)]2dr σintegraltext rW(r)dr 1/2 σintegraltext rW(r)dr 1/3 ? ? ?1? ? 0 1 0 ? ? = σ2bracketleftbig 0 1 0 bracketrightbig ? ? 1 0 0 0 σ 0 0 0 1 ? ? ?1 × ? ? 1 integraltext W(r)dr 1/2integraltext W(r)dr integraltext[W(r)]2dr integraltext rW(r)dr 1/2 integraltext rW(r)dr 1/3 ? ? ?1? ? 1 0 0 0 σ 0 0 0 1 ? ? ?1? ? 0 1 0 ? ? = bracketleftbig 0 1 0 bracketrightbig ? ? 1 integraltext W(r)dr 1/2integraltext W(r)dr integraltext[W(r)]2dr integraltext rW(r)dr 1/2 integraltext rW(r)dr 1/3 ? ? ?1? ? 0 1 0 ? ? ≡ Q. From this result it follows that the asymptotic distribution of the OLS t test of the hypothesis that β = 1 is given by ?tT = T(?βT ?1)÷(T2 · ?σ2?β T )1/2 p?→T(?βT ?1)÷radicalbigQ. 19 This completes the proofs of Theorem 3. Again, this distribution does not dependent on α or σ. The small-sample dis- tribution of the OLS t statistics under the assumption of Gaussian disturbance is presented under case 4 in Table B.6. If this distribution were truly t, then a value below ?2.0 would be sufficient to reject the null hypothesis. However, Table B.6 reveals that, because of the nonstandard distribution, the t statistic must below ?3.4 before the null hypothesis of a unit root could be rejected. The assumption that the true value of δ is equal to zero is again an auxiliary hypothesis upon which the asymptotic properties of the test depend. Thus, as in section 3.1.2, it is natural to consider the OLS F test of the joint null hypothesis that δ = 0 and β = 1. Though this F test statistic is calculated in the usual way, its asymptotic distribution is nonstandard, and the calculated F statistic should be compared with the value under case 4 in Table B.7. Remark: To derive the asymptotic distribution in this chapter, it is useful to use the following result: bracketleftbigg α 0 0 β bracketrightbiggbracketleftbigg a b c d bracketrightbiggbracketleftbigg α 0 0 β bracketrightbigg = bracketleftbigg α2a αβb αβc β2d bracketrightbigg . The unit root tests are not confining themselves to the simple AR(1) process as the original Dickey and Fuller (1919). There are two ways to generalize the unit root process. One is to assume that Yt in (11) is a AR(p) process, the other is to assume thatut is aARMA(p,q) (Said-Dickey) process or a generally nonparamet- ric process satisfying certain memory and moment constrains (Phillips-Perron). These modifications of unit root models are discussed in the following. 20 3.2 Augmented Dickey-Fuller Test, Yt is AR(p) process Instead of (11) and (12), suppose that the data were generated from an AR(p) process, (1?φ1L?φ2L2 ?...?φpLp)Yt = εt, (32) where εt is i.i.d. sequence with zero mean and variance σ2 and finite fourth moment. It is helpful to write the autoregression (32) in a slightly different form. To do so, define β ≡ φ1 +φ2 +...+φp ζj ≡ ?[φj+1 +φj+2 +...+φp] forj = 1,2,...,p?1. Notice that for any value of φ1,φ2,...,φp, the following polynomials in L are equivalent: (1?βL)?(ζ1L+ζ2L2 +...+ζp?1Lp?1)(1?L) = 1?βL?ζ1L+ζ1L2 ?ζ2L+ζ2L3 ?...?ζp?1Lp?1 +ζp?1Lp = 1?(β +ζ1)L?(ζ2 ?ζ1)L2 ?(ζ3 ?ζ2)L3 ?...?(ζp?1 ?ζp?2)Lp?1 ?(?ζp?1)Lp = 1?[(φ1 +φ2 +...+φp)?(φ2 +...+φp)]L?[?(φ3 +φ4 +...+φp) + (φ2 +...+φp)]L2 ?...?[?(φp) + (φp?1 +φp)]Lp?1 ?(φp)Lp = 1?φ1L?φ2L2 ?...?φpLp. Thus, the autoregression (32) can be equivalently be written {(1?βL)?(ζ1L+ζ2L2 +...+ζp?1Lp?1)(1?L)}Yt = εt (33) or Yt = βYt?1 +ζ1△Yt?1 +ζ2△Yt?2 +...+ζp?1△Yt?p+1 +εt. (34) Example: In the case of p = 3, (34) is Yt = φ1Yt?1 +φ2Yt?2 +φ3Yt?3 +εt = (φ1 +φ2 +φ3)Yt?1 ?(φ2 +φ3)[Yt?1 ?Yt?2]?(φ3)[Yt?2 ?Yt?3] = βYt?1 +ζ1△Yt?1 +ζ2△Yt?2. 21 Suppose that the process that generated Yt contain a single unit root; that is suppose one root of (1?φ1z?φ2z2 ?...?φpzp) = 0 (35) is unity, then 1?φ1 ?φ2 ?...?φp = 0, (36) and all other root of (35) are outside the unit circle. Notice that (36) implies that β = 1 in (34). Moreover, when β = 1, it would imply that 1?φ1L?φ2L2 ?...?φpLp = (1?L)(1?ζ1L?ζ2L2 ?...?ζp?1Lp?1) and all the roots of (1?ζ1L?ζ2L2 ?...?ζp?1Lp?1) = 0 would lie outside the unit circle. Under the null hypothesis that β = 1, expression (34) could then be written as △Yt = ζ1△Yt?1 +ζ2△Yt?2 +...+ζp?1△Yt?p+1 +εt. (37) or △Yt = ut (38) where ut = ψ(L)εt = (1?ζ1L?ζ2L2 ?...?ζp?1Lp?1)?1εt. That is, we may express this AR(p) with an unit root process as the AR(1) with an unit root process as (38) but with serially correlated ut. Dickey and Fuller (1979) propose a test of unit root in this AR(p) model and is known as the Augmented Dickey-Fuller (ADF) test. 3.2.1 Constant Term but No Time Trend included in the Regression; True Process Is a Autoregressive with no Drift Assume that the initial sample is of size T + p, with observation numbered {Y?p+1,Y?p+2,...,YT}, and conditional on the first p observations. We are in- terested in the properties of OLS estimation of Yt = ζ1△Yt?1 +ζ2△Yt?2 +...+ζp?1△Yt?p+1 +α+βYt?1 +εt (39) = x′tβ +εt (40) 22 under the null hypothesis that α = 0 and β = 1, i.e. the DGP is △Yt = ζ1△Yt?1 +ζ2△Yt?2 +...+ζp?1△Yt?p+1 +εt, (41) where β ≡ (ζ1,ζ2,...,ζp?1,α,β)′ and xt ≡ (△Yt?1,△Yt?2,...,△Yt?p+1,1,Yt?1)′. The asymptotic distribution of the OLS coefficients estimators, ?βT are in the following. Theorem 4: Let the data Yt be generated by (41); then as T → ∞, for the regression model (39), ΥT(?βT ?β) L?→ bracketleftbigg V 0 0 Q bracketrightbigg?1bracketleftbigg h 1 h2 bracketrightbigg = bracketleftbigg V?1h 1 Q?1h2 bracketrightbigg , (42) where ΥT = ? ?? ?? ?? ?? ? √T 0 . . . 0 0 0 √T . . . 0 0 . . . . . . . . . . . . . . . . . . . . . 0 0 . . . √T 0 0 0 . . . 0 T ? ?? ?? ?? ?? ? , V = ? ?? ?? ?? ? γ0 γ1 . . . γp?2 γ1 γ0 . . . γp?3 . . . . . . . . . . . . . . . . . . γp?2 γp?3 . . . γ0 ? ?? ?? ?? ? , Q = bracketleftbigg 1 λintegraltext W(r)dr λintegraltext W(r)dr λ2integraltext[W(r)]2dr bracketrightbigg , h1 ~Np?1(0,V), h2 ~ bracketleftbigg σW(1) 1/2σλ{[W(1)]2 ?1} bracketrightbigg ,γj = E[(△Yt)(△Yt?j)] and λ = σ·ψ(1) = σ/(1?ζ1 ?ζ2 ?...?ζp?1). (43) The results reveals that in a regression of I(1) variables on I(1) and I(0) vari- ables, the asymptotic distribution of the coefficient of I(1) and I(0) vari- ables are independent. Thus, the asymptotic distribution of √T(?ζj ?ζj), j = 1,2,..,p?1 and T(?β?β) are independent. This results can be used for showing 23 that the distribution of ?βT in the ADF regression is the Dickey-Fuller distrib- ution (taking into account of serially correlated in the ut, see (44)). Also the asymptotic distribution of √T(?ζj ?ζj) is normal. Therefore, the limiting distribution of T(?β?β) is given by the second element of bracketleftbigg T1/2 0 0 T bracketrightbiggbracketleftbigg ?α?0 ?βT ?1 bracketrightbigg L?→ bracketleftbigg 1 λintegraltext W(r)dr λintegraltext W(r)dr λ2integraltext[W(r)]2dr bracketrightbigg?1bracketleftbigg σW(1) 1/2σλ{[W(1)]2 ?1} bracketrightbigg ≡ bracketleftbigg σ 0 0 σ/λ bracketrightbiggbracketleftbigg 1 integraltext W(r)dr integraltext W(r)dr integraltext[W(r)]2dr bracketrightbigg?1bracketleftbigg W(1) 1/2{[W(1)]2 ?1} bracketrightbigg , or T(?βT ?1) ? (σ/λ)· 1/2{[W(1)] 2 ?1}?W(1)·integraltext1 0 W(r)drintegraltext 1 0 [W(r)] 2dr? bracketleftBigintegraltext1 0 W(r)dr bracketrightBig2 . (44) The parameter (σ/λ) is the factor to correct the serial correlation in ut. When ut is i.i.d., form (38) we have ζi = 0 and λ = σ, that is (σ/λ) = 1. This distribution is back to simple Dickey-Fuller distribution. We are now in a position to propose the ADF test statistics which correct for (σ/λ) and have the same distribution as DF. Theorem 5 (ADF): Let the data Yt be generated by (41); then as T → ∞, for the regression model (39), (a). T(?βT ?1) 1? ?ζ1 ? ?ζ2 ?...? ?ζp?1 ? 1/2{[W(1)]2 ?1}?W(1)·integraltext10 W(r)dr integraltext1 0 [W(r)] 2dr? bracketleftBigintegraltext1 0 W(r)dr bracketrightBig2 , (b). tT = ( ?βT ?1) {s2Te′p+1(summationtextxtx′t)?1ep+1}1/2 ? 1/2{[W(1)]2 ?1}?W(1)·integraltext10 W(r)dr braceleftbiggintegraltext 1 0 [W(r)] 2dr? bracketleftBigintegraltext1 0 W(r)dr bracketrightBig2bracerightbigg1/2 , where ep+1 = [0 0 ...0 1]′. 24 Proof: (a). From (44) we have T ·(λ/σ)(?βT ?1) ? 1/2{[W(1)] 2 ?1}?W(1)·integraltext1 0 W(r)drintegraltext 1 0 [W(r)] 2dr? bracketleftBigintegraltext1 0 W(r)dr bracketrightBig2 . (45) Recall from (43 ) that λ/σ = (1?ζ1 ?ζ2 ?...?ζp?1)?1. Since ?ζj is consistent from (42), this magnitude is clearly consistently estimated by ?λ/σ = (1? ?ζ1 ? ?ζ2 ?...? ?ζp?1)?1. (46) It follows that [T(?βT ?1)·(λ/σ)]?[T(?βT ?1)·( ?λ/σ)] = T(?βT ?1)·[(λ/σ)? ?(λ/σ)] = Op(1)·op(1) = op(1). Thus [T(?βT ?1)·( ?λ/σ)] and [T(?βT ?1)·(λ/σ)] have the same asymptotic distri- bution. This complete the proof of part (a). To rove art (b), we first multiply the numerator and denominator of tT by T results in tT = T( ?βT ?1) {s2Te′p+1ΥT(summationtextxtx′t)?1ΥTep+1}1/2. (47) But e′p+1ΥT( summationdisplay xtx′t)?1ΥTep+1 = e′p+1 bracketleftBig Υ?1T ( summationdisplay xtx′t)Υ?1T bracketrightBig?1 ep+1 L?→ e′ p+1 bracketleftbigg V?1 0 0 Q?1 bracketrightbigg ep+1 = 1 λ2 · braceleftbiggintegraltext 1 0 [W(r)] 2dr? bracketleftBigintegraltext1 0 W(r)dr bracketrightBig2bracerightbigg. 25 Hence from (45) and (47), tT L?→ (σ/λ)1/2{[W(1)] 2 ?1}?W(1)·integraltext1 0 W(r)drintegraltext 1 0 [W(r)] 2dr? bracketleftBigintegraltext1 0 W(r)dr bracketrightBig2 ÷ ? ??? ??? σ2 λ2 · braceleftbiggintegraltext 1 0 [W(r)] 2dr? bracketleftBigintegraltext1 0 W(r)dr bracketrightBig2bracerightbigg ? ??? ??? 1/2 = 1/2{[W(1)] 2 ?1}?W(1)·integraltext1 0 W(r)drbraceleftbigg integraltext1 0 [W(r)] 2dr? bracketleftBigintegraltext1 0 W(r)dr bracketrightBig2bracerightbigg1/2 . This is the same distribution as in (26). Thus, the usual t test of β = 1 for OLS estimation ofAR(p) can be compared with Theorem 2 and use case 2 of Table B.6 without any correction for the fact ut is serially correlated (or △Y are included in the regression). 26 3.3 Augmented Dickey-Fuller Test,Yt isaARMA(p,q) process The fact that the distribution of ?βT in the ADF regression is the Dickey-Fuller distribution has been extended by Said and Dickey (1984) to the more general case in which, under the null hypothesis, the series of first difference are of the general ARMA(p,q) form with unknown p and q. They showed that a regression model, such as (39), is still valid for testing the unit root null under the presence of the serial correlations of error, if the number of lags △Y included as regressor increase with the sample size at a controlled rate T1/3. Essentially the moving terms are being approximated by including enough autoregressive terms. Consider the general ARIMA(p,1,q) model is defined by Yt = βYt?1 +ut, (48) (1?φ1L?φ2L2 ?...?φpLp)ut = (1 +θ1L+θ2L2 +...+θqLq)εt (49) with Y0 = 0, εt is i.i.d. and β = 1. (48) and (49) is equivalently written as (1?φ1L?φ2L2 ?...?φpLp)△Yt = (1 +θ1L+θ2L2 +...+θqLq)εt. (50) Therefore, (41) as a special case of (50) in which θj = 0,j = 1,2,...,q. Rewrite (50) as η(L)△Yt = εt (51) where η(L) = (1?η1L?η2L2 ?....) = (1+θ1L+θ2L2 +...+θqLq)?1(1?φ1L? φ2L2 ?...?φpLp). That is, △Yt = η1△Yt?1 +η2△Yt?2 +η3△Yt?3 +....+εt or Yt = Yt?1 +η1△Yt?1 +η2△Yt?2 +η3△Yt?3 +....+εt. (52) This motives us to estimate the coefficient in (52) by regressionYt onYt?1,△Yt?1,△Yt?2,..., △Yt?k where k is a suitably chosen integer. To get consistent estimator of the 27 coefficient in (52) it is necessary to let k as a function of T. Consider a truncated version of (52) Yt = α+βYt?1 +η1△Yt?1 +η2△Yt?2 +...+ηk△Yt?k +etk = x′tβ +etk, (53) where β ≡ (α,β,ζ1,ζ2,...,ζk)′ and xt ≡ (1,Yt?1,△Yt?1,△Yt?2,...,△Yt?k)′. No- tice that etk is not a white noise. In this case, the limiting distribution of t statistics of the coefficient on the lagged Yt?1 (i.e., ?βT) from OLS estimation of has the same Dickey-Fuller t-distribution as when ut is i.i.d.. Theorem 6 (Said-Dickey ADF): Let the data Yt be generated by (48) and (49) and the regression model be (53). We assume that T?1/3k → 0 and there exist c> 0,r> 0 such that ck>T1/r, then as T →∞, tT = ( ?βT ?1) {s2Te′p+1(summationtextxtx′t)?1ep+1}1/2 ? 1/2{[W(1)]2 ?1}?W(1)·integraltext10 W(r)dr braceleftbiggintegraltext 1 0 [W(r)] 2dr? bracketleftBigintegraltext1 0 W(r)dr bracketrightBig2bracerightbigg1/2 , where ek+2 = [0 1 ...0 0]′. The intuition behind that k should be a function of T is clear from the fact etk = ζk+1△Yt?k?1 +ζk+2△Yt?k?2 +...+εt. Then as k →∞ etk ?εt = ζk+1△Yt?k?1 +ζk+2△Yt?k?2 +... p?→ 0 from the absolute summability of ζj, i.e summationtext∞j=0|ζj| < ∞ which imply ζj → 0 and ζk+1△Yt?k?1 + ζk+2△Yt?k?2 + ... ?→ 0. Then, it is expected that a ARIMA(p,1,q) process here would have the same asymptotic result with the ARIMA(p,1,0) process. Therefore, the asymptotic result should be derived un- der the condition that k → ∞. However, it can not increase quickly then T, i.e. 28 we need the condition that T?1/3k → 0. 3.3.1 Choice of Lag-Length in the ADF Test It has been observed that the size and power properties of the ADF test are sensitive to the number of lagged terms (k) used. Several guidelines have been suggested for the choice of k. Ng and Perron (1995) examine these in details. The guidelines are: (a). Rule for fixing k at an arbitrary level independent of T. Overall, choosing a fixed k is not desirable from their detailed simulation. (b). Rule for fixing k as a function of T. A rule commonly used is the one suggested by Schwert (1989) which is to choose k = Int{c(T/100)1/d}. Schwert suggest c = 12 and d = 4. The problem with such a rule is that it need not be optional for all p and q in the ARMA(p,q). (c). Information based rules. The information criteria suggest choosing k to minimize an objective function that trades off parsimony against reduction in sum of squares. The objective function is of the form (see also p.5 of Ch. 16) Ik = log ?σ2k +kCTT . The Akaike information criterion (AIC) choose CT = 2. The Schwarz Bayesian information criterion (BIC) chooses CT = logT. Ng and Perron argue that both AIC and BIC are asymptotically the same with ARMA(p,q) models and that both of them choose k proportional to logT. (d). Sequential rules. Hall (1994) discusses a general to specific rule which is to start with a large value of k (kmax). We test the significance of the last coefficient and reduce k iteratively until a significant statistic is encountered. Ng and Perron (1995) compare AIC, BIC, and Hall’s general to specific ap- proach through a Monte Carlo study. The major conclusions are: (a). Both AIC and BIC choose very small value of k (e.g. k = 3). This results in high size distortions, especially with MA errors (Remark: The intuition be- hind this conclusion is that with MA error, this model is AR(∞), if you use too 29 small lagged value, your model look not very much like a AR(∞), and since this asymptotic results is derived from k → ∞, your finite sample distribution is far away from the asymptotic distribution and results in size distortion.) (b). Hall’s criterion tends to choose higher values of k. The higher the kmax is, the higher is the chosen value of k. This results in the size being at the nominal level, but of course with a loss of power. (Remark: Unit root test statistics sa far we have derived is the distribution under the null. On the other side, we can derive the test statistics under the alternative hypothesis of stationary or fractional difference process. Theses test statistics should go to ?∞, say, against the left-tailed hypothesis, when the alternative hypothesis is true to be able to consistently reject the null hypothesis, or what is called power is one. A common results is that the asymptotic distribution of the unit root test statistics under the alternative hypothesis is function of k, and with precisely, (T/k), say. For a fixed T, (T/K) is smaller with a larger k. This cause the unit roots distribution tend to ?∞ more slower under the alternative hypothesis and has a lower power. See Lee and Schmidt (1996) and Lee and Shie (2004).) What this study suggests is that Hall’s general to specific methods is preferable to the others. DeJong et al. (1992) show that increasing k typically results in a modest decrease in power but a substantial decrease in size distortion. If this is the case the information criteria are at a disadvantage because they result in a choice of very small value of k. However, Stock (1994) propose opposite evidence arguing in favor of BIC compared with Hall’s method. 30 3.4 Phillips-Perron Test, ut is mixing process In contrast to ADF where a AR(1) unit root process has been extend to AR(p) and ARMA(p,q) with a unit root, Phillips (1987) and Phillips and Perron (1988) extend the random walk (AR(1) process with a unit root) to a general setting that allows for weakly dependent and heterogeneously distributed innovations. The model Phillips (1987) consider are Yt = βYt?1 +ut, (54) β = 1 (55) whereY0 = 0 (not necessary in Phillips’s original paper), β = 1 andut is a weakly dependent and heterogeneously distributed innovations sequence to be specified below. We consider the three least square regressions Yt = ?βYt?1 + ?ut, (56) Yt = ?α+ ?βYt?1 + ?ut, (57) and Yt = ?α+ ?βYt?1 + ?δ(t? 12T) + ?ut, (58) where ?β,(?α, ?β), and (?α,?δ, ?β) are the conventional least-squares regression coef- ficients. Phillips (1987) and Phillips and Perron (1978) were concerned with the limiting distribution of the regression in (56), (57), and (58) (?β,(?α, ?β), and (?α,?δ, ?β)) under the null hypothesis that the data are generated by (54) and (55). So far, we have assumed that the sequence ut used to construct WT is i.i.d.. Nevertheless, just as we can obtain central limit theorems when ut is not neces- sary i.i.d., so also can we obtain FCLT when ut is not necessary i.i.d.. Here we present a version of FCLT, due to McLeish (1975), under a very weak assumption on ut. Theorem 7 (McLeish): Let ut satisfies that (a). E(ut) = 0, 31 (b). supt E|ut|γ <∞ for some γ > 2, (c). λ2 = limT→∞E[T?1(summationtextut)2] exists and λ2 > 0, and (d). ut is strong mixing with mixing coefficients αm that satisfysummationtext∞1 α1?2/γm <∞, then WT =?W, where WT(r) ≡T?1/2summationtext[Tr]?t=1 ut/λ. These conditions (a)–(d) allow for both temporal dependence (by mixing) and heteroscedasticity (as long as λ2 = limT→∞E[T?1(summationtextut)2] exist) in the process ut. (Hint: When ut is an i.i.d. process, then λ2 = limT→∞E[T?1(summationtextut)2] = σ2, and this result is back to (10).) We first provide the following asymptotic results of the sample moments which are useful to derive the asymptotics of the OLS estimator. Lemma 2: Let ut be a random sequence that satisfies the assumptions in Theorem 7, and if suptE|ut|γ+η <∞ for some η> 0, yt = u1 +u2 +...+ut fort = 1,2,...,T, (59) with y0 = 0. Then (a) T?12 Tsummationtext t=1 ut L?→λW(1), (b) T?2 Tsummationtext t=1 Y2t?1 =?λ2integraltext10 [W(r)]2dr, (c) T?32 Tsummationtext t=1 Yt?1 =?λintegraltext10 W(r)dr, (d) T?1 Tsummationtext t=1 u2t p?→σ2u = T?1 Tsummationtext t=1 E(u2t). (e) T?1 Tsummationtext t Yt?1ut =? 12{λ2[W(1)2]?σ2u}, (f) T?32 Tsummationtext t=1 tut =?λ[W(1)?integraltext10 W(r)dr], (g) T?52 Tsummationtext t=1 tYt?1 =?λintegraltext10 rW(r)dr, (h) T?3 Tsummationtext t=1 tY2t?1 =?λ2integraltext10 r[W(r)]2dr. 32 A joint weak convergence for the sample moments given above to their respective limits is easily established and will be used below. Proof: Proofs of items (a), (b), (c), (f), (g) and (h) are analogous to those proofs at Lemma 1. Item (d) is the results of LLN for a mixing process. (e). For a random walk, Y2t = (Yt?1 + ut)2 = Y2t?1 + 2Yt?1ut + u2t, imply- ing that Yt?1ut = (1/2){Y2t ?Y2t?1 ?u2t} and then summationtextTt=1Yt?1ut = (1/2){Y2T ? Y20 } ? (1/2)summationtextTt=1u2t. Recall that Y0 = 0, and thus it is convenient to writesummationtext T t=1Yt?1ut = 1 2Y 2 T ? 1 2 summationtextT t=1u 2 t. From items (a)) we know that T ?1Y2 T (= (T?1/2summationtextTt=1us)2 L?→ λ2W2(1) and summationtextTt=1u2t p?→ σ2u by LLN (MacLeish); then,summationtext T t=1Yt?1ut =? 1 2{λ 2[W(1)2]?σ2 u}. 3.4.1 No Constant Term or Time Trend in the Regression; True Process Is a Random Walk We first consider the case that no constant term or time trend in the regression model, but true process is a random walk. The asymptotic distributions of OLS unit root coefficients estimator and t-ratio test statistics are in the following. Theorem 8: Let the data Yt be generated by (54) and (55); and ut be a random sequence that satisfies the assumptions in Theorem 7, and if suptE|ut|γ+η <∞ for some η> 0, then as T →∞, for the regression model (56), T(?βT ?1) ? 1/2([W(1) 2]?σ2 u/λ 2) integraltext1 0 [W(r)] 2dr and t = ( ?βT ?1) ?σ?βT ? (λ/2σu){[W(1)]2 ?σ2u/λ2} {integraltext10 [W(r)]2dr}1/2 , where ?σ2?β T = [s2T ÷summationtextTt=1Y2t?1]1/2 and s2T denote the OLS estimate of the distur- 33 bance variance: s2T = Tsummationdisplay t=1 (Yt ? ?βTYt?1)2/(T ?1). Proof: Since the deviation of the OLS estimate from the true value is characterized by T(?βT ?1) = T?1 Tsummationtext t=1 Yt?1ut T?2 Tsummationtext t=1 Y2t?1 , (60) which is a continuous function function of Lemma 2’s (b) and (e), it follows that under the null hypothesis that β = 1, the OLS estimator ?β is characterized by T(?βT ?1) ? 1/2{λ 2[W(1)2]?σ2 u} λ2integraltext10 [W(r)]2dr = 1/2([W(1)2]?σ2u/λ2)integraltext 1 0 [W(r)] 2dr . (61) To prove second part of this theorem, We first note since ?β is a consistent estimator of β from (61), s2T is a consistent estimator of σ2u, from the analogous proofs as in Theorem 2. Then, we can express the t statistics alternatively as tT = T(?βT ?1) braceleftBigg T?2 Tsummationdisplay t=1 Y2t?1 bracerightBigg1/2 ÷(s2T)1/2 or tT = T ?1summationtextT t=1Yt?1utbraceleftBig T?2summationtextTt=1Y2t?1 bracerightBig1/2 (s2T)1/2 , which is a continuous function function of Lemma 2’s (b) and (e), it follows that under the null hypothesis that β = 1, the asymptotic distribution of OLS t statistics is characterized by tT L?→ (1/2){[λ 2W(1)]2 ?σ2 u}braceleftBig λ2integraltext10 [W(r)]2dr bracerightBig1/2 (σ2u)1/2 = (λ/2σu){[W(1)] 2 ?σ2 u/λ 2} {integraltext10 [W(r)]2dr}1/2 . (62) This complete the proof of this Theorem. 34 Theorem 8 extends (18) and (24) to the very general case of weakly de- pendent and heterogeneously distributed data. When ut is a i.i.d. sequence, λ2 = T?1E[(summationtextut)2] = T?1summationtextE(u2t) = σ2u, and we see that the results of Theo- rem 8 reduces to those of Theorem 2. Several interesting things are noteworthy to the asymptotic distribution re- sults for the least squares estimators, (61). First, note that the scale factor here is T, not √T as it previously has been. Thus, ?β is ”collapsing ” to its limits at a much faster rate than before. This is sometimes called superconsistency. Next, note that the limiting distribution is no longer normal; instead, we have a distribution that is somewhat complicated function of a Wiener process. When σ2u = λ2 (independence) we have the distribution of J.S. White (1958, p.1196), apart from an incorrect scaling there, as noted by Phillips (1987). For σ2u = λ2, this distribution is also that tabulated by Dickey and Fuller (1979) in their famous work on testing for unit root. In the regression setting studied in previous chapters the existence of serial correlation in ut in the presence of a lagged dependent variable regressor leads to the inconsistency of ?β for β0 as discussed in Chapter 10. Here, however, the situation is quite different. Even though the regressor is a lagged dependent variables, ?β is consistent for β0 = 1 despite the fact that condition (c) and (d) of Theorem 7 permit ut to display considerable correlation. The effect of the serial correlation is thatσ2u negationslash= λ2. This results is a shift of the location of the asymptotic distribution away from zero (sinceE(W(1)2?σ2u/λ2) negationslash= 0 where E(W(1)2) = E(χ2(1)) = 1.) relative to the σ2u = λ2 case 9no serial correlation). Despite this effect of the serial correlation in ut, we no longer have the serious adverse consequence of the inconsistency of ?β. One way of understanding why this is so is succinctly expressed by Phillips (1987, p.283): Intuitively, when the data generating process has a unit root, the strength of the single (as measured by the sample variation of the re- gressor Yt?1 dominates the noise by a factor of O(T), so that the effect of any regressor-error correlation are annihilated in the regression as T →∞. Note that , however, even when σ2u = λ2 the asymptotic distribution given in 35 (61) is not centered about zero, so an asymptotic bias is still present. The reason for this is that there generally exist a strong (negative) correlation betweenW(1)2 and (integraltext10 [W(r)2]dr)?1, resulting from the fact that W(1)2 and W(r)2 are highly correlated for each r. Thus, even though E(W(1)2 ?σ2u/λ2) = 0 with σ2u = λ2, we do not have E bracketleftBigg 1/2([W(1)2]?σ2u/λ2)integraltext 1 0 [W(r)] 2dr bracketrightBigg = 0. See Abadir (1995) for further details. 3.4.2 Estimation of λ2 and σ2u The limiting distribution given in Theorem 8 depend on unknown parameters σ2u and λ2. Theses distributions are therefore not directly useable for statistical testing. However, both theses parameters may be consistently estimated and the estimates may be used to construct modified statistics whose limiting distribution are independent of (λ2, σ2u). As we shall see, these new statistics provide very general tests for the presence of a unit root in (54). As shown in Lemma 2 (d), T?1summationtextTt=1u2t → σ2u. This provides us with the simple estimator s2u = T?1 Tsummationdisplay t=1 (Yt ?Yt?1)2 = T?1 Tsummationdisplay t=1 u2t, which is consistent for σ2u under the null hypothesis β = 1. Since ?β → 1 by The- orem 8 we may also use ?s2u = T?1summationtextT1 (Yt??βYt?1)2 as a consistent estimator ofσ2u. Consistent estimation of λ2 = limT→∞T?1E(summationtextT1 ut)2 is more difficult. We start by defining λ2T = T?1E( Tsummationdisplay 1 ut)2 = T?1 Tsummationdisplay 1 E(u2t) + 2T?1 T?1summationdisplay τ=1 Tsummationdisplay t=τ+1 E(utut?τ) and by introducing the approximate λ2Tl = T?1 Tsummationdisplay 1 E(u2t) + 2T?1 lsummationdisplay τ=1 Tsummationdisplay t=τ+1 E(utut?τ). 36 We shall call l the lag truncation number. For large T and large l<T, λ2Tl may be expected to very close to λ2T if the total contribution in λ2T of covariance such as E(utut?τ) with long lags τ > l is small. This will be true if ut satisfies the assumption in Theorem 7. Formally, we have the following lemma. Lemma 3: If the sequence ut satisfies the assumption in Theorem 7 and ifl→∞ as T →∞, then λ2T ?λ2Tl → 0 as T →∞. This lemma suggests that under suitable conditions on the rate at which l → ∞ as T → ∞ we may proceed to estimate λ2 from finite sample of data by sequentially estimating λ2Tl. We define s2Tl = T?1 Tsummationdisplay 1 u2t + 2T?1 lsummationdisplay τ=1 Tsummationdisplay t=τ+1 utut?τ. (63) The following result establishes that s2Tl is a consistent estimator of λ2. Theorem 9: If (a). ut satisfies all the assumptions in Theorem 7 except that part (b) is replaced by the stronger moment condition: suptE|ut1|2γ <∞, for some γ > 2; (b). l→∞ as T →∞ such that l = o(T1/4); then s2Tl →λ2 as T →∞. According to this result, if we allow the number of estimated autocovariances to increase asT →∞ but control the rate of increase so thatl = o(T1/4), thens2Tl yields a consistent estimator of λ2. Inevitably the choice of l will be an empirical matter. Rather than using the first difference ut = Yt?Yt?1 in the construction of s2Tl, we could have used the residuals ?ut = Yt??βYt?1 from the least squares regression. Since ?β → 1, this estimator is also consistent for λ2 under the null hypothesis β = 1. We remark that s2Tl is not constrained to be nongative as it presently defined in (63). When there are large negative sample serial covariance, s2Tl can take 37 on negative values. Newey and West (1987) have suggested a modification to variance estimator such as s2Tl which ensures that they are nonnegative. In the present case, the modification yield: ?s2Tl = T?1 Tsummationdisplay 1 ?u2t + 2T?1 lsummationdisplay τ=1 wτl Tsummationdisplay t=τ+1 ?ut?ut?τ, (64) where wTl = 1?τ/(l+ 1), (65) which put a higher weight on more recent autocovariances. 3.4.3 New Tests for a Unit Root The consistent estimator ?s2u and ?s2Tl may be used to develop new tests for unit roots that apply under very general conditions. We define the statistics: Z?β = T(?β?1)? 1/2(?s 2 Tl ? ?s 2 u)parenleftBig T?2summationtextT1 Y2t?1 parenrightBig (66) and Z?t = tT ·(?s2u/?s2Tl)1/2 ?1/2(?s2Tl ? ?s2u) ? ??sTl parenleftBigg T?2 Tsummationdisplay 1 Y2t?1 parenrightBigg1/2? ? ?1 . (67) Z?β is a transformation of the standardized estimator T(?β ? 1) and Z?t is a transformation of the regression t statistics as in (62). The limiting distribution of Z?β and Z?t are given by the following results. Theorem 10 (Phillips 1987): If the condition of Lemma 3 are satisfied, then as T →∞, (a). Z?β ? 1/2{[W(1)] 2 ?1} integraltext1 0 [W(r)] 2dr 38 and (b). Z?t ? 1/2{[W(1)] 2 ?1} {integraltext10 [W(r)]2dr}1/2 under the null hypothesis that data is generated by (54) and (55). Proof: (a). From (61) we have T(?βT ?1) ? 1/2([W(1) 2]?σ2 u/λ 2) integraltext1 0 [W(r)] 2dr and from Lemma 2 (b) and Theorem 9 we have 1/2(s2Tl ?s2u) T?2summationtextT1 Y2t?1 ? 1/2(λ2 ?σ2u) λ2integraltext10 [W(r)]2dr ≡ 1/2(1?σ2u/λ2)integraltext 1 0 [W(r)] 2dr . Therefore, the test statistics Z?β is distributed as Z?β = T(?βT ?1)? 1/2(?s 2 Tl ? ?s 2 u) T?2summationtextT1 Y2t?1 ? 1/2([W(1)2]?σ2u/λ2 ?1 +σ2u/λ2)integraltext 1 0 [W(r)] 2dr ≡ 1/2([W(1) 2]?1) integraltext1 0 [W(r)] 2dr . (b). From (62) and the consistency ?s2u and ?s2Tl we have tT ·(?s2u/?s2Tl)1/2 ? 1/2{[W(1)] 2 ?σ2 u/λ 2} {integraltext10 [W(r)]2dr}1/2 . (68) Consider the following statistics: 1/2(?s2Tl ? ?s2u) ?sTl bracketleftBig T?2summationtextT1 Y2t?1 bracketrightBig1/2 ? 1/2(λ 2 ?σ2 u) λ2 braceleftBigintegraltext1 0 [W(r)] 2dr bracerightBig1/2 ≡ 1/2(1?σ 2 u/λ 2) braceleftBigintegraltext1 0 [W(r)] 2dr bracerightBig1/2. (69) 39 Combining (68) and (69) we have Z?t = tT ·(?s2u/?s2Tl)1/2 ? 1/2(?s 2 Tl ? ?s 2 u) ?sTl bracketleftBig T?2summationtextT1 Y2t?1 bracketrightBig1/2 ? 1/2([W(1) 2]?σ2 u/λ 2 ?1 +σ2 u/λ 2) braceleftBigintegraltext1 0 [W(r)] 2dr bracerightBig1/2 ≡ 1/2([W(1) 2]?1) braceleftBigintegraltext1 0 [W(r)] 2dr bracerightBig1/2. Theorem (11) demonstrate that the limiting distribution of the two statistics Z?β andZ?t are invariant within a very wide class of weakly dependent and possibly heterogeneously distributed innovationut. More, the limiting distribution ofZ?β is identical to that of T(?β?1) when λ2 = σ2u, so that the statistical tables reported in the section labeled Case 1 in Table B.5 are still useable. The limiting distribution of Z?t given in Theorem (11) is identical to that of regression tT statistics when λ2 = σ2u. This is, in fact, the limiting distribution of the t statistics when the innovation ut is i.i.d.(0,σ2). Therefore, the statistical tables reported in the section labeled Case 1 in Table B.6 are still useable. Phillips and Perron (1988) analysis the asymptotic results of estimator OLS when the regression contains a constant (?β) or a constant and a time trend (?β) under the assumption that true data generating process is (54) and (55). Theorem 11 (Phillips and Perron 1988): If the condition of Lemma 3 are satisfied, then as T →∞, Z?β = T(?β?1)? 1/2(?s 2 Tl ? ?s 2 u) T?2summationtextT1 (Yt?1 ? ˉY?1)2, Z?t = ?tT ·(?s2u/?s2Tl)1/2 ?1/2(?s2Tl ? ?s2u) ? ??sTl parenleftBigg T?2 Tsummationdisplay 1 (Yt?1 ? ˉY?1)2 parenrightBigg1/2? ? ?1 , Z?β = T(?β?1)? T 6 24DX (?s 2 Tl ? ?s 2 u), 40 and Z?t = ?tT ·(?s2u/?s2Tl)1/2 ? T 3(s2 Tl ?s 2 u) 4√3D1/2x sTl , the limiting distribution of Z are identical to those of distribution when λ2 = σ2u, where ˉY?1 = summationtextT?11 Yt/(T ?1) and Dx = det(X′X) and the regressors are X = (1,t,Yt?1), ?s2u = T?1summationtextT1 (Yt??α??βYt?1)2, ?s2Tl = T?1summationtextT1 ?u2t+2T?1summationtextlτ=1wτlsummationtextTt=τ+1 ?ut?ut?τ, ?s2u = T?1summationtextT1 [Yt??α??βYt?1??δ(t?12T)]2, and ?s2Tl = T?1summationtextT1 ?u2t+2T?1summationtextlτ=1wτlsummationtextTt=τ+1 ?ut?ut?τ. Exercise: Reproduce Case 4 at Table B.5 and B.6 on Hamilton (1994)’s p.762 and 763, respectively. Two things to be noted: (1). Confirms that the same results is obtained from non-gaussian i.i.d. (2). The constant α will not affect this distribution. (so using α = 0 and α = 10000 will get identical results) Exercise: Reproduce Table 1 of Phillips and Perron (1988, p.344) from which you will have the chance to have your own unit root test’s (ADF and PP) program in Gauss and the chance to see what are size distortion and lack of power of unit root tests. 41 3.5 Phillips-Perron Test, ut is a MA(∞) process Hamilton (1994, Ch. 17) has parameterized the Phillips-Perron test by assuming that the innovation in (54) to be ut = ?(L)εt = ∞summationdisplay j=0 ?jεt?j, (70) where εt is a white noise process (0,σ2ε) and) summationtext∞j=0j·|?j|<∞. 3.5.1 Beveridge-Nelson Decomposition Since (70) is a subcase of the assumption in Theorem 7 (McLeish), all we have to shown is that the ”long-run” variance in Theorem 7, λ2, here is equal to σ2ε ·?2(1) (Hamilton, p.505, eq. 17.5.10.). To do this, we need the Beveridge-Nelson De- composition. Theorem 12 (Beveridge-Nelson (B-N) Decomposition): Let ut = ?(L)εt = ∞summationdisplay j=0 ?jεt?j, (71) where εt is a white noise process (0,σ2ε) and summationtext∞j=0j·|?j|<∞. Then u1 +u2 +...+ut = ?(1)(ε1 +ε2 +...+εt) +ξt ?ξ0, where ?(1) ≡ summationtext∞j=0?j,ξt = summationtext∞j=0αjεt?j,αj = ?(?j+1 +?j+2 +?j+3 +...) andsummationtext ∞ j=0|αj| < ∞. (Therefore, ξt is a stationary process since it is a MA(∞) with absolute summable coefficients.) Proof: 42 Observe that tsummationdisplay s=1 us = tsummationdisplay s=1 ∞summationdisplay j=0 ?jεs?j = {?0εt +?1εt?1 +?2εt?2 +...+?tε0 +?t+1ε?1 +...} +{?0εt?1 +?1εt?2 +?2εt?3 +...+?t?1ε0 +?tε?1 +...} +{?0εt?2 +?1εt?3 +?2εt?4 +...+?t?2ε0 +?t?1ε?1 +...} +...+{?0ε1 +?1ε0 +?2ε?1 +...} = ?0εt + (?0 +?1)εt?1 + (?0 +?1 +?2)εt?2 +... +(?0 +?1 +?2 +...+?t?1)ε1 + (?1 +?2 +...+?t)ε0 +(?2 +?3 +...+?t+1)ε?1 +.... = (?0 +?1 +?2 +...)εt ?(?1 +?2 +?3 +...)εt +(?0 +?1 +?2 +...)εt?1 ?(?2 +?3 +?4 +...)εt?1 +(?0 +?1 +?2 +...)εt?2 ?(?3 +?4 +?5 +...)εt?2 +... +(?0 +?1 +?2 +...)ε1 ?(?t +?t+1 +?t+2 +...)ε1 +(?1 +?2 +?3 +...)ε0 ?(?t+1 +?t+2 +?t+3 +...)ε0 +(?2 +?3 +?4 +...)ε?1 ?(?t+2 +?t+3 +?t+4 +...)ε?1 +... or tsummationdisplay s=1 us = ?(1)· tsummationdisplay s=1 εs +ξt ?ξ0, where ξt = ?(?1 +?2 +?3 +...)εt ?(?2 +?3 +?4 +...)εt?1 ?(?3 +?4 +?5 +...)εt?3 ?... ξ0 = ?(?1 +?2 +?3 +...)ε0 ?(?2 +?3 +?4 +...)ε?1 ?(?3 +?4 +?5 +...)ε?2... This theorem states that for any serial correlated process ut that satisfy (71), its partial sum (summationtextut) can be write as the sum of a random walk (?(1)summationtextεt) and a stationary process, ξt and a initial condition ξ0. Notice that ξt is stationary 43 from the fact that ξt =summationtext∞j=0αjεt?j, where αj = ?(?j+1 +?j+2 +...) and {αj}∞j=0 is absolutely summable: ∞summationdisplay j=0 |αj| = |?1 +?2 +?3 +...|+|?2 +?3 +?4 +...|+|?3 +?4 +?5 +...|+... ≤ {|?1|+|?2|+|?3|+...}+{|?2|+|?3|+|?4|+...}+{|?3|+|?4|+|?5|+...}+... = |?1|+ 2|?2|+ 3|?3|+... = ∞summationdisplay j=0 j·|?j|, which is bounded by the assumptions in Theorem 12. 3.5.2 The Equality of Long-Run Variance in Phillips’s and Hamilton’s Assumption We now show that the long run variance E(T?1(summationtextut)2) in Theorem 7 is equal to ?(1)2σ2ε (Hamilton’s p.505, (17.5.10). From B-N Decomposition we see that u1 +u2 +...+uT = ?(1)(ε1 +ε2 +...+εT) +ξT ?ξ0, therefore, λ2 = T?1E[(u1 +u2 +...+uT)2] = T?1E{[?(1)2(ε1 +ε2 +...+εT)2] +ξ2T +ξ20 + 2[?(1)(ε1 +ε2 +...+εT)·ξT] +2[?(1)(ε1 +ε2 +...+εT)·ξ0] + 2[ξTξ0]} = T?1 bracketleftBigg (?(1)2Tσ2ε) +E(ξ2T) +E(ξ20) + 2 parenleftBigg ?(1)σ2ε T?1summationdisplay j=0 αj parenrightBigg + 0 +σ2ε ∞summationdisplay j=0 αjαT+j bracketrightBigg → ?(1)2σ2ε from the stationarity of ξT and absolute summability of αj. (We have to show that summationtext∞j=0|αj|< 0 imply summationtext∞j=0αj < 0 and summationtext∞j=0αjαT+j < 0.) Therefore, the results of Hamilton is all the same with those of Phillips as long as we replace λ2 with ?(1)2σ2ε. 44 4 Issues in Unit Root Testing 4.1 Size Distortion and Low Power of Unit Root Tests Schwart (1989) first presented Monte-Carlo evidence to point out the size distor- tion problem of the commonly used unit root test. He argued that the distribution of the Dickey-Fuller tests is far different from the distribution (this is the meaning of size-distortion, the distribution under the null hypothesis is not what you have expected, and therefore the 5% critical value is misleading) reported by Dickey and Fuller if the underlying distribution contains a moving-average component. He also suggests that the Phillips and Perron (PP) test suffer from size distor- tions when the MA parameters is large, which is the case with many economic time series as noted by Schwert (1989). The test with the least size distortion is the Said-Dickey (1984) high-order autoregression t-test. Whereas Schwert com- plained the size distortion of unit root tests, DeJong et al. complained about the low power of unit root tests, DeJong et al. (1992) argued about that the unit root tests have low power against possible trend-stationary alternatives. Similar problems about size distortions and low power were noticed by Agiakoglou and Newbold (1992). The poor power problem is not unique to the unit root tests. Cochrane argue that any test of the hypothesis θ = θ0 has arbitrarily low power against alter- native θ0 ?? in small sample, but in many cases the difference between θ0 and θ0 ?? would not be considered important from the statistical or economic perspective. (For example, the expected value of a population height is 170 or 171.) But the low power problem is particular disturbing in the unit root case because of the discontinuity of the distribution theory near unit root. (unit root test statistics has different asymptotic distribution under the null and the alternative.) Mention must be made of a paper by Gonzalo and Lee (1996) who complain about the repetition of the phrase ”lack of power unit root test”. They show numerically that the lack of power and size distortion of the Dickey-Fuller tests for unit roots are similar to and in many situations even smaller than the lack of power and size distortions of the standard student t-tests for stationary roots in an autoregressive model. But arguments like this miss the important point. 45 There is no discontinuity of inference in the latter case but there is in the case of unit root tests. Thus, the consequences of lack of power are vastly different in the two cases. There have been several solutions to the problems of size distortion and low power of the ADF and PP tests. Some of these are modifications of the ADF and PP tests and others are new tests. See Maddala and Kim (1999) p.103 for a good survey. 4.2 Tests With Stationarity as Null, the KPSS Kwiatkowski, Phillips, Schmidt, and Shin (1992), which is often referred to as KPSS, start with the model Yt = ψ+δt+ζt +εt, where εt is a stationary process and ζt is a random walk given by ζt = ζt?1 +ut, ut ~i.i.d.(0,σ2u). The null hypothesis of stationarity is formulated as H0 : σ2u = 0 orζt isaconstant. The KPSS test statistics for this hypothesis is given by LM = summationtextT t=1S 2 t ?s2Tl , where ?s2Tl = T?1 Tsummationdisplay t=1 et + 2T?1 lsummationdisplay τ=1 wτl Tsummationdisplay t=τ+1 etet?τ is a consistent estimator of long run variance limT→∞T?1E(S2T). Here Wτl is an optimal weighting function that corresponds to the choice of a spectral window. KPSS use the Bartlett window, as suggested by Newey and West (1987), wτl = 1? τl+ 1, 46 and et are the residuals from the regression of Yt on a constant and a time trend (remember that LM test statistics is constructed under the null hypothesis), and St is the partial sum of et defined by St = Tsummationdisplay i=1 ei, t = 1,2,...,T. For consistency of ?s2Tl, it is necessary thatl→∞asT →∞. The ratel = o(T1/2) is usually satisfactory. KPSS derive the asymptotic distribution of the LM statistic and tabulate the critical values by simulation. For testing the null of level stationarity instead of trend stationarity the test is constructed the same way except that et is obtained as the residual from a regression of Yt on an intercept only. The test is an upper tail test. It has been suggested (see, e.g., KPSS, p.176 and Choi, 1994, p.721) that the tests using stationarity as null can be used for confirmatory analysis, i.e., to confirm our conclusions about unit roots. However, if both tests fail to reject the respective nulls or both reject the respective nulls, we do not have a con- firmation. 4.3 Panel Data Unit Root Tests The principle motivation behind panel data unit root tests is to increase the power of unit root tests by increasing the sample size. An alternative route of increasing the sample size by using long time series data, it is argued, causes problems arising from structural changes. However, it is not clear whether this is more of a problem than cross-sectional heterogeneity, a problem with the use of panel data. It is often argued that the commonly used unit root tests such as ADF and PP are not very powerful, and that using panel data you get a more power test. See Maddala and Kim (1999), p.134 for a good survey. 47