Ch. 17 Maximum Likelihood Estimation The identi cation process having led to a tentative formulation for the model, we then need to obtain e cient estimates of the parameters. After the parameters have been estimated, the tted model will be subjected to diagnostic checks . This chapter contains a general account of likelihood method for estimation of the parameters in the stochastic model. Consider an ARMA (from model identi cation) model of the form Yt = c + 1Yt 1 + 2Yt 2 + ::: + pYt p + "t + 1"t 1 + 2"t 2 + ::: + q"t q; with "t white noise: E("t) = 0 E("t" ) = 2 for t = 0 otherwise : This chapter explores how to estimate the value of (c; 1; :::; p; 1; :::; q; 2) on the basis of observations on Y . The primary principle on which estimation will be based is maximum likelihood estimation. Let = (c; 1; :::; p; 1; :::; q; 2)0 denote the vector of population parameters. Suppose we have observed a sample of size T (y1; y2; :::; yT). The approach will be to calculate the joint probability density fYT ;YT 1;:::;Y1(yT ; yT 1; :::; y1; ); (1) which might loosely be viewed as the probability of having observed this particular sample. The maximum likelihood estimate (MLE) of is the value for which this sample is most likely to have been observed; that is, it is the value of that maximizes (1). This approach requires specifying a particular distribution for the white noise process "t. Typically we will assume that "t is gaussian white noise: "t i:i:d: N(0; 2): 1 1 MLE of a Gaussian AR(1) Process 1.1 Evaluating the Likelihood Function Using (Scalar) Con- ditional Density A stationary Gaussian AR(1) process takes the form Yt = c + Yt 1 + "t; (2) with "t i:i:d: N(0; 2) and j j < 1 (How do you know at this stage ?). For this case, = (c; ; 2)0. Consider the p:d:f of Y1, the rst observations in the sample. This is a random variable with mean and variance E(Y1) = = c1 and V ar(Y1) = 2 1 2: Since f"tg1t= 1 is Gaussian, Y1 is also Gaussian. Hence, fY1(y1; ) = fY1(y1; c; ; 2) = 1p2 p 2=(1 2) exp 12 fy1 [c=(1 )]g 2 2=(1 2) : Next consider the distribution of the second observation Y2 conditional on the observing Y1 = y1. From (2), Y2 = c + Y1 + "2: (3) Conditional on Y1 = y1 means treating the random variable Y1 as if it were the deterministic constant y1. For this case, (3) gives Y2 as the constant (c + y1) plus the N(0; 2) variable "2. Hence, (Y2jY1 = y1) N((c + y1); 2); meaning that fY2jY1(y2jy1; ) = 1p2 2 exp 12 (y2 c y1) 2 2 : The joint density of observations 1 and 2 is then just fY2;Y1(y2; y1; ) = fY2jY1(y2jy1; )fY1(y1; ): 2 Similarly, the distribution of the third observation conditional on the rst two is fY3jY2;Y1(y3jy2; y1; ) = 1p2 2 exp 12 (y3 c y2) 2 2 form which fY3;Y2;Y1(y3; y2; y1; ) = fY3jY2;Y1(y3jy2; y1; )fY2;Y1(y2; y1; ) = fY3jY2;Y1(y3jy2; y1; )fY2jY1(y2jy1; )fY1(y1; ): In general, the value of Y1; Y2; :::; Yt 1 matter for Yt only through the value Yt 1, and the density of observation t conditional on the preceding t 1 observa- tions is given by fYtjYt 1;Yt 2;:::;Y1(ytjyt 1; yt 2; :::; y1; ) = fYtjYt 1(ytjyt 1; ) = 1p2 2 exp 12 (yt c yt 1) 2 2 : The likelihood of the complete sample can thus be calculated as fYT ;YT 1;YT 2;:::;Y1(yT ; yT 1; yT 2; :::; y1; ) = fY1(y1; ) TY t=2 fYtjYt 1(ytjyt 1; ): (4) The log likelihood function (denoted L( )) is therefore L( ) = log fY1(y1; ) + TX t=2 log fYtjYt 1(ytjyt 1; ): (5) The log likelihood for a sample of size T from a Gaussian AR(1) process is seen to be L( ) = 12 log(2 ) 12 log[ 2=(1 2)] fy1 [c=(1 )]g 2 2 2=(1 2) [(T 1)=2] log(2 ) [(T 1)=2] log( 2) TX t=2 (y t c yt 1)2 2 2 : (6) 3 1.2 Evaluating the Likelihood Function Using (Vector) Joint Density A di erent description of the likelihood function for a sample of size T from a Gaussian AR(1) process is some time useful. Collect the full set of observations in a (T 1) vector, y (Y1; Y2; :::; YT )0: The mean of this (T 1) vector is E(y) = 2 66 66 66 4 E(Y1) E(Y2) : : : E(YT ) 3 77 77 77 5 = 2 66 66 66 4 : : : 3 77 77 77 5 = ; where = c=(1 ). The variance -covariance of y is = E[(y )(y )0] = 2 1(1 2) 2 66 66 66 4 1 : : : T 1 1 : : T 2 : : : : : : : : : : : : : : : : : : T 1 : : : : 1 3 77 77 77 5 = 2V where V = 1(1 2) 2 66 66 66 4 1 : : : T 1 1 : : T 2 : : : : : : : : : : : : : : : : : : T 1 : : : : 1 3 77 77 77 5 : The sample likelihood function is therefore the multivariate Gaussian density: fY(y; ) = (2 ) T=2j 1j1=2 exp 12(y )0 1(y ) ; with log likelihood L( ) = ( T=2) log(2 ) + 12 logj 1j 12(y )0 1(y ): (7) (6) and (7) must represent the identical likelihood function. 4 It is easy to verify by direct multiplication that L0L = V 1, with L = 2 66 66 66 4 p1 2 0 : : : 0 1 0 : : 0 0 1 0 : 0 : : : : : : : : : : : : 0 0 : : 1 3 77 77 77 5 : Then (7) becomes L( ) = ( T=2) log(2 ) + 12 logj 2L0Lj 12(y )0 2L0L(y ): (8) De ne the (T 1) vector ~y to be ~y L(y ) = 2 66 66 66 4 p1 2 0 : : : 0 1 0 : : 0 0 1 0 : 0 : : : : : : : : : : : : 0 0 : : 1 3 77 77 77 5 2 66 66 66 4 Y1 Y2 Y3 : : YT 3 77 77 77 5 = 2 66 66 66 4 p1 2(Y 1 ) (Y2 ) (Y1 ) (Y3 ) (Y2 ) : : (YT ) (YT 1 ) 3 77 77 77 5 = 2 66 66 66 4 p1 2[Y 1 c=(1 )] Y2 c Y1 Y3 c Y2 : : YT c YT 1 3 77 77 77 5 : The last term in (8) can thus be written 1 2(y ) 0 2L0L(y ) = 1 2 2 ~y0~y = 1 2 2 (1 2)[Y1 c=(1 )]2 + 1 2 2 TX t=2 (Yt c Yt 1)2: 5 The middle term in (8) is similarly 1 2 logj 2L0Lj = 1 2 logf 2T jL0Ljg = 12 log 2T + 12 logjL0Lj = T2 log 2 + 12 logfjL0jjLjg since L is triangular = T2 log 2 + logjLj = T2 log 2 + 12 log(1 2): Thus equation (6) and (7) are just two di erent expressions for the same magni- tude. Either expression accurately describes the log likelihood function. 1.3 Exact Maximum Likelihood Estimators for the Gaus- sian AR(1) Process The MLE ^ is the value for which (6) is maximized. In principle, this requires di erentiating (6) and setting the result equal to zero. In practice, when an attempt is made to carry this out, the result is a system of nonlinear equation in and (Y1; Y2; :::; YT) for which there is no simple solution for in terms of (Y1; Y2; :::; YT ). Maximization of (6) thus requires iterative or numerical proce- dure described in p.21 of Chapter 3. 1.4 Conditional Maximum Likelihood Estimation An alternative to numerical maximization of the exact likelihood function is to regard the value of y1 as deterministic and maximize the likelihood conditioned on the rst observation fYT ;YT 1;YT 2;::;Y2jY1(yT ; yT 1; yT 2; :::; y2jy1; ) = TY t=2 fYtjYt 1(ytjyt 1; ); 6 the objective then being to maximize L ( ) = [(T 1)=2] log(2 ) [(T 1)=2] log( 2) TX t=2 (y t c yt 1)2 2 2 = [(T 1)=2] log(2 ) [(T 1)=2] log( 2) TX t=2 "2 t 2 2 (9) Maximization of (9) with respect to c and is equivalent to minimization of TX t=2 (yt c yt 1)2 = (y X )0(y X ); (10) which is achieved by an ordinary least square (OLS) regression of yt on a constant and its own lagged value, where y = 2 66 66 66 4 y2 y3 : : : yT 3 77 77 77 5 ;X = 2 66 66 66 4 1 y1 1 y2 : : : : : : 1 yT 1 3 77 77 77 5 ; and = c : The conditional maximum likelihood estimates of c and are therefore given by ^c ^ = T 1 PT t=2 yt 1PT t=2 yt 1 PT t=2 y 2 t 1 1 PT t=2 yt 1PT t=2 yt 1yt : The conditional maximum likelihood estimator of 2 is found by setting @L @ 2 = (T 1) 2 2 + TX t=2 (y t c yt 1)2 2 4 = 0 or ^ 2 = TX t=2 " (yt ^c ^ yt 1)2 T 1 # : It is important to note if you have a sample of size T to estimate an AR(1) process by conditional MLE, you will only use T 1 observation of this sample. 7 2 MLE of a Gaussian AR(p) Process This section discusses a Gaussian AR(p) process, Yt = c + 1Yt 1 + 2Yt 2 + ::: + pYt p + "t; with "t i:i:d: N(0; 2). In this case, the vector of population parameters to be estimated is = (c; 1; 2; :::; p; 2)0. 2.1 Evaluating the Likelihood Function We rst collect the rst p observation in the sample (Y1; Y2; :::; Yp) in a (p 1) vector yp which has mean vector p with each element = c1 1 2 ::: p and variance-covariance matrix is given by 2Vp = 2 66 66 66 66 4 0 1 2 : : : p 1 1 0 1 : : : p 2 2 1 0 : : : p 3 : : : : : : : : : : : : : : : : : : : : : p 1 p 2 p 3 : : : 1 3 77 77 77 77 5 The density of the rst p observations is then fYp;Yp 1;:::;Y1(yp; yp 1; :::; y1; ) = (2 ) p=2j 2V 1p j1=2 exp 12 2 (yp p)0V 1p (yp p) = (2 ) p=2( 2)p=2jV 1p j1=2 exp 12 2 (yp p)0V 1p (y p) For the remaining observations in the sample (Yp+1; Yp+2; :::; YT ), conditional on the rst t 1 observations, the tth observations is gaussian with mean c + 1yt 1 + 2yt 2 + ::: + pyt p 8 and variance 2. Only the p most recent observations matter for this distribution. Hence for t > p fYtjYt 1;:::;Y1(ytjyt 1; :::; y1; ) = fYtjYt 1;::;Yt p(ytjyt 1; ::; yt p; ) = 1p2 2 exp (y t c 1yt 1 2yt 2 ::: pyt p)2 2 2 : The likelihood function for the complete sample is then fYT ;YT 1;:::;Y1(yT ; yT 1; :::; y1; ) = fYp;Yp 1;:::;Y1(yp; yp 1; :::; y1; ) TY t=p+1 fYtjYt 1;::;Yt p(ytjyt 1; ::; yt p; ); and the loglikelihood is therefore L( ) = log fYT ;YT 1;:::;Y1(yT ; yT 1; :::; y1; ) = p2 log(2 ) p2 log( 2) + 12 logjV 1p j 12 2 (yp p)0V 1p (y p) T p2 log(2 ) T p2 log( 2) TX t=p+1 (yt c 1yt 1 2yt 2 ::: pyt p)2 2 2 : Maximization of this exact log likelihood of an AR(p) process must be accom- plished numerically. 2.2 Conditional maximum Likelihood estimates The log of the likelihood conditional on the rst p observation assume the simple form L ( ) = log fYT ;YT 1;::;Yp+1jYp;:::;Y1(yT ; yT 1; ::yp+1jyp; :::; y1; ) = T p2 log(2 ) T p2 log( 2) TX t=p+1 (yt c 1yt 1 2yt 2 ::: pyt p)2 2 2 = T p2 log(2 ) T p2 log( 2) TX t=p+1 "2t 2 2: (11) 9 The value of c; 1; :::; p that maximizes (11) are the same as those that minimize TX t=p+1 (yt c 1yt 1 2yt 2 ::: pyt p)2: Thus, the conditional MLE of these parameters can be obtained from an OLS regression of yt on a constant and p of its own lagged values. The conditional MLE estimator of 2 turns out to be the average squared residual from this regression: ^ 2 = 1T p TX t=p+1 (yt ^c ^ 1yt 1 ^ 2yt 2 ::: ^ pyt p)2: It is important to note if you have a sample of size T to estimate an AR(p) process by conditional MLE, you will only use T p observation of this sample. 10 3 MLE of a Gaussian MA(1) Process This section discusses a Gaussian MA(1) process, Yt = + "t + "t 1 (12) with "t i:i:d: N(0; 2). In this case, the vector of population parameters to be estimated is = ( ; ; 2)0. 3.1 Evaluating the Likelihood Function The observations in the sample (Y1; Y2; :::; YT) in a (T 1) vector y which has mean vector with each element and variance-covariance matrix given by = E(y )(y )0 = 2 2 66 66 66 66 4 (1 + 2) 0 : : : 0 (1 + 2) : : : 0 0 (1 + 2) : : : 0 : : : : : : : : : : : : : : : : : : : : : 0 0 0 : : : (1 + 2) 3 77 77 77 77 5 The likelihood function is then fYT ;YT 1;:::;Y1(yT ; yT 1; :::; y1; ) = (2 ) T=2j j 1=2 exp 12(y )0 1(y ) Using triangular factorization of the variances covariance matrix, the likelihood function can be written fYT ;YT 1;:::;Y1(yT; yT 1; :::; y1; ) = (2 ) T=2 " TY t=1 dtt # 1=2 exp " 12 TX t=1 ~y2t dtt # and the loglikelihood is therefore L( ) = log fYT ;YT 1;:::;Y1(yT; yT 1; :::; y1; ) = T2 log(2 ) 12 TX t=1 dtt 12 TX t=1 ~y2t dtt ; 11 where dtt = 2 1 + 2 + 4 + ::: + 2t 1 + 2 + 4 + ::: + 2(t 1) and ~yt = yt [1 + 2 + 4 + ::: + 2t] 1 + 2 + 4 + ::: + 2(t 1) ~yt 1: Maximization of this exact log likelihood of an MA(1) process must be ac- complished numerically. 3.2 Evaluating the Likelihood Function Using (Scalar) Con- ditional Density Consider the p:d:f of Y1, Y1 = + "1 + "0; the rst observations in the sample. This is a random variable with mean and variance E(Y1) = V ar(Y1) = 2(1 + 2): Since f"tg1t= 1 is Gaussian, Y1 is also Gaussian. Hence, Y1 N( ; (1 + 2) 2) or fY1(y1; ) = fY1(y1; ; ; 2) = 1p2 p 2(1 + 2) exp 12 (y1 ) 2 2(1 + 2) : Next consider the distribution of the second observation Y2 conditional on the "observing" Y1 = y1. From (12), Y2 = + "2 + "1: (13) 12 (Following the method in calculating the joint density of the complete sample of AR process) Conditional on Y1 = y1 means treating the random variable Y1 as if it were the deterministic constant y1. For this case, (13) gives Y2 as the constant ( + "1) plus the N(0; 2) variable "2. However, it is not the case since observing Y1 = y1 give no information on the realization of "1 for you can not distinguish "1 from "0 even after rst observation y1. 3.2.1 Conditional Maximum Likelihood Estimation To make the conditional density fY2jY1(y2jy1; ) feasible, we must impose an ad- ditional assumption such as that we know with certainty that "0 = 0. Suppose that we know for certain that "0 = 0. Then (Y1j"0 = 0) N( ; 2) or fY1j"0=0(y1j"0 = 0; ) = 1p2 2 exp 12 (y1 ) 2 2 2 = 1p2 2 exp " 2 1 2 2 : Moreover, given observation of y1, the value of "1 is then known with certainty as well: "1 = y1 : Hence (Y2jY1 = y1; "0 = 0) N(( + "1); 2); meaning that fY2jY1;"0=0(y2jy1; "0 = 0; ) = 1p2 2 exp 12 (y2 "1) 2 2 2 = 1p2 2 exp " 2 2 2 2 : Since "1 is know with certainty, "2 can be calculated from "2 = y2 "1: 13 Proceeding in this fashion, it is clear that given knowledge that "0 = 0, the full sequence f"1; "2; :::; "Tg can be calculated from fy1; y2; :::; yTg by iterating on "t = yt "t 1 for t = 1; 2; :::; T, starting from "0 = 0. The condition density of the tth ob- servation can then be calculated as joint density of observations 1 and 2 is then just fYtjYt 1;Yt 2;:::;Y1;"0=0(ytjyt 1; yt 2; :::; y1; "0 = 0; ) = fYtj"t 1(ytj"t 1; ) = 1p2 2 exp " 2 t 2 2 : The likelihood (conditional on "0 = 0) of the complete sample can thus be calculated as the product of these individual densities: fYT ;YT 1;YT 2;:::;Y1j"0=0(yT; yT 1; yT 2; :::; y1j"0 = 0; ) = fY1j"0=0(y1j"0 = 0; ) TY t=2 fYtjYt 1;Yt 2;:::;Y1;"0=0(ytjyt 1; yt 2; :::; y1; "0 = 0; ): The conditional log likelihood function (denoted L ( )) is therefore L ( ) = log fYT ;YT 1;YT 2;:::;Y1j"0=0(yT ; yT 1; yT 2; :::; y1j"0 = 0; ) = T2 log(2 ) T2 log( 2) TX t=1 "2t 2 2: (14) The data implied in the log likelihood function can be calculated from the itera- tion (Yt ) = (1 + L)"t and then (the reason why invertibility is needed) "t = (1 + L) 1(Yt ) = (Yt ) (Yt 1 ) + 2(Yt 2 ) ::: + ( 1)t 1 t 1(Y1 ) + ( 1)t t"0: 14 Although it is simple to program this iteration by computer, the log likelihood function is a fairly complicated nonlinear function of and , so that an analytical expression for the MLE of and is not readily calculated. Hence even the conditional MLE for an MA(1) process must be found by numerical optimization. It is important to note if you have a sample of size T to estimate an MA(1) process by conditional MLE, you will use all the T observation of this sample. 15 4 MLE of a Gaussian MA(q) Process This section discusses a Gaussian MA(q) process, Yt = + "t + 1"t 1 + 2"t 2 + ::: + q"t q (15) with "t i:i:d: N(0; 2). In this case, the vector of population parameters to be estimated is = ( ; 1; 2; ::; q; 2)0. 4.1 Evaluating the Likelihood Function The observations in the sample (Y1; Y2; :::; YT) in a (T 1) vector y which has mean vector with each element and variance-covariance matrix given by . The likelihood function is then fYT ;YT 1;:::;Y1(yT ; yT 1; :::; y1; ) = (2 ) T=2j j 1=2 exp 12(y )0 1(y ) Maximization of this exact log likelihood of an MA(q) process must be ac- complished numerically. 4.2 Evaluating the Likelihood Function Using (Scalar) Con- ditional Density Consider the p:d:f of Y1 Y1 = + "1 + 1"0 + 2" 1 + ::: + q" q+1: A simple approach is to condition on the assumption that the rst q value of " were all zero: "0 = " 1 = ::: = " q+1 = 0: Let "0 denote the (q 1) vector ("1; " 1; :::; " q+1)0 . Then (Y1j"0 = 0) N( ; 2) 16 or fY1j"0=0(y1j"0 = 0; ) = 1p2 2 exp 12 (y1 ) 2 2 2 = 1p2 2 exp " 2 1 2 2 : Next consider the distribution of the second observation Y2 conditional on the "observing" Y1 = y1. From (15), Y2 = + "2 + 1"1 + 2"0 + ::: + q" q+2: (16) Moreover, given observation of y1, the value of "1 is then known with certainty as well: "1 = y1 : Hence (Y2jY1 = y1; "0 = 0) N(( + 1"1); 2); meaning that fY2jY1; "0=0(y2jy1; "0 = 0; ) = 1p2 2 exp 12 (y2 1"1) 2 2 2 = 1p2 2 exp " 2 2 2 2 : Since "1 is know with certainty, "2 can be calculated from "2 = y2 1"1: Proceeding in this fashion, it is clear that given knowledge that "0 = 0, the full sequence f"1; "2; :::; "Tg can be calculated from fy1; y2; :::; yTg by iterating on "t = yt 1"t 1 2"t 2 ::: q"t q for t = 1; 2; :::; T, starting from "0 = 0. The likelihood (conditional on "0 = 0) of the complete sample can thus be calculated as the product of these individual densities: fYT ;YT 1;YT 2;:::;Y1j"0=0(yT; yT 1; yT 2; :::; y1j"0 = 0; ) = fY1j"0=0(y1j"0 = 0; ) TY t=2 fYtjYt 1;Yt 2;:::;Y1;"0=0(ytjyt 1; yt 2; :::; y1; "0 = 0; ): 17 The conditional log likelihood function (denoted L ( )) is therefore L ( ) = log fYT ;YT 1;YT 2;:::;Y1j"0=0(yT ; yT 1; yT 2; :::; y1j"0 = 0; ) = T2 log(2 ) T2 log( 2) TX t=1 "2t 2 2: (17) It is important to note if you have a sample of size T to estimate an MA(q) process by conditional MLE, you will also use all the T observation of this sample. 18 5 MLE of a Gaussian ARMA(p; q) Process This section discusses a Gaussian ARMA(p; q) process, Yt = c + 1Yt 1 + 2Yt 2 + ::: + pYt p + "t + 1"t 1 + 2"t 2 + ::: + q"t q; with "t i:i:d: N(0; 2). In this case, the vector of population parameters to be estimated is = (c; 1; 2; :::; p; 1; 2; :::; q; 2)0. 5.1 Conditional maximum Likelihood estimates The approximation to the likelihood function for an autoregrssion conditional on initial value of the y0s. The approximation to the likelihood function for a moving average process conditioned on initial value of the "’s. A common approximation to the likelihood function for an ARMA(p; q) process conditions on both y’s and "’s. The (p + 1)th observation is Yp+1 = c + 1Yp + 2Yp 1 + ::: + pY1 + "p+1 + 1"p + ::: + q"p q+1: Conditional on Y1 = y1; Y2 = y2; :::; Yp = yp and setting "p = "p 1 = ::: = "p q+1 = 0 we have Yp+1 N((c + 1Yp + 2Yp 1 + ::: + pY1); 2): Then the conditional likelihood calculated from t = p + 1; :::; T is L ( ) = log f(yT; yT 1; ::yp+1jyp; :::; y1; "p = "p 1 = ::: = "p q+1 = 0; ) = T p2 log(2 ) T p2 log( 2) TX t=p+1 "2t 2 2; (18) where the sequence f"p+1; "p+2; :::; "Tg can be calculated from fy1; y2; :::; yTg by iterating on "t = Yt c 1Yt 1 2Yt 2 ::: pYt p 1"t 1 2"t 2 ::: q"t q; t = p + 1; p + 2; :::; T: 19 It is important to note if you have a sample of size T to estimate an ARMA(p; q) process by conditional MLE, you will only use the T p observation of this sam- ple From (9),(11),(14),(17), and (18) we see that all the conditional log-likelihood function take a concise form T 2 log(2 ) T 2 log( 2) TX t=t "2 t 2 2 ; where T and t is the appropriate total and rst observations used, respectively. The solution to the conditional log-likelihood function ^ is also called the con- ditional sums of squared estimator, CSS, denoted as ^ CSS. 20 6 Numerical Optimization Refer to p.21 of Chapter 3. 7 Statistical Properties of MLE 1. Refer to p.23 of Chapter 3 for the consistency, asymptotic normality and asymptotic e ciency of ^ MLE. 2. Refer to p.25 of Chapter 3 for three methods of estimating the asymptotic variance of ^ MLE. 3. Refer to p.9 of Chapter 5 for three asymptotic equivalent tests relating to ^ MLE. 4. For an large number of observations the CSS estimators will be equivalent to MLE. See Pierce (1971), " Least square estimation of a mixed autoregressive- moving average process", Biometrika 58: pp. 299-312. Exercise: Use the data I give to you, identify what model it appears to be and estimate the model you identify with CSS. 21