Ch. 3 Estimation 1 The Nature of Statistical Inference It is argued that it is important to develop a mathematical model purporting to provide a generalized description of the data generating process. A prob- ability model in the form of the parametric family of the density functions Φ = {f(x;θ),θ ∈ Θ} and its various ramifications formulated in last chapter provides such a mathematical model. By postulating Φ as a probability model for the distribution of the observation of interested, we could go on to consider questions about the unknown parameters θ (via estimation and hypothesis tests) as well as further observations from the probability model (prediction). In the next section the important concept of a sampleing model is introduced as a way to link the probability model postulated, say Φ = {f(x;θ),θ ∈ Θ}, to the observed data x ≡ (x1,...,xn)′ available. The sampling model provided the second important ingredient needed to define a statistical model; the starting point of any ”parametric” statistical inference. In short, a statistical model is defined as comprising (a). a probability model Φ = {f(x;θ),θ ∈ Θ}; and (b). a sampling model x ≡ (X1,...,Xn)′. The concept of a statistical model provide the starting point of all forms of sta- tistical inference to be considered in the sequel. To be more precise, the concept of a statistical model forms the basis of what is known as parametric inference. There is also a branch of statistical inference known asnon?parametric inference where no Φ is assumed a priori. 1.1 The sampling model A sampleing model is introduced as a way to link the probability model postu- lated, say Φ = {f(x;θ),θ ∈ Θ} and the observed data x ≡ (x1,...,xn)′ available. It is designed to model the relationship between them and refers to the way the 1 observed data can be viewed in relation to Φ. Definition 1: A sample is defined to be a set of random variables (X1,X2,...,Xn) whose den- sity functions coincides with the ”true” density function f(x;θ0) as postulated by the probability model. Data are generally drawn in one of two settings. A cross section sample is a sample of a number of observational units all drawn at the same point in time. A time series sample is a set of observations drawn on the same observational unit at a number (usually evenly spaced) points in time. Many recently have been based on time-series cross sections, which generally consistent of the same cross section observed at several points in time. The term panel data set is usually fitting for this sort of study. Given that a sample is a set of r.v.’s related to Φ it must have a distribution which we call the distribution of the sample. Definition 2: The distribution of the sample x ≡ (X1,X2,...,Xn)′, is defined to be the joint distribution of the r.v.’s X1,X2,...,Xn denoted by fx(x1,...,xn;θ) ≡ f(x;θ). The distribution of the sample incorporates both forms of relevant informa- tion, the probability as well as sample information. It must comes as no surprise to learn that f(x;θ) plays a very important role in statistical inference. The form of f(x;θ) depends crucially on the nature of the sampling model and as well as Φ. The simplest but most widely used form of a sampling model is the one based on the idea of a random experiment E and is called a random sample. Definition 3: A set of random variables (X1,X2,...,Xn) is called a random sample from f(x;θ) if the r.v.’s X1,X2,...,Xn are independently and identically distributed (i.i.d). In 2 this case the distribution of the sample takes the form f(x1,...,xn;θ) = nproductdisplay i=1 f(xi;θ) = (f(x;θ))n the first equality due to independence and the second due to the fact that the r.v. are identically distributed. A less restrictive form of a sample model in which we call an independent sample, where the identically distributed condition in the random sample is re- laxed. Definition 4: A set of random variables (X1,X2,...,Xn) is said to be an independent sample from f(xi;θi), i = 1,2,...n, respectively, if the r.v.’s X1,X2,...,Xn are indepen- dent. In this case the distribution of the sample takes the form f(x1,...,xn;θ) = nproductdisplay i=1 f(xi;θi). Usually the density function f(xi;θi), i = 1,2,...,n belong to the same family but their numerical characteristics (moments, etc.) may differs. If we relax the independence assumption as well we have what we can call a non-random sample. Definition 5: A set of random variables (X1,X2,...,Xn) 1 is said to be a non-random sample from f(x1,x2,...,xn;θ) if the r.v.’s X1,X2,...Xn are non-i.i.d.. In this case the only decomposition of the distribution of the sample possible is f(x1,...,xn;θ) = nproductdisplay i=1 f(xi|x1,...,xi?1;θi) given x0, where f(xi|x1,...,xi?1;θi), i = 1,2,...n represent the conditional distri- bution of Xi given X1,X2,...,Xi?1. 1Here, we must regard this set of random variables as a sample of size ’one’ from a multi- variate point of view. 3 In the context of statistical inferences need to postulate both probability as well as a sampling model and thus we define a statistical model as comprising both. Definition 6: A statistical model is defined as comprising (a). a probability model Φ = {f(x;θ), θ ∈ Θ}; and (b). a sampling model x ≡ (X1,X2,...,Xn)′. It must be emphasized that the two important components of a statistical model, the probability and sampling models, are clearly interrelated. For ex- ample we cannot postulate the probability model Φ = {f(x;θ), θ ∈ Θ} if the sample x is non-random. This is because if the r.v.’s X1,X2,...,Xn are not independent the probability model must be defined in terms of their joint distri- bution,.i.e. Φ = {f(x1,x2,...,xn;θ), θ ∈ Θ} (for example, stock price). More- over, in the case of an independent but not identically distributed sample we need to specify the individual density functions for each r.v. in the sample, i.e. Φ = {fk(xk;θ), θ ∈ Θ,k = 1,2,...,n}. The most important implication of this relationship is that when the sampling model postulated is found to be inappro- priate it means that the probability model has to be re-specified as well. 1.2 An overview of statistical inference The statistical model in conjunction with the observed data enable us to consider the following question: (A). Are the observed data consistent with the postulated statistical model ? (model misspecification) (B). Assuming that the postulated statistical model is consistent with the ob- served data, what can we infer about the unknown parameter θ ∈ Θ ? (a). Can we decrease the uncertainty about θ by reducing the parameters space from Θ to Θ0 where Θ0 is a subset of Θ. (confidence estimation) (b). Can we decrease the uncertainty about θ by choosing a particular value in θ, say ?θ, as providing the most representative value of θ ? (point estimation) 4 (c). Can we consider the question that θ belongs to some subset Θ0 of Θ ?( hypothesis testing) (C). Assuming that a particular representative value ?θ of θ has been chosen what can we infer about further observations from the data generating process (DGP) as described by the postulated statistical model ? (prediction) 5 2 Point Estimation (Point) Estimation refers to our attempt to give a numerical value to θ. Let (S,F,P(·)) be the probability space of reference with X a r.v. defined on this space. The following statistical model is postulated: (i) Φ = {f(x;θ), θ ∈ Θ}, Θ ? R; (ii)x ≡ (X1,X2,...,Xn)′ is a random sample from f(x;θ). Estimation in the context of this statistical model takes the form of constructing a mapping h(·) : X → Θ, where X is the observation space and h(·) is a Borel function. The composite function (a statistic) ?θ ≡ h(x) : S → Θ is called an estimator and its value h(x), x ∈ X an estimate of θ. It is important to distin- guish between the two because the former is a random variable and the latter is a real number. Example: Let f(x;θ) = [1/√2pi]exp{?12(x?θ)2}, θ ∈R, and x be a random sample from f(x;θ). Then X = Rn and the following function define estimators of θ: 1. ?θ1 = 1nsummationtextni=1 Xi, 2. ?θ2 = 1k summationtextki=1 Xi, k = 1,2,...,n?1; 3. ?θ3 = 1n(X1 + Xn). It is obvious that we can construct infinitely many such estimators. However, constructing ”good” estimators is not so obvious. It is clear we need some criteria to choose between theses estimators. In other words, we need to formalize what we mean by a ”good” estimator. 2.1 Finite sample properties of estimator 2.1.1 Unbiasedness An estimator is constructed with the sole aim of providing us the ”most represen- tative values” of θ in the parameter space Θ, based on the available information in the form of statistical model. Given that the estimator ?θ = h(x) is a r.v. 6 (being a Borel function a random vector x) any information of what we mean by a ”most representative values” must be in terms of the distribution of ?θ, say f(?θ). The obvious property to require a ’good’ estimator ?θ of θ to satisfy is that f(?θ) is centered around θ. Definition 7: An estimator ?θ of θ is said to be an unbiased estimator of θ if E(?θ) = integraldisplay ∞ ?∞ ?θf(?θ)d?θ = θ. That is, the distribution of ?θ has mean equal to the unknown parameter to be estimated. Note that an alternative, but equivalent, way to define E(?θ) is E(?θ) = integraldisplay ∞ ?∞ ··· integraldisplay ∞ ?∞ h(x)f(x;θ)dx where f(x;θ) = f(x1,x2,...,xn;θ) is the distribution of the sample, x. It must be remembered that unbiasedness is a property based on the distri- bution of ?θ. This distribution is often called sampling distribution of ?θ in order to distinguish it from any other distribution of function of r.v.’s. 2.1.2 Efficiency Although unbiasedness seems at first sight to be a highly desirable property it turns out in most situations there are too many unbiased estimators for this prop- erty to be used as the sole criterion for judging estimators. The question which naturally arises is ” How can we choose among unbiased estimators ?”. Given that the variance is a measure of dispersion, intuition suggests that the estimator with the smallest variance is in a sense better because its distribution is more ’concentrated’ around θ. Definition 8: An unbiased estimator ?θ of θ is said to be relatively more efficient than some 7 other unbiased estimator ?θ if Var(?θ) < Var(?θ). In the case of biased estimators relative efficiency can be defined in terms of the mean square error (MSE) which take the form MSE(?θ,θ0) = E(?θ?θ0)2 = E(?θ?E(?θ) + E(?θ)?θ0)2 = Var(?θ) + [Bias(?θ,θ0)]2, the cross-product term being zero. Bias(?θ,θ0) = E(?θ) ? θ0 is the bias of ?θ relative to the value θ0. For any two estimator ?θ and ?θ of θ if MSE(?θ,θ) ≤ MSE(?θ,θ), θ ∈ Θ with strict inequality holding for some θ ∈ Θ, ?θ is said to be inadmissible. Using the concept of relatively efficiency we can compare different estimator we happen to consider. This is, however, not very satisfactory since there might be much better estimators in terms of MSE for which we know nothing about. In order to avoid choosing the better of two inefficient estimators we need some absolute measure of efficiency. Such a measure is provided by the Cramer- Rao lower bound. Definition 9: The equation CR(θ) = bracketleftBig 1 + dB(θ)dθ bracketrightBig2 E bracketleftbiggparenleftBig ? logf(x;θ) ?θ parenrightBig2bracketrightbigg is the Cramer?Rao lower bound, where f(x,θ) is the distribution of the sample and B(θ) the bias. It can be shown that for any estimator θ? of θ MSE(θ?,θ) ≥ CR(θ) under the following regularity condition on Φ: (a). The set A = {x : f(x;θ) > 0} does not depend on θ; 8 (b). For each θ ∈ Θ the distribution [?i logf(x;θ)]/(?θi), i = 1,2,3 exist for all x ∈X; (c). 0 < E[(?/?θ)logf(x;θ)]2 < ∞ for all θ ∈ θ. In the case of unbiased estimators the inequality takes the form Var(θ?) ≥ bracketleftBigg E parenleftbigg? logf(x;θ) ?θ parenrightbigg2bracketrightBigg?1 ; the inverse of the lower bound is called Fisher′s information number and is denoted by In(θ).2 Definition 10 (multi-parameters’s Cram′er-Rao Theorem): An unbiased estimator ?θ of θ is said to be fully efficient if Var(?θ) = E bracketleftbiggparenleftbigg? logf(x;θ) ?θ parenrightbiggparenleftbigg? logf(x;θ) ?θ parenrightbigg′bracketrightbigg?1 = E bracketleftbigg ?? 2 logf(x;θ) ?θ?θ′ bracketrightbigg?1 where In(θ) = E bracketleftbiggparenleftbigg? logf(x;θ) ?θ parenrightbiggparenleftbigg? logf(x;θ) ?θ parenrightbigg′bracketrightbigg = E bracketleftbigg ?? 2 logf(x;θ) ?θ?θ′ bracketrightbigg , is called the sample information matrix. Proof: (for the case that θ is 1×1): Given that f(x1,x2,...,xn;θ) is the joint density function of the sample, it pos- sesses the property that integraldisplay ∞ ?∞ ··· integraldisplay ∞ ?∞ f(x1,x2,...,xn;θ)dx1...dxn = 1, or, more compactly, integraldisplay ∞ ?∞ f(x;θ)dx = 1. 2It must bear in mind that the information matrix is a function of sample size n. 9 If we assume that the domain of x is independent of θ ((a) this permits straight- forward differentiation inside the integral sign) and that the derivative ?f(·)/?θ exist. Then differentiating the above equation with respect to θ results in integraldisplay ∞ ?∞ ?f(x;θ) ?θ dx = 0. (1) This equation can be reexpressed as integraldisplay ∞ ?∞ ? lnf(x;θ) ?θ f(x;θ)dx = 0 ( d dt lnf(t) = f′(t) f(t) ). Therefore, it simply states that E bracketleftbigg? lnf(x;θ) ?θ bracketrightbigg = 0, i.e., the expectation of the derivative of the natural logarithm of the likelihood function of a random sample from a regular density is zero. Likewise, differentiating (1) w.r.t. θ again provides integraldisplay ∞ ?∞ ?2 lnf(x;θ) ?θ2 f(x;θ)dx + integraldisplay ∞ ?∞ ? lnf(x;θ) ?θ ?f(x;θ) ?θ dx = integraldisplay ∞ ?∞ ?2 lnf(x;θ) ?θ2 f(x;θ)dx + integraldisplay ∞ ?∞ bracketleftbigg? lnf(x;θ) ?θ bracketrightbigg2 f(x;θ)dx. That is Var bracketleftbigg? lnf(x;θ) ?θ bracketrightbigg = ?E bracketleftbigg?2 lnf(x;θ) ?θ2 bracketrightbigg . Now consider the estimator h(x) of θ whose expectation is E(h(x)) = integraldisplay h(x)f(x;θ)dx. (2) Differentiating (2) w.r.t. θ we obtain ?E(h(x)) ?θ = integraldisplay h(x)?f(x;θ)?θ dx = integraldisplay h(x)?lnf(x;θ)?θ f(x;θ)dx = cov bracketleftbigg h(x), ? lnf(x;θ)?θ bracketrightbigg (since E bracketleftbigg? lnf(x;θ) ?θ bracketrightbigg = 0). 10 By Cauchy-Schwarz inequality (E|YZ|≤ E(Y 2)1/2E(Z2)1/2), we have cov bracketleftbigg h(x), ? lnf(x;θ)?θ bracketrightbigg2 = bracketleftbigg?E(h(x)) ?θ bracketrightbigg2 ≤ Var(h(x))·Var bracketleftbigg? lnf(x;θ) ?θ bracketrightbigg . (3) In light of (3), we have Var(h(x)) ≥ bracketleftBig ?E(h(x)) ?θ bracketrightBig2 ?E bracketleftBig ?2 lnf(x;θ) ?θ2 bracketrightBig. If the estimator is unbiased, E(h(x)) = θ and Var(h(x)) ≥ 1 ?E bracketleftBig ?2 lnf(x;θ) ?θ2 bracketrightBig. Example: For a random sample of size n from a normal distribution, the Cramer-Rao vari- ance lower bound for unbiased estimator θ = (μ,σ2)′ is derived as following. f(x1,x2,...,xn;θ) = f(x,θ) = nproductdisplay i=1 braceleftBigg (2pi)?1/2(σ2)?1/2 exp bracketleftBigg ?12 parenleftbiggx i ?μ σ parenrightbigg2bracketrightBiggbracerightBigg = (2pi)?n/2(σ2)?n/2 exp bracketleftBigg ? 12σ2 nsummationdisplay i=1 (xi ?μ)2 bracketrightBigg . lnf(x1,x2,...,xn;θ) = ?n2 ln(2pi)? n2 lnσ2 ? 12σ2 nsummationdisplay i=1 (xi ?μ)2, ? lnL ?μ = 1 σ2 nsummationdisplay i=1 (xi ?μ), ? lnL ?σ2 = ? n 2σ2 + 1 2σ4 nsummationdisplay i=1 (xi ?μ)2, ?2 lnL ?μ2 = ? n σ2, ?2 lnL ?(σ2)2 = n 2σ4 ? 1 σ6 nsummationdisplay i=1 (xi ?μ)2, ?2 lnL ?μ?σ2 = ? 1 σ4 nsummationdisplay i=1 (xi ?μ). 11 The sample information matrix is In(θ) = E bracketleftbiggparenleftbigg? lnf(x;θ) ?θ parenrightbiggparenleftbigg? lnf(x;θ) ?θ parenrightbigg′bracketrightbigg = E bracketleftbigg ?? 2 lnf(x;θ) ?θ?θ′ bracketrightbigg = ?E bracketleftBigg ?2 lnL ?μ2 ?2 lnL ?μ?σ2 ?2 lnL ?(σ2)2 ?2 lnL ?μ?σ2 bracketrightBigg = bracketleftbigg n σ2 00 n 2σ4 bracketrightbigg . (how?) The Cramer-Rao low variance Bound therefore is I?1n (μ,σ) = bracketleftbigg σ2 n 0 0 2σ4n bracketrightbigg . 2.1.3 Sufficiency Efficiency can be seen as a property indicating that the estimator ”utilizes” all the information contained in the statistical model. An important concept related to the information of a statistical model is the concept of a sufficient statis- tic introduced by Fisher (1922) as a way to reduce the sampling information by discarding only the information of no relevance to any inference about θ. In other words, a statistic τ(x) is said to be sufficient for θ if it makes no difference whether we use x or τ(x) in inference concerning θ. Obviously in such a case we would prefer to work with τ(x) instead of x, the former being of lower dimen- sionality. Definition 11 (sufficient statistic): A statistic τ(·) : X → Rm, n > m (n is the size of a sample), is called suffi- cient for θ if the conditional distribution f(x|τ(x) = τ) is independent of θ, i.e. θ does not appear in f(x|τ(x) = τ) and the domain of f(·) does not involve θ. Verifying this directly by deriving f(x|τ(x) = τ) and showing that is indepen- dent of θ can be a very difficult exercise. One indirect way of verifying sufficiency 12 is provided by the following Theorem. Theorem 1 ( Fisher?Neyman factorization): The statistics τ(x) is sufficient for θ if and only if there exist a factorization of the form f(x;θ) = f(τ(x);θ)h(x), where f(τ(x);θ) is the density function of τ(x) and depends on θ and h(x), some function of x independent of θ. Even this result, however, is of no great help because we have to have the sta- tistics τ(x) as well as its distribution to begin with. Lehmann and Scheffe (1950) provides us with a very convenient way to derive minimal sufficient statistics. Intuition suggests that, since efficiency is related to full utilization of the information in the statistical model, and sufficiency can be seen as a maximal reduction of such information without losing any relevant information as far as inference about θ is concerned, there must be a direct relationship between the two properties. A relationship along the lines that when an efficient estimator is needed we should look no further than the sufficient statistics, is provided by the following Theorem. Theorem 2 (Rao-Blackwell): Let τ(x) be a sufficient statistic for θ and κ(x) be an estimator of θ, then E(h(x)?θ)2 ≤ E(κ(x)?θ)2, θ ∈ Θ, where h(x) = E(κ(x)|τ(x) = τ), i.e. the conditional expectation of κ(x) given τ(x) = τ. From the above discussion of the properties of unbiasedness, relative and fully efficiency and sufficiency we can see that these properties are directly related to the distribution of the estimator ?θ of θ. As argued repeats, deriving the distrib- ution of Borel functions of r.v.’s such as ?θ = h(x) is a very difficult exercises and very few results are available in the literature. These existing results are mainly related to simple functions of normally distributed r.v.’s . For the cases where no 13 such results are available (which is the rule rather than the exception) we have to resort to asymptotic results. This implies that we need to extend the above list of criteria for ’good’ estimators to include asymptotic properties of estimators. These asymptotic properties will refer to the behavior of ?θn as n →∞. In order to emphasis the distinction between these asymptotic properties and the proper- ties considered so far we will call the latter finite sample properties. The finite sample properties are related directly to the distribution of ?θn, say f(?θn). On the other hand, the asymptotic properties are related to the asymptotic distribution of ?θn. 2.2 Asymptotic Properties 2.2.1 Consistency A natural property to require estimators to have is that as n →∞ the probability of ?θ being close to the true θ should increase as well. We formalize this idea using the concept of convergence in probability associated with the weak law of large numbers. Definition 11: An estimator ?θn = h(x) is said to be (weak) consistent for θ if limn→∞ Pr(|?θ ?θ| < ε) = 1, we write ?θn p?→ θ. A sufficient condition for weakly consistent of ?θn is formalized in the following Lemma. Lemma (Convergence in Mean Square Error): If ?θn is an estimator of θ which satisfies the following properties: (a) limn→∞ E(?θn) = θ; (b) limn→∞ Var(?θn) = 0, then ?θn p?→ θ. It is important to note that these are only sufficient conditions for consistency (not necessary); that is, consistency is not equivalent to the above conditions, 14 since for consistency Var(?θn) need not even exist. A stronger form of consistency associated with almost sure convergence is a very desirable asymptotic property for estimators. Definition 12: An estimator ?θn is said to be a strongly consistent estimator of θ if Pr parenleftBig limn→∞?θn = θ parenrightBig = 1, and is denoted by ?θn a.s.?→ θ. Theorem 3: Given g : Rk →Rl and any sequence random k×1 vector bn such that bn a.s.?→ b, where b is k ×1, if g is continuous at b, then g(bn) a.s.?→ g(b). 2.2.2 Asymptotic Normality A second form of convergence is convergence in distribution. Definition 13: Let Xn be a sequence of random variables indexed by the sample size, and assume that Xn has cdf Fn(x). If limn→∞|Fn(x) ? F(x)| = 0 at all continuity point of F(x), where F(x) is the distribution function of a random variable X, then Xn converges in distribution to the random variable X, denoted as Xn d?→ X and F(x) is called the limiting distribution of Xn. Extending the central limit theorem to ?θn leads to the property of asymptotic normality. Definition 14: An estimator ?θn is said to be asymptotic normal if two sequence {Vn(θ), n ≥ 1} and {θn, n ≥ 1} exist such that (Vn(θ))?12(?θn ?θn) d?→ Z ~ N(0,1), 15 where Vn(θ) = limn→∞Var(?θn). In most cases of interest in practice such as the case of a random sample, Vn(θ) is of order 1/n, denoted by Vn(θ) = O(1/n). In such a case asymptotic normality can be written in the following form: √n(?θ n ?θn) ~ N(0,V(θ)), where V(θ) = n·Vn(θ) represent the asymptotic variance. Definition 15: An estimator ?θn with Vn(?θ) = O(1/n) is said to be asymptotically unbiased if E bracketleftBig√ n(?θn ?θn) bracketrightBig = 0 as n →∞. It must be emphasized that asymptotic unbiasedness is a stronger condition than limn→∞E(?θn) = θ; the former specifying the rate of convergence. In relation to the variance of the asymptotic normal distribution we can define the concept of asymptotic efficiency. Definition 16: An asymptotically normal estimator ?θn is said to be asymptotically efficient if Vn(θ) = (In(θ))?1, where In(θ) = E bracketleftbiggparenleftbigg? logf(x;θ) ?θ parenrightbiggparenleftbigg? logf(x;θ) ?θ parenrightbigg′bracketrightbigg = E bracketleftbigg ?? 2 logf(x;θ) ?θ?θ′ bracketrightbigg , i.e. the asymptotic variance achieve the limit of the Cramer-Rao lower bound. See Chapter 4 for detailed coverage of asymptotic theorems. 16 3 Methods of Estimation The purpose of this section is to consider various methods for constructing ”good estimators” for the unknown parameters θ. These methods to be discussed are the least?square method, the method of moment and the maximum likelihood estimation method. 3.1 The method of least-squares The method of least-square was first introduced by Legendre in 1805 and Gauss in 1809 in the context of astronomical measurement. The problem as posed at the time was one of approximating a set of noisy observations yi, i = 1,2,...,n, with some known functions gi(θ1,θ2,...,θm), i = 1,2,...,n, which depended on the unknown parameters θ = (θ1,...,θm)′, m < n. Legendre suggest minimizing the squared errors l(θ) = nsummationdisplay i=1 (yi ?gi(θ))2 the least?squares. Assuming differentiability of gi(θ), i = 1,2,...,n, [?l(θ)]/?θ = 0 gives rise to the so called normal equation of the form (?2) nsummationdisplay i=1 [yi ?gi(θ)] ??θ k gi(θ) = 0, k = 1,2,...,m. In this form the least-square method has nothing to do with the statistical model developed above, it is merely an interpolation method in approximation theory. Gauss extended the least square method to statistical method with the prob- abilistic structure attached to the error term. He pose the problem in the form yi = gi(θ) + εi, i = 1,2,..,n, εi ~ NI(0,σ2), i = 1,2,...,n. In this form the problem can be viewed as one of estimation in the context of statistical model: Φ = braceleftBigg f(yi;θ) = 1σradicalbig(2pi) exp parenleftbigg ? 12σ2(yi ?gi(θ))2 parenrightbigg ,θ ∈ Θ bracerightBigg 17 by transferring the probabilistic assumption from εi to yi, the observable r.v., and consider y = (Y1,Y2,...,Yn)′ be an independent sample from f(yi;θ),i = 1,2,...,n. Gauss went on to derive what we have called the distribution of the sample f(y;θ) = (2piσ2)?n/2 exp parenleftBigg ? 12σ2 nsummationdisplay i=1 (yi ?gi(θ))2 parenrightBigg , and suggested that maximization of f(y;θ) with respect to θ give rise to the same estimator of θ as minimization the square errors nsummationdisplay i=1 (yi ?gi(θ))2. As we will see below, the above maximization can be seen as a forerunner of the maximum likelihood method. 3.2 The method of moments In the context of a probability model Φ, however unknown parameters of inter- est are not only associated with the mean but also with the higher moments. This prompted Pearson in 1894 to suggest the method of moment as a general estimation method. Let us assume that x = (X1,X2,...,Xn)′ is a random sample from f(x;θ), θ ∈ Rk. The raw moments of f(x;θ), μ′r ≡ E(xr), r ≥ 1, are by definition functions of the unknown parameters, since μ′r(θ) = integraldisplay ∞ ?∞ xrf(x;θ)dx, r ≥ 1. In order to apply the method we need to express the unknown parameter θ in the form θi = gi(μ′1,μ′2,...,μ′k), i = 1,2,...,k, where gis are continuous functions. The method of moment based on the substi- tutional idea, proposes estimating θi using ?θi = gi(m1,m2,...,mk), i = 1,2,...,k, 18 as the estimator of ?θi, i = 1,2,...,k, where mr = (1/n)summationtextni=1 Xri , r ≥ 1 represent the sample raw moments. The justification of the method is based on the fact that if μ′1,μ′2,...,μ′k are one-to one function of θ then since mr a.s.?→ μ′r, r ≥ 1, it follows that (by Slutzky Theorem) ?θi a.s.?→ θi, i = 1,2,...,k. Example Let Xi ~ N(μ,σ2), i = 1,2,...,n, then μ′1 = μ, μ′2 = σ2 + μ2 and m1 = (1/n)summationtextni=1 Xi, m2 = (1/n)summationtextni=1 X2i . The method suggests ?μ = m1 = ˉXn, ?σ2 = m2 ?(m1)2 = 1n summationdisplay i X2i ?( ˉX2n) = 1n nsummationdisplay i=1 (Xi ? ˉX2n). Although the method of moment usually yield consistent estimator they are in general inefficient. This is taken by Fisher in several papers in the 1920s and 30s arguing in favor of the MLE. The controversy between Pearson and Fisher about the relative merits of their respective methods of estimations ends in the mid- 1930s with Fisher the winner and the absolute dominance then of the maximum likelihood method. The basic reason for the inefficiency of the estimator based on the method of moment is not hard to find. It is due to the fact that the method does not use any information relating to the probability model Φ apart from the assumption that raw moments of the order k exist. 3.3 The maximum likelihood method The maximum likelihood method of estimation was formulated by Fisher and extended by various authors such as Cramer, Rao and Wald. In the current sta- tistical literature the method of likelihood is by far the most widely used method of estimation and plays a very important role in hypothesis testing. 19 3.3.1 The likelihood function Consider the statistical model (univariate): (a) Φ = {f(x;θ), θ ∈ Θ}; (b) x = (X1,X2,...,Xn)′ a sample from f(x;θ), where x takes values in X = Rn, the observation space. The distribution of the sample D(x1,x2,...,xn;θ) describe how the density changes as x takes differ- ent value in X for a given θ ∈ Θ. In deriving the likelihood function we reason as follows: Since D(x;θ) incorporates all the information in the statistical model it makes a lot of intuitive sense to reverse the argument in deriving D(x;θ) and consider the question which value of θ ∈ Θ is mostly supported by a given sample realisation x = x ? That is, the value θ under which x would have had the highest ’likelihood’ of arising must be intuitively our best choice of θ. Using this intuitive argument the likelihood function is defined by L(θ;x) = k(x)D(x;θ), θ ∈ Θ, where k(x) > 0 is a function of x only (not θ). In particular L(·;x) : Θ → [0,1]. The presence of k(x) (which we always chose arbitrarily to be equal to one) in the definition of L(θ;x) implies that the likelihood function is non-unique; any monotonic transformation of it represents the same information. In particular: (a) logL(θ;x), the log likelihood functions; and (b) ? logL(θ;x)?θ ≡ s(θ;x), the score function, incorporate the same information as L(θ;x). 3.3.2 The Maximum Likelihood Estimator (MLE) Given that the likelihood function represents the support given to the various θ ∈ Θ given x = x, it is natural to define the maximun likelihood estimator 20 of θ to be a Borel function ?θ : X → Θ such that L(?θ;x) = maxθ∈ΘL(θ;x), and there may be one, none or many such MLE’s. Note that logL(?θ;x) ≥ logL(θ?;x), for all θ? ∈ Θ. In the case where L(θ;x) is differentiable the MLE can be derived as a solution of the equations ? logL(θ;x) ?θ ≡ s(θ;x) = 0, refereed as the likelihood equation. Example (univariate normal distribution, i.i.d. sample): Let X1,X2,...,Xn be a random sample from the N(μ,σ2). Find the MLEs of θ = (μ,σ2)′. For random sampling from a normal distribution, the log-likelihood and its derivatives are lnL(μ,σ2) = ?n2 ln(2pi)? n2 lnσ2 ? 12σ2 nsummationdisplay i=1 (Xi ?μ)2, ? lnL ?μ = 1 σ2 nsummationdisplay i=1 (Xi ?μ), ? lnL ?σ2 = ? n 2σ2 + 1 2σ4 nsummationdisplay i=1 (Xi ?μ)2. Therefore, ?μ = 1nsummationtextni=1(Xi) = ˉX and ?σ2 = 1nsummationtextni=1(Xi? ˉX)2 is indeed MLE since ?2 lnL ?μ2 vextendsinglevextendsingle vextendsinglevextendsingle θ=?θ = ? n?σ2 < 0, ?2 lnL ?(σ2)2 vextendsinglevextendsingle vextendsinglevextendsingle θ=?θ = n2 ?σ4 ? 1?σ6 nsummationdisplay i=1 (Xi ? ˉXn)2 = ?n2 ?σ4 < 0, ?2 lnL ?μ?σ2 vextendsinglevextendsingle vextendsinglevextendsingle θ=?θ = ? 1?σ4 nsummationdisplay i=1 (Xi ? ˉX) = 0. 21 We can deduce that H(?θ) = bracketleftBigg ? n?σ2 0 0 ?n2 ?σ4 bracketrightBigg is a negative definite matrix. Example: The expected value and variance of the estimators, ?μ and ?σ2 would be E(?μ) = E parenleftBigg 1 n nsummationdisplay i=1 Xi parenrightBigg = 1nE parenleftBigg nsummationdisplay i=1 Xi parenrightBigg = 1n parenleftBigg nsummationdisplay i=1 E(Xi) parenrightBigg = 1nnμ = μ. Denotex = (X1,X2,...,Xn)′ and μ = (μ,μ,...,μ)′ be the vector of this random sample and its mean vector respectively. Then the joint distribution of this sample would be that x ~ N(μ,σ2I). We know that nsummationdisplay i=1 (Xi ? ˉX)2 = x′M0x = (x?μ)′M0(x?μ), since M0(x?μ) = M0x?M0μ = M0x, where M0 = I?1/n(ii′). From the results of ’quadratic forms related to the normal distribution’ at section 6.2.3 of Ch. 2, we have summationtextn i=1(Xi ? ˉX) 2 σ2 = x′M0x σ2 = (x?μ)′M0(x?μ) σ2 ~ χ 2 trace of M0 ≡ χ 2 n?1. Therefore, E parenleftbiggsummationtextn i=1(Xi ? ˉX) 2 σ2 parenrightbigg = (n?1), or E parenleftBigg nsummationdisplay i=1 (Xi ? ˉX)2 parenrightBigg = (n?1)σ2. 22 The expected value of the MLE estimator of σ2,?σ2 is E(?σ2) = E parenleftbiggsummationtextn i=1(Xi ? ˉX) 2 n parenrightbigg = n?1n σ2 negationslash= σ2, which is biased. An unbiased estimator of σ2 would be s2 = summationtextn i=1(Xi ? ˉX) 2 n?1 . Example: Let X1,X2,...,Xn be a random sample from the density f(x;θ) = θ(1?θ)x, x = 0,1,..., 0 < θ < 1. Find the MLEs of θ. Since this is a random sample, the likelihood function takes the form L(θ;X1,X2,...,Xn) = f(X1;θ)f(X2;θ)...f(Xn;θ) = θ(1?θ)X1 ·θ(1?θ)X2 ···θ(1?θ)Xn = θn(1?θ) C8n i=1 Xi, and the log-likelihood is therefore lnL(θ;X1,X2,...,Xn) = nlnθ + nsummationdisplay i=1 Xi ln(1?θ). The FOC condition is ? lnL(θ;X1,X2,...,Xn) ?θ = n θ ? summationtextn i=1 Xi 1?θ = 0 which has a solution at ?θ = 1 1 + 1/nsummationtextni=1 Xi = 1 1 + ˉX. Exercise: Check second order condition to make sure that the solution from FOC is indeed a MLE in the last example. 23 Example (multivariate normal distribution, i.i.d. sample): Let x ≡ (x′1,x′2,...,x′T)′ be a random sample of size T from Nn(μ,Σ), i.e. xt ~ Nn(μ,Σ),t = 1,2,...,T. x being a nT ×1 vector. The likelihood function takes the form L(μ,Σ;x) = Tproductdisplay t=1 bracketleftbigg (2pi)?n/2|Σ|?1/2 exp braceleftbigg ?12(xt ?μ)′Σ?1(xt ?μ) bracerightbiggbracketrightbigg = (2pi)?nT/2|Σ|?T/2 exp braceleftBigg ?12 Tsummationdisplay t=1 (xt ?μ)′Σ?1(xt ?μ) bracerightBigg , and the log-likelihood function therefore is lnL(μ,Σ;x) = ?nT2 ln(2pi)? T2 ln(|Σ|)? 12 Tsummationdisplay t=1 (xt ?μ)′Σ?1(xt ?μ). Since Tsummationdisplay t=1 (xt ?μ)′Σ?1(xt ?μ) = Tsummationdisplay t=1 (xt ?ˉxT)′Σ?1(xt ?ˉxT) + T(ˉxT ?μ)′Σ?1(ˉxT ?μ) = trΣ?1Λ+ T(ˉxT ?μ)′Σ?1(ˉxT ?μ) where ˉxT = 1T summationdisplay t xt, Λ = summationdisplay t (xt ?ˉxT)(xt ?ˉxT)′. ? lnL(μ,Σ;x) = ?nT2 ln(2pi)? T2 ln(|Σ|)? 12(trΣ?1Λ)? T2(ˉxT ?μ)′Σ?1(ˉxT ?μ). The FOC is ? lnL(μ,Σ;x) ?μ = ?TΣ ?1(ˉxT ?μ) = 0 ? ?μ = ˉxT, and ? lnL(μ,Σ;x) ?Σ?1 = T 2Σ? 1 2Λ = 0 ? ?Σ = 1T summationdisplay t (xt ?ˉxT)(xt ?ˉxT)′. 24 In deriving the first order condition above, we have used the following reason. Since Σ is positive definite, Σ?1 is also positive definite and (ˉxT ?μ)′Σ?1(ˉxT ? μ) ≥ 0 and is 0 if and only if ?μ = ˉxT. To maximize the second terms of lnL(μ,Σ;x) we use the following lemma. Lemma (Anderson, p.62): If D is positive of order p, the maximum of f(G) = ?T ln|G|?trG?1D with respect to positive definite matrices G exists, occurs at G = (1/T)D. Proof: Let D = EE′ and E′G?1E = H. Then G = EH?1E′, and |G| = |E||H?1||E′| = |H?1||EE′| = |D|/|H|, and trG?1D = trG?1EE′ = trE′G?1E = trH. Then the function to be maximized (with respect to positive definite H) is f = ?T ln|D|+ T ln|H|?trH. Let H = KK′, where K is lower triangular. Them the maximum of f = ?T ln|D|+ T ln|K|2 ?trKK′ = ?T ln|D|+ psummationdisplay i=1 (T lnk2ii ?k2ii)? summationdisplay i>j k2ij occurs at k2ii = T,kij = 0,i negationslash= j; that is, at H = TI. Then G = (1/T)EE′ = (1/T)D. 3.3.3 Numerical Method It is an erroneous conclusion that deriving the MLE is a matter of a simple differentiation. Let us consider some examples where the derivation is not as straightforward. 25 Example: Let z = (z′1,z′2,...,z′n)′ where zi = (Xi,Yi), be a random sample from f(x,y;ρ) = (1?ρ 2)?1/2 2pi exp braceleftbigg ? 12(1?ρ2)(x2 ?2ρxy + y2) bracerightbigg , then the log-likelihood function is logL(ρ;X,Y) = ?nlog2pi ? n2 log(1?ρ2)? 12(1?ρ2) nsummationdisplay i=1 (X2i ?2ρXiYi + Y 2i ), and the FOC would be dlogL dρ = n(?2) 2(1?ρ2)ρ?ρ summationtextn i=1(X 2 i + Y 2 i )) (1?ρ2)2 + 1 + ρ2 (1?ρ2)2 nsummationdisplay i=1 (XiYi) = 0, = n?ρ(1? ?ρ2) + (1 + ?ρ2) nsummationdisplay i=1 (XiYi)? ?ρ parenleftBigg nsummationdisplay i=1 X2i + nsummationdisplay i=1 Y 2i parenrightBigg = 0 This is a cubic equation in ρ and hence there are three possible value for the MLE and additional search is needed to locate the maximum value using numer- ical methods. A short tour on Numerical method The template for most gradient method in common use is the Newton’s method. The basis for Newton’s method is a linear Taylor series approximation. Expanding the first-order conditions, ?F(θ) ?θ = 0 (which may have nonlinear solution) in a linear Taylor series around an arbitrary θ0 yield ?F(θ) ?θ ? g 0 +H0(θ?θ0) = 0, where the superscript indicates that the term is evaluated at θ0 and g and H are the gradient vector and Hessian matrix, respectively. Solving for θ and then equating θ to θt+1 and θ0 to θt, we obtain the iteration θt+1 = θt ?H?1t gt. 26 3.3.4 Finite Sample Properties of MLE (a) Invariance Let ?θ be a MLE of θ. If g(·) : Θ → R is a Borel function of θ then a MLE of g(θ) exists and is given by g(?θ). This means that if the MLE of θ is available then for functions such as θk, eθ, logθ, its MLE is derived by substitution ?θ in place of θ, i.e. ?θk, e?θ and log ?θ are MLE’s of these functions. However, as in the case of ?σ2, MLE are not in general unbiased estimator. In one particular case, when unbiasedness is accompanied by full efficiency, however, the two coincide. (b) Unbiasedness, full-efficiency In the case where Φ satisfies the regularity conditions and ?θ is an unbiased estimator of θ whose variance achieves the Cramer-Rao lower bound, then the likelihood estimation has a unique solution equal to ?θ. This suggests that any unbiased fully efficient estimator ?θ can be derived as a solution of the likelihood equation, as in the case of ?μ. (c) Sufficiency The property mostly emphasized by Fisher in support of the MLE was the properties of sufficiency. If τ(x) is a sufficient statistic for θ and a unique MLE ?θ of θ exists then ?θ is a function of τ(x). 3.3.5 Asymptotic Properties (i.i.d. case and θ is a scalar) of MLE Although MLE’s enjoy several optimum finite sample properties, as seen above, their asymptotic properties provide the main justification for the almost universal appeal to the method of maximum likelihood. As argued below, under certain regularity conditions, MLE can be shown to be consistent, asymptotically normal and asymptotically efficient. Let us begin the discussion of asymptotic properties enjoyed by MLE by con- sidering the simplest possible case where the statistical model is as follows: (a). probability model, Φ = {f(x;θ),θ ∈ Θ}, θ being a scalar; (b). sampling model, x ≡ (X1,X2,...,Xn)′ is a random sample from f(x;θ). For our purpose it suffices to repeat the regularity condition on Φ: 27 (RC 1) The set A = {x : f(x;θ) > 0} does not depend on θ. (RC 2) For each θ ∈ Θ the derivatives [?i logf(x;θ)]/(?θi), i = 1,2,3, exist for all x ∈X. (RC 3) 0 < E[(?/?θ)logf(x;θ)]2 < ∞ for all θ ∈ Θ. (RC 4) For every θ ∈ Θ, vextendsinglevextendsingle vextendsinglevextendsingle?i logf(x;θ) ?θi vextendsinglevextendsingle vextendsinglevextendsingle≤ hi(x), i = 1,2,3, where the function h1(x) and h2(x) are integrable over (?∞,∞), i.e. integraldisplay ∞ ?∞ hi(x)dx < ∞, i = 1,2, and integraldisplay ∞ ?∞ h3(x)f(x;θ)dx < k, where k does not depend on θ. Example: Let x ≡ (X1,X2,...,Xn)′ be a random sample from f(x;θ) = 1/θ, where 0 ≤ x ≤ θ. Since the range of the random variables (X1,X2,...,Xn) depends on the parameters θ. This function violates regularity condition 1 (RC 1) above and are excluded from the properties below. Example: Let x ≡ (X1,X2,...,Xn)′ be a random sample from N(0,σ2). Then ?3 logf(x) ?σ6 = ? n σ6 + 3summationtextX2 σ8 →∞, as σ 2 → 0, that is, the third derivative is not bounded in the open interval 0 < σ2 < ∞, the regularity condition 4 (RC 4) is not satisfied. Under the regular condition we can prove that the likelihood equation [? logL(θ;x)]/?θ = 0 admits a sequence of solution {?θn, n ≥ 1} such that: (a) Consistency: ?θn a.s.?→ θ0, strong consistency 28 which implies that ?θn P?→ θ0, weak consistency. (b). Asymptotic normality: √n(?θ n ?θ0) ~ N(0,I(θ)?1). (3). Asymptotic efficiency: I(θ) = lim n→∞ bracketleftbigg1 nIn(θ) bracketrightbigg , i.e.the asymptotic variance of ?θn equal the limit of the Cramer-Rao lower bound for consistent estimator (that is, Var(?θn) = 1nI(θ)?1 = In(θ)?1). Proof: Take a Taylor expansion of [? logL(?θn;x)]/?θ at ?θ = θ0 we would get ? logL(?θ) ?θ = ? logL(θ0) ?θ + ?2 logL(θ0) ?θ2 ( ?θn ?θ0) + op(n) = 0, where op(n) refers to all terms of order n.3 The above expansion is based on RC 2 and RC 4. In view of the fact the logL(θ) = summationtextni=1 logf(xi;θ) and the f(xi;θ),i = 1,2,...,n, is i.i.d., we can express the above Taylor expansion in the form ?1n nsummationdisplay i=1 ?2 logf(xi;θ0) ?θ2 ( ?θn ?θ0) = 1 n nsummationdisplay i=1 ? logf(xi;θ0) ?θ + op(1). Using the strong law of large numbers for the i.i.d. r.v.’s An = 1n nsummationdisplay i=1 ? logf(xi;θ0) ?θ a.s.?→ E parenleftbigg? logf(x;θ 0) ?θ parenrightbigg = 0 3Here, op(n) is from summationtextn op(1), and op(1) is from the fact that f(x,θ) is bound, i.e. f(x,θ) is OP(1) (see White, p.28), then so does ? logf(x,θ)?θ , and its Taylor’s expansion’s remainder is op(1) when ?θ → θ0. 29 and Bn = parenleftBigg ?1n nsummationdisplay i=1 ?2 logf(xi;θ0) ?θ2 parenrightBigg a.s.?→ I i(θ) = E parenleftbigg ?? 2 logf(x;θ0) ?θ2 parenrightbigg , where summationtextni=1 Ii(θ) = In(θ) or In(θ) = nI1(θ) (see Arnold, p.271). This in turn imply that (?θn ?θ0) a.s.?→ 0, or ?θn a.s.?→ θ0. To show the normality we multiple all the terms of the Taylor expansion by√ n to get Bn√n(?θn ?θ0) = 1√n nsummationdisplay i=1 ? logf(xi;θ0) ?θ + op( √n). Using the central limit theorem for i.i.d. r.v.’s we can show that 1√ n nsummationdisplay i=1 ? logf(xi;θ0) ?θ ~ N(0,Ii(θ)). (see the proof ofCramer?Rao Bound.) Given that Bn a.s.?→ Ii(θ) we can deduce that √n(?θ n ?θ) ~ N(0,Ii(θ)?1) + Op( √n) ≡ N(0,nI?1 n (θ)) + op( √n), or (?θn ?θ) ~ N(0,In(θ)?1), or √n(?θ n ?θ0) ~ N(0,I(θ)?1). 30 3.3.6 Estimating the asymptotic variance of the MLE, multivariate parameters, θ is k ×1 (a). If the form of the expected value of the second derivative of the log-likelihood is known, we can evaluate the information matrix at ?θ to estimate the covariance matrix for the MLE, [In(?θ)]?1 = braceleftbigg ?E bracketleftbigg?2 lnL(θ) ?θ?θ′ bracketrightbigg θ=?θ bracerightbigg?1 . (b). If the expected value of the second derivative of the log-likelihood is compli- cated, two alternative estimators is [?In(?θ)]?1 = braceleftBigg ? bracketleftBigg ?2 lnL(?θ) ??θ??θ′ bracketrightBiggbracerightBigg?1 and (c). BHHH estimator [?In(?θ)]?1 = bracketleftBigg nsummationdisplay i=1 ?gi?g′i bracketrightBigg?1 , where ?gi = ? lnf(xi;θ)?θ vextendsinglevextendsingle vextendsinglevextendsingle θ=?θ is k ×1. The reason for using the BHHH is that lnL = nsummationdisplay i=1 lnf(xi;θ). Therefore, ? lnL ?θ = g = nsummationdisplay i=1 gi, (gi = ? lnf(xi;θ)?θ ,is k ×1), and ?E bracketleftbigg?2 lnL ?θ?θ′ bracketrightbigg = E bracketleftbiggparenleftbigg? lnL ?θ parenrightbiggparenleftbigg? lnL ?θ parenrightbigg′bracketrightbigg = E bracketleftBigg nsummationdisplay i=1 nsummationdisplay j=1 gig′j bracketrightBigg , we drop the unequal subscript since lnf(xi;θ), i = 1,2,...,n is a random sample to obtain ?E bracketleftbigg?2 lnL ?θ?θ′ bracketrightbigg = E bracketleftBigg nsummationdisplay i=1 gig′i bracketrightBigg . 31 Example: Reproduce the results from Gauss on P.132 example 4.22 in Greene (4th ed.) or p. 482 example 17.4 in Greene (5th ed.) to learn numerical method and three asymptotic variance estimators. 32