Ch. 2 Probability Theory 1 Descriptive Study of Data 1.1 Histograms and Their Numerical Characteristics By descriptive study of data we refer to the summarization and exposition (tab- ulation, grouping, graphical representation) of observed data as well as the derivation of numerical characteristics such as measures of location, dispersion and shape. Although the descriptive study of data is an important facet of modeling with real data itself, in the present study it is mainly used to motivate the need for probability theory and statistical inference proper. In order to make the discussion more speci c let us consider the after-tax personal income data of 23000 household for 1999-2000 in the US. There data in row form constitute 23000 numbers between $5000 and $100000. This presents us with a formidable task in attempting to understand how income is distributed among the 23000 households represented in the data. The purpose of descriptive statistics is to help us make some sense of such data. A natural way to proceed is to summarize the data by allocating the numbers into classes (intervals). The number of intervals is chosen a priori and it depends on the degree of summa- rization needed. Then we have the " Table of the personal income in the US". The rst column of the table shows the income intervals, the second column the second column shows the number of income falling into each interval and the third column the relative frequency for each interval. The relative frequency is calculated by dividing the number of observations in each interval by the total number of observations. The fourth column is the cumulative frequency. Sum- marizing the data in this Table enables us to get some idea of how income is distributed among various class. If we plot the relative (cumulative) frequencies in a bar graph we get what is known as the histogram (cumulative). For further information on the distribution of income we could calculate vari- ous numerical characteristics describing the histogram’s location, dispersion and shape. Such measure can be calculate directly in terms of the raw data. How- ever, in the present case it is more convenient for expositional purpose to use the grouped data. The main reason for this is to introduce various concept which will be reinterpreted in the context of probability. 1 The mean as measure of location takes the form z = nX i=1 izi; where i and zi refer to the relatively frequency and the midpoint of interval i. The mode as a measure of location refers to the value of income that occurs most frequently in the data set. Another measure of location is the median referring to the value of income in the middle when income are arranged in an ascending order according to the size of income. The best way to calculate the median is to plot the cumulative frequency graph. Another important feature of the histogram is the dispersion of the relative frequency around a measure of central tendency. The most frequently used mea- sured of dispersion is the variance de ned by v2 = nX i=1 (zi z)2 i; which is a measure of dispersion around the mean; v is known as the standard deviation. We can extend the concept of the variance to mk = nX i=1 (zi z)k i; k = 3; 4; ::: de ning what are known as higher central moments. These higher moments can be used to get a better idea of the shape of the histogram. For example, the standardized form of the third and fourth moments de ned by SK = m3v3 and K = m4v4 ; known as the skewness and kurtosis coefficients, measure the asymmetry and the peakedness of the histogram, respectively. In the case of a symmetric his- togram, SK = 0 and the less peaked the histogram the greater value of K. 1.2 Looking Ahead The most important drawback of descriptive statistics is that the study of the observed data enables us to draw certain conclusions which relate only to the 2 data in hand. The temptation in analyzing the above income data is to at- tempt to make generalizations beyond the data in hand, in particular about the distribution of income in the US (not just 23000 households in US). This, how- ever, is not possible in the descriptive statistics framework. In order to be able to generalize beyond the data in hand we need to "model" the distribution of income in the US and not just describe the observed data in hand. Such a general model is provided by probability theory to be considered in Section 2. It turns out that the model provided by probability theory owns a lot to the earlier developed descriptive statistics. In particular, most of the concepts which form the basis of the probability theory were motivated by the descriptive statistic concept considered above. The concepts of measures of location, dispersion and shape, as well as the frequency curve, were transplanted into probability theory with renewed interpretations. The frequency curve when reinterpreted becomes a density function purporting to model observable real world phenomena, As for the various measures, they will now be reinterpreted in terms of the density function. 3 2 Probability Why we need the probability theory in analyzing observed data ? In the descrip- tive study of data considered in the last section, it is emphasized that the results cannot be generalized outside the observed data under consideration. Any ques- tion relating to the population from which the observed data were from cannot be answered within the descriptive statistics framework. In order to be able to do that we need the theoretical framework o ered by probability theory. In ef- fect probability theory develops a mathematical model which provides the logical foundation of statistical inference procedures for analyzing observed data. In developing a mathematical model we must rst identify the important fea- tures, relations and entities in the real world phenomena and then devise the concepts and choose the assumptions with which to project a generalized de- scription of there phenomena; an idealized pictures of these phenomena. The model as a consistent mathematical system has "a life of its own" and can be analyzed and studied without direct reference to real world phenomena. (Thinks of analyzing the population, we do not have to refer to the information in the sample.) By the 1920s there was a wealth of results and probability began to grow into a systematic body of knowledge. Although various people attempted a systemati- zation of probability it was the work of the Russian mathematician Kolmogorov which provided to be the cornerstone for a systematic approach to probability theory. Kolmogorov managed to relate the concept of the probability to that of a measure in integration theory and exploited to the full the analogies between set theory and the theory of functions on the one hand and the concept of a random variable on the other. In a monumental monograph in 1933 he proposed an axiomatization of probability theory establishing it once and for all as part of mathematical proper. There is no doubt that this monograph provided to be the watershed for the later development of probability theory growing enormously in importance and applicability. 4 2.1 The Axiomatic Approach The axiomatic approach to probability proceeds from a set of axioms (accepted without questioning as obvious), which are based on many centuries of human experience, and the subsequent development is built deductively using formal logical arguments, like any other part of mathematics such as geometry or linear algebra. In mathematics an axiomatic system is required to be complete; non redundant and consistent. By complete we mean that the set of axioms postu- lated should enables us to prove every other theorem in the theory in question using the axioms and mathematical logic. The notion of non-redundancy refers to the impossibility of deriving any axiom of the system from the other axioms. Consistency refers to the non- contradictory nature of the axioms. A probability model is by construction intended to be a description of a chance mechanism giving rise to observed data. The starting point of such a model is provided by the concept of a random experiment describing a simplistic and idealized process giving rise to observed data. The starting point of such a model is provided by the concept of a random experiment describing a simplistic and idealized process giving rise to the observed data. De nition 1: A random experiment, denoted by E, is an experiment which satis es the fol- lowing conditions: (a) all possible distinct outcomes are known a priori; (b) in any particular trial the outcomes is not known a priori; and (c) it can be repeated under identical conditions. The axiomatic approach to probability theory can be viewed as a formalization of the concept of a random experiment. In an attempt to formalize condition (a) all possible distinct outcome are known a priori, Kolmogorov devised the set S which included "all possible distinct outcome" and has to be postulated before the experiment is performed. De nition 2: The sample space, denoted by S, is de ned to be the set of all possible outcome of the experiment E. The elements of S are called elementary events. 5 Example: Consider the random experiment E of tossing a fair coin twice and observing the faces turning up. The sample space of E is S = f(HT); (TH); (HH); (TT)g; with (HT); (TH); (HH); (TT) being the elementary events belonging to S. The second ingredient of E related to (b) and in particular to the various form events can take. A moment’s of re ection suggested that there is no particular reason why we should be interested in elementary outcomes only. We might be interested in such event as A1{’at least one H’, A2{’at most one H’, and these are not elementary events; in particular A1 = f(HT); (TH); (HH)g and A2 = f(HT); (TH); (TT)g are combinations of elementary events. All such outcome are called events as- sociated with the same sample space S and they are de ned by combining elementary events. Understanding the concept of an event is crucial for the discussion which follows. Intuitively an event is any proposition associated with E which may occur or not at each trial. We say that event A1 occurs when any one of the elementary events it comprises occurs. Thus, when a trial is made only one elementary event is observed but a large number of event may have occurred. Fir example, if the elementary event (HT) occurs in a particular trial, A1 and A2 have occurred as well. Given that S is a set with members the elementary events this takes us im- mediately into the realm of set theory and event can be formally de ned to be subsets of S formed by set theoretic operation (" \ "-intersection, " [ "-union, " "-complementation) on the elementary events. For example, A1 = f(HT)g[f(TH)g[f(HH)g = f(TT)g S; A2 = f(HT)g[f(TH)g[f(TT)g = f(HH)g S: 6 Two special events areS itself, called the sure events and the impossible event ? de ned to contain no elements of S, i.e. ? = f g; the latter is de ned for com- pleteness. A third ingredient of E associated with (b) which Kolmogorov had to formal- ized was the idea of uncertainty related to the outcome of any particular trial of E. This he formalized in the notion of probabilities attributed to the various events associated with E, such as P(A1), P(A2), expressing the "likelihood" of occurrence of these events. Although attributing probabilities to the elementary events presents no particular mathematical problem, going the same for events in general is not as straightforward. The di culty arise because if A1 and A2 are events, A1 = S A1, A2 = S A2, A1\A2, A1[A2, etc., are also events because the occurrence or non-occurrence of A1 and A2 implies the occurrence or not of these events. This implies that for the attribution of probabilities to make sense we have to impose some mathematical structure on the set of all events, say F, which re ects the fact that whichever way we combine these events, the end result is always an event. The temptation at this stage is to de ne F to be the set of all subsets of S, called the power set; Surely, this covers all possibilities ! In the above example, the power set of S take the form F = fS;?;f(HT)g;f(TH)g;f(HH)g;f(TT)g;f(HT); (TH)g;f(HT); (HH)g;f(HT); (TT)g; f(TH); (HH)g;f(TH); (TT)g;f(HH); (TT)g;f(HT);(TH);(HH)g;f(HT);(TH); (TT)g; f(TH); (HH); (TT)g;f(HT); (TH);(HH); (TT)gg: Sometimes we are not interested in all the subsets of S, we need to de ne a set independently of the power set by endowing it with a mathematical structure which ensures that no inconsistency arise. This is achieved by requiring that F in the following has a special mathematical structures, It is a - eld related to S. De nition 3: Let F be a set of subsets of S. F is called a - eld if: (a) if A 2 F, then A 2 F{closure under complementation; (b) if Ai 2 F; i = 1; 2; :::, then ([1i=1Ai) 2 F{closure under countable union. Note that (a) and (b) taken together implying the following: 7 (c) S 2 F, because A[ A = S; (d) ?2 F (from (c) S =?2 F); and (e) Ai 2 F; i = 1; 2; :::, then (\1i=1Ai) 2 F. These suggest that a - eld is a set of subsets of S which is closed under complementation, countable unions and intersections. That is, any of these op- eration on the elements of F will give rise to an element of F. Example: If we are interested in events with one of each H or T there is no point in de ning the - eld to be the power set, and Fc can do as well with fewer events to attributed probabilities to. Fc = ff(HT); (TH)g;f(HH); (TT)g;S;?g: Exercise: Check if the set F1 = ff(HT)g;f(TH); (HH); (TT)g;S;?g is a - eld or not. Let us turn our attention to the various collections of events ( - elds) that are relevant for econometrics. De nition 4: The Borel - eld B is the smallest collection of sets (called the Borel sets) that includes (a) all open sets of R; (b) the complements B of any B in B; (c) the union [1n=1Bi of any sequences fBig of sets in B. The Borel set of R just de ned are said to be generated by the open sets of R. The same Borel sets would be generated by ball the open half-lines of R, all the closed half-lines of R, all the open intervals of R, or all the closed intervals of R. The Borel sets are a "rich" collection of events for which probabilities can be de ned. To see how the Borel set contains alomost every conceviable subset of R 8 from the closed half-lines, consider the following example. Example: Let S be the real line R = fx : 1 < x < 1g and the set of events of interest be J = fBx : x 2Rg; where Bx = fz : z xg = ( 1; x]. How can we construct a - eld, (J) on R from the events Bx? By de nition Bx 2 (J), then (1). Taking complements of Bx: Bx = fz : z 2R; z > xg = (x;1) 2 (J); (2). Taking countable unions of Bx: [1n=1( 1; x (1=n)] = ( 1; x) 2 (J); (3). Taking complements of (2): ( 1; x) = [x;1) 2 (J); (4). From (1), for y > x; [y;1) 2 (J); (5). From (4), ( 1; x][ [y;1) = (x; y) 2 (J); (6). \1n=1(x (1=n); x] = fxg 2 (J). This shows not only that (J) is a - eld but it includes almost every con- ceivable subset of R, that is, it coincides with the - eld generated by any set of subsets of R, which we denote by B, i.e. (J) = B, or the Borel Field on R. Having solved the technical problem in attributing probabilities to events by postulating the existence of a - eld F associated with the sample space S, Kolmogorov went on to formalize the concept of probability itself. De nition 5: A mapping P : F ! [0; 1] is a probability measures on fS;Fg provided that (a) P(?) = 0. (b) For any A 2 F, P(A) = 1 P(A). (c) For any disjoint sequence fAig of sets in F (i.e., Ai \ Aj = ? for all i 6= j), P([1i=1Ai) =P1i=1 P(Ai). Example: 9 Since f(HT)g\f(HH)g =?, P(f(HT)g[f(HH)g) = P(f(HT)g) +P(\f(HH)g) = 14 + 14 = 12: To summarize the argument so far, Kolmogorov formalized the condition (a) and (b) of the random experiment E in the form of the trinity (S;F;P( )) com- prising the set of all outcomes S{the sample space, a - eld F of events re- lated to S and a probability function P( ) assigning probability to events in F. For the coin example, if we choose F(The rst is H and the second is T)= ff(HT)g;f(TH); (HH); (TT)g;?;Sg to be the - eld of interest, P( ) is de ned by P(S) = 1; P(?) = 0; P(f(HT)g) = 14; P(f(TH); (HH); (TT)g) = 34: Because of its importance the trinity (S;F;P( )) is given a name. De nition 6: A sample space S endowed with a - eld F and a probability measure P( ) is called a probability space. That is we call the triple (S;F;P) a probability space. As far as condition (c) of E is concerned, yet to be formalized, it will prove of paramount importance in the context of the limit theorems in Chapter 4. 2.2 Conditional Probability So far we have considered probabilities of events on the assumption that no information is available relating to the outcome of a particular trial. Sometimes, however, additional information is available in the form of the known occurrence of some event A. For example, in the case of tossing a fair coin twice we might know that in the rst trial it was heads. What di erence does this information make to the original triple (S;F;P) ? Firstly, knowing that the rst trial was a head, the set of all possible outcomes now becomes SA = f(HT); (HH)g; 10 since (TH); (TT) are no longer possible. Secondly, the - eld taken to become FA = fSA;?;f(HT)g;f(HH)gg: Thirdly the probability set function become PA(SA) = 1; PA(?) = 0; PA(f(HT)g) = 12; PA(f(HH)g = 12: Thus, knowing that event A-one H has occurred in the rst trial transformed the original probability space (S;F;P) to the conditional probability space (SA;FA;PA). The question that naturally arises is to what extent we can de- rive the above conditional probabilities without having to transform the original probability space. The following formula provides us with a way to calculate the conditional probability. PA(A1) = P(A1jA) = P(A1 \A)P(A) : Example: Let A1 = f(HT)g and A = f(HT); (HH)g, then since P(A1) = 14, P(A) = 12, P(A1 \A) = P(f(HT)g) = 14, PA(A1) = P(A1jA) = 1=41=2 = 12; as above. Using the above rule of conditional probability we can deduce that P(A1 \ A2) = P(A1jA2) P(A2) = P(A2jA1) P(A1) for A1; A2 2 F: This is called the multiplication rule. Moreover, when knowing that A2 has occurred does not change the original probability of A1, i.e. P(A1jA2) = P(A1); we say that A1 and A2 are independent. 11 Independence is very di erent from mutual exclusiveness in the sense that A1\A2 =? but P(A1jA2) 6= P(A1) and vice versa can both arise. Independence is a probabilistic statement which ensures that the occurrence of one event does not in uence the occurrence (or non-occurrence) of the other event. On the other hand, mutual exclusiveness is a statement which refers to the events (set) themselves not the associated probability. Two events are said to be mutually exclusive when they cannot occur together. 12 3 Random Variables and Probability Distribu- tions The model based on (S;F;P) does not provide us with a exible enough frame- work. The basic idea underlying the construction of (S;F;P) was to set up a framework for studying probability of events as a prelude to analyzing problem involving uncertainty. One facet of E which can help us suggest a more exible probabilities space is the fact when the experiment is performed the outcome is often considered in relation to some quantifiable attribute; i.e. an attribute which can be repressed in numbers. It turns out that assigning numbers to qual- itative outcome make possible a much more exible formulation of probability theory. This suggests that if we could nd a consistent way to assign numbers to outcomes we might be able to change (S;F;P) to something more easily handled. The concept of a random variable is designed to just that without changing the underlying probabilistic structure of (S;F;P). 3.1 The Concept of a Random Variable Let us consider the possibility of de ning a function X( ) which maps S directly into the real line R, that is, X( ) : S !Rx; assigning a real number x1 to each s1 in S by x1 = X(s1), x1 2R; s1 2 S. The question arises as to whether every function from S to Rx will provided us with a consistent way of attaching numbers to elementary events; consistent in the sense of preserving the event structure of the probability space (S;F;P). The answer, unsurprisingly, is not. This is because, although X is a function de ned on S, probabilities are assigned to events in F and the issue we have to face is how to de ne the value taken by X for the di erent elements of S in a way which preserve the event structures of F. What we require from X 1( ) or (X) is to provide us with a correspondence between Rx and S which re ects the event structure of F, that is, it preserves union, intersections and complements. In other word for each subset N of Rx, the inverse image X 1(N) must be an event in F. This prompts us to de ne a random variable X to be any function satisfying this event preserving condition in relation to some - eld de ned on Rx; for generality we always take the Borel eld B on R. 13 De nition 7: A random variable X is a real valued function from S to R which satis es the condition that for each Borel set B 2 B on R, the set X 1(B) = fs : X(s) 2 B; s 2 Sg is an event in F. Example: De ne the function X|"the number of heads", then X(fHHg) = 2, X(fTHg) = 1, X(fHTg) = 1, and X(fTTg) = 0. Further we see that X 1(2) = f(HH)g, X 1(1) = f(TH); (HT)g and X 1(0) = f(TT)g. In fact, it can be shown that the - eld related to the random variables, X, so de ned is FX = fS;?;f(HH)g;f(TT)g;f(TH); (HT)g;f(HH); (TT)g; f(HT); (TH); (HH)g;f(HT); (TH);(TT)gg: We can verify that X 1(f0g) [f1g) = f(HT); (TH); (TT)g 2 FX, X 1(f0g) [ f2g) = f(HH); (TT)g2 FX and X 1(f1g)[f2g) = f(HT); (TH); (HH)g2 FX. Example: Consider the random variable Y |"number of Head in the rst trial", then Y (fHHg) = Y (fHTg) = 1, and Y (fTTg) = Y (fTHg) = 0. However Y does not preserve the event structure of FX since Y 1(f0g) = f(TH); (TT)g is not an event in FX and so does Y 1(f1g) = f(HH); (HT)g From the two examples above, we see that the question "X( ) : S !RX is a random variable ?" does not make any sense unless some - eld F is also speci- ed. In the case of the function X{number of heads, in the coin-tossing example we see that it is a random variable relative to the - eld FX. This, however, does not preclude Y from being a random variable with respect to some other - eld FY ; for instance FY = fS;?;f(HH); (HT)g;f(TH); (TT)gg. Intuition suggests that for any real value function X( ) : S ! R we should be able to de ne a - eld FX on S such that X is a random variable. The concept of a - eld generated by a random variable enables us to concentrate on particular aspects of an experiment without having to consider everything associated with the experiment at the same time. Hence, when we choose to de ne a random variable and the associated - eld we make an implicit choice about the features of the random experiment we are interested in. 14 How do we decide that some function X( ) : S ! R is a random variables relative to a given - eld F ? From the discussion of the - eld (J) generated by the set J = fBx : x 2 Rg where Bx = ( 1; x] we know that B = (J) and if X( ) is such that X 1(( 1; x]) = fs : X(s) 2 ( 1; x]; s 2 Sg 2 F for all ( 1; x] 2 B; then X 1(B) = fs : X(s) 2 B; s 2 Sg 2 F for all B 2 B: In other words, when we want to establish that X is a random variables or de ne Px( ) we have to look no further than the half-closed interval ( 1; x], and the - eld (J) they generate, whatever the range Rx. Let us use the shorthand notation fX(s) xg instead of fs : X(s) 2 ( 1; x]; s 2 Sg to the above example, X 1(( 1; x]) = fs : X(s) xg = 8 >>< >>: ? X < 0; f(TT)g X 0 (That is x = 0); f(TT)(TH)(HT)g X 1 (That is x = 1); f(TT)(TH)(HT)(HH)g X 2 (That is x = 2); we can see that X 1(( 1; x]) 2 FX for all x 2 R and thus X( ) is a random variables with respect to FX. A random variable X relative to F maps S into a subset of the real line, and the Borel eld B on R plays now the role of F. In order to complete the model we need to assign probabilities to the elements B of B. Common sense suggests that the assignment of the probabilities to the event B 2 B must be consistent with the probabilities assigned to the corresponding events in F. Formally, we need to de ne a set function PX( ) : B ! [0; 1] such that PX(B) = P(X 1(B)) P(s : X(s) 2 B; s 2 S) for all B 2 B. For example, in the above example, Px(f0g) = 1=4, Px(f1g) = 1=2, Px(f2g) = 1=4 and Px(f0g[f1g) = 3=4. 15 The question which arises is whether, in order to de ne the set function PX( ), we need to consider all the elements of the Borel eld B. The answer is that we do not need to do that because, as argued above, any such element of B can be expressed in terms of the semi-closed intervals ( 1; x]. This implies that by choosing such semi-closed intervals ’intelligently’, we can de ne PX( ) with the minimum of e ort. For example, we may de ne: Px(( 1; x]) = 8 >< >>: 0 X < 0; 1 4 X 0 (That is x = 0);3 4 X 1 (That is x = 1);1 X 2 (That is x = 2); As we can see, the semi-closed intervals were chosen to divided the real line at the points corresponding to the value taken by X. This way of de ning the semi-closed intervals is clearly non-unique but will prove very convenient in the next subsection. In fact, the event and probability structure of (S;F;P( )) is preserved in the induced probability space (R;B; Px( )). We traded S, a set of arbitrary elements, for R, the real line; F a - eld of subset of S with B, the Borel eld on the real line; and P( ) a set function de ned on arbitrary sets with PX( ), a set function on semi-closed intervals of the real line. 16 3.2 The Distribution and Density Functions In the previous section the introduction of the concept of a random variable X, enable us to trade the probability space (S;F;P( )) for (R;B; PX( )) which has a much more convenient mathematical structure. The latter probability space, however, is not as yet simple enough because PX( ) is still a set function albeit on real line intervals. In order to simplify it we need to transform it into a point function with which we are so familiar. De ne a point function F( ) : R! [0; 1]; which is seemingly, only a function of x. In fact, however, this function will do exactly the same job as PX( ). Heuristically, this is achieved by de ning F( ) as a point function by PX(( 1; x]) = F(x) F( 1); for all x 2R; and assigning the value zero to F( 1). De nition 8: Let X be a r:v: de ned on (S;F;P( )). The point function F( ) : R ! [0; 1] de ned by F(x) = Px(( 1; x]) = Pr(X x); for all x 2R is called the distribution function (DF) of X and satis ed the following prop- erties: (a). F(x) is non-decreasing; (b). F( 1)=limx! 1F(x) = 0 and F(1)=limx!1F(x) = 1, (c). F(x) is continuous from the right. (i.e. limh!0F(x + h) = F(x); 8x 2R.) The great advantage of F( ) over P( ) and PX( ) is that the former is a point function and can be represented in the form of an algebraic formula; the kind of functions we are so familiar with in elementary mathematics. De nition 9: A random variable X is called discrete if its range Rx is some subsets of the set 17 of integers Z = f0; 1; 2; :::g. De nition 10: A random variable X is called continuous if its distribution function F(x) is continuous for all x 2R and there exists a non-negative function f( ) on the real line such that F(x) = Z x 1 f(u)du; 8x 2 R: In de ning the concept of a continuous r:v: we introduced the function f(x) which is directly related to F(x). De nition 11: Let F(x) be the DF of the r:v: X. The non-negative function f(x) de ned by F(x) = Z x 1 f(u)du; 8x 2R continuous or F(x) = X u x f(u); 8x 2R discrete is said to be the probability density function (pdf) of X. Example: Let X be uniformly distributed in the interval [a; b] and we write X U(a; b). The DF of X takes the form: F(x) = 8 < : 0 x < a; x a b a a x < b; 1 x b: The corresponding pdf of X is given by f(x) = 1 b a a x b; 0 elsewhere: Although we can use the distribution function F(x) as the fundamental con- cept of our probability model we prefer to adopt the density function f(x) instead, 18 because we gain in simplicity and added intuition. It enhance intuition to view density function as distribution probability mass over the range of X. The pdf satis es the following properties: (a). f(x) 0, 8x 2R; (b). R1 1 f(x)dx = 1; (c). Prob(a < X < b) = Rba f(x)dx; (d). f(x) = ddxF(x), at every point where the DF is continuous. 3.3 The Notation of a Probability Model In a continuous random variable, it is impossible to get the pdf f(x) from the random experiment E directly (either it is costly or is impossible to know all X), we have to model a particular real phenomenon by previous experience in modeling similar phenomenon or by a preliminary study of the data. By a parameterized probability model, we may transform the original uncer- tainty related to E to uncertainty related to unknown parameters of f( ); in order to emphasize this we write the pdf as f(x; ). We are now in a position to de ne our probability model in the form of parametric family of density function which we denote by = ff(x; ); 2 g: represents a set of density functions indexed by the unknown parameters which are assumed to belong to a parameter space . Example The Pareto distribution: = f(x; ) = x 0 x0 x +1 ; x > x0; 2 ; x0{a known number, = R+{the positive real line. For each value in , f(x; ) represents a di erent density. When a particular parameter family of densities is chosen, as the appropri- ate probability model for modeling a real phenomenon, we are in e ect assuming that the observed data available were generated by the "chance mechanism" 19 described by one of those density in . The original uncertainty relating to the outcome of a particular trial of the experiment has now been transformed into the uncertainty relating to the choice of one in , say which determines uniquely the one density, that is, f(x; ), which give rise the observed data. The task of estimating or testing some hypothesis about using the observed data lies with statistical inference in next chapter. 3.4 Some Univariate Distribution 1. Continuous Distributions: (i). The normal distribution: A random variable X, x 2 R, is normally distributed if its probability density function is given by f(x; ; 2) = 1 p2 exp 12 2 (x )2 ; 2R; 2 2R+: We often express this by X N( ; 2). As far as the shape of the normal distribution and density function are con- cerned we note the following characteristics: (a). The normal density is symmetric about , i.e. f( + k) = 1 p2 exp k 2 2 2 = f( k) ) Pr( X + k) = Pr( k X ); k > 0; and for the DF, F( x) = 1 F(x + 2 ): (b). The density function attains its maximum at x = , df(x) dx = f(x) 2(x ) 2 2 = 0 ) x = ; and f( ) = 1 p2 : (c). The density function has two points of in ection at x = + : d2f(x) dx2 = 3p 2 exp 12 2 (x )2 1 (x ) 2 2 = 0 ) x = : 20 The density function of the random variable Z = X is f(z) = 1p2 exp 12z2 ; which does not depend on the unknown parameters , . This is called the stan- dard normal distribution, which we write in this form, Z N(0; 1). (ii). Exponential family of distribution A continuous random variables X has a gamma distribution with parameters and , written f(x) = ( )e xx 1; x 0; > 0; > 0: Many familiar distributions are special cases, including the exponential ( = 1), and chi-squared ( = 1=2; = n=2). 2. Discrete Random variables: (i). Poisson Distribution. (ii). Binomial distribution. (iii). Bernoulli distribution. 3.5 Numerical characteristics of random variables In modeling real phenomena using probability model of the form = ff(x; ); 2 g we need to be able to postulate such models having only a general quantita- tive description of the random variable in question at our disposal a priori. Such information comes in the form of certain numerical characteristics of random variables such as the mean, the variance, the skewness and kurtosis coe cients and higher moments. Indeed, sometimes such numerical characteristics actually determine the type of probability density in . Moreover, the analysis of density functions is usually undertaken in terms of theses numerical characteristics. 21 3.5.1 Mathematical Expectation The mean of X denoted by E(X) is de ned by E(X) = Z 1 1 xf(x)dx for a continous r:v; and E(X) = X i xif(xi) for a discrete r:v; when the integral and sum exist. We always denote E(X) = . Example: If X U(a; b), i.e. X is uniformly distributed r.v., then E(X) = Z 1 1 xf(x)dx = Z b a x 1 b a dx = 12 1 b a x2jba = a + b2 : In the above example the mean of the r.v. X existed. The condition which guarantees the existence of E(X) is that EjXj = Z 1 1 jxjf(x)dx < 1: (since E(X) EjXj) One example where the mean does not exist is the cases of a Cauchy dis- tributed r.v. with a pdf given by f(x) = 1 (1 + x2); x 2R: Then the expectation (absolute) of X would be EjXj = Z 1 1 jxjf(x)dx = 1 Z 1 1 jxj 1(1 + x2)dx = 1 2 Z 1 0 x 1 + x2 dx by symemtric = 1 lima!12 Z a 0 x 1 + x2 dx = 1 lima!1 loge(1 + a2) = 1: 22 That is E(X) doesn’t exist for the Cauchy distribution. Some properties of the expectation: (a). E(c) = c, if c is a constant. (b). E(aX1 +bX2) = aE(X1)+bE(X2) for any two r.v.’s X1 and X2 whose mean exist and a; b are real constant. (c). Pr(X E(X)) 1= for a positive r.v. X and > 0; this is the so called Markov Inequality. 3.5.2 The Variance Related to the mean as a measure of location is the dispersion measure called the variance and de ned by V ar(X) = E[X E(X)]2 = Z 1 1 (X )2f(x)dx = E(X2) ( )2 = 2: Note that the square root of the variance is referred to as standard deviation. Example: Let X U(a; b); then V ar(X) = Z 1 1 X a + b 2 2 1 b a dx = (b a) 2 12 : Some properties of the Variance (a). V ar(c) = 0 for any constant c. (b). V ar(aX) = a2 V ar(X), for constant a. (c). Chebyshev’s Inequality{Pr(jX E(X)j k) [V ar(X)]=k2. 23 3.5.3 Higher Moments (a). r th Row Moments is the moment of inertia from x = 0 de ned by: 0r E(Xr) = Z 1 1 xrf(x)dx; r = 0; 1; 2; ::; (b). r th Central Moments is de ned as the moment around x = : r E(X )r = Z 1 1 (x )rf(x)dx; r = 0; 1; 2; ::; These higher moments are sometimes useful in providing us with further infor- mation relating to the distribution and density function of r.v.’s. In particular, the 3rd and 4th central moments, when standardized in the form: 3 = 3 3 and 4 = 4 4 are referred to measure of skewness and kurtosis and provided us with measures of asymmetry and atness of peak, respectively. 24 4 Random Vectors and Their Distributions The probability model formulated in the previous chapter was in the form of a parametric family of densities associated with a random variable X : = ff(x; ); 2 g. In practice, however, there are many observable phenomena where the outcome comes in the form of several quantitative attributes. For example, data on personal income might be related to number of children, social class, type of occupation, age class, etc. In order to be able to model such real phenomena we extend a single r.v.’s framework to one for multidimensional r.v.’s or random vectors, that is, x = (X1; X2; :::; Xn)0; where each Xi, i = 1; 2; :::; n measures a particular quanti able attribute of the random experiment’s (E) outcome. 4.1 Joint distribution and density functions Consider the random experiment E of tossing a fair coin twice. De ne the func- tion X1( ) to be the number of heads and X2( ) to be the number of tails. The function (X1( ); X2( )) : S ! R2 is a two dimensional vector function which assigns to each elements s of S, the pair of ordered numbers (x1; x2) where x1 = X1(s); x2 = X2(s). De nition 12: A (bivariate) random vector x( ) is a vector function x( ) : S !R2; such that for any two real numbers (x1; x2) x, the event X 1(( 1; x]) = fs : 1 < X1(s) x1; 1 < X2(s) x2; s 2 Sg 2 F: The random vector induces a probability space(R2;B2; PX( )), where B2 ( B B) are Borel subsets on the plane and PX( ) a probability set function de ned over events in B2, in a way which preserves the probability structure of the original 25 probability space (S;F;P( )). This is achieved by attributing to each B 2 B2 the probability PX(B) = P(fs : (X1(s); X2(s)) 2 Bg) or PX(( 1; x]) = Pr(X1 x1; X2 x2): We can go a step further to reduce PX( ) to a point function F(x1; x2), we call the joint (cumulative) distribution function. De nition 13: Let x (X1; X2) be a random vector de ned on (S;F;P( )). The function de ned by F( ; ) : R2 ! [0; 1]; such that F(x) F(x1; x2) = PX(( 1; x]) = Pr(X1 x1; X2 x2) Pr(x x) is said to be the joint distribution function of x. Example: In the coin-tossing example above, the random vector x( ) takes the value (1; 1), (2; 0),(0; 2) with probability 12; 14 and 14, respectively. In order to derive the joint distribution (DF) we have to de ne all the events of the form fs : X1(s) x1; X2(s) x2; s 2 Sg for all (x1; x2) 2R2: fs : X1(s) x1; X2(s) x2; s 2 Sg = 8 >>> >< >>> >: ? X1 < 0; X2 < 0; fTTg X1 0; X2 2 (That is x1 = 0; x2 = 2); f(TH); (HT)g X1 1; X2 1 (That is x1 = 1; x2 = 1); fHHg X1 2; X2 0 (That is x1 = 2; x2 = 0); S X1 2; X2 2: 26 It is worthy of noting that here the events expressed in the Borel plane B 2 B2 do not have the idea of accumulation as in the (single) random variable, since for example (X1 0; X2 2) (X1 1; X2 1). The joint DF of X1 and X2 is given by F(x1; x2) = 8 >>< >>: 0 X1 < 0; X2 < 0; 1 4 X1 0; X2 2;3 4 X1 1; X2 1;1 X 1 2; X2 0: De nition 14: The joint DF of X1 and X2 is called continuous if there exists a non-negative function f(x1; x2) such that F(x1; x2) = Z x1 1 Z x2 1 f(u; v)du dv: The function f(x1; x2) is called the joint density function of X1 and X2. This function implies the following properties for f(x1; x2): (a). R1 1R1 1 f(x1; x2)dx1 dx2 = 1. (b). Pr(a < X1 b; c < X2 d) =Rba Rdc f(x1; x2)dx1 dx2. (c). f(x1; x2) = @2@x1@x2F(x1; x2), if f( ) is continuous at (x1; x2). 4.2 Some Bivariate distributions 1. Bivariate normal distribution f(x1; x2; ) = (1 2) 1=2 2 1 2 exp ( 12(1 2) " x1 1 1 2 2 x 1 1 1 x 2 2 2 + x 2 2 2 2#) ; x1; x2 2R, and = ( 1; 2; 21; 22; ) 2R2 R2+ [0; 1]. 27 The extension of the concept of a random variable X to that of a random vector x = (X1; X2; :::; Xn) enables us to generalize the probability model = ff(x; ); 2 g: to that of a parametric family of joint density functions = ff(x1; x2; :::; xn; ); 2 g: This is a very important generalization since in most applied disciplines, the real phenomena to be modeled are usually multidimensional in the sense that there is more than one quanti able features to be considered. Notation: For nonstochastic cases: (1) a,x,y etc.: element, (1 1). (2) a;x;y: column vector, (n 1). (3) A;X;Y: matrix, (n n). For stochastic cases: (1) X: random variable; x: the value that X takes; fX(x): the "probability that the random variable X takes on the value x. (2) x = (X1; X2; :::; Xn)0: random vector; x = (x1; x2; :::; xn)0: the value that x takes; fx(x):is the (joint) probability that (X1; X2; :::; Xn)0 takes on the values x = (x1; x2; :::; xn). (3) X, etc.: matrix (such as the regressors matrix and the variance- covariance matrix). 4.3 Marginal distributions Let x (X1; X2) be a bivariate random vector de ned on (S;F;P) with a joint distribution function F(x1; x2). The question which naturally arises is whether we could separate X1 and X2 and consider them as individual random variables. the answer to this question leads us to the concept of a marginal distribution. De nition 15: The Marginal distribution functions of X1 and X2 are de ned by F1(x1) = limx2!1F(x1; x2): 28 and F2(x2) = limx1!1F(x1; x2): Having separated X1 and X2 we need to see whether they can be considered as single r.v.’s de ned on the same probability space. In de ning a random vector we imposed the condition that fs : X1(s) x1; X2(s) x2g2 F: The de nition of the marginal distribution we used the event fs : X1(s) x1; X2(s) 1g; which we know belong to F. This event, however, can be written as the intersec- tion of two sets of the form fs : X1(s) x1g\fs : X2(s) 1g but the second set is S i.e. fs : X2(s) 1g = S, which implies that fs : X1(s) x1g\fs : X2(s) 1g = fs : X1(s) x1g; which indeed belong to F and it is the condition needed for X1 to be a r.v. with a probability function F1(x1); the same is true for X2. The marginal density functions of X1 and X2 are de ned by f1(x1) = Z 1 1 f(x1; x2)dx2 and f2(x2) = Z 1 1 f(x1; x2)dx1 Example: For the random vector (X1; X2)=(no. of heads, no. of tails) above, the marginal 29 density of f1(x1) is "recovered" from f1(0) = f(0; 1) + f(0; 1) + f(0; 2) = 0 + 0 + 1=4 = 1=4; f1(1) = f(1; 0) + f(1; 1) + f(1; 2) = 0 + 1=2 + 0 = 1=2; f1(2) = f(2; 0) + f(2; 1) + f(2; 2) = 1=4 + 0 + 0 = 1=4: It is quite obvious that knowing the joint density function of X1 and X2, we can derive their marginal density functions; the reverse, however, is not true in general. Knowledge of f(x1) and f(x2) is enough to derive f(x1; x2) only when f(x1; x2) = f(x1) f(x2); in which cases we say that X1 and X2 are independent r:v:0s. Independence in terms of the distribution function takes the same form F(x1; x2) = F(x1) F(x2); 4.4 Conditional distributions In this section we consider the question of simplify probability models by conditioning with respect to some subset of the r.v.’s. In the context of the probability space (S;F;P( )) the conditional probability of event A1 given event A2 is de ned by P(A1jA2) = P(A1 \A2)P(A 2) ; P(A2) > 0; A1; A2 2 F: By using an analogous de nition in term of distribution function, we de ne the conditional density of X1 given X2 = x2 to be fX1jX2(x1jx2) = f(x1; x2)f 2(x2) ; x1 2Rx1: 30 Similarly, fX2jX1(x2jx1) = f(x1; x2)f 1(x1) ; x2 2Rx2; provided f1(x1) > 0 and f2(x2) > 0. (However, the mathematical apparatus needed to bypass the problem that in a continuous random variable, X2, f2(x2) = 0. This de nition of conditional den- sity does not make sense. See Billingsley (1979), p.354-407) Two things to be noted: 1. The conditional density is a proper density function, i.e. for a given X2 = x2, (i) fX1jX2(x1j x2) 0; (ii)R1 1 fX1jX2(x1= x2)dx1 = 1; 2. Knowledge of all these conditional densities is equivalent to knowledge of joint density, i.e. f(x1; x2) = fX1jX2(x1jx2) f2(x2) = fX2=X1(x2jx1) f1(x1); (x1; x2) 2R2: An immediate implication of last equation is that if X1 and X2 are indepen- dent, then fX1jX2(x1jx2) = f1(x1); x1 2Rx1: Lemma: Let g : Rk ! Rl be a continuous function. Let z and y be independent. Then g(z) and g(y) are independent. Proof: Let A1 = [z : g(z) a1] and A2 = [y : g(y) a2]. Then Fg(z)g(y)(a1;a2) P[g(z) a1; g(y) a2] = P[z 2 A1;y 2 A2] = P[z 2 A1] P[y 2 A2] = P[g(z) a1] P[g(y) a2] = Fg(z)(a1)Fg(y)(a2) for all a1, a2 2Rl. Hence g(z) 31 and g(y) are independent. Exercise: Let X = (X1; X2; X3) be a continuous random vector having joint density f(x1; x2; x3) = 6 exp( x1 x2 x3); 0 < x1 < x2 < x3: Find marginal pdf of f(x2) and the conditional density of X3 given (X1; X2) = (x1; x2). 32 5 Function of Random Variables One of the most important problems in probability theory and statistical inference is to derive the distribution of a function h(X1; X2; :::; Xn) when the distribution of the random vector x = (X1; X2; :::; Xn) is known. This problem is important for at least two reasons: (a). it is often the case that in modeling observable phenomena we are primarily interested in function of random variables; and (b) in statistical inference the quantities of primary interest are commonly func- tions of random variables. It is no exaggeration to say that the whole of statistical inference is based on our ability to derive the distribution of various functions of r.v.’s. 5.1 (Single) Function of one random variable (one ) one transformation) De nition 16: A function h( ) : Rx !Ris said to be a Borel function if any a 2Rand x 2Rx, the set Bh = fh(x) ag is a Borel set, i.e. Bh 2 B, where B is the Borel eld onR The above de nition is to require that h( ) is a Borel function that is a ob- vious condition to impose given that we need h(X) to be a random variable itself. Having ensured that the function h( ) of the r.v. X is itself a r.v. Y = h(X) we want to derive the distribution of Y when the distribution of X is known. Lemma: Let X be a continuous r.v. and Y = h(X) where h(X) is di erentiable for all x 2 Rx and [dh(x)]=(dx) > 0 or [dh(x)]=(dx) < 0 for all x. Then the density function of Y is given by fY (y) = fX(h 1(y)) d dyh 1(y) for a < y < b; where jj stands for the absolute value and a and b refer to the smallest and biggest value y can take, respectively. 33 Example: Let X (N( ; 2) and Y = (X )= , which implies that [dh(x)]=(dx) = 1= > 0 for all x 2Rsince > 0 by de nition; h 1(y) = y+ and [dh 1(y)]=(dy) = . Thus since fX(x) = 1 p2 exp ( 12 x 2) ; therefore, fY (y) = 1 p2 exp ( 12 y + 2) ( ) = 1 p2 expf 12y2g; i.e. Y N(0; 1) the standard normal distribution. In cases where the conditions of the Lemma above are not satis ed we need to derive the distribution from the relationship FY (y) = Pr(h(x) y) = Pr(X 2 h 1(( 1; x])): Exercise: Let X N( ; 2). Find pdf of Y , where Y = X2. 5.2 (Single) Function of several random variables (n ) one transformation) As in the case of a simple r.v. for a Borel function h( ) : Rn !R and a random vector x = (X1; X2; :::; Xn), h(x) is a random variable. Three commonly used functions of random variables (take two random variables as example) are: 1. The distribution of X1 + X2, 2. The distribution of X1=X2, 3. The distribution of Y =min(X1; X2). Exercise: Let Xi U( 1; 1); i = 1; 2; 3 and Y = X1 + X2 + X3. Find pdf of Y . 34 5.3 Functions of several random variables (n ) n trans- formation) After considering various simple functions of r.v.’s separately, let us consider them together. Let x = (X1; X2; :::; Xn)0 be a random vector with a joint probability density function fx(x1; x2; :::; xn) and de ne the one to one transformation: Y1 = h1(X1; X2; :::; Xn) Y2 = h2(X1; X2; :::; Xn) : : Yn = hn(X1; X2; :::; Xn); whose inverse take the form h 1i ( ) = gi( ); i = 1; 2; :::; n, that is, X1 = g1(Y1; Y2; :::; Yn) X2 = g2(Y1; Y2; :::; Yn) : : Xn = gn(Y1; Y2; :::; Yn): Assume: (a) hi( ) and gi( ) are continuous; (b) the partial derivatives @Xi=@Yi, i; j = 1; 2; :::; n exist and are continuous; and (c) the Jocobian of the inverse transformation J = det @(X 1; X2; :::; Xn)0 @(Y1; Y2; :::Yn) 6= 0: Then f(y1; y2; :::; yn) = f(g1(y1; y2; :::; yn)); :::; gn(y1; y2; :::; yn))jJj: Exercise: Let Xi (0; 1); i = 1; 2 be two independent r.v.’s and Y1 = h1(X1; X2) = X1 +X2, Y2 = h2(X1; X2) = X1X2. Find joint pdf of f(Y1; Y2) and marginal density of f1(y1) and f2(y2). 35 5.4 Functions of normally distributed random variables Lemma 1 If Xi N( i; 2i ), i = 1; 2; :::; n are independent r.v.’s then (Pni=1 Xi) N(Pni=1 i;Pni=1 2i ){normal Lemma 2 If Xi N(0; 1), i = 1; 2; :::; n are independent r.v.’s then (Pni=1 X2i ) 2(n){chi-square with n degree of freedom. In particular, if Y 2(n) then the density function of Y would be: fY (y; n) = 12(n=2) (n=2)y(n=2) 1e (y=2); y > 0; n = 1; 2; :: E(Y ) = n; V ar(Y ) = 2n: Lemma 3 If X1 N(0; 1) and X2 2(n) are independent r.v.’s then X1=[p(X2=n)] t(n){Student’s t with n degree of freedom. In particular, if W t(n) then the density function of W would be: fW (w; n) = 1p(n ) n+1 2 n2 1 1 + w2n [(n+1)=2] n > 0; w 2R; E(W) = 0; V ar(W) = nn 2; n > 2; 4 = 3 + 6n 4; n 4: These moments show that for a large n the t-distribution is very close to the standard normal. Lemma 4 If X1 2(n1) and X2 2(n2) are independent r.v.’s then (X1=n1)=(X2=n2) F(n1; n2){Fisher0s F with n1 and n2 degree of freedom. In particular, if U F(n1; n2) then the density function of U would be: fU(u; n1; n2) = n1+n22 n1 n2 n1=2 n12 n22 u12(n1 2) 1 + n1n2u 1 2(n1+n2) ; u > 0: E(U) = n2n 2 2 ; n > 2; V ar(U) = 2n 2 2(n1 + n2 2) n1(n2 2)2(n2 4); n2 > 4: 36 6 The General Notation of Expectation In section 3.5 we considered the notation of mathematical expectation in the context of the simple probability model = ff(x; ); 2 g as a useful characteristic of density functions of a single random variables. Since then we have generalized the probability model to = ff(x1; x2; :::; xn; ); 2 g and put forward a framework in the context of which joint density functions can be analyzed. These include marginalisation, conditioning and functions of random variables. The purpose of this section is to consider the notation of expectation in the context of this more general framework. For simplicity of exposition we consider the case where n = 2. 6.1 Expectation of a marginal random variable The expectation of a marginal random variable is just as the de nition E(X1) = Z x1fX1(x1)dx1 = Z x1 Z fX1 X2(x1 x2)dx2 dx1 = Z Z x1fX1 X2(x1 x2)dx2dx1: Therefore, the expectation of the random vector E(x) is just the vector that collecting all the expectation of marginal (individual) random variable, i.e. E(x) = (E(X1) E(X2) ::: E(Xn))0. 6.2 Expectation of a function of random variables Let (X1; X2) be a bivariate random vector with fx(x1; x2) their joint density function and let h( ) : R2 ! R be a Borel function. De ne Y = h(X1; X2) and consider its expectation. This can be de ned in two equivalent ways: (a). E(Y ) = Z 1 1 fY (y)dy; 37 or (b). E(Y ) = E(h(X1; X2)) = Z 1 1 Z 1 1 h(x1; x2)f(x1; x2)dx1dx2: 6.2.1 Forms of h(X1; X2) of particular interest For h(X1; X2) = (X1 E(X1))l(X2 E(X2))k; let 1 = E(X1) and 2 = E(X2): Then lk E(h(X1; X2)) = Z 1 1 Z 1 1 (X1 1)l(X2 2)kf(x1; x2)dx1dx2 are called joint central moment of order l + k. Two especially interesting joint central moments are the covariance and variance: (a) Covariance: l = k = 1 Cov(X1; X2) = E((X1 1)(X2 2)) = E(X1X2) E(X1) E(X2): If X1 and X2 are independent then Cov(X1; X2) = 0, and the converse is not true. (b) Variance: l = 2; k = 0 V ar(X1) = E(X1 1)2: For a linear function Pi aiXi the variance is of the form V ar nX i=1 aiXi ! = nX i=1 a2i V ar(Xi) + X i6=j X aiajCov(XiXj); where ai are real constant. 38 Using covariance and variance we could de ne the correlation coe cient by Corr(X1; X2) = Cov(X1; X2)p[V ar(X 1) V ar(X2)] ; which has the properties that 1 Corr(X1; X2) 1. 6.2.2 Properties of expectation 1. Linearity: E[ah1(X1; X2) + bh2(X1; X2)] = aE(h1(X1; X2)) + bE(h2(X1; X2)), where a and b are constant and h1( ), h2( ) are Borel functions from R2 to R. In particular E (Pni=1 aiXi) =Pni=1 aiE(Xi). 2. If X1 and X2 are independent r.v.’s, for every Borel function h1( ), h2( ) R!R, E(h1(X1)h2(X2)) = E(h1(X1)) E(h2(X2)); given that the above expectations exist. One particular case of interest is when h1(X1) = X1 and h2(X2) = X2, then E(X1X2) = E(X1) E(X2): This is in some sense linear independence which is much weaker than indepen- dence. Moreover, given that Cov(X1; X2) = E(X1X2) E(X1) E(X2); linear independence is equivalent to uncorrelatedness since it implies that Cov(X1; X2) = 0. 6.3 Conditional expectation The conditional expectation of X1 given that X2 takes a particular value x2(X2 = x2) is de ned by E(X1jX2 = x2) = Z 1 1 x1fX1jX2(x1; x2)dx1; 39 and is a function of x2. In general for any Borel function h( ) whose expectation exist E(h(X1)jX2 = x2) = Z 1 1 h(x1)fX11jX2(x1; x2)dx1: We have the following properties of conditional expectation. Properties of the Conditional Expectation: Let X, X1, and X2 be random variables on S;F;P, then (a). E[a1h(X1) + a2h(X2)jX = x] = a1E[h(X1)jX = x] + a2E[h(X2)jX = x], a1; a2 is constants. (b). If X1 X2, E(X1jX = x) E(X2jX = x). (c). E[h(X1; X2)jX2 = x2] = E[h(X1; x2)jX2 = x2]. (d). E[h(X1)jX2 = x2] = E[h(X1)] if X1 and X2 are independent. (e). E[h(X1)] = EX2fE[h(X1)jX2 = x2]g, this is so called law of iterated ex- pectation. (f). The conditional expectation E(X1jX2 = x2) is a non-stochastic function of x2, i.e. E(X1j ) : RX2 !R. The graph (x2; E(X1jX2 = x2)) is called the regres- sion curve. (g). E[h(X1) g(X2)jX2 = x2] = g(x2)E[h(X1)jX2 = x2]. As is the case of ordinary expectation, we can de ne higher conditional mo- ments: (a). Raw conditional moments: E(Xr1jX2 = x2) = Z 1 1 xr1fX1jX2(x1; x2)dx1; r 1: (b). Central conditional moments : E[(X1 E(X1jX2 = x2))rjX2 = x2]; r 2: 40 Of particular interest is the conditional variance, sometimes called skedasticity: V ar(X1jX2 = x2) = E[(X1 E(X1jX2 = x2))2jX2 = x2] = E(X21jX2 = x2) [E(X1jX2 = x2)]2: Exercise: Show that in a bivariate normal distribution, V ar(X1jX2 = x2) = 21(1 2). That is the conditional variance is free of the conditional variables|homoskedastic. 41 7 Multivariate Normal distribution The multivariate normal distribution is by far the most important distribution is statistical inference for a variety of reasons including the fact that some of the statistics based on sampling from such a distribution have tractable distributions themselves. Before we consider the multivariate normal distribution, however, let us introduce some notation and various simple results related to random vectors and their distribution in general. 7.1 Multivariate distribution Let x (X1; X2; :::; Xn)0 be an n 1 random vector de ned on the probability space (S;F;P( )). The mean vector E(x) is de ned by E(x) = 2 66 66 66 4 E(X1) E(X2) : : : E(Xn) 3 77 77 77 5 2 66 66 66 4 1 2 : : : n 3 77 77 77 5 ; an n 1 vector; and the covariance matrix Cov(x) by Cov(x) = E(x )(x )0 = 2 66 66 66 4 V ar(X1) Cov(X1; X2) : : : Cov(X1; Xn) Cov(X2; X1) V ar(X2) : : : Cov(X2; Xn) : : : : : : : : : : : : : : : : : : Cov(Xn; X1) Cov(Xn; X2) : : : V ar(Xn) 3 77 77 77 5 = 2 66 66 66 4 21 12 : : : 1n 21 22 : : : 2n : : : : : : : : : : : : : : : : : : n1 n2 : : : 2n 3 77 77 77 5 ; where is an n n symmetric positive de nite matrix. Lemma 1: If x has mean and covariance , then for z = Ax + b, 42 (a). E(z) = AE(x) + b = A + b; (b). Cov(z) = E[(Ax + b (A + b))(Ax + b (A + b))0] = AE(x )(x )0A0 = A A0: 7.2 The Multivariate Normal distribution Let x (X1; X2; :::; Xn)0 be an n 1 random vector with E(x) = and Cov(x) = . If the joint density of x is in the form of f(x; ; ) = (2 ) n=2j j 1=2 exp( 1=2)(x )0 1(x ); the we say that x follows a multivariate normal distribution, denoted as x Nn( ; ): If R is the correlation matrix of x, that is, rij = ij=( i j), then the density function of x can also expressed as f(x) = (2 ) n=2( 1 2 n) 1 exp( 1=2) 0R 1 ; where i = (xi i)= i, i = 1; 2; :::; n. Three special cases are of interest. (a). If all the variables are uncorrelated, then ij = 0 for i 6= j. Thus R = I, and the density becomes f(x) = (2 ) n=2( 1 2 n) 1 exp( 1=2) 0 (1) = f(x1)f(x2) f(xn) = nY i=1 f(xi); (2) where f(xi) N( i; i). That is, if normally distributed variables are un- correlated, then they are independent. (b). If = 0 and 2i = 2, Xi (N(0; 2) and the density in (1) becomes f(x) = (2 ) n=2( 2) n=2 exp( 1=2 2)x0x: (3) 43 (c). If = 1 then (3) become f(x) = (2 ) n=2 exp( 1=2)x0x: (4) This is the multivariate standard normal distribution. 7.2.1 Marginal and Conditional Normal distributions Let x1 be any subset of the random vector x, including a single variable, and let x2 be the remaining variables. Partition and likewise so that = 1 2 and = 11 12 21 22 : Then (a). If (x1;x2)0 have a joint multivariate normal distribution, then the marginal distribution are also normal, i.e. x1 N( 1; 11) and x2 N( 2; 22): (b) The conditional distribution of x1 given x2 = x2 is normal as well: x1j(x2 = x2) N( 1:2; 11:2) where 1:2 = 1 + 12 122 (x2 2); 11:2 = 11 12 122 21: 44 7.2.2 Linear Function of a normal vector If x Nn( ; ), then Ax + b N(A + b;A A0). 7.2.3 Quadratic forms related to the normal distribution (a). Let x Nn( ; ), then (x )0 1(x ) 2(n). (b). Let x Nn( ;In), then for A an idempotent symmetric matrix, we have (x )0A(x ) 2(tr A). 7.2.4 Independence of Quadratic Form (a). If x Nn(0;In) and x0Ax and x0Bx are two idempotent quadratic form in x, x0Ax and x0Bx are independent if AB = 0. (b). A linear function, Lx, and an idempotent quadratic from x0Ax, in a stan- dard normal vector are statistically independent if LA = 0. Proof: For part (a), since A and B are both symmetric and idempotent, A = A0A and B = B0B. The quadratic forms are therefore x0Ax = x0A0Ax = x01x1 where x1 = Ax and x0Bx = x0B0Bx = x02x2 where x2 = Bx: Both vectors have zero mean vectors, so the covariance matrix of x1 and x2 is E(x1x02) = AIB0 = AB = 0: Since Ax and Bx are linear functions of a normally distributed random vector, they are, in turn, normally distributed. Their zero covariance matrix implies that they are statistically independent using the fact that continuous functions of two independent random vector are also independent. The proof of part (b) is similar and is omitted. 45 Example: Let Xi i:i:d N(0; 1), then Pni=1(Xi X)2 = x0M0x 2(n 1), where x = [X1; X2; :::; Xn]0. 46