Ch. 2 Probability Theory
1 Descriptive Study of Data
1.1 Histograms and Their Numerical Characteristics
By descriptive study of data we refer to the summarization and exposition (tab-
ulation, grouping, graphical representation) of observed data as well as the
derivation of numerical characteristics such as measures of location, dispersion
and shape.
Although the descriptive study of data is an important facet of modeling with
real data itself, in the present study it is mainly used to motivate the need for
probability theory and statistical inference proper.
In order to make the discussion more speci c let us consider the after-tax
personal income data of 23000 household for 1999-2000 in the US. There data in
row form constitute 23000 numbers between $5000 and $100000. This presents
us with a formidable task in attempting to understand how income is distributed
among the 23000 households represented in the data. The purpose of descriptive
statistics is to help us make some sense of such data. A natural way to proceed
is to summarize the data by allocating the numbers into classes (intervals). The
number of intervals is chosen a priori and it depends on the degree of summa-
rization needed. Then we have the " Table of the personal income in the US".
The rst column of the table shows the income intervals, the second column the
second column shows the number of income falling into each interval and the
third column the relative frequency for each interval. The relative frequency is
calculated by dividing the number of observations in each interval by the total
number of observations. The fourth column is the cumulative frequency. Sum-
marizing the data in this Table enables us to get some idea of how income is
distributed among various class. If we plot the relative (cumulative) frequencies
in a bar graph we get what is known as the histogram (cumulative).
For further information on the distribution of income we could calculate vari-
ous numerical characteristics describing the histogram’s location, dispersion and
shape. Such measure can be calculate directly in terms of the raw data. How-
ever, in the present case it is more convenient for expositional purpose to use the
grouped data. The main reason for this is to introduce various concept which
will be reinterpreted in the context of probability.
1
The mean as measure of location takes the form
z =
nX
i=1
izi;
where i and zi refer to the relatively frequency and the midpoint of interval i.
The mode as a measure of location refers to the value of income that occurs most
frequently in the data set. Another measure of location is the median referring
to the value of income in the middle when income are arranged in an ascending
order according to the size of income. The best way to calculate the median is
to plot the cumulative frequency graph.
Another important feature of the histogram is the dispersion of the relative
frequency around a measure of central tendency. The most frequently used mea-
sured of dispersion is the variance de ned by
v2 =
nX
i=1
(zi z)2 i;
which is a measure of dispersion around the mean; v is known as the standard deviation.
We can extend the concept of the variance to
mk =
nX
i=1
(zi z)k i; k = 3; 4; :::
de ning what are known as higher central moments. These higher moments
can be used to get a better idea of the shape of the histogram. For example, the
standardized form of the third and fourth moments de ned by
SK = m3v3 and K = m4v4 ;
known as the skewness and kurtosis coefficients, measure the asymmetry and
the peakedness of the histogram, respectively. In the case of a symmetric his-
togram, SK = 0 and the less peaked the histogram the greater value of K.
1.2 Looking Ahead
The most important drawback of descriptive statistics is that the study of the
observed data enables us to draw certain conclusions which relate only to the
2
data in hand. The temptation in analyzing the above income data is to at-
tempt to make generalizations beyond the data in hand, in particular about the
distribution of income in the US (not just 23000 households in US). This, how-
ever, is not possible in the descriptive statistics framework. In order to be able
to generalize beyond the data in hand we need to "model" the distribution
of income in the US and not just describe the observed data in hand. Such a
general model is provided by probability theory to be considered in Section 2.
It turns out that the model provided by probability theory owns a lot to the
earlier developed descriptive statistics. In particular, most of the concepts which
form the basis of the probability theory were motivated by the descriptive statistic
concept considered above. The concepts of measures of location, dispersion and
shape, as well as the frequency curve, were transplanted into probability theory
with renewed interpretations. The frequency curve when reinterpreted becomes
a density function purporting to model observable real world phenomena, As
for the various measures, they will now be reinterpreted in terms of the density
function.
3
2 Probability
Why we need the probability theory in analyzing observed data ? In the descrip-
tive study of data considered in the last section, it is emphasized that the results
cannot be generalized outside the observed data under consideration. Any ques-
tion relating to the population from which the observed data were from cannot
be answered within the descriptive statistics framework. In order to be able to
do that we need the theoretical framework o ered by probability theory. In ef-
fect probability theory develops a mathematical model which provides the logical
foundation of statistical inference procedures for analyzing observed data.
In developing a mathematical model we must rst identify the important fea-
tures, relations and entities in the real world phenomena and then devise the
concepts and choose the assumptions with which to project a generalized de-
scription of there phenomena; an idealized pictures of these phenomena. The
model as a consistent mathematical system has "a life of its own" and
can be analyzed and studied without direct reference to real world
phenomena. (Thinks of analyzing the population, we do not have to refer to
the information in the sample.)
By the 1920s there was a wealth of results and probability began to grow into
a systematic body of knowledge. Although various people attempted a systemati-
zation of probability it was the work of the Russian mathematician Kolmogorov
which provided to be the cornerstone for a systematic approach to probability
theory. Kolmogorov managed to relate the concept of the probability to that of
a measure in integration theory and exploited to the full the analogies between
set theory and the theory of functions on the one hand and the concept of a
random variable on the other. In a monumental monograph in 1933 he proposed
an axiomatization of probability theory establishing it once and for all as part of
mathematical proper. There is no doubt that this monograph provided to be the
watershed for the later development of probability theory growing enormously in
importance and applicability.
4
2.1 The Axiomatic Approach
The axiomatic approach to probability proceeds from a set of axioms (accepted
without questioning as obvious), which are based on many centuries of human
experience, and the subsequent development is built deductively using formal
logical arguments, like any other part of mathematics such as geometry or linear
algebra. In mathematics an axiomatic system is required to be complete; non
redundant and consistent. By complete we mean that the set of axioms postu-
lated should enables us to prove every other theorem in the theory in question
using the axioms and mathematical logic. The notion of non-redundancy refers
to the impossibility of deriving any axiom of the system from the other axioms.
Consistency refers to the non- contradictory nature of the axioms.
A probability model is by construction intended to be a description of a chance
mechanism giving rise to observed data. The starting point of such a model is
provided by the concept of a random experiment describing a simplistic and
idealized process giving rise to observed data. The starting point of such a model
is provided by the concept of a random experiment describing a simplistic and
idealized process giving rise to the observed data.
De nition 1:
A random experiment, denoted by E, is an experiment which satis es the fol-
lowing conditions:
(a) all possible distinct outcomes are known a priori;
(b) in any particular trial the outcomes is not known a priori; and
(c) it can be repeated under identical conditions.
The axiomatic approach to probability theory can be viewed as a formalization
of the concept of a random experiment. In an attempt to formalize condition (a)
all possible distinct outcome are known a priori, Kolmogorov devised the set S
which included "all possible distinct outcome" and has to be postulated before
the experiment is performed.
De nition 2:
The sample space, denoted by S, is de ned to be the set of all possible outcome
of the experiment E. The elements of S are called elementary events.
5
Example:
Consider the random experiment E of tossing a fair coin twice and observing the
faces turning up. The sample space of E is
S = f(HT); (TH); (HH); (TT)g;
with (HT); (TH); (HH); (TT) being the elementary events belonging to S.
The second ingredient of E related to (b) and in particular to the various form
events can take. A moment’s of re ection suggested that there is no particular
reason why we should be interested in elementary outcomes only. We might be
interested in such event as A1{’at least one H’, A2{’at most one H’, and these
are not elementary events; in particular
A1 = f(HT); (TH); (HH)g
and
A2 = f(HT); (TH); (TT)g
are combinations of elementary events. All such outcome are called events as-
sociated with the same sample space S and they are de ned by combining
elementary events. Understanding the concept of an event is crucial for the
discussion which follows. Intuitively an event is any proposition associated with
E which may occur or not at each trial. We say that event A1 occurs when any
one of the elementary events it comprises occurs. Thus, when a trial is made only
one elementary event is observed but a large number of event may have occurred.
Fir example, if the elementary event (HT) occurs in a particular trial, A1 and A2
have occurred as well.
Given that S is a set with members the elementary events this takes us im-
mediately into the realm of set theory and event can be formally de ned to be
subsets of S formed by set theoretic operation (" \ "-intersection, " [ "-union,
" "-complementation) on the elementary events. For example,
A1 = f(HT)g[f(TH)g[f(HH)g = f(TT)g S;
A2 = f(HT)g[f(TH)g[f(TT)g = f(HH)g S:
6
Two special events areS itself, called the sure events and the impossible event ?
de ned to contain no elements of S, i.e. ? = f g; the latter is de ned for com-
pleteness.
A third ingredient of E associated with (b) which Kolmogorov had to formal-
ized was the idea of uncertainty related to the outcome of any particular trial
of E. This he formalized in the notion of probabilities attributed to the various
events associated with E, such as P(A1), P(A2), expressing the "likelihood" of
occurrence of these events. Although attributing probabilities to the elementary
events presents no particular mathematical problem, going the same for events
in general is not as straightforward. The di culty arise because if A1 and A2 are
events, A1 = S A1, A2 = S A2, A1\A2, A1[A2, etc., are also events because
the occurrence or non-occurrence of A1 and A2 implies the occurrence or
not of these events. This implies that for the attribution of probabilities to make
sense we have to impose some mathematical structure on the set of all events, say
F, which re ects the fact that whichever way we combine these events, the end
result is always an event. The temptation at this stage is to de ne F to be the
set of all subsets of S, called the power set; Surely, this covers all possibilities !
In the above example, the power set of S take the form
F = fS;?;f(HT)g;f(TH)g;f(HH)g;f(TT)g;f(HT); (TH)g;f(HT); (HH)g;f(HT); (TT)g;
f(TH); (HH)g;f(TH); (TT)g;f(HH); (TT)g;f(HT);(TH);(HH)g;f(HT);(TH); (TT)g;
f(TH); (HH); (TT)g;f(HT); (TH);(HH); (TT)gg:
Sometimes we are not interested in all the subsets of S, we need to de ne a
set independently of the power set by endowing it with a mathematical structure
which ensures that no inconsistency arise. This is achieved by requiring that F
in the following has a special mathematical structures, It is a - eld related to
S.
De nition 3:
Let F be a set of subsets of S. F is called a - eld if:
(a) if A 2 F, then A 2 F{closure under complementation;
(b) if Ai 2 F; i = 1; 2; :::, then ([1i=1Ai) 2 F{closure under countable union.
Note that (a) and (b) taken together implying the following:
7
(c) S 2 F, because A[ A = S;
(d) ?2 F (from (c) S =?2 F); and
(e) Ai 2 F; i = 1; 2; :::, then (\1i=1Ai) 2 F.
These suggest that a - eld is a set of subsets of S which is closed under
complementation, countable unions and intersections. That is, any of these op-
eration on the elements of F will give rise to an element of F.
Example:
If we are interested in events with one of each H or T there is no point in
de ning the - eld to be the power set, and Fc can do as well with fewer events
to attributed probabilities to.
Fc = ff(HT); (TH)g;f(HH); (TT)g;S;?g:
Exercise:
Check if the set
F1 = ff(HT)g;f(TH); (HH); (TT)g;S;?g
is a - eld or not.
Let us turn our attention to the various collections of events ( - elds) that
are relevant for econometrics.
De nition 4:
The Borel - eld B is the smallest collection of sets (called the Borel sets) that
includes
(a) all open sets of R;
(b) the complements B of any B in B;
(c) the union [1n=1Bi of any sequences fBig of sets in B.
The Borel set of R just de ned are said to be generated by the open sets of
R. The same Borel sets would be generated by ball the open half-lines of R, all
the closed half-lines of R, all the open intervals of R, or all the closed intervals of
R. The Borel sets are a "rich" collection of events for which probabilities can be
de ned. To see how the Borel set contains alomost every conceviable subset of R
8
from the closed half-lines, consider the following example.
Example:
Let S be the real line R = fx : 1 < x < 1g and the set of events of interest
be
J = fBx : x 2Rg;
where Bx = fz : z xg = ( 1; x]. How can we construct a - eld, (J) on R
from the events Bx?
By de nition Bx 2 (J), then
(1). Taking complements of Bx: Bx = fz : z 2R; z > xg = (x;1) 2 (J);
(2). Taking countable unions of Bx: [1n=1( 1; x (1=n)] = ( 1; x) 2 (J);
(3). Taking complements of (2): ( 1; x) = [x;1) 2 (J);
(4). From (1), for y > x; [y;1) 2 (J);
(5). From (4), ( 1; x][ [y;1) = (x; y) 2 (J);
(6). \1n=1(x (1=n); x] = fxg 2 (J).
This shows not only that (J) is a - eld but it includes almost every con-
ceivable subset of R, that is, it coincides with the - eld generated by any set of
subsets of R, which we denote by B, i.e. (J) = B, or the Borel Field on R.
Having solved the technical problem in attributing probabilities to events by
postulating the existence of a - eld F associated with the sample space S,
Kolmogorov went on to formalize the concept of probability itself.
De nition 5:
A mapping P : F ! [0; 1] is a probability measures on fS;Fg provided that
(a) P(?) = 0.
(b) For any A 2 F, P(A) = 1 P(A).
(c) For any disjoint sequence fAig of sets in F (i.e., Ai \ Aj = ? for all i 6= j),
P([1i=1Ai) =P1i=1 P(Ai).
Example:
9
Since f(HT)g\f(HH)g =?,
P(f(HT)g[f(HH)g) = P(f(HT)g) +P(\f(HH)g)
= 14 + 14 = 12:
To summarize the argument so far, Kolmogorov formalized the condition (a)
and (b) of the random experiment E in the form of the trinity (S;F;P( )) com-
prising the set of all outcomes S{the sample space, a - eld F of events re-
lated to S and a probability function P( ) assigning probability to events in F.
For the coin example, if we choose F(The rst is H and the second is T)=
ff(HT)g;f(TH); (HH); (TT)g;?;Sg to be the - eld of interest, P( ) is de ned
by
P(S) = 1; P(?) = 0; P(f(HT)g) = 14; P(f(TH); (HH); (TT)g) = 34:
Because of its importance the trinity (S;F;P( )) is given a name.
De nition 6:
A sample space S endowed with a - eld F and a probability measure P( ) is
called a probability space. That is we call the triple (S;F;P) a probability
space.
As far as condition (c) of E is concerned, yet to be formalized, it will prove of
paramount importance in the context of the limit theorems in Chapter 4.
2.2 Conditional Probability
So far we have considered probabilities of events on the assumption that no
information is available relating to the outcome of a particular trial. Sometimes,
however, additional information is available in the form of the known occurrence
of some event A. For example, in the case of tossing a fair coin twice we might
know that in the rst trial it was heads. What di erence does this information
make to the original triple (S;F;P) ? Firstly, knowing that the rst trial was a
head, the set of all possible outcomes now becomes
SA = f(HT); (HH)g;
10
since (TH); (TT) are no longer possible. Secondly, the - eld taken to become
FA = fSA;?;f(HT)g;f(HH)gg:
Thirdly the probability set function become
PA(SA) = 1; PA(?) = 0; PA(f(HT)g) = 12; PA(f(HH)g = 12:
Thus, knowing that event A-one H has occurred in the rst trial transformed
the original probability space (S;F;P) to the conditional probability space
(SA;FA;PA). The question that naturally arises is to what extent we can de-
rive the above conditional probabilities without having to transform the original
probability space. The following formula provides us with a way to calculate the
conditional probability.
PA(A1) = P(A1jA) = P(A1 \A)P(A) :
Example:
Let A1 = f(HT)g and A = f(HT); (HH)g, then since P(A1) = 14, P(A) = 12,
P(A1 \A) = P(f(HT)g) = 14,
PA(A1) = P(A1jA) = 1=41=2 = 12;
as above.
Using the above rule of conditional probability we can deduce that
P(A1 \ A2) = P(A1jA2) P(A2)
= P(A2jA1) P(A1) for A1; A2 2 F:
This is called the multiplication rule. Moreover, when knowing that A2 has
occurred does not change the original probability of A1, i.e.
P(A1jA2) = P(A1);
we say that A1 and A2 are independent.
11
Independence is very di erent from mutual exclusiveness in the sense that
A1\A2 =? but P(A1jA2) 6= P(A1) and vice versa can both arise. Independence
is a probabilistic statement which ensures that the occurrence of one event does
not in uence the occurrence (or non-occurrence) of the other event. On the
other hand, mutual exclusiveness is a statement which refers to the events (set)
themselves not the associated probability. Two events are said to be mutually
exclusive when they cannot occur together.
12
3 Random Variables and Probability Distribu-
tions
The model based on (S;F;P) does not provide us with a exible enough frame-
work. The basic idea underlying the construction of (S;F;P) was to set up a
framework for studying probability of events as a prelude to analyzing problem
involving uncertainty. One facet of E which can help us suggest a more exible
probabilities space is the fact when the experiment is performed the outcome
is often considered in relation to some quantifiable attribute; i.e. an attribute
which can be repressed in numbers. It turns out that assigning numbers to qual-
itative outcome make possible a much more exible formulation of probability
theory. This suggests that if we could nd a consistent way to assign numbers to
outcomes we might be able to change (S;F;P) to something more easily handled.
The concept of a random variable is designed to just that without changing
the underlying probabilistic structure of (S;F;P).
3.1 The Concept of a Random Variable
Let us consider the possibility of de ning a function X( ) which maps S directly
into the real line R, that is,
X( ) : S !Rx;
assigning a real number x1 to each s1 in S by x1 = X(s1), x1 2R; s1 2 S. The
question arises as to whether every function from S to Rx will provided us with
a consistent way of attaching numbers to elementary events; consistent in the
sense of preserving the event structure of the probability space (S;F;P). The
answer, unsurprisingly, is not. This is because, although X is a function de ned
on S, probabilities are assigned to events in F and the issue we have to face
is how to de ne the value taken by X for the di erent elements of S in a
way which preserve the event structures of F. What we require from X 1( ) or
(X) is to provide us with a correspondence between Rx and S which re ects the
event structure of F, that is, it preserves union, intersections and complements.
In other word for each subset N of Rx, the inverse image X 1(N) must be an
event in F. This prompts us to de ne a random variable X to be any function
satisfying this event preserving condition in relation to some - eld de ned on
Rx; for generality we always take the Borel eld B on R.
13
De nition 7:
A random variable X is a real valued function from S to R which satis es the
condition that for each Borel set B 2 B on R, the set X 1(B) = fs : X(s) 2
B; s 2 Sg is an event in F.
Example:
De ne the function X|"the number of heads", then X(fHHg) = 2, X(fTHg) =
1, X(fHTg) = 1, and X(fTTg) = 0. Further we see that X 1(2) = f(HH)g,
X 1(1) = f(TH); (HT)g and X 1(0) = f(TT)g. In fact, it can be shown that
the - eld related to the random variables, X, so de ned is
FX = fS;?;f(HH)g;f(TT)g;f(TH); (HT)g;f(HH); (TT)g;
f(HT); (TH); (HH)g;f(HT); (TH);(TT)gg:
We can verify that X 1(f0g) [f1g) = f(HT); (TH); (TT)g 2 FX, X 1(f0g) [
f2g) = f(HH); (TT)g2 FX and X 1(f1g)[f2g) = f(HT); (TH); (HH)g2 FX.
Example:
Consider the random variable Y |"number of Head in the rst trial", then
Y (fHHg) = Y (fHTg) = 1, and Y (fTTg) = Y (fTHg) = 0. However Y does
not preserve the event structure of FX since Y 1(f0g) = f(TH); (TT)g is not an
event in FX and so does Y 1(f1g) = f(HH); (HT)g
From the two examples above, we see that the question "X( ) : S !RX is a
random variable ?" does not make any sense unless some - eld F is also speci-
ed. In the case of the function X{number of heads, in the coin-tossing example
we see that it is a random variable relative to the - eld FX. This, however,
does not preclude Y from being a random variable with respect to some other
- eld FY ; for instance FY = fS;?;f(HH); (HT)g;f(TH); (TT)gg. Intuition
suggests that for any real value function X( ) : S ! R we should be able to
de ne a - eld FX on S such that X is a random variable. The concept of a
- eld generated by a random variable enables us to concentrate on particular
aspects of an experiment without having to consider everything associated with
the experiment at the same time. Hence, when we choose to de ne a random
variable and the associated - eld we make an implicit choice about the features
of the random experiment we are interested in.
14
How do we decide that some function X( ) : S ! R is a random variables
relative to a given - eld F ? From the discussion of the - eld (J) generated
by the set J = fBx : x 2 Rg where Bx = ( 1; x] we know that B = (J) and if
X( ) is such that
X 1(( 1; x]) = fs : X(s) 2 ( 1; x]; s 2 Sg 2 F for all ( 1; x] 2 B;
then
X 1(B) = fs : X(s) 2 B; s 2 Sg 2 F for all B 2 B:
In other words, when we want to establish that X is a random variables or
de ne Px( ) we have to look no further than the half-closed interval ( 1; x], and
the - eld (J) they generate, whatever the range Rx. Let us use the shorthand
notation fX(s) xg instead of fs : X(s) 2 ( 1; x]; s 2 Sg to the above
example,
X 1(( 1; x]) = fs : X(s) xg
=
8
>><
>>:
? X < 0;
f(TT)g X 0 (That is x = 0);
f(TT)(TH)(HT)g X 1 (That is x = 1);
f(TT)(TH)(HT)(HH)g X 2 (That is x = 2);
we can see that X 1(( 1; x]) 2 FX for all x 2 R and thus X( ) is a random
variables with respect to FX.
A random variable X relative to F maps S into a subset of the real line, and
the Borel eld B on R plays now the role of F. In order to complete the model
we need to assign probabilities to the elements B of B. Common sense suggests
that the assignment of the probabilities to the event B 2 B must be consistent
with the probabilities assigned to the corresponding events in F. Formally, we
need to de ne a set function PX( ) : B ! [0; 1] such that
PX(B) = P(X 1(B)) P(s : X(s) 2 B; s 2 S) for all B 2 B.
For example, in the above example,
Px(f0g) = 1=4, Px(f1g) = 1=2, Px(f2g) = 1=4 and Px(f0g[f1g) = 3=4.
15
The question which arises is whether, in order to de ne the set function PX( ),
we need to consider all the elements of the Borel eld B. The answer is that we
do not need to do that because, as argued above, any such element of B can
be expressed in terms of the semi-closed intervals ( 1; x]. This implies that by
choosing such semi-closed intervals ’intelligently’, we can de ne PX( ) with the
minimum of e ort. For example, we may de ne:
Px(( 1; x]) =
8
><
>>:
0 X < 0;
1
4 X 0 (That is x = 0);3
4 X 1 (That is x = 1);1 X 2 (That is x = 2);
As we can see, the semi-closed intervals were chosen to divided the real line
at the points corresponding to the value taken by X. This way of de ning the
semi-closed intervals is clearly non-unique but will prove very convenient in the
next subsection.
In fact, the event and probability structure of (S;F;P( )) is preserved in the
induced probability space (R;B; Px( )). We traded S, a set of arbitrary elements,
for R, the real line; F a - eld of subset of S with B, the Borel eld on the real
line; and P( ) a set function de ned on arbitrary sets with PX( ), a set function
on semi-closed intervals of the real line.
16
3.2 The Distribution and Density Functions
In the previous section the introduction of the concept of a random variable X,
enable us to trade the probability space (S;F;P( )) for (R;B; PX( )) which has
a much more convenient mathematical structure. The latter probability space,
however, is not as yet simple enough because PX( ) is still a set function albeit
on real line intervals. In order to simplify it we need to transform it into a point
function with which we are so familiar.
De ne a point function
F( ) : R! [0; 1];
which is seemingly, only a function of x. In fact, however, this function will do
exactly the same job as PX( ). Heuristically, this is achieved by de ning F( ) as
a point function by
PX(( 1; x]) = F(x) F( 1); for all x 2R;
and assigning the value zero to F( 1).
De nition 8:
Let X be a r:v: de ned on (S;F;P( )). The point function F( ) : R ! [0; 1]
de ned by
F(x) = Px(( 1; x]) = Pr(X x); for all x 2R
is called the distribution function (DF) of X and satis ed the following prop-
erties:
(a). F(x) is non-decreasing;
(b). F( 1)=limx! 1F(x) = 0 and F(1)=limx!1F(x) = 1,
(c). F(x) is continuous from the right. (i.e. limh!0F(x + h) = F(x); 8x 2R.)
The great advantage of F( ) over P( ) and PX( ) is that the former is a point
function and can be represented in the form of an algebraic formula; the kind of
functions we are so familiar with in elementary mathematics.
De nition 9:
A random variable X is called discrete if its range Rx is some subsets of the set
17
of integers Z = f0; 1; 2; :::g.
De nition 10:
A random variable X is called continuous if its distribution function F(x) is
continuous for all x 2R and there exists a non-negative function f( ) on the real
line such that
F(x) =
Z x
1
f(u)du; 8x 2 R:
In de ning the concept of a continuous r:v: we introduced the function f(x)
which is directly related to F(x).
De nition 11:
Let F(x) be the DF of the r:v: X. The non-negative function f(x) de ned by
F(x) =
Z x
1
f(u)du; 8x 2R continuous
or
F(x) =
X
u x
f(u); 8x 2R discrete
is said to be the probability density function (pdf) of X.
Example:
Let X be uniformly distributed in the interval [a; b] and we write X U(a; b).
The DF of X takes the form:
F(x) =
8
<
:
0 x < a;
x a
b a a x < b;
1 x b:
The corresponding pdf of X is given by
f(x) =
1
b a a x b;
0 elsewhere:
Although we can use the distribution function F(x) as the fundamental con-
cept of our probability model we prefer to adopt the density function f(x) instead,
18
because we gain in simplicity and added intuition. It enhance intuition to view
density function as distribution probability mass over the range of X. The pdf
satis es the following properties:
(a). f(x) 0, 8x 2R;
(b). R1 1 f(x)dx = 1;
(c). Prob(a < X < b) = Rba f(x)dx;
(d). f(x) = ddxF(x), at every point where the DF is continuous.
3.3 The Notation of a Probability Model
In a continuous random variable, it is impossible to get the pdf f(x) from the
random experiment E directly (either it is costly or is impossible to know all
X), we have to model a particular real phenomenon by previous experience in
modeling similar phenomenon or by a preliminary study of the data.
By a parameterized probability model, we may transform the original uncer-
tainty related to E to uncertainty related to unknown parameters of f( ); in
order to emphasize this we write the pdf as f(x; ). We are now in a position to
de ne our probability model in the form of parametric family of density function
which we denote by
= ff(x; ); 2 g:
represents a set of density functions indexed by the unknown parameters
which are assumed to belong to a parameter space .
Example
The Pareto distribution:
=
f(x; ) = x
0
x0
x
+1
; x > x0; 2
;
x0{a known number, = R+{the positive real line. For each value in , f(x; )
represents a di erent density.
When a particular parameter family of densities is chosen, as the appropri-
ate probability model for modeling a real phenomenon, we are in e ect assuming
that the observed data available were generated by the "chance mechanism"
19
described by one of those density in . The original uncertainty relating to the
outcome of a particular trial of the experiment has now been transformed into the
uncertainty relating to the choice of one in , say which determines uniquely
the one density, that is, f(x; ), which give rise the observed data. The task of
estimating or testing some hypothesis about using the observed data lies
with statistical inference in next chapter.
3.4 Some Univariate Distribution
1. Continuous Distributions:
(i). The normal distribution:
A random variable X, x 2 R, is normally distributed if its probability density
function is given by
f(x; ; 2) = 1 p2 exp
12 2 (x )2
; 2R; 2 2R+:
We often express this by X N( ; 2).
As far as the shape of the normal distribution and density function are con-
cerned we note the following characteristics:
(a). The normal density is symmetric about , i.e.
f( + k) = 1 p2 exp
k
2
2 2
= f( k)
) Pr( X + k) = Pr( k X ); k > 0;
and for the DF,
F( x) = 1 F(x + 2 ):
(b). The density function attains its maximum at x = ,
df(x)
dx = f(x)
2(x )
2 2
= 0 ) x = ; and f( ) = 1 p2 :
(c). The density function has two points of in ection at x = + :
d2f(x)
dx2 =
3p
2 exp
12 2 (x )2
1 (x )
2
2
= 0 ) x = :
20
The density function of the random variable Z = X is
f(z) = 1p2 exp
12z2
;
which does not depend on the unknown parameters , . This is called the stan-
dard normal distribution, which we write in this form, Z N(0; 1).
(ii). Exponential family of distribution
A continuous random variables X has a gamma distribution with parameters
and , written
f(x) =
( )e
xx 1; x 0; > 0; > 0:
Many familiar distributions are special cases, including the exponential ( = 1),
and chi-squared ( = 1=2; = n=2).
2. Discrete Random variables:
(i). Poisson Distribution.
(ii). Binomial distribution.
(iii). Bernoulli distribution.
3.5 Numerical characteristics of random variables
In modeling real phenomena using probability model of the form = ff(x; ); 2
g we need to be able to postulate such models having only a general quantita-
tive description of the random variable in question at our disposal a priori. Such
information comes in the form of certain numerical characteristics of random
variables such as the mean, the variance, the skewness and kurtosis coe cients
and higher moments. Indeed, sometimes such numerical characteristics actually
determine the type of probability density in . Moreover, the analysis of density
functions is usually undertaken in terms of theses numerical characteristics.
21
3.5.1 Mathematical Expectation
The mean of X denoted by E(X) is de ned by
E(X) =
Z 1
1
xf(x)dx for a continous r:v;
and
E(X) =
X
i
xif(xi) for a discrete r:v;
when the integral and sum exist. We always denote E(X) = .
Example:
If X U(a; b), i.e. X is uniformly distributed r.v., then
E(X) =
Z 1
1
xf(x)dx =
Z b
a
x
1
b a
dx = 12
1
b a
x2jba = a + b2 :
In the above example the mean of the r.v. X existed. The condition which
guarantees the existence of E(X) is that
EjXj =
Z 1
1
jxjf(x)dx < 1: (since E(X) EjXj)
One example where the mean does not exist is the cases of a Cauchy dis-
tributed r.v. with a pdf given by
f(x) = 1 (1 + x2); x 2R:
Then the expectation (absolute) of X would be
EjXj =
Z 1
1
jxjf(x)dx = 1
Z 1
1
jxj 1(1 + x2)dx
= 1 2
Z 1
0
x
1 + x2
dx by symemtric
= 1 lima!12
Z a
0
x
1 + x2
dx
= 1 lima!1 loge(1 + a2)
= 1:
22
That is E(X) doesn’t exist for the Cauchy distribution.
Some properties of the expectation:
(a). E(c) = c, if c is a constant.
(b). E(aX1 +bX2) = aE(X1)+bE(X2) for any two r.v.’s X1 and X2 whose mean
exist and a; b are real constant.
(c). Pr(X E(X)) 1= for a positive r.v. X and > 0; this is the so called
Markov Inequality.
3.5.2 The Variance
Related to the mean as a measure of location is the dispersion measure called the
variance and de ned by
V ar(X) = E[X E(X)]2
=
Z 1
1
(X )2f(x)dx
= E(X2) ( )2
= 2:
Note that the square root of the variance is referred to as standard deviation.
Example:
Let X U(a; b); then
V ar(X) =
Z 1
1
X
a + b
2
2 1
b a
dx = (b a)
2
12 :
Some properties of the Variance
(a). V ar(c) = 0 for any constant c.
(b). V ar(aX) = a2 V ar(X), for constant a.
(c). Chebyshev’s Inequality{Pr(jX E(X)j k) [V ar(X)]=k2.
23
3.5.3 Higher Moments
(a). r th Row Moments is the moment of inertia from x = 0 de ned by:
0r E(Xr) =
Z 1
1
xrf(x)dx; r = 0; 1; 2; ::;
(b). r th Central Moments is de ned as the moment around x = :
r E(X )r =
Z 1
1
(x )rf(x)dx; r = 0; 1; 2; ::;
These higher moments are sometimes useful in providing us with further infor-
mation relating to the distribution and density function of r.v.’s. In particular,
the 3rd and 4th central moments, when standardized in the form:
3 = 3 3
and
4 = 4 4
are referred to measure of skewness and kurtosis and provided us with measures
of asymmetry and atness of peak, respectively.
24
4 Random Vectors and Their Distributions
The probability model formulated in the previous chapter was in the form of
a parametric family of densities associated with a random variable X : =
ff(x; ); 2 g. In practice, however, there are many observable phenomena
where the outcome comes in the form of several quantitative attributes. For
example, data on personal income might be related to number of children, social
class, type of occupation, age class, etc. In order to be able to model such real
phenomena we extend a single r.v.’s framework to one for multidimensional r.v.’s
or random vectors, that is,
x = (X1; X2; :::; Xn)0;
where each Xi, i = 1; 2; :::; n measures a particular quanti able attribute of the
random experiment’s (E) outcome.
4.1 Joint distribution and density functions
Consider the random experiment E of tossing a fair coin twice. De ne the func-
tion X1( ) to be the number of heads and X2( ) to be the number of tails. The
function (X1( ); X2( )) : S ! R2 is a two dimensional vector function which
assigns to each elements s of S, the pair of ordered numbers (x1; x2) where
x1 = X1(s); x2 = X2(s).
De nition 12:
A (bivariate) random vector x( ) is a vector function
x( ) : S !R2;
such that for any two real numbers (x1; x2) x, the event
X 1(( 1; x]) = fs : 1 < X1(s) x1; 1 < X2(s) x2; s 2 Sg 2 F:
The random vector induces a probability space(R2;B2; PX( )), where B2 (
B B) are Borel subsets on the plane and PX( ) a probability set function de ned
over events in B2, in a way which preserves the probability structure of the original
25
probability space (S;F;P( )). This is achieved by attributing to each B 2 B2
the probability
PX(B) = P(fs : (X1(s); X2(s)) 2 Bg)
or
PX(( 1; x]) = Pr(X1 x1; X2 x2):
We can go a step further to reduce PX( ) to a point function F(x1; x2), we
call the joint (cumulative) distribution function.
De nition 13:
Let x (X1; X2) be a random vector de ned on (S;F;P( )). The function
de ned by
F( ; ) : R2 ! [0; 1];
such that
F(x) F(x1; x2) = PX(( 1; x]) = Pr(X1 x1; X2 x2) Pr(x x)
is said to be the joint distribution function of x.
Example:
In the coin-tossing example above, the random vector x( ) takes the value (1; 1),
(2; 0),(0; 2) with probability 12; 14 and 14, respectively. In order to derive the joint
distribution (DF) we have to de ne all the events of the form fs : X1(s)
x1; X2(s) x2; s 2 Sg for all (x1; x2) 2R2:
fs : X1(s) x1; X2(s) x2; s 2 Sg
=
8
>>>
><
>>>
>:
? X1 < 0; X2 < 0;
fTTg X1 0; X2 2 (That is x1 = 0; x2 = 2);
f(TH); (HT)g X1 1; X2 1 (That is x1 = 1; x2 = 1);
fHHg X1 2; X2 0 (That is x1 = 2; x2 = 0);
S X1 2; X2 2:
26
It is worthy of noting that here the events expressed in the Borel plane B 2 B2
do not have the idea of accumulation as in the (single) random variable, since for
example (X1 0; X2 2) (X1 1; X2 1).
The joint DF of X1 and X2 is given by
F(x1; x2) =
8
>><
>>:
0 X1 < 0; X2 < 0;
1
4 X1 0; X2 2;3
4 X1 1; X2 1;1 X
1 2; X2 0:
De nition 14:
The joint DF of X1 and X2 is called continuous if there exists a non-negative
function f(x1; x2) such that
F(x1; x2) =
Z x1
1
Z x2
1
f(u; v)du dv:
The function f(x1; x2) is called the joint density function of X1 and X2. This
function implies the following properties for f(x1; x2):
(a). R1 1R1 1 f(x1; x2)dx1 dx2 = 1.
(b). Pr(a < X1 b; c < X2 d) =Rba Rdc f(x1; x2)dx1 dx2.
(c). f(x1; x2) = @2@x1@x2F(x1; x2), if f( ) is continuous at (x1; x2).
4.2 Some Bivariate distributions
1. Bivariate normal distribution
f(x1; x2; ) = (1
2) 1=2
2 1 2
exp
(
12(1 2)
"
x1 1
1
2
2
x
1 1
1
x
2 2
2
+
x
2 2
2
2#)
;
x1; x2 2R, and = ( 1; 2; 21; 22; ) 2R2 R2+ [0; 1].
27
The extension of the concept of a random variable X to that of a random
vector x = (X1; X2; :::; Xn) enables us to generalize the probability model
= ff(x; ); 2 g:
to that of a parametric family of joint density functions
= ff(x1; x2; :::; xn; ); 2 g:
This is a very important generalization since in most applied disciplines, the
real phenomena to be modeled are usually multidimensional in the sense that
there is more than one quanti able features to be considered.
Notation:
For nonstochastic cases:
(1) a,x,y etc.: element, (1 1).
(2) a;x;y: column vector, (n 1).
(3) A;X;Y: matrix, (n n).
For stochastic cases:
(1) X: random variable; x: the value that X takes; fX(x): the "probability that
the random variable X takes on the value x.
(2) x = (X1; X2; :::; Xn)0: random vector; x = (x1; x2; :::; xn)0: the value that x
takes; fx(x):is the (joint) probability that (X1; X2; :::; Xn)0 takes on the values
x = (x1; x2; :::; xn).
(3) X, etc.: matrix (such as the regressors matrix and the variance- covariance
matrix).
4.3 Marginal distributions
Let x (X1; X2) be a bivariate random vector de ned on (S;F;P) with a joint
distribution function F(x1; x2). The question which naturally arises is whether
we could separate X1 and X2 and consider them as individual random variables.
the answer to this question leads us to the concept of a marginal distribution.
De nition 15:
The Marginal distribution functions of X1 and X2 are de ned by
F1(x1) = limx2!1F(x1; x2):
28
and
F2(x2) = limx1!1F(x1; x2):
Having separated X1 and X2 we need to see whether they can be considered
as single r.v.’s de ned on the same probability space. In de ning a random vector
we imposed the condition that
fs : X1(s) x1; X2(s) x2g2 F:
The de nition of the marginal distribution we used the event
fs : X1(s) x1; X2(s) 1g;
which we know belong to F. This event, however, can be written as the intersec-
tion of two sets of the form
fs : X1(s) x1g\fs : X2(s) 1g
but the second set is S i.e. fs : X2(s) 1g = S,
which implies that
fs : X1(s) x1g\fs : X2(s) 1g = fs : X1(s) x1g;
which indeed belong to F and it is the condition needed for X1 to be a r.v. with
a probability function F1(x1); the same is true for X2.
The marginal density functions of X1 and X2 are de ned by
f1(x1) =
Z 1
1
f(x1; x2)dx2
and
f2(x2) =
Z 1
1
f(x1; x2)dx1
Example:
For the random vector (X1; X2)=(no. of heads, no. of tails) above, the marginal
29
density of f1(x1) is "recovered" from
f1(0) = f(0; 1) + f(0; 1) + f(0; 2) = 0 + 0 + 1=4 = 1=4;
f1(1) = f(1; 0) + f(1; 1) + f(1; 2) = 0 + 1=2 + 0 = 1=2;
f1(2) = f(2; 0) + f(2; 1) + f(2; 2) = 1=4 + 0 + 0 = 1=4:
It is quite obvious that knowing the joint density function of X1 and X2, we
can derive their marginal density functions; the reverse, however, is not true in
general. Knowledge of f(x1) and f(x2) is enough to derive f(x1; x2) only when
f(x1; x2) = f(x1) f(x2);
in which cases we say that X1 and X2 are independent r:v:0s. Independence in
terms of the distribution function takes the same form
F(x1; x2) = F(x1) F(x2);
4.4 Conditional distributions
In this section we consider the question of simplify probability models by
conditioning with respect to some subset of the r.v.’s.
In the context of the probability space (S;F;P( )) the conditional probability of
event A1 given event A2 is de ned by
P(A1jA2) = P(A1 \A2)P(A
2)
; P(A2) > 0; A1; A2 2 F:
By using an analogous de nition in term of distribution function, we de ne
the conditional density of X1 given X2 = x2 to be
fX1jX2(x1jx2) = f(x1; x2)f
2(x2)
; x1 2Rx1:
30
Similarly,
fX2jX1(x2jx1) = f(x1; x2)f
1(x1)
; x2 2Rx2;
provided f1(x1) > 0 and f2(x2) > 0.
(However, the mathematical apparatus needed to bypass the problem that in a
continuous random variable, X2, f2(x2) = 0. This de nition of conditional den-
sity does not make sense. See Billingsley (1979), p.354-407)
Two things to be noted:
1. The conditional density is a proper density function, i.e. for a given X2 = x2,
(i) fX1jX2(x1j x2) 0;
(ii)R1 1 fX1jX2(x1= x2)dx1 = 1;
2. Knowledge of all these conditional densities is equivalent to knowledge of joint
density, i.e.
f(x1; x2) = fX1jX2(x1jx2) f2(x2)
= fX2=X1(x2jx1) f1(x1); (x1; x2) 2R2:
An immediate implication of last equation is that if X1 and X2 are indepen-
dent, then
fX1jX2(x1jx2) = f1(x1); x1 2Rx1:
Lemma:
Let g : Rk ! Rl be a continuous function. Let z and y be independent. Then
g(z) and g(y) are independent.
Proof:
Let A1 = [z : g(z) a1] and A2 = [y : g(y) a2]. Then Fg(z)g(y)(a1;a2)
P[g(z) a1; g(y) a2] = P[z 2 A1;y 2 A2] = P[z 2 A1] P[y 2 A2] =
P[g(z) a1] P[g(y) a2] = Fg(z)(a1)Fg(y)(a2) for all a1, a2 2Rl. Hence g(z)
31
and g(y) are independent.
Exercise:
Let X = (X1; X2; X3) be a continuous random vector having joint density
f(x1; x2; x3) = 6 exp( x1 x2 x3); 0 < x1 < x2 < x3:
Find marginal pdf of f(x2) and the conditional density of X3 given (X1; X2) =
(x1; x2).
32
5 Function of Random Variables
One of the most important problems in probability theory and statistical inference
is to derive the distribution of a function h(X1; X2; :::; Xn) when the distribution
of the random vector x = (X1; X2; :::; Xn) is known. This problem is important
for at least two reasons:
(a). it is often the case that in modeling observable phenomena we are primarily
interested in function of random variables; and
(b) in statistical inference the quantities of primary interest are commonly func-
tions of random variables. It is no exaggeration to say that the whole of
statistical inference is based on our ability to derive the distribution
of various functions of r.v.’s.
5.1 (Single) Function of one random variable (one ) one
transformation)
De nition 16:
A function h( ) : Rx !Ris said to be a Borel function if any a 2Rand x 2Rx,
the set Bh = fh(x) ag is a Borel set, i.e. Bh 2 B, where B is the Borel eld onR
The above de nition is to require that h( ) is a Borel function that is a ob-
vious condition to impose given that we need h(X) to be a random variable itself.
Having ensured that the function h( ) of the r.v. X is itself a r.v. Y = h(X)
we want to derive the distribution of Y when the distribution of X is known.
Lemma:
Let X be a continuous r.v. and Y = h(X) where h(X) is di erentiable for all
x 2 Rx and [dh(x)]=(dx) > 0 or [dh(x)]=(dx) < 0 for all x. Then the density
function of Y is given by
fY (y) = fX(h 1(y))
d
dyh
1(y)
for a < y < b;
where jj stands for the absolute value and a and b refer to the smallest and
biggest value y can take, respectively.
33
Example:
Let X (N( ; 2) and Y = (X )= , which implies that [dh(x)]=(dx) = 1= >
0 for all x 2Rsince > 0 by de nition; h 1(y) = y+ and [dh 1(y)]=(dy) = .
Thus since
fX(x) = 1 p2 exp
(
12
x
2)
;
therefore,
fY (y) = 1 p2 exp
(
12
y +
2)
( )
= 1 p2 expf 12y2g;
i.e. Y N(0; 1) the standard normal distribution.
In cases where the conditions of the Lemma above are not satis ed we need
to derive the distribution from the relationship
FY (y) = Pr(h(x) y) = Pr(X 2 h 1(( 1; x])):
Exercise:
Let X N( ; 2). Find pdf of Y , where Y = X2.
5.2 (Single) Function of several random variables (n )
one transformation)
As in the case of a simple r.v. for a Borel function h( ) : Rn !R and a random
vector x = (X1; X2; :::; Xn), h(x) is a random variable.
Three commonly used functions of random variables (take two random variables
as example) are:
1. The distribution of X1 + X2,
2. The distribution of X1=X2,
3. The distribution of Y =min(X1; X2).
Exercise:
Let Xi U( 1; 1); i = 1; 2; 3 and Y = X1 + X2 + X3. Find pdf of Y .
34
5.3 Functions of several random variables (n ) n trans-
formation)
After considering various simple functions of r.v.’s separately, let us consider them
together. Let x = (X1; X2; :::; Xn)0 be a random vector with a joint probability
density function fx(x1; x2; :::; xn) and de ne the one to one transformation:
Y1 = h1(X1; X2; :::; Xn)
Y2 = h2(X1; X2; :::; Xn)
:
:
Yn = hn(X1; X2; :::; Xn);
whose inverse take the form h 1i ( ) = gi( ); i = 1; 2; :::; n, that is,
X1 = g1(Y1; Y2; :::; Yn)
X2 = g2(Y1; Y2; :::; Yn)
:
:
Xn = gn(Y1; Y2; :::; Yn):
Assume:
(a) hi( ) and gi( ) are continuous;
(b) the partial derivatives @Xi=@Yi, i; j = 1; 2; :::; n exist and are continuous; and
(c) the Jocobian of the inverse transformation
J = det
@(X
1; X2; :::; Xn)0
@(Y1; Y2; :::Yn)
6= 0:
Then
f(y1; y2; :::; yn) = f(g1(y1; y2; :::; yn)); :::; gn(y1; y2; :::; yn))jJj:
Exercise:
Let Xi (0; 1); i = 1; 2 be two independent r.v.’s and Y1 = h1(X1; X2) =
X1 +X2, Y2 = h2(X1; X2) = X1X2. Find joint pdf of f(Y1; Y2) and marginal density
of f1(y1) and f2(y2).
35
5.4 Functions of normally distributed random variables
Lemma 1
If Xi N( i; 2i ), i = 1; 2; :::; n are independent r.v.’s then
(Pni=1 Xi) N(Pni=1 i;Pni=1 2i ){normal
Lemma 2
If Xi N(0; 1), i = 1; 2; :::; n are independent r.v.’s then
(Pni=1 X2i ) 2(n){chi-square with n degree of freedom.
In particular, if Y 2(n) then the density function of Y would be:
fY (y; n) = 12(n=2) (n=2)y(n=2) 1e (y=2); y > 0; n = 1; 2; ::
E(Y ) = n; V ar(Y ) = 2n:
Lemma 3
If X1 N(0; 1) and X2 2(n) are independent r.v.’s then
X1=[p(X2=n)] t(n){Student’s t with n degree of freedom.
In particular, if W t(n) then the density function of W would be:
fW (w; n) = 1p(n )
n+1
2
n2
1
1 + w2n [(n+1)=2]
n > 0; w 2R;
E(W) = 0; V ar(W) = nn 2; n > 2; 4 = 3 + 6n 4; n 4:
These moments show that for a large n the t-distribution is very close to the
standard normal.
Lemma 4
If X1 2(n1) and X2 2(n2) are independent r.v.’s then
(X1=n1)=(X2=n2) F(n1; n2){Fisher0s F with n1 and n2 degree of freedom.
In particular, if U F(n1; n2) then the density function of U would be:
fU(u; n1; n2) =
n1+n22
n1
n2
n1=2
n12 n22
u12(n1 2)
1 + n1n2u
1
2(n1+n2)
; u > 0:
E(U) = n2n
2 2
; n > 2; V ar(U) = 2n
2
2(n1 + n2 2)
n1(n2 2)2(n2 4); n2 > 4:
36
6 The General Notation of Expectation
In section 3.5 we considered the notation of mathematical expectation in the
context of the simple probability model
= ff(x; ); 2 g
as a useful characteristic of density functions of a single random variables. Since
then we have generalized the probability model to
= ff(x1; x2; :::; xn; ); 2 g
and put forward a framework in the context of which joint density functions can be
analyzed. These include marginalisation, conditioning and functions of random
variables. The purpose of this section is to consider the notation of expectation
in the context of this more general framework. For simplicity of exposition we
consider the case where n = 2.
6.1 Expectation of a marginal random variable
The expectation of a marginal random variable is just as the de nition
E(X1) =
Z
x1fX1(x1)dx1 =
Z
x1
Z
fX1 X2(x1 x2)dx2
dx1
=
Z Z
x1fX1 X2(x1 x2)dx2dx1:
Therefore, the expectation of the random vector E(x) is just the vector
that collecting all the expectation of marginal (individual) random variable, i.e.
E(x) = (E(X1) E(X2) ::: E(Xn))0.
6.2 Expectation of a function of random variables
Let (X1; X2) be a bivariate random vector with fx(x1; x2) their joint density
function and let h( ) : R2 ! R be a Borel function. De ne Y = h(X1; X2) and
consider its expectation. This can be de ned in two equivalent ways:
(a).
E(Y ) =
Z 1
1
fY (y)dy;
37
or
(b).
E(Y ) = E(h(X1; X2)) =
Z 1
1
Z 1
1
h(x1; x2)f(x1; x2)dx1dx2:
6.2.1 Forms of h(X1; X2) of particular interest
For
h(X1; X2) = (X1 E(X1))l(X2 E(X2))k;
let
1 = E(X1) and 2 = E(X2):
Then
lk E(h(X1; X2)) =
Z 1
1
Z 1
1
(X1 1)l(X2 2)kf(x1; x2)dx1dx2
are called joint central moment of order l + k. Two especially interesting joint
central moments are the covariance and variance:
(a) Covariance: l = k = 1
Cov(X1; X2) = E((X1 1)(X2 2)) = E(X1X2) E(X1) E(X2):
If X1 and X2 are independent then Cov(X1; X2) = 0, and the converse is not true.
(b) Variance: l = 2; k = 0
V ar(X1) = E(X1 1)2:
For a linear function Pi aiXi the variance is of the form
V ar
nX
i=1
aiXi
!
=
nX
i=1
a2i V ar(Xi) +
X
i6=j
X
aiajCov(XiXj);
where ai are real constant.
38
Using covariance and variance we could de ne the correlation coe cient by
Corr(X1; X2) = Cov(X1; X2)p[V ar(X
1) V ar(X2)]
;
which has the properties that 1 Corr(X1; X2) 1.
6.2.2 Properties of expectation
1. Linearity: E[ah1(X1; X2) + bh2(X1; X2)] = aE(h1(X1; X2)) + bE(h2(X1; X2)),
where a and b are constant and h1( ), h2( ) are Borel functions from R2 to R. In
particular E (Pni=1 aiXi) =Pni=1 aiE(Xi).
2. If X1 and X2 are independent r.v.’s, for every Borel function h1( ), h2( )
R!R,
E(h1(X1)h2(X2)) = E(h1(X1)) E(h2(X2));
given that the above expectations exist.
One particular case of interest is when h1(X1) = X1 and h2(X2) = X2, then
E(X1X2) = E(X1) E(X2):
This is in some sense linear independence which is much weaker than indepen-
dence. Moreover, given that
Cov(X1; X2) = E(X1X2) E(X1) E(X2);
linear independence is equivalent to uncorrelatedness since it implies that Cov(X1; X2) =
0.
6.3 Conditional expectation
The conditional expectation of X1 given that X2 takes a particular value x2(X2 =
x2) is de ned by
E(X1jX2 = x2) =
Z 1
1
x1fX1jX2(x1; x2)dx1;
39
and is a function of x2.
In general for any Borel function h( ) whose expectation exist
E(h(X1)jX2 = x2) =
Z 1
1
h(x1)fX11jX2(x1; x2)dx1:
We have the following properties of conditional expectation.
Properties of the Conditional Expectation:
Let X, X1, and X2 be random variables on S;F;P, then
(a). E[a1h(X1) + a2h(X2)jX = x] = a1E[h(X1)jX = x] + a2E[h(X2)jX = x],
a1; a2 is constants.
(b). If X1 X2, E(X1jX = x) E(X2jX = x).
(c). E[h(X1; X2)jX2 = x2] = E[h(X1; x2)jX2 = x2].
(d). E[h(X1)jX2 = x2] = E[h(X1)] if X1 and X2 are independent.
(e). E[h(X1)] = EX2fE[h(X1)jX2 = x2]g, this is so called law of iterated ex-
pectation.
(f). The conditional expectation E(X1jX2 = x2) is a non-stochastic function of
x2, i.e. E(X1j ) : RX2 !R. The graph (x2; E(X1jX2 = x2)) is called the regres-
sion curve.
(g). E[h(X1) g(X2)jX2 = x2] = g(x2)E[h(X1)jX2 = x2].
As is the case of ordinary expectation, we can de ne higher conditional mo-
ments:
(a). Raw conditional moments:
E(Xr1jX2 = x2) =
Z 1
1
xr1fX1jX2(x1; x2)dx1; r 1:
(b). Central conditional moments :
E[(X1 E(X1jX2 = x2))rjX2 = x2]; r 2:
40
Of particular interest is the conditional variance, sometimes called skedasticity:
V ar(X1jX2 = x2) = E[(X1 E(X1jX2 = x2))2jX2 = x2]
= E(X21jX2 = x2) [E(X1jX2 = x2)]2:
Exercise:
Show that in a bivariate normal distribution, V ar(X1jX2 = x2) = 21(1 2).
That is the conditional variance is free of the conditional variables|homoskedastic.
41
7 Multivariate Normal distribution
The multivariate normal distribution is by far the most important distribution
is statistical inference for a variety of reasons including the fact that some of the
statistics based on sampling from such a distribution have tractable distributions
themselves. Before we consider the multivariate normal distribution, however, let
us introduce some notation and various simple results related to random vectors
and their distribution in general.
7.1 Multivariate distribution
Let x (X1; X2; :::; Xn)0 be an n 1 random vector de ned on the probability
space (S;F;P( )). The mean vector E(x) is de ned by
E(x) =
2
66
66
66
4
E(X1)
E(X2)
:
:
:
E(Xn)
3
77
77
77
5
2
66
66
66
4
1
2
:
:
:
n
3
77
77
77
5
; an n 1 vector;
and the covariance matrix Cov(x) by
Cov(x) = E(x )(x )0 =
2
66
66
66
4
V ar(X1) Cov(X1; X2) : : : Cov(X1; Xn)
Cov(X2; X1) V ar(X2) : : : Cov(X2; Xn)
: : : : : :
: : : : : :
: : : : : :
Cov(Xn; X1) Cov(Xn; X2) : : : V ar(Xn)
3
77
77
77
5
=
2
66
66
66
4
21 12 : : : 1n
21 22 : : : 2n
: : : : : :
: : : : : :
: : : : : :
n1 n2 : : : 2n
3
77
77
77
5
;
where is an n n symmetric positive de nite matrix.
Lemma 1:
If x has mean and covariance , then for z = Ax + b,
42
(a). E(z) = AE(x) + b = A + b;
(b).
Cov(z) = E[(Ax + b (A + b))(Ax + b (A + b))0]
= AE(x )(x )0A0 = A A0:
7.2 The Multivariate Normal distribution
Let x (X1; X2; :::; Xn)0 be an n 1 random vector with E(x) = and Cov(x) =
. If the joint density of x is in the form of
f(x; ; ) = (2 ) n=2j j 1=2 exp( 1=2)(x )0 1(x );
the we say that x follows a multivariate normal distribution, denoted as x
Nn( ; ):
If R is the correlation matrix of x, that is, rij = ij=( i j), then the density
function of x can also expressed as
f(x) = (2 ) n=2( 1 2 n) 1 exp( 1=2) 0R 1 ;
where i = (xi i)= i, i = 1; 2; :::; n.
Three special cases are of interest.
(a). If all the variables are uncorrelated, then ij = 0 for i 6= j. Thus R = I, and
the density becomes
f(x) = (2 ) n=2( 1 2 n) 1 exp( 1=2) 0 (1)
= f(x1)f(x2) f(xn) =
nY
i=1
f(xi); (2)
where f(xi) N( i; i). That is, if normally distributed variables are un-
correlated, then they are independent.
(b). If = 0 and 2i = 2, Xi (N(0; 2) and the density in (1) becomes
f(x) = (2 ) n=2( 2) n=2 exp( 1=2 2)x0x: (3)
43
(c). If = 1 then (3) become
f(x) = (2 ) n=2 exp( 1=2)x0x: (4)
This is the multivariate standard normal distribution.
7.2.1 Marginal and Conditional Normal distributions
Let x1 be any subset of the random vector x, including a single variable, and let
x2 be the remaining variables. Partition and likewise so that
=
1
2
and =
11 12
21 22
:
Then
(a). If (x1;x2)0 have a joint multivariate normal distribution, then the marginal
distribution are also normal, i.e.
x1 N( 1; 11)
and
x2 N( 2; 22):
(b) The conditional distribution of x1 given x2 = x2 is normal as well:
x1j(x2 = x2) N( 1:2; 11:2)
where
1:2 = 1 + 12 122 (x2 2);
11:2 = 11 12 122 21:
44
7.2.2 Linear Function of a normal vector
If x Nn( ; ), then Ax + b N(A + b;A A0).
7.2.3 Quadratic forms related to the normal distribution
(a). Let x Nn( ; ), then (x )0 1(x ) 2(n).
(b). Let x Nn( ;In), then for A an idempotent symmetric matrix, we
have (x )0A(x ) 2(tr A).
7.2.4 Independence of Quadratic Form
(a). If x Nn(0;In) and x0Ax and x0Bx are two idempotent quadratic form in
x, x0Ax and x0Bx are independent if AB = 0.
(b). A linear function, Lx, and an idempotent quadratic from x0Ax, in a stan-
dard normal vector are statistically independent if LA = 0.
Proof:
For part (a), since A and B are both symmetric and idempotent, A = A0A and
B = B0B. The quadratic forms are therefore
x0Ax = x0A0Ax = x01x1 where x1 = Ax
and
x0Bx = x0B0Bx = x02x2 where x2 = Bx:
Both vectors have zero mean vectors, so the covariance matrix of x1 and x2 is
E(x1x02) = AIB0 = AB = 0:
Since Ax and Bx are linear functions of a normally distributed random vector,
they are, in turn, normally distributed. Their zero covariance matrix implies that
they are statistically independent using the fact that continuous functions of two
independent random vector are also independent.
The proof of part (b) is similar and is omitted.
45
Example:
Let Xi i:i:d N(0; 1), then Pni=1(Xi X)2 = x0M0x 2(n 1), where
x = [X1; X2; :::; Xn]0.
46