1
1
Lecture 3
Relationship
Between Variables
2
Individuals and Variables
? Individuals are the objects described
by a set of data. Individuals may be
people, but they may also be animals
or things.
?A variable is any characteristic of an
individual. A variable can take
different values for different
individuals.
3
? The 1997 survey data set, for example,
includes data about a sample of women
at childbearing ages.
? The women are the individuals
described by the data set. For each
individual, the data contain the values of
variables such as date of birth, place of
residence, and educational level.
? In practice, any set of data is
accompanied by background
information that helps us understand
the data.
4
? The individuals described are the women.
Each row recodes data on one individual.
You will often see each row of data called
a case. Each column contains the values
of one variable for all the individuals.
? Most data sets follow this format---each
row is an individual, and each column is a
variable.
2
5
Measuring center: the mean
? A description of a distribution almost
always includes a measure of its center or
average. The most common measure of
center is the ordinary arithmetic average,
or mean.
6
? To find the mean of a set of observations,
add their values and divide by the number
of observations. If the n observations are
, their mean is:
Or in more compact notation:
12
, ,...,
n
xx x
12
...
n
xx x
x
n
+ ++
=
1
i
xx
n
=
∑
7
Example: mean age at first
marriage
? Q105: When were you married for the first
time?
Statistics
age at first marriage
4134
872
21.04
Valid
Missing
N
Mean
8
age at first marriage
2 .0 .0 .0
2 .0 .0 .1
4 .1 .1 .2
13 .3 .3 .5
38 .8 .9 1.4
111 2.2 2.7 4.1
185 3.7 4.5 8.6
309 6.2 7.5 16.1
528 10.5 12.8 28.8
604 12.1 14.6 43.4
609 12.2 14.7 58.2
589 11.8 14.2 72.4
435 8.7 10.5 82.9
307 6.1 7.4 90.4
191 3.8 4.6 95.0
90 1.8 2.2 97.2
58 1.2 1.4 98.6
27 .5 .7 99.2
18 .4 .4 99.7
3 .1 .1 99.7
7 .1 .2 99.9
1 .0 .0 99.9
2 .0 .0 100.0
1 .0 .0 100.0
4134 82.6 100.0
872 17.4
5006 100.0
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
Total
Valid
SystemMissing
Total
Frequency Percent Valid Percent
Cumulative
Percent
3
9
age at first marriage
34.5
33.5
32.5
31.5
30.5
29.5
28.5
27.5
26.5
25.5
24.5
23.5
22.5
21.5
20.5
19.5
18.5
17.5
16.5
15.5
14.5
13.5
12.5
11.5
age at first marriage
Frequency
700
600
500
400
300
200
100
0
Std. Dev = 2.71
Mean = 21.0
N = 4134.00
10
An important point
? Since the single age refers to a 12 month
age range, accuracy in the calculations
requires that the mid-point of the range be
used to represent the average age of all
members of the group.
Use 11.5, 12.5, 13,5,……34.5 instead of
11, 12, 13, ……34
11
Measuring center: the median
? The median is the mid-point of a
distribution, the number such that half the
observations are smaller and the other half
are larger. To find the median of a
distribution:
1. Arrange all observations in order of size,
from smallest to largest.
12
2. If the number of observations n is odd, the
median is the center observation in the
ordered list. Find the location of the
median by counting (n+1)/2 observations
up (down) from the bottom (top) of the list.
3. If the number of observations n is even,
the median is the mean of the two center
observations in the ordered list. The
location of the median is again (n+1)/2
from the bottom (top) of the list.
4
13
Examples
There is an odd number of observations,
so there is one center observation. This is
the median. It is 41.
n=11,
location of the median=(11+1)/2=6
22 25 34 35 41 41 46 46 46 47 49
14
The count of observations n=10 is even.
There is no center observation, but there
is a center pair. These are two 39s. The
median is the average of these two
observations, which is 39.
n=10,
location of the median=(10+1)/2=5.5
922323339 39 42 49 46 52
15
The median age at first
marriage
Statistics
age at first marriage
4134
872
21.00
Valid
Missing
N
Median
16
The formula for the median age at first marriage:
Median
? l =lower limit of the age group containing the
median
? N =total population
? F =cumulative frequency up to the age group
containing the median
? f =frequency of the age group containing the
median
? i =the size of the interval of the age group
containing the median
2
N
F
li
f
?
= +×
5
17
?M=A+AM
A M B
A+AM
2
N
F
li
f
?
= +×
18
Comparing the mean and the
median
4 8 12
Mean=(4+8+12)/3=8
Median=8
4 8 120
Mean=(4+8+120)/3=44
Median=8
The median, unlike the mean, is resistant.
19
? The mean and median of a symmetric
distribution are close together. If the
distribution is exactly symmetric, the
mean and median are exactly the
same. In a skewed distribution, the
mean is farther out in the long tail
than is the median.
20
Measuring spread: the standard
deviation
? The mean to measure center and the
standard deviation to measure spread
? The standard deviation measures
spread by looking at how far the
observations are from their mean.
6
21
The standard deviation
? The variance of a set of observations is
the average of the squares of the
deviations of the observations from their
mean. In symbols, the variance of n
observations is
12
, ,...,
n
xx x
22 2
2
12
()().()
1
n
xx xx xx
s
n
?+?++?
=
?
22
Or, more compactly,
22
1
()
1
i
s xx
n
=?
?
∑
The standard deviation is the square
root of the variance:
2
1
()
1
i
sxx
n
=?
?
∑
23
Calculating the standard
deviation
1792 1666 1362 1614 1460 1867 1439
1792+1666+1362+1614+1460+1867+1439
7
x =
The mean:
11200
1600
7
==
24
Calculating the standard deviation
1439x =
1792x =
deviation = -161 deviation = 192
1600x =
1600x =
7
25
1792 1792 - 1600 = 192 192
2
= 36864
1666 1666 - 1600 = 66 66
2
= 4356
1362 1362 - 1600 = -238 (-238)
2
= 56644
1614 1614 - 1600 = 14 14
2
= 196
1460 1460 - 1600 = -140 (-140)
2
= 19600
1867 1867 - 1600 = 267 267
2
= 71289
1439 1439 - 1600 = -161 (-161)
2
= 25921
sum = 0 sum = 214870
i
x
Observations Deviations Squared deviations
i
x x?
2
()
i
xx?
26
? The variance is the sum of the squared
deviations divided by one less than the
number of observations:
2
214870
35811.67
6
s ==
27
? The standard deviation is the square root
of the variance:
35811.67 189.24s ==
28
Note that the “average” in the variance
divides the sum by one fewer than the
number of observations, that is, n-1 rather
than n. The reason is that the deviations
always sum to exactly 0, so that knowing n-1
of them determines the last one. Only n-1 of
the squared deviations can vary freely, and
we average by dividing the total by n-1. The
number n-1 is called the degree of freedom
of the variance or standard deviation.
8
29
Standard deviation
? measure spread about the mean and
should be used only when the mean is
chosen as the measure of center.
? = 0 only when there is no spread. This
happens only when all observations have
the same value. Otherwise s > o. as
observations become more spread out
about their mean, s gets larger.
30
? has the same units of measurement as the
original observations.
? Like the mean, s is not resistant. Strong
skewness or few outliers can greatly
increases s.
Standard deviation
31
Independent, Dependent and
Control Variables
? Independent variables: also known as
explanatory or predictor variables (i.e. they
explain or predict the dependent variable)
or covariates. Independent variables can
be numeric or categorical data. The
statistical procedures to be used to
analyse the data will depend on whether
the variables are numeric or categorical.
32
? Dependent variables: also called
outcome or response variables (i.e. they
are outcomes or responses to the
independent variables). Most statistical
procedures require the dependent variable
to be either numeric or dichotomous.
However, dependent variables with more
than two categories are possible.
9
33
? Control variables: these are independent
variables that have a known or expected
relationship to the dependent variable.
Therefore, we are not interested in examining
their relationship to the dependent variable.
Nonetheless they still have to be included in
the analysis because they also have a known
or expected relationship to the other
independent variables. We have to control for
their effect when we examine the relationship
of the other independent variables to the
dependent variable. Hence they are called
control variables. Control variables can be
numeric or categorical.
34
?Ageis often a control variable in
demographic analysis. This is because
many demographic events are influenced
by a person’s age so that we usually know
the nature of the relationship between age
and various demographic events such as
getting married, giving birth and dying.
35
? For example, we know that the number of
children a woman has is related to her age;
older women will have more children than
younger women, all else being equal.
However, other personal factors such as
level of education can also be related to a
person’s age. For example, older people
tend to have fewer years of schooling than
younger people. Thus, if we want to
examine the relationship between
women’s level of education and the
number of children they have, we need to
control for age in the analysis.
36
?Sexis also a common control variable in
demographic analysis. If we know that
males and females have different risks of
experiencing a demographic event, we
should control for sex in the data analysis.
10
37
Relationship Between Variables
? In survey data analysis, we often are
interested in examining the relationship
between two variables or the relationships
between one or more independent
variables and a dependent variable.
? The relationship between two variables, or
between one or more independent
variables and a dependent variable, may
be one of causation or association.
Usually specification of a relationship of
causation is based on theory or hypothesis.
38
? In examining a relationship of causation,
the objective is to see whether the
independent variable(s) has an effect on
the dependent variable. Sometimes, the
direction of causation may be unclear, in
which case we can test only for an
association between the variables. In this
case, the objective is to see whether
changes in one or more (independent)
variables result in a real change in the
other (dependent) variable.
39 40
? The type of relationship you specify to be
analysed is determined by theoretical
arguments. Therefore you need to be sure
of the theory on which you are basing your
data analysis. This will guide you in
formulating a theoretical (or conceptual or
analytical) framework that will in turn
determine the statistical model on which
you analyse your data.
11
41
? Hence, the principle of survey data
analysis should be:
42
Statistical Tests
? One of the great mistakes in social
research is the use of inappropriate
statistical tests. A test is inappropriate if it
is used on the wrong type of data or if the
assumptions of the test are not met.
? Each type of statistical test requires a
dependent and independent variable. In its
simplest form a statistical model
represents the independent variable
influencing the dependent variable.
43
The following table describes the type of
data and the statistical tests we will discuss.
The independent variable influencing
the dependent variable
44