1 1 Lecture 3 Relationship Between Variables 2 Individuals and Variables ? Individuals are the objects described by a set of data. Individuals may be people, but they may also be animals or things. ?A variable is any characteristic of an individual. A variable can take different values for different individuals. 3 ? The 1997 survey data set, for example, includes data about a sample of women at childbearing ages. ? The women are the individuals described by the data set. For each individual, the data contain the values of variables such as date of birth, place of residence, and educational level. ? In practice, any set of data is accompanied by background information that helps us understand the data. 4 ? The individuals described are the women. Each row recodes data on one individual. You will often see each row of data called a case. Each column contains the values of one variable for all the individuals. ? Most data sets follow this format---each row is an individual, and each column is a variable. 2 5 Measuring center: the mean ? A description of a distribution almost always includes a measure of its center or average. The most common measure of center is the ordinary arithmetic average, or mean. 6 ? To find the mean of a set of observations, add their values and divide by the number of observations. If the n observations are , their mean is: Or in more compact notation: 12 , ,..., n xx x 12 ... n xx x x n + ++ = 1 i xx n = ∑ 7 Example: mean age at first marriage ? Q105: When were you married for the first time? Statistics age at first marriage 4134 872 21.04 Valid Missing N Mean 8 age at first marriage 2 .0 .0 .0 2 .0 .0 .1 4 .1 .1 .2 13 .3 .3 .5 38 .8 .9 1.4 111 2.2 2.7 4.1 185 3.7 4.5 8.6 309 6.2 7.5 16.1 528 10.5 12.8 28.8 604 12.1 14.6 43.4 609 12.2 14.7 58.2 589 11.8 14.2 72.4 435 8.7 10.5 82.9 307 6.1 7.4 90.4 191 3.8 4.6 95.0 90 1.8 2.2 97.2 58 1.2 1.4 98.6 27 .5 .7 99.2 18 .4 .4 99.7 3 .1 .1 99.7 7 .1 .2 99.9 1 .0 .0 99.9 2 .0 .0 100.0 1 .0 .0 100.0 4134 82.6 100.0 872 17.4 5006 100.0 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 Total Valid SystemMissing Total Frequency Percent Valid Percent Cumulative Percent 3 9 age at first marriage 34.5 33.5 32.5 31.5 30.5 29.5 28.5 27.5 26.5 25.5 24.5 23.5 22.5 21.5 20.5 19.5 18.5 17.5 16.5 15.5 14.5 13.5 12.5 11.5 age at first marriage Frequency 700 600 500 400 300 200 100 0 Std. Dev = 2.71 Mean = 21.0 N = 4134.00 10 An important point ? Since the single age refers to a 12 month age range, accuracy in the calculations requires that the mid-point of the range be used to represent the average age of all members of the group. Use 11.5, 12.5, 13,5,……34.5 instead of 11, 12, 13, ……34 11 Measuring center: the median ? The median is the mid-point of a distribution, the number such that half the observations are smaller and the other half are larger. To find the median of a distribution: 1. Arrange all observations in order of size, from smallest to largest. 12 2. If the number of observations n is odd, the median is the center observation in the ordered list. Find the location of the median by counting (n+1)/2 observations up (down) from the bottom (top) of the list. 3. If the number of observations n is even, the median is the mean of the two center observations in the ordered list. The location of the median is again (n+1)/2 from the bottom (top) of the list. 4 13 Examples There is an odd number of observations, so there is one center observation. This is the median. It is 41. n=11, location of the median=(11+1)/2=6 22 25 34 35 41 41 46 46 46 47 49 14 The count of observations n=10 is even. There is no center observation, but there is a center pair. These are two 39s. The median is the average of these two observations, which is 39. n=10, location of the median=(10+1)/2=5.5 922323339 39 42 49 46 52 15 The median age at first marriage Statistics age at first marriage 4134 872 21.00 Valid Missing N Median 16 The formula for the median age at first marriage: Median ? l =lower limit of the age group containing the median ? N =total population ? F =cumulative frequency up to the age group containing the median ? f =frequency of the age group containing the median ? i =the size of the interval of the age group containing the median 2 N F li f ? = +× 5 17 ?M=A+AM A M B A+AM 2 N F li f ? = +× 18 Comparing the mean and the median 4 8 12 Mean=(4+8+12)/3=8 Median=8 4 8 120 Mean=(4+8+120)/3=44 Median=8 The median, unlike the mean, is resistant. 19 ? The mean and median of a symmetric distribution are close together. If the distribution is exactly symmetric, the mean and median are exactly the same. In a skewed distribution, the mean is farther out in the long tail than is the median. 20 Measuring spread: the standard deviation ? The mean to measure center and the standard deviation to measure spread ? The standard deviation measures spread by looking at how far the observations are from their mean. 6 21 The standard deviation ? The variance of a set of observations is the average of the squares of the deviations of the observations from their mean. In symbols, the variance of n observations is 12 , ,..., n xx x 22 2 2 12 ()().() 1 n xx xx xx s n ?+?++? = ? 22 Or, more compactly, 22 1 () 1 i s xx n =? ? ∑ The standard deviation is the square root of the variance: 2 1 () 1 i sxx n =? ? ∑ 23 Calculating the standard deviation 1792 1666 1362 1614 1460 1867 1439 1792+1666+1362+1614+1460+1867+1439 7 x = The mean: 11200 1600 7 == 24 Calculating the standard deviation 1439x = 1792x = deviation = -161 deviation = 192 1600x = 1600x = 7 25 1792 1792 - 1600 = 192 192 2 = 36864 1666 1666 - 1600 = 66 66 2 = 4356 1362 1362 - 1600 = -238 (-238) 2 = 56644 1614 1614 - 1600 = 14 14 2 = 196 1460 1460 - 1600 = -140 (-140) 2 = 19600 1867 1867 - 1600 = 267 267 2 = 71289 1439 1439 - 1600 = -161 (-161) 2 = 25921 sum = 0 sum = 214870 i x Observations Deviations Squared deviations i x x? 2 () i xx? 26 ? The variance is the sum of the squared deviations divided by one less than the number of observations: 2 214870 35811.67 6 s == 27 ? The standard deviation is the square root of the variance: 35811.67 189.24s == 28 Note that the “average” in the variance divides the sum by one fewer than the number of observations, that is, n-1 rather than n. The reason is that the deviations always sum to exactly 0, so that knowing n-1 of them determines the last one. Only n-1 of the squared deviations can vary freely, and we average by dividing the total by n-1. The number n-1 is called the degree of freedom of the variance or standard deviation. 8 29 Standard deviation ? measure spread about the mean and should be used only when the mean is chosen as the measure of center. ? = 0 only when there is no spread. This happens only when all observations have the same value. Otherwise s > o. as observations become more spread out about their mean, s gets larger. 30 ? has the same units of measurement as the original observations. ? Like the mean, s is not resistant. Strong skewness or few outliers can greatly increases s. Standard deviation 31 Independent, Dependent and Control Variables ? Independent variables: also known as explanatory or predictor variables (i.e. they explain or predict the dependent variable) or covariates. Independent variables can be numeric or categorical data. The statistical procedures to be used to analyse the data will depend on whether the variables are numeric or categorical. 32 ? Dependent variables: also called outcome or response variables (i.e. they are outcomes or responses to the independent variables). Most statistical procedures require the dependent variable to be either numeric or dichotomous. However, dependent variables with more than two categories are possible. 9 33 ? Control variables: these are independent variables that have a known or expected relationship to the dependent variable. Therefore, we are not interested in examining their relationship to the dependent variable. Nonetheless they still have to be included in the analysis because they also have a known or expected relationship to the other independent variables. We have to control for their effect when we examine the relationship of the other independent variables to the dependent variable. Hence they are called control variables. Control variables can be numeric or categorical. 34 ?Ageis often a control variable in demographic analysis. This is because many demographic events are influenced by a person’s age so that we usually know the nature of the relationship between age and various demographic events such as getting married, giving birth and dying. 35 ? For example, we know that the number of children a woman has is related to her age; older women will have more children than younger women, all else being equal. However, other personal factors such as level of education can also be related to a person’s age. For example, older people tend to have fewer years of schooling than younger people. Thus, if we want to examine the relationship between women’s level of education and the number of children they have, we need to control for age in the analysis. 36 ?Sexis also a common control variable in demographic analysis. If we know that males and females have different risks of experiencing a demographic event, we should control for sex in the data analysis. 10 37 Relationship Between Variables ? In survey data analysis, we often are interested in examining the relationship between two variables or the relationships between one or more independent variables and a dependent variable. ? The relationship between two variables, or between one or more independent variables and a dependent variable, may be one of causation or association. Usually specification of a relationship of causation is based on theory or hypothesis. 38 ? In examining a relationship of causation, the objective is to see whether the independent variable(s) has an effect on the dependent variable. Sometimes, the direction of causation may be unclear, in which case we can test only for an association between the variables. In this case, the objective is to see whether changes in one or more (independent) variables result in a real change in the other (dependent) variable. 39 40 ? The type of relationship you specify to be analysed is determined by theoretical arguments. Therefore you need to be sure of the theory on which you are basing your data analysis. This will guide you in formulating a theoretical (or conceptual or analytical) framework that will in turn determine the statistical model on which you analyse your data. 11 41 ? Hence, the principle of survey data analysis should be: 42 Statistical Tests ? One of the great mistakes in social research is the use of inappropriate statistical tests. A test is inappropriate if it is used on the wrong type of data or if the assumptions of the test are not met. ? Each type of statistical test requires a dependent and independent variable. In its simplest form a statistical model represents the independent variable influencing the dependent variable. 43 The following table describes the type of data and the statistical tests we will discuss. The independent variable influencing the dependent variable 44