Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Sociology 3211: Quantitative Methods of Social Research March 17, 2014 Contents 1 Data, Variables, and Statistics 1.1 Branches of statistics . . . . 1.2 Error . . . . . . . . . . . . . . 1.3 Levels of Measurement: . . . 1.4 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 5 6 7 9 2 Frequencies 11 2.1 Frequency Tables . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Figures for Frequency Distributions . . . . . . . . . . . . . . . 13 3 Measures of Central Tendency 3.1 The mean . . . . . . . . . . . . . 3.1.1 Calculating the mean from 3.2 The Median . . . . . . . . . . . . 3.3 Comparing the mean and median . a . . . . . . . . . . . frequency table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Dispersion 4.1 Standard Deviation . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Standardized Variables . . . . . . . . . . . . . . . . . 4.1.2 Guidelines for interpreting standardized scores . . . . 4.1.3 Calculating the standard deviation from a frequency table . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 The Interquartile Range . . . . . . . . . . . . . . . . . . . . 4.2.1 IQR from a frequency table . . . . . . . . . . . . . . 4.3 Some notes on the IQR . . . . . . . . . . . . . . . . . . . . . 1 . . . . 19 19 21 23 23 27 . 27 . 29 . 29 . . . . . 30 30 31 32 32 5 Comparing Group Means 5.1 Association between variables . . . . 5.2 One ordinal/interval, one dichotomy 5.3 One ordinal/interval, one nominal . . 5.4 Both Ordinal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Statistical Inference 6.1 Standard error of a statistic . . . . . . . . . . . . . . 6.2 Confidence Intervals . . . . . . . . . . . . . . . . . . 6.2.1 Comparing group means: approximate method 6.2.2 Comparing Group means: Exact Method . . . 6.3 T-values and significance tests . . . . . . . . . . . . . 6.4 Comparing more than two groups . . . . . . . . . . . 7 Cross-tabulations 7.1 Independence and expected values 7.2 Index of Dissimilarity . . . . . . . 7.3 Standardized residuals . . . . . . 7.4 Chi-square test . . . . . . . . . . 7.5 Examining Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 34 35 37 37 . . . . . . 39 40 41 42 43 44 45 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 50 52 52 54 56 8 Correlation and Simple Regression 8.1 Correlation . . . . . . . . . . . . . . . . . . . 8.1.1 Calculating the correlation . . . . . . . 8.1.2 Standard error of the correlation . . . 8.1.3 Correlation matrix . . . . . . . . . . . 8.1.4 Correlations and scale . . . . . . . . . 8.1.5 Interpreting correlations . . . . . . . . 8.2 A Visual Interpretation . . . . . . . . . . . . 8.3 Regression . . . . . . . . . . . . . . . . . . . . 8.3.1 Residuals . . . . . . . . . . . . . . . . 8.3.2 Calculating regression coefficients . . . 8.3.3 Dependent and Independent Variables 8.3.4 Analysis of Variance . . . . . . . . . . 8.4 Transformations . . . . . . . . . . . . . . . . . 8.4.1 Dummy variables . . . . . . . . . . . . 8.4.2 Change of Scale . . . . . . . . . . . . . 8.4.3 Non-linear transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 58 59 60 60 61 62 62 63 64 65 66 66 68 68 68 69 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.4 8.4.5 8.4.6 Ladder of Transformations . . . . . . . . . . . . . . . . 70 Transformations and nonlinear relationships . . . . . . 70 Choosing Transformations . . . . . . . . . . . . . . . . 72 9 Multiple Regression 9.1 Example of a multiple regression . . . . . . . . . . . . . . . . . 9.2 Standardized Coefficients . . . . . . . . . . . . . . . . . . . . . 9.3 Direct, Indirect, and Total Effects . . . . . . . . . . . . . . . . 9.4 Nominal variables in Regression . . . . . . . . . . . . . . . . . 9.4.1 Interpreting the coefficients . . . . . . . . . . . . . . . 9.4.2 Testing whether a nominal variable makes a difference 9.4.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.4 Combining categories . . . . . . . . . . . . . . . . . . . 74 78 78 79 82 82 83 83 85 10 Beyond Linear Regression 89 10.1 Non-linear effects . . . . . . . . . . . . . . . . . . . . . . . . . 89 10.2 Interaction (specification) Effects . . . . . . . . . . . . . . . . 93 3 Chapter 1 Data, Variables, and Statistics We will begin with some definitions. • Data: any information that can be expressed as a number or one of a set of categories. For example, age can be expressed as a number, and marital status can be expressed as “married,” “divorced,” “widowed,” etc.. In these cases, the way to express the idea as data is pretty straightforward. There are many cases in which there is more ambiguity: for example, someone’s political views. How can you reduce them to data? There is no perfect way to do it, but there are many possibilities. For example, surveys often ask people whether they would say that they are liberal, moderate, or conservative. Knowing which one of these terms someone picked obviously doesn’t tell you everything about what someone thinks, but it tells you something. There is a lot of data in the modern world: some examples are crime rates, unemployment rates, election results; surveys of the public; rankings of things (colleges; nations ranked on qualities like freedom or corruption); performances of sports teams and individual athletes. Statistics is essentially about how to organize and evaluate data. Since all sorts of information can be expressed as numbers, the general principles of statistics apply to lots of subject areas, not just sociology: e. g., weather, geology, medicine..... But each subject area has some special features. For example, measuring “religious faith” is different from measuring temperature. So many departments have their own quantitative or statistical courses, and they can be quite a 4 bit different, although they are all based on the same general principles. • Variable: a characteristic that is measured as a number or a category name, and that differs from unit to unit. A variable is disguished from a constant, which is a characteristic that’s the same for all units. • Units (cases): The entity that the variable refers to. The most familiar unit is individual people. However, often social scientists analyze other units: for example, states of the US, points in time, organizations, households. 1.1 Branches of statistics 1. Univariate statistics: summarizes the values of a single variable for different units. The most common univariate statistics are the mean (average) and standard deviation. Of course, we could just list the values for a single variable for every case. But unless you have a very small number of units, a list of numbers gets overwhelming. Univariate statistics seeks to reduce the mass of information to a few numbers. 2. Bivariate and multivariate statistics: this involves the relations between variables–do different variables “go together”? E. g., do students in smaller classes learn more than students in larger classes? Bivariate statistics involves two variables, while multivariate involves more than two. Most research in the social sciences involves multivariate statistics. A common situation is when you have one variable you want to predict or explain (the “dependent variable”) and a number of variables that we think might help to explain it (“independent variables”). For example, you might want to know which factors affect student performance. If you think about it, or ask people for their ideas, you will get a long list of possibilities. Statistics can help you figure out which which of those have large effects, which have small effects, and which have no effect at all. 3. Statistical inference: This is about how sure you can be. Say that there’s a poll of 1500 voters that asks them how they voted in 5 November 2012. 51% of men in the sample report voting for Mitt Romney; 45% of the women report voting for Romney. Can we conclude that women in general were less likely to vote for Romney than men were, or could the just be a matter of “the luck of the draw”? Or suppose you had information on a number of countries, and you found some pattern: for example, richer countries are more likely to be stable democracies. Is that evidence that there is a real connection between affluence and democracy, or could it just be a coincidence? The alternative to inference is description, where you just discuss the data you have. Description and inference both can involve univariate, bivariate, or multivariate statistics, although in practice inference is more often applied to bivariate and multivariate statistics. 1.2 Error A key aspect of statistics: there is always some uncertainty. Let’s take an example. One of the data sets we which comes from a survey–a number of people were asked different questions, and numbers are used to represent their answers. Surveys are an important source of data in sociology, although not the only one. One of the variables is the number of children aged under 18 living in the household. We can be pretty sure that people know the answer, so there is little or no “error” in their answers. Now suppose we want to predict the value of the variable. It probably will be possible to find some factors that predict it: for example, gender, age, marital status, ethnicity, income, education. But the number of children people have is also affected by a number of factors that you can’t measure, or even describe very clearly. So there will be error in terms of prediction. With most things involving people, there’s a part that you can’t predict. So any results of a statistical analysis are going to involve “most of the time” or “more often than not.” That is, there are always going to be exceptions to the rule, and often there will be a lot of exceptions. This is important to remember because people don’t always emphasize it when reporting the results of a statistical analysis–for example, a report that found a difference between men and women would talk about the difference, and often wouldn’t emphasize the variation within each group. But actually, the “error” is an important part of any statistical analysis. 6 For example, let’s take another variable in the data set. People were asked “During the past 30 days, for about how many days have you felt you did not get enough rest or sleep?” People could give any answer between 0 and 30. Lowest Highest Men Women 0 0 30 30 Table 1.1: Minimum and maximum values of days not enough rest, men and women So it’s clear that men aren’t all the same, and women aren’t all the same–in fact, both men and women cover the whole possible range. But suppose we looked at it another way. 0 days 1-3 days 4-9 days 10 or more days Men Women 42% 16% 14% 28% 35% 16% 16% 33% Table 1.2: Average number of days not enough rest, men and women There now is a pattern: women tend to report more days without enough rest or sleep. So a statement about a difference between men and women (in this sample) if it’s understood as involving a tendency or average. (It also is unlikely to be a result of chance–that is, I can be pretty sure that it is true of Americans in general, not just people in this sample). But there are large differences within each sex, and a lot of overlap. 1.3 Levels of Measurement: 1. Nominal: categories with no meaningful order. Marital status is an example. You can use numbers to represent the categories, but those numbers are arbitrary: they are just a way to tell them apart. Many mathematical operations don’t make sense with nominal variables–e. g., taking an average. 7 2. Ordinal: also discrete, but the categories have a meaningful order. For example, there is a question about general health: 1=excellent, 2=very good, 3=good, 4=fair, 5=poor. Clearly there is a natural order to these categories. It makes sense to say that ‘2’ is in between ‘1’ and ‘3’. The variable would still make sense if you reversed the order (5=excellent....1=poor), but not if you made any other changes. A more subtle point is that that although “very good” is definitely in between “excellent” and “good,” but it’s not certain whether it’s exactly midway in between, or closer to one of the categories than to another. So suppose you had two groups, each with two people. In group A, one person has excellent health and one has good health; in group B, both have very good health. Which group is healthier, or are they both the same? There is no definitive answer (some people would argue it’s not even a meaningful question). 3. Interval: the values have an order, and the distance between different values is defined. E. g., income. With an interval variables we can get a definite answer if we are comparing different groups. For example, if group A has one person who earns $20,000 a year and one who earns $120,000, while group B has one who earns $60,000 and one who earns $40,000, we can say that the total income is higher in group A. 4. Ratio: an interval variable that also has a definite zero point. With ratio variables, you can say things like “person A is twice as old as person B.” If the zero point is arbitrary. E. g., IQ scores don’t have a meaningful zero point (they were designed to have a mean of about 100), so person with an IQ of 140 isn’t twice as smart as a person with an IQ of 70. In fact, the statement “person A is twice as smart as person B” has no real meaning. I nterval and ratio variables can be “continuous” or “discrete.” Continuous means that the variable can have any value, at least within some range–for example, a person’s weight can be any positive number. Discrete means only a limited number of values are possible. For example, the number of children someone has is necessarily a whole number–you can’t have a value like 1.4 or 2.75. Many variables are continuous in principle but are measured as discrete. For example, in this data set a person’s weight is given in whole pounds, with no decimals, so it’s discrete. But if a variable has 8 a lot of possible values, it usually reasonable to regard it as continuous. What does “a lot” mean? There’s no absolute rule, but somewhere between 10 and 20 is often a good dividing line. Nominal and ordinal variables are always discrete. 5. Dichotomies: these are variables which have only two values, like agree/disagree or male/female. Dichotomies can be regarded as either ordinal or nominal variables. The interval/ratio distinction isn’t very important in practice, because most interval variables are also ratio variables (like age, income, or weight), and most of the exceptions (like IQ) could be understood as ordinal rather than interval. The interval/ordinal distinction is potentially important, but it’s an issue for more advanced statistics, so I won’t talk about it much. But the distinction between nominal variables and the other types is very important: with nominal variables, many statistics cannot be used. 1.4 Notes 1. statistical programs don’t distinguish between nominal, ordinal, and interval level variables. That is, they can’t tell you not to do something that doesn’t make sense, like taking the average of a nominal variable. 2. often there are “missing values”: cases for which no answer is recorded. For example, a person might refuse to tell you how old he or she is. Most statistical programs have special ways of dealing with missing data. The simplest is leaving those cases out of the statistical analyses. Before doing an analysis, you should check to see what will be done with “missing values.” 3. sometimes the order of the categories in a variable isn’t the natural order, or isn’t the order you want to use. For example, with the variable on general health that I mentioned, if you stick with the original form, you’ll have to remember that high values mean WORSE health. This might get confusing, especially if you have other variables that are also “backwards.” Also, sometimes a variable 9 is ordinal in principle, but the numbers aren’t assigned that way: for example, it’s pretty common to have a variable for which agree=1; disagree=2; not sure=3. If you put “not sure” in the middle, it’s an ordinal variable. So it’s often convenient to “recode” variables into a form that makes more sense. For example, you might create a “health” variable where excellent=5, very good=4..... poor=1. 10 Chapter 2 Frequencies A frequency is a count of the number of cases that have a particular value, or that fall in a particular range of values. If you have a small number of possible values, you will usually want to know the count for each one. However, if you have an ordinal or interval variable with a lot of possible values, it is usually better group them into ranges. If you have a continuous variable, you must group them into ranges. Consider weight in pounds: for many values, there are only a few people who have that exact number. For example, there are three people who say they weigh 101 pounds, five who say they weigh 102, two who say they weigh 103. This is more detail than you need, so it would be more informative to group people into ranges, like less than 100 pounds, 100-109, 110-119. Note that when you do this, the ranges must be non-overlapping. You want to put everyone into exactly one category. So for example, you should not have ranges of 100-110, 110-120 because then it’s not clear how people who weight exactly 110 pounds should be counted. 2.1 Frequency Tables If you have a discrete variable, a “frequency table” provides a good way to show how many cases have each value. See the example in Figure 2.1, which is output from the SPSS program. The number in the left column is the value of the variable. In the data set “0” is used to represent people who cirrently smoke. The table also shows the “label” for that number. If it didn’t, we would have to refer to 11 the “codebook” that tells us which number corresponds to which value. “Frequency” is the number of people who gave that answer, in this case 666 currently smoke. “Percent” is the percent of the people who gave that answer. You calculate percent by taking the number who gave the answer and dividing it by the number of cases, which is 4232, nd multiplying by 100: In this case 15.7% of the people surveyed said that that they currently smoked. Notice the values at the bottom of the table. They are indicated as “missing”–that is, we don’t have an answer for those cases. There are several different missing values. If you look in the codebook, you can see that 77 was used to indicate that people refused to answer or said that they didn’t know. -9 is who are marked ”DNA,” which I think stands for “did not ask.” The “system” is short for “system-missing”, which is an SPSS term. In SPSS, “system-missing” values are not represented by a number, but by a special symbol (a period). Missing values are represented by a number, just like the real values, but you can instruct the program that cases with that value should be left out of all calculations. System-missing values are often used to represent cases that are left out because the question was not asked. In this example, I think that they were people who didn’t answer a previous question about whether you have ever smoked regularly. The people designing the survey figured that if they didn’t answer that it was a waste of time asking any more questions about smoking. For most purposes the distinction between different types of missing values doesn’t matter, since in either case we don’t have an answer. The table shows several “Totals.” One is the total number of missing values. Then there is the total number of cases for which we have a valid answer. That’s listed as the “Total” after the highest value. Finally, there’s the “grand total,” which is valid plus missing cases. The grand total is the same for all of the variables in the data set. For this data set, it’s always 4232–the number of people included. The number of missing and valid cases will differ from variable to variable. For some variables, it’s zero (e. g., sex), but for most variables there are some missing cases–for example, some people didn’t give their age, some didn’t give their income, etc. The relative numbers in each category as a percent of all “valid” cases are shown under the “valid percent” column. Finally, the “cumulative percent” is the sum of the valid percents with that value or lower. This is useful for ordinal or interval variables, because they let you give the percent of people who are below or above a particular level. In this case, it would 12 seem reasonable to regard the variable as ordinal, so you could say that 21% of people had smoked within the past 5 years. The different columns of the frequency table show the same information in different ways. The essential information is in the first two columns–the values and the frequencies. Of course, it’s more convenient to have the computer program do calculations, so that’s why the SPSS output shows the columns with percentages. The potentially confusing thing is that SPSS tends to show all of the different things that people might be interested in, so sometimes you have to look through the output, identify the numbers you want, and ignore the others. For example, with a frequency table you usually will want to focus on the valid percents rather than the total percents. 2.2 Figures for Frequency Distributions A “figure” or “graph” is a picture representing some statistical information. A “table” is a list of numbers. Both figures and tables are useful. A table can usually show more detail, but a figure is often easier to grasp–that is, someone can see something at a glance rather than having to think about the pattern of numbers in the table. Figures are particularly useful for showing the distribution of one variable. There are two major kinds of figures for that purpose. One is a “pie chart” in which the size of a “slice” is proportional to the number of cases with a given value. The other is a bar chart, in which the heigh of a bar is proportional to the number of cases with a given value. Either one of these shows the same information that you have in a frequency table. Pie charts are used mostly for nominal variables, while bar charts are used for both nominal and ordinal variables. I’ll focus on bar charts, because most people who think about graphics don’t like pie charts. The reason is that people seem to have more difficulty in accurately judging the area of a “slice” than they do in judging the height of a bar. That is, you can easily see that one slice is bigger than another, but it’s harder to make accurate judgments on how much bigger, like 50%, twice as big, three times as big, etc. There is another kind of graph that is similar to the bar chart–the histogram. It’s used for ordinal or interval variables with lots of possible values. For variables of this kind, there are many different values, and most of them may have only a few cases. As a result, a bar chart will look 13 “jagged,” making it hard to pick out the important features. A histogram will group the values of the variable. For example, rather than showing the number of people who are 21, 22, 23, etc., it might show the number of people who are 21-29, 31-39, etc. That means you lose some detail, but the advantage is that it may be easier to see the main features of the data. With most statistical programs, including SPSS, the histogram command will automatically set up groups; however, you can usually manually change those values if you want to. Conventionally a bar chart has gaps between the bars, while a histogram puts them side-by-side. Examples of a pie chart, bar chart, and histogram are shown in Figures 2.2-2.4. 14 LASTSMK1 INTERVAL SINCE LAST SMOKED Frequency Percent Valid Percent Cumulative Percent 0 Current smoker 666 15.7 15.9 15.9 12 .3 .3 16.2 2 Within last 3 months 9 .2 .2 16.4 3 Within last 6 months 17 .4 .4 16.8 4 Within last year 34 .8 .8 17.6 5 Within last 5 years 151 3.6 3.6 21.2 6 within last 10 years 118 2.8 2.8 24.0 7 more than 10 years 880 20.8 21.0 45.0 8 never smoked 2309 54.6 55.0 100.0 Total 4196 99.1 100.0 5 .1 System 31 .7 Total 36 .9 4232 100.0 1 Within last month Valid 77 Missing Total Figure 2.1: Frequency table for Time Since Last Smoked 15 Figure 2.2: Example of a pie chart 16 Figure 2.3: Example of a Bar Chart 17 Figure 2.4: Example of a Histogram 18 Chapter 3 Measures of Central Tendency 3.1 The mean “Central Tendency” is the statistical term for what in everyday language, you would call an “average” or “typical” value. There are a number of statistics that represent central tendency, but the two major ones are the mean and the median, and those are the ones that I’ll talk about. The mean is the most common measure of central tendency. It’s sometimes called the “average,” but “average” is also used more broadly: for example, when someone says “average Americans” they usually just mean “typical” in a general way. So I will call the statistic the “mean” to avoid ambiguity. To get the mean, add together all values of the variable (x) , divide by the number of cases (don’t include cases with missing values in either part). Or in terms of symbols: P x x= N P The symbol (capital Greek letter sigma) stands for sum. x is the conventional way to designate an unspecified variable–whatever variable we happen to be interested in. Sometimes people write xi , where the subscript represents the individual case, to make it clear that you’re summing the values for each case. When the values are listed individually, calculating the mean is straightforward. For example, suppose that these are scores on a test: 81, 97, 100, 67, 75. The sum of the values is 420 and N=5, so the mean is 84. Of course, if you have a lot of cases, it takes a while to get the mean, even with a calculator; that’s where computers are useful. 19 GENERAL HEALTH Frequency Percent Valid Percent Cumulative Percent Excellent 768 18.1 18.2 18.2 Very good 1350 31.9 32.1 50.3 Good 1282 30.3 30.4 80.7 Fair 554 13.1 13.2 93.9 Poor 257 6.1 6.1 100.0 Total 100.0 Valid Missing Total 4211 99.5 7 8 .2 9 13 .3 Total 21 .5 4232 100.0 Figure 3.1: Frequency table for self-rated health 20 3.1.1 Calculating the mean from a frequency table You can also calculate the mean from a frequency table. Let’s take a variable I’ve mentioned before, self-rated health. We can find N, the total number of people, listed in the the frequency table as the total number of valid cases: 4211. (As I said last week, the missing values are not used when calculating statistics). But what about the sum of the x? To get that, remember what the numbers in the table mean. There are 768 people with the value of 1 (excellent), 1350 with 2, and so on. So the numerator is: (768*1)+(1350*2)+(1282*3)+(554*4)+(257*5)=10185 The symbol * means multiply (it’s used because the conventional multiplication symbol can be confused with the letter x.) That is just a shorter way of writing the sum of 4211 individual values: 1+1+1...+1+2+2....2+3+3...3+4+4....4+4+5...5 Putting it all together, the mean is: 10185/4211=2.568. That’s almost exactly in between “very good” and “good”, just a little closer to “good.” That seems reasonable given the percentages. Note that the mean is continuous, even if the original variable is discrete. So you don’t need to round the mean off to the nearest integer–you just give the decimal figure. You might wonder how many decimals to use. If the variable is a whole number, two decimals is a reasonable choice (in this case, 2.57). You could also do three, but anything more than three decimals is excessive. But you can be flexible–if the units are already precise, like weight in pounds, just one decimal would be reasonable. Again, we could write the formula in symbols: P fx x= N The small letter f means the frequency of that value of x, and fx is f times x (if two variables are written next to each other without a sign, it is assumed that you mean to multiply them). It’s possible to get confused about what N is. Sometimes people think that N is the number of distinct values (5 for health categories). But N is always the number of cases (people). A good way to check for mistakes is to remember what the mean is supposed to be: an average or typical value of the variable. Then remember what is x in this case (health) and what values x can have (1 through 5). Then ask whether the number you came up with make sense in terms of the possible values of x. In this case, if everyone said that they had excellent health, the mean would be 1. If 21 everyone said they had poor health, the mean would be 5. So the biggest possible value for the mean is 5, the smallest is 1. If you calculate the mean and get a number outside of that range, you know you made a mistake. There are several commands in SPSS that will calculate the mean. One of them is “Descriptives”. Another is an option in “Frequencies.” A few other points about the mean: 1. The order of the cases doesn’t matter in calculating the mean. For example, the mean of 8, 6, and 1 is 15/3=5. You get exactly the same number if they are listed 1, 6, 8 or 8, 1, 6 or any other possible way. 2. In a literal sense, you can take the mean of a nominal variable–that is, you can do the calculations and get a number. But that number will not have any sensible interpretation. You should take the mean only if the variable is ordinal or interval (including ratio). Because a dichotomy can be regarded as an ordinal variable, you can take the mean, although usually it’s not the natural thing to do. That is, we would usually describe a dichotomy by saying something like “60 percent of the cases are women,” rather than “the mean value of the variable sex is 1.6.” 3. The mean is a univariate statistic. That is, it involves one variable, and not the relations between variables. You can calculate the means for several variables, but from the means alone you can’t tell whether or how those variables are related. 4. You can compare the means for two different variables only if those variables are measured on the same scale. For example, there’s a variable for number of days out of the last 30 your mental health was not good. The mean is 3.35. It would not be correct to say that because 3.35 is bigger than 2.57, people rate their mental health as worse than their health in general. But it would be reasonable to compare the means for “mental health not good” and “physical health not good” since those are both measured as days out of the last 30. The mean number of days physical health is not good (4.24) is higher than the mean number of days mental health is not good. 22 3.2 The Median The median is the middle value if you arrange all of the values in order of size. It doesn’t matter if you arrange them from large to small or small to large, you get the same median. For example, say that you have only three cases, with values 5, 7, and 10. The median is 7. With only three cases, it’s obvious which the middle one is, but the rule is that it’s the (N+1)/2 case, where N is the number of valid cases. E. g., with 99 cases arranged by size, the median would be the 50th. If the number of cases is even, then (N+1)/2 isn’t a whole number. E. g., with N=100, you get the 50.5th case. There is no 50.5th case, but there is a 50th and 51st, so you can get the median by taking the average of the values of those two cases. So suppose we have four families, with 3, 2, 1, and 1 children. What is the median number of children? The second highest value is 2, the third highest is 1. The average of those two values is 1.5. Notice that it’s OK for the median to be a fraction, even though the number of children for any actual family has to be a whole number. If the data are shown in a frequency table, you follow the same approach, but remember that there are many cases for each value. For example, let’s take the frequency table I used as an example (self-rated health). With 4211 cases, (N+1)/2=2106. If you wrote all the values in order, you’d have: 1,1,1....1,2,2...2,....5 The first 768 of those values would be 1; then the next 1350 values would be 2. 768+1350=2118 would take us past the 2106th case, so the median is 2. Writing all of the values in order and counting to 2106 would be a lot of wasted effort. It’s easier to get there with the cumulative number of cases, as I just did. An even easier way to get the median from a frequency table is to follow this rule: the median is the value for which the cumulative percent passes 50. In this case, 1 just gets us to 18.2%, but then 2 takes us to 50.3%, so the answer is 2. 3.3 Comparing the mean and median The mean and median are often close to each other. But sometimes there is a substantial difference. For example, the median number of days without enough rest or sleep is 3, while the mean is 7.66: more than twice as big. 23 The reason the mean is bigger than the median is no one is very far below the median (you can’t be below zero), but there are some people who are far above the median–who say that they felt that they didn’t have enough rest every day, or almost every day. The people who have very high numbers have a big impact on the mean, but not as much impact on the median. If everyone who had more than 5 days without enough rest managed to reduce themselves to exactly five days, the median would stay the same, because only the order matters and reducing all the large values would not change the order. In general, the mean will be different from the median when the distribution of the variable is “skewed” rather than symmetrical. The meaning of these terms can be understood by thinking of a histogram (or bar chart), in which the height of a bar represents the number of cases with a given value. A symmetrical distribution means that the right and left halfs of the histogram are mirror images of each other. With a skewed distribution, they are not mirror images: the figure looks unbalanced. Often there is a “tail” going off to the right (high values) or left (low values). Figure 3.1 shows the histogram for number of days without enough rest. It is skewed. Figure 3.2 shows the histogram for height in inches. It is pretty symmetrical. Usually variables that are skewed are skewed to the “right”: a few values are much bigger than the median. This is particularly true when the variable has a lower limit but no upper limit. Then the mean will tend to be bigger than the median. But a variable can also be skewed to the left, although it’s less common. For example, with the conventional 0-100 scale of grading tests, most people are concentrated near the top, so there may be a few that are far below everyone else. In this case, the mean will tend to be smaller than the median. The greater sensitivity of the mean to extreme values might be regarded as a good feature or a bad feature. On the one hand, you could argue that it’s important to take account of the unusual values–that is, to recognize that they’re not just larger or smaller than the typical value, but much larger or smaller. On the other hand, you could argue that you shouldn’t give too much weight to a small minority, especially because extreme values may result from some kind of mistake (for example, a data entry person putting in an incorrect number). So I would say that it’s not a matter of one measure being clearly better: they both give different kinds of information, so you should consider both of them. 24 Figure 3.2: Histogram of number of days without enough rest 25 Figure 3.3: Histogram of height 26 Chapter 4 Dispersion 4.1 Standard Deviation The standard deviation is a measure of “dispersion”: that is, how “spread out” the values are. For example, suppose you have three values: 14, 12, 10. The mean is 12. What if you have the values 17, 15, 4? The mean is still 12, but the values are more spread out. What kind of statistic could we use to express this difference between the two distributions? The simplest possibility is the “range,” which is defined as the largest value minus the smallest value. In the first example, the range is 4; in the second, it is 13. The range is simple to calculate and easy to understand. The drawback is that it depends on just two extreme values. Suppose you have one distribution that’s like this (suppose it’s scores on a test): 35, 79, 80, 83, 85, 87, 88, 92, 93, 98. The mean is 82. The range is 63. Then suppose you have a distribution like this: 55, 60, 70, 75, 75, 85, 100, 100, 100, 100. The mean is 82, and the range is only 45. But is it really valid to say that the first distribution is a more spread out than the second? In the first, 9 of the 10 cases are within 18 points of each other (80 to 98)–there’s just one that’s a lot different. In the second, they are scattered pretty evenly over the range. So you could make an argument either way. The standard deviation takes account of not just the highest and lowest values, but of all the values. The formula is: s P (x − x̄)2 N −1 27 Remember that x̄ is the symbol for the mean. So to get the standard deviation, you need to calculate the mean first. Then you take each value, subtract the mean, square the result, add them together, divide by N-1, and finally take the square root. (If you omit the last step, you have the “variance”, which is sometimes used as a measure of dispersion. But a bigger standard deviation means a bigger variance, so they are just different ways of giving the same information). This is a lot of calculation, so people rarely calculate the standard deviation by hand. However, it’s important to do it a few times in order to grasp what the standard deviation is. The standard deviation of the first example is 17.54. The standard deviation of the second is 17.51. So despite the difference in the distributions, the standard deviations are almost equal. Some points about the standard deviation: 1. Like the mean, the standard deviation should be calculated only for ordinal or interval variables, not for nominal variables. (It can be calculated for dichotomies, but isn’t very useful for them). 2. The minimum possible value of the standard deviation is zero. This will occur only if all cases have the same value. 3. There is no upper limit in principle. 4. Sometimes N is used in the denominator instead of N-1. There are arguments in favor of both formulas, but it doesn’t make much practical difference unless the sample is very small. 5. Although the standard deviation is usually reported as a number, it has units–they are the same units as the original variable. 6. Usually most of the cases are within one standard deviation of the mean. This isn’t an absolute rule, but it’s a useful thing to remember. 7. Cases more than two or three standard deviations away from the mean are unusual. For example, the mean height for men in the US today is about 5 feet 10 inches, and the standard deviation is about 3.5 inches. So to be 2 standard deviations above the mean would mean you were 6 foot 5; two standard deviations below would be 5 foot 3. 28 4.1.1 Standardized Variables Standardized variables are related to the idea of being within k standard deviations of the mean. A standardized variable is defined as x − x̄ x∗ = sx That is x minus the mean of x, all divided by the standard deviation of x. An asterisk is a common way of indicating that a variable is standardized. The mean of x* is 0 and the standard deviation is 1.0, regardless of the mean and standard deviation of the original variable. If you keep track of the units in the equation, you find that they cancel out. For example, suppose height is measured in inches. Then the stadndard deviation of height is also in inches, and the value of a standardized variable will be some number of inches divided by some number of inches. The result is a number, with no units. What’s the point of standardizing a variable? You can compare the values of standardized variables, even if the scale of the original variables is different. For example, say two students attend different schools. One uses a scale of 0-100 for grades, the other uses 0-4. Say that the person at the place with the 100 point scale got an 80, the mean for all students was 78, the sd is 8. Then the person’s standardized score is: (82-78)/8=0.25. The student at the other school had a GPA of 2.9, the mean for all students was 2.6, and the standard deviation is 0.6. Then their standardized score is 0.5. So the second student did better in relative terms. 4.1.2 Guidelines for interpreting standardized scores First, the sign: + above average 0 exactly average - below average These are exact rules: if something has a positive standardized score, it’s larger than the mean. There are also rules for interpreting the magnitude of standardized scores, although they are approximate rather than absolute: |x∗ | < 1: not unusual 1 < |x∗ | < 2: somewhat unusual 2 < |x∗ | < 3: unusual 3 < |x∗ | very unusual It’s often a good idea to give unusual values special attentionthink about whether there might be some mistake in measuring, 29 Frequency 1 2 3 4 5 Total Valid Percent 768 18.2 1350 32.0 1282 30.4 554 13.2 257 6.1 4211 100.0 Table 4.1: Frequency Table for General Health why they might have the values they do, whether they have a large in uence on any of your calculations. This is especially true when you have some knowledge about the units (e. g., if the units are cities or states), because in that case you may be able to think of reasons why they are different, or check on the accuracy of the measurements using other sources. However, it can be useful even when the units are anonymous people, as in survey data. 4.1.3 Calculating the standard deviation from a frequency table As with the mean, you use the same basic formula, but have to take account of frequencies. Specifically, the formula is: s P f (x − x̄)2 N −1 As with the mean, f is the frequency for a particular value of x, and N is the total number of cases (e. g., people). You have to be careful about the order of things: first square the deviations from the mean, then multiply that total by f. If you first multiply the deviation from the mean by f and then square that total, you’ll get a different (and incorrect) answer. 4.1.4 Example The variable is general health. To get the standard devation, we first need the mean. That was calculated in chapter 3.2.1: it is 2.57 Making a table can help in calculation–here is an example. 30 x 1 2 3 4 5 Total f x̄ x − x̄ (x − x̄)2 f (x − x̄)2 768 2.57 -1.57 2.47 1889.3 1350 2.57 -0.57 0.32 432.0 1282 2.57 0.43 0.18 230.8 554 2.57 1.43 2.04 1130.2 257 2.57 2.43 5.90 1516.3 4211 5198.5 Table 4.2: Example of Worksheet for Calculating Standard Deviation Note that the x − x̄ column won’t add up to zero the way it does for individual data. That’s because it doesn’t take account of the frequency x. You could compute a f (x − x̄) column, which would add to zero. But it’s not needed for computing the standard deviation, so you don’t need to calculate it. Finally, the variance is 5198.5/4210=1.235. The standard deviation is the square root of that, which is 1.11. 4.2 The Interquartile Range The interquartile range is another measure of dispersion. It is related to the median; the idea of the median is to divide the values into halves, and the idea of the IQR is to divide the data into quarters (“quartiles”). The first quartile is the value which is greater than 25% of the cases and smaller than 75%; the second quartile is the median (greater than half and less than half); and the third quartile is the value that’s greater than 75%. The IQR is the value of the third quartile minus the value of the first quartile. So it’s the same basic idea as the range, but rather than taking the extreme values, it takes values that are closer to the middle. The interpretation of the IQR is that it’s the range that covers the middle 50% of the cases. How do you find the quartiles? The key number is (N+3)/4. If you count (N+3)/4 cases from the bottom, you have the value of the first quartile; (N+3)/4 cases from the top gives you the third quartile. As with the median, you can get a fraction: in that case, just take the average of the two surrounding values. (For a more exact calculation, it should be closer to one of the values when you have a fraction of 1/4 or 3/4, and the average only if you have a fraction of 1/2. However, to simplify things, you 31 can use the average for all fractions). Then the difference is the IQR. Let’s take the examples from last time. N=10, so (N+3)/4=3.25; that is, in between the third and fourth value. The first data set is: 35, 79, 80, 83, 85, 87, 88, 92, 93, 98 The third value from the bottom is 80 and the fourth is 83, so the first quartile is 81.5. Third from the top is 92 and fourth from the top is 88, so the third quartile is 90. The difference 90-81.5 is 8.5. The second data set is: 55, 60, 70, 75, 75, 85, 100, 100, 100, 100 Here the first quartile is 72.5 and the third is 100. The IQR is 27.5. Notice that the IQR of the second data set is bigger, even though the standard deviation is almost the same. That’s because the IQR is just concerned with the cases in the middle, and those cases are more spread out in the second data set. The IQR is not sensitive to the exact values of the cases that aren’t in the middle 50%. For example, if the lowest value in one of the samples was zero, the IQR would stay the same. In contrast, the standard deviation changes if any of the values change. 4.2.1 IQR from a frequency table To find the first quartile, identify the value for which the cumulative percent passes 25; for the third quartile, identify the value for which the cumulative percent passes 75. 4.3 Some notes on the IQR 1. The minimum possible value is zero. This will occur when the 25th and 75th percentiles have the same value. In contrast to the standard deviation, a value of zero does not necessarily mean that all cases have the same value–it just means that at least half of them do. 2. As with the standard deviation, there is no upper limit in principle. 3. The median and IQR also have the same units as the original variable. 32 4. You can generalize the idea of the IQR and compute ranges between various “percentiles”: for example, the 90th percentile minus the 10th. The IQR is the most common choice, but that’s just a convention. 33 Chapter 5 Comparing Group Means So far, we’ve been talking about univariate statistics: the distribution of single variables. We’ll now turn to bivariate statistics: the relationship between two variables. The great majority of research in the social sciences involves bivariate or multivariate statistics: univariate statistics is just a preliminary step. 5.1 Association between variables The central question with bivariate statistics: is there an association between two variables? Association between two variables (call them x and y) means that knowing the value of one variable helps to predict the value of the other variable. E. g., say the two variables are month and temperature. There is an association between them. Knowing the month will help you to predict the temperature; knowing the temperature will help you predict the month. E. g., if you hear that the high temperature in Storrs was 90 on a particular day, you could reasonably guess about which month it was. Lack of association is known as “independence.” An example of variables that are independent is day of the week (Sunday, Monday.....) and temperature-e. g., knowing that the high temperature in Storrs was 90 on a given day doesn’t help you guess what day of the week it was. Sometimes social scientists look at association for the purposes of prediction: e. g., an economist might want to predict what the unemployment rate will be at this time next year. In order to do that, the economist could look at information about various economic conditions and 34 unemployment rates in the following year. If there’s an association, that means that the value of the economic conditions today can be used to predict unemployment next year. For the purposes of prediction, the economist wouldn’t care why the asssociation existed–just that it does. But often social scientists look at association as part of a process of figuring out whether a variable x is a cause of y. An example: sometimes people say that birth order affects personality, success in life, and other things. How can you find if that’s true? A first step is to see if there actually are differences between people who were first children, people who were second, etc. If there are, the next step is to figure out why that association is there. If there aren’t any differences between them, that suggests that birth order doesn’t affect the outcomes you are interested in. It doesn’t quite settle the question: as we’ll see later, a relationship between two variables can be “hidden” by relationships involving variables. But as a general rule, if there’s nothing to start with, there’s probably not much to explain. If there is a substantial association between two variables (even if it’s not one you expected), it needs to be explained somehow. People often say “correlation does not imply causation,” but that is not really true unless you add a qualification: “correlation between x and y does not imply any direct causation between x and y.” If a correlation exists, there has to be a reason: something is causing something else. It is also important to measure the strength of any association. Most variables of interest to sociologists have lots of causes. But some (probably most) of the in uences will be small, others large. What statistics should you look at to see if there is an association between two variables? There are different ones, depending on what kind of variables are involved. 5.2 One ordinal/interval, one dichotomy Start with the case where one of the variables (y) is ordinal or interval; the other (x) is a dichotomy. Then you could first divide the cases into two groups, depending on the value of x, then calculate a statistic in the two groups separately and see if its values are different. Most often, the statistic people look at is a measure of central tendency (mean or median). However, sometimes a measure of dispersion is of interest. For example, 35 1 2 3 4 5 last year 1-2 years 2-5 years 5+ years never TOTAL MEN WOMEN 1082 68.8% 1969 191 12.1% 316 129 8.2% 159 152 9.7% 135 18 1.1% 26 1572 2605 ALL 75.6% 3051 73.0% 12.1% 507 12.1% 6.1% 288 6.9% 5.2% 287 6.9% 1.0% 44 1.1% 4177 Table 5.1: Frequency tables for time since last checkup, men and women you might want to see if dispersion in income (that is, income inequality) is higher in one country than another. As an example of comparing means, there’s a variable in the data set about length of time since last routine medical exam. It has five categories: within the last year, 1-2 years, 2-5 years, more than five years, or never. The frequency tables for men and women are in Table 4.1. Using those tables, you can calculate the mean for men and the mean for women. For example, the mean for men is: 1082 × 1 + 191 × 2 + 129 × 3 + 152 × 4 + 18 × 5 1082 + 191 + 129 + 152 + 18 2549 which comes to 1572 = 1.62. You can make the same kind of calculation for women, and get 1.44. That is, women tend to have had their last routine checkup more recently. What if we calculated the mean for everyone? It is 1.51. That is in between the mean for men and women, but not exactly halfway in between. It is closer to the mean for women. Why? Because most of the people in the sample are women. In fact, the mean for everyone can be calculated from the means and numbers for men and women: 1.62 × 1572 + 1.44 × 2605 1572 + 2605 You can calculate other statistics separately in each group: for example, the standard deviation, median, or IQR. However, when the variable has a small number of categories, the median and IQR are less useful for group comparisons, because they change in “jumps,” so they aren’t good at identifying small differences. In this example, the median is 36 x 1 2 3 3 4 5 6 7 Status Employed Self-employed Unemp more than year Unemp less than year Homemaker Student Retired Unable to work Total Mean 1.55 1.56 1.94 1.86 1.51 1.55 1.55 2.06 1.60 Frequency 1630 348 103 123 294 58 1130 260 3946 Table 5.2: Satisfaction with life by employment status 1 for both men and women; the IQR is 1 for men and 0 for women. So usually comparisons involve means or standard deviations. 5.3 One ordinal/interval, one nominal What if x is a nominal variable with more than two groups? You apply the same general idea: separate the cases by values of x, calculate the mean in each group. You simply have more means to compare. For example, say that one variable is satisfaction with life (1-4, higher means less satisfied) and the other is employment status. There are some differences that seem pretty clear: people who are unemployed or unable to work are less satisfied. However, when looking at the less obvious differences, you need to pay attention to the numbers in each group. For example, are people who are homemakers more satisfied that people who are employed? The means point in that direction, but there are only 294 homemakers, and the differences aren’t that large, so maybe we can’t be sure. We’ll consider this issue more exactly under statistical inference, but basically, the smaller the group, the bigger the difference in means you need in order to be confident. 5.4 Both Ordinal Suppose that x is an ordinal variable without too many categories. For example, the data set has a measure of household income, which is recorded 37 x 1 2 3 3 4 5 6 7 Income less than 10K 10-15K 15-20K 20-25K 25-35K 35-50K 50-75K over 75K Total Mean 1.98 1.89 1.80 1.68 1.66 1.63 1.50 1.42 1.60 Frequency 166 180 274 368 419 515 620 934 3476 Table 5.3: Satisfaction with life by Income as one of eight categories. Then you can do the same thing as before: compute the mean of y in each category of x, and compare the means. The difference between the ordinal and nominal cases is that when x is ordinal you’re less interested in the exact means in the groups, and more interested in seeing if there’s a general pattern. For example, suppose x is income, and y is satisfaction with life. It looks like there’s a pattern: the more income, the more satisfaction (lower mean). If you look more closely, it seems like in the middle ranges (25-50,000) increases in income don’t make as much difference as they do in the lower or the upper ranges. Maybe this means someting, but maybe it’s just a quirk of the sample. In any case, the first thing you should do is look at the general picture: is there a relationship of the form “the bigger x is, the bigger y is.” If not, is there another kind of relationship that you can describe simply? For example: ”y is largest for middle values of x.” Only then should you look for more subtle things. Usually if just one of the ordered categories is different from the surrounding ones you can assume that this is just a matter of random variation. What if you got no obvious pattern, just some means that were higher and some that were lower? That would be a sign (not conclusive evidence, but a sign) that maybe there is no relationship at all. To be sure, you have to have to use statistical inference, but generally if there is a relationship between two ordinal/interval variables, it will have a simple pattern. 38 Chapter 6 Statistical Inference Usually when you do a statistical analysis you want to reach a conclusion that applies outside of the particular cases on which you have information. For example, let’s take an example I used before, gender and time since last medical checkup. In the class data set, the average time was higher for men than for women, but what if you want to say something about people in general, not just people in this data set? In this case, the goal is to use a smaller group (the sample) to make an estimate about a larger group (the population). You can’t have absolute certainty in any conclusions, but you may be able to say that you are “pretty sure” or even “almost sure” that a conclusion about the population is correct. This is statistical inference (as distinct from description, which is entirely about the cases you observe). Statistical inference is most straightforward when you have a random sample. The meaning of “random” is different from the everyday meaning of haphazard or without any conscious method. In the basic kind of random sample, everyone has the same chance of being selected for the sample. You can think of it as a lottery where the prize is being chosen for the sample. You can give some people a higher chance, by giving them extra “tickets,” while keeping the same basic procedure. Most surveys in sociology, or surveys of public opinion, are designed to provide random samples. For example, the class data set is intended to provide a random sample of American adults. Sometimes you might want to generalize about some population other than the one from which the sample was taken. E. g., suppose that you wanted to generalize about people in Canada. The best way to do that would be to get a random sample of Canadians, but sometimes that’s not available, so you might wonder if the American data 39 can tell you anything about Canadians. Generalizing outside the population might or might not be justified, but it isn’t primarily a statistical issue. So we’ll just consider going from a random sample to the population from which the sample was taken. A lot of data in sociology doesn’t involve random samples. For example, the states of the United States are not a random sample of anything. However, you can apply statistical inference to this kind of data too. With data of this kind, the question isn’t about the population–it’s about whether any pattern we see could plausibly be explained by “chance.” 6.1 Standard error of a statistic A basic tool of statistical inference is the standard error of a statistic–for example, the standard error of a mean. The standard error of a statistic is an estimate of what would happen if you took numerous random samples from the same population and computed the statistic for each sample. If you had a lot of samples, you could compute the mean and the standard deviation of the sample statistic: the standard error is an estimate of the standard deviation. Why should we care about the standard deviation of a sample statistic? After all, we have just have one sample from the population, with one sample mean, not lots of samples with different sample means. The reason is that the standard deviation is a guide to how different the statistic for a paricular sample might be from the “true” (population) value. Many statistics, including the sample mean, approximately follow a particular distribution, known as the “normal distribution.” The figure shows what the normal distribution looks like. Even if the original variable (x) doesn’t have a normal distribution, the sample mean of x will be approximately normal. You can use a table of the normal distribution to see exactly how much chance there is that the mean of a particular sample will be one, two, three, or however many standard deviations away from the mean of the population. For the moment we just need a few facts: in a normal distribution about 95% of the values are within two standard deviations of the mean, about 99% are within 2.5 standard deviations of the mean. It turns out that we can estimate the standard deviation of a sample statistic even if we just have one sample. The formula for 40 Figure 6.1: Normal Distribution estimating the standard error of a statistic depends on the statistic. For the mean, the formula is pretty simple: sx sx̄ = √ N That is, the standard error of the sample mean is the standard deviation of x divided by the square root of N. 6.2 Confidence Intervals A confidence interval is range of possible population values of a statistic given the value in a sample. If we call the sample statistic τ , and call the standard error of the statistic sτ , the confidence interval is: from τ − ksτ to τ + ksτ . k is a number based on how sure you want to be, and obtained from the table for a normal distribution. The more confident you want to be, the larger k will be. E. g., if you want to be 99.9% sure that the population value of the statistic is in the confidence interval, you’ll need a k of a little more than 3. To be 95% sure, you only need a k of about 2. For example, there’s a question in the data set about the number of days in the last 30 on which your mental health was not good. The mean is 3.35, and the standard deviation is 7.63. N is 4160 (N is the number of cases used to compute the statistic, so it will differ depending on the 41 particular variables used). Applying the formula, the standard error of the sample mean is .118, which you can round off to .12. Finally, the 95% confidence interval is (3.11,3.59).1 That is, we can be 95% sure that the mean in the population is between 3.11 and 3.59. If you want to put that conclusion in words, you could say something like you are “pretty sure” that the average opinion in the population is in that range. You could also form a 99.9% confidence interval, which would be 2.99 to 3.71, and describe that with words like “almost certain.” 6.2.1 Comparing group means: approximate method Often social scientists are interested in comparing group means. A basic question: is there any difference between the groups? When you compute sample means for diferent groups, you almost always find that the mean is higher in one group than the other. But that’s partly because there is almost always some chance difference between samples. For example, suppose you gave point values to cards A=13, K=12, Q=11, J=9, 10=10, etc. Then when you deal a hand, that is a sample from the population. When you deal two hands from a pack, the mean point value will usually be different, even though the population value is always the same. That’s the element of chance or “luck of the draw.” The same thing happens with sampling from a population: some differences in a particular sample are just a matter of chance. If you have two groups, you have a mean and a standard deviation for each group. So you can compute a confidence interval for each group. Call the two groups A and B, and say that the mean in A is higher. Gender Men Women All Mean s N 2.59 6.76 1573 3.81 8.08 2587 3.35 7.63 4160 Table 6.1: Number of Days Mental Health not Good From Table 1, you can calculate the standard errors of the means (.170 for men, .175 for women). Then you can calculate the 95% confidence intervals, (2.25,2.93) for men, and (3.46,4.16) for women. There is no 1 A standard way to indicate an interval is (lower,upper). 42 overlap between them: the highest value for men is smaller than the lowest value for women. These are what we could call the highest and lowest plausible values of the population means for men and for women. Because there is no overlap, we can say that the population mean for men is pretty definitely higher than the mean for women. If there had been overlap, that would mean that it’s possible that the means are the same, or even that the population mean for men was higher. That is, we can’t be sure that the population means of the two groups are different, and can’t be sure about the direction of the difference if they are. Here is an example in which there is overlap. The 95% confidence Gender Men Women All Mean s N 2.54 1.08 1582 2.58 1.13 2629 2.57 1.11 4211 Table 6.2: Mean of self-rated health intervals are (2.49,2.59) for men and (2.54,2.62) for women. There are values that are in both confidence intervals. Here is an example for you to calculate: Gender Men Women All Mean s N 3.54 8.15 1570 4.67 8.97 2570 4.24 8.69 4140 Table 6.3: Mean of Days Physical Health not Good 6.2.2 Comparing Group means: Exact Method The method I’ve described is only approximate. There’s a more accurate way, which requires a little more calculation. This is based on the idea of looking at the difference between two means. We can regard that difference as a statistic, and estimate a standard error. Then we can create a confience interval for the difference, and ask if it’s possible that the difference is zero. 43 s s(x̄1 − x̄2 ) = s1 2 s2 2 + N1 N2 where the subscripts refer to the two groups. You can also say that this is the square root of the sum of the squared standard errors for the two groups. If you apply this formula to differences in self-rated health you get .035. The difference in the means is .04. So the 95% confidence interval is (-0.03,0.11). The value of zero is important, because it means no difference between the groups. So for the confidence interval of the difference to contain zero is like having the confidence interval for the two groups overlap. In this case, the confidence interval includes zero, so again we conclude that it’s possible that there’s no real difference (or that men report worse health than women). The approximate method is more conservative, in the sense that you’re more likely to conclude that there might be no difference. That is, sometimes you would conclude that there might be no difference using the approximate method, but conclude that there is a difference using the exact method. The exact method is the one that you should use: I just began with the approximate method as a way to introduce the idea of comparing groups. 6.3 T-values and significance tests As I’ve mentioned, the value zero is important for many statistics involving the relationship between variables, because it means no difference or no relationship. Statistical significance means that we can be reasonably confident that some statistic representing the relationship between variables is not zero. E. g., if someone says that “there is a statistically significant difference between men and women” that is equivalent to saying that the confidence interval for the difference in means does not include zero. Sometimes people will just say “there is a significant difference” but this is ambiguous, since in everyday terms “significant” has other meanings, like large theoretically interesting. So if you’re talking about statistical significance, it’s a good idea to say “statistically significant” rather than just “significant.” Say τ̂ is the observed value of a statistic and τ0 is a hypothetical value you are interested in (most often the value is zero). That is, you are 44 asking the question “is it possible that the population value of the statistic is τ0 ?” Finally, sτ is the standard error of the statistic. Then you can compute a ratio: τ̂ − τ0 sτ It’s called the “t-ratio” or “t-statistic” because it has a particular distribution known as the t-distribution. Using a table of the t-distribution, you can look up the statistical significance of the t-statistic. When the sample is large, the t-distribution becomes almost exactly like the normal distribution. I’ve been assuming, and will continue to assume, that the sample is large enough to just use a normal distribution. You can use the t-value to conduct a “significance test” of the hypothesis of no difference. The value of 2 for the t-statistic is the conventional standard for this test. It corresponds to the 95% confidence interval: if the t-ratio is bigger than 2 or smaller than -2, then zero is not in the 95% confidence interval. “Statistically significant” means we can reject the hypothesis of no difference: we can say that the proposition that there is no difference in the population is hard to square with the observed difference in the sample. So significance tests and confidence intervals give us essentially the same information. The difference is that significance tests focus on the hypothetical value of zero. “statistically significant” means you can be pretty sure whether the population value of a statistic is no zero. You can also be pretty sure about its sign: you can say that the statistic is pretty sure to be positive or pretty sure to be negative negative. The sign is usually the most basic thing that someone would want to know: for example, in comparing groups, the sign tells you which group mean is larger. “Not significant” means you can’t be sure whether the population value of the statistic is positive, negative, or zero. Note that “not significant” doesn’t mean positive evidence that the population value is zero , or even that the population value is “small”. To make a judgment about the size of any difference, you need to look at the confidence interval. 6.4 Comparing more than two groups If one of the variables is a dichotomy, the t-test tells you whether the variables are related: “is there a difference between men and women?” is 45 equivalent to “does gender make a difference?” When you have more than two groups, you can compare each pair of means in the way described in the previous sections. With k groups, you have k(k-1)/2 pairs. You might find significant differences between some pairs but not others. In that case, it’s not clear whether you should say there is an association between the variables or not. We will later look at way to test for an association, but for now I will just give a rough rule: if many of the differences between pairs are significant, there’s probably an association between the variables; if only a small fraction are, probably not. 46 Chapter 7 Cross-tabulations The point of comparing means is to see if there’s an association between two variables. But you can’t compare means when both of the variables involved are nominal, because you can’t take the mean of a nominal variable. What if you want to look at the association between two nominal variables? You can use crosstabulations. Crosstabulations may also be useful when one or both of the variables are ordinal or interval but have only a small number of categories. “Small” is a matter of degree, but as a practical matter, you could say up to about seven: when you have more than that, cross-tabulations get hard to read. An example of a cross-tabulation: whether you have any health care coverage and whether there was a time in the last year when you needed to see a doctor but couldn’t because of cost. With two categories of coverage and two categories of whether you were unable to see a doctor because of cost, you have a total of four possibilities: health coverage and yes, health coverage and no, no coverage and yes, no coverage and no. Unable to afford doctor Yes No Total Coverage 294 3510 3804 No Coverage 173 237 410 Total 467 3747 4214 Table 7.1: Crosstabulation of health coverage and whether unable to see doctor because of cost 47 You could imagine going through the whole list of people and classifying them into these four groups. That’s what the four numbers in the middle tell you. The “total” columns tell you the same information that you would get in a frequency table: how many people with and without health coverage there are, how many people were and were not unable to see a doctor because of cost. For example, the total number of people who were unable to see a doctor is equal to the number of people with health coverage who were unable plus the number of people without coverage who were unable.1 The problem with this cross-tabulation is that it’s hard to compare the numbers in the four cells. If we just look at the numbers, we see the biggest groups are people who have coverage and never were unable to see a doctor, followed by people who have health care coverage and were unable to see a doctor. But this just means that most people have health coverage: it doesn’t tell us about whether there’s any difference between people who do and people who don’t. So it is more informative to give the table in terms of percentages. Table 2 gives the percentages calculated separately for people with and without health coverage (known as the “row percentages,” because the rows represent different values of health coverage). That is, 42.2% of people without health coverage were unable to see a doctor, while only 7.8% of the people with health coverage were. To get this table, you divide the cell values by the row totals: for example, (294/3804)*100=7.8. You could also do the calculations in the other direction if you were given Table 2: that is, you could go back and calculate the frequencies in Table 1. For example, 42.2*410/100=173.02, which rounds to 173. Because the percentages are different when you compare the two rows, it appears that the two variables have something to do with each other: people without health coverage are more likely to be unable to see a doctor because of cost. You could also compute the “column percentages,” which are shown in Table 3. You get these by dividing the cells by the totals for the columns. For example, 63 percent of the people who couldn’t see a doctor because of cost had health coverage, while 92.3 percent of the people who never were unable to see a doctor had coverage. The row and column percentages give the same information in different forms. 1 These totals are based on the number of people who answer both questions: people who say don’t know or refuse to answer are usually left out of cross-tabulations. So they may be smaller than the totals you get in the frequency tables for the variables . 48 Unable to afford doctor Yes No Total Coverage 294 3510 3804 7.8% 92.3% 100% No Coverage 173 237 410 42.2% 57.8% 100% Total 467 3747 4214 Table 7.2: Crosstabulation with row percents Unable to afford doctor Yes No Total Coverage 294 3510 3804 63.0% 93.7% No Coverage 173 237 410 37.0% 6.3% 100% 100% Total 467 3747 4214 Table 7.3: Crosstabulation with column percents In this case, if people without health care coverage are more likely to be unable to see a doctor, then they are going to make up a larger share of the people who are unable to see a doctor. Note that “a larger share” doesn’t necessarily mean a majority. In fact, most of the people who were unable to see a doctor because of cost did have some health coverage. That’s because most people have health coverage: a small fraction of a large group can be a bigger number than a large fraction of a small group. Row percentages will always have to add to 100 (allowing for rounding error) when you go across the rows. Column percentages have to add to 100 percent when going down the columns. If you see a table and aren’t sure what the percentages mean, you can use these facts to figure out what they are. There is a convention that if one of the variables can be thought of as a cause and the other as an effect, the “cause” variable is usually used as the row variable, and the row percentaes are shown. E. g., in this case coverage status could be thought of as the cause and whether you were unable to see a doctor as the outcome. It wouldn’t seem 49 reasonable to think of it the other way round. When you have a cause variable, many people find it more natural to use that as the base for the percentages. E. g., in this case, I’d say it’s easier to grasp Table 2 than Table 3. It’s not wrong to do it the other way, but it’s a good idea to follow this convention unless you have a special reason not to. However, there are a lot of cases where it’s not clear which variable is cause and which is effect, or when it seems like you could regard it as either one. E. g., general health and exercise. You could say that exercise affects health (presumably improves it). On the other hand, health could affect exercise, because healthier people are likely to find exercise easier and more enjoybable. This ambiguity is not a problem. The table has the same interpretation regardless of which is rows and which is columns, so if you’re not sure about cause and effect you can just make an arbitrary choice. What you can learn from looking at a cross-tabulation: do the variables have anything to do with each other? If the row percentages are different when you compare different row values or the column percentages are different when you compare columns, then you can say that the variables have something to do with each other. If the percentages are the same, then you can say that the variables are unconnected. But you might want to go beyond this, and distinguish between stronger and weaker connections. How can you do this? Many statistics have been developed for this purpose. Almost all of them are based on “residuals,” so we first need to learn how to calcualate residuals. 7.1 Independence and expected values Suppose that x is a variable. Then you can write x = x̂ + e. x̂ means a predicted value of x (sometimes called a “fitted value”). e is the “error” or “residual,” and is computed by x − x̂. There are lots of possible predicted values: each one represents an idea about what kind of pattern there is in the values of x. In this case, the idea that we’re interested in evaluating is that the variables are independent. That is, two variables have nothing to do with each other; knowing the value of one is of no use in predicting the value of the other. More precisely, we could say that the distribution of one variable (call it y) is the same for every value of the other variable (which we’ll call x). The hypothesis of independence is widely used as a baseline. How do you get the predicted values under the hypothesis of 50 Unable to afford doctor Yes No Coverage 421.6 3382.4 No Coverage 45.4 364.6 Total 467 3747 Total 3804 410 4214 Table 7.4: Predicted Values Under Independence independence? Let’s ask what numbers we would expect to see if there were no association. Then the row percentages would be the same for people with and without health coverage. For example, if 11.1% of people were unable to see a doctor because of cost, and whether that happened to you is independent of health coverage, then 11.1% of people with coverage would have been unable to see a doctor, and 11.1% of people without coverage would have been unable to see a doctor. Then compute the numbers in the hypothetical table by multiplying the percent who were unable to see a doctor by the numbers with and without coverage, and dividing by 100. For example, 11.1*410/100=45.51. That’s how many people without health coverage “should” have been unable to see a doctor, if the variables were independent. When you use the percentages, the predicted values are affected by rounding error. A more accurate way to do the calculations is to multiply the relevant row and column totals, and divide by the grand total. For example: 467*410/4214=45.44. We can then compute the difference between the predicted and actual values, which are shown in Table 5. They are known as the “residuals.” Note that the residuals sum to zero if you go across the rows or down the columns. Unable to afford doctor Coverage No Coverage Total Yes No -127.6 +127.6 +127.6 -127.6 0 0 Total 0 0 0 Table 7.5: Residuals from Model of Independence 51 7.2 Index of Dissimilarity What if you wanted a statistic to show how well or badly the predictions from the model of independence fit the data? You would have to combine all of the residuals to get some kind of total. But since some of the residuals are positive and some are negative, the positive and negative residuals would cancel out, so just adding them wouldn’t work. What if you added up the absolute values of the residuals? That would give a sort of total error, which is a more reasonable measure: the lowest possible value would be zero, meaning a perfect fit. The problem is that the result would depend on the number of cases. So it might be better to adjust for the total number of cases. This is what the index of dissimilarity does. The formula: P |e| 2N Where e (for error) is the residual. The maximum value of the index of dissimilarity depends on the total percentages in the rows and columns in a complicated way. Therefore, the Index of Dissimilarity is more a rough guide than an exact statistic. It’s useful when comparing tables, and you want to say that one has a stronger association than another. With the table of health coverage and being unable to see a doctor, the index of dissimilarity is 0.061. The index of dissimilarity can be interpreted as the proportion of cases that would have to be “moved” in order to make the model of independence exactly fit the data. In fact, it is sometimes used as an index of segregation. If you had people of different ethnicities living in different neighborhods, you could make a cross-tabulation of neighborhood and ethnicity. Complete integration would mean that all the neighborhoods contain the same mix of ethnicities. That is, the variables would be independent: knowing where someone lived would not give you any clue about their ethnicity. The index of dissimilarity can be interpreted as the minimum proportion of the population that would have to moved in order to produce perfect integration. 7.3 Standardized residuals With a two-by-two tables, the residuals are all the same number, two positive and two negative. But with a bigger table, you can have a more 52 complicated pattern. The predictions may be pretty good for some combinations, but bad for others. Residuals can be used to identify where the predictions fit or do not fit. Table 6 gives another example: type of community by marital status. Marital Status City Urban County Suburban MSA Non-Urban Total Married Div. Widowed 673 50.5% 564 59.8% 321 62.5% 9 47.4% 817 58.8% 2384 56.8% 207 15.5% 126 13.4% 53 10.3% 2 10.5% 194 14.0% 583 13.9% 184 13.8% 115 12.2% 72 14.0% 5 26.3% 219 15.8% 595 14.2% Sep. Never Unmarried Total Married Couple 34 202 34 1334 2.6% 15.1% 2.6% 100.0% 19 104 16 944 2.0% 11.0% 1.7% 100.0% 10 43 15 514 2.0% 8.4% 2.9% 100.0% 0 3 0 19 0.0% 15.8% 0.0% 100.0% 30 108 21 1389 2.2% 7.8% 1.5% 100.0% 93 460 86 4200 2.2% 11.0% 2.1% 100.0 Table 7.6: Crosstabulation of community and marital status If the residual is exactly zero, that means a perfect prediction. If the residual is near zero, that means a good prediction. A residual that’s much greater than zero or much less than zero means a bad prection. But it seems reasonable to take the size of the predicted values into account too. E. g., if you predict that the Republicans will win 221 seats in the House of Representatives in 2014 they actually win 216, you could regard that prediction pretty good. If you predicted that someone would have 2 children and they actually had 7, you would regard that prediction as way off. The “standardized residual” is designed to adjust for the size of the prediction. It is defined as e √ n̂ where e is the residual and n̂ is the predicted count in a cell. It’s related to standardized scores, which we talked about before. To get a standardized score you subtract the mean and divide by the standard deviation. The 53 mean residual is zero. The square root of the predicted value is an estimate of the standard deviation produced by chance variation. This means that one thing you can do with the standardized residuals is see if any are large: with an absolute value of more than 2.0, or especially more than 3.0. A large standardized residual suggests that your fitted value is far enough from the actual value to make it hard to explain as just a matter of chance. Marital Status City Urban Suburban MSA Non-Urban Total Married Div. Widowed Sep. -84.2 28.2 29.2 -1.8 28.6 0 22.2 -4.8 -18.2 -0.6 1.5 0 -5.0 -18.7 -0.8 2.3 22.2 0 4.5 -1.9 -1.4 -0.4 -0.8 0 Never Unmarried Total Married Couple 55.9 6.7 0 0.6 -3.3 0 -13.3 4.5 0 0.9 -0.4 0 -44.1 -7.4 0 0 0 Table 7.7: Residuals, community and marital status Marital Status Married City Urban Suburban MSA Non-Urban Div. Widowed -3.06 1.63 1.22 -0.42 1.71 -2.16 -0.54 -0.39 1.02 0.11 Sep. Never Unmarried Married Couple -0.36 0.82 4.62 1.28 -1.62 -0.42 0.06 -0.76 -0.10 -0.41 -1.77 1.38 -1.41 -0.65 0.64 -0.62 1.58 -0.14 -3.58 -1.40 Table 7.8: Standardized residuals, community and marital status 7.4 Chi-square test The chi-square statistic is used to test association in cross-tabulations. It is especially useful when both of the variables are in the cross-tabulation are nominal, although it can be also used with ordinal variables. 54 Weight Current Former 1 Not overweight 88 6.2% 44 3.1% 2 Overweight 117 7.9% 60 1302 1479 4.1% 88.0% 100% 148 13.3% 353 8.8% 45 917 1110 4.1% 81.6% 100% 149 3515 4017 3.7% 87.5% 100% 3 Obese Total Never Total 1296 1428 90.8% 100% Table 7.9: Overweight status by asthma Table 1 gives an example of a table of overweight status (ordinal, three categories) by asthma status (current, former, never). To calculate the chi-square statistic, follow these steps: 1. Calculate the predicted values assuming independence. 2. Calculate the residuals 3. Calculate the standardized residuals 4. Calculate the sum of the squares of the standardized residuals. Note that this will always be a positive number. 5. Calculate the “degrees of freedom” using the formula (I-1)(J-1), where I is the number of rows and J is the number of columns. In this example, I=3 and J=3, so the degrees of freedom equals 4. 6. Look up the “critical value” of the chi-square statistic with the appopriate number of degrees of freedom. If your chi-square is bigger than the critical value, there’s evidence of an association; if not, there is no clear evidence–that is, the data are consistent with the idea that the variables are independent. Before we start, if you just look at the table it seems that obese people are more likely to have asthma. Overweight people are in between, 55 Cell Not overweight, current Not overweight, former Notverweight, never Overweight, current Overweight, former Overweight, never Obese, current Obese, former Obese, never Total √ n̂ e e/ n̂ e2 /n̂ 125.5 -37.5 -3.34 11.20 53.0 -9.0 -1.23 1.52 1249.5 46.5 1.31 1.73 130.0 -13.0 -1.13 1.29 54.9 5.1 0.69 0.48 1294.2 7.8 0.22 0.05 97.5 50.5 5.11 26.10 41.2 3.8 -0.60 0.36 971.3 -54.2 -1.74 3.04 4017 0 45.76 Table 7.10: Calculating the sum of squared standardized residuals although they seem more like people who are not overweight. That is, the two variables appear to have something to do with each other: if you know whether someone is overweight, you can make a better guess about wehther they have asthma. To calculate the predicted values, use the formula Nrow Ncol . Nrow is the total for that row, and Ncol is the total for that N column. For example, for people who are not overweight and have asthma, = 125.5. To calculate the chi-square statistic, the predicted value is 353×1428 4017 it helps to make a table like Table 7.10. After you compute the sum of the squared standardized residuals, there’s not much calculation. You just have to look up the critical value in a table. With four degrees of freedom, the 5% critical value is 9.49. The value we see is much bigger than that, so we can conclude there’s pretty good evidence that the variables are not independent: that is, that there really are differences between the chance of having asthma depending on whether you are overweight. The 1% critical value is 13.27, and the 0.1% critical value is 18.47, so even someone who asked for stronger evidence would still have to agree. In fact, the chance of getting a value like this just be chance would be tiny, something like 3 in a billion. 7.5 Examining Tables Suppose you obtain a statistically significant chi-square statistic. That means that there is evidence that the variables are related. However, you 56 usually want to go beyond that and say how they are related. In some cases, it’s easy to see: for example, here you can say that the more overweight you are, the higher your chance of having asthma, but in other cases it’s more complex: for example, the categories may fall into several groups. The basic principle is to look for the rows and columns in which there are large standardized residuals. Those are the ones that are clearly different from the others. Then try to see if you can give a plausible “story” about the pattern 57 Chapter 8 Correlation and Simple Regression 8.1 Correlation Correlation, like cross-tabluation, involves the association between variables. Association means that the variables aren’t independent–they have something to do with each other. However, “something to do with each other” is very general, so we usually want to go beyond saying that there’s some association–we want to be able to say something about particular kinds of association. Two forms of association are particularly important. Positive association: the bigger the value of x, the bigger the value of y (on the average). Negative association: the bigger the value of x, the smaller the value of y (again, on the average). Positive and negative association are meaningful concepts for the association between two ordinal or interval/ratio variables. They are not meaningful when a nominal variable is involved. For example, it would not be meaningful to say that there’s a positive association between ethnicity and years of education, because “the larger ethnicity is” doesn’t make sense. It would make sense to talk about positive or negative association between years of education and income, because “more education” and “more income” are both meaningful ideas. If you have a dichotomy, you can characterize association as positive or negative, even though that’s not the natural way to describe it. E. g., sex (M=1 and F=2) and number of days physical health was not 58 good (0...30). Earlier in the course, we saw that women tend to have higher numbers for days physical health was not good. Since female is the higher value on sex, you could describe this as a positive association“higher values of sex” (being female rather than male) go with higher values on the health question. Warning: it’s important to note what high values mean for each variable, and also to say that when you’re describing any results. You can’t assume that the meaning of a higher value is what you expect from the variable name. For example, in the question on life satisfaction, higher numbers mean less satisfied (or more dissatisfied). So if you said that there’s a positive association between some variable and the life satisfaction variable, people might draw the wrong conclusion. It’s better to say something that seems obvious, e.g., that higher values of age mean older, than to run the risk that people will draw the wrong conclusion. In practice, that means you need to check the codebook or the “variable view” before considering the association between variables. A particular kind of positive and negative association is known as linear association. It is represented by the equation ŷ = α + βx, where ŷ represents a predicted value of y. It’s called linear because any equation of that form corresponds to a straight line on a graph, if position on the horizontal axis represents the value of a case on the one variable and position on the vertical axis represents the value on another. If one of the variables can be thought of as a cause and the other as an effect, the “cause” variable is traditionally put on the x axis. The idea of linear association is not meaningful for nominal variables, so correlation should not be used if one or both of the variables you are interested in is nominal. Correlation can be used if both of the variables are ordinal or interval. 8.1.1 Calculating the correlation 1. “center” x and y by subtracting their means 2. compute the product (x − x̄)(y − ȳ) 3. compute the squares of (x − x̄) and (y − ȳ) 4. the correlation is then √P P (x−x̄)(y−ȳ) √P (y−ȳ)2 (x−x̄)2 59 Let’s take an example–I’ll use a hypothetical example to make the calculations easier. Suppose we have two variables x and y, representing grades on two tests. They are measured 1-4, where 1 means√D √ and 4 means A. Putting it all together, the correlation in this case is 4/( 8 8) = 0.5. x 4 4 2 1 4 3 18 y x − x̄ y − ȳ (x − x̄)2 (y − ȳ)2 (x − x̄)(y − ȳ) 3 +1 +1 1 1 1 4 +1 +2 1 4 2 2 -1 0 1 0 0 1 -2 -1 4 1 2 1 +1 -1 1 1 -1 1 0 -1 0 1 0 12 0 0 8 8 4 Table 8.1: Example of calculating a correlation The correlation is a number between -1 and 1 that represents the linear relationship between two variables. A number that is farther from 0 represents a stronger relationship, in the sense that the variables predict each other accurately. In terms of a graph, the correlation represents how closely the points are clustered around a straight line showing the relationship between x and y. If every point falls exactly on the line, the correlation is +1 if the line slopes up, -1 if it slopes down. If you have a horizontal line, the correlation is undefined. If the points are scattered all over with no pattern, the correlation is zero. 8.1.2 Standard error of the correlation p √ The standard error of a correlation is approximately (1 − r2 )/ N (the letter r is often used for the correlation). You can use this formula to calculate confidence intervals or t-ratios involving the correlation. 8.1.3 Correlation matrix When you have just two variables, you can simply give the correlation. But when you have more than two, it’s convenient to show the correlation between each pair in the form of a “maxtrix.” Table 2 shows a correlation matrix involving the variables sex (1=M 2=F), education, income, and satisfaction with life (1=very satisfied ... 4=not satisfied). 60 Female N Female Education Income 1.000 -.04 -.14 4232 4216 3686 Satis .00 3957 Educ N -.04 4212 1.000 4216 .44 3679 -.13 3947 Income N -.14 3686 .44 3679 1.000 3686 -.25 3476 Satis N .00 3957 -.13 3947 -.25 3476 1.000 3957 Table 8.2: Example of a correlation matrix To find the correlation between any pair of variables, locate the column for one and the row for the other. For example, the correlation between income and education is .44. Note that it doesn’t matter which is the row and which is the column–the table is symmetrical. In terms of the formula, it doesn’t matter which you call x and which you call y. The 1.000 in the diagonal means that the correlation of a variable with itself is one. 8.1.4 Correlations and scale The correlation is not affected by the scale of a variable. For example, suppose a data set has a measure of height. Then the correlation of any variable with height is the same regardless of whether you measure height by inches or centimeters. Also, the correlation of height in inches with height in centimeters is 1.00. That is, if you know a person’s height in inches, you can predict their height in centimeters perfectly. So a corrlelation is similar to a standardized score, and different from a mean or standard deviation, in that respect. That is, if I say that the correlation between two variables is a particular value, you don’t have to know the units of the variables in order to interpret that. The only thing you need to know is what a higher value of each variable means. 61 8.1.5 Interpreting correlations People sometimes assume that because the possible range of the absolute value of a correlation is from 0 to 1, correlations with values like 0.1 are too small to be of interest. This is wrong–the standards for what should count as an large or small correlation differ depending on the kind of variable you are talking about. In general, when you are dealing with data on individual people, the correlations are well under 0.5. When you are dealing with units like nations, correlations tend to be much larger. So the best way to judge a correlation between x and y is to look at the correlation of other variables with x and/or y. How does the correlation you found compare with other correlations that are generally thought to be “important”? 8.2 A Visual Interpretation Correlation and regression can both be understood in terms of a “scatterplot” showing the values of x and y. It’s easier to grasp a scatterplot when the values of the variables are continuous rather than limited to a small number of values and there aren’t too many cases. The main data set for the class has lots of cases, and almost all of the variables have a limited number of categories. So to look at correlation and regression, I’ll use another data set, giving selected characteristics of nations that are members of the Organisation for Economic Cooperation and Development (the OECD includes most of the affluent nations, plus a few middle-income nations like Mexico and Turkey). Two of the variables are “Gini coefficients.” The Gini coefficient is a measure of inequality ranging from 0 (complete equality) to 1 (one person has all of the income in the country). The data includes the Gini coefficient before taxes and transfers (basically, inequality in what people earn), and the Gini coefficient after taxes and government transfers. The mean “before” value is about .46 and the mean “after” value is .32. That is, government taxes and spending usually make things more equal, which is to be expected. But we could also ask about the relationship between before and after values. You would expect a positive relationship: if a country starts out more equal (relative to others) it will end up more equal. That’s not logically necessary, but it seems more likely. But how strong will the relationship be? If every government reduces inequality by the same amount, then you could predict the amount of 62 Figure 8.1: Scatterplot, inequality before and after taxes and transfers inequality perfectly by knowing how much equality there was before. There is a definite relationship–the higher the inequality in earnings, the higher the inequality after taxes and transfers. That is, it is a positive relationship. The correlation is 0.559. 8.3 Regression Suppose you want a straight line representing the relationship. Visually, you could try to draw a line that comes close to passing through all of the points, but different people might make somewhat different choices. So it’s desirable to have a definite standard. Any line can be represented by an equation y = α + βx + e (8.1) P 2 Suppose you define the best line as the one that makes e as small as possible. You could try out different values of α and β and then calculate the sum of the squared errors, but you don’t need to use this trial-and-error approach. There’s a formula for finding the values that give you the 63 “least-squares” fit. In this case, it’s ŷ = −.083 + 0.870x. Given an equation, you can put in values of x for different cases, and get predicted values of y. For example, the value of x (Gini before) for the United States is 0.486. Applying the equation, the predicted value for the Gini after is .340. The actual value of the Gini after for the United States is .380. That means that the value of e (the residual) for the US is .04. So the US has more inequality after taxes and transfers than the equation predicts. Next we might ask whether we should consider that to be a large error or a small error. To do this, we can compute the standard deviation of the residuals, and then compute a standardized score. The standardized residual for the United States is 0.75: that is, the error is not unusually large. The numbers in the regression equation are known as “coefficients.” What to the regression coeffients mean? The β coefficient tells you the effect of a one-unit increase in x on the predicted value of y. That is, it’s not just a number (like the correlation), but a number of y’s per x. In this case, both units are points on the Gini coefficient. However, the units will not normally be the same. An everyday example that helps to illustrate the nature of a regression coefficient is miles per hour. If you know the time (hours) that someone drives, and the average speed (miles per hour) at which they drive, you can compute the distance (miles). With a regression, when we multiply x by β, we get a predicted value in the same units as y. The α coefficient is usually of less interest than β. It gives the predicted value of y if x=0. However, the value x=0 is not always possible in principle, and even if it is, it may not exist in practice. In this case, it is possible in principle (everyone has exactly the same income), but no country is close to it (the lowest actual value for the Gini before is 0.344. And if we apply the equation with x=0, we get a predicted value of -.083, which is impossible because the Gini coefficient can’t be less than zero. So usually the α coefficient is just treated as a number that you have to have in the equation, not as something that’s meaningful in its own right. 8.3.1 Residuals In a good regression model, the residuals should represent unpredictable factors–that is, there should be no pattern, because a pattern means something predictable. If there is a pattern, that means you should try to 64 modify the regression to accomodate it. One way to try to assess the pattern is to look at the unusually large (positive or negative) residuals and think about whether those cases have anything in common. In this example, the largest standarized residual (2.41) is for Chile and the second largest is for Mexico (2.39). Those are the only residuals greater than 2 (there are none less than -2), but there are two more that are close: Turkey at 1.93 and South Korea at 1.91. If we ask what those four countries have in common, one thing that comes to mind they are all relatively low income by the standards of this group (Korea is only a little below average, and the other three are well below). So that suggests that maybe less affluent countries don’t do as much to redistribute income as more affluent ones. That’s just an idea, but later we’ll see how you could evaluate it. 8.3.2 Calculating regression coefficients The formula for β is: P (x − x̄)(y − ȳ) P (x − x̄)2 (8.2) Notice that the numerator is the same as in the formula for correlations, but the bottom part is different: it just involves the independent variable x. This is related to an important difference between correlation and regression. The correlation is symmetrical: it doesn’t matter which variable you call x and which you call y, you get the same correlation. The regression coefficient is not symmetrical: you get a different value of β depending on which is dependent and which is independent. People are usually primarily interested in β, since it tells you about the relations between the variables. The α coefficient is necessary, but usually isn’t the focus of interest. However, if you need to calculate it, the formula is: α = ȳ − β x̄ (8.3) That is, you first calculate β, and then use this formula. The reason that this formula works is that with a least squares regression, the predicted value of y when x = x̄ is ȳ. 65 8.3.3 Dependent and Independent Variables The correlation of x with y is the same as the correlaton of y with x. But the coefficients in the regressions y = α + βx and x = α + βy will not normally be the same. Therefore, you need to think about which variable should be on the left. That is called the “dependent variable” and normally symbolized by y. The variable on the right is called the “independent” or “predictor” variable and normally symbolized by x. If you think in terms of cause and effect, x should be the potential cause and y should be the varaible that is affected. For example, if you were doing a regression with age and income, income should be the y variable: it may be influenced by age, but age can’t be influenced by income. Often things are not this clear: for example, if you have opinions on two subjects, it is logically possible for the influence to go either way. In such cases, you have to rely on “common sense” or outside information. For example, if I had the variables of satisfaction with life and self-rated health, I would choose satisfaction with life as the dependent. That’s because if someone said that they weren’t very satisfied and you asked why, it would make perfect sense if they said “because I am in poor health.” If someone said they were in poor health and you asked why, it would not seem as natural to say “because I’m not satisfied with my life.” But this is my judgment, not an issue that can be decided by statistics. 8.3.4 Analysis of Variance The file Gini.pdf contains SPSS output involving a regression with Gini (after) as the dependent variable. The independent variable (Security) is a measure of the extent of government income security measures (like retirement, disability, unemployment insurance). One of the goals of these measures is to increase the income of people who would otherwise be poor, so to the extent that they are effective in doing this, they should reduce the Gini index. Look at the “ANOVA” (analysis of variance) table. One column is labelled “df” for “degrees of freedom”. We saw degrees of freedom before in the chi-square and F tests. In a regression, there are N-1 total degrees of freedom. They are divided into two groups: k “regression” degrees of freedom, where k is the number of independent variables in the regression (k=1 in a simple regression), and N-k-1 “residual” degrees of freedom. 66 The best way to understand the term “degrees of freedom” is as equivalent to pieces of information. That is, we have observations on a number of cases. Each case is another piece of information. The regression expresses the same information in a new way. That is, each value of y is written as y = ŷ + e. So the regression uses just one number estimated from the data (the regression coeffcient β) to predict part of the variation in y. The residuals (which are also estimated from the data) account for the rest of the variation in y.1 Another column involves sums of squares. The total sum of P squares is (y − ȳ)2 . It is broken up into two parts: the regression and P residual sums of squares. The regression sum of squares is (ŷ − ȳ)2 that is, the deviations P 2of the predicted values from the mean. The residual sum of squares is e . The ratio of the regression sum of squares to the total sum of squares is called the R2 The reason for the term is that its the square of the correlation (sometimes called r) between the predicted value of y and the actual value of y. In the case of simple regression, it’s the square of the correlation between x and y.2 The R-square is a measure of how well you can predict y from the regression. It’s sometimes called “explained variance,” but “explained” really just means prediction, not explaining in any deeper sense. Regression divides up the total sum of squares into the part that can be predicted by x and the part that can’t be predicted by x3 The “standard error of the estimate” in the model summary is an estimate of the standard deviation of the residuals. The reason I call it an estimate is that the regression based on the observed data is an estimate of the regression based on the entire population. That is, you could imagine observing every case and doing the same regression. In that case, you would have the true regression coefficients, and therefore the “true” errors. But you actually only observe some of the population, so your regression coefficients are just the coefficients. That means you only have an estimate 1 The reason for the -1 is that the variation is relative to the mean, and calculating the mean takes one degree of freedom. The α coefficient doesn’t predict any of the variation, because it applies to all cases. 2 The correlation between x and the predicted value of y is 1.00 or -1.00, depending on whether the relationship between x and y is positive or negative. 3 You may remember that (a + b)2 = a2 + 2ab + b2 . The reason that the total P sum of squares divides up into two sums of squares is that in a least squares regression, (ŷ − ȳ)e is always zero. That is, the predicted values are uncorrelated with the residuals. 67 of the errors. The distinction between residuals and errors is that residuals are values obtained from an observed sample, while errors are hypothetical values in the population. 8.4 Transformations 8.4.1 Dummy variables You can include a dichotomy as an independent variable in a regression. A higher value of x means being in one category rather than another. However, it’s conventional to convert dichotomies to “dummy” or “indicator” variables. These variables have the values 0 or 1; usually they are named for the category that is one. For example, you could convert the variable for sex (1=M, 2=F) to a dummy variable called “male” (1=M, 0=F) or one called “female” (0=M, 1=F). This isn’t strictly necessary, but there are practical advantages. One is that it helps people to understand the regression output: a significant effect of “sex” tells you that men and women are different, but not about the direction of the difference. You can also make a dummy variable out of another type of variable: for example, for some purposes it might be useful to have a dummy variable for people aged 65 and above. 8.4.2 Change of Scale Some statistics, like means, standard deviations, and regression coefficients, depend on the units of the variables. With these statistics, you may get very large or very small numbers. This is especially true for regression coefficients, because they depend on the scale of both independent and dependent variables. For example, in a regression with the Gini index as the dependent variable and per-capita GDP as the independent variable, the β coefficient is equal to -.00000306. In SPSS, this is written as 3E-06. The “E-06” means move the decimal place six places to the left. “3E+03” would mean move the decimal place 3 places to the right, that is 3000. Using this notation, you can write any number, but people find it hard to deal with very large or small numbers. So in these cases, you can make it easier by changing the scale: that is, multiply or divide one or both of the variables by multiples of ten. Suppose we divided GDP by 1000, so it was 68 GDP in thousands of dollars. Then the regression coefficient would be -.00306. We might go farther, and divide GDP by 100000. Then the regression coefficient would be -.306. We could also change the dependent variable, but in this case we would want to multiply it rather than divide it. For example, suppose we kept GDP as is, but multiplied the Gini coefficient by 1000. Then when we did the regression, the coefficient would be -.00306. All of this is purely for convenience. It doesn’t change the ANOVA table, or any of the predicted values or residuals. Multiplying and dividing by multiples of ten is a special case of a “linear transformation.” A linear transformation is any rule for turning an old varaible into a new one that involves multiplication and/or addition of a constant. For example, you can get from Fahrenheit to Celsius temperatures by a linear transformation. A linear transformation does not change the predicted values, residuals, or conclusions from a linear regression. 8.4.3 Non-linear transformations There are lots of tranformations that are not linear. For example, powers: 2 3 x The square root and the reciprocal are also powers, because √ , x , etc.. x = x0.5 , and x1 = x−1 . Another non-linear transformation is the “common logarithm” is defined by the relationship: 10log(x) = x. For example, 2 is the logarithm of 100, because 102 = 100. You can use other “bases,” but 10 is useful because it makes it easy to get a sense of the size of x if you’re given log(x). For example, if the logarithm of x is 4.2, you can tell that x is greater than 10,000 and less than 100,000. Each increase of 1.0 in the common logarthim of x is equivalent to multiplying x by 10. Sometimes you see the “natural logarithm,” which is defined by the relationship elog(x) = x; e is a number approximately equal to 2.718. Although e has a lot of interesting mathematic properties, but natural logarithms are simply equal to common logarithms times a constant (about 2.3), so it doesn’t really mater which you use. We’ll just consider the common logarithm because it’s easier to interpret. The logarithm is defined only for positive numbers. The logarithm of 1 is 0, and the logarithm of a number between 0 and 1 is a negative number. For example, the logarithm of .01 is -2. The logarithm goes to minus infinity as x goes to zero. However, you can make the log transformation apply to variables with a value of 0 by taking the logarithm of x+k, where k is a small positive number. For example, you could apply the log transforation to x+0.25 or x+0.10 rather than to x. This is useful 69 for count variables, which often have values of 0. So this discussion of transformation just applies to variables that can’t be negative. However, there are a lot of variables like that: examples include GDP, the unemployment rate, crime rates, height, weight, number of children. 8.4.4 Ladder of Transformations The 0 power is not literally defined, because x0 = 1 for all x, but the logarithm can be regarded as filling the place of x0 . This gives you what has been called a “ladder of transformations”: xp , where p is a number. As you go “up” the ladder, the distribution of the new variable w becomes stretched out to the right–the large values grow proportionately faster–see the table for an example: √ x x x2 x3 x4 1/x log(x) ∞ −∞ 0 0 0 0 0 1 0 1 1 1 1 1 0.5 0.30 1.41 2 4 8 16 0.33 0.48 1.73 3 9 27 81 0.20 0.70 2.24 5 25 125 625 0.10 1 3.16 10 100 1000 10000 Table 8.3: Powers of x for selected values of x As you go “down” the ladder, large values get pulled in, so that the distribution becomes less stretched out to the right. Negative powers, like the inverse, also reverse the order–the largest values become the smallest. You can experiment with different transformations to see which works best. 8.4.5 Transformations and nonlinear relationships The most important reason to transform variables is that the relationship between x and y might not follow a straight line. As an example, it looks like ther is a negative relationship between GDP and the Gini coefficient, but it doesn’t seem to follow a straight line. Countries with a GDP of around $20000 have substantially less inequality than countries with about $1000, but it’s not clear that countries with about $40000 are much lower 70 than countries with about $30000. So we might get a better idea of the relationship if we transformed one or both variables. Figure 8.2: Relationship between per-capita GDP and Gini coefficient With powers greater than 1, the slope increases as x increases. With powers less than one, the slope decreases as x increases. So suppose that you think that a one-unit change in x has more impact on y when you start from a small value of x. Then you should use a transformation of x like the square root, and use the transformed variable (w) as the independent variable instead of x. If you think that a one-unit change in x has more impact when you start from a large value, you should use a transformation like x2 . As an alternative to transforming x, you can transform y. Then the effects of going “up” or “down” the ladder are opposite–for example, if your dependent variable is y 2 , a one unit change in x will have a decreasing effect on y. For example, given the scatterplot for the relationship between GDP (x) and the Gini index (y), we could consider using a square root or log transformation for x, or a square or cube transformation for y. 71 8.4.6 Choosing Transformations The transformations I’ve discussed are useful for representing two kinds of relationships which are illustrated in the figures. In one, the slope increases as x increases. That is, the effect of a change in x on y is larger when you start from a high value of x. In the other, the slope decreases as x increases: the effect of a given change in x is larger when you start from a small value of x. Both kinds of relationships can be either positive or negative, so I show both positive and negative forms in each figures. Figure 8.3: Increasing slopes The rule: for increasing slopes, go “up” the ladder of transformations on x, or go “down” the ladder on y. For example, you might try representing the relationship in Figure 1 by using x2 as an √ independent variable, or y as the dependent variable. You could represent √ Figure 2 by using x as the independent variable, or y 2 as dependent. There is also √ the question of how far “up” or “down” the ladder to go. For example, x, log(x), or 1/x? You can also do it informally by plotting the transformed variables against the other variable and seeing if 72 Figure 8.4: Decreasing slopes the line looks straight. A more formal method which applies if you transform x is to pick the transformation that gives you the largest regression sum of squares (or the largest R2 , which follows from the regression sum of squares). This method does not apply if you are transforming y. 73 Chapter 9 Multiple Regression So far, we’ve been talking about regressions with just one independent variable. But with most dependent variables you might be interested in, there are a number of factors that might make a difference, and often a large number of factors. That means you need “multiple regression”–regression including all of those factors as independent variables. Suppose we have an idea that people who weigh more earn less, either because they are less productive or because of discrimination. So you do a regression with income as the dependent variable and weight (pounds/100) as as the independent variable: ŷ = 5.424 + .144x (9.1) The t-ratio is 1.74, so it wouldn’t usually be considered signficant, but it is pretty close. In any case, the results don’t support the idea. But this regression omits an important variable. Men tend to be heavier than women, and men tend to earn more than women. Suppose we include a dummy variable for men. To distinguish the independent variables, we can call weight x1 and male x + 2. Then the regression is: ŷ = 5.728 − .183x1 + .680x2 (9.2) The t-ratio for weight is 2.03: that is, the heavier someone is, the less they earn. The differerence between the results is that the first regression compares people who way more to people who weigh less; the second compares people who weight more to people of the same sex who weigh less. 74 With multiple regression, a coefficient represents the difference that a variable makes to the predicted value “controlling for” all of the other independent variables in the regression. The term “controlling for” comes from experiments, where you might be able to literally adjust all of the variables except the one you’re interested in. You can’t usually hold variables constant in the social sciences, but you can think of matching cases so that they’re the same except for one independent variable. But if you take this literally, it’s impossible to have two cases that are literally the same except for one thing–e. g., two men who are the same except one weighs 10 pounds more. Even identical twins, who are genetically the same, would have different life experiences. However, many things about people aren’t relevant to their earnings, or only make a little difference, so it’s not necessary to match them with respect to those variables. So the realistic goal is to match people on all relevant factors. If you do that, the regression coefficient can be interpreted as the effect of the indepedent variable on the dependent variable. People often distinguish between the independent variable you are interested in and “control variables.” The reason to include the control variables in a regression is because you need them to get accurate estimates of the variable you are interested in. The goal is to including all of the control variables that really do influence the dependent variable and exclude those that don’t. This goal presents a dilemma. If you want to make sure you include everything that makes a difference, you’ll include some that are unnecessary. If you want to make sure that you don’t include unnecessary variables, you run the risk of omitting some that really do make a difference. Usually, it’s considered better to include unnecessary ones than to omit ones that do make a difference, so when in doubt, you should include a control variable. However, there are some drawbacks to having unnecessary independent variables, so just doing a regression with every independent variable you have is not considered a good idea. So people try out different “specifications,” with the goal of finding the one that includes everything that really does make a difference to the dependent variable, but doesn’t include superfluous variables. The interpretation of a regression coefficient is that if xj increases by one unit while all other independent variables remain the same, then the predicted value of y will increase by βj . Notice that β can be negative, in which case “increase by βj ” means that the predicted value of y becomes smaller. More generally, if xi changes by k and all other independent 75 variables stay the same, the predicted value will change by βj k. Of course, some variables can’t change in a literal sense, but in that case you can think of comparing two cases which are the same with respect to all of the independent varaibles but one. How do you decide if a variable really makes a difference? Look at the second column in the SPSS output “Std. Error.” We’ve seen standard errors before, when dealing with the difference between the means in two groups. The idea is the same here–the standard error is an estimate of how different the sample value might be from the population value. That is, it tells you what you could expect to get if you could perform this regression on the whole population. As before, you get a 95% confidence interval by taking the estimate plus or minus two times the standard error. For example, for “male” the estimate is .68 and the standard error is .079. So the 95% confidence interval is about .54 to .84. The values in this confidence interval are all positive. That is, we can be confident that in the population men would be found to have higher incomes than women. So we definitely do need to take account of gender when considering the effects of pregnancy. If the confidence interval includes zero, that means we can’t be sure if the variable makes any difference, it’s considered all right to remove the variable. The 95% confidence interval is the usual standard. An equivalent approach is to look at the column “t”, which is the coefficient estimate divided by the standard error. If the absolute value of the t-ratio for a control variable is less than 2.0, we can take it out. If the absolute value is greater than 2, we know we need to keep it. When you add or remove one control variable, the t-ratios for the other variables normally change. So if you start with a lot of potential control variables, it’s best to remove variables one at a time, rather than all at once. A reasonable approach would be to start from the smallest t-ratio and keep removing variables until everything left in the regression has a t-ratio of 2 or above. You can also start with the variable you’re interested in, and then add potential control variables. That’s what we did here. If the t-ratio is significant, keep it in and add another; if it’s not, take it out and add another in its place. Again, it’s usually best to make these changes one at a time. So in this case, we would say that sex needs to stay in the regression, and then think about whether there are other variables that might influence income; if so, we should add them. 76 Model Summary Model R 1 .270 R Square a Adjusted R Std. Error of the Square Estimate .073 .071 .60633 a. Predictors: (Constant), NUMBER OF CHILDREN IN HOUSEHOLD, female, EDUCATION LEVEL, INCOME LEVEL, REPORTED AGE IN YEARS a ANOVA Model Sum of Squares Regression 1 df Mean Square 99.568 5 19.914 Residual 1269.463 3453 .368 Total 1369.031 3458 F Sig. 54.166 .000 b a. Dependent Variable: satisfaction b. Predictors: (Constant), NUMBER OF CHILDREN IN HOUSEHOLD, female, EDUCATION LEVEL, INCOME LEVEL, REPORTED AGE IN YEARS Coefficients Model a Unstandardized Coefficients Standardized t Sig. Coefficients B (Constant) 1 Std. Error Beta 2.627 .073 35.829 .000 female .058 .021 .045 2.690 .007 REPORTED AGE IN .004 .001 .099 5.148 .000 EDUCATION LEVEL .012 .011 .020 1.109 .267 INCOME LEVEL .079 .006 .265 14.197 .000 NUMBER OF CHILDREN IN .019 .011 .032 1.694 .090 YEARS HOUSEHOLD a. Dependent Variable: satisfaction Figure 9.1: Example of multiple regression 77 9.1 Example of a multiple regression The SPSS output in Figure 9.1 shows the results from a regression of satisfaction with life (1=very dissatisfied, 2=dissatisfied, 3=satisfied, 4=very satisfied) on five variables: female (1=female, 0=male), age (in years), education (1=none, 2=elementary, 3=hs, 4=graduated hs, 5=attended college, 6=graduated from college), income (1=less than 10K, 2=10-15K, 3=15-20K, 4=20-25K, 5=25-35K, 6=35-50K, 7=50-75K, 8=75K and up), and number of childen under 18 in the household. Some questions: 1. What is the predicted value for a 50-year man who has graduated from college, makes $100,000 per year, and has no children? 2. Suppose that the man from question 1 says he is “very satisfied.” What is his residual? 3. Suppose that the man from question 1 says he is “very dissatisfied.” What is his residual? 4. What kind of person will have the highest predicted value of satisfaction? 5. What kind of person will have the lowest predicted value of satisfaction? 6. What is the predicted value for a 70-year man who has graduated from college, makes $60,000 per year, and has no children? 9.2 Standardized Coefficients You might want to say something about the relative importance of the different independent variables. You can’t do this just by looking at the coefficients, because those depend on the scale of the variable. For example, the coefficient for age is smaller than the coefficient for female. But the value of “female” cannot differ by more than one (it is zero or one), while the value of age can differ by 80 (18-year-old vs. 98-year-old). The “standardized coefficients” are a way to compare the relative importance of different independent variables. A standardized coefficient is equal to sx β sy 78 Different independent variables will have different standard deviations, so the relative sizes of the standardized coefficients will differ from those of the coeifficients. However, the signs will always be the same. The farther a standardized coefficient is from zero, the bigger the impact of x on y. In this case, we can say that income is the most important variable, then age, then gender, then number of children, then education. You shouldn’t take small differences in the standardized coefficients too seriously: for example, number of children vs. gender. But it’s clear that income is the most important variable. You can also think about how how the unstandardized coefficients are related in terms of the original units. For example, gender makes about as much difference as about 15 years of age. Thinking in terms of the original units is useful when those units have a meaningful interpretation (as age and sex do). 9.3 Direct, Indirect, and Total Effects I have said that a regression coefficient βj can be interpreted as the expected change in y if xj increased by one unit and all other x variables stayed the same. Or if xj is not something that can literally change, you could think of comparing two cases that differ by one unit on xj and are the same on the other independent variables: for example, suppose you compared a man and a woman with the same age, sex, income, education, marital status, and number of children. However, this interpretation raises the question of whether it is reasonable to expect one of the x variables to change while everything else stays the same. With many things in social life, it seems reasonable to think that if one thing is different, than several other things will be different. For example, the effect of education on satisfaction is not statistically significant in the regression we looked at previously. Does that mean that if your goal is to maximize satisfaction, getting more education is useless? No: if someone has more education, then they will probably earn more money, and income does have a significant effect on satisfaction in that regression. When you have a number of x variables, you can usually classify them as more distant or closer to the outcome. First you have things that are fixed at birth: for example, gender, age, and race or ethnicity. Then you have things that are established at different times in life. For example, 79 education is usually finalized in adolescence and early adulthood, then the kind of job someone gets is influenced by their education. Finally, even with opinions or feelings, there are some that seem to be prior to others. These are things that could be thought of as answers to a question “why.” Note that this isn’t a matter of statistics, it’s about knowledge we already have (for example, that some things are determined at birth), or about what seems more reasonable. Of course, what’s reaonable is open to dispute, but hopefully you could get some consensus. Implications: 1. The independent variables should all be potential causes of the dependent variable; there should not be anything that is more likely to be caused by the dependent variable. For example, if your dependent variable is education, income should not be among the independent variables: income isn’t a cause of education, it is caused by education. 2. If you include an independent variable, you should include all variables that come before or are simultaneous with it that seem like they might influence the dependent variable. For example, if you include education, you should include gender (before); if you include gender, you should include race (simultaneous). 3. If you don’t include the variables that come “after” xj , the regression gives the “total effects” of xj on y. 4. If you do include the variables that come “after” xj , the regression gives the “direct effects” of xj on y, after controlling for those variables. 5. The difference between the “total effects” and “direct effects” is known as the “indirect effects” operating through the later variables. As an example, suppose that we start with the model in Figure 9.1. The coefficient for education is .012, and is not statistically significant. Now let’s remove income–it’s legitimate to do that, since income comes “after” education. Now the coefficient for education is .080, and the t-statistic is over 8, which is much bigger that the critical value. Why did it change so much? Because people who have more education have higher incomes, and as the previous regression shows, people with higher incomes are more satisfied. An indirect effect is an effect that has two (or more) steps. In this 80 Variable Constant (1) (2) 2.627 2.869 (.073) (.068) (3) 3.350 (.037) Female .058 (.021) .006 (.021) .002 (.021) Age .004 .002 (.001) (.001) .001 (.001) Educ .012 (.011) Income .079 (.006) Children .019 (.011) .080 (.009) .023 (.011) Table 9.1: Regressions for Direct and Total Effects 81 case, there is a substantial indirect effect of education, and a smaller direct effect (or maybe no direct effect). Notice that effects can be either positive or negative, so the total effect of a variable is not necessarily bigger than the direct effect. For example, when we just include female and age, the coefficient for female only .002. But when we add income, education, and marital status the coefficient for female is .069. That is, there is a negative indirect effect that almost exactly offsets the positive direct effect. Most of that indirect effect is the result of income–women earn less, so that makes them less satisfied with their lives. But if you compare men and women with the same income, women are more satisfied with their lives. 9.4 Nominal variables in Regression As I’ve mentioned, linear regression represents the idea that “the bigger x is, the bigger (or smaller) y is.” That means that you can’t directly use nominal variables in a regression, because the idea of bigger and smaller doesn’t make sense for them. However, there is a way to include a nominal variable as an independent variable. (You can’t have a nominal variable as a dependent variable in linear regression). You can include a dichotomy in a regression by arbitrarily defining one category as larger: for example, male=0 and female=1. Then a positive coefficient means women have a higher predicted value than men, a negative coefficient that women have a lower predicted value than men. Note that you could do this just as well by making male=1 and female=0. For nominal variables, you create a series of dummy variables, each being one if the nominal variable has a particular value and zero if it doesn’t. The result is that every case has a value of one on one of the dummies, and zero on all others. If there are K categories, you include K-1 of them. The other one is a baseline against which everything else is compared–just as when you have a dichotomy. 9.4.1 Interpreting the coefficients The coefficients for the dummy variables representing a nominal variable have to be interpreted as a group. Each one shows a predicted value in that group relative to the baseline category. Note that the baseline category is not explicitly shown: you have to remember what it is. If you want to know 82 how the other categories compare to each other, you can compare their coefficients. It is often convenient to arrange the coefficients in a sort of number line. The baseline category implicitly has a value of zero: make sure that you include it. Note that the coefficients will be different if you choose a different baseline, but their relative values will always be the same. That is, by choosing a different baseline you are just changing the zero point, not the relations among the categories. 9.4.2 Testing whether a nominal variable makes a difference The t-tests for the dummy variables involve comparisons of the category with the baseline category. That is, each one involves just one pair of categories. Therefore, they don’t address the more general question of whether the nominal variable makes a difference. Also, the t-statistics will change when you use a different reference category, and unlike the coefficients, there is no uniform relationship among them. Therefore, they cannot be used to answer the general question of whether the variable makes a difference. If you see a lot of significant t-ratios, that shows that the variable makes a difference. However, the converse is not true: an absence of significant t-statistics doesn’t mean that the variable does not make a difference. There are several ways to test for whether a nominal variable makes a difference. The simplest is to compare the Mean Square Residual, or the Standard Error of the Estimate (which is the square root of the MSR) for the regressions with and without the nominal variable. If the model with the nominal variable has a smaller Mean Square Residual, you can say the nominal variable makes a difference. This is the method we will use.1 9.4.3 Example I will add marital status to the regression predicting satisfaction with life. There are six categories of marital status (see the table). I made dummy variables for the first five categories, leaving member of an unmarried 1 A better one is to compute the Mean Square Residual divided by the residual degrees of freedom. Again, if it is smaller, you can say the nominal variable makes a difference. 83 MARITAL STATUS Frequency Percent Valid Percent Cumulative Percent Valid Married 2396 56.6 56.8 56.8 Divorced 583 13.8 13.8 70.7 Widowed 597 14.1 14.2 84.8 Separated 93 2.2 2.2 87.0 460 10.9 10.9 98.0 86 2.0 2.0 100.0 4215 99.6 100.0 17 .4 4232 100.0 Never Married Unmarried couple Total Missing Total 9 Figure 9.2: Frequency table for marital status 84 couple as the reference. The results of a regression including the extra variables are shown. Some things to notice: • The mean square residual is smaller than in the regression without the marital status variables (.363 compared to .368). So we can say that martial status seems to make a difference. • The most satisfied group is married people. They have a positive coefficient, meaning that they are more satisfied than the reference group (members of an unmarried couple). • All of the other groups are less satisfied than members of an unmarried couple: that is, all have negative coefficients. • The least satisfied group is people who are separated. • The coefficients and t-statistics for all of the other variables change. Some are bigger and some are smaller than before. For example, education was .012 before marital status was included. When marital status is included, it is .019. It’s still not statistically significant, but it’s pretty close. Income is still very significant, but its estimated effect is smaller (.062 compared to .079). 9.4.4 Combining categories One problem with including dummy variables to represent nominal variables is that the regression gets complicated, making it hard for people to grasp what’s going on. Therefore, it is sometimes useful to combine or “collapse” categories of a nominal variable. It is legitimate to do this with categories that are similar in terms of their relation to the dependent variable and seem similar in principle. For example, in this case, I could combine married people to members of an unmarried couple. The difference between the two groups is not statistically significant, and they can be thought of as similar in the sense that both involve living with a partner. Notice that widowed people are close to members of an unmarried couple in terms of satisfaction, but in principle it wouldn’t seem reasonable to combine them. I also combined separated people with divorced people. They are pretty similar in satisfaction, and they are also similar in terms of the nature of their situation. The resulting regression is shown in the next 85 Model Summary Model R 1 .295 R Square a Adjusted R Std. Error of the Square Estimate .087 .084 .60232 a. Predictors: (Constant), n_married, female, EDUCATION LEVEL, NUMBER OF CHILDREN IN HOUSEHOLD, separated, divorced, widowed, INCOME LEVEL, REPORTED AGE IN YEARS, married a ANOVA Model Sum of Squares Regression 1 df Mean Square 118.948 10 11.895 Residual 1248.363 3441 .363 Total 1367.310 3451 F Sig. 32.787 .000 b a. Dependent Variable: satis b. Predictors: (Constant), n_married, female, EDUCATION LEVEL, NUMBER OF CHILDREN IN HOUSEHOLD, separated, divorced, widowed, INCOME LEVEL, REPORTED AGE IN YEARS, married Coefficients Model a Unstandardized Coefficients Standardized t Sig. Coefficients B (Constant) Std. Error Beta 2.704 .098 27.731 .000 female .068 .022 .053 3.149 .002 REPORTED AGE IN .003 .001 .088 3.932 .000 INCOME LEVEL .062 .006 .208 10.092 .000 EDUCATION LEVEL .019 .011 .032 1.750 .080 NUMBER OF CHILDREN IN .008 .012 .014 .727 .468 married .075 .070 .059 1.075 .283 divorced -.132 .074 -.073 -1.783 .075 widowed -.026 .078 -.014 -.332 .740 separated -.213 .096 -.051 -2.212 .027 n_married -.049 .075 -.024 -.655 .512 YEARS 1 HOUSEHOLD a. Dependent Variable: satis Figure 9.3: Regression including marital status dummies 86 Model Summary Model R 1 .294 R Square a Adjusted R Std. Error of the Square Estimate .086 .084 .60235 a. Predictors: (Constant), divsep, REPORTED AGE IN YEARS, female, EDUCATION LEVEL, n_married, widowed, NUMBER OF CHILDREN IN HOUSEHOLD, INCOME LEVEL a ANOVA Model Sum of Squares Regression 1 df Mean Square 118.098 8 14.762 Residual 1249.212 3443 .363 Total 1367.310 3451 F Sig. 40.687 .000 b a. Dependent Variable: satis b. Predictors: (Constant), divsep, REPORTED AGE IN YEARS, female, EDUCATION LEVEL, n_married, widowed, NUMBER OF CHILDREN IN HOUSEHOLD, INCOME LEVEL Coefficients Model a Unstandardized Coefficients Standardized t Sig. Coefficients B (Constant) Std. Error Beta 2.758 .078 35.139 .000 female .069 .022 .053 3.159 .002 REPORTED AGE IN .004 .001 .093 4.213 .000 INCOME LEVEL .063 .006 .210 10.200 .000 EDUCATION LEVEL .020 .011 .033 1.818 .069 NUMBER OF CHILDREN IN .009 .012 .015 .777 .437 widowed -.100 .037 -.053 -2.698 .007 n_married -.118 .037 -.058 -3.176 .002 divsep -.215 .031 -.127 -7.010 .000 YEARS 1 HOUSEHOLD a. Dependent Variable: satis Figure 9.4: Regression including marital status dummies, combining categories 87 table. The Mean Square Residual is the same, meaning that the models are equally good in terms of fitting the data, and I prefer the model with combined categories on the grounds that it is simpler. Notice that the t-ratios for the variables involving marital status are much bigger than they were before. That is because the reference category is different: it is now married people plus members of an unmarried couple. As a general rule, when the reference category contains a larger number of cases, the standard errors are smaller and t-ratios are larger. 88 Chapter 10 Beyond Linear Regression 10.1 Non-linear effects This issue is related to transformations, which were covered in the previous chapter. In a linear regression–every one-unit change in x has the same effect on y as every other one-unit change. For example, going from 18 to 19 has the same effect as going from 28 to 29, or 88 to 89. Of course, this may not be true. One way to allow for the possibility of non-linear √ relationships is to try transformations, like x or log(x), as independent variables. Decisions on whether to transform variables should be made separately for each variable. For example, if you transform age you don’t have to transform income. But all of the transformations we have talked about are “monotonic”–the bigger x is, the bigger f(x) is. This means that none of them can represent the situation where there is a “peak” or “valley”–where the highest or lowest predicted values of y occur for the middle values of x rather than for the highest or lowest values of x. How can we allow for non-monotonic effects? One way is to break up the values of x into a number of dummy varaibles. The exact number is flexible, but usually about five is a reasonable choice. Let’s take age as an example. The regressions we’ve seen so far suggest that satisfaction increases with age. But we can check to see if there’s a more complex relationship. I created four dummy variables: age 18-34, 35-49, 50-64, and 65 and up. In the regression, 18-34 is the reference group. The results show that the two middle-aged groups are somewhat 89 Coefficients Model a Unstandardized Coefficients Standardized t Sig. Coefficients B (Constant) Std. Error Beta 2.936 .063 46.696 .000 female .069 .022 .054 3.206 .001 EDUCATION LEVEL .021 .011 .035 1.922 .055 divsep -.199 .031 -.117 -6.498 .000 widowed -.121 .037 -.065 -3.304 .001 nmar -.141 .036 -.069 -3.881 .000 .066 .006 .221 10.725 .000 ymid -.048 .037 -.033 -1.295 .195 omid -.066 .036 -.050 -1.833 .067 .123 .039 .090 3.199 .001 1 INCOME LEVEL old a. Dependent Variable: satis Figure 10.1: Example of using dummy variables for non-linear effects 90 less satisfied than the youngest group, while the oldest group is more satisfied. That is, there appears to be a non-monotonic effect. The dummy variable model is just an approximation. If you take it literally, it implies that age makes no difference between 18 and 34, and then suddenly your satisfaction falls. This seems very unlikely, especially since the group limits were arbitrary. A model that allows for non-monotonic effects that change gradually includes both x and x2 as independent variables. This kind of model can produce a “U” shape or an upside-down “U” shape for the effect of the variable. This model is known as a quadratic regression. A practical issue is that if x is big, x2 will be very big, so the coefficient for the squared term can be very small, even if it has an important effect. So before squaring, it is a good idea to rescale the variable if necessary. In this case, I created a varaible called age00, which is age/100. Then I squared that. So someone who has 20 has values of .2 (rescaled x) and .04 (squared), someone who is 50 has .5 and .25, etc. The t-ratio for the squared term is 4.76. That means it is statistically significant–it should be in there. What are the implications of this model? Let’s take an example: a never-married man who has an education of 5 (some college) and and income of 5 ($25,000-$35,000). Suppose that you have a 20-year old man with those characteristics. His predicted value is: 3.203 + .023 × 5 − .149 + .065 × 5 − 1.476 × 0.2 + 1.658 × .04 = 3.265 What about a 40-year-old man with those characteristics? 3.203 + .023 × 5 − .149 + .065 × 5 − 1.476 × 0.4 + 1.658 × .16 = 3.168 What about a 60-year old? 3.203 + .023 × 5 − .149 + .065 × 5 − 1.476 × 0.6 + 1.658 × .36 = 3.205 What about an 80-year old? 3.203 + .023 × 5 − .149 + .065 × 5 − 1.476 × 0.8 + 1.658 × .64 = 3.374 So the quadratic model says that old people are the most satisfied–in that, it agrees with the linear regression. But it says the least satisfied people are not the young, but the middle-aged. 91 Coefficients Model a Unstandardized Coefficients Standardized t Sig. Coefficients B (Constant) Std. Error Beta 2.783 .072 38.900 .000 female .069 .022 .053 3.180 .001 EDUCATION LEVEL .020 .011 .033 1.823 .068 divsep -.217 .031 -.128 -7.100 .000 widowed -.099 .037 -.053 -2.672 .008 nmar -.125 .036 -.061 -3.437 .001 INCOME LEVEL .063 .006 .209 10.184 .000 age00 .323 .073 .084 4.444 .000 1 a. Dependent Variable: satis Coefficients Model a Unstandardized Coefficients Standardized t Sig. Coefficients B (Constant) 1 Std. Error Beta 3.203 .113 28.229 .000 female .074 .022 .057 3.396 .001 EDUCATION LEVEL .023 .011 .038 2.055 .040 divsep -.203 .031 -.120 -6.644 .000 widowed -.144 .038 -.077 -3.771 .000 nmar -.149 .037 -.073 -4.082 .000 .065 .006 .217 10.582 .000 age00 -1.476 .385 -.385 -3.836 .000 age002 1.658 .348 .486 4.760 .000 INCOME LEVEL a. Dependent Variable: satis Figure 10.2: Example of quadratic regression 92 There are two general rules that are useful when considering quadratic regressions. First, the sign of the x2 term tells you if it is a U or an upside-down U shape. If it is positive, it’s a U, if it’s negative, it’s an upside-down U. In this case, it is positive. Second, there is a formula that tells you where the “turning point” is. If β1 is the coefficient for x and β2 is the coefficient for x2 , it is: −β1 2β2 in this case, it is −(−1.476) 1.476 = = .445 2 × 1.658 3.316 Remember that x is age divided by 100, so that means the minimum satisfaction occurs at age 44.5. Say we round that off to the nearest whole number, 45, and plug that into the regression equation. The predicted value is: 3.203 + .023 × 5 − .149 + .065 × 5 − 1.476 × 0.45 + 1.658 × .2025 = 3.165 That is a little lower than the predicted value for a 40-year old. So the quadratic regression tells us that satisfaction with life declines until people are in their mid-40s, and then starts to increase again. Note that the turning point may not occur within the actual values of x. In that case, the relationship will be effectively an increasing slope or a decreasing slope like you get with the transformations we’ve talked about so far. The advantage of the dummy variable approach compared to the quadratic regression is that it is more flexible: it is not limited to the two basic shapes (U and upside-down U). The disadvantage is that it involves more variables, and the coefficients are more strongly affected by sampling error, so it can be harder to see the pattern. 10.2 Interaction (specification) Effects A standard regression assumes that the independent variable has a single effect that applies to all cases. If you think in terms of the independent variable as cause and the dependent as effect, that means a change in the independent variable always has the same effect on the dependent variable. 93 For example, say that we’re investigating the relationship between years of schooling and knowledge of some subject as measured by a test. The regression equation is: y = α + βx + e That means an additional year of school will produce an increase of β points in everyone. But that seems unlikely when you think about it. Maybe some kinds of people will tend to learn more: e. g., those who have more aptitude, those who study harder, those who get better teaching...... These kind of differences in the effect of x on y are known as interaction or specification effects. They mean that β is not a single number–it differs depending on the value of other variables. Note that interaction effects are not equivalent to saying that other variables also matter–that we need to add other independent variables. Even if we have numerous independent variables, each has just one effect, given by its coefficient. If there are interaction effects, the regular regression coefficient is still meaningful: it gives an average effect of the independent variable. But you can go beyond the ordinary regression. Research in the social sciences often deals with interaction effects. They often provide information that may be useful in comparative evaluation of theories, or practically important (e. g., suppose you discover that one teaching method works better for a particular kind of student, a different one works better for another kind of student), or just unexpected and therefore a potential subject for more research. How can you have interaction effects? Suppose that one of the variables involved is a dichotomy: for example, say we are predicting weight and think that the effect of some variables might differ by gender. Then you can divide the same into two parts, fit your regression separately on each, and compare the coefficients. Too see if the group differences in the estimated effects of the variables are statistically significant, you can use a formula we’ve p 2 encountered before. The standard error of β1 − β2 is s1 + s22 . For example, the difference in the estimated effects of education is 2.224, and the standard error of that difference is 1.11. The 95%confidence interval is then about (.00,.44). That is, we can be pretty sure that education has more effect among women than among men. This approach has two limitations. First, it works only for 94 Variable Constant Male Age Height Educ Male*Ht Male*Ed All Women Men All -127.61 -77.37 -193.82 -74.23 (13.73) (18.11) (22.59) (17.61) 8.94 -123.5 (1.70) (27.63) -.101 -.091 -.148 -.113 (.035) (.044) (.056) (.035) 4.815 4.086 5.823 4.06 (.206) (.272) (.313) (.269) -3.343 -4.161 -1.937 -4.216 (.548) (.715) (.851) (.709) 1.790 (.410) 2.291 (.110) Table 10.1: Example of Separate Regressions for Two Groups dichotomies or nominal variables with a few categories. For example, if we thought there might be an interaction involving age and education, we couldn’t easily use this approach. We could divide age into two or three groups, but then we’d lose the distinctions between the groups. Second, it lets the effect of all variables differ between the groups. Suppose we want to say that some variables have the same effects, while others have different effects? A more general approach to interactions is to create artificial variables that are the product of two other variables. For example, suppose we have two new variables male × educ and male × height. Then we run a regression inlcuding those in addition to the other independent variable. The results are given in the last column of the tale. The coefficients for age and height now show the effects among women. To get the effects among men, you add them to the relevant interaction coefficients. The interaction coefficients directly show the difference between the groups in the effects of the variables. 95