Download Sociology 3211 - Central Web Server 2

Document related concepts

Categorical variable wikipedia , lookup

Transcript
Sociology 3211: Quantitative Methods of
Social Research
March 17, 2014
Contents
1 Data, Variables, and Statistics
1.1 Branches of statistics . . . .
1.2 Error . . . . . . . . . . . . . .
1.3 Levels of Measurement: . . .
1.4 Notes . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
5
6
7
9
2 Frequencies
11
2.1 Frequency Tables . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Figures for Frequency Distributions . . . . . . . . . . . . . . . 13
3 Measures of Central Tendency
3.1 The mean . . . . . . . . . . . . .
3.1.1 Calculating the mean from
3.2 The Median . . . . . . . . . . . .
3.3 Comparing the mean and median
.
a
.
.
. . . . . . . . .
frequency table
. . . . . . . . .
. . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4 Dispersion
4.1 Standard Deviation . . . . . . . . . . . . . . . . . . . . . . .
4.1.1 Standardized Variables . . . . . . . . . . . . . . . . .
4.1.2 Guidelines for interpreting standardized scores . . . .
4.1.3 Calculating the standard deviation from a frequency
table . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.4 Example . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 The Interquartile Range . . . . . . . . . . . . . . . . . . . .
4.2.1 IQR from a frequency table . . . . . . . . . . . . . .
4.3 Some notes on the IQR . . . . . . . . . . . . . . . . . . . . .
1
.
.
.
.
19
19
21
23
23
27
. 27
. 29
. 29
.
.
.
.
.
30
30
31
32
32
5 Comparing Group Means
5.1 Association between variables . . . .
5.2 One ordinal/interval, one dichotomy
5.3 One ordinal/interval, one nominal . .
5.4 Both Ordinal . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6 Statistical Inference
6.1 Standard error of a statistic . . . . . . . . . . . . . .
6.2 Confidence Intervals . . . . . . . . . . . . . . . . . .
6.2.1 Comparing group means: approximate method
6.2.2 Comparing Group means: Exact Method . . .
6.3 T-values and significance tests . . . . . . . . . . . . .
6.4 Comparing more than two groups . . . . . . . . . . .
7 Cross-tabulations
7.1 Independence and expected values
7.2 Index of Dissimilarity . . . . . . .
7.3 Standardized residuals . . . . . .
7.4 Chi-square test . . . . . . . . . .
7.5 Examining Tables . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
34
34
35
37
37
.
.
.
.
.
.
39
40
41
42
43
44
45
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
47
50
52
52
54
56
8 Correlation and Simple Regression
8.1 Correlation . . . . . . . . . . . . . . . . . . .
8.1.1 Calculating the correlation . . . . . . .
8.1.2 Standard error of the correlation . . .
8.1.3 Correlation matrix . . . . . . . . . . .
8.1.4 Correlations and scale . . . . . . . . .
8.1.5 Interpreting correlations . . . . . . . .
8.2 A Visual Interpretation . . . . . . . . . . . .
8.3 Regression . . . . . . . . . . . . . . . . . . . .
8.3.1 Residuals . . . . . . . . . . . . . . . .
8.3.2 Calculating regression coefficients . . .
8.3.3 Dependent and Independent Variables
8.3.4 Analysis of Variance . . . . . . . . . .
8.4 Transformations . . . . . . . . . . . . . . . . .
8.4.1 Dummy variables . . . . . . . . . . . .
8.4.2 Change of Scale . . . . . . . . . . . . .
8.4.3 Non-linear transformations . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
58
58
59
60
60
61
62
62
63
64
65
66
66
68
68
68
69
2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8.4.4
8.4.5
8.4.6
Ladder of Transformations . . . . . . . . . . . . . . . . 70
Transformations and nonlinear relationships . . . . . . 70
Choosing Transformations . . . . . . . . . . . . . . . . 72
9 Multiple Regression
9.1 Example of a multiple regression . . . . . . . . . . . . . . . . .
9.2 Standardized Coefficients . . . . . . . . . . . . . . . . . . . . .
9.3 Direct, Indirect, and Total Effects . . . . . . . . . . . . . . . .
9.4 Nominal variables in Regression . . . . . . . . . . . . . . . . .
9.4.1 Interpreting the coefficients . . . . . . . . . . . . . . .
9.4.2 Testing whether a nominal variable makes a difference
9.4.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . .
9.4.4 Combining categories . . . . . . . . . . . . . . . . . . .
74
78
78
79
82
82
83
83
85
10 Beyond Linear Regression
89
10.1 Non-linear effects . . . . . . . . . . . . . . . . . . . . . . . . . 89
10.2 Interaction (specification) Effects . . . . . . . . . . . . . . . . 93
3
Chapter 1
Data, Variables, and Statistics
We will begin with some definitions.
• Data: any information that can be expressed as a number or one of a
set of categories. For example, age can be expressed as a number, and
marital status can be expressed as “married,” “divorced,” “widowed,”
etc.. In these cases, the way to express the idea as data is pretty
straightforward. There are many cases in which there is more
ambiguity: for example, someone’s political views. How can you
reduce them to data? There is no perfect way to do it, but there are
many possibilities. For example, surveys often ask people whether
they would say that they are liberal, moderate, or conservative.
Knowing which one of these terms someone picked obviously doesn’t
tell you everything about what someone thinks, but it tells you
something.
There is a lot of data in the modern world: some examples are crime
rates, unemployment rates, election results; surveys of the public;
rankings of things (colleges; nations ranked on qualities like freedom
or corruption); performances of sports teams and individual athletes.
Statistics is essentially about how to organize and evaluate data.
Since all sorts of information can be expressed as numbers, the
general principles of statistics apply to lots of subject areas, not just
sociology: e. g., weather, geology, medicine..... But each subject area
has some special features. For example, measuring “religious faith” is
different from measuring temperature. So many departments have
their own quantitative or statistical courses, and they can be quite a
4
bit different, although they are all based on the same general
principles.
• Variable: a characteristic that is measured as a number or a category
name, and that differs from unit to unit. A variable is disguished
from a constant, which is a characteristic that’s the same for all units.
• Units (cases): The entity that the variable refers to. The most
familiar unit is individual people. However, often social scientists
analyze other units: for example, states of the US, points in time,
organizations, households.
1.1
Branches of statistics
1. Univariate statistics: summarizes the values of a single variable for
different units. The most common univariate statistics are the mean
(average) and standard deviation. Of course, we could just list the
values for a single variable for every case. But unless you have a very
small number of units, a list of numbers gets overwhelming.
Univariate statistics seeks to reduce the mass of information to a few
numbers.
2. Bivariate and multivariate statistics: this involves the relations
between variables–do different variables “go together”? E. g., do
students in smaller classes learn more than students in larger classes?
Bivariate statistics involves two variables, while multivariate involves
more than two. Most research in the social sciences involves
multivariate statistics. A common situation is when you have one
variable you want to predict or explain (the “dependent variable”)
and a number of variables that we think might help to explain it
(“independent variables”). For example, you might want to know
which factors affect student performance. If you think about it, or ask
people for their ideas, you will get a long list of possibilities. Statistics
can help you figure out which which of those have large effects, which
have small effects, and which have no effect at all.
3. Statistical inference: This is about how sure you can be. Say that
there’s a poll of 1500 voters that asks them how they voted in
5
November 2012. 51% of men in the sample report voting for Mitt
Romney; 45% of the women report voting for Romney. Can we
conclude that women in general were less likely to vote for Romney
than men were, or could the just be a matter of “the luck of the
draw”? Or suppose you had information on a number of countries,
and you found some pattern: for example, richer countries are more
likely to be stable democracies. Is that evidence that there is a real
connection between affluence and democracy, or could it just be a
coincidence? The alternative to inference is description, where you
just discuss the data you have. Description and inference both can
involve univariate, bivariate, or multivariate statistics, although in
practice inference is more often applied to bivariate and multivariate
statistics.
1.2
Error
A key aspect of statistics: there is always some uncertainty. Let’s take an
example. One of the data sets we which comes from a survey–a number of
people were asked different questions, and numbers are used to represent
their answers. Surveys are an important source of data in sociology,
although not the only one. One of the variables is the number of children
aged under 18 living in the household. We can be pretty sure that people
know the answer, so there is little or no “error” in their answers.
Now suppose we want to predict the value of the variable. It
probably will be possible to find some factors that predict it: for example,
gender, age, marital status, ethnicity, income, education. But the number of
children people have is also affected by a number of factors that you can’t
measure, or even describe very clearly. So there will be error in terms of
prediction. With most things involving people, there’s a part that you can’t
predict. So any results of a statistical analysis are going to involve “most of
the time” or “more often than not.” That is, there are always going to be
exceptions to the rule, and often there will be a lot of exceptions. This is
important to remember because people don’t always emphasize it when
reporting the results of a statistical analysis–for example, a report that
found a difference between men and women would talk about the difference,
and often wouldn’t emphasize the variation within each group. But
actually, the “error” is an important part of any statistical analysis.
6
For example, let’s take another variable in the data set. People
were asked “During the past 30 days, for about how many days have you
felt you did not get enough rest or sleep?” People could give any answer
between 0 and 30.
Lowest
Highest
Men Women
0
0
30
30
Table 1.1: Minimum and maximum values of days not enough rest, men and
women
So it’s clear that men aren’t all the same, and women aren’t all
the same–in fact, both men and women cover the whole possible range.
But suppose we looked at it another way.
0 days
1-3 days
4-9 days
10 or more days
Men
Women
42%
16%
14%
28%
35%
16%
16%
33%
Table 1.2: Average number of days not enough rest, men and women
There now is a pattern: women tend to report more days without
enough rest or sleep. So a statement about a difference between men and
women (in this sample) if it’s understood as involving a tendency or
average. (It also is unlikely to be a result of chance–that is, I can be pretty
sure that it is true of Americans in general, not just people in this sample).
But there are large differences within each sex, and a lot of overlap.
1.3
Levels of Measurement:
1. Nominal: categories with no meaningful order. Marital status is an
example. You can use numbers to represent the categories, but those
numbers are arbitrary: they are just a way to tell them apart. Many
mathematical operations don’t make sense with nominal variables–e.
g., taking an average.
7
2. Ordinal: also discrete, but the categories have a meaningful order.
For example, there is a question about general health: 1=excellent,
2=very good, 3=good, 4=fair, 5=poor.
Clearly there is a natural order to these categories. It makes sense to
say that ‘2’ is in between ‘1’ and ‘3’. The variable would still make
sense if you reversed the order (5=excellent....1=poor), but not if you
made any other changes. A more subtle point is that that although
“very good” is definitely in between “excellent” and “good,” but it’s
not certain whether it’s exactly midway in between, or closer to one of
the categories than to another. So suppose you had two groups, each
with two people. In group A, one person has excellent health and one
has good health; in group B, both have very good health. Which
group is healthier, or are they both the same? There is no definitive
answer (some people would argue it’s not even a meaningful question).
3. Interval: the values have an order, and the distance between different
values is defined. E. g., income. With an interval variables we can get
a definite answer if we are comparing different groups. For example, if
group A has one person who earns $20,000 a year and one who earns
$120,000, while group B has one who earns $60,000 and one who
earns $40,000, we can say that the total income is higher in group A.
4. Ratio: an interval variable that also has a definite zero point. With
ratio variables, you can say things like “person A is twice as old as
person B.” If the zero point is arbitrary. E. g., IQ scores don’t have a
meaningful zero point (they were designed to have a mean of about
100), so person with an IQ of 140 isn’t twice as smart as a person
with an IQ of 70. In fact, the statement “person A is twice as smart
as person B” has no real meaning. I nterval and ratio variables can be
“continuous” or “discrete.” Continuous means that the variable can
have any value, at least within some range–for example, a person’s
weight can be any positive number. Discrete means only a limited
number of values are possible. For example, the number of children
someone has is necessarily a whole number–you can’t have a value like
1.4 or 2.75.
Many variables are continuous in principle but are measured as
discrete. For example, in this data set a person’s weight is given in
whole pounds, with no decimals, so it’s discrete. But if a variable has
8
a lot of possible values, it usually reasonable to regard it as
continuous. What does “a lot” mean? There’s no absolute rule, but
somewhere between 10 and 20 is often a good dividing line. Nominal
and ordinal variables are always discrete.
5. Dichotomies: these are variables which have only two values, like
agree/disagree or male/female. Dichotomies can be regarded as either
ordinal or nominal variables.
The interval/ratio distinction isn’t very important in practice,
because most interval variables are also ratio variables (like age, income, or
weight), and most of the exceptions (like IQ) could be understood as
ordinal rather than interval. The interval/ordinal distinction is potentially
important, but it’s an issue for more advanced statistics, so I won’t talk
about it much. But the distinction between nominal variables and the other
types is very important: with nominal variables, many statistics cannot be
used.
1.4
Notes
1. statistical programs don’t distinguish between nominal, ordinal, and
interval level variables. That is, they can’t tell you not to do
something that doesn’t make sense, like taking the average of a
nominal variable.
2. often there are “missing values”: cases for which no answer is
recorded. For example, a person might refuse to tell you how old he
or she is. Most statistical programs have special ways of dealing with
missing data. The simplest is leaving those cases out of the statistical
analyses. Before doing an analysis, you should check to see what will
be done with “missing values.”
3. sometimes the order of the categories in a variable isn’t the natural
order, or isn’t the order you want to use. For example, with the
variable on general health that I mentioned, if you stick with the
original form, you’ll have to remember that high values mean
WORSE health. This might get confusing, especially if you have
other variables that are also “backwards.” Also, sometimes a variable
9
is ordinal in principle, but the numbers aren’t assigned that way: for
example, it’s pretty common to have a variable for which agree=1;
disagree=2; not sure=3. If you put “not sure” in the middle, it’s an
ordinal variable. So it’s often convenient to “recode” variables into a
form that makes more sense. For example, you might create a
“health” variable where excellent=5, very good=4..... poor=1.
10
Chapter 2
Frequencies
A frequency is a count of the number of cases that have a particular value,
or that fall in a particular range of values. If you have a small number of
possible values, you will usually want to know the count for each one.
However, if you have an ordinal or interval variable with a lot of possible
values, it is usually better group them into ranges. If you have a continuous
variable, you must group them into ranges. Consider weight in pounds: for
many values, there are only a few people who have that exact number. For
example, there are three people who say they weigh 101 pounds, five who
say they weigh 102, two who say they weigh 103. This is more detail than
you need, so it would be more informative to group people into ranges, like
less than 100 pounds, 100-109, 110-119. Note that when you do this, the
ranges must be non-overlapping. You want to put everyone into exactly one
category. So for example, you should not have ranges of 100-110, 110-120
because then it’s not clear how people who weight exactly 110 pounds
should be counted.
2.1
Frequency Tables
If you have a discrete variable, a “frequency table” provides a good way to
show how many cases have each value. See the example in Figure 2.1,
which is output from the SPSS program.
The number in the left column is the value of the variable. In the
data set “0” is used to represent people who cirrently smoke. The table also
shows the “label” for that number. If it didn’t, we would have to refer to
11
the “codebook” that tells us which number corresponds to which value.
“Frequency” is the number of people who gave that answer, in this case 666
currently smoke. “Percent” is the percent of the people who gave that
answer. You calculate percent by taking the number who gave the answer
and dividing it by the number of cases, which is 4232, nd multiplying by
100: In this case 15.7% of the people surveyed said that that they currently
smoked.
Notice the values at the bottom of the table. They are indicated
as “missing”–that is, we don’t have an answer for those cases. There are
several different missing values. If you look in the codebook, you can see
that 77 was used to indicate that people refused to answer or said that they
didn’t know. -9 is who are marked ”DNA,” which I think stands for “did
not ask.” The “system” is short for “system-missing”, which is an SPSS
term. In SPSS, “system-missing” values are not represented by a number,
but by a special symbol (a period). Missing values are represented by a
number, just like the real values, but you can instruct the program that
cases with that value should be left out of all calculations. System-missing
values are often used to represent cases that are left out because the
question was not asked. In this example, I think that they were people who
didn’t answer a previous question about whether you have ever smoked
regularly. The people designing the survey figured that if they didn’t answer
that it was a waste of time asking any more questions about smoking.
For most purposes the distinction between different types of
missing values doesn’t matter, since in either case we don’t have an answer.
The table shows several “Totals.” One is the total number of missing
values. Then there is the total number of cases for which we have a valid
answer. That’s listed as the “Total” after the highest value. Finally, there’s
the “grand total,” which is valid plus missing cases. The grand total is the
same for all of the variables in the data set. For this data set, it’s always
4232–the number of people included. The number of missing and valid
cases will differ from variable to variable. For some variables, it’s zero (e.
g., sex), but for most variables there are some missing cases–for example,
some people didn’t give their age, some didn’t give their income, etc.
The relative numbers in each category as a percent of all “valid”
cases are shown under the “valid percent” column. Finally, the “cumulative
percent” is the sum of the valid percents with that value or lower. This is
useful for ordinal or interval variables, because they let you give the percent
of people who are below or above a particular level. In this case, it would
12
seem reasonable to regard the variable as ordinal, so you could say that
21% of people had smoked within the past 5 years.
The different columns of the frequency table show the same
information in different ways. The essential information is in the first two
columns–the values and the frequencies. Of course, it’s more convenient to
have the computer program do calculations, so that’s why the SPSS output
shows the columns with percentages. The potentially confusing thing is
that SPSS tends to show all of the different things that people might be
interested in, so sometimes you have to look through the output, identify
the numbers you want, and ignore the others. For example, with a
frequency table you usually will want to focus on the valid percents rather
than the total percents.
2.2
Figures for Frequency Distributions
A “figure” or “graph” is a picture representing some statistical information.
A “table” is a list of numbers. Both figures and tables are useful. A table
can usually show more detail, but a figure is often easier to grasp–that is,
someone can see something at a glance rather than having to think about
the pattern of numbers in the table.
Figures are particularly useful for showing the distribution of one
variable. There are two major kinds of figures for that purpose. One is a
“pie chart” in which the size of a “slice” is proportional to the number of
cases with a given value. The other is a bar chart, in which the heigh of a
bar is proportional to the number of cases with a given value. Either one of
these shows the same information that you have in a frequency table. Pie
charts are used mostly for nominal variables, while bar charts are used for
both nominal and ordinal variables. I’ll focus on bar charts, because most
people who think about graphics don’t like pie charts. The reason is that
people seem to have more difficulty in accurately judging the area of a
“slice” than they do in judging the height of a bar. That is, you can easily
see that one slice is bigger than another, but it’s harder to make accurate
judgments on how much bigger, like 50%, twice as big, three times as big,
etc. There is another kind of graph that is similar to the bar chart–the
histogram. It’s used for ordinal or interval variables with lots of possible
values. For variables of this kind, there are many different values, and most
of them may have only a few cases. As a result, a bar chart will look
13
“jagged,” making it hard to pick out the important features. A histogram
will group the values of the variable. For example, rather than showing the
number of people who are 21, 22, 23, etc., it might show the number of
people who are 21-29, 31-39, etc. That means you lose some detail, but the
advantage is that it may be easier to see the main features of the data.
With most statistical programs, including SPSS, the histogram command
will automatically set up groups; however, you can usually manually change
those values if you want to. Conventionally a bar chart has gaps between
the bars, while a histogram puts them side-by-side. Examples of a pie
chart, bar chart, and histogram are shown in Figures 2.2-2.4.
14
LASTSMK1 INTERVAL SINCE LAST SMOKED
Frequency
Percent
Valid Percent
Cumulative
Percent
0 Current smoker
666
15.7
15.9
15.9
12
.3
.3
16.2
2 Within last 3 months
9
.2
.2
16.4
3 Within last 6 months
17
.4
.4
16.8
4 Within last year
34
.8
.8
17.6
5 Within last 5 years
151
3.6
3.6
21.2
6 within last 10 years
118
2.8
2.8
24.0
7 more than 10 years
880
20.8
21.0
45.0
8 never smoked
2309
54.6
55.0
100.0
Total
4196
99.1
100.0
5
.1
System
31
.7
Total
36
.9
4232
100.0
1 Within last month
Valid
77
Missing
Total
Figure 2.1: Frequency table for Time Since Last Smoked
15
Figure 2.2: Example of a pie chart
16
Figure 2.3: Example of a Bar Chart
17
Figure 2.4: Example of a Histogram
18
Chapter 3
Measures of Central Tendency
3.1
The mean
“Central Tendency” is the statistical term for what in everyday language,
you would call an “average” or “typical” value. There are a number of
statistics that represent central tendency, but the two major ones are the
mean and the median, and those are the ones that I’ll talk about.
The mean is the most common measure of central tendency. It’s
sometimes called the “average,” but “average” is also used more broadly:
for example, when someone says “average Americans” they usually just
mean “typical” in a general way. So I will call the statistic the “mean” to
avoid ambiguity. To get the mean, add together all values of the variable
(x) , divide by the number of cases (don’t include cases with missing values
in either part). Or in terms of symbols:
P
x
x=
N
P
The symbol
(capital Greek letter sigma) stands for sum. x is
the conventional way to designate an unspecified variable–whatever variable
we happen to be interested in. Sometimes people write xi , where the
subscript represents the individual case, to make it clear that you’re
summing the values for each case. When the values are listed individually,
calculating the mean is straightforward. For example, suppose that these
are scores on a test: 81, 97, 100, 67, 75. The sum of the values is 420 and
N=5, so the mean is 84. Of course, if you have a lot of cases, it takes a while
to get the mean, even with a calculator; that’s where computers are useful.
19
GENERAL HEALTH
Frequency
Percent
Valid Percent
Cumulative
Percent
Excellent
768
18.1
18.2
18.2
Very good
1350
31.9
32.1
50.3
Good
1282
30.3
30.4
80.7
Fair
554
13.1
13.2
93.9
Poor
257
6.1
6.1
100.0
Total
100.0
Valid
Missing
Total
4211
99.5
7
8
.2
9
13
.3
Total
21
.5
4232
100.0
Figure 3.1: Frequency table for self-rated health
20
3.1.1
Calculating the mean from a frequency table
You can also calculate the mean from a frequency table. Let’s take a
variable I’ve mentioned before, self-rated health. We can find N, the total
number of people, listed in the the frequency table as the total number of
valid cases: 4211. (As I said last week, the missing values are not used
when calculating statistics). But what about the sum of the x? To get that,
remember what the numbers in the table mean. There are 768 people with
the value of 1 (excellent), 1350 with 2, and so on. So the numerator is:
(768*1)+(1350*2)+(1282*3)+(554*4)+(257*5)=10185
The symbol * means multiply (it’s used because the conventional
multiplication symbol can be confused with the letter x.) That is just a
shorter way of writing the sum of 4211 individual values:
1+1+1...+1+2+2....2+3+3...3+4+4....4+4+5...5
Putting it all together, the mean is: 10185/4211=2.568. That’s
almost exactly in between “very good” and “good”, just a little closer to
“good.” That seems reasonable given the percentages. Note that the mean
is continuous, even if the original variable is discrete. So you don’t need to
round the mean off to the nearest integer–you just give the decimal figure.
You might wonder how many decimals to use. If the variable is a whole
number, two decimals is a reasonable choice (in this case, 2.57). You could
also do three, but anything more than three decimals is excessive. But you
can be flexible–if the units are already precise, like weight in pounds, just
one decimal would be reasonable. Again, we could write the formula in
symbols:
P
fx
x=
N
The small letter f means the frequency of that value of x, and fx
is f times x (if two variables are written next to each other without a sign,
it is assumed that you mean to multiply them).
It’s possible to get confused about what N is. Sometimes people
think that N is the number of distinct values (5 for health categories). But
N is always the number of cases (people). A good way to check for mistakes
is to remember what the mean is supposed to be: an average or typical
value of the variable. Then remember what is x in this case (health) and
what values x can have (1 through 5). Then ask whether the number you
came up with make sense in terms of the possible values of x. In this case,
if everyone said that they had excellent health, the mean would be 1. If
21
everyone said they had poor health, the mean would be 5. So the biggest
possible value for the mean is 5, the smallest is 1. If you calculate the mean
and get a number outside of that range, you know you made a mistake.
There are several commands in SPSS that will calculate the mean.
One of them is “Descriptives”. Another is an option in “Frequencies.”
A few other points about the mean:
1. The order of the cases doesn’t matter in calculating the mean. For
example, the mean of 8, 6, and 1 is 15/3=5. You get exactly the same
number if they are listed 1, 6, 8 or 8, 1, 6 or any other possible way.
2. In a literal sense, you can take the mean of a nominal variable–that is,
you can do the calculations and get a number. But that number will
not have any sensible interpretation. You should take the mean only
if the variable is ordinal or interval (including ratio). Because a
dichotomy can be regarded as an ordinal variable, you can take the
mean, although usually it’s not the natural thing to do. That is, we
would usually describe a dichotomy by saying something like “60
percent of the cases are women,” rather than “the mean value of the
variable sex is 1.6.”
3. The mean is a univariate statistic. That is, it involves one variable,
and not the relations between variables. You can calculate the means
for several variables, but from the means alone you can’t tell whether
or how those variables are related.
4. You can compare the means for two different variables only if those
variables are measured on the same scale. For example, there’s a
variable for number of days out of the last 30 your mental health was
not good. The mean is 3.35. It would not be correct to say that
because 3.35 is bigger than 2.57, people rate their mental health as
worse than their health in general. But it would be reasonable to
compare the means for “mental health not good” and “physical
health not good” since those are both measured as days out of the
last 30. The mean number of days physical health is not good (4.24)
is higher than the mean number of days mental health is not good.
22
3.2
The Median
The median is the middle value if you arrange all of the values in order of
size. It doesn’t matter if you arrange them from large to small or small to
large, you get the same median. For example, say that you have only three
cases, with values 5, 7, and 10. The median is 7. With only three cases, it’s
obvious which the middle one is, but the rule is that it’s the (N+1)/2 case,
where N is the number of valid cases. E. g., with 99 cases arranged by size,
the median would be the 50th. If the number of cases is even, then
(N+1)/2 isn’t a whole number. E. g., with N=100, you get the 50.5th case.
There is no 50.5th case, but there is a 50th and 51st, so you can get the
median by taking the average of the values of those two cases. So suppose
we have four families, with 3, 2, 1, and 1 children. What is the median
number of children? The second highest value is 2, the third highest is 1.
The average of those two values is 1.5. Notice that it’s OK for the median
to be a fraction, even though the number of children for any actual family
has to be a whole number.
If the data are shown in a frequency table, you follow the same
approach, but remember that there are many cases for each value. For
example, let’s take the frequency table I used as an example (self-rated
health). With 4211 cases, (N+1)/2=2106. If you wrote all the values in
order, you’d have:
1,1,1....1,2,2...2,....5
The first 768 of those values would be 1; then the next 1350
values would be 2. 768+1350=2118 would take us past the 2106th case, so
the median is 2. Writing all of the values in order and counting to 2106
would be a lot of wasted effort. It’s easier to get there with the cumulative
number of cases, as I just did. An even easier way to get the median from a
frequency table is to follow this rule: the median is the value for which the
cumulative percent passes 50. In this case, 1 just gets us to 18.2%, but then
2 takes us to 50.3%, so the answer is 2.
3.3
Comparing the mean and median
The mean and median are often close to each other. But sometimes there is
a substantial difference. For example, the median number of days without
enough rest or sleep is 3, while the mean is 7.66: more than twice as big.
23
The reason the mean is bigger than the median is no one is very far below
the median (you can’t be below zero), but there are some people who are
far above the median–who say that they felt that they didn’t have enough
rest every day, or almost every day. The people who have very high
numbers have a big impact on the mean, but not as much impact on the
median. If everyone who had more than 5 days without enough rest
managed to reduce themselves to exactly five days, the median would stay
the same, because only the order matters and reducing all the large values
would not change the order.
In general, the mean will be different from the median when the
distribution of the variable is “skewed” rather than symmetrical. The
meaning of these terms can be understood by thinking of a histogram (or
bar chart), in which the height of a bar represents the number of cases with
a given value. A symmetrical distribution means that the right and left
halfs of the histogram are mirror images of each other. With a skewed
distribution, they are not mirror images: the figure looks unbalanced.
Often there is a “tail” going off to the right (high values) or left (low
values). Figure 3.1 shows the histogram for number of days without enough
rest. It is skewed. Figure 3.2 shows the histogram for height in inches. It is
pretty symmetrical.
Usually variables that are skewed are skewed to the “right”: a few
values are much bigger than the median. This is particularly true when the
variable has a lower limit but no upper limit. Then the mean will tend to
be bigger than the median. But a variable can also be skewed to the left,
although it’s less common. For example, with the conventional 0-100 scale
of grading tests, most people are concentrated near the top, so there may
be a few that are far below everyone else. In this case, the mean will tend
to be smaller than the median.
The greater sensitivity of the mean to extreme values might be
regarded as a good feature or a bad feature. On the one hand, you could
argue that it’s important to take account of the unusual values–that is, to
recognize that they’re not just larger or smaller than the typical value, but
much larger or smaller. On the other hand, you could argue that you
shouldn’t give too much weight to a small minority, especially because
extreme values may result from some kind of mistake (for example, a data
entry person putting in an incorrect number). So I would say that it’s not a
matter of one measure being clearly better: they both give different kinds
of information, so you should consider both of them.
24
Figure 3.2: Histogram of number of days without enough rest
25
Figure 3.3: Histogram of height
26
Chapter 4
Dispersion
4.1
Standard Deviation
The standard deviation is a measure of “dispersion”: that is, how “spread
out” the values are. For example, suppose you have three values: 14, 12, 10.
The mean is 12. What if you have the values 17, 15, 4? The mean is still
12, but the values are more spread out. What kind of statistic could we use
to express this difference between the two distributions?
The simplest possibility is the “range,” which is defined as the
largest value minus the smallest value. In the first example, the range is 4;
in the second, it is 13. The range is simple to calculate and easy to
understand. The drawback is that it depends on just two extreme values.
Suppose you have one distribution that’s like this (suppose it’s scores on a
test): 35, 79, 80, 83, 85, 87, 88, 92, 93, 98. The mean is 82. The range is 63.
Then suppose you have a distribution like this: 55, 60, 70, 75, 75, 85, 100,
100, 100, 100. The mean is 82, and the range is only 45. But is it really
valid to say that the first distribution is a more spread out than the second?
In the first, 9 of the 10 cases are within 18 points of each other (80 to
98)–there’s just one that’s a lot different. In the second, they are scattered
pretty evenly over the range. So you could make an argument either way.
The standard deviation takes account of not just the highest and
lowest values, but of all the values. The formula is:
s
P
(x − x̄)2
N −1
27
Remember that x̄ is the symbol for the mean. So to get the standard
deviation, you need to calculate the mean first. Then you take each value,
subtract the mean, square the result, add them together, divide by N-1,
and finally take the square root. (If you omit the last step, you have the
“variance”, which is sometimes used as a measure of dispersion. But a
bigger standard deviation means a bigger variance, so they are just different
ways of giving the same information). This is a lot of calculation, so people
rarely calculate the standard deviation by hand. However, it’s important to
do it a few times in order to grasp what the standard deviation is.
The standard deviation of the first example is 17.54. The
standard deviation of the second is 17.51. So despite the difference in the
distributions, the standard deviations are almost equal.
Some points about the standard deviation:
1. Like the mean, the standard deviation should be calculated only for
ordinal or interval variables, not for nominal variables. (It can be
calculated for dichotomies, but isn’t very useful for them).
2. The minimum possible value of the standard deviation is zero. This
will occur only if all cases have the same value.
3. There is no upper limit in principle.
4. Sometimes N is used in the denominator instead of N-1. There are
arguments in favor of both formulas, but it doesn’t make much
practical difference unless the sample is very small.
5. Although the standard deviation is usually reported as a number, it
has units–they are the same units as the original variable.
6. Usually most of the cases are within one standard deviation of the
mean. This isn’t an absolute rule, but it’s a useful thing to remember.
7. Cases more than two or three standard deviations away from the
mean are unusual. For example, the mean height for men in the US
today is about 5 feet 10 inches, and the standard deviation is about
3.5 inches. So to be 2 standard deviations above the mean would
mean you were 6 foot 5; two standard deviations below would be 5
foot 3.
28
4.1.1
Standardized Variables
Standardized variables are related to the idea of being within k standard
deviations of the mean. A standardized variable is defined as
x − x̄
x∗ =
sx
That is x minus the mean of x, all divided by the standard deviation of x.
An asterisk is a common way of indicating that a variable is standardized.
The mean of x* is 0 and the standard deviation is 1.0, regardless of the
mean and standard deviation of the original variable.
If you keep track of the units in the equation, you find that they
cancel out. For example, suppose height is measured in inches. Then the
stadndard deviation of height is also in inches, and the value of a
standardized variable will be some number of inches divided by some
number of inches. The result is a number, with no units.
What’s the point of standardizing a variable? You can compare
the values of standardized variables, even if the scale of the original
variables is different. For example, say two students attend different
schools. One uses a scale of 0-100 for grades, the other uses 0-4. Say that
the person at the place with the 100 point scale got an 80, the mean for all
students was 78, the sd is 8. Then the person’s standardized score is:
(82-78)/8=0.25. The student at the other school had a GPA of 2.9, the
mean for all students was 2.6, and the standard deviation is 0.6. Then their
standardized score is 0.5. So the second student did better in relative terms.
4.1.2
Guidelines for interpreting standardized scores
First, the sign: + above average 0 exactly average - below average These
are exact rules: if something has a positive standardized score, it’s larger
than the mean.
There are also rules for interpreting the magnitude of
standardized scores, although they are approximate rather than absolute:
|x∗ | < 1: not unusual
1 < |x∗ | < 2: somewhat unusual
2 < |x∗ | < 3: unusual
3 < |x∗ | very unusual
It’s often a good idea to give unusual values special
attentionthink about whether there might be some mistake in measuring,
29
Frequency
1
2
3
4
5
Total
Valid
Percent
768
18.2
1350
32.0
1282
30.4
554
13.2
257
6.1
4211
100.0
Table 4.1: Frequency Table for General Health
why they might have the values they do, whether they have a large in uence
on any of your calculations. This is especially true when you have some
knowledge about the units (e. g., if the units are cities or states), because
in that case you may be able to think of reasons why they are different, or
check on the accuracy of the measurements using other sources. However, it
can be useful even when the units are anonymous people, as in survey data.
4.1.3
Calculating the standard deviation from a
frequency table
As with the mean, you use the same basic formula, but have to take
account of frequencies. Specifically, the formula is:
s
P
f (x − x̄)2
N −1
As with the mean, f is the frequency for a particular value of x,
and N is the total number of cases (e. g., people). You have to be careful
about the order of things: first square the deviations from the mean, then
multiply that total by f. If you first multiply the deviation from the mean
by f and then square that total, you’ll get a different (and incorrect) answer.
4.1.4
Example
The variable is general health. To get the standard devation, we first need
the mean. That was calculated in chapter 3.2.1: it is 2.57 Making a table
can help in calculation–here is an example.
30
x
1
2
3
4
5
Total
f
x̄ x − x̄ (x − x̄)2 f (x − x̄)2
768 2.57 -1.57
2.47
1889.3
1350 2.57 -0.57
0.32
432.0
1282 2.57
0.43
0.18
230.8
554 2.57
1.43
2.04
1130.2
257 2.57
2.43
5.90
1516.3
4211
5198.5
Table 4.2: Example of Worksheet for Calculating Standard Deviation
Note that the x − x̄ column won’t add up to zero the way it does
for individual data. That’s because it doesn’t take account of the frequency
x. You could compute a f (x − x̄) column, which would add to zero. But
it’s not needed for computing the standard deviation, so you don’t need to
calculate it.
Finally, the variance is 5198.5/4210=1.235. The standard
deviation is the square root of that, which is 1.11.
4.2
The Interquartile Range
The interquartile range is another measure of dispersion. It is related to the
median; the idea of the median is to divide the values into halves, and the
idea of the IQR is to divide the data into quarters (“quartiles”). The first
quartile is the value which is greater than 25% of the cases and smaller
than 75%; the second quartile is the median (greater than half and less
than half); and the third quartile is the value that’s greater than 75%. The
IQR is the value of the third quartile minus the value of the first quartile.
So it’s the same basic idea as the range, but rather than taking the extreme
values, it takes values that are closer to the middle. The interpretation of
the IQR is that it’s the range that covers the middle 50% of the cases.
How do you find the quartiles? The key number is (N+3)/4. If
you count (N+3)/4 cases from the bottom, you have the value of the first
quartile; (N+3)/4 cases from the top gives you the third quartile. As with
the median, you can get a fraction: in that case, just take the average of
the two surrounding values. (For a more exact calculation, it should be
closer to one of the values when you have a fraction of 1/4 or 3/4, and the
average only if you have a fraction of 1/2. However, to simplify things, you
31
can use the average for all fractions). Then the difference is the IQR.
Let’s take the examples from last time. N=10, so (N+3)/4=3.25;
that is, in between the third and fourth value. The first data set is:
35, 79, 80, 83, 85, 87, 88, 92, 93, 98
The third value from the bottom is 80 and the fourth is 83, so the first
quartile is 81.5. Third from the top is 92 and fourth from the top is 88, so
the third quartile is 90. The difference 90-81.5 is 8.5. The second data set
is:
55, 60, 70, 75, 75, 85, 100, 100, 100, 100
Here the first quartile is 72.5 and the third is 100. The IQR is 27.5. Notice
that the IQR of the second data set is bigger, even though the standard
deviation is almost the same. That’s because the IQR is just concerned
with the cases in the middle, and those cases are more spread out in the
second data set. The IQR is not sensitive to the exact values of the cases
that aren’t in the middle 50%. For example, if the lowest value in one of
the samples was zero, the IQR would stay the same. In contrast, the
standard deviation changes if any of the values change.
4.2.1
IQR from a frequency table
To find the first quartile, identify the value for which the cumulative
percent passes 25; for the third quartile, identify the value for which the
cumulative percent passes 75.
4.3
Some notes on the IQR
1. The minimum possible value is zero. This will occur when the 25th
and 75th percentiles have the same value. In contrast to the standard
deviation, a value of zero does not necessarily mean that all cases
have the same value–it just means that at least half of them do.
2. As with the standard deviation, there is no upper limit in principle.
3. The median and IQR also have the same units as the original variable.
32
4. You can generalize the idea of the IQR and compute ranges between
various “percentiles”: for example, the 90th percentile minus the 10th.
The IQR is the most common choice, but that’s just a convention.
33
Chapter 5
Comparing Group Means
So far, we’ve been talking about univariate statistics: the distribution of
single variables. We’ll now turn to bivariate statistics: the relationship
between two variables. The great majority of research in the social sciences
involves bivariate or multivariate statistics: univariate statistics is just a
preliminary step.
5.1
Association between variables
The central question with bivariate statistics: is there an association
between two variables? Association between two variables (call them x and
y) means that knowing the value of one variable helps to predict the value
of the other variable. E. g., say the two variables are month and
temperature. There is an association between them. Knowing the month
will help you to predict the temperature; knowing the temperature will help
you predict the month. E. g., if you hear that the high temperature in
Storrs was 90 on a particular day, you could reasonably guess about which
month it was. Lack of association is known as “independence.” An example
of variables that are independent is day of the week (Sunday, Monday.....)
and temperature-e. g., knowing that the high temperature in Storrs was 90
on a given day doesn’t help you guess what day of the week it was.
Sometimes social scientists look at association for the purposes of
prediction: e. g., an economist might want to predict what the
unemployment rate will be at this time next year. In order to do that, the
economist could look at information about various economic conditions and
34
unemployment rates in the following year. If there’s an association, that
means that the value of the economic conditions today can be used to
predict unemployment next year. For the purposes of prediction, the
economist wouldn’t care why the asssociation existed–just that it does.
But often social scientists look at association as part of a process
of figuring out whether a variable x is a cause of y. An example: sometimes
people say that birth order affects personality, success in life, and other
things. How can you find if that’s true? A first step is to see if there
actually are differences between people who were first children, people who
were second, etc. If there are, the next step is to figure out why that
association is there. If there aren’t any differences between them, that
suggests that birth order doesn’t affect the outcomes you are interested in.
It doesn’t quite settle the question: as we’ll see later, a relationship
between two variables can be “hidden” by relationships involving variables.
But as a general rule, if there’s nothing to start with, there’s probably not
much to explain.
If there is a substantial association between two variables (even if
it’s not one you expected), it needs to be explained somehow. People often
say “correlation does not imply causation,” but that is not really true
unless you add a qualification: “correlation between x and y does not imply
any direct causation between x and y.” If a correlation exists, there has to
be a reason: something is causing something else.
It is also important to measure the strength of any association.
Most variables of interest to sociologists have lots of causes. But some
(probably most) of the in uences will be small, others large.
What statistics should you look at to see if there is an association
between two variables? There are different ones, depending on what kind of
variables are involved.
5.2
One ordinal/interval, one dichotomy
Start with the case where one of the variables (y) is ordinal or interval; the
other (x) is a dichotomy. Then you could first divide the cases into two
groups, depending on the value of x, then calculate a statistic in the two
groups separately and see if its values are different. Most often, the statistic
people look at is a measure of central tendency (mean or median).
However, sometimes a measure of dispersion is of interest. For example,
35
1
2
3
4
5
last year
1-2 years
2-5 years
5+ years
never
TOTAL
MEN
WOMEN
1082 68.8%
1969
191 12.1%
316
129 8.2%
159
152 9.7%
135
18 1.1%
26
1572
2605
ALL
75.6% 3051 73.0%
12.1% 507 12.1%
6.1% 288 6.9%
5.2% 287 6.9%
1.0%
44 1.1%
4177
Table 5.1: Frequency tables for time since last checkup, men and women
you might want to see if dispersion in income (that is, income inequality) is
higher in one country than another.
As an example of comparing means, there’s a variable in the data
set about length of time since last routine medical exam. It has five
categories: within the last year, 1-2 years, 2-5 years, more than five years,
or never. The frequency tables for men and women are in Table 4.1.
Using those tables, you can calculate the mean for men and the
mean for women. For example, the mean for men is:
1082 × 1 + 191 × 2 + 129 × 3 + 152 × 4 + 18 × 5
1082 + 191 + 129 + 152 + 18
2549
which comes to 1572
= 1.62. You can make the same kind of
calculation for women, and get 1.44. That is, women tend to have had their
last routine checkup more recently.
What if we calculated the mean for everyone? It is 1.51. That is
in between the mean for men and women, but not exactly halfway in
between. It is closer to the mean for women. Why? Because most of the
people in the sample are women. In fact, the mean for everyone can be
calculated from the means and numbers for men and women:
1.62 × 1572 + 1.44 × 2605
1572 + 2605
You can calculate other statistics separately in each group: for
example, the standard deviation, median, or IQR. However, when the
variable has a small number of categories, the median and IQR are less
useful for group comparisons, because they change in “jumps,” so they
aren’t good at identifying small differences. In this example, the median is
36
x
1
2
3
3
4
5
6
7
Status
Employed
Self-employed
Unemp more than year
Unemp less than year
Homemaker
Student
Retired
Unable to work
Total
Mean
1.55
1.56
1.94
1.86
1.51
1.55
1.55
2.06
1.60
Frequency
1630
348
103
123
294
58
1130
260
3946
Table 5.2: Satisfaction with life by employment status
1 for both men and women; the IQR is 1 for men and 0 for women. So
usually comparisons involve means or standard deviations.
5.3
One ordinal/interval, one nominal
What if x is a nominal variable with more than two groups? You apply the
same general idea: separate the cases by values of x, calculate the mean in
each group. You simply have more means to compare. For example, say
that one variable is satisfaction with life (1-4, higher means less satisfied)
and the other is employment status.
There are some differences that seem pretty clear: people who are
unemployed or unable to work are less satisfied. However, when looking at
the less obvious differences, you need to pay attention to the numbers in
each group. For example, are people who are homemakers more satisfied
that people who are employed? The means point in that direction, but
there are only 294 homemakers, and the differences aren’t that large, so
maybe we can’t be sure. We’ll consider this issue more exactly under
statistical inference, but basically, the smaller the group, the bigger the
difference in means you need in order to be confident.
5.4
Both Ordinal
Suppose that x is an ordinal variable without too many categories. For
example, the data set has a measure of household income, which is recorded
37
x
1
2
3
3
4
5
6
7
Income
less than 10K
10-15K
15-20K
20-25K
25-35K
35-50K
50-75K
over 75K
Total
Mean
1.98
1.89
1.80
1.68
1.66
1.63
1.50
1.42
1.60
Frequency
166
180
274
368
419
515
620
934
3476
Table 5.3: Satisfaction with life by Income
as one of eight categories. Then you can do the same thing as before:
compute the mean of y in each category of x, and compare the means. The
difference between the ordinal and nominal cases is that when x is ordinal
you’re less interested in the exact means in the groups, and more interested
in seeing if there’s a general pattern. For example, suppose x is income, and
y is satisfaction with life.
It looks like there’s a pattern: the more income, the more
satisfaction (lower mean). If you look more closely, it seems like in the
middle ranges (25-50,000) increases in income don’t make as much
difference as they do in the lower or the upper ranges. Maybe this means
someting, but maybe it’s just a quirk of the sample. In any case, the first
thing you should do is look at the general picture: is there a relationship of
the form “the bigger x is, the bigger y is.” If not, is there another kind of
relationship that you can describe simply? For example: ”y is largest for
middle values of x.” Only then should you look for more subtle things.
Usually if just one of the ordered categories is different from the
surrounding ones you can assume that this is just a matter of random
variation.
What if you got no obvious pattern, just some means that were
higher and some that were lower? That would be a sign (not conclusive
evidence, but a sign) that maybe there is no relationship at all. To be sure,
you have to have to use statistical inference, but generally if there is a
relationship between two ordinal/interval variables, it will have a simple
pattern.
38
Chapter 6
Statistical Inference
Usually when you do a statistical analysis you want to reach a conclusion
that applies outside of the particular cases on which you have information.
For example, let’s take an example I used before, gender and time since last
medical checkup. In the class data set, the average time was higher for men
than for women, but what if you want to say something about people in
general, not just people in this data set? In this case, the goal is to use a
smaller group (the sample) to make an estimate about a larger group (the
population). You can’t have absolute certainty in any conclusions, but you
may be able to say that you are “pretty sure” or even “almost sure” that a
conclusion about the population is correct. This is statistical inference (as
distinct from description, which is entirely about the cases you observe).
Statistical inference is most straightforward when you have a
random sample. The meaning of “random” is different from the everyday
meaning of haphazard or without any conscious method. In the basic kind
of random sample, everyone has the same chance of being selected for the
sample. You can think of it as a lottery where the prize is being chosen for
the sample. You can give some people a higher chance, by giving them
extra “tickets,” while keeping the same basic procedure. Most surveys in
sociology, or surveys of public opinion, are designed to provide random
samples. For example, the class data set is intended to provide a random
sample of American adults. Sometimes you might want to generalize about
some population other than the one from which the sample was taken. E.
g., suppose that you wanted to generalize about people in Canada. The
best way to do that would be to get a random sample of Canadians, but
sometimes that’s not available, so you might wonder if the American data
39
can tell you anything about Canadians. Generalizing outside the
population might or might not be justified, but it isn’t primarily a
statistical issue. So we’ll just consider going from a random sample to the
population from which the sample was taken.
A lot of data in sociology doesn’t involve random samples. For
example, the states of the United States are not a random sample of
anything. However, you can apply statistical inference to this kind of data
too. With data of this kind, the question isn’t about the population–it’s
about whether any pattern we see could plausibly be explained by “chance.”
6.1
Standard error of a statistic
A basic tool of statistical inference is the standard error of a statistic–for
example, the standard error of a mean. The standard error of a statistic is
an estimate of what would happen if you took numerous random samples
from the same population and computed the statistic for each sample. If
you had a lot of samples, you could compute the mean and the standard
deviation of the sample statistic: the standard error is an estimate of the
standard deviation. Why should we care about the standard deviation of a
sample statistic? After all, we have just have one sample from the
population, with one sample mean, not lots of samples with different
sample means. The reason is that the standard deviation is a guide to how
different the statistic for a paricular sample might be from the “true”
(population) value.
Many statistics, including the sample mean, approximately follow
a particular distribution, known as the “normal distribution.” The figure
shows what the normal distribution looks like. Even if the original variable
(x) doesn’t have a normal distribution, the sample mean of x will be
approximately normal. You can use a table of the normal distribution to
see exactly how much chance there is that the mean of a particular sample
will be one, two, three, or however many standard deviations away from the
mean of the population. For the moment we just need a few facts: in a
normal distribution about 95% of the values are within two standard
deviations of the mean, about 99% are within 2.5 standard deviations of
the mean.
It turns out that we can estimate the standard deviation of a
sample statistic even if we just have one sample. The formula for
40
Figure 6.1: Normal Distribution
estimating the standard error of a statistic depends on the statistic. For the
mean, the formula is pretty simple:
sx
sx̄ = √
N
That is, the standard error of the sample mean is the standard deviation of
x divided by the square root of N.
6.2
Confidence Intervals
A confidence interval is range of possible population values of a statistic
given the value in a sample. If we call the sample statistic τ , and call the
standard error of the statistic sτ , the confidence interval is: from τ − ksτ
to τ + ksτ . k is a number based on how sure you want to be, and obtained
from the table for a normal distribution. The more confident you want to
be, the larger k will be. E. g., if you want to be 99.9% sure that the
population value of the statistic is in the confidence interval, you’ll need a k
of a little more than 3. To be 95% sure, you only need a k of about 2.
For example, there’s a question in the data set about the number
of days in the last 30 on which your mental health was not good. The mean
is 3.35, and the standard deviation is 7.63. N is 4160 (N is the number of
cases used to compute the statistic, so it will differ depending on the
41
particular variables used). Applying the formula, the standard error of the
sample mean is .118, which you can round off to .12. Finally, the 95%
confidence interval is (3.11,3.59).1 That is, we can be 95% sure that the
mean in the population is between 3.11 and 3.59. If you want to put that
conclusion in words, you could say something like you are “pretty sure”
that the average opinion in the population is in that range. You could also
form a 99.9% confidence interval, which would be 2.99 to 3.71, and describe
that with words like “almost certain.”
6.2.1
Comparing group means: approximate method
Often social scientists are interested in comparing group means. A basic
question: is there any difference between the groups? When you compute
sample means for diferent groups, you almost always find that the mean is
higher in one group than the other. But that’s partly because there is
almost always some chance difference between samples. For example,
suppose you gave point values to cards A=13, K=12, Q=11, J=9, 10=10,
etc. Then when you deal a hand, that is a sample from the population.
When you deal two hands from a pack, the mean point value will usually be
different, even though the population value is always the same. That’s the
element of chance or “luck of the draw.” The same thing happens with
sampling from a population: some differences in a particular sample are
just a matter of chance.
If you have two groups, you have a mean and a standard
deviation for each group. So you can compute a confidence interval for each
group. Call the two groups A and B, and say that the mean in A is higher.
Gender
Men
Women
All
Mean
s
N
2.59 6.76 1573
3.81 8.08 2587
3.35 7.63 4160
Table 6.1: Number of Days Mental Health not Good
From Table 1, you can calculate the standard errors of the means
(.170 for men, .175 for women). Then you can calculate the 95% confidence
intervals, (2.25,2.93) for men, and (3.46,4.16) for women. There is no
1
A standard way to indicate an interval is (lower,upper).
42
overlap between them: the highest value for men is smaller than the lowest
value for women. These are what we could call the highest and lowest
plausible values of the population means for men and for women. Because
there is no overlap, we can say that the population mean for men is pretty
definitely higher than the mean for women.
If there had been overlap, that would mean that it’s possible that
the means are the same, or even that the population mean for men was
higher. That is, we can’t be sure that the population means of the two
groups are different, and can’t be sure about the direction of the difference
if they are.
Here is an example in which there is overlap. The 95% confidence
Gender
Men
Women
All
Mean
s
N
2.54 1.08 1582
2.58 1.13 2629
2.57 1.11 4211
Table 6.2: Mean of self-rated health
intervals are (2.49,2.59) for men and (2.54,2.62) for women. There are
values that are in both confidence intervals.
Here is an example for you to calculate:
Gender
Men
Women
All
Mean
s
N
3.54 8.15 1570
4.67 8.97 2570
4.24 8.69 4140
Table 6.3: Mean of Days Physical Health not Good
6.2.2
Comparing Group means: Exact Method
The method I’ve described is only approximate. There’s a more accurate
way, which requires a little more calculation. This is based on the idea of
looking at the difference between two means. We can regard that difference
as a statistic, and estimate a standard error. Then we can create a confience
interval for the difference, and ask if it’s possible that the difference is zero.
43
s
s(x̄1 − x̄2 ) =
s1 2 s2 2
+
N1
N2
where the subscripts refer to the two groups. You can also say that this is
the square root of the sum of the squared standard errors for the two
groups. If you apply this formula to differences in self-rated health you get
.035. The difference in the means is .04. So the 95% confidence interval is
(-0.03,0.11). The value of zero is important, because it means no difference
between the groups. So for the confidence interval of the difference to
contain zero is like having the confidence interval for the two groups
overlap. In this case, the confidence interval includes zero, so again we
conclude that it’s possible that there’s no real difference (or that men
report worse health than women).
The approximate method is more conservative, in the sense that
you’re more likely to conclude that there might be no difference. That is,
sometimes you would conclude that there might be no difference using the
approximate method, but conclude that there is a difference using the exact
method. The exact method is the one that you should use: I just began
with the approximate method as a way to introduce the idea of comparing
groups.
6.3
T-values and significance tests
As I’ve mentioned, the value zero is important for many statistics involving
the relationship between variables, because it means no difference or no
relationship. Statistical significance means that we can be reasonably
confident that some statistic representing the relationship between variables
is not zero. E. g., if someone says that “there is a statistically significant
difference between men and women” that is equivalent to saying that the
confidence interval for the difference in means does not include zero.
Sometimes people will just say “there is a significant difference” but this is
ambiguous, since in everyday terms “significant” has other meanings, like
large theoretically interesting. So if you’re talking about statistical
significance, it’s a good idea to say “statistically significant” rather than
just “significant.”
Say τ̂ is the observed value of a statistic and τ0 is a hypothetical
value you are interested in (most often the value is zero). That is, you are
44
asking the question “is it possible that the population value of the statistic
is τ0 ?” Finally, sτ is the standard error of the statistic. Then you can
compute a ratio:
τ̂ − τ0
sτ
It’s called the “t-ratio” or “t-statistic” because it has a particular
distribution known as the t-distribution. Using a table of the t-distribution,
you can look up the statistical significance of the t-statistic. When the
sample is large, the t-distribution becomes almost exactly like the normal
distribution. I’ve been assuming, and will continue to assume, that the
sample is large enough to just use a normal distribution.
You can use the t-value to conduct a “significance test” of the
hypothesis of no difference. The value of 2 for the t-statistic is the
conventional standard for this test. It corresponds to the 95% confidence
interval: if the t-ratio is bigger than 2 or smaller than -2, then zero is not in
the 95% confidence interval. “Statistically significant” means we can reject
the hypothesis of no difference: we can say that the proposition that there
is no difference in the population is hard to square with the observed
difference in the sample.
So significance tests and confidence intervals give us essentially
the same information. The difference is that significance tests focus on the
hypothetical value of zero. “statistically significant” means you can be
pretty sure whether the population value of a statistic is no zero. You can
also be pretty sure about its sign: you can say that the statistic is pretty
sure to be positive or pretty sure to be negative negative. The sign is
usually the most basic thing that someone would want to know: for
example, in comparing groups, the sign tells you which group mean is
larger. “Not significant” means you can’t be sure whether the population
value of the statistic is positive, negative, or zero. Note that “not
significant” doesn’t mean positive evidence that the population value is
zero , or even that the population value is “small”. To make a judgment
about the size of any difference, you need to look at the confidence interval.
6.4
Comparing more than two groups
If one of the variables is a dichotomy, the t-test tells you whether the
variables are related: “is there a difference between men and women?” is
45
equivalent to “does gender make a difference?” When you have more than
two groups, you can compare each pair of means in the way described in
the previous sections. With k groups, you have k(k-1)/2 pairs. You might
find significant differences between some pairs but not others. In that case,
it’s not clear whether you should say there is an association between the
variables or not. We will later look at way to test for an association, but for
now I will just give a rough rule: if many of the differences between pairs
are significant, there’s probably an association between the variables; if
only a small fraction are, probably not.
46
Chapter 7
Cross-tabulations
The point of comparing means is to see if there’s an association between
two variables. But you can’t compare means when both of the variables
involved are nominal, because you can’t take the mean of a nominal
variable. What if you want to look at the association between two nominal
variables? You can use crosstabulations. Crosstabulations may also be
useful when one or both of the variables are ordinal or interval but have
only a small number of categories. “Small” is a matter of degree, but as a
practical matter, you could say up to about seven: when you have more
than that, cross-tabulations get hard to read. An example of a
cross-tabulation: whether you have any health care coverage and whether
there was a time in the last year when you needed to see a doctor but
couldn’t because of cost.
With two categories of coverage and two categories of whether
you were unable to see a doctor because of cost, you have a total of four
possibilities: health coverage and yes, health coverage and no, no coverage
and yes, no coverage and no.
Unable to afford doctor
Yes
No Total
Coverage
294 3510 3804
No Coverage 173 237
410
Total
467 3747 4214
Table 7.1: Crosstabulation of health coverage and whether unable to see
doctor because of cost
47
You could imagine going through the whole list of people and
classifying them into these four groups. That’s what the four numbers in
the middle tell you. The “total” columns tell you the same information
that you would get in a frequency table: how many people with and
without health coverage there are, how many people were and were not
unable to see a doctor because of cost. For example, the total number of
people who were unable to see a doctor is equal to the number of people
with health coverage who were unable plus the number of people without
coverage who were unable.1 The problem with this cross-tabulation is that
it’s hard to compare the numbers in the four cells. If we just look at the
numbers, we see the biggest groups are people who have coverage and never
were unable to see a doctor, followed by people who have health care
coverage and were unable to see a doctor. But this just means that most
people have health coverage: it doesn’t tell us about whether there’s any
difference between people who do and people who don’t. So it is more
informative to give the table in terms of percentages. Table 2 gives the
percentages calculated separately for people with and without health
coverage (known as the “row percentages,” because the rows represent
different values of health coverage). That is, 42.2% of people without health
coverage were unable to see a doctor, while only 7.8% of the people with
health coverage were. To get this table, you divide the cell values by the
row totals: for example, (294/3804)*100=7.8. You could also do the
calculations in the other direction if you were given Table 2: that is, you
could go back and calculate the frequencies in Table 1. For example,
42.2*410/100=173.02, which rounds to 173. Because the percentages are
different when you compare the two rows, it appears that the two variables
have something to do with each other: people without health coverage are
more likely to be unable to see a doctor because of cost.
You could also compute the “column percentages,” which are
shown in Table 3. You get these by dividing the cells by the totals for the
columns. For example, 63 percent of the people who couldn’t see a doctor
because of cost had health coverage, while 92.3 percent of the people who
never were unable to see a doctor had coverage. The row and column
percentages give the same information in different forms.
1
These totals are based on the number of people who answer both questions: people
who say don’t know or refuse to answer are usually left out of cross-tabulations. So they
may be smaller than the totals you get in the frequency tables for the variables .
48
Unable to afford doctor
Yes
No Total
Coverage
294
3510 3804
7.8% 92.3% 100%
No Coverage
173
237
410
42.2% 57.8% 100%
Total
467
3747 4214
Table 7.2: Crosstabulation with row percents
Unable to afford doctor
Yes
No Total
Coverage
294
3510 3804
63.0% 93.7%
No Coverage
173
237
410
37.0% 6.3%
100% 100%
Total
467
3747 4214
Table 7.3: Crosstabulation with column percents
In this case, if people without health care coverage are more likely
to be unable to see a doctor, then they are going to make up a larger share
of the people who are unable to see a doctor. Note that “a larger share”
doesn’t necessarily mean a majority. In fact, most of the people who were
unable to see a doctor because of cost did have some health coverage.
That’s because most people have health coverage: a small fraction of a
large group can be a bigger number than a large fraction of a small group.
Row percentages will always have to add to 100 (allowing for
rounding error) when you go across the rows. Column percentages have to
add to 100 percent when going down the columns. If you see a table and
aren’t sure what the percentages mean, you can use these facts to figure out
what they are. There is a convention that if one of the variables can be
thought of as a cause and the other as an effect, the “cause” variable is
usually used as the row variable, and the row percentaes are shown. E. g.,
in this case coverage status could be thought of as the cause and whether
you were unable to see a doctor as the outcome. It wouldn’t seem
49
reasonable to think of it the other way round. When you have a cause
variable, many people find it more natural to use that as the base for the
percentages. E. g., in this case, I’d say it’s easier to grasp Table 2 than
Table 3. It’s not wrong to do it the other way, but it’s a good idea to follow
this convention unless you have a special reason not to. However, there are
a lot of cases where it’s not clear which variable is cause and which is effect,
or when it seems like you could regard it as either one. E. g., general health
and exercise. You could say that exercise affects health (presumably
improves it). On the other hand, health could affect exercise, because
healthier people are likely to find exercise easier and more enjoybable. This
ambiguity is not a problem. The table has the same interpretation
regardless of which is rows and which is columns, so if you’re not sure
about cause and effect you can just make an arbitrary choice.
What you can learn from looking at a cross-tabulation: do the
variables have anything to do with each other? If the row percentages are
different when you compare different row values or the column percentages
are different when you compare columns, then you can say that the
variables have something to do with each other. If the percentages are the
same, then you can say that the variables are unconnected. But you might
want to go beyond this, and distinguish between stronger and weaker
connections. How can you do this? Many statistics have been developed for
this purpose. Almost all of them are based on “residuals,” so we first need
to learn how to calcualate residuals.
7.1
Independence and expected values
Suppose that x is a variable. Then you can write x = x̂ + e. x̂ means a
predicted value of x (sometimes called a “fitted value”). e is the “error” or
“residual,” and is computed by x − x̂. There are lots of possible predicted
values: each one represents an idea about what kind of pattern there is in
the values of x. In this case, the idea that we’re interested in evaluating is
that the variables are independent. That is, two variables have nothing to
do with each other; knowing the value of one is of no use in predicting the
value of the other. More precisely, we could say that the distribution of one
variable (call it y) is the same for every value of the other variable (which
we’ll call x). The hypothesis of independence is widely used as a baseline.
How do you get the predicted values under the hypothesis of
50
Unable to afford doctor
Yes
No
Coverage
421.6 3382.4
No Coverage 45.4 364.6
Total
467
3747
Total
3804
410
4214
Table 7.4: Predicted Values Under Independence
independence? Let’s ask what numbers we would expect to see if there were
no association. Then the row percentages would be the same for people
with and without health coverage. For example, if 11.1% of people were
unable to see a doctor because of cost, and whether that happened to you
is independent of health coverage, then 11.1% of people with coverage
would have been unable to see a doctor, and 11.1% of people without
coverage would have been unable to see a doctor. Then compute the
numbers in the hypothetical table by multiplying the percent who were
unable to see a doctor by the numbers with and without coverage, and
dividing by 100. For example, 11.1*410/100=45.51. That’s how many
people without health coverage “should” have been unable to see a doctor,
if the variables were independent. When you use the percentages, the
predicted values are affected by rounding error. A more accurate way to do
the calculations is to multiply the relevant row and column totals, and
divide by the grand total. For example: 467*410/4214=45.44.
We can then compute the difference between the predicted and
actual values, which are shown in Table 5. They are known as the
“residuals.” Note that the residuals sum to zero if you go across the rows or
down the columns.
Unable to afford doctor
Coverage
No Coverage
Total
Yes
No
-127.6 +127.6
+127.6 -127.6
0
0
Total
0
0
0
Table 7.5: Residuals from Model of Independence
51
7.2
Index of Dissimilarity
What if you wanted a statistic to show how well or badly the predictions
from the model of independence fit the data? You would have to combine
all of the residuals to get some kind of total. But since some of the
residuals are positive and some are negative, the positive and negative
residuals would cancel out, so just adding them wouldn’t work. What if you
added up the absolute values of the residuals? That would give a sort of
total error, which is a more reasonable measure: the lowest possible value
would be zero, meaning a perfect fit. The problem is that the result would
depend on the number of cases. So it might be better to adjust for the total
number of cases. This is what the index of dissimilarity does. The formula:
P
|e|
2N
Where e (for error) is the residual. The maximum value of the index of
dissimilarity depends on the total percentages in the rows and columns in a
complicated way. Therefore, the Index of Dissimilarity is more a rough
guide than an exact statistic. It’s useful when comparing tables, and you
want to say that one has a stronger association than another. With the
table of health coverage and being unable to see a doctor, the index of
dissimilarity is 0.061. The index of dissimilarity can be interpreted as the
proportion of cases that would have to be “moved” in order to make the
model of independence exactly fit the data. In fact, it is sometimes used as
an index of segregation. If you had people of different ethnicities living in
different neighborhods, you could make a cross-tabulation of neighborhood
and ethnicity. Complete integration would mean that all the neighborhoods
contain the same mix of ethnicities. That is, the variables would be
independent: knowing where someone lived would not give you any clue
about their ethnicity. The index of dissimilarity can be interpreted as the
minimum proportion of the population that would have to moved in order
to produce perfect integration.
7.3
Standardized residuals
With a two-by-two tables, the residuals are all the same number, two
positive and two negative. But with a bigger table, you can have a more
52
complicated pattern. The predictions may be pretty good for some
combinations, but bad for others. Residuals can be used to identify where
the predictions fit or do not fit. Table 6 gives another example: type of
community by marital status.
Marital Status
City
Urban
County
Suburban
MSA
Non-Urban
Total
Married
Div.
Widowed
673
50.5%
564
59.8%
321
62.5%
9
47.4%
817
58.8%
2384
56.8%
207
15.5%
126
13.4%
53
10.3%
2
10.5%
194
14.0%
583
13.9%
184
13.8%
115
12.2%
72
14.0%
5
26.3%
219
15.8%
595
14.2%
Sep.
Never Unmarried
Total
Married
Couple
34
202
34
1334
2.6%
15.1%
2.6% 100.0%
19
104
16
944
2.0%
11.0%
1.7% 100.0%
10
43
15
514
2.0%
8.4%
2.9% 100.0%
0
3
0
19
0.0%
15.8%
0.0% 100.0%
30
108
21
1389
2.2%
7.8%
1.5% 100.0%
93
460
86
4200
2.2%
11.0%
2.1%
100.0
Table 7.6: Crosstabulation of community and marital status
If the residual is exactly zero, that means a perfect prediction. If
the residual is near zero, that means a good prediction. A residual that’s
much greater than zero or much less than zero means a bad prection. But
it seems reasonable to take the size of the predicted values into account too.
E. g., if you predict that the Republicans will win 221 seats in the House of
Representatives in 2014 they actually win 216, you could regard that
prediction pretty good. If you predicted that someone would have 2
children and they actually had 7, you would regard that prediction as way
off. The “standardized residual” is designed to adjust for the size of the
prediction. It is defined as
e
√
n̂
where e is the residual and n̂ is the predicted count in a cell. It’s related to
standardized scores, which we talked about before. To get a standardized
score you subtract the mean and divide by the standard deviation. The
53
mean residual is zero. The square root of the predicted value is an estimate
of the standard deviation produced by chance variation. This means that
one thing you can do with the standardized residuals is see if any are large:
with an absolute value of more than 2.0, or especially more than 3.0. A
large standardized residual suggests that your fitted value is far enough
from the actual value to make it hard to explain as just a matter of chance.
Marital Status
City
Urban
Suburban
MSA
Non-Urban
Total
Married
Div.
Widowed
Sep.
-84.2
28.2
29.2
-1.8
28.6
0
22.2
-4.8
-18.2
-0.6
1.5
0
-5.0
-18.7
-0.8
2.3
22.2
0
4.5
-1.9
-1.4
-0.4
-0.8
0
Never Unmarried Total
Married
Couple
55.9
6.7
0
0.6
-3.3
0
-13.3
4.5
0
0.9
-0.4
0
-44.1
-7.4
0
0
0
Table 7.7: Residuals, community and marital status
Marital Status
Married
City
Urban
Suburban
MSA
Non-Urban
Div.
Widowed
-3.06 1.63
1.22 -0.42
1.71 -2.16
-0.54 -0.39
1.02 0.11
Sep.
Never Unmarried
Married
Couple
-0.36 0.82
4.62
1.28
-1.62 -0.42
0.06
-0.76
-0.10 -0.41
-1.77
1.38
-1.41 -0.65
0.64
-0.62
1.58 -0.14
-3.58
-1.40
Table 7.8: Standardized residuals, community and marital status
7.4
Chi-square test
The chi-square statistic is used to test association in cross-tabulations. It is
especially useful when both of the variables are in the cross-tabulation are
nominal, although it can be also used with ordinal variables.
54
Weight
Current
Former
1 Not overweight
88
6.2%
44
3.1%
2 Overweight
117
7.9%
60
1302 1479
4.1% 88.0% 100%
148
13.3%
353
8.8%
45
917 1110
4.1% 81.6% 100%
149
3515 4017
3.7% 87.5% 100%
3 Obese
Total
Never
Total
1296 1428
90.8% 100%
Table 7.9: Overweight status by asthma
Table 1 gives an example of a table of overweight status (ordinal,
three categories) by asthma status (current, former, never). To calculate
the chi-square statistic, follow these steps:
1. Calculate the predicted values assuming independence.
2. Calculate the residuals
3. Calculate the standardized residuals
4. Calculate the sum of the squares of the standardized residuals. Note
that this will always be a positive number.
5. Calculate the “degrees of freedom” using the formula (I-1)(J-1),
where I is the number of rows and J is the number of columns. In this
example, I=3 and J=3, so the degrees of freedom equals 4.
6. Look up the “critical value” of the chi-square statistic with the
appopriate number of degrees of freedom. If your chi-square is bigger
than the critical value, there’s evidence of an association; if not, there
is no clear evidence–that is, the data are consistent with the idea that
the variables are independent.
Before we start, if you just look at the table it seems that obese
people are more likely to have asthma. Overweight people are in between,
55
Cell
Not overweight, current
Not overweight, former
Notverweight, never
Overweight, current
Overweight, former
Overweight, never
Obese, current
Obese, former
Obese, never
Total
√
n̂
e e/ n̂ e2 /n̂
125.5 -37.5 -3.34 11.20
53.0 -9.0 -1.23 1.52
1249.5 46.5
1.31 1.73
130.0 -13.0 -1.13 1.29
54.9
5.1
0.69 0.48
1294.2
7.8
0.22 0.05
97.5 50.5
5.11 26.10
41.2
3.8 -0.60 0.36
971.3 -54.2 -1.74 3.04
4017
0
45.76
Table 7.10: Calculating the sum of squared standardized residuals
although they seem more like people who are not overweight. That is, the
two variables appear to have something to do with each other: if you know
whether someone is overweight, you can make a better guess about wehther
they have asthma. To calculate the predicted values, use the formula
Nrow Ncol
. Nrow is the total for that row, and Ncol is the total for that
N
column. For example, for people who are not overweight and have asthma,
= 125.5. To calculate the chi-square statistic,
the predicted value is 353×1428
4017
it helps to make a table like Table 7.10.
After you compute the sum of the squared standardized residuals,
there’s not much calculation. You just have to look up the critical value in
a table. With four degrees of freedom, the 5% critical value is 9.49. The
value we see is much bigger than that, so we can conclude there’s pretty
good evidence that the variables are not independent: that is, that there
really are differences between the chance of having asthma depending on
whether you are overweight. The 1% critical value is 13.27, and the 0.1%
critical value is 18.47, so even someone who asked for stronger evidence
would still have to agree. In fact, the chance of getting a value like this just
be chance would be tiny, something like 3 in a billion.
7.5
Examining Tables
Suppose you obtain a statistically significant chi-square statistic. That
means that there is evidence that the variables are related. However, you
56
usually want to go beyond that and say how they are related. In some cases,
it’s easy to see: for example, here you can say that the more overweight you
are, the higher your chance of having asthma, but in other cases it’s more
complex: for example, the categories may fall into several groups.
The basic principle is to look for the rows and columns in which
there are large standardized residuals. Those are the ones that are clearly
different from the others. Then try to see if you can give a plausible “story”
about the pattern
57
Chapter 8
Correlation and Simple
Regression
8.1
Correlation
Correlation, like cross-tabluation, involves the association between
variables. Association means that the variables aren’t independent–they
have something to do with each other. However, “something to do with
each other” is very general, so we usually want to go beyond saying that
there’s some association–we want to be able to say something about
particular kinds of association.
Two forms of association are particularly important. Positive
association: the bigger the value of x, the bigger the value of y (on the
average). Negative association: the bigger the value of x, the smaller the
value of y (again, on the average). Positive and negative association are
meaningful concepts for the association between two ordinal or
interval/ratio variables. They are not meaningful when a nominal variable
is involved. For example, it would not be meaningful to say that there’s a
positive association between ethnicity and years of education, because “the
larger ethnicity is” doesn’t make sense. It would make sense to talk about
positive or negative association between years of education and income,
because “more education” and “more income” are both meaningful ideas.
If you have a dichotomy, you can characterize association as
positive or negative, even though that’s not the natural way to describe it.
E. g., sex (M=1 and F=2) and number of days physical health was not
58
good (0...30). Earlier in the course, we saw that women tend to have higher
numbers for days physical health was not good. Since female is the higher
value on sex, you could describe this as a positive association“higher values
of sex” (being female rather than male) go with higher values on the health
question.
Warning: it’s important to note what high values mean for each
variable, and also to say that when you’re describing any results. You can’t
assume that the meaning of a higher value is what you expect from the
variable name. For example, in the question on life satisfaction, higher
numbers mean less satisfied (or more dissatisfied). So if you said that
there’s a positive association between some variable and the life satisfaction
variable, people might draw the wrong conclusion. It’s better to say
something that seems obvious, e.g., that higher values of age mean older,
than to run the risk that people will draw the wrong conclusion. In
practice, that means you need to check the codebook or the “variable view”
before considering the association between variables.
A particular kind of positive and negative association is known as
linear association. It is represented by the equation ŷ = α + βx, where ŷ
represents a predicted value of y. It’s called linear because any equation of
that form corresponds to a straight line on a graph, if position on the
horizontal axis represents the value of a case on the one variable and
position on the vertical axis represents the value on another. If one of the
variables can be thought of as a cause and the other as an effect, the
“cause” variable is traditionally put on the x axis. The idea of linear
association is not meaningful for nominal variables, so correlation should
not be used if one or both of the variables you are interested in is nominal.
Correlation can be used if both of the variables are ordinal or interval.
8.1.1
Calculating the correlation
1. “center” x and y by subtracting their means
2. compute the product (x − x̄)(y − ȳ)
3. compute the squares of (x − x̄) and (y − ȳ)
4. the correlation is then √P
P
(x−x̄)(y−ȳ)
√P
(y−ȳ)2
(x−x̄)2
59
Let’s take an example–I’ll use a hypothetical example to make the
calculations easier. Suppose we have two variables x and y, representing
grades on two tests. They are measured 1-4, where 1 means√D √
and 4 means
A. Putting it all together, the correlation in this case is 4/( 8 8) = 0.5.
x
4
4
2
1
4
3
18
y x − x̄ y − ȳ (x − x̄)2 (y − ȳ)2 (x − x̄)(y − ȳ)
3
+1
+1
1
1
1
4
+1
+2
1
4
2
2
-1
0
1
0
0
1
-2
-1
4
1
2
1
+1
-1
1
1
-1
1
0
-1
0
1
0
12
0
0
8
8
4
Table 8.1: Example of calculating a correlation
The correlation is a number between -1 and 1 that represents the
linear relationship between two variables. A number that is farther from 0
represents a stronger relationship, in the sense that the variables predict
each other accurately. In terms of a graph, the correlation represents how
closely the points are clustered around a straight line showing the
relationship between x and y. If every point falls exactly on the line, the
correlation is +1 if the line slopes up, -1 if it slopes down. If you have a
horizontal line, the correlation is undefined. If the points are scattered all
over with no pattern, the correlation is zero.
8.1.2
Standard error of the correlation
p
√
The standard error of a correlation is approximately (1 − r2 )/ N (the
letter r is often used for the correlation). You can use this formula to
calculate confidence intervals or t-ratios involving the correlation.
8.1.3
Correlation matrix
When you have just two variables, you can simply give the correlation. But
when you have more than two, it’s convenient to show the correlation
between each pair in the form of a “maxtrix.” Table 2 shows a correlation
matrix involving the variables sex (1=M 2=F), education, income, and
satisfaction with life (1=very satisfied ... 4=not satisfied).
60
Female
N
Female Education Income
1.000
-.04
-.14
4232
4216
3686
Satis
.00
3957
Educ
N
-.04
4212
1.000
4216
.44
3679
-.13
3947
Income
N
-.14
3686
.44
3679
1.000
3686
-.25
3476
Satis
N
.00
3957
-.13
3947
-.25
3476
1.000
3957
Table 8.2: Example of a correlation matrix
To find the correlation between any pair of variables, locate the
column for one and the row for the other. For example, the correlation
between income and education is .44. Note that it doesn’t matter which is
the row and which is the column–the table is symmetrical. In terms of the
formula, it doesn’t matter which you call x and which you call y. The 1.000
in the diagonal means that the correlation of a variable with itself is one.
8.1.4
Correlations and scale
The correlation is not affected by the scale of a variable. For example,
suppose a data set has a measure of height. Then the correlation of any
variable with height is the same regardless of whether you measure height
by inches or centimeters. Also, the correlation of height in inches with
height in centimeters is 1.00. That is, if you know a person’s height in
inches, you can predict their height in centimeters perfectly. So a
corrlelation is similar to a standardized score, and different from a mean or
standard deviation, in that respect. That is, if I say that the correlation
between two variables is a particular value, you don’t have to know the
units of the variables in order to interpret that. The only thing you need to
know is what a higher value of each variable means.
61
8.1.5
Interpreting correlations
People sometimes assume that because the possible range of the absolute
value of a correlation is from 0 to 1, correlations with values like 0.1 are too
small to be of interest. This is wrong–the standards for what should count
as an large or small correlation differ depending on the kind of variable you
are talking about. In general, when you are dealing with data on individual
people, the correlations are well under 0.5. When you are dealing with
units like nations, correlations tend to be much larger. So the best way to
judge a correlation between x and y is to look at the correlation of other
variables with x and/or y. How does the correlation you found compare
with other correlations that are generally thought to be “important”?
8.2
A Visual Interpretation
Correlation and regression can both be understood in terms of a
“scatterplot” showing the values of x and y. It’s easier to grasp a scatterplot
when the values of the variables are continuous rather than limited to a
small number of values and there aren’t too many cases. The main data set
for the class has lots of cases, and almost all of the variables have a limited
number of categories. So to look at correlation and regression, I’ll use
another data set, giving selected characteristics of nations that are
members of the Organisation for Economic Cooperation and Development
(the OECD includes most of the affluent nations, plus a few middle-income
nations like Mexico and Turkey). Two of the variables are “Gini
coefficients.” The Gini coefficient is a measure of inequality ranging from 0
(complete equality) to 1 (one person has all of the income in the country).
The data includes the Gini coefficient before taxes and transfers (basically,
inequality in what people earn), and the Gini coefficient after taxes and
government transfers. The mean “before” value is about .46 and the mean
“after” value is .32. That is, government taxes and spending usually make
things more equal, which is to be expected. But we could also ask about
the relationship between before and after values. You would expect a
positive relationship: if a country starts out more equal (relative to others)
it will end up more equal. That’s not logically necessary, but it seems more
likely. But how strong will the relationship be? If every government reduces
inequality by the same amount, then you could predict the amount of
62
Figure 8.1: Scatterplot, inequality before and after taxes and transfers
inequality perfectly by knowing how much equality there was before.
There is a definite relationship–the higher the inequality in
earnings, the higher the inequality after taxes and transfers. That is, it is a
positive relationship. The correlation is 0.559.
8.3
Regression
Suppose you want a straight line representing the relationship. Visually,
you could try to draw a line that comes close to passing through all of the
points, but different people might make somewhat different choices. So it’s
desirable to have a definite standard. Any line can be represented by an
equation
y = α + βx + e
(8.1)
P 2
Suppose you define the best line as the one that makes
e as small as
possible. You could try out different values of α and β and then calculate
the sum of the squared errors, but you don’t need to use this trial-and-error
approach. There’s a formula for finding the values that give you the
63
“least-squares” fit. In this case, it’s ŷ = −.083 + 0.870x.
Given an equation, you can put in values of x for different cases,
and get predicted values of y. For example, the value of x (Gini before) for
the United States is 0.486. Applying the equation, the predicted value for
the Gini after is .340. The actual value of the Gini after for the United
States is .380. That means that the value of e (the residual) for the US is
.04. So the US has more inequality after taxes and transfers than the
equation predicts. Next we might ask whether we should consider that to
be a large error or a small error. To do this, we can compute the standard
deviation of the residuals, and then compute a standardized score. The
standardized residual for the United States is 0.75: that is, the error is not
unusually large.
The numbers in the regression equation are known as
“coefficients.” What to the regression coeffients mean? The β coefficient
tells you the effect of a one-unit increase in x on the predicted value of y.
That is, it’s not just a number (like the correlation), but a number of y’s
per x. In this case, both units are points on the Gini coefficient. However,
the units will not normally be the same. An everyday example that helps
to illustrate the nature of a regression coefficient is miles per hour. If you
know the time (hours) that someone drives, and the average speed (miles
per hour) at which they drive, you can compute the distance (miles). With
a regression, when we multiply x by β, we get a predicted value in the same
units as y.
The α coefficient is usually of less interest than β. It gives the
predicted value of y if x=0. However, the value x=0 is not always possible
in principle, and even if it is, it may not exist in practice. In this case, it is
possible in principle (everyone has exactly the same income), but no
country is close to it (the lowest actual value for the Gini before is 0.344.
And if we apply the equation with x=0, we get a predicted value of -.083,
which is impossible because the Gini coefficient can’t be less than zero. So
usually the α coefficient is just treated as a number that you have to have
in the equation, not as something that’s meaningful in its own right.
8.3.1
Residuals
In a good regression model, the residuals should represent unpredictable
factors–that is, there should be no pattern, because a pattern means
something predictable. If there is a pattern, that means you should try to
64
modify the regression to accomodate it. One way to try to assess the
pattern is to look at the unusually large (positive or negative) residuals and
think about whether those cases have anything in common. In this
example, the largest standarized residual (2.41) is for Chile and the second
largest is for Mexico (2.39). Those are the only residuals greater than 2
(there are none less than -2), but there are two more that are close: Turkey
at 1.93 and South Korea at 1.91. If we ask what those four countries have
in common, one thing that comes to mind they are all relatively low income
by the standards of this group (Korea is only a little below average, and the
other three are well below). So that suggests that maybe less affluent
countries don’t do as much to redistribute income as more affluent ones.
That’s just an idea, but later we’ll see how you could evaluate it.
8.3.2
Calculating regression coefficients
The formula for β is:
P
(x − x̄)(y − ȳ)
P
(x − x̄)2
(8.2)
Notice that the numerator is the same as in the formula for
correlations, but the bottom part is different: it just involves the
independent variable x. This is related to an important difference between
correlation and regression. The correlation is symmetrical: it doesn’t matter
which variable you call x and which you call y, you get the same correlation.
The regression coefficient is not symmetrical: you get a different value of β
depending on which is dependent and which is independent.
People are usually primarily interested in β, since it tells you
about the relations between the variables. The α coefficient is necessary,
but usually isn’t the focus of interest. However, if you need to calculate it,
the formula is:
α = ȳ − β x̄
(8.3)
That is, you first calculate β, and then use this formula. The reason that
this formula works is that with a least squares regression, the predicted
value of y when x = x̄ is ȳ.
65
8.3.3
Dependent and Independent Variables
The correlation of x with y is the same as the correlaton of y with x. But
the coefficients in the regressions y = α + βx and x = α + βy will not
normally be the same. Therefore, you need to think about which variable
should be on the left. That is called the “dependent variable” and normally
symbolized by y. The variable on the right is called the “independent” or
“predictor” variable and normally symbolized by x. If you think in terms of
cause and effect, x should be the potential cause and y should be the
varaible that is affected. For example, if you were doing a regression with
age and income, income should be the y variable: it may be influenced by
age, but age can’t be influenced by income. Often things are not this clear:
for example, if you have opinions on two subjects, it is logically possible for
the influence to go either way. In such cases, you have to rely on “common
sense” or outside information. For example, if I had the variables of
satisfaction with life and self-rated health, I would choose satisfaction with
life as the dependent. That’s because if someone said that they weren’t
very satisfied and you asked why, it would make perfect sense if they said
“because I am in poor health.” If someone said they were in poor health
and you asked why, it would not seem as natural to say “because I’m not
satisfied with my life.” But this is my judgment, not an issue that can be
decided by statistics.
8.3.4
Analysis of Variance
The file Gini.pdf contains SPSS output involving a regression with Gini
(after) as the dependent variable. The independent variable (Security) is a
measure of the extent of government income security measures (like
retirement, disability, unemployment insurance). One of the goals of these
measures is to increase the income of people who would otherwise be poor,
so to the extent that they are effective in doing this, they should reduce the
Gini index.
Look at the “ANOVA” (analysis of variance) table. One column
is labelled “df” for “degrees of freedom”. We saw degrees of freedom before
in the chi-square and F tests. In a regression, there are N-1 total degrees of
freedom. They are divided into two groups: k “regression” degrees of
freedom, where k is the number of independent variables in the regression
(k=1 in a simple regression), and N-k-1 “residual” degrees of freedom.
66
The best way to understand the term “degrees of freedom” is as
equivalent to pieces of information. That is, we have observations on a
number of cases. Each case is another piece of information. The regression
expresses the same information in a new way. That is, each value of y is
written as y = ŷ + e. So the regression uses just one number estimated from
the data (the regression coeffcient β) to predict part of the variation in y.
The residuals (which are also estimated from the data) account for the rest
of the variation in y.1
Another
column involves sums of squares. The total sum of
P
squares is
(y − ȳ)2 . It is broken up into two parts: the regression
and
P
residual sums of squares. The regression sum of squares is
(ŷ − ȳ)2 that
is, the deviations
P 2of the predicted values from the mean. The residual sum
of squares is
e . The ratio of the regression sum of squares to the total
sum of squares is called the R2 The reason for the term is that its the
square of the correlation (sometimes called r) between the predicted value
of y and the actual value of y. In the case of simple regression, it’s the
square of the correlation between x and y.2
The R-square is a measure of how well you can predict y from the
regression. It’s sometimes called “explained variance,” but “explained”
really just means prediction, not explaining in any deeper sense. Regression
divides up the total sum of squares into the part that can be predicted by x
and the part that can’t be predicted by x3
The “standard error of the estimate” in the model summary is an
estimate of the standard deviation of the residuals. The reason I call it an
estimate is that the regression based on the observed data is an estimate of
the regression based on the entire population. That is, you could imagine
observing every case and doing the same regression. In that case, you
would have the true regression coefficients, and therefore the “true” errors.
But you actually only observe some of the population, so your regression
coefficients are just the coefficients. That means you only have an estimate
1
The reason for the -1 is that the variation is relative to the mean, and calculating the
mean takes one degree of freedom. The α coefficient doesn’t predict any of the variation,
because it applies to all cases.
2
The correlation between x and the predicted value of y is 1.00 or -1.00, depending on
whether the relationship between x and y is positive or negative.
3
You may remember that (a + b)2 = a2 + 2ab + b2 . The reason that the total
P sum of
squares divides up into two sums of squares is that in a least squares regression, (ŷ − ȳ)e
is always zero. That is, the predicted values are uncorrelated with the residuals.
67
of the errors. The distinction between residuals and errors is that residuals
are values obtained from an observed sample, while errors are hypothetical
values in the population.
8.4
Transformations
8.4.1
Dummy variables
You can include a dichotomy as an independent variable in a regression. A
higher value of x means being in one category rather than another.
However, it’s conventional to convert dichotomies to “dummy” or
“indicator” variables. These variables have the values 0 or 1; usually they
are named for the category that is one. For example, you could convert the
variable for sex (1=M, 2=F) to a dummy variable called “male” (1=M,
0=F) or one called “female” (0=M, 1=F). This isn’t strictly necessary, but
there are practical advantages. One is that it helps people to understand
the regression output: a significant effect of “sex” tells you that men and
women are different, but not about the direction of the difference. You can
also make a dummy variable out of another type of variable: for example,
for some purposes it might be useful to have a dummy variable for people
aged 65 and above.
8.4.2
Change of Scale
Some statistics, like means, standard deviations, and regression coefficients,
depend on the units of the variables. With these statistics, you may get
very large or very small numbers. This is especially true for regression
coefficients, because they depend on the scale of both independent and
dependent variables. For example, in a regression with the Gini index as
the dependent variable and per-capita GDP as the independent variable,
the β coefficient is equal to -.00000306. In SPSS, this is written as 3E-06.
The “E-06” means move the decimal place six places to the left. “3E+03”
would mean move the decimal place 3 places to the right, that is 3000.
Using this notation, you can write any number, but people find it hard to
deal with very large or small numbers. So in these cases, you can make it
easier by changing the scale: that is, multiply or divide one or both of the
variables by multiples of ten. Suppose we divided GDP by 1000, so it was
68
GDP in thousands of dollars. Then the regression coefficient would be
-.00306. We might go farther, and divide GDP by 100000. Then the
regression coefficient would be -.306. We could also change the dependent
variable, but in this case we would want to multiply it rather than divide it.
For example, suppose we kept GDP as is, but multiplied the Gini coefficient
by 1000. Then when we did the regression, the coefficient would be -.00306.
All of this is purely for convenience. It doesn’t change the
ANOVA table, or any of the predicted values or residuals. Multiplying and
dividing by multiples of ten is a special case of a “linear transformation.” A
linear transformation is any rule for turning an old varaible into a new one
that involves multiplication and/or addition of a constant. For example,
you can get from Fahrenheit to Celsius temperatures by a linear
transformation. A linear transformation does not change the predicted
values, residuals, or conclusions from a linear regression.
8.4.3
Non-linear transformations
There are lots of tranformations that are not linear. For example, powers:
2
3
x
The square root and the reciprocal are also powers, because
√ , x , etc..
x = x0.5 , and x1 = x−1 . Another non-linear transformation is the “common
logarithm” is defined by the relationship: 10log(x) = x. For example, 2 is the
logarithm of 100, because 102 = 100. You can use other “bases,” but 10 is
useful because it makes it easy to get a sense of the size of x if you’re given
log(x). For example, if the logarithm of x is 4.2, you can tell that x is
greater than 10,000 and less than 100,000. Each increase of 1.0 in the
common logarthim of x is equivalent to multiplying x by 10. Sometimes you
see the “natural logarithm,” which is defined by the relationship elog(x) = x;
e is a number approximately equal to 2.718. Although e has a lot of
interesting mathematic properties, but natural logarithms are simply equal
to common logarithms times a constant (about 2.3), so it doesn’t really
mater which you use. We’ll just consider the common logarithm because
it’s easier to interpret. The logarithm is defined only for positive numbers.
The logarithm of 1 is 0, and the logarithm of a number between 0 and 1 is a
negative number. For example, the logarithm of .01 is -2. The logarithm
goes to minus infinity as x goes to zero. However, you can make the log
transformation apply to variables with a value of 0 by taking the logarithm
of x+k, where k is a small positive number. For example, you could apply
the log transforation to x+0.25 or x+0.10 rather than to x. This is useful
69
for count variables, which often have values of 0. So this discussion of
transformation just applies to variables that can’t be negative. However,
there are a lot of variables like that: examples include GDP, the
unemployment rate, crime rates, height, weight, number of children.
8.4.4
Ladder of Transformations
The 0 power is not literally defined, because x0 = 1 for all x, but the
logarithm can be regarded as filling the place of x0 . This gives you what
has been called a “ladder of transformations”: xp , where p is a number. As
you go “up” the ladder, the distribution of the new variable w becomes
stretched out to the right–the large values grow proportionately faster–see
the table for an example:
√
x x
x2
x3
x4
1/x log(x)
∞
−∞
0 0
0
0
0
1
0
1 1
1
1
1
0.5
0.30 1.41 2
4
8
16
0.33
0.48 1.73 3
9
27
81
0.20
0.70 2.24 5 25 125
625
0.10
1 3.16 10 100 1000 10000
Table 8.3: Powers of x for selected values of x
As you go “down” the ladder, large values get pulled in, so that
the distribution becomes less stretched out to the right. Negative powers,
like the inverse, also reverse the order–the largest values become the
smallest. You can experiment with different transformations to see which
works best.
8.4.5
Transformations and nonlinear relationships
The most important reason to transform variables is that the relationship
between x and y might not follow a straight line. As an example, it looks
like ther is a negative relationship between GDP and the Gini coefficient,
but it doesn’t seem to follow a straight line. Countries with a GDP of
around $20000 have substantially less inequality than countries with about
$1000, but it’s not clear that countries with about $40000 are much lower
70
than countries with about $30000. So we might get a better idea of the
relationship if we transformed one or both variables.
Figure 8.2: Relationship between per-capita GDP and Gini coefficient
With powers greater than 1, the slope increases as x increases.
With powers less than one, the slope decreases as x increases. So suppose
that you think that a one-unit change in x has more impact on y when you
start from a small value of x. Then you should use a transformation of x
like the square root, and use the transformed variable (w) as the
independent variable instead of x. If you think that a one-unit change in x
has more impact when you start from a large value, you should use a
transformation like x2 .
As an alternative to transforming x, you can transform y. Then
the effects of going “up” or “down” the ladder are opposite–for example, if
your dependent variable is y 2 , a one unit change in x will have a decreasing
effect on y. For example, given the scatterplot for the relationship between
GDP (x) and the Gini index (y), we could consider using a square root or
log transformation for x, or a square or cube transformation for y.
71
8.4.6
Choosing Transformations
The transformations I’ve discussed are useful for representing two kinds of
relationships which are illustrated in the figures. In one, the slope increases
as x increases. That is, the effect of a change in x on y is larger when you
start from a high value of x. In the other, the slope decreases as x increases:
the effect of a given change in x is larger when you start from a small value
of x. Both kinds of relationships can be either positive or negative, so I
show both positive and negative forms in each figures.
Figure 8.3: Increasing slopes
The rule: for increasing slopes, go “up” the ladder of
transformations on x, or go “down” the ladder on y. For example, you
might try representing the relationship in Figure 1 by using x2 as an
√
independent variable,
or
y as the dependent variable. You could represent
√
Figure 2 by using x as the independent variable, or y 2 as dependent.
There is also
√ the question of how far “up” or “down” the ladder
to go. For example, x, log(x), or 1/x? You can also do it informally by
plotting the transformed variables against the other variable and seeing if
72
Figure 8.4: Decreasing slopes
the line looks straight. A more formal method which applies if you
transform x is to pick the transformation that gives you the largest
regression sum of squares (or the largest R2 , which follows from the
regression sum of squares). This method does not apply if you are
transforming y.
73
Chapter 9
Multiple Regression
So far, we’ve been talking about regressions with just one independent
variable. But with most dependent variables you might be interested in,
there are a number of factors that might make a difference, and often a large
number of factors. That means you need “multiple regression”–regression
including all of those factors as independent variables.
Suppose we have an idea that people who weigh more earn less,
either because they are less productive or because of discrimination. So you
do a regression with income as the dependent variable and weight
(pounds/100) as as the independent variable:
ŷ = 5.424 + .144x
(9.1)
The t-ratio is 1.74, so it wouldn’t usually be considered signficant,
but it is pretty close. In any case, the results don’t support the idea. But
this regression omits an important variable. Men tend to be heavier than
women, and men tend to earn more than women. Suppose we include a
dummy variable for men. To distinguish the independent variables, we can
call weight x1 and male x + 2. Then the regression is:
ŷ = 5.728 − .183x1 + .680x2
(9.2)
The t-ratio for weight is 2.03: that is, the heavier someone is, the
less they earn. The differerence between the results is that the first
regression compares people who way more to people who weigh less; the
second compares people who weight more to people of the same sex who
weigh less.
74
With multiple regression, a coefficient represents the difference
that a variable makes to the predicted value “controlling for” all of the
other independent variables in the regression. The term “controlling for”
comes from experiments, where you might be able to literally adjust all of
the variables except the one you’re interested in.
You can’t usually hold variables constant in the social sciences,
but you can think of matching cases so that they’re the same except for one
independent variable. But if you take this literally, it’s impossible to have
two cases that are literally the same except for one thing–e. g., two men
who are the same except one weighs 10 pounds more. Even identical twins,
who are genetically the same, would have different life experiences.
However, many things about people aren’t relevant to their earnings, or
only make a little difference, so it’s not necessary to match them with
respect to those variables. So the realistic goal is to match people on all
relevant factors. If you do that, the regression coefficient can be interpreted
as the effect of the indepedent variable on the dependent variable.
People often distinguish between the independent variable you are
interested in and “control variables.” The reason to include the control
variables in a regression is because you need them to get accurate estimates
of the variable you are interested in. The goal is to including all of the
control variables that really do influence the dependent variable and
exclude those that don’t. This goal presents a dilemma. If you want to
make sure you include everything that makes a difference, you’ll include
some that are unnecessary. If you want to make sure that you don’t include
unnecessary variables, you run the risk of omitting some that really do
make a difference. Usually, it’s considered better to include unnecessary
ones than to omit ones that do make a difference, so when in doubt, you
should include a control variable. However, there are some drawbacks to
having unnecessary independent variables, so just doing a regression with
every independent variable you have is not considered a good idea. So
people try out different “specifications,” with the goal of finding the one
that includes everything that really does make a difference to the
dependent variable, but doesn’t include superfluous variables.
The interpretation of a regression coefficient is that if xj increases
by one unit while all other independent variables remain the same, then the
predicted value of y will increase by βj . Notice that β can be negative, in
which case “increase by βj ” means that the predicted value of y becomes
smaller. More generally, if xi changes by k and all other independent
75
variables stay the same, the predicted value will change by βj k. Of course,
some variables can’t change in a literal sense, but in that case you can
think of comparing two cases which are the same with respect to all of the
independent varaibles but one.
How do you decide if a variable really makes a difference? Look
at the second column in the SPSS output “Std. Error.” We’ve seen
standard errors before, when dealing with the difference between the means
in two groups. The idea is the same here–the standard error is an estimate
of how different the sample value might be from the population value. That
is, it tells you what you could expect to get if you could perform this
regression on the whole population. As before, you get a 95% confidence
interval by taking the estimate plus or minus two times the standard error.
For example, for “male” the estimate is .68 and the standard error is .079.
So the 95% confidence interval is about .54 to .84. The values in this
confidence interval are all positive. That is, we can be confident that in the
population men would be found to have higher incomes than women. So we
definitely do need to take account of gender when considering the effects of
pregnancy. If the confidence interval includes zero, that means we can’t be
sure if the variable makes any difference, it’s considered all right to remove
the variable. The 95% confidence interval is the usual standard. An
equivalent approach is to look at the column “t”, which is the coefficient
estimate divided by the standard error. If the absolute value of the t-ratio
for a control variable is less than 2.0, we can take it out. If the absolute
value is greater than 2, we know we need to keep it.
When you add or remove one control variable, the t-ratios for the
other variables normally change. So if you start with a lot of potential
control variables, it’s best to remove variables one at a time, rather than all
at once. A reasonable approach would be to start from the smallest t-ratio
and keep removing variables until everything left in the regression has a
t-ratio of 2 or above.
You can also start with the variable you’re interested in, and then
add potential control variables. That’s what we did here. If the t-ratio is
significant, keep it in and add another; if it’s not, take it out and add
another in its place. Again, it’s usually best to make these changes one at a
time. So in this case, we would say that sex needs to stay in the regression,
and then think about whether there are other variables that might influence
income; if so, we should add them.
76
Model Summary
Model
R
1
.270
R Square
a
Adjusted R
Std. Error of the
Square
Estimate
.073
.071
.60633
a. Predictors: (Constant), NUMBER OF CHILDREN IN HOUSEHOLD,
female, EDUCATION LEVEL, INCOME LEVEL, REPORTED AGE IN
YEARS
a
ANOVA
Model
Sum of Squares
Regression
1
df
Mean Square
99.568
5
19.914
Residual
1269.463
3453
.368
Total
1369.031
3458
F
Sig.
54.166
.000
b
a. Dependent Variable: satisfaction
b. Predictors: (Constant), NUMBER OF CHILDREN IN HOUSEHOLD, female, EDUCATION
LEVEL, INCOME LEVEL, REPORTED AGE IN YEARS
Coefficients
Model
a
Unstandardized Coefficients
Standardized
t
Sig.
Coefficients
B
(Constant)
1
Std. Error
Beta
2.627
.073
35.829
.000
female
.058
.021
.045
2.690
.007
REPORTED AGE IN
.004
.001
.099
5.148
.000
EDUCATION LEVEL
.012
.011
.020
1.109
.267
INCOME LEVEL
.079
.006
.265
14.197
.000
NUMBER OF CHILDREN IN
.019
.011
.032
1.694
.090
YEARS
HOUSEHOLD
a. Dependent Variable: satisfaction
Figure 9.1: Example of multiple regression
77
9.1
Example of a multiple regression
The SPSS output in Figure 9.1 shows the results from a regression of
satisfaction with life (1=very dissatisfied, 2=dissatisfied, 3=satisfied,
4=very satisfied) on five variables: female (1=female, 0=male), age (in
years), education (1=none, 2=elementary, 3=hs, 4=graduated hs,
5=attended college, 6=graduated from college), income (1=less than 10K,
2=10-15K, 3=15-20K, 4=20-25K, 5=25-35K, 6=35-50K, 7=50-75K, 8=75K
and up), and number of childen under 18 in the household. Some questions:
1. What is the predicted value for a 50-year man who has graduated
from college, makes $100,000 per year, and has no children?
2. Suppose that the man from question 1 says he is “very satisfied.”
What is his residual?
3. Suppose that the man from question 1 says he is “very dissatisfied.”
What is his residual?
4. What kind of person will have the highest predicted value of
satisfaction?
5. What kind of person will have the lowest predicted value of
satisfaction?
6. What is the predicted value for a 70-year man who has graduated
from college, makes $60,000 per year, and has no children?
9.2
Standardized Coefficients
You might want to say something about the relative importance of the
different independent variables. You can’t do this just by looking at the
coefficients, because those depend on the scale of the variable. For example,
the coefficient for age is smaller than the coefficient for female. But the
value of “female” cannot differ by more than one (it is zero or one), while
the value of age can differ by 80 (18-year-old vs. 98-year-old). The
“standardized coefficients” are a way to compare the relative importance of
different independent variables. A standardized coefficient is equal to
sx
β
sy
78
Different independent variables will have different standard deviations, so
the relative sizes of the standardized coefficients will differ from those of the
coeifficients. However, the signs will always be the same. The farther a
standardized coefficient is from zero, the bigger the impact of x on y. In
this case, we can say that income is the most important variable, then age,
then gender, then number of children, then education. You shouldn’t take
small differences in the standardized coefficients too seriously: for example,
number of children vs. gender. But it’s clear that income is the most
important variable.
You can also think about how how the unstandardized coefficients
are related in terms of the original units. For example, gender makes about
as much difference as about 15 years of age. Thinking in terms of the
original units is useful when those units have a meaningful interpretation
(as age and sex do).
9.3
Direct, Indirect, and Total Effects
I have said that a regression coefficient βj can be interpreted as the
expected change in y if xj increased by one unit and all other x variables
stayed the same. Or if xj is not something that can literally change, you
could think of comparing two cases that differ by one unit on xj and are
the same on the other independent variables: for example, suppose you
compared a man and a woman with the same age, sex, income, education,
marital status, and number of children.
However, this interpretation raises the question of whether it is
reasonable to expect one of the x variables to change while everything else
stays the same. With many things in social life, it seems reasonable to think
that if one thing is different, than several other things will be different. For
example, the effect of education on satisfaction is not statistically
significant in the regression we looked at previously. Does that mean that if
your goal is to maximize satisfaction, getting more education is useless? No:
if someone has more education, then they will probably earn more money,
and income does have a significant effect on satisfaction in that regression.
When you have a number of x variables, you can usually classify
them as more distant or closer to the outcome. First you have things that
are fixed at birth: for example, gender, age, and race or ethnicity. Then you
have things that are established at different times in life. For example,
79
education is usually finalized in adolescence and early adulthood, then the
kind of job someone gets is influenced by their education. Finally, even
with opinions or feelings, there are some that seem to be prior to others.
These are things that could be thought of as answers to a question “why.”
Note that this isn’t a matter of statistics, it’s about knowledge we
already have (for example, that some things are determined at birth), or
about what seems more reasonable. Of course, what’s reaonable is open to
dispute, but hopefully you could get some consensus. Implications:
1. The independent variables should all be potential causes of the
dependent variable; there should not be anything that is more likely to
be caused by the dependent variable. For example, if your dependent
variable is education, income should not be among the independent
variables: income isn’t a cause of education, it is caused by education.
2. If you include an independent variable, you should include all
variables that come before or are simultaneous with it that seem like
they might influence the dependent variable. For example, if you
include education, you should include gender (before); if you include
gender, you should include race (simultaneous).
3. If you don’t include the variables that come “after” xj , the regression
gives the “total effects” of xj on y.
4. If you do include the variables that come “after” xj , the regression
gives the “direct effects” of xj on y, after controlling for those
variables.
5. The difference between the “total effects” and “direct effects” is
known as the “indirect effects” operating through the later variables.
As an example, suppose that we start with the model in Figure 9.1. The
coefficient for education is .012, and is not statistically significant. Now
let’s remove income–it’s legitimate to do that, since income comes “after”
education. Now the coefficient for education is .080, and the t-statistic is
over 8, which is much bigger that the critical value. Why did it change so
much? Because people who have more education have higher incomes, and
as the previous regression shows, people with higher incomes are more
satisfied. An indirect effect is an effect that has two (or more) steps. In this
80
Variable
Constant
(1)
(2)
2.627 2.869
(.073) (.068)
(3)
3.350
(.037)
Female
.058
(.021)
.006
(.021)
.002
(.021)
Age
.004
.002
(.001) (.001)
.001
(.001)
Educ
.012
(.011)
Income
.079
(.006)
Children
.019
(.011)
.080
(.009)
.023
(.011)
Table 9.1: Regressions for Direct and Total Effects
81
case, there is a substantial indirect effect of education, and a smaller direct
effect (or maybe no direct effect).
Notice that effects can be either positive or negative, so the total
effect of a variable is not necessarily bigger than the direct effect. For
example, when we just include female and age, the coefficient for female
only .002. But when we add income, education, and marital status the
coefficient for female is .069. That is, there is a negative indirect effect that
almost exactly offsets the positive direct effect. Most of that indirect effect
is the result of income–women earn less, so that makes them less satisfied
with their lives. But if you compare men and women with the same income,
women are more satisfied with their lives.
9.4
Nominal variables in Regression
As I’ve mentioned, linear regression represents the idea that “the bigger x
is, the bigger (or smaller) y is.” That means that you can’t directly use
nominal variables in a regression, because the idea of bigger and smaller
doesn’t make sense for them. However, there is a way to include a nominal
variable as an independent variable. (You can’t have a nominal variable as
a dependent variable in linear regression). You can include a dichotomy in
a regression by arbitrarily defining one category as larger: for example,
male=0 and female=1. Then a positive coefficient means women have a
higher predicted value than men, a negative coefficient that women have a
lower predicted value than men. Note that you could do this just as well by
making male=1 and female=0. For nominal variables, you create a series of
dummy variables, each being one if the nominal variable has a particular
value and zero if it doesn’t. The result is that every case has a value of one
on one of the dummies, and zero on all others. If there are K categories,
you include K-1 of them. The other one is a baseline against which
everything else is compared–just as when you have a dichotomy.
9.4.1
Interpreting the coefficients
The coefficients for the dummy variables representing a nominal variable
have to be interpreted as a group. Each one shows a predicted value in that
group relative to the baseline category. Note that the baseline category is
not explicitly shown: you have to remember what it is. If you want to know
82
how the other categories compare to each other, you can compare their
coefficients. It is often convenient to arrange the coefficients in a sort of
number line. The baseline category implicitly has a value of zero: make
sure that you include it. Note that the coefficients will be different if you
choose a different baseline, but their relative values will always be the same.
That is, by choosing a different baseline you are just changing the zero
point, not the relations among the categories.
9.4.2
Testing whether a nominal variable makes a
difference
The t-tests for the dummy variables involve comparisons of the category
with the baseline category. That is, each one involves just one pair of
categories. Therefore, they don’t address the more general question of
whether the nominal variable makes a difference. Also, the t-statistics will
change when you use a different reference category, and unlike the
coefficients, there is no uniform relationship among them. Therefore, they
cannot be used to answer the general question of whether the variable
makes a difference. If you see a lot of significant t-ratios, that shows that
the variable makes a difference. However, the converse is not true: an
absence of significant t-statistics doesn’t mean that the variable does not
make a difference. There are several ways to test for whether a nominal
variable makes a difference. The simplest is to compare the Mean Square
Residual, or the Standard Error of the Estimate (which is the square root
of the MSR) for the regressions with and without the nominal variable. If
the model with the nominal variable has a smaller Mean Square Residual,
you can say the nominal variable makes a difference. This is the method we
will use.1
9.4.3
Example
I will add marital status to the regression predicting satisfaction with life.
There are six categories of marital status (see the table). I made dummy
variables for the first five categories, leaving member of an unmarried
1
A better one is to compute the Mean Square Residual divided by the residual degrees
of freedom. Again, if it is smaller, you can say the nominal variable makes a difference.
83
MARITAL STATUS
Frequency
Percent
Valid Percent
Cumulative
Percent
Valid
Married
2396
56.6
56.8
56.8
Divorced
583
13.8
13.8
70.7
Widowed
597
14.1
14.2
84.8
Separated
93
2.2
2.2
87.0
460
10.9
10.9
98.0
86
2.0
2.0
100.0
4215
99.6
100.0
17
.4
4232
100.0
Never Married
Unmarried couple
Total
Missing
Total
9
Figure 9.2: Frequency table for marital status
84
couple as the reference. The results of a regression including the extra
variables are shown. Some things to notice:
• The mean square residual is smaller than in the regression without
the marital status variables (.363 compared to .368). So we can say
that martial status seems to make a difference.
• The most satisfied group is married people. They have a positive
coefficient, meaning that they are more satisfied than the reference
group (members of an unmarried couple).
• All of the other groups are less satisfied than members of an
unmarried couple: that is, all have negative coefficients.
• The least satisfied group is people who are separated.
• The coefficients and t-statistics for all of the other variables change.
Some are bigger and some are smaller than before. For example,
education was .012 before marital status was included. When marital
status is included, it is .019. It’s still not statistically significant, but
it’s pretty close. Income is still very significant, but its estimated
effect is smaller (.062 compared to .079).
9.4.4
Combining categories
One problem with including dummy variables to represent nominal variables
is that the regression gets complicated, making it hard for people to grasp
what’s going on. Therefore, it is sometimes useful to combine or “collapse”
categories of a nominal variable. It is legitimate to do this with categories
that are similar in terms of their relation to the dependent variable and
seem similar in principle. For example, in this case, I could combine
married people to members of an unmarried couple. The difference between
the two groups is not statistically significant, and they can be thought of as
similar in the sense that both involve living with a partner. Notice that
widowed people are close to members of an unmarried couple in terms of
satisfaction, but in principle it wouldn’t seem reasonable to combine them.
I also combined separated people with divorced people. They are
pretty similar in satisfaction, and they are also similar in terms of the
nature of their situation. The resulting regression is shown in the next
85
Model Summary
Model
R
1
.295
R Square
a
Adjusted R
Std. Error of the
Square
Estimate
.087
.084
.60232
a. Predictors: (Constant), n_married, female, EDUCATION LEVEL,
NUMBER OF CHILDREN IN HOUSEHOLD, separated, divorced,
widowed, INCOME LEVEL, REPORTED AGE IN YEARS, married
a
ANOVA
Model
Sum of Squares
Regression
1
df
Mean Square
118.948
10
11.895
Residual
1248.363
3441
.363
Total
1367.310
3451
F
Sig.
32.787
.000
b
a. Dependent Variable: satis
b. Predictors: (Constant), n_married, female, EDUCATION LEVEL, NUMBER OF CHILDREN IN
HOUSEHOLD, separated, divorced, widowed, INCOME LEVEL, REPORTED AGE IN YEARS,
married
Coefficients
Model
a
Unstandardized Coefficients
Standardized
t
Sig.
Coefficients
B
(Constant)
Std. Error
Beta
2.704
.098
27.731
.000
female
.068
.022
.053
3.149
.002
REPORTED AGE IN
.003
.001
.088
3.932
.000
INCOME LEVEL
.062
.006
.208
10.092
.000
EDUCATION LEVEL
.019
.011
.032
1.750
.080
NUMBER OF CHILDREN IN
.008
.012
.014
.727
.468
married
.075
.070
.059
1.075
.283
divorced
-.132
.074
-.073
-1.783
.075
widowed
-.026
.078
-.014
-.332
.740
separated
-.213
.096
-.051
-2.212
.027
n_married
-.049
.075
-.024
-.655
.512
YEARS
1
HOUSEHOLD
a. Dependent Variable: satis
Figure 9.3: Regression including marital status dummies
86
Model Summary
Model
R
1
.294
R Square
a
Adjusted R
Std. Error of the
Square
Estimate
.086
.084
.60235
a. Predictors: (Constant), divsep, REPORTED AGE IN YEARS,
female, EDUCATION LEVEL, n_married, widowed, NUMBER OF
CHILDREN IN HOUSEHOLD, INCOME LEVEL
a
ANOVA
Model
Sum of Squares
Regression
1
df
Mean Square
118.098
8
14.762
Residual
1249.212
3443
.363
Total
1367.310
3451
F
Sig.
40.687
.000
b
a. Dependent Variable: satis
b. Predictors: (Constant), divsep, REPORTED AGE IN YEARS, female, EDUCATION LEVEL,
n_married, widowed, NUMBER OF CHILDREN IN HOUSEHOLD, INCOME LEVEL
Coefficients
Model
a
Unstandardized Coefficients
Standardized
t
Sig.
Coefficients
B
(Constant)
Std. Error
Beta
2.758
.078
35.139
.000
female
.069
.022
.053
3.159
.002
REPORTED AGE IN
.004
.001
.093
4.213
.000
INCOME LEVEL
.063
.006
.210
10.200
.000
EDUCATION LEVEL
.020
.011
.033
1.818
.069
NUMBER OF CHILDREN IN
.009
.012
.015
.777
.437
widowed
-.100
.037
-.053
-2.698
.007
n_married
-.118
.037
-.058
-3.176
.002
divsep
-.215
.031
-.127
-7.010
.000
YEARS
1
HOUSEHOLD
a. Dependent Variable: satis
Figure 9.4: Regression including marital status dummies, combining categories
87
table. The Mean Square Residual is the same, meaning that the models are
equally good in terms of fitting the data, and I prefer the model with
combined categories on the grounds that it is simpler. Notice that the
t-ratios for the variables involving marital status are much bigger than they
were before. That is because the reference category is different: it is now
married people plus members of an unmarried couple. As a general rule,
when the reference category contains a larger number of cases, the standard
errors are smaller and t-ratios are larger.
88
Chapter 10
Beyond Linear Regression
10.1
Non-linear effects
This issue is related to transformations, which were covered in the previous
chapter. In a linear regression–every one-unit change in x has the same
effect on y as every other one-unit change. For example, going from 18 to
19 has the same effect as going from 28 to 29, or 88 to 89. Of course, this
may not be true. One way to allow for the possibility
of non-linear
√
relationships is to try transformations, like x or log(x), as independent
variables. Decisions on whether to transform variables should be made
separately for each variable. For example, if you transform age you don’t
have to transform income.
But all of the transformations we have talked about are
“monotonic”–the bigger x is, the bigger f(x) is. This means that none of
them can represent the situation where there is a “peak” or “valley”–where
the highest or lowest predicted values of y occur for the middle values of x
rather than for the highest or lowest values of x. How can we allow for
non-monotonic effects?
One way is to break up the values of x into a number of dummy
varaibles. The exact number is flexible, but usually about five is a
reasonable choice. Let’s take age as an example. The regressions we’ve seen
so far suggest that satisfaction increases with age. But we can check to see
if there’s a more complex relationship. I created four dummy variables: age
18-34, 35-49, 50-64, and 65 and up. In the regression, 18-34 is the reference
group. The results show that the two middle-aged groups are somewhat
89
Coefficients
Model
a
Unstandardized Coefficients
Standardized
t
Sig.
Coefficients
B
(Constant)
Std. Error
Beta
2.936
.063
46.696
.000
female
.069
.022
.054
3.206
.001
EDUCATION LEVEL
.021
.011
.035
1.922
.055
divsep
-.199
.031
-.117
-6.498
.000
widowed
-.121
.037
-.065
-3.304
.001
nmar
-.141
.036
-.069
-3.881
.000
.066
.006
.221
10.725
.000
ymid
-.048
.037
-.033
-1.295
.195
omid
-.066
.036
-.050
-1.833
.067
.123
.039
.090
3.199
.001
1
INCOME LEVEL
old
a. Dependent Variable: satis
Figure 10.1: Example of using dummy variables for non-linear effects
90
less satisfied than the youngest group, while the oldest group is more
satisfied. That is, there appears to be a non-monotonic effect.
The dummy variable model is just an approximation. If you take
it literally, it implies that age makes no difference between 18 and 34, and
then suddenly your satisfaction falls. This seems very unlikely, especially
since the group limits were arbitrary. A model that allows for
non-monotonic effects that change gradually includes both x and x2 as
independent variables. This kind of model can produce a “U” shape or an
upside-down “U” shape for the effect of the variable. This model is known
as a quadratic regression.
A practical issue is that if x is big, x2 will be very big, so the
coefficient for the squared term can be very small, even if it has an
important effect. So before squaring, it is a good idea to rescale the
variable if necessary. In this case, I created a varaible called age00, which is
age/100. Then I squared that. So someone who has 20 has values of .2
(rescaled x) and .04 (squared), someone who is 50 has .5 and .25, etc. The
t-ratio for the squared term is 4.76. That means it is statistically
significant–it should be in there.
What are the implications of this model? Let’s take an example:
a never-married man who has an education of 5 (some college) and and
income of 5 ($25,000-$35,000). Suppose that you have a 20-year old man
with those characteristics. His predicted value is:
3.203 + .023 × 5 − .149 + .065 × 5 − 1.476 × 0.2 + 1.658 × .04 = 3.265
What about a 40-year-old man with those characteristics?
3.203 + .023 × 5 − .149 + .065 × 5 − 1.476 × 0.4 + 1.658 × .16 = 3.168
What about a 60-year old?
3.203 + .023 × 5 − .149 + .065 × 5 − 1.476 × 0.6 + 1.658 × .36 = 3.205
What about an 80-year old?
3.203 + .023 × 5 − .149 + .065 × 5 − 1.476 × 0.8 + 1.658 × .64 = 3.374
So the quadratic model says that old people are the most
satisfied–in that, it agrees with the linear regression. But it says the least
satisfied people are not the young, but the middle-aged.
91
Coefficients
Model
a
Unstandardized Coefficients
Standardized
t
Sig.
Coefficients
B
(Constant)
Std. Error
Beta
2.783
.072
38.900
.000
female
.069
.022
.053
3.180
.001
EDUCATION LEVEL
.020
.011
.033
1.823
.068
divsep
-.217
.031
-.128
-7.100
.000
widowed
-.099
.037
-.053
-2.672
.008
nmar
-.125
.036
-.061
-3.437
.001
INCOME LEVEL
.063
.006
.209
10.184
.000
age00
.323
.073
.084
4.444
.000
1
a. Dependent Variable: satis
Coefficients
Model
a
Unstandardized Coefficients
Standardized
t
Sig.
Coefficients
B
(Constant)
1
Std. Error
Beta
3.203
.113
28.229
.000
female
.074
.022
.057
3.396
.001
EDUCATION LEVEL
.023
.011
.038
2.055
.040
divsep
-.203
.031
-.120
-6.644
.000
widowed
-.144
.038
-.077
-3.771
.000
nmar
-.149
.037
-.073
-4.082
.000
.065
.006
.217
10.582
.000
age00
-1.476
.385
-.385
-3.836
.000
age002
1.658
.348
.486
4.760
.000
INCOME LEVEL
a. Dependent Variable: satis
Figure 10.2: Example of quadratic regression
92
There are two general rules that are useful when considering
quadratic regressions. First, the sign of the x2 term tells you if it is a U or
an upside-down U shape. If it is positive, it’s a U, if it’s negative, it’s an
upside-down U. In this case, it is positive. Second, there is a formula that
tells you where the “turning point” is. If β1 is the coefficient for x and β2 is
the coefficient for x2 , it is:
−β1
2β2
in this case, it is
−(−1.476)
1.476
=
= .445
2 × 1.658
3.316
Remember that x is age divided by 100, so that means the minimum
satisfaction occurs at age 44.5. Say we round that off to the nearest whole
number, 45, and plug that into the regression equation. The predicted
value is:
3.203 + .023 × 5 − .149 + .065 × 5 − 1.476 × 0.45 + 1.658 × .2025 = 3.165
That is a little lower than the predicted value for a 40-year old. So the
quadratic regression tells us that satisfaction with life declines until people
are in their mid-40s, and then starts to increase again.
Note that the turning point may not occur within the actual
values of x. In that case, the relationship will be effectively an increasing
slope or a decreasing slope like you get with the transformations we’ve
talked about so far.
The advantage of the dummy variable approach compared to the
quadratic regression is that it is more flexible: it is not limited to the two
basic shapes (U and upside-down U). The disadvantage is that it involves
more variables, and the coefficients are more strongly affected by sampling
error, so it can be harder to see the pattern.
10.2
Interaction (specification) Effects
A standard regression assumes that the independent variable has a single
effect that applies to all cases. If you think in terms of the independent
variable as cause and the dependent as effect, that means a change in the
independent variable always has the same effect on the dependent variable.
93
For example, say that we’re investigating the relationship between years of
schooling and knowledge of some subject as measured by a test. The
regression equation is:
y = α + βx + e
That means an additional year of school will produce an increase of β
points in everyone. But that seems unlikely when you think about it.
Maybe some kinds of people will tend to learn more: e. g., those who have
more aptitude, those who study harder, those who get better teaching......
These kind of differences in the effect of x on y are known as
interaction or specification effects. They mean that β is not a single
number–it differs depending on the value of other variables. Note that
interaction effects are not equivalent to saying that other variables also
matter–that we need to add other independent variables. Even if we have
numerous independent variables, each has just one effect, given by its
coefficient.
If there are interaction effects, the regular regression coefficient is
still meaningful: it gives an average effect of the independent variable. But
you can go beyond the ordinary regression. Research in the social sciences
often deals with interaction effects. They often provide information that
may be useful in comparative evaluation of theories, or practically
important (e. g., suppose you discover that one teaching method works
better for a particular kind of student, a different one works better for
another kind of student), or just unexpected and therefore a potential
subject for more research.
How can you have interaction effects? Suppose that one of the
variables involved is a dichotomy: for example, say we are predicting weight
and think that the effect of some variables might differ by gender. Then
you can divide the same into two parts, fit your regression separately on
each, and compare the coefficients.
Too see if the group differences in the estimated effects of the
variables are statistically significant, you can use a formula
we’ve
p
2
encountered before. The standard error of β1 − β2 is s1 + s22 . For
example, the difference in the estimated effects of education is 2.224, and
the standard error of that difference is 1.11. The 95%confidence interval is
then about (.00,.44). That is, we can be pretty sure that education has
more effect among women than among men.
This approach has two limitations. First, it works only for
94
Variable
Constant
Male
Age
Height
Educ
Male*Ht
Male*Ed
All Women
Men
All
-127.61
-77.37 -193.82 -74.23
(13.73) (18.11) (22.59) (17.61)
8.94
-123.5
(1.70)
(27.63)
-.101
-.091
-.148
-.113
(.035)
(.044)
(.056) (.035)
4.815
4.086
5.823
4.06
(.206)
(.272)
(.313) (.269)
-3.343
-4.161 -1.937 -4.216
(.548)
(.715)
(.851) (.709)
1.790
(.410)
2.291
(.110)
Table 10.1: Example of Separate Regressions for Two Groups
dichotomies or nominal variables with a few categories. For example, if we
thought there might be an interaction involving age and education, we
couldn’t easily use this approach. We could divide age into two or three
groups, but then we’d lose the distinctions between the groups. Second, it
lets the effect of all variables differ between the groups. Suppose we want to
say that some variables have the same effects, while others have different
effects?
A more general approach to interactions is to create artificial
variables that are the product of two other variables. For example, suppose
we have two new variables male × educ and male × height. Then we run a
regression inlcuding those in addition to the other independent variable.
The results are given in the last column of the tale. The coefficients for age
and height now show the effects among women. To get the effects among
men, you add them to the relevant interaction coefficients. The interaction
coefficients directly show the difference between the groups in the effects of
the variables.
95