Download Chapter 5-3: Dichotomous Predictor Variables

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Psychometrics wikipedia , lookup

Renormalization group wikipedia , lookup

Mediation (statistics) wikipedia , lookup

Omnibus test wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
Chapter 5-3 Dichotomous Predictor Variables
In this chapter, we will see how to use dichotomous predictor variables in a linear regression
model. What we cover applies to all types of regression models.
The linear regression model, which is what we have fitted in previous chapters, uses an
estimation method called the least-squares method. In this method, the sum of the squared
vertical distances of all the data points from the regression line are minimized.
That is, the regression estimates,
equation
n
n
i 1
i 1
̂0 and ̂1 , which are the Y intercept and slope for X, for the
 (Yi  Yˆi )2   (Yi  ˆ0  ˆ1 X 1 )2
are chosen so that this equation has the smallest possible value, which is the same as saying the
linear regression line is as close as possible simultaneously to all of the points in the scatterplot.
For one predictor, the ̂ 0 and
̂1 are estimated using the following equations:
n
ˆ1 
n
 ( X i  X )(Yi  Y )
i 1
n
(X
i 1
i
 X)
and
2
ˆ0  Y  ˆ1 X
where X 
X
i 1
i
n
These equations are shown in this chapter merely to make the following point:
Interval Scale Assumption
Linear regression, as well as the other forms of regression
taught in this course, assume that all predictor variables have
at least an interval scale. For linear regression, the outcome
variable is also assumed to have at least an interval scale.
This assumption is necessary so arithmetic can be performed on the values of each predictor
variable. Clearly, arithmetic is done the values of X in the above equations.
_________________
Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual. Salt Lake City, UT: University of Utah
School of Medicine. Chapter 5-3. (Accessed February 13, 2012, at http://www.ccts.utah.edu/biostats/
?pageId=5385).
Chapter 5-3 (revision 13 Feb 2012)
p. 1
Arithmetic Operations On a Dichotomous Variable
The following discussion is found in Chapter 2-6, page 4, but worth repeating here because we
need to use it extensively in regression modeling.
It makes sense to do arithmetic on an interval scaled variable, since this scale is
sufficiently close to our notion of integers and real numbers (the interval scale shares the property
of equal intervals with both of these number systems). It does not make sense to do arithmetic on
nominal and ordinal scales, since these scales do not have equal intervals (see box).
__________________________________________________________________________________________________________________
Measurement Scale (also called level of measurement)
Nominal scale
name
unordered categories
e.g., cancer therapies: chemo, radiation, surgery
Ordinal scale
name + order
ordered categories
e.g., quality of life: lousy, okay, great
Interval scale
name + order + equal intervals + arbitrary zero point
continuous measurement with arbitrary zero
e.g., body temperature: 0°F does not imply
absence of temperature (although perhaps absence of
life). The 0 point is just a convention of the scale.
Ratios do not make sense--you would not say 101.8°F
is 1.05 times as hot as 97°F.
Ratio scale
name + order + equal intervals + absolute zero point
continuous measurement with absolute zero
e.g., hematocrit: 0% means no hematocrit, however
unlikely. Ratios make sense (at least arithmetically)–a
Hct of 48% is 1.2 times a Hct of 40%, although at
opposite ends of the normal range (so does not
necessarily equate to 1.2 times better health).
Dichotomous scale (a special case of the nominal scale, in that it always has just two
categories)
e.g., gender: male or female
A second measurement scale scheme is:
Binary data
Unordered categorical data
Ordered categorical data
Continuous data
(dichotomous scale)
(nominal scale)
(ordinal scale)
(interval & ratio scales)
__________________________________________________________________________________________________________________
Chapter 5-3 (revision 13 Feb 2012)
p. 2
Although it is rarely claimed as such, a dichotomous scale could be considered an interval scale,
since it has order (although perhaps an arbitrary order), it has equal intervals (one interval that is
equal to itself), and one of the categories can be selected to represent the 0 value.
This claim is made by Jum C. Nunnally, one of the best-known psychometric experts (Nunnally
and Bernstein, 1994, p.16):
“When there are only two categories, there is only one interval to consider, so that one
interval may be considered an ‘equal’ interval. That is why binary (dichotomous)
variables may be considered to form interval scales, the point noted above as being so
important to modern regression theory and elsewhere in statistics.”
Nunnally and Bernstein (1994, pp. 189-190) further state:
“As noted in the section titled ‘Another form of Partialling,’ categorical variables are now
used quite commonly in multivariate analysis thanks to Cohen (1968). This use reflects
the point made in Chapter 1 that a scale may be regarded as an interval scale when it
contains only two points. This is the basis of the analysis of variance. If the variable
takes on only two values, such as gender, one level may be coded 0 and the other coded
1…. A variable coded 0 or 1 is called a ‘dummy’ or ‘indicator’ variable. The independent
variable’s ‘scale’ has interval properties, by definition, because the scale has only two
points.”
Sarle (1997), on his web-site discussing measurement theory, states the same thing,
“What about binary (0/1) variables?
For a binary variable, the classes of one-to-one transformations, monotone
increasing/decreasing transformations, and affine transformations are identical--you can't
do anything with a one-to-one transformation that you can't do with an affine
tranformation. Hence binary variables are at least at the interval level. If the variable
connotes presence/absence or if there is some other distinguishing feature of one
category, a binary variable may be at the ratio or absolute level.
Nominal variables are often analyzed in linear models by coding binary dummy
variables. This procedure is justified since binary variables are at the interval level or
higher.”
Using these arguments, we are justified to recode nominal and ordinal predictor variables into
indicator, or dummy variables, and include them directly into the regression equation. The
regression algorithm treats the indicator variables as interval scales, and performs arithmetic
directly on the 0-1 values.
This claim that dichotomous variables are actually interval scales is rarely taught in statistics
classes, so few people are even aware why indicator variables work in regression models.
Chapter 5-3 (revision 13 Feb 2012)
p. 3
Demonstration That Treating a Dichotomous Variable as an Interval Scale is Reasonable
Statisticians are traditionally trained to think of a 0-1 variable as a “Bernoulli variable,” rather
than as a continuous “interval scale” variable. A Bernoulli variable has mean p and variance p(1p), where p is the probability of a 1 (Ross, 1998).
The derivation for this mean and variance for a Bernoulli variable, with standard deviation being
the square root of the variance, is taught in the first semester of a masters degree level statistics
program. The important point about the formulas is that they just use the nominal scale property
of the variable. That is, they are based on simply counting the number of occurrences of the
variables outcome (how 0’s and how many 1’s), and then doing arithmetic on the counts.
Arithmetic is not done the values of the variable themselves.
These formulas for the mean and standard deviation of a Bernoulli variable look very different
than the sample mean and sample standard deviation used in statistics:
n
X 
X
i 1
i
(sample mean)
n
and
n
s  s2 
(X
i 1
i
 X )2
n 1
(sample standard deviation)
Let’s apply these standard formulas to a dichotomous variable and see what happens.
Reading in the Stata formatted data file, births.dta,
File
Open
Find the directory where you copied the course CD
Find the subdirectory datasets & do-files
Single click on births.dta
Open
use "C:\Documents and Settings\u0032770.SRVR\Desktop\
Biostats & Epi With Stata\datasets & do-files\births.dta", clear
*
which must be all on one line, or use:
cd "C:\Documents and Settings\u0032770.SRVR\Desktop\"
cd "Biostats & Epi With Stata\datasets & do-files"
use births, clear
Chapter 5-3 (revision 13 Feb 2012)
p. 4
Requesting a frequency table for the dichotomous variable, lowbw, using Stata menus:
Statistics
Summaries, tables & tests
Tables
One-way tables
Categorical variable: lowbw
OK
tabulate lowbw
low birth |
weight |
Freq.
Percent
Cum.
------------+----------------------------------0 |
440
88.00
88.00
1 |
60
12.00
100.00
------------+----------------------------------Total |
500
100.00
We see that the lowbw variable is a 0-1 variable, or Bernoulli variable.
Using the Bernoulli formulas, we get
mean = p = 60/500 = 0.1200
variance = p(1-p) = 0.1200(.8800) = 0.1056
standard deviation =
p(1  p) = .324962
Notice how we just use the counts of the categories, the “Frequency” column of the frequency
table, and then do arithmetic on the counts, rather than the values of the variable. That is, we
computed these test statistics using only the nominal scale property of the variable (we just
counted the frequency of occurrence of the name, or label, given to the variable).
Now, using the ordinary statistical formulas for mean and standard deviation, which were
designed for interval scales, where arithmetic is done directly on the values of the variable,
Statistics
Summaries, tables & tests
Summary and descriptive statistics
Summary statistics
Variables: lowbw
Options: standard display
OK
summarize lowbw
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------lowbw |
500
.12
.325287
0
1
Chapter 5-3 (revision 13 Feb 2012)
p. 5
We see that the Bernoulli mean is exactly the same as when the ordinary formula for the mean is
applied, both giving 0.12.
We see that the Bernoulli standard deviation of 0.324962 does not quite match the ordinary
standard deviation formula value of 0.325287. However, that is only because the Bernoulli
formula is the population formula. The ordinary population formula for the standard deviation
divides by N rather than N-1,
N
  2 
(X
i 1
i
  )2
(population standard deviation)
N
where sigma , ϭ, is the population standard deviation and, mu, µ, is the
population mean.
n 1
, than we have the population standard
n
If we multiply our sample standard deviation by
deviation calculation.
n
n 1
n 1
s
n
n
 ( X i  X )2
i 1
n 1
n

(X
i 1
i
n
  )2
  , where X is assumed to be equal to 
When we do that,
display 0.325287*sqrt(499)/sqrt(500)
.32496155
which we see is an exact match to the Bernoulli formula, which gave .324962 .
So, treating a dichomous variable as an interval scales works for descriptive statistics.
That is, treating a dichotomous variable as an interval scale and then applying the ordinary
formulas produces an identical result as treating it as a nominal scale Bernoulli variable, and then
applying the Bernoulli formulas.
Next, let’s see what happens with significance tests, seeing if interval scale significance tests
give an identical result to categorical significance tests.
Chapter 5-3 (revision 13 Feb 2012)
p. 6
Computing a t test, using lowbw as the outcome variable, using Stata menus:
Statistics
Summaries, tables & tests
Classical tests of hypotheses
Two-group mean-comparison test
Variable name: lowbw
Group variable name: sex
OK
ttest lowbw, by(sex)
Two-sample t test with equal variances
-----------------------------------------------------------------------------Group |
Obs
Mean
Std. Err.
Std. Dev.
[95% Conf. Interval]
---------+-------------------------------------------------------------------1 |
264
.1022727
.0186842
.3035821
.0654831
.1390624
2 |
236
.1398305
.0226235
.3475482
.0952598
.1844012
---------+-------------------------------------------------------------------combined |
500
.12
.0145473
.325287
.0914185
.1485815
---------+-------------------------------------------------------------------diff |
-.0375578
.0291209
-.0947728
.0196572
-----------------------------------------------------------------------------diff = mean(1) - mean(2)
t = -1.2897
Ho: diff = 0
degrees of freedom =
498
Ha: diff < 0
Pr(T < t) = 0.0989
Ha: diff != 0
Pr(|T| > |t|) = 0.1977
Ha: diff > 0
Pr(T > t) = 0.9011
Next, taking the more traditional statistical approach, comping the proportions using a chi-square
test,
Statistics
Summaries, tables & tests
Tables
Two-way tables with measures of association
Row variable: lowbw
Column variable: sex
Test statistics: Pearson chi-squared
Cell contents: Within-column relative frequencies (i.e., column %’s)
OK
tabulate lowbw sex, chi2 column
Chapter 5-3 (revision 13 Feb 2012)
p. 7
+-------------------+
| Key
|
|-------------------|
|
frequency
|
| column percentage |
+-------------------+
low birth |
sex of baby
weight |
1
2 |
Total
-----------+----------------------+---------0 |
237
203 |
440
|
89.77
86.02 |
88.00
-----------+----------------------+---------1 |
27
33 |
60
|
10.23
13.98 |
12.00
-----------+----------------------+---------Total |
264
236 |
500
|
100.00
100.00 |
100.00
Pearson chi2(1) =
1.6645
Pr = 0.197
We discover that the two-tailed p values are identical between the t test and the chi-square test.
Also, notice the column percents in the crosstabulation table agree with the means in the t-test
output. A proportion is nothing more than a mean of a 0-1 scored variable:
n
(mean)
X
X
i 1
n
i

X 1  X 2  ...  X n 1  0  ...  1

 p (proportion)
n
n
So, it works for significance tests.
We have verified, then, that treating a dichotomous variable outcome variable as an interval
scale, and then applying ordinary interval scaled significance tests, provides the same result as
treating it as a categorical variable and applying categorical variable significance tests
(D’Agostino (1972).
That is, D’Agostino (1972) published a similar demonstration, comparing one-way ANOVA to
the chi-square test. A one-way ANOVA with two groups is identically the t test, so his
demonstration applies to that shown in this chapter. D’Agostino (1972, p. 32) concluded,
“We have seen for the situation studied that the one-way ANOVA procedure and the
standard chi-squared procedure are algebraically similar and under the null hypothesis
asymptotically equivalent. Pointing this out to students and users of statistical methdos
may aid substanitally in their understanding of statistical methodology. There really are
not two distinct ways of handling this problem.”
It seems kind of surprising that the chi-square test, which has the form:
2  
i
(O - Ei ) 2
(observed - expected) 2
N (ad  bc) 2
 i

expected
Oi
(a  b)(a  c)(b  d )(c  d )
i
Chapter 5-3 (revision 13 Feb 2012)
p. 8
gives an identical result as the t test, since they have very different looking formulas. In the chisquare formula, the a, b, c, d are the cell counts of the 2 x 2 crosstabulation table, and N is the
total sample size (we are only doing arithmetic on the counts of values).
It turns out the two formulas are algebraically identical.
To see this, first we use the fact that the chi-square test is algebraically identical to the z test for
proportions (see box).
Then, notice that the z test for comparing two proportions
z
p1  p2
1 1
p(1  p)   
 n1 n2 
, where p(1-p) is the pooled variance,
is identical to the equal variance version of the two-sample t test
t
x1  x2
1 1
s

n1 n2
, were s is the pooled variance.
Suggested Use of This Knowledge
Do nothing with it. If you use a t test to compare two proportions, readers and editors, even
statistical editors, will think you are incompetent, since they will have never heard about all this.
Just be happy with now knowing why you can put a 0-1 variable into a regression equation.
Chapter 5-3 (revision 13 Feb 2012)
p. 9
Equivalence of Chi-Square Test for 2  2 Table and the two-proportions Z test (Altman,
1991, pp 257-258).
Given a 2  2 table,
Group 1
a
c
a+c = n1
Group 2
b
d
b+d = n2
N= n1+ n2
We have p1= a/(a+c), p2= b/(b+c) , and the pooled proportion is p = (a+b)/N.
Then, the z test for comparing two proportions is given by
z
p1  p2
1 1
p(1  p)   
 n1 n2 
Substituting, this is equivalent to
z
a
b

ac bd
ab cd  1
1 



N
N ac bd 
which, after some manipulation, gives the computation shortcut formula for the chi-square test
z
N (ad  bc)2
(a  b)(a  c)(b  d )(c  d )
 2
Thus, the chi-square with 1 degree of freedom (the 2  2 table case) is identically the square of
the z test (the square of the standard normal distribution).
Chapter 5-3 (revision 13 Feb 2012)
p. 10
Modeling Categorical Variables (“Dummy Variable” Coding or “Indicator Variable”
Coding)
To use a nominal scale (unordered categories) or ordinal scale (ordered categories) in a
regression model, we convert first convert them to 0-1 variables so that we meet the interval
scale assumption.
First, lets try it for a dichotomous variable scored as 1 and 2, which is how sex is scored in the
births dataset,
Statistics
Linear models and related
Linear regression
Dependent variable: bweight
Independent variables: sex
OK
regress bweight sex
Source |
SS
df
MS
-------------+-----------------------------Model | 4839398.61
1 4839398.61
Residual |
197926455
498
397442.68
-------------+-----------------------------Total |
202765853
499 406344.395
Number of obs
F( 1,
498)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
500
12.18
0.0005
0.0239
0.0219
630.43
-----------------------------------------------------------------------------bweight |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------sex |
-197.071
56.47605
-3.49
0.001
-308.0317
-86.11032
_cons |
3426.973
87.78347
39.04
0.000
3254.501
3599.444
------------------------------------------------------------------------------
Comparing this to a t test, using Stata menus,
Statistics
Summaries, tables & tests
Classical tests of hypotheses
Two-group mean-comparison test
Variable name: bweight
Group variable name: sex
OK
ttest bweight, by(sex)
Chapter 5-3 (revision 13 Feb 2012)
p. 11
Regression output:
Source |
SS
df
MS
-------------+-----------------------------Model | 4839398.61
1 4839398.61
Residual |
197926455
498
397442.68
-------------+-----------------------------Total |
202765853
499 406344.395
Number of obs
F( 1,
498)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
500
12.18
0.0005
0.0239
0.0219
630.43
-----------------------------------------------------------------------------bweight |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------sex |
-197.071
56.47605
-3.49
0.001
-308.0317
-86.11032
_cons |
3426.973
87.78347
39.04
0.000
3254.501
3599.444
------------------------------------------------------------------------------
t-test output:
Two-sample t test with equal variances
-----------------------------------------------------------------------------Group |
Obs
Mean
Std. Err.
Std. Dev.
[95% Conf. Interval]
---------+-------------------------------------------------------------------1 |
264
3229.902
38.99802
633.6428
3153.113
3306.69
2 |
236
3032.831
40.80225
626.816
2952.446
3113.215
---------+-------------------------------------------------------------------combined |
500
3136.884
28.5077
637.4515
3080.874
3192.894
---------+-------------------------------------------------------------------diff |
197.071
56.47605
86.11032
308.0317
-----------------------------------------------------------------------------diff = mean(1) - mean(2)
t =
3.4895
Ho: diff = 0
degrees of freedom =
498
Ha: diff < 0
Ha: diff != 0
Ha: diff > 0
Pr(T < t) = 0.9997
Pr(|T| > |t|) = 0.0005
Pr(T > t) = 0.0003
We notice that the slope in the regression model is the same as the mean difference in the t test
output. The sign is different, but that is because the t test procedure subtracts group 2 from group
1 (substract 2nd row from 1st row), whereas the regression model subtracts group 1 from group 2
(change from left to right on number line).
The intercept term is rather strange. It is an extrapolation out to a sex of 0 (1=male, 2=female),
which is needed since the Y-intercept occurs at an X equal to 0).
It is okay to use a dichotomous 1-2 variable in linear regression, then, as long as you don’t plan
to interpret the intercept term.
A more intuitive result comes from recoding the 1-2 variable into a 0-1 variable. These 0-1
variables are called dummy variables or indicator variables. The name indicator variable comes
from a score of 1 indicates the presence of the attribute.
Chapter 5-3 (revision 13 Feb 2012)
p. 12
A natural naming convention, then, is to give the name of the indicator variable the name of what
it indicates.
sex
1 = male
2 = female
... recoded to ...
male
1 = male
0 = female
Data
Create or change variable
Other variable transformation commands
Recode categorical variable
Main tab: Variables: sex
Required: (1=1)(2=0)
Options tab: Generate new variables: male
OK
recode sex (1=1)(2=0), generate(male)
Checking our work,
Statistics
Summaries, tables & tests
Tables
Twoway tables with measures of association
Row variable: sex
Column variable: male
Uncheck Test statistics: Pearson chi-squared
Uncheck Cell contents: Within-column relative frequencies
OK
tabulate sex male
sex of |
male
baby |
0
1 |
Total
-----------+----------------------+---------1 |
0
264 |
264
2 |
236
0 |
236
-----------+----------------------+---------Total |
236
264 |
500
We see that 1 stayed 1, and 2 went to 0, so we did it correctly.
Chapter 5-3 (revision 13 Feb 2012)
p. 13
Requesting the regression again, this time with male instead of sex
Statistics
Linear models and related
Linear regression
Dependent variable: bweight
Independent variables: male
OK
regress bweight male
Source |
SS
df
MS
-------------+-----------------------------Model | 4839398.61
1 4839398.61
Residual |
197926455
498
397442.68
-------------+-----------------------------Total |
202765853
499 406344.395
Number of obs
F( 1,
498)
Prob > F
R-squared
Adj R-squared
Root MSE
=
=
=
=
=
=
500
12.18
0.0005
0.0239
0.0219
630.43
-----------------------------------------------------------------------------bweight |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------male |
197.071
56.47605
3.49
0.001
86.11032
308.0317
_cons |
3032.831
41.03753
73.90
0.000
2952.202
3113.459
------------------------------------------------------------------------------
Comparing this to the t test output,
Statistics
Summaries, tables & tests
Classical tests of hypotheses
Group mean comparison test
Variable name: bweight
Group variable name: male
OK
ttest bweight, by(male)
Two-sample t test with equal variances
-----------------------------------------------------------------------------Group |
Obs
Mean
Std. Err.
Std. Dev.
[95% Conf. Interval]
---------+-------------------------------------------------------------------0 |
236
3032.831
40.80225
626.816
2952.446
3113.215
1 |
264
3229.902
38.99802
633.6428
3153.113
3306.69
---------+-------------------------------------------------------------------combined |
500
3136.884
28.5077
637.4515
3080.874
3192.894
---------+-------------------------------------------------------------------diff |
-197.071
56.47605
-308.0317
-86.11032
-----------------------------------------------------------------------------diff = mean(0) - mean(1)
t = -3.4895
Ho: diff = 0
degrees of freedom =
498
Ha: diff < 0
Pr(T < t) = 0.0003
Ha: diff != 0
Pr(|T| > |t|) = 0.0005
Ha: diff > 0
Pr(T > t) = 0.9997
We see that the linear regression constant term now correctly represents the female birthweight.
Chapter 5-3 (revision 13 Feb 2012)
p. 14
References
Altman DG. (1991). Practical Statistics for Medical Research. New York, Chapman &
Hall/CRC.
Cohen J. (1968). Multiple regression as a general data-analytic system. Psychological Bulletin
70:426-443.
D’Agostino RB. (1972). Relation between the chi-squared and ANOVA tests for testing the
equality of k independent dichotomous populations. The American Statistician
26(3):30-32.
Nunnally JC, Bernstein IH. (1994). Psychometric Theory. 3rd ed. New York, McGraw-Hill.
Sarle WS. (1997). Measurement theory: frequently asked questions. Version 3, Sep 14.
URL: ftp://ftp.sas.com/pub/neural/measurement.html
Chapter 5-3 (revision 13 Feb 2012)
p. 15