Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 5-3 Dichotomous Predictor Variables In this chapter, we will see how to use dichotomous predictor variables in a linear regression model. What we cover applies to all types of regression models. The linear regression model, which is what we have fitted in previous chapters, uses an estimation method called the least-squares method. In this method, the sum of the squared vertical distances of all the data points from the regression line are minimized. That is, the regression estimates, equation n n i 1 i 1 ̂0 and ̂1 , which are the Y intercept and slope for X, for the (Yi Yˆi )2 (Yi ˆ0 ˆ1 X 1 )2 are chosen so that this equation has the smallest possible value, which is the same as saying the linear regression line is as close as possible simultaneously to all of the points in the scatterplot. For one predictor, the ̂ 0 and ̂1 are estimated using the following equations: n ˆ1 n ( X i X )(Yi Y ) i 1 n (X i 1 i X) and 2 ˆ0 Y ˆ1 X where X X i 1 i n These equations are shown in this chapter merely to make the following point: Interval Scale Assumption Linear regression, as well as the other forms of regression taught in this course, assume that all predictor variables have at least an interval scale. For linear regression, the outcome variable is also assumed to have at least an interval scale. This assumption is necessary so arithmetic can be performed on the values of each predictor variable. Clearly, arithmetic is done the values of X in the above equations. _________________ Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual. Salt Lake City, UT: University of Utah School of Medicine. Chapter 5-3. (Accessed February 13, 2012, at http://www.ccts.utah.edu/biostats/ ?pageId=5385). Chapter 5-3 (revision 13 Feb 2012) p. 1 Arithmetic Operations On a Dichotomous Variable The following discussion is found in Chapter 2-6, page 4, but worth repeating here because we need to use it extensively in regression modeling. It makes sense to do arithmetic on an interval scaled variable, since this scale is sufficiently close to our notion of integers and real numbers (the interval scale shares the property of equal intervals with both of these number systems). It does not make sense to do arithmetic on nominal and ordinal scales, since these scales do not have equal intervals (see box). __________________________________________________________________________________________________________________ Measurement Scale (also called level of measurement) Nominal scale name unordered categories e.g., cancer therapies: chemo, radiation, surgery Ordinal scale name + order ordered categories e.g., quality of life: lousy, okay, great Interval scale name + order + equal intervals + arbitrary zero point continuous measurement with arbitrary zero e.g., body temperature: 0°F does not imply absence of temperature (although perhaps absence of life). The 0 point is just a convention of the scale. Ratios do not make sense--you would not say 101.8°F is 1.05 times as hot as 97°F. Ratio scale name + order + equal intervals + absolute zero point continuous measurement with absolute zero e.g., hematocrit: 0% means no hematocrit, however unlikely. Ratios make sense (at least arithmetically)–a Hct of 48% is 1.2 times a Hct of 40%, although at opposite ends of the normal range (so does not necessarily equate to 1.2 times better health). Dichotomous scale (a special case of the nominal scale, in that it always has just two categories) e.g., gender: male or female A second measurement scale scheme is: Binary data Unordered categorical data Ordered categorical data Continuous data (dichotomous scale) (nominal scale) (ordinal scale) (interval & ratio scales) __________________________________________________________________________________________________________________ Chapter 5-3 (revision 13 Feb 2012) p. 2 Although it is rarely claimed as such, a dichotomous scale could be considered an interval scale, since it has order (although perhaps an arbitrary order), it has equal intervals (one interval that is equal to itself), and one of the categories can be selected to represent the 0 value. This claim is made by Jum C. Nunnally, one of the best-known psychometric experts (Nunnally and Bernstein, 1994, p.16): “When there are only two categories, there is only one interval to consider, so that one interval may be considered an ‘equal’ interval. That is why binary (dichotomous) variables may be considered to form interval scales, the point noted above as being so important to modern regression theory and elsewhere in statistics.” Nunnally and Bernstein (1994, pp. 189-190) further state: “As noted in the section titled ‘Another form of Partialling,’ categorical variables are now used quite commonly in multivariate analysis thanks to Cohen (1968). This use reflects the point made in Chapter 1 that a scale may be regarded as an interval scale when it contains only two points. This is the basis of the analysis of variance. If the variable takes on only two values, such as gender, one level may be coded 0 and the other coded 1…. A variable coded 0 or 1 is called a ‘dummy’ or ‘indicator’ variable. The independent variable’s ‘scale’ has interval properties, by definition, because the scale has only two points.” Sarle (1997), on his web-site discussing measurement theory, states the same thing, “What about binary (0/1) variables? For a binary variable, the classes of one-to-one transformations, monotone increasing/decreasing transformations, and affine transformations are identical--you can't do anything with a one-to-one transformation that you can't do with an affine tranformation. Hence binary variables are at least at the interval level. If the variable connotes presence/absence or if there is some other distinguishing feature of one category, a binary variable may be at the ratio or absolute level. Nominal variables are often analyzed in linear models by coding binary dummy variables. This procedure is justified since binary variables are at the interval level or higher.” Using these arguments, we are justified to recode nominal and ordinal predictor variables into indicator, or dummy variables, and include them directly into the regression equation. The regression algorithm treats the indicator variables as interval scales, and performs arithmetic directly on the 0-1 values. This claim that dichotomous variables are actually interval scales is rarely taught in statistics classes, so few people are even aware why indicator variables work in regression models. Chapter 5-3 (revision 13 Feb 2012) p. 3 Demonstration That Treating a Dichotomous Variable as an Interval Scale is Reasonable Statisticians are traditionally trained to think of a 0-1 variable as a “Bernoulli variable,” rather than as a continuous “interval scale” variable. A Bernoulli variable has mean p and variance p(1p), where p is the probability of a 1 (Ross, 1998). The derivation for this mean and variance for a Bernoulli variable, with standard deviation being the square root of the variance, is taught in the first semester of a masters degree level statistics program. The important point about the formulas is that they just use the nominal scale property of the variable. That is, they are based on simply counting the number of occurrences of the variables outcome (how 0’s and how many 1’s), and then doing arithmetic on the counts. Arithmetic is not done the values of the variable themselves. These formulas for the mean and standard deviation of a Bernoulli variable look very different than the sample mean and sample standard deviation used in statistics: n X X i 1 i (sample mean) n and n s s2 (X i 1 i X )2 n 1 (sample standard deviation) Let’s apply these standard formulas to a dichotomous variable and see what happens. Reading in the Stata formatted data file, births.dta, File Open Find the directory where you copied the course CD Find the subdirectory datasets & do-files Single click on births.dta Open use "C:\Documents and Settings\u0032770.SRVR\Desktop\ Biostats & Epi With Stata\datasets & do-files\births.dta", clear * which must be all on one line, or use: cd "C:\Documents and Settings\u0032770.SRVR\Desktop\" cd "Biostats & Epi With Stata\datasets & do-files" use births, clear Chapter 5-3 (revision 13 Feb 2012) p. 4 Requesting a frequency table for the dichotomous variable, lowbw, using Stata menus: Statistics Summaries, tables & tests Tables One-way tables Categorical variable: lowbw OK tabulate lowbw low birth | weight | Freq. Percent Cum. ------------+----------------------------------0 | 440 88.00 88.00 1 | 60 12.00 100.00 ------------+----------------------------------Total | 500 100.00 We see that the lowbw variable is a 0-1 variable, or Bernoulli variable. Using the Bernoulli formulas, we get mean = p = 60/500 = 0.1200 variance = p(1-p) = 0.1200(.8800) = 0.1056 standard deviation = p(1 p) = .324962 Notice how we just use the counts of the categories, the “Frequency” column of the frequency table, and then do arithmetic on the counts, rather than the values of the variable. That is, we computed these test statistics using only the nominal scale property of the variable (we just counted the frequency of occurrence of the name, or label, given to the variable). Now, using the ordinary statistical formulas for mean and standard deviation, which were designed for interval scales, where arithmetic is done directly on the values of the variable, Statistics Summaries, tables & tests Summary and descriptive statistics Summary statistics Variables: lowbw Options: standard display OK summarize lowbw Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------lowbw | 500 .12 .325287 0 1 Chapter 5-3 (revision 13 Feb 2012) p. 5 We see that the Bernoulli mean is exactly the same as when the ordinary formula for the mean is applied, both giving 0.12. We see that the Bernoulli standard deviation of 0.324962 does not quite match the ordinary standard deviation formula value of 0.325287. However, that is only because the Bernoulli formula is the population formula. The ordinary population formula for the standard deviation divides by N rather than N-1, N 2 (X i 1 i )2 (population standard deviation) N where sigma , ϭ, is the population standard deviation and, mu, µ, is the population mean. n 1 , than we have the population standard n If we multiply our sample standard deviation by deviation calculation. n n 1 n 1 s n n ( X i X )2 i 1 n 1 n (X i 1 i n )2 , where X is assumed to be equal to When we do that, display 0.325287*sqrt(499)/sqrt(500) .32496155 which we see is an exact match to the Bernoulli formula, which gave .324962 . So, treating a dichomous variable as an interval scales works for descriptive statistics. That is, treating a dichotomous variable as an interval scale and then applying the ordinary formulas produces an identical result as treating it as a nominal scale Bernoulli variable, and then applying the Bernoulli formulas. Next, let’s see what happens with significance tests, seeing if interval scale significance tests give an identical result to categorical significance tests. Chapter 5-3 (revision 13 Feb 2012) p. 6 Computing a t test, using lowbw as the outcome variable, using Stata menus: Statistics Summaries, tables & tests Classical tests of hypotheses Two-group mean-comparison test Variable name: lowbw Group variable name: sex OK ttest lowbw, by(sex) Two-sample t test with equal variances -----------------------------------------------------------------------------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------1 | 264 .1022727 .0186842 .3035821 .0654831 .1390624 2 | 236 .1398305 .0226235 .3475482 .0952598 .1844012 ---------+-------------------------------------------------------------------combined | 500 .12 .0145473 .325287 .0914185 .1485815 ---------+-------------------------------------------------------------------diff | -.0375578 .0291209 -.0947728 .0196572 -----------------------------------------------------------------------------diff = mean(1) - mean(2) t = -1.2897 Ho: diff = 0 degrees of freedom = 498 Ha: diff < 0 Pr(T < t) = 0.0989 Ha: diff != 0 Pr(|T| > |t|) = 0.1977 Ha: diff > 0 Pr(T > t) = 0.9011 Next, taking the more traditional statistical approach, comping the proportions using a chi-square test, Statistics Summaries, tables & tests Tables Two-way tables with measures of association Row variable: lowbw Column variable: sex Test statistics: Pearson chi-squared Cell contents: Within-column relative frequencies (i.e., column %’s) OK tabulate lowbw sex, chi2 column Chapter 5-3 (revision 13 Feb 2012) p. 7 +-------------------+ | Key | |-------------------| | frequency | | column percentage | +-------------------+ low birth | sex of baby weight | 1 2 | Total -----------+----------------------+---------0 | 237 203 | 440 | 89.77 86.02 | 88.00 -----------+----------------------+---------1 | 27 33 | 60 | 10.23 13.98 | 12.00 -----------+----------------------+---------Total | 264 236 | 500 | 100.00 100.00 | 100.00 Pearson chi2(1) = 1.6645 Pr = 0.197 We discover that the two-tailed p values are identical between the t test and the chi-square test. Also, notice the column percents in the crosstabulation table agree with the means in the t-test output. A proportion is nothing more than a mean of a 0-1 scored variable: n (mean) X X i 1 n i X 1 X 2 ... X n 1 0 ... 1 p (proportion) n n So, it works for significance tests. We have verified, then, that treating a dichotomous variable outcome variable as an interval scale, and then applying ordinary interval scaled significance tests, provides the same result as treating it as a categorical variable and applying categorical variable significance tests (D’Agostino (1972). That is, D’Agostino (1972) published a similar demonstration, comparing one-way ANOVA to the chi-square test. A one-way ANOVA with two groups is identically the t test, so his demonstration applies to that shown in this chapter. D’Agostino (1972, p. 32) concluded, “We have seen for the situation studied that the one-way ANOVA procedure and the standard chi-squared procedure are algebraically similar and under the null hypothesis asymptotically equivalent. Pointing this out to students and users of statistical methdos may aid substanitally in their understanding of statistical methodology. There really are not two distinct ways of handling this problem.” It seems kind of surprising that the chi-square test, which has the form: 2 i (O - Ei ) 2 (observed - expected) 2 N (ad bc) 2 i expected Oi (a b)(a c)(b d )(c d ) i Chapter 5-3 (revision 13 Feb 2012) p. 8 gives an identical result as the t test, since they have very different looking formulas. In the chisquare formula, the a, b, c, d are the cell counts of the 2 x 2 crosstabulation table, and N is the total sample size (we are only doing arithmetic on the counts of values). It turns out the two formulas are algebraically identical. To see this, first we use the fact that the chi-square test is algebraically identical to the z test for proportions (see box). Then, notice that the z test for comparing two proportions z p1 p2 1 1 p(1 p) n1 n2 , where p(1-p) is the pooled variance, is identical to the equal variance version of the two-sample t test t x1 x2 1 1 s n1 n2 , were s is the pooled variance. Suggested Use of This Knowledge Do nothing with it. If you use a t test to compare two proportions, readers and editors, even statistical editors, will think you are incompetent, since they will have never heard about all this. Just be happy with now knowing why you can put a 0-1 variable into a regression equation. Chapter 5-3 (revision 13 Feb 2012) p. 9 Equivalence of Chi-Square Test for 2 2 Table and the two-proportions Z test (Altman, 1991, pp 257-258). Given a 2 2 table, Group 1 a c a+c = n1 Group 2 b d b+d = n2 N= n1+ n2 We have p1= a/(a+c), p2= b/(b+c) , and the pooled proportion is p = (a+b)/N. Then, the z test for comparing two proportions is given by z p1 p2 1 1 p(1 p) n1 n2 Substituting, this is equivalent to z a b ac bd ab cd 1 1 N N ac bd which, after some manipulation, gives the computation shortcut formula for the chi-square test z N (ad bc)2 (a b)(a c)(b d )(c d ) 2 Thus, the chi-square with 1 degree of freedom (the 2 2 table case) is identically the square of the z test (the square of the standard normal distribution). Chapter 5-3 (revision 13 Feb 2012) p. 10 Modeling Categorical Variables (“Dummy Variable” Coding or “Indicator Variable” Coding) To use a nominal scale (unordered categories) or ordinal scale (ordered categories) in a regression model, we convert first convert them to 0-1 variables so that we meet the interval scale assumption. First, lets try it for a dichotomous variable scored as 1 and 2, which is how sex is scored in the births dataset, Statistics Linear models and related Linear regression Dependent variable: bweight Independent variables: sex OK regress bweight sex Source | SS df MS -------------+-----------------------------Model | 4839398.61 1 4839398.61 Residual | 197926455 498 397442.68 -------------+-----------------------------Total | 202765853 499 406344.395 Number of obs F( 1, 498) Prob > F R-squared Adj R-squared Root MSE = = = = = = 500 12.18 0.0005 0.0239 0.0219 630.43 -----------------------------------------------------------------------------bweight | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------sex | -197.071 56.47605 -3.49 0.001 -308.0317 -86.11032 _cons | 3426.973 87.78347 39.04 0.000 3254.501 3599.444 ------------------------------------------------------------------------------ Comparing this to a t test, using Stata menus, Statistics Summaries, tables & tests Classical tests of hypotheses Two-group mean-comparison test Variable name: bweight Group variable name: sex OK ttest bweight, by(sex) Chapter 5-3 (revision 13 Feb 2012) p. 11 Regression output: Source | SS df MS -------------+-----------------------------Model | 4839398.61 1 4839398.61 Residual | 197926455 498 397442.68 -------------+-----------------------------Total | 202765853 499 406344.395 Number of obs F( 1, 498) Prob > F R-squared Adj R-squared Root MSE = = = = = = 500 12.18 0.0005 0.0239 0.0219 630.43 -----------------------------------------------------------------------------bweight | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------sex | -197.071 56.47605 -3.49 0.001 -308.0317 -86.11032 _cons | 3426.973 87.78347 39.04 0.000 3254.501 3599.444 ------------------------------------------------------------------------------ t-test output: Two-sample t test with equal variances -----------------------------------------------------------------------------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------1 | 264 3229.902 38.99802 633.6428 3153.113 3306.69 2 | 236 3032.831 40.80225 626.816 2952.446 3113.215 ---------+-------------------------------------------------------------------combined | 500 3136.884 28.5077 637.4515 3080.874 3192.894 ---------+-------------------------------------------------------------------diff | 197.071 56.47605 86.11032 308.0317 -----------------------------------------------------------------------------diff = mean(1) - mean(2) t = 3.4895 Ho: diff = 0 degrees of freedom = 498 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Pr(T < t) = 0.9997 Pr(|T| > |t|) = 0.0005 Pr(T > t) = 0.0003 We notice that the slope in the regression model is the same as the mean difference in the t test output. The sign is different, but that is because the t test procedure subtracts group 2 from group 1 (substract 2nd row from 1st row), whereas the regression model subtracts group 1 from group 2 (change from left to right on number line). The intercept term is rather strange. It is an extrapolation out to a sex of 0 (1=male, 2=female), which is needed since the Y-intercept occurs at an X equal to 0). It is okay to use a dichotomous 1-2 variable in linear regression, then, as long as you don’t plan to interpret the intercept term. A more intuitive result comes from recoding the 1-2 variable into a 0-1 variable. These 0-1 variables are called dummy variables or indicator variables. The name indicator variable comes from a score of 1 indicates the presence of the attribute. Chapter 5-3 (revision 13 Feb 2012) p. 12 A natural naming convention, then, is to give the name of the indicator variable the name of what it indicates. sex 1 = male 2 = female ... recoded to ... male 1 = male 0 = female Data Create or change variable Other variable transformation commands Recode categorical variable Main tab: Variables: sex Required: (1=1)(2=0) Options tab: Generate new variables: male OK recode sex (1=1)(2=0), generate(male) Checking our work, Statistics Summaries, tables & tests Tables Twoway tables with measures of association Row variable: sex Column variable: male Uncheck Test statistics: Pearson chi-squared Uncheck Cell contents: Within-column relative frequencies OK tabulate sex male sex of | male baby | 0 1 | Total -----------+----------------------+---------1 | 0 264 | 264 2 | 236 0 | 236 -----------+----------------------+---------Total | 236 264 | 500 We see that 1 stayed 1, and 2 went to 0, so we did it correctly. Chapter 5-3 (revision 13 Feb 2012) p. 13 Requesting the regression again, this time with male instead of sex Statistics Linear models and related Linear regression Dependent variable: bweight Independent variables: male OK regress bweight male Source | SS df MS -------------+-----------------------------Model | 4839398.61 1 4839398.61 Residual | 197926455 498 397442.68 -------------+-----------------------------Total | 202765853 499 406344.395 Number of obs F( 1, 498) Prob > F R-squared Adj R-squared Root MSE = = = = = = 500 12.18 0.0005 0.0239 0.0219 630.43 -----------------------------------------------------------------------------bweight | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------male | 197.071 56.47605 3.49 0.001 86.11032 308.0317 _cons | 3032.831 41.03753 73.90 0.000 2952.202 3113.459 ------------------------------------------------------------------------------ Comparing this to the t test output, Statistics Summaries, tables & tests Classical tests of hypotheses Group mean comparison test Variable name: bweight Group variable name: male OK ttest bweight, by(male) Two-sample t test with equal variances -----------------------------------------------------------------------------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------0 | 236 3032.831 40.80225 626.816 2952.446 3113.215 1 | 264 3229.902 38.99802 633.6428 3153.113 3306.69 ---------+-------------------------------------------------------------------combined | 500 3136.884 28.5077 637.4515 3080.874 3192.894 ---------+-------------------------------------------------------------------diff | -197.071 56.47605 -308.0317 -86.11032 -----------------------------------------------------------------------------diff = mean(0) - mean(1) t = -3.4895 Ho: diff = 0 degrees of freedom = 498 Ha: diff < 0 Pr(T < t) = 0.0003 Ha: diff != 0 Pr(|T| > |t|) = 0.0005 Ha: diff > 0 Pr(T > t) = 0.9997 We see that the linear regression constant term now correctly represents the female birthweight. Chapter 5-3 (revision 13 Feb 2012) p. 14 References Altman DG. (1991). Practical Statistics for Medical Research. New York, Chapman & Hall/CRC. Cohen J. (1968). Multiple regression as a general data-analytic system. Psychological Bulletin 70:426-443. D’Agostino RB. (1972). Relation between the chi-squared and ANOVA tests for testing the equality of k independent dichotomous populations. The American Statistician 26(3):30-32. Nunnally JC, Bernstein IH. (1994). Psychometric Theory. 3rd ed. New York, McGraw-Hill. Sarle WS. (1997). Measurement theory: frequently asked questions. Version 3, Sep 14. URL: ftp://ftp.sas.com/pub/neural/measurement.html Chapter 5-3 (revision 13 Feb 2012) p. 15