* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download day5-E2005
Degrees of freedom (statistics) wikipedia , lookup
Foundations of statistics wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Taylor's law wikipedia , lookup
History of statistics wikipedia , lookup
Psychometrics wikipedia , lookup
Omnibus test wikipedia , lookup
Misuse of statistics wikipedia , lookup
Ph.D. COURSE IN BIOSTATISTICS DAY 5 REGRESSION ANALYSIS 600 500 400 pefr Example: Relationship between height and pefr in 43 females and 58 males. Data from Bland, Table 11.4. (pefr.dta) 700 800 How do we describe and analyze the relationship or association between two quantitative variables? 150 160 170 height Female 180 190 Male 1 This type of data arises in two situations: Situation 1: The data are a random sample of pairs of observations. In the example: both pefr and height are measured (observed) quantities, i.e. random variables, and none of these variables are controlled by the investigator. Situation 2: One of the variables is controlled by the investigator, and the other is subject to random variation, e.g. in a dose-response experiment, the dose is usually controlled by the investigator and the response is the measured quantity (random variable). Purpose in both cases: To describe how the response (pefr) varies with the explanatory variable (height). Note: A regression analysis is not symmetric in the two variables. Terminology: x = independent/explanatory variable = dose y = dependent/response variable sex = grouping variable 2 Linear relationship In the mathematical sense the most simple relationship between y and x is a straight line, i.e. y x Statistical model In the statistical sense this corresponds to the model: y x E Example: does the description depend on sex? where E represents the random variation around the straight line. Random variation The random variation reflects several sources of variation: (1) measurement error, (2) biological (inter-individual) variation and (3) deviations in the relationship from a straight line. In a linear regression analysis the cumulative contributions from these sources are described as independent ”error” from a normal distribution E : N (0, 2 ) . 3 Statistical model The data consists of pair of observations ( xi , yi ), i 1,.., n and the statistical model takes the form: yi xi Ei , Ei : N (0, 2 ) i 1,..., n where the Ei’s (or equivalently the yi’s) are independent. Example: does the parameters depend on sex? : intercept The model has 3 unknown parameters: : slope 2 : (residual)variance Unknown parameters Estimation A linear regression can be performed by most statistical software and spreadsheets. The estimates of and are obtained by the method of least squares by minimizing the residual sum of squares: RSS i 1 ( yi xi )2 . n Solution: ( yi y )( xi x ) ˆ ˆ x, ˆ , y ( xi x )2 ˆ 2 RSS (n 2) 4 In Stata the command is: regress pefr height if sex==1 -> sex = Female Source| SS df MS --------+----------------------------Model| 12251.4221 1 12251.4221 Residual| 88856.2222 41 2167.22493 --------+---------------------------Total| 101107.644 42 2407.32487 ˆ 2 Regression for each group Only females shown Number of obs F( 1, 41) Prob > F R-squared Adj R-squared Root MSE = = = = = = 43 5.65 0.0222 0.1212 0.0997 46.553 ˆ -------------------------------------------------------------------pefr| Coef. Std. Err. t P>|t| [95% Conf. Interval] --------+----------------------------------------------------------2.38 0.022 .4385803 5.385795 ̂ height| 2.912188 1.224836 401.5433 ̂ _cons| -9.170501 203.3699 -0.05 0.964 -419.8843 -------------------------------------------------------------------- Note: ˆ y ˆ x Estimated regression line: y ˆ ˆ x y ˆ ( x x ) The line pass through ( x , y ) with slope ̂ 5 The sampling distribution of the estimates: 2 1 x 2 ˆ : N ( , ) 2 n ( xi x ) 1 ˆ : N ( , 2 ) 2 ( xi x ) ˆ : 2 2 n2 But note: ˆ and ˆ are not independent estimates 2 (n 2) p-value for slope = 0 -------------------------------------------------------------------pefr| Coef. Std. Err. t P>|t| [95% Conf. Interval] --------+----------------------------------------------------------height| 2.912188 1.224836 2.38 0.022 .4385803 5.385795 _cons| -9.170501 203.3699 -0.05 0.964 -419.8843 401.5433 -------------------------------------------------------------------- t-tests of the hypotheses slope = 0 (top) and intercept = 0 (bottom) p-value for intercept = 0 Confidence intervals for the parameters 6 Test and confidence intervals Stata gives a t-test of the hypothesis 0 and a t-test of the hypothesis 0 . The test statistics are computed as ˆ 0 t se( ˆ ) ˆ 0 t se(ˆ ) and These test statistics have a t-distribution with n – 2 degrees of freedom, if the corresponding hypothesis is true. The standard errors of the estimates are obtained from the sampling distribution by replacing the 2 2 population variance by the estimateˆ . 95% confidence intervals for the parameters are derived as in lecture 2, e.g. as ˆ tn 2 se( ˆ ), where tn 2 is the upper 97.5 percentile in a t-distribution with n – 2 degrees of freedom. After the regress command other hypothesized values of the parameters can be assessed directly by test height = 2.5 Note: F = t2 ( 1) height = 2.5 F( 1, 41) = Prob > F = 0.11 0.7382 7 Interpretation of the parameters Intercept (): the expected pefr when height = 0, which makes no biological sense. For this reason the reference point on the x-axis is sometimes changed to a more meaningful value, e.g. x height 170 Physical unit of intercept: as y, i.e. as pefr (litre/minute). Slope (β): the expected difference in pefr between two (female) students A and B, where A is 1 cm taller than B. Physical units of slope: as y / x, i.e. as pefr/height (litre/minute/cm) Standard deviation (σ): The standard deviation of the random variation around the regression line. Approximately 2/3 of the data points are within one standard deviation from the line. The estimate is often called root mean square error. Physical unit of standard deviation: as y, i.e. as pefr (litre/minute). Change of units: If height in the example is measure in meter the slope becomes: 100 ˆ 2.91 (litre/minute/meter) 8 Fitted value For the ith observation the fitted value (expected value) is 600 yˆi ˆ ˆ xi y ˆ ( xi x ) Residual The residual is the difference between the observed value and the fitted value: ri yi yˆi 350 400 450 500 550 ( xi , yi ) 150 160 170 height pefr 180 190 Linear prediction 9 Checking the model assumptions 1. Look at the scatter plot of y against x. The model assumes a linear trend. 2. If the model is correct the residuals have mean zero and approximately constant variance. Plot the residuals (r) against the fitted values ( ŷ ) or the explanatory variable x. The plot must not show any systematic structure and the residuals must have approximately constant variation around zero. 3. The residuals represent estimated errors. Use a histogram and/or a Q-Q plot to check if the distribution is approximately normal. Note: A Q-Q plot of the observed outcomes, the yi’s, can not be used to check the assumption of normality, since the yi’s do not follow the same normal distribution (the mean depends on xi). The explanatory variable, the xi’s, is not required to follow a normal distribution. 10 100 Plots for females 50 Both plots look OK! -100 -50 0 Residuals -100 -50 0 Residuals 50 100 Stata: predicted values and residuals are obtained using two predict commands after the regress command: regress pefr height predict yhat, xb (yhat is the name of a new variable) predict res, residuals (res is the name of a new variable) 440 460 480 500 Linear prediction 520 -100 -50 0 Inverse Normal 50 100 11 1000 800 Example: Non linear regression 200 400 y The non-linear relationship between y and x is most easily seen from the plot of the residuals against x. 600 Note: 0 5 10 15 20 25 30 1 0 -2 -1 0 -50 -150 Residuals 50 Quantiles of Standard Normal 100 2 150 x 0 5 10 15 x 20 25 30 -100 -50 0 50 100 150 Residuals 12 100 50 Again, the fact that the variance increase with x is most easily seen from the plot of the residuals against x. y Note: 150 200 Example: Variance heterogeneity 0 5 10 15 20 25 30 1 0 -2 -1 0 -50 -150 Residuals 50 Quantiles of Standard Normal 100 2 150 x 0 5 10 15 x 20 25 30 -50 0 50 100 Residuals 13 Regression models can serve several purposes: 1. Description of a relationship between two variables 2. Calibration 3. Confounder control and related problems, e.g. to describe the relationship between two variables after adjusting for one or several other variables. 4. Prediction Re 1. In the example about pefr and height we found a linear relationship and the regression analysis identified the parameters of the ”best” line as y ˆ ˆ x Re 2. Example: much modern laboratory measurement equipment do not measure the concentrations in your samples directly, but uses build-in regression techniques to calibrate the measurements against known standards. Re 3. Example: Describe the relationship between birth weight and smoking habit when adjusting for parity and gestational age. This is a regression problem with multiple explanatory variables (multiple linear regression or analysis of covariance) 14 Example (test of no effect modification): In the data on pefr and height we may want to compare the relationship for males with that for females, i.e. assess if the sex is an effect-modifier of this relationship. The hypothesis of no effect modification is female male , i.e. that the two regression lines are parallel. A simple test of this hypothesis can be derived from the estimates of the two separate regression analyses. We have Group Slope Std. Err. Female 2.912188 1.224836 Male 3.966202 1.227104 an approximately standard normal test statistic is ˆfemale ˆmale ˆfemale ˆmale z ˆ ˆ s.e.( female male ) s.e.2 ( ˆfemale ) s.e.2 ( ˆmale ) Inserting the values gives z =-0.608, i.e. p-value = 0.543. The slopes does not seem to be different. 15 Re 4. Example: Predicting the expected outcome for a specified x-value, e.g. predicting pefr for a female with height=175 cm: Stata: lincom ( 1) _cons+height*175 175 height + _cons = 0 --------------------------------------------------------------pefr | Coef. Std. Err. t P>|t| [95% Conf. Interval] ------+-------------------------------------------------------(1) | 500.4623 13.1765 37.98 0.000 473.8518 527.0728 --------------------------------------------------------------- The t-test assess the hypothesis that pefr= 0 for a 175 cm high female!!! (nonsense in this case). To test the hypothesis that pefr is e.g. 400, write lincom _cons+height*175-400 Note: Prediction using x-values outside the range of observed x-values (extrapolation) should in general be avoided. 16 DECOMPOSITION OF THE TOTAL VARIATION If we ignore the explanatory variable, the total variation of the response variable y is the adjusted sum of squares (corrected total) SSTotal ( yi y )2 When the explanatory variable x is included in the analysis we may ask: How much of the variation in y is explained by the variation in x ? i.e. How large would the variation in pefr be, if the persons have the same height?. residual Deviation: fitted – overall mean yi y ( yi yˆi ) ( yˆi y ) SSTotal ( yi y )2 ( yi yˆi ) 2 ( yˆi y ) 2 SSResidual SSModel Variation about regression = Residual Variation explained by regression = Model 17 The degrees of freedom are decomposed in a similar way fTot f Res f Mod n 1 (n 2) 1 Stata: All this appears in the analysis of variance table in the output from the regress command MS = mean square = SS/df -> sex = Female Source| SS df MS --------+---------------------------Model| 12251.4221 1 12251.4221 Residual| 88856.2222 41 2167.22493 --------+---------------------------Total| 101107.644 42 2407.32487 Number of obs F( 1, 41) Prob > F R-squared Adj R-squared Root MSE = = = = = = 43 5.65 0.0222 0.1212 0.0997 46.553 The mean squares are two independent variance estimates. 2 If the slope is 0, they both estimate the population variance . 18 The F-test of the hypothesis: 0 Intuitively, if the ratio SS Mod / SS Res is large the model explains a large part of the variation and the slope must therefore differ from zero. This is formalized in the test statistic F MSMod / MS Res , which follows an F-distribution (Lecture 2, page 44), if the hypothesis is true. Large values leads to rejection of the hypothesis. Note: F 5.65 2.38 t 0 Source| SS df MS --------+---------------------------Model| 12251.4221 1 12251.4221 Residual| 88856.2222 41 2167.22493 --------+---------------------------Total| 101107.644 42 2407.32487 Number of obs F( 1, 41) Prob > F R-squared Adj R-squared Root MSE = = = = = = 43 5.65 0.0222 0.1212 0.0997 46.553 R-squared as a measure of explained variation The total variation is reduced from 101107.644 to 88856.2222, i.e. the reduction is 12.12% or 0.1212 which is found in the right panel as the R-squared value. Adj R-squared is a similar measure of explained variation, but computed from the mean squares. R-squared is also called the ”coefficient of determination”. 19 THE CORRELATION COEFFICIENT A linear regression describes the relationship between two variables, but not the ”strength” of this relation. Example: (fishoil.dta) Fish oil trial (see: day 2, page 11). 50 The correlation coefficient is a measure of the strength of a linear relation. -50 0 difsys What is the relationship between the change in diastolic and in systolic blood pressure in the fish oil group? -40 -20 0 difdia 20 40 20 Use a linear regression analysis? No obvious choice of response. The problem is symmetric. Here the sample correlation coefficient may be a more useful way to summarize the strenght of the linear relationship between the two variables. Pearson’s correlation coefficient rxy r (x x ) ( y y) (x x ) ( y y) i i 2 i 2 i Basic properties of the correlation coefficient: g g g • 1 rxy 1 rxy ryx symmetric in x and y if x and y are independent rxy 0 If the observations lie exactly on a straight rxy 1 line with positive/negative slope Change of origin and/or scale of x and/or y will not change the size of r (the sign is changed if the ordering is reversed) 21 Stata: correlate difsys difdia if grp==2 | difsys difdia -------------+-----------------difsys | 1.0000 difdia | 0.5911 1.0000 The correlation is positive indicating a positive linear relationship. The sample correlation coefficient r is an estimate of the population correlation coefficient . A test of the hypothesis 0 is identical to the t-test of the hypothesis 0 . It can be shown that ˆ 0 r2 t n2 se( ˆ ) 1 r2 Stata: The command pwcorr difsys difdia,sig gives the correlation coefficient and the p-value of this test. For a linear regression: r2 = R-Squared = Explained variation 22 Use of correlation coefficients: Correlations are popular, but what do they tell about data? Note: The correlation coefficient only measures the linear relationship r=0.07 r=0.85 r=0.0 r=0.85 Conclusion: Always make plot of the data! 23 Misuse of correlation coefficients In general: A correlation should primarily be used to evaluate the association between two variables, when the setting is truly symmetric. The following examples illustrate misuse or rather misinterpretation of correlation coefficients. Comparison of two measurements methods Two studies, each comparing two methods of measuring heights of men. In both studies 10 men were measured twice, once with each method. In such studies a correlation coefficient is often used to quantify the agreement (or disagreement) between the methods. This is a bad idea! 24 Example 1 190 Higher correlation in left panel 190 n=10 r=0.9 (p<0.001) 185 method 4 method 2 185 n=10 r=0.8 (p=0.005) 180 175 170 180 175 170 170 175 180 method 1 185 190 170 175 180 185 190 method 3 Is a higher correlation evidence of a better agreement ? No, this is wrong!!! A difference vs. average plot reveals that there is a large disagreement between method 1 and 2, see next page. 25 10 8 8 method 3 - method 4 method 1 - method 2 5.6 cm 10 6 4 2 0 -2 -4 -6 -8 -10 170 175 180 185 190 (method 1 + method 2)/2 6 4 2 0 0.2 cm -2 -4 -6 -8 -10 170 175 180 185 190 (method 3 + method 4)/2 Compare the average disagreement between the two methods! Note: The correlation coefficient does not give you any information on whether or not the observations are located around the line x = y, i.e. whether or not the methods show any systematic disagreement. 26 Example 2: Two other studies. The same basic set-up. 200 200 n=10 r=0.9 (p<0.001) 190 method 4 method 2 190 n=10 r=0.8 (p=0.005) 180 170 160 180 170 160 160 170 180 190 method 1 200 160 170 180 190 200 method 3 The plots show: • No systematic disagreement (points are located around the line x = y). • Correlation coefficient in left panel (method 1 vs 2) larger than correlation coefficient in right panel (method 3 vs 4). Better agreement between method 1 and 2 than method 3 and 4 ??? 27 8 8 6 6 method 3 - method 4 s.d.= 2.8 cm method 1 - method 2 The answer is: No!!! 4 2 0 -2 -4 -6 -8 160 170 180 190 200 (method 1+method 2) /2 4 2 s.d.= 1.6 cm 0 -2 -4 -6 -8 160 170 180 190 200 (method 3 + method 4)/2 Compare the standard deviations of the differences (Limits of agreement = 2 x s.d., see Lecture 2, p. 29) Note: The correlation is larger between method 1 and 2 because the variation in heights is larger in this study. The correlation coefficient says more about the persons than about the measurement methods! 28 NON-PARAMETRIC METHODS FOR TWO-SAMPLE PROBLEMS Non-parametric methods, or distribution-free methods, are a class of statistical methods, which do not require a particular parametric form of the population distribution. Advantages: Non-parametric methods are based on fewer and weaker assumptions and can therefore be applied to a wider range of situations. Disadvantages: Non-parametric methods are mainly statistical test. Use of these methods may therefore overemphasize significance testing, which is only a part of a statistical analysis. Non-parametric tests do not depend on the observed values in the sample(s), but only the on the ordering or ranking. The non-parametric methods can therefore also be applied in situations where the outcome is measured on some ordinal scale, e.g. a complication registered as –, +, ++, or +++. A large number of different non-parametric tests has been developed. Here only a few simple test in widespread use will be discussed. 29 TWO INDEPENDENT SAMPLES: WILCOXON-MANN-WHITNEY RANK SUM TEST Illustration of the basic idea Consider a small experiment with 5 observations from two groups Active treatment x1 , x2 y1 , y2 , y3 Control Hypothesis of interest: the same distribution in the two samples, i.e. no effect of active treatment. For data values 15, 26, 14, 31, 21 (in arbitrary order) there are 120 (=5!) different ways to allocate these five values to x1 , x2 , y1 , y2 , y3 . Each allocation is characterized by the ordering of the units. Each ordering is equally likely if the hypothesis is true. An ordering is determined by the ranks of the observations. If e.g. x2 14 y3 15 x1 21 y2 26 y1 31 then rank ( x1 ) 3, rank ( x2 ) 1, rank ( y1 ) 5, rank ( y2 ) 4, rank ( y3 ) 2 30 Basic idea: Compute sum of rank in treatment group. If this sum is large or small the hypothesis is not supported by the data. 5 5 4 3 There are 10 different combinations of ranks for the 3 3 2 1 observations in the treatment group. Under the hypothesis each of these is equally likely (i.e. has probability 0.10). sum ranks sum 1,2,3 6 1,4,5 10 1,2,4 7 2,3,4 9 1,2,5 8 2,3,5 10 1,3,4 8 2,4,5 11 1,3,5 9 3,4,5 12 0.2 Probability ranks 0.1 0.0 observed configuration We have p-value = 4·0.1=0.4 Note: The distribution is symmetric. 6 7 8 9 10 11 12 Sum of ranks in treatment group observed value 31 General case Data: Two samples of independent observations Group 1 x1 , x2 , , xn1 from a population with distribution function FX Group 2 y1 , y2 , , yn2 from a population with distribution function FY Let N n1 n2 denote the total number of observations. Hypothesis: The x’s and the y’s are observations from the same (continuous) distribution, i.e. FX FY . The alternatives of special interest: the y’s are shifted upwards (or downwards). Test statistic (Wilcoxon’s ranksum test) T1 Sum of ranks in group 1, or T2 Sum of ranks in group 2 A two-sided test will reject the hypothesis for large or small values of T1 (or T2 ). Note: The two test statistics are equivalent since T2 N ( N 1) T1 2 32 Some properties of the test statistic If the hypothesis is true, the distribution of the test statistic is completely specified. In particular; the distribution is symmetric and we have n1 (n1 1) / 2 T1 n1 ( N n2 1) / 2 n2 (n2 1) / 2 T2 n2 ( N n1 1) / 2 Moreover, mean and the variance are given by E (T1 ) n1 ( N 1) / 2 E (T2 ) n2 ( N 1) / 2 Var (T1 ) Var (T2 ) n1 n2 ( N 1) /12 The formula for the variance is only valid if all observations are distinct. If the data contain tied observations, i.e. observations taking the same value, then Midranks, computed as the average value of the relevant ranks, are used. The variance is then smaller and a correction is necessary. The general variance formula becomes n1 n2 ( N 1) 1 Var (T1 ) Var (T2 ) 1 3 12 N N where k ki sets of ties 3 i ki number of identical observations in the i’th set of tied values 33 Finding the p-value The exact distribution of the of rank sum statistic under the hypothesis is rather complicated, but is tabulated for small sample sizes, see e.g. Armitage, Berry & Matthews, Table A7 or Altman, Table B10. Note: These tables are appropriate for untied data only. The p-value will be too large if the tables are used for tied data. For larger sample size (e.g. N > 30) the distribution of the rank sum statistic is usually approximated by a normal approximation with the same mean and variance, i.e. the test statistic z T1 E (T1 ) Var (T1 ) is approximately a standard normal variate if the hypothesis is true. Some programs (and textbooks) use a continuity correction, and the test statistics then becomes | T1 E (T1 ) | 0.5 z Var (T1 ) 34 Rank-sum test with Stata Example. In the Lectures day 2 we used a t-test to compare the change in diastolic blood pressure in pregnant women who were allocated to either supplementary fish oil or a control group. The analogous nonparametric test is computed by the command use fishoil.dta ranksum difdia , by(grp) Two-sample Wilcoxon rank-sum (Mann-Whitney) test grp | obs rank sum expected -------------+--------------------------------control | 213 44953 45901.5 fish oil | 217 47712 46763.5 -------------+--------------------------------combined | 430 92665 92665 unadjusted variance adjustment for ties 1660104.25 -3237.25 ---------adjusted variance 1656867.00 Ho: difdia(grp==control) = difdia(grp==fish oil) z = -0.737 Prob > |z| = 0.4612 two-sided p-value Stata computes the approximate standard normal variate without a continuity correction 35 The rank-sum test can also be used to analyse a 2×C table with ordered categories. In Lecture 4 (page 42) first parity births in skejby-cohort.dta were cross-classified according to mother’s smoking habits and year of births. To evaluate if the prevalence of smoking has changed we use a rank-sum test to compare the distribution on birth year among smokers and non-smokers. ranksum year if parity==0 , by(mtobacco) gives mtobacco | obs rank sum expected -------------+--------------------------------smoker | 1311 3473225 3527901 nonsmoker | 4070 11007046 10952370 -------------+--------------------------------combined | 5381 14480271 14480271 unadjusted variance adjustment for ties 2.393e+09 -2.669e+08 ---------adjusted variance 2.126e+09 Ho: year(mtobacco==smoker) = year(mtobacco==nonsmoker) z = -1.186 Prob > |z| = 0.2357 36 Mann-Whitney’s U test Some statistical program packages compute a closely related test statistic, Mann-Whitney’s U test. This test is equivalent to the Wilcoxon rank-sum test, but is derived by a different argument. Basic idea: Consider all pairs of observations (x,y) with one observation from each sample. Let U XY number of pairs with x < y U YX number of pairs with y < x A pair with x = y is counted as ½ in both sums. Extreme values of these test statistics suggest the hypothesis is not supported by the data. One may show that UYX T1 n1 (n1 1) / 2 U XY T2 n2 (n2 1) / 2 The distributions of these test statistics are therefore a simple translation of the distribution of the rank-sum and the same p-value is obtained. 37 General comments on the rank-sum test For comparison of two independent samples the rank-sum test is a robust alterative to the t-test. For detecting a shift in location the rank-sum test is never much less sensitive than the t-test, but may be much better if the distribution is far from a normal distribution. The rank-sum test is not well suited for comparison of two populations, which differ in spread, but have essentially the same mean. Non-parametric methods are primarily statistical test. For the shift in location situation, i.e. when X is distributed as Y , where is the unknown shift we may estimate the shift parameter as the median of the n1 n2 differences between one observation from each sample, and a confidence interval for the shift parameter can then be obtained from the rank-sum test. This procedure is not included in Stata. Note: A monotonic transformation of the data, e.g. by a logarithm has no impact on the value of the rank-sum statistic. 38 TWO PAIRED SAMPLES: WILCOXON’S SIGNED RANK-SUM TEST Basic problem: Analysis of paired data without assuming normality of the variation. Data: A sample of n pairs ( x1 , y1 ),( x2 , y2 ), ,( xn , yn ) of observations. Question: Does the distribution of the x’s differ from the distribution of the y’s? Preliminary model considerations: For a pair of observation we may write x e1 y e2 where and represent the expected response of x and y, and where e1 and e2 are error terms. Assume: Error terms from different pairs are independent and follow the same distribution. 39 If the error terms e1 and e2 follow the same distribution then the difference d yx has a symmetric distribution with median (and mean) . Statistical model: The n differences d1 , d 2 , , d n are regarded as a random sample from a symmetric distribution F with median . Estimation: The population median is estimated by the sample median. Hypothesis: The x’s and the y’s have the same distribution, or equivalently 0. The sign test A simple test statistic is based on the signs of the differences. If the median is 0, positive and negative difference should be equally likely, and the number of positive differences therefore follows a binomial distribution with p = 0.5. If some differences are zero the sample size is reduced accordingly. Stata: signtest hgoral=craft 40 Wilcoxon’s signed rank sum test The sign test utilizes only the sign of the differences, not their magnitude. A more powerful test is available is both sign and size of the differences are taken into account. Basic idea: Sort the differences in ascending order of their absolute value (i.e. ignoring the sing of the differences). Use the sum of the ranks of the positive differences as the test statistic. Wilcoxon’s signed rank-sum test T sum of ranks of positive differences, when differences are are ranked in ascending order according to absolute value. Alternatively, T, defined analogously, can be used. The two test statistics are equivalent. Basic properties: With no ties and zeros present in the sample of differences, the test statistic has a symmetric distribution and 0 T n (n 1) / 2 E (T ) n (n 1) / 4 Var (T ) n (n 1) (2n 1) / 24 41 Ties and zeroes among differences Mid ranks are used if some of the differences have the same absolute value, i.e. these differences are given the average value of the ranks that would otherwise apply. Differences that are equal to zero are not included in any of the test statistics. A formula for the variance corrected for ties and zeroes exists and is used by Stata. Zeroes are usually accounted for by ignoring these differences and reducing the sample size according. Finding the p-value The exact distribution of the of Wilcoxon’s signed rank-sum test under the hypothesis is tabulated for small sample sizes (n ≤ 25), see e.g. Armitage, Berry & Matthews, Table A6 or Altman, Table B9. Note: These tables are appropriate for untied data only. The p-value will be too large if the tables are used for data with ties. 42 Normal approximation For larger sample size (n > 25 ) the distribution of the test statistic is approximated by a normal approximation with the same mean and variance, i.e. the test statistic T E (T ) z Var (T ) is approximately a standard normal variate if the hypothesis is true. Stata computes this test statistic using a variance estimate that allows for ties and zeroes. Some programs (and textbooks) use a continuity correction, and the test statistics then becomes | T E (T ) | 0.5 z Var (T ) The continuity correction has little or no effect even for moderate sample sizes and can safely be ignored. 43 Wilcoxon’s signed rank-sum test with Stata Example. In the lectures day 3 we used a paired t-test to compare counts of T4 and T8 cells in blood from 20 individuals. The analogous non-parametric test is computed by the command use tcounts.dta signrank t4=t8 Wilcoxon signed-rank test sign | obs sum ranks expected -------------+--------------------------------positive | 12 147 105 negative | 8 63 105 zero | 0 0 0 -------------+--------------------------------all | 20 210 210 unadjusted variance adjustment for ties adjustment for zeros adjusted variance Ho: t4 = t8 z = Prob > |z| = 717.50 0.00 0.00 ---------717.50 1.568 0.1169 No correction since these data have no ties or zeroes The p-value is larger than 0.05, so the difference between the distribution 44 of T4 and T8 cells is not statistically significant Example continued Diagnostic plots of these data (day 3, page 31 and 38) suggest that the counts initially should be log-transformed. Note: Transformations of the basic data, the x’s and the y’s, may change the value of Wilcoxon’s signed rank-sum test. signrank logt4=logt8 sign | obs sum ranks expected -------------+--------------------------------positive | 12 150 105 negative | 8 60 105 zero | 0 0 0 -------------+--------------------------------all | 20 210 210 unadjusted variance adjustment for ties adjustment for zeros adjusted variance Ho: logt4 = logt8 z = Prob > |z| = 717.50 0.00 0.00 ---------717.50 1.680 0.0930 Note: the number of positive ranks are unchanged, but the sum of these ranks has changed. The p-value has also changed (a little). 45 NON-PARAMETRIC CORRELATION COEFFICIENTS Non-parametric correlation coefficients measure the strength of the association between continuous variables or between ordered categorical variables. Spearman’s rho Data: A sample of n pairs ( x1 , y1 ),( x2 , y2 ), ,( xn , yn ) of observations. Procedure: Rank the x’s and the y’s, and let Ri rank ( xi ) Qi rank ( yi ) Then Spearman’s rho is defined as the usual correlation coefficient computed from the ranks, i.e. ( R R )(Q Q) ( R R ) (Q Q) i i i 2 i i i 2 i We have 1 1 . If Y increase with X then is positive, if Y decrease with X then is negative. 46 If X and Y are independent and the data have no tied observations then 1 E( ) 0 Var ( ) n 1 From Spearman’s rho a non-parametric test of independence between X and Y can be derived. The exact distribution of Spearman’s rho under the hypothesis of independence is complicated, but has been tabulated for small sample sizes, see e.g. Altman, Table B8. Usually the p-value is found by computing the test statistic n2 tS 1 2 which approximately has a t-distribution with n – 2 degrees of freedom. Stata’s command spearman uses this approach to compute the p-value, see below. 47 Kendall’s tau A pair ( X i , Yi ),( X j , Y j ) of pairs of observations are called concordant if X i X j and Yi Y j or if X i X j and Yi Y j , i.e. when the two pairs are ordered in the same way according to X and according to Y. Similarly, a pair of pairs are called discordant if the ordering according to Y is a reversal of the ordering according to X. Let C = number of concordant pairs in the sample D = number of discordant pairs in the sample Ties are handled by adding ½ to both C and D. Then C D n (n 1) / 2 number of pairs of pairs in the sample Let S C D then Kendall’s tau (or tau-a) is defined as S n (n 1) / 2 Kendall’s tau-b uses a slightly different denominator to allow for ties. 48 Properties of Kendall’s tau We have 1 1 . When X and Y are independent and no ties are present in the data it can be shown that E ( ) 0 Var ( ) 2 (2n 5) 9n (n 1) Formulas valid for tied data are complicated. Also from Kendall’s tau a non-parametric test of independence between X and Y can be derived. The test statistic is usually based on a normal approximation to S, the numerator of Kendall’s tau. A continuity correction is routinely applied. Stata’s command ktau uses this approach to compute the p-value, see below. Note: Both Spearman’s rho and Kendall’s tau are unchanged if one or both of the series of observations are transformed. 49 Non-parametric correlation coefficients with Stata Example. Consider the data with counts of T4 and T8 cells in blood from 20 persons, but this time we want to describe the association between the two counts. spearman t4 t8 Number of obs = Spearman's rho = 20 0.6511 Test of Ho: t4 and t8 are independent Prob > |t| = 0.0019 ktau t4 t8 Number of obs Kendall's tau-a Kendall's tau-b Kendall's score SE of score = = = = = 20 0.5053 0.5053 96 30.822 S=C–D The hypothesis of independence is rejected in both cases. Persons with a high T4 value typically also have a high T8 value. Test of Ho: t4 and t8 are independent Prob > |z| = 0.0021 (continuity corrected) Note: The hypothesis of independence differs from the hypothesis tested with a paired two-sample test 50 Example Non-parametric correlation coefficients can also be used to analyse a R×C table with ordered categories. In lecture 4 (page 42) births in December 1993 included in skejby-cohort.dta were cross-classified according to age of the mother and parity of the child. Age of mother Parity 0 1 2- Total -24 57 13 5 75 25-29 70 40 20 130 30- 53 52 33 138 180 105 58 343 Total The hypothesis of independence in this 3×3 table with ordered categories can be assessed by the following commands gene agecat=(mage>24)+(mage>29) if mage<. gene paricat=(parity>0)+(parity>1) if parity<. spearman agecat paricat if year==1993 & mon==12 51 Output Number of obs = Spearman's rho = 343 0.2807 Test of Ho: agecat and paricat are independent Prob > |t| = 0.0000 For comparison the same analysis of the ungrouped data is spearman mage parity if year==1993 & mon==12 Number of obs = Spearman's rho = 343 0.3224 Test of Ho: mage and parity are independent Prob > |t| = 0.0000 As expected the correlation is stronger in the ungrouped data. Note: The usual chi-square of independence, which does not take the ordering into account, is also statistically significant. We get X2 = 28.57 on 4 degrees of freedom and p-value = 0.000001. 52