Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
EPIB 698D Lecture 6 Raul Cruz-Cano Spring 2013 DETERMINING NORMALITY (OR LACK THEREOF) • One of the first steps in test selection should be investigating the distribution of the data. • PROC UNIVARIATE can be implemented to determine whether or not your data are normal. – If the population from which the data are obtained is normal, the mean and median should be equal or close to equal. – The skewness coefficient, which is a measure of symmetry, should be near zero. Positive values for the skewness coefficient indicate that the data are right skewed, and negative values indicate that that data are left skewed. – The kurtosis coefficient, which is a measure of spread, should also be near zero. – Positive values for the kurtosis coefficient indicate that the distribution of the data is steeper than a normal distribution, and negative values for kurtosis indicate that the distribution of the data is flatter than normal distribution. DETERMINING NORMALITY (OR LACK THEREOF) • The NORMAL option in PROC UNIVARIATE produces a table with tests for normality. – Shapiro-Wilk Statistic, EDF Goodness-of-Fit Tests, Kolmogorov D Statistic, Anderson-Darling Statistic, Cramér-von Mises Statistic – In general, if the p-values are less than 0.05, then the data should be considered non-normally distributed. – However, it is important to remember that these tests are heavily dependent on sample size. – Strikingly non-normal data may have a p-value greater than 0.05 due to a small samplesize. Therefore, graphical representations of the data should always be examined. • The PLOTS option in PROC UNIVARIATE creates low-resolution stem-andleaf, box, and normal probability plots. – The stem-and-leaf plot is used to visualize the overall distribution of the data and the box plot is a graphical representation of the 5-number summary. – The normal probability plot is designed to investigate whether a variable is normally distributed. If the data are normal, then the plot should display a straight diagonal line. Different departures from the straight diagonal line indicate different types of departures from normality DETERMINING NORMALITY • The HISTOGRAM statement in PROC UNIVARIATE will produce high resolution histograms. • PROC UNIVARIATE is an invaluable tool in visualizing and summarizing data in order to gain an understanding of the underlying populations from which the data are obtained. To produce these results, the following code can be used. PROC UNIVARIATE data=datafile normal plots; Histogram; Var variable1 variable2 ... variablen; Run; • The determination of the normality of the data should result from evaluation of the graphical output in conjunction with the numerical output. • In addition, the user might wish to look at subsets of the data; for example, a CLASS statement might be used to stratify by gender. Normality Test: Box Plot One-sample t-test? DATA relieftime; INPUT relief; DATALINES; 90 93 93 99 98 100 103 104 99 102 ; PROC UNIVARIATE DATA = relieftime normal plot; VAR relief; histogram relief / midpoints = 80 to 120 by 5 normal; RUN; When used in conjunction with the NORMAL option, the histogram will have a line indicating the shape of a normal distribution with the same mean and variance as the sample. Tests for Normality • The histogram shows most observations falling at the peak of the normal curve. • The box-plot shows that the mean falls on the median (*--+--*), indicating no skewed data. • The formal tests of normality in the output are non-significant, indicating these data come from a normal distribution. • We can assume the data are normally distributed and proceed with the one-sample t-test. Normality Test: Box Plot Paired t-test? DATA study; INPUT before after; DATALINES; 90 95 87 92 100 104 80 89 95 101 90 105 ; PROC UNIVARIATE DATA = study normal plot; VAR before after; histogram before after / normal; RUN; Tests for Normality • There are so few data points that the histograms are difficult to interpret. • The box-plots for before and after both show the mean very close to the median, suggesting the data are not skewed. • The tests of normality for before and after have p-values > alpha, indicating we do not reject the assumption of normality. • We can proceed with the matched pairs t-test. Tests for Normality DATA response; INPUT group $ time; DATALINES; Two-sample t-test? c 80 c 93 c 83 c 89 c 98 t 100 t 103 t 104 t 99 t 102 ; PROC UNIVARIATE DATA = response normal plot; class group; var time; histogram time / midpoints = 80 to 120 by 5 normal; RUN; A few notes: • The code has specified that the univariate procedure be performed on the variable time, but that it is done by the class “group.” This way you will have separate summary statistics, plots and histograms for the treatment and control groups. Tests for Normality • The tests for normality for both the treatment and control groups are nonsignificant (p-value > alpha), indicating we can assume they come from a normal distribution. • Because each group only has 5 subjects, the histograms are difficult to interpret, but there is no indication of nonnormality. • Proceed with the two-sample t-test Histograms for control and treatment groups 100 c P e r c e n t 80 60 40 20 0 100 t P e r c e n t 80 60 40 20 0 80 85 90 95 100 t i me 105 110 115 120 Another Example • A semiconductor manufacturer produces printed circuit boards that are sampled to determine the thickness of their copper plating. • The following statements create a data set named Trans, which contains the plating thicknesses (Thick) of 100 boards: data Trans; input Thick @@; label Thick = 'Plating Thickness (mils)'; datalines; 3.468 3.428 3.509 3.516 3.461 3.492 3.478 3.556 3.482 3.512 3.490 3.467 3.498 3.519 3.504 3.469 3.497 3.495 3.518 3.523 3.458 3.478 3.443 3.500 3.449 3.525 3.461 3.489 3.514 3.470 3.561 3.506 3.444 3.479 3.524 3.531 3.501 3.495 3.443 3.458 3.481 3.497 3.461 3.513 3.528 3.496 3.533 3.450 3.516 3.476 3.512 3.550 3.441 3.541 3.569 3.531 3.468 3.564 3.522 3.520 3.505 3.523 3.475 3.470 3.457 3.536 3.528 3.477 3.536 3.491 3.510 3.461 3.431 3.502 3.491 3.506 3.439 3.513 3.496 3.539 3.469 3.481 3.515 3.535 3.460 3.575 3.488 3.515 3.484 3.482 3.517 3.483 3.467 3.467 3.502 3.471 3.516 3.474 3.500 3.466 ; run; Example title 'Analysis of Plating Thickness'; proc univariate data=Trans; histogram Thick / normal(percents=20 40 60 80 midpercents) name='MyPlot'; run; Q-Q Plots • The following properties of Q-Q plots and probability plots make them useful diagnostics of how well a specified theoretical distribution fits a set of measurements: – If the quantiles of the theoretical and data distributions agree, the plotted points fall on or near the line . – If the theoretical and data distributions differ only in their location or scale, the points on the plot fall on or near the line . The slope and intercept are visual estimates of the scale and location parameters of the theoretical distribution. • Q-Q plots are more convenient than probability plots for graphical estimation of the location and scale parameters because the -axis of a Q-Q plot is scaled linearly. • On the other hand, probability plots are more convenient for estimating percentiles or probabilities. Q-Q plots Example • Data set Measures, which contains the measurements of the diameters of 50 steel rods in the variable Diameter: data Measures; input Diameter @@; label Diameter = 'Diameter (mm)'; datalines; 5.501 5.251 5.404 5.366 5.445 5.576 5.607 5.200 5.977 5.177 5.332 5.399 5.661 5.512 5.252 5.404 5.739 5.525 5.160 5.410 5.823 5.376 5.202 5.470 5.410 5.394 5.146 5.244 5.309 5.480 5.388 5.399 5.360 5.368 5.394 5.248 5.409 5.304 6.239 5.781 5.247 5.907 5.208 5.143 5.304 5.603 5.164 5.209 5.475 5.223 ; run; Q-Q plots Example symbol v=plus; title 'Normal Q-Q Plot for Diameters'; proc univariate data=Measures noprint; qqplot Diameter / normal square vaxis=axis1; axis1 label=(a=90 r=0); run; Probability Plots • The PROBPLOT statement creates a probability plot, which compares ordered variable values with the percentiles of a specified theoretical distribution. If the data distribution matches the theoretical distribution, the points on the plot form a linear pattern. Consequently, you can use a probability plot to determine how well a theoretical distribution models a set of measurements. • Probability plots are similar to Q-Q plots, which you can create with the QQPLOT statement. Probability plots are preferable for graphical estimation of percentiles, whereas Q-Q plots are preferable for graphical estimation of distribution parameters. Probability Plot Example proc univariate data=Measures; probplot Length1 Length2 / normal(mu=10 sigma=0.3) square ctext=blue; run; You can check against other distributions: lognormal, Gamma, Betta, etc. Collinearity • When a regressor is nearly a linear combination of other regressors in the model, the affected estimates are unstable and have high standard errors. • This problem is called collinearity or multicollinearity. • It is a good idea to find out which variables are nearly collinear with which other variables. • Consequences of high multicollinearity: – Increased standard error of estimates of the β’s (decreased reliability). – Often confusing and misleading results • The approach in PROC REG follows that of Belsley, Kuh, and Welsch (1980). PROC REG provides several methods for detecting collinearity with the COLLIN, COLLINOINT, TOL, and VIF options. Collinearity • The COLLIN option in the MODEL statement requests that a collinearity analysis be performed. • Belsey, Kuh, and Welsch (1980) suggest that, when this number is around 10, weak dependencies might be starting to affect the regression estimates. When this number is larger than 100, the estimates might have a fair amount of numerical error (although the statistical standard error almost always is much greater than the numerical error). • For each variable, PROC REG produces the proportion of the variance of the estimate accounted for by each principal component. A collinearity problem occurs when a component associated with a high condition index contributes strongly (variance proportion greater than about 0.5) to the variance of two or more variables. Collinearity • The VIF option in the MODEL statement provides the variance inflation factors (VIF). These factors measure the inflation in the variances of the parameter estimates due to collinearities that exist among the regressor (independent) variables. There are no formal criteria for deciding if a VIF is large enough to affect the predicted values. • The TOL option requests the tolerance values for the parameter estimates Example Collinearity Aerobic fitness (measured by the ability to consume oxygen) is fit to some simple exercise tests. The goal is to develop an equation to predict fitness based on the exercise tests rather than on expensive and cumbersome oxygen consumption measurements data fitness; input Age Weight Oxygen RunTime RestPulse RunPulse MaxPulse @@; datalines; 44 89.47 44.609 11.37 62 178 182 40 75.07 45.313 10.07 62 185 185 44 85.84 54.297 8.65 45 156 168 42 68.15 59.571 8.17 40 166 172 38 89.02 49.874 9.22 55 178 180 47 77.45 44.811 11.63 58 176 176 40 75.98 45.681 11.95 70 176 180 43 81.19 49.091 10.85 64 162 170 44 81.42 39.442 13.08 63 174 176 38 81.87 60.055 8.63 48 170 186 44 73.03 50.541 10.13 45 168 168 45 87.66 37.388 14.03 56 186 192 45 66.45 44.754 11.12 51 176 176 47 79.15 47.273 10.60 47 162 164 54 83.12 51.855 10.33 50 166 170 49 81.42 49.156 8.95 44 180 185 51 69.63 40.836 10.95 57 168 172 51 77.91 46.672 10.00 48 162 168 48 91.63 46.774 10.25 48 162 164 49 73.37 50.388 10.08 67 168 168 57 73.37 39.407 12.63 58 174 176 54 79.38 46.080 11.17 62 156 165 52 76.32 45.441 9.63 48 164 166 50 70.87 54.625 8.92 48 146 155 51 67.25 45.118 11.08 48 172 172 54 91.63 39.203 12.88 44 168 172 51 73.71 45.790 10.47 59 186 188 57 59.08 50.545 9.93 49 148 155 49 76.32 48.673 9.40 56 186 188 48 61.24 47.920 11.50 52 170 176 52 82.78 47.467 10.50 53 170 172 ; Run; Example Collinearity proc reg data=fitness; model Oxygen=RunTime Age Weight RunPulse MaxPulse RestPulse / tol vif collin; run; dangerously high VIFs Look at large CN rows See which predictors have large % variance in each Solution • Check correlations then Redefine variables – Remove or average redundant ones • Variable selection, model re-specification – Use forward or backward in the regression analysis Let’s revisit the examples from previous classes… data blood; INFILE ‘F:\blood.txt'; INPUT subjectID $ gender $ bloodtype $ age_group $ RBC WBC cholesterol; run; data blood1; set blood; if gender='Female' then sex=1; else sex=0; if bloodtype='A' then typeA=1; else typeA=0; if bloodtype='B' then typeB=1; else typeB=0; if bloodtype='AB' then typeAB=1; else typeAB=0; if age_group='Old' then Age_old=1; else Age_old=0; run; Check normality and collinearity of RBC, WBC and cholesterol PROC UNIVARIATE DATA = blood1 normal plot; class gender; var RBC; histogram RBC / normal; qqplot RBC / normal square ; RUN; proc reg DATA = blood1; model cholesterol =RBC WBC / vif collin; run; title 'Paired Comparison'; data pressure; input SBPbefore SBPafter @@; diff_BP=SBPafter-SBPbefore ; datalines; 120 128 124 131 130 131 118 127 140 132 128 125 140 141 135 137 126 118 130 132 126 129 127 135 ; run; data piared; input lossa lossj; diff=lossa-lossj; datalines ; +4 -8 +3 -10 0 -12 -3 -16 -4 -18 -5 -20 -11 -21 -14 -24 -15 -26 -300 -30 ; run; Check normality and collinearity of SBPbefore & SBPafter and lossa & lossb data nonparametric; input loss diet $; datalines ; proc ttest data = "c:\hsb2"; class female; var write; run; Check normality of write and loss +4 atkins +3 atkins 0 atkins -3 atkins -4 atkins -5 atkins -11 atkins -14 atkins -15 atkins -300 atkins -8 jenny -10 jenny -12 jenny -16 jenny -18 jenny -20 jenny -21 jenny -24 jenny -26 jenny -30 jenny ; run; data Clover; input Strain $ Nitrogen @@; datalines; 3DOK1 19.4 3DOK1 32.6 3DOK1 27.0 3DOK1 32.1 3DOK1 33.0 3DOK5 17.7 3DOK5 24.8 3DOK5 27.9 3DOK5 25.2 3DOK5 24.3 3DOK4 17.0 3DOK4 19.4 3DOK4 9.1 3DOK4 11.9 3DOK4 15.8 3DOK7 20.7 3DOK7 21.0 3DOK7 20.5 3DOK7 18.8 3DOK7 18.6 3DOK13 14.3 3DOK13 14.4 3DOK13 11.8 3DOK13 11.6 3DOK13 14.2 COMPOS 17.3 COMPOS 19.4 COMPOS 19.1 COMPOS 16.9 COMPOS 20.8 ; run; Check normality of Nitrogen