Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
School of Nursing 25-26 Sept 2008 – M. Higgins “I Have a Bunch of Data – Now What?” Data Screening, Exploring and Clean-Up Melinda K. Higgins, Ph.D. 25 & 26 September 2008 Data Screening, Exploring and Clean-Up School of Nursing Outline Descriptive Statistics (univariate & bi-variate) I. I. Measures of Centrality II. Measures of Variability III. Distributions & Transformations IV. Tests of Normality V. Outliers VI. Missing Data VII. Correlations II. Overall Flow Charts III. Potential Statistical Analyses (Decision Tree) IV. Contact Info Data Screening, Exploring and Clean-Up 25-26 Sept 2008 – M. Higgins School of Nursing 25-26 Sept 2008 – M. Higgins A Few Initial Considerations • GROUPS – If data is to be evaluated by group – you will want to evaluate the descriptive statistics BY group (e.g. the data might not be skewed overall, but one group may be by itself) – may or may not want to transform. • LONGITUDINAL DATA – If variables were measured over time, you will need to consider all the time points (e.g. you would NOT want to transform one time point and not the others) • MULTIVARIATE MEASURES – additional screening measures in bi-variate/multivariate combinations (multicollinearity, influential cases, leverage, Mahalonobis distance) – not covered in this lecture. Data Screening, Exploring and Clean-Up School of Nursing 25-26 Sept 2008 – M. Higgins Measures of Central Tendency • Mean = (Xi)/n • Median = 50% Below ≤ Median ≤ 50% Above • for odd n, Median=middle value(sorted X) • For even n, Median = average of 2 middle X’s • Trimmed Mean – mean recalculated after deleting _% or _# off top and bottom of sorted data (usually 5% or so) • Mode – number(s) repeated the most Data Screening, Exploring and Clean-Up School of Nursing 25-26 Sept 2008 – M. Higgins Measures of Variance • (sample) Variance = sums of squares of deviation from mean/(n-1) x x 2 i n 1 • (sample) Standard Deviation = sqrt(variance) • Range = max(X) – min(X) • IQR – Interquartile Range = 75th Percentile(X) – 25th Percentile (X) Data Screening, Exploring and Clean-Up School of Nursing Distributions • Stem and Leaf • Dot plot • Histogram • Box plot Data Screening, Exploring and Clean-Up 25-26 Sept 2008 – M. Higgins School of Nursing 25-26 Sept 2008 – M. Higgins Boxplots (as Defined in SPSS) • A boxplot shows the five statistics (minimum, first quartile, median, third quartile, and maximum). It is useful for displaying the distribution of a scale variable and pinpointing outliers. • The boundaries of the box are “Tukey’s hinges.” The median is identified by a line inside the box. The length of the box is the interquartile range (IQR) computed from Tukey’s hinges [i.e. 25th and 75th percentiles]. • Outliers. Cases with values that are between 1.5 and 3 box lengths (box length=IQR) from either end of the box (“o”). Extremes. Cases with values more than 3 box lengths from either end of the box (“*”). • Whiskers at the ends of the box show the distance from the end of the box to the largest and smallest observed values that are less than 1.5 box lengths from either end of the box. Data Screening, Exploring and Clean-Up School of Nursing Distributions (cont’d) – Skewness and Kurtosis • Skewness and Kurtosis are the two most commonly used measures to evaluate deviations from normality. • Skewness measures the extent to which the distribution is not symmetric. • Kurtosis measure the extent to which the distribution is more “pointed/narrow” or “flatter/wider” than the normal distribution. Data Screening, Exploring and Clean-Up 25-26 Sept 2008 – M. Higgins School of Nursing 25-26 Sept 2008 – M. Higgins Statistical Test: Skewness & Kurtosis • Zs = (S_skew-0)/SE_skew • S_skew = Skewness measure • SE_skew is the std. error of skewness • Zk = (S_kurt-0)/SE_kurt • S_kurt = Kurtosis measure • SE_kurt is the std. error of kurtosis • Zs or Zk values > 1.96 are significant at 0.05 sig. level • Zs or Zk values > 2.58 are significant at 0.01 sig. level • Zs or Zk values > 3.29 are significant at 0.001 sig. level Data Screening, Exploring and Clean-Up School of Nursing 25-26 Sept 2008 – M. Higgins Statistics N Valid Missing Mean Std. Error of Mean Median Std. Deviation Variance Skewness Std. Error of Skewness Kurtosis Std. Error of Kurtosis Range Minimum Maximum day1 Hygiene (Day 1 of Glastonbury Festival) 810 0 1.7934 .03319 1.7900 .94449 .892 8.865 .086 170.450 .172 20.00 .02 20.02 day2 Hygiene (Day 2 of Glastonbury Festival) 264 546 .9609 .04436 .7900 .72078 .520 1.095 .150 .822 .299 3.44 .00 3.44 Zs=103.08 Zk=990.98 Data Screening, Exploring and Clean-Up Zs=7.3 Zk=2.7 day3 Hygien (Day 3 of Glastonbury Festival) 12 68 .976 .06404 .760 .71028 .504 1.03 .218 .732 .433 3.39 .0 3.41 School of Nursing 25-26 Sept 2008 – M. Higgins Additional Tests of Normality • The following 2 tests compare the scores in the sample to a normally distributed set of scores with the same mean and std. deviation. If the test is non-significant (p<0.05) it says that the sample distribution is not significantly different from a normal population. • Kolmogorov-Smirov • Shapiro-Wilk • [NOTE: With larger sample sizes, these tests will be significant for small deviations from normality – use graphics/visual inspection.] Data Screening, Exploring and Clean-Up School of Nursing 25-26 Sept 2008 – M. Higgins Tests of Normality Kolmogorov-Smirnov a Statistic df Sig. day1 Hygiene (Day 1 of Glastonbury Festival) day2 Hygiene (Day 2 of Glastonbury Festival) day3 Hygiene (Day 3 of Glastonbury Festival) Statistic Shapiro-Wilk df Sig. .083 810 .000 .654 810 .000 .121 264 .000 .908 264 .000 .140 123 .000 .908 123 .000 SPSS – Analyze/Explore/Normality Plots with Tests a. Lilliefors Significance Correction Data Screening, Exploring and Clean-Up School of Nursing 25-26 Sept 2008 – M. Higgins Normal Probability Plots Data Screening, Exploring and Clean-Up School of Nursing 25-26 Sept 2008 – M. Higgins Transformations SPSS COMPUTE and/or SAS Data Procedure Moderate – positive skewness NEWX=SQRT(X) Substantial positive skewness NEWX=LG10(X) (with zero) NEWX=LG10(X+C) Severe positive skewness NEWX=1/X L-shaped (with zero) NEWX=1/(X+C) Moderate negative skewness NEWX=SQRT(K-X) Substantial negative skewness NEWX=LG10(K-X) Severe negative skewness (J-shaped) NEWX=1/(K-X) C = constant added so smallest score is 1 K = constant from which each score is subtracted so smallest score is 1. Data Screening, Exploring and Clean-Up School of Nursing 25-26 Sept 2008 – M. Higgins Statistics N sqrtday1 810 0 1.3042 .01069 1.3379 .30429 .093 .840 .086 14.320 .172 4.33 .14 4.47 Valid Missing Mean Std. Error of Mean Median Std. Deviation Variance Skewness Std. Error of Skewness Kurtos is Std. Error of Kurtosis Range Minimum Maximum lg10day1 810 0 .2035 .00819 .2529 .23297 .054 -1.947 .086 9.613 .172 3.00 -1.70 1.30 sqrtday2 264 546 .9092 .02259 .8888 .36703 .135 .255 .150 -.403 .299 1.85 .00 1.85 lg10day2 263 547 -.1589 .02437 -.1024 .39514 .156 -.835 .150 .968 .299 2.24 -1.70 .54 Tests of Normality sqrtday1 lg10day1 sqrtday2 lg10day2 Kolmogorov-Smirnova Statistic df Sig. .065 810 .000 .123 810 .000 .045 264 .200* .103 263 .000 Statistic .916 .861 .988 .956 Shapiro-Wilk df 810 810 264 263 *. This is a lower bound of the true significance. a. Lilliefors Significance Correction Data Screening, Exploring and Clean-Up Sig. .000 .000 .027 .000 School of Nursing 25-26 Sept 2008 – M. Higgins LG10 Original SQRT Data Screening, Exploring and Clean-Up School of Nursing 25-26 Sept 2008 – M. Higgins Outliers • Review histograms and boxplots and look for extreme values • Investigate values • is it “real?” • can it be corrected? • Should it be deleted (or left out of analyses)? • [consider clinical reasons; procedural reasons] • Calculate z-scores (next page) and review amount of outliers • Is there a pattern? (compare outliers to non-outliers) Data Screening, Exploring and Clean-Up School of Nursing 25-26 Sept 2008 – M. Higgins Outliers DESCRIPTIVES VARIABLES=day2/SAVE. XX /SAVE option creates Z COMPUTE outlier1=abs(zday2). z-score of “DAY2” s EXECUTE. RECODE outlier1 (3.29 thru Highest = 4) (2.58 thru highest = 3) (1.96 thru Highest = 2) (Lowest thru 2 = 1). EXECUTE. VALUE LABELS outlier1 1 'Absolute z-score less than 2' 2 'Absolute z-score greater than 1.96' 3 'Absolute z-score greater than 2.58' 4 'Absolute z-score greater than 3.29'. FREQUENCIES VARIABLES=outlier1. /ORDER=ANALYSIS. outlie r1 Valid Missing Total 1.00 Absolute 2.00 Absolute 3.00 Absolute 4.00 Absolute Total Sy stem z-sc ore z-sc ore z-sc ore z-sc ore less than 2 greater than 1.96 greater than 2.58 greater than 3.29 Frequency 246 12 4 2 264 546 810 Percent 30.4 1.5 .5 .2 32.6 67.4 100.0 Valid Perc ent 93.2 4.5 1.5 .8 100.0 Data Screening, Exploring and Clean-Up Cumulative Percent 93.2 97.7 99.2 100.0 School of Nursing 25-26 Sept 2008 – M. Higgins Missing Data • Look for patterns (MVA next slide) and/or reason why missing • Can compare missing data subjects to non-missing data subjects • Can delete/ignore or Impute based on model • Goal is to: • Minimize Bias • Maximize utilization of information (data=$$) • Get good estimates of uncertainty • Censorship (survival analysis – “loss to follow-up”) • [SIDE NOTE: SPSS missing – strings vs. numeric data types] NOTE: Missing Data Imputation – to be discussed in another lecture Data Screening, Exploring and Clean-Up School of Nursing 25-26 Sept 2008 – M. Higgins Missing Data MVA VARIABLES = timedrs attdrug atthouse income emplmnt mstatus race /TTEST PROB PERCENT=5 /MPATTERN Univariate Statistics /EM. emplmnt mstatus race -1.1 29.6 .289 439 26 7.67 7.88 income .2 32.2 .846 439 26 7.92 7.62 atthouse attdrug t df P(2-tail) # Pres ent # Miss ing Mean(Pres ent) Mean(Miss ing) Mean 7.90 7.69 23.54 4.21 .47 1.78 1.09 Missing Count Percent 0 .0 0 .0 1 .2 26 5.6 0 .0 0 .0 0 .0 No. of Extremes a,b Low High 0 34 0 0 4 0 0 0 0 0 . . . . a. Number of cases outside the range (Q1 - 1.5*IQR, Q3 + 1.5*IQR). timedrs income N timedrs 465 attdrug 465 atthous e 464 income 439 emplmnt 465 ms tatus 465 Separate Variance t Tests a race 465 Std. Deviation 10.948 1.156 4.484 2.419 .500 .416 .284 -.2 28.6 .851 438 26 23.53 23.69 . . . 439 0 4.21 . -1.1 28.0 .279 439 26 .46 .58 -1.0 29.0 .346 439 26 1.77 1.85 -.4 27.3 .662 439 26 1.09 1.12 b. . indicates that the inter-quartile range (IQR) is zero. For each quantitative variable, pairs of groups are formed by indicator variables (present, miss ing). a. Indicator variables with less Data than 5%Screening, miss ing are not displayed. Exploring None are significant (as compared to “INCOME”) and Clean-Up School of Nursing 25-26 Sept 2008 – M. Higgins income atthouse race mstatus attdrug timedrs Missing and Extreme Value Patterns a emplmnt Case 52 64 69 77 118 135 161 172 173 174 181 196 203 236 240 258 304 321 325 352 378 379 409 419 421 435 253 % Missing # Missing Missing Patterns (cases with missing values) 1 14.3 S 1 14.3 + S 1 14.3 S 1 14.3 S 1 14.3 S 1 14.3 S 1 14.3 S 1 14.3 S 1 14.3 S 1 14.3 S 1 14.3 S 1 14.3 + S 1 14.3 + S 1 14.3 S 1 14.3 S 1 14.3 + S 1 14.3 S 1 14.3 S 1 14.3 S 1 14.3 S 1 14.3 S 1 14.3 S 1 14.3 + S 1 14.3 S 1 14.3 S 1 14.3 + S 1 14.3 + S - indicates an extreme low value, while + indicates an extreme high value. The range us ed is (Q1 - 1.5*IQR, Q3 + 1.5*IQR). a. Cases and variables are sorted on missing patterns. Data Screening, Exploring and Clean-Up Gender Valid 95 F m M Total 81 1 1 Frequency 1 95 1 81 178 Percent .6 53.4 .6 45.5 100.0 Valid Percent .6 53.4 .6 45.5 100.0 We fixed “m” but what about the subject with no gender (missing)? Gender (string) Valid F M Total Statistics N Mean Median Mode Valid Missing Gender 178 0 GenderNum 177 1 1.4633 1.0000 1.00 Frequency 1 95 82 178 Percent .6 53.4 46.1 100.0 GenderNum Valid Missing Total 1.00 2.00 Total System Frequency 95 82 177 1 178 Percent 53.4 46.1 99.4 .6 100.0 Valid Percent .6 53.4 46.1 100.0 (numeric) Valid Percent 53.7 46.3 100.0 School of Nursing 1 case not counted!! Case correctly counted Ca se Processing Sum ma ry Inc luded N Percent AnemicGroup If Anemic, which group * GenderNum Gender Recoded Numeric * AgeGroupHbg Age Group for Hemoglobin Levels 26 14.6% Case Processing Summary Cases Ex cluded N Percent 152 25-26 Sept 2008 – M. Higgins N 85.4% Total Percent 178 100.0% Included N Percent AnemicGroup If Anemic, which group * AgeGroupHbg Age Group for Hemoglobin Levels * Gender Gender String Variable 27 151 84.8% Total N Percent 178 Case Summaries Case Summaries AnemicGroup If Anemic, which group GenderNum Gender AgeGroupHbg Age Group Recoded Numeric for Hemoglobin Levels 1.00 Female 2.00 Age 5 to less than 8 3.00 Age 8 to less than 12 4.00 Age 12 to less than 15 Total 2.00 Male 1.00 Age 2 to less than 5 2.00 Age 5 to less than 8 3.00 Age 8 to less than 12 4.00 Age 12 to less than 15 Total Total 1.00 Age 2 to less than 5 2.00 Age 5 to less than 8 3.00 Age 8 to less than 12 4.00 Age 12 to less than 15 Total 15.2% Cases Excluded N Percent N 6 5 3 14 1 1 7 3 12 1 7 12 6 26 % of Total N 23.1% 19.2% 11.5% 53.8% 3.8% 3.8% 26.9% 11.5% 46.2% 3.8% 26.9% 46.2% 23.1% 100.0% AnemicGroup If Anemic, which group AgeGroupHbg Age Group Gender Gender for Hemoglobin Levels String Variable 1.00 Age 2 to less than 5 M 2.00 Age 5 to less than 8 Total F M Total 3.00 Age 8 to less than 12 4.00 Age 12 to less than 15 F M Total F M Total Total F M Total N 1 1 6 1 7 1 5 7 13 3 3 6 1 14 12 27 % of Total N 3.7% 3.7% 22.2% 3.7% 25.9% 3.7% 18.5% 25.9% 48.1% 11.1% 11.1% 22.2% 3.7% 51.9% 44.4% 100.0% This was an interesting case as designation of “anemia” depended only on age (if less than 12), but depends on both age and gender if older than 12. [Our missing gender was 8 yrs old.] Data Screening, Exploring and Clean-Up 100.0% School of Nursing Correlations 25-26 Sept 2008 – M. Higgins comfort 1 Measures of Correlation comfort • [Parametric] R2 and R (X vs Y or X1 vs X2) = Pearson's correlation coefficient • [Non-parametric] Spearman's rho, Kendall's and tau_b Kendall's tau-b – both based on rank (see SPSS Help for further details) Spearman's rho role involvement Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Correlations 76 .341** .004 68 .162 .165 75 role involvement .341** .162 .004 .165 68 75 1 .381** .001 68 67 .381** 1 .001 67 75 **. Correlation is significant at the 0.01 level (2-tailed). comfort role involvement comfort Correlation Coefficient 1.000 .361** .117 Sig. (2-tailed) . .000 .149 N 76 68 75 role Correlation Coefficient .361** 1.000 .158 Sig. (2-tailed) .000 . .067 N 68 68 67 involvement Correlation Coefficient .117 .158 1.000 Sig. (2-tailed) .149 .067 . N 75 67 75 comfort Correlation Coefficient 1.000 .506** .169 Sig. (2-tailed) . .000 .146 N 76 68 75 role Correlation Coefficient .506** 1.000 .227 Sig. (2-tailed) .000 . .065 N 68 68 67 involvement Correlation Coefficient .169 .227 1.000 Sig. (2-tailed) .146 .065 . N 75 67 75 **. Correlation is significant at the 0.01 level (2-tailed). Data Screening, Exploring and Clean-Up SchoolCorrelations of Nursing comfort role involvement Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N comfort 1 76 .341** .004 68 .162 .165 75 25-26 Sept 2008 – M. Higgins role involvement .341** .162 .004 .165 68 75 1 .381** .001 68 67 .381** 1 .001 67 75 **. Correlation is significant at the 0.01 level (2-tailed). (0.341)2 = .116 Data Screening, Exploring and Clean-Up School of Nursing Checklist for Data Screening 1. Inspect univariate descriptive stats – check for data accuracy/discrepancies a) Out-of-range values b) Plausible means and standard deviations c) Univariate outliers 2. Evaluate amount and patterns of missing data 3. Check pairwise plots for nonlinearity and heteroscedasticity [REGRESSION] 4. Identify and deal with nonnormal variables and univariate outliers a) Check skewness and kurtosis and probability plots b) Perform transforms (if desired) c) Check results of transformation 5. Identify multivariate outliers [REGRESSION] 6. Evaluate variables for multicollinearity and singularity [REGRESSION] Data Screening, Exploring and Clean-Up 25-26 Sept 2008 – M. Higgins School of Nursing 25-26 Sept 2008 – M. Higgins What do I do Now? – A Decision Tree for Picking Statistical Methods to Use • Questions to Ask • Major Research Question? • Degree of Relationship Among Variables • Significant Group Difference • Prediction of Group Membership • Structure • Time/Course of Events • Number & Kind of Dependent Variables • Single vs Multiple & Discrete vs Continuous • Number & Kind of Independent Variables • Single vs Multiple & Discrete vs Continuous • Covariates? [yes/no] • Decision Tree Yields Analytic Strategy and Goal of Analysis Data Screening, Exploring and Clean-Up Tabachnick, B.G. and Fidell, L.S. (2007) Using Multivariate Statistics (5th Ed.). New York: Pearson Education, Inc. School of Nursing 25-26 Sept 2008 – M. Higgins “How to talk to a Statistician” • List of Hypotheses/Aims (end goals) • List of Variables • Type, Measure (numeric, string, date/time, scales, categorical) • Independent, covariates, dependent (outcomes) • Names, Labels and Values [consistency (q1,q2,q3,…, item01,item02,…), length, consider graphics] • Model (hypothesized, general idea – theoretical concerns) • Graphics/figures/tables requested (reports, posters, grants) • POWER – idea on “effect size” (how big a change do you hope to see) – clinical significance, prior results? Data Screening, Exploring and Clean-Up School of Nursing 25-26 Sept 2008 – M. Higgins VIII. Statistical Resources and Contact Info SON S:\Shared\Statistics_MKHiggins\website2\index.htm [updates in process] Working to include tip sheets (for SPSS, SAS, and other software), lectures (PPTs and handouts), datasets, other resources and references Statistics At Nursing Website: [website being updated] http://www.nursing.emory.edu/pulse/statistics/ And Blackboard Site (in development) for “Organization: Statistics at School of Nursing” Contact Dr. Melinda Higgins [email protected] Office: 404-727-5180 / Mobile: 404-434-1785 Data Screening, Exploring and Clean-Up