Download Data Screening - Melinda Higgins, Ph.D.

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Time series wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
School of Nursing
25-26 Sept 2008 – M. Higgins
“I Have a Bunch of Data – Now What?”
Data Screening, Exploring
and Clean-Up
Melinda K. Higgins, Ph.D.
25 & 26 September 2008
Data Screening, Exploring and Clean-Up
School of Nursing
Outline
Descriptive Statistics
(univariate & bi-variate)
I.
I.
Measures of Centrality
II.
Measures of Variability
III.
Distributions &
Transformations
IV. Tests of Normality
V.
Outliers
VI. Missing Data
VII. Correlations
II.
Overall Flow Charts
III.
Potential Statistical Analyses
(Decision Tree)
IV.
Contact Info
Data Screening, Exploring and Clean-Up
25-26 Sept 2008 – M. Higgins
School of Nursing
25-26 Sept 2008 – M. Higgins
A Few Initial Considerations
• GROUPS – If data is to be evaluated by group – you will
want to evaluate the descriptive statistics BY group (e.g.
the data might not be skewed overall, but one group may
be by itself) – may or may not want to transform.
• LONGITUDINAL DATA – If variables were measured over
time, you will need to consider all the time points (e.g.
you would NOT want to transform one time point and not
the others)
• MULTIVARIATE MEASURES – additional screening
measures in bi-variate/multivariate combinations
(multicollinearity, influential cases, leverage,
Mahalonobis distance) – not covered in this lecture.
Data Screening, Exploring and Clean-Up
School of Nursing
25-26 Sept 2008 – M. Higgins
Measures of Central Tendency
• Mean = (Xi)/n
• Median = 50% Below ≤ Median ≤ 50% Above
• for odd n, Median=middle value(sorted X)
• For even n, Median = average of 2 middle X’s
• Trimmed Mean – mean recalculated after deleting _% or
_# off top and bottom of sorted data (usually 5% or so)
• Mode – number(s) repeated the most
Data Screening, Exploring and Clean-Up
School of Nursing
25-26 Sept 2008 – M. Higgins
Measures of Variance
• (sample) Variance = sums of squares of deviation from
mean/(n-1)
 x  x 
2
i
n  1
• (sample) Standard Deviation = sqrt(variance)
• Range = max(X) – min(X)
• IQR – Interquartile Range = 75th Percentile(X) – 25th
Percentile (X)
Data Screening, Exploring and Clean-Up
School of Nursing
Distributions
• Stem and Leaf
• Dot plot
• Histogram
• Box plot
Data Screening, Exploring and Clean-Up
25-26 Sept 2008 – M. Higgins
School of Nursing
25-26 Sept 2008 – M. Higgins
Boxplots (as Defined in SPSS)
• A boxplot shows the five statistics (minimum,
first quartile, median, third quartile, and
maximum). It is useful for displaying the
distribution of a scale variable and pinpointing
outliers.
• The boundaries of the box are “Tukey’s
hinges.” The median is identified by a line
inside the box. The length of the box is the
interquartile range (IQR) computed from
Tukey’s hinges [i.e. 25th and 75th percentiles].
• Outliers. Cases with values that are between
1.5 and 3 box lengths (box length=IQR) from
either end of the box (“o”). Extremes. Cases
with values more than 3 box lengths from
either end of the box (“*”).
• Whiskers at the ends of the box show the
distance from the end of the box to the largest
and smallest observed values that are less
than 1.5 box lengths from either end of the box.
Data Screening, Exploring and Clean-Up
School of Nursing
Distributions (cont’d) –
Skewness and Kurtosis
• Skewness and Kurtosis are
the two most commonly used
measures to evaluate
deviations from normality.
• Skewness measures the
extent to which the
distribution is not symmetric.
• Kurtosis measure the extent
to which the distribution is
more “pointed/narrow” or
“flatter/wider” than the
normal distribution.
Data Screening, Exploring and Clean-Up
25-26 Sept 2008 – M. Higgins
School of Nursing
25-26 Sept 2008 – M. Higgins
Statistical Test: Skewness & Kurtosis
• Zs = (S_skew-0)/SE_skew
• S_skew = Skewness measure
• SE_skew is the std. error of skewness
• Zk = (S_kurt-0)/SE_kurt
• S_kurt = Kurtosis measure
• SE_kurt is the std. error of kurtosis
• Zs or Zk values > 1.96 are significant at 0.05 sig. level
• Zs or Zk values > 2.58 are significant at 0.01 sig. level
• Zs or Zk values > 3.29 are significant at 0.001 sig. level
Data Screening, Exploring and Clean-Up
School of Nursing
25-26 Sept 2008 – M. Higgins
Statistics
N
Valid
Missing
Mean
Std. Error of Mean
Median
Std. Deviation
Variance
Skewness
Std. Error of Skewness
Kurtosis
Std. Error of Kurtosis
Range
Minimum
Maximum
day1 Hygiene
(Day 1 of
Glastonbury
Festival)
810
0
1.7934
.03319
1.7900
.94449
.892
8.865
.086
170.450
.172
20.00
.02
20.02
day2 Hygiene
(Day 2 of
Glastonbury
Festival)
264
546
.9609
.04436
.7900
.72078
.520
1.095
.150
.822
.299
3.44
.00
3.44
Zs=103.08
Zk=990.98
Data Screening, Exploring and Clean-Up
Zs=7.3
Zk=2.7
day3 Hygien
(Day 3 of
Glastonbury
Festival)
12
68
.976
.06404
.760
.71028
.504
1.03
.218
.732
.433
3.39
.0
3.41
School of Nursing
25-26 Sept 2008 – M. Higgins
Additional Tests of Normality
• The following 2 tests compare the scores in the sample
to a normally distributed set of scores with the same
mean and std. deviation. If the test is non-significant
(p<0.05) it says that the sample distribution is not
significantly different from a normal population.
• Kolmogorov-Smirov
• Shapiro-Wilk
• [NOTE: With larger sample sizes, these tests will be
significant for small deviations from normality – use
graphics/visual inspection.]
Data Screening, Exploring and Clean-Up
School of Nursing
25-26 Sept 2008 – M. Higgins
Tests of Normality
Kolmogorov-Smirnov a
Statistic
df
Sig.
day1 Hygiene (Day 1 of
Glastonbury Festival)
day2 Hygiene (Day 2 of
Glastonbury Festival)
day3 Hygiene (Day 3 of
Glastonbury Festival)
Statistic
Shapiro-Wilk
df
Sig.
.083
810
.000
.654
810
.000
.121
264
.000
.908
264
.000
.140
123
.000
.908
123
.000
SPSS – Analyze/Explore/Normality Plots with Tests
a. Lilliefors Significance Correction
Data Screening, Exploring and Clean-Up
School of Nursing
25-26 Sept 2008 – M. Higgins
Normal Probability Plots
Data Screening, Exploring and Clean-Up
School of Nursing
25-26 Sept 2008 – M. Higgins
Transformations
SPSS COMPUTE and/or SAS Data Procedure
Moderate – positive skewness
NEWX=SQRT(X)
Substantial positive skewness
NEWX=LG10(X)
(with zero)
NEWX=LG10(X+C)
Severe positive skewness
NEWX=1/X
L-shaped (with zero)
NEWX=1/(X+C)
Moderate negative skewness
NEWX=SQRT(K-X)
Substantial negative skewness
NEWX=LG10(K-X)
Severe negative skewness (J-shaped)
NEWX=1/(K-X)
C = constant added so smallest score is 1
K = constant from which each score is subtracted
so smallest score is 1.
Data Screening, Exploring and Clean-Up
School of Nursing
25-26 Sept 2008 – M. Higgins
Statistics
N
sqrtday1
810
0
1.3042
.01069
1.3379
.30429
.093
.840
.086
14.320
.172
4.33
.14
4.47
Valid
Missing
Mean
Std. Error of Mean
Median
Std. Deviation
Variance
Skewness
Std. Error of Skewness
Kurtos is
Std. Error of Kurtosis
Range
Minimum
Maximum
lg10day1
810
0
.2035
.00819
.2529
.23297
.054
-1.947
.086
9.613
.172
3.00
-1.70
1.30
sqrtday2
264
546
.9092
.02259
.8888
.36703
.135
.255
.150
-.403
.299
1.85
.00
1.85
lg10day2
263
547
-.1589
.02437
-.1024
.39514
.156
-.835
.150
.968
.299
2.24
-1.70
.54
Tests of Normality
sqrtday1
lg10day1
sqrtday2
lg10day2
Kolmogorov-Smirnova
Statistic
df
Sig.
.065
810
.000
.123
810
.000
.045
264
.200*
.103
263
.000
Statistic
.916
.861
.988
.956
Shapiro-Wilk
df
810
810
264
263
*. This is a lower bound of the true significance.
a. Lilliefors Significance Correction
Data Screening, Exploring and Clean-Up
Sig.
.000
.000
.027
.000
School of Nursing
25-26 Sept 2008 – M. Higgins
LG10
Original
SQRT
Data Screening, Exploring and Clean-Up
School of Nursing
25-26 Sept 2008 – M. Higgins
Outliers
• Review histograms and boxplots and look for extreme
values
• Investigate values
• is it “real?”
• can it be corrected?
• Should it be deleted (or left out of analyses)?
• [consider clinical reasons; procedural reasons]
• Calculate z-scores (next page) and review amount of
outliers
• Is there a pattern? (compare outliers to non-outliers)
Data Screening, Exploring and Clean-Up
School of Nursing
25-26 Sept 2008 – M. Higgins
Outliers


DESCRIPTIVES
VARIABLES=day2/SAVE.
XX
/SAVE option creates
Z
COMPUTE outlier1=abs(zday2).
z-score of “DAY2”
s
EXECUTE.
RECODE
outlier1 (3.29 thru Highest = 4) (2.58 thru highest = 3) (1.96 thru Highest = 2) (Lowest thru
2 = 1).
EXECUTE.
VALUE LABELS outlier1
1 'Absolute z-score less than 2' 2 'Absolute z-score greater than 1.96' 3 'Absolute z-score
greater than 2.58' 4 'Absolute z-score greater than 3.29'.
FREQUENCIES
VARIABLES=outlier1.
/ORDER=ANALYSIS.
outlie r1
Valid
Missing
Total
1.00 Absolute
2.00 Absolute
3.00 Absolute
4.00 Absolute
Total
Sy stem
z-sc ore
z-sc ore
z-sc ore
z-sc ore
less than 2
greater than 1.96
greater than 2.58
greater than 3.29
Frequency
246
12
4
2
264
546
810
Percent
30.4
1.5
.5
.2
32.6
67.4
100.0
Valid Perc ent
93.2
4.5
1.5
.8
100.0
Data Screening, Exploring and Clean-Up
Cumulative
Percent
93.2
97.7
99.2
100.0
School of Nursing
25-26 Sept 2008 – M. Higgins
Missing Data
• Look for patterns (MVA next slide) and/or reason why missing
• Can compare missing data subjects to non-missing data subjects
• Can delete/ignore or Impute based on model
• Goal is to:
• Minimize Bias
• Maximize utilization of information (data=$$)
• Get good estimates of uncertainty
• Censorship (survival analysis – “loss to follow-up”)
• [SIDE NOTE: SPSS missing – strings vs. numeric data types]
NOTE: Missing Data Imputation – to be discussed in another lecture
Data Screening, Exploring and Clean-Up
School of Nursing
25-26 Sept 2008 – M. Higgins
Missing Data
MVA
VARIABLES = timedrs attdrug atthouse income emplmnt mstatus race
/TTEST PROB PERCENT=5
/MPATTERN
Univariate Statistics
/EM.
emplmnt
mstatus
race
-1.1
29.6
.289
439
26
7.67
7.88
income
.2
32.2
.846
439
26
7.92
7.62
atthouse
attdrug
t
df
P(2-tail)
# Pres ent
# Miss ing
Mean(Pres ent)
Mean(Miss ing)
Mean
7.90
7.69
23.54
4.21
.47
1.78
1.09
Missing
Count
Percent
0
.0
0
.0
1
.2
26
5.6
0
.0
0
.0
0
.0
No. of Extremes a,b
Low
High
0
34
0
0
4
0
0
0
0
0
.
.
.
.
a. Number of cases outside the range (Q1 - 1.5*IQR, Q3 + 1.5*IQR).
timedrs
income
N
timedrs
465
attdrug
465
atthous e
464
income
439
emplmnt
465
ms tatus
465
Separate Variance t Tests a
race
465
Std.
Deviation
10.948
1.156
4.484
2.419
.500
.416
.284
-.2
28.6
.851
438
26
23.53
23.69
.
.
.
439
0
4.21
.
-1.1
28.0
.279
439
26
.46
.58
-1.0
29.0
.346
439
26
1.77
1.85
-.4
27.3
.662
439
26
1.09
1.12
b. . indicates that the inter-quartile range (IQR) is zero.
For each quantitative variable, pairs of groups are formed by indicator
variables (present, miss ing).
a. Indicator variables with less Data
than 5%Screening,
miss ing are not displayed.
Exploring
None are significant (as
compared to “INCOME”)
and Clean-Up
School of Nursing
25-26 Sept 2008 – M. Higgins
income
atthouse
race
mstatus
attdrug
timedrs
Missing and Extreme Value
Patterns a
emplmnt
Case
52
64
69
77
118
135
161
172
173
174
181
196
203
236
240
258
304
321
325
352
378
379
409
419
421
435
253
% Missing
# Missing
Missing Patterns (cases with missing values)
1
14.3
S
1
14.3 +
S
1
14.3
S
1
14.3
S
1
14.3
S
1
14.3
S
1
14.3
S
1
14.3
S
1
14.3
S
1
14.3
S
1
14.3
S
1
14.3
+
S
1
14.3 +
S
1
14.3
S
1
14.3
S
1
14.3
+
S
1
14.3
S
1
14.3
S
1
14.3
S
1
14.3
S
1
14.3
S
1
14.3
S
1
14.3 +
S
1
14.3
S
1
14.3
S
1
14.3
+
S
1
14.3
+
S
- indicates an extreme low value, while + indicates an extreme
high value. The range us ed is (Q1 - 1.5*IQR, Q3 + 1.5*IQR).
a. Cases and variables are sorted on missing patterns.
Data Screening, Exploring and Clean-Up
Gender
Valid
95
F
m
M
Total
81
1
1
Frequency
1
95
1
81
178
Percent
.6
53.4
.6
45.5
100.0
Valid Percent
.6
53.4
.6
45.5
100.0
We fixed “m” but what about
the subject with no gender
(missing)?
Gender (string)
Valid
F
M
Total
Statistics
N
Mean
Median
Mode
Valid
Missing
Gender
178
0
GenderNum
177
1
1.4633
1.0000
1.00
Frequency
1
95
82
178
Percent
.6
53.4
46.1
100.0
GenderNum
Valid
Missing
Total
1.00
2.00
Total
System
Frequency
95
82
177
1
178
Percent
53.4
46.1
99.4
.6
100.0
Valid Percent
.6
53.4
46.1
100.0
(numeric)
Valid Percent
53.7
46.3
100.0
School of Nursing
1 case not counted!!
Case correctly counted
Ca se Processing Sum ma ry
Inc luded
N
Percent
AnemicGroup If
Anemic, which group
* GenderNum Gender
Recoded Numeric *
AgeGroupHbg Age
Group for Hemoglobin
Levels
26
14.6%
Case Processing Summary
Cases
Ex cluded
N
Percent
152
25-26 Sept 2008 – M. Higgins
N
85.4%
Total
Percent
178
100.0%
Included
N
Percent
AnemicGroup If
Anemic, which group *
AgeGroupHbg Age
Group for Hemoglobin
Levels * Gender
Gender String Variable
27
151
84.8%
Total
N
Percent
178
Case Summaries
Case Summaries
AnemicGroup If Anemic, which group
GenderNum Gender
AgeGroupHbg Age Group
Recoded
Numeric
for
Hemoglobin
Levels
1.00 Female
2.00
Age 5 to less
than 8
3.00 Age 8 to less than 12
4.00 Age 12 to less than 15
Total
2.00 Male
1.00 Age 2 to less than 5
2.00 Age 5 to less than 8
3.00 Age 8 to less than 12
4.00 Age 12 to less than 15
Total
Total
1.00 Age 2 to less than 5
2.00 Age 5 to less than 8
3.00 Age 8 to less than 12
4.00 Age 12 to less than 15
Total
15.2%
Cases
Excluded
N
Percent
N
6
5
3
14
1
1
7
3
12
1
7
12
6
26
% of Total N
23.1%
19.2%
11.5%
53.8%
3.8%
3.8%
26.9%
11.5%
46.2%
3.8%
26.9%
46.2%
23.1%
100.0%
AnemicGroup If Anemic, which group
AgeGroupHbg Age Group
Gender Gender
for Hemoglobin
Levels
String Variable
1.00
Age 2 to less
than 5
M
2.00 Age 5 to less than 8
Total
F
M
Total
3.00 Age 8 to less than 12
4.00 Age 12 to less than 15
F
M
Total
F
M
Total
Total
F
M
Total
N
1
1
6
1
7
1
5
7
13
3
3
6
1
14
12
27
% of Total N
3.7%
3.7%
22.2%
3.7%
25.9%
3.7%
18.5%
25.9%
48.1%
11.1%
11.1%
22.2%
3.7%
51.9%
44.4%
100.0%
This was an interesting case as designation of “anemia” depended
only on age (if less than 12), but depends on both age and gender if
older than 12. [Our missing gender was 8 yrs old.]
Data Screening, Exploring and Clean-Up
100.0%
School of Nursing
Correlations 25-26 Sept 2008 – M. Higgins
comfort
1
Measures of Correlation
comfort
• [Parametric]
R2
and R
(X vs Y or X1 vs X2) =
Pearson's correlation
coefficient
• [Non-parametric]
Spearman's rho, Kendall's
and tau_b
Kendall's tau-b – both
based on rank (see
SPSS Help for further
details)
Spearman's rho
role
involvement
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Correlations
76
.341**
.004
68
.162
.165
75
role
involvement
.341**
.162
.004
.165
68
75
1
.381**
.001
68
67
.381**
1
.001
67
75
**. Correlation is significant at the 0.01 level (2-tailed).
comfort
role
involvement
comfort
Correlation Coefficient
1.000
.361**
.117
Sig. (2-tailed)
.
.000
.149
N
76
68
75
role
Correlation Coefficient
.361**
1.000
.158
Sig. (2-tailed)
.000
.
.067
N
68
68
67
involvement Correlation Coefficient
.117
.158
1.000
Sig. (2-tailed)
.149
.067
.
N
75
67
75
comfort
Correlation Coefficient
1.000
.506**
.169
Sig. (2-tailed)
.
.000
.146
N
76
68
75
role
Correlation Coefficient
.506**
1.000
.227
Sig. (2-tailed)
.000
.
.065
N
68
68
67
involvement Correlation Coefficient
.169
.227
1.000
Sig. (2-tailed)
.146
.065
.
N
75
67
75
**. Correlation
is significant
at the 0.01 level
(2-tailed).
Data
Screening,
Exploring
and
Clean-Up
SchoolCorrelations
of Nursing
comfort
role
involvement
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
comfort
1
76
.341**
.004
68
.162
.165
75
25-26 Sept 2008 – M. Higgins
role
involvement
.341**
.162
.004
.165
68
75
1
.381**
.001
68
67
.381**
1
.001
67
75
**. Correlation is significant at the 0.01 level (2-tailed).
(0.341)2 = .116
Data Screening, Exploring and Clean-Up
School of Nursing
Checklist for Data Screening
1. Inspect univariate descriptive stats – check for
data accuracy/discrepancies
a) Out-of-range values
b) Plausible means and standard deviations
c) Univariate outliers
2. Evaluate amount and patterns of missing data
3. Check pairwise plots for nonlinearity and
heteroscedasticity [REGRESSION]
4. Identify and deal with nonnormal variables and
univariate outliers
a) Check skewness and kurtosis and
probability plots
b) Perform transforms (if desired)
c) Check results of transformation
5. Identify multivariate outliers [REGRESSION]
6. Evaluate variables for multicollinearity and
singularity [REGRESSION]
Data Screening, Exploring and Clean-Up
25-26 Sept 2008 – M. Higgins
School of Nursing
25-26 Sept 2008 – M. Higgins
What do I do Now? – A Decision Tree for
Picking Statistical Methods to Use
• Questions to Ask
• Major Research Question?
• Degree of Relationship Among Variables
• Significant Group Difference
• Prediction of Group Membership
• Structure
• Time/Course of Events
• Number & Kind of Dependent Variables
• Single vs Multiple & Discrete vs Continuous
• Number & Kind of Independent Variables
• Single vs Multiple & Discrete vs Continuous
• Covariates? [yes/no]
• Decision Tree Yields Analytic Strategy and Goal of Analysis
Data Screening, Exploring and Clean-Up
Tabachnick, B.G. and Fidell, L.S. (2007) Using Multivariate Statistics (5th Ed.). New York: Pearson Education, Inc.
School of Nursing
25-26 Sept 2008 – M. Higgins
“How to talk to a Statistician”
• List of Hypotheses/Aims (end goals)
• List of Variables
• Type, Measure (numeric, string, date/time, scales,
categorical)
• Independent, covariates, dependent (outcomes)
• Names, Labels and Values [consistency (q1,q2,q3,…,
item01,item02,…), length, consider graphics]
• Model (hypothesized, general idea – theoretical concerns)
• Graphics/figures/tables requested (reports, posters, grants)
• POWER – idea on “effect size” (how big a change do you
hope to see) – clinical significance, prior results?
Data Screening, Exploring and Clean-Up
School of Nursing
25-26 Sept 2008 – M. Higgins
VIII. Statistical Resources and Contact Info
SON S:\Shared\Statistics_MKHiggins\website2\index.htm
[updates in process]
Working to include tip sheets (for SPSS, SAS, and other software),
lectures (PPTs and handouts), datasets, other resources and
references
Statistics At Nursing Website: [website being updated]
http://www.nursing.emory.edu/pulse/statistics/
And Blackboard Site (in development) for
“Organization: Statistics at School of Nursing”
Contact
Dr. Melinda Higgins
[email protected]
Office: 404-727-5180 / Mobile: 404-434-1785
Data Screening, Exploring and Clean-Up