Download Assumption and Data Transformation

Document related concepts

History of statistics wikipedia , lookup

Psychometrics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Analysis of variance wikipedia , lookup

Transcript
Problems in
Data Analyses
General case in data
analysis
Assumptions distortion
Missing data
General Assumption of Anova

The error terms are randomly, and normally distributed
Populations (for each condition) are Normally Distributed

The variance of different population are homogeneous
(Homo-scedasticity)
Populations (for each condition) have Equal Variances


Variances and means of different populations are not
correlated (independent)
The main effects are additive
CRD ANOVA F-Test Assumptions

Randomness & Normality

Homogeneity of Variance

Independence of Errors

Additive
Randomized Block F Test
Assumptions
1. Normality
Populations are normally distributed
2. Homogeneity of Variance
Populations have equal variances
3. Independence of Errors
Independent random samples are drawn
4. The main effects are additive
5. No Interaction Between Blocks & Treatments
5
Randomly, independently and
Normally distribution
 The assumption of normality do not affect the validity of the
analysis of variance too seriously
 There are test for normality, but it is rather point pointless to
apply them unless the number of samples we are dealing with is
fairly large
 Independence implies that there is no relation between the size
of the error terms and the experimental grouping to which the
belong
 It is important to avoid having all plots receiving a given
treatment occupying adjacent positions in the field
 The best insurance against seriously violating the first
assumption of the analysis of variance is to carry out the
randomization appropriate to the particular design
Normality

Reason:
• ANOVA is an Analysis of Variance
• Analysis of two variances, more specifically, the ratio of two
variances
• Statistical inference is based on the F distribution which is
given by the ratio of two chi-squared distributions
• No surprise that each variance in the ANOVA ratio come
from a parent normal distribution

Calculations can always be derived no matter what the
distribution is. Calculations are algebraic properties
separating sums of squares. Normality is only needed
for statistical inference.
Diagnosis: Normality
• The points on the normality plot must more or less
follow a line to claim “normal distributed”.
• There are statistic tests to verify it scientifically.
• The ANOVA method we learn here is not sensitive to
the normality assumption. That is, a mild departure
from the normal distribution will not change our
conclusions much.
Normality plot: normal scores vs. residuals
Normality Tests


Wide variety of tests can be performed to test if the data follows a
normal distribution.
Mardia (1980) provides an extensive list for both the uni-variate and
multivariate cases and it is categorized into two types:
• Properties of normal distribution, more specifically, the first
four moments of the normal distribution
 Shapiro-Wilk’s W (compares the ratio of the standard
deviation to the variance multiplied by a constant to one)
 Lilliefors-Kolmogorov-Smirnov Test
 Graphical methods based on residual error (Residual Plotts)
• Goodness-of-fit tests,
 Kolmogorov-Smirnov D
2
 Cramer-von Mises W
2
 Anderson-Darling A
Checking for Normality
Reminder:
Normality of the RESIDUALS is assumed. The original data are
assumed normal also, but each group may have a different mean if
Ha is true. Practice is to first fit the model, THEN output the
residuals, then test for normality of the residuals. This APPROACH
is always correct.
TOOLS
1. Histogram and/or box-plot of all residuals (eij).
2. Normal probability (Q-Q) plot.
3. Formal test for normality.
Histogram of Residuals
proc glm data=stress;
class sand;
model resistance = sand / solution;
output out=resid r=r_resis p=p_resis ;
title1 'Compression resistance in concrete
beams as';
title2 ' a function of percent sand in the
mix';
run;
proc capability data=resid;
histogram r_resis / normal;
ppplot r_resis / normal square ;
run;
Formal Tests of Normality
• Kolmogorov-Smirnov test; Anderson-Darling test (both based on the
empirical CDF).
• Shapiro-Wilk’s test; Ryan-Joiner test (both are correlation based
tests applicable for n < 50).
• D’Agostino’s test (n>=50).
All quite conservative – they fail to reject the null
hypothesis of normality more often than they should.
Shapiro-Wilk’s W test
e1, e2, …, en represent data ranked from smallest to largest.
H0: The population has a normal distribution.
HA: The population does not have a normal distribution.


1
W  d  a i (n  j1   j ) 
 j1

k
Coefficients ai come from a table.
R.R. Reject H0 if W < W0.05
2
n
d   ( i   ) 2
i 1
n
k
2
(n  1)
k
2
Critical values of Wa come from a table.
If n is even
If n is odd.
Shapiro-Wilk Coefficients
Shapiro-Wilk Coefficients
Shapiro-Wilk W Table
D’Agostino’s Test
e1, e2, …, en represent data ranked from smallest to largest.
H0: The population has a normal distribution.
Ha: The population does not have a normal distribution.
(D  0.28209479) n
Y
0.02998598
Two sided test. Reject H0 if
Y  Y0.025 or Y  Y0.975
1
s  n

2
( j   ) 

j1

n
n
D
[ j 
j1
1
2
1
2
(n  1)] j
n 2s
Y0.025 and Y0.975 come from
a table of percentiles of
the Y statistics
The Consequences of NonNormality




F-test is very robust against non-normal data, especially
in a fixed-effects model
Large sample size will approximate normality by Central
Limit Theorem (recommended sample size > 50)
Simulations have shown unequal sample sizes between
treatment groups magnify any departure from normality
A large deviation from normality leads to hypothesis test
conclusions that are too liberal and a decrease in power
and efficiency
Remedial Measures for
Non-Normality



Data transformation
Be aware - transformations may lead to a fundamental change in
the relationship between the dependent and the independent
variable and is not always recommended.
Don’t use the standard F-test.
• Modified F-tests
 Adjust the degrees of freedom
 Rank F-test (capitalizes the F-tests robustness)
• Randomization test on the F-ratio
• Other non-parametric test if distribution is unknown
• Make up our own test using a likelihood ratio if distribution is
known
Homogeneity of Variances

Eisenhart (1947) describes the problem of unequal variances as
follows
• the ANOVA model is based on the proportion of the mean
squares of the factors and the residual mean squares
• The residual mean square is the unbiased estimator of 2, the
variance of a single observation
• The between treatment mean squares takes into account not only
the differences between observations, 2, just like the residual
mean squares, but also the variance between treatments
• If there was non-constant variance among treatments, the
residual mean square can be replaced with some overall variance,
 a2, and a treatment variance,  t2, which is some weighted
version of  a2
• The “neatness” of ANOVA is lost
Homogeneity of Variances




The overall F-test is very robust against heterogeneity of
variances, especially with fixed effects and equal sample
sizes.
Tests for treatment differences like t-tests and contrasts are
severely affected, resulting in inferences that may be too
liberal or conservative
Unequal variances can have a marked effect on the level of
the test, especially if smaller sample sizes are associated
with groups having larger variances
Unequal variances will lead to bias conclusion
Way to solve the problem of
Heterogeneous variances
 The data can be separated into groups such that
the variances within each group are
homogenous
 An advance statistic tests can be used rather
than analysis of variance
 Transform the data in such a way that data will
be homogenous
Tests for Homogeneity of Variances
• Bartley’s Test
• Levene’s Test
Computes a one-way-anova on the absolute value (or sometimes
the square) of the residuals, |yij – ŷi| with t-1, N – t degrees of
freedom
Considered robust to departures of normality, but too conservative
• Brown-Forsythe Test
A slight modification of Levene’s test, where the median is
substituted for the mean (Kuehl (2000) refers to it as the Levene
(med) Test)
• The Fmax Test (Hartley Test)
Proportion of the largest variance of the treatment groups to the
smallest and compares it to a critical value table
Bartlett’s Test
Bartlett’s Test: Allows unequal replication, but requires
normality.
t
 t

2
2
C   (ni  1) loge s   (ni  1) loge si 
i1

 i1

If
C > c2(t-1),a then apply the correction term
si2
s 
i1 t
t
2



1  t
1 
1
  t
 

CF  1 
3( t  1)  i1 (ni  1) 

(
n

1
)

i


i1
Reject if C/CF >
c2(t-1),a
More work but better power
Levene’s Test
More work but powerful result.
Let
zij  yij  yi
yi = sample median of
t
L
2
n
(
z

z
)
 i i  /(t  1)
i 1
ni
t
2
n
(
z

z
)
 i ij i /( nT  t )
i-th group
t
nT   ni
i 1
i 1 j 1
Reject H0 if
L  Fa,df1 ,df2
Essentially an Anova on the z ij
df1 = t -1
df2 = nT - t
Hartley’s Test
A logical extension of the F test for t=2,
Requires equal replication, r, among groups.
Requires normality
Fmax
.
2
max
2
min
s

s
Reject if Fmax > Fa,t,n-1,
 Tabachnik and Fidell (2001) use the Fmax ratio more as a rule of
thumb rather than using a table of critical values.
 Fmax ratio is no greater than 10
 Sample sizes of groups are approximately equal (ratio of
smallest to largest is no greater than 4)
Tests for Homogeneity of Variances

More importantly:
VARIANCE TESTS ARE ONLY FOR ONEWAY ANOVA
WARNING:
Homogeneity of variance testing is only available for unweighted one-way models.
Tests for Homogeneity of Variances
(Randomized Complete Block Design
and/or Factorial Design)


In a CRD, the variance of each treatment group
is checked for homogeneity
In factorial/RCBD, each cell’s variance should
be checked
Ho: σij2 = σi’j’2, For all i, j where i ≠ i’, j ≠ j’
Tests for Homogeneity of Variances
(Latin-Squares/Split-Plot Design)


If there is only one score per cell, homogeneity
of variances needs to be shown for the
marginals of each column and each row
• Each factor for a latin-square
• Whole plots and subplots for split-plot
If there are repetitions, homogeneity is to be
shown within each cell like RCBD
Remedial Measures for
Heterogeneous Variances

Studies that do not involve repeated measures
• If normality is violated, the data
transformation necessary to normalize data
will usually stabilize variances as well
• If variances are still not homogeneous,
non-ANOVA tests might be an option
Transformations to Achieve
Homo-scedasticity
What can we do if the homo-scedasticity (equal
variances) assumption is rejected?
1. Declare that the Anova model is not an adequate
model for the data. Look for alternative models.
2. Try to “cheat” by forcing the data to be homoscedastic through a transformation of the response
variable Y. (Variance Stabilizing Transformations.)
Independence
It is a special case and the most common cause of
heterogeneity of variance


Independent observations
• No correlation between error terms
• No correlation between independent variables and
error
Positively correlated data inflates standard error
• The estimation of the treatment means are more
accurate than the standard error shows.
Independence Tests


If some notion of how the data was collected is
understandable, check can be done if there exists any
autocorrelation.
The Durbin-Watson statistic looks at the correlation of
each value and the value before it
• Data must be sorted in correct order for meaningful
results
• For example, samples collected at the same time
would be ordered by time if suspect results could be
depent on time
Independence
A positive correlation between means and
variances is often encountered when there is a
wide range of sample means
Data that often show a relation between
variances and means are data based on counts
and data consisting of proportion or percentages
Transformation data can frequently solve the
problems
Remedial Measures for
Dependent Data

First defense against dependent data is proper study design
and randomization
• Designs could be implemented that takes correlation into account, e.g.,
crossover design

Look for environmental factors unaccounted for
• Add covariates to the model if they are causing correlation, e.g., quantified
learning curves

If no underlying factors can be found attributed to the
autocorrelation
• Use a different model, e.g., random effects model
• Transform the independent variables using the correlation coefficient
The Main effects are additive
 For each design, there is a mathematical model called a
linear additive model.
 It means that the value of experimental unit is made up
of general mean plus main effects plus an error term
 When the effects are not additive, there are
multiplicative treatment effect
 In the case of multiplication treatment effects, there are
again transformation that will change the data to fit the
additive model
Data Transformation
 There are two ways in which the anova assumptions can be
violated:
1. Data may consist of measurement on an ordinal or a
nominal scale
2. Data may not satisfy at least one of the four
requirements
 Two options are available to analyze data:
1. It is recommended to use non-parametric data analysis
2. It is recommended to transform the data before analysis
Square Root Transformation
 It is used when we are dealing with counts of rare events
 The data tend to follow a Poisson distribution
 If there is account less than 10. It is better to add 0.5 to the
value
Square Root Transformation
Response is positive and continuous.
zi 
yi
This transformation works when we notice the
variance changes as a linear function of the mean.
  k i
2
i
• Useful for count data (Poisson Distributed).
• For small values of Y, use Y+.5.
35.00
30.00
Sample Variance
Typical use:
Counts of items when counts
are between 0 and 10.
k>0
25.00
20.00
15.00
10.00
5.00
0.00
0
10
20
Sample Mean
30
40
Logaritmic Transformation
 It is used when the standard deviation of samples are
roughly proportional to the means
 There is an evidence of multiplicative rather than additive
 Data with negative values or zero can not be transformed.
It is suggested to add 1 before transformation
Logarithmic Transformation
Response is positive and continuous.
Z  ln(Y )
Typical use:
1. Growth over time.
2. Concentrations.
3. Counts of times when counts
are greater than 10.
 k
2
i
2
i
k>0
200
180
160
Sample Variance
This transformation tends to work when the
variance is a linear function of the square of the
mean
• Replace Y by Y+1 if zero occurs.
• Useful if effects are multiplicative (later).
• Useful If there is considerable heterogeneity
in the data.
140
120
100
80
60
40
20
0
0
10
20
Sample Mean
30
40
Arcus sinus or angular
Transformation
 It is used when we are dealing with counts expressed as
percentages or proportion of the total sample
 Such data generally have a binomial distribution
 Such data normally show typical characteristics in which
the variances are related to the means
ARCSINE SQUARE ROOT
Response is a proportion.
1
Z  sin
Y  arcsin Y
With proportions, the variance is a linear
function of the mean times (1-mean) where
the sample mean is the expected proportion.
• Y is a proportion (decimal between 0 and 1).
• Zero counts should be replaced by 1/4, and
N by N-1/4 before converting to percentages
Typical use:
1. Proportion of seeds germinating.
2. Proportion responding.
 i2  k i 1   i 
Reciprocal Transformation
Response is positive and continuous.
1
Z 
Y
This transformation works when the variance
is a linear function of the fourth power of the
mean.
• Use Y+1 if zero occurs
• Useful if the reciprocal of the original
scale has meaning.
Typical use: Survival time.
 i2  k i4
Box/Cox Transformations
(advanced)
suggested
transformation
geometric mean
of the original
data.
 y i  1

zi     1
 ln y
i

 0
 0
1 n

  exp   lny i 
 n i 1

Exponent, l, is unknown. Hence the model can be
viewed as having an additional parameter which must
be estimated (choose the value of l that minimizes the
residual sum of squares).
General case in data
analysis
Assumptions distortion
Missing data
Missing data
Observations that intended to be made but did not make.
Reason of missing data:
 An animal may die
 An experimental plot may be flooded out
 A worker may be ill and not turn up on the job
 A jar of jelly may be dropped on the floor
 The recorded data may be lost
Since most experiment are designed with at least some degree of
balance/symmetry, any missing observations will destroy the balance
Missing data
 In the presence of missing data, the research goal
remains making inferences that apply to the population
targeted by the complete sample - i.e. the goal remains
what it was if we had seen the complete data.
 However, both making inferences and performing the
analysis are now more complex.
 Making assumptions in order to draw inferences, and
then use an appropriate computational approach for the
analysis is required
 Consider the causes and pattern of the missing data for
making appropriate changes in the planned analysis of
the data
Missing data
 Avoid adopting computationally simple solutions (such
as just analyzing complete data or carrying forward the
last observation in a longitudinal study) which generally
lead to misleading inferences.
 In one factor experiment, the data analysis can be
executed with good estimated value, but in the factorial
experiment theoretically can not be analyzed
 In CRD one factor experiment, if there are missing data,
data can be analyzed with different replication numbers
 In the RCBD one factor experiment, if 1 – 2 complete
block or treatment is missing but there are still 2 blocks
complete, data analysis simply can be proceeded
Missing data
 In the RCBD/LS one factor is experiment, if there 1 –
2 missing observations in the block or treatment , data
can be treated by :
a. the appropriate method of unequal frequencies
b. the use of estimating unknown value from the
observed data
 The estimate of the missing observation most frequently
is the value that minimizes the experimental error sum of
square when the regular analysis is performed
Missing values in RCBD’s


Missing values result in a loss of orthogonality
(generally)
A single missing value can be imputed
• The missing cell (yi*j*=x) can be estimated by profile
least squares
Yij =
(tT + b B – S)
[(t-1)(b -1)]
 Where:
t = number of treatment
• b = number of block
• T= sum of observation with the same
treatment as the missing observation
• B= sum of observations in the same block
as the missing observation
• S = number of treatments
•
Imputation






The error df should be reduced by one, since M was estimated
SAS can compute the F statistic, but the p-value will have to be
computed separately
The method is efficient only when a couple cells are missing
The usual Type III analysis is available, but be careful of
interpretation
Little and Rubin use MLE and simulation-based approaches
PROC MI in SAS v9 implements Little and Rubin approaches
Missing Data in Latin
Square

If only one plot is missing, you can use the following formula:
^
Y
ij(k)
= t(Ri + Cj + Tk)-2G
[(t-1)(t-2)]

Where:
• Ri = sum of remaining observations in the ith row
• Cj = sum of remaining observations in the jth column
• Tk = sum of remaining observations in the kth treatment
• G = grand total of the available observations
• t = number of treatments

Total and error df must be reduced by 1

Used only to obtain a valid ANOVA and should not be used
in the computation of means
Relative Efficiency
Did blocking increase our precision for comparing
treatment means in a given experiment?
RE(RCB, CR): the relative efficiency of the randomized complete
block design compared to a completely randomized design
MSECR (b  1) MSB  b(t  1) MSE
RE ( RCB , CR) 

MSE RCB
(bt  1) MSE
MSE and MSB comes from RCB.
IF RE(RCB,CR) >1 then blocking is efficient because many more
observations would be required in CRD than would be required
in the RCB
Relative Efficiency of LS

To compare with an RBD using columns as blocks

RE = MSR + (t-1)MSE
tMSE
To compare with an RBD using rows as blocks

RE = MSC + (t-1)MSE
tMSE
To compare with a CRD
RE = MSR + MSC + (t-1)MSE
(t+1)MSE