Download The Bisquare Weighted Analysis of Variance: A Technique for Nonnormal Distributions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Linear regression wikipedia , lookup

Regression analysis wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Choice modelling wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Least squares wikipedia , lookup

Forecasting wikipedia , lookup

Data assimilation wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Statistics
THE BISQUARE WEIGHTED ANALYSIS OF VARIANCE: A TECHNIQUE FOR
NONNORMAL DISTRIBUTIONS
Rebecca Anne Regeth & Wm Wren Stine
University of New Hampshire
The bisquare-weighted analysis of variance
(bANOVA) is a technique that we have developed
for comparing the weighted means of several groups
where the weights are designed to reduce the
influence of, or completely eliminate, outliers (Stine
& Regeth, In preparation). Unlike the analysis of
variance (ANOVA), the bANOVA maintains high
power with nonnormal distributions. It also has
nearly the power of the ANOVA when used with
normal distributions, making the bAN OVA especially
useful when it is difficult to tell if the underlying
distribution is normal.
The ANOVA is used to compare the arithmetic
averages, or means, of two or more group (also
called levels of an independent variable). One of the
assumptions of the ANOVA procedure is that each
group is sampled from a normally distributed
population (Kirk, 1995, p. 99). Unfortunately, normal
distributions are rare. For example, Micceri (1989)
examined 440 data sets and found that 28% were
symmetrical with moderate or heavy tails and 69%
were moderately asymmetric with moderate or heavy
tails.
Heavy-tailed distributions, in particular, are very
common.
Bessel (1818, as cited in Hampel,
Ronchetti, Rousseeuw, & Stahel, 1986, p. 22)
examined a data set consisting of errors of
observation of 300 star positions. Although this
distribution has been widely cited as an example of a
normal distribution (e.g., Maxwell & Delaney, 1990,
p. 51), it appears to be heavy-tailed (Hampel et aI.,
1986, p. 22; Stine & Regeth, In preparation).
Many statistics books state that the ANOVA is
robust with respect to violations of normality (e.g.,
Hays, 1981, p. 276: Keppel, 1982, p. 85-86, Kirk,
1995, p. 99).
However, there is considerable
evidence to suggest that even for distributions with
Slightly heavy tails the procedure shows a drop in
power. Power drops considerably with heavier taDed
distributions. The main reason that power drops with
heavy-tailed distributions is that the mean is
influenced by outliers.
One reason that previous Monte Carlo studies
have not shown the ANOVA to be non-robust with
respect to violations of normality is that these studies
have not controlled for the inflation of variance when
examining heavy-tailed distributions. We have found
that when the normal distribution is contaminated
with outliers, Type I error rates remain constant, but
Type II error rates increase, decreasing power (Stine
& Regeth, In preparation).
Methods
for
dealing
with
heavy-tailed
distributions include using data transformations,
using nonparametric statistics, or using a technique
that includes a measure of central tendency that is
not sensitive to outliers (i.e., the bisquare-weighted
average.)
The problem with using data
transformations is that the researcher must detect
the non normal distribution (which is quite difficult). A
problem with using nonparametric techniques (e.g.,
the rank ANOVA) is that these procedures lose
power when samples are drawn from moderately
heavy-tailed distributions (Stine & Regeth, In
preparation).
The bisquare-weighted ANOVA retains power
with heavy-tailed distributions and has nearly the
power of the ANOVA with normal distributions.
Additionally, the bANOVA can be used "blindly."
The researcher does not need to determine whether
or not the distribution of the underlying population is
normal.
There are three basic steps used in calculating
the bANOVA. The first step is to calculate the
bisquare-weighted average of each group in the
design. Next, each weight is used in order to
calculate a weighted ANOVA. Third, the F-ratio from
the weighted ANOVA is transformed to account for
the change in degrees of freedom that occurs when
the weights are used.
The bisquare-weighted average, in turn, is
calculated using an iterative process. The first
iteration uses the median as the measure of central
tendency (Eq. 1). Deviations from the median are
found tor each score in the group. After calculating
1.483 times the median of the absolute values of the
deviations (the Median Absolute Deviation, or MAD:
this product is a robust estimate of the standard
deviation, Hampel et aI., 1986, pp. 105 & 107), the
weights for each score are calculated as a function
each score's deviation from the median (with this
deviation being scaled by 1.483 MAD: Eqs. 2, 3, and
4). These weights are then used to calculate the
weighted average (Eq. (5».
bw(O)(X.J )
MA0
(k) _
=
11!edian{
~j I
I=l, .. .,nj
(1)
= ~f1.~2'{\~j - ~E~{Xjjll}
w
~j - b k-1) ( X)
1. 483MAD;
(3)
E.. - -------"'--'.....
lJ
w~).
677
(2)
H~Jr ~J"I
<r
(4)
NESUG '96 Proceedings
Statistics
R,
L lI{f)X;j
bW(k)(X.) = .!.:i=~!_ _
.J
(5)
II;
"{"' W~~)
i.J
i=l
'J
The bisquare weighting formula (Eq. 4) assigns
weights near 1.0 to scores that are near the mean.
Scores that are far from the mean get weights closer
to zero. If a scaled deviation (Eq. 3) exceeds
approximately four (r 4) times a robust estimate
of the standard deviation (1.483 MAD), it is given a
weight of zero. This removes the score from the
sample.
In the next iteration, the bisquare-weighted
x. h
j
e: (1)
21
22
23
24
25
-1.349
-0.674
0.000
0.674
1.349
,1
Next, the weights are calculated. The bisquare
formula assigns weights near 1.0 to scores that are
near the measure of central tendency. Scores that
are far from the measure get weights closer to zero.
If a score exceeds approximately four standard
deviations it is given a weight of zero which removes
the score from the sample.
=
average from the first iteration (bw(1)(
A';!
is used
in Eq. (3) instead of the median. weights are
assigned using Eq. (4) as before. Finally, a new
bisquare-weighted average is found from the new
weights (Eq. 5).
When the new bisquare-weighted average is
approximately equal to the old bisquare-weghted
average (bW(k)( j ) i! bW(k-I)( j ) , the iterative
x.
x.
process is stopped and the first step for calculating
the bANOVA is finished. The second step involves
using the weights (Wi~k»
The bisquare-weighted average is compared to
the median. If the two are approximately equal, the
procedure is finished for this group. In this case
from the final iteration to
calculate a weighted ANOVA. One then transforms
the original F-ratio using Eq. 6. The result of the
transformation (Fbw ) is compared to a tabled value
using the degrees of freedom from the original
design.
Fbw = (0.534+ 0.OO1206dfErroJF
bw(l)(X )
An example of the calculation of a bisqua...
weighted average and a bisquare-weighted
ANOVA. (bANOVA)
Below is an example of a one-way ANOVA
design. There are three groups and five subjects per
group.
Group 2
Group 3
21,22,
23,24,25
Medlan=23
Mean=23
1,2,3,
4 10
Medlan=3
Mean=4
1,2,3,
4,100
Medlan=3
Mean=22
= 23= b¥1°) (X
1
X;2
e: (1)
w(1)
1
2
3
4
10
-1.349
-0.674
0.000
0.674
4.720
0.786
0.944
1.000
0.944
0.000
3.674
) so there
'
i2
i2
w(I)X
i2
i2
0.786
1.888
3.000
3.776
0.000
9.450
Notice that in the last row (a score of 10) the
weight is zero, indicating that the score was beyond
approximately four robust estimates of the standard
deviation from the measure of central tendency and
is thus considered too extreme to keep. It was
removed from the sample at this point (i.e.,
w~i
= 0.0).
However, the weight may become nonzero in subsequent iterations.
In the next iteration, the bisquare-weighted
average from the first iteration is used in Eq. (3)
instead of the median. The deviations of the scores
from the bisquare-weighted average are calculated.
As in the previous iteration, the absolute value of the
deviations is found. Weights are assigned USing Eq.
To calculate the bisquare-weighted average we
start with Eq. (1). The first iteration uses the median
Scaled
as the measure of central tendency.
deviations from the median are found for each score
in the group. (MAD = 1.0)
NESUG '96 Proceedings
102.559
4.459
is no need to continue to iterate. Next we will
calculate the weights for Group 2.
The first iteration for Group 2 is presented
below:
(6)
Group 1
=
.1
678
Statistics
(4). Finally, a new bisquare-weighted average is
found from the new weights (Eq. (5».
The second iteration for Group 2 is below:
i2
;2
wi2(z)Xi2
-1.060
-0.386
0.288
0.963
5.009
.864
.981
.990
.888
.000
3.723
.864
1.963
2.970
3.550
0.000
9.346
£(z)
X;z
1
2
3
4
10
w(z)
point,
from
the
first
iteration,
bw(l)(X:z) = 2572
bW(2)( x:z) = 2.510.
while,
from
this
iteration,
At
this
gives the bANOVAonce the F is tranSformed into
Fbw using Eq. 6).
ANOVA summary table:
Source
Between
Within
Total
Source
Between
Within
Total
The difference between these
bW(k-I) (X:2~
~ 0.001.
1
2
3
4
10
-1.012
-0.337
0.337
1.011
5.057
i2
wg)
.864
.981
.990
.888
.000
3.724
(5)
W;z X;2
.864
1.963
2.970
3.550
0.000
9.311
(~w'S)(X.3) - bW(4)(X:3)1~ 0.(01). Here
are the results for Group 3 (bw(S) (X:3 )
X;3
1
2
3
4
10
£ (5)
w(S)
i3
i3
-1.012
-0.337
0.337
1.011
65.745
.876
.986
.986
.876
= 2.5(0):
(5)
W i3
X;3
.876
1.972
2.958
3.505
.000
0.000
3.724
9.311
P
.434
SS
1172.18
17.04
1189.23
df
2
10
12
MS
586.09
1.70
Fbw~
187.77 .0001
The code presented in the appendix calculates a
two-way ANOVA and a two-way bANOVA. The
former is included for comparison to the later. We
wUl briefly describe only the calculation of the
weights and bisquare-weighted averages for each
cell of the two-way design (using the NLiN
procedure) and, more briefly, the weighted ANOVA
(with the GLM procedure) using the weights from the
previous step. Most of the code presented for the
bANOVA merely creates a SAS data set that the
procedure NLIN can use for calculating the bisquare
averages. As no "interesting- routines are used in
these procedures (indeed. they are quite tedious),
we will avoid reviewing them in detail.
The SAS data set that NUN reads (TEST)
contains columns for the variables IVA (naming the
levels of independent variable A). IVB (naming the
levels of independent
variable B), DVA (the
dependent variable), MEDKY (which holds the
medians for each of the groups), DIFF (which simply
equals zero), and MAD (containing the median
absolute deviations for each of the groups). NLiN
calculates a bisquare-weighted average for each
group of the two-way design (BY IVA IVB) and
outputs the weights for each case of each cell.
These weights, as mentioned above, will be used by
GLM to calculate a weighted ANOVA.
Procedure NUN requires one to use the PARMS
statement to supply an initial value for the parameter
to be estimated (with the estimate calculated
iteratively, starting with the initial value, in order to
minimize the sum of squared errors).
For the
bisquare-wieghted averages we wish to use the
median of a particular group for the initial value.
Group 3 took five iterations before the difference
was small
F
.894
Description of SAS Routines.
results for Group 2:
£(5)
MS
571.67
639.17
As you can see, the two extreme scores in
groups 2 and 3 had a large impact on the ANOVA.
However, these scores were given weights of zero
and therefore are not reflected in the Fbw and the
bANOVA summary table.
At the 5th iteration, the difference was less than
Here are the
0.001 with bw(S) (X:2) = 2.500 .
X;2
df
2
12
14
bANOVA summary table:
numbers is 0.062. The researcher needs to decide
just how close these two numbers need to be. We
used a difference of 0.001.
Therefore, the
procedure
will
be
repeated
again.
until
~w'k) (X:2) -
SS
1143.33
7670.00
8813.33
To show the usefulness of the bANOVA with a
heavy-tailed distribution, we have calculated an
ANOVA on the original scores and a weighted
ANOVA using the final weights (W~k» from each of
the bisquare-weighted averages (which, of course,
679
NESUG '96 Proceedings
Statistics
Unfortunately, one has to supply an explicit value
(not a variable) in the PARMS statement. So, in the
PARMS statement, we set B (which will be the
bisquare-weighted average for the cell defined by
the factorial combination of IVA with IVB) equal to 1.
The next statement redefines B to the initial value
that we will actually use. For iteration
(i.e., no
iterative fits have been calculated: _ITER-=O) and
the first case LN_=1) we define the initial value of B
to be the median of the group (MEDKY).
In the MODEL statement we specify that we
wish to chose B (the bisquare-weighted average) so
that the difference between B and the dependent
variable (OVA) is as small as possible (I.e., the
difference is as dose to zero, the value of DIFF, as
possible). NLiN also needs an expression for the
first derivative of the model with respect to the
parameter to be estimated.
This derivative is
specified by the DER.B statement. R and SCALE
are the rejection point and the robust measure of
variability, respectively.
RESID
is the scaled
residual
where
MODEL.DIFF is the estimated model (the later
variable created by the NLiN procedure as the value
of the predicted score for the given case and
iteration).
The two sets Of IF-THEN ELSE
statements define the bisquare-weighting function
(where WS is the variable that contains the weights)
by first creating an influence function (PSYS) and
then generating the weights (see Hampel et aI.,
1986, Ch. 2). We then replace the weight used by
the NLIN procedure LWEIGHTJ with the weight we
have calculated. Finally, the calculated weight is
indicated for output to the SAS data set that we wish
to pass on to the GLM procedure (by the 10
statement) and this data set (named WEIGHTS) is
created by the OUTPUT statement.
The GLM procedure calculates a weighted
ANOVA using the weights that we calculated in the
NLIN procedure (which were stored in the SAS data
set WEIGHTS). Notice that this code is identical to
that of the two-way ANOVA with the exception that a
WEIGHT statement is used. The weights (WS), of
course, are those from the NLIN procedure where
the bisquare-weighted averages were calculated for
each cell (or group) of the design. The procedures
that follow the GLM print results and calculate
transformed F-ratios. Again, the programming is
rather straight forward and will not be reviewed.
To convert this code calculate a bANOVA for a
one-way design (a single independent variable) one
simply removes IVB from the routines (hence,
statements such as BY IVA IVB become BY IVA).
The GLM procedure is then modified to calculate a
one-way ANOVA and the single resulting F-ratio is
transformed in the last procedure.
°
NESUG '96 Proceedings
680
References
Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., &
Stahel, W. A. (1986). Robust statistics: The
aWoach based on influence functions. NY:
Wiley.
Keppel, G. (1982). pesign and analysis· A
researcher's handbook, 2nd Ed., Englewood Cliffs,
NJ: Prentice-Hall.
Kirk, R. E. (1995). Experimental design:
Procedures for the behavioral sciences, 3rd Ed.,
Belmont, CA: Brooks/Cole.
Maxwell, S. E., & Delaney, H. D. (1990). Designing
experiments and analyzing data: A model
comparison Def§PBCtive. Pacific Grove, CA:
Brooks/Cole.
Micceri, T. (1989). The Unicorn, the normal curve,
and other improbable creatures. PsychQlogical
Bulletin, W, 156-166.
Mosteller, F., & Tukey, J. W. (1977). Data analysis
and regression: A second course in statistics.
Reading, MA: Addison-Wesley.
SAS Institute, (1990). SAS language: Reference.
Version 6, First Edition, Cary, NC: SAS Institute.
Stine, W. W., & Regeth, R. A. (In preparation).
Parametric statistics and power in the face of
heavy tails:
I. Univariate omnibus tests for
completely randomized designs.
Appendix: SAS Routines for Calculating the
ANOVA and bANOVA for a two-way design.
Two SAS routines are presented for conducting
the Analysis of Variance techniques that we
described. They are appropriate for a Completely
Randomized Factorial design with two independent
variables that have p and q levels, respectively
(CRF-pq). The ANOVA calculates an AnalySis of
Variance. The bANOVA calculates a BisquareWeighted Analysis of Variance.
The bANOVA
routines also include procedures that print the
weights for individual cases to assist the analyst in
identifying outliers, etc.
This code has been tested on SAS Production
Release 6.08 using a Digital Equipment Corporation
VAX-8820 under the VMS Version 5.5-2 operating
system. Perhaps the most effective way to use the
code presented below would be to optically scan the
text into a microcomputer (e.g., a Macintosh) and
then pass the code over to a machine with SAS. In
most environments, only the file specifications would
Statistics
INPUT IVA IVB DVA;
/* Tells SAS to use the first and second columns
of values from NUMBER3.DAT as the values for
the independent variables and the third column
of values as the values for the dependent
variable. Names these IVA, IVB, and DVA. * /
DIFF = 0;
/* DIFF will be the difference between the mean
of the DVA and the bisquare weigrted average
(see below). It is now set to zero. */
OUTPUT;
/* SAS puts these values into RA WDATA. */
RUN;
/* Forces the execution of the DATA step. */
have to be altered for the code to work. The code
can also be used as a model for the development of
these analyses in other statistical packages.
ANOVA for CRF·pq
DATA RAWDATA;
/* Names the SAS data set that will be created
RAWDATA. */
INFILE 'NUMBER2.DAT';
/* Tells SAS that the data are in a file named
NUMBER2.DAT (VMS specific). */
INPUT IVA IVB DVA;
/* Indicates to $AS to use the first and second
columns of values from NUMBER2.DAT as the
values for the independent variables and the
third column of values as the values for the
dependent variable. Names these IVA IVB and
'
,
DVA. */
OUTPUT;
/*Tells $AS to put these values into RAWDATA
*/
OPTIONS LlNESIZE = 80;
/* Displays the output as 80 characters wide.
*/
PROC UNIVARIATE DATA = RAWDATA NOPRINT'
/* Provides descriptives (such as the mean and
median) for the data set RAWDATA. */
VARDVA;
/* Indicates that DVA is to be analyied from the
data set. */
BY IVA IVB;
/* SAS executes this step for each group defined
IVA and IVB. */
OUTPUT OUT = MEDIANS MEDIAN =
MEDDVA;
/* Puts the results into a file called MEDIANS
and names the median MEDDVA. */
RUN;
/* Forces the execution of the PROC step. */
.
RUN;
/* Forces the execution of the DATA step. */
OPTIONS LlNESIZE = 80;
/* Displays the output as 80 characters wide
*/
.
PROC GLM DATA = RAWDATA;
/* Uses the General Unear model on the
RAWDATA data set. */
CLASS IVA IVB;
/*Classifies IVA and IVB as the nominal level
variables. * /
MODEL DVA = IVA IVB IVA*IVB'
/* Defines DVA as the dependent variable and
IVA and IVB as the independent variables. * /
RUN;
.
.
/* Forces the execution of the GLM step. */
DATA EXTEND;
/* Creates a SAS data file called EXTEND. */
MERGE RAWDATA MEDIANS;
/* EXTEND contains the data from RAWDATA and
MEDIANS. */
BY IVA IVB;
/* Arranges the data by IVA and IVB. */
IF _~ = 1 THEN DO;
/* This procedure extends the number of
groups in MEDIANS to match the number of
subjects in RA WDATA. * /
IVAOLD=O;
IVBOLD = 0;
END;
/* Ends the IF, THEN loop. */
IF IVAOLD A= IVA AND IVBOLD "'=IVB
THEN DO;
/* This is a continuation of the above
procedure. For example, if MEDIANS has
MEDIANS has p groups and RAWDATA has 30
bANOVA for the CRF.pq
/* This block reads the data from
NUMBER3.DAT into a SAS data file called
RAWDATA and sets up a variable named
DIFF.RAWDATA will contain 4 variables: IVA
IVB, DVA, and DIFF. */
'
DATA RAWDATA;
/* Names the SAS data set that will be created
RAWDATA. */
INFILE 'NUMBER3.DAT';
/* Tells SAS that the data are in a file named
NUMBER3.DAT. */
681
NESUG '96 Proceedings
Statistics
subjects then EXTEND will have 120 rows of
data. */
IVAOLD = IVA;
IVBOLD = IVB;
MED = MEDDVA;
/* Names the median MEDDVA. */
END;
/* Ends the IF, THEN loop. */
ABDDVA = ABS(DVA - MED);
/* ABDDVA is the absolute value of the deviation
of OVA from the median (MED). */
RETAIN MEDDVA;
/* Tells SAS to keep the new medians. */
DROP IVAOLD IVBOLD MED;
/* Tells SAS to drop the old IVAs, IVBs, and
medians. */
OUTPUT;
/* Tells SAS to put these values into EXTEND.
*/
RUN;
/* Forces the execution of the DATA step. */
PROC DATASETS NOUST;
/* Deletes the MEDIANS data file. */
DELETE MEDIANS;
RUN;
/* Forces the execution of the PROC step.
*/
/* This block sets up a data set called EXTEND
from the data contained in RA WDATA and
MEDIANS. MEDIANS contains the median values
for each group. EXTEND will contain 6
variables: IVA, IVS, DVA, DIFF, the group
medians (MEDDVA) and the absolute median
deviation for each group (ABDDVA). */
PROC UNIVARIATE DATA = EXTEND NOPRINT;
/* Gives descriptives on EXTEND such as the
mean and median for each group defined by the
IVA and IVB. */
VAR ABDDVA DVA;
/* Indicates that ABDDVA and DVA will be
analyzed from the data set EXTEND. */
BY IVA IVB;
/* SAS executes this step for each group defined
by the IVA and IVB. */
OUTPUT OUT = MEDABDEV MEDIAN =
MAD2DVA MED2DVA;
/* Puts the results into a file called MEDABDEV
and names the median of ABDDVA MAD2DVA and
the median of OVA MED2DVA. */
RUN; .
/* Forces the execution of the PROC step. */
NESUG '96 Proceedings
PROC DATASETS NOUST;
/* Deletes the EXTEND data file. */
DELETE EXTEND;
RUN;
/* Forces the execution of the PROC step. */
DATA TEST;
/* Creates a SAS data file called TEST that
contains the data from RAWDATA and
MEDABDEV. */
MERGE RAW DATA MEDABDEV;
BY IVA IVB;
/* Arranges the data by the IVA and IVB. */
IF _N_ = 1 THEN DO;
/* This procedure extends the number of
groups in MEDABDEV to match the number of
subjects in RAWDATA. */
IVAOLD= 0;
IVBOLD = 0;
END;
/* Ends the IF, THEN loop. */
IF IVAOLD A= IVA AND IVBOLD ~IVB
THEN 00;
/* This procedure renames several variables.
*/
IVAOLD = IVA;
/* Renames IVA to IVAOLD. */
IVBOLD = IVB;
/* Renames IVB to IVBOLD. */
MAD = MAD2DVA;
/* Renames MAD2DVA to MAD. */
MEDKY = MED2DVA;
/* Renames MED2DVA to MEDKY. */
END;
/* Ends the IF, THEN loop. */
RETAIN MAD MEDKY;
/* Tells SAS to keep MAD and MEDKY. */
DROPMAD2DVA MED2DVA IVAOLD
IVBOLD;
/* Tells SAS to drop MAD2DVA, MED2DVA,
IVAOLD, and IVBOLD. */
OUTPUT;
/* Tells SAS to put these values into TEST. */
RUN;
/* Forces the execution of the DATA step. */
PROC OATASETS NOUST;
/* Deletes the RAWDATA and MEDABDEV files.
*/
DELETE RAWDATA MEDABDEV;
RUN;
/* Forces the execution of the PROC step. */
682
Statistics
1* If the absolute value of the residual is
/* The following block calculates a bisquare
weighted average for each cell and outputs the
weights for each element in the cell into a file
called WEIGHTS. WEIGHTS will contain 7
variables: IVA, IVB, DVA, DIFF, MED2DVA,
MAD20VA, and WS. These weights will later be
used to calculate the weighted ANOVA * /
outside of 4 units of the middle of the
distribution (R), then the residual is given a
weight of zero. *1
IF RESID A= 0 THEN WS == PSYS /
RESIO;
/* If the value of the residual is not equal to
zero, then the weight of the residual equals the
influence function value divided by the residual.
PROC NUN DATA = TEST NOHAlVE;
/* Fits a nonlinear regression model using the
least squares procedure. NaHALVE turns off
the step-size search during iteration. (See
Example 5; SAS Institute, 1990, p. 1165.) */
TITLE 'Tukey biweight';
/* Title for this section. * /
BY IVA IVB;
/* Executes the procedure for each group. * /
PARMS B = 1;
/* B represents the bisquare weighted average.
It is nominally set to 1.0. */
IF _ITEIl-=O AND _N_= 1 THEN B =
MEOKY;
/* For the first iteration, B is initialized to the
median value. * /
MODEL DIFF = (OVA - B);
/* The model is set as the difference (OIFF)
between the OVA and B. * /
OER.B = -1;
/* The derivative of the model with respect to B
is -1. */
*1
ELSE WS
= 1.;
1* If the residual equals zero, then the weight
for that case equals 1.0 (B at the center of the
distribution). *1
_WEIGHT_ = WS;
1* This replaces SAS's NUN weight value with
the weight value calculated from the biweight
procedure. * /
10 WS;
/* This specifies that the variable WS will be
output to the SAS data set. */
. OUTPUT OUT=WEIGHTS;
1* Puts results into a file called WEIGHTS. * /
RUN;
1* Forces the execution of the PROC step. * /
PROC OATASETS NOLlST;
1* Deletes the TEST data file. *1
OELffiTEST;
RUN;
R = 4;
/* Forces the execution of the PROC step. * /
/* This defines the rejection point as 4 scale
units beyond the middle of the distribution. * /
SCALE = 1.483 * MAD;
/* SCALE is the unit of measurement for the
residuals. MAD is the measure of scale. MAD
times 1.483 provides an unbiased estimate of
the standard deviation when sampling from a
normal distribution (Hampel et aI., 1986, Ch.
PROC PRINT DATA = WEIGHTS;
1* Prints the data set with the weights for
outlier identification*/
RUN;
1* The following block calculates the weighted
ANOVA from the weights obtained from the
2). *1
previous procedure. The results are output into
the file ANOVA which will contain 7 variables:
IVA, IVB, OVA, OIFF, MAD, MEOKY, and WS. *1
RESIO = (OIFF - MOOEL.OIFF)/SCALE;
1* MOOEL.OIFF is the estimated deviation of B
from OVA. RESIO provides a scaled normalized
estimate of the residuals. *1
IF ABS (RESIO) < R THEN PSYS =
RESIO*( (1 -(RESID/R)**2)
**2);
1* If the absolute value of the residual is within
4 units of the middle of the distribution (R)
then the residual receives a non-zero weight
using the following influence function
(Mosteller & Tukey, 1977, p. 205). The
influence function is called the psy function
(Hampel et aI., 1986, Ch. 2). */
ELSE PSYS = 0.;
PROC GLM DATA = WBGHTS OUTSTAT = ANOVA
NOPRINT;
/* This uses the general linear model on the
WEIGHTS data set and puts the results into
ANOVA *1
CLASS IVA IVB;
1* Classifies IVA and IVB as the variables of
interest. * /
MODEL DVA = IVA IVB IVA*lVB;
/* This defines OVA as the dependent variable,
IVA and IVB as the independent variables, and
IVA*lVB as the interaction. * /
683
NESUG '96 Proceedings
Statistics
PROBAFA = PROBF(APPFA, OF,
DFERROR, 0);
/* This finds the probability value for the
APPFA value. */
END;
/* Ends the IF, THEN loop. */
RETAIN SSA FA APPFA PROBA PROBAFA;
/* This tells SAS to keep these variables. */
IF _TYPE- = 'ss l' AND _SOURCE- =
'IVB' THEN DO;
/* This labels the sum of squares for the B
effect SSB. */
SSB = SS;
FB = F;
/* This labels the F value for the B effect FB.
WEIGHT WS;
/* This indicates that the OVA will be weighted
using the weights (WS) obtained in the above
procedure when calculating the ANOVA. */
RUN;
/* Forces the execution of the GLM step. * /
PROC DATASETS NOUST;
/* Deletes WEIGHTS. */
DELETE WEIGHTS;
RUN;
/* Forces the execution of the PROC step. */
/* This block calculates the F value for the
weighted ANOVA. It also calculates approximate
F values (APPFA, APPfB, and APPFAB) based
on the regression equation in equation (6) and
the corresponding probability value (PROBFA).
These values are listed in OUTF: NAME (OVA),
SOURCE (error, IVA, IVB, IVAB), TYPE (error,
SSl), OF, SS, F, PROB, DFERROR, SSERROR,
SSA, FA, APPFA, PROBA, PROBAFA, SSB, FB,
APPFB, PROBB, PROBAFB, SSAB, FAB,
APPFAB, PROBAB, and PROBAFAB. */
DATAOUTF;
/* Creates SAS data set called OUTF.
SET ANOVA;
/* Uses the ANOVA file. */
*/
APPFB = (.534 +(.001206 *
OF ERROR» * FB;
/* This calculates an F value (APPFB value)
based on the regression equation (see equation
(6». */
PROBB = PROB;
/* This labels the probability value for the B
effect PROBB. */
PROBAFB = PROBF(APPFB, OF,
DFERROR, 0);
/* This finds the probability value for APPFB.
*/
END;
/* Ends the IF, THEN loop. */
RETAIN SSB FB APPFB PROBB
PROBAFB;
/* This tells SAS to keep these variables. */
IF _TYPE_ = '551' AND _SOURCE- =
'IVA*lVB' THEN DO;
/* This labels the sum of squares for the AB
effect SSAB. */
SSAB = SS;
FAB = F;
/* This labels the F value for the AB effect FAB.
*/
APPFAB = (.S34 + (.001206 *
DFERROR» * FAB;
/* This calculates an F value (APPFAB value)
based on the regression equation (see equation
(6». */
PROBAB = PROB;
/*This labels the probability value for the AB
effect PROBAB. */
PROBAFAB = PROBF (APPFAB,
OF, DFERROR, 0);
/* This finds the probability value for the
APPFAB value. */
OUTPUT;
/* This puts the output into OUTF. */
*/
IF _TYPE_ .. 'ERROR' THEN DO;
/* This labels the error term in the ANOVA file
to DFERROR and the sum of squares term to
SSERROR. */
DFERROR = OF;
SSERROR = SS;
END;
/* Ends the IF, THEN loop. */
RETAIN DFERROR SSERROR;
/* This tells SAS to keep the DFERROR and
SSERROR terms. */
IF _TYPE-. = 'ss 1' AND _SOURCE- =
'NA' THEN DO;
/* This labels the sum of squares for the A
effect SSA and labels the F value for the A effect
FA. */
SSA = SS;
FA = F;
APPFA = (.534 + (.001206 *
DFERROR» * FA;
/* This calculates an approximate F value
(APPFA value) based on the regression equation
(see equation (6». */
PROBA == PROB;
/* This labels the probability value for the A
effect PROBA. */
NESUG '96 Proceedings
684
Statistics
END;
/* Ends the IF, THEN loop. */
RUN;
/* Forces the execution of the DATA step. */
PROC PRINT DATA = OUTF;
/* Prints OUTF. */
RUN;
/* Forces the execution of the PROC step. */
685
NESUG '96 Proceedings