Download Program to Generate Atkinson's and Resistant Envelopes for Normal Probability Plots of Regression Residuals

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Linear regression wikipedia , lookup

Regression analysis wikipedia , lookup

Data assimilation wikipedia , lookup

Forecasting wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Program to generate Atkinson's and resistant envelopes for normal
probability plots of regression residuals.
Rafael Flores"
Instituto de Nutrici6n de Centro America and Panama (INCAP).
Virginia F. FlackUniversity of California at Los Angeles (UCLA).
Introduction
Normal probability plots of residuals have been suggested as
tools to evaluate the normality assumption for regression residuals
and they can be used for detection of one or more bad data values
(Daniel, 1959).
The extent that these problems can be detected
depends on the data set and the investigator's interpretative
skills. Draper and smith (1981) insist that to gain experience in
making decisions on normal probability plots, the user should train
himself using sample plots of different sizes similar to those
provided by Daniel and Wood (1980).
The magnitude of the residuals' fluctuations are a function
of the location of the design points for the regression. Atkinson
(1981) introduced a simulation-based method to produce a reference
set of fluctuations for these plots and therefore, provide guidance
as to what sort of variability can be expected when one is using
a normal probability plot of regression residuals.
BMDP (1987)
implemented this method of placing envelopes on the plot in its P2R
program.
The idea of using simulation as an aid for the interpretation
of a particular residual plot is to give the statistician a basis
for comparison of the observed plot with an "expected" plot, where
that comparison is derived from residuals from an appropriately
chosen error distribution.
Flack and Flores (1989) studied the envelope's stability
properties and joint residual vector inclusion levels, as well as
alternative resistant techniques for creating envelopes.
They
recommended one of these resistant versions that showed good
stability and good sensitiv~ty to outlying residuals.
This paper presents a SAS' program that does both Atkinson's
and resistant envelopes for normal probability plots of regression
residuals.
Notation
We have the regr,ession model Y = XB + 10, with Ynx1 ' ,Xnxp and E nx1 •
The error terms e i , L= 1, ... , n, are assumed to be Lndependent,
2
identically distributed (i.i.d.) according to N(0,a ).
The
externally studentized i th residual is:
r
•
-Yi
s(i}
--=---5L i -
v1-h i
where y i is the i th observed y value, 9' i is its ordinary least
squares estimator, s{i} is the estimator of a when the i th
1345
observation has been omitted in the estimation,
diagonal element of the hat matrix, X' (X'XI-'X'.
hi
is
the
ith
Atkinson's envelope algorithm
Generate MMl vectors Z from the N(O,l) distribution. For each
of these vectors fit the model Z = X B + £, to obtain M simulated
vectors r*~. Order the elements of each simulated residual vector
and let i oe the order statistic within each residual vector. For
each i = 1, ... , n, select 11 = min r*zlH and u i = max r*Z(i).
These upper and lower values for the l
order statistic of the M
simulated residual vector form the upper and lower edges of the
diagnostic envelope, respectively.
Atkinson's suggestion of M = 19 gives envelope boundaries
which are estimates of the 5 th and 95 th percentiles of the
distribution of the ith order statistic of the externally
studentized residual vector, given X and i.i.d. N(O,a 2) errors.
This method is restricted because the envelope limits for the
i th ordered residual are estimates of a lOOk (M+l) percentile of that
residual's appropriate distribution, with k being an integer < M.
It also depends on extreme order statistics among the simulated
residuals, making this procedure susceptible to creating quite
variable envelopes.
Resistant envelope algorithm
From the set of M simulated i th order statistics among the
externally studentized residuals, we define the ith lower and upper
diagnostic envelope bound as (Flack and Flores, 1989):
Ii
FLi - 1. 5 {Fui - F\}
ui
FUi + 1.5 {Fui
-
F\}
where FUi and F\ are the upper and lower fourths of the M i th order
statistics.
For each i,. these fourths are the f and M+l-f order
statistics, respectively, of the M simulated values of {r* z(i)}. The
ideal f of Hoaglin and Iglewicz (1987) is used for estimating the
fourths; it is the value f = M/4 + 5/12.
This resistant method gives more stable bounds estimates for
a given number of simulations, and has a more flexible range of
levels of joint Gaussian residual vector inclusion than the
Atkinson's method (Flack and Flores, 1989).
Application
A SAS' program with curve smoothing was developed to implement
the above cited algorithms and it appears listed below.
The program was applied to the salinity data set reported by
Ruppert and Carrol (1980) and used in detail by Atkinson (1985).
This data set comprises 28 observations on the salinity of water
during the spring in the pamlico Sound, North Carolina.
The
response (y) is the bi-weekly average of salinity. There are three
explanatory variables: the salinity in the previous two week time
1346
period (x,), a dummy variable for the time period during MarchApril (x 2) and the river discharge (x3). The first-order multiple
regression model was fitted.
The full normal probability plot in Figure 1 shows that four
observations are on the Atkinson's envelope or a little bit outside
it, and therefore one should be cautious about those points.
In
contrast, Atkinson (1985) did not see any particular feature of
interest with the half-normal plot of the externally studentized
residuals.
This could be attributed to the fact that the full
normal plots exclude residuals more frequently than the half normal
plots (Flack and Flores, 1989).
The resistant envelope shows
clearly that there is one observation that is quite far outside of
the limits and more checking should be done. This observation is
the one identified by Atkinson (1985) using the Cook's distance.
There is no suggestion of plot curvature associated with skewness.
When the Shapiro-Wilks test is applied to the externally
studentized residuals (Whetherill et al., 1986) we do not reject
the normality assumption at the 5% level (p=.085).
conclusions
A SAS' program is available to produce Atkinson's and
resistant envelopes for normal probability plots of regression
residuals.
We warn against using the diagnostic envelopes as pure
acception/rejection indicators for testing the normality of errors
assumption.
The resistant alternative to the envelope generation method
proposed by Atkinson shows promise.
The envelope edges can
highlight patterns such as one extreme point, which may fall
outside the bounds and/or force nearby points outside their
diagnostic limits.
Acknowledgements
This work was supported in part by the Instituto de Nutrici6n
de Centro America y Panama (INCAP).
INCAP Apartado Postal 1188. Guatemala, Guatemala. (502)-2-723762 .
• Department of Biostatistics. school of Public Health, UCLA . Los
Angeles CA 90024-1772, USA. (213) 825-5250 .
• SAS, SAS/STAT, SAS/IML and SAS/GRAPH are registered trademarks of
SAS Institute Inc., Cary, NC, USA.
o
References
1.
20.
2.
3.
4.
Atkinson, A.C. (1981) "Two graphical displays for outlying and
influential observations in regression," Biometrika 68, 13Atkinson, A.C. (1985) Plots, transformations and regression,
Clarendon Press, Oxford.
Daniel, C. (1959)"Use of half-normal plots in interpreting
factorial two level experiments," Technometrics 1, 311-341.
Daniel, C. and F.S. Wood (1980) Fitting equations to data, 2nd.
1347
editi on, Wiley .
Drap er, N.R. and H. smith (1981 ) Appl ied regre ssion
anal ysis,
2nd. editi on, Wiley .
6.
Flack , V. F. and R. Flore s (1989 ) "Usin g simu lated
enve lopes
in the evalu ation of norm al prob abili ty plots
of
regre
ssion
resid uals ," Tech nome trics 25, 219-2 30.
7. Hard wick, J. (1987 ) "2R diag nosti cs," BMDP
comm unica tions 19,
12-24 .
8.
Hoag lin, D.C., and B. Iglew icz (1987 ) "Fine -tuni
ng some
resis tant rules for outli er labe ling, " JASA 82,
1147
-1149 .
9.
Rupp ert, D. and R.J. Carro l (1980 ) "Trim med leas
t
squa
res
estim ation in the linea r mode l," JASA 75, 828-3
8.
10. Weth erill, G.B., P. Dunco mbe, M. Kenw ard, J.
Kolle
Paul and B. J. Vowe den (1986 ) Regr essio n analy rstro m, S.R.
sis with
appl icati ons, Chapm an and Hall.
5.
Progr am
*
1.
Fix the samp le size (n) and the matr ix of pred ictor
gene rate a set of estim ated resid uals with a fixed s to
cova rianc e matr ix.
data YXj
set sugi .sali nity end= eof;
if eof then call symp ut('n ',left (put( N ,2.)) );
* 2. Gene rate m=19 stand ard norm al pseud
o-ran dom nx1 vecto rs
deno ted:
n1, n2, .... nm
%let m=19 j
%let seed= 5638 ;
%mac ro devi atesj
data norm alj
temps eed=& seedj
%do i=l %to &nj
%do j=l %to &mj
call ranno r(tem pseed , n&j);
%end j
drop temp seed;
outp utj
%end;
%mend devi ates;
%dev iates
data yxnlj
merge yx norm al;
* 3. Comp ute the least squar es regre ssion of Y on X. Save and
sort the obser ved exter nally stude ntize d resid uals
(estr O)j
%let k=3 j
%let j=Oj
%mac ro exts tresj
%let ou=n r;
%let rs=e strj
proc reg data= yxnlj
mode l y=x1 -x&k /nopr int;
outp ut out=& OU&j rstud ent=& rs&jj
data &ou& jj
1348
set &ou&j;
keep &rs&j;
proc sort;
by &rs&j;
* 4.
Compute the regression of ni on X.
Save and sort the
simulated externally studentized residuals (estri)
%do i=1 %to &m;
proc reg data=yxnl;
model n&i=x1-X&k/noprint;
output out=&ou&i rstudent=&rs&i;
data &ou&i;
set &ou&i;
keep &rs&i;
proc sort;
by &rs&i;
%end;
%mendj
%extstres
* 5. Calculate the expected values of normal order statistics (u).
Harter, H.L. (1961) "Expected values of normal order
statistics," Biometrika 48, 151-165;
data exvano ;
n=O;
n=symget ( I n I ) ;
alpha1=.315065+(.057974*(log10(n»)-(.009776*((log10(n»**2»;
alpha2=.327511+(.05B212*(log10(n»)-(.007909*((log10(n»**2»;
do i=1 to n;
if i-=1 then go to between;
do;
u=probit((i-alpha1)/(n-(2*alphal)+1»; output;
end;
between:
if i>1 and i<n then
do;
u=probit((i-alpha2)/(n-(2*alpha2)+1»; output;
end;
i f i=n then
do;
u=probit((i-alpha1)/(n-(2*alpha1)+1»; output;
end;
end;
keep u;
proc sort;
by u;
* 6. Put together the expected, observed and simulated values;
data uestr;
merge exvano nrO nr1 nr2 nr3 nr4 nr5 nr6 nr7 nr8 nr9 nr10 nr11
nr12 nr13 nr14 nr15 nr16 nr17 nr18 nr19;
* 7.
Order the elements within each observation of the m
simulated residuals. Use the bubble sort
proc iml;
use uestr;
read all into a;
start orderowsi
1349
r=nrow(al;
c=ncol(al;
do i=l to r;
sswitch=O;
do while (sswitch=Ol;
ffswitch=O;
k=3;
m=c-1;
do while (k <= m);
j=k+1;
b=a[i,k); d=a[i,j);
if b > d
then
do;
temp=a[i,k) ;
a[i,k)=a[i,j) ;
a[i,j)=temp;
ffswitch=l;
k=k+1;
end;
else
k=k+1;
end;
if ffswitch=O then sswitch=l;
end;
end;
finish;
run orderows;
* 8. Get the resistant limits and concatenated them to the data
set
ar=a [ , 3 : 21) ;
m=ncol(al-2;
f=m/4+5/12;
ft=int(fl;
start quartile;
if f=ft then do;
tlow=ar[, f);
i=m-f+1;
thigh=ar [ , i) ;
end;
else do;
i=ft+1;
tlow=O.5*(ar[,ft)+a[,i)l;
i=m-ft+1;
j=m-ft;
thigh=O.5*(ar[,i)+ar[,j)l;
end;
finish;
run quartile;
fdiff=thigh-tlow;
tl=tlow-(1.5*fdif f l;
th=thigh+(1.5*fdif f l;
t=all tl;
t=t th;
1350
* 9. output the final data set
create envelope from t i
append from t i
quit;
goptions device=hplj5p2i
title1 c=yellow f=triplex h=.23 in 'Figure 1'i
title2 c=yellow f=triplex h=.20 in 'Atkinson' 's and resistant
envelopes'i
footnote c=yellow f=triplex h=. 15 in j=l '. Atkinson + resistant' i
symbol1 c=cyan
f=swissb
v=Oi
symbo12 c=magenta i=sm40 1=33 f=triplex v=.i
symbo13 c=magenta i=sm40 1=33 f=triplex v=.i
symbo14 c=red
i=sm40 1=1 f=triplex v=+i
symbo15 c=red
i=sm40 1=1 f=triplex v=+i
axis1 minor=(n=4)
label=(a=90 f=triplex h=.17 in '8tudentized residuals')
value=(f=triplex)i
axis2 minor=(n=3)
label=(f=triplex h=.17 in 'Expected normal order statistics')
value=(f=triplex)i
proc gplot data=envelopei
plot
col2*col1=1
col3*col1=2
co121*col1=3
col22*col1=4
col23*col1=5
/overlay
vaxis=axis1
haxis=axis2;
run;
1351
Figure 1
Atkinson's and resistant envelopes
4
.'
.0
3
2
.....,os
1
;:l
.....,
"(j
II)
k
"(j
II)
0
....+'
N
1'II)1
"(j
;:l
't
+'
en -1
.. ,
-2
o
.0
-3
-3
-2
-1
o
1
Expected normal order statistics
. Atkinson + resistant
1352
2
3