Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Program to generate Atkinson's and resistant envelopes for normal probability plots of regression residuals. Rafael Flores" Instituto de Nutrici6n de Centro America and Panama (INCAP). Virginia F. FlackUniversity of California at Los Angeles (UCLA). Introduction Normal probability plots of residuals have been suggested as tools to evaluate the normality assumption for regression residuals and they can be used for detection of one or more bad data values (Daniel, 1959). The extent that these problems can be detected depends on the data set and the investigator's interpretative skills. Draper and smith (1981) insist that to gain experience in making decisions on normal probability plots, the user should train himself using sample plots of different sizes similar to those provided by Daniel and Wood (1980). The magnitude of the residuals' fluctuations are a function of the location of the design points for the regression. Atkinson (1981) introduced a simulation-based method to produce a reference set of fluctuations for these plots and therefore, provide guidance as to what sort of variability can be expected when one is using a normal probability plot of regression residuals. BMDP (1987) implemented this method of placing envelopes on the plot in its P2R program. The idea of using simulation as an aid for the interpretation of a particular residual plot is to give the statistician a basis for comparison of the observed plot with an "expected" plot, where that comparison is derived from residuals from an appropriately chosen error distribution. Flack and Flores (1989) studied the envelope's stability properties and joint residual vector inclusion levels, as well as alternative resistant techniques for creating envelopes. They recommended one of these resistant versions that showed good stability and good sensitiv~ty to outlying residuals. This paper presents a SAS' program that does both Atkinson's and resistant envelopes for normal probability plots of regression residuals. Notation We have the regr,ession model Y = XB + 10, with Ynx1 ' ,Xnxp and E nx1 • The error terms e i , L= 1, ... , n, are assumed to be Lndependent, 2 identically distributed (i.i.d.) according to N(0,a ). The externally studentized i th residual is: r • -Yi s(i} --=---5L i - v1-h i where y i is the i th observed y value, 9' i is its ordinary least squares estimator, s{i} is the estimator of a when the i th 1345 observation has been omitted in the estimation, diagonal element of the hat matrix, X' (X'XI-'X'. hi is the ith Atkinson's envelope algorithm Generate MMl vectors Z from the N(O,l) distribution. For each of these vectors fit the model Z = X B + £, to obtain M simulated vectors r*~. Order the elements of each simulated residual vector and let i oe the order statistic within each residual vector. For each i = 1, ... , n, select 11 = min r*zlH and u i = max r*Z(i). These upper and lower values for the l order statistic of the M simulated residual vector form the upper and lower edges of the diagnostic envelope, respectively. Atkinson's suggestion of M = 19 gives envelope boundaries which are estimates of the 5 th and 95 th percentiles of the distribution of the ith order statistic of the externally studentized residual vector, given X and i.i.d. N(O,a 2) errors. This method is restricted because the envelope limits for the i th ordered residual are estimates of a lOOk (M+l) percentile of that residual's appropriate distribution, with k being an integer < M. It also depends on extreme order statistics among the simulated residuals, making this procedure susceptible to creating quite variable envelopes. Resistant envelope algorithm From the set of M simulated i th order statistics among the externally studentized residuals, we define the ith lower and upper diagnostic envelope bound as (Flack and Flores, 1989): Ii FLi - 1. 5 {Fui - F\} ui FUi + 1.5 {Fui - F\} where FUi and F\ are the upper and lower fourths of the M i th order statistics. For each i,. these fourths are the f and M+l-f order statistics, respectively, of the M simulated values of {r* z(i)}. The ideal f of Hoaglin and Iglewicz (1987) is used for estimating the fourths; it is the value f = M/4 + 5/12. This resistant method gives more stable bounds estimates for a given number of simulations, and has a more flexible range of levels of joint Gaussian residual vector inclusion than the Atkinson's method (Flack and Flores, 1989). Application A SAS' program with curve smoothing was developed to implement the above cited algorithms and it appears listed below. The program was applied to the salinity data set reported by Ruppert and Carrol (1980) and used in detail by Atkinson (1985). This data set comprises 28 observations on the salinity of water during the spring in the pamlico Sound, North Carolina. The response (y) is the bi-weekly average of salinity. There are three explanatory variables: the salinity in the previous two week time 1346 period (x,), a dummy variable for the time period during MarchApril (x 2) and the river discharge (x3). The first-order multiple regression model was fitted. The full normal probability plot in Figure 1 shows that four observations are on the Atkinson's envelope or a little bit outside it, and therefore one should be cautious about those points. In contrast, Atkinson (1985) did not see any particular feature of interest with the half-normal plot of the externally studentized residuals. This could be attributed to the fact that the full normal plots exclude residuals more frequently than the half normal plots (Flack and Flores, 1989). The resistant envelope shows clearly that there is one observation that is quite far outside of the limits and more checking should be done. This observation is the one identified by Atkinson (1985) using the Cook's distance. There is no suggestion of plot curvature associated with skewness. When the Shapiro-Wilks test is applied to the externally studentized residuals (Whetherill et al., 1986) we do not reject the normality assumption at the 5% level (p=.085). conclusions A SAS' program is available to produce Atkinson's and resistant envelopes for normal probability plots of regression residuals. We warn against using the diagnostic envelopes as pure acception/rejection indicators for testing the normality of errors assumption. The resistant alternative to the envelope generation method proposed by Atkinson shows promise. The envelope edges can highlight patterns such as one extreme point, which may fall outside the bounds and/or force nearby points outside their diagnostic limits. Acknowledgements This work was supported in part by the Instituto de Nutrici6n de Centro America y Panama (INCAP). INCAP Apartado Postal 1188. Guatemala, Guatemala. (502)-2-723762 . • Department of Biostatistics. school of Public Health, UCLA . Los Angeles CA 90024-1772, USA. (213) 825-5250 . • SAS, SAS/STAT, SAS/IML and SAS/GRAPH are registered trademarks of SAS Institute Inc., Cary, NC, USA. o References 1. 20. 2. 3. 4. Atkinson, A.C. (1981) "Two graphical displays for outlying and influential observations in regression," Biometrika 68, 13Atkinson, A.C. (1985) Plots, transformations and regression, Clarendon Press, Oxford. Daniel, C. (1959)"Use of half-normal plots in interpreting factorial two level experiments," Technometrics 1, 311-341. Daniel, C. and F.S. Wood (1980) Fitting equations to data, 2nd. 1347 editi on, Wiley . Drap er, N.R. and H. smith (1981 ) Appl ied regre ssion anal ysis, 2nd. editi on, Wiley . 6. Flack , V. F. and R. Flore s (1989 ) "Usin g simu lated enve lopes in the evalu ation of norm al prob abili ty plots of regre ssion resid uals ," Tech nome trics 25, 219-2 30. 7. Hard wick, J. (1987 ) "2R diag nosti cs," BMDP comm unica tions 19, 12-24 . 8. Hoag lin, D.C., and B. Iglew icz (1987 ) "Fine -tuni ng some resis tant rules for outli er labe ling, " JASA 82, 1147 -1149 . 9. Rupp ert, D. and R.J. Carro l (1980 ) "Trim med leas t squa res estim ation in the linea r mode l," JASA 75, 828-3 8. 10. Weth erill, G.B., P. Dunco mbe, M. Kenw ard, J. Kolle Paul and B. J. Vowe den (1986 ) Regr essio n analy rstro m, S.R. sis with appl icati ons, Chapm an and Hall. 5. Progr am * 1. Fix the samp le size (n) and the matr ix of pred ictor gene rate a set of estim ated resid uals with a fixed s to cova rianc e matr ix. data YXj set sugi .sali nity end= eof; if eof then call symp ut('n ',left (put( N ,2.)) ); * 2. Gene rate m=19 stand ard norm al pseud o-ran dom nx1 vecto rs deno ted: n1, n2, .... nm %let m=19 j %let seed= 5638 ; %mac ro devi atesj data norm alj temps eed=& seedj %do i=l %to &nj %do j=l %to &mj call ranno r(tem pseed , n&j); %end j drop temp seed; outp utj %end; %mend devi ates; %dev iates data yxnlj merge yx norm al; * 3. Comp ute the least squar es regre ssion of Y on X. Save and sort the obser ved exter nally stude ntize d resid uals (estr O)j %let k=3 j %let j=Oj %mac ro exts tresj %let ou=n r; %let rs=e strj proc reg data= yxnlj mode l y=x1 -x&k /nopr int; outp ut out=& OU&j rstud ent=& rs&jj data &ou& jj 1348 set &ou&j; keep &rs&j; proc sort; by &rs&j; * 4. Compute the regression of ni on X. Save and sort the simulated externally studentized residuals (estri) %do i=1 %to &m; proc reg data=yxnl; model n&i=x1-X&k/noprint; output out=&ou&i rstudent=&rs&i; data &ou&i; set &ou&i; keep &rs&i; proc sort; by &rs&i; %end; %mendj %extstres * 5. Calculate the expected values of normal order statistics (u). Harter, H.L. (1961) "Expected values of normal order statistics," Biometrika 48, 151-165; data exvano ; n=O; n=symget ( I n I ) ; alpha1=.315065+(.057974*(log10(n»)-(.009776*((log10(n»**2»; alpha2=.327511+(.05B212*(log10(n»)-(.007909*((log10(n»**2»; do i=1 to n; if i-=1 then go to between; do; u=probit((i-alpha1)/(n-(2*alphal)+1»; output; end; between: if i>1 and i<n then do; u=probit((i-alpha2)/(n-(2*alpha2)+1»; output; end; i f i=n then do; u=probit((i-alpha1)/(n-(2*alpha1)+1»; output; end; end; keep u; proc sort; by u; * 6. Put together the expected, observed and simulated values; data uestr; merge exvano nrO nr1 nr2 nr3 nr4 nr5 nr6 nr7 nr8 nr9 nr10 nr11 nr12 nr13 nr14 nr15 nr16 nr17 nr18 nr19; * 7. Order the elements within each observation of the m simulated residuals. Use the bubble sort proc iml; use uestr; read all into a; start orderowsi 1349 r=nrow(al; c=ncol(al; do i=l to r; sswitch=O; do while (sswitch=Ol; ffswitch=O; k=3; m=c-1; do while (k <= m); j=k+1; b=a[i,k); d=a[i,j); if b > d then do; temp=a[i,k) ; a[i,k)=a[i,j) ; a[i,j)=temp; ffswitch=l; k=k+1; end; else k=k+1; end; if ffswitch=O then sswitch=l; end; end; finish; run orderows; * 8. Get the resistant limits and concatenated them to the data set ar=a [ , 3 : 21) ; m=ncol(al-2; f=m/4+5/12; ft=int(fl; start quartile; if f=ft then do; tlow=ar[, f); i=m-f+1; thigh=ar [ , i) ; end; else do; i=ft+1; tlow=O.5*(ar[,ft)+a[,i)l; i=m-ft+1; j=m-ft; thigh=O.5*(ar[,i)+ar[,j)l; end; finish; run quartile; fdiff=thigh-tlow; tl=tlow-(1.5*fdif f l; th=thigh+(1.5*fdif f l; t=all tl; t=t th; 1350 * 9. output the final data set create envelope from t i append from t i quit; goptions device=hplj5p2i title1 c=yellow f=triplex h=.23 in 'Figure 1'i title2 c=yellow f=triplex h=.20 in 'Atkinson' 's and resistant envelopes'i footnote c=yellow f=triplex h=. 15 in j=l '. Atkinson + resistant' i symbol1 c=cyan f=swissb v=Oi symbo12 c=magenta i=sm40 1=33 f=triplex v=.i symbo13 c=magenta i=sm40 1=33 f=triplex v=.i symbo14 c=red i=sm40 1=1 f=triplex v=+i symbo15 c=red i=sm40 1=1 f=triplex v=+i axis1 minor=(n=4) label=(a=90 f=triplex h=.17 in '8tudentized residuals') value=(f=triplex)i axis2 minor=(n=3) label=(f=triplex h=.17 in 'Expected normal order statistics') value=(f=triplex)i proc gplot data=envelopei plot col2*col1=1 col3*col1=2 co121*col1=3 col22*col1=4 col23*col1=5 /overlay vaxis=axis1 haxis=axis2; run; 1351 Figure 1 Atkinson's and resistant envelopes 4 .' .0 3 2 .....,os 1 ;:l ....., "(j II) k "(j II) 0 ....+' N 1'II)1 "(j ;:l 't +' en -1 .. , -2 o .0 -3 -3 -2 -1 o 1 Expected normal order statistics . Atkinson + resistant 1352 2 3