Download Normal Probability plot - Indian Statistical Institute

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability wikipedia , lookup

Statistics wikipedia , lookup

History of statistics wikipedia , lookup

Transcript
Normal Probability plot
Shibdas Bandyopadhyay
[email protected]
Indian Statistical Institute
Abstract
Normal probability plots are made to graphically verify normality assumption for
data from a univariate population that are mutually independent and identically
distributed. Normal probability plot is very common option in most statistical
packages. In the context of design of experiments or regression, though the
observations are assumed to be mutually independent and homoscedastic, they have
different unknown expectations. So the raw data are inappropriate for normality
check. To overcome the problem of unequal expectations, it is common to use
residuals of a fitted regression model. The residuals have zero expectation, but these
are heteroscedastic, and also mutually dependent. It is thus inappropriate to use the
residuals for normality check. In this study, mutually independent homoscedastic
components with zero mean are extracted from residuals through principle
component analysis; these are then used for normal probability plot. The technique
is illustrated with data.
Key words and phrases: Normal probability plot, principal component analysis.
AMS (1991) subject classification: 62P.
Normal Probability plot
Shibdas Bandyopadhyay
[email protected]
Indian Statistical Institute
1. Introduction
Let Y1 , Y2 , ... .., Yn be mutually independent with common mean  and standard
deviation . To check graphically if the data are from a common normal
distribution, one plots Y(i ) , the ith ordered statistic of Y1 , Y2 , ... .., Yn , against  1 (ci ) ,
i = 1,2, …., n; if the line plot is nearly linear, one is satisfied with the normality
assumption. In the plot,  happens to be the slope of the straight line of Y(i ) on
 1 (ci ) ; ci ' s are chosen to estimate  ‘efficiently’ (David and Nagaraja, 2003).
Currently used ci ' s (Blom, 1958) in statistical packages like in Minitab are:
ci = (i- 83 )/( n  14 ), i=1,2,…, n.
(1.1)
Line plot of Y(i ) on  1 (ci ) is called Normal Probability Plot.
While testing for , it is natural to check normality assumption using normal
probability plot. Use of normal probability plot to check normality assumption has
been common in other situations also. In this study, we shall consider the use of
normal probability plot to check normality assumption for response in the context
of regression and design of experiments.
Consider the standard linear regression model:
Y = X + 
(1.2)
where Y is n1 response vector, X is np design matrix of rank r  p,  is p1 vector
of unknown parameters, and  is n1 unobservable vector of error components;
error components are assumed to be mutually independent and identically
distributed with zero mean and standard deviation .
Though the n components of Y are independently distributed with common
standard deviation , components of Y do not have a common mean . The ith
component Y i of Y has the mean  i = X i' , where X i' is the ith row of X, i=1,2,…, n.
So, a line plot of Y(i ) on  1 (ci ) is not meaningful to check the normality of Y i ’s. It
has become a standard practice, as in Minitab, to work with ˆ , the n1 residuals:
(1.3)
ˆ = Y – X ˆ

1
and make a line plot of ˆi , the ith component of ˆ , on  (ci ) , ci ’s given by (1.1).
We use a match factory data (Roy et al, 1959) for illustration. Data are scores of n =
25 workers on three psychological tests U 1 , U 2 , U 3 and also their efficiency index Y.
Components of ˆ ' after fitting the regression
Y =  1 +  2 U 1 + 3 U 2 + 4 U 3
(1.4)
(with X 1 1, X 2 =U 1 , X 3 =U 2 and X 4 =U 3 ) is 1  25:
( 3.33 –0.18 –0.88 –3.62 –5.16
–1.61 –1.37 –1.27
1.31 0.12
0.055 –2.28
0.69
3.87 3.84).
–2.24 0.92 3.42 –0.22 –0.52
1.16 2.17 0.66 0.88 –3.07
Regression residuals
Fig.1 is a line plot of ˆi on  1 (ci ) , with ci = (i- 83 )/(25.25), i=1,2,…, 25.
-3
-2
6.00
4.00
2.00
0.00
-1 -2.00 0
-4.00
-6.00
1
2
3
Phi-inverse(Ci)
Fig.1 : Normal Probability Plot with regression residuals
But this line plot of  1 (ci ) on ˆi with ci = (i- 83 )/( n  14 ), i=1,2,…, n is not appropriate
to check normality of Y i ’s. It true that, when the mutually independent Y i ’s are
'
normally distributed with mean  i = X i  and common standard deviation , ˆi ’s
are distributes as normal with mean zero but standard deviations are different
multiples (depending on X) of . Also ˆi ’s are not mutually independent. So, one
needs modification (Hocking, 2003).
This study suggests a natural modification by extracting independent and
identically distributed normal components from ˆ = Y – X ˆ using principal
component analysis. It will not be possible to carry out the suggested modification
by using statistical tables and calculators or on PC; it computer intensive. One
would need principal component analysis module, which is common in most
statistical packages such as Eigen Analysis in Minitab.
2. Extraction of independent and identically distributed components using
principal component analysis
Consider the regression model Y = X +  along with what follows (1.1). One may
write ˆ as, for ˆ = (X'X)  X'Y,
(2.1)
ˆ = Y – X ˆ = (I n -X(X'X)  X') Y  HY
where (X'X)  is a g-inverse of X'X and H = I n -X(X'X)  X'. It follows that ˆ has a
singular normal distribution, mass of the joint density of n components of ˆ lies in
(n-r) dimension, with zero mean and covariance matrix  2 H with rank (H)= (n-r).
Since H is symmetric and idempotent of rank (n-r), characteristic roots of H are 1 of
multiplicity (n-r) and 0 of multiplicity r. Using spectral decomposition of H we may
write
0
I
 P', PP' = P'P = I n .
H = P  n  r
0 
0
P is non-stochastic orthogonal matrix and depends only on the design matrix X. P' ˆ
has a singular normal distribution, mass of the joint density of n components of P' ˆ
0
I
 . Thus, if
lies in (n-r) dimension, with zero mean and covariance matrix  2  n  r
0 
0
we write P = ( P (1) P ( 2 ) ), where P (1) consists of the first (n-r) columns of P (the
characteristic vectors corresponding to the (n-r) non-zero characteristic roots of H )
and P ( 2 ) consists of the remaining r columns of P, (n-r) components of P' (1) ˆ are
independent and identically distributed normal with zero mean and standard
deviation  while the remaining r components of P' ( 2 ) ˆ are identically zero ( zero
mean and zero variance).
For the match factory data, (P' ˆ )' is 1  25,
(P' ˆ )' = ( 1.84 –0.36 1.30 1.40 0.59 –4.58
–2.55 –1.93 0.26 –1.56 0.82 2.83
0.68 2.77 3.97 –2.59 –3.63 2.11
5.54 –0.86 –0.061 0
0
0
0).
(2.2)
Notice that each of the last four components of P' ˆ , P' ( 2 ) ˆ , is 0, as these should be.
Fig.2 is a line plot of ith ordered statistic of the 21 components of P’ (1) ˆ with
ci = (i- 83 )/(21.25), (since r=p=4, n-r= 21) on  1 (ci ) , i=1,2,…, 21.
PC of residuals
6.00
4.00
2.00
-3
-2
0.00
-1 -2.00 0
1
2
3
-4.00
-6.00
Phi-inverse(Ci)
Fig.2 : Normal Probability Plot with PC of regression residuals
We do not wish to compare the two figures. We only want to point out that the
analysis suggested with principal components is an appropriate method and is not
difficult to implement in packages that have eigen analysis module.
References
Blom, G. (1958). Statistical Estimates and Transformed Beta-Variables. Wiley, New
York.
David, H.A. and Nagaraja, H. N. (2003). Order Statistics. Wiley – Interscience.
Hocking R. R.(2003). Methods and Applications of Linear Models. Wiley –
Interscience.
Roy, J., Chakravarty, I.M. and Laha, R.G.(1959). Handbook of Methods of
Applied Statistics, Vol. 1. John Wiley & Sons, Inc.