Download Probability Plotting in SAS

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
PROBABILITY PLOTT ING IN SAS
Daniel M. Chilko, West Virginia University
Gerry Hobbs, West Virginia University
E. James Harner, West Virginia University
Intr'oduction
EBr charts or histop.:rams are th~
simplest and most fr~quently used
They
~raphlcal representation of data.
th ..
reveal many oroperties of t~e data
ra~ge of ~~ta values, the number of
modesl whether the distribution is
symmetric or skewed, th~ existence of
outliers. Although bar charts reveal
the g~neral shape of the distribution,
i t Is sometimes difficult to determine
w~ethp.r or not t~e data can ~e viewed as
a sample from some hypothesl~ed
distribution. ProbabilTty plots arp a
~raDhical reoresentation of data that
focus on the dIstributional aspects of
data.
random .variable with a standard normal
distribution is shown In Fi~ure 1. This
~raDh can he turned Into a stral~ht linp.
hy transformi"~ the x axis to F(x) (hoth
ax.es woulrl be probabll ities) or by
transformln~
a~es would he
the ~(x) axIs to x (both
quantities).
A samnle rl'istributTon functTon can be
oroduced in SAS using PROC RANK with the
PERCENT option and PROC PLOT. See
Fip;:urF. 2.
For data from symmetric
distrihutions l
this function is
characterTstically S-sha~~~ and its
point of inflection makes it difficult
to work with.
Bar charts are easy to f'troduce in
SAS usin~ PROC CHART and PROC GCH~RT.
Prohability plots are also easy to
produce using SAS.
If the orobability axis
ls transformed,. thp. plot is now a
pro,",ability plot. That is, a
olot scales the probability
prob~bility
~xis
of a sample distribution function
accor~in~ to some probabt, ity
Probability plots
distribution such that, if W~ chosp. thp
correct distribution, the resultin~ plot
is ~orp or less a straight linp~
A random varlable, x, is characterized
by its distribution function
~r
e~ample;
the graph of F{x) for a
Sample Cumulative Distribution Function
Normal Distribution Function
f'EHCE.HT
100
0.9
.,
0.'
F
I
00
0.'
70
o. ,
"'
•
•
50
C.5
"
0, •
.0
0, ,
0,
•
20
kO
z
•
0.1
iB"'_.b
85,1
65.1
68_1:1
65 • .9
Pd'-'c.entag_ C«'II'P .. r
D.
0F===:::::,----~---~----'T
-I
°,
•
1
f
.11111 .....
2..
tnI.1
6:!31,;3
If
~
Normal Probability Plot
xn are the orde:e-d
ol1servations from a samplf." of SIZE' nJl
I
X
,
••• ,.
the scaljng of the ~xjs is achi~ved by
flndinJ!; a set of values. Yl' Y2' .~ .. Yn
(say) such that
os, sx
FCy.)·p.
1
(i3,o.t.:
1
Wh~re p. ~s dpnote a~nropriate c~ospn
fr~ctioAs of the rllstribution
correspondjn~ to the ~ Is:.
Plottln~ the
oairs (xl'Yl)~ (x2-1Y2)~ •• q (Xn'Yo)
should result in a strai~ht Tine if tl1e
X· IS a~e ~ sample fro~ a distr1bution
t1"kvinr, a -r1i strihution function F{x).
Si~ce
se~ms
,
·•"
",
,•
,•
•••
•
rea$onabl~
to consider
th~ data as d~nendent upon the
distribution function, probability plots
ar~ usually constructed with th~
variable of interest as the vertical
axis and th~ orobahi 1 i"ty distribution
scale1 valu~s as the horizontal axis.
it
Kif'lhall (1963) Inv~sti~ated the
of how to chose the values of P
~uestron
I
fo~
.Iven size n, for use in probability
Dlots.
G6.~I
6:5.5I
1
65".Ol-<-,~_~_ _~_ _~_ _~_ _~_ _~_ _~_
-L. S
f
D.5
l .•
lIiIUI"'.
3.
Si nee construction of
I)~obabi
is indenendent of the
esti~nation
11 ty plots
of
scale and location parameters, one
pot~ntial us~ is th~ ~stimation of these
paraM~ters from the nlot Itse1f
CFprrell, 1958)~ Interest ~~Y instead
focus on the estimation of a partIcular
percentile of the distrihution.
Estimation of param~ters from a
f)robalJi1lty rJlot requires fitting a
str'ai~ht lln~ to the plot.
Harner ~L
0-1. (1981) rlf>scrtbp.d the estimation ·of
the 99th percentIle of a distrihutlon
using various re~ression techn1Qu~s to
fit th~ straight line us1ng SAS.
plots
orocedur'E's
0.0
nrobability plots provide an
rnformal test of the hypothesis tna,t ~
sample comes from a normal distribution
end c~n be nroduced In SAS hy using PROr
R~NK wIth the nORMAL option and PROC
PLOT. Ft)1;lJr~ 3 Is a normal pro,",abi'ity
plot for thp. percenta~e of copp~r in 12
s~lilples from the Liberty Bell.
of
statistical
-0.5
~rmal
Fur thermore thl;" sea I j ng is I nrlependent
~ny scale and location parameters, so
that thA scaling reduces to findln~ the
invers~ of the distribution function of
a t·standariizerl ' ! r~ndom varl~ble. For
~xamDle, If x has a prohabl 1 ity
distribution with location oarameter ,l
and a scale pe~amet~r B, a olot of the
PQj...,ts (:I(l~ll)' (;>(2.2:2)' ,~., (~n,zn)1
where z· = (y.-A}/~, WOuld still result
in ~ st}ail':ht11 inp..
')ro~<lbility
-I. 0
l:rtll.njll!- tt'Or.ul
= F-ICi/(~+1J)
~Ior:-rl<"ll
•
1/(n+1)
11 usttatiol1, we chose I.
Y.
•
fl'tl.().7.:
The probleM of scalIng the orohabilrty
axis now ~p.comes the oroblem of findin2
the inver5~ ~f the ~tstrihlltron
function. That is,
J.any
•
Oi.Ot.:
Pi = (i-.375)/(n+.25)
III
• •
C1,!il
Pi = (i-.SlIn
II
For" t
•
tie". 01
a
Some cOMmonly used va1ues arp.
Pj
•
UI!I.5X
assum~
th-F! data to have a normal rlistribution.
It
Interpretation of non-linear plots
1s trLte that randoTr! variables havlnf!
normal or near normal distributions
occur quite"oftp.n in natu~e, perhaps
because the normal distribution tS the
llmltin~ ~Tstributlon of a random
variable which represents the sum of e
series of indep~ndent and identically
distrlhuterl random vari~bles.
lliite ofteh when data is ~lott~d on a
particular probability plot, the plot
does nnt a~pear too straight~ Abbot
(1960) and Kln~ (1965) ~ave Investigated
non-linear plots. Th~rr sturlies show
that many such plots have a simple and
str~ightfc~war'd explanation.
2
I~
Normal Probability Plot
.
Rgure t., shows a good fit in t"e mIddle
of the plot but that the plot tends to
fl~tten out at each end.
A scarcity of
values at the hi~h end usually indjcates
a InSDecltlon an1 selectIon process that
remov~s unacceptable values.
A scarcity
of vaTup.s at the lnw ~nd may tndicate
selection to a minimum specification or
measur i ng e~ui pment wh i ch may ·not "'ave
rpsolution he10w some particular value.
.,
.
~'
.,"
r
..
•
I'
,
v
.25
r
,
."
,
•
,.
"
,•
I
22
21
/
20
../'
I.
,.
"
Normal Probability Plot
"
·
?
,,•.'
f
0.'
i:
~,
~
Ii
~
1-'
g
~
,
•r
•
•••
f
-< .•
-1>••
-<.g
~
i'
I,
-1.2
,",.f.
-1.5
!
~
0.<
[Q!U"'.
5.
Normal Probability Plot
1000-
/
0.0
:,'
r,'
-1. 5
0.'
r
~.
.
F
I
,J
.. ,
"r-
"
'"
•
,/
1. >
•i
-9. Q
,.,.......
I.'
~
,
1S
I .•
i
I
6:000
r
•
'" S!lao
•,
·,
o tODD
c
"rr
·,
:31000
n
20CO
-i. II
0.0
-3.0-
I. ,
'.0
1000
I'
,
>
!
<
i,i
~
.
•
•••
''r~--~~~'~'~'~'~';';'~'~'-'~~r'-~---------,----------~
~
I
,
...
~
2B
A COnvex plot usually indicates a
left-skewed distribution. A COncaVe
plot Indicates a right-skewed
distribution. See FI8ure 6. A
log~normal ~robabillty plot Is a good
ne~t step for this pattern. See Fi~ure
7.
";'.
i;
'
"
"
• plot characterized by two fairly
straight portions connected by as-shaped
connection indtcates a bimodal
distribution. Se~ Fi~ure 5. The
detection of two sources for th~ data
when only one is expected can be an
Important b~nefit.
-'
•,
~2."
FI
'.0
IIW"C ...
F IlIIl.lroe- 51.
3
l.2-
2••
Log-normal Probability Plot
Normal Probability Plot
PerllentQCII. Ccppar
12 Sd.aI>Le. frll. 'L ,beMY aell
tI!J • .s~
•
59. OX
•
L
6t1.5i
,
o •
•
T
"• 5
•",
,• •
•
,"
•
•••
,"'
"
•
," ,
•
,
GII.Ol
,.
0'1. s;;
••
fn. ct:·
0
.'
••
••
• •
•
C
•••
•
05.51:
"
•
00:. O~
65• .31
,:r
c,;,____~__-._________.--------_.--------~
1
-:2.4
G.O
-L, 2
85.CI.l!...,_ _~_ _~_ _~_ _~_ _~_ _~_~~
2••
l ..... e,..•• No,....a!
·'2.0
-1.5
-1. (l
-0..5
o. a
1. G
F 1&\11"'0- "I.
Rp.ference lirie-s
to
of
~ddltlonal aid to the interpretation
norm2l1 rronablTity rtlot IS a
E is usually assumed to be a vector of
independently an~ Identically
dlstribute~ random variables, each
normally distributed with ~ean zero and
constant variance.
do
rp.ference line wnlch corresponds to a
normal distrr~ution with a specified
mean ~nd var'i ances.
PROC MEA.NS can be used
to produce a data set containing the
usual mom~nt estlmates. A short n~TA
step that processes this data set ~Rkes
i t ~asy to add ref~re"ce lines to
In a r~~reSs'Dn analysis, the
differences ~etween ohserved and
prp.dicted values are call~d residuals.
That Is,
pronability plots in SAS. See Fi,l!ure 8.
The sample mean and varlnnce are not
robust and estTmates of scale and
location p_arameters based on order
statistics are ~ore useful when outliers
are present in the data. Hillyer (1978)
investigated the uSe of moment and
Quantile ~stimators in a cont~xt si~ilar
to or'obabi 1 i ty pl otti n~. The I)resent
authors dunl h'!ated his rp.sults using
SAS~
PROC ~OPT, for example, ~roduces
order statistic.s.
T"
fZ.
Y- Y
w-her"e Y
Xb and b are the least SQuares
estimates.
If the underlying model
assumptions are tru~, then the ,'s have
normal distributions, each wlth mean
ZerO.
They do not, in ~eneral, have the
sa~e variance nor are they independently
distrbuted.
EO
ralf-norm.al prohability nlots provide an
test of the normal I ty of the
residuals in a re~ression analysis.
H1.Ilf-normal plots show more sensitivity
to kurtosis at the expense of not
revealTn~ skewness.
A detaIled
discussion of Qroducing half-normal
probability plots in SAS was ~iven by
I nformal
Half-nor~al
Wlel1
Q
probability plots
random vari able has a norma 1
mean zero,· the
rHstribution with
absolute value af
is saij to have a
djstribution.
In
this random variable
half-normal
the linp.ar r~~resslon
SaIl
y
~
(197B).
A useful eXDoslt;on on
the Interpretation of half-normal plots
fram~\10rk
was ~iven by Panlel and .Tood (1971).
XB • E
4
Exponential
Gamma orobakilfty plots
.,,1 Ie the normal ~Istrlbution Is of
Importance in statistics, the
gamma 11stribution is also encountered
frequently. The general ~amm~
distribution dep~nds on a location l
scale- f and shape oarameter. The general
~amma distrl~ution can bp transform~d to
a standardized distribution with only a
shape oaraMeter~ The chi-sQuare- and
exponential distributions arp specral
cases of the ~a~a distribution. A
chi-square random variab1e ",:fth rlegr~e5
of fre~dom ~ is a gamma random variable
with shape narameter eqoal to d/2; ~n
exponential r~ndom variable is a gamma
distribution with shape oarameter equal
to 1. Wilk et. a1. (1962) d.scrl~ed the
construction and interpretation of gamma
probability olots. T~e SAS function
GAMI NV can be used to produce ~aml11a
prooabl i ty plots.
~lot
The exponential distribution Is often
used to characterize fallur~ nr waiti~~
ti~e rlistributions.
FI~ure 9 is an
sin~ular
f
1:
proba~ility
~xponential
nro .... Abillty oint fl-r,r"r/uced hy
SAS for thE \','aitinp: tf'l1es hetween major
train wrecks in the U. S. durinv the
period from 1900 to 1960.
Chi-square probabr,ity p10t
Sample variances or mean squares fro'll a
normal pODulation have a chi-souare
distribution. A chi-square probability
olot can he used to provldp. an infor~al
test of the hOMo~enlety of sample
varjances. Fi~ure 10 IS a chi-square
prollabi 1 i ty plot for' the sample varf ances
af the 2mount of nitro~en in 5 red
clover plants innoculat~d with 6
~Iffer'ent hacteria stral~s.
Exponential Probability Plot
Chi-square probability plot
...
N,t.".,ga'rl ",o",t,,"r>~ Df' '"'lid 01 ........ pl .... t:~
Il"Ine,,'-IICLtc.d .Ith _lib tna,tIon <lultur-••
,
0'1' r+.-,"ob,\j. tro,"'..,l, .tra.'n~ and
rh,,, .. b , ..... ,,1, t .. t, .t.ra,n., '1"1_11
saoa
.,
• ~DOD
•
,
••
30
'" '<'i'00
·"
"
e 2"00
H
a,
J
,an"
21.00
•
,•
190e
•
"• Isao
.18
•
•
""
•
ro 12;QD
,•
·
u
o
•
,
J
•.12
100
,
800
a
"
01
G
3tJ:o
0.0
0.5
1.0
L.i:i
'.0
2.5
.. ,
•
Q.5
f , lIyr .. 9.
5
•
Q
'. "
3.n
3.5
References
Itlhot, W. H. (1960), Probab! I! ty Charts,
Private pubncation, St. Petersburt!:, FA..
!En!el, C. and F.
Equattons to
Data~
~Iood
(1971), Fittin.
John Wiley
an~
Sons;
Ilew York, IIY.
Ferrell, E.~. (1958), Plotting
Experimental Data on Normal or
lo~-normal Probability Paper, Industr-icll
Quality Control, 15, pp. 12-15.
H3nson, V.F., J.H. Carlson, K.M.
Papauchado, and N.A. Nielson, (1976),
The liberty Bell: Composition of the
Famous Failure, 4merican Scientist, 64,
pp.
614-619.
ti>rner, E.J .. G.H. Hobbs, E.C. Keller
Jr., A.G. Everett, and D.M. Chilko
(1ge}), Assessing Estimates of the 99th
Percentile of a OTstrlbution,
En\!' i ronme tries Ptoced i ngs, (to appear).
Hi lIyer, 11 •.J. (1978), Evaluation of the
EffeGt of Distributional ~$sumption5 o~
St~tistical Form~ of the Photochemical
Oxidant Standard, Systems Apnlications,
Inc., San-Rafael, CA.
Kimbell, B.F.
(1960), On the Choice of
Plotting Positions on Prohability
Journal of American Statistical
Association; 55, PD. 5~6-560.
P~pert
i
King, J.R. (1965), Graphical Data
Analysis with Probabiltty Papers,
Technical and ~n~ineerrng Aids for
Management; Lowell, ~A •
(1978), SAS R~gression
Appl1cations, SAS Technica' Report
A.-I02, SAS Institute, Inc., Cary, ~!C.
. 9>11, J.P.
Wi lk~ M.B., R. Gnanadeskan~ and M.J.
Iluyet (1962), ProbabIlity Plottln~ for
the Gamma Distribution,
~,
Tp.chnometr~cs,
PP. }-20.
6