Download PROC DISCRETE-A Procedure for Fitting Discrete Probability Distributions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

History of statistics wikipedia , lookup

Gibbs sampling wikipedia , lookup

Transcript
PROC DISCRETE;* A Procedure for Fitting Discrete Probability Distributions
James P. Geaghan, Louisiana State University
Charles E. Gates, Texas A & M University
George D. Williams, louisiana State University
set are the BY variables, estimates of the
frequency totals (TOTAL), the mean (MEAN) and the
variance (VARIANCE). The parameter estimates from
the distributions are;
POlS - no additional estimates produced
POIZ - MEAN2, PLAMBDA and PTHETA
POSB - P
NEGB - NBP, NBK
THOM - TM and TLAMBDA
NEYA - MI and M2
PDIB - ALPHA, MU and QUE
LOGZ - LTHETA and LLAMBDA
The DISCRETE procedure fits discrete
probability distributions to count data. The
eight distributions available in the procedure
are the Poisson, negative binomial, positive
binomial, Thomas double poisson, Neyman type A,
Poisson-binomial, Poisson with zeroes and the
logarithmic distribution with zeroes.
Data may be entered as either raw data or as
classes and frequencies for an observed
distribution. All of the eight distributions,
or any subset, may be fitted to the observed
data set.
DATA = data set name
The DATA = option is used to specify the data set
to be used by DISCRETE. If DATA ~ is omitted, the
most recently created data set will be used.
The PROC DISCRETE statement
PROC DISCRETE options;
The options and parameters below may be
requested in the PROC DISCRETE statement.
MINCLASS = class value
MIN = class value
Specifies the minimum class value to be output
(default = 0). MINCLASS affects only the output;
all calculations are performed for classes less
than MINCLASS.
NOSIMPLE
NOSIMP
Requests the program to not print the simple
statistics section of the output.
NO INTERMEDIATE
MAXCLASS = class value
MAX = class value
Specifies the maximum class value to be used
(default = largest observed value). MAXCLASS
causes all classes larger than the class value
specified to be pooled in the class specified as
the maximum class. MAXCLASS does not affect the
calculation of the mean and variance unless the
truncate option is used, but will affect other
calculations.
•
NOTE: MAXCLASS must be specified to obtain
the correct estimate of the probability of
success (p) for the positive binomial
distribution when the largest possible
class does not occur in the data set.
•
NOTE: the use of MAXCLASS to pool or
truncate the upper tail will change
the fit of some distributions.
Non
Requests the program to not print intermediate
results (iteration series or initial values).
NOPRINT
Requests the program to not print any of the
normal output. Notes and error messages are
transferred to the SAS log.
TRUNCATE
TRUNC
When used with a value specified as MAXClASS=, the
TRUNCATE aptian causes values in classes exceeding
the maximum specified class to be deleted instead
of being pooled into the maximum class.
Individual distributions may be fitted by
requesting the following options. If no
distributions are requested, all eight
distributions will be fitted.
POlS - Poisson distribution
POSB - positive binomial distribution
NEGB - negative binomial distribution
THOM - Thomas double poisson distribution
NEYA - Neyman type A distribution
POIB - Poisson-binomial distribution
POIZ - Poisson distribution with zeroes
LOGZ - logarithmic distribution with zeroes
ITERATIONS =
IT=
Specifies the maximum number of iterations
(default = 100) for all distributions with
iterative solutions.
PRECISION = integer value
P = integer value
Where precision = lO**(-"integer value ll ) gives
the minimum precision for iterations
(default = 6, giving a precision of 0.000001).
PARMEST
Causes a SAS data set to be created which contains
the values of parameters estimated by PROC
DISCRETE. The values are contained in a data set
called "PARMEST". When both a PARMEST option and
an OUTPUT statement are specified, "PARMEST" is
the last data set created. Contained in the data
EXPMIN =
EX =
Classes are pooled to obtain a minimum expected
value for calculating the Chi square goodness of
fit. EXPMIN specifies the minimum desired expected
value of classes to be pooled. Default=1.0.
869
•
PBNUMBER = integer value
PB ; integer value
Where "integer value" specifies the number of
recalculations of the Poisson-binomial
distribution (default; 4).
The only
variance
Poisson,
positive
distributions fitted when the
is less than the mean are the
the Poisson with zeroes, and the
binomial distribution.
PROC DISCRETE'S OUTPUT
Statements used with PROC DISCRETE
SIMPLE STATISTICS
Summarize the number of classes, total
frequencies, range of class values and the mean
and variance for each group of BY variables.
The variance-mean ratio (VIM), which can be used
to test the equality of mean and variance for a
Poisson distribution (Elliot 1971), is given and
tested as x2 = [(n-1)V/M ratio].
FREQUENCY variable name;
FREQ variable name;
Gives the name of the variable for the frequency
of the observed classes. If the FREQUENCY
statement is not given, the data are assumed to
be raw data with a frequency of one.
CLASSES variable name;
CLASS variable name;
The CLASSES statement must be given. If the
frequency variable is given in the FREQUENCY
statement, the corresponding class variable must
be specified in the CLASSES statement. If the
data are raw. the raw data variable name must be
specified by the CLASSES statement.
ITERATIONS
For the appropriate distributions, successive
iterations or recalculations of parameter values
are pri nted.
EXPECTED VALUES AND PROBABILITIES
The class value, probabilities for each class,
and the expected and observed frequencies are
printed for each of the fitted distributions.
Classes which have been pooled are ~lso indicated.
OUTPUT OUT = data set name options;
When an OUTPUT statement is included, DISCRETE
creates a new SAS data set which has the name
given with "OUT=II. The new data set will contain
the values of the classes used by DISCRETE, and
the probabilities for each of the distributions
fitted. Variable names for the probabilities must
be specified by giving the four letter name used
to identify the distribution as a DISCRETE
statement parameter followed by u=new variable
name". For example: OUTPUT OUT=data set name
POIS=namel POSB=name2 NEGB=name3 THOM=name4
NEYA=name5 POIB=name6 POIZ=name7 LOGZ=name8;
SUMMARY STATISTICS
The Chi square value for a Chi square goodness of
fit, the degrees of freedom, and the probability
of a greater Chi square value are given as summary
statistics. Parameters estimated for each
distribution are also given.
HOW THE DISTRIBUTIONS ARE FITTED.
Formulas used in fitting each of the distributions
are given below. Some equations have been modified
(some by taking logarithms) to provide exact
probabilities to data whose maximum class value
exceeds the computational limits of factorials.
BY variable names;
The BY statement may be used with PROC DISCRETE
if the dataset is sorted in the order of the BY
variables.
How DISCRETE
If the value
calculations
omitted from
The formulas below employ the following
common notation;
treats missing values.
of any SAS variable used in the
is missing, the observation will be
the analysis.
x = 0,1, ... = value of each class,
fx = the observed frequency of the xth class,
N ; I fx ; the total sample frequency,
P ; the expected proportion in the xth class,
x
n ; the number classes from 0 to the largest
class observed,
I when unspecified is for x ; 0 to n.
SYSTEM OPTIONS
PROC DISCRETE will also respond to the SAS system
options LINESIZE, PAGESIZE, SKIP, and NOCENTER.
Notes on pooling expected values for the Chi
square goodness of fit test.
•
Expected values are pooled to achieve a
minimum value of one (unless a different
minimum is specified with EXPMIN ;).
•
Pooling proceeds from zero until the minimum
value criterion is met. Pooling continues
after each group is created up to the
maximum class as necessary. Pooled groups
are indicated on the output.
POISSON OISTRIBUTION - Steel and Torrie (1980)
Notes on convergence problems
•
If DISCRETE cannot converge on any
distribution, a message indicating the
problem is printed as a note. That
distribution is not fitted, and the program
proceeds to the next distribution.
The distribution is fitted using the maximum
likelihood estimator
The distribution is described by
P
x
870
-" x
x!
=~
x; 0,1,2, ...
NEYMAN TYPE A DISTRIBUTION - Douglas (1955)
NEGATIVE BINOMIAL DISTRIBUTION - Bliss and
Fisher (1953)
The distribution is described by
The distribution is described by
p)k
(q _
x
where p,k > D, q = 1 + p and p
term in the equation ;s
= (k
p
x
=ilk.
x
+ x + 1)! L -
x! (k-1)! q
A general
J:
l"
j=O J.
(mle- m2 )j
for
Xf';O
for x = 0,1, ...
k+l
~
=L
e- m1 ~
and ml,m2 ) O. The method of fitting ml and m2
requires iteration to estimate both parameters.
The method used is that of Shenton (1949. in
Dougl as 1955).
POISSON-BINOMIAL DISTRIBUTION - McGuire. et al.
(1957)
The fully efficient method given by Fisher is to
estimate k such that
e
= ~
,
x.
P
(Ax ) - N In(l + p)
k=l~
for x
The first value of Fx is defined as
x-I (x-I)!
x-l-i x-i-l n-x+i
F = opn I i!(x-l-i)! (n-1)
p
q
Fi
x
;=0
= 0,1, ...
where a and p are the parameters and
x> 0
;s minimized, where
n=2,3.4 •...
and [j] indicates factorial moments.
A=2:f
x
ie.
i=l x V1.
POSITIVE BINOMIAL DISTRIBUTION - Ondrik and
Griffiths (1969)
The distribution ;s described by
We then define
p)n
(q +
where
_ -u(l- qn)
for x =
Po = Fo - e
The moment estimators are:
a = (n-l))('
P = &pn
The maximum likelihood solution for p is
A,
x
-m
=
, -m
I
r=l
m e
r!
(rA)x-re-r'l\
for x
=0
In(fo/N)
X= -
In(f./fom).
(n-l))(
q = I-p
for x
= 1,2, ...
POISSON DISTRIBUTION WITH ZEROES - (Cohen 1960)
This distribution, called an extension of a
truncated Poisson distribution by Cohen (1960) is
defined as follows:
(x-r)!
where m,A > O. When A = 0 the distribution
degenerates to the Poisson distribution. The
maximum likelihood solution is
m= -
s,2_X
One method of fitting this distribution is to let
the program try n = 2, 3, ... , PBNUMBER and
tabulate the best fit (as defined by the minimum
x2 value).
The distribution is defined as
p
_
px-
THOMAS DOUBLE POISSON - Thomas (1949)
e
o.
n(s2-~)
x fx
-N-
2:
for x > 0
Px=Fx/x!
where q = 1 - p. The general term in the
expansion of the distribution is given by
p _
n!
x n-x
for x = 0,1,2, •..
X - (n - xlix! p q
P=
(n-l)[O]
1
(n-l)[I] = n-l
(n-l)[2] = (n-l)(n-2)
(n-1)[3]
(n-l)( n-2)( n- 3) • etc.
1 - 6
for x
=0
ee- AAX/(I-e- A)x!
Note that if e = 0, a degenerate distribution is
implied with all values at x = 0; if e = 1 then
a truncated Poisson distribution is implied and
if 8 = l-e- A• the ordinary Poisson is implied.
871
The following are the maximum likelihood
estimators.
8 = "I: f
REFERENCES
Bliss, C.1. and R.A. Fisher. 1953. "Fitting the
negative binomial distribution to biological
datal!. Biometrics 9: 176-200.
"
/ I: fx
x=1 x x=O
and
Xis
Chakravarti, I.M. R.G. Laha and J. Roy. 1967.
Handbook of methods of applied statistics.
Vol I. John Wiley.
estimated fron
X' = _A_
1-.-;'
Cohen, A.C., Jr. 1960. "An extension of a
truncated Poisson distribution ll • Biometrics
446-450.
where
/ n ,
Douglas, J.B. 1955. "fitting the Neyman type A
(two parameter) contagious distribution".
Biometrics 11: 149-158.
the mean of the non-zero sample observations.
X can be determined using the Newton-Raphson
scheme (in Nielsen 1964).
Elliot, J, M. 1971. Some methods for the
statistical analysis of samples of benthic
invertebrates. Freshwater Biol. Assn. Sci. Pub.
No. 25. Ambleside, Westmorland. England. 148 pp.
LOGARITHMIC DISTRIBUTION WITH ZEROES
The logarithmic distribution (without zeroes)
is given by Chakravarti et al. (1967). The
logarithmic with zeros is as follows:
1 - A
x
McGuire, J.U., T.A. Brindley and T.A. Bancroft.
1957. liThe distribution of corn borer larvae
pyrausta nubilalis (HBN.) in field cornll.
Biometrics 13: 65-78.
=0
Nielsen, K.L. 1964. Methods in numerical
analysis. The MacMillan CD., N.Y. 408 pp.
x > 0
-xln(l-e)
where 0 ~ A ~ 1 and 0 < e < 1. If A = I, the
usual logarithmic distribution is generated, but
if A = 0 then the distribution is degenerate
with all values concentrated at the zero class.
The maximum likelihood solut1on for A is
Ondrick, C.W. and J.C. Griffiths. 1969. Discrete
distribution models of binomial, Poisson and
negative btnomial: Computer Contribution 35.
Kansas Biol. Survey. 20 pp.
Steel, R.G. and J.H. Torrie. 1980. Principles
and Procedures of Statistics. McGraw·Hill Book
Co. N.Y. 633 pp.
Thomas, Marjorie. 1949. A generalization of
Poisson1s binomial limit for use in ecology.
Biometrika 36: 18-25.
and for e is
(I-e) In(l-e) I: xf
x=1 x
+
e =llfx = o.
x
• PROC DISCRETE is supported by the authors,
not by the SAS Institute.
The latter is solved iteratively by the NewtonRaphson method (in Nielsen (1964) until e
changes by less than 10-ITERATION (default=10- 6 ).
A starting value of 0.5 is used to initiate the
iteration.
Contact author: James Geaghan
Dept. of Experimental Statistics
43 Ag. Admin. Bldg.,
L.S.U.
Baton Rouge, Louisiana
70803
872