Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
PROC DISCRETE;* A Procedure for Fitting Discrete Probability Distributions James P. Geaghan, Louisiana State University Charles E. Gates, Texas A & M University George D. Williams, louisiana State University set are the BY variables, estimates of the frequency totals (TOTAL), the mean (MEAN) and the variance (VARIANCE). The parameter estimates from the distributions are; POlS - no additional estimates produced POIZ - MEAN2, PLAMBDA and PTHETA POSB - P NEGB - NBP, NBK THOM - TM and TLAMBDA NEYA - MI and M2 PDIB - ALPHA, MU and QUE LOGZ - LTHETA and LLAMBDA The DISCRETE procedure fits discrete probability distributions to count data. The eight distributions available in the procedure are the Poisson, negative binomial, positive binomial, Thomas double poisson, Neyman type A, Poisson-binomial, Poisson with zeroes and the logarithmic distribution with zeroes. Data may be entered as either raw data or as classes and frequencies for an observed distribution. All of the eight distributions, or any subset, may be fitted to the observed data set. DATA = data set name The DATA = option is used to specify the data set to be used by DISCRETE. If DATA ~ is omitted, the most recently created data set will be used. The PROC DISCRETE statement PROC DISCRETE options; The options and parameters below may be requested in the PROC DISCRETE statement. MINCLASS = class value MIN = class value Specifies the minimum class value to be output (default = 0). MINCLASS affects only the output; all calculations are performed for classes less than MINCLASS. NOSIMPLE NOSIMP Requests the program to not print the simple statistics section of the output. NO INTERMEDIATE MAXCLASS = class value MAX = class value Specifies the maximum class value to be used (default = largest observed value). MAXCLASS causes all classes larger than the class value specified to be pooled in the class specified as the maximum class. MAXCLASS does not affect the calculation of the mean and variance unless the truncate option is used, but will affect other calculations. • NOTE: MAXCLASS must be specified to obtain the correct estimate of the probability of success (p) for the positive binomial distribution when the largest possible class does not occur in the data set. • NOTE: the use of MAXCLASS to pool or truncate the upper tail will change the fit of some distributions. Non Requests the program to not print intermediate results (iteration series or initial values). NOPRINT Requests the program to not print any of the normal output. Notes and error messages are transferred to the SAS log. TRUNCATE TRUNC When used with a value specified as MAXClASS=, the TRUNCATE aptian causes values in classes exceeding the maximum specified class to be deleted instead of being pooled into the maximum class. Individual distributions may be fitted by requesting the following options. If no distributions are requested, all eight distributions will be fitted. POlS - Poisson distribution POSB - positive binomial distribution NEGB - negative binomial distribution THOM - Thomas double poisson distribution NEYA - Neyman type A distribution POIB - Poisson-binomial distribution POIZ - Poisson distribution with zeroes LOGZ - logarithmic distribution with zeroes ITERATIONS = IT= Specifies the maximum number of iterations (default = 100) for all distributions with iterative solutions. PRECISION = integer value P = integer value Where precision = lO**(-"integer value ll ) gives the minimum precision for iterations (default = 6, giving a precision of 0.000001). PARMEST Causes a SAS data set to be created which contains the values of parameters estimated by PROC DISCRETE. The values are contained in a data set called "PARMEST". When both a PARMEST option and an OUTPUT statement are specified, "PARMEST" is the last data set created. Contained in the data EXPMIN = EX = Classes are pooled to obtain a minimum expected value for calculating the Chi square goodness of fit. EXPMIN specifies the minimum desired expected value of classes to be pooled. Default=1.0. 869 • PBNUMBER = integer value PB ; integer value Where "integer value" specifies the number of recalculations of the Poisson-binomial distribution (default; 4). The only variance Poisson, positive distributions fitted when the is less than the mean are the the Poisson with zeroes, and the binomial distribution. PROC DISCRETE'S OUTPUT Statements used with PROC DISCRETE SIMPLE STATISTICS Summarize the number of classes, total frequencies, range of class values and the mean and variance for each group of BY variables. The variance-mean ratio (VIM), which can be used to test the equality of mean and variance for a Poisson distribution (Elliot 1971), is given and tested as x2 = [(n-1)V/M ratio]. FREQUENCY variable name; FREQ variable name; Gives the name of the variable for the frequency of the observed classes. If the FREQUENCY statement is not given, the data are assumed to be raw data with a frequency of one. CLASSES variable name; CLASS variable name; The CLASSES statement must be given. If the frequency variable is given in the FREQUENCY statement, the corresponding class variable must be specified in the CLASSES statement. If the data are raw. the raw data variable name must be specified by the CLASSES statement. ITERATIONS For the appropriate distributions, successive iterations or recalculations of parameter values are pri nted. EXPECTED VALUES AND PROBABILITIES The class value, probabilities for each class, and the expected and observed frequencies are printed for each of the fitted distributions. Classes which have been pooled are ~lso indicated. OUTPUT OUT = data set name options; When an OUTPUT statement is included, DISCRETE creates a new SAS data set which has the name given with "OUT=II. The new data set will contain the values of the classes used by DISCRETE, and the probabilities for each of the distributions fitted. Variable names for the probabilities must be specified by giving the four letter name used to identify the distribution as a DISCRETE statement parameter followed by u=new variable name". For example: OUTPUT OUT=data set name POIS=namel POSB=name2 NEGB=name3 THOM=name4 NEYA=name5 POIB=name6 POIZ=name7 LOGZ=name8; SUMMARY STATISTICS The Chi square value for a Chi square goodness of fit, the degrees of freedom, and the probability of a greater Chi square value are given as summary statistics. Parameters estimated for each distribution are also given. HOW THE DISTRIBUTIONS ARE FITTED. Formulas used in fitting each of the distributions are given below. Some equations have been modified (some by taking logarithms) to provide exact probabilities to data whose maximum class value exceeds the computational limits of factorials. BY variable names; The BY statement may be used with PROC DISCRETE if the dataset is sorted in the order of the BY variables. How DISCRETE If the value calculations omitted from The formulas below employ the following common notation; treats missing values. of any SAS variable used in the is missing, the observation will be the analysis. x = 0,1, ... = value of each class, fx = the observed frequency of the xth class, N ; I fx ; the total sample frequency, P ; the expected proportion in the xth class, x n ; the number classes from 0 to the largest class observed, I when unspecified is for x ; 0 to n. SYSTEM OPTIONS PROC DISCRETE will also respond to the SAS system options LINESIZE, PAGESIZE, SKIP, and NOCENTER. Notes on pooling expected values for the Chi square goodness of fit test. • Expected values are pooled to achieve a minimum value of one (unless a different minimum is specified with EXPMIN ;). • Pooling proceeds from zero until the minimum value criterion is met. Pooling continues after each group is created up to the maximum class as necessary. Pooled groups are indicated on the output. POISSON OISTRIBUTION - Steel and Torrie (1980) Notes on convergence problems • If DISCRETE cannot converge on any distribution, a message indicating the problem is printed as a note. That distribution is not fitted, and the program proceeds to the next distribution. The distribution is fitted using the maximum likelihood estimator The distribution is described by P x 870 -" x x! =~ x; 0,1,2, ... NEYMAN TYPE A DISTRIBUTION - Douglas (1955) NEGATIVE BINOMIAL DISTRIBUTION - Bliss and Fisher (1953) The distribution is described by The distribution is described by p)k (q _ x where p,k > D, q = 1 + p and p term in the equation ;s = (k p x =ilk. x + x + 1)! L - x! (k-1)! q A general J: l" j=O J. (mle- m2 )j for Xf';O for x = 0,1, ... k+l ~ =L e- m1 ~ and ml,m2 ) O. The method of fitting ml and m2 requires iteration to estimate both parameters. The method used is that of Shenton (1949. in Dougl as 1955). POISSON-BINOMIAL DISTRIBUTION - McGuire. et al. (1957) The fully efficient method given by Fisher is to estimate k such that e = ~ , x. P (Ax ) - N In(l + p) k=l~ for x The first value of Fx is defined as x-I (x-I)! x-l-i x-i-l n-x+i F = opn I i!(x-l-i)! (n-1) p q Fi x ;=0 = 0,1, ... where a and p are the parameters and x> 0 ;s minimized, where n=2,3.4 •... and [j] indicates factorial moments. A=2:f x ie. i=l x V1. POSITIVE BINOMIAL DISTRIBUTION - Ondrik and Griffiths (1969) The distribution ;s described by We then define p)n (q + where _ -u(l- qn) for x = Po = Fo - e The moment estimators are: a = (n-l))(' P = &pn The maximum likelihood solution for p is A, x -m = , -m I r=l m e r! (rA)x-re-r'l\ for x =0 In(fo/N) X= - In(f./fom). (n-l))( q = I-p for x = 1,2, ... POISSON DISTRIBUTION WITH ZEROES - (Cohen 1960) This distribution, called an extension of a truncated Poisson distribution by Cohen (1960) is defined as follows: (x-r)! where m,A > O. When A = 0 the distribution degenerates to the Poisson distribution. The maximum likelihood solution is m= - s,2_X One method of fitting this distribution is to let the program try n = 2, 3, ... , PBNUMBER and tabulate the best fit (as defined by the minimum x2 value). The distribution is defined as p _ px- THOMAS DOUBLE POISSON - Thomas (1949) e o. n(s2-~) x fx -N- 2: for x > 0 Px=Fx/x! where q = 1 - p. The general term in the expansion of the distribution is given by p _ n! x n-x for x = 0,1,2, •.. X - (n - xlix! p q P= (n-l)[O] 1 (n-l)[I] = n-l (n-l)[2] = (n-l)(n-2) (n-1)[3] (n-l)( n-2)( n- 3) • etc. 1 - 6 for x =0 ee- AAX/(I-e- A)x! Note that if e = 0, a degenerate distribution is implied with all values at x = 0; if e = 1 then a truncated Poisson distribution is implied and if 8 = l-e- A• the ordinary Poisson is implied. 871 The following are the maximum likelihood estimators. 8 = "I: f REFERENCES Bliss, C.1. and R.A. Fisher. 1953. "Fitting the negative binomial distribution to biological datal!. Biometrics 9: 176-200. " / I: fx x=1 x x=O and Xis Chakravarti, I.M. R.G. Laha and J. Roy. 1967. Handbook of methods of applied statistics. Vol I. John Wiley. estimated fron X' = _A_ 1-.-;' Cohen, A.C., Jr. 1960. "An extension of a truncated Poisson distribution ll • Biometrics 446-450. where / n , Douglas, J.B. 1955. "fitting the Neyman type A (two parameter) contagious distribution". Biometrics 11: 149-158. the mean of the non-zero sample observations. X can be determined using the Newton-Raphson scheme (in Nielsen 1964). Elliot, J, M. 1971. Some methods for the statistical analysis of samples of benthic invertebrates. Freshwater Biol. Assn. Sci. Pub. No. 25. Ambleside, Westmorland. England. 148 pp. LOGARITHMIC DISTRIBUTION WITH ZEROES The logarithmic distribution (without zeroes) is given by Chakravarti et al. (1967). The logarithmic with zeros is as follows: 1 - A x McGuire, J.U., T.A. Brindley and T.A. Bancroft. 1957. liThe distribution of corn borer larvae pyrausta nubilalis (HBN.) in field cornll. Biometrics 13: 65-78. =0 Nielsen, K.L. 1964. Methods in numerical analysis. The MacMillan CD., N.Y. 408 pp. x > 0 -xln(l-e) where 0 ~ A ~ 1 and 0 < e < 1. If A = I, the usual logarithmic distribution is generated, but if A = 0 then the distribution is degenerate with all values concentrated at the zero class. The maximum likelihood solut1on for A is Ondrick, C.W. and J.C. Griffiths. 1969. Discrete distribution models of binomial, Poisson and negative btnomial: Computer Contribution 35. Kansas Biol. Survey. 20 pp. Steel, R.G. and J.H. Torrie. 1980. Principles and Procedures of Statistics. McGraw·Hill Book Co. N.Y. 633 pp. Thomas, Marjorie. 1949. A generalization of Poisson1s binomial limit for use in ecology. Biometrika 36: 18-25. and for e is (I-e) In(l-e) I: xf x=1 x + e =llfx = o. x • PROC DISCRETE is supported by the authors, not by the SAS Institute. The latter is solved iteratively by the NewtonRaphson method (in Nielsen (1964) until e changes by less than 10-ITERATION (default=10- 6 ). A starting value of 0.5 is used to initiate the iteration. Contact author: James Geaghan Dept. of Experimental Statistics 43 Ag. Admin. Bldg., L.S.U. Baton Rouge, Louisiana 70803 872