Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Bayesian Estimation of the Number of Unknown Species: Incorporating a Model for the Discovery Process 1 1 2 Simon P. Wilson , Brett Houlding , and Mark Costello 1 Discipline of Statistics, Trinity College Dublin, Ireland. Correspondance to [email protected]. 2 Department of Marine Sciences, University of Auckland, Leigh Marine Laboratory, New Zealand. Abstract The problem of estimating the number of existing species has been reviewed many times within the statistics literature. The typical case is to estimate the unknown number of species S from a sample of n individuals in which R ≤ S distinct species are observed with abundances n1, . . . , nR. Unfortunately, records of animal and plant species discovery usually contain only the number of sampled species R and their dates of discovery t1, . . . , tR. As such, here we propose a model for estimating the number of species S using only the species discovery dates t1, . . . , tR and the yearly number of distinct author surnames a1, . . . , aT , the latter being considered a proxy for a latent ‘effort process’ in species sampling. 1. Species Estimation 2. The Discovery Process 3. A Proxy for the Effort Process A community contains an unknown number of species S with unknown abundances N1, . . . , NS . The total population of life PS within the community is thus N = i=1 Ni. Typically, species abundance models define probability distributions for the Ni in terms of a set of parameters θ. Here we assume species abundances follow a Log-normal distribution, which, even though being a continuous distribution whilst populations are integer-valued, is a common modelling assumption. Hence: If we assume that the available data consists not of the interdiscovery sample sizes, but instead the years t1, . . . , tR of species discovery, then if the number of individuals sampled in any given year were also known (i.e., the values n(1), . . . , n(T ) were available), the following constraints would bound the values of M1, . . . , MR: The number of individuals sampled in a given year is then assumed to be a random process whose probability distribution is determined by et. A common technique that is suitable for the discrete effort process suggested here is to assume the number of individuals sampled in a given year follows a Poisson distribution with mean Et = exp(et): log(Ni) ∼ N (θ1, θ22), for i = 1 : S max Mi−1, tX i −1 n(j) < Mi ≤ j=1 (1) ti X n(j) (3) A sample of individuals is assumed to have been taken sequentially over a period of T years, with n(t) individuals being sampled in year t. This has resulted in a total of R distinct species being identified. Furthermore, it is assumed that each individual had an equal chance of being sampled, and that sampling was with replacement, so that the probability of a randomly sampled indiPi vidual being of species i is pi = Ni/N . As such, and observing that Mi = 1 + j=2 mj , the interdiscovery sample sizes can be considered as latent variables Denoting by Mi the sample number in which the i-th species is whose probability mass function follows that defined in Equafirst observed, (hence 1 = M1 < M2 < · · · < MR ≤ n), we tion (2), but truncated so as to satisfy the constraints of Equadefine mi = Mi − Mi−1 to be the number of individuals sampled tion (3). The information leading to inference on the species between the (i − 1)-th and i-th species discoveries. abundances, and hence the number of unknown species, would then be provided through the species discovery years and the The distribution of m1, . . . , mR given N1, . . . , NR can then be yearly sample sizes. Figure 2 below represents this aspect of determined as follows: There are mi individuals sampled between the inference problem graphically. the discovery of the (i − 1)-th and i-th species. This happens if and only if mi − 1 individuals are sampled from the first i − 1 species (meaning no new species are observed) and then the mi-th sample is a first observation of new species i. Under the assumpti tion of sampling with replacement, this occurs with probability Pi−1 ( j=1 pj )mi−1pi. Hence, assuming that mi ≥ 1 for all i and that S ≥ R, we have for all R discovered species: P (m1, . . . , mR|N1, . . . , NS , R) = ( i=2 j=1 S θ pj ) mi−1 pi (2) Nk k=1:S n(t)|et ∼ Po exp(et) , for t = 1 : T (5) j=1 That is to say, for the Mi-th individual to have been sampled in year ti, the cumulative number of individuals sampled up to and inclusive of year ti − 1 must be less than Mi. In addition, the cumulative number of individuals sampled up to and inclusive of the year in which the Mi-th individual is sampled, must be greater than or equal to Mi. R X i−1 Y mi i=1:R n( j ) The advantage of defining the ‘effort process’ is that we can hopefully measure some proxy of it, and relate that to the effort rate Et. One such possibility is the yearly number of distinct author surnames in papers concerning species discovery. Defining at to be the number of distinct author surnames in year t, a simple model is to assume that at is Poisson distributed with mean exp(et − α), for some parameter α: at|et, α ∼ Po exp(et − α)) , for t = 1 : T (6) The parameter α can be interpreted in the following way: Since we expect exp(et) individuals to be sampled in year t, then we expect exp(α) individuals to be sampled per author. In other words, α is the natural logarithm of the number of individuals we expect an author to sample in a given year. Finally, we are able to generate the at by examining databases of species discovery. Figure 3 below represents the proposed inference procedure concerning yearly sample sizes. σe α ej aj j=1:T n( j ) Figure 2: DAG displaying the inference model for the latent inter-discovery sample sizes. j=1:T Figure 3: DAG displaying the inference model for the yearly Again, however, it is unlikely that ecologists or databases will sample sizes. mi Nk maintain a record of the number of samples taken annually, k=1:S i=1:R and so these too must be assumed unavailable. Moreover, Taken collectively, the model for estimating the number of unmodels for how many individuals have been sampled, as a known species by incorporating a model for the discovery profunction of time, do not appear to have been considered within cess which utilizes a proxy for a latent effort process, is presented graphically in Figure 4 below. Figure 1: DAG for Species Estimation using inter-discovery the statistical or biological literatures. sample sizes. σe α To account for this problem, we propose the use of a latent As such, if the values of m1, . . . , mR were available, then such ‘effort process’ that is used to represent an underlying rate data, taken in combination with a prior distribution over θ, could of sampling effort Et > 0. The value of this process is also be implemented in Bayes’ Theorem to determine a posterior dis- unknown, however, it does not have to be something that can ej ti aj tribution for N1, . . . , NS . Moreover, techniques such as Reversible be potentially observed or interpreted, so long as there is data S θ Jump Markov Chain Monte Carlo (RJMCMC) can then be im- available that is related to it. plemented to permit inference over the total number of species S. Figure 1 above provides a graphical representation for this part of A simple possibility is to model Et as a discrete Gaussian mi Nk n( j ) the species estimation model. k=1:S i=1:R j=1:T process whereby et = log(Et) follows a discrete-time Gaussian random walk. The joint distribution of e1, . . . , eT will then Unfortunately, however, records of animal and plant species dis- be an Intrinsic Gaussian Markov Random Field (IGMRF): covery usually only list the year of discovery, and not the number Figure 4: Full DAG for species estimation problem. of samples collected between discoveries, e.g., the most compreet|et−1, σe2 ∼ N et−1, σe2 , for t = 2 : T (4) hensive database of known species: the Catalogue of Life. As Acknowledgement such, to estimate the total number of unknown species using such This work is funded by the STATICA project, a Principal Inlimited information, a model is required for the discovery process vestigator program of Science Foundation Ireland, Grant number itself. 08/IN.1/I1879.