Download Bayesian Estimation of the Number of Unknown Species

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Theoretical ecology wikipedia , lookup

Transcript
Bayesian Estimation of the Number of Unknown Species:
Incorporating a Model for the Discovery Process
1
1
2
Simon P. Wilson , Brett Houlding , and Mark Costello
1
Discipline of Statistics, Trinity College Dublin, Ireland. Correspondance to [email protected].
2
Department of Marine Sciences, University of Auckland, Leigh Marine Laboratory, New Zealand.
Abstract
The problem of estimating the number of existing species has been reviewed many times within the statistics literature. The typical case is to
estimate the unknown number of species S from a sample of n individuals in which R ≤ S distinct species are observed with abundances
n1, . . . , nR. Unfortunately, records of animal and plant species discovery usually contain only the number of sampled species R and their dates of
discovery t1, . . . , tR. As such, here we propose a model for estimating the number of species S using only the species discovery dates t1, . . . , tR
and the yearly number of distinct author surnames a1, . . . , aT , the latter being considered a proxy for a latent ‘effort process’ in species sampling.
1. Species Estimation
2. The Discovery Process
3. A Proxy for the Effort Process
A community contains an unknown number of species S with
unknown abundances N1, . . . , NS . The total population of life
PS
within the community is thus N = i=1 Ni. Typically, species
abundance models define probability distributions for the Ni in
terms of a set of parameters θ. Here we assume species abundances follow a Log-normal distribution, which, even though being
a continuous distribution whilst populations are integer-valued, is
a common modelling assumption. Hence:
If we assume that the available data consists not of the interdiscovery sample sizes, but instead the years t1, . . . , tR of
species discovery, then if the number of individuals sampled in any given year were also known (i.e., the values n(1), . . . , n(T ) were available), the following constraints
would bound the values of M1, . . . , MR:
The number of individuals sampled in a given year is then assumed to be a random process whose probability distribution is
determined by et. A common technique that is suitable for the
discrete effort process suggested here is to assume the number of
individuals sampled in a given year follows a Poisson distribution
with mean Et = exp(et):
log(Ni) ∼
N (θ1, θ22),
for i = 1 : S
max Mi−1,
tX
i −1
n(j) < Mi ≤
j=1
(1)
ti
X
n(j)
(3)
A sample of individuals is assumed to have been taken sequentially
over a period of T years, with n(t) individuals being sampled in
year t. This has resulted in a total of R distinct species being
identified. Furthermore, it is assumed that each individual had
an equal chance of being sampled, and that sampling was with
replacement, so that the probability of a randomly sampled indiPi
vidual being of species i is pi = Ni/N .
As such, and observing that Mi = 1 + j=2 mj , the interdiscovery sample sizes can be considered as latent variables
Denoting by Mi the sample number in which the i-th species is whose probability mass function follows that defined in Equafirst observed, (hence 1 = M1 < M2 < · · · < MR ≤ n), we tion (2), but truncated so as to satisfy the constraints of Equadefine mi = Mi − Mi−1 to be the number of individuals sampled tion (3). The information leading to inference on the species
between the (i − 1)-th and i-th species discoveries.
abundances, and hence the number of unknown species, would
then be provided through the species discovery years and the
The distribution of m1, . . . , mR given N1, . . . , NR can then be yearly sample sizes. Figure 2 below represents this aspect of
determined as follows: There are mi individuals sampled between the inference problem graphically.
the discovery of the (i − 1)-th and i-th species. This happens if
and only if mi − 1 individuals are sampled from the first i − 1
species (meaning no new species are observed) and then the mi-th
sample is a first observation of new species i. Under the assumpti
tion of sampling with replacement, this occurs with probability
Pi−1
( j=1 pj )mi−1pi. Hence, assuming that mi ≥ 1 for all i and that
S ≥ R, we have for all R discovered species:
P (m1, . . . , mR|N1, . . . , NS , R) =
(
i=2 j=1
S
θ
pj )
mi−1
pi
(2)
Nk
k=1:S
n(t)|et ∼ Po exp(et) , for t = 1 : T
(5)
j=1
That is to say, for the Mi-th individual to have been sampled
in year ti, the cumulative number of individuals sampled up
to and inclusive of year ti − 1 must be less than Mi. In
addition, the cumulative number of individuals sampled up
to and inclusive of the year in which the Mi-th individual is
sampled, must be greater than or equal to Mi.
R X
i−1
Y
mi
i=1:R
n( j )
The advantage of defining the ‘effort process’ is that we can hopefully measure some proxy of it, and relate that to the effort rate
Et. One such possibility is the yearly number of distinct author
surnames in papers concerning species discovery. Defining at to be
the number of distinct author surnames in year t, a simple model
is to assume that at is Poisson distributed with mean exp(et − α),
for some parameter α:
at|et, α ∼ Po exp(et − α)) , for t = 1 : T
(6)
The parameter α can be interpreted in the following way: Since
we expect exp(et) individuals to be sampled in year t, then we expect exp(α) individuals to be sampled per author. In other words,
α is the natural logarithm of the number of individuals we expect
an author to sample in a given year. Finally, we are able to generate the at by examining databases of species discovery. Figure
3 below represents the proposed inference procedure concerning
yearly sample sizes.
σe
α
ej
aj
j=1:T
n( j )
Figure 2: DAG displaying the inference model for the latent inter-discovery sample sizes.
j=1:T
Figure 3: DAG displaying the inference model for the yearly
Again, however, it is unlikely that ecologists or databases will sample sizes.
mi
Nk
maintain a record of the number of samples taken annually,
k=1:S
i=1:R
and so these too must be assumed unavailable. Moreover, Taken collectively, the model for estimating the number of unmodels for how many individuals have been sampled, as a known species by incorporating a model for the discovery profunction of time, do not appear to have been considered within cess which utilizes a proxy for a latent effort process, is presented
graphically in Figure 4 below.
Figure 1: DAG for Species Estimation using inter-discovery the statistical or biological literatures.
sample sizes.
σe
α
To account for this problem, we propose the use of a latent
As such, if the values of m1, . . . , mR were available, then such ‘effort process’ that is used to represent an underlying rate
data, taken in combination with a prior distribution over θ, could of sampling effort Et > 0. The value of this process is also
be implemented in Bayes’ Theorem to determine a posterior dis- unknown, however, it does not have to be something that can
ej
ti
aj
tribution for N1, . . . , NS . Moreover, techniques such as Reversible be potentially observed or interpreted, so long as there is data
S
θ
Jump Markov Chain Monte Carlo (RJMCMC) can then be im- available that is related to it.
plemented to permit inference over the total number of species S.
Figure 1 above provides a graphical representation for this part of A simple possibility is to model Et as a discrete Gaussian
mi
Nk
n( j )
the species estimation model.
k=1:S
i=1:R
j=1:T
process whereby et = log(Et) follows a discrete-time Gaussian
random walk. The joint distribution of e1, . . . , eT will then
Unfortunately, however, records of animal and plant species dis- be an Intrinsic Gaussian Markov Random Field (IGMRF):
covery usually only list the year of discovery, and not the number
Figure
4:
Full
DAG
for
species
estimation
problem.
of samples collected between discoveries, e.g., the most compreet|et−1, σe2 ∼ N et−1, σe2 , for t = 2 : T
(4)
hensive database of known species: the Catalogue of Life. As
Acknowledgement
such, to estimate the total number of unknown species using such
This work is funded by the STATICA project, a Principal Inlimited information, a model is required for the discovery process
vestigator program of Science Foundation Ireland, Grant number
itself.
08/IN.1/I1879.