Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Poisson approximation for occurrence times: A new approach and an aplication to genetic Miguel Abadi Instituto de Matemática, Estatı́stica e Ciência da Computação, Universidade Estadual de Campinas [[email protected]] One basic excercise in probability is the convergence of the binomial distribution B(n, p) to the Poisson distribution P (λ) as the product np converges to λ. namely, the number of successes of a coin tossing converges to the Poisson distribution for independent processes. In order to model realistic situations generalizations of this fact were recently developed. In particular, to study the number of occurrences of a certain observable, we are interested in: • other observables than a succes in a single coin tossing • other kind of processes than a independent one • to give not just a convergence theorem but rather an aproximation theorem which provides an explicit error term for this aproximation. The most famous tool for proving this convergence is, probably, a method due to Chen and Stein. For a description of it we suggest Arratia, Goldstein and Gordon (1990). One feature of their aproach is that this method provides bounds for the total variation distance between the two distributions. We present an alternative new method introduced by Abadi and Vergne (2005a) which uses several previous results obtained by Galves and Schmitt (1997), Collet, Galves and Schmitt (1999), Abadi (2001), and Abadi (2004). Our results applies to “words” of any lenght and therefore can be easly addapted to any observable writting it as a disjoint union of words. Our results are established on the setting of mixing processes, which covers widely ergodic Markov chains and Gibbs measures, even though the technique is a very general one. A crucial difference between our approach and the Chen-Stein method is that we prove a pointwise error bound: this allow us to control the error over the tail distribution. We illustrate with the following application the powerfullness of our approach. In genetic analysis, one interest is to determine “words” in the DNA sequence that have some specific functionality. One way to do this is to test if an especific word appears in the DNA sequence with a frequency which differs (either higher or lower) from the expected (randomly generted) one. Clearly, the expected frequency depends on the model of the sequence. We present some simulations when this model is an ergodic Markov chain. In such a case, our results apply and the aim is to test if the frequency of the specific word is close or not to that of a Poisson random variable. If its occurrences are close to a Poisson random variable then we can assume that the word is randomly generated. If it differs, we identify it as a word with some esfecific functionality. 1 Our results holds for more general processes allowing us to use the same approach even when the sequence is modeled by more general processes that seems more adequated to model the DNA sequences. A formal presentation of the theoretical results can be found on the paper Statistics and error terms of occurrence times in mixing processes which can be downloded from http://www.ime.usp.br/˜ miguel/statis.pdf. The applications are expossed in Abadi and Vergne (2005b). 1. Abadi, M. (2001). Exponential approximation for hitting times in mixing processes. Math. Phys. Elec. Journal 7 2. 2. Abadi, M. (2004). Sharp error terms and necessary conditions for exponential hitting times in mixing processes. Annals of Probability, 32, no 1A, 243-264. 3. Abadi, M. and Vergne N. (2005a). Statistics and error terms of occurrence times in mixing processes. Submitted. Can be downloaded from: http://www.ime.usp.br/ãbadi 4. Abadi, M. and Vergne N. (2005b). Poisson approximation in biological context. Unicamp e Univ. Evry. 5. Arratia, D. Goldstein, L. and Gordon, L. (1990). Poisson approximation and the Chen-Stein method. With comments and a rejoinder by the authors. Stat. sci. 5, 403-434. 6. Collet, P. Galves, A. and Schmitt, B. (1999). Repetition times for Gibbsian sources. Nonlinearity 12, 1225-1237. 7. Galves, A. and Schmitt, B. (1997). Inequalities for hitting times in mixing dynamical systems. Random Comput. Dyn. 5, 337-348. 2