Download Powerpoint - Biomathematics and Statistics Scotland

Searching for applications of EVT in biology Adam Butler, Biomathematics & Statistics Scotland UK extremes, April 2007 Acknowledgements: Len Thomas, Clive Anderson, Dirk Husmeier Overview • Biologists are frequently interested in properties of extreme or rare events - i.e. extinction, long-range dispersal, genetic mutation – but EVT is not widely known or used in many branches of biology • Some possible reasons: • Biological sciences have tended to be data-poor, relative to e.g. hydrology • Focus on testing of scientific hypotheses rather than risk assessment • Difficulty in deriving a meaningful quantitative definition of an extreme event • Oppurtunities arise from the large datasets that arise in modern biology (e.g. genetics, ecological modelling), & from an increasing focus on quantitative risk assessment Genetics • “…a sequence alignment is a way of arranging the primary sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences…” (Wikipedia) • EVT has been used for sequence alignment since the early 90s (Karlin et al., 1990; Mott, 1992; Mott & Tribe, 1999), and is now embedded within standard software (BLAST, FASTA) • Basic idea is to compare the target sequence with a (very) large database of known sequences, by: 1) defining a similarity score 2) using a fast algorithm to search for the best match(es) within the database 3) using EVT to evaluate the statistical significance of this match • Theoretical arguments are used to justify the use of a Gumbel model for the best score • Currently interest is in the alignment of multiple sequences (Fromlett & Futschik, 2004; Wang & Sen, 2006), & this requires the use of multivariate extreme value methods Ecology Review papers: Gaines & Denny (1993), Katz et al. (2005) • Disturbance Study the extremes of environmental processes that are known to lead to ecological disturbance: sediment rates, fire sizes, frost days • Longevity & survival Study the maximum lifespan or size of an individual • Population dynamics Evaluate the probability of extinction or explosion of a population • Dispersal & spread Spatial spread (of diseases, pollen, invasive species, native species responding to climate change) known to be influenced by long-range dispersal events: use EVT to analyse dispersal data? Issues: spatial structure; censoring &/or non-reporting; mixtures • Ecological modelling Study the properties of extreme events simulated by complex process-based ecological models – e.g. mass extinction events Deterministic models: find the region of the parameter space associated with the process exceeding a particular level Stochastic models: calculate the probability of the process exceeding a threshold for a given parameter set EVT for complex stochastic models: some vague ideas Y() ~ CSM(), likelihood of CSM intractable,  high dimensional Possible approach if simulation is quick & we have real data x…?: EVT + ABC: 1. generate a value from the prior,  ~  2. use the model to simulate a dataset y() ~ CSM() 3. fit y()|{y() > u} ~ GPD to estimate P(Y() > v), for v >> u 4. accept  if P(Y() > v) lies within a 95% confidence interval about P(X > v), else reject Or perhaps could use ABC-MCMC on (,v) with pseudo-prior on v Y() ~ CSM(), likelihood of CSM intractable,  high dimensional Possible approach if simulation is slow & we do not have data…? EVT + GP: Run CSM for a relatively small set of parameter values  Assume y()|{y() > u} ~ GPD(()) Assume () ~ N(, ) Impose structure on  & fit by hierarchical Bayes Could be used to draw inferences about P(Y() > v) for v >> u, even if we have not simulated from CM() Some references Karlin, S., Dembo, A. and Kawabata, T. (1990) Statistical composition of high-scoring segments from molecular sequences. The Annals of Statistics, 18, 571-581. Mott, R. (1992) Maximum-likelihood estimation of the statistical distribution of SmithWaterman local sequence similarity scores. Bulletin of Mathematical Biology,54, 59-75. Gaines, S.D. and Denny, M. W. (1993) The largest, smallest, highest, lowest, longest, and shortest: extremes in ecology. Ecology, 74, 1677-1692. Mott, R. and Tribe, R. (1999) Approximate sequences of gapped alignments. Journal of Computational Biology, 6, 91-112. Frommlet, F. and Futschik, A. (2004) On the Dependence Structure of Sequence Alignment Scores Calculated with Multiple Scoring Matrices, Statistical Applications in Genetics and Molecular Biology, 3, article 24. Katz, R. W., Brush, G.S. and Parlange, M.B. (2005) Statistics of extremes: modeling ecological disturbances. Ecology, 86, 1124-1134. Wang, L. and Sen, P. K. (2006) Extreme value theory in some statistical analysis of genomic sequences. Extremes, 8, 295-310.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Powerpoint - Biomathematics and Statistics Scotland