Download Powerpoint - Biomathematics and Statistics Scotland

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Synthetic biology wikipedia , lookup

DNA barcoding wikipedia , lookup

History of molecular evolution wikipedia , lookup

Multi-state modeling of biomolecules wikipedia , lookup

Molecular evolution wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Homology modeling wikipedia , lookup

Transcript
Searching for applications
of EVT in biology
Adam Butler, Biomathematics & Statistics Scotland
UK extremes, April 2007
Acknowledgements: Len Thomas, Clive Anderson, Dirk Husmeier
Overview
• Biologists are frequently interested in properties of extreme or rare
events - i.e. extinction, long-range dispersal, genetic mutation –
but EVT is not widely known or used in many branches of biology
• Some possible reasons:
• Biological sciences have tended to be data-poor, relative to e.g. hydrology
• Focus on testing of scientific hypotheses rather than risk assessment
• Difficulty in deriving a meaningful quantitative definition of an extreme event
• Oppurtunities arise from the large datasets that arise in modern
biology (e.g. genetics, ecological modelling), & from an increasing
focus on quantitative risk assessment
Genetics
•
“…a sequence alignment is a way of arranging the primary
sequences of DNA, RNA, or protein to identify regions of
similarity that may be a consequence of functional, structural, or
evolutionary relationships between the sequences…” (Wikipedia)
•
EVT has been used for sequence alignment since the early 90s
(Karlin et al., 1990; Mott, 1992; Mott & Tribe, 1999), and is now
embedded within standard software (BLAST, FASTA)
• Basic idea is to compare the target sequence with a (very) large
database of known sequences, by:
1) defining a similarity score
2) using a fast algorithm to search for the best match(es) within the database
3) using EVT to evaluate the statistical significance of this match
• Theoretical arguments are used to justify the use of a Gumbel
model for the best score
• Currently interest is in the alignment of multiple sequences
(Fromlett & Futschik, 2004; Wang & Sen, 2006), & this requires
the use of multivariate extreme value methods
Ecology
Review papers: Gaines & Denny (1993), Katz et al. (2005)
• Disturbance
Study the extremes of environmental processes that are known to
lead to ecological disturbance: sediment rates, fire sizes, frost days
• Longevity & survival
Study the maximum lifespan or size of an individual
• Population dynamics
Evaluate the probability of extinction or explosion of a population
• Dispersal & spread
Spatial spread (of diseases, pollen, invasive species, native
species responding to climate change) known to be influenced by
long-range dispersal events: use EVT to analyse dispersal data?
Issues: spatial structure; censoring &/or non-reporting; mixtures
• Ecological modelling
Study the properties of extreme events simulated by complex
process-based ecological models – e.g. mass extinction events
Deterministic models: find the region of the parameter space
associated with the process exceeding a particular level
Stochastic models: calculate the probability of the process
exceeding a threshold for a given parameter set
EVT for complex stochastic
models: some vague ideas
Y() ~ CSM(), likelihood of CSM intractable,  high dimensional
Possible approach if simulation is quick & we have real data x…?:
EVT + ABC:
1. generate a value from the prior,  ~ 
2. use the model to simulate a dataset y() ~ CSM()
3. fit y()|{y() > u} ~ GPD to estimate P(Y() > v), for v >> u
4. accept  if P(Y() > v) lies within a 95% confidence
interval about P(X > v), else reject
Or perhaps could use ABC-MCMC on (,v) with pseudo-prior on v
Y() ~ CSM(), likelihood of CSM intractable,  high dimensional
Possible approach if simulation is slow & we do not have data…?
EVT + GP:
Run CSM for a relatively small set of parameter values 
Assume y()|{y() > u} ~ GPD(())
Assume () ~ N(, )
Impose structure on  & fit by hierarchical Bayes
Could be used to draw inferences about P(Y() > v) for v >> u,
even if we have not simulated from CM()
Some references
Karlin, S., Dembo, A. and Kawabata, T. (1990) Statistical composition of high-scoring
segments from molecular sequences. The Annals of Statistics, 18, 571-581.
Mott, R. (1992) Maximum-likelihood estimation of the statistical distribution of SmithWaterman local sequence similarity scores. Bulletin of Mathematical Biology,54, 59-75.
Gaines, S.D. and Denny, M. W. (1993) The largest, smallest, highest, lowest, longest, and
shortest: extremes in ecology. Ecology, 74, 1677-1692.
Mott, R. and Tribe, R. (1999) Approximate sequences of gapped alignments. Journal of
Computational Biology, 6, 91-112.
Frommlet, F. and Futschik, A. (2004) On the Dependence Structure of Sequence Alignment
Scores Calculated with Multiple Scoring Matrices, Statistical Applications in Genetics
and Molecular Biology, 3, article 24.
Katz, R. W., Brush, G.S. and Parlange, M.B. (2005) Statistics of extremes: modeling
ecological disturbances. Ecology, 86, 1124-1134.
Wang, L. and Sen, P. K. (2006) Extreme value theory in some statistical analysis of
genomic sequences. Extremes, 8, 295-310.