Download Applications of scan statistics in molecular biology and neuroscience

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Transcript
Applications of
scan statistics in
molecular
biology and
neuroscience
by Chan Hock Peng
Dept of Statistics and Applied
Probabilty
Outline
• 1. General introduction
• 2. Applications in molecular biology
(weighted scan statistics)
• 3. Tail probability computations
• 4. Applications in neuroscience
(template matching problem)
• 5. Tail probability computations
• 6. Extensions and other applications
Notation
• M u : The maximum score in any window
of length u.
•  : The underlying rate of events
occurring under normal circumstances.
• n: The length of the interval under
consideration.
Example 1
• (USA Today, 1996) On Feb 22, US Navy
suspended all operations of F-14 jet
after third crash in one month.
• The three crashes in a month was seven
times expected rate based on 5 year
period.
• M 30=3, n=5*365,  =1/70.
Example 2
• (Home News, 1995) In 10 month period,
11 residents died at a Tennessee State
Institution. Number was twice what was
expected.
• Judge was angry and ordered mental
health commissioner to spend one in four
weekends at institution.
• M 10=11, n=?,  =11/20.
Clusters of DAM sites in
E.Coli DNA
•
•
•
•
•
Karlin and Brendel (1992).
DAM site--occurrence of the pattern GATC.
Important in repair and replication of DNA.
M 245=8, n=4.7 million,  =1.1/250.
P-value approx. of Naus (1982),
P{M 245  8}  0.87
P{M 245  10}  0.03
Palindromes in DNA
• A-T and C-G are complementary
bases.
• Complement of CCACGTGG is
GGTGCACC.
• CCACGTGG is palindromic pattern
because its complement reads the
same as itself backwards.
Palindromic sequences in
viruses
• Masse et al. (1992) & Leung et al. (1994).
• Palindromic sequences clusters around
origin of replication.
• Event occurs if there is palindromic pattern
of length at least 10 base pairs.
• HCMV sequence. M 1000 =10, n=229354,
=0.001. p-value=0.00195.

Extensions to general scoring
functions (weighted scan)
• In Chew, Choi and Leung (2005),
longer palindromic patterns are given
larger weights.
• For example, a pattern of length k
can be given score of k/10.
• p-value computations ?
Other applications of
weighted scan
• Rajewsky et al. (2002) & Lifanov et
al. (2003).
• Scanning for clusters of transcription
factor binding sites.
• Position weighted matrices to score
words for similarity to a given motif.
• Siepel et al. (2005). Searching for
segments of high evolutionary
conservation.
P-value computations for
weighted scan
• Chan and Zhang (2006).
  (n  u)e uI ( k / u ) (k / u   ) 

P{M u  k}  1  exp 
2uK ' '


where
• I is a large deviation rate function.
•  is an overshoot function.

• K is the moment generating function
of
the scores.
Template matching in
neuroscience
• Neurons are basic units of
information processing in brain.
• Generate small and highly peaked
electric potentials known as spikes.
• Pattern of spikes modeled as point or
counting process, e.g. Poisson
process.
Template pattern
• Dave and Margoliash (2000) and Mooney
(1)
(d )
(2000), w  ( w ,..., w ) the spike patterns
of a zebra finch when it is listening to a
bird song.
(i )
• Each w contains the times in which
spikes were generated for ith neuron in
an interval of time [0,T).
Longer spike train patterns
y  ( y ,..., y ) be
• Let
corresponding spike train patterns when
finch is sleeping, observed over a longer
period of time [0,a).
• If w matches well with a segment of y, then
evidence of bird song replay and hence
song learning during sleep.
(1)
(d )
Scoring function
• Consider kernel function f, e.g. let
f(x) = 1 if x < 0.025 ms, f(x)=-0.3 if
x> 0.025 ms.
• For the illustration below, consider
d=1 and T=0.2ms.
• Let w={.01, .05, .09, .12}.
• Let y ={.32, .75, 1.03, 1.15, 1.25 }.
• To check if there is a match between w
and the segment of y starting at time t=1,
compare w = {.01,.05,.09,.12} against y1 = {.03,.15}.
• The point .03 provides a score of 1
because there is point in w less than
0.025ms away.
• The point .15 provides a score of -0.3
because nearest point in w is more than
0.025ms away.
• Overall score at time t=1 is 1-0.3=0.7.
Scan statistics
• For d>1, add up scores over all neurons
starting at same time t.
• Scan statistics M T is the maximum
possible score over all t in the interval
[0,a-T).
• Chi (2004) obtain approx of log( P{MT  c})
• Chan & Loh (2005) more precise approx
of P{MT  c} was obtained.
Assumptions and related
information
• Each
(i )
w is stationary while
(1)
(d )
y ,..., y are independent Poisson
processes.
• Separate formulas when kernel f is
continuous and when it is not continuous.
• Number of times a large score c is
exceeded is Poisson random variable.
Table of approximations
•c
0.017
0.018
0.019
0.020
0.021
0.022
MC (s.e.)
0.0387(0.0019)
0.0237(0.0012)
0.0158(0.0008)
0.0095(0.0005)
0.0054(0.0003)
0.0033(0.0002)
C&L
0.0383
0.0241
0.0149
0.0091
0.0055
0.0033
Future works
• Higher dimension Poisson processes
e.g. 2 or 3 dimensional.
• Applications in astronomy and
imaging.
• Varying window-sizes.