Download A Statistical Method for Finding Transcriptional Factor Binding Sites

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Multi-state modeling of biomolecules wikipedia , lookup

Genome evolution wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression profiling wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Molecular evolution wikipedia , lookup

Gene regulatory network wikipedia , lookup

Non-coding DNA wikipedia , lookup

Community fingerprinting wikipedia , lookup

Transcriptional regulation wikipedia , lookup

Ligand binding assay wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Cooperative binding wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Network motif wikipedia , lookup

Transcript
A Statistical Method for Finding
Transcriptional Factor Binding Sites
Authors: Saurabh Sinha and Martin Tompa
Presenter: Christopher Schlosberg
CS598ss
Regulation of Gene Expression
Difficulties of Motif Finding
 Regulatory sequences don’t follow same
orientation as the coding sequence or each
other
 Multiple binding sites might exist for each
regulated gene
 Large variation in the binding sites of a single
factor. Variations are not well understood.
Previous & Proposed Methods for
Finding Motifs
 Previous Methods:
 Find longer, general motifs
 Use local search algorithms (Gibbs sampling,
Expectation Maximization, greedy algorithms)
 Proposed Method:
 TFBS is small enough to use enumerative methods
 Enumerative statistical methods guarantee global
optimality and affordability
Proposed Method Highlights
 Allows variations in the binding site instances of a given transcription
factor
 Allows for motifs to include “spacers”
 Allows for overlapping occurrences (in both orientations), which
lends to complex dependencies
 Statistical significance of a motif (s) is based on the frequencies of
shorter (more frequent) oligonucleotides
 Use of Markov chain to model background genomic distribution
 Use of z-score to measure statistical significance
 Allows for multiple binding sites
Characteristics of a Motif
 Any single TFBS has significant variation
 Many motifs have spacers from 1-11bp
 Variation often occurs as a transition (e.g. purine 
purine) rather than a transversion (e.g. pyrimidine 
purine)
 Variation occurs less between a pair of complementary
bases.
 Indels are uncommon
Proposed Motif Definition
 Motif will be a string with Σ= {A,C,G,T, R,Y,S,W,N}
 A,C,G,T (DNA bp), R (purine), Y (pyrimidine), S (strong), W
(weak), N (spacer)
 TF database (SCPD) confirms this model of variation
 Of 50 binding site consensi, 31 exact fits (62%)
 Another 10 fit if slight variations allowed
Measure of Statistical Significance
 Given set of corregulated S. cerevisiae genes, the input to the
problem is corresponding set of 800bp upstream sequences having
3’ end on start site of gene translation.
 Model must measure from input sequences:
 Absolute number of occurrences (Ns) of motif (s)
 Background genomic distribution
 X is a set of random DNA sequences in the same number and
lengths of the input sequences
 Generated by Markov chain of order m
 Transition probabilities determined by (m+1)-mer frequencies in fully
complement of 6000+ (800bp in length)
 Background model chooses m=3
z-score
 Xs – r.v. is number of occurrences of motif (s) in X
 E(Xs) – expectation, σ(Xs) – standard deviation
 zs – number of S.D. by which observed value Ns exceeds
expectation
Implications
 Possibility of overlap of a motif with itself (in either
orientation)
 Previous study of pattern autocorrelation
 Generalized computation of SD, treating motif as a finite
set of strings
 Higher order Markov chains
 Spacers handled at no extra computational cost
 Handles motif in either orientation
Algorithm
 Enumerates over each input sequence
 Tabulates number Ns of occurrences of each motif in
either direction
 Compute expectation and SD for each motif s.t. Ns>0
 Calculate z-score
 Rank motifs by z-score
Algorithm Analysis
 For single motif, complexity is O(c2k2)
 k – # of nonspacer characters in motif
 c – # of instantiations of R, Y, S, W in motif
 Only modest values of k
 Linear dependence on genome size
 Can trim variance calculation to optimize
Number of Occurrences
 Convert motif s into a multiset W
 Add reverse complements for each string in W
 Motif s only occurs at position in X iff some string in W occurs
at same position
 Xs - # of occurrences (in X) of each member of W
 Handling Palindromes
 Wi – member of W
 |W| = T
Number of Occurrences Con’t
Expectation
 Linearity of Expectation
Variance
 B term
 C term
C Term
 A term
A Term
Overlapping Concatenation
 CW (like W) is potentially a multiset
 One-to-one correspondence
C Term Simplification
A Term Revisited
Si1Si2 Term & Approximation
 Kleffe and Borodovsky (1992) Approximation
B Term
B Term Con’t
Summary
Higher Order Markov Models
 Variance calculations remain the same except for Si1Si2
term
 Experimental m = 3
Experimental Results & Future
Considerations
 17 coregulated sets of genes
 Known TF with known binding site consensus
 In 9 experiments, known consensus was one of 3 highest
scoring motifs
 Future Topics:
 Non-centered spacers
 Enumeration Loop optimization
 Filtering repeats
Question
 E(Xs) is more straight-forward to calculate
compared to σ(Xs). Under the assumptions
given in the paper, name one of the reasons for
this complication.