Download Characterization of transcription factor binding sites by

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cre-Lox recombination wikipedia , lookup

Metagenomics wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Genomics wikipedia , lookup

Transcription factor wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Epitranscriptome wikipedia , lookup

Smith–Waterman algorithm wikipedia , lookup

Helitron (biology) wikipedia , lookup

NEDD9 wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Transcript
Characterization of transcription factor binding sites
by high-throughput SELEX
Overview of the HTPSELEX Database
Transcription Factor Binding Sites: Features and Facts
Degenerate sequence motifs
Typical length: 6-20 bp
Low information content: 8-12 bits (1 site per 250-4000 bp)
Quantitative recognition mechanism: measurable affinity of
different sites may vary over three orders of magnitude
Regulatory function often depends on cooperative interactions
with neighboring sites
Representation of the binding specificity by a scoring
matrix (also referred to as weight matrix)
1
2
3
4
5
6
7
8
9
A
-10
-10
-14
-12
-10
5
-2
-10
-6
C
5
-10
-13
-13
-7
-15
-13
3
-4
G
-3
-14
-13
-11
5
-12
-13
2
-7
T
-5
5
5
5
-10
-9
5
-11
5
C
5
T
+ 5
T
+ 5
T
+ 5
G
+ 5
A
+ 5
T
+ 5
C
+ 3
T
+ 5 =
A
-10
C
-10
G
-13
T
+ 5
A
-10
C
-15
G
-13
T
-11
A
- 6 = -83
Strong
Binding site
Random
Sequence
Title
43
Physical interpretation of an weight matrix
Weight matrix elements represent relative binding energies between DNA
base-pairs and protein surface areas (base-pair acceptor sites).
A weight matrix column describes the base preferences of a base-pair
acceptor site.
Berg-von Hippel model of protein-DNA interactions
The weight matrix score expresses the
binding free energy of protein-DNA complex
in arbitrary units:
− ∆G ( x) = S (x) + const.
N
S (x) = ∑ wi ( xi )
i =1
N
It is convenient to express the binding free
energy in dimension-free −RT units:
On a relative scale, the binding constant for
sequence x is given by:
E (x) = ∑ ε i ( xi )
ε i (b) i==1 −wi (b) RT
K rel (x) = e E ( x )
For sequences longer than the weight
matrix:
1
1
or
(
)
=
K rel (x) =
K
x
rel
max i e − E ( xi ... xi+ N −1 )
∑i e − E ( xi ...xi+N −1 )
(index i runs over all subsequence starting positions on both strands)
Berg-von Hippel Theory – Information Content
The energy terms of a weight matrix can be
computed from the base frequencies pi(b)
found in in vitro or in vivo selected binding
sites:
1
p (b)
ε i (b) = − ln i
λ q (b)
q(b) is the background frequency of base b.
λ is an unknown parameters related to the
stringency of the binding conditions.
The information content of a binding site has been defined as the conditional
entropy of the base frequency matrix relative to back-ground base frequencies.
N
T
IC = ∑∑ pi (b) log2
i =1 b = A
pi (b)
q(b)
Paradox: λ depends on selection conditions (e.g. the protein
concentration) - therefore the base frequencies observed in selected
binding sites do not reflect a protein-intrinsic property.
Weight matrices/profiles from a biochemical and viewpoint
A weight matrix expresses the sequence specificity of a DNA binding
proteins.
A column describes the base preferences of a surface area of the DNAbinding protein.
Weights of a weight matrix can be interpreted as additive binding
energy contributions. No interactions between binding site positions !
According to the Berg-von Hippel theory negated binding energies are
proportional to the logarithms of the base frequencies observed in an in
vivo or in vitro selected set of binding sites.
Weight matrices can thus be used to compute relative binding energies
or dissociation constants for oligonucleotides of any sequence, which in
turn can be experimentally determined by gel shift experiments.
An accurate weight matrix for the binding specificity of a transcription
factor is one that accurately predicts binding constants.
Experimental techniques for estimating the parameters
of a TF specificity matrix
Competitive bandshifts (EMSA) → rel. binding constants of oligonucletides
Alignment of in vivo sites → base frequency matrix (from 10-100 sequences)
in vitro selection (SELEX) → base frequency matrix (up to 200 sequences)
SAGE/SELEX → base frequency matrix (up to 10’000 binding sequences)
Exhaustive mutagenesis + Krel assay → intrinsic specificity matrix
Protein binding arrays + magic algorithm → intrinsic specificity matrix
Some problems and limitations:
– A base probability matrix is generate by an alignment or probabilistic
modeling algorithm → no direct observation
– Krel usually not very precise (within factor of 2)
– Point mutations may create binding site in other frame
Modeling of a Transcription Factor Binding Site from High
Throughput SELEX Data Using a Hidden Markov Modeling
Approach
Emmanuelle Roulet, Nicolas Mermod (Center for biotechnology UNILEPFL, Lausanne, Switzerland)
Anamaria A Camargo, Andrew JG Simpson (Ludwig Institute of Cancer
Research, Sao Paulo, Brazil)
Philipp Bucher (Swiss Institute for Experimental Cancer Research and
Swiss Institute of Bioinformatics, Epalinges s/Lausanne, Switzerland)
Nat. Biotechnol. 20, 31-835 (2002)
Motivation and Goals of the Project
Motivation: Accurate and reliable computational tools to predict transcription factor binding
sites are still not available.
Potential reasons:
1. Lack of adequate experimental data
2. Lack of adequate computational models
3. Lack of an adequate method to estimate the parameters of a computational model from
the experimental data
Goal: To develop a combined computational-experimental protocol to derive an accurate
predictive model of the sequence specificity of a DNA-binding protein
Potential benefits:
1. Being able to predict transcription factor binding in genome sequences.
2. Insights into molecular mechanisms of sequence-specific protein-DNA interactions
3. Ability to rationally design gene control regions of desired properties for biotechnological
applications
Our Approach to the Problem of Characterizing the
Sequence-Specificity of a DNA Binding Transcription
Factor
1. Choice of a quantitative predictive model for representing the binding
specificity. Our choice: a profile-HMM
2. Choice of an experimental method to generate data for estimating the
model parameters. Our choice: a SELEX experiment
3. Choice of a machine learning algorithm to estimate the model parameters
from the data. Our choice: the Baum-Welch HMM training algorithm
4. Validation of the approach and optimization of the experimental parameters
by a computer simulation of step 2 and 3
5. Adjustment of experimental protocol to produce the necessary data as
suggested by the computer simulation
6. Generation of the experimental data
7. Building a binding site model from the data
8. A posteriori validation of the model by cross-validation and comparison with
independent experimental results
Study Object: Transcription Factor CTF/NFI
Dimeric DNA-binding protein recognizing a palindromic
sequence motif with consensus sequence TTGGC(N5)GCCAA
First isolated as a replication factor of Adenovirus type 2
Later independently isolated as a CCAAT-box binding
transcription factor
Can activate transcription of a reporter gene in transfected
cells
Recently shown to be implicated in regulatory pathways
related to tumor progression and immune response
Biochemical mechanism of gene regulation still elusive
Old CTF/NFI Binding Site Profile
Example: TGGGCATATAGCCAC
Score: 10-1+10+10+10 +0 +10+10+10+10+9 = 88
Random sequence library
5’ –TCCATCTCTTCTGTATGTCGAGATCTA.N(25).TAGATCTCCTAACCGACTCCGTTAATT-3’
Second strand synthesis by pcr
Bgl II
Primer 1
Bgl II
5’–TCCATCTCTTCTGTATGTCGAGATCTA.N(25).TAGATCTCCTAACCGACTCCGTTAATT-3’
3’–AGGTAGAGAAGACATACAGATCTAGAT.N(25).ATCTAGAGGATTGGCTGAGGCAATTAA-5’
Primer 2
Selection of binding sequences (gel shift)
Selection cycles
Amplification
Digestion Bgl II
5’ –GATCTA..N(25)..TA
AT..N(25)..TACTAG-3’
Concatemerization and cloning
5’-GATCTA…N(25)…TAGATCTA…N(25)…TAGATCTA…N(25)…TA
AT…N(25)…ATCTAGAT…N(25)…ATCTAGAT…N(25)…ATCTAG-3’
site 1
site 2
site 3
HTS sequencing
Principle of the Baum-Welch hidden Markov model training
algorithm
Initial model:
Training sequences:
AACAGCGTGCCAACTAGTGATCACA
CCACAACFFACGCCCAAATAACCAA
GTTAGTGGACCGCTTCCAGCAATCT
ATCACGGCACCCCATTTTTCTGTCT
TGGTAAATTAATAATAAAACAGTGG
GCGCGTGATTTGGCATCGTCCCATA
AAGTTGGCTTTTCACCAATAGCGAG
...
How does it work ?
1.
The initial model serves as current
model.
2.
Training sequences are aligned to the
current model.
3.
New base and transition frequencies
are estimated from the multiple
alignment generated by step 2. The
new model becomes the current model.
4.
Step 2 and 3 are repeated until
convergence is reached.
Trained model:
Doing the Experiment
Results – CTF/NF1
Clone statistics
Cycle
Seq.reads
Clones
Colonies
Clones with
detectable inserts
0
468
425
427
295
1
623
364
553
111
2
545
392
447
208
3
2234
1445
1619
1187
4
378
215
318
102
Site Statistics
Cycle
Sites
Different
sites
Diff. sites
err < 0.01/bp
err <0.001/bp
0
2262
2262
1482
825
1
1678
1678
1227
954
2
1572
1572
731
203
3
8813
8813
7385
5585
4
1156
1156
552
309
SUM
15481
15481
11377
7876
New CTF/NFI model
Hidden Markov Model (frequencies given in %):
Scoring profile (relative energy units):
Predicted and observed evolution of Selex populations
Theoretically predicted
affinity profiles of
successive SELEX cycles
(Djordjevic & Sengupta
2006)
high
Weight matrix scores for
successive CTF/NF1 HTP
SELEX populations (Roulet
et al. 2002)
low affinity
high
Major Differences between New and Old CTF/NFI Binding
Site Models
The new model contains a sixth half-site position reducing the major
spacer length class to 3. This extends the consensus half-site motif
to TTGGCA.
Alternative spacer length classes N4 and N5 (N6 and N7 according to
the old numbering system) receive much more severe penalties in
the new profile. Based on the estimated frequencies, it is not certain
whether these binding modes have occurred at all during SELEX
amplification.
The G mismatch at the first position of the half-site weigth matrix has
a much lower weight in the new model.
Quality Assessment of the New Model: Comparison of
Predicted Binding Scores with in vitro measured Binding
Constants
Data from Meisterernst et al. (1988). Nucl. Acids Res. 16, 4419-4435
Beyond simple weight matrices: correlated dinucleotide analysis
HTP SELEX Sequencing totals for members of the TCF family
SELEX Library
Total number of
sites
Total number of
unique sites
% error rate
<0.01%
per bp
<0.001%
per bp
LEF1/TCF-1α
LEF1_2
2125
2125
2067
1893
LEF1_3
7064
7046
6169
6263
LEF1_5
1503
1471
1366
1128
LEF1_6
3072
2500
2327
2144
LEF1_7
397
379
359
328
14161
13521
12288
11756
SUM
LEF1/TCF-1 α with β-catenin
LBC_5
1967
1963
1833
1700
LBC_6
6116
5311
5129
4800
SUM
8083
7274
6962
6500
11951
11937
11683
TCF4
TCF4_3
11951
PSSM of LEF1/TCF-1α SELEX cycle 3
1
C
2
C
3
T
4
T
5
T
6
G
7
A
8
T
9
C
10
A
A
0.093
0.013
0.018
0.002
0.004
0.014
0.968
0.154
0.011
0.042
C
0.411
0.851
0.019
0.005
0.003
0.034
0.001
0.004
0.562
0.080
G
0.292
0.093
0.003
0.001
0.005
0.936
0.010
0.001
0.422
0.047
T
0.203
0.044
0.961
0.991
0.988
0.016
0.020
0.840
0.004
0.831
PSSM of LEF1/TCF-1α SELEX cycle 6
1
C
2
C
3
T
4
T
5
T
6
G
7
A
8
T
9
C
10
A
A
0.033
0.001
0.002
0.001
0.001
0.001
0.994
0.017
0.002
0.003
C
0.682
0.989
0.004
0.005
0.004
0.003
0.001
0.003
0.777
0.020
G
0.182
0.005
0.001
0.001
0.001
0.995
0.004
0.001
0.220
0.003
T
0.103
0.005
0.993
0.993
0.994
0.002
0.001
0.979
0.001
0.973
Base frequency tables for DNA binding sites of TCF family
members derived by HTP SELEX
Sequence Logos for binding sites of TCF family proteins
Lef-1
Lef-1/beta-catenin
Tcf-4
Comparison of our TCF4 binding site with motif obtained
by affinity measurements
Sequence Logo pasted from Hallikas et al. (2006). Cell 124:21.
Motif obtained by competition assays with complete single base-substitution
series. Note: at least one significant position is missing because of a priori
restriction of motif extension.
Overview of HTPSELEX Database
Contents – from raw data to HMMs:
• Single-read sequencing chromatograms
• Clone sequences (assembled by Phred/Phrap)
• Site sequences with estimated sequencing errors
• HMMs for binding sites in two formats (decodeanhmm, MAMOT)
Additional features:
• Quality-controlled sequence download
• Access to selected low-throughput SELEX data
• Experimental and computational protocols
Example of a
HTPSELEX
clone entry
ID
XX
AC
XX
DT
XX
DE
XX
KW
XX
OS
OC
XX
RN
RA
RA
RT
RT
RL
XX
DR
XX
FH
FH
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
XX
SQ
LBC_5_00003
standard; DNA; UNC; 1023 BP.
LBC_5_00003
5-Jun-2005
5' Sequence of SELEX/SAGE Clone : LBC_5_00003
of cycle 5
HTP SELEX/SAGE, invitro transcription factor binding sites
unidentified
unidentified
[1]
Emmanuelle Roulet, Stephane Busso, Anamaria A.Camargo, Andrew J.G Simpson,
Nicolas Mermod, and Philipp Bucher.
High-throughput SELEX-SAGE method for quantitative modelling of
transcription-factor binding sites.
Nature Biotechnology 20:831-835(2000)
TRACES;LBC_5_003TF.scf
Key
Location/Qualifiers
source
1..1023
/mol_type="unassigned DNA"
/organism="unidentified"
/tissue_type="SELEX"
110..142
/bound_moiety ="LEF1/TCF with beta catenin "
/label="LBC_5_00003_1"
/note="Base quality score is 2.8361e-03"
143..175
/bound_moiety ="LEF1/TCF with beta catenin "
/label="LBC_5_00003_2"
/note="Base quality score is 1.2369e-03"
misc_binding
misc_binding
Sequence 1023 BP;
230 A; 291 C; 260 G; 242 T; 0 other;
AAAACCTAAT ATAAGGGGCA GATTAGGGCC CTCTCGATGC TGCTCGAGCG GCCGCCAGTG
TGATGGATAT CTGCAGAATT CCAGCACACT GGCGGCCGTT ACTAGTGGAT CTATTGGCGG