* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Characterization of transcription factor binding sites by
Cre-Lox recombination wikipedia , lookup
Metagenomics wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Transcription factor wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Epitranscriptome wikipedia , lookup
Smith–Waterman algorithm wikipedia , lookup
Characterization of transcription factor binding sites by high-throughput SELEX Overview of the HTPSELEX Database Transcription Factor Binding Sites: Features and Facts Degenerate sequence motifs Typical length: 6-20 bp Low information content: 8-12 bits (1 site per 250-4000 bp) Quantitative recognition mechanism: measurable affinity of different sites may vary over three orders of magnitude Regulatory function often depends on cooperative interactions with neighboring sites Representation of the binding specificity by a scoring matrix (also referred to as weight matrix) 1 2 3 4 5 6 7 8 9 A -10 -10 -14 -12 -10 5 -2 -10 -6 C 5 -10 -13 -13 -7 -15 -13 3 -4 G -3 -14 -13 -11 5 -12 -13 2 -7 T -5 5 5 5 -10 -9 5 -11 5 C 5 T + 5 T + 5 T + 5 G + 5 A + 5 T + 5 C + 3 T + 5 = A -10 C -10 G -13 T + 5 A -10 C -15 G -13 T -11 A - 6 = -83 Strong Binding site Random Sequence Title 43 Physical interpretation of an weight matrix Weight matrix elements represent relative binding energies between DNA base-pairs and protein surface areas (base-pair acceptor sites). A weight matrix column describes the base preferences of a base-pair acceptor site. Berg-von Hippel model of protein-DNA interactions The weight matrix score expresses the binding free energy of protein-DNA complex in arbitrary units: − ∆G ( x) = S (x) + const. N S (x) = ∑ wi ( xi ) i =1 N It is convenient to express the binding free energy in dimension-free −RT units: On a relative scale, the binding constant for sequence x is given by: E (x) = ∑ ε i ( xi ) ε i (b) i==1 −wi (b) RT K rel (x) = e E ( x ) For sequences longer than the weight matrix: 1 1 or ( ) = K rel (x) = K x rel max i e − E ( xi ... xi+ N −1 ) ∑i e − E ( xi ...xi+N −1 ) (index i runs over all subsequence starting positions on both strands) Berg-von Hippel Theory – Information Content The energy terms of a weight matrix can be computed from the base frequencies pi(b) found in in vitro or in vivo selected binding sites: 1 p (b) ε i (b) = − ln i λ q (b) q(b) is the background frequency of base b. λ is an unknown parameters related to the stringency of the binding conditions. The information content of a binding site has been defined as the conditional entropy of the base frequency matrix relative to back-ground base frequencies. N T IC = ∑∑ pi (b) log2 i =1 b = A pi (b) q(b) Paradox: λ depends on selection conditions (e.g. the protein concentration) - therefore the base frequencies observed in selected binding sites do not reflect a protein-intrinsic property. Weight matrices/profiles from a biochemical and viewpoint A weight matrix expresses the sequence specificity of a DNA binding proteins. A column describes the base preferences of a surface area of the DNAbinding protein. Weights of a weight matrix can be interpreted as additive binding energy contributions. No interactions between binding site positions ! According to the Berg-von Hippel theory negated binding energies are proportional to the logarithms of the base frequencies observed in an in vivo or in vitro selected set of binding sites. Weight matrices can thus be used to compute relative binding energies or dissociation constants for oligonucleotides of any sequence, which in turn can be experimentally determined by gel shift experiments. An accurate weight matrix for the binding specificity of a transcription factor is one that accurately predicts binding constants. Experimental techniques for estimating the parameters of a TF specificity matrix Competitive bandshifts (EMSA) → rel. binding constants of oligonucletides Alignment of in vivo sites → base frequency matrix (from 10-100 sequences) in vitro selection (SELEX) → base frequency matrix (up to 200 sequences) SAGE/SELEX → base frequency matrix (up to 10’000 binding sequences) Exhaustive mutagenesis + Krel assay → intrinsic specificity matrix Protein binding arrays + magic algorithm → intrinsic specificity matrix Some problems and limitations: – A base probability matrix is generate by an alignment or probabilistic modeling algorithm → no direct observation – Krel usually not very precise (within factor of 2) – Point mutations may create binding site in other frame Modeling of a Transcription Factor Binding Site from High Throughput SELEX Data Using a Hidden Markov Modeling Approach Emmanuelle Roulet, Nicolas Mermod (Center for biotechnology UNILEPFL, Lausanne, Switzerland) Anamaria A Camargo, Andrew JG Simpson (Ludwig Institute of Cancer Research, Sao Paulo, Brazil) Philipp Bucher (Swiss Institute for Experimental Cancer Research and Swiss Institute of Bioinformatics, Epalinges s/Lausanne, Switzerland) Nat. Biotechnol. 20, 31-835 (2002) Motivation and Goals of the Project Motivation: Accurate and reliable computational tools to predict transcription factor binding sites are still not available. Potential reasons: 1. Lack of adequate experimental data 2. Lack of adequate computational models 3. Lack of an adequate method to estimate the parameters of a computational model from the experimental data Goal: To develop a combined computational-experimental protocol to derive an accurate predictive model of the sequence specificity of a DNA-binding protein Potential benefits: 1. Being able to predict transcription factor binding in genome sequences. 2. Insights into molecular mechanisms of sequence-specific protein-DNA interactions 3. Ability to rationally design gene control regions of desired properties for biotechnological applications Our Approach to the Problem of Characterizing the Sequence-Specificity of a DNA Binding Transcription Factor 1. Choice of a quantitative predictive model for representing the binding specificity. Our choice: a profile-HMM 2. Choice of an experimental method to generate data for estimating the model parameters. Our choice: a SELEX experiment 3. Choice of a machine learning algorithm to estimate the model parameters from the data. Our choice: the Baum-Welch HMM training algorithm 4. Validation of the approach and optimization of the experimental parameters by a computer simulation of step 2 and 3 5. Adjustment of experimental protocol to produce the necessary data as suggested by the computer simulation 6. Generation of the experimental data 7. Building a binding site model from the data 8. A posteriori validation of the model by cross-validation and comparison with independent experimental results Study Object: Transcription Factor CTF/NFI Dimeric DNA-binding protein recognizing a palindromic sequence motif with consensus sequence TTGGC(N5)GCCAA First isolated as a replication factor of Adenovirus type 2 Later independently isolated as a CCAAT-box binding transcription factor Can activate transcription of a reporter gene in transfected cells Recently shown to be implicated in regulatory pathways related to tumor progression and immune response Biochemical mechanism of gene regulation still elusive Old CTF/NFI Binding Site Profile Example: TGGGCATATAGCCAC Score: 10-1+10+10+10 +0 +10+10+10+10+9 = 88 Random sequence library 5’ –TCCATCTCTTCTGTATGTCGAGATCTA.N(25).TAGATCTCCTAACCGACTCCGTTAATT-3’ Second strand synthesis by pcr Bgl II Primer 1 Bgl II 5’–TCCATCTCTTCTGTATGTCGAGATCTA.N(25).TAGATCTCCTAACCGACTCCGTTAATT-3’ 3’–AGGTAGAGAAGACATACAGATCTAGAT.N(25).ATCTAGAGGATTGGCTGAGGCAATTAA-5’ Primer 2 Selection of binding sequences (gel shift) Selection cycles Amplification Digestion Bgl II 5’ –GATCTA..N(25)..TA AT..N(25)..TACTAG-3’ Concatemerization and cloning 5’-GATCTA…N(25)…TAGATCTA…N(25)…TAGATCTA…N(25)…TA AT…N(25)…ATCTAGAT…N(25)…ATCTAGAT…N(25)…ATCTAG-3’ site 1 site 2 site 3 HTS sequencing Principle of the Baum-Welch hidden Markov model training algorithm Initial model: Training sequences: AACAGCGTGCCAACTAGTGATCACA CCACAACFFACGCCCAAATAACCAA GTTAGTGGACCGCTTCCAGCAATCT ATCACGGCACCCCATTTTTCTGTCT TGGTAAATTAATAATAAAACAGTGG GCGCGTGATTTGGCATCGTCCCATA AAGTTGGCTTTTCACCAATAGCGAG ... How does it work ? 1. The initial model serves as current model. 2. Training sequences are aligned to the current model. 3. New base and transition frequencies are estimated from the multiple alignment generated by step 2. The new model becomes the current model. 4. Step 2 and 3 are repeated until convergence is reached. Trained model: Doing the Experiment Results – CTF/NF1 Clone statistics Cycle Seq.reads Clones Colonies Clones with detectable inserts 0 468 425 427 295 1 623 364 553 111 2 545 392 447 208 3 2234 1445 1619 1187 4 378 215 318 102 Site Statistics Cycle Sites Different sites Diff. sites err < 0.01/bp err <0.001/bp 0 2262 2262 1482 825 1 1678 1678 1227 954 2 1572 1572 731 203 3 8813 8813 7385 5585 4 1156 1156 552 309 SUM 15481 15481 11377 7876 New CTF/NFI model Hidden Markov Model (frequencies given in %): Scoring profile (relative energy units): Predicted and observed evolution of Selex populations Theoretically predicted affinity profiles of successive SELEX cycles (Djordjevic & Sengupta 2006) high Weight matrix scores for successive CTF/NF1 HTP SELEX populations (Roulet et al. 2002) low affinity high Major Differences between New and Old CTF/NFI Binding Site Models The new model contains a sixth half-site position reducing the major spacer length class to 3. This extends the consensus half-site motif to TTGGCA. Alternative spacer length classes N4 and N5 (N6 and N7 according to the old numbering system) receive much more severe penalties in the new profile. Based on the estimated frequencies, it is not certain whether these binding modes have occurred at all during SELEX amplification. The G mismatch at the first position of the half-site weigth matrix has a much lower weight in the new model. Quality Assessment of the New Model: Comparison of Predicted Binding Scores with in vitro measured Binding Constants Data from Meisterernst et al. (1988). Nucl. Acids Res. 16, 4419-4435 Beyond simple weight matrices: correlated dinucleotide analysis HTP SELEX Sequencing totals for members of the TCF family SELEX Library Total number of sites Total number of unique sites % error rate <0.01% per bp <0.001% per bp LEF1/TCF-1α LEF1_2 2125 2125 2067 1893 LEF1_3 7064 7046 6169 6263 LEF1_5 1503 1471 1366 1128 LEF1_6 3072 2500 2327 2144 LEF1_7 397 379 359 328 14161 13521 12288 11756 SUM LEF1/TCF-1 α with β-catenin LBC_5 1967 1963 1833 1700 LBC_6 6116 5311 5129 4800 SUM 8083 7274 6962 6500 11951 11937 11683 TCF4 TCF4_3 11951 PSSM of LEF1/TCF-1α SELEX cycle 3 1 C 2 C 3 T 4 T 5 T 6 G 7 A 8 T 9 C 10 A A 0.093 0.013 0.018 0.002 0.004 0.014 0.968 0.154 0.011 0.042 C 0.411 0.851 0.019 0.005 0.003 0.034 0.001 0.004 0.562 0.080 G 0.292 0.093 0.003 0.001 0.005 0.936 0.010 0.001 0.422 0.047 T 0.203 0.044 0.961 0.991 0.988 0.016 0.020 0.840 0.004 0.831 PSSM of LEF1/TCF-1α SELEX cycle 6 1 C 2 C 3 T 4 T 5 T 6 G 7 A 8 T 9 C 10 A A 0.033 0.001 0.002 0.001 0.001 0.001 0.994 0.017 0.002 0.003 C 0.682 0.989 0.004 0.005 0.004 0.003 0.001 0.003 0.777 0.020 G 0.182 0.005 0.001 0.001 0.001 0.995 0.004 0.001 0.220 0.003 T 0.103 0.005 0.993 0.993 0.994 0.002 0.001 0.979 0.001 0.973 Base frequency tables for DNA binding sites of TCF family members derived by HTP SELEX Sequence Logos for binding sites of TCF family proteins Lef-1 Lef-1/beta-catenin Tcf-4 Comparison of our TCF4 binding site with motif obtained by affinity measurements Sequence Logo pasted from Hallikas et al. (2006). Cell 124:21. Motif obtained by competition assays with complete single base-substitution series. Note: at least one significant position is missing because of a priori restriction of motif extension. Overview of HTPSELEX Database Contents – from raw data to HMMs: • Single-read sequencing chromatograms • Clone sequences (assembled by Phred/Phrap) • Site sequences with estimated sequencing errors • HMMs for binding sites in two formats (decodeanhmm, MAMOT) Additional features: • Quality-controlled sequence download • Access to selected low-throughput SELEX data • Experimental and computational protocols Example of a HTPSELEX clone entry ID XX AC XX DT XX DE XX KW XX OS OC XX RN RA RA RT RT RL XX DR XX FH FH FT FT FT FT FT FT FT FT FT FT FT FT XX SQ LBC_5_00003 standard; DNA; UNC; 1023 BP. LBC_5_00003 5-Jun-2005 5' Sequence of SELEX/SAGE Clone : LBC_5_00003 of cycle 5 HTP SELEX/SAGE, invitro transcription factor binding sites unidentified unidentified [1] Emmanuelle Roulet, Stephane Busso, Anamaria A.Camargo, Andrew J.G Simpson, Nicolas Mermod, and Philipp Bucher. High-throughput SELEX-SAGE method for quantitative modelling of transcription-factor binding sites. Nature Biotechnology 20:831-835(2000) TRACES;LBC_5_003TF.scf Key Location/Qualifiers source 1..1023 /mol_type="unassigned DNA" /organism="unidentified" /tissue_type="SELEX" 110..142 /bound_moiety ="LEF1/TCF with beta catenin " /label="LBC_5_00003_1" /note="Base quality score is 2.8361e-03" 143..175 /bound_moiety ="LEF1/TCF with beta catenin " /label="LBC_5_00003_2" /note="Base quality score is 1.2369e-03" misc_binding misc_binding Sequence 1023 BP; 230 A; 291 C; 260 G; 242 T; 0 other; AAAACCTAAT ATAAGGGGCA GATTAGGGCC CTCTCGATGC TGCTCGAGCG GCCGCCAGTG TGATGGATAT CTGCAGAATT CCAGCACACT GGCGGCCGTT ACTAGTGGAT CTATTGGCGG