Download Molecular Cell Biology - Biomedical Informatics

Bioinformatics approaches for studying of gene regulation. By Ilya Ioshikhes, Ph.D. Department of Biomedical Informatics. Molecular Biology of the Cell, 3rd edn. Part I. Introduction to the Cell Chapter 3. Macromolecules: Structure, Shape, and Information Nucleic Acids 8 Figure 3-19. Information flow in protein synthesis. (A) The nucleo-tides in an mRNA molecule are joined together to form a complementary copy of a segment of one strand of DNA. (B) They are then matched three at a time to complementary sets of three nucleotides in the anticodon regions of tRNA molecules. At the other end of each type of tRNA molecule, a specific amino acid is held in a high-energy linkage, and when matching occurs, this amino acid is added to the end of the growing polypeptide chain. Thus translation of the mRNA nucleotide sequence into an amino acid sequence depends on complementary base-pairing between codons in the mRNA and corresponding tRNA anticodons. The molecular basis of information transfer in translation is therefore very similar to that in DNA replication and transcription. Note that the mRNA is both synthesized and translated starting from its 5' end. Molecular Cell Biology 9. Molecular Structure of Genes and Chromosomes 9.5. Organizing Cellular DNA into Chromosomes Figure 9-30. Structure of the nucleosome. (a) Ribbon diagram of the nucleosome shown face-on (left) and from the side (right). One DNA strand is shown in green and the other in brown. H2A is yellow; H2B, red; H3, blue; H4, green. (b) Space-filling model shown from the side. DNA is shown in white; histones are colored as in (a). H2A, H2A′, H2B, H2B′, H3, and H4 indicate the positions of the respective histone N-terminal tails visible in this view. The H2A′ N-terminal tail interacts with the upper loop of DNA, while the H2A N-terminal tail (only partially seen in this view) interacts with the bottom loop of DNA. The N-terminal tail of one H4 extends from the bottom of the nucleosome and interacts with the neighboring histone octamer in the crystal lattice (not shown). The N-terminal tails of histones H2B, H2B′, H3, and H3′ pass between the two loops of DNA. The N-terminal tails of H2A, H4, H3, and H2B include an additional 3, 15, 19, and 23 residues, respectively, that are not visualized in the crystal structure because they are not highly structured. They extend further from the surface of the nucleosome where they may participate in nucleosome-nucleosome interactions in the 30 nm fiber (See Figure 9-31) or interact with other chromatin-associated proteins. [From K. Luger et al., 1997, Nature 389:251; courtesy of T. J. Richmond.] Molecular Biology of the Cell, 3rd edn. Part II. Molecular Genetics Chapter 8. The Cell Nucleus The Global Structure of Chromosomes Figure 8-30. Model of chromatin packing. This schematic drawing shows some of the many orders of chromatin packing postulated to give rise to the highly condensed mitotic chromosome. Agalioti T, Lomvardas S, Parekh B, Yie J, Maniatis T, Thanos D. “Ordered recruitment of chromatin modifying and general transcription factors to the IFN-beta promoter.” Cell. 2000 Nov 10;103(4):667-78. Characteristic features of gene regulation mechanisms: • Large number and variety of participating regulatory elements: thousands of transcription factors (TFs), chromatin, DNA methylation etc. • None of those elements is neither absolutely necessary nor sufficient for the regulatory processes. • There are a lot of DNA sequence motifs (signals) related to these agents: TF binding sites, nucleosome sequence pattern, CpG islands etc. • Majority of those signals are very weak. • Gene expression is regulated by large number of weak signals interacting with each other in some sophisticated ways. Possible approaches in that study : • Exhaustive analysis of signals caused by 1-2 elements, with gradual generalization of results. • From intuitive model to sequence analysis. • From known sequence features to their quantitative analysis. • From sequences to revealing common sequence motifs. • In depth analysis of known features. SEQ_1 Frog Xenopus borealis ACCURACY 1 bp NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNAGCTTGGCAGGACAAGGGCAG CTCTGCAAACTGTAAAACCGGACAAAGGCTTTCCCCTGGCTTACACGCAA AAGGGAAGGGCCTTTCCTGAGGAGGTGAGCGGCAACCTGGACTCGGGGAT GGCGCTGGAAGTGATCTGCTTGGATTTTGCTCAAGACTTGGATGCAAGGG CTATCCCGATGAGCTGACAAGGGCCTTGGGAGGGGGGCGGGGGCTGTGCA GATAACAAGCTGTCCACTTCCAGGCACTGCCCTTCCGTGGCTCCCGTAGC > SEQ_2 Frog Xenopus borealis ACCURACY 1 bp GGGCTCCGCCCXTTCGGAAGGATGCTAGGGAGCCGGAGAGAGCGCAGAGA GGCGGGGTGAAAGGGATGGGGGGAGCTGAGGCAGGAGGGCAGGCTGTCAA GGCCGGGCTTGTTTTCCTGCCTGGGGGAAAAGACCCTGGCATGGGGAGGA GCTGGGCCCCCCCCAGAAGGCAGCACAAGGGGAGGAAAAGTCAGCCTTGT GCTCGCCTACGGCCATACCACCCTGAAAGTGCCCGATATCGTCTGATCTC GGAAGCCAAGCAGGGTCGGGCCTGGTTAGTACTTGGATGGGAGACCGCCT GGGAATACCAGGTGTCGTAGGCTTTTGCACTTTTGCCATTCTGAGTAACA GCAGGGGGCAGTCTCCTCCATGCATTTTTCTTTCCCCGAACAGCTGCCTG > SEQ_3 African Green Monkey ACCURACY 1 bp ACTGCTCTGTGTTCTGTTAATTCATCTCACAGAGTTACATCTTTCCCTTC AAGAAGCCTTTCGCTAAGGCTGTTCTTGTGGAATTGGCAAAGGGATATTT GGAAGCCCATAGAGGGCTATGGTGAAAAAGGAAATATCTTCCGTTCAAAA CTGGAAAGAAGCTTTCTGAGAAACTGCTCTGTGTTCTGTTAATTCATCTC ACAGAGTTACATCTTTCCCTTCAAGAAGCCTTTCGCTAAGGCTGTTCTTG TGGAATTGGCAAAGGGATATTTGGAAGCCCATAGAGGGCTATGGTGAAAA AGGAAATATCTTCCGTTCAAAACTGGAAAGAAGCTTTCTGAGAAACTGCT CTGTGTTCTGTTAATTCATCTCACAGAGTTACATCTTTCCCTTCAAGAAG > SEQ_4 Mouse ACCURACY 1 bp AAAATGAGAAACATCCACTTGACGACTTGAAAAATGACGAAATCACTAAA AAACGTGAAAAATGAGAAATGCACACTGAAGGACCTGGAATATGGCGAGA AAACTGAAAATCACGGAAAATGAGAAATACACACTTTAGGACGTGAAATA TGGCGAGGAAAACTGAAAAAGGTGGAAAATTTAGAAATGTCCACTGTAGG ACGTGGAATATGGCAAGAAAACTGAAAATCATGGAAAATGAGAAACATCC ACTTGACGACTTGAAAAATGACGAAATCACTAAAAAACGTGAAAAATGAG AAATGCACACTGAAGGACCTGGAATATGGCGAGAAAACTGAAAATCACGG AAAATGAGAAATACACACTTTAGGACGTGAAATATGGCGAGGAAAACTGA > SEQ_5 Psammechinus miliaris (sea urchin) ACCURACY 1 bp NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNAGCTTATAATCATCCTTATACACGCG CAGTCGATGAGATGAAAAGTTCATTAACGCTACATTTACAGTGTTTTGGG CAATTCTCCCTCCCCCCCCCCCTCTCTCTCTCTCTCTCTCTCTCTCTCTC TCTCCCTTCCTCTAAATATGTTGNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN > SEQ_6 Yeast Saccharomyces cerevisiae ACCURACY 1 bp AGTACAGAGGTCAATGGCAGTAATGGCACTTGGTGCGGCTTCTGTGCCAG TAATGTGGCTTTCTCTAACAAGTTGGATGCACATCGGCGAGAGACAAGGC TTTAGAATACGGTCACAGATATTGGAGGCATATTTGGAGGAAAAGCCAAT GGAATGGTACGACAATAATGAAAAATTGTTAGGAGATTTTACTCAAATCA ACAGATGTGTGGAAGAGCTAAGATCAAGCTCCGCAGAGGCATCAGCCATA ACTTTCCAGAATTTAGTTGCAATATGTGCGCTTCTGGGGACGTCATTCTA CTATTCTTGGTCATTAACTTTAATTATTCTTTGCAGCTCTCCAATAATCA CATTTTTTGCAGTGGTGTTTTCCAGAATGATTCATGTATATTCAGAGAAG > SEQ_7 Yeast Saccharomyces cerevisiae ACCURACY 1 bp TTCTCTATTCTGCCACTATACAATTTATTGTTTTCCACAAAGGGTAAAGG TACTTTAAGAAAATAGTTTCTTATTTTTTTTGCCATGTAATTACCTAATA GGGAAATTTACACGCTGCTTCGCACATATACAATTGTTTCAGATATGAAA ACTGTTGCATTATTGCCGTTCATCATTTAAATACCAGAGCTTATAAACCT GGATATGGCTGAACTATCTCCCGTTGTTACGTTCACACAGAGAGCTTTCA AGTGCCGCTGAAAATTCCACTAGGAAACAAAGAACAAGCTACGTCATGAA CTTTTTAAGTTTTAAGACTACAAAACACTATCACATTTTCAGGTACGTGA Sequences absolutely dissimilar. No conserved regions. Conventional evolution-based approaches of sequence alignment (like BLAST) are hardly applicable. Dinucleotides (AA/TT first) are primary subject to alignment. Possible number of configurations: P (2Ac.i + 1) 204 sequences, Ac. 1 to 55 Roughly 51 204 configurations Algorithms of multiple sequence alignment. • Alignment of the most accurately mapped nucleosome sequences. • Multicycle consecutive alignment – AA/TT matrices Mi of Ac.-sorted sequences aligned one by one to pattern derived on previous step. Results of 10,000 cycles are averaged. • Quasi-exhaustive consecutive alignment – keeps track of several “suboptimal” alignments; alignment with highest SIM = Sij (Mi^*Mj) is final. • Alignment with simulated annealing strategy: new alignment is accepted if SIMk+1 > SIMk or with probability P(-E)=e-E/T, where –E=SIMk+1-SIMk otherwise. T is decreasing “temperature” factor. • Multiple alignment by positional entropy criterion using Gibbs sampling strategy. Agalioti T, Lomvardas S, Parekh B, Yie J, Maniatis T, Thanos D. “Ordered recruitment of chromatin modifying and general transcription factors to the IFN-beta promoter.” Cell. 2000 Nov 10;103(4):667-78. Approach 2 Chromatin structure of promoter sequences and regularity in positioning of TF sites – an example of intuitive conceptual model. F TF – nucleosome correlation. • Putative TF binding sites mapped on promoter sequences. • Distribution of each TF site overall sequences calculated. • Scanning with a “nucleosomal” 145 bp window through distributions of all TF sites. • Calculation of spectral distribution for each TF inside the window in every scanning point. • Evaluation of number N of TFs with main “nucleosomal” period 10.1-10.5 bp in their spectra. • Evaluation of difference between N and statistically expected R number of such TFs: dS(StD)=(N-R)/SQRT(R). Left: Order of events leading to transcription initiation from the IFN-ß promoter. I and II represent nucleosomes positioned in the promoter area. Derived from Agalioti, T., Lomvardas, S., Parekh, B., Yie, J., Maniatis, T., and Thanos, D. 2000. Ordered recruitment of chromatin modifying and general transcription factors to the IFN-ß promoter. Cell 103:667-678. Right: Nucleosome positioning at the pS2 promoter. Derived from Sewack, G.F. and Hansen, U. 1997. Nucleosome positioning and transcription-associated chromatin alterations on the human estrogen-responsive pS2 promoter. J. Biol. Chem. 272:31118-31129. . To further optimize the findings increasing the statistical significance of the results, we varied the length of the windows. The results of the calculation indicate the most statistically significant effect of 6.68 StD for the windows (-46…+121) and (-46…+124), covering the TSS. Size of this window (167–170 bp) is similar to those of chromatosome. Nucleosome-TF correlation. • Very consistent effect of high statistical significance. • Obtained on two large, representative and essentially independent data sets. • Obtained by two independent approaches. • Has many correlations with known experimental data. Approach 3 Large-scale human promoter mapping using CpG islands. (Program CpG_promoter by Quadratic Discriminant Analysis QDA) Quantitative analysis of known sequence feature Definition of CpG island • Length > 200 bp • C + G content > 50% • CpG ratio Obs/Exp > 0.6 (Gardiner-Garden and Frommer, J.Mol.Biol. 196, 261-282 (1987)) Molecular Biology of the Cell, 3rd edn. Part II. Molecular Genetics Chapter 9. Control of Gene Expression The Molecular Genetic Mechanisms That Create Specialized Cell Types 41 Figure 9-70. The CG islands surrounding the promoter in three mammalian housekeeping genes. The yellow boxes show the extent of each island. Note also that, as for most genes in mammals, the exons (dark red) are very short relative to the introns (light red). (Adapted from A.P. Bird, Trends Genet. 3:342-347, 1987.) SN and SP Sensitivity SN is proportion of True Positive (TP) predictions out of all de-facto positives: SN = TP / (TP + FN) Specificity SP is proportion of True Positive (TP) predictions out of all positive predictions: SP = TP / (TP + FP) Results of promoter mapping (Test Set 2) • • • • • 135 genes 68 have CpG island around promoter 63 recognized SN = 0.47 (0.93) SP = 0.34 (1 Pos./26 kb; 1/36 kb is in fact) • Promoter Scan gives SN = 0.44 SP = 0.06 (1 Pos. / 4.7 kb) Approach 4 Revealing of regulatory mechanisms in promoter sequences. From sequence to model. (Work in progress) Alternative Architecture Types of Human Pol II Promoters Molecular Cell Biology 4. Nucleic Acids, the Genetic Code, and the Synthesis of Macromolecules 4.3. Nucleic Acid Synthesis Figure 4-15. Transcription of DNA into RNA is catalyzed by RNA polymerase, which can initiate the synthesis of strands de novo on DNA templates. The nucleotide at the 5′ end of an RNA strand retains all three of its phosphate groups; all subsequent nucleotides release pyrophosphate (PPi) when added to the chain and retain only their α phosphate (red). The released PPi is subsequently hydrolyzed by pyrophosphatase to Pi, driving the equilibrium of the overall reaction toward chain elongation. In most cases, only one DNA strand is transcribed into RNA. The Cell II. The Flow of Genetic Information 6. RNA Synthesis and Processing Eukaryotic RNA Polymerases and General Transcription Factors Figure 6.14. RNA polymerase II holoenzyme The holoenzyme consists of a preformed complex of RNA polymerase II, the general transcription factors TFIIB, TFIIE, TFIIF, and TFIIH, and several other proteins that activate transcription. This complex can be recruited directly to a promoter via interaction with TFIID (TBP + TAFs). An Introduction to Genetic Analysis 11. Regulation of Gene Transcription Transcription: an overview of gene regulation in eukaryotes. Figure 11-29. (a) Assembly of the RNA Polymerase II initiation complex begins with the binding of transcription factor TFIID to the TATA box. TFIID is composed of one TATA box-binding subunit called TBP (dark blue) and more than eight other subunits (TAFs), represented by one large symbol (light blue). Inhibitors can bind to the TFIID-promoter complex, blocking the binding of other general transcription factors. Binding of TFIIA to the TFIID-promoter complex (to form the D-A complex) prevents inhibitor binding. TFIIB then binds to the D-A complex, followed by binding of a preformed complex between TFIIF and RNA polymerase II. Finally, TFIIE, TFIIH, and TFIIJ must add to the complex, in that order, for transcription to be initiated. (From H.Lodish, D.Baltimore, A.Berk, S.L.Zipursky, P.Matsudaira, and J.Darnell, Molecular Cell Biology, 3d ed. Copyright © 1995 by Scientific American Books) Figure 10-52. Structure of the complex formed between TBP, promoter DNA, and TFIIB. In in vitro transcription systems, TFIIB binds to the assembled TBP – promoter DNA complex. Shown here are the C-terminal domain of Arabidopsis TBP and the C-terminal domain of human TFIIB. Transcription initiation in vivo also requires TFIIA, which binds to the TBP – promoter DNA complex on the side opposite to where TFIIB binds. TFIIA is thought to bind before TFIIB does. [Adapted from D. B. Nikolov et al., 1995, Nature 377:119.] Molecular Cell Biology Fourth Edition Harvey Lodish (Massachusetts Institute of Technology) Arnold Berk (U. of California, Los Angeles) Lawrence Zipursky (U. of California, Los Angeles) Paul Matsudaira (Massachusetts Institute of Technology) David Baltimore (California Institute of Technology) James Darnell (Rockefeller U.) Molecular Biology of the Cell, 3rd edn. Part II. Molecular Genetics Chapter 9. Control of Gene Expression How Genetic Switches Work 20 Figure 9-34. The gene control region of a typical eucaryotic gene. The promoter is the DNA sequence where the general transcription factors and the polymerase assemble. The most important feature of the promoter is the TATA box, a short sequence of T-A and A-T base pairs that is recognized by the general transcription factor TFIID. The start point of transcription is typically located about 25 nucleotide pairs downstream from the TATA box. The regulatory sequences serve as binding sites for gene regulatory proteins, whose presence on the DNA affects the rate of transcription initiation. These sequences can be located adjacent to the promoter, far upstream of it, or even downstream of the gene. DNA looping is thought to allow gene regulatory proteins bound at any of these positions to interact with the proteins that assemble at the promoter. Whereas the general transcription factors that assemble at the promoter are similar for all polymerase II transcribed genes, the gene regulatory proteins and the locations of their binding sites relative to the promoter are different for each gene. A total of 1871 non-redundant human promoter sequences from the Eukaryotic Promoter Database (EPD) release 75 (http://www.epd.isb-sib.ch) and 8793 human promoters from the Database of Transcriptional Start Sites (DBTSS) (http://www.dbtss.hgc.jp/index.html) were used for statistical analyses as two separate datasets. We also constructed a small test set of 27 human promoters with MSS. This set was utilized to analyze the statistics of core-promoter elements in MSS promoters. Each promoter was considered several time one time for each known TSS, so the total number of sequences in this set is 107. Molecular Cell Biology 10. Regulation of Transcription Initiation 10.4. Regulatory Sequences in Eukaryotic ProteinCoding Genes Figure 10-30. Comparison of nucleotide sequences upstream of the start site in 60 different vertebrate protein-coding genes. Each sequence was aligned to maximize homology in the region from −35 to −20. The tabulated numbers are the percentage frequency of each base at each position. Maximum homology occurs over a six-base region, referred to as the TATA box, whose consensus sequence is shown at the bottom. The initial base in mRNAs encoded by genes containing a TATA box most frequently is an A. [See R. Breathnach and P. Chambon, 1981, Ann. Rev. Biochem. 50:349; P. Bucher, 1990,J. Mol. Biol. 212:563.] To extract a subset of promoter sequences containing the TATA b or Inr element at theirfunctional positions, the positional weight matrices (PWM) with optimal cut-off values were applied (Bucher, 1990).We define the TATA or Inr element as being present at a certain position if the PWM score at this position exceeds the cut-off value, and define the element to be absent at this position otherwise. Since there are no matrices for DPE and BRE, we matched 5 out of 5 letters and 6 out of 7 for the DPE and BRE consensuses (Smale and Kadonaga, 2003), respectively. We used the same parameters to extract subsets containing known synergetic combinations, yet the respective elements had to be placed at their experimentally defined synergetic distance from one another. The distances between the elements i the remaining combinations were chosen based on the positions the respective elements in the known combinations. To estimate the statistical significance of the occurrence frequency of an element or synergetic combination in the respective functional window, we calculated a parameter statistical significance, dS, measured in units of standard deviatio (StD = √Nout) dS = (Nin−Nout)/√Nout, where Nin is the number of occurrences of an element or combination inside its functional window and Nout is the number of occurrence of that element or combination in the average interval of the same length outside the functional window. Figure 1. The occurrence frequency (the percentage of sequences having a considered motif centered at particular position) distribution of the TATA box motifs based on scanning of EPD (blue curve) and DBTSS (magenta curve) sequences by PWM (Bucher, 1990). The TSS is placed at position +1. The straight horizontal gray line depicts the average amount of TATA motifs found in the randomly generated sequence with the same percentage of each of four nucleotides as in the EPD promoter sequences, namely A = 20.7%, C = 29.3%, G = 29.5%, and T = 20.5. The shadow rectangles indicate standard deviation calculated based on 1871 random sequences (short rectangle) and on 8973 random sequences (long rectangle), respectively. Figure 2. The occurrence frequency distribution of the Inr motifs based on scanning of EPD (blue curve) and DBTSS (magenta curve) sequences by PWM (Bucher, 1990). The TSS is placed at position +1. The straight horizontal gray line depicts the average amount of TATA motifs found in the randomly generated sequence with the same percentage of each of four nucleotides as in the EPD promoter sequences, namely A = 20.7%, C = 29.3%, G = 29.5%, and T = 20.5. The shadow rectangles indicate standard deviation calculated based on 1871 random sequences (short rectangle) and on 8973 random sequences (long rectangle), respectively. According to these data, half of the promoters, 49.0% (48.4%), have the Inr element at a functional position, only 21.8% (10.4%) have TATA box, 24.6% (24.6%) contain DPE, and 24.5% (25.5%) have BRE. The majority of the promoters, 77.3% (74.3%), have at least one of four core-promoter elements at its functional position and 41.8% (44.1%) have only one element including TATA – 5.5% (2.9%), Inr – 20.1% (23.0%), DPE – 6.6% (8.4%), and BRE – 9.6% (9.8%) Figure 1. Occurrence frequency distribution of combination TATA_Inr for EPD (blue) and DBTSS (magenta). TSS is placed at position +1. Figure 2. Occurrence frequency distribution of combination Inr_DPE for EPD (blue) and DBTSS (magenta). Figure 3. Occurrence frequency distribution of combination TATA_BRE for EPD (blue) and DBTSS (magenta). Figure 4. Occurrence frequency distribution of combination Inr_BRE for EPD (blue) and DBTSS (magenta). Figure 5. Occurrence frequency distribution of combination DPE_BRE for EPD (blue) and DBTSS (magenta). The value at each position is an 11-point sliding average. Figure 6. Occurrence frequency distribution of combination TATA_DPE for EPD (blue) and DBTSS (magenta). Note the common features of the aforementioned combinations: (1) all of them involve TFIID, and TBP binds to DNA regardless of the presence/absence of TATA box; (2) TFIID covers the TSS area; (3) the distance from the TSS to the edge of the complex is approximately the same (~30–40 bp). Combinations BRE_Inr, BRE_DPE and TATA_DPE also satisfy these requirements. These combinations are presented in a number of promoters comparable with the three previous combina with comparable statistical significance (Table 4). They may therefore be also considered as possible synergetic combinations of core-promoter elements (Fig. 1D–F). We found that 83 (76.9%) of the MSS promoters contain at least one core-promoter element in the functional position relatively the TSS. This percentage is practically the same as for all promoters from both the datasets. The statistical significance of the presence of any one of the four elements in the functional position is comparatively high for a relatively small dataset: dS = 3.5StD, P-value = 0.0005. Remarkably, the portion of MSS promoters containing BRE (29.6%) is larger than on average in the EPD/DBTSS datasets. Thus the presence of the BRE element in the CpG+and MSS promoters is comparable with the presence of the TATA box in the CpG-less promoters. An example of MSS promoter. Figure 1. An example of MSS promoter sequence (36, GenBank Accession #X52601, TSS positions marked by shadow) containing all four core promoter elements at functional position relative to a TSS (marked by the bold letters of a color same as the respective core element). Nature Structural & Molecular Biology 11, 1031 - 1033 (2004) doi:10.1038/nsmb1104-1031 Another piece in the transcription initiation puzzle Francisco J Asturias The author is at the Department of Cell Biology, The Scripps Research Institute, 10550 North Torrey Pines Road, La Jolla, California 92037, USA. [email protected] A new report provides evidence that the TFIIB-RNAPII interaction depends on the presence of additional factors and highlights the importance of structural characterization of the entire preinitiation complex. Beyond core-promoter Molecular Cell Biology 10. Regulation of Transcription Initiation 10.4. Regulatory Sequences in Eukaryotic ProteinCoding Genes Figure 10-30. Comparison of nucleotide sequences upstream of the start site in 60 different vertebrate protein-coding genes. Each sequence was aligned to maximize homology in the region from −35 to −20. The tabulated numbers are the percentage frequency of each base at each position. Maximum homology occurs over a six-base region, referred to as the TATA box, whose consensus sequence is shown at the bottom. The initial base in mRNAs encoded by genes containing a TATA box most frequently is an A. [See R. Breathnach and P. Chambon, 1981, Ann. Rev. Biochem. 50:349; P. Bucher, 1990,J. Mol. Biol. 212:563.] The occurrence frequency (the percentage of sequences having a considered motif centered at particular position) distribution of the GC-box sites. The distribution is obtained by scanning of 8973 human promoters from DBTSS (magenta – positive strand, red – negative strand, dark blue – both strands) and 1871 human promoters from EPD (green – both stands) sequences. The value at each position is an eleven point sliding average. The TSS is placed at position +1. The straight horizontal line depicts the average amount of GC-box sites found in both strands of the randomly generated sequence with the same percentage of each of four nucleotides as in the training set of promoter sequences. The flowchart of optimization process. The input parameters are promoter database, an initial PWM (or motif consensus), a set of experimentally defined sites, and a “functional window”. The first step is the extraction of the dataset of putative sites. There are two levels of optimization at the beginning: cutoff value and motif length. The Correlation Coefficient (CC) is used as optimization parameter. Each cycle brings a portion of new sites typical for this particular window and excludes some not typical sites increasing the influence of sites from that window. This influence is strongly limited by the requirement to be as close as possible to the previous matrix expressed by the definition of CC. All aforementioned steps should be repeated for each window from the functional window. As a result we will have a set of optimal matrices, one matrix for each considered window. Each matrix has its own sensitivity and specificity. CC  (TP * TN )  ( FN * FP) (TP  FN ) * (TN  FP) * (TP  FP) * (TN  FN ) Sensitivity (Sn) - percentage of experimentally confirmed sites recognized by the respective matrix. Specificity (Sp). To compare the specificity of two matrices we will suppose that the majority of sites found by these matrices in the randomly generated DNA sequences are false positives. If this is true, the ratio of the occurrence frequencies found by the new and original matrices is inversely proportional to the ratio of their specificities. Therefore, we will consider the averaged occurrence frequency of sites in the randomly generated sequences as a parameter describing the specificity of the PWM. 4-row mononucleotide versus 16-row dinucleotide matrices The majority of practically used PWMs are the 4-row mononucleotide matrices based on the ‘additivity hypothesis’, which considers the contributions from each position of the binding site as independent and additive (Berg and von Hippel, 1987). Some experimental evidence (Man and Stormo 2001; Bulyk,M., Johnson,P., and Church,G., 2002) and theoretical considerations (Zhang and Marr, 1993) show that a dinucleotide approach (counting of dependence between adjacent nucleotides of TFBS) could be in some cases the more appropriate approximation. Using the same methodology, we built the 16-row dinucleotide matrices. The limitations of small experimental datasets have convinced researchers to use less accurate, but fairly reliable 4-rows matrices (Benos,P., Bulyk,M., and Stormo,G., 2002). There is no such limitation in our case since we use a large set of putative sites. The sensitivity/specificity ratios for the original and new matrices for GC-box. Specificity - the averaged occurrence frequency of GC-box sites found by the original matrix (circle at the left upper corner) and two sets of new 4-row (squares) and 16-row (diamonds) matrices. The x-axis is sensitivity - the percentage of recognized sites from a control set of experimentally defined sites. Figure 3. The occurrence frequency distribution of the HMG1 sites. The rest as for Sp1. Figure 4. The occurrence frequency distribution of the PAX2 sites. The rest as for Sp1. Figure 5. The occurrence frequency distribution of the NRF2 sites. The rest as at Figure 2. A pair of two closely positioned TF binding sites that acquire new regulatory properties due to direct or indirect interactions between corresponding transcription factors is called a composite element (CE). We performed clustering of putative binding sites predicted by the MATCH program in a vicinity of putative binding sites for TF STAT-1, as a study case. Clear over-representation of putative binding sites was obtained for transcription factors AML-1a, AP-2, CDX-a, c-Ets-1, c-Myb, c-REL, ELK-1, EN-1, GKLF, HSF-1, HSF-2, IK-1, IK-2, IK-3, LYF-1, MSX-1, Myo-D, NF-AT, NF-κB, NRF-2, Oct-1, P300, Pax-4, Pax-6, RFX-1, SRY, TST-1. On the contrary, putative binding sites for GATA-1, MZF-1, and Sp1 were clearly under represented in that area. Although some of the results might be a mere consequence of shared motifs for respective binding sites, others warrant different interpretation and may point to potential CEs. Influence of variant histone H2A.Z on local chromatin dynamics (In-depth chromatin analysis by structural modeling) Gaussian Network Model (Bahar et al.,1997) •The dynamics of the interactions is controlled by the connectivity (or Kirchhoff) matrix G, by analogy with the statistical mechanical theory of elasticity originally developed by Flory and coworkers for polymer networks. •The elements of G are defined as •Here rc is the cutoff distance defining the range of Inhibitor binding alters the directions of motions in HIV-1 reverse transcriptase interaction of residues, each residue being represented by "Anisotropy of fluctuation dynamics of proteins with an elastic network model" Atilgan, AR, its a-carbon, and Rij is the distance between ith and jth Durrell, SR, Jernigan, RL, Demirel, MC, Keskin, O. & Bahar, I. Biophys. J. 80, 505-515, residues. 2001. (.pdf) Anisotropic Network Model (Atilgan et al., 2001) •The value of rc = 7 Å includes the neighboring residues •The anisotropic network model (ANM) is an extension of the GNM to the 3N-d space of located in the first coordination shell near a central collective modes. residue. •Note that the columns (or rows) of G are interdependent (all sum up to zero), and thus G cannot be inverted; instead it is reconstructed after removal of its zero eigenvalue and corresponding eigenvector. •The inter-residue 'distances' are controlled by harmonic potentials in the GNM, ANM adopts further assumption that the three (-x, -y and -z) components of the inter-residue separation vectors obey Gaussian dynamics. G is replaced by its 3N x 3N counterpart (1/g)H where H is the Hessian matrix of the second derivatives of the intermolecular potential V = (g/2) DRT G DR. Molecular Biology of the Cell, 3rd edn. Part I. Introduction to the Cell Chapter 2. Small Molecules, Energy, and Biosynthesis The Chemical Components of a Cell Panel 2-5: The 20 amino acids involved in the synthesis of proteins Molecular Biology of the Cell, 3rd edn. © 1994 by Bruce Alberts, Dennis Bray, Julian Lewis, Martin Raff, Keith Roberts, and James D. Watson. Part I. Introduction to the Cell Chapter 2. Small Molecules, Energy, and Biosynthesis The Chemical Components of a Cell Panel 2-6: A survey of the major types of nucleotides and their derivatives encountered in cells Going beyond: • To other species (promoter-chromatin architecture in Drosophila and Yeast). • TF regulatory modules. • Post-transcriptional regulation (RNAi). • From sequence analysis to molecular modeling and vice versa. • Still beyond… Acknowledgements • Prof. Ed Trifonov (Weizmann Institute / University of Haifa) • Prof. Michael Q. Zhang (Cold Spring Harbor Lab NY) • Prof. Ivet Bahar (University of Pittsburgh) • Prof. Gary Stormo (Washington University, St. Louis) • Prof. Alex Bolshoy (Weizmann Ins. /Haifa U.) • Prof. Mark Borodovsky (Georgia Institute of Technology, Atlanta) • K. Derenshteyn (GIT) Ioshikhes’ group: • Dr. Naum Gershenzon • Dr. Li Wang • Dr. Amutha Ramaswamy (Dept. Biomedical Informatics, Ohio State University) Summary “Do you see anything there?” … “Just a suggestion, perhaps. But wait an instant!” He stood upon a chair, and holding up the light in his left hand, he curved his right arm over the board hat and round the long ringlets. “Good havens!” I cried in amazement. The face of Stapleton had sprung out of the canvas. “The fellow is a Baskerville – that is evident.” Arthur Konan-Doyle “The Hound of the Baskervilles”

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Molecular Cell Biology - Biomedical Informatics