Download Molecular Biology of the Cell

Predicting Gene Expression from Sequence Michael A. Beer and Saeed Tavazoie Cell 117, 185-198 (16 April 2004) 1 The Authors Mike Beer Postdoctoral Researcher Ph.D, Princeton (1995) Saeed Tavazoie (middle) Professor Dept. of Molecular Biology The Question • Transcription factor binding sites are relatively well-characterized in Saccharomyces cerevisiae • But - the presence of a TF binding site alone is not sufficient to predict expression of a gene • Multiple regulatory factors are often involved • How do you identify the elaborate rules for gene regulation? Simple regulatory structures Each possible combination of TFs must be tested in the lab; This is a hugely time-consuming task.. Problems with predicting gene regulation Regulatory motif sequences have low consensus e.g. The well known “TATA box” has a consensus of TATA(A/T)A(A/T)(A/G) Numerous transcription factors can bind to any one motif Many genes have multiple known motifs upstream of ATG Example of cis-regulatory logic From Yuh et al (1998), Science 279, 1896-1902 The Approach 1. Using microarray expression data, the authors built clusters of genes with similar expression patterns. From brain expression data in Wen et al (1998), PNAS 95, 334-339 The Approach, con’t. 2. From groups of genes with similar expression patterns, a search is undertaken for consensus sequence motifs within 800bp upstream of ATG in each cluster. The Approach, con’t 3. The authors built a Markov model using the TF sequence motifs as parent nodes, and the expression data as data values. 4. This can be applied to a gene of interest by identifying the upstream TF motifs for that gene, and finding the model(s) that best fits the known upstream TF motifs. 5. If the expression data is within the parameters predicted by the model, then there is a decent chance that its associated gene regulatory structure can be verified experimentally. Two examples from yeast Both clusters have at least 10 genes each, and there is some confidence that genes with the same upstream TFs will exhibit the same expression pattern as these clusters. Constructing the models Using expression data from 30 microarrays, the authors identified 5547 genes with “significant” expression levels in yeast, and this data was used to construct 49 models of expression patterns. Predictive accuracy These 49 models were applied to five test sets of expression data, using only the upstream 800 bp region as input. They found that the expression pattern was correctly predicted for 1898 genes out of the test set(s) of 2587 genes. This amounts to 73% accuracy (random would be 1/49, or 2%). Application to C. elegans Given the larger amount of regulatory sequences in higher order organisms, and the potential for more complex regulation, the authors had low expectations for applying this model to C. elegans. Using 2000 bp of upstream sequence, and microarray expression data including Hill (2000), the authors were surprised to learn that they could predict expression patterns for roughly half of the genes in the C. elegans dataset. An example from C. elegans Is it really so simple? Gene regulation involves a complex combinatorial dance of numerous factors aside from the presence or absence of TF binding sites. The authors have deliberately limited their scope to cis-acting upstream factors-- ignoring regulatory elements in introns or downstream regions, as well as the effects of operons, alternative splicing, histone modifications, methylation, et cetera Model constraints Several bits of information were found to be significant factors in improving the predictive accuracy of the models: A. B. C. D. Motif orientiation ( <--- or ---> ) Distance from the start codon The particular order of various TFs The presence of multiple copies of the same TF All of those factors were included in the model as priors. Why is distance from the start codon significant? From Harbison et al (2004), Nature 431, 99-104 The number of copies of a TF binding site is relevant.. From Molecular Biology of the Cell, 4th edition Motif combinatorics and predictive accuracy Combinatoric models are more accurate than single-TF models (unless a gene is under the control of only one TF). The order of various TFs is significant Future directions.. Because of the sensitivity of the model(s), even a very small amount of ambiguity can yield junk results. For this reason, SAGE data is not particularly suitable, as only unique SAGE tags can be said to be unambiguous; this in turn excludes all sorts of potentially useful data. However, we could use the microarray-based predictions to pick gene regulatory structures to investigate..

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Molecular Biology of the Cell