* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download file
Gene desert wikipedia , lookup
Short interspersed nuclear elements (SINEs) wikipedia , lookup
Non-coding DNA wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Metagenomics wikipedia , lookup
Essential gene wikipedia , lookup
Oncogenomics wikipedia , lookup
Transposable element wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Genomic library wikipedia , lookup
Human genome wikipedia , lookup
Nutriepigenomics wikipedia , lookup
History of genetic engineering wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Public health genomics wikipedia , lookup
Microevolution wikipedia , lookup
Genomic imprinting wikipedia , lookup
Pathogenomics wikipedia , lookup
Designer baby wikipedia , lookup
Helitron (biology) wikipedia , lookup
Genome (book) wikipedia , lookup
Gene expression programming wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Genome editing wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Ridge (biology) wikipedia , lookup
Mir-92 microRNA precursor family wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Gene expression profiling wikipedia , lookup
Discovery of transcription networks 4 3 2 1 0 -1 -2 0 5 10 15 Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel Hierarchical clustering Promoter Motifs and expression profiles CGGCCCCGCGGA CTCCTCCCCCCCTTC TGGCCAATCA ATGTACGGGTG 3 AlignACE Example 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA 300-600 bp of upstream sequence per gene are searched in Saccharomyces cerevisiae. http://statgen.ncsu.edu/~dahlia/journalclub/S01/jmb1205.pdf …HIS7 …ARO4 …ILV6 …THR4 …ARO1 …HOM2 …PRO3 A cluster of gene may contain a common motif in their promoter 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3 Find a needle in a haystack AAAAGAGTCA AAATGACTCA AAGTGAGTCA AAAAGAGTCA GGATGAGTCA AAATGAGTCA GAATGAGTCA AAAAGAGTCA ********** Computational Identification of Cis-regulatory Elements Associated with Groups of Functionally Related Genes in Saccharomyces cerevisiae J.D. Hughes, P.W. Estep, S. Tavazoie, G.M. Church Journal of Molecular Biology (2000) Example GAL4 is one of the yeast genes required for growth on galactose. http://www.cifn.unam.mx/Computational_Genomics/old_research/BIOL2.html Motif Representation G1 G2 G3 G4 G5 A A G A A G A A G G A A A A A A T T A A G G G G G A A A A A 1 2 3 4 5 6 A 0.8 0.4 1 0.6 0 1 C 0 0 0 0 0 0 G 0.2 0.6 0 0 1 0 T 0 0 0 0.4 0 0 Finding New Motif • By lab work • By comparison to known motifs in other species • By searching upstream regions of a set of potentially co-regulated genes The genes bound by the TF Abf1 can be clustered into several groups, some contain a motif NCGTNNNNARTGAT CGATGAGMTK NCGTNNNNARTGAT & CGATGAGMTK (sporulation experiment) Search Space • Size of search space: • L=600, W = 15, N = 10 : ( L W 1) N LN size 10 27 • Exact search methods are not feasible AlignACE Example Input Data Set 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3 300-600 bp of upstream sequence per gene are searched in Saccharomyces cerevisiae. Based on slides from G. Church Computational Biology course at Harvard K-means • Start with random positions of centroids. • Assign data points to centroids. • Move centroids to center of assigned points. • Iterate till minimal cost. Iteration = 3 AlignACE Example Initial Seeding 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3 TGAAAAATTC GACATCGAAA GCACTTCGGC GAGTCATTAC GTAAATTGTC CCACAGTCCG TGTGAAGCAC MAP score = -10.0 Based on slides from G. Church Computational Biology course at Harvard AlignACE Example Sampling Add? 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3 TGAAAAATTC GACATCGAAA GCACTTCGGC GAGTCATTAC GTAAATTGTC CCACAGTCCG TGTGAAGCAC How much better is the alignment with this site as opposed to without? TCTCTCTCCA TGAAAAATTC GACATCGAAA GCACTTCGGC GAGTCATTAC GTAAATTGTC CCACAGTCCG TGTGAAGCAC Based on slides from G. Church Computational Biology course at Harvard AlignACE Example Sampling Add? Remove. 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3 TGAAAAATTC GACATCGAAA GCACTTCGGC GAGTCATTAC GTAAATTGTC CCACAGTCCG TGTGAAGCAC How much better is the alignment with this site as opposed to without? TGAAAAAATG TGAAAAATTC GACATCGAAA GCACTTCGGC GAGTCATTAC GTAAATTGTC CCACAGTCCG TGTGAAGCAC Based on slides from G. Church Computational Biology course at Harvard AlignACE Example Column Sampling 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3 GACATCGAAA GCACTTCGGC GAGTCATTAC GTAAATTGTC CCACAGTCCG TGTGAAGCAC How much better is the alignment with this new column structure? GACATCGAAAC GCACTTCGGCG GAGTCATTACA GTAAATTGTCA CCACAGTCCGC TGTGAAGCACA Based on slides from G. Church Computational Biology course at Harvard AlignACE Example The Best Motif 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT …HIS7 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG …ARO4 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT …ILV6 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC …THR4 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA …ARO1 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA …HOM2 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …PRO3 AAAAGAGTCA AAATGACTCA AAGTGAGTCA AAAAGAGTCA GGATGAGTCA AAATGAGTCA GAATGAGTCA AAAAGAGTCA MAP score = 20.37 Based on slides from G. Church Computational Biology course at Harvard The MAP Score • MAP – Maximal a priori log likelihood score • This is what the algorithm tries to optimize. • Measures the degree of over representation of the motif in the input sequence relative to expectation in a random sequence. MAP The MAP Score B,G = standard Beta & Gamma functions N = number of aligned sites; T = number of total possible sites Fjb = number of occurrences of base b at position j (F = sum) Gb = background genomic frequency for base b bb = n x Gb for n pseudocounts (b = sum) W = width of motif; C = number of columns in motif (W>=C) Based on slides from G. Church Computational Biology course at Harvard The MAP Score N MAP N log exp N = number of aligned sites exp = expected number of sites in the input sequence, comparing to a random model 7 1 P = 1 site every 16,000 bases 4 For 64,000 bases sequence - exp = 4 Some examples Motif Number of genes (each 1,000 BPs long promoter) Number of times found Expected number of times MAP score AGGGTAA (7) 16 10 ~1 10 GTAGATG (7) 16 2 ~1 0.60206 CCGTGAG (7) 160 10 ~10 0 GATGTA (6) 16 2 ~4 -0.60206 AGGGTA (6) 16 10 4 4.089354 A (1) 16 2504 ~2500 1.73 AAAAAAA (7) 16 5 ~1.5 2.614394 GGGGGGG (7) 16 5 ~0.5 5 N MAP N log exp The MAP Score Properties MAP N log a) Motif should be “strong” b) Input sequence can’t be too long 7 1 P = 1 site every 16,000 bases 4 1 2 12 106 = 1500 16000 Genome length ~12Mb : exp = Motif needs more than 1500 sites to get a positive MAP score: MAP = N log N 1500 = 1500 log = 0 exp 1500 Problem: most transcription factor binding sites will only occur in dozens to hundreds of genes N exp Solution: Cluster genes before searching for motifs Time-point 3 Time-point 1 Group Specificity Score: All Genome (N) How well a motif targets the genes used to find it comparing to all genome ? Motif ORFs Group (S1) X ORFs with best sites (S2) What is the probability to have such large intersection? S1 x N S1 S2 x N S2 N = Total # of ORFs in the genome (6226) S1 = # ORFs used to align the motif S2 = # targets in the genome (~ 100 ORFs with best ScanACE scores) X = # size of intersection of S1 and S2 Based on slides from G. Church Computational Biology course at Harvard Group Specificity Score: All Genome (N) How well a motif targets the genes used to find it comparing to all genome ? Motif ORFs Group (S1) X ORFs with best sites (S2) What is the probability to have such large intersection? S= min( S1 , S 2 ) i=x S1 i N S1 S 2 i N S2 N = Total # of ORFs in the genome (6226) S1 = # ORFs used to align the motif S2 = # targets in the genome (~ 100 ORFs with best ScanACE scores) X = # size of intersection of S1 and S2 Based on slides from G. Church Computational Biology course at Harvard Positional Bias Score: #ORFS Measures the degree of preference of positioning in a particular range upstream to translational start. 10 6 1 50 bp Start -600 bp Based on slides from G. Church Computational Biology course at Harvard Positional Bias Score: #ORFS 10 1 • Find best 200 sites in the genome 50 bp Start Restrict sites to segment of length [s = 600 bp] from translation start • t = # sites in the segment • Choose window size [w = 50 bp] • m = # sites in the most enriched window What is the probability to have m or more sites in a window of size w? t w m w t m 1 m s s Based on slides from G. Church Computational Biology course at Harvard -600 b Positional Bias Score: #ORFS 10 1 • Find best 200 sites in the genome 50 bp Start Restrict sites to segment of length [s = 600 bp] from translation start • t = # sites in the segment • Choose window size [w = 50 bp] • m = # sites in the most enriched window What is the probability to have m or more sites in a window of size w? t i t i t w 1 w i s s i = m P= Based on slides from G. Church Computational Biology course at Harvard -600 b Lecture Topics • Introduction to DNA regulatory motifs • AlignACE - A motif finding algorithm • Assessment of motifs • AlignACE results on yeast genome • Summary & Conclusions Comparisons of motifs • The CompareACE program finds best alignment between two motifs and calculates the correlation between the two position-specific scoring matrices • Similar motifs: CompareACE score > 0.7 Based on slides from G. Church Computational Biology course at Harvard Clustering motifs by similarity motif A motif B motif C motif D Pairwise CompareACE scores 1 2 3 4 5 6 A 0.8 0.4 1 0.6 0 1 C 0 0 0 0 0 0 G 0.2 0.6 0 0 1 0 T 0 0 0 0.4 0 0 1 2 3 4 5 6 A 0.4 0.4 1 0.6 0 0 C 0 0 0 0 0 1 G 0.6 0.6 0 0 1 0 T 0 0 0 0.4 0 0 A B A 1.0 0.9 B 1.0 C D C 0.1 0.2 1.0 D 0.0 0.1 0.8 1.0 Hierarchical Clustering cluster 1: A, B cluster 2: C, D Most Group Specific Motifs Most Positional Biased Motifs Negative Controls • 250 AlignACE runs on randomly created groups of ORFs, of size 20, 40, 60, 80,and 100 ORFs. MAP MAP random real Based on slides from G. Church Computational Biology course at Harvard Negative Controls 10 MAP cut off of 10, Group Specificity cutoff of 10 : False Positives = 10-20% Positive Controls • 29 listed TFs with five or more known binding sites were chosen. • AlignACE was run on the upstream regions of the corresponding regulated genes. • An appropriate motif was found in 21/29 cases. • False negative rate = ~ 10-30 % Based on slides from G. Church Computational Biology course at Harvard The data • Organism: Saccharomyces cerevisiae • Microarray experiment : Affymetrix microarrays of 6,220 mRNA • Data: gathered by Cho et al. • 15 time points, spanned about 4 hours across two cell cycles • Genome sequence Typical clusters of genes in the data Variance normalization and clustering of expression time series •3,000 most variable ORFs were chosen (based on the normalized dispersion in expression level of each gene across the time points (s.d./mean). •The 15 time points were used to construct a 3,000 by 15 data matrix. •The variance of each gene was normalized across the 15 conditions: Subtracting the mean across the time points from the expression level of each gene and dividing by the standard deviation across the time point. Before and after mean - variance normalization Gene Expression Before normalization Gene1 Gene2 Gene3 13 1 3 5 7 9 11 8 6 4 2 0 -2 -4 -6 Tim e Normalized Expression Gene1 Gene2 Gene3 After normalization 2 1 0 -1 -2 Tim e 1 3 5 7 9 11 13 -3 Representation of expression data Normalized Expression Data from microarrays Time-point 1 Gene 1 Gene 2 K-means •Start with random positions of centroids. = position of data point Xi = position of data centroid C Iteration = 0 Choosing K Since we don’t know the number of clusters in advance we need a way to estimate it. In order to choose the number of clusters K, the Sum of Squares of Errors is calculated for different K values. A clear break point indicates the “natural” number of clusters in the data. Sum Squared errors K Significantly enrichment of functional category within clusters • Each gene was mapped into one of 199 functional categories ( according to MIPS database ). • For each cluster, P-values was calculated for observing the frequencies of genes from particular functional categories. • There was significant grouping of genes within the same cluster. The hyper-geometric score P values were calculated for finding at least (k) ORFs from a particular functional category within a cluster of size (n). where (f) is the total number of genes within a functional category and (g) is the total number of genes within the genome (6,220). P- values greater than 3×10- 4 are not reported, as their total expectation within the cluster would be higher than 0.05 As we tested 199 MIPS (ref.15). Challenge: generalize hypergeometric for more than two sets Chr V Functional group Expression cluster Sequence- MCB element Consensuses nucleotides This motif was later mapped to the literature and confirmed to be the very well known MCB element which is known to control the periodicity of the genes which peak at G1-S. The existence of motif in all ORF’s of each clusters MCB element clusters Location of the motif - MCB element • Distance from ATG (b.p) SCB element This motif (later found to be the SCB element) was the second scoring motif within this cluster. The SCB element is also a very well-known cis-regulatory element which contributes to the periodicity of the genes within the G1-S regulon. ribonucleotide reductase Determining the cell-cycle periodicity of clusters Show Fourier Analysis allow to rank the genes according to the periodicity of cell cycle. expression matrix cell cycle high Periodicity low periodicity low periodicity 5 0 35 33 31 29 27 25 23 21 19 17 15 13 11 9 7 5 3 1 -5 -10 time expression 10 Explain FFT… (including ORs variability) Periodic clusters Non periodic clusters And this was just the beginning… In case of two motifs derived from a cluster ? Collaboration Co-occurrence (AND) Redundancy (OR) http://longitude.weizmann.ac.il/publications/PilpelNatGent01.pdf Logic of interaction of motifs Only M2 Expression level Only M1 Expression level M1 M1 AND M2 M2 G2 G2 Synergistic motifs A combination of two motifs is called ‘synergistic’ if the expression coherence score of the genes that have the two motifs is significantly higher than the scores of the genes that have either of the motifs Mcm1 SFF A global map of combinatorial expression control Pilpel et al. Nature Genetics 2001 Heat-shock Cell cycle Sporulation Diauxic shift MAPK signaling DNA damage STRE *High connectivity *Hubs *Alternative partners in various conditions PHO4 CCA ALPHA1 mRPE8 mRPE57 AFT1 PDR SWI5 MIG1 mRPE69 RAP1 mRPE72 GCN4 CSRE SFF ' mRPE34 MCB mRPE58 MCM1 mRPE6 RPN4 ECB BAS1 SCB LYS14 ABF1 SFF STE12 ALPHA2 MCM1' ALPHA1' HAP234 mRRPE PAC mRRSE3 The human cell cycle G1-Phase S-Phase G2-Phase M-Phase The proliferation cluster genes are cell cycle periodic 4 3 2 0 -1 Gene Expression 1 Disrtribution of cell cycle periodicity -3 -4 Proportion -2 G2/M G1/S CHR 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 All genes Proliferation genes 1 5 10 15 20 Samples 25 2 30 3 4 35 5 6 CCP score 7 40 8 9 10 45 The cell cycle motifs are enriched among the proliferation cluster genes CHR ELK1 CDE E2F NFY 200 Not in the cluster, mutated in cancer 150 100 50 TSS Regulation of the proliferation cluster: significant motifs Motif P-value NFY 3.74*10-11 CDE 5.31*10-10 E2F 2.37*10-09 ELK1 3.10*10-06 CHR 1.42*10-05 1000bp up stream Sequence logo 326 MathInspector motifs Potential regulatory motifs in 3’ UTRs Finding 3’ UTRs elements associated with high/low transcript stability (in yeast) Entire genome AAGCTTCC CCTACAAC