Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Regulatory Sequence Analysis Regulatory regions and regulatory elements Jacques van Helden [email protected] Tra ns cri be tiv e rep eti no n -co g din s/M ne co Mb ge ne Ge Siz e Organism Ye ar s b din g d The non-coding genome % % % % Mycoplasma genitalium Haemophilus influenzae Escherichia coli 1995 1995 1997 0.6 1.8 4.6 481 1 717 4 289 802 90 954 86 932 87 10 14 13 Saccharomyces cerevisiae 1996 12 6 286 524 72 28 Arabidiopsis thaliana Caenorhabditis elegans Drosophila melanogaster Homo sapiens 2001 120 1998 97 2000 165 2001 3 200 27 000 19 000 16 000 31 000 225 30 196 27 97 15 10 3 70 73 85 97 46 28 Transcriptional activation Transcriptional activator activation domain RNA polymerase DNA-binding domain enhancer initiation Transcriptional repression Prevent RNA polymerase from accessing DNA Competition for factor binding site Factor titration Prevent transcription factor from interacting with RNA-polymerase (bind with activation domain) Transcription factor-DNA interfaces A C B D Phosphate utilization in yeast PHO2 PHO3 acid phosphatase expression codes for acid phosphatase up-regulates Up-regulates issecretion secreted expression codes for expression up-regulates PHO5, 11,12 catalyzes up-regulates Pho2p Pi transporter up-regulates expression codes for orthophosphoric monoester PHO84, 86,87,88,89 alkaline phosphatase up-regulates Pho4p expression codes for H2O orthophosphoric monoester PHO8 3.1.3.1 H2O PHO4 up-regulates (nucleus) catalyzes 3.1.3.2 PHO81 up-regulates transports facilitates alcohol expression codes for is translocates tranferred Pi expression codes for is tranferred transport Pho4p-Phosphate Pho4p alcohol Pi 2.7.1.- catalyzes inhibits Pho81p Pho80p Pho85p Pho85-Pho80 complex inhibits (cytoplasm) extracellular space Pho4p binding sites gene PHO5 PHO5 PHO5 PHO8 PHO8 PHO81 PHO84 PHO84 PHO84 PHO84 PHO84 PHO5 PHO5 PHO5 start end sequence -260 -242 ..GCACTCACACGTGGGACTA -260 -245 ..GCACTCACACGTGGGA -262 -239 TGGCACTCACACGTGGGACTAGCA -540 -522 ...TCGGGCCACGTGCAGCGAT -736 -718 ..ttacccgCACGCTTaatat -350 -332 ...TTATGGCACGTGCGAATAA -421 -403 ..TTTCCAGCACGTGGGGCGG -442 -425 ...TAGTTCCACGTGGACGTG -879 -874 .aaaagtgtCACGTGataaaaat -267 -250 ..taatacgCACGTTTTTaa -592 -575 ....TTACGCACGTTGGTGCTG -368 -349 ...AATTAGCACGTTTTCGCATA (?) (?) ..AAATTAGCACGTTTCGC -370 -347 .TAAATTAGCACGTTTTCGCATAGA IUPAC ambiguous nucleotide code A C G T R Y W S M K H B V D N A C G T A or G C or T A or T G or C A or C G or T A, C or T G, C or T G, A, C G, A or T G, A, C or T Adenine Cytosine Guanine Thymine puRine pYrimidine Weak hydrogen bonding Strong hydrogen bonding aMino group at common position Keto group at common position not G not A not T not C aNy Pho4p binding specificity - matrix descriptions C A C G T Pho4p 14 0 5 7 6 0 26 0 0 0 0 3 2 8 5 16 6 26 0 26 0 1 0 4 4 2 1 1 12 0 0 0 26 0 16 12 6 16 15 2 2 0 0 0 0 25 10 7 A C G T Pho4p.cacgtg 2 17 0 0 0 0 2 1 16 0 18 0 0 0 6 3 0 1 0 18 0 18 9 12 0 0 0 0 18 0 1 2 D E A C G T 7 0 0 1 0 1 0 7 2 1 0 5 5 3 0 0 Pho4p.cacgtt 1 0 8 0 3 8 0 8 4 0 0 0 0 0 0 0 8 4 2 4 5 5 5 3 5 0 2 11 13 1 1 3 0 0 8 0 0 0 0 8 0 0 0 8 1 0 2 5 Sequence logo Rap1 Rpn4 Gcn4 HSE Mig1 Cbf1 Shannon uncertainty Shannon uncertainty Hs(j): uncertainty of a column of a PSSM Hg: uncertainty of the background (e.g. a genome) Properties of the uncertainty (for a 4 letter alphabet) min(H)=0 • H=1 • H s ( j ) = −∑ f ij log 2 ( f ij ) i=1 A H g = −∑ pi log 2 ( pi ) No uncertainty at all: the nucleotide is completely specified (e.g. p={1,0,0,0}) R seq Uncertainty between two letters (e.g. p={0.5,0,0,0.5}) max(H) = 2 • A Complete uncertainty: one bit of information is required to specify the choice between each alternative (e.g. p={0.25,0.25,0.25,0.25}) i=1 ( j) = Hg − Hs( j) w Rseq = ∑ Rseq ( j ) j=1 Schneider (1986) defines an information content Rseq based on Shannon’s uncertainty. € Source: Schneider (1986) Schneider logos Schneider (1990) proposes a graphical representation based on his previous entropy (H) for representing the importance of each residue at each position of an alignment. He provides a new formula for Rseq Hs(j) Rseq(j) e(n) uncertainty of column j information content of column j correction for small samples (pseudo-weight) Remarks A H s ( j ) = −∑ f ij log 2 ( f ij ) i=1 Rseq ( j ) = 2 − H s ( j ) + e( n ) hij = f ij Rseq ( j) € This information content does not include any correction for the prior residue probabilities (pi) This information content is expressed in bits. Boundaries • • min(Rseq)=0 max(Rseq)=2 equiprobable residues perfect conservation of 1 residue, all the others are forbidden http://www.lecb.ncifcrf.gov/~toms/icons/tata.gif References - Sequence logoo Schneider, T.D., G.D. Stormo, L. Gold, and A. Ehrenfeucht. 1986. Information content of binding sites on nucleotide sequences. J Mol Biol 188: 415-431. Schneider, T.D. and R.M. Stephens. 1990. Sequence logos: a new way to display consensus sequences. Nucleic Acids Res 18: 6097-6100. Tom Schneider’s publications online • http://www.lecb.ncifcrf.gov/~toms/paper/index.html Methionine Biosynthesis in S.cerevisiae Aspartate biosynthesis L-Aspartate ATP ADP 2.7.2.4 Aspartate kinase HOM3 Aspartate semialdehyde deshydrogenase HOM2 Homoserine deshydrogenase HOM6 L-aspartyl-4-P NADPH NADP+; Pi 1.2.1.11 L-aspartic semialdehyde Threonine biosynthesis NADPH NADP+ 1.1.1.3 L-Homoserine AcetlyCoA CoA 2.3.1.31 Met31p met32p Homoserine O-acetyltransferase MET2 O-acetylhomoserine (thiol)-lyase MET17 MET31 MET32 O-acetyl-homoserine Sulfur assimilation Sulfide 4.2.99.10 MET28 Homocysteine Cysteine biosynthesis 5-methyltetrahydropteroyltri-L-glutamate 5-tetrahydropteroyltri-L-glutamate 2.1.1.14 Methionine synthase (vit B12-independent) MET6 Cbf1p/Met4p/Met28p complex CBF1 MET4 Gcn4p GCN4 L-Methionine S-adenosyl-methionine synthetase I H20; ATP 2.5.1.6 S-adenosyl-methionine Pi, PPi synthetase II S-Adenosyl-L-Methionine SAM1 SAM2 Met30p MET30 Met4p binding sites gene MET3 MET3 MET14 MET16 ECM17 ECM17 MET10 MET10 MET2 MET2 MET17 MET17 MET6 MET6 SAM2 SAM2 A C G T 13 1 1 1 11 0 1 4 start -367 -384 -235 -185 -311 -339 -255 -237 -360 -554 -306 -332 -540 -502 -329 -381 end -349 -366 -217 -167 -293 -321 -237 -219 -342 -536 -288 -314 -522 -484 -311 -363 3 0 4 9 3 3 4 6 sequence GAAAAGTCACGTGTAATTT AAAAGGTCACGTGACCAGA CTAATTTCACGTGATCAAT ATCATTTCACGTGGCTAGT ATTTCATCACGTGCGTATT .TTTGTCCACGTGATATTTC .CCACACCACGTGAGCTTAT .TAGAAGCACGTGACCACAA GTATTTTCACGTGATGCGC TAATAATCACGTGATATTT .AAATGGCACGTGAAGCTGT TTGAGGTCACATGATCGCA GCCACATCACGTGCACATT AATATTTCACGTGACTTAC .TCTACCCACGTGACTATAA .TCTTCACATGTGATTCATC 2 0 16 0 1 0 0 12 0 16 0 15 0 0 0 0 4 0 0 0 15 0 16 4 10 0 0 1 0 16 0 0 Met31p binding sites gene MET14 MET2 MET17 MET6 SAM2 SAM1 MET19 MUP3 MET8 MET1 MET3 MET28 MET8 MET30 MET6 A C G T 5 2 5 2 start -202 -313 -227 -313 -306 -283 -173 -188 -184 -232 -259 -159 -434 -168 -405 end -182 -293 -207 -293 -286 -263 -153 -168 -164 -212 -239 -139 -414 -148 -385 sequence CCTCAAAAAATGTGGCAATGG TGCAAAAAATTGTGGATGCAC TCATGAAAACTGTGTAACATA GTCGCAAAACTGTGGTAGTCA GCTTGAAAACTGTGGCGTTTT ACAGGAAAACTGTGGTGGCGC ATAAGCAAACTGTGGGTTCAT CGGAAAAAACTGTGGCGTCGC GGAAAAAAAATGTGAAAATCG CATAATAAACTGTGAACGGAC ACAAAGCCACAGTTTTACAAC CTAACACCACAGTTTTGGGCG TCTTGTCCGCAGTTTTATCTG GGGAAGCCACAGTTTGCGCGG CTATCGAACTCGTTTAGTCGC 11 14 14 14 2 0 0 0 0 2 2 0 0 0 11 0 0 1 0 0 0 0 0 0 0 0 14 0 14 11 1 0 0 0 1 14 0 13 0 1 5 5 1 3 Characteristics of yeast regulatory sites Located upstream the regulated gene Short DNA sequences (5-30 bp) Highly conserved core (5-8 bp), with partly conserved flanking nucleotides Pair of very shot oligonucleotides (3 nt) separated by a nonconserved segment (0-20 bp) Strand-insensitive Wihtin 800 bp from the start codon Efficiency dos not depend on strand position Pattern matching vs pattern discovery Set of DNA sequences Yes Pattern matching Matching positions Pattern known ? No Pattern discovery Putative regulatory patterns Questions and approaches 1. If we know the consensus for a given transcription factor, can we predict its binding sites in a DNA sequence ? 2. Can we scan a sequence for matches with the consensus of all he currently known transcription factor ? 3. Pattern discovery within a sequence set Can we detect regulatory signals by searching conserved elements in noncoding sequences of orthologous genes ? 5. Matching a library of patterns against a sequence Starting from a set of co-expressed genes, can we predict cis-acting elements involved in their transcriptional regulation ? 4. Pattern matching against a sequence Phylogenetic footprinting Can we predict gene regulation on the basis of the presence of regulatory motifs in their regulatory regions ? Gene classification on the basis of pattern scores Typical situations : pattern discovery Selected sequence set e.g. family of 20 co-regulated genes, obtained from DNA chip experiment → identify putative regulatory sites Genome-scale pattern discovery θ e.g. all upstream sequences → identify transcription initiation signals e.g. all downstream sequences → identify 3' maturation signals Typical situations : pattern matching Selected genes, selected patterns θ ν Selected genes, library of patterns θ e.g. 10 genes known to be regulated by a factor → search matching positions → infer putative action of any previously known transcription factor All genes, selected patterns θ → classify all the genes of a genome according to putative regulatory properties Differences between species organism location yeast upstream coli upstream overlap. Initiation distance range -800 to -1 bp -400 to +50 bp higher organisms upstream downstream within introns over 100s of Kb position effect often irrelevant often essential often irrelevant strand insensitive sensitive or symmetric insensitive spaced pair of 3nt ~5-8 conserved bp rare frequent most common ~5-8 conserved bp core repeated sites occasional composite elements frequent