Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Discovery of Regulatory Elements by a Phylogenetic Footprinting Algorithm Mathieu Blanchette Martin Tompa Computer Science & Engineering University of Washington Outline • How are genes regulated? • What is phylogenetic footprinting? • First solution • Improvements and extensions • Application to regulation of several important genes 2 Regulation of Genes • What turns genes on and off? • When is a gene turned on or off? • Where (in which cells) is a gene turned on? • How many copies of the gene product are produced? 3 Regulation of Genes Transcription Factor RNA polymerase DNA Regulatory Element Coding region 4 Regulation of Genes Transcription Factor RNA polymerase DNA Regulatory Element Coding region 5 Goal Identify regulatory elements in DNA sequences. These are: • Binding sites for proteins • Short substrings (5-25 nucleotides) • Up to 1000 nucleotides (or farther) from gene • Inexactly repeating patterns (“motifs”) 6 Phylogenetic Footprinting (Tagle et al. 1988) Functional sequences evolve slower than nonfunctional ones. • Consider a set of orthologous sequences from different species • Identify unusually well conserved regions 7 Substring Parsimony Problem Given: • phylogenetic tree T, • set of orthologous sequences at leaves of T, • length k of motif • threshold d Problem: • Find each set S of k-mers, one k-mer from each leaf, such that the “parsimony” score of S in T is at most d. This problem is NP-hard. 8 Small Example AGTCGTACGTGAC... (Human) AGTAGACGTGCCG... (Chimp) ACGTGAGATACGT... (Rabbit) GAACGGAGTACGT... (Mouse) TCGTGACGGTGAT... (Rat) Size of motif sought: k = 4 9 Solution ACGT AGTCGTACGTGAC... AGTAGACGTGCCG... ACGT ACGTGAGATACGT... ACGT ACGG GAACGGAGTACGT... TCGTGACGGTGAT... Parsimony score: 1 mutation 10 CLUSTALW multiple sequence alignment (rbcS gene) Cotton Pea Tobacco Ice-plant Turnip Wheat Duckweed Larch ACGGTT-TCCATTGGATGA---AATGAGATAAGAT---CACTGTGC---TTCTTCCACGTG--GCAGGTTGCCAAAGATA-------AGGCTTTACCATT GTTTTT-TCAGTTAGCTTA---GTGGGCATCTTA----CACGTGGC---ATTATTATCCTA--TT-GGTGGCTAATGATA-------AGG--TTAGCACA TAGGAT-GAGATAAGATTA---CTGAGGTGCTTTA---CACGTGGC---ACCTCCATTGTG--GT-GACTTAAATGAAGA-------ATGGCTTAGCACC TCCCAT-ACATTGACATAT---ATGGCCCGCCTGCGGCAACAAAAA---AACTAAAGGATA--GCTAGTTGCTACTACAATTC--CCATAACTCACCACC ATTCAT-ATAAATAGAAGG---TCCGCGAACATTG--AAATGTAGATCATGCGTCAGAATT--GTCCTCTCTTAATAGGA-------A-------GGAGC TATGAT-AAAATGAAATAT---TTTGCCCAGCCA-----ACTCAGTCGCATCCTCGGACAA--TTTGTTATCAAGGAACTCAC--CCAAAAACAAGCAAA TCGGAT-GGGGGGGCATGAACACTTGCAATCATT-----TCATGACTCATTTCTGAACATGT-GCCCTTGGCAACGTGTAGACTGCCAACATTAATTAAA TAACAT-ATGATATAACAC---CGGGCACACATTCCTAAACAAAGAGTGATTTCAAATATATCGTTAATTACGACTAACAAAA--TGAAAGTACAAGACC Cotton Pea Tobacco Ice-plant Turnip Wheat Duckweed Larch CAAGAAAAGTTTCCACCCTC------TTTGTGGTCATAATG-GTT-GTAATGTC-ATCTGATTT----AGGATCCAACGTCACCCTTTCTCCCA-----A C---AAAACTTTTCAATCT-------TGTGTGGTTAATATG-ACT-GCAAAGTTTATCATTTTC----ACAATCCAACAA-ACTGGTTCT---------A AAAAATAATTTTCCAACCTTT---CATGTGTGGATATTAAG-ATTTGTATAATGTATCAAGAACC-ACATAATCCAATGGTTAGCTTTATTCCAAGATGA ATCACACATTCTTCCATTTCATCCCCTTTTTCTTGGATGAG-ATAAGATATGGGTTCCTGCCAC----GTGGCACCATACCATGGTTTGTTA-ACGATAA CAAAAGCATTGGCTCAAGTTG-----AGACGAGTAACCATACACATTCATACGTTTTCTTACAAG-ATAAGATAAGATAATGTTATTTCT---------A GCTAGAAAAAGGTTGTGTGGCAGCCACCTAATGACATGAAGGACT-GAAATTTCCAGCACACACA-A-TGTATCCGACGGCAATGCTTCTTC-------ATATAATATTAGAAAAAAATC-----TCCCATAGTATTTAGTATTTACCAAAAGTCACACGACCA-CTAGACTCCAATTTACCCAAATCACTAACCAATT TTCTCGTATAAGGCCACCA-------TTGGTAGACACGTAGTATGCTAAATATGCACCACACACA-CTATCAGATATGGTAGTGGGATCTG--ACGGTCA Cotton Pea Tobacco Ice-plant Turnip Wheat Duckweed Larch ACCAATCTCT---AAATGTT----GTGAGCT---TAG-GCCAAATTT-TATGACTATA--TAT----AGGGGATTGCACC----AAGGCAGTG-ACACTA GGCAGTGGCC---AACTAC--------------------CACAATTT-TAAGACCATAA-TAT----TGGAAATAGAA------AAATCAAT--ACATTA GGGGGTTGTT---GATTTTT----GTCCGTTAGATAT-GCGAAATATGTAAAACCTTAT-CAT----TATATATAGAG------TGGTGGGCA-ACGATG GGCTCTTAATCAAAAGTTTTAGGTGTGAATTTAGTTT-GATGAGTTTTAAGGTCCTTAT-TATA---TATAGGAAGGGGG----TGCTATGGA-GCAAGG CACCTTTCTTTAATCCTGTGGCAGTTAACGACGATATCATGAAATCTTGATCCTTCGAT-CATTAGGGCTTCATACCTCT----TGCGCTTCTCACTATA CACTGATCCGGAGAAGATAAGGAAACGAGGCAACCAGCGAACGTGAGCCATCCCAACCA-CATCTGTACCAAAGAAACGG----GGCTATATATACCGTG TTAGGTTGAATGGAAAATAG---AACGCAATAATGTCCGACATATTTCCTATATTTCCG-TTTTTCGAGAGAAGGCCTGTGTACCGATAAGGATGTAATC CGCTTCTCCTCTGGAGTTATCCGATTGTAATCCTTGCAGTCCAATTTCTCTGGTCTGGC-CCA----ACCTTAGAGATTG----GGGCTTATA-TCTATA Cotton Pea Tobacco Ice-plant Larch Turnip Wheat Duckweed T-TAAGGGATCAGTGAGAC-TCTTTTGTATAACTGTAGCAT--ATAGTAC TATAAAGCAAGTTTTAGTA-CAAGCTTTGCAATTCAACCAC--A-AGAAC CATAGACCATCTTGGAAGT-TTAAAGGGAAAAAAGGAAAAG--GGAGAAA TCCTCATCAAAAGGGAAGTGTTTTTTCTCTAACTATATTACTAAGAGTAC TCTTCTTCACAC---AATCCATTTGTGTAGAGCCGCTGGAAGGTAAATCA TATAGATAACCA---AAGCAATAGACAGACAAGTAAGTTAAG-AGAAAAG GTGACCCGGCAATGGGGTCCTCAACTGTAGCCGGCATCCTCCTCTCCTCC CATGGGGCGACG---CAGTGTGTGGAGGAGCAGGCTCAGTCTCCTTCTCG 11 An Exact Algorithm (generalizing Sankoff and Rousseau 1975) Wu [s] = best parsimony score for subtree rooted at node u, if u is labeled with string s. … … 4k entries ACGG: 1 ACGT: 0 ACGG: + ACGT: 0 ... AGTCGTACGTG ... … ACGG: ACGT :0 ... … ACGGGACGTGC ACGG: 2 ACGT: 1 … ACGG: ACGT :0 ... ... … ACGG: ACGT :0 ... … ACGG: 1 ACGT: 1 ... … ACGG: 0 ACGT: 2 ... … ACGTGAGATAC GAACGGAGTAC TCGTGACGGTG ACGG: 0 ACGT: + ... 12 Recurrence Wu [s] = min ( Wv [t] + d(s, t) ) v: child t of u 13 Running Time Wu [s] = min ( Wv [t] + d(s, t) ) v: child t of u O(k 42k ) time per node 14 Running Time Wu [s] = min ( Wv [t] + d(s, t) ) v: child t of u O(k 42k ) time per node Number of species Average sequence length Total time O(n k (42k + l )) Motif length 15 Improvements • Better algorithm reduces time from 2k k O(n k (4 + l )) to O(n k (4 + l )) • By restricting to motifs with parsimony score at most d, greatly reduce the number of table entries computed (exponential in d, polynomial in k) • Amenable to many useful extensions (e.g., allow insertions and deletions) 16 Application to -actin Gene Gilthead sea bream (678 bp) Medaka fish (1016 bp) Common carp (696 bp) Grass carp (917 bp) Chicken (871 bp) Human (646 bp) Rabbit (636 bp) Rat (966 bp) Mouse (684 bp) Hamster (1107 bp) 17 Common carp ACGGACTGTTACCACTTCACGCCGACTCAACTGCGCAGAGAAAAACTTCAAACGACAACA TTGGCATGGCTTTTGTTATTTTTGGCGCTTGACTCAGG ATCTAAAAACTGGAACGGCGAAGGTGACGGCAATGTTTTGGCAAATAAGCATCCCCGAAGTTCTACAATGCATCTGAGGACTCAATGTTTTTTTTTTTTTTT TTTCTTTAGTCATTCCAAATGTTTGTTAAATGCATTGTTCCGAAACTTATTTGCCTCTATGAAGGCTGCCCAGTAATTGGGAGCATACTTAACATTGTAGTATTGTA T GTAAATTATGTAACAAAACAATGACTGGGTTTTTGTACTTTCAGCCTTAATCTTGGGTTTTTTTTTTTTTTTGGTTCCAAAAAACTAAGCTTTACCATTCAAGATGTAAA CCTGTACACTGAC GGTTTCATTCCCCCTGGCATATTGAAAAAGCTGTGTGGAACGTGGCGGTGCAGACATTTGGTGGGGCCA A TAATTCAAATAAAAGT GCACATGTAAGACATCCTACTCTGTGTGATTTTTCTGTTTGTGCTGAGTGAACTTGCTATGAAGTCTTTTAGTGCACTCTTTAATAAAAGTAGTCTTCCCTTAAAGTGTCC CTTCCCTTATGGCCTTCACATTTCTCAACTAGCGCTTCAACTAGAAAGCACTTTAGGGACTGGGATGC Chicken TTGGCATGGCTTTATTTGTTTTTTCTTTTGGC ACCGGACTGTTACCAACACCCACACCCCTGTGATGAAACAAAACCCATAAATGCGCATAAAACAAGACGAGA GC TTGACTCAGGATTAAAAAACTGGAATGGTGAAGGTGTCAGCAGCAGTCTTAAAATGAAACATGTTGGAGCGAACGCCCCCAAAGTTCTACAATG CATCTGAGGACTTTGATTGTACATTTGTTTCTTTTTTAAT AGTCATTCCAAATATTGTTATAATGCATTGTTACAGGAAGTTACTCGCCTCTGTGAAGGCAACAGCCCA GCTGGGAGGAGCCGGTACCAATTACTGGTGTTAGATGATAATTGCTTGTCTGTAAATTATGTAACCCAACAAGTGTCTTTTTGTATCTTCCGCCTTAAAAACAAAACAC ACTTGATCCTTTTTGGTTTGTCAAGCAAGCGGGCTGTGTTCCCCAGTGATAGATGTGAATGAAGGCTTTACAGTCCCCCACAGTCTAGGAGTAAAGTGCCAGTATGTGGG CCTGTACACTGAC GGAGGGAGGGGCTA TTAAGACCAGTTCAAATAAAAGTGCACACAATAGAGGCTTGACTGGTGTTGGTTTTTATTTCTGTGCTGCGC TGCTTGGCCGTTGGTAGCTGTTCTCATCTAGCCTTGCCAGCCTGTGTGGGTCAGCTATCTGCATGGGCTGCGTGCTGGTGCTGTCTGGTGCAGAGGTTGGATAAACCGT GATGATATTTCAGCAAGTGGGAGTTGGCTCTGATTCCATCCTGAGCTGCCATCAGTGTGTTCTGAAGGAAGCTGTTGGATGAGGGTGGGCTGAGTGCTGGGGGACAGCT GGGCTCAGTGGGACTGCAGCTGTGCT Human TTGGCATGGCTTTATTTGTTTTTTTTGTTTTGTT GCGGACTATGACTTAGTTGCGTTACACCCTTTCTTGACAAAACCTAACTTGCGCAGAAAACAAGATGAGA TTGGTTTTTTTTTTTTTTTTGGC TTGACTCAGGATTTAAAAACTGGAACGGTGAAGGTGACAGCAGTCGGTTGGAGCGAGCATCCCCCAAAGTTCA CAATGTGGCCGAGGACTTTGATTGCATTGTTGTTTTTTTAATAGTCATTCCAAATATGAGATGCATTGTTACAGGAAGTCCCTTGCCATCCTAAAAGCCACCCCACTTC TCTCTAAGGAGAATGGCCCAGTCCTCTCCCAAGTCCACACAGGGGAGGTGATAGCATTGCTTTCGTGTAAATTATGTAATGCAAAATTTTTTTAATCTTCGCCTTAATA CTTTTTTATTTTGTTTTATTTTGAATGATGAGCCTTCGTGCCCCCCCTTCCCCCTTTTTGTCCCCCAACTTGAGATGTATGAAGGCTTTTGGTCTCCCTGGGAGTGGGTGG CCTGTACACTGACTTGAGACCAGTTGAATAAAAGTGCACACCTTAAAAATGAGGCCAAGTGTGACTTTGTGGTGTGGCTGGGT AGGCAGCCAGGGCTTA TGGGGGCAGCAGAGGGTG Parsimony score over 10 vertebrates: 0 1 2 18 Motifs Absent from Some Species • Find motifs – with small parsimony score – that span a large part of the tree • Example: in tree of 10 species spanning 760 Myrs, find all motifs with – – – – score 0 spanning at least 250 Myrs score 1 spanning at least 350 Myrs score 2 spanning at least 450 Myrs score 3 spanning at least 550 Myrs 19 Application to c-fos Gene 10 Puffer fish 7 2 Chicken 2 1 Pig 2 2 1 0 1 Mouse Hamster Human Asked for motifs of length 10, with 0 mutations over tree of size 6 1 mutation over tree of size 11 2 mutations over tree of size 16 3 mutations over tree of size 21 4 mutations over tree of size 26 Found: 0 mutations over tree of size 8 1 mutation over tree of size 16 3 mutations over tree of size 21 4 mutations over tree of size 28 20 Application to c-fos Gene Motif Score Conserved in Known? CAGGTGCGAATGTTC 0 4 mammals TTCCCGCCTCCCCTCCCC 0 4 mammals GAGTTGGCTGcagcc 3 puffer + 4 mammals GTTCCCGTCAATCcct 1 chicken + 4 mammals yes CACAGGATGTcc 4 all 6 yes AGGACATCTG 1 chicken + 4 mammals yes GTCAGCAGGTTTCCACG 0 4 mammals yes TACTCCAACCGC 0 4 mammals yes 21 Other Genes Similar results for the following genes: •insulin •c-myc promoter and intron •growth hormone •interleukin-3 •histone H1 •-globin •dihydrofolate reductase •fibroin •myogenin •prolactin •thyroglobulin •γ-actin 3´ UTR •rbcS •rbcL 22 Conclusions • Guaranteed optimality for question posed • Time linear in the number of species and the total sequence lengths, exponential in the parsimony score • Practical on real biological data sets • Discovered highly conserved regions, both known and not (yet) known • Available at http://bio.cs.washington.edu/software.html 23