Download Matin Tompa`s Lecture Slides - Department of Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Discovery of Regulatory
Elements by a Phylogenetic
Footprinting Algorithm
Mathieu Blanchette
Martin Tompa
Computer Science & Engineering
University of Washington
Outline
• How are genes regulated?
• What is phylogenetic footprinting?
• First solution
• Improvements and extensions
• Application to regulation of several
important genes
2
Regulation of Genes
• What turns genes on and off?
• When is a gene turned on or off?
• Where (in which cells) is a gene turned on?
• How many copies of the gene product are
produced?
3
Regulation of Genes
Transcription Factor
RNA polymerase
DNA
Regulatory Element
Coding region
4
Regulation of Genes
Transcription Factor
RNA polymerase
DNA
Regulatory Element
Coding region
5
Goal
Identify regulatory elements in DNA
sequences. These are:
• Binding sites for proteins
• Short substrings (5-25 nucleotides)
• Up to 1000 nucleotides (or farther) from
gene
• Inexactly repeating patterns (“motifs”)
6
Phylogenetic Footprinting
(Tagle et al. 1988)
Functional sequences evolve slower than
nonfunctional ones.
• Consider a set of orthologous sequences from
different species
• Identify unusually well conserved regions
7
Substring Parsimony Problem
Given:
• phylogenetic tree T,
• set of orthologous sequences at leaves of T,
• length k of motif
• threshold d
Problem:
• Find each set S of k-mers, one k-mer from
each leaf, such that the “parsimony” score of
S in T is at most d.
This problem is NP-hard.
8
Small Example
AGTCGTACGTGAC... (Human)
AGTAGACGTGCCG... (Chimp)
ACGTGAGATACGT... (Rabbit)
GAACGGAGTACGT... (Mouse)
TCGTGACGGTGAT... (Rat)
Size of motif sought: k = 4
9
Solution
ACGT
AGTCGTACGTGAC...
AGTAGACGTGCCG...
ACGT
ACGTGAGATACGT...
ACGT
ACGG
GAACGGAGTACGT...
TCGTGACGGTGAT...
Parsimony score: 1 mutation
10
CLUSTALW multiple sequence alignment (rbcS gene)
Cotton
Pea
Tobacco
Ice-plant
Turnip
Wheat
Duckweed
Larch
ACGGTT-TCCATTGGATGA---AATGAGATAAGAT---CACTGTGC---TTCTTCCACGTG--GCAGGTTGCCAAAGATA-------AGGCTTTACCATT
GTTTTT-TCAGTTAGCTTA---GTGGGCATCTTA----CACGTGGC---ATTATTATCCTA--TT-GGTGGCTAATGATA-------AGG--TTAGCACA
TAGGAT-GAGATAAGATTA---CTGAGGTGCTTTA---CACGTGGC---ACCTCCATTGTG--GT-GACTTAAATGAAGA-------ATGGCTTAGCACC
TCCCAT-ACATTGACATAT---ATGGCCCGCCTGCGGCAACAAAAA---AACTAAAGGATA--GCTAGTTGCTACTACAATTC--CCATAACTCACCACC
ATTCAT-ATAAATAGAAGG---TCCGCGAACATTG--AAATGTAGATCATGCGTCAGAATT--GTCCTCTCTTAATAGGA-------A-------GGAGC
TATGAT-AAAATGAAATAT---TTTGCCCAGCCA-----ACTCAGTCGCATCCTCGGACAA--TTTGTTATCAAGGAACTCAC--CCAAAAACAAGCAAA
TCGGAT-GGGGGGGCATGAACACTTGCAATCATT-----TCATGACTCATTTCTGAACATGT-GCCCTTGGCAACGTGTAGACTGCCAACATTAATTAAA
TAACAT-ATGATATAACAC---CGGGCACACATTCCTAAACAAAGAGTGATTTCAAATATATCGTTAATTACGACTAACAAAA--TGAAAGTACAAGACC
Cotton
Pea
Tobacco
Ice-plant
Turnip
Wheat
Duckweed
Larch
CAAGAAAAGTTTCCACCCTC------TTTGTGGTCATAATG-GTT-GTAATGTC-ATCTGATTT----AGGATCCAACGTCACCCTTTCTCCCA-----A
C---AAAACTTTTCAATCT-------TGTGTGGTTAATATG-ACT-GCAAAGTTTATCATTTTC----ACAATCCAACAA-ACTGGTTCT---------A
AAAAATAATTTTCCAACCTTT---CATGTGTGGATATTAAG-ATTTGTATAATGTATCAAGAACC-ACATAATCCAATGGTTAGCTTTATTCCAAGATGA
ATCACACATTCTTCCATTTCATCCCCTTTTTCTTGGATGAG-ATAAGATATGGGTTCCTGCCAC----GTGGCACCATACCATGGTTTGTTA-ACGATAA
CAAAAGCATTGGCTCAAGTTG-----AGACGAGTAACCATACACATTCATACGTTTTCTTACAAG-ATAAGATAAGATAATGTTATTTCT---------A
GCTAGAAAAAGGTTGTGTGGCAGCCACCTAATGACATGAAGGACT-GAAATTTCCAGCACACACA-A-TGTATCCGACGGCAATGCTTCTTC-------ATATAATATTAGAAAAAAATC-----TCCCATAGTATTTAGTATTTACCAAAAGTCACACGACCA-CTAGACTCCAATTTACCCAAATCACTAACCAATT
TTCTCGTATAAGGCCACCA-------TTGGTAGACACGTAGTATGCTAAATATGCACCACACACA-CTATCAGATATGGTAGTGGGATCTG--ACGGTCA
Cotton
Pea
Tobacco
Ice-plant
Turnip
Wheat
Duckweed
Larch
ACCAATCTCT---AAATGTT----GTGAGCT---TAG-GCCAAATTT-TATGACTATA--TAT----AGGGGATTGCACC----AAGGCAGTG-ACACTA
GGCAGTGGCC---AACTAC--------------------CACAATTT-TAAGACCATAA-TAT----TGGAAATAGAA------AAATCAAT--ACATTA
GGGGGTTGTT---GATTTTT----GTCCGTTAGATAT-GCGAAATATGTAAAACCTTAT-CAT----TATATATAGAG------TGGTGGGCA-ACGATG
GGCTCTTAATCAAAAGTTTTAGGTGTGAATTTAGTTT-GATGAGTTTTAAGGTCCTTAT-TATA---TATAGGAAGGGGG----TGCTATGGA-GCAAGG
CACCTTTCTTTAATCCTGTGGCAGTTAACGACGATATCATGAAATCTTGATCCTTCGAT-CATTAGGGCTTCATACCTCT----TGCGCTTCTCACTATA
CACTGATCCGGAGAAGATAAGGAAACGAGGCAACCAGCGAACGTGAGCCATCCCAACCA-CATCTGTACCAAAGAAACGG----GGCTATATATACCGTG
TTAGGTTGAATGGAAAATAG---AACGCAATAATGTCCGACATATTTCCTATATTTCCG-TTTTTCGAGAGAAGGCCTGTGTACCGATAAGGATGTAATC
CGCTTCTCCTCTGGAGTTATCCGATTGTAATCCTTGCAGTCCAATTTCTCTGGTCTGGC-CCA----ACCTTAGAGATTG----GGGCTTATA-TCTATA
Cotton
Pea
Tobacco
Ice-plant
Larch
Turnip
Wheat
Duckweed
T-TAAGGGATCAGTGAGAC-TCTTTTGTATAACTGTAGCAT--ATAGTAC
TATAAAGCAAGTTTTAGTA-CAAGCTTTGCAATTCAACCAC--A-AGAAC
CATAGACCATCTTGGAAGT-TTAAAGGGAAAAAAGGAAAAG--GGAGAAA
TCCTCATCAAAAGGGAAGTGTTTTTTCTCTAACTATATTACTAAGAGTAC
TCTTCTTCACAC---AATCCATTTGTGTAGAGCCGCTGGAAGGTAAATCA
TATAGATAACCA---AAGCAATAGACAGACAAGTAAGTTAAG-AGAAAAG
GTGACCCGGCAATGGGGTCCTCAACTGTAGCCGGCATCCTCCTCTCCTCC
CATGGGGCGACG---CAGTGTGTGGAGGAGCAGGCTCAGTCTCCTTCTCG
11
An Exact Algorithm
(generalizing Sankoff and Rousseau 1975)
Wu [s] = best parsimony score for subtree rooted at node u,
if u is labeled with string s.
…
…
4k entries
ACGG: 1
ACGT: 0
ACGG: +
ACGT: 0
...
AGTCGTACGTG
...
…
ACGG:
ACGT :0
...
…
ACGGGACGTGC
ACGG: 2
ACGT: 1
…
ACGG:
ACGT :0
...
...
…
ACGG:
ACGT :0
...
…
ACGG: 1
ACGT: 1
...
…
ACGG: 0
ACGT: 2
...
…
ACGTGAGATAC
GAACGGAGTAC
TCGTGACGGTG
ACGG: 0
ACGT: +
...
12
Recurrence
Wu [s] =  min ( Wv [t] + d(s, t) )
v: child t
of u
13
Running Time
Wu [s] =  min ( Wv [t] + d(s, t) )
v: child t
of u
O(k  42k )
time per node
14
Running Time
Wu [s] =  min ( Wv [t] + d(s, t) )
v: child t
of u
O(k  42k )
time per node
Number of
species
Average
sequence
length
Total time O(n k (42k + l ))
Motif length
15
Improvements
• Better algorithm reduces time from
2k
k
O(n k (4 + l )) to O(n k (4 + l ))
• By restricting to motifs with parsimony
score at most d, greatly reduce the
number of table entries computed
(exponential in d, polynomial in k)
• Amenable to many useful extensions
(e.g., allow insertions and deletions)
16
Application to -actin Gene
Gilthead sea bream (678 bp)
Medaka fish (1016 bp)
Common carp (696 bp)
Grass carp (917 bp)
Chicken (871 bp)
Human (646 bp)
Rabbit (636 bp)
Rat (966 bp)
Mouse (684 bp)
Hamster (1107 bp)
17
Common carp
ACGGACTGTTACCACTTCACGCCGACTCAACTGCGCAGAGAAAAACTTCAAACGACAACA
TTGGCATGGCTTTTGTTATTTTTGGCGCTTGACTCAGG
ATCTAAAAACTGGAACGGCGAAGGTGACGGCAATGTTTTGGCAAATAAGCATCCCCGAAGTTCTACAATGCATCTGAGGACTCAATGTTTTTTTTTTTTTTT
TTTCTTTAGTCATTCCAAATGTTTGTTAAATGCATTGTTCCGAAACTTATTTGCCTCTATGAAGGCTGCCCAGTAATTGGGAGCATACTTAACATTGTAGTATTGTA T
GTAAATTATGTAACAAAACAATGACTGGGTTTTTGTACTTTCAGCCTTAATCTTGGGTTTTTTTTTTTTTTTGGTTCCAAAAAACTAAGCTTTACCATTCAAGATGTAAA
CCTGTACACTGAC
GGTTTCATTCCCCCTGGCATATTGAAAAAGCTGTGTGGAACGTGGCGGTGCAGACATTTGGTGGGGCCA A
TAATTCAAATAAAAGT
GCACATGTAAGACATCCTACTCTGTGTGATTTTTCTGTTTGTGCTGAGTGAACTTGCTATGAAGTCTTTTAGTGCACTCTTTAATAAAAGTAGTCTTCCCTTAAAGTGTCC
CTTCCCTTATGGCCTTCACATTTCTCAACTAGCGCTTCAACTAGAAAGCACTTTAGGGACTGGGATGC
Chicken
TTGGCATGGCTTTATTTGTTTTTTCTTTTGGC
ACCGGACTGTTACCAACACCCACACCCCTGTGATGAAACAAAACCCATAAATGCGCATAAAACAAGACGAGA
GC
TTGACTCAGGATTAAAAAACTGGAATGGTGAAGGTGTCAGCAGCAGTCTTAAAATGAAACATGTTGGAGCGAACGCCCCCAAAGTTCTACAATG
CATCTGAGGACTTTGATTGTACATTTGTTTCTTTTTTAAT AGTCATTCCAAATATTGTTATAATGCATTGTTACAGGAAGTTACTCGCCTCTGTGAAGGCAACAGCCCA
GCTGGGAGGAGCCGGTACCAATTACTGGTGTTAGATGATAATTGCTTGTCTGTAAATTATGTAACCCAACAAGTGTCTTTTTGTATCTTCCGCCTTAAAAACAAAACAC
ACTTGATCCTTTTTGGTTTGTCAAGCAAGCGGGCTGTGTTCCCCAGTGATAGATGTGAATGAAGGCTTTACAGTCCCCCACAGTCTAGGAGTAAAGTGCCAGTATGTGGG
CCTGTACACTGAC
GGAGGGAGGGGCTA
TTAAGACCAGTTCAAATAAAAGTGCACACAATAGAGGCTTGACTGGTGTTGGTTTTTATTTCTGTGCTGCGC
TGCTTGGCCGTTGGTAGCTGTTCTCATCTAGCCTTGCCAGCCTGTGTGGGTCAGCTATCTGCATGGGCTGCGTGCTGGTGCTGTCTGGTGCAGAGGTTGGATAAACCGT
GATGATATTTCAGCAAGTGGGAGTTGGCTCTGATTCCATCCTGAGCTGCCATCAGTGTGTTCTGAAGGAAGCTGTTGGATGAGGGTGGGCTGAGTGCTGGGGGACAGCT
GGGCTCAGTGGGACTGCAGCTGTGCT
Human
TTGGCATGGCTTTATTTGTTTTTTTTGTTTTGTT
GCGGACTATGACTTAGTTGCGTTACACCCTTTCTTGACAAAACCTAACTTGCGCAGAAAACAAGATGAGA
TTGGTTTTTTTTTTTTTTTTGGC
TTGACTCAGGATTTAAAAACTGGAACGGTGAAGGTGACAGCAGTCGGTTGGAGCGAGCATCCCCCAAAGTTCA
CAATGTGGCCGAGGACTTTGATTGCATTGTTGTTTTTTTAATAGTCATTCCAAATATGAGATGCATTGTTACAGGAAGTCCCTTGCCATCCTAAAAGCCACCCCACTTC
TCTCTAAGGAGAATGGCCCAGTCCTCTCCCAAGTCCACACAGGGGAGGTGATAGCATTGCTTTCGTGTAAATTATGTAATGCAAAATTTTTTTAATCTTCGCCTTAATA
CTTTTTTATTTTGTTTTATTTTGAATGATGAGCCTTCGTGCCCCCCCTTCCCCCTTTTTGTCCCCCAACTTGAGATGTATGAAGGCTTTTGGTCTCCCTGGGAGTGGGTGG
CCTGTACACTGACTTGAGACCAGTTGAATAAAAGTGCACACCTTAAAAATGAGGCCAAGTGTGACTTTGTGGTGTGGCTGGGT
AGGCAGCCAGGGCTTA
TGGGGGCAGCAGAGGGTG
Parsimony score over 10 vertebrates: 0 1
2
18
Motifs Absent from Some Species
• Find motifs
– with small parsimony score
– that span a large part of the tree
• Example: in tree of 10 species spanning
760 Myrs, find all motifs with
–
–
–
–
score 0 spanning at least 250 Myrs
score 1 spanning at least 350 Myrs
score 2 spanning at least 450 Myrs
score 3 spanning at least 550 Myrs
19
Application to c-fos Gene
10
Puffer fish
7
2
Chicken
2
1
Pig
2
2
1
0
1
Mouse
Hamster
Human
Asked for motifs of length 10, with
0 mutations over tree of size 6
1 mutation over tree of size 11
2 mutations over tree of size 16
3 mutations over tree of size 21
4 mutations over tree of size 26
Found:
0 mutations over tree of size 8
1 mutation over tree of size 16
3 mutations over tree of size 21
4 mutations over tree of size 28
20
Application to c-fos Gene
Motif
Score
Conserved in
Known?
CAGGTGCGAATGTTC
0
4 mammals
TTCCCGCCTCCCCTCCCC
0
4 mammals
GAGTTGGCTGcagcc
3
puffer + 4 mammals
GTTCCCGTCAATCcct
1
chicken + 4 mammals
yes
CACAGGATGTcc
4
all 6
yes
AGGACATCTG
1
chicken + 4 mammals
yes
GTCAGCAGGTTTCCACG
0
4 mammals
yes
TACTCCAACCGC
0
4 mammals
yes
21
Other Genes
Similar results for the following genes:
•insulin
•c-myc promoter and intron
•growth hormone
•interleukin-3
•histone H1
•-globin
•dihydrofolate reductase
•fibroin
•myogenin
•prolactin
•thyroglobulin
•γ-actin 3´ UTR
•rbcS
•rbcL
22
Conclusions
• Guaranteed optimality for question posed
• Time linear in the number of species and the
total sequence lengths, exponential in the
parsimony score
• Practical on real biological data sets
• Discovered highly conserved regions, both
known and not (yet) known
• Available at
http://bio.cs.washington.edu/software.html
23
Related documents