Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Homologous recombination wikipedia , lookup
DNA replication wikipedia , lookup
DNA repair protein XRCC4 wikipedia , lookup
Zinc finger nuclease wikipedia , lookup
DNA polymerase wikipedia , lookup
DNA profiling wikipedia , lookup
DNA nanotechnology wikipedia , lookup
Microsatellite wikipedia , lookup
Exhaustive search CS 466 Saurabh Sinha A fruitfly has a bacterial infection • When attacked by bacteria, the fruitfly’s immune system kicks in • Many genes that were lying “dormant” now producing their proteins, to fight the infection. (Some otherwise active genes may now become inactive.) • Which genes are these ? Looking for differentially expressed genes • Measure the activity level of all genes in normal fly and in infected fly • Find genes whose activity levels are significantly different between the two conditions • How to measure gene activity level ? An Introduction to Bioinformatics Algorithms www.bioalgorithms.info DNA Arrays--Technical Foundations • In a single experiment, the expression levels of hundreds or thousands genes within a cell are measured • An array works by exploiting the ability of a given mRNA molecule to hybridize to the DNA template. • An “array” containing many short DNA fragments called “probes”. Each probe’s sequence is part of some gene, and therefore can bind to mRNA of that gene • The amount of mRNA bound to each probe on the array reflects the amount of mRNA of that gene in the cell May, 11, 2004 http://www.ncbi.nih.gov/About/primer/microarrays.html 4 An Introduction to Bioinformatics Algorithms www.bioalgorithms.info DNA Microarray Millions of DNA strands build up on each location. May, 11, 2004 Tagged probes become hybridized to the DNA chip’s microarray. http://www.affymetrix.com/corporate/media/image_library/image_library_1.affx 5 An experiment on a microarray In this schematic: GREEN represents Control DNA RED represents Sample DNA YELLOW represents a combination of Control and Sample DNA BLACK represents areas where neither the Control nor Sample DNA Each color in an array represents either healthy (control) or diseased (sample) tissue. The location and intensity of a color tell us whether the gene is present in the control and/or sample DNA. May 11,2004 http://www.ncbi.nih.gov/About/primer/microarrays.html 10 Differentially expressed genes • Find a set of genes differentially expressed in the infected fly • These are perhaps the ones orchestrating the immune response • Look at promoters of these genes • Find that the substring TCGGGGATTTCC occurs often (modulo minor spelling mistakes) in these promoters Regulatory motif • Someone tells you that TCGGGGATTTCC is the preferred binding site of the “NFkB” transcription factor • Infer that NFkB is turning on the immunity ! • What if you did not know that NFkB binds TCGGGGATTTCC ? • Could you have just gazed at the promoter sequences, and discovered this binding site ? Chapter 4.5-4.9 Finding motifs ab initio • Enumerate all possible strings of some fixed (small) length • For each such string (“motif”) count its occurrences in the promoters • Report the most frequently occurring or the most statistically significant motif • Does the true motif pop out ? Simple statistics • Consider 10 promoters, each 100 bp long • Suppose a secret motif ATGCAACT has been “planted” in each promoter • Our enumerative method counts every possible “8-mer” • Expected number of occurrences of an 8-mer is 10 x 100 x (1/4)8 ≈ 0.015 • Most likely, an arbitrary 8-mer will occur at most once, may be twice • 10 occurrences of ATGCAACT will stand out Variation in binding sites • Motif occurrences will not always be exact copies of the consensus string • The transcription factor can usually tolerate some variability in its binding sites • It’s possible that none of the 10 occurrences of our motif ATGCAACT is actually this precise string A new motif model • To define a motif, let’s say we know where the motif occurrence starts in the sequence • The motif start positions in their sequences can be represented as s = (s1,s2,s3,…,st) Motifs: Profiles and Consensus a G g t a c T t C c A Alignment a c g a c g C c g t t t t a T C a c A c c g g A g t t t G _________________ Profile A C G T 3 2 0 0 0 4 1 0 1 0 4 0 0 0 0 5 3 1 0 1 1 4 0 0 1 0 3 1 0 0 1 4 _________________ Consensus A C G T A C G T • Line up the patterns by their start indexes s = (s1, s2, …, st) • Construct matrix profile with frequencies of each nucleotide in columns • Consensus nucleotide in each position has the highest score in column Profile matrices • Suppose there were t sequences to begin with • Consider a column of a profile matrix • The column may be (t, 0, 0, 0) – A perfectly conserved column • The column may be (t/4, t/4, t/4, t/4) – A completely uniform column • “Good” profile matrices should have more conserved columns Scoring Motifs l • Given s = (s1, … st) and DNA a G g t a c T t C c A t a c g t a c g t T A g t a c g t C c A t C c g t a c g G _________________ l Score(s,DNA) = max count(k , i) i 1 k{ A,T ,C ,G} A C G T 3 0 1 0 3 1 1 0 2 4 0 0 1 4 0 0 0 1 4 0 0 0 3 1 0 0 0 5 1 0 1 4 _________________ Consensus a c g t a c g t Score 3+4+4+5+3+4+3+4=30 t Goal • Goal is to find the starting positions s=(s1,…st) to maximize the Score(s,DNA) of the resulting profile matrix • This is one formulation of the “motif finding problem” Another formulation • Hamming distance between two strings v and w is dH(v,w) = number of mismatches between v and w • Given an array of starting positions s=(s1,…st), we define dH(v, s) = ∑idH(v,si) • Define: TotalDist(v, DNA) = mins dH(v,s) • Computing TotalDist is easy – find closest string to v in each input sequence The median string problem • Find v that minimizes TotalDist(v) • A double minimization (mins, minv) • Equivalent to motif finding problem – Show this Naïve time complexity • Motif finding problem: Consider every (s1,…st): O((n-l+1)t) • Median string problem: Consider every l-mer: O(4l). Relatively fast ! • Common form of both expressions: Find a vector of L variables, each variable can take k values: O(kL) Enumeration • Want to generate all strings in {1,2,3,4}L 11…11 11…12 11…13 11…14 . . 44…44 “Search Tree” for enumeration Here, L = 2 -4 1 2 11 2 11 12 3 3 2- 3- 4- 4 13 14 21 22 23 24 31 32 33 34 41 42 43 44 Analyzing Search Trees • Characteristics of the search trees: – The sequences are contained in its leaves – The parent of a node is the prefix of its sequence • How can we move through the tree? “Search Tree” for enumeration -- Order of steps 4 1 2 11 2 11 12 1 3 2- 3- 4- 4 3 13 14 2 3 21 4 22 5 23 6 24 7 31 8 32 9 10 33 34 11 41 12 13 42 43 14 44 15 An algorithm to enumerate ! NEXTLEAF(a, L, k) for i := L to 1 if ai < k ai := ai+1 return a ai := 1 return a ALLLEAVES(L, k) a := (1,..,1) while true output a a := NEXTLEAF(a,L,k) if a = (1,..,1) return Increment the least significant digit; and “carry over” to next position if necessary A new task: visiting all the vertices in tree • Not just the leaves, but the internal nodes also – – • • Why? Why not only the leaves of the tree? We’ll see later. PreOrder traversal of a tree PreOrder(node): 1. Visit node 2. PreOrder(left child of node) 3. PreOrder(right child of node) • How about a non- “recursive” version? Visit the Next Vertex 1. NextVertex(a,i,L,k) 2. if i < L 3. a i+1 1 4. return ( a,i+1) 5. else 6. for j L to 1 7. if aj < k 8. aj aj +1 9. return( a,j ) 10. return(a,0) At an “internal node” At a “leaf node”, may need to “climb up” // // // // a : the array of digits i : prefix length L: max length k : max digit value In words • If at an internal node, just go one level deeper. • If at a leaf node, – go to next leaf – if moved to a non-sibling in this process, jump up Example • Moving to the next vertex: Current Location 1- 11 12 13 -- 2- 14 21 22 23 3- 24 31 32 33 4- 34 41 42 43 44 Example • Moving to the next vertices: Location after 5 next vertex moves -- 1- 11 12 13 2- 14 21 22 23 3- 24 31 32 33 4- 34 41 42 43 44 Bypassing • What if we wish to skip an entire subtree (at some internal node) during the tree traversal ? Bypass Move: Example • Bypassing the descendants of “2-”: Current Location 1- 11 12 13 -- 2- 14 21 22 23 3- 24 31 32 33 4- 34 41 42 43 44 Example • Bypassing the descendants of “2-”: Next Location -- Somehow, know that 2* is useless 1- 11 12 13 2- 14 21 22 23 3- 24 31 32 33 4- 34 41 42 43 44 Bypassing BYPASS(a, i, L, k) for j := i to 1 if aj < k aj := aj+1 return (a,j) return (a,0) Brute Force Solution for the Motif finding problem 1. 2. 3. 4. 5. 6. 7. 8. 9. BruteForceMotifSearchAgain(DNA, t, n, l) s (1,1,…, 1) bestScore Score(s,DNA) while forever s NextLeaf (s, t, n-l+1) if (Score(s,DNA) > bestScore) bestScore Score(s, DNA) bestMotif (s1,s2 , . . . , st) return bestMotif O(l(n-l+1)t) Can We Do Better? • Sets of s=(s1, s2, …,st) may have a weak profile for the first i positions (s1, s2, …,si) • Every row of alignment may add at most l to Score • Optimism: if all subsequent (t-i) positions (si+1, …st) add (t – i ) * l to Score(s,i,DNA) • If Score(s,i,DNA) + (t – i ) * l < BestScore, it makes no sense to search in vertices of the current subtree – Use ByPass() • “Branch and bound” strategy – This saves us from looking at (n – l + 1)t-i leaves Pseudocode for Branch and Bound Motif Search 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. BranchAndBoundMotifSearch(DNA,t,n,l) s (1,…,1) bestScore 0 i1 while i > 0 if i < t optimisticScore Score(s, i, DNA) +(t – i ) * l if optimisticScore < bestScore (s, i) Bypass(s,i, n-l +1) else (s, i) NextVertex(s, i, n-l +1) else if Score(s,DNA) > bestScore bestScore Score(s) bestMotif (s1, s2, s3, …, st) (s,i) NextVertex(s,i,t,n-l + 1) return bestMotif The median string problem • Enumerate 4l strings v • For each v, compute TotalDist(v, DNA) – This requires linear scan of DNA, i.e., O(nt) • Overall: O(nt4l) • Improvement by branch and bound ? • During enumeration of l-mers, suppose we are at some prefix v’, and find that TotalDist(v’,DNA) > BestDistanceSoFar. • Why enumerate further ? Bounded Median String Search 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. BranchAndBoundMedianStringSearch(DNA,t,n,l ) s (1,…,1) bestDistance ∞ i1 while i > 0 if i < l prefix string corresponding to the first i nucleotides of s optimisticDistance TotalDistance(prefix,DNA) if optimisticDistance > bestDistance (s, i ) Bypass(s,i, l, 4) else (s, i ) NextVertex(s, i, l, 4) else word nucleotide string corresponding to s if TotalDistance(s,DNA) < bestDistance bestDistance TotalDistance(word, DNA) bestWord word (s,i ) NextVertex(s,i,l, 4) return bestWord Today’s summary • • • • • DNA arrays, measuring gene expression The motif finding problem The median string problem Enumeration with search trees Branch and bound