Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Motif finding: Lecture 1 CS 498 CXZ From DNA to Protein: In words 1. DNA = nucleotide sequence • Alphabet size = 4 (A,C,G,T) 2. DNA mRNA (single stranded) • AATACGAAGTAA Alphabet size = 4 (A,C,G,U) AAUACGAAGUAA 3. mRNA amino acid sequence • Alphabet size = 20 4. Amino acid sequence “folds” into 3dimensional molecule called protein Asn Thr Lys Stop Gene expression • Process of making a protein from a gene as template • Transcription, then translation • Can be regulated Transcription • Process of making a single stranded mRNA using double stranded DNA as template • Only genes are transcribed, not all DNA Step 1: From DNA to mRNA Transcription SOURCE: http://academy.d20.co.edu/kadets/lundberg/DNA_animations/rna.dcr Transcriptional regulation TRANSCRIPTION FACTOR GENE ACAGTGA PROTEIN Transcriptional regulation TRANSCRIPTION FACTOR GENE ACAGTGA PROTEIN The importance of gene regulation QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Genetic regulatory network controlling the development of the body plan of the sea urchin embryo Davidson et al., Science, 295(5560):1669-1678. • That was the “circuit” responsible for development of the sea urchin embryo • Nodes = genes • Switches = gene regulation • Change the switches and the circuit changes • Gene regulation significance: – Development of an organism – Functioning of the organism – Evolution of organisms Binding sites and motifs Binding sites • Binding sites of transcription factor “Bicoid”, collected experimentally http://webdisk.berkeley.edu/~dap5/data_04/motifs/bicoid.gif QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. TAAT C C C http://webdisk.berkeley.edu/~dap5/data_04/motifs/bicoid.gif QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Motif (“Consensus String”) WAAT C C N QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. http://webdisk.berkeley.edu/~dap5/data_04/motifs/bicoid.gif W = T or A N = A,C,G,T Motif Motif • Common sequence “pattern” in the binding sites of a transcription factor • A succinct way of capturing variability among the binding sites Alternative way to represent motif 1 1 9 9 0 0 0 1 A 6 0 0 0 0 9 8 7 C 1 0 0 0 QuickTime™ 1 0 0 1 G and a TIFF (Uncompressed) decompressor are needed to see this picture. 1 8 0 0 8 0 1 0 T Position weight matrix (PWM) Or simply, “weight matrix” Motif representation • Consensus string – May allow “degenerate” symbols in string, e.g., N = A/C/G/T; W = A/T; S = C/G; R = A/G; Y = T/C etc. • Position weight matrix – More powerful representation – Probabilistic treatment The motif finding problem • Suppose a transcription factor (TF) controls five different genes • Each of the five genes should have binding sites for TF in their promoter region Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Binding sites for TF The motif finding problem • Now suppose we are given the promoter regions of the five genes G1, G2, … G5 • Can we find the binding sites of TF, without knowing about them a priori ? – Binding sites are similar to each other, but not necessarily identical • This is the motif finding problem • To find a motif that represents binding sites of an unknown TF A variant of motif finding • Given a motif (e.g., consensus string, or weight matrix), find the binding sites in an input sequence • For consensus string, problem is trivial – For each position l in input sequence, check if substring starting at position l matches the motif. • For weight matrix, not so trivial Binding sites from a weight matrix motif • Given a string 1 1 9 s9 of 0 0length 0 1 A l = 7 • s = s1s2…sl 6 0 0 0 0 9 8 7 C W • Pr(s | W) =1 0 0 0 W 1 s0 k0 1 G k .11 .11 1 1 0 0 0 .11 .67 0 0 0 0 1 .89 .78 .11 0 0 0 .11 0 0 .11 A C G k • Example: 1 8 0 0 8 0 1 0 T Pr(CTAATCCG) = 0.67 x 0.89Counts x 1 xof 1each x 0.89 base each column x 1 x0.89 xIn 0.11 .11 .89 0 0 .89 0 .11 0 T Probability of each base In each column Wk = probability of base in column k Binding sites from a weight matrix motif • Given sequence S (e.g., 1000 base-pairs long) • For each substring s of S, – Compute Pr(s|W) – If Pr(s|W) > some threshold, call that a binding site • Look at S, as well as its “reverse complement” – Rev.Compl. of AGTTACACCA is TGGTGTAACT – (That’s what is on the other strand of DNA) Ab initio motif finding • The original motif finding problem • To find a motif that represents binding sites of an unknown TF Ab initio motif finding • Define a motif score, find the motif with maximum score over all possible motifs in search space (motif model) • Consensus string model => exhaustive search algorithm, guarantee on finding the optimal motif • PWM model => local search, not guaranteed to find optimal motif. Ab initio motif finding consensus string motifs • A precise motif model defines the search space (I.e., a list of all candidate motifs). • The motif model also prescribes exactly how to determine if a substring is a match to a particular motif. • Define motif model precisely Ab initio motif finding consensus string motifs • E.g., string over alphabet {A,C,G,T} of fixed length l. If l = 4, all 256 strings AAAA, AAAT, AAAC, …, TTTT, are “candidate motifs”. • E.g., string over alphabet {A,C,G,T} of fixed length l, and allowing up to d mismatches. If AAAA is a motif, and d=1, then AAAT, AATA etc. are also counted as matches to motif. • E.g., string over extended alphabet {A,C,G,T,N} of fixed length l. Here “N” stands for any character (A,C,G,or T.) – If AANAA is the motif, then AACAA, AAGAA, AATAA or AAAAA are all counted as matches to this motif. Ab initio motif finding consensus string motifs • Define a motif score, i.e., a real number associated with each candidate motif, in relation to the input sequences. • E.g., count Ns of a motif s in input sequences(s). • E.g., some function of the motif count Ns. – E.g., Zs = (Ns - Es)/s – Es is the expected count of motif s in random sequences; and – s is the variance of the count in random sequences Ab initio motif finding consensus string motifs • For each motif s in the search space, – Compute the score of s • Output the highest scoring motifs. • This is the “enumerative” algorithm. • Guaranteed to produce the optimal motif, since every possible motif is considered. • Guarantee possible due to small search space. (E.g., 4l where l is the motif length). • Cant handle large values of l (e.g., > 10) : exponential growth of running time. Ab initio motif finding PWM motifs • • • • Local search techniques, e.g., Gibbs sampling Expectation Maximization Greedy Gibbs sampling: The search space • Input: a set of sequences {S1,S2,…,Sn} • Input: motif length l • Candidate motif: A set of substrings {s1,s2,…,sn}, each of length l, one from each Si. • Search space: all possible candidate motifs – O(Ln) where L is length of each Si. Gibbs sampling: algorithm • Consider any candidate motif {s1,s2,…,sn},where each si is of length l • Let Wk be the frequency of base in the kth position of the candidate motif – Pr(s|W) = W sk k k • Let “background” (genome-wide) frequency of nucleotide be q Gibbs sampling: algorithm • Let current motif be Wt = {s1,s2,…,sn} • Pick one si to replace • For each substring s’ in Si, replace si with s’ and compute W sk k Pr(s'|W ) Pr(s') Pr(s | background) qsk Gibbs sampling: algorithm • Pick s’ with probability proportional to Pr(s’) as computed • Replace si with s’ to obtain new current motif Mt+1 • Keep updating motif • Report the motif with maximum score