Download Slides

CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project Homework problem 3.1 • Count separately the number of character comparisons and the number of steps needed to find the next matching character using the bad character rule • Question: can you give an example? Extended bad character rule k T: xpbctbxabpqqaabpqz P: tpabxab In this iteration: *^^ Find T(k) in P that is # of comparison = 3 immediately left to i, shift P to align T(k) with that position P: tpabxab char Position in P a 6, 3 b 7, 4 p 2 t 1 x 5 Table lookup: 2 Results for some real genes llr = 394 E-value = 2.0e-023 llr = 347 E-value = 9.8e-002 llr = 110 E-value = 1.4e+004 Strategies to improve results • Combine results from different algorithms usually helpful – Ones that appeared multiple times are probably more interesting • Except simple repeats like AAAAA or ATATATATA – Cluster motifs into groups according to their similarities Strategies to improve results • Compare with known motifs in database – TRANSFAC – JASPAR • Issues: – How to determine similarity between motifs? • Alignment between matrices – How similar is similar? • Empirically determine some threshold Strategies to improve results • Statistical test of significance – Enrichment in target sequences vs background sequences Target set T Assumed to contain a common motif, P Background set B Assumed to not contain P, or with very low frequency Ideal case: every sequence in T has P, no sequence in B has P Statistical test for significance P Background set + target set B+T Target set T Size = N Appear in n sequences Appear in m sequences Size = M • Intuitively: if n / N >> m / M – P is enriched (over-represented) in T – Statistical significance? Hypergeometric distribution • A box with M balls, of which N are red, and the rest are blue. – Red ball: target sequences – Blue ball: background sequences • If we randomly draw m balls from the box, • what’s the probability we’ll see n red balls?  N  M  N     n mn  Hypegeom ( M , N , m, n)    M    m – If probability very small, we are probably not drawing randomly • Total # of choices: (M choose m) • # of choices to have n red balls: (N choose n) x (M-N choose m-n) Cumulative hypergeometric test for motif significance • We are interested in: if we randomly pick m balls, how likely that we’ll see at least n red balls?  N  M  N     min( m , N )  i  m  i   cHypegeom ( M , N , m, n)   M  i n   m This can be interpreted as the p-value for the null hypothesis that we are randomly picking. Alternative hypothesis: our selection favors red balls. Equivalent: the target set T is enriched with motif P. Or: P is over-represented in T.  N  M  N     n 1  i  m  i    1  M  i 0   m Examples • • • • • • • Yeast genome has 6000 genes Select 50 genes believed to be co-regulated by a common TF Found a motif for these 50 genes It appeared in 20 out of these 50 genes In the whole genome, 100 genes have this motif M = 6000, N = 50, m = 100+20 = 120, n = 20 Intuition: – m/M = 120/6000. In Genome, 1 out 50 genes have the motif – N = 50, would expect only 1 gene in the target set to have the motif – 20-fold enrichment • P-value = 6 x 10-22 • n = 5. 5-fold enrichment. P-value = 0.003 • Normally a very low p-value is needed, e.g. 10-10 0 0 10 10 hypegeom(6000, 50, 120, n) chypegeom(6000, 50, 120, n) hypegeom(6000, 50, 120, n) chypegeom(6000, 50, 120, n) -1 10 -20 10 -2 10 -3 -40 10 10 -4 10 -60 10 -5 10 -6 10 -80 10 -7 10 10 0 2 4 6 n -100 0 10 20 30 n 40 50 8 10 ROC curve for motif significance • Motif is usually a PWM • Any word will have a score – – – – Typical scoring function: Log P(W | M) / P(W | B) W: a word. M: a PWM. B: background model • To determine whether a sequence contains a motif, a cutoff has to be decided – With different cutoffs, you get different number of genes with the motif – Hyper-geometric test first assumes a cutoff – It may be better to look at a range of cutoffs ROC curve for motif significance P Target set T Size = N • • • • • • Given a score cutoff Appeared in n sequences Appeared in m sequences Background set + target set B+T Size = M With different score cutoff, will have different m and n Assume you want to use P to classify T and B Sensitivity: n / N Specificity: (M-N-m+n) / (M-N) False Positive Rate = 1 – specificity: (m – n) / (M-N) With decreasing cutoff, sensitivity , FPR  ROC curve for motif significance A good cutoff Lowest cutoff. Every sequence has the motif. Sensitivity = 1. specificity = 0. sensitivity 1 ROC-AUC: area under curve. • 1: perfect separation. • 0.5: random. Motif 1 Motif 2 Random 0 0 1-specificity Motif 1 is better than motif 2. 1 Highest cutoff. No motif can pass the cutoff. Sensitivity = 0. specificity = 1. Other strategies • Cross-validation – Randomly divide sequences into 10 sets, hold 1 set for test. – Do motif finding on 9 sets. Does the motif also appear in the testing set? • Phylogenetic conservation information – Does a motif also appears in the homologous genes of another species? – Strongest evidence – However, will not be able to find species-specific ones Other strategies • Finding motif modules – Will two motifs always appear in the same gene? • Location preference – Some motifs appear to be in certain location • E.g., within 50-150bp upstream to transcription start – If a detect motif has strong positional bias, may be a sign of its function • Evidence from other types of data sources – Do the genes having the motif always have similar activities (gene expression levels) across different conditions? – Interact with the same set of proteins? – Similar functions? – etc. To search for new instances • Usually many false positives • Score cutoff is critical • Can estimate a score cutoff from the “true” binding sites Motif finding Scoring function A set of scores for the “true” sites. Take mean - std as a cutoff. (or a cutoff such that the majority of “true” sites can be predicted). To search for new instances • Use other information, such as positional biases of motifs to restrict the regions that a motif may appear • Use gene expression data to help: the genes having the true motif should have similar activities • Phylogenetic conservation is the key Final project • Write a review paper on a topic that we didn’t cover in lectures Or • Implement an algorithm and do some experiments • Compare several algorithms (existing implementation ok) • Combine several algorithms to form a pipeline (e.g. gene expression + motif analysis) • Final: – 5 -10 pages report (single space, single column, 12pt) + 15 minutes presentation Possible topics for term paper • Possible topics: – Haplotype inferencing – Computational challenges associated with new microarray technologies – Phylogenetic footprinting – Small RNA gene / target prediction (siRNA, mRNA, …) – Biomedical text mining – Protein structure prediction – Topology of biological networks An example project • • • • • Given a gene expression data (say cell cycle) Cluster genes using k-means Find motifs using several algorithms (Cluster and combine similar motifs) Rank motifs according to their specificity to the target sequences comparing to the other clusters • Get their logos • Use the sequences to search the whole genome for more genes with the motif • Do they have any functional significance?

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Slides