* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Presentation
Survey
Document related concepts
Transcript
Motif Discovery in Protein Sequences using Messy De Bruijn Graph Mehmet Dalkilic and Rupali Patwardhan Goal The goal of this project is to develop an algorithm that can take advantage of the properties of De Bruijn graphs for discovering motifs in protein sequences. Outline of Presentation Motivation and Background Approach Implementation Applications Future Work Motivation Most of the popular motif discovery algorithms being used right now depend on statistical significance to find the motif. This project explores computational and graph theoretic ways of doing the same thing without using statistical significance. Such an approach could drastically reduce the time required to search for motifs. What is a De Bruijn Graph? De Bruijn Graph is a graph whose nodes are sequences of symbols from some alphabet and whose edges indicate the sequences which might overlap. The parameters are nodelength(n) and overlap(k). So if n=4 and k=3, an edge ACAT CATS represents the sequence 'ACATS' Example If we have a sequence ABCDEFG, and we take nodelength=4 and overlap=3, we will can represent this same sequence by the following De Bruijn Graph ABCDEFG ABCD Node Length = 4 Overlap = 3 BCDE CDEF DEFG Applying this to Identify Repeating Sub-sequences If we have a bunch of sequences, we can go on adding corresponding nodes and edges to our De Bruijn graph. If any sub-sequence is repeated, the corresponding edge will already be present in that graph. So we just increment the weight of that edge. Eventually the edges corresponding to highly repeated sequences will have higher weights. Now we can find the motif by simply following the graph along these edges with weights above a specified threshold . Example Sequence 1: PAKARCDEKD Sequence 2: ARCDEKHKH Constructing the De Bruijn Graph for these sequences … PAKARCDEKD ARCDEKHKH PAKA 1 AKAR 1 KARC 1 ARCD 2 1 2 CDEK DEKD 1 DEKH 1 1 EKHK KHKH RCDE Making them Messy In the context of protein sequences, some amino acid residues can be substituted without affecting the function of the protein. So a sequence could be considered 'similar' to an edge though its not exactly same. Similarity is determined in the context of a standard scoring matrix, such as BLOSUM62. In that case, we increment weights of all edges that represent sequences that are ‘similar’ to the one in question. Example Consider the same 2 sequences as before, but with K replaced by R in one of them. PAKARCDERD ARCDEKHKH As per BLOSUM62, K and R have a positive substitution score. PAKARCDERD ARCDEKHKH PAKA 1 AKAR 1 KARC 1 ARCD 2 1 1.75 CDER DERD RCDE 1 CDEK 1 1 DEKH 1 EKHK KHKH Another Example > Sequence 1 DMLKLCDKADDKMNDRLDDYLKLDD > Sequence 2 EAKDKFDFKDFKLCDKADDARTYVH > Sequence 3 GTYYYCPGHKLCDEADDFFHVDDTE > Sequence 4 LKLCDKANDYRPYYPITDPLMMNHI > Sequence 5 GTYKPGHKLCDEADDFFHENDTEKYC > Sequence 6 KLCDKADDYRPYYPITDPLGATAKHI Another Example > Sequence 1 DMLKLCDKADDKMNDRLDDYLKLDD > Sequence 2 EAKDKFDFKDFKLCDKADDARTYVH > Sequence 3 GTYYYCPGHKLCDEADDFFHVDDTE > Sequence 4 LKLCDKANDYRPYYPITDPLMMNHI > Sequence 5 GTYKPGHKLCDEADDFFHENDTEKYC > Sequence 6 KLCDKADDYRPYYPITDPLGATAKHI Sample output … http://biokdd.informatics.indiana.edu/r patward/L519/project/ex1.html http://biokdd.informatics.indiana.edu/rp atward/L519/project/ttt.gif Results When 41 sequences belonging to PS00021 family were given as input The best motif output was YCRNPD The Prosite Reg Ex for this family is [FY]-C-R-N-P-[DNR]. http://biokdd.informatics.indiana.edu/r patward/L519/project/PS00021_op.ht ml Possible Applications To predict if a given protein sequence is likely to belong to a particular protein family or not. To construct regular expressions for protein families. To fine-tune the results of clustering algorithms, by helping to decide whether to merge two clusters or not. Do preprocessing to improve the performance of other motif discovery algorithms. Limitation of this Approach The motif should have at least 3 continuous amino acid residues. So the program runs into trouble if the motif consists of alternate residues. For example, something like AxAxCxDxAxGxC (x could be any residue). The problem is due to the need for overlaps, which is inherent nature of De Bruijn Graphs. Future Work We would like to integrate a machinelearning aspect to dynamically change the node length and other parameters to find the optimal motif. We also want to try to extend this approach to do clustering itself. Link to the Implementation http://biokdd.informatics.indiana.edu/rp atward/L519/project.html Acknowledgement I would like to thank Dr. Mehmet Dalkilic for his ideas and support.