Download Pattern Recognition

Pattern Recognition CIS 786 Prof. Barry Cohen Pavan Tipirneni Niranjan Mulay Rana Farha Ketal Patel What is Pattern Recognition? •A Technique to identify interesting patterns of events such as Amino acid, Nucleotide, Gene Expression levels etc. that appear in number of times in a particular set of data. Pattern Recognition in Molecular Biology • Human Genome Project • Protein analysis • Gene Expression & DNA Micro Analysis • Drug Discovery Pattern Discovery in Proteins • Three main steps - Proteins related to a query sequence are found by searching the database for similar sequences. - Sequences revealed from this initial screen are then used as query sequences to search other family members - This process is repeated till exhaustion. Tandem Repeats • These are two or more contiguous, approximate copies of a pattern of nucleotides. • There duplicates occur as a result of mutational events in which an original segment of DNA, the pattern is converted into a sequence of individual copies. • They have been linked to a number of different diseases. • These might play a role in gene regulation and in the development of immune system cells. Types of Patterns Deterministic Matches a given string or not. Probabilistic each sequence is given a probability that this sequence is generated by a model. The higher the probability, the better is the match between sequence and pattern. TEIRESIAS Algorithm • TEIRESIAS searches for patterns consisting of characters of the alphabet Σ and wild-card characters ‘.’. • Ambiguous Character is a character corresponding to a subset of Σ. Ex. A-[LF]-G • Wild-card or Don’t care is a special kind of ambiguous character that matches any character in Σ. Ex. N in nucleotide, X in protein sequences and are also denoted by ‘.’. • Flexible Gap is a gap of variable length. Ex. X(4,6) matches any gap with length 4,5 or 6. X(I) denotes a fixed gap of length I. (L,W) Patterns Pattern P is a (L,W) pattern iff P is a string of characters from Σ and wild cards ‘.’.  P starts and ends with a character from Σ  Any sub pattern of P( i.e subsequence starting and ending with a character from Σ) containing exactly L non-wildcard characters has length of at most W. Ex. For L=3 and W=5 AF..CH..E  Algorithm • Idea: If a pattern P is a (L,W) pattern occurring in at least K sequences, then its sub patterns are also (L,W) patterns occurring in at least K sequences. • Necessary Condition: K >= 2 • P is more specific than Q if we can get Q from P by removing several characters from P and replacing several non wildcard characters with wildcard characters. • Ex: AB.CD.E is more specific than AB..D. Two Phases The algorithm works in two phases. Scanning phase: it finds all (L,W) patterns occurring in at least K sequences that contain exactly L non-wildcards. Pruned Exhaustive Search: • find a short pattern that appears in K input sequences • extend them until the support doesn’t go below K • once we find pattern that cannot be extended further, we can say that the patters in maximal and can be written to output. Convolution Phase For each elementary pattern P, try to extend the pattern with other elementary patterns. Extend Pattern P:  While there exist an elementary pattern Q, which can be glued to the left side of P:  Take such Q which is largest in suffix ordering.  Let R be the pattern resulting from gluing Q to the left side of P  If pattern R has number of occurrences at least K and is maximal with respect to the set of already reported patterns:  Try to extend pattern R with other elementary patterns.  If Pattern R has the same number of occurrences as pattern P, then P is not maximal and we do not need to search for other extensions of P  Otherwise pattern P is not a significant pattern. • Repeat the same process for the elementary patterns which can be glued on the right side of P. • Report Pattern P. Demonstration • http://cbcsrv.watson.ibm.com/Ttwpd.html • example for convolution phase QK…LLI.K.PFQ…R.I FQ…R.IAQ..K.D.R QK…LLI.K.PFQ…R.I.AQ..K.D.R Snapshots Snapshots(Contd….) Snapshots(Contd….) Snapshots(Contd….) Snapshots(Contd….) Snapshots(Contd….) For L=2 W=3 K=2 For L=2 W=4,5,6,7,8 K=2 For L=2 W=9…. K=2 Pattern Discovery Approaches • Different Pattern Discovery Approaches • Depth First Approach of PRATT Other Approaches • Sequence pattern discovery • Structural pattern discovery [7] • Enumeration (Brute Force) • Pruning (Divide-n-conquer) • NP hard – machine learning What is PRATT? • Pattern discovery software • Use pattern graphs • Use Depth First Algorithm Depth First Algorithm Depth First Algorithm Depth First Algorithm Depth First Algorithm Depth First Algorithm Depth First Algorithm Depth First Algorithm Depth 1st in Pattern Discovery Sequences: abb aab bab K(supp)=2 empty b supp=3 a supp=3 ab supp=3 ba supp=1 aa supp=1 aba supp=0 bb supp=1 abb supp=1 Result is ab, b and a. Advantages • Fast on average inputs[6] • Finds maximal patterns [6] • Practically linear time algorithm SPLASH :Structural Pattern Localization Analysis by Sequential Histograms •Pattern discovery usually is reduced to an enumeration and verification problem or a multiple alignment problem. •Either of these class of problems is NP-Hard so most of the solutions that have been proposed use heuristics or ad hoc constraints to discover patterns effectively Eg: •Probabilistic algorithms such as Meme maximize a likelihood function. •Enumeration algorithms such as PRATT limit the maximum size of discovered patterns to avoid exponential requirements on system memory. •Splash is a deterministic pattern discovery algorithm which can find sparse amino or nucleic acid patterns matching identically in a set of protein or DNA sequences •Splash can deal with very general patterns that are defined through arbitrary homology metrics.This means Splash is not limited to the detection of identity in signals but can as easily detect similarity. Pattern discovery by Splash Given a set of protein or DNA sequences A1,A2,…..An Splash will discover patterns of the form T(T U ‘.’) * T where T is an amino acid or nucleic acid or a class of amino acids and ‘.’ is a wild card character,T is called a token. Eg:String 1:A L C A L F A A G S K Q String2: K C A Q W S G G R N P S Pattern: CA.[FW]..G Constraints: •Minimum support:There are two choices a)Pattern must occur atleast jo times in the set of sequences. b)Pattern must occur in atleast jo independent sequences. •Density constraint:Patterns must have atleast ko matching tokens in each substring of length wo that starts with a token.These parameters can be set independently. •Identical matches:Either one or two characters in the pattern must match identically. •Length:Patterns are reported only if they have atleast lo tokens. Algorithm: An initial density constraint (ko,lmin) and minimumsupport jo are chosen . How it works: •Splash uses MOTIF algorithm as its starting point and combines it with maximality principle. It works as follows (1)Enumerate all L tuples of amino acids that appear in the input set and the distance between the first and the last triplet is bound from above W.Those L tuples with instances exceeding the threshold are used as anchor regions to induce local alignment patterns. – This is the principle of MOTIF. 2)If fewer than no patterns are found then decrease the density constraint while progressively increasing the value of lo. 3)If the value of lmax is exceeded without discovering atleast no patterns,the minimum support jo is decreased and the procedure is repeated. 4)If a predefined support threshold jmin is reached,without any pattern being discovered,the procedure is halted and no pattern is reported. Note: Patterns are reported only if their z-score is greater than or equal to a predefined threshold zo. The z-score is the number of standard deviations away from the mean of the expected number of patterns of that type in a randomized database,a measure of the statistical significance of the pattern computed by Splash. Performance: A comparison with PRATT Applications: •Exhaustive Motif discovery •Hierarchical Motif discovery •Remote Homology Detection •Analysis of data from gene expression arrays •Phylogeny •The analysis of promoter regions •Analysis and prediction of protein secondary and tertiary structure. Exhaustive Motif discovery Splash can be used to exhaustively analyze a sequence database for all non overlapping motifs that are statistically significant.This is useful in order of relative sequence support,all regions of a protein family that have been preserved by evolution and may therefore play a structural role. Example with Trypsin Trypsin protein patterns Comparison with TEIRESIAS: 1)Teiresias takes exponential time for execution for sparse patterns , a disadvantage which is overcome by Splash 2)Teiresias enumerates only patterns consistent with the data set.Splash is not limited to a fixed alphabet size. 3)Patterns have to be identical in Teiresias, they don’t have to be so when using Splash as it uses a homology metric rather than a distance metric. Pattern recognition technique is used Text mining Protein structure characterization and prediction Promoter signal detection Gene Expression analysis Tools Provided by IBM Bioinformatics and Pattern Discovery Group Protein Annotation w/Biodictionary Gene Expression analysis Sequence pattern discovery Multiple sequence alignment Gene discovery Motif Discovery Protein annotation w/Biodictionary Important task to find membership of sequence in a protein family, metal binding, domain of amino acid sequence and structural confirmation such as helix or turn It uses TEIRESIAS algorithm Input Sequence is entered in FASTA format Query is searched against pattern available in data base called biodictionary Output Plot of similarities that query sequence have with other sequences in database in descending order Features such as active site, binding site, modified sites, signals and various domain that can be identified in the processed query FASTA format sequence: >APE_HUMAN_fragment LRVRLASHLRKLRKRLLRDA Gene Expression Gene expression is a process by which gene’s coded information is converted into cells. Task used to analyze gene expression data using TEIRESIAS algorithm application Technique designed to gain quantitative measure of gene expression Gene Expression Data are log values of ratio of mRNA cocentration of cell line of interest and reference sample. Induction or repression of cells give opposite signs Unaffected cells give zero value of ratio Input M x N matrix with level of expression of ith gene in the jth species, time point or experimental condition This tool will analyze this gene expression data using TEIRESIAS algorithm application MxN Matrix as an input Output Set of pattern composed of multiple numerical interval and plot of expression ratio of each gene over time points Highlighting specific pattern and clicking on sequences can list the input with pattern underlined for easy identification. Clicking on plot provides a graph of expression ratio of each gene pattern over j time point of condition Output By choosing derivative option for input will assign “+”, “-” or “=“ signs to each numerical value with with respect to jth condition or time point. Inverse Regulation will double input data set which means original data set and dataset with switched sign This will help to identify the genes which are oppositely regulated Conclusion TEIRESIAS algorithm application is very efficient compared to many previously used algorithm You can avoid redundancy Created wide verity of protein database such as PRINTS, BLOCKS, POSITES in protein family References [1] http://cbcsrv.watson.ibm.com/Help/aboutTspd.htm [2] Rigoutsos and Floratos, Combinational Pattern Discovery in biological sequence: The TEIRESIAS Algorithm [3] Finding Patterns in Biological Sequences by Brona Brejova, Chrysanne DiMarco, Tomas Vinar, Sandra Romero Hidalgo, Gina Holguin, Cheryl Patten. [4] Andreas Wespi, Marc Dacier and Herve Debar : An Intrusion Detection System Based on the TEIRESIAS Pattern Discovery Algorithm [5] http://www.celera.com/company/home.cfm?ppage= ov erview&cpage=faq [6] ttp://citeseer.nj.nec.com/cache/papers/cs/21059/http :zSzzSzwww.cs.nyu.eduzSzcswebzSzResearchzSzT heseszSzfloratos_aristidis.pdf/floratos99pattern.pdf [7] Juris Viksna, David Gilbert, Pattern Matching and pattern discovery algorithms for protein topologies [8] Holm, L., Park, J.: DaliLite workbench for protein structure comparison.Bioinformatics 16 (2000) 566–567. [9] Orengo, C.A., Michie, A.D., Jones, S., Swindelis, M.B.: CATH – a hierarchic classification of protein domain structures. Structure 5 (1997) 1093–1108. [10]Inge Jonassen, Efficient discovery of conserved patterns using pattern graph, ISSN 0333-3590, march 1996 [11] J. Vilo. Discovering frequent patterns from strings. Technical Report C-1998-9, Department of Computer Science, University of Helsinki, P. O. Bo 26, FIN-00014, University of Helsinki, May 1998. [12] http://www.soi.city.ac.uk/~drg/seminars/nato_asi/sld026.htm [13] Brona Brejova, Chrysanne DiMarco, Tomas Vinar, Sandra Romero Hidalgo, Finding Patterns in Biological Sequences [14] Isidore Rigoutsas, Aris Floratos, Laxmi Parida, Yuan Gao and Daniel Platt “ The Emergence of Pattern Discovery Techniques in Computational Biology”. [15] Andrea Califano “SPLASH:Structural Pattern Localization Analysis by Sequential Histograms”. [16] Andrey Rzhetsky, William Noble Grundy, Reina E.Riemann, Andrea Califano “An Investigation of distant homology detection methods for multidomain protein families”. [17] Gustavo Stolovitzky, Andrea Califona “Statistical Significance of Patterns in Biosequences”. [18] www.research.ibm.com/splash [19] Anthony P. Burgard, Gregory L. Moore, and Costas D. Maranas “Review of the TEIRESIAS-Based Tools of the IBM Bioinformatics and Pattern Discovery Group”.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Pattern Recognition