Download Pattern Recognition

Document related concepts

Gene nomenclature wikipedia , lookup

Gene regulatory network wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

RNA-Seq wikipedia , lookup

Network motif wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Gene expression wikipedia , lookup

Homology modeling wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Community fingerprinting wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Expression vector wikipedia , lookup

Point mutation wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
Pattern Recognition
CIS 786
Prof. Barry Cohen
Pavan Tipirneni
Niranjan Mulay
Rana Farha
Ketal Patel
What is Pattern Recognition?
•A Technique to identify interesting patterns of events
such as Amino acid, Nucleotide, Gene Expression levels
etc. that appear in number of times in a particular set of
data.
Pattern Recognition in Molecular Biology
• Human Genome Project
• Protein analysis
• Gene Expression & DNA Micro Analysis
• Drug Discovery
Pattern Discovery in Proteins
• Three main steps
- Proteins related to a query sequence are found by
searching the database for similar sequences.
- Sequences revealed from this initial screen are then used
as query sequences to search other family members
- This process is repeated till exhaustion.
Tandem Repeats
• These are two or more contiguous, approximate copies of a
pattern of nucleotides.
• There duplicates occur as a result of mutational events in which
an original segment of DNA, the pattern is converted into a
sequence of individual copies.
• They have been linked to a number of different diseases.
• These might play a role in gene regulation and in the
development of immune system cells.
Types of Patterns
Deterministic
Matches a given string or not.
Probabilistic
each sequence is given a probability that
this sequence is generated by a model.
The higher the probability, the better is the
match between sequence and pattern.
TEIRESIAS Algorithm
• TEIRESIAS searches for patterns consisting of characters of the
alphabet Σ and wild-card characters ‘.’.
• Ambiguous Character is a character corresponding to a subset
of Σ.
Ex. A-[LF]-G
• Wild-card or Don’t care is a special kind of ambiguous
character that matches any character in Σ. Ex. N in nucleotide, X
in protein sequences and are also denoted by ‘.’.
• Flexible Gap is a gap of variable length. Ex. X(4,6) matches any
gap with length 4,5 or 6. X(I) denotes a fixed gap of length I.
(L,W) Patterns
Pattern P is a (L,W) pattern iff
P is a string of characters from Σ and wild cards
‘.’.
 P starts and ends with a character from Σ
 Any sub pattern of P( i.e subsequence starting
and ending with a character from Σ) containing
exactly L non-wildcard characters has length of at
most W.
Ex. For L=3 and W=5
AF..CH..E

Algorithm
• Idea: If a pattern P is a (L,W) pattern occurring in at least K
sequences, then its sub patterns are also (L,W) patterns occurring
in at least K sequences.
• Necessary Condition: K >= 2
• P is more specific than Q if we can get Q from P by removing
several characters from P and replacing several non wildcard
characters with wildcard characters.
• Ex: AB.CD.E is more specific than AB..D.
Two Phases
The algorithm works in two phases.
Scanning phase: it finds all (L,W) patterns occurring in at least
K sequences that contain exactly L non-wildcards.
Pruned Exhaustive Search:
• find a short pattern that appears in K input sequences
• extend them until the support doesn’t go below K
• once we find pattern that cannot be extended further, we
can say that the patters in maximal and can be written to
output.
Convolution Phase
For each elementary pattern P, try to extend the pattern with other
elementary patterns.
Extend Pattern P:
 While there exist an elementary pattern Q, which can be glued to
the left side of P:
 Take such Q which is largest in suffix ordering.
 Let R be the pattern resulting from gluing Q to the left side of P
 If pattern R has number of occurrences at least K and is
maximal with respect to the set of already reported patterns:
 Try to extend pattern R with other elementary patterns.
 If Pattern R has the same number of occurrences as
pattern P, then P is not maximal and we do not need to
search for other extensions of P
 Otherwise pattern P is not a significant pattern.
•
Repeat the same process for the elementary patterns which can be
glued on the right side of P.
• Report Pattern P.
Demonstration
• http://cbcsrv.watson.ibm.com/Ttwpd.html
• example for convolution phase
QK…LLI.K.PFQ…R.I
FQ…R.IAQ..K.D.R
QK…LLI.K.PFQ…R.I.AQ..K.D.R
Snapshots
Snapshots(Contd….)
Snapshots(Contd….)
Snapshots(Contd….)
Snapshots(Contd….)
Snapshots(Contd….)
For L=2 W=3 K=2
For L=2 W=4,5,6,7,8 K=2
For L=2 W=9…. K=2
Pattern Discovery Approaches
• Different Pattern Discovery Approaches
• Depth First Approach of PRATT
Other Approaches
• Sequence pattern discovery
• Structural pattern discovery [7]
• Enumeration (Brute Force)
• Pruning (Divide-n-conquer)
• NP hard – machine learning
What is PRATT?
• Pattern discovery software
• Use pattern graphs
• Use Depth First Algorithm
Depth First Algorithm
Depth First Algorithm
Depth First Algorithm
Depth First Algorithm
Depth First Algorithm
Depth First Algorithm
Depth First Algorithm
Depth 1st in Pattern Discovery
Sequences:
abb
aab
bab
K(supp)=2
empty
b supp=3
a supp=3
ab supp=3
ba supp=1
aa supp=1
aba supp=0
bb supp=1
abb supp=1
Result is ab, b and a.
Advantages
• Fast on average inputs[6]
• Finds maximal patterns [6]
• Practically linear time algorithm
SPLASH :Structural Pattern Localization Analysis
by Sequential Histograms
•Pattern discovery usually is reduced to an enumeration
and verification problem or a multiple alignment
problem.
•Either of these class of problems is NP-Hard so most
of the solutions that have been proposed use heuristics
or ad hoc constraints to discover patterns effectively
Eg:
•Probabilistic algorithms such as Meme maximize a likelihood
function.
•Enumeration algorithms such as PRATT limit the maximum size
of discovered patterns to avoid exponential requirements on system
memory.
•Splash is a deterministic pattern discovery algorithm which can
find sparse amino or nucleic acid patterns matching identically in a
set of protein or DNA sequences
•Splash can deal with very general patterns that are defined
through arbitrary homology metrics.This means Splash is not
limited to the detection of identity in signals but can as easily
detect similarity.
Pattern discovery by Splash
Given a set of protein or DNA sequences A1,A2,…..An
Splash will discover patterns of the form T(T U ‘.’) * T where T
is an amino acid or nucleic acid or a class of amino acids and
‘.’ is a wild card character,T is called a token.
Eg:String 1:A L C A L F A A G S K Q
String2: K C A Q W S G G R N P S
Pattern: CA.[FW]..G
Constraints:
•Minimum support:There are two choices
a)Pattern must occur atleast jo times in the set of sequences.
b)Pattern must occur in atleast jo independent sequences.
•Density constraint:Patterns must have atleast ko matching tokens
in each substring of length wo that starts with a token.These
parameters can be set independently.
•Identical matches:Either one or two characters in the pattern must
match identically.
•Length:Patterns are reported only if they have atleast lo tokens.
Algorithm:
An initial density constraint (ko,lmin) and minimumsupport jo are
chosen .
How it works:
•Splash uses MOTIF algorithm as its starting point and combines it
with maximality principle.
It works as follows
(1)Enumerate all L tuples of amino acids that appear in the input set
and the distance between the first and the last triplet is bound from
above W.Those L tuples with instances exceeding the threshold are
used as anchor regions to induce local alignment patterns. – This is
the principle of MOTIF.
2)If fewer than no patterns are found then decrease the density
constraint while progressively increasing the value of lo.
3)If the value of lmax is exceeded without discovering atleast no
patterns,the minimum support jo is decreased and the procedure is
repeated.
4)If a predefined support threshold jmin is reached,without any pattern
being discovered,the procedure is halted and no pattern is reported.
Note:
Patterns are reported only if their z-score is greater than or equal to
a predefined threshold zo.
The z-score is the number of standard deviations away from the
mean of the expected number of patterns of that type in a
randomized database,a measure of the statistical significance of the
pattern computed by Splash.
Performance:
A comparison with PRATT
Applications:
•Exhaustive Motif discovery
•Hierarchical Motif discovery
•Remote Homology Detection
•Analysis of data from gene expression arrays
•Phylogeny
•The analysis of promoter regions
•Analysis and prediction of protein secondary and tertiary
structure.
Exhaustive Motif discovery
Splash can be used to exhaustively analyze a sequence database for
all non overlapping motifs that are statistically significant.This is
useful in order of relative sequence support,all regions of a protein
family that have been preserved by evolution and may therefore
play a structural role.
Example with Trypsin
Trypsin protein patterns
Comparison with TEIRESIAS:
1)Teiresias takes exponential time for execution for sparse
patterns , a disadvantage which is overcome by Splash
2)Teiresias enumerates only patterns consistent with the data
set.Splash is not limited to a fixed alphabet size.
3)Patterns have to be identical in Teiresias, they don’t have to
be so when using Splash as it uses a homology metric rather
than a distance metric.
Pattern recognition technique is used
Text mining
Protein structure characterization and prediction
Promoter signal detection
Gene Expression analysis
Tools Provided by IBM Bioinformatics and
Pattern Discovery Group
Protein Annotation w/Biodictionary
Gene Expression analysis
Sequence pattern discovery
Multiple sequence alignment
Gene discovery
Motif Discovery
Protein annotation w/Biodictionary
Important task to find membership of sequence in a
protein family, metal binding, domain of amino acid
sequence and structural confirmation such as helix or
turn
It uses TEIRESIAS algorithm
Input
Sequence is entered in FASTA format
Query is searched against pattern available in data
base called biodictionary
Output
Plot of similarities that query sequence have with
other sequences in database in descending order
Features such as active site, binding site, modified
sites, signals and various domain that can be
identified in the processed query
FASTA format sequence: >APE_HUMAN_fragment
LRVRLASHLRKLRKRLLRDA
Gene Expression
Gene expression is a process by which gene’s coded
information is converted into cells.
Task used to analyze gene expression data using
TEIRESIAS algorithm application
Technique designed to gain quantitative measure of
gene expression
Gene Expression
Data are log values of ratio of mRNA cocentration of
cell line of interest and reference sample.
Induction or repression of cells give opposite signs
Unaffected cells give zero value of ratio
Input
M x N matrix with level of expression of ith gene in
the jth species, time point or experimental condition
This tool will analyze this gene expression data using
TEIRESIAS algorithm application
MxN Matrix as an input
Output
Set of pattern composed of multiple numerical
interval and plot of expression ratio of each gene
over time points
Highlighting specific pattern and clicking on
sequences can list the input with pattern underlined
for easy identification.
Clicking on plot provides a graph of expression ratio
of each gene pattern over j time point of condition
Output
By choosing derivative option for input will assign
“+”, “-” or “=“ signs to each numerical value with
with respect to jth condition or time point.
Inverse Regulation will double input data set which
means original data set and dataset with switched
sign
This will help to identify the genes which are
oppositely regulated
Conclusion
TEIRESIAS algorithm application is very efficient
compared to many previously used algorithm
You can avoid redundancy
Created wide verity of protein database such as
PRINTS, BLOCKS, POSITES in protein family
References
[1] http://cbcsrv.watson.ibm.com/Help/aboutTspd.htm
[2] Rigoutsos and Floratos, Combinational Pattern Discovery in
biological sequence: The TEIRESIAS Algorithm
[3] Finding Patterns in Biological Sequences by Brona Brejova,
Chrysanne DiMarco, Tomas Vinar, Sandra Romero Hidalgo, Gina
Holguin, Cheryl Patten.
[4] Andreas Wespi, Marc Dacier and Herve Debar : An Intrusion
Detection System Based on the TEIRESIAS Pattern Discovery
Algorithm
[5] http://www.celera.com/company/home.cfm?ppage=
ov erview&cpage=faq
[6] ttp://citeseer.nj.nec.com/cache/papers/cs/21059/http
:zSzzSzwww.cs.nyu.eduzSzcswebzSzResearchzSzT
heseszSzfloratos_aristidis.pdf/floratos99pattern.pdf
[7] Juris Viksna, David Gilbert, Pattern Matching and pattern
discovery algorithms for protein topologies
[8] Holm, L., Park, J.: DaliLite workbench for protein structure
comparison.Bioinformatics 16 (2000) 566–567.
[9] Orengo, C.A., Michie, A.D., Jones, S., Swindelis, M.B.: CATH
– a hierarchic classification of protein domain structures. Structure
5 (1997) 1093–1108.
[10]Inge Jonassen, Efficient discovery of conserved patterns using
pattern graph, ISSN 0333-3590, march 1996
[11] J. Vilo. Discovering frequent patterns from strings. Technical
Report C-1998-9, Department of Computer Science, University of
Helsinki, P. O. Bo 26, FIN-00014, University of Helsinki, May
1998.
[12] http://www.soi.city.ac.uk/~drg/seminars/nato_asi/sld026.htm
[13] Brona Brejova, Chrysanne DiMarco, Tomas Vinar, Sandra
Romero Hidalgo, Finding Patterns in Biological Sequences
[14] Isidore Rigoutsas, Aris Floratos, Laxmi Parida, Yuan
Gao and Daniel Platt “ The Emergence of Pattern Discovery
Techniques in Computational Biology”.
[15] Andrea Califano “SPLASH:Structural Pattern
Localization Analysis by Sequential Histograms”.
[16] Andrey Rzhetsky, William Noble Grundy, Reina
E.Riemann, Andrea Califano “An Investigation of distant
homology detection methods for multidomain protein
families”.
[17] Gustavo Stolovitzky, Andrea Califona “Statistical
Significance of Patterns in Biosequences”.
[18] www.research.ibm.com/splash
[19] Anthony P. Burgard, Gregory L. Moore, and Costas D.
Maranas “Review of the TEIRESIAS-Based Tools of the IBM
Bioinformatics and Pattern Discovery Group”.