Download Clustering Sequences in a Metric Space

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Microsoft Jet Database Engine wikipedia , lookup

Concurrency control wikipedia , lookup

Database wikipedia , lookup

Relational model wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

ContactPoint wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Transcript
Clustering Sequences in a
Metric Space
The MoBIoS Project
Rui Mao, Daniel P. Miranker, Jacob N. Sarvela and Weijia Xu
Department of Computer Sciences, University of Texas
Austin, TX 78712, USA
{rmao, miranker}@cs.utexas.edu
Research supported in part by the Texas Higher Education Coordinating Board, Texas Advanced Research Program.
Immediate Goal:
Use Metric Space Indexing to
Support Homology Search
1.
2.
3.
Develop tree-based index structure to speed
homology search.
Maintain the use of an evolutionary model of
similarity.
Deliver full Smith-Waterman sensitivity.
2
Metric Space

A metric space[CPRZ97a] is a pair, M=(D,d), where D is
a domain of indexing keys, and d is distance function
with the following properties:
d(Ox,Oy) = d (Oy,Ox)
d(Ox,Oy) > 0, d(Ox,Ox) = 0
d(Ox,Oy) <= d(Ox,Oz) + d(Oz,Oy)
(symmetry)
(non negativity)
(triangle inequality)
3
Metric Space Indexing
 Metric space indexing exploits intrinsic clustering
of the data.
 Hierarchical structure
avoids linear scan of entire database.
in the best case leads to search time logarithmic to
database size.
4
Challenges

Local alignment does not form a metric.
Local Alignment produces a set of answers, a distance
function produce a single number.

Popular evolutionary models (PAM) are not metrics.
PAM Matrices are based on log-odds
–
Negative Values
PAM Matices are Asymmetric
–
Let Pr(x,y) be the probability that amino acid x, mutates to amino
acid y.
–
Pr(x,y)  Pr(y,x)
More similar sequences score higher, not lower
–
Identical sequences must be distance 0 apart
5
From PAM to mPAM - Symmetry
PAM: The computation of PAM
matrix computed frequency of one
amino acid mutating to another.
 mPAM: We model that a pair of
amino acids, one in each sequence,
evolved from a common ancestor
[Gonnet & Korostensky]

X
Y
Z
The probability that amino acid Y and
amino acid Z are from same ancestor
amino acid x is:
Pr(y,z)= f(x)Pr(x,y)Pr(x,z)
6
From PAM to mPAM – Distance vs.
Similarity
PAM computed logodds based on
frequency of mutations
 mPAM: Compute the
expected time for a
particular mutation to
occur.

More frequent
mutations will occur, on
average, in less time.
mPAM matrix
7
Computing Local Alignments
from an Index

Divide the database into small fixed size pieces
Build a metric-space index based on global alignment

Divide the query into small fixed size pieces
For each query piece, use index to find results based on
global alignment.
– Like BLASTS hot spot index, but is fully sensitive

Chain the results together
Intuitively like BLASTS extension of hot-spots
Best algorithm is the last step of “A Sublinear Algorithm for
Approximate Keyword Searching” [Myers94]
8
Initial Results: M-Tree
 M-tree [CPRZ97b]is an open-source Metric-
space indexing package.
 Results for global alignment of Yeast peptide
sequences of length 10.
 Compare M-tree clustering result with farthestfirst traversal bulk load clustering result.
9
fraction of leaves visited to all the leaves vs. database size
fraction of leaves visited to all the leaves
0.79
0.78
0.77
0.76
0.75
0.74
0.73
0
2000
4000
6000
8000
10000
12000
database size
For a set of queries, the average fraction of number of leaves visited in
the searching to the total leaf number decreases while the database size
goes up. This shows that is the database is clustered well.
10
Covering radius vs. tree level of M-tree
350
radii of routing objects
300
average radius
250
min radius
200
max radius
150
100
50
0
0
1
2
3
4
5
6
7
8
tree lev el
The covering radius of routing objects of one level of M-tree
decreases while descending the tree. This shows the database the
hierarchically clustered well.
11
Covering radius vs. tree level of farthest-first
traversal bulk load
radii of routing objects
350
average radius
300
min radius
250
max radius
200
150
100
50
0
0
2
4
tree level
6
8
The covering radius of routing objects of one database tree level created by farthestfirst traversal bulk load decreases while descending the tree. This shows the database
the hierarchically clustered well. The radii here are significantly smaller than those of Mtree, which means that we can build a new index structure that is better than M-tree.
12
Long-term Goal:
Biologists Need a New Kind of DBMS
Traditional relational databases:
 Data is dynamic
 Workload:
Biological databases
 Data is write-once.
 Workload:
Ad-hoc queries based
Data clustering (mining)
Regular, exact, periodic
queries
– Billing
– Customer service
Transactional
– inventory
– bank accounts


Biological data types are
non-relational
Biological data types do
cluster in metric-spaces
Genomic/proteomic sequences
Mass-Spectrometer signatures
Molecular Models
13
MoBIoS Architecture
(Molecular Biological Information System)
14
Storage Manager
Metric-Space Index Structure:
 Persistent representation
Multiple hierarchical trees.
Choice of metric distance functions, including user
defined

Results in:
Efficient clustering of the database
Search time logarithmic to the database size
15
Query Engine
MoBIoS SQL (M-SQL)

Built-in biological data
types
Sequence
Mass-spectra data


Examples:
Homology look-up
Gene fusion experiment
Embodies evolutionary
semantics of bioinformatic
investigation
16
M-SQL Program for MS/MS Protein Identifcation
// Return proteins in the intersection of recorded spectra sufficiently
// similar, range1, to the measured spectra of the first MS, and proteins
// which have a digested fragment computed to be sufficiently similar in
// sequence to the sequencing determined by the second MS
// Database is loaded with genomic and proteomic information
Create table protein_sequences (accesion_id int, sequence peptide, …,
primary metrickey(sequence, mPAM250);
Create table digested_sequences(accession_id int, fragment peptide,
enzyme varchar, ms_peak int…, primary key(enzyme, accession_id);
Create index fragment_sequence on digested_sequences (fragment)
metric(mPAM250);
Create table mass_spectra(accession_id int, enzyme varchar,
spectrum spectrum, primary metrickey(spectrum, cosine_distance);
SELECT
Prot.accesion_id, Prot.sequence
FROM protein_sequences Prot, digested_sequences DS,
mass_spectra MS
WHERE
MS.enzyme = DS.enzyme = E and
Cosine_Distance(S, MS.spectrum, range1) and
DS.accession_id = MS.accession_id = Prot.accesion_id and
DS.ms_peak = P and
MPAM250(PS, DS.sequence, range2)
17
Mining Engine
(No much idea???)
Primitives for relating
clusters to each other.

Possible syntax
Gene expression
 Protein family

18
Applications
 Homology search
 Proteomics
MS/MS and Ion-Trap MS need both MS signature
and sequence data to analyze results
 Gene Expression
Built in clustering algorithms
 Sequence Assembly
19
Properties of Biological
Databases
Data is write-once.
 Workload:

Ad-hoc retrieval queries based on evolutionary criteria
Data clustering and categorization (mining)

Many biological data types are non-relational
Genomic/proteomic sequences
Structural and functional annotations to sequences.
Mass-Spectrometer signatures
Molecular Models
20
Existing Methods
 BLAST
Build index of the query sequence, linear scan the
database
 BLAT
Build index of the database, search the database
based on exact match of fixed length segments
 SST
Tree-structured index for vector space object
21
From PAM matrix to mPAM
• PAM matrix is one of most
commonly used substitution matrix
to compute the similarity between
two peptide sequence under an
evolutional model.
• PAM matrix can not be used
directly for metric distance
indexing technique.
– Similarity score don’t have
reflexivity properties.
– There are negative values.
– Doesn’t satisfy triangular
inequality rules
Figure-2 Log odds matrix for 250 PAMs. (DayHoff
1978)
22
Metric-Space Indexing to Speed
Homology Search
1.
2.
3.
4.
Split the database and build metric space
index structure
Split the query sequence
Search the query segments in the metric
indexing database
Chain the search results
23
Results
Drosophila distance distribution on mPAM
#pairs
800
600
400
200
0
0
5000
10000
Distance
15000
24
Max. Abs. Log-odds of Sub-seq. Lengths
8
7
M.A.L.
6
5
4
3
2
1
0
0
20
40
Length
60
80
Min length: 3 Max length: 80 Threshold: ln10 (2.303) Segment number: 1M
Trial number: 1M Bucket number: 100 Sequential search range: 80
25
Length:30 #DistCalcu VS Index size
40000
Radi
us:0
5
30000
10
20000
15
#calcu
50000
10000
20
0
0
10000 20000
size
30000 40000
25
30
35
40
26
References
•
•
•
•
•
[CPRZ97a] P. Ciaccia, M. Patella, F. Rabitti, and P. Zezula. Indexing metric
spaces with M-tree. In Atti del Quinto Convegno Nazionale SEBD, Verona, Italy,
June 1997.
[CPRZ97b] P. Ciaccia, M. Patella, and P. Zezula, “M-Tree: An Efficient Access
Method for Similarity Search in Metric Spaces”. Proc. VLDB, 1997.
[DSO78] Dayhoff M.O., Schwartz R. and Orcutt B.C. (1978) Atlas of protein
sequence and structure. Vol. 5, Suppl. 3, Ed. M. O. Dayhoff.
[MT]
The M-Tree Project Homepage, http://wwwdb.deis.unibo.it/Mtree/index.html
[SW81]
Temple F. Smith and Mchael S. Waterman. Identification of common
molecular subsequences. Journal of Molecular Biology, 147:195-197, 1981.
27