Download MM indexing and DM - Carnegie Mellon University

Document related concepts
no text concepts found
Transcript
Indexing and Data Mining in
Multimedia Databases
Christos Faloutsos
CMU
www.cs.cmu.edu/~christos
Outline
Goal: ‘Find similar / interesting things’
• Problem - Applications
• Indexing - similarity search
• New tools for Data Mining: Fractals
• Conclusions
• Resources
USC 2001
C. Faloutsos
2
Problem
Given a large collection of (multimedia)
records, find similar/interesting things, ie:
• Allow fast, approximate queries, and
• Find rules/patterns
USC 2001
C. Faloutsos
3
Sample queries
• Similarity search
– Find pairs of branches with similar sales
patterns
– find medical cases similar to Smith's
– Find pairs of sensor series that move in sync
– Find shapes like a spark-plug
USC 2001
C. Faloutsos
4
Sample queries –cont’d
• Rule discovery
– Clusters (of branches; of sensor data; ...)
– Forecasting (total sales for next year?)
– Outliers (eg., unexpected part failures; fraud
detection)
USC 2001
C. Faloutsos
5
Outline
Goal: ‘Find similar / interesting things’
• Problem - Applications
• Indexing - similarity search
• New tools for Data Mining: Fractals
• Conclusions
• related projects @ CMU and resourses
USC 2001
C. Faloutsos
6
Indexing - Multimedia
Problem:
• given a set of (multimedia) objects,
• find the ones similar to a desirable query
object
USC 2001
C. Faloutsos
7
$price
$price
1
365
day
$price
1
365
day
distance function: by expert
1
365
day
USC 2001
C. Faloutsos
8
‘GEMINI’ - Pictorially
eg,. std
S1
F(S1)
1
365
day
F(Sn)
Sn
eg, avg
1
USC 2001
365
day
C. Faloutsos
9
Remaining issues
• how to extract features automatically?
• how to merge similarity scores from
different media
USC 2001
C. Faloutsos
10
Outline
Goal: ‘Find similar / interesting things’
• Problem - Applications
• Indexing - similarity search
– Visualization: Fastmap
– Relevance feedback: FALCON
• Data Mining / Fractals
• Conclusions
USC 2001
C. Faloutsos
11
FastMap
~100
O1
O2
O3
O4
O5
O1
0
1
1
100
100
USC 2001
O2
1
0
1
100
100
O3
1
1
0
100
100
O4
100
100
100
0
1
O5
100
100
100
1
0
C. Faloutsos
??
~1
12
FastMap
• Multi-dimensional scaling (MDS) can do
that, but in O(N**2) time
• We want a linear algorithm: FastMap
[SIGMOD95]
USC 2001
C. Faloutsos
13
Applications: time sequences
• given n co-evolving time sequences
• visualize them + find rules [ICDE00]
DEM
rate
JPY
HKD
time
USC 2001
C. Faloutsos
14
Applications - financial
• currency exchange rates [ICDE00]
FRF
GBP
JPY
HKD
USD(t)
USD(t-5)
USC 2001
C. Faloutsos
15
Applications - financial
• currency exchange rates [ICDE00]
FRF
DEM
HKD
JPY
USD(t)
USD(t-5)
USC 2001
USD
GBP
C. Faloutsos
16
Application: VideoTrails
[ACM MM97]
USC 2001
C. Faloutsos
17
VideoTrails - usage
• scene-cut detection (about 10% errors)
• scene classification (eg., dialogue vs action)
USC 2001
C. Faloutsos
18
Outline
Goal: ‘Find similar / interesting things’
• Problem - Applications
• Indexing - similarity search
– Visualization: Fastmap
– Relevance feedback: FALCON
• Data Mining / Fractals
• Conclusions
USC 2001
C. Faloutsos
19
Merging similarity scores
• eg., video: text, color, motion, audio
– weights change with the query!
• solution 1: user specifies weights
• solution 2: user gives examples 
– and we ‘learn’ what he/she wants: rel. feedback
(Rocchio, MARS, MindReader)
– but: how about disjunctive queries?
USC 2001
C. Faloutsos
20
‘FALCON’
Vs
Inverted Vs
Trader wants only ‘unstable’ stocks
USC 2001
C. Faloutsos
21
“Single query point” methods
x
+ +
+
+
++
Rocchio
USC 2001
C. Faloutsos
22
“Single query point” methods
x
+ +
+
+
++
Rocchio
x
+ +
+
+
++
MARS
x
+ +
+
+
+
+
MindReader
The averaging affect in action...
USC 2001
C. Faloutsos
23
Main idea: FALCON Contours
[Wu+, vldb2000]
+
feature2
+
eg., frequency
+
+
+
feature1 (eg., temperature)
USC 2001
C. Faloutsos
24
Conclusions for indexing +
visualization
• GEMINI: fast indexing, exploiting off-theshelf SAMs
• FastMap: automatic feature extraction in
O(N) time
• FALCON: relevance feedback for
disjunctive queries
USC 2001
C. Faloutsos
25
Outline
Goal: ‘Find similar / interesting things’
• Problem - Applications
• Indexing - similarity search
• New tools for Data Mining: Fractals
• Conclusions
• Resourses
USC 2001
C. Faloutsos
26
Data mining & fractals –
Road map
•
•
•
•
Motivation – problems / case study
Definition of fractals and power laws
Solutions to posed problems
More examples
USC 2001
C. Faloutsos
27
Problem #1 - spatial d.m.
Galaxies (Sloan Digital Sky Survey w/ B.
- ‘spiral’ and ‘elliptical’
Nichol)
galaxies
(stores & households ; mpg
& MTBF...)
- patterns? (not Gaussian; not
uniform)
-attraction/repulsion?
- separability??
USC 2001
C. Faloutsos
28
Problem#2: dim. reduction
• given attributes x1, ... xn
– possibly, non-linearly correlated
• drop the useless ones
(Q: why?
A: to avoid the ‘dimensionality curse’)
USC 2001
C. Faloutsos
29
Answer:
• Fractals / self-similarities / power laws
USC 2001
C. Faloutsos
30
What is a fractal?
= self-similar point set, e.g., Sierpinski triangle:
...
zero area;
infinite length!
USC 2001
C. Faloutsos
31
Definitions (cont’d)
• Paradox: Infinite perimeter ; Zero area!
• ‘dimensionality’: between 1 and 2
• actually: Log(3)/Log(2) = 1.58… (long
story)
USC 2001
C. Faloutsos
32
Intrinsic (‘fractal’) dimension
Eg:
• Q: fractal dimension
of a line?
#cylinders; miles / gallon
x
5
4
3
2
USC 2001
C. Faloutsos
y
1
2
3
4
33
Intrinsic (‘fractal’) dimension
• Q: fractal dimension of a
line?
• A: nn ( <= r ) ~ r^1
(‘power law’: y=x^a)
USC 2001
C. Faloutsos
34
Intrinsic (‘fractal’) dimension
• Q: fractal dimension of a
line?
• A: nn ( <= r ) ~ r^1
(‘power law’: y=x^a)
USC 2001
• Q: fd of a plane?
• A: nn ( <= r ) ~ r^2
fd== slope of (log(nn) vs
log(r) )
C. Faloutsos
35
Sierpinsky triangle
== ‘correlation integral’
log(#pairs
within <=r )
1.58
log( r )
USC 2001
C. Faloutsos
36
Road map
•
•
•
•
•
Motivation – problems / case studies
Definition of fractals and power laws
Solutions to posed problems
More examples
Conclusions
USC 2001
C. Faloutsos
37
Solution#1: spatial d.m.
Galaxies (Sloan Digital Sky Survey w/ B.
Nichol - ‘BOPS’ plot - [sigmod2000])
•clusters?
•separable?
•attraction/repulsion?
•data ‘scrubbing’ –
duplicates?
USC 2001
C. Faloutsos
38
Solution#1: spatial d.m.
log(#pairs within <=r )
- 1.8 slope
- plateau!
ell-ell
- repulsion!
spi-spi
spi-ell
log(r)
USC 2001
C. Faloutsos
39
Solution#1: spatial d.m.
[w/ Seeger, Traina, Traina, SIGMOD00]
log(#pairs within <=r )
- 1.8 slope
- plateau!
ell-ell
- repulsion!
spi-spi
spi-ell
log(r)
USC 2001
C. Faloutsos
40
spatial d.m.
r1
r2
Heuristic on choosing # of
clusters
r2 r1
USC 2001
C. Faloutsos
41
Solution#1: spatial d.m.
log(#pairs within <=r )
- 1.8 slope
- plateau!
ell-ell
- repulsion!
spi-spi
spi-ell
log(r)
USC 2001
C. Faloutsos
42
Solution#1: spatial d.m.
log(#pairs within <=r )
- 1.8 slope
- plateau!
ell-ell
-repulsion!!
spi-spi
-duplicates
spi-ell
log(r)
USC 2001
C. Faloutsos
43
Problem #2: Dim. reduction
y (a) Quarter-circle
y
y
(b)Line
(c) Spike
1
0
0
USC 2001
1
x
x
0
C. Faloutsos
0
x
44
Solution:
• drop the attributes that don’t increase the
‘partial f.d.’ PFD
• dfn: PFD of attribute set A is the f.d. of the
projected cloud of points [w/ Traina, Traina,
Wu, SBBD00]
USC 2001
C. Faloutsos
45
Problem #2: dim. reduction
global FD=1
PFD=1
PFD~1
y (a) Quarter-circle
y
y
(b)Line
(c) Spike
1
0
1
0
PFD~1
USC 2001
x
x
0
PFD=1
C. Faloutsos
x
0
PFD=0
46
Problem #2: dim. reduction
global FD=1
PFD=1
PFD=1
y (a) Quarter-circle
y
y
(b)Line
(c) Spike
1
0
1
0
PFD~1
USC 2001
x
x
0
PFD=1
C. Faloutsos
0
x
Notice: ‘max
variance’
PFD=0 would
fail here
47
Problem #2: dim. reduction
global FD=1
PFD=1
PFD~1
y (a) Quarter-circle
y
y
(b)Line
(c) Spike
1
0
1
0
PFD~1
USC 2001
x
x
0
PFD=1
0
x
Notice: SVD would fail
here
PFD=0
C. Faloutsos
48
Road map
•
•
•
•
Motivation – problems / case studies
Definition of fractals and power laws
Solutions to posed problems
More examples
– fractals
– power laws
• Conclusions
USC 2001
C. Faloutsos
49
disk traffic
• Not Poisson, not(?) iid - BUT: self-similar
• How to model it?
#bytes
time
USC 2001
C. Faloutsos
50
traffic
• disk traces (80-20 ‘law’ = ‘multifractal’ [ICDE’02])
20%
80%
#bytes
time
USC 2001
C. Faloutsos
51
Traffic
Many other time-sequences are
bursty/clustered: (such as?)
USC 2001
C. Faloutsos
52
Tape accesses
# tapes needed, to
retrieve n records?
Tape#1
Tape# N
(# days down, due to
failures / hurricanes /
communication
noise...)
time
USC 2001
C. Faloutsos
53
Tape accesses
50-50 = Poisson
# tapes retrieved
Tape#1
Tape# N
real
time
USC 2001
C. Faloutsos
# qual. records
54
More apps: Brain scans
• Oct-trees; brain-scans
Log(#octants)
2.63 =
fd
USC 2001
C. Faloutsos
octree levels
55
GIS points
Cross-roads of
Montgomery county:
•any rules?
USC 2001
C. Faloutsos
56
GIS
log(#pairs(within <= r))
A: self-similarity:
• intrinsic dim. = 1.51
• avg#neighbors(<= r )
= r^D
1.51
log( r )
USC 2001
C. Faloutsos
57
Examples:LB county
• Long Beach county of CA (road end-points)
USC 2001
C. Faloutsos
58
More fractals:
• cardiovascular system: 3 (!)
• stock prices (LYCOS) - random walks: 1.5
1 year
2 years
• Coastlines: 1.2-1.58 (?)
USC 2001
C. Faloutsos
59
USC 2001
C. Faloutsos
60
Road map
•
•
•
•
Motivation – problems / case studies
Definition of fractals and power laws
Solutions to posed problems
More examples
– fractals
– power laws
• Conclusions
USC 2001
C. Faloutsos
61
Fractals <-> Power laws
self-similarity ->
• <=> fractals
• <=> scale-free
• <=> power-laws (y=x^a, F=C*r^(-2))
log(#pairs
within <=r )
1.58
log( r )
USC 2001
C. Faloutsos
62
“the”
log(freq)
Zipf’s law
“and”
Bible
RANK-FREQUENCY
plot: (in log-log scales)
q
log(rank)
Zipf’s (first) Law:
USC 2001
C. Faloutsos
63
Zipf’s law
• similarly for first names (slope ~-1)
• last names (~ -0.7)
• etc
USC 2001
C. Faloutsos
64
More power laws
• Energy of earthquakes (Gutenberg-Richter
law) [simscience.org]
log(count)
amplitude
day
USC 2001
magnitude
C. Faloutsos
65
Clickstream data
<url, u-id, ....>
Web Site Traffic
log(count)
Zipf
log(freq)
USC 2001
C. Faloutsos
66
Lotka’s law
• library science (Lotka’s law of publication
count); and citation counts:
(citeseer.nj.nec.com 6/2001)
log(count)
J. Ullman
log(#citations)
USC 2001
C. Faloutsos
67
Korcak’s law
log(count( >= area))
Scandinavian lakes
area vs
complementary
cumulative count
(log-log axes)
USC 2001
log(area)
C. Faloutsos
68
More power laws: Korcak
log(count( >= area))
Japan islands;
area vs cumulative
count (log-log axes)
USC 2001
log(area)
C. Faloutsos
69
(Korcak’s law: Aegean islands)
USC 2001
C. Faloutsos
70
Olympic medals:
log(# medals)
Russia
China
2.5
USA
2
1.5
Series1
Linear (Series1)
1
y = -0.9676x + 2.3054
R2 = 0.9458
0.5
0
0
USC 2001
0.5
1
1.5
C. Faloutsos
2
log rank
71
SALES data – store#96
count of
products
# units sold
USC 2001
C. Faloutsos
72
TELCO data
count of
customers
# of service units
USC 2001
C. Faloutsos
73
More power laws on the Internet
log(degree)
-0.82
log(rank)
degree vs rank, for Internet domains
(log-log) [sigcomm99]
USC 2001
C. Faloutsos
74
Even more power laws:
•
•
•
•
Income distribution (Pareto’s law);
duration of UNIX jobs [Harchol-Balter]
Distribution of UNIX file sizes
Web graph [CLEVER-IBM; Barabasi]
USC 2001
C. Faloutsos
75
Overall Conclusions:
‘Find similar/interesting things’ in multimedia
databases
• Indexing: feature extraction (‘GEMINI’)
– automatic feature extraction: FastMap
– Relevance feedback: FALCON
USC 2001
C. Faloutsos
76
Conclusions - cont’d
• New tools for Data Mining: Fractals/power
laws:
– appear everywhere
– lead to skewed distributions (Gaussian,
Poisson, uniformity, independence)
– ‘correlation integral’ for separability/cluster
detection
– PFD for dimensionality reduction
USC 2001
C. Faloutsos
77
Resources:
• Software and papers:
–
–
–
–
www.cs.cmu.edu/~christos
Fractal dimension (FracDim)
Separability (sigmod 2000, kdd2001)
Relevance feedback for query by content
(FALCON – vldb 2000)
USC 2001
C. Faloutsos
78
Resources
• Manfred Schroeder “Chaos, Fractals and
Power Laws”
USC 2001
C. Faloutsos
79
Related documents