Download MM indexing and DM - Carnegie Mellon University

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Indexing and Data Mining in
Multimedia Databases
Christos Faloutsos
CMU
www.cs.cmu.edu/~christos
Outline
Goal: ‘Find similar / interesting things’
• Problem - Applications
• Indexing - similarity search
• New tools for Data Mining: Fractals
• Conclusions
• Resources
USC 2001
C. Faloutsos
2
Problem
Given a large collection of (multimedia)
records, find similar/interesting things, ie:
• Allow fast, approximate queries, and
• Find rules/patterns
USC 2001
C. Faloutsos
3
Sample queries
• Similarity search
– Find pairs of branches with similar sales
patterns
– find medical cases similar to Smith's
– Find pairs of sensor series that move in sync
– Find shapes like a spark-plug
USC 2001
C. Faloutsos
4
Sample queries –cont’d
• Rule discovery
– Clusters (of branches; of sensor data; ...)
– Forecasting (total sales for next year?)
– Outliers (eg., unexpected part failures; fraud
detection)
USC 2001
C. Faloutsos
5
Outline
Goal: ‘Find similar / interesting things’
• Problem - Applications
• Indexing - similarity search
• New tools for Data Mining: Fractals
• Conclusions
• related projects @ CMU and resourses
USC 2001
C. Faloutsos
6
Indexing - Multimedia
Problem:
• given a set of (multimedia) objects,
• find the ones similar to a desirable query
object
USC 2001
C. Faloutsos
7
$price
$price
1
365
day
$price
1
365
day
distance function: by expert
1
365
day
USC 2001
C. Faloutsos
8
‘GEMINI’ - Pictorially
eg,. std
S1
F(S1)
1
365
day
F(Sn)
Sn
eg, avg
1
USC 2001
365
day
C. Faloutsos
9
Remaining issues
• how to extract features automatically?
• how to merge similarity scores from
different media
USC 2001
C. Faloutsos
10
Outline
Goal: ‘Find similar / interesting things’
• Problem - Applications
• Indexing - similarity search
– Visualization: Fastmap
– Relevance feedback: FALCON
• Data Mining / Fractals
• Conclusions
USC 2001
C. Faloutsos
11
FastMap
~100
O1
O2
O3
O4
O5
O1
0
1
1
100
100
USC 2001
O2
1
0
1
100
100
O3
1
1
0
100
100
O4
100
100
100
0
1
O5
100
100
100
1
0
C. Faloutsos
??
~1
12
FastMap
• Multi-dimensional scaling (MDS) can do
that, but in O(N**2) time
• We want a linear algorithm: FastMap
[SIGMOD95]
USC 2001
C. Faloutsos
13
Applications: time sequences
• given n co-evolving time sequences
• visualize them + find rules [ICDE00]
DEM
rate
JPY
HKD
time
USC 2001
C. Faloutsos
14
Applications - financial
• currency exchange rates [ICDE00]
FRF
GBP
JPY
HKD
USD(t)
USD(t-5)
USC 2001
C. Faloutsos
15
Applications - financial
• currency exchange rates [ICDE00]
FRF
DEM
HKD
JPY
USD(t)
USD(t-5)
USC 2001
USD
GBP
C. Faloutsos
16
Application: VideoTrails
[ACM MM97]
USC 2001
C. Faloutsos
17
VideoTrails - usage
• scene-cut detection (about 10% errors)
• scene classification (eg., dialogue vs action)
USC 2001
C. Faloutsos
18
Outline
Goal: ‘Find similar / interesting things’
• Problem - Applications
• Indexing - similarity search
– Visualization: Fastmap
– Relevance feedback: FALCON
• Data Mining / Fractals
• Conclusions
USC 2001
C. Faloutsos
19
Merging similarity scores
• eg., video: text, color, motion, audio
– weights change with the query!
• solution 1: user specifies weights
• solution 2: user gives examples 
– and we ‘learn’ what he/she wants: rel. feedback
(Rocchio, MARS, MindReader)
– but: how about disjunctive queries?
USC 2001
C. Faloutsos
20
‘FALCON’
Vs
Inverted Vs
Trader wants only ‘unstable’ stocks
USC 2001
C. Faloutsos
21
“Single query point” methods
x
+ +
+
+
++
Rocchio
USC 2001
C. Faloutsos
22
“Single query point” methods
x
+ +
+
+
++
Rocchio
x
+ +
+
+
++
MARS
x
+ +
+
+
+
+
MindReader
The averaging affect in action...
USC 2001
C. Faloutsos
23
Main idea: FALCON Contours
[Wu+, vldb2000]
+
feature2
+
eg., frequency
+
+
+
feature1 (eg., temperature)
USC 2001
C. Faloutsos
24
Conclusions for indexing +
visualization
• GEMINI: fast indexing, exploiting off-theshelf SAMs
• FastMap: automatic feature extraction in
O(N) time
• FALCON: relevance feedback for
disjunctive queries
USC 2001
C. Faloutsos
25
Outline
Goal: ‘Find similar / interesting things’
• Problem - Applications
• Indexing - similarity search
• New tools for Data Mining: Fractals
• Conclusions
• Resourses
USC 2001
C. Faloutsos
26
Data mining & fractals –
Road map
•
•
•
•
Motivation – problems / case study
Definition of fractals and power laws
Solutions to posed problems
More examples
USC 2001
C. Faloutsos
27
Problem #1 - spatial d.m.
Galaxies (Sloan Digital Sky Survey w/ B.
- ‘spiral’ and ‘elliptical’
Nichol)
galaxies
(stores & households ; mpg
& MTBF...)
- patterns? (not Gaussian; not
uniform)
-attraction/repulsion?
- separability??
USC 2001
C. Faloutsos
28
Problem#2: dim. reduction
• given attributes x1, ... xn
– possibly, non-linearly correlated
• drop the useless ones
(Q: why?
A: to avoid the ‘dimensionality curse’)
USC 2001
C. Faloutsos
29
Answer:
• Fractals / self-similarities / power laws
USC 2001
C. Faloutsos
30
What is a fractal?
= self-similar point set, e.g., Sierpinski triangle:
...
zero area;
infinite length!
USC 2001
C. Faloutsos
31
Definitions (cont’d)
• Paradox: Infinite perimeter ; Zero area!
• ‘dimensionality’: between 1 and 2
• actually: Log(3)/Log(2) = 1.58… (long
story)
USC 2001
C. Faloutsos
32
Intrinsic (‘fractal’) dimension
Eg:
• Q: fractal dimension
of a line?
#cylinders; miles / gallon
x
5
4
3
2
USC 2001
C. Faloutsos
y
1
2
3
4
33
Intrinsic (‘fractal’) dimension
• Q: fractal dimension of a
line?
• A: nn ( <= r ) ~ r^1
(‘power law’: y=x^a)
USC 2001
C. Faloutsos
34
Intrinsic (‘fractal’) dimension
• Q: fractal dimension of a
line?
• A: nn ( <= r ) ~ r^1
(‘power law’: y=x^a)
USC 2001
• Q: fd of a plane?
• A: nn ( <= r ) ~ r^2
fd== slope of (log(nn) vs
log(r) )
C. Faloutsos
35
Sierpinsky triangle
== ‘correlation integral’
log(#pairs
within <=r )
1.58
log( r )
USC 2001
C. Faloutsos
36
Road map
•
•
•
•
•
Motivation – problems / case studies
Definition of fractals and power laws
Solutions to posed problems
More examples
Conclusions
USC 2001
C. Faloutsos
37
Solution#1: spatial d.m.
Galaxies (Sloan Digital Sky Survey w/ B.
Nichol - ‘BOPS’ plot - [sigmod2000])
•clusters?
•separable?
•attraction/repulsion?
•data ‘scrubbing’ –
duplicates?
USC 2001
C. Faloutsos
38
Solution#1: spatial d.m.
log(#pairs within <=r )
- 1.8 slope
- plateau!
ell-ell
- repulsion!
spi-spi
spi-ell
log(r)
USC 2001
C. Faloutsos
39
Solution#1: spatial d.m.
[w/ Seeger, Traina, Traina, SIGMOD00]
log(#pairs within <=r )
- 1.8 slope
- plateau!
ell-ell
- repulsion!
spi-spi
spi-ell
log(r)
USC 2001
C. Faloutsos
40
spatial d.m.
r1
r2
Heuristic on choosing # of
clusters
r2 r1
USC 2001
C. Faloutsos
41
Solution#1: spatial d.m.
log(#pairs within <=r )
- 1.8 slope
- plateau!
ell-ell
- repulsion!
spi-spi
spi-ell
log(r)
USC 2001
C. Faloutsos
42
Solution#1: spatial d.m.
log(#pairs within <=r )
- 1.8 slope
- plateau!
ell-ell
-repulsion!!
spi-spi
-duplicates
spi-ell
log(r)
USC 2001
C. Faloutsos
43
Problem #2: Dim. reduction
y (a) Quarter-circle
y
y
(b)Line
(c) Spike
1
0
0
USC 2001
1
x
x
0
C. Faloutsos
0
x
44
Solution:
• drop the attributes that don’t increase the
‘partial f.d.’ PFD
• dfn: PFD of attribute set A is the f.d. of the
projected cloud of points [w/ Traina, Traina,
Wu, SBBD00]
USC 2001
C. Faloutsos
45
Problem #2: dim. reduction
global FD=1
PFD=1
PFD~1
y (a) Quarter-circle
y
y
(b)Line
(c) Spike
1
0
1
0
PFD~1
USC 2001
x
x
0
PFD=1
C. Faloutsos
x
0
PFD=0
46
Problem #2: dim. reduction
global FD=1
PFD=1
PFD=1
y (a) Quarter-circle
y
y
(b)Line
(c) Spike
1
0
1
0
PFD~1
USC 2001
x
x
0
PFD=1
C. Faloutsos
0
x
Notice: ‘max
variance’
PFD=0 would
fail here
47
Problem #2: dim. reduction
global FD=1
PFD=1
PFD~1
y (a) Quarter-circle
y
y
(b)Line
(c) Spike
1
0
1
0
PFD~1
USC 2001
x
x
0
PFD=1
0
x
Notice: SVD would fail
here
PFD=0
C. Faloutsos
48
Road map
•
•
•
•
Motivation – problems / case studies
Definition of fractals and power laws
Solutions to posed problems
More examples
– fractals
– power laws
• Conclusions
USC 2001
C. Faloutsos
49
disk traffic
• Not Poisson, not(?) iid - BUT: self-similar
• How to model it?
#bytes
time
USC 2001
C. Faloutsos
50
traffic
• disk traces (80-20 ‘law’ = ‘multifractal’ [ICDE’02])
20%
80%
#bytes
time
USC 2001
C. Faloutsos
51
Traffic
Many other time-sequences are
bursty/clustered: (such as?)
USC 2001
C. Faloutsos
52
Tape accesses
# tapes needed, to
retrieve n records?
Tape#1
Tape# N
(# days down, due to
failures / hurricanes /
communication
noise...)
time
USC 2001
C. Faloutsos
53
Tape accesses
50-50 = Poisson
# tapes retrieved
Tape#1
Tape# N
real
time
USC 2001
C. Faloutsos
# qual. records
54
More apps: Brain scans
• Oct-trees; brain-scans
Log(#octants)
2.63 =
fd
USC 2001
C. Faloutsos
octree levels
55
GIS points
Cross-roads of
Montgomery county:
•any rules?
USC 2001
C. Faloutsos
56
GIS
log(#pairs(within <= r))
A: self-similarity:
• intrinsic dim. = 1.51
• avg#neighbors(<= r )
= r^D
1.51
log( r )
USC 2001
C. Faloutsos
57
Examples:LB county
• Long Beach county of CA (road end-points)
USC 2001
C. Faloutsos
58
More fractals:
• cardiovascular system: 3 (!)
• stock prices (LYCOS) - random walks: 1.5
1 year
2 years
• Coastlines: 1.2-1.58 (?)
USC 2001
C. Faloutsos
59
USC 2001
C. Faloutsos
60
Road map
•
•
•
•
Motivation – problems / case studies
Definition of fractals and power laws
Solutions to posed problems
More examples
– fractals
– power laws
• Conclusions
USC 2001
C. Faloutsos
61
Fractals <-> Power laws
self-similarity ->
• <=> fractals
• <=> scale-free
• <=> power-laws (y=x^a, F=C*r^(-2))
log(#pairs
within <=r )
1.58
log( r )
USC 2001
C. Faloutsos
62
“the”
log(freq)
Zipf’s law
“and”
Bible
RANK-FREQUENCY
plot: (in log-log scales)
q
log(rank)
Zipf’s (first) Law:
USC 2001
C. Faloutsos
63
Zipf’s law
• similarly for first names (slope ~-1)
• last names (~ -0.7)
• etc
USC 2001
C. Faloutsos
64
More power laws
• Energy of earthquakes (Gutenberg-Richter
law) [simscience.org]
log(count)
amplitude
day
USC 2001
magnitude
C. Faloutsos
65
Clickstream data
<url, u-id, ....>
Web Site Traffic
log(count)
Zipf
log(freq)
USC 2001
C. Faloutsos
66
Lotka’s law
• library science (Lotka’s law of publication
count); and citation counts:
(citeseer.nj.nec.com 6/2001)
log(count)
J. Ullman
log(#citations)
USC 2001
C. Faloutsos
67
Korcak’s law
log(count( >= area))
Scandinavian lakes
area vs
complementary
cumulative count
(log-log axes)
USC 2001
log(area)
C. Faloutsos
68
More power laws: Korcak
log(count( >= area))
Japan islands;
area vs cumulative
count (log-log axes)
USC 2001
log(area)
C. Faloutsos
69
(Korcak’s law: Aegean islands)
USC 2001
C. Faloutsos
70
Olympic medals:
log(# medals)
Russia
China
2.5
USA
2
1.5
Series1
Linear (Series1)
1
y = -0.9676x + 2.3054
R2 = 0.9458
0.5
0
0
USC 2001
0.5
1
1.5
C. Faloutsos
2
log rank
71
SALES data – store#96
count of
products
# units sold
USC 2001
C. Faloutsos
72
TELCO data
count of
customers
# of service units
USC 2001
C. Faloutsos
73
More power laws on the Internet
log(degree)
-0.82
log(rank)
degree vs rank, for Internet domains
(log-log) [sigcomm99]
USC 2001
C. Faloutsos
74
Even more power laws:
•
•
•
•
Income distribution (Pareto’s law);
duration of UNIX jobs [Harchol-Balter]
Distribution of UNIX file sizes
Web graph [CLEVER-IBM; Barabasi]
USC 2001
C. Faloutsos
75
Overall Conclusions:
‘Find similar/interesting things’ in multimedia
databases
• Indexing: feature extraction (‘GEMINI’)
– automatic feature extraction: FastMap
– Relevance feedback: FALCON
USC 2001
C. Faloutsos
76
Conclusions - cont’d
• New tools for Data Mining: Fractals/power
laws:
– appear everywhere
– lead to skewed distributions (Gaussian,
Poisson, uniformity, independence)
– ‘correlation integral’ for separability/cluster
detection
– PFD for dimensionality reduction
USC 2001
C. Faloutsos
77
Resources:
• Software and papers:
–
–
–
–
www.cs.cmu.edu/~christos
Fractal dimension (FracDim)
Separability (sigmod 2000, kdd2001)
Relevance feedback for query by content
(FALCON – vldb 2000)
USC 2001
C. Faloutsos
78
Resources
• Manfred Schroeder “Chaos, Fractals and
Power Laws”
USC 2001
C. Faloutsos
79
Related documents