Download Mining Knowledge About Changes, Differences, and Trends

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Mining Knowledge about
Changes, Differences, and
Trends
Guozhu Dong
Wright State University
Dayton, Ohio
Outline
• Introduction
– Knowledge discovery from databases (KDD)
– Knowledge about changes, differences, & trends
• Contributions
– Changes between datasets
– Changes in data cubes
– Trends in data cubes
KDD 99 & more
VLDB 01 & SIGMOD 01
VLDB 02
• Concluding remarks
2
Introduction -- KDD (1)
• Mountains of data, everywhere!
– Use them  better service, better cure, …
• Aims of KDD
– Mine valid, novel, potentially useful patterns
– Classifiers, clustering, associations, insights, ..
• History
– Traditional scientific discovery = manual mining
– Ancestry of KDD: statistics, machine learning, pattern
recognition, database, …
– Field started in 1990s
• Data forms
– Market basket data (transactions)
– Relational data
– Data cubes (relational + concept hierarchies)
3
Introduction – KDD (2)
• Main tasks for KDD
– Identifying “useful pattern types”
– Giving algorithms for mining them
– Finding ways for using them
• Our contributions are along these lines
4
Example knowledge patterns about
changes, differences, & trends (CDT)
• Compare dataset A against dataset B,
looking for patterns capturing CDT
– Cancer tissues vs normal tissues
– Loyal customers vs disloyal customers
– Data_1999 vs Data_2000
Gene groups
 Drug design
Emerging trends
• Compare cells in a data cube, looking for
similar cells with big measure differences
– “Gradients”
• Analyze trends in MDML (multidimensional
multi-level) manner on a set of time series in
data cube
5
Traditional approaches to “mining” CDT
• Compare histograms or pie charts of datasets
90
80
70
60
50
40
30
20
10
0
East
West
North
Gain a little
Miss a lot
1st Qtr
2nd Qtr
3rd Qtr
4th Qtr
• Study time series, one or two at a time
• Summaries
• Limitations:
– Only offer high level view, on very few “factors/variables”
– But miss knowledge on many factor groups, many insights
6
Outline
• Introduction
– Knowledge discovery from databases
– Changes, differences, and trends
• Contributions
– Changes between datasets
– Changes in data cubes
– Trends in data cubes
KDD 99 etc
VLDB 01 & SIGMOD 01
VLDB 02
• Concluding remarks
7
Emerging Patterns
between Two Datasets
Normal Tissues
Cancer Tissues
g1
g2 g3 g4
g1 g2 g3
g4
L
H
L
H
H
H
L
H
L
H
L
L
L
H
H
H
H
L
L
H
L
L
L
H
L
H
H
L
H
H
H
L
EP: Patterns w/ high frequency ratio b/w datasets
E.G. {g1=L,g2=H,g3=L}; freq ratio = infinite
8
Colon tumor gene expression
• 40 tumor, 22 normal colon tissue samples
• 6500 genes/sample (Affymetrix Hum6000
micro-array gene chip)
100s of samples
g1 g2 g3 g4
20 90 25 80
1000s of dimensions
24 95 23 28
80 20 25 85
25 89 85 25
Original GE data
Last page: binned data
9
Top minimal EPs w/ infinite freq ratio
NormalEP
{25 33 37
{25 33 37
{29 33 35
{29 33 37
41 43
41 43
37 41
41 43
……
57
47
43
47
59
57
57
57
FreqInNormal
69} 77.3%
69} 77.3%
69} 77.3%
69} 77.3%
{6 43 57}
{6 47 57}
{6 57 69}
77.3%
77.3%
77.3%
CancerEP
{2 10}
{3 10}
{10 20}
{10 21}
{21 58}
{15 40 56}
{21 40 56}
FreqInCancer
70%
67.5%
67.5%
67.5%
65%
62.5%
62.5%
Minimal EP with infinite ratio (jumping EPs): all
their subsets occur in both classes of tissues
Papers using EP techniques
in Cancer Cell (cover, 3/02) & in Bioinformatics
10
EP Types of Particular Interest (1)
• Minimal jumping EPs for normal tissues
 Properly expressed gene groups important for normal cell
functioning, but destroyed in all colon cancer tissues
 Restore these  ?cure colon cancer?
• Minimal jumping EPs for cancer tissues
 Bad gene groups that occur in some cancer tissues but
never occur in normal tissues
 Disrupt these  ?cure colon cancer?
• ? Possible targets for drug design ?
• Good for classification (later)!
11
EP Types of Particular Interest (2)
• Emerging trends in timestamped DBs
– E.G. Enrollment of US students in major Canadian
univ’s increased by 86% during 99-02, to 5000
– This was news in US papers (Oct 02)
– Perhaps an opportunity for Canadian universities
• Note: Dominating trends  not opportunities
(either you have won or you are out)
12
Related work
• Classification/discriminant rules
– We’re not limited to classification/high level rules
• Association rules
– We are more tightly coupled with objectives of
application (divide data into “good” and “bad”)
• Changes in models of datasets
– Only compare fitted decision trees
• Other work usually assumes frequency
threshold; we may not
13
EP Mining Algorithms
• Border-based approach
(KDD 99)
– Produces border descriptions of desired
collections of EPs (structured & concise)
– Manipulates borders to get answer
• Constraint-based approach
(KDD 00)
– Look ahead, bound, prune
• Tree-based approach
(Bailey et al, 01)
– Organize data in a tree manner to
encourage sharing/reducing work
• Still room for improvement
High dimens
14
Borders describe large collections
• <{12,13},
{12345,12456}>
L (min)
12
13
R (max)
123 1234
124 1235
125 1245
126 1246
134 1256
135 1345
12345
12456
{1,3,4,5}
15
Border-Diff: Effect
• <{{}},{1234}> - <{{}},{34,24,23}>
= <{1,234},{1234}>
{}
1,
2, 3, 4
Don’t expand
12, 13, 14,
23, 24, 34
collections
123, 124, 134, 234
1234
• Similar to: [1,100] - [1,50] = (50,100]
• Good for: Jumping EPs; EPs in rectangle
regions, …
16
EP-based Classification
• Classification by aggregating power of EPs
NormalEP
FreqInNormal
{25 33 37 41 43} 80%
{25 33 37 41 63} 77.3%
{29 33 35 37 41} 77.3%
{6 43 67}
77.3%
{6 47 77}
77.3%
{6 57 69}
60%
CancerEP
{2 10}
{3 10}
{10 20}
{21 58}
{15 40 56}
{21 40 56}
FreqInCancer
70%
67.5%
67.5%
65%
62.5%
62.5%
• T= {2 6 10 25 33 37 41 43 47 57 69}
–
–
–
–
Normal score (T) = 0.8 + 0.6 = 1.4
Cancer score (T) = 0.7
Class(T) = Normal
May also normalize scores …
We gave several
proposals since
1999
17
EP-based Classification
• Very high accuracy: Outperforms best of five
other classifiers in 2/3 of 30 UCI datasets
• Outperforms SVM on gene expression data
• Variants
– Using different subsets of selected EPs
– Perhaps instance-driven for EP discovery
and score computation
18
Why EP-based classifiers are good
• Use discriminating power of low support
EPs, together with high support ones
• Use multi-feature conditions, not just
single-feature conditions
• Select from larger pools of discriminative
conditions
– Compare: The search space of patterns for
decision trees is limited by early choices.
• Combine power of a diversified committee
of “experts” (EPs)
• Decision is highly understandable
19
Outline
• Introduction
– Knowledge discovery from databases
– Changes, differences, and trends
• Contributions
– Changes between datasets
– Changes in data cubes
– Trends in data cubes
KDD 99 & more
VLDB 01 & SIGMOD 01
VLDB 02
• Concluding remarks
20
Decision support in data cubes
• Used for learning from consolidated historical data:
– anomalies
Wal-Mart success story
– unusual factor combinations
Initial idea: Codd et al 93
• Focus on modeling & analysis of data for decision
makers, not daily operations.
• Data organized around major subjects or factors,
such as customer, product, time, sales.
• Contain huge number of summaries at different
levels of details
• OLAP operators provided for data analysis
21
Data Cubes -- Base Cells
• Sales volume (measure) as a function of
product, time, and location (dimensions)
Hierarchical summarization paths
Industry Region
Year
Product
Category Country Quarter
Product
City
Office
Time
Month Week
Day
Base cells
22
Data Cubes: Derived Cells
2Qtr
Time
3Qtr
4Qtr
sum
U.S.A
Canada
Mexico
Location
TV
PC
VCR
sum
1Qtr
Sum, count,
avg, max,
min, …
(TV,*,Mexico)
sum
Derived cells, offering different
levels of details
23
Gradient problem
• Find pairs of similar cells (conditions)
having big changes in measure values
– Q: Find pairs of similar conditions having big
changes in total sale price
– A: Sales of trucks in West went down 20% from
99 to 00; Sales of (SUVs, East, June01) is 10%
higher than (SUVs, West, June01) ……
• Similar cells:
ances/desc pairs, sibling pairs
• Considered by Imielinski et al as Cubegrade Problem
• No constraint  costly (see next slide)
24
Huge Space of Cuboids and Cells
***
A**
AB*
*B*
A*C
**C
Coarse
to fine
*BC
ABC
Each node is a cuboid.
*: ALL
Each cuboid represents a set of cells.
Cuboid (and cells) form lattices
25
Constrained Gradient Mining
• Csig: (cnt100)
• Cprb: (city=“Van”, cust_grp=“busi”, prod_grp=“*”)
• Cgrad(cg, cp): (avg_price(cg) / avg_price(cp)1.3)
(c4, c2) satisfies Cgrad!
Dimensions
Siblings
Ancestor of
c1, c2. c3
cid
Yr
City
c1
00
Van
Busi
c2
*
Van
c3
*
c4
*
Measures
Cst_grp Prd_grp
Cnt
Avg_price
PC
300
2100
Busi
PC
2800
1800
Tor
Busi
PC
7900
2350
*
busi
PC
58600
2250
26
LiveSet-Driven Algorithm
-- Main Idea -• Compute iceberg of probe cells P using
Csig & Cprb
• Use P and Cgrad to find gradients
– Traverse gradient cells in coarse-to-fine
manner, using iceberg H-cubing SIGMOD 01
– Deal with all potential probe cells in one
traversal (as live set of probe cells)
– Dynamically prune live set during traversal
27
LiveSet
• LiveSet(c): set of probe cells cp that may form a
gradient-probe pair w/ some desc of current cell c
– View current cell as a “set of potential gradient cells”
Csig: cnt  100
Cgrad(cg, cp): (cnt(cg)/cnt(cp)  2)
Dimensions
Measures
cid
Yr
City
Cstgrp
prdgrp
Cnt
avgprice
P1
00
Van
Edu
PC
100
1500
P2
99
Tor
*
PC
4000
1800
P3
*
Mon
Busi
PC
1500
8000
P4
*
Edm
*
Ski
2000
10000
p5
*
Whi
*
Ski
1000
10050
P1, … P5: Global
probe cells
Cur cell c1=(*,*,Edu,*)
• cnt=800
LiveSet(c1)={p2, p4}
28
2-Way Pruning of Gradient Cells
and Probe Cells Using LiveSet
• Prune current grad cell c if LiveSet(c) = {}
• Prune probe cells cp if cp can be ignored in
searching c’s descendants
– Use min-max boundary check:
If constraint cnt(cg)/cnt(cp)>=2
and Cnt values in liveset are: 10, 18, 32, …; min(cnt)=10
then 19/10<2  gradient cells w/ cnt<=19 can be pruned
• Handle non anti-monotone constraints, using
weaker constraint for pruning (SIGMOD 01)
29
Pruning Probe Cells by
Dimension Matching Analysis
• Derive LiveSet of child c2 from LiveSet of parent c1
– Since LiveSet(c2)  LiveSet(c1)
• Discard probe cells in LiveSet(c2) that are
unmatchable with c2
Dimensions
Measures
# of
mismatches
(with c3)
cid
Yr
City
Cst_grp
Prd_grp
Cnt
Avg_price
P1
00
Van
Edu
PC
100
1500
1
P2
99
Tor
*
PC
4000
1800
1
P3
*
Mon
Busi
PC
1500
8000
1, 1*
LiveSet(c1) = {p1,p2,p3}
c1=(00, Tor, *, *)
LiveSet(c2) = {p1,p2}
c2=(00, Tor, *,PC)
30
An efficient H-cubing
method using H-tree
H-tree: efficient way to organize data, & to
promote sharing/reuse of computation
Header
Table
Attr.
Val.
Edu
Hhd
Bus
…
Jan
Feb
…
Tor
Van
Mon
…
sum, cnt
Sum:2285 …
…
…
…
…
…
…
…
…
…
…
root
Hhd.
Edu.
Jan.
Side-link
Tor.
Aux-Info
Mar.
Jan.
Bus.
Feb.
Van.
Tor.
Mon.
A.I.
A.I.
A.I.
Sum: 1765
Cnt: 2
bins
31
H-cubing: Computing Cells
Involving Dimension City
Edu
Hhd
Bus
…
Jan
Feb
sum
cnt
…
…
…
…
…
…
…
…
Attr. Val.
Header
Table
HTor
Attr.
Val.
Edu
Hhd
Bus
…
Jan
Feb
…
Tor
Van
Mon
…
sum, cnt
Sum:2285 …
…
…
…
…
…
…
…
…
…
…
Side-link
From (*, *, Tor) to (*, Jan, Tor)
root
Hhd.
Edu.
Jan.
Side-link
Tor.
Aux-Info
Mar.
Jan.
Bus.
Feb.
Van.
Tor.
Mon.
A.I.
A.I.
A.I.
Sum: 1765
Cnt: 2
bins
32
Scalability on
Number of Probe Cells
33
Scalability on Gradient Threshold
34
Scalability on Significance
Threshold
35
Scalability on Number of Tuples
36
Outline
• Introduction
– Knowledge discovery from databases
– Changes, differences, and trends
• Contributions
– Changes between datasets
– Changes in data cubes
– Trends in data cubes
KDD 99 & more
VLDB 01 & SIGMOD 01
VLDB 02
• Concluding remarks
37
Multi-Dimensional Trends Analysis
of Sets of Time-Series -- Overview
• Consider applications having many time series
– Stocks, power grids, sensor nets, internet,
gene expressions for toxicology, …
• Needs for MDML trends analysis
– Mining/monitoring unusual patterns/events,
in MDML manner
• Regression cube for time series
– Store regression base cube
– Support MDML OLAP of regressions
• Results also useful for MDML data stream
monitoring
38
Why MDML trends analysis
• Many time series
– E.G. Prices of 10000s of stocks; One time
series per stock
• Objectives
–
–
–
–
Understand behavior of stocks/stock groups
Find patterns of stock groups
Monitor unusual events
Find “groups of stocks” – variables -- with
interesting patterns (MDML search)
39
Regression based trends
analysis
A time series: (ti , zi), i =1..n
Linear regression model is a linear fitting curve
z = a0 + a1 t
With least square error
Can generalize regression to
z = a0+a1f1(t)+a2f2(t)+…+akfk(t)
Each f is a fixed function of t
Common tool for trends analysis
But limited to situations where “variables”
(groups of time series) are known
40
Regression cube for time
series
• There is one initial time series per base cell
• Too costly to fully store all time series
• Regression base cube
– Only store regression parameters of base
cells (4 values vs 10000s)
– Can we support MDML OLAP of regressions, using only
the regression base cube, in lossless manner?
• Answer is yes, for both “roll up” on standard
dimensions and on time dimension
41
Aggregation in Standard Dimensions
Two component cells
Aggregated cell
We can derive regression
of aggregated cell from
regression parameters of
component cells
42
Aggregation in Time Dimension
Cells of 2 adjacent
time intervals:
Aggregated cell
We can derive regression
of aggregated cell from
regression parameters of
component cells
43
Remarks on Regression Cube
Efficient storage; scalable (independent of
number of tuples in data cells)
Lossless aggregation without accessing raw data
Fast and efficient aggregation
Regression models of data cells at all levels
Results cover a large and popular class of
regression (linear, polynomial, and other models)
44
Concluding remarks
• Mining knowledge about change, differences,
& trends (CDT) is useful & exciting
• Traditional approaches focus on high level
view
• We considered CDT mining in transactions,
relations, & data cubes
• We used discovered CDT patterns for
classification, niche mining, & bioinformatics
& medical studies
• Future work: mining useful CDT knowledge
for bioinformatics, bio-medicine, business, …
45
References: Changes,
Differences, & Trends
• S. D. Bay and M. J. Pazzani. Detecting group differences: Mining
contrast sets. Data Mining and Knowledge Discovery, 2001.
• Y. Cai, N. Cercone, and J. Han. Attribute-oriented induction in
relational databases. In Knowledge Discovery in Databases,
AAAI/MITPress, 1991.
• G. Dong and K. Deshpande. Efficient mining of niches and set
routines. In Pacific-Asia Conf. On Knowledge Discovery & Data Mining,
2001.
• G. Dong and J. Li. Efficient mining of emerging patterns: Discovering
trends and differences. In Proc. of the 5th ACM SIGKDD Int'l Conf. On
Knowledge Discovery and Data Mining, 1999.
• G. Dong, X. Zhang, L. Wong, and J. Li. CAEP: Classification by
aggregating emerging patterns. In Proc. 2nd Int'l Conf. on Discovery
Science, Tokyo, 1999.
• V. Ganti, J. Gehrke, R. Ramakrishnan, and W. Y. Loh. A framework for
measuring changes in data characteristics. In PODS, 1999.
• J. Li, G. Dong, and K. Ramamohanarao. Instance-based classification
by emerging patterns. In European Conf. of Principles and Practice of
Knowledge Discovery in Databases, Lyon, France, 2000.
46
References: Changes, Difference
and Trends (Cont’d)
• J. Li, G. Dong, K. Ramamohanarao. Making use of the most expressive
jumping emerging patterns for classification. In Proc Pacific Asia Conf.
on Knowledge Discovery & Data Mining, 2000.
• J. Li, K. Ramamohanarao, G. Dong. Combining the strength of pattern
frequency and distance for classification. In Pacific-Asia KDD, 2001.
• J. Li, L. Wong. Identifying good diagnostic genes or genes groups from
gene expression data by using the concept of emerging patterns.
Bioinformatics. 18:725--734, 2002.
• Bing Liu, Wynne Hsu, Heng-Siew Han, and Yiyuan Xia. Mining changes
for real-life applications. In DaWaK, 2000.
• Bing Liu, Wynne Hsu, and Yiming Ma. Discovering the set of
fundamental rule changes. In KDD, 2001.
• Eng-Juh Yeoh, …, Jinyan Li, …,Limsoon Wong, James R. Downing.
Classification, subtype discovery, and prediction of outcome in
pediatric acute lymphoblastic leukemia by gene expression profiling.
Cancer Cell, 1:133—143, March 2002.
• X. Zhang, G. Dong, K. Ramamohanarao. Exploring constraints to
efficiently mine emerging patterns from large high-dimensional
datasets. In KDD, 2000.
47
References: Changes and Trends
(Data Cubes)
• S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, R.
Ramakrishnan, and S. Sarawagi. On the computation of
multidimensional aggregates. VLDB'96.
• K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and
iceberg cubes. SIGMOD'99.
• S. Chaudhuri and U. Dayal. An overview of data warehousing and
OLAP technology. ACM SIGMOD Record, 26:65-74, 1997.
• Y. Chen, G. Dong, J. Han, B. W. Wah, J. Wang. Multi-Dimensional
Regression Analysis of Time-Series Data Streams. VLDB 2002.
• E. F. Codd, S. B. Codd, and C. T. Salley. Providing OLAP (on-line
analytical processing) to user-analysts: an IT mandate. Tech Report,
Codd Associates, 1993.
• G. Dong, J. Han, J. Lam, J. Pei, K. Wang. Mining Multi-Dimensional
Constrained Gradients in Data Cubes. VLDB 2001.
• M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D.
Ullman. Computing iceberg queries efficiently. VLDB'98.
48
References: Changes and Trends
(Data Cubes) (Cont’d)
•
•
•
•
•
•
•
•
J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F.
Pellow, and H. Pirahesh. Data cube: A relational aggregation operator
generalizing group-by, cross-tab and sub-totals. Data Mining and Knowledge
Discovery, 1:29-54, 1997.
J. Han, J. Pei, G. Dong, and K. Wang. Efficient computation of iceberg cubes
with complex measures. SIGMOD'01.
V. Harinarayan, A. Rajaraman, and J. D. Ullman. Implementing data cubes
efficiently. SIGMOD'96.
T. Imielinski, L. Khachiyan, and A. Abdulghani. Cubegrades: Generalizing
association rules. Tech Report, Computer Science, Rutgers Univ, Aug. 2000.
L. V.S. Lakshmanan, J Pei, J. Han. Quotient Cube: How to Summarize the
Semantics of a Data Cube. VLDB 2002.
K. Ross and D. Srivastava. Fast computation of sparse datacubes. VLDB'97.
S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-driven exploration of
OLAP data cubes. EDBT'98.
Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based algorithm for
simultaneous multidi-mensional aggregates. SIGMOD'97.
49
Extra Slides
• Just in case …
50
Base Cells: Tuples of a Relation
Product
Location
Time
Sale
Printer
Manhattan Jan 1999
100K
Laptop
Queens
Jan 1999
800K
…
…
…
…
51
Data Cubes: OLAP OPs
Rollup,
drilldown,
2Qtr
3Qtr
4Qtr
sum
pivot
U.S.A
Canada
Mexico
Location
TV
PC
VCR
sum
1Qtr
slice/dice
Time
sum
52
Experimental Results
• Constraints:
– Csig is on cnt
– Cprb selects set of cells
– Cgrad(cg, cp): (avg_price(cg)/avg_price(cp)s)
• Data set
– 10 dimensions
– 10k-20k tuples
– Cardinality 10 for each dimension
– Measure range: 100-1000
• All-Pairs: One independent search per probe cell
53