Download Quantitative Association Rules

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Quantitative Association Rules
„
„
„
Up to now: associations of boolean attributes only
Now: numerical attributes, too
Example:
„ Original database
ID
1
2
„
ID
1
2
age
23
38
marital status
single
married
# cars
0
2
Boolean database
age: 20..29
1
0
age: 30..39
0
1
WS 2003/04
m-status: single
1
0
m-status: married
0
1
...
...
...
Data Mining Algorithms
8 – 79
Static Discretization of Quantitative
Attributes
„
Discretized prior to mining using concept hierarchy.
„
Numeric values are replaced by ranges.
„
In relational database, finding all frequent k-predicate sets
will require k or k+1 table scans.
„
Data cube is well suited for mining.
„
The cells of an n-dimensional
(age)
()
(income)
(buys)
cuboid correspond to the
predicate sets.
„
(age, income)
Mining from data cubes
can be much faster.
WS 2003/04
(age,buys) (income,buys)
(age,income,buys)
Data Mining Algorithms
8 – 80
Quantitative Association Rules: Ideas
„
„
Static discretization
„ Discretization of all attributes before mining the association
rules
„ E.g. by using a generalization hierarchy for each attribute
„ Substitute numerical attribute values by ranges or intervals
Dynamic discretization
„ Discretization of the attributes during association rule mining
„ Goal (e.g.): maximization of confidence
„ Unification of neighboring association rules to a generalized
rule
WS 2003/04
Data Mining Algorithms
8 – 81
Quantitative Association Rules
„
„
„
„
Numeric attributes are dynamically discretized
„ Such that the confidence or compactness of the rules
mined is maximized.
2-D quantitative association rules: Aquan1 ∧ Aquan2 ⇒ Acat
Cluster “adjacent”
association rules
to form general
rules using a 2-D
grid.
Example:
age(X,”30-34”) ∧ income(X,”24K - 48K”)
⇒ buys(X,”high resolution TV”)
WS 2003/04
Data Mining Algorithms
8 – 82
Quantitative Association Rules:
Basic Notions
„
„
„
„
„
„
[Srikant & Agrawal 1996a]
Let I = {i1, ..., im} be a set of literals called „attributes“
Let IV = I × IN+ be a set of attribute-value pairs
Let D be a set of data objects R, R ⊆ IV
„ Each attribute may occur at most once in a data
object
Let IR = {<x, u, o> ∈ I × IN+ × IN+ | u ≤ o}
„ Where <x, u, o> denotes an attribute x with an
assigned range of values [u..o]
Attribute(X) for X ⊆ IR: set {x | <x, u, o> ∈ IR}
WS 2003/04
Data Mining Algorithms
8 – 83
Quantitative Association Rules:
Basic Notions (2)
„
„
„
„
quantitative Items: elements from IR
quantitative Itemset: set X ⊆ IR
Data object R supports a set X ⊆ IR:
„ For each <x, u, o> ∈ X, there is a pair <x, v> ∈ R
where u ≤ v ≤ o
Support of set X in D for a quantitative itemset X:
Percentage of data objects in D that support X
Quantitative association rule:
„ X ⇒ Y where
X ⊆ IR, Y ⊆ IR and Attribute(X) ∩ Attribute(Y) = ∅
„
„
WS 2003/04
Data Mining Algorithms
8 – 84
Quantitative Association Rules:
Basic Notions (3)
„
„
„
„
Support s of a quantitative association rule X ⇒ Y in D:
„ Support of set X ∪ Y in D
Confidence c of a quantitative association rule X ⇒ Y in D:
„ Percentage of transactions that contain set Y within the subset
of transactions that contain set X
Itemset X‘ is a generalization of an itemset X (X is a specialisation
of X‘, resp.) if
„ X and X‘ contain the same attributes
„ The ranges in the elements of X are completely contained in
the respective ranges of elements in X‘
Corresponds to ancestor and descendant relationhips in
hierarchical association rules
WS 2003/04
Data Mining Algorithms
8 – 85
Quantitative Association Rules:
Method
„
„
„
„
Discretization of numerical attributes
„ Choose appropriate ranges
„ Preserve original order of ranges
Transformation of categorical attributes to
contiguous integers
Transformation of data records in D
„ According to the transformations of the
attributes
Determine the support of each individual
attribute-value pair in D
WS 2003/04
Data Mining Algorithms
8 – 86
Quantitative Association Rules:
Method (2)
„
Merge neighboring attribute values to ranges
„ While support of resulting ranges is smaller than
maxsup
Frequently occurring 1-itemsets
Find all frequent quantitative itemsets
„ By using a variant of the apriori algorithm
Determine quantitative association rules from frequent
itemsets
Remove uninteresting rules
„ Remove rules that have an interest smaller than
min-interest
„ Similar interest measure as for hierarchical
association rules
„
„
„
„
WS 2003/04
Data Mining Algorithms
8 – 87
Quantitative Association Rules:
Example
ID
100
200
300
400
500
interval
20..24
25..29
30..34
35..39
integer
1
2
3
4
ID
100
200
300
400
500
WS 2003/04
age
23
25
29
34
38
marital status
single
married
single
married
married
value
married
single
age
1
2
2
3
4
# cars
1
1
0
2
2
integer
1
2
marital status
2
1
2
1
1
Data Mining Algorithms
Preprocessing
# cars
1
1
0
2
2
8 – 88
Quantitative Association Rules:
Example (2)
ID
100
200
300
400
500
age
23
25
29
34
38
marital status
single
married
single
married
married
( <age: 30..39> )
( <marital status: married> )
( <marital status: single> )
( <#cars: 0..1> )
( <age: 30..39> <m-status: married> )
...
Itemset
<age: 30..39> and <m-status: married> --> <#cars: 2>
<age: 20..29> --> <#cars: 0..1>
WS 2003/04
# cars
1
1
0
2
2
2
3
2
3
2
.. .
Support
40%
60%
Data Mining
Confidence
100%
67%
Data Mining Algorithms
8 – 89
Quantitative Association Rules:
Partitioning of Numerical Attributes
„
„
Problem: Minimum support
„ Too many intervals → too small support for each
individual interval
„ Too few intervals → too small confidence of the
rules
Solution
„ Partitioning of the domain into many intervals
„ Additionally, taking ranges into account which arise
from merging adjacent intervals
2
„ On the average, there are O(n ) many ranges that
contain a certain value
WS 2003/04
Data Mining Algorithms
8 – 90
ARCS (Association Rule Clustering
System)
How does ARCS work?
1. Binning
2. Find frequent
predicateset
3. Clustering
4. Optimize
WS 2003/04
Data Mining Algorithms
8 – 91
Limitations of ARCS
„
Only quantitative attributes on LHS of rules.
„
Only 2 attributes on LHS. (2D limitation)
„
An alternative to ARCS
„
Non-grid-based
„
equi-depth binning
„
„
clustering based on a measure of partial
completeness.
“Mining Quantitative Association Rules in
Large Relational Tables” by R. Srikant and R.
Agrawal.
WS 2003/04
Data Mining Algorithms
8 – 92
Mining Distance-based Association Rules
„
„
Binning methods do not capture the semantics of interval
data
Price($)
Equi-width
(width $10)
Equi-depth
(depth 2)
Distancebased
7
20
22
50
51
53
[0,10]
[11,20]
[21,30]
[31,40]
[41,50]
[51,60]
[7,20]
[22,50]
[51,53]
[7,7]
[20,22]
[50,53]
Distance-based partitioning, more meaningful discretization
considering:
„ density/number of points in an interval
„ “closeness” of points in an interval
WS 2003/04
Data Mining Algorithms
8 – 93
Clusters and Distance Measurements
„
„
S[X] is a set of N tuples t1, t2, …, tN , projected on
the attribute set X
The diameter of S[X]:
∑ ∑
d ( S [ X ]) =
„
N
N
i =1
j =1
dist X (ti[ X ], tj[ X ])
N ( N − 1)
distx:distance metric, e.g. Euclidean distance or
Manhattan
WS 2003/04
Data Mining Algorithms
8 – 94
Clusters and Distance Measurements
(Cont.)
„
The diameter, d, assesses the density of a
cluster CX , where
X
d (CX ) ≤ d 0
CX ≥ s 0
„
Finding clusters and distance-based rules
„ the density threshold, d0 , replaces the notion
of support
„ modified version of the BIRCH clustering
algorithm
WS 2003/04
Data Mining Algorithms
8 – 95
Chapter 8: Mining Association Rules
„
Introduction
„
„
Simple Association Rules
„
„
„
Motivation, notions, algorithms, interestingness
Quantitative Association Rules
„
„
Basic notions, apriori algorithm, hash trees, interestingness
Hierarchical Association Rules
„
„
Transaction databases, market basket data analysis
Motivation, basic idea, partitioning numerical attributes,
adaptation of apriori algorithm, interestingness
Constraint-based Association Mining
Summary
WS 2003/04
Data Mining Algorithms
8 – 96
Constraints for Association Rules:
Motivation
„
„
„
„
Too many frequent itemsets
„ Eficiency problem
Too many association rules
„ Evaluation problem
Sometimes are constraints known in advance
„ „only rules containing product a but without product b“
„ „only rules containing items with an overall price of > 100“
Constraints for frequent itemsets
WS 2003/04
Data Mining Algorithms
8 – 97
Constraints for Association Rules
Types of Constraints [Ng, Lakshmanan, Han & Pang 1998]
„
Domain Constraints
„
Sθ v, θ ∈ { =, ≠, <, ≤, >, ≥ }: S = v, S ≠ v, S < v, S ≤ v, S > v, S ≥ v
„
„
vθ S, θ ∈ {∈, ∉}: v ∈ S, v ∉ S
„
„
Example: snacks ∉ S.type
Vθ S or Sθ V, θ ∈ { ⊆, ⊂, ⊄, =, ≠ }: V ⊆ S, V ⊂ S, V ⊄ S, V = S, V ≠ S
„
„
Example: S.price < 100
Example: {snacks, wines} ⊆ S.type
Aggregation Constraints
„
agg(S) θ v where
„
„
„
WS 2003/04
agg ∈ {min, max, sum, count, avg}
θ ∈ { =, ≠, <, ≤, >, ≥ }
examples: count(S1.type) = 1, avg(S2.price) > 100
Data Mining Algorithms
8 – 98
Further Types of Constraints
„
„
Interactive, exploratory mining giga-bytes of data?
„ Could it be real? — Making good use of constraints!
What kinds of constraints can be used in mining?
„ Knowledge type constraint: classification, association,
etc.
„ Data constraint: SQL-like queries
„
„
Dimension/level constraints:
„
„
small sales (price < $10) triggers big sales (sum > $200).
Interestingness constraints:
„
WS 2003/04
in relevance to region, price, brand, customer category.
Rule constraints
„
„
Find product pairs sold together in Vancouver in Dec.’98.
strong rules (min_support ≥ 3%, min_confidence ≥ 60%).
Data Mining Algorithms
8 – 99
Constrained Association Query
Optimization Problem
„
Given a CAQ = { (S1, S2) | C }, the algorithm should be :
sound: It only finds frequent sets that satisfy the
given constraints C
„ complete: All frequent sets satisfy the given
constraints C are found
A naïve solution:
„ Apply Apriori for finding all frequent sets, and then
to test them for constraint satisfaction one by one.
Our approach:
„ Comprehensive analysis of the properties of
constraints and try to push them as deeply as
possible inside the frequent set computation.
„
„
„
WS 2003/04
Data Mining Algorithms
8 – 100
Constraints for Association Rules:
Application of Constraints
„
„
At computation of association rules
„ Solves the evalution problem
„ Does not solve the efficiency problem
At computation of frequent itemsets
„ Eventually solves the efficiency problem
„ Question at candidate generation time
„
WS 2003/04
Which itemsets can be excluded by applying the
constraints?
Data Mining Algorithms
8 – 101
Constraints for Association Rules:
Anti-monotony
„
„
„
Definition
„ If a set s violates an anti-monotone constraint then
each subset of s violates the same constraint.
Examples
„ sum(S.price) ≤ v is anti-monotone
„ sum(S.price) ≥ v is not anti-monotone
„ sum(S.price) = v is partially anti-monotone
Application
„ Include anti-monotone constraints into candidate
generation
WS 2003/04
Data Mining Algorithms
8 – 102
Constraints for Association Rules:
Anti-monotony of Constraint Types
Types of
constraints
S θ v, θ ∈ { =, ≤, ≥ }
v∈S
S⊇V
S⊆V
S=V
min(S) ≤ v
min(S) ≥ v
min(S) = v
max(S) ≤ v
max(S) ≥ v
max(S) = v
count(S) ≤ v
count(S) ≥ v
count(S) = v
sum(S) ≤ v
sum(S) ≥ v
sum(S) = v
avg(S) θ v, θ ∈ { =, ≤, ≥ }
(frequent constraint)
WS 2003/04
yes
no
no
yes
partially
no
yes
partially
yes
no
partially
yes
no
partially
yes
no
partially
no
(yes)
anti-monotone?
Data Mining Algorithms
8 – 103
Chapter 8: Mining Association Rules
„
Introduction
„
„
Simple Association Rules
„
„
„
Motivation, notions, algorithms, interestingness
Quantitative Association Rules
„
„
Basic notions, apriori algorithm, hash trees, interestingness
Hierarchical Association Rules
„
„
Transaction databases, market basket data analysis
Motivation, basic idea, partitioning numerical attributes,
adaptation of apriori algorithm, interestingness
Constraint-based Association Mining
Summary
WS 2003/04
Data Mining Algorithms
8 – 104
Why Is the Big Pie Still There?
„
More on constraint-based mining of associations
„ From association to correlation and causal structure
analysis
„
„
From intra-transaction association to intertransaction associations
„
„
Association does not necessarily imply correlation or causal
relationships
E.g., break the barriers of transactions (Lu, et al. TOIS’99).
From association analysis to classification and
clustering analysis
„
E.g, clustering association rules
WS 2003/04
Data Mining Algorithms
8 – 105
Summary
„
Association rule mining
„
„
probably the most significant contribution from the
database community in KDD
A large number of papers have been published
„
Many interesting issues have been explored
„
An interesting research direction
„
WS 2003/04
Association analysis in other types of data: spatial
data, multimedia data, time series data, etc.
Data Mining Algorithms
8 – 106
References
R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for generation of frequent
itemsets. In Journal of Parallel and Distributed Computing (Special Issue on High Performance Data
Mining), 2000.
R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large
databases. SIGMOD'93, 207-216, Washington, D.C.
R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB'94 487-499, Santiago,
Chile.
R. Agrawal and R. Srikant. Mining sequential patterns. ICDE'95, 3-14, Taipei, Taiwan.
R. J. Bayardo. Efficiently mining long patterns from databases. SIGMOD'98, 85-93, Seattle,
Washington.
S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing association rules to
correlations. SIGMOD'97, 265-276, Tucson, Arizona.
S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for
market basket analysis. SIGMOD'97, 255-264, Tucson, Arizona, May 1997.
K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and iceberg cubes. SIGMOD'99, 359370, Philadelphia, PA, June 1999.
D.W. Cheung, J. Han, V. Ng, and C.Y. Wong. Maintenance of discovered association rules in large
databases: An incremental updating technique. ICDE'96, 106-114, New Orleans, LA.
M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman. Computing iceberg queries
efficiently. VLDB'98, 299-310, New York, NY, Aug. 1998.
„
„
„
„
„
„
„
„
„
„
WS 2003/04
Data Mining Algorithms
8 – 107
References (2)
„
„
„
„
„
„
„
„
„
„
G. Grahne, L. Lakshmanan, and X. Wang. Efficient mining of constrained correlated sets. ICDE'00, 512521, San Diego, CA, Feb. 2000.
Y. Fu and J. Han. Meta-rule-guided mining of association rules in relational databases. KDOOD'95,
39-46, Singapore, Dec. 1995.
T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Data mining using two-dimensional optimized
association rules: Scheme, algorithms, and visualization. SIGMOD'96, 13-23, Montreal, Canada.
E.-H. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for association rules. SIGMOD'97,
277-288, Tucson, Arizona.
J. Han, G. Dong, and Y. Yin. Efficient mining of partial periodic patterns in time series database.
ICDE'99, Sydney, Australia.
J. Han and Y. Fu. Discovery of multiple-level association rules from large databases. VLDB'95, 420-431,
Zurich, Switzerland.
J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. SIGMOD'00, 1-12,
Dallas, TX, May 2000.
T. Imielinski and H. Mannila. A database perspective on knowledge discovery. Communications of ACM,
39:58-64, 1996.
M. Kamber, J. Han, and J. Y. Chiang. Metarule-guided mining of multi-dimensional association rules
using data cubes. KDD'97, 207-210, Newport Beach, California.
M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A.I. Verkamo. Finding interesting rules
from large sets of discovered association rules. CIKM'94, 401-408, Gaithersburg, Maryland.
WS 2003/04
Data Mining Algorithms
8 – 108
References (3)
F. Korn, A. Labrinidis, Y. Kotidis, and C. Faloutsos. Ratio rules: A new paradigm for fast, quantifiable
data mining. VLDB'98, 582-593, New York, NY.
„
B. Lent, A. Swami, and J. Widom. Clustering association rules. ICDE'97, 220-231, Birmingham,
England.
„
H. Lu, J. Han, and L. Feng. Stock movement and n-dimensional inter-transaction association rules.
SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD'98), 12:112:7, Seattle, Washington.
„
H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering association rules.
KDD'94, 181-192, Seattle, WA, July 1994.
„
H. Mannila, H Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. Data
Mining and Knowledge Discovery, 1:259-289, 1997.
„
R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining association rules. VLDB'96, 122133, Bombay, India.
„
R.J. Miller and Y. Yang. Association rules over interval data. SIGMOD'97, 452-461, Tucson, Arizona.
„
R. Ng, L. V. S. Lakshmanan, J. Han, and A. Pang. Exploratory mining and pruning optimizations of
constrained associations rules. SIGMOD'98, 13-24, Seattle, Washington.
N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association
rules. ICDT'99, 398-416, Jerusalem, Israel, Jan. 1999.
„
„
WS 2003/04
Data Mining Algorithms
8 – 109
References (4)
„
„
„
„
„
„
„
„
„
„
J.S. Park, M.S. Chen, and P.S. Yu. An effective hash-based algorithm for mining association rules.
SIGMOD'95, 175-186, San Jose, CA, May 1995.
J. Pei, J. Han, and R. Mao. CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets.
DMKD'00, Dallas, TX, 11-20, May 2000.
J. Pei and J. Han. Can We Push More Constraints into Frequent Pattern Mining? KDD'00. Boston,
MA. Aug. 2000.
G. Piatetsky-Shapiro. Discovery, analysis, and presentation of strong rules. In G. Piatetsky-Shapiro
and W. J. Frawley, editors, Knowledge Discovery in Databases, 229-238. AAAI/MIT Press, 1991.
B. Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic association rules. ICDE'98, 412-421, Orlando,
FL.
J.S. Park, M.S. Chen, and P.S. Yu. An effective hash-based algorithm for mining association rules.
SIGMOD'95, 175-186, San Jose, CA.
S. Ramaswamy, S. Mahajan, and A. Silberschatz. On the discovery of interesting patterns in
association rules. VLDB'98, 368-379, New York, NY..
S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database
systems: Alternatives and implications. SIGMOD'98, 343-354, Seattle, WA.
A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association rules in
large databases. VLDB'95, 432-443, Zurich, Switzerland.
A. Savasere, E. Omiecinski, and S. Navathe. Mining for strong negative associations in a large
database of customer transactions. ICDE'98, 494-502, Orlando, FL, Feb. 1998.
WS 2003/04
Data Mining Algorithms
8 – 110
References (5)
„
„
„
„
„
„
„
„
„
„
C. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable techniques for mining causal
structures. VLDB'98, 594-605, New York, NY.
R. Srikant and R. Agrawal. Mining generalized association rules. VLDB'95, 407-419, Zurich,
Switzerland, Sept. 1995.
R. Srikant and R. Agrawal. Mining quantitative association rules in large relational tables.
SIGMOD'96, 1-12, Montreal, Canada.
R. Srikant, Q. Vu, and R. Agrawal. Mining association rules with item constraints. KDD'97, 67-73,
Newport Beach, California.
H. Toivonen. Sampling large databases for association rules. VLDB'96, 134-145, Bombay, India,
Sept. 1996.
D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton, R. Motwani, and S. Nestorov. Query flocks: A
generalization of association-rule mining. SIGMOD'98, 1-12, Seattle, Washington.
K. Yoda, T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Computing optimized
rectilinear regions for association rules. KDD'97, 96-103, Newport Beach, CA, Aug. 1997.
M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. Parallel algorithm for discovery of
association rules. Data Mining and Knowledge Discovery, 1:343-374, 1997.
M. Zaki. Generating Non-Redundant Association Rules. KDD'00. Boston, MA. Aug. 2000.
O. R. Zaiane, J. Han, and H. Zhu. Mining Recurrent Items in Multimedia with Progressive
Resolution Refinement. ICDE'00, 461-470, San Diego, CA, Feb. 2000.
WS 2003/04
Data Mining Algorithms
8 – 111
Related documents