Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Quantitative Association Rules Up to now: associations of boolean attributes only Now: numerical attributes, too Example: Original database ID 1 2 ID 1 2 age 23 38 marital status single married # cars 0 2 Boolean database age: 20..29 1 0 age: 30..39 0 1 WS 2003/04 m-status: single 1 0 m-status: married 0 1 ... ... ... Data Mining Algorithms 8 – 79 Static Discretization of Quantitative Attributes Discretized prior to mining using concept hierarchy. Numeric values are replaced by ranges. In relational database, finding all frequent k-predicate sets will require k or k+1 table scans. Data cube is well suited for mining. The cells of an n-dimensional (age) () (income) (buys) cuboid correspond to the predicate sets. (age, income) Mining from data cubes can be much faster. WS 2003/04 (age,buys) (income,buys) (age,income,buys) Data Mining Algorithms 8 – 80 Quantitative Association Rules: Ideas Static discretization Discretization of all attributes before mining the association rules E.g. by using a generalization hierarchy for each attribute Substitute numerical attribute values by ranges or intervals Dynamic discretization Discretization of the attributes during association rule mining Goal (e.g.): maximization of confidence Unification of neighboring association rules to a generalized rule WS 2003/04 Data Mining Algorithms 8 – 81 Quantitative Association Rules Numeric attributes are dynamically discretized Such that the confidence or compactness of the rules mined is maximized. 2-D quantitative association rules: Aquan1 ∧ Aquan2 ⇒ Acat Cluster “adjacent” association rules to form general rules using a 2-D grid. Example: age(X,”30-34”) ∧ income(X,”24K - 48K”) ⇒ buys(X,”high resolution TV”) WS 2003/04 Data Mining Algorithms 8 – 82 Quantitative Association Rules: Basic Notions [Srikant & Agrawal 1996a] Let I = {i1, ..., im} be a set of literals called „attributes“ Let IV = I × IN+ be a set of attribute-value pairs Let D be a set of data objects R, R ⊆ IV Each attribute may occur at most once in a data object Let IR = {<x, u, o> ∈ I × IN+ × IN+ | u ≤ o} Where <x, u, o> denotes an attribute x with an assigned range of values [u..o] Attribute(X) for X ⊆ IR: set {x | <x, u, o> ∈ IR} WS 2003/04 Data Mining Algorithms 8 – 83 Quantitative Association Rules: Basic Notions (2) quantitative Items: elements from IR quantitative Itemset: set X ⊆ IR Data object R supports a set X ⊆ IR: For each <x, u, o> ∈ X, there is a pair <x, v> ∈ R where u ≤ v ≤ o Support of set X in D for a quantitative itemset X: Percentage of data objects in D that support X Quantitative association rule: X ⇒ Y where X ⊆ IR, Y ⊆ IR and Attribute(X) ∩ Attribute(Y) = ∅ WS 2003/04 Data Mining Algorithms 8 – 84 Quantitative Association Rules: Basic Notions (3) Support s of a quantitative association rule X ⇒ Y in D: Support of set X ∪ Y in D Confidence c of a quantitative association rule X ⇒ Y in D: Percentage of transactions that contain set Y within the subset of transactions that contain set X Itemset X‘ is a generalization of an itemset X (X is a specialisation of X‘, resp.) if X and X‘ contain the same attributes The ranges in the elements of X are completely contained in the respective ranges of elements in X‘ Corresponds to ancestor and descendant relationhips in hierarchical association rules WS 2003/04 Data Mining Algorithms 8 – 85 Quantitative Association Rules: Method Discretization of numerical attributes Choose appropriate ranges Preserve original order of ranges Transformation of categorical attributes to contiguous integers Transformation of data records in D According to the transformations of the attributes Determine the support of each individual attribute-value pair in D WS 2003/04 Data Mining Algorithms 8 – 86 Quantitative Association Rules: Method (2) Merge neighboring attribute values to ranges While support of resulting ranges is smaller than maxsup Frequently occurring 1-itemsets Find all frequent quantitative itemsets By using a variant of the apriori algorithm Determine quantitative association rules from frequent itemsets Remove uninteresting rules Remove rules that have an interest smaller than min-interest Similar interest measure as for hierarchical association rules WS 2003/04 Data Mining Algorithms 8 – 87 Quantitative Association Rules: Example ID 100 200 300 400 500 interval 20..24 25..29 30..34 35..39 integer 1 2 3 4 ID 100 200 300 400 500 WS 2003/04 age 23 25 29 34 38 marital status single married single married married value married single age 1 2 2 3 4 # cars 1 1 0 2 2 integer 1 2 marital status 2 1 2 1 1 Data Mining Algorithms Preprocessing # cars 1 1 0 2 2 8 – 88 Quantitative Association Rules: Example (2) ID 100 200 300 400 500 age 23 25 29 34 38 marital status single married single married married ( <age: 30..39> ) ( <marital status: married> ) ( <marital status: single> ) ( <#cars: 0..1> ) ( <age: 30..39> <m-status: married> ) ... Itemset <age: 30..39> and <m-status: married> --> <#cars: 2> <age: 20..29> --> <#cars: 0..1> WS 2003/04 # cars 1 1 0 2 2 2 3 2 3 2 .. . Support 40% 60% Data Mining Confidence 100% 67% Data Mining Algorithms 8 – 89 Quantitative Association Rules: Partitioning of Numerical Attributes Problem: Minimum support Too many intervals → too small support for each individual interval Too few intervals → too small confidence of the rules Solution Partitioning of the domain into many intervals Additionally, taking ranges into account which arise from merging adjacent intervals 2 On the average, there are O(n ) many ranges that contain a certain value WS 2003/04 Data Mining Algorithms 8 – 90 ARCS (Association Rule Clustering System) How does ARCS work? 1. Binning 2. Find frequent predicateset 3. Clustering 4. Optimize WS 2003/04 Data Mining Algorithms 8 – 91 Limitations of ARCS Only quantitative attributes on LHS of rules. Only 2 attributes on LHS. (2D limitation) An alternative to ARCS Non-grid-based equi-depth binning clustering based on a measure of partial completeness. “Mining Quantitative Association Rules in Large Relational Tables” by R. Srikant and R. Agrawal. WS 2003/04 Data Mining Algorithms 8 – 92 Mining Distance-based Association Rules Binning methods do not capture the semantics of interval data Price($) Equi-width (width $10) Equi-depth (depth 2) Distancebased 7 20 22 50 51 53 [0,10] [11,20] [21,30] [31,40] [41,50] [51,60] [7,20] [22,50] [51,53] [7,7] [20,22] [50,53] Distance-based partitioning, more meaningful discretization considering: density/number of points in an interval “closeness” of points in an interval WS 2003/04 Data Mining Algorithms 8 – 93 Clusters and Distance Measurements S[X] is a set of N tuples t1, t2, …, tN , projected on the attribute set X The diameter of S[X]: ∑ ∑ d ( S [ X ]) = N N i =1 j =1 dist X (ti[ X ], tj[ X ]) N ( N − 1) distx:distance metric, e.g. Euclidean distance or Manhattan WS 2003/04 Data Mining Algorithms 8 – 94 Clusters and Distance Measurements (Cont.) The diameter, d, assesses the density of a cluster CX , where X d (CX ) ≤ d 0 CX ≥ s 0 Finding clusters and distance-based rules the density threshold, d0 , replaces the notion of support modified version of the BIRCH clustering algorithm WS 2003/04 Data Mining Algorithms 8 – 95 Chapter 8: Mining Association Rules Introduction Simple Association Rules Motivation, notions, algorithms, interestingness Quantitative Association Rules Basic notions, apriori algorithm, hash trees, interestingness Hierarchical Association Rules Transaction databases, market basket data analysis Motivation, basic idea, partitioning numerical attributes, adaptation of apriori algorithm, interestingness Constraint-based Association Mining Summary WS 2003/04 Data Mining Algorithms 8 – 96 Constraints for Association Rules: Motivation Too many frequent itemsets Eficiency problem Too many association rules Evaluation problem Sometimes are constraints known in advance „only rules containing product a but without product b“ „only rules containing items with an overall price of > 100“ Constraints for frequent itemsets WS 2003/04 Data Mining Algorithms 8 – 97 Constraints for Association Rules Types of Constraints [Ng, Lakshmanan, Han & Pang 1998] Domain Constraints Sθ v, θ ∈ { =, ≠, <, ≤, >, ≥ }: S = v, S ≠ v, S < v, S ≤ v, S > v, S ≥ v vθ S, θ ∈ {∈, ∉}: v ∈ S, v ∉ S Example: snacks ∉ S.type Vθ S or Sθ V, θ ∈ { ⊆, ⊂, ⊄, =, ≠ }: V ⊆ S, V ⊂ S, V ⊄ S, V = S, V ≠ S Example: S.price < 100 Example: {snacks, wines} ⊆ S.type Aggregation Constraints agg(S) θ v where WS 2003/04 agg ∈ {min, max, sum, count, avg} θ ∈ { =, ≠, <, ≤, >, ≥ } examples: count(S1.type) = 1, avg(S2.price) > 100 Data Mining Algorithms 8 – 98 Further Types of Constraints Interactive, exploratory mining giga-bytes of data? Could it be real? — Making good use of constraints! What kinds of constraints can be used in mining? Knowledge type constraint: classification, association, etc. Data constraint: SQL-like queries Dimension/level constraints: small sales (price < $10) triggers big sales (sum > $200). Interestingness constraints: WS 2003/04 in relevance to region, price, brand, customer category. Rule constraints Find product pairs sold together in Vancouver in Dec.’98. strong rules (min_support ≥ 3%, min_confidence ≥ 60%). Data Mining Algorithms 8 – 99 Constrained Association Query Optimization Problem Given a CAQ = { (S1, S2) | C }, the algorithm should be : sound: It only finds frequent sets that satisfy the given constraints C complete: All frequent sets satisfy the given constraints C are found A naïve solution: Apply Apriori for finding all frequent sets, and then to test them for constraint satisfaction one by one. Our approach: Comprehensive analysis of the properties of constraints and try to push them as deeply as possible inside the frequent set computation. WS 2003/04 Data Mining Algorithms 8 – 100 Constraints for Association Rules: Application of Constraints At computation of association rules Solves the evalution problem Does not solve the efficiency problem At computation of frequent itemsets Eventually solves the efficiency problem Question at candidate generation time WS 2003/04 Which itemsets can be excluded by applying the constraints? Data Mining Algorithms 8 – 101 Constraints for Association Rules: Anti-monotony Definition If a set s violates an anti-monotone constraint then each subset of s violates the same constraint. Examples sum(S.price) ≤ v is anti-monotone sum(S.price) ≥ v is not anti-monotone sum(S.price) = v is partially anti-monotone Application Include anti-monotone constraints into candidate generation WS 2003/04 Data Mining Algorithms 8 – 102 Constraints for Association Rules: Anti-monotony of Constraint Types Types of constraints S θ v, θ ∈ { =, ≤, ≥ } v∈S S⊇V S⊆V S=V min(S) ≤ v min(S) ≥ v min(S) = v max(S) ≤ v max(S) ≥ v max(S) = v count(S) ≤ v count(S) ≥ v count(S) = v sum(S) ≤ v sum(S) ≥ v sum(S) = v avg(S) θ v, θ ∈ { =, ≤, ≥ } (frequent constraint) WS 2003/04 yes no no yes partially no yes partially yes no partially yes no partially yes no partially no (yes) anti-monotone? Data Mining Algorithms 8 – 103 Chapter 8: Mining Association Rules Introduction Simple Association Rules Motivation, notions, algorithms, interestingness Quantitative Association Rules Basic notions, apriori algorithm, hash trees, interestingness Hierarchical Association Rules Transaction databases, market basket data analysis Motivation, basic idea, partitioning numerical attributes, adaptation of apriori algorithm, interestingness Constraint-based Association Mining Summary WS 2003/04 Data Mining Algorithms 8 – 104 Why Is the Big Pie Still There? More on constraint-based mining of associations From association to correlation and causal structure analysis From intra-transaction association to intertransaction associations Association does not necessarily imply correlation or causal relationships E.g., break the barriers of transactions (Lu, et al. TOIS’99). From association analysis to classification and clustering analysis E.g, clustering association rules WS 2003/04 Data Mining Algorithms 8 – 105 Summary Association rule mining probably the most significant contribution from the database community in KDD A large number of papers have been published Many interesting issues have been explored An interesting research direction WS 2003/04 Association analysis in other types of data: spatial data, multimedia data, time series data, etc. Data Mining Algorithms 8 – 106 References R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for generation of frequent itemsets. In Journal of Parallel and Distributed Computing (Special Issue on High Performance Data Mining), 2000. R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. SIGMOD'93, 207-216, Washington, D.C. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB'94 487-499, Santiago, Chile. R. Agrawal and R. Srikant. Mining sequential patterns. ICDE'95, 3-14, Taipei, Taiwan. R. J. Bayardo. Efficiently mining long patterns from databases. SIGMOD'98, 85-93, Seattle, Washington. S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing association rules to correlations. SIGMOD'97, 265-276, Tucson, Arizona. S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket analysis. SIGMOD'97, 255-264, Tucson, Arizona, May 1997. K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and iceberg cubes. SIGMOD'99, 359370, Philadelphia, PA, June 1999. D.W. Cheung, J. Han, V. Ng, and C.Y. Wong. Maintenance of discovered association rules in large databases: An incremental updating technique. ICDE'96, 106-114, New Orleans, LA. M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman. Computing iceberg queries efficiently. VLDB'98, 299-310, New York, NY, Aug. 1998. WS 2003/04 Data Mining Algorithms 8 – 107 References (2) G. Grahne, L. Lakshmanan, and X. Wang. Efficient mining of constrained correlated sets. ICDE'00, 512521, San Diego, CA, Feb. 2000. Y. Fu and J. Han. Meta-rule-guided mining of association rules in relational databases. KDOOD'95, 39-46, Singapore, Dec. 1995. T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Data mining using two-dimensional optimized association rules: Scheme, algorithms, and visualization. SIGMOD'96, 13-23, Montreal, Canada. E.-H. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for association rules. SIGMOD'97, 277-288, Tucson, Arizona. J. Han, G. Dong, and Y. Yin. Efficient mining of partial periodic patterns in time series database. ICDE'99, Sydney, Australia. J. Han and Y. Fu. Discovery of multiple-level association rules from large databases. VLDB'95, 420-431, Zurich, Switzerland. J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. SIGMOD'00, 1-12, Dallas, TX, May 2000. T. Imielinski and H. Mannila. A database perspective on knowledge discovery. Communications of ACM, 39:58-64, 1996. M. Kamber, J. Han, and J. Y. Chiang. Metarule-guided mining of multi-dimensional association rules using data cubes. KDD'97, 207-210, Newport Beach, California. M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A.I. Verkamo. Finding interesting rules from large sets of discovered association rules. CIKM'94, 401-408, Gaithersburg, Maryland. WS 2003/04 Data Mining Algorithms 8 – 108 References (3) F. Korn, A. Labrinidis, Y. Kotidis, and C. Faloutsos. Ratio rules: A new paradigm for fast, quantifiable data mining. VLDB'98, 582-593, New York, NY. B. Lent, A. Swami, and J. Widom. Clustering association rules. ICDE'97, 220-231, Birmingham, England. H. Lu, J. Han, and L. Feng. Stock movement and n-dimensional inter-transaction association rules. SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD'98), 12:112:7, Seattle, Washington. H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering association rules. KDD'94, 181-192, Seattle, WA, July 1994. H. Mannila, H Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery, 1:259-289, 1997. R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining association rules. VLDB'96, 122133, Bombay, India. R.J. Miller and Y. Yang. Association rules over interval data. SIGMOD'97, 452-461, Tucson, Arizona. R. Ng, L. V. S. Lakshmanan, J. Han, and A. Pang. Exploratory mining and pruning optimizations of constrained associations rules. SIGMOD'98, 13-24, Seattle, Washington. N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. ICDT'99, 398-416, Jerusalem, Israel, Jan. 1999. WS 2003/04 Data Mining Algorithms 8 – 109 References (4) J.S. Park, M.S. Chen, and P.S. Yu. An effective hash-based algorithm for mining association rules. SIGMOD'95, 175-186, San Jose, CA, May 1995. J. Pei, J. Han, and R. Mao. CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets. DMKD'00, Dallas, TX, 11-20, May 2000. J. Pei and J. Han. Can We Push More Constraints into Frequent Pattern Mining? KDD'00. Boston, MA. Aug. 2000. G. Piatetsky-Shapiro. Discovery, analysis, and presentation of strong rules. In G. Piatetsky-Shapiro and W. J. Frawley, editors, Knowledge Discovery in Databases, 229-238. AAAI/MIT Press, 1991. B. Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic association rules. ICDE'98, 412-421, Orlando, FL. J.S. Park, M.S. Chen, and P.S. Yu. An effective hash-based algorithm for mining association rules. SIGMOD'95, 175-186, San Jose, CA. S. Ramaswamy, S. Mahajan, and A. Silberschatz. On the discovery of interesting patterns in association rules. VLDB'98, 368-379, New York, NY.. S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. SIGMOD'98, 343-354, Seattle, WA. A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association rules in large databases. VLDB'95, 432-443, Zurich, Switzerland. A. Savasere, E. Omiecinski, and S. Navathe. Mining for strong negative associations in a large database of customer transactions. ICDE'98, 494-502, Orlando, FL, Feb. 1998. WS 2003/04 Data Mining Algorithms 8 – 110 References (5) C. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable techniques for mining causal structures. VLDB'98, 594-605, New York, NY. R. Srikant and R. Agrawal. Mining generalized association rules. VLDB'95, 407-419, Zurich, Switzerland, Sept. 1995. R. Srikant and R. Agrawal. Mining quantitative association rules in large relational tables. SIGMOD'96, 1-12, Montreal, Canada. R. Srikant, Q. Vu, and R. Agrawal. Mining association rules with item constraints. KDD'97, 67-73, Newport Beach, California. H. Toivonen. Sampling large databases for association rules. VLDB'96, 134-145, Bombay, India, Sept. 1996. D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton, R. Motwani, and S. Nestorov. Query flocks: A generalization of association-rule mining. SIGMOD'98, 1-12, Seattle, Washington. K. Yoda, T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Computing optimized rectilinear regions for association rules. KDD'97, 96-103, Newport Beach, CA, Aug. 1997. M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. Parallel algorithm for discovery of association rules. Data Mining and Knowledge Discovery, 1:343-374, 1997. M. Zaki. Generating Non-Redundant Association Rules. KDD'00. Boston, MA. Aug. 2000. O. R. Zaiane, J. Han, and H. Zhu. Mining Recurrent Items in Multimedia with Progressive Resolution Refinement. ICDE'00, 461-470, San Diego, CA, Feb. 2000. WS 2003/04 Data Mining Algorithms 8 – 111