Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Extending Association Analysis Michael Steinbach Ph.D. Defense © 2005 M. Steinbach Ph.D. Defense ‹#› Outline Introduction Extending association analysis to non-binary data and non-traditional patterns – Generalizing the notion of support – Generalizing the notion of confidence Creating new types of association patterns Analyzing the structure of association patterns Conclusions and future work © 2005 M. Steinbach Ph.D. Defense ‹#› Traditional Association Analysis Association analysis: Analyzes relationships among items (attributes) in a binary transaction data – Example data: market basket data – Data can be represented as a binary matrix – Applications in business and science TID Items 1 Bread, Milk 2 3 4 5 Bread, Diaper, Beer, Eggs Milk, Diaper, Beer, Coke Bread, Milk, Diaper, Beer Bread, Milk, Diaper, Coke © 2005 M. Steinbach Ph.D. Defense Beer Eggs Coke 1 2 3 4 5 Diapers – Itemsets: Collection of items Example: {Milk, Diaper} – Association Rules: X Y, where X and Y are itemsets. Example: Milk Diaper Milk types of patterns Bread Two Set-Based Representation of Data 1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 0 1 1 1 0 0 1 0 0 0 0 0 1 0 1 Binary Matrix Representation of Data ‹#› Traditional Association Analysis … Association measures evaluate the strength of an association pattern – Support and confidence are the most commonly used – The support, (X), of an itemset X is the number of transactions that contain all the items of the itemset Frequent itemsets have support > specified threshold Different types of itemset patterns are distinguished by a measure and a threshold – The confidence of an association rule is given by conf(X Y) = (X Y) / (X) Estimate of the conditional probability of Y given X © 2005 M. Steinbach Ph.D. Defense ‹#› Traditional Association Analysis … Process of finding interesting patterns: 1. Find frequent itemsets using a support threshold 2. Find association rules for frequent itemsets 3. Sort association rules according to confidence Support filtering is necessary – To eliminate spurious patterns – For efficiency, we need the anti-monotone property: X Y implies (Y) ≤ (X) A Confidence is used because of its interpretation as conditional probability © 2005 M. Steinbach Ph.D. Defense B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD null ABCE ABDE ACDE BCDE ABCDE Given d items, there are 2d possible candidate itemsets ‹#› Extending Association Analysis Why extend association analysis? – To address limitations of existing schemes for association analysis – To create new kinds of useful patterns – To better understand the structure of the association patterns in a data set © 2005 M. Steinbach Ph.D. Defense ‹#› Limitations of Association Analysis Traditional association analysis does not apply to – Non-binary data Must transform data into binary transaction data to apply traditional association analysis techniques. Order and magnitude information can be lost Can often “make it work” by coding combinations of values, but this adds complexity and explodes the number of items Limited solutions exist – Min-Apriori (Han, Karypis, Kumar 1997) Document Data – Non-traditional association patterns. Error Tolerant Itemsets (ETIs) General © 2005 M. Steinbach Boolean formulas (Yang, Fayyad, and Bradley 2001) (Bollman-Sdorra, et al. 01, Srikant et al. 97) Ph.D. Defense ‹#› Limitations of Association Analysis … Support and confidence are not appropriate for all applications Example involving coffee and tea: – Every customer in a grocery store purchases coffee – Only 1/4 of the customers purchase tea – conf(tea coffee) = 1 – But this is misleading because any item implies coffee – This problem is common when the frequency of items has a skewed support distribution – This cross-support problem can be addressed by using other measures, such as h-confidence (hyperclique pattern) © 2005 M. Steinbach Ph.D. Defense ‹#› Limitations of Association Analysis … Lack of knowledge of structure of association patterns – Support threshold is critical If too high, no patterns If too low, too many patterns – At some support threshold, algorithms to find association patterns “hit the wall” – Particular difficulty in finding patterns with low support LPMiner (Seno, Karypis 2001) © 2005 M. Steinbach Ph.D. Defense From Summary of Results, Frequent Itemset Mining Implementations 2003 ‹#› Overview and Contributions Presentation and contributions fall into three categories 1. A mathematical framework to extend association analysis to non-binary data and non-traditional patterns Generalizing the notion of support – Extend the hyperclique pattern (Xiong, et al 2003) to continuous data Generalizing the notion of confidence – Define notion of confidence for Error-Tolerant Itemsets © 2005 M. Steinbach Ph.D. Defense ‹#› Overview and Contributions 2. A framework for creating new types of association measures (and their accompanying itemset patterns) Can use any pairwise association or proximity measure as the basis for defining a measure of itemset strength – Examples: cosine, confidence, correlation All measures have the anti-monotone property 3. Analyzing the structure of association patterns Introduce the notion of support envelopes Can visualize the structure of association patterns © 2005 M. Steinbach Ph.D. Defense ‹#› Publications Related to Thesis Steinbach, M., Tan, P., Xiong, H., and Kumar, V., Generalizing the Notion of Support. KDD '04, pp. 689-694, Seattle, WA, August 22 - 25, 2004. Steinbach, M. and Kumar, V., Generalizing the Notion of Confidence. ICDM’05, to appear, Houston, TX, November 27 - 30, 2005. Steinbach, M., Tan, P., and Kumar, V., Support Envelopes: A Technique for Exploring the Structure of Association Patterns. KDD '04, pp. 689-694, Seattle, WA, August 22 - 25, 2004. © 2005 M. Steinbach Ph.D. Defense ‹#› Additional Publications Books: P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining, Pearson Addison-Wesley, May, 2005. Book Chapters: V. Kumar, P.-N. Tan, and M. Steinbach, Data Mining, in Handbook of Data Structures and Applications, CRC Press, 2004. M. Steinbach, L. Ertoz, and V. Kumar, Challenges of Clustering High Dimensional Data. in New Vistas in Statistical Physics - Applications in Econophysics, Bioinformatics, and Pattern Recognition, Springer-Verlag, 2004. L. Ertoz, M. Steinbach, and Vipin Kumar, Finding Topics in Collections of Documents: A Shared Nearest Neighbor Approach, in Clustering and Information Retrieval, 2003, Kluwer Academic Publishers. P. Zhang, M. Steinbach, V. Kumar, S. Shekhar, P.-N. Tan, S. Klooster, and C. Potter, Discovery of Patterns of Earth Science Data Using Data Mining, in Next Generation of Data Mining Applications, IEEE Press, 2005. © 2005 M. Steinbach Ph.D. Defense ‹#› Additional Publications … Journal Articles: H. Xiong, G. Pandey, M. Steinbach, and V. Kumar, Enhancing Data Analysis with Noise Removal, IEEE Transactions on Knowledge and Data Engineering (TKDE), 2006, accepted for publication as a regular paper. C. Potter, P.-N.Tan, M. Steinbach, S. Klooster, V. Kumar, R. Myneni, and V. Genovese, Major Disturbance Events in Terrestrial Ecosystems Detected using Global Satellite Data Sets, Global Change Biology, 2003. C. Potter, S. Klooster, M. Steinbach, P. Tan, V. Kumar, S. Shekhar, R. Nemani, and R. Myneni, Global Teleconnections of Ocean Climate to Terrestrial Carbon Flux, J. of Geophysical Research, Vol. 108, No. D17, 4556, 2003. C. Potter, S. Klooster, M. Steinbach, P. Tan, V. Kumar, S. Shekhar, and C. Carvalho, Understanding Global Teleconnections of Climate to Regional Model Estimates of Amazon Ecosystem Carbon Fluxes, Global Change Biology, 2003 C. Potter, S. Klooster, M. Steinbach, P. Tan, V. Kumar, R. Myneni, V. Genovese, Variability in Terrestrial Carbon Sinks Over Two Decades: Part 1-North America, Earth Interactions, 2003. Conferences: H. Xiong, M. Steinbach, and V. Kumar, Privacy Leakage in Multi-relational Databases via Pattern based Semisupervised Learning, in Proc. of the ACM Conference on information and Knowledge Management (CIKM 2005), Bremen, Germany, 2005. H. Xiong, M. Steinbach, P.-N. Tan, and V. Kumar, HICAP: Hierarchical Clustering with Pattern Preservation, in Proc. 2004 SIAM International Conf. on Data Mining (SDM 2004), pp. 279 - 290, Florida, 2004 M. Steinbach, P.N Tan, V. Kumar, S. Klooster, C. Potter: Discovery of climate indices using clustering. KDD 2003: 446-455 L. Ertöz, M. Steinbach, and V. Kumar: Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data. SDM 2003. © 2005 M. Steinbach Ph.D. Defense ‹#› Additional Publications … Workshops: M. Steinbach, P.-N. Tan, V. Kumar, C. Potter, and S. Klooster, Temporal Data Mining for the Discovery and Analysis of Ocean Climate Indices, KDD Workshop on Temporal Data Mining, 2002. M. Steinbach, P.-N. Tan, V. Kumar, C. Potter, and S. Klooster, Data Mining for the Discovery of Ocean Climate Indices, The Fifth Workshop on Scientific Data Mining, 2nd SIAM International Conference on Data Mining, 2002. V. Kumar, M. Steinbach, P.-N. Tan, S. Klooster, C. Potter, A. Torregrosa, Mining Scientific Data: Discovery of Patterns in the Global Climate System, Joint Statistical Meeting, 2001. M. Steinbach, P.-N. Tan, V. Kumar, C. Potter, S. Klooster, A. Torregrosa, Clustering Earth Science Data: Goals, Issues and Results, KDD Workshop on Mining Scientific Datasets, 2001. P.-N. Tan, M. Steinbach, V. Kumar, C. Potter, S. Klooster, A. Torregrosa, Finding Spatio-Temporal Patterns in Earth Science Data, KDD Workshop on Temporal Data Mining, 2001. M. Steinbach, G. Karypis, and V. Kumar, Efficient Algorithms for Creating Product Catalogs, Web Mining Workshop, 1st SIAM International Conference on Data Mining, Chicago, IL, 2001. L. Ertoz, M. Steinbach, and V. Kumar, Finding Topics in Collections of Documents: A Shared Nearest Neighbor Approach,Text Mine'01, Workshop on Text Mining, 1st SIAM International Conference on Data Mining, Chicago, IL, April, 2001. M. Steinbach, G. Karypis, and V. Kumar, A Comparison of Document Clustering Techniques TextMining Workshop, KDD 2000, Boston, MA, August, 2000. © 2005 M. Steinbach Ph.D. Defense ‹#› Outline Introduction Extending association analysis to non-binary data and non-traditional patterns – Generalizing the notion of support – Generalizing the notion of confidence Creating new types of association patterns Analyzing the structure of association patterns Conclusions and future work © 2005 M. Steinbach Ph.D. Defense ‹#› Generalizing Support: Problem Statement Challenge: Create a framework for generalizing support that – Handles non-binary data (ordinal, continuous) – Handles new types of patterns – Allow people to more easily express, explore, and communicate new types of association patterns Motivating examples for continuous data Document Data Microarray data www.biology.ucsc.edu/mcd/research.html © 2005 M. Steinbach Ph.D. Defense ‹#› Proposed Approach Proposed Approach: Support ( ) can be viewed as being composed of two steps (functions): – Evaluate the strength of a pattern in each object (transaction) Evaluation vector is given by v = eval(X) Summarization (norm) function measures strength of the pattern in all transactions Example – eval = (logical and) – X = { Milk, Diapers } – norm = sum (X) = (norm eval)(X) = norm(eval(X)) = norm(v) © 2005 M. Steinbach Ph.D. Defense 1 2 3 4 5 Diapers Milk – Summarize all these evaluations with a single number 1 0 1 1 1 0 1 1 1 1 norm(v) v 0 0 1 1 1 3 ‹#› Evaluation and Summarization Functions Evaluation functions Summarization functions – Boolean functions constructed from and (), or (), and not () – min, max, range – product – Special purpose: Error-Tolerant Itemsets – Vector norms L1, L2, and L2 squared – Sums Average Weighted average Weighted vector norms © 2005 M. Steinbach Ph.D. Defense ‹#› Usefulness of Support Framework Traditional support results from a number of choices – eval = { , min, } – norm = { L1 , L2 squared, sum } – Any of these nine combinations give the traditional support for binary data – But for continuous data, these support measures are different Can extend a recently developed association pattern, the hyperclique pattern (Xiong, et al. 2003), to continuous data – eval = min – norm = L2 squared Has led to the creation of a new kind of pattern defined by range support – eval = range – norm = L2 squared © 2005 M. Steinbach Ph.D. Defense ‹#› Outline Introduction Extending association analysis to non-binary data and non-traditional patterns – Generalizing the notion of support – Generalizing the notion of confidence Creating new types of association patterns Analyzing the structure of association patterns Conclusions and future work © 2005 M. Steinbach Ph.D. Defense ‹#› Generalizing Confidence: Problem Statement Challenge: Create a framework for generalizing confidence that – Handles non-binary data (ordinal, continuous) – Handles new types of patterns – Allow people to more easily express, explore, and communicate new types of association patterns © 2005 M. Steinbach Ph.D. Defense ‹#› Example: Error-Tolerant Itemsets A (strong) error-tolerant itemset (ETI) can have a fraction of the items missing in each transaction. Example: see the data in the table – Let = 5/8. In other words, each transaction only needs to have 3/8 (37.5%) of the items. – X = {i1, i2, i3, i4} and Y = {i5, i6, i7, i8} are both ETIs with a support of 4. ! Standard confidence: © 2005 M. Steinbach Ph.D. Defense ‹#› A Framework for Generalizing Confidence Proposed Approach: Confidence can be viewed as being composed of two steps (functions): – Evaluate the strength of a pattern in each object (transaction) for the two sets of attributes (items), X and Y (X Y = ) Evaluation functions can be the same as previously mentioned, e.g., min, max, range, boolean functions, etc. – Measure the strength of the relationship between the resulting pair of pattern evaluation vectors, vX and vY Confidence functions can be a measure of prediction or proximity. – Measure the extent to which the strength of one association pattern can be used to predict another, such as confidence, or – Capture the proximity (similarity or dissimilarity) between the two association patterns. Euclidean distance, correlation, cosine, Bregman divergence © 2005 M. Steinbach Ph.D. Defense ‹#› Confidence for Boolean Support Functions A Boolean support function – Has an evaluation function that returns a binary evaluation vector indicating the presence or absence of a pattern in each transaction. – Uses the sum, L1, or L2 squared summarization function Goal is to define confidence for Boolean support functions so that conf( X Y ) can be interpreted as an estimate of the conditional probability of Y given X. Key observation is that you have to work with the evaluation vectors and the basic definition of conditional probability Thus, conf(X Y) = prob(vY | vX ) = prob(vX vY ) / prob(vY ) Another way to express this is as conf( X Y ) = traditional confidence(vX, vY) © 2005 M. Steinbach Ph.D. Defense ‹#› Example: Error-Tolerant Itemsets … Returning to the ETI example, we get the following: vX X = {i1, i2, i3, i4} Y = {i5, i6, i7, i8} vY conf(X,Y) = prob(vY | vX ) = prob(vX vY ) / prob(vX ) = support(vX vX) / support(vX) = 0 / 4 = 0 © 2005 M. Steinbach Ph.D. Defense ‹#› Confidence for Continuous Data One approach is to define a confidence measure for continuous data that agrees with traditional confidence for binary data. – – – – Normalize attributes to have an L1 norm of 1 eval fuction is min, norm fuction is L1 Confidence is defined as Another approach is to drop the requirement of being consistent with the case of binary data (Min-Apriori (Han, Karypis, Kumar 1997) – – – – Normalize attributes to have an L1 norm of 1 eval fuction is min norm fuction is L1 Traditional definition of confidence: conf(X Y) = (X Y) / (X) © 2005 M. Steinbach Ph.D. Defense ‹#› Example: Min Apriori This approach is inconsistent with traditional confidence Original Data Normalized Data Evaluation Vectors Standard confidence: Min-Apriori confidence: © 2005 M. Steinbach Ph.D. Defense ‹#› Outline Introduction Extending association analysis to non-binary data and non-traditional patterns – Generalizing the notion of support – Generalizing the notion of confidence Creating new types of association patterns Analyzing the structure of association patterns Conclusions and future work © 2005 M. Steinbach Ph.D. Defense ‹#› New Association Patterns: Motivation There are many pairwise measures of association or proximity among items (attributes) – Each measure has specific properties and applications – E.g., cosine measure is good for sparse data, while correlation is more appropriate for dense data Interestingness measures © 2005 M. Steinbach (Tan and Kumar 02) Ph.D. Defense ‹#› Proposed Approach Proposed Approach: Using pairwise measures of association or proximity – Find values for all pairs of attributes (or sets of attributes) – Apply the min function to obtain a single value Example: If X = {i1, i2, i3} and our pairwise measure is cosine, then we can define, , a measure of itemset strength (X) = min( cosine(i1, i2), cosine(i1, i3), cosine(i2, i3) ) A set of attributes, X, is a clique association pattern with respect to a threshold and a pairwise association measure if (i, j) , i, j X ( can be cosine, corr, conf,…) © 2005 M. Steinbach Ph.D. Defense ‹#› Proposed Approach … Actually three approaches ( is a pairwise measure) Subset-Subset – min{ ( X, Y), for all itemsets X and Y} – All-confidence ( = confidence) is an example (Omiecinski 2003) – All-subsets patterns: all-subsets cosine, all-subsets correlation, allsubsets confidence Item-Subset – min{ ( X, Y), for all itemsets X and Y, where X is a single item} – H-confidence ( = confidence) is an example (Xiong, 2003) – Hyperclique patterns: h-cosine, h-correlation, h-confidence Item-Item – min{ ( X, Y), for all itemsets X and Y, X and Y are single items} – Clique patterns: cosine clique, correlation clique, confidence clique © 2005 M. Steinbach Ph.D. Defense ‹#› Proposed Approach … When one or both of the itemsets are not single items (attributes), it is not possible to directly apply most pairwise measures – Confidence is an exception Can use the approach proposed for generalizing confidence – Compute the evaluation vector of the itemset – Then apply the pairwise measure to the two vectors: the evaluation vector and the original attribute vector © 2005 M. Steinbach Ph.D. Defense ‹#› An Experiment We compared the performance of h-confidence, cosine clique, and confidence clique patterns. The h-confidence hyperclique pattern is important because the hyperclique pattern has many applications Clustering, classification, data cleaning Typically applied to objects instead of items Purity of patterns is excellent Often the h-confidence patterns don’t cover many objects Better coverage may mean better application performance Cos, conf related to h-conf © 2005 M. Steinbach Ph.D. Defense ‹#› Experimental Results We used several document data sets with class labels for the documents Patterns were found on documents and goodness was measured by the entropy of the patterns Three quantities are reported – Number of patterns – Average entropy of the patterns – Coverage of documents Also evaluated the cosine cliques for original data © 2005 M. Steinbach Ph.D. Defense ‹#› Experimental Results – LA1 and FBIS la1 level=50 fbis level=70 2500 7000 h-confidence cosine (orig data) cosine (binary data) confidence 2000 h-confidence cosine (orig data) cosine (binary data) confidence 6000 Number of Patterns Number of Patterns 5000 1500 1000 4000 3000 2000 500 1000 0 2 3 4 5 6 7 8 9 10 11 12 0 2 3 4 5 Number of Attributes in the Pattern 6 la1 level=50 9 10 11 12 1 h-confidence cosine (orig data) cosine (binary data) confidence 0.9 0.8 h-confidence cosine (orig data) cosine (binary data) confidence 0.9 0.8 0.7 0.7 Average Entropy Average Entropy 8 fbis level=70 1 0.6 0.5 0.4 0.6 0.5 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 2 7 Number of Attributes in the Pattern 3 4 5 6 7 8 9 Number of Attributes in the Pattern 10 11 12 0 2 3 4 5 6 7 8 9 Number of Attributes in the Pattern 10 11 12 Experimental Results – CranMed and tr45 cranmed level=30 tr45 level=50 5000 9000 h-confidence cosine (orig data) cosine (binary data) confidence 4500 4000 7000 Number of Patterns 3500 Number of Patterns h-confidence cosine (binary data) confidence 8000 3000 2500 2000 6000 5000 4000 3000 1500 2000 1000 1000 500 0 2 3 4 5 6 7 8 9 10 11 12 0 2 13 3 4 5 6 Number of Attributes in the Pattern 9 10 11 12 13 14 15 1 h-confidence cosine (orig data) cosine (binary data) confidence 0.9 0.8 h-confidence cosine (binary data) confidence 0.9 0.8 0.7 Average Entropy 0.7 Average Entropy 8 tr45 level=50 cranmed level=30 1 0.6 0.5 0.4 0.6 0.5 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 2 7 Number of Attributes in the Pattern 3 4 5 6 7 8 9 10 Number of Attributes in the Pattern 11 12 13 0 2 3 4 5 6 7 8 9 10 11 Number of Attributes in the Pattern 12 13 14 15 Experimental Results – Percent Coverage la1 level=50 fbis level=70 30 35 h-confidence cosine (orig data) cosine (binary data) confidence 25 h-confidence cosine (orig data) cosine (binary data) confidence 30 Percent Coverage Percent Coverage 25 20 15 20 15 10 10 5 5 0 2 3 4 5 6 7 8 9 10 11 12 0 2 3 4 5 Number of Attributes in the Pattern 6 cranmed level=30 9 10 11 18 h-confidence cosine (orig data) cosine (binary data) confidence 35 h-confidence cosine (binary data) confidence 16 30 14 Percent Coverage Percent Coverage 8 tr45 level=50 40 25 20 15 10 12 10 8 6 5 0 2 7 Number of Attributes in the Pattern 4 3 4 5 6 7 8 9 10 Number of Attributes in the Pattern 11 12 13 2 2 3 4 5 6 7 8 9 10 11 Number of Attributes in the Pattern 12 13 14 15 12 Outline Introduction Extending association analysis to non-binary data and non-traditional patterns – Generalizing the notion of support – Generalizing the notion of confidence Creating new types of association patterns Analyzing the structure of association patterns Conclusions and future work © 2005 M. Steinbach Ph.D. Defense ‹#› Describing Association Patterns: Support Envelopes The support envelope for a binary transaction data set and a pair of positive integers (m, n) – Is a subset of all items and transactions 0 1 0 0 0 1 0 1 1 0 1 0 1 0 1 1 1 0 0 0 0 1 0 1 1 1 1 1 1 0 1 0 0 0 1 0 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 0 0 0 1 The support envelope contains all association patterns involving m or more transactions and n or more items. – – – – m is support n is the length of the itemset Itemsets and variants (frequent, maximal, closed) Error Tolerant Itemsets (ETIs) © 2005 M. Steinbach Ph.D. Defense ‹#› Simple Example Idea: instead of finding all association patterns containing at least m transactions and n items, find the items and transactions containing all such patterns. – For an example using the data set below, find the set of items and transactions that contain all patterns with at least 3 transactions and at least 3 items. trans/item 1 2 3 4 5 6 7 8 9 10 11 12 col sum © 2005 M. Steinbach A 1 0 0 0 0 0 1 1 1 1 0 1 6 4 B 0 1 1 0 1 1 0 0 0 0 1 0 5 2 C 1 0 1 1 0 0 1 1 0 1 1 0 7 5 6 Ph.D. Defense D 1 1 1 0 1 1 1 1 1 1 1 0 10 5 6 E 1 0 1 1 0 0 1 1 0 1 0 1 7 5 row sum 4 2 3 4 2 2 2 4 4 2 4 3 2 2 ‹#› Support Envelope Algorithm (SEA) The algorithm to find a support envelope is simple. 1: input: A data matrix and a pair of positive integers (m, n) 2: repeat 3: Eliminate all rows whose sum is less than n 4: Eliminate all columns whose sum is less than m 5: until there is no change 6: return the set of remaining rows and columns © 2005 M. Steinbach Ph.D. Defense ‹#› Support Envelopes Form a Lattice (5, 2) {1-12} {A-E} Each box represents a support envelope. Format is the following: (6, 1) {1-12} {A,C,D,E} (m,n) Transactions Items Entire lattice of Envelopes is called the support lattice. (7, 1) {1-12} {C,D,E} (2, 3) {1,3,7,8,10,11} {A,B,C,D,E} (6, 2) {1,3,4,7-12} {A, C,D,E} (10, 1) {1-3,5-11} {E} (4, 3) {1,3,7,8,10} {A,C,D,E} Envelopes drawn with a dotted border are on the lattice boundary, which we call the support boundary. At most min( M, N) such envelopes. © 2005 M. Steinbach (1, 4) {1,3,7,8,10} {A,B,C,D,E} Ph.D. Defense (5, 3) {1,3,7,8,10} {C,D,E} (4, 4) {1,7,8,10} {A,C,D,E} ‹#› Visualizing Support Envelopes for Mushroom One of the support envelopes (576, 23) is denser than its surrounding neighbors. © 2005 M. Steinbach Ph.D. Defense ‹#› An Interesting Dense Envelope for Mushroom One of the columns was the column 48, ‘gill-color:buff’ – There are exactly 1728 instances of item 48, every one of which occurs with 13 other items (one of which is ‘poisonous’). – The co-occurrence of 14 items is larger than is typical for this data set. Support Envelope (576,23) © 2005 M. Steinbach Ph.D. Defense ‹#› Outline Introduction Generalizing Support Generalizing Confidence Generalizing Association Patterns Support Envelopes Conclusions and Future Work © 2005 M. Steinbach Ph.D. Defense ‹#› Conclusions and Future Work: Generalizing Support We described a framework for generalizing support that is based on the simple, but useful observation that support can be viewed as the composition of two functions: – A function that evaluates the strength or presence of a pattern in each object, and – A function that summarizes these evaluations with a single number. Future work – Efficient implementations – Exploring applications of the continuous hyperclique and range patterns – New types of support for non-binary data and nontraditional association patterns © 2005 M. Steinbach Ph.D. Defense ‹#› Conclusions and Future Work: Generalizing Confidence We described a framework for generalizing confidence that is based on the simple, but useful observation that support can be defined in terms of two functions: – A function that evaluates the strength or presence of a pattern in each object, and – A function that summarizes the relationship between the two evaluation vectors with a single number. Future work – Exploring applications of the different measures of confidence – Creating new types of confidence based on interestingness and proximity measures © 2005 M. Steinbach Ph.D. Defense ‹#› Conclusions and Future Work: New Patterns We described a framework for creating a wide variety of new association measures from any pairwise association or proximity measure – These measures are guaranteed to have the anti-monotone property – Specific instances of these measures, the cosine and confidence cliques, were proposed and found to be strictly superior to the hyperclique pattern Future work – Research is needed to determine which measures (out of the large number possible) are useful for association analysis and what additional properties they might have – A more detailed study using more and different types of data sets is needed for cosine and confidence clique patterns – More efficient algorithms needed © 2005 M. Steinbach Ph.D. Defense ‹#› Conclusions and Future Work: Support Envelopes Support envelopes are a new tool for exploring association structure. – Support envelopes form a lattice - at most M * N envelopes – Envelopes on the boundary are especially interesting. Bound the maximum sizes of association patterns At most min( M, N ) boundary envelopes – Can visualize association structure by plotting support envelopes – Efficient algorithms Future work – Parallel/distributed implementations of the support envelope code – Investigation of the basic approach and its variations for binary data – Application of support envelopes to other kinds of data or patterns – Support envelopes for a cube – Continuous data © 2005 M. Steinbach Ph.D. Defense ‹#› Thank You! Questions? Hyperclique Pattern for Binary Data Definition: The h-confidence of an itemset X = {i1, i2, …, im} is the minimum confidence with which with one item implies the others, or hconf(X) = (X) / max{( i1 ), ( i2 ), …, ( im )} If X={A,B,C}, (X) = 0.06, (A)= (B)= 0.1, and (C)= 0.6, then hconf(X) = 0.06/0.1 = 0.6 H-confidence is – Non-increasing in the size of the itemset (anti-monotone) – Two items belong to the same hyperclique only if their support is similar (cross-support) – The items in an itemset with h-confidence h are have a pairwise cosine similarity of at least h (high affinity) Used for clustering, noise removal, classification, etc. © 2005 M. Steinbach Ph.D. Defense ‹#› Hyperclique Pattern for Continuous Data To extend hyperclique pattern to continuous data – Choose eval function to be min – Choose norm function to be L2 squared – We write this support function as min,L2squared This support is between 0 and 1 and is anti-monotone If we normalize attributes to have an L2 norm of 1, then hconf(X) = min,L2squared(X) This is true whether the attributes are binary or continuous. © 2005 M. Steinbach Ph.D. Defense ‹#› Example: Continuous Hyperclique Compute the support {term1, term2, term3} using the min,L2squared support function. – Attributes normalized to have an L2 norm of 1 – Compute support by taking the min across rows and the sum of squares of that result Original Data Normalized Data Pairwise Cosine © 2005 M. Steinbach Ph.D. Defense ‹#›