Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining Data Mining (DM)/ Knowledge Discovery in Databases (KDD) “The nontrivial extraction of implicit, previously unknown, and potentially useful information from data” [Frawley et al, 1992] Need for Data Mining Increased Remote ability to generate data sensors and satellites Bar codes for commercial products Computerization of businesses Need for Data Mining Increased ability to store data Media: bigger magnetic disks, CD-ROMs Better database management systems Data warehousing technology Volume of data 1970 1980 1990 2000 Need for Data Mining Examples Wal-Mart records 20,000,000 transactions/day Healthcare transactions yield multi-GB databases Mobil Oil exploration storing 100 terabytes Human Genome Project, multi-GBs and increasing Astronomical object catalogs, terabytes of images NASA EOS, 1 terabyte/day Something for Everyone Bell Atlantic MCI Land’s End Visa Bank of New York FedEx Market Analysis and Management Customer profiling Data mining can tell you what types of customers buy what products (clustering or classification) or what products are often bought together (association rules). Identifying Discover customer requirements relationship between personal characteristics and probability of purchase Discover correlations between purchases Fraud Detection and Management Applications: Widely used in health care, retail, credit card services, telecommunications, etc. Approach: Use historical data to build models of fraudulent behavior and use data mining to help identify similar instances. Examples: Auto Insurance Money Laundering Medical Insurance IBM Advertisement DM step in KDD Process Statistics Database AI Data Mining Hardware Mining Association Rules Mining Association Rules Assocation rule mining: Finding associations or correlations among a set of items or objects in transaction databases, relational databases, and data warehouses. Applications: Basket data analysis, cross-marketing, catalog design, lossleader analysis, clustering, etc. Examples: Rule form: “Body ead [support, confidence]”. Buys=Diapers Buys=Beer [0.5%, 60%] Major=CS ^ Class=DataMining Grade=A [1%, 75%] Rule Measures: Support and Confidence Customer buys both Customer buys beer Customer buys diaper Find all the rules X & Y Z with minimum confidence and support support, s, probability that a transaction contains {X, Y, Z} confidence, c, conditional probability that a transaction having {X, Y} also contains Z. Transaction ID Items Bought For minimum support 50%, 2000 A,B,C minimum confidence 50%: 1000 A,C A C (50%, 66.6%) 4000 A,D C A (50%, 100%) 5000 B,E,F Association Rule Given Set of items I = {i1, i2, .., im} Set of transactions D Each transaction T in D is a set of items An association rule is an implication X Y X and Y are itemsets, X I , Y I , X Y Rule meets minimum confidence c (c% of transactions in D which contain X contain Y) A minimum support s is also met c X Y / X s X Y / D Mining Strong Association Rules in Transaction DBs Measurement of rule strength in a transaction DB. A B [support, confidence] support = Prob(A B) = #_of_trans_containing_all_the_items_in confidence = Prob(B|A) AB total_#_of_trans = #_of_trans_that_contain_both A and B #_of_trans_containing A We are often interested in only strong associations, i.e. support min_sup and confidence min_conf. Examples. milk bread [5%, 60%]. tire auto_accessories auto_services [2%, 80%]. Methods for Mining Associations Apriori Partition Technique: Sampling technique Anti-Skew Multi-level or generalized association Constraint-based or query-based association Apriori (Levelwise) Scan database multiple times For ith scan, find all large itemsets of size i with min support Use the large itemsets from scan i as input to scan i+1 Create candidates, subsets of size i+1 which contain only large itemsets as subsets Notation: Large k-itemset, Lk Set of candidate large itemsets of size k, Ck Note: If {A,B} is not a large itemset, then no superset of it can be either. Mining Association Rules -- Example Transaction ID Items Bought 2000 A,B,C 1000 A,C 4000 A,D 5000 B,E,F Min. support 50% Min. confidence 50% Frequent Itemset Support {A} 75% {B} 50% {C} 50% {A,C} 50% For rule A C: support = support({A, C}) = 50% confidence = support({A, C})/support({A}) = 66.6% Apriori principle: Any subset of a frequent itemset must be frequent. Transaction ID Items Bought 2000 A,B,C 1000 A,C 4000 A,D 5000 B,E,F Minsup = 0.25, Minconf = 0.5 L1 = {(A, 3), (B, 2), (C, 2), (D, 1), (E, 1), (F, 1)} C2 = {(A,B), (A,C), (A,D), (A,E), (A,F), (B,C), .., (E,F)} L2 = {(A,B, 1), (A,C, 2), (A,D, 1), (B,C, 1), (B,E, 1), (B,F, 1), (E,F, 1)} C3 = {(A,B,C), (A,B,D), (A,C,D), (B,C,E), (B,C,F), (B,E,F)} L3 = {(A,B,C, 1), (B,E,F, 1)} C4 = {}, L4 = {}, End of program Possible Rules A=>B (c=.33,s=1), B=>A (c=.5,s=1), A=>C (c=.67,s=2), C=>A (c=1.0,s=2) A=>D (c=.33,s=1), D=>A (c=1.0,s=1), B=>C (c=.5,s=1), C=>B (.5,s=1), B=>E (c=.5,s=1), E=>B(c=1,s=1), B=>F (c=.5,s=1), F=>B(c=1,s=1) A=>B&C (c=.33,s=1), B=>A&C (c=.5,s=1), C=>A&B (c=.5,s=1), A&B=>C(c=1,s=1), A&C=>B (c=.5,s=1), B&C=>A (c=1,s=1), B=>E&F (c=.5,s=1), E=>B&F(c=1,s=1), F=>B&E (c=1,s=1), B&E=>F (c=1,s=1), B&F=>E(c=1,s=1), E&F=>B (c=1,s=1) Example Partitioning Requires only two passes through external database Divide database into n partitions, each fits in main memory Scan 1: Process one partition in memory at a time, finding local large itemsets Candidate large itemsets are the union of all local large itemsets (superset of actual large itemsets, contains false +) Scan 2: Calculate support, determine actual large itemsets If data is skewed, partitioning may not work well. The chance that a local large itemset is a global large itemset may be small. Partitioning Will any large itemsets be missed? If l Li, then t1(l)/t1 < MS & t2(l)/t2 < MS & … & tn(l)/tn < MS thus t1(l) + t2(l) + … + tn(l) < MS * (t1 + t2 + … + tn) How do run times compare? 800 80 700 600 60 500 400 300 40 20 200 100 0 0 D1 A .5 P .5 D2 A .33 P .33 A .25 D3 P .25 A .5 P .5 D4 A .33 P .33 A .25 P .25 PlayTennis Training Examples Day Outlook Temperature Humidity Wind PlayTennis D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No Association Rule Visualization: DBMiner DBMiner Association Rule Graph Clementine (UK, bought by SPSS) The Web Node shows the strength of associations in the data - i.e. how often field values coincide Multi-Level Association Food A descendant of an infrequent itemset cannot be frequent A transaction database can be encoded by dimensions and levels bread milk 2% skim Fraser TID T1 T2 T3 T4 T5 wheat white Sunset Items {111, 121, 211, 221} {111, 211, 222, 323} {112, 122, 221, 411} {111, 121} {111, 122, 211, 221, 413} Encoding Hierarchical Information in Transaction Database A taxonomy for the relevant data items food milk 2% . . bread . . . chocolate . into . . generalized_item_id. . . . . of bar_code Conversion old Mills Wonder Dairyland Foremost GID bar_code_set category content_spec brand 112 {17325, 34823, 91265} milk 2% Foremost 141 {29394, 77454, 89157} milk skim Dairy_land 171 {73295, 99184, 79520} milk chocolate Dairy_land 212 {88452, 35672, 98427, 31205} bread wheat Wonder ... {..., ...} ... ... ... 711 {32514, 78152} fruit_juice orange Minute_maid Mining Surprising Temporal Patterns Chakrabarti et al 1990 Find prevalent rules that hold over large fractions of data Useful for promotions and store arrangement Intensively researched Milk and cereal sell together! Prevalent != Interesting 1995 Analysts already know about prevalent rules Interesting rules are those that deviate from prior expectation Mining’s payoff is in finding surprising phenomena 1998 Zzzz... Milk and cereal sell together! Milk and cereal sell together! Association Rules - Strengths & Weaknesses Strengths Understandable and easy to use Useful Weaknesses Brute force methods can be expensive (memory and time) Apriori is O(CD), where C = sum of sizes of candidates (2n possible, n = #items) D = size of database Association does not necessarily imply correlation Validation? Maintenance? Clustering Clustering Group similar items together Example: sorting laundry Similar items may have important attributes / functionality in common Group customers together with similar interests and spending patterns Form of unsupervised learning Cluster objects into classes using rule: Maximize intraclass similarity, minimize interclass similarity Clustering Techniques Partition Enumerate all partitions Score by some criteria K means Hierarchical Model based Hypothesize model for each cluster Find model that best fits data AutoClass, Cobweb Clustering Goal Suppose you transmit coordinates of points drawn randomly from this dataset Only allowed 2 bits/point What encoder/decoder will lose least information? Idea One Break into grid Decode each bitpair as middle of each grid cell 00 01 10 11 Idea Two Break into grid Decode each bitpair as centroid of all data in the grid cell 00 10 01 11 K Means Clustering 1. Ask user how many clusters (e.g., k=5) K Means Clustering 1. 2. Ask user how many clusters (e.g., k=5) Randomly guess k cluster center locations K Means Clustering 1. 2. 3. Ask user how many clusters (e.g., k=5) Randomly guess k cluster center locations Each data point finds closest center K Means Clustering 1. 2. 3. 4. Ask user how many clusters (e.g., k=5) Randomly guess k cluster center locations Each data point finds closest center Each cluster finds new centroid of its points K Means Clustering 1. 2. 3. 4. 5. Ask user how many clusters (e.g., k=5) Randomly guess k cluster center locations Each data point finds closest center Each cluster finds new centroid of its points Repeat until… K Means Issues Computationally efficient Initialization Termination condition Distance measure What should k be? Hierarchical Clustering 1. Each point is its own cluster Hierarchical Clustering 1. 2. Each point is its own cluster Find most similar pair of clusters Hierarchical Clustering 1. 2. 3. Each point is its own cluster Find most similar pair of clusters Merge it into a parent cluster Hierarchical Clustering 1. 2. 3. 4. Each point is its own cluster Find most similar pair of clusters Merge it into a parent cluster Repeat Hierarchical Clustering Issues This was bottom-up clustering (agglomerative clustering) Can also perform top-down clustering (divisive clustering) Define similarity between clusters What is stopping criteria? Cluster Visualization: IBM One row per cluster (# is % size) Charts show fields, ordered by influence Pie: outer ring dist for whole, inner ring for cluster Bar: solid bar for cluster, transparent bar for whole Structure-based Brushing User selects region of interest with magenta triangle User selects level of detail with colored polyline Spatial Associations FIND SPATIAL ASSOCIATION RULE DESCRIBING "Golf Course" FROM Washington_Golf_courses, Washington WHERE CLOSE_TO(Washington_Golf_courses.Obj, Washington.Obj, "3 km") AND Washington.CFCC <> "D81" IN RELEVANCE TO Washington_Golf_courses.Obj, Washington.Obj, CFCC SET SUPPORT THRESHOLD 0.5 Partek Conclusions