Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining: Concepts and Techniques — Slides for Textbook — — potpourri — ©Jiawei Han and Micheline Kamber http://www.cs.sfu.ca Potpourri composed by Yannis Theodoridis (May 2001) January 20, 2006 Data Mining: Concepts and Techniques 1 Contents Introduction Data Warehouses Data Preprocessing Data Mining Functionality Association Rules Classification Clustering Trend Analysis Social Impact A prototype system: DBMiner January 20, 2006 Data Mining: Concepts and Techniques 2 1 What Is Data Mining? Data mining (knowledge discovery in databases): Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases Alternative names and their “inside stories”: Data mining: a misnomer? Knowledge discovery(mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. What is not data mining? January 20, 2006 Expert systems or small ML/statistical programs Data Mining: Concepts and Techniques 3 Data Mining Applications Data mining is a young discipline with wide and diverse applications 9 a nontrivial gap exists between general principles of data mining and domain-specific, effective data mining tools for particular applications Some application domains (covered in this chapter) 9 9 9 9 Biomedical and DNA data analysis Financial data analysis Retail industry Telecommunication industry January 20, 2006 Data Mining: Concepts and Techniques 4 2 Commercial Data Mining tools Commercial data mining systems have little in common 9 9 Different data mining functionality or methodology May even work with completely different kinds of data sets Need multiple dimensional view in selection Data types: relational, transactional, text, time sequence, spatial? System issues 9 9 9 running on only one or on several operating systems? a client/server architecture? Provide Web-based interfaces and allow XML data as input and/or output? January 20, 2006 Data Mining: Concepts and Techniques 5 Commercial Data Mining tools Data sources 9 9 ASCII text files, multiple relational data sources support ODBC connections (OLE DB, JDBC)? Data mining functions and methodologies 9 9 One vs. multiple data mining functions One vs. variety of methods per function More data mining functions and methods per function provide the user with greater flexibility and analysis power Coupling with DB and/or data warehouse systems 9 Four forms of coupling: no coupling, loose coupling, semitight coupling, and tight coupling January 20, 2006 Ideally, a data mining system should be tightly coupled with a database system Data Mining: Concepts and Techniques 6 3 Commercial Data Mining tools Scalability 9 9 9 Visualization tools 9 9 Row (or database size) scalability Column (or dimension) scalability Curse of dimensionality: it is much more challenging to make a system column scalable that row scalable “A picture is worth a thousand words” Visualization categories: data visualization, mining result visualization, mining process visualization, and visual data mining Data mining query language and graphical user interface 9 9 Easy-to-use and high-quality graphical user interface Essential for user-guided, highly interactive data mining January 20, 2006 Data Mining: Concepts and Techniques 7 Examples of Data Mining Systems (1) IBM Intelligent Miner A wide range of data mining algorithms Scalable mining algorithms Toolkits: neural network algorithms, statistical methods, data preparation, and data visualization tools Tight integration with IBM's DB2 relational database system Mirosoft SQLServer 2000 Integrate DB and OLAP with mining Support OLEDB for DM standard January 20, 2006 Data Mining: Concepts and Techniques 8 4 Examples of Data Mining Systems (2) SGI MineSet Multiple data mining algorithms and advanced statistics Advanced visualization tools SAS Enterprise Miner A variety of statistical analysis tools Data warehouse tools and multiple data mining algorithms Clementine (SPSS) An integrated data mining development environment for end-users and developers Multiple data mining algorithms and visualization tools January 20, 2006 Data Mining: Concepts and Techniques 9 Data Mining: A KDD Process Pattern Evaluation Data mining: the core of knowledge discovery process. Data Mining Task-relevant Data Data Warehouse Selection Data Cleaning Data Integration Databases January 20, 2006 Data Mining: Concepts and Techniques 10 5 Data Mining and Business Intelligence Increasing potential to support business decisions End User Making Decisions Business Analyst Data Presentation Visualization Techniques Data Mining Information Discovery Data Analyst Data Exploration Statistical Analysis, Querying and Reporting Data Warehouses / Data Marts OLAP, MDA Data Sources Paper, Files, Information Providers, Database Systems, OLTP January 20, 2006 DBA Data Mining: Concepts and Techniques 11 Architecture of a Typical DM System Graphical user interface Pattern evaluation Data mining engine Knowledge-base Database or data warehouse server Data cleaning & data integration Databases January 20, 2006 Filtering Data Warehouse Data Mining: Concepts and Techniques 12 6 Data Mining Functionalities (1) Concept description: Characterization and discrimination Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions Association (correlation and causality) Multi-dimensional vs. single-dimensional association age(X, “20..29”) ^ income(X, “20..29K”) Æ buys(X, “PC”) [support = 2%, confidence = 60%] contains(T, “computer”) Æ contains(x, “software”) [1%, 75%] January 20, 2006 Data Mining: Concepts and Techniques 13 Data Mining Functionalities (2) Classification and Prediction Finding models (functions) that describe and distinguish classes or concepts for future prediction E.g., classify countries based on climate, or classify cars based on gas mileage Presentation: decision-tree, classification rule, neural network Prediction: Predict some unknown or missing numerical values Cluster analysis January 20, 2006 Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns Clustering based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity Data Mining: Concepts and Techniques 14 7 Data Mining Functionalities (3) Outlier analysis Outlier: a data object that does not comply with the general behavior of the data It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis Trend and evolution analysis Trend and deviation: regression analysis Sequential pattern mining, periodicity analysis Similarity-based analysis Other pattern-directed or statistical analyses January 20, 2006 Data Mining: Concepts and Techniques 15 Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data noisy: containing errors or outliers inconsistent: containing discrepancies in codes or names No quality data, no quality mining results! Quality decisions must be based on quality data Data warehouse needs consistent integration of quality data January 20, 2006 Data Mining: Concepts and Techniques 16 8 Association Rule: Basic Concepts Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit) Find: all rules that correlate the presence of one set of items with that of another set of items E.g., 98% of people who purchase tires and auto accessories also get automotive services done Applications * ⇒ Maintenance Agreement (What the store should do to boost Maintenance Agreement sales) Home Electronics ⇒ * (What other products should the store stocks up?) Attached mailing in direct marketing Detecting “ping-pong”ing of patients, faulty “collisions” January 20, 2006 Data Mining: Concepts and Techniques 17 Rule Measures: Support & Confidence Customer buys both Customer buys beer Customer buys diaper Find all the rules X & Y ⇒ Z with minimum confidence and support support, s, probability that a transaction contains {X Y Z} confidence, c, conditional probability that a transaction having {X Y} also contains Z Transaction ID Items Bought Let minimum support 50%, and minimum confidence 50%, we have 2000 A,B,C A ⇒ C (50%, 66.6%) 1000 A,C C ⇒ A (50%, 100%) 4000 A,D 5000 B,E,F January 20, 2006 Data Mining: Concepts and Techniques 18 9 Visualization of Association Rule Using Plane Graph January 20, 2006 Data Mining: Concepts and Techniques 19 Visualization of Association Rule Using Rule Graph January 20, 2006 Data Mining: Concepts and Techniques 20 10 Rule Mining: A Road Map Boolean vs. quantitative associations (Based on the types of values handled) 9 9 buys(x, “SQLServer”) ^ buys(x, “DMBook”) ⇒ buys(x, “UMiner”) [0.2%, 60%] age(x, “30..39”) ^ income(x, “42..48K”) ⇒ buys(x, “PC”) [1%, 75%] Single dimension vs. multiple dimensional associations (see ex. Above) Single level vs. multiple-level analysis Various extensions 9 9 9 9 What brands of beers are associated with what brands of diapers? Correlation, causality analysis Association does not necessarily imply correlation or causality Maxpatterns and closed itemsets Constraints enforced E.g., small sales (sum < 100) trigger big buys (sum > 1,000)? January 20, 2006 Data Mining: Concepts and Techniques 21 Mining Association Rules—An Example Transaction ID 2000 1000 4000 5000 For rule A ⇒ C: Items Bought A,B,C A,C A,D B,E,F Min. support 50% Min. confidence 50% Frequent Itemset {A} {B} {C} {A,C} Support 75% 50% 50% 50% support = support({A &C}) = 50% confidence = support({A &C})/support({A}) = 66.6% The Apriori principle: Any subset of a frequent itemset must be frequent January 20, 2006 Data Mining: Concepts and Techniques 22 11 Mining Frequent Itemsets: the Key Step Find the frequent itemsets: the sets of items that have minimum support 9 A subset of a frequent itemset must also be a frequent itemset y 9 i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset) Use the frequent itemsets to generate association rules. January 20, 2006 Data Mining: Concepts and Techniques 23 The Apriori Algorithm Join Step: Ck is generated by joining Lk-1with itself Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent kitemset Pseudo-code: Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=∅; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return ∪k Lk; January 20, 2006 Data Mining: Concepts and Techniques 24 12 The Apriori Algorithm — Example Database D TID 100 200 300 400 itemset sup. {1} 2 {2} 3 Scan D {3} 3 {4} 1 {5} 3 C1 Items 134 235 1235 25 L1 itemset sup. {1} {2} {3} {5} C2 itemset C2 itemset sup L2 itemset sup 2 2 3 2 {1 {1 {1 {2 {2 {3 C3 itemset {2 3 5} Scan D {1 {2 {2 {3 3} 3} 5} 5} January 20, 2006 2} 3} 5} 3} 5} 5} 1 2 1 2 3 2 2 3 3 3 Scan D {1 {1 {1 {2 {2 {3 2} 3} 5} 3} 5} 5} L3 itemset sup {2 3 5} 2 Data Mining: Concepts and Techniques 25 Candidates Generation Suppose the items in Lk-1 are listed in an order Step 1: 1 self-joining Lk-1 insert into Ck select p.item1, p.item2, …, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1 Step 2: 2 pruning forall itemsets c in Ck do forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck January 20, 2006 Data Mining: Concepts and Techniques 26 13 How to Count Supports of Candidates? Why counting supports of candidates a problem? The total number of candidates can be very huge One transaction may contain many candidates Method: Candidate itemsets are stored in a hash-tree Leaf node of hash-tree contains a list of itemsets and counts Interior node contains a hash table Subset function: finds all the candidates contained in a transaction January 20, 2006 Data Mining: Concepts and Techniques 27 Example of Generating Candidates L3={abc, abd, acd, ace, bcd} Self-joining: L3*L3 abcd from abc and abd acde from acd and ace Pruning: acde is removed because ade is not in L3 C4={abcd} January 20, 2006 Data Mining: Concepts and Techniques 28 14 Mining Distance-based Association Rules Binning methods do not capture the semantics of interval data Price($) Equi-width (width $10) Equi-depth (depth 2) Distancebased 7 20 22 50 51 53 [0,10] [11,20] [21,30] [31,40] [41,50] [51,60] [7,20] [22,50] [51,53] [7,7] [20,22] [50,53] Distance-based partitioning, more meaningful discretization considering: density/number of points in an interval “closeness” of points in an interval January 20, 2006 Data Mining: Concepts and Techniques 29 Clusters and Distance Measurements S[X] is a set of N tuples t1, t2, …, tN , projected on the attribute set X The diameter of S[X]: d ( S [ X ]) = ∑ ∑ N N i =1 j =1 dist X ( t i[ X ], t j[ X ]) N ( N − 1) distx:distance metric, e.g. Euclidean distance or Manhattan January 20, 2006 Data Mining: Concepts and Techniques 30 15 Clusters and Distance Measurements(Cont.) The diameter, d, assesses the density of a cluster CX , where d (C X ) ≤ d 0 X CX ≥ s 0 Finding clusters and distance-based rules the density threshold, d0 , replaces the notion of support modified version of the BIRCH clustering algorithm January 20, 2006 Data Mining: Concepts and Techniques 31 Interestingness Measurements Objective measures Two popular measurements: support; and confidence Subjective measures (Silberschatz & Tuzhilin, KDD95) A rule (pattern) is interesting if it is unexpected (surprising to the user); and/or actionable (the user can do something with it) January 20, 2006 Data Mining: Concepts and Techniques 32 16 Criticism to Support and Confidence Example 1: (Aggarwal & Yu, PODS98) Among 5000 students 3000 play basketball 3750 eat cereal 2000 both play basket ball and eat cereal play basketball ⇒ eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% which is higher than 66.7%. play basketball ⇒ not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence cereal not cereal sum(col.) January 20, 2006 basketball not basketball sum(row) 2000 1750 3750 1000 250 1250 3000 2000 5000 Data Mining: Concepts and Techniques 33 Criticism to Support and Confidence Example 2: X and Y: positively correlated, X and Z, negatively related support and confidence of X=>Z dominates We need a measure of dependent or correlated events corrA, B = P ( A∪ B ) P ( A) P ( B ) X 1 1 1 1 0 0 0 0 Y 1 1 0 0 0 0 0 0 Z 0 1 1 1 1 1 1 1 Rule Support X=>Y 25% X=>Z 37,50% Confidence 50% 75% P(B|A)/P(B) is also called the lift of rule A => B January 20, 2006 Data Mining: Concepts and Techniques 34 17 Other Interestingness Measures: Interest P( A ∧ B) P ( A) P( B) Interest (correlation, lift) taking both P(A) and P(B) in consideration P(A^B)=P(B)*P(A), if A and B are independent events A and B negatively correlated, if the value is less than 1; otherwise A and B positively correlated X 1 1 1 1 0 0 0 0 Y 1 1 0 0 0 0 0 0 Z 0 1 1 1 1 1 1 1 January 20, 2006 Itemset Support Interest X,Y X,Z Y,Z 25% 37.50% 12.50% 2 0.9 0.57 Data Mining: Concepts and Techniques 35 Association Rules:Summary Association rule mining probably the most significant contribution from the database community in KDD A large number of papers have been published Many interesting issues have been explored An interesting research direction Association analysis in other types of data: spatial data, multimedia data, time series data, etc. January 20, 2006 Data Mining: Concepts and Techniques 36 18 Training Dataset This follows an example from Quinlan’s ID3 age <=30 <=30 30…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 January 20, 2006 income student credit_rating high no fair high no excellent high no fair medium no fair low yes fair low yes excellent low yes excellent medium no fair low yes fair medium yes fair medium yes excellent medium no excellent high yes fair medium no excellent buys_computer no no yes yes yes no yes no yes yes yes yes yes no Data Mining: Concepts and Techniques 37 Output: A Decision Tree for “buys_computer” age? <=30 30..40 overcast yes student? >40 credit rating? no yes excellent fair no yes no yes January 20, 2006 Data Mining: Concepts and Techniques 38 19 Presentation of Classification Results January 20, 2006 Data Mining: Concepts and Techniques 39 AGNES (Agglomerative Nesting) Introduced in Kaufmann and Rousseeuw (1990) Implemented in statistical analysis packages, e.g., Splus Use the Single-Link method and the dissimilarity matrix. Merge nodes that have the least dissimilarity Go on in a non-descending fashion Eventually all nodes belong to the same cluster 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 0 0 0 1 2 January 20, 2006 3 4 5 6 7 8 9 10 1 0 0 1 2 3 4 5 6 7 8 9 10 Data Mining: Concepts and Techniques 0 1 2 3 4 5 6 7 8 9 10 40 20 DBSCAN: Density Based Spatial Clustering of Applications with Noise Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points Discovers clusters of arbitrary shape in spatial databases with noise Outlier Border Eps = 1cm Core January 20, 2006 MinPts = 5 Data Mining: Concepts and Techniques 41 Reachabilitydistance undefined ε ε‘ January 20, 2006 ε Data Mining: Concepts and Techniques Cluster-order of the objects 42 21 Constraint-Based Clustering Analysis Clustering analysis: less parameters but more user-desired constraints, e.g., an ATM allocation problem January 20, 2006 Data Mining: Concepts and Techniques 43 Mining Time-Series and Sequence Data Time-series plot January 20, 2006 Data Mining: Concepts and Techniques 44 22 Mining Time-Series and Sequence Data: Trend analysis A time series can be illustrated as a time-series graph which describes a point moving with the passage of time Categories of Time-Series Movements Long-term or trend movements (trend curve) Cyclic movements or cycle variations, e.g., business cycles Seasonal movements or seasonal variations i.e, almost identical patterns that a time series appears to follow during corresponding months of successive years. Irregular or random movements January 20, 2006 Data Mining: Concepts and Techniques 45 Social Impacts: Threat to Privacy and Data Security? Is data mining a threat to privacy and data security? “Big Brother”, “Big Banker”, and “Big Business” are carefully watching you Profiling information is collected every time You use your credit card, debit card, supermarket loyalty card, or frequent flyer card, or apply for any of the above You surf the Web, reply to an Internet newsgroup, subscribe to a magazine, rent a video, join a club, fill out a contest entry form, You pay for prescription drugs, or present you medical care number when visiting the doctor Collection of personal data may be beneficial for companies and consumers, there is also potential for misuse January 20, 2006 Data Mining: Concepts and Techniques 46 23 Protect Privacy and Data Security Fair information practices International guidelines for data privacy protection Cover aspects relating to data collection, purpose, use, quality, openness, individual participation, and accountability Purpose specification and use limitation Openness: Individuals have the right to know what information is collected about them, who has access to the data, and how the data are being used Develop and use data security-enhancing techniques Blind signatures Biometric encryption Anonymous databases January 20, 2006 Data Mining: Concepts and Techniques 47 OLAP (Summarization) Display Using MS/Excel 2000 January 20, 2006 Data Mining: Concepts and Techniques 48 24 Market-Basket-Analysis (Association)—Ball graph January 20, 2006 Data Mining: Concepts and Techniques 49 Display of Association Rules in Rule Plane Form January 20, 2006 Data Mining: Concepts and Techniques 50 25 Display of Decision Tree (Classification Results) January 20, 2006 Data Mining: Concepts and Techniques 51 Display of Clustering (Segmentation) Results January 20, 2006 Data Mining: Concepts and Techniques 52 26 3D Cube Browser January 20, 2006 Data Mining: Concepts and Techniques 53 Trends in Data Mining (1) Scalable data mining methods Constraint-based mining: use of constraints to guide data mining systems in their search for interesting patterns Application exploration development of application-specific data mining system Invisible data mining (mining as built-in function) Integration of data mining with database systems, data warehouse systems, and Web database systems Quality assessment January 20, 2006 Data Mining: Concepts and Techniques 54 27 Trends in Data Mining (2) Standardization of data mining language A standard will facilitate systematic development, improve interoperability, and promote the education and use of data mining systems in industry and society Visual data mining Uncertainty handling New methods for mining complex types of data More research is required towards the integration of data mining methods with existing data analysis techniques for the complex types of data Web mining Privacy protection and information security in data mining January 20, 2006 Data Mining: Concepts and Techniques 55 28