Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining Page 1 Outline • What is data mining? • Data Mining Tasks – Association – Classification – Clustering • Data mining Algorithms • Are all the patterns interesting? Page 2 What is Data Mining: • Huge amount of databases and web pages make information extraction next to impossible (remember the favored statement: I will bury them in data!) • Inability of many other disciplines: (statistic, AI, information retrieval) to have scalable algorithms to extract information and/or rules from the databases • Necessity to find relationships among data Page 3 What is Data Mining: • Discovery of useful, possibly unexpected data patterns • Subsidiary issues: – Data cleansing – Visualization – Warehousing Page 4 Examples • A big objection to was that it was looking for so many vague connections that it was sure to find things that were bogus and thus violate innocents’ privacy. • The Rhine Paradox: a great example of how not to conduct scientific research. Page 5 Rhine Paradox --- (1) • David Rhine was a parapsychologist in the 1950’s who hypothesized that some people had Extra-Sensory Perception. • He devised an experiment where subjects were asked to guess 10 hidden cards --- red or blue. • He discovered that almost 1 in 1000 had ESP --- they were able to get all 10 right! Page 6 Rhine Paradox --- (2) • He told these people they had ESP and called them in for another test of the same type. • Alas, he discovered that almost all of them had lost their ESP. • What did he conclude? – Answer on next slide. Page 7 Rhine Paradox --- (3) • He concluded that you shouldn’t tell people they have ESP; it causes them to lose it. Page 8 A Concrete Example • This example illustrates a problem with intelligencegathering. • Suppose we believe that certain groups of evil-doers are meeting occasionally in hotels to plot doing evil. • We want to find people who at least twice have stayed at the same hotel on the same day. Page 9 The Details • 109 people being tracked. • 1000 days. • Each person stays in a hotel 1% of the time (10 days out of 1000). • Hotels hold 100 people (so 105 hotels). • If everyone behaves randomly (I.e., no evil-doers) will the data mining detect anything suspicious? Page 10 Calculations --- (1) • Probability that persons p and q will be at the same hotel on day d : – 1/100 * 1/100 * 10-5 = 10-9. • Probability that p and q will be at the same hotel on two given days: – 10-9 * 10-9 = 10-18. • Pairs of days: – 5*105. Page 11 Calculations --- (2) • Probability that p and q will be at the same hotel on some two days: – 5*105 * 10-18 = 5*10-13. • Pairs of people: – 5*1017. • Expected number of suspicious pairs of people: – 5*1017 * 5*10-13 = 250,000. Page 12 Conclusion • Suppose there are (say) 10 pairs of evil-doers who definitely stayed at the same hotel twice. • Analysts have to sift through 250,010 candidates to find the 10 real cases. – Not gonna happen. – But how can we improve the scheme? Page 13 Appetizer • Consider a file consisting of 24471 records. File contains at least two condition attributes: A and D A/D 0 1 total 0 9272 232 9504 1 14695 272 14967 Total 23967 504 24471 Page 14 Appetizer (con’t) • Probability that person has A: P(A)=0.6, • Probability that person has D: P(D)=0.02 • Conditional probability that person has D provided it has A: P(D|A) = P(AD)/P(A)=(272/24471)/.6 = .02 • P(A|D) = P(AD)/P(D)= .54 • What can we say about dependencies between A and D? A/D 0 1 total 0 9272 232 9504 1 14695 272 14967 Total 23967 504 24471 Page 15 Appetizer(3) • So far we did not ask anything that statistics would not have ask. So Data Mining another word for statistic? • We hope that the response will be resounding NO • The major difference is that statistical methods work with random data samples, whereas the data in databases is not necessarily random • The second difference is the size of the data set • The third data is that statistical samples do not contain “dirty” data Page 16 Architecture of a Typical Data Mining System Graphical user interface Pattern evaluation Data mining engine Knowledge-base Database or data warehouse server Data cleaning & data integration Databases Filtering Data Warehouse Page 17 Data Mining Tasks • Association (correlation and causality) – Multi-dimensional vs. single-dimensional association – age(X, “20..29”) ^ income(X, “20..29K”) -> buys(X, “PC”) [support = 2%, confidence = 60%] – contains(T, “computer”) -> contains(x, “software”) [1%, 75%] – What is support? – the percentage of the tuples in the database that have age between 20 and 29 and income between 20K and 29K and buying PC – What is confidence? – the probability that if person is between 20 and 29 and income between 20K and 29K then it buys PC • Clustering (getting data that are close together into the same cluster. – What does “close together” means? Page 18 Distances between data • Distance between data is a measure of dissimilarity between data. d(i,j)>=0; d(i,j) = d(j,i); d(i,j)<= d(i,k) + d(k,j) • Euclidean distance: <x1,x2, … xk> and <y1,y2,…yk> • Standardize variables by finding standard deviation and dividing each xi by standard deviation of X • Covariance(X,Y)=1/k(Sum(xi-mean(x))(y(I)-mean(y)) • Boolean variables and their distances Page 19 Data Mining Tasks • Outlier analysis – Outlier: a data object that does not comply with the general behavior of the data – It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis • Trend and evolution analysis – Trend and deviation: regression analysis – Sequential pattern mining, periodicity analysis – Similarity-based analysis • Other pattern-directed or statistical analyses Page 20 Are All the “Discovered” Patterns Interesting? • A data mining system/query may generate thousands of patterns, not all of them are interesting. – Suggested approach: Human-centered, query-based, focused mining • Interestingness measures: A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm • Objective vs. subjective interestingness measures: – Objective: based on statistics and structures of patterns, e.g., support, confidence, etc. – Subjective: based on user’s belief in the data, e.g., unexpectedness, novelty, actionability, etc. Page 21 Are All the “Discovered” Patterns Interesting? - Example coffee 0 1 0 5 70 1 5 tea 20 75 25 Conditional probability that if one buys coffee, one also buys tea is 2/9 Conditional probability that if one buys tea she also buys coffee is 20/25=.8 However, the probability that she buys coffee is .9 So, is it significant inference that if customer buys tea she also buys coffee? Is buying tea and coffee independent activities? Page 22 How to measure Interestingness • RI = | X , Y| - |X||Y|/N • Support and Confidence: |X Y|/N – support and |X Y|/|X| confidence of X->Y • Chi^2: (|XY| - E(|XY|)) ^2 /E(|XY|); • J(X->Y) = P(Y)(P(X|Y)*log (P(X|Y)/P(X)) + (1- P(X|Y))*log ((1P(X|Y)/(1-P(X)) • Sufficiency (X->Y) = P(X|Y)/P(X|!Y); Necessity (X->Y) = P(!X|Y)/P(!X|!Y). Interestingness of Y->X is NC++ = 1-N(X->Y)*P(Y), if N(…) is less than 1 or 0 otherwise Page 23 Can We Find All and Only Interesting Patterns? • Find all the interesting patterns: Completeness – Can a data mining system find all the interesting patterns? – Association vs. classification vs. clustering • Search for only interesting patterns: Optimization – Can a data mining system find only the interesting patterns? – Approaches • First general all the patterns and then filter out the uninteresting ones. • Generate only the interesting patterns—mining query optimization Page 24 Clustering • Partition data set into clusters, and one can store cluster representation only • Can be very effective if data is clustered but not if data is “smeared” • Can have hierarchical clustering and be stored in multidimensional index tree structures • There are many choices of clustering definitions and clustering algorithms. Page 25 Example: Clusters Outliers x x x x x x x x xx x x x x x x x x xx x x x x x x x xx x x x x x x x x x x x x Page 26 Sampling • Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data • Choose a representative subset of the data – Simple random sampling may have very poor performance in the presence of skew • Develop adaptive sampling methods – Stratified sampling: • Approximate the percentage of each class (or subpopulation of interest) in the overall database • Used in conjunction with skewed data • Sampling may not reduce database I/Os (page at a time). Page 27 Sampling Raw Data Page 28 Sampling Raw Data Cluster/Stratified Sample Page 29 Discretization • Three types of attributes: – Nominal — values from an unordered set – Ordinal — values from an ordered set – Continuous — real numbers • Discretization: divide the range of a continuous attribute into intervals – Some classification algorithms only accept categorical attributes. – Reduce data size by discretization – Prepare for further analysis Page 30 Discretization • Discretization – reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values. Page 31 Discretization Sort Attribute Select cut Point Evaluate Measure NO NO Satisfied Yes Split/Merge Stop DONE Page 32 Discretization • Dynamic vs Static • Local vs Global • Top-Down vs Bottom-Up • Direct vs Incremental Page 33 Discretization – Quality Evaluation • Total number of Intervals • The Number of Inconsistencies • Predictive Accuracy • Complexity Page 34 Discretization - Binning • Equal width – all range is between min and max values is split in equal width intervals • Equal-frequency - Each bin contains approximately the same number of data Page 35 Entropy-Based Discretization • Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T, the entropy after partitioning is |S | |S | E (S ,T ) 1 Ent ( ) 2 Ent ( ) S1 | S | S2 | S| • The boundary that minimizes the entropy function over all possible boundaries is selected as a binary discretization. • The process is recursively applied to partitions obtained until some stopping criterion is met, e.g., Ent ( S ) E (T , S ) • Experiments show that it may reduce data size and improve classification accuracy Page 36 Data Mining Primitives, Languages, and System Architectures • Data mining primitives: What defines a data mining task? • A data mining query language • Design graphical user interfaces based on a data mining query language • Architecture of data mining systems Page 37 Why Data Mining Primitives and Languages? • Data mining should be an interactive process – User directs what to be mined • Users must be provided with a set of primitives to be used to communicate with the data mining system • Incorporating these primitives in a data mining query language – More flexible user interaction – Foundation for design of graphical user interface – Standardization of data mining industry and practice Page 38 What Defines a Data Mining Task ? • Task-relevant data • Type of knowledge to be mined • Background knowledge • Pattern interestingness measurements • Visualization of discovered patterns Page 39 Task-Relevant Data (Minable View) • Database or data warehouse name • Database tables or data warehouse cubes • Condition for data selection • Relevant attributes or dimensions • Data grouping criteria Page 40 Types of knowledge to be mined • Characterization • Discrimination • Association • Classification/prediction • Clustering • Outlier analysis • Other data mining tasks Page 41 A Data Mining Query Language (DMQL) • Motivation – A DMQL can provide the ability to support ad-hoc and interactive data mining – By providing a standardized language like SQL • Hope to achieve a similar effect like that SQL has on relational database • Foundation for system development and evolution • Facilitate information exchange, technology transfer, commercialization and wide acceptance • Design – DMQL is designed with the primitives described earlier Page 42 Syntax for DMQL • Syntax for specification of – task-relevant data – the kind of knowledge to be mined – concept hierarchy specification – interestingness measure – pattern presentation and visualization • Putting it all together — a DMQL query Page 43 Syntax for task-relevant data specification • use database database_name, or use data warehouse data_warehouse_name • from relation(s)/cube(s) [where condition] • in relevance to att_or_dim_list • order by order_list • group by grouping_list • having condition Page 44 Specification of task-relevant data Page 45 Syntax for specifying the kind of knowledge to be mined • Characterization Mine_Knowledge_Specification ::= mine characteristics [as pattern_name] analyze measure(s) • Discrimination Mine_Knowledge_Specification ::= mine comparison [as pattern_name] for target_class where target_condition {versus contrast_class_i where contrast_condition_i} analyze measure(s) • Association Mine_Knowledge_Specification ::= mine associations [as pattern_name] Page 46 Syntax for specifying the kind of knowledge to be mined (cont.) Classification Mine_Knowledge_Specification ::= mine classification [as pattern_name] analyze classifying_attribute_or_dimension Prediction Mine_Knowledge_Specification ::= mine prediction [as pattern_name] analyze prediction_attribute_or_dimension {set {attribute_or_dimension_i= value_i}} Page 47 Syntax for concept hierarchy specification • To specify what concept hierarchies to use use hierarchy <hierarchy> for <attribute_or_dimension> • We use different syntax to define different type of hierarchies – schema hierarchies define hierarchy time_hierarchy on date as [date,month quarter,year] – set-grouping hierarchies define hierarchy age_hierarchy for age on customer as level1: {young, middle_aged, senior} < level0: all level2: {20, ..., 39} < level1: young level2: {40, ..., 59} < level1: middle_aged level2: {60, ..., 89} < level1: senior Page 48 Syntax for concept hierarchy specification (Cont.) – operation-derived hierarchies define hierarchy age_hierarchy for age on customer as {age_category(1), ..., age_category(5)} := cluster(default, age, 5) < all(age) – rule-based hierarchies define hierarchy profit_margin_hierarchy on item as level_1: low_profit_margin < level_0: all if (price - cost)< $50 level_1: medium-profit_margin < level_0: all if ((price - cost) > $50) and ((price - cost) <= $250)) level_1: high_profit_margin < level_0: all if (price - cost) > $250 Page 49 Syntax for interestingness measure specification • Interestingness measures and thresholds can be specified by the user with the statement: with <interest_measure_name> threshold = threshold_value • Example: with support threshold = 0.05 with confidence threshold = 0.7 Page 50 Syntax for pattern presentation and visualization specification • We have syntax which allows users to specify the display of discovered patterns in one or more forms display as <result_form> • To facilitate interactive viewing at different concept level, the following syntax is defined: Multilevel_Manipulation ::= roll up on attribute_or_dimension | drill down on attribute_or_dimension | add attribute_or_dimension | drop attribute_or_dimension Page 51 Putting it all together: the full specification of a DMQL query use database AllElectronics_db use hierarchy location_hierarchy for B.address mine characteristics as customerPurchasing analyze count% in relevance to C.age, I.type, I.place_made from customer C, item I, purchases P, items_sold S, works_at W, branch where I.item_ID = S.item_ID and S.trans_ID = P.trans_ID and P.cust_ID = C.cust_ID and P.method_paid = ``AmEx'' and P.empl_ID = W.empl_ID and W.branch_ID = B.branch_ID and B.address = ``Canada" and I.price >= 100 with noise threshold = 0.05 display as table Page 52 DMQL and SQL • DMQL: Describe general characteristics of graduate students in the Big-University database use Big_University_DB mine characteristics as “Science_Students” in relevance to name, gender, major, birth_place, birth_date, residence, phone#, gpa from student where status in “graduate” • Corresponding SQL statement: Select name, gender, major, birth_place, birth_date, residence, phone#, gpa from student where status in {“Msc”, “MBA”, “PhD” } Page 53 Decision Trees Example: • Conducted survey to see what customers were interested in new model car • Want to select customers for advertising campaign sale custId c1 c2 c3 c4 c5 c6 car taurus van van taurus merc taurus age 27 35 40 22 50 25 city newCar sf yes la yes sf yes sf yes la no la no training set Page 54 One Possibility sale custId c1 c2 c3 c4 c5 c6 age<30 Y N city=sf Y likely car taurus van van taurus merc taurus age 27 35 40 22 50 25 city newCar sf yes la yes sf yes sf yes la no la no car=van N unlikely Y likely N unlikely Page 55 Another Possibility sale custId c1 c2 c3 c4 c5 c6 car=taurus Y N city=sf Y likely car taurus van van taurus merc taurus age 27 35 40 22 50 25 city newCar sf yes la yes sf yes sf yes la no la no age<45 N unlikely Y likely N unlikely Page 56 Issues • Decision tree cannot be “too deep” • would not have statistically significant amounts of data for lower decisions • Need to select tree that most reliably predicts outcomes Page 57 Top-Down Induction of Decision Tree Attributes = {Outlook, Temperature, Humidity, Wind} PlayTennis = {yes, no} Outlook sunny overcast Humidity high no rain Wind yes normal yes strong no weak yes Page 58 Entropy and Information Gain • S contains si tuples of class Ci for i = {1, …, m} • Information measures info required to classify any arbitrary tuple s s I( s ,s ,...,s ) log s s m 1 2 m i i 2 i 1 • Entropy of attribute A with values {a1,a2,…,av} s1 j ... smj I ( s1 j ,..., smj ) s j 1 v E(A) • Information gained by branching on attribute A Gain(A) I(s 1, s 2 ,..., sm) E(A) Page 59 Example: Analytical Characterization • Task – Mine general characteristics describing graduate students using analytical characterization • Given – attributes name, gender, major, birth_place, birth_date, phone#, and gpa – Gen(ai) = concept hierarchies on ai – Ui = attribute analytical thresholds for ai – Ti = attribute generalization thresholds for ai – R = attribute relevance threshold Page 60 Example: Analytical Characterization (cont’d) • 1. Data collection – target class: graduate student – contrasting class: undergraduate student • 2. Analytical generalization using Ui – attribute removal • remove name and phone# – attribute generalization • generalize major, birth_place, birth_date and gpa • accumulate counts – candidate relation: gender, major, birth_country, age_range and gpa Page 61 Example: Analytical characterization (3) • 3. Relevance analysis – Calculate expected info required to classify an arbitrary tuple I(s 1, s 2 ) I( 120,130 ) 120 120 130 130 log 2 log 2 0.9988 250 250 250 250 – Calculate entropy of each attribute: e.g. major For major=”Science”: S11=84 S21=42 I(s11,s21)=0.9183 For major=”Engineering”: S12=36 S22=46 I(s12,s22)=0.9892 For major=”Business”: S23=42 I(s13,s23)=0 S13=0 Number of grad students in “Science” Number of undergrad students in “Science” Page 62 Example: Analytical Characterization (4) • Calculate expected info required to classify a given sample if S is partitioned according to the attribute E(major) 126 82 42 I ( s11, s 21 ) I ( s12, s 22 ) I ( s13, s 23 ) 0.7873 250 250 250 • Calculate information gain for each attribute Gain(major) I(s 1, s 2 ) E(major) 0.2115 – Information gain for all attributes Gain(gender) = 0.0003 Gain(birth_country) = 0.0407 Gain(major) Gain(gpa) = 0.2115 = 0.4490 Gain(age_range) = 0.5971 Page 63 Example: Analytical characterization (5) • 4. Initial working relation (W0) derivation – R = 0.1 – remove irrelevant/weakly relevant attributes from candidate relation => drop gender, birth_country – remove contrasting class candidate relation major Science Science Science Engineering Engineering age_range 20-25 25-30 20-25 20-25 25-30 gpa Very_good Excellent Excellent Excellent Excellent count 16 47 21 18 18 Initial target class working relation W0: Graduate students • 5. Perform attribute-oriented induction on W0 using Ti Page 64 What Is Association Mining? • Association rule mining: – • Applications: – • Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. Basket data analysis, cross-marketing, catalog design, loss-leader analysis, clustering, classification, etc. Examples. – – – Rule form: “Body ® Head [support, confidence]”. buys(x, “diapers”) ® buys(x, “beers”) [0.5%, 60%] major(x, “CS”) ^ takes(x, “DB”) ® grade(x, “A”) [1%, 75%] Page 65 Association Rule Mining sales records: tran1 tran2 tran3 tran4 tran5 tran6 cust33 cust45 cust12 cust40 cust12 cust12 p2, p5, p1, p5, p2, p9 p5, p8 p8, p11 p9 p8, p11 p9 market-basket data • Trend: Products p5, p8 often bough together • Trend: Customer 12 likes product p9 Page 66 Association Rule • Rule: {p1, p3, p8} • Support: number of baskets where these products appear • High-support set: support threshold s • Problem: find all high support sets Page 67 Association Rule: Basic Concepts • Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit) • Find: all rules that correlate the presence of one set of items with that of another set of items – E.g., 98% of people who purchase tires and auto accessories also get automotive services done • Applications – * Maintenance Agreement (What the store should do to boost Maintenance Agreement sales) – Home Electronics * (What other products should the store stocks up?) – Attached mailing in direct marketing – Detecting “ping-pong”ing of patients, faulty “collisions” Page 68 Rule Measures: Support and Confidence Customer buys both Customer buys beer Customer buys diaper • Find all the rules X & Y Z with minimum confidence and support – support, s, probability that a transaction contains {X Y Z} – confidence, c, conditional probability that a transaction having {X Y} also contains Z Transaction ID Items Bought Let minimum support 50%, and minimum confidence 50%, 2000 A,B,C we have 1000 A,C – A C (50%, 66.6%) 4000 A,D – C A (50%, 100%) 5000 B,E,F Page 69 Mining Association Rules—An Example Transaction ID 2000 1000 4000 5000 Items Bought A,B,C A,C A,D B,E,F For rule A C: Min. support 50% Min. confidence 50% Frequent Itemset Support {A} 75% {B} 50% {C} 50% {A,C} 50% support = support({A C}) = 50% confidence = support({A C})/support({A}) = 66.6% The Apriori principle: Any subset of a frequent itemset must be frequent Page 70 Mining Frequent Itemsets: the Key Step • Find the frequent itemsets: the sets of items that have minimum support – A subset of a frequent itemset must also be a frequent itemset • i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset – Iteratively find frequent itemsets with cardinality from 1 to k (kitemset) • Use the frequent itemsets to generate association rules. Page 71 The Apriori Algorithm • Join Step: Ck is generated by joining Lk-1with itself • Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset • Pseudo-code: Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 are contained in t that Lk+1 = candidates in Ck+1 with min_support end return k Lk; Page 72 The Apriori Algorithm — Example Database D TID 100 200 300 400 itemset sup. C1 {1} 2 {2} 3 Scan D {3} 3 {4} 1 {5} 3 Items 134 235 1235 25 C2 itemset sup L2 itemset sup 2 2 3 2 {1 {1 {1 {2 {2 {3 C3 itemset {2 3 5} Scan D {1 3} {2 3} {2 5} {3 5} 2} 3} 5} 3} 5} 5} 1 2 1 2 3 2 L1 itemset sup. {1} {2} {3} {5} 2 3 3 3 C2 itemset {1 2} Scan D L3 itemset sup {2 3 5} 2 {1 {1 {2 {2 {3 3} 5} 3} 5} 5} Page 73 How to Generate Candidates? • Suppose the items in Lk-1 are listed in an order • Step 1: self-joining Lk-1 insert into Ck select p.item1, p.item2, …, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1 • Step 2: pruning forall itemsets c in Ck do forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck Page 74 How to Count Supports of Candidates? • Why counting supports of candidates a problem? – The total number of candidates can be very huge – One transaction may contain many candidates • Method: – Candidate itemsets are stored in a hash-tree – Leaf node of hash-tree contains a list of itemsets and counts – Interior node contains a hash table – Subset function: finds all the candidates contained in a transaction Page 75 Example of Generating Candidates • L3={abc, abd, acd, ace, bcd} • Self-joining: L3*L3 – abcd from abc and abd – acde from acd and ace • Pruning: – acde is removed because ade is not in L3 • C4={abcd} Page 76 Criticism to Support and Confidence • Example 1: (Aggarwal & Yu, PODS98) – Among 5000 students • 3000 play basketball • 3750 eat cereal • 2000 both play basket ball and eat cereal – play basketball eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% which is higher than 66.7%. – play basketball not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence basketball not basketball sum(row) cereal 2000 1750 3750 not cereal 1000 250 1250 sum(col.) 3000 2000 5000 Page 77 Criticism to Support and Confidence (Cont.) • Example 2: X 1 1 1 1 0 0 0 0 Y 1 1 0 0 0 0 0 0 Z 0 1 1 1 1 1 1 1 – X and Y: positively correlated, – X and Z, negatively related – support and confidence of X=>Z dominates • We need a measure of dependent or correlated events corrA, B P( A B) P( A) P( B) Rule Support Confidence X=>Y 25% 50% X=>Z 37.50% 75% • P(B|A)/P(B) is also called the lift of rule A => B Page 78 Other Interestingness Measures: Interest • Interest (correlation, lift) P( A B) P( A) P( B) – taking both P(A) and P(B) in consideration – P(A^B)=P(B)*P(A), if A and B are independent events – A and B negatively correlated, if the value is less than 1; otherwise A and B positively correlated X 1 1 1 1 0 0 0 0 Y 1 1 0 0 0 0 0 0 Z 0 1 1 1 1 1 1 1 Itemset Support Interest X,Y X,Z Y,Z 25% 37.50% 12.50% 2 0.9 0.57 Page 79 Classification vs. Prediction • Classification: – predicts categorical class labels – classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data • Prediction: – models continuous-valued functions, i.e., predicts unknown or missing values • Typical Applications – – – – credit approval target marketing medical diagnosis treatment effectiveness analysis Page 80 Classification Process: Model Construction Training Data NAME M ike M ary B ill Jim D ave A nne RANK YEARS TENURED A ssistant P rof 3 no A ssistant P rof 7 yes P rofessor 2 yes A ssociate P rof 7 yes A ssistant P rof 6 no A ssociate P rof 3 no Classification Algorithms Classifier (Model) IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Page 81 Classification Process: Use the Model in Prediction Classifier Testing Data Unseen Data (Jeff, Professor, 4) NAME T om M erlisa G eorge Joseph RANK YEARS TENURED A ssistant P rof 2 no A ssociate P rof 7 no P rofessor 5 yes A ssistant P rof 7 yes Tenured? Page 82 Supervised vs. Unsupervised Learning • Supervised learning (classification) – Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations – New data is classified based on the training set • Unsupervised learning (clustering) – The class labels of training data are unknown – Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data Page 83 Training Dataset This follows an example from Quinlan’s ID3 age <=30 <=30 31…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 income high high high medium low low low medium low medium medium medium high medium student no no no no yes yes yes no yes yes yes no yes no credit_rating fair excellent fair fair fair excellent excellent fair fair fair excellent excellent fair excellent Page 84 Output: A Decision Tree for “buys_computer” age? <=30 student? overcast 30..40 yes >40 credit rating? no yes excellent fair no yes no yes Page 85 Algorithm for Decision Tree Induction • Basic algorithm (a greedy algorithm) – Tree is constructed in a top-down recursive divide-and-conquer manner – At start, all the training examples are at the root – Attributes are categorical (if continuous-valued, they are discretized in advance) – Examples are partitioned recursively based on selected attributes – Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) • Conditions for stopping partitioning – All samples for a given node belong to the same class – There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf – There are no samples left Page 86 Information Gain (ID3/C4.5) • Select the attribute with the highest information gain • Assume there are two classes, P and N – Let the set of examples S contain p elements of class P and n elements of class N – The amount of information, needed to decide if an arbitrary example in S belongs to P or N is defined as p p n n I ( p, n) log 2 log 2 pn pn pn pn Page 87 Information Gain in Decision Tree Induction • Assume that using attribute A a set S will be partitioned into sets {S1, S2 , …, Sv} – If Si contains pi examples of P and ni examples of N, the entropy, or the expected information needed to classify objects in all subtrees Si is pi ni E ( A) I ( pi , ni ) i 1 p n • The encoding information that would be gained by branching on A Gain( A) I ( p, n) E ( A) Page 88 Attribute Selection by Information Gain Computation Class P: buys_computer = “yes” Class N: buys_computer = “no” I(p, n) = I(9, 5) =0.940 Compute the entropy for age: 5 4 I ( 2,3) I ( 4,0) 14 14 5 I (3,2) 0.69 14 E ( age) Hence Gain(age) I ( p, n) E (age) Similarly age <=30 30…40 >40 pi 2 4 3 ni I(pi, ni) 3 0.971 0 0 2 0.971 Gain(income) 0.029 Gain( student ) 0.151 Gain(credit _ rating ) 0.048 Page 89 Gini Index (IBM IntelligentMiner) • If a data set T contains examples from n classes, gini index, gini(T) is defined as n 2 gini(T ) 1 p j j 1 where pj is the relative frequency of class j in T. • If a data set T is split into two subsets T1 and T2 with sizes N1 and N2 respectively, the gini index of the split data contains examples from n classes, the gini index gini(T) is defined as N 1 gini( ) N 2 gini( ) ( T ) gini split T1 T2 N N • The attribute provides the smallest ginisplit(T) is chosen to split the node (need to enumerate all possible splitting points for each attribute). Page 90 Extracting Classification Rules from Trees • • • • • • Represent the knowledge in the form of IF-THEN rules One rule is created for each path from the root to a leaf Each attribute-value pair along a path forms a conjunction The leaf node holds the class prediction Rules are easier for humans to understand Example IF age = “<=30” AND student = “no” THEN buys_computer = “no” IF age = “<=30” AND student = “yes” THEN buys_computer = “yes” IF age = “31…40” THEN buys_computer = “yes” IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes” IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no” Page 91 Avoid Overfitting in Classification • The generated tree may overfit the training data – Too many branches, some may reflect anomalies due to noise or outliers – Result is in poor accuracy for unseen samples • Two approaches to avoid overfitting – Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold • Difficult to choose an appropriate threshold – Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees • Use a set of data different from the training data to decide which is the “best pruned tree” Page 92 Approaches to Determine the Final Tree Size • Separate training (2/3) and testing (1/3) sets • Use cross validation, e.g., 10-fold cross validation • Use all the data for training – but apply a statistical test (e.g., chi-square) to estimate whether expanding or pruning a node may improve the entire distribution • Use minimum description length (MDL) principle: – halting growth of the tree when the encoding is Page 93 minimized Scalable Decision Tree Induction Methods in Data Mining Studies • SLIQ (EDBT’96 — Mehta et al.) – builds an index for each attribute and only class list and the current attribute list reside in memory • SPRINT (VLDB’96 — J. Shafer et al.) – constructs an attribute list data structure • PUBLIC (VLDB’98 — Rastogi & Shim) – integrates tree splitting and tree pruning: stop growing the tree earlier • RainForest (VLDB’98 — Gehrke, Ramakrishnan & Ganti) – separates the scalability aspects from the criteria that determine the quality of the tree – builds an AVC-list (attribute, value, class label) Page 94 Bayesian Theorem • Given training data D, posteriori probability of a hypothesis h, P(h|D) follows the Bayes theorem P(h | D) P(D | h)P(h) P(D) • MAP (maximum posteriori) hypothesis h arg max P(h | D) arg max P(D | h)P(h). MAP hH hH • Practical difficulty: require initial knowledge of many probabilities, significant computational cost Page 95 Bayesian classification • The classification problem may be formalized using a-posteriori probabilities: • P(C|X) = prob. that the sample tuple X=<x1,…,xk> is of class C. • E.g. P(class=N | outlook=sunny,windy=true,…) • Idea: assign to sample X the class label C such that P(C|X) is maximal Page 98 Estimating a-posteriori probabilities • Bayes theorem: P(C|X) = P(X|C)·P(C) / P(X) • P(X) is constant for all classes • P(C) = relative freq of class C samples • C such that P(C|X) is maximum = C such that P(X|C)·P(C) is maximum • Problem: computing P(X|C) is unfeasible! Page 99 Naïve Bayesian Classification • Naïve assumption: attribute independence P(x1,…,xk|C) = P(x1|C)·…·P(xk|C) • If i-th attribute is categorical: P(xi|C) is estimated as the relative freq of samples having value xi as i-th attribute in class C • If i-th attribute is continuous: P(xi|C) is estimated thru a Gaussian density function • Computationally easy in both cases Page 100 Play-tennis example: estimating P(xi|C) outlook Outlook sunny sunny overcast rain rain rain overcast sunny sunny rain sunny overcast overcast rain Temperature Humidity Windy Class hot high false N hot high true N hot high false P mild high false P cool normal false P cool normal true N cool normal true P mild high false N cool normal false P mild normal false P mild normal true P mild high true P hot normal false P mild high true N P(p) = 9/14 P(n) = 5/14 P(sunny|p) = 2/9 P(sunny|n) = 3/5 P(overcast|p) = 4/9 P(overcast|n) = 0 P(rain|p) = 3/9 P(rain|n) = 2/5 temperature P(hot|p) = 2/9 P(hot|n) = 2/5 P(mild|p) = 4/9 P(mild|n) = 2/5 P(cool|p) = 3/9 P(cool|n) = 1/5 humidity P(high|p) = 3/9 P(high|n) = 4/5 P(normal|p) = 6/9 P(normal|n) = 2/5 windy P(true|p) = 3/9 P(true|n) = 3/5 P(false|p) = 6/9 P(false|n) = 2/5 Page 101 Play-tennis example: classifying X • An unseen sample X = <rain, hot, high, false> • P(X|p)·P(p) = P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) = 3/9·2/9·3/9·6/9·9/14 = 0.010582 • P(X|n)·P(n) = P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) = 2/5·2/5·4/5·2/5·5/14 = 0.018286 • Sample X is classified in class n (don’t play) Page 102 Association-Based Classification • Several methods for association-based classification – ARCS: Quantitative association mining and clustering of association rules (Lent et al’97) • It beats C4.5 in (mainly) scalability and also accuracy – Associative classification: (Liu et al’98) • It mines high support and high confidence rules in the form of “cond_set => y”, where y is a class label – CAEP (Classification by aggregating emerging patterns) (Dong et al’99) • Emerging patterns (EPs): the itemsets whose support increases significantly from one class to another • Mine Eps based on minimum support and growth rate Page 103 What Is Prediction? • Prediction is similar to classification – First, construct a model – Second, use model to predict unknown value • Major method for prediction is regression – Linear and multiple regression – Non-linear regression • Prediction is different from classification – Classification refers to predict categorical class label – Prediction models continuous-valued functions Page 104 Regression Analysis and Log-Linear Models in Prediction • Linear regression: Y = + X – Two parameters , and specify the line and are to be estimated by using the data at hand. – using the least squares criterion to the known values of Y1, Y2, …, X1, X2, …. • Multiple regression: Y = b0 + b1 X1 + b2 X2. – Many nonlinear functions can be transformed into the above. • Log-linear models: – The multi-way table of joint probabilities is approximated by a product of lower-order tables. – Probability: p(a, b, c, d) = ab acad bcd Page 105 General Applications of Clustering • Pattern Recognition • Spatial Data Analysis – create thematic maps in GIS by clustering feature spaces – detect spatial clusters and explain them in spatial data mining • Image Processing • Economic Science (especially market research) • WWW – Document classification – Cluster Weblog data to discover groups of similar access patterns Page 107 Examples of Clustering Applications • Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs • Land use: Identification of areas of similar land use in an earth observation database • Insurance: Identifying groups of motor insurance policy holders with a high average claim cost • City-planning: Identifying groups of houses according to their house type, value, and geographical location • Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults Page 108 What Is Good Clustering? • A good clustering method will produce high quality clusters with – high intra-class similarity – low inter-class similarity • The quality of a clustering result depends on both the similarity measure used by the method and its implementation. • The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns. Page 109 Types of Data in Cluster Analysis • Data matrix • Dissimilarity matrix x11 ... x i1 ... x n1 ... x1f ... ... ... ... xif ... ... ... ... ... xnf ... ... 0 d(2,1) 0 d(3,1) d ( 3,2) 0 : : : d ( n,1) d ( n,2) ... x1p ... xip ... xnp ... 0 Page 110 Measure the Quality of Clustering • Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, which is typically metric: d(i, j) • There is a separate “quality” function that measures the “goodness” of a cluster. • The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal and ratio variables. • Weights should be associated with different variables based on applications and data semantics. • It is hard to define “similar enough” or “good enough” – the answer is typically highly subjective. Page 111 Similarity and Dissimilarity Between Objects • Distances are normally used to measure the similarity or dissimilarity between two data objects • Some popular ones include: Minkowski distance: d (i, j) q (| x x |q | x x |q ... | x x |q ) i1 j1 i2 j2 ip jp where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two pdimensional data objects, and q is a positive integer • If q = 1, d is Manhattan distance d (i, j) | x x | | x x | ... | x x | i1 j1 i2 j 2 i p jp Page 112 Similarity and Dissimilarity Between Objects • If q = 2, d is Euclidean distance: d (i, j) (| x x |2 | x x |2 ... | x x |2 ) i1 j1 i2 j2 ip jp – Properties • d(i,j) 0 • d(i,i) = 0 • d(i,j) = d(j,i) • d(i,j) d(i,k) + d(k,j) • Also one can use weighted distance, parametric Pearson product moment correlation, or other disimilarity measures. Page 113 Binary Variables • A contingency table for binary data Object j Object i 1 0 1 a b 0 c d sum a c b d sum a b cd p • Simple matching coefficient (invariant, if the binary bc variable is symmetric): d (i, j) a bc d • Jaccard coefficient (noninvariant if the binary variable is asymmetric): d (i, j) bc a bc Page 114 Dissimilarity between Binary Variables • Example Name Jack Mary Jim Gender M F M Fever Y Y Y Cough N N P Test-1 P P N Test-2 N N N Test-3 N P N Test-4 N N N – gender is a symmetric attribute – the remaining attributes are asymmetric binary – let the values Y and P be set to 1, and the value N be set to 0 01 0.33 2 01 11 d ( jack , jim ) 0.67 111 1 2 d ( jim , mary ) 0.75 11 2 d ( jack , mary ) Page 115 Major Clustering Methods • Partitioning algorithms: Construct various partitions and then evaluate them by some criterion • Hierarchy algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion • Density-based: based on connectivity and density functions • Grid-based: based on a multiple-level granularity structure • Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other Page 116 Partitioning Algorithms: Basic Concept • Partitioning method: Construct a partition of a database D of n objects into a set of k clusters • Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion – Global optimal: exhaustively enumerate all partitions – Heuristic methods: k-means and k-medoids algorithms – k-means (MacQueen’67): Each cluster is represented by the center of the cluster – k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster Page 117 The K-Means Clustering Method • Given k, the k-means algorithm is implemented in 4 steps: – Partition objects into k nonempty subsets – Compute seed points as the centroids of the clusters of the current partition. The centroid is the center (mean point) of the cluster. – Assign each object to the cluster with the nearest seed point. – Go back to Step 2, stop when no more new assignment. Page 118 The K-Means Clustering Method • Example 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 1 2 3 4 5 6 7 8 9 10 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Page 119 Comments on the K-Means Method • Strength – Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. – Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms • Weakness – Applicable only when mean is defined, then what about categorical data? – Need to specify k, the number of clusters, in advance – Unable to handle noisy data and outliers – Not suitable to discover clusters with non-convex shapes Page 120