Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
INTRODUCTION TO DATA MINING Pinakpani Pal Electronics & Communication Sciences Unit Indian Statistical Institute [email protected] Main Sources • Data Mining Concepts and Techniques –Jiawei Han and Micheline Kamber, 2007 • Handbook of Data Mining and Discovery- Willi Klosgen and Jan M Zytkow, 2002 • Fast algorithms for mining association rules and sequential patterns – R.Srikant, Ph.D. Thesis at the University of Wisconsin-Madison, 1996. • “Parallel & distributed association mining: a survey,” – M. J. Zaki, IEEE Concurrency, 7(4), pp.14-25, 1999. Introduction to Data Mining 2 Prelude • Data Mining is a method of finding interesting trends or patterns in large datasets. • Data collection may be incomplete, heterogeneous and historical. • Since data volume is very large, efficiency and scalability are two very important criteria for data mining algorithms. • Data Mining tools are expected to involve minimal user intervention. Introduction to Data Mining 3 Prelude • Data mining deals with finding patterns in data either by – user-definition (pre-defined by the user), – interesting (with the help of an interestingness measure) or – valid (validity pre-defined). • Discovered patterns help and guide the appropriate authority in taking future decisions. So, Data Mining is regarded as a tool for Decision Support. Introduction to Data Mining 4 Data Mining Communities • Statistics: Provides the background for the algorithms. • Artificial Intelligence: Provides the required heuristics for machine learning / conceptual clustering. • Database: Provides the platform for storage and retrieval of raw and summary data. Introduction to Data Mining 5 Data Mining Mining knowledge from Large amounts of Data. Evolution: • Data collection • Database creation • Data management – Data storage – Retrieval – Transaction processing Introduction to Data Mining 6 Data Mining • Advanced data analysis data warehouse and data mining Introduction to Data Mining 7 Data Mining Components Information Repository: single or multiple heterogeneous data source Data Sever: storing or retrieving relevant data Knowledgebase: concept hierarchies, constraints, threshold, metadata Pattern Extraction : characterization, discrimination, association, classification, prediction, clustering, various statistical analysis Pattern Evaluation: interestingness measures Introduction to Data Mining 8 Stages of the Data Mining Process Misconception: Data mining systems can autonomously dig out all of the valuable knowledge from a given large database, without human intervention. Steps: • [Data Collection] – web crawling / warehousing Introduction to Data Mining 9 Stages of the Data Mining Process Steps (contd.): • Data Preprocessing & Feature Extraction – Data cleaning: elimination of erroneous and irrelevant data – Data Integration: from multiple source – Data selection / reduction: to accept only the interesting attributes of the data according to the problem domain. – Data transformation: normalization, aggregation Introduction to Data Mining 10 Stages of the Data Mining Process Steps (contd.): • Pattern Extraction & Evaluation – Identification of data mining primitives and interestingness measures are done at this stage. • Visualization of data – Making it easily understandable • Evaluation of results – Not every s/w discovered facts are useful for human beings! Introduction to Data Mining 11 Data Preprocessing Data Cleaning: Data may be incomplete, noisy and inconsistent. Attempts are made to identify outliers to smooth out noise, fill in missing values and correct inconsistencies. Introduction to Data Mining 12 Data Preprocessing Data Integration: Data analysis may involve data integration from different sources as in Data Warehouse. The sources may include Databases, Data cubes or flat files. Introduction to Data Mining 13 Data Preprocessing Data Reduction: Since both data volume and attribute set may be too large, data reduction becomes necessary. It includes activities like, Removal of irrelevant and redundant attributes, Data Compression and Aggregation or Generation of Summary Data. Introduction to Data Mining 14 Data Preprocessing Transformation: Data need to be transformed or consolidated into forms suitable for mining. It may include activities like, Generalization, Normalization, e.g. attribute values converted from absolute values to ranges, Construction of new attributes etc. Introduction to Data Mining 15 Patterns • Descriptive – characterizing general properties of the data • Predictive – inference on the current data in order to make patterns • Discover: – multiple kind of patterns to accommodate different user expectation (may specify hints to guide) /application – patterns at various granularity Introduction to Data Mining 16 Frequent Patterns Patterns that occur frequently in the data. Types: • Itemset • Subsequences • Substructures (sub-graphs, sub-trees, sub-lattices) Introduction to Data Mining 17 Discovery of Association Rules To identify the features or items in a problem domain that tend to appear together. These features or items are said to be associated. The process is to find the set of all subsets of items or attributes that frequently occur in many database records or transactions, and additionally, to extract rules on how a subset of items influences the presence of another subset. Introduction to Data Mining 18 Association Rule: Example A user studying the buying habits of customers may choose to mine association rules of the form: P (X:customer,W) ^ Q (X,Y) buys (X,Z) [support=n%, confidence is m%] Meta rules such as the following can be specified: occupation(X, “student”) ^ age(X, “20...29”) buys(X, “mobile”) [1.4%, 70%] Introduction to Data Mining 19 Association Rule: Single/Multi Single-dimensional association rule: buys(X, “computer”) buys (X, “antivirus”) [1.1%, 55%] OR “computer” “antivirus” (A B ) [1.1%, 55%] Multi-dimensional association rule: occupation(X, “student”) ^ age(X, “20...29”) buys(X, “mobile”) [1.4%, 70%] Introduction to Data Mining 20 Metrics for Interestingness measures Interestingness measures in knowledge discovery help to identify the relevance of the patterns discovered during the mining process. Introduction to Data Mining 21 Interestingness measures • Used to confine the number of uninteresting patterns returned by the process. • Based on the structure of patterns and statistics underlying them. • Associate a threshold which can be controlled by the user – patterns not meeting the threshold are not presented to the user. Introduction to Data Mining 22 Interestingness measures: objective Objective measures of pattern interestingness: • simplicity • utility (support) • certainty (confidence) • novelty Introduction to Data Mining 23 Interestingness measures: simplicity Simplicity: a patterns interestingness is based on its overall simplicity for human comprehension. e.g. Rule length is a simplicity measure Introduction to Data Mining 24 Interestingness measures: support Utility (support): usefulness of a pattern support(AB) = P(A U B). The support for a association rule {A} {B} is the % of all the transactions under analysis that contains this itemset. Introduction to Data Mining 25 Interestingness measures: confidence Certainty (confidence): Assesses the validity or trustworthiness of a pattern. Confidence is a certainty measure confidence(A B) = P(B│A) The confidence for a association rule {A} {B} is the % cases that follows the rule. Association rules that satisfy both the confidence and support threshold are referred to as strong association rules. Introduction to Data Mining 26 Interestingness measures: novelty Novelty: Patterns contributing new information to the given pattern set are called novel patterns. e.g: Data exception. Removing redundant patterns is a strategy for detecting novelty. Introduction to Data Mining 27 Market Basket data analysis Let, a transaction be defined as the variety of items purchased by a customer in one visit, irrespective of the quantity of each item purchased. The problem is to find the items that a customer tends to buy together. Introduction to Data Mining 28 Market Basket data analysis An association rule is an expression of the form XY, where X and Y are the sets of items. The intuitive meaning of the expression is, the transactions that contain X tend to contain Y as well. The inverse may not be true. Since only presence or absence of items are considered and not the quantity purchased, this type of rules are called Binary Association Rules. Introduction to Data Mining 29 Market Basket data analysis Purpose is to study consumers’ purchase pattern in departmental stores. Considering four possible transactions, 1 - {Pen, Ink, Diary, Writing Pad} 2 - {Pen, Ink, Diary} 3 - {Pen, Diary} 4 - {Pen, Ink, Writing Pad} Introduction to Data Mining 30 Market Basket data analysis A possible Association Rule, “ Purchase of Pen implies the purchase of Ink or Diary” {Pen} {Ink} or {Pen} {Diary} Basically, the rule is of the form {LHS} {RHS} where, both {LHS} and {RHS} are sets of items, called itemset and {LHS} ∩ {RHS} = . • {Pen, Ink} is a 2-itemset. Introduction to Data Mining 31 Binary Association Rule Mining Two Step Process 1. Find all frequent itemsets – An itemset will be considered for mining rules if its support is above a threshold called minsup. 2. Generate strong association rules from frequent itemsets – Acceptance of a rule is once again through a threshold called minconf. Introduction to Data Mining 32 Finding Frequent Itemsets If there are N items in a market basket and the association is studied for all possible item combinations, totally 2N combinations are to be checked. Introduction to Data Mining 33 Finding Frequent Itemsets All nonempty subsets of a frequent itemset must also be frequent. (anti-monotone property) Apriori Algorithm An itemset is frequent when its occurrence in the total dataset exceeds the minsup. If there exists N items, the algorithm attempts to compute frequent itemsets for 1-itemset to Nitemsets. Introduction to Data Mining 34 Apriori Algorithm The algorithm has two steps, 1. Join step 2. Prune step 1. Join step : Here frequent k-itemsets are computed by joining the (k-1)-itemsets 2. Prune step: if a k-itemset fails to cross the minsup threshold, all the supersets of the concerned kitemset are no longer considered for association rule discovery. Introduction to Data Mining 35 Apriori Algorithm • Let Lk be the set of frequent k-itemsets • Let Ck be the set of candidate k-itemsets Each member of this set has two fields – itemset and support count. Introduction to Data Mining 36 Apriori Algorithm 1. 2. 3. 4. 5. Let k←1 Generate L1 frequent itemsets of length 1 (Lk = ) OR (k = N) goto Step 7 k ← k+1 Generate Lk frequent itemsets of length k by Join and Prune 6. Goto Step 3. 7. Stop Output : UkLk Introduction to Data Mining 37 Apriori Algorithm Join () forall (i,j) where i ϵ Lk-1 and j ϵ Lk-1, i≠j select all possible k-itemset and insert into Ck endfor If L3={{{1 2 3}, s123}, {{1 2 4}, s124}, {{1 3 4}, s134}, {1 3 5}, s135}, {2 3 4}, s234}} C4={{{1 2 3 4}, s1234}, {{1 3 4 5}, s1345}} Introduction to Data Mining 38 Apriori Algorithm Prune() forall itemsets c Ck do forall (k-1)-subsets s of c do If (S Lk-1) then delete c from Ck endif endfor endfor L4={{{1 2 3 4}, s1234}} Lk ← Ck Introduction to Data Mining 39 Rule Generation Rule generation should ensure production of rules that satisfy only the minimum confidence threshold – Because, rules are generated from frequent itemsets, they automatically satisfy the minimum support threshold Given a frequent itemset li, find all non-empty subsets f li such that f li – f satisfies the minimum confidence requirement • If | li | = k, then there are 2k – 2 candidate association rules Introduction to Data Mining 40 Rule Generation Algorithm: forall li i ≥ 2 do call genrule (li, li) endfor Introduction to Data Mining 41 Rule Generation genrule (lk, fi) F ← {(m-1)-itemset fm-1 | fm-1 fm} forall fm-1ϵ F do conf ←sup(lk) / sup(fm-1) if (conf ≥ minconf) print rule “fm-1 (lk- fm-1), conf, sup(lk)” if (m-1 >1) cal genrule(lk, am-1) endif endif endfor Introduction to Data Mining 42 Rule Generation If {A,B,C,D} is a frequent itemset, candidate rules: {ABC}{D}, {ABD}{C}, {ACD}{B}, {BCD}{A}, {AB}{CD}, {AC}{BD}, {AD}{BC}, {BC}{AD}, {BD}{AC}, {CD}{AB}, {A}{BCD}, {B} {ACD}, {C}{ABD}, {D}{ABC} Introduction to Data Mining 43 Rule Generation In general, confidence does not have an antimonotone property c({ABC} {D}) can be larger or smaller than c({AB} {D}) But confidence of rules generated from the same itemset has an anti-monotone property – Confidence is anti-monotone w.r.t. number of items on the RHS of the rule e.g., L = {A,B,C,D}: c({ABC} {D}) c({AB} {CD}) c({A} {BCD}) Introduction to Data Mining 44 Case Study To find the Association among the species of trees present in a forest. The problem is to find a set of association rules which would indicate the species of trees that usually appear together and also whether a set of species ensures the presence of another set of species with a minimum degree of confidence specified apriori. Introduction to Data Mining 45 Data Collection A forest area is divided into a number of transacts. A group of surveyors walk through each such transact to identify the different species of trees and their number of occurrences. Introduction to Data Mining 46 Data Species 1 2 3 ⁞ 398 1 7 0 2 0 5 16 ⁞ 6 4 ⁞ 2 Transacts 3 … 1 … 9 … 0 … ⁞ … 25 … Introduction to Data Mining 1008 13 0 2 ⁞ 7 47 Converting the Data Species 1 2 3 ⁞ 398 1 1 0 2 0 1 1 ⁞ 1 1 ⁞ 1 Transacts 3 … 1 … 1 … 0 … ⁞ … 1 … Introduction to Data Mining 1008 1 0 1 ⁞ 1 48 Drawbacks Support and confidence used by Apriori allow a lot of rules which are not necessarily interesting Two options to extract interesting rules • Using subjective knowledge • Using objective measures (measures better than confidence) Introduction to Data Mining 49 Subjective approaches • Visualization – users allowed to interactively verify the discovered rules • Template-based approach – filter out rules that do not fit the user specified templates • Subjective interestingness measure – filter out rules that are obvious (bread butter) and that are non-actionable (do not lead to profits) Introduction to Data Mining 50 Objective Measures TID 1 2 3 4 5 6 7 8 9 10 A 1 0 1 1 0 1 0 1 1 1 B 1 0 1 0 1 1 1 0 1 0 C 0 1 1 0 0 0 1 1 0 1 D 0 0 1 0 1 0 1 1 0 1 Support(A) = 0.7 Support(B) = 0.6 Support(C) = 0.5 Support(D) = 0.5 Support(AB) = 0.4 Support(CD) = 0.4 minsup = 0.3 How to infer, AB or CD Introduction to Data Mining 51 Dissociation • Dissociation of an itemset is, the % of transactions where one or more items but not all are absent. Dissociation(AB) = 0.5 Dissociation(CD) = 0.2 • Extract frequent itemsets from a set of transactions under high association but low dissociation. Introduction to Data Mining 52 Togetherness Let Si = subset of transactions containing the item i. SA ∩ SB = subset of transactions containing both A and B. SA U SB = subset of transactions containing either A or B. Togetherness(AB)= | SA ∩ SB | / | SA U SB | Similar to minsup, a threshold min_togetherness can be defined to find frequent itemsets. Introduction to Data Mining 53 Objective Measures • Weka uses other objective measures – Lift (A B) = confidence(A B)/support(B) = support(A B)/(support(A)*support(B)) – Leverage (A B) = support(A B) – support(A)*support(B) – Conviction(A B) = support(A)*support(not B)/support(A B) – conviction inverts the lift ratio and also computes support for RHS not being true Introduction to Data Mining 54 Modifications of Apriori Algorithm • • • • • Reduce computation time: Hash based techniques Transaction reduction Sampling Dynamic itemset counting Introduction to Data Mining 55 Frequent Pattern Mining Variations • • • • • • Type of value handled Levels of abstractions Number of data dimensions Kinds of Patterns to be mined Completeness of Patterns to be mined Kind of rules to be mined Introduction to Data Mining 56 Type of Value Handled Binary / Boolean • Absence of items helps in improving the discovery of association rules but does not directly contribute to rule mining. Quantitative • In certain applications, absence of items may sometime be as important as their presence. • In medical applications, it has been found that both presence and absence of symptoms need to be considered in discovering association rules. Introduction to Data Mining 57 Quantitative Association Rules For numeric attributes like, age, salary etc. binary association rule mining is not applicable. The attribute domain can be categorized in two basic approaches regarding the treatment of quantitative attributes: • Static • Dynamical Introduction to Data Mining 58 Static Discritisation Quantitative Attributes are discritised using predefined concept hierarchies. Say income may be replaced by original numeric values of this attribute interval level “0…10K”, “11…20K” … and so on. Introduction to Data Mining 59 Dynamical Discritisation Quantitative Attributes are discritised (clustered) into “bins” based on the distribution of Data. After the verification of minsup and minconf thresholds, following rules may be obtained, age(x,5) studies(x, “in school”) age(x,6) studies(x, “in school”) ⁞ age(x,17) studies(x, “in school”) age(x,18) studies(x, “in school”) Introduction to Data Mining 60 Dynamical Discritisation • ARCS(Association Rule Clustering System) used for mining quantitative rules may be used for classification in the form, Aquant1 Aquant2 …. Aquantn Acat where Aquant1 , Aquant2 etc. are tests on numeric attribute ranges and Acat is the class label assigned after the training step. Introduction to Data Mining 61 Dynamical Discritisation Using ARCS (Association Rule Clustering System), a composite rule may be formed as, age(x, “5….18”) studies(x, “in school”) Similar way, two dimensional quantitative rules can also be formed. age(x, “25 …. 40”) income(x, “20K …. 40K”) buys(x, “new car”) Introduction to Data Mining 62 Levels of Abstractions All Pen Fountain Writing Pad Dot Ruled Blank Ink Bottle Cartridge Pilot … Parker … Oxford … Pioneer … … … … Link Introduction to Data Mining 63 Multilevel Association Rule Using • Uniform minimum support • Reduced minimum support at lower level • Group based minimum support Introduction to Data Mining 64 Rules over Taxonomies • The items used for rule mining may not be at the same level. There can be an in-built taxonomy among the items. An example of a taxonomy as applicable to market basket data : Clothes Footwear Outerwear Shoes Snickers Track Suits Shirts Track Pants This taxonomy implies : • Track Suits is-a Outerwear, Outerwear is-a Clothes etc. Introduction to Data Mining 65 Rules over Taxonomies Application domain may need rules at different levels of the taxonomy. Trivial Rule: If Ŷ implies ancestor(Y), then rule Y Ŷ is Trivial. Shoes Footwear (A rule with 100% confidence) Footwear Shoes Snickers Introduction to Data Mining 66 Rules across Levels • Rule OuterwearSnickers does not infer either Track SuitsSnickers or Track PantsSnickers So, a rule at a higher level does not infer the same rule at the lower level of the taxonomy. Clothes Footwear Outerwear Shoes Snickers Track Suits Introduction to Data Mining Shirts Track Pants 67 Rules across Levels • Rule Track SuitsSnickers definitely infers the rule OuterwearSnickers So, a rule at a lower level definitely infers the same rule at the higher level of the taxonomy. Clothes Footwear Outerwear Shoes Snickers Track Suits Introduction to Data Mining Shirts Track Pants 68 Interest Measure • To find rules whose support is more than R times the expected value or whose confidence is more than R times the expected value , for some user specified constant R. Introduction to Data Mining 69 Rule (with Taxonomies) Generation Steps 1. Find frequent itemsets 2. Use frequent itemsets to generate the desired rules. 3. Prune all uninteresting rules from this set. Introduction to Data Mining 70 The Database TID 1 2 3 4 5 6 Items Shirts Track Suits, Snickers Track Pants, Snickers minsup = 30% minconf=60% Shoes Shoes Track Suits Introduction to Data Mining 71 Frequent Itemset & Taxonomies Itemsets Sup (out of 6) Footwear {Track Suit} 2 {Outerwear} 3 {Clothes} 4 {Shoes} 2 {Snickers} 2 {Footwear} 4 {Outerwear, Snickers} 2 {Clothes, Snickers} 2 {Outerwear, Footwear} 2 Track Suits {Clothes, Footwear} 2 Shoes Snickers Clothes Outerwear Introduction to Data Mining Shirts Track Pants 72 Rules Rule Sup% OuterwearSnickers OuterwearFootwear SnickersOuterwear SnickersClothes 33 33 33 33 Introduction to Data Mining Conf % 66 66 100 100 73 Rule under Item Constraints Some applications may need association rules under user specified constraints on items. When a taxonomy is present, these constraints may be specified using the taxonomy. Introduction to Data Mining 74 Rule under Item Constraints (Track Suits Shoes) (descendants(Clothes) ancestors(Snickers)) • A Boolean expression representing a constraint. • Allow rules containing either, both Track Suits and Shoes or Clothes or any descendant of Clothes and do not contain Snickers or Footwear as its ancestor. Introduction to Data Mining 75 Rule under Item Constraints Exploitation of hierarchy does not stop the generation of association rules among the items at the same level. Thus, these types of association rules are the Generalized Association Rules. Introduction to Data Mining 76 Number of Data Dimensions • Single Dimension – Discrete Predicate: buy(X,”Pen”) buy (X, “Ink”) • Multidimension – Discrete Predicate: age(X,”9..21”)^occupation(X,”Student”) buy (X, “Pen”) – Multiple occurrence of Predicate: age(X,”9..21”)^occupation(X,”Student”)^ buy(X,”Pen”) buy (X, “Ink”) Introduction to Data Mining 77 Sequential Patterns A sequential pattern always provides an order. • In a market basket application, it is not interested in the set of items appearing in a transaction but tries to find an inter-transaction purchase pattern. So the transactions need to be ordered. Introduction to Data Mining 78 Sequential Patterns It is assumed that a customer can have only one transaction at a given transaction time. • An itemset (I) is a non-empty set of items (ij) I = {i1 i2…in} • A sequence (s) is an ordered list of itemsets or events (ej). s = {e1 e2…em} where ei occurs before ej (i<j) Introduction to Data Mining 79 Sequential Patterns A sequence is contained in another sequence if each itemset in the first sequence is contained in some itemset of the second sequence. A sequence {(3) (4 5) (8)} is contained in another sequence {(7) (3 8) (9) (4 5 6) (8)} since, (3) (3 8), (4 5) (4 5 6) and (8) (8). A sequence {(3) (5)} is not contained in {(3 5)} and vice versa. Introduction to Data Mining 80 Sequential Patterns • In a set of sequences, a sequence s is maximal if it is not contained in any other sequence. • A sequence to be frequent it must at least cross the minimum support threshold. • A frequent sequence is called sequential pattern. • A sequential patterns with length l is called an lpattern. Introduction to Data Mining 81 Discovery of Sequential Patterns CustId Date 001 13/0205/12 001 Items Sequence Support 30 {(10)} 1 14/05/2012 90 {(20)} 1 002 13/05/2012 10, 20 {(30)} 4 002 15/05/2012 30 {(40)} 2 002 16/05/2012 40, 60, 70 {(50)) 1 003 17/05/2012 30, 50, 70 {(60)} 1 004 13/05/2012 30 {(70)} 3 004 14/015/2012 40, 70 004 16/05/2012 90 {(90)} 3 005 13/05/2012 90 minsup = 25% Introduction to Data Mining 82 Discovery of Sequential Patterns • L1={{(30)}, {(40)}, {(70)}, {(90)}} • candidate sequence (22) c2={{(30) (30)},{(30) (40)}, {(30) (70)}, {(30) (90)}, …, {(90) (90)} , {(30 40)}, …, {(70 90)}} Sequence Support Sequence Support (10 20) 1 (30) (70) 2 (10) (30) 1 (30) (90) 2 (20) (30) 1 (40) (90) 1 (30) (40) 2 (70) (90) 1 (30) (60) 1 (40 70) 2 Introduction to Data Mining 83 Discovery of Sequential Patterns • L2={{(30) (40)}, {(30) (70)}, {(30) (90)} {(40 70)}} candidate sequence c2={{(30) (30) (70)},{(30) (30) (90)}, {(30) (40 70)}, …, {(40) (30) (70)},{(40) (30) (90)}, {(40) (40 70)}, …, {(30) (40) (30) (70)},{(30) (40) (30) (90)}, {(30) (40) (40 70)}, …, {(40 70) (40 70)}, …, {(30) (40 70 90)}} Sequence (30) (40 70) Support 2 Introduction to Data Mining 84 Discovery of Sequential Patterns CustId Date Items CustId Sequence 001 13/0205/12 30 1 (30) (90) 001 14/05/2012 90 2 (10 20) (30) (40 60 70) 002 13/05/2012 10, 20 3 (30 60 70) 002 15/05/2012 30 4 (30) (40 70) (90) 002 16/05/2012 40, 60, 70 5 (90) 003 17/05/2012 30, 50, 70 004 13/05/2012 30 004 14/015/2012 40, 70 004 16/05/2012 90 005 13/05/2012 90 If minsup of any maximal sequence = 0.25 (say), then, acceptable sequential patterns: {(30) (90)} and {(30) (40 70)}. Introduction to Data Mining 85 Specification of Time Windows • User may define a time window within which the patterns are to be discovered. • If a pattern is found without adequate support within a time window but crosses minsup across different time windows, it would not be considered as a valid sequential pattern. • This effort helps in studying seasonal purchase patterns in case of market basket analysis. Introduction to Data Mining 86 Sequential Patterns over Taxonomies Similar to rule mining, the items under consideration may not be at the same level. Clothes Footwear Shoes Snickers Outerwear Track Suits Shirts Track Pants From the available transactions if a sequential pattern is found as {(Track Suits) (Shoes)}, it would also support patterns like, {(Outerwear)(Shoes)},{(Outerwear) (Footwear)} etc. These are called generalized sequential patterns. Introduction to Data Mining 87 Data Classification • Classification is a method where the data instances in a problem domain are distributed among different pre-defined classes or concepts. • Usually a data instance is placed in only one class. • For the purpose of classification, definite criteria / rules are defined for the membership of each class. Introduction to Data Mining 88 Data Classification • Classification is usually done under the supervision of domain experts of the problem domain under consideration. So, classification process involves supervised learning. • Clustering, on the other hand, is the result of unsupervised learning. Here the class or concept label of each data instance or each cluster is not known. The number of such classes or concepts are pre-defined intuitively. Introduction to Data Mining 89 Data Classification Classification process has two steps. 1. build the model from training data set – Learning a mapping function y = f(X) where y is the associated class label for an instance X. 2. classify unknown data. Introduction to Data Mining 90 Comparison of Classification Methods Properties for the comparison: • Predictive Accuracy : Ability of a model to correctly predict the class label for a new data instant. • Speed : Computation cost, in terms of time, required in a model to generate, i.e. to train the classes and then to classify data. Introduction to Data Mining 91 Comparison of Classification Methods Properties for the comparison: • Robustness : Ability of a model to make correct classification under noisy data or data with missing values. • Scalability : The response of a model in training and classification step against the increase in data volume. Introduction to Data Mining 92 Classification by Decision Tree Induction • A Decision Tree is a tree structure. • Classification is done against a concept. • Tree is formed by testing an attribute or attribute combination in each node. • Each branch of the tree is caused by an outcome of this test. • The leaf nodes represent the classes. Introduction to Data Mining 93 Decision Tree Concept: Buy New Car INCOME =20K 20-50K MARITAL STATUS Single YES AGE Married NO >50K <40 >40 YES NO Introduction to Data Mining YES 94 Decision Tree Induction Algorithm 1. Tree starts as a single node on which training samples are tested. 2. If all the training samples are of the same class the node becomes the leaf and it is labeled with that class. 3. Running an attribute selection algorithm, an attribute is chosen for tree generation (attribute INCOME in the example). Introduction to Data Mining 95 Decision Tree Induction Algorithm 4. A branch is created for each value of the chosen attribute and the samples are partitioned accordingly(three branches under INCOME). 5. Algorithm repeats steps 3 and 4 recursively to form decision tree for the samples at each partition. Once an attribute is considered in a node, it is not considered in any of its descendent nodes. Introduction to Data Mining 96 Decision Tree Induction Algorithm 6. The recursive procedure stops when i. all samples for each node belong to the same class according to the domain expert. ii. there is no other attribute on which the samples can be further partitioned. Majority Voting may be employed here to convert a node to a leaf node and be labeled as a class that covers majority of its samples. iii. There are no tuples for a given branch Introduction to Data Mining 97 Tree Pruning Tree pruning is done to avoid overfitting of data at different nodes. Statistical measures are taken to identify and to remove branches not reliable enough. This process results in faster classification and makes better classification of unknown data. • Prepruning • Postpruning Introduction to Data Mining 98 Prepruning The tree generation process is stopped after every partitioning. As a result all the new nodes generated become leaf nodes with membership of samples decided by Majority Voting. Goodness of partitioning is then tested by measures like 2, information gain etc. If any result goes below a prespecified threshold, further partitioning of the affected subset of samples is stopped. Introduction to Data Mining 99 Prepruning • High threshold would generate an over-simplified tree and low threshold may cause hardly any pruning. Introduction to Data Mining 100 Postpruning • Branches are removed from a fully grown tree. Here the expected error rate at each non-leaf node is computed if its sub-tree is pruned. It is compared with the combined error rates along its each branch weighted by the proportion of the participating samples. If the expected error rate is lower, the subtree is removed. Introduction to Data Mining 101 Classification Rule Generation Each path of a decision tree from the root to a leaf gives rise to a IF-THEN classification rule. From the decision tree in the example rules may be formed as: IF income=20k AND marital-status=“married” THEN buys-new-car=“no” IF income=50k THEN buys-new-car=“yes” etc. Introduction to Data Mining 102 Classification Rule Generation Either during Rule Generation or during Postpruning the redundant paths are pruned. For example if the following rules are found, IF income=20k AND marital-status=“married” THEN buys-new-car=“no” IF income=20k AND marital-status=“widow” THEN buys-new-car=“no” Introduction to Data Mining 103 Classification Rule Generation The 2 paths are pruned to 1 path as, IF income=20k AND marital-status=(“married” OR “widow”) THEN buys-new-car=“no” Other well known classification methods are, Bayesian Classification, Classification by Backpropagation, k-Nearest Neighbor Classifiers etc. Introduction to Data Mining 104 Case Study: Dynamic Classification Hierarchy Classification of Archaeological data: • Classification Hierarchy is created over a Backend Database to generate and update Association Rules. Continuous restructuring of Classification Hierarchy is done with the updation of the database. • On arrival of a new instance, system tries to place it in the existing hierarchy. Failure to classify, considers the instance as an Exception to the class found to be the closest. Introduction to Data Mining 105 Case Study: Dynamic Classification Hierarchy Classification of Archaeological data: • System initiates restructuring when the number of Exceptions exceeds a predefined threshold value. Three important operations are used. 1. ADD : adds a new branch to the hierarchy. 2. FUSE : merges more than one classes to one. 3. BREAK : decomposes a class into more than one classes. Introduction to Data Mining 106 Initial Transaction • Universal attribute set: A={ao, a1, a2, a3, a4, a5, a6, b0, b1, b2, b3, b4, b5, b6} Transactions: I1={ao, a1, a2, a3, a4} I2={ao, a1, a2, a5, a6} I3={ao, b0, b1, b2} I4={ao, b0, b3, b4} I5={ao, b0, b5, b6} Introduction to Data Mining 107 Initial Hierarchy Exact match at leaf level classes • 5 leaf classes C0 {a0} {a1,a2} C1 C11 {a3,a4} C2{b0} C12 C21 C22 C23 {a5,a6} {b1,b2} {b3,b4} {b5,b6} Introduction to Data Mining 108 Add I6 ={ao, a3, a4, b0, b1, b2, b3, b4} Approximate – up to intermediate level (exception) C0 {a0} {a1,a2} C1 C11 {a3,a4} C2{b0} C12 C21 C22 C23 {a5,a6} {b b } {b b } {b b } 1, 2 3, 4 5, 6 {b1,b2b3,b4} C24 Large number of exception may generate new class Introduction to Data Mining 109 Fuse C0 {a0} {a1,a2} C1 C11 C2{a1,a2, a3,a4} … C1n C21 … C2m C0 {a0} {a1,a2} C1 C11 … C1n C21 Introduction to Data Mining C2{a3,a4} C2m … 110 Fuse • The fuse of two peer classes K1 and K2 K1A K 2A is not allowed if there exists any other peer class K3 A A with K 3 K 2 Introduction to Data Mining 111 Further Transaction • Universal attribute set: A={ao, a1, a2, a3, a4, a5, a6, b0, b1, b2, b3, b4, b5, b6} Transactions: I7={ao, a3, a4, b0, b1, b2 , b3, b4} I8 ={ao, a5, a6, b0, b1, b2 , b3, b4} I9 ={ao, a3, a5, b0, b1, b2 , b3, b4} I10 ={ao, a3, a5, b0, b1, b2 , b5} I11 ={ao, a3, b0, b1, b2 , b3, b4} Introduction to Data Mining 112 Break C0 {a0} {a1,a2} C1 C11 {a3,a4} C2{b0} C12 C21 C22 C23 C24 {a5,a6} {b1,b2} {b3,b4} {b5,b6} {b1,b2b3,b4} C41 {a3,a4} Introduction to Data Mining C42 {a5,a6} 113 Cluster Analysis • The process of partitioning a set of data objects into groups of similar objects is called Clustering. The objects belonging to same cluster are supposed to be similar whereas those in different clusters should be dissimilar under the same similarity measure. Introduction to Data Mining 114 Cluster Analysis • A good clustering algorithm should have the following properties : • Scalability • Ability to handle different data types • Insensitivity to the order of input records • Working under minimum intervention • Constraint based clustering • Accept high dimensionality Introduction to Data Mining 115 Clustering Algorithms • Partitioning Method : In presence of n objects or data instances, a partitioning method constructs k partitions where k n. Each group/partition must have at least one object. Each object must belong to only one group (may not be true for a fuzzy partitioning algorithm). Introduction to Data Mining 116 k-Means Algorithm or a Centroid-based Technique Accepts an input parameter k and partitions n objects into k clusters where intra-cluster similarity is high and inter-cluster similarity is low. Similarity is measured with respect to the mean value of the objects in a cluster, called the centroid of the cluster. Introduction to Data Mining 117 Centroid-based Technique 1. arbitrarily choose k objects out of n as initial cluster centers; 2. assign or reassign each object to a cluster where it is most similar, with respect to the mean value; 3. re-compute the cluster means; 4. repeat steps 2 and 3 until there is no further change or there is an exit condition. Introduction to Data Mining 118 Centroid-based Technique k-means is an iterative algorithm that works on the convergence of a squared-error criterion of the form, E = i=1 to k pCi |p-mi|2 Where, E is the sum of square-error for all objects, p is a given object and mi is the centroid of the cluster Ci. Introduction to Data Mining 119 k-Medoids Algoritrhm k-means algorithm is sensitive to outliers where a very large value may distort the distribution of data among clusters. In order to overcome it, instead of the mean a medoid is used as the reference point of a cluster. A medoid is the most centrally located object in a cluster. Introduction to Data Mining 120 k-Medoids Algoritrhm 1. arbitrarily choose k objects out of n as initial medoids; 2. assign each remaining object to the cluster with the nearest medoid; 3. randomly select a non-medoid object, Orandom ; Introduction to Data Mining 121 k-Medoids Algoritrhm 4. compute the total cost S of swapping Oj with Orandom (the cost function calculates the difference in square-error value if a current medoid is replaced by a nonmedoid object); 5. if S<0 then swap Oj with Orandom to form new set of k-medoids (the total cost of swapping is the sum of costs incurred by all nonmedoid objects); Introduction to Data Mining 122 k-Medoids Algoritrhm 6. repeat steps 2 to 5 until no change. • To judge the quality of replacement of Oj by Orandom , each nonmedoid object p is examined for following four cases. • If p Oj and Oj is replaced by Orandom and p is closest to Oi where ij, then reassign p to Oi . • If p Oj and Oj is replaced by Orandom and p is closest to Orandom , then reassign p to Orandom . Introduction to Data Mining 123 k-Medoids Algoritrhm • If p Oi , where ij and Oj and p is still closest to Oi , does not change. • If p Oi , where ij and Oj and p is closest to Orandom Orandom . is replaced by Orandom then assignment of p is replaced by Orandom , then reassign p to Introduction to Data Mining 124 Parallel Association Rule Mining Algorithms Challenges include: • synchronization and communication minimization • disk I/O minimization • workload balancing Introduction to Data Mining 125 Parallel Association Rule Mining Algorithms Strategies are, • Distributed vs. shared memory architecture - SM needs more synchronization by locking etc. where for DM message passing claims higher communication overhead. • Data vs. task parallelism. • Static vs. dynamic parallelism. Introduction to Data Mining 126 Sources & References 1. Jiawei Han and Micheline Kamber, “Data Mining Concepts and Techniques”, 2007 2. Willi Klosgen and Jan M Zytkow, “Handbook of Data Mining and Discovery”, 2002 3. R.Srikant, “Fast algorithms for mining association rules and sequential patterns”, Ph.D. Thesis at the University of Wisconsin-Madison, 1996. 4. R.Agrawal, T.Imielimski & A.Swami, “Mining association rules between sets of items in large databases,” Proc. ACM SIGMOD, pp.207-216, 1993. Introduction to Data Mining 127 Sources & References 5. R.Agrawal & R.Srikant, “Fast algorithms for mining association rules,” Proc. International Conference for Very Large databases, 1994. 6. J.S.Park, M.S.Chen & P.S.Yu, “An effective hash based algorithm for mining association rules,” Proc. ACM SIGMOD,1995. 7. R.Srikant, Q.Vu & R.Agrawal, “Mining association rules with item constraints,” Proc. International Conference on Knowledge Discovery in Databases, 1997. Introduction to Data Mining 128 Sources & References 8. K.Ali, S.Manganaris & R.Srikant, “Partial classification using association rules,” Proc. International Conference on Knowledge Discovery in Databases, 1997. 9. S Pal and A Bagchi, “Association against Dissociation: some pragmatic considerations for Frequent Itemset generation under Fixed and Variable Thresholds,” ACM SigKDD Explorations, Vol.7, Issue 2, Dec.2005, pp. 151159. Introduction to Data Mining 129 Sources & References 10. S Ray and A Bagchi, “Rule Generation by Boolean Minimization – Experience with Coronary Bifurcation Stenting in Angioplasty,” ReTIS 2006. 11. S.Maitra & A.Bagchi, “Dynamic restructuring of classification hierarchy towards data mining,” Proc. International Conference on Management of Data, 1998. 12. T.G.Dietterich & R.S.Michalski, “Discovering patterns in sequences of events,” Artificial Intelligence, vol.25, pp.187232, 1985. Introduction to Data Mining 130 Sources & References 13. R.Agrawal & R.Srikant, “Mining sequential patterns” Proc. IEEE International Conference on Data Engineering, 1995. 14. R.Srikant & R.Agrawal, Mining sequential patterns : generalizations and performance improvements,” Proc. International Conference on Extending Database Technology, 1996. 15. M.J.Zaki, “Parallel & distributed association mining: a survey,” IEEE Concurrency, 7(4), pp.14-25, 1999. Introduction to Data Mining 131 Research Challenges Areas: • Query Language • Architecture • Text Mining • Multimedia Mining • Spatial / Temporal Analysis • Graph-Mining Introduction to Data Mining 132 THANK YOU