Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
5.Data Mining/D.S.Jagli 1 Introduction let us try to understand the technology in a business context. Like all decision support systems, data mining delivers information. 5.Data Mining/D.S.Jagli 2 Decision support progresses to data mining. 5.Data Mining/D.S.Jagli 3 Results of Data Mining 1. Forecasting what may happen in the future 2. Classifying people or things into groups by recognizing patterns 3. Clustering people or things into groups based on their attributes 4. Associating what events are likely to occur together 5. Sequencing what events are likely to lead to later events 5.Data Mining/D.S.Jagli 4 Data mining is not •Brute-force crunching of bulk data •“Blind” application of algorithms •Going to find relationships where none exist •Presenting data in different ways •A database intensive task •A difficult to understand technology requiring an advanced degree in computer science 5.Data Mining/D.S.Jagli 5 What is a Data Mining? Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data. 5.Data Mining/D.S.Jagli 6 What is a Data Mining ? Cont… Valid: The patterns hold in general. Novel: We did not know the pattern beforehand. Useful: We can devise actions from the patterns. Understandable: We can interpret and comprehend the patterns. 5.Data Mining/D.S.Jagli 7 Why Use Data Mining Today? Competitive pressure! “The secret of success is to know something that nobody else knows.” Aristotle Onassis Human analysis skills are inadequate: • Volume and dimensionality of the data • High data growth rate Competition on service, not only on price (Banks, phone companies, hotel chains, rental car companies) • • • • Personalization, CRM The real-time enterprise “Systemic listening” Security, homeland defense 5.Data Mining/D.S.Jagli 8 Data Mining Features 1. It is a knowledge discovery process. 2. Data mining helps to understand the substance of the data in a special unsuspected way. 3. It unearths patterns and trends in the raw data you never knew existed. 5.Data Mining/D.S.Jagli 9 Data Mining Features cont… 4. ” Data mining centers around the automated discovery of new facts and relationships in data. 5. Data mining tools enable you to uncover hidden information. 4. The assumption is that more useful knowledge lies hidden beneath the surface. 5.Data Mining/D.S.Jagli 10 Examples of What People are Doing with Data Mining or Applications Fraud/Non-Compliance Recruiting/Attracting Anomaly detection customers Isolate the factors that lead to Maximizing profitability (cross fraud, waste and abuse selling, identifying Target auditing and profitable customers) investigative efforts more Service Delivery and effectively Customer Retention Credit/Risk Scoring Build profiles of customers likely to Intrusion detection use which services Parts failure prediction 5.Data Mining/D.S.Jagli 11 Data Mining versus OLAP OLAP - On-line Analytical Processing Provides you with a very good view of what is happening, but can not predict what will happen in the future or why it is happening 5.Data Mining/D.S.Jagli 12 5.Data Mining/D.S.Jagli 13 The Knowledge Discovery Process Data mining as the process of knowledge discovery. Data mining discovers knowledge or information that you never knew was present in your data. 5.Data Mining/D.S.Jagli 14 The Knowledge Discovery Process Steps: 1. Identify business problem 2. Data mining 3. Action 4. Evaluation and measurement 5. Deployment and integration into businesses processes 5.Data Mining/D.S.Jagli 15 Data Mining Step in Detail 1. Data preprocessing Data selection: Identify target datasets and relevant fields • Data cleaning 1. 2. 3. 4. 2. 3. Remove noise and outliers Data transformation Create common units Generate new fields Data mining model construction Model evaluation 5.Data Mining/D.S.Jagli 16 5.Data Mining/D.S.Jagli 17 What is a Data Mining Model? A data mining model is a description of a specific aspect of a dataset. It produces output values for an assigned set of input values. Examples: • Linear regression model • Classification model • Clustering 5.Data Mining/D.S.Jagli 18 Data Mining Models (Contd.) A data mining model can be described at two levels: Functional level: • Describes model in terms of its intended usage. Examples: Classification, clustering Representational level: Specific representation of a model. Example: Log-linear model, classification tree, nearest neighbor method. 5.Data Mining/D.S.Jagli 19 Or Market Basket Analysis Introduction As noted earlier, huge amount of data is stored electronically in many retail outlets due to barcoding of goods sold. It is good business sense to try to find some useful information from this mountains of data. A conceptually simple yet interesting technique is to find association rules from these large databases. 5/22/2017 5.Association Rule mining/D.S.Jagli 21 Introduction Association rules mining (or market basket analysis) searches for interesting customer habits by looking at associations. The classical example is the one where a store in USA was reported to have discovered that people buying bread tend also to buy cheese. Not sure if this is actually true. Applications in marketing, store layout, customer segmentation, medicine, finance, and many more. Question: Suppose you are able to find that two products x and y (say bread and milk or crackers and cheese) are frequently bought together. How can you use that information? 5/22/2017 5.Association Rule mining/D.S.Jagli 22 A Simple Example Consider the following ten transactions of a shop selling only nine items. We will return to it after explaining some terminology. Transaction ID Items 5/22/2017 10 Bread, Cheese, Newspaper 20 Bread, Cheese, Juice 30 Bread, Milk 40 Cheese, Juice. Milk, Coffee 50 Sugar, Tea, Coffee, Biscuits, Newspaper 60 Sugar, Tea, Coffee, Biscuits, Milk, Juice, Newspaper 70 Bread, Cheese 80 Bread, Cheese, Juice, Coffee 90 Bread, Milk 100 Sugar, Tea, Coffee, Bread, Milk, Juice, Newspaper 5.Association Rule mining/D.S.Jagli 23 Terminology Let the number of different items sold be n Let the number of transactions be N Let the set of items be {i1, i2, …, in}. The number of items may be large, perhaps several thousands. Let the set of transactions be {t1, t2, …, tN}. Each transaction ti contains a subset of items from the itemset {i1, i2, …, in}. These are the things a customer buys when they visit the supermarket. 5/22/2017 5.Association Rule mining/D.S.Jagli 24 Terminology We want to find a group of items that tend to occur together frequently. The association rules are often written as X→Y meaning that whenever X appears Y also tends to appear. X and Y may be single items or sets of items but the same item does not appear in both. 5/22/2017 5.Association Rule mining/D.S.Jagli 25 Terminology Suppose X and Y appear together in only 10% of the transactions but whenever X appears there is 80% chance that Y also appears. The 1% presence of X and Y together is called the support (or prevalence) of the rule and 80% is called the confidence (or predictability) of the rule. These are measures of interestingness of the rule. 5/22/2017 5.Association Rule mining/D.S.Jagli 26 Terminology • Confidence denotes the strength of the association between X and Y. • Support indicates the frequency of the pattern. A minimum support is necessary if an association is going to be of some business value. • Let the chance of finding an item X in the N transactions is x% then we can say probability of X is P(X) = x/100 since probability values are always between 0.0 and 1.0. Question: Why is minimum support minimum confidence necessary? 5/22/2017 5.Association Rule mining/D.S.Jagli necessary? Why is 27 Terminology • Now suppose we have two items X and Y with probabilities of P(X) = 0.2 and P(Y) = 0.1. What does the product P(X) x P(Y) mean? • What is likely to be the chance of both items X and Y appearing together, that is P(X and Y) = P(X U Y)? 5/22/2017 5.Association Rule mining/D.S.Jagli 28 Terminology • Suppose we know P(X U Y) then what is the probability of Y appearing if we know that X already exists. It is written as P(Y|X) • The support for X→Y is the probability of both X and Y appearing together, that is P(X U Y). • The confidence of X→Y is the conditional probability of Y appearing given that X exists. It is written as P(Y|X) and read as P of Y given X. 5/22/2017 5.Association Rule mining/D.S.Jagli 29 THE TASK & NAÏVE ALGORITHM Want to find all associations which have at least p% support with at least q% confidence such that all rules satisfying any user constraints are found the rules are found efficiently from large databases the rules found are actionable 5/22/2017 5.Association Rule mining/D.S.Jagli 30 Applications Although we are only considering basket analysis, the technique has many other applications as noted earlier. For example we may have many patients coming to a hospital with various diseases and from various backgrounds. We therefore have a number of attributes (items) and we may be interested in knowing which attributes appear together and may be which attributes are frequently associated to some particular disease. 5/22/2017 5.Association Rule mining/D.S.Jagli 31 Example We have 9 items and 10 transactions. Find support and confidence for each pair. How many pairs are there? Transaction ID Items Bread, Cheese, Newspaper 10 5/22/2017 20 Bread, Cheese, Juice 30 Bread, Milk 40 Cheese, Juice. Milk, Coffee 50 Sugar, Tea, Coffee, Biscuits, Newspaper 60 70 Sugar, Tea, Coffee, Biscuits, Milk, Juice, Newspaper Bread, Cheese 80 Bread, Cheese, Juice, Coffee 90 Bread, Milk 100 Sugar, Tea, Coffee, Bread, Milk, Juice, Newspaper 5.Association Rule mining/D.S.Jagli 32 Example There are 9 x 8 or 72 pairs, half of them duplicates. 36 is too many to analyse in a class, so we use an even simpler example of n = 4 (Bread, Cheese, Juice, Milk) and N = 4. We want to find rules with minimum support of 50% and minimum confidence of 75%. Transaction ID Items 5/22/2017 100 Bread, Cheese 200 Bread, Cheese, Juice 300 Bread, Milk 400 Cheese, Juice. Milk 5.Association Rule mining/D.S.Jagli 33 Example N = 4 results in the following frequencies. All items have the 50% support we require. Only two pairs have the 50% support and no 3-itemset has the support. We will look at the two pairs that have the minimum support to find if they have the required confidence. 5/22/2017 5.Association Rule mining/D.S.Jagli 34 Example Itemsets 5/22/2017 Frequency Bread 3 Cheese 3 Juice 2 Milk 2 (Bread, Cheese) 2 (Bread, Juice) 1 (Bread, Milk) 1 (Cheese, Juice) 2 (Cheese, Milk) 1 (Juice, Milk) 1 (Bread, Cheese, Juice) 1 (Bread, Cheese, Milk) 0 (Bread, Juice, Milk) 0 (Cheese, Juice, Milk) 1 (Bread, Cheese, Juice, Milk) 0 5.Association Rule mining/D.S.Jagli 35 Terminology Items or itemsets that have the minimum support are called frequent. In our example, all the four items and two pairs are frequent. We will now determine if the two pairs {Bread, Cheese} and {Cheese, Juice} lead to association rules with 75% confidence. Every pair {A, B} can lead to two rules A → B and B → A if both satisfy the minimum confidence. Confidence of A → B is given by the support for A and B together divided by the support of A. 5/22/2017 5.Association Rule mining/D.S.Jagli 36 Rules We have four possible rules and their confidence is given as follows: Bread Cheese with confidence of 2/3 = 67% Cheese Bread with confidence of 2/3 = 67% Cheese Juice with confidence of 2/3 = 67% Juice Cheese with confidence of 100% Therefore only the last rule Juice Cheese has confidence above the minimum 75% and qualifies. Rules that have more than user-specified minimum confidence are called confident. 5/22/2017 5.Association Rule mining/D.S.Jagli 37 Problems with NAÏVE ALGORITHM This simple algorithm works well with four items since there were only a total of 16 combinations that we needed to look at but if the number of items is say 100, the number of combinations is much larger, in billions. The number of combinations becomes about a million with 20 items since the number of combinations is 2n with n items (why?). The naïve algorithm can be improved to deal more effectively with larger data sets. 5/22/2017 5.Association Rule mining/D.S.Jagli 38 Improved NAÏVE ALGORITHM Rather than counting all possible item combinations we can look at each transaction and count only the combinations that actually occur as shown below (that is, we don’t count itemsets with zero frequency). Transaction ID 100 200 300 400 5/22/2017 Items Bread, Cheese Bread, Cheese, Juice Bread, Milk Cheese, Juice, Milk Combinations {Bread, Cheese} {Bread, Cheese}, {Bread, Juice}, {Cheese, Juice}, {Bread, Cheese, Juice} {Bread, Milk} {Cheese, Juice}, {Cheese, Milk}, {Juice, Milk}, {Cheese, Juice, Milk} 5.Association Rule mining/D.S.Jagli 39 Improved naïve algorithm Removing zero frequencies leads to the smaller table below. We then proceed as before but the improvement is not large and we need better techniques. Itemsets Bread Cheese Juice Milk (Bread, Cheese) (Bread, Juice) (Bread, Milk) (Cheese, Juice) (Juice, Milk) (Bread, Cheese, Juice) (Cheese, Juice, Milk) 5/22/2017 Frequency 3 3 2 2 2 1 1 2 1 1 1 5.Association Rule mining/D.S.Jagli 40 Clasical 5/22/2017 5.Association Rule mining/D.S.Jagli 41 The Apriori Algorithm The basic algorithm to find association rules was proposed in 1993,improved algorithm in 1994. This classical Apriori algorithm may be simply described by a two step approach: Step 1 ─ discover all frequent (single) items that have support above the minimum support required Step 2 ─ use the set of frequent items to generate the association rules that have high enough confidence level A more formal description is given on the slide after the next. 5/22/2017 5.Association Rule mining/D.S.Jagli 42 Apriori Algorithm Terminology A k-itemset is a set of k items.(eg: k=10 items 1,2,3…..10) The set Ck is a set of candidate k-itemsets that are potentially frequent.(eg:1,2,3;3,2,1;2,3,4;4,3,2;) The set Lk is a subset of Ck and is the set of kitemsets that are frequent.(eg: 1,2;2,1) 5/22/2017 5.Association Rule mining/D.S.Jagli 43 Step-by-step Apriori Algorithm Step 1 – Computing L1: Scan all transactions. Find all frequent items that have support above the required p%. Let these frequent items be labeled L1. Step 2 – Apriori-gen Function : Use the frequent items L1 to build all possible item pairs like {Bread, Cheese} if Bread and Cheese are in L1. set of these item pairs is called C2, the candidate set. Question: Do all members of C2 have the support? Why? 5/22/2017 5.Association Rule mining/D.S.Jagli 44 The Apriori Algorithm Step 3 – Pruning: Scan all transactions and find all pairs in the candidate pair setC2 that are frequent. Let these frequent pairs be L2. Step 4 – General rule : A generalization of Step 2. Build candidate set of k items Ck by combining frequent itemsets in the set Lk-1. 5/22/2017 5.Association Rule mining/D.S.Jagli 45 The Apriori Algorithm Step 5 – Pruning Step 5, generalization of Step 3. Scan all transactions and find all item sets in Ck that are frequent. Let these frequent itemsets be Lk. Step 6 – Continue Continue with Step 4 unless Lk is empty. Step 7 – Stop Stop when Lk is empty. 5/22/2017 5.Association Rule mining/D.S.Jagli 46 An Example This example is similar to the last example but we have added two more items and another transaction making five transactions and six items. We want to find association rules with 50% support and 75% confidence. Transaction ID Items 5/22/2017 100 Bread, Cheese, Eggs, Juice 200 Bread, Cheese, Juice 300 Bread, Milk, Yogurt 400 Bread, Juice, Milk 500 Cheese, Juice, Milk 5.Association Rule mining/D.S.Jagli 47 Example First find L1. 50% support requires that each frequent item appear in at least three transactions. Therefore L1 is given by: 5/22/2017 Item Frequnecy Bread 4 Cheese 3 Juice 4 Milk 3 5.Association Rule mining/D.S.Jagli 48 Example The candidate 2-itemsets or C2 therefore has six pairs. These pairs and their frequencies are: Item Pairs Frequency (Bread, Cheese) 2 (Bread, Juice) 3 (Bread, Milk) 2 (Cheese, Juice) 3 (Cheese, Milk) 1 (Juice, Milk) 2 Question: Why did we not select (Egg, Milk)? 5/22/2017 5.Association Rule mining/D.S.Jagli 49 Deriving Rules L2 has only two frequent item pairs {Bread, Juice} and {Cheese, Juice}. After these two frequent pairs, there are no candidate 3-itemsets (why?) since we do not have two 2itemsets that have the same first item (why?). The two frequent pairs lead to the following possible rules: Bread → Juice Juice → Bread Cheese → Juice Juice → Cheese 5/22/2017 5.Association Rule mining/D.S.Jagli 50 Deriving Rules The confidence of these rules is obtained by dividing the support for both items in the rule by the support of the item on the left hand side of the rule. The confidence of the four rules therefore are 3/4 = 75% (Bread → Juice) 3/4 = 75% (Juice → Bread) 3/3 = 100% (Cheese → Juice) 3/4 = 75% (Juice → Cheese) Since all of them have a minimum 75% confidence, they all qualify. We are done since there are no 3–itemsets. 5/22/2017 5.Association Rule mining/D.S.Jagli 51 Questions 1. How do we compute C3 from L2 or more generally Ci from Li-1? 2. If there were several members of the set C3, how do we derive L3 by pruning C3? 3. Why did we not consider (Bread, Cheese, Juice) in the last example? 4. Assume that the 3-itemset {A, B, M} has the minimum support. What do we do next? How do we derive association rules? 5/22/2017 5.Association Rule mining/D.S.Jagli 52 Answers To compute C3 from L2, or more generally Ci from Li-1, we join members of Li-1 to other members in Li-1 by first sorting the itemsets in their lexicographic order and then joining those itemsets in which the first (i-2) items are common. Observe that if an itemset in C3 is (a, b, c) then L2 must have had itemsets (a, b), (b, c) and (a, c) since all subsets of C3 must be frequent. Why? To repeat, itemsets in a candidate set Ci or a frequent set Li will be frequent only if every subset is frequent. 5/22/2017 5.Association Rule mining/D.S.Jagli 53 Answers For example, {A, B, M} will be frequent only if {A, B}, {A,M} and {B, M} are frequent, which in turn requires each of A, B and M also to be frequent. We did not consider (Bread, Cheese, Juice) because the pairs (Bread, Cheese) and (Bread, Juice) should both have been in L2 but (Bread, Cheese) was not so the three-itemset could not be frequent. 5/22/2017 5.Association Rule mining/D.S.Jagli 54 Example If the set {A, B, M} has the minimum support, we can find association rules that satisfy the conditions of support and confidence by first generating all nonempty subsets of {A, B, M} and using each of it on the LHS and remaining symbols on the RHS. Subsets of {A, B, M} are A, B, M, AB, AM, BM. Therefore possible rules involving all three items are A → BM, B → AM, M → AB, AB → M, AM → B and BM → A. Now we need to test their confidence. 5/22/2017 5.Association Rule mining/D.S.Jagli 55 Answers To test the confidence of the possible rules, we proceed as we have done before. We know that confidence(A → B) = P(B|A) = P(A U B)/ P(A) This confidence is the likelihood of finding B if A already has been found in a transaction. It is the ratio of the support for A and B together and support for A by itself. The confidence of all these rules can thus be computed. 5/22/2017 5.Association Rule mining/D.S.Jagli 56 Efficiency Consider an example of a supermarket database which might have several thousand items including 1000 frequent items and several million transactions. Which part of the apriori algorithm will be the most expensive to compute? Why? 5/22/2017 5.Association Rule mining/D.S.Jagli 57 Efficiency The algorithm to construct the candidate set Ci to find the frequent set Li is crucial to the performance of the Apriori algorithm. The larger the candidate set, higher the processing cost of discovering the frequent itemsets since the transactions must be scanned for each. Given that the numbers of early candidate itemsets are very large, the initial iterations dominate the cost. In a supermarket database with about 1000 frequent items, there will be almost half a million candidate pairs C2 that need to be searched for to find L . 5/22/2017 2 5.Association Rule mining/D.S.Jagli 58 Efficiency Generally the number of frequent pairs out of half a million pairs will be small and therefore (why?) the number of 3-itemsets should be small. Therefore, it is the generation of the frequent set L2 that is the key to improving the performance of the Apriori algorithm. 5/22/2017 5.Association Rule mining/D.S.Jagli 59 Classification Classical problem in statistics and machine learning. Separation of objects or ordering of objects into classes. Usually a set of classes are pre-defined. A set of training samples are used for building class models. Each new object is then assigned to one of the classes using the model. Typical Applications: effectiveness analysis 22/05/2017 credit 5.Classification/D.S.Jagli approval, treatment 61 Apriori Classification The classification based on training samples with class labels known is called supervised learning. New data is classified based on the model developed during training. Each object has a number of attributes. One special attribute is the class label called the output or dependent attribute. Value known for the training sample but not for others. Other attributes are independent attributes. Attributes may be numerical (e.g. income) or categorical (ordered e.g. course grade or unordered e.g. gender, job title, race). The class label is categorical. 22/05/2017 5.Classification/D.S.Jagli 62 Posteriori Classification No training data with known class labels available. May be classes not known apriori. Usually clustering is used in such situations although some unsupervised classification techniques do exist. 22/05/2017 5.Classification/D.S.Jagli 63 Classification Classification is a prediction technique. Given the values of the independent attributes the class can be predicted. Regression is another prediction technique. Classification is used to predict categorical values while regression is used to predict continuous or ordered values. 22/05/2017 5.Classification/D.S.Jagli 64 Two Phases Model construction phase: 1) Each training sample belongs to a predefined class given by the class label attribute. A model is built. II. Usage and testing phase: 1) Each test object (not used for training) is classified 2) Accuracy of the model is estimated I. 1. 2. 22/05/2017 If the class label of some test samples is known, it is compared with the result from the model Accuracy rate is the percentage of test samples correctly classified by the model 5.Classification/D.S.Jagli 65 Quality evaluation Many quality measures. Han and Kambler list these: Predictive accuracy Efficiency or speed to construct model and to use Robustness in handling noise and missing values Scalability efficiency in handling large disk-resident data Interpretability: understanding and insight provided by the model Goodness of rules (size, compactness of model) 22/05/2017 5.Classification/D.S.Jagli 66 Decision Tree 1. A decision tree is an approach like that needed to support the game of twenty questions where each internal node of the tree denotes a question or a test on the value of an independent attribute, each branch represents an outcome of the test, and each leaf represents a class. 2. Assume that each object has a number of independent attributes and a dependent attribute. 22/05/2017 5.Classification/D.S.Jagli 67 Example (From Han and Kamber) age <=30 <=30 30…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 22/05/2017 income high high high medium low low low medium low medium medium medium high medium student no no no no yes yes yes no yes yes yes no yes no credit_rating fair excellent fair fair fair excellent excellent fair fair fair excellent excellent fair excellent 5.Classification/D.S.Jagli buys_computer no no yes yes yes no yes no yes yes yes yes yes no 68 A decision Tree (From Han and Kamber) age? <=30 overcast 30..40 student? yes >40 credit rating? no yes excellent fair no yes no yes 22/05/2017 5.Classification/D.S.Jagli 69 Decision Tree To classify an object, the appropriate attribute value is used at each node, starting from the root, to determine the branch taken. The path found by tests at each node leads to a leaf node which is the class the model believes the object belongs to. Decision tree is an attractive technique since the results are easy to understand. The rules can often be expressed in natural language e.g. if the student has GPA > 3.0 and class attendance > 90% then the student is likely to get a D. 22/05/2017 5.Classification/D.S.Jagli 70 Basic Algorithm 1. The training data is S. Discretise all continuous-valued attributes. Let the root node contain S. 2. If all objects S in the root node belong to the same class then stop. 3. Split the next leaf node by selecting an attribute A from amongst the independent attributes that best divides or splits the objects in the node into subsets and create a decision tree node. 4. Split the node according to the values of A. 5. Stop if any of the following conditions is met otherwise continue with 3. -- data in each subset belongs to a single class. -- there are no remaining attributes on which the sample may be further divided. 22/05/2017 5.Classification/D.S.Jagli 71 Building A Decision Tree The aim is to build a decision tree consisting of a root node, a number of internal nodes, and a number of leaf nodes. Building the tree starts with the root node and then splitting the data into two or more children nodes and splitting them in lower level nodes and so on until the process is complete. The method uses induction based on the training data. We illustrate it using a simple example. 22/05/2017 5.Classification/D.S.Jagli 72 An Example (from the text) Sore Throat Fever Swollen Glands Yes No Yes Yes No No No Yes No Yes Yes No Yes No Yes No No No Yes Yes Yes No No Yes No No Yes No No No Congestion Headache Yes Yes Yes No Yes Yes No Yes Yes Yes Yes Yes No No No No No Yes Yes Yes Diagnosis Strep throat Allergy Cold Strep throat Cold Allergy Strep throat Allergy Cold Cold •First five attributes are symptoms and the last attribute is diagnosis. All attributes are categorical. • Wish to predict the diagnosis class. 22/05/2017 5.Classification/D.S.Jagli 73 An Example (from the text) Consider each of the attributes in turn to see which would be a “good” one to start Sore Throat Diagnosis No No No No No Yes Yes Yes Yes Yes Allergy Cold Allergy Strep throat Cold Strep throat Cold Strep throat Allergy Cold • Sore throat does not predict diagnosis. 22/05/2017 5.Classification/D.S.Jagli 74 An Example (from the text) Is symptom fever any better? Fever Diagnosis No No No No No Yes Yes Yes Yes Yes Allergy Strep throat Allergy Strep throat Allergy Strep throat Cold Cold Cold Cold Fever is better but not perfect. 22/05/2017 5.Classification/D.S.Jagli 75 An Example (from the text) Try swollen glands Swollen Glands Diagnosis No No No No No No No Yes Yes Yes Allergy Cold Cold Allergy Allergy Cold Cold Strep throat Strep throat Strep throat Good. Swollen glands = yes means Strep Throat 22/05/2017 5.Classification/D.S.Jagli 76 An Example (from the text) Try congestion Congestion Diagnosis No No Yes Yes Yes Yes Yes Yes Yes Yes Strep throat Strep throat Allergy Cold Cold Allergy Allergy Cold Cold Strep throat Not helpful. 22/05/2017 5.Classification/D.S.Jagli 77 An Example (from the text) Try the symptom headache Headache Diagnosis No No No No No Yes Yes Yes Yes Yes Cold Cold Allergy Strep throat Strep throat Allergy Allergy Cold Cold Strep throat Not helpful. 22/05/2017 5.Classification/D.S.Jagli 78 Decision Tree Swollen Glands Continuing the process one more step will find Fever as the next split attribute and the final Result as shown. No Yes Diagnosis = Strep Throat Fever No Diagnosis = Allergy 22/05/2017 Yes Diagnosis = Cold 5.Classification/D.S.Jagli 79 An Example This approach does not work if there are many attributes and a large training set. Need an algorithm to select an attribute that best discriminates among the target classes as the split attribute. The tree continues to grow until it is no longer possible to find better ways to split the objects. 22/05/2017 5.Classification/D.S.Jagli 80 What is a clustering ? Clustering is the process of grouping data into classes, or clusters, so that objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other clusters. 22/05/2017 5.Clustering Analysis/D.S.Jagli 82 Cluster Analysis terminology 1. Unsupervised classification 2. Aims to decompose or partition a data set in to clusters . 3. Inter-cluster (between-groups) distance is maximized and intra-cluster (within-group) distance is minimized. 4. Its Different to classification as there are no predefined classes, no training data, may not even know how many clusters. 22/05/2017 5.Clustering Analysis/D.S.Jagli 83 Cluster Analysis features Not a new field, a branch of statistics Many old algorithms are available Renewed interest due to data mining and machine learning New algorithms are being developed, in particular to deal with large amounts of data Evaluating a clustering method is difficult 22/05/2017 5.Clustering Analysis/D.S.Jagli 84 Cluster Analysis Desired features 1. Scalability 2. Only one scan of the dataset 3. Ability to stop and resume 4. Minimal input parameters 5. Robustness 6. Ability to discover different cluster shapes 7. Different data types 8. Result independent of data input order 22/05/2017 5.Clustering Analysis/D.S.Jagli 85 Questions • How to define a cluster? • How to find if objects are similar or not? • How to define some concept of distance between individual objects and sets of objects? 22/05/2017 5.Clustering Analysis/D.S.Jagli 86 Types of Data Numerical data - continuous variables on a roughly linear scale e.g. marks, age, weight. Units can affect the analysis. Eg:Distance in metres is obviously a larger number than in kilometres. May need to scale or normalise data Binary data – many variables are binary e.g. gender, married or not, u/g or p/g. 22/05/2017 5.Clustering Analysis/D.S.Jagli 87 Types of Data Nominal data - similar to binary but can take more than two states e.g. colour, staff position. Ordinal data - similar to nominal but the different values are ordered in a meaningful sequence. Ratio-scaled data - nonlinear scale data 22/05/2017 5.Clustering Analysis/D.S.Jagli 88 Distance A simple, well-understood concept. Distance has the following properties (assume x and y to be vectors): • distance is always positive • distance x to x is zero • distance x to y cannot be greater than the sum of distance x to z and z to y • distance x to y is the same as from y to x. Examples: absolute value of the difference ∑|x-y| (the Manhattan distance ) or ∑(x-y)**2 (the Euclidean distance). 22/05/2017 5.Clustering Analysis/D.S.Jagli 89 Distance Three common distance measures are; 1. Manhattan Distance or the absolute value of the difference D(x, y) = ∑|x-y| 2. Euclidean distance D(x, y) = ( ∑(x-y)2 ) ½ 3. Maximum difference or Chebychev distance D(x, y) = maxi |xi – yi| 22/05/2017 5.Clustering Analysis/D.S.Jagli 90 Examples Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8): (a) Compute the Euclidean distance between the two objects. (b) Compute the Manhattan distance between the two objects. 22/05/2017 5.Clustering Analysis/D.S.Jagli 91 Categorical Data I. Difficult to find a sensible distance measure between say India and Australia or between a male and a female or between a student doing MCA and one doing BE 1. One possibility is to convert all categorical data into numeric data (makes sense). 2. Another possibility is to treat distance between categorical values as a function which only has a value 1 or 0. 22/05/2017 5.Clustering Analysis/D.S.Jagli 92 Distance Using the approach of treating distance between two categorical values as a value 1 or 0 Essentially it is just comparing how many attribute value are the same. 22/05/2017 5.Clustering Analysis/D.S.Jagli 93 Distance Measure Need to carefully look at scales of different attributes’ values. For example, the GPA varies from 0 to 4 while the age attribute varies from say 20 to 50. The total distance, the sum of distance between all attribute values, is then dominated by the age attribute and will be affected little by the variations in GPA. To avoid such problems the data should be scaled or normalized 22/05/2017 5.Clustering Analysis/D.S.Jagli 94 Clustering It is not always easy to predict how many clusters to expect from a set of data. One may choose a number of clusters based on some knowledge of data. The result may be an acceptable set of clusters. This does not however mean that a different number of clusters will not provide an even better set of clusters. Choosing the number of clusters is a difficult problem. 22/05/2017 5.Clustering Analysis/D.S.Jagli 95 Types of clustering Methods 1. Partitioning methods – given n objects, make k (n) partitions (clusters) of data and use iterative relocation method. It is assumed that each cluster has at least one object and each object belongs to only one cluster. 2. Hierarchical methods - start with one cluster and then split it in smaller clusters or start with each object in an individual cluster and then try to merge similar clusters. 22/05/2017 5.Clustering Analysis/D.S.Jagli 96 Types of Methods 1.1 Density-based methods - for each data point in a cluster, at least a minimum number of points must exist within a given radius 1.2. K-Means Clustering: 3. Grid-based methods - object space is divided into a grid 4.Model-based methods - a model is assumed, perhaps based on a probability distribution 22/05/2017 5.Clustering Analysis/D.S.Jagli 97 The K-Means Method the most commonly used method. 1. Select the number of clusters and Let this number be K 2. Pick k-seeds as a centroids of k- clusters. the seeds may be picked up randomly . 3. Compute the Euclidean distance of each object in the dataset from each of the centroids. 4. Allocate each object to the cluster it is nearest to based on the distance computed in the previous step . 22/05/2017 5.Clustering Analysis/D.S.Jagli 98 The K-Means Method 5. Compute the centroids of clusters by computing the means of the attribute values of the objects in eeach cluster. 6. Check if the stoping criterion has been met.(if cluster member ship is un changed) .if yes go to step 7,if not go to step 3. 7. (optional) one may decide to stop at this stage or to split a cluster or combine two clusters until a stopping criterion is met. 22/05/2017 5.Clustering Analysis/D.S.Jagli 99 The K-Means Method Essentially the algorithm tries to build cluster with high level of similarity within clusters and low level of similarity between clusters. This method requires means of variables to be computed. The method is scalable and is efficient. The algorithm does not find a global minimum, it rather terminates at a local minimum. 22/05/2017 5.Clustering Analysis/D.S.Jagli 100 Example Student Age Marks1 Marks2 Marks3 S1 18 73 75 57 S2 18 79 85 75 S3 23 70 70 52 S4 20 55 55 55 S5 22 85 86 87 S6 19 91 90 89 S7 20 70 65 65 S8 21 53 56 59 S9 19 82 82 60 S10 40 76 60 78 22/05/2017 5.Clustering Analysis/D.S.Jagli 101 Example 1. Let the three seeds be the first three records: Student Age Mark1 Mark2 Mark3 S1 18 73 75 57 S2 18 79 85 75 S3 23 70 70 52 2. Now we compute the distances between the objects based on the four attributes. 3. K-Means requires Euclidean distances but we use the sum of absolute differences as well as to show how the distance metric can change the results. 22/05/2017 5.Clustering Analysis/D.S.Jagli 102 Distances Student Age Marks1 Marks2 Marks3 Manhattan Euclidean Distance Distance S1 18 73 75 57 Seed 1 C1 Seed 1 C1 S2 18 79 85 75 Seed 2 C2 Seed 2 C2 S3 23 70 70 52 Seed 3 C3 Seed 3 C3 S4 20 55 55 55 42/76/36 C3 27/43/22 C3 S5 22 85 86 87 57/23/67 C2 34/14/41 C2 S6 19 91 90 89 66/32/82 C2 40/19/47 C2 S7 20 70 65 60 23/41/21 C3 13/24/14 C1 S8 21 53 56 59 44/74/40 C3 28/42/23 C3 S9 19 82 82 60 20/22/36 C1 12/16/19 C1 S10 47 75 76 77 60/52/58 C2 33/34/32 C3 22/05/2017 5.Clustering Analysis/D.S.Jagli 103 Distances Student Age Marks1 Marks2 Marks3 Manhattan Euclidean Distance Distance S1 18 73 75 57 Seed 1 C1 Seed 1 C1 S2 18 79 85 75 Seed 2 C2 Seed 2 C2 S3 23 70 70 52 Seed 3 C3 Seed 3 C3 S4 20 55 55 55 42/76/36 C3 27/43/22 C3 S5 22 85 86 87 57/23/67 C2 34/14/41 C2 S6 19 91 90 89 66/32/82 C2 40/19/47 C2 S7 20 70 65 60 23/41/21 C3 13/24/14 C1 S8 21 53 56 59 44/74/40 C3 28/42/23 C3 S9 19 82 82 60 20/22/36 C1 12/16/19 C1 S10 47 75 76 77 60/52/58 C2 33/34/32 C3 22/05/2017 5.Clustering Analysis/D.S.Jagli 104 Example Note the different allocations by the two different distance metrics. K-Means uses the Euclidean distance. The means of the new clusters are also different as shown on the next slide. We now start with the new means and compute Euclidean distances. The process continues until there is no change. 22/05/2017 5.Clustering Analysis/D.S.Jagli 105 Cluster Means Manhattan Euclidean 22/05/2017 Cluster Age Mark1 Mark2 Mark3 C1 18.5 77.5 78.5 58.5 C2 26.5 82.8 80.3 82 C3 21 62 61.5 57.8 Seed 1 18 73 75 57 Seed 2 18 79 85 75 Seed 3 23 70 70 52 C1 19.0 75.0 74.0 60.7 C2 19.7 85.0 87.0 83.7 C3 26.0 63.5 60.3 60.8 5.Clustering Analysis/D.S.Jagli 106 Distances Student Age Marks1 Marks2 Marks3 Distance Cluster C1 18.5 77.5 78.5 58.5 C2 24.8 82.8 80.3 82 C3 21 62 61.5 57.8 S1 18 73 75 57 8/52/36 C1 S2 18 79 85 75 30/18/63 C2 S3 23 70 70 52 22/67/28 C1 S4 20 55 55 55 46/91/26 C3 S5 22 85 86 87 51/7/78 C2 S6 19 91 90 89 60/15/93 C2 S7 20 70 65 60 19/56/22 C1 S8 21 53 56 59 44/89/22 C3 S9 19 82 82 60 16/32/48 C1 S10 47 75 76 77 52/63/43 C3 22/05/2017 5.Clustering Analysis/D.S.Jagli 107 Comments Euclidean Distance used the example since that is what K-Means requires. Manhanttan Distance used the Method since that is s called K-Median method The results of using Manhattan distance and Euclidean distance are quite different The last iteration changed the cluster membership of S3 from C3 to C1. 22/05/2017 5.Clustering Analysis/D.S.Jagli 108 Quite different than partitioning. 22/05/2017 5.Clustering Analysis/D.S.Jagli 109 Hierarchical Clustering 22/05/2017 22/05/2017 5.Clustering Analysis/D.S.Jagli 110 110 Hierarchical Clustering This method attempts to capture the structure of data by constructing tree of cluster Types: agglomerative :Involves gradually merging different objects into clusters(Bottom-UpApproach) divisive :dividing large clusters into smaller ones(Top-Down-Approach) 22/05/2017 5.Clustering Analysis/D.S.Jagli 111 Distance Computing between 2 Clusters 1. Single Link method – nearest neighbor distance between two clusters is the distance between the two closest points, one from each cluster. 2. Complete Linkage method – furthest neighbor – distance between two clusters is the distance between the two furthest points, one from each cluster. 22/05/2017 5.Clustering Analysis/D.S.Jagli 112 Distance between Clusters 3. Centroid method – distance between two clusters is the distance between the two centroids or the two canters of gravity of the two clusters. 4. Unweighted pair-group average method –distance between two clusters is the average distance between all pairs of objects in the two clusters. This means p*n distances need to be computed if p and n are the number of objects in each of the clusters 22/05/2017 5.Clustering Analysis/D.S.Jagli 113 Single Link B Consider two clusters Each with 5 objects. Need the smallest distance to be computed. 22/05/2017 B B B B A A A A A 5.Clustering Analysis/D.S.Jagli 114 Complete Link B Consider two clusters each with 5 objects. Need the largest distance to be computed. 22/05/2017 B B B B A A A A A 5.Clustering Analysis/D.S.Jagli 115 Centroid Link Consider two clusters. Need the distance between the centroids to be computed. B C B B B B A A A C A 22/05/2017 A 5.Clustering Analysis/D.S.Jagli 116 Average Link Consider two clusters each with 4 objects. Need 16 distances to be computed and average found. B B B B A A A A 22/05/2017 5.Clustering Analysis/D.S.Jagli 117 Agglomerative Clustering 1. Each object is assigned to a cluster of its own 2. Build a distance matrix by computing distances between every pair of objects 3. Find the two nearest clusters 4. Merge them and remove them from the list 5. Update matrix by including distance of all objects from the new cluster 6. Continue until all objects belong to one cluster 22/05/2017 5.Clustering Analysis/D.S.Jagli 118 Example We now consider the simple example about students’ marks and use the distance between two objects as the Manhattan distance. We will use the centroid method for distance between clusters. We first calculate a matrix of distances. 22/05/2017 5.Clustering Analysis/D.S.Jagli 119 Example Student Age Marks1 Marks2 Marks3 971234 18 73 75 57 994321 18 79 85 75 965438 23 70 70 52 987654 20 55 55 55 968765 22 85 86 87 978765 19 91 90 89 984567 20 70 65 60 985555 21 53 56 59 998888 19 82 82 60 995544 47 75 76 77 22/05/2017 5.Clustering Analysis/D.S.Jagli 120 Distance Matrix 22/05/2017 5.Clustering Analysis/D.S.Jagli 121 Example The matrix gives distance of each object with every other object. The smallest distance of 8 is between object s4 and object s8. They are combined and put where object 4 was. Compute distances from this cluster and update the distance matrix. 22/05/2017 5.Clustering Analysis/D.S.Jagli 122 Updated Matrix 22/05/2017 5.Clustering Analysis/D.S.Jagli 123 Note The smallest distance now is 15 between the objects S5 and S6. They are combined in a cluster and S5 and S6 are removed. Compute distances from this cluster and update the distance matrix. 22/05/2017 5.Clustering Analysis/D.S.Jagli 124 Updated Matrix 22/05/2017 5.Clustering Analysis/D.S.Jagli 125 Contd…. Look at shortest distance again. S3 and S7 are at a distance 16 apart. We merge them and put C3 where S3 was. The updated distance matrix is given on the next slide. It shows that C2 and S1 have the smallest distance and are then merged in the next step. We stop short of finishing the example. 22/05/2017 5.Clustering Analysis/D.S.Jagli 126 Updated matrix 22/05/2017 5.Clustering Analysis/D.S.Jagli 127 Result will be like this 22/05/2017 5.Clustering Analysis/D.S.Jagli 128 Thank you Note: you can study search engine from my slides 22/05/2017 5.Data Mining/D.S.Jagli 129