Download CSE3212 Data Mining Defining a Data Mining Task Task

Defining a Data Mining Task CSE3212 Data Mining To define a data mining task, one needs to answer the following questions: 1. What data set do I want to mine? 2. What kind of knowledge do I want to mine? 3. What background knowledge could be useful? 4. How do I measure if the results are interesting? 5. How do I display what I have discovered? Data Mining Approaches 2.1 Task-relevant Data 2.2 What to be mined? Or the Approaches Generally we wish to mine only a subset of a database, not the whole database. It may be that we only want to study something specific e.g. trends in postgraduate students countries they come from; degree program they are doing; their age; time (duration) that they taken to finish the degree; and Have they been awarded scholarship? Building the database subset may be a subtask before data mining can be done. What kind of knowledge we are after? Classification Estimation Prediction Clustering Description Affinity Grouping Outliers ….. 2.3 Classification 2.4 Estimation Classification involves considering the features of some object then assigning it it to some pre-defined class, for example: Spotting fraudulent insurance claims Which phone numbers are fax numbers Which customers are high-value The features that are considered are known as the independent attributes or variables while the attribute that constitute the pre-defined classes is called as the dependent attribute/variable. First build a model based on the known data and use the model to classify other data for which the class label is not known Æ known as supervised learning Estimation deals with numerically valued outcomes rather than discrete categories as occurs in classification. Estimating the number of children in a family Estimating family income 2.5 2.6 Prediction Clustering Essentially the same as classification and estimation but involves future behaviour Historical data is used to build a model explaining behaviour (outputs) for known inputs The model developed is then applied to current inputs to predict future outputs Predict which customers will respond to a promotion Classifying loan applications Clustering is also sometimes referred to as segmentation (though this has other meanings in other fields) In clustering there are no pre-defined classes. Selfsimilarity is used to group records. The user must attach meaning to the clusters formed Clustering often precedes some other data mining task, for example: once customers are separated into clusters, a promotion might be carried out based on market basket analysis of the resulting cluster Known as un-supervised learning 2.7 2.8 Description Deviation Detection A good description of data can provide understanding of behaviour The description of the behaviour can suggest an explanation for it as well Statistical measures can be useful in describing data, as can techniques that generate rules Records whose attributes deviate from the norm by significant amounts are also called outliers Application areas include: fraud detection quality control tracing defects. Visualization techniques and statistical techniques are useful in finding outliers A cluster which contains only a few records may in fact represent outliers 2.9 2.10 Affinity Grouping Market Basket Analysis Affinity grouping is also referred to as Market Basket Analysis A common example is the discovery of which items are frequently sold together at a supermarket. If this is known, decisions can be made about: arranging items on shelves which items should be promoted together which items should not simultaneously be discounted Rule Body When a customer buys a shirt, in 70% of cases, he or she will also buy a tie! We find this happens in 13.5% of all purchases. Rule Head 2.11 Confidence Support 2.12 Co-Occurrence Table The Usefulness of Market Basket Analysis Some rules are useful: Unknown, unexpected and indicative of some action to take. Some rules are trivial: Known by anyone familiar with the business. Some rules are inexplicable: Seem to have no explanation and do not suggest a course of action. “The key to success in business is to know something that nobody else knows” Aristotle Onassis Customer 1 2 3 4 5 OJ Cleaner Milk Cola Detergent Items orange juice (OJ), cola milk, orange juice, window cleaner orange juice, detergent orange juice, detergent, cola window cleaner, cola OJ 4 1 1 2 2 Cleaner 1 2 1 1 0 Milk Cola Detergent 1 2 2 1 1 0 1 0 0 0 3 1 0 1 2 2.13 From the Co-Occurrence Table 2.14 Support and Confidence We can say that people who buys Orange Juice also will buy Cola ( or detergent). orange juice Î cola This association rule is satisfied by 2 out of 5 customers ( 1 and 4) hence support is 2/5 = 40% However, there are four customers (1,2,3 and 4) have purchased orange juice and hence the confidence of the above rule is only 2/4 = 50% Question: Are support and confidence measures good enough? The rule has one item (or attribute) on the left hand side and the right hand side. How do you find rules which has more than one items on the left hand side (multi-attribute rule) Support: Percentage of transactions from a transaction database that the given rule satisfies. This can be taken as the probability P(X ∪ Y) where X ∪ Y indicates that a transaction contains both X and Y, that is union of item sets X and Y. Confidence: Which assess the degree of certainty of the detected association. This can be taken as the conditional probability P(Y|X), that is, the probability that a transaction containing X also contains Y. More formally Support (X ⇒ Y ) = P (X ∪ Y) Confidence (X ⇒ Y) = P (Y|X) 2.15 What is a Rule? 2.16 Is the Rule a Useful Predictor? - 1 Confidence is the ratio of the number of transactions with all the items in the rule to the number of transactions with just the items in the condition. Consider: If condition then result Note: If nappies and Thursday then beer if B and C then A If this rule has a confidence of 0.33, it means that when B and C occur in a transaction, there is a 33% chance that A also occurs. is usually better than (in the sense that it is more actionable) If Thursday then nappies and beer because it has just one item in the result If a 3 way combination is the most common, then consider rules with just 1 item in the result, e.g. If A and B, then C If A and C, then B 2.17 2.18 Is the Rule a Useful Predictor? - 2 Is the Rule a Useful Predictor? - 3 Consider the following table of probabilities of items and there combinations: Combination A B C A and B A and C B and C A and B and C Now consider the following rules: Probability 0.45 0.42 0.40 0.25 0.20 0.15 0.05 Rule If A and B then C If A and C then B If B and C then A p(condition) p(condition and result) 0.25 0.05 0.20 0.05 0.15 0.05 confidence 0.20 0.25 0.33 It is tempting to choose “If B and C then A”, because it is the most confident (33%) - but there is a problem 2.19 Is the Rule a Useful Predictor? - 4 Is the Rule a Useful Predictor? - 5 This rule is actually worse than just saying that A randomly occurs in the transaction - which happens 45% of the time A measure called improvement indicates whether the rule predicts the result better than just assuming the result in the first place Improvement = 2.20 Improvement measures how much better a rule is at predicting a result than just assuming the result in the first place When improvement > 1, the rule is better at predicting the result than random chance p(condition and result) p(condition)p(result) 2.21 Is the Rule a Useful Predictor? - 6 Is the Rule a Useful Predictor? - 7 Consider the improvement for our rules: Rule If A and B then C If A and C then B If B and C then A If A then B support 0.05 0.05 0.05 0.25 confidence 0.20 0.25 0.33 0.59 2.22 When improvement < 1, negating the result produces a better rule. For example improvement 0.50 0.59 0.74 1.32 if B and C then not A has a confidence of 0.67 and thus an improvement of 0.67/0.55 = 1.22 Negated rules may not be as useful as the original association rules when it comes to acting on the results None of the rules with three items shows any improvement - the best rule in the data actually has only two items: “if A then B”. A predicts the occurrence of B 1.31 times better than chance. 2.23 2.24 Choosing the Right Set of Items Multi-attribute Rule Choosing the right level of detail (the creation of classes and a taxonomy) Virtual items may be added to take advantage of information that goes beyond the taxonomy Anonymous versus signed transactions For 2 items on the left hand side and one item on the right hand side of a rule (e.g. If A and B then C) would require the co-occurrence matrix to be 3-dimensional. How do you visualise three dimensional co-occurrence matrix? What happens for higher dimensions? 2.25 2.26 The Process for Market Basket Analysis An Example A co-occurrence cube would show associations in three dimensions - hard to visualize more Consider the following database: Student(sid, name1, dob, country, degree, startsem, address1, telephone, address2, email, scholarship, ..) We must: Choose the right set of items Generate rules by deciphering the counts in the cooccurrence matrix Overcome the practical limits imposed by many items in large numbers of transactions Enrolment(sid, subject-id, mark, tutegroup, tutor,..) Subject(sub-id, name, school-id, whenstarted, lecturer,..) School(name, id, ..) Not all of this data is needed for decision making. Let us extract some data from this database. 2.27 2.28 Data Cube Example yob, We could look at the information as country, 1965, Thailand, 1970, Canada, 1967, Australia, 1966, Australia, 1972, Australia, 1972, India, 1982, Sweden, yob X country X degree X startsem X numsubjects X scholarship In fact it is natural to think of an enterprise data as multidimensional. degree, startsem, numsubjects, MIT, 991, BIT, 992, LLB, 993, LLB, 983, Bcom, 973, BIT/Bcom, 991, MSc(IT), 991, 5, 4, 3, 4, 5, 5, 3, scholarship 25% 0 30% 40% 10% 10% 10% Is this information useful for decision making? Not really! 2.29 2.30 Example Example The university management may be interested in retrieving information like: • How many students are doing BIT? How many students from Thailand? How many students started in 1998? (queries involving only one variable) • How many students doing BIT are from Thailand? How many MIT students started in 981? How many students from Thailand started in 993? (queries involving two variables) •How many students doing MIT from Thailand started in 981? (query involving three variables) Special type of database systems, called data cube systems, are often used for answering such queries. 2.31 2.32 Data Cube Data Cube The example queries discussed earlier may be represented by a three-dimensional data cube with each edge representing one of the variables viz. startsem, country, and degree. Let us look at a simple two-dimensional situation: country X degree A point inside the cube is an intersection of the coordinates defined by the edges of the cube. The coordinates of the point define the meaning of the data at that point. For decision making this may be useful information. If we had a 2-dimensional matrix then we could find out the number of students for any country (x) and any degree (y). 2.33 2.34 Data Cube Data Cube But in the two-dimensional situation, we don’t just want to find out the number of students for any country (x) and any degree (y). We may have many other queries e.g. Consider a slightly more complex situation in which we have three dimensions: 1. How many students are doing MIT? country X degree X startsem 2. How many students from Thailand? 3. How many Asian students doing Law degrees? for any country (x), any degree (y) and any start semester (z). Thus there is kind of hierarchy that we wish to use, for example, the world, the continents, the regions, the countries etc. In degrees, we may want a hierarchy of university, Schools, UG and PG, individual degrees. We may now look at this information as a 3dimensional cube as shown on the following slide. 2.35 2.36 A Sample Data Cube Data Cube de gr e e de gr e Dimensions: country, degree, sem Hierarchical summarization paths country continent school region ug/pg country degree Year LLB BComp MIT Sum 991 992 Total enrolments semester 993 001 sum U.S.A Malaysia Australia Country e Number of students as a function of country, degree and semester sum semester semester 2.37 Data Cube 2.38 Strengths and Weaknesses Each edge of the cube is called a dimension. A user normally has a number of different dimensions from which the given data may be analyzed. A user therefore has a multidimensional conceptual view of the data which is represented by the cube. The points inside a cube provide aggregations. For example, a point may provide the number of students from Malaysia admitted to BComp in year 2006. 2.39 Outlier Analysis Outlier analysis identifies data objects that do not comply with the general behaviour or model of the data. Often outliers are ignored but in applications like fraud detection the outliers are the objects of interest 2.41 Strengths Clear understandable results Supports undirected data mining Works on variable length data Is simple to understand Weaknesses Requires exponentially more computational effort as the problem size grows Suits items in transactions but not all problems fit this description It can be difficult to determine the right set of items to analysis It does not handle rare items well; simply considering the level of support will exclude these items We need an algorithm to find the association rules. 2.40

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download CSE3212 Data Mining Defining a Data Mining Task Task