Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lecture 16 Data mining and knowledge discovery Introduction, or or what is data mining? n Data warehouse and query tools n Decision trees n Case study: Profiling people with high blood pressure n Summary n Negnevitsky, Pearson Education, 2005 1 What is data mining? Data is what we collect and store, and knowledge is what helps us to make informed decisions. n The extraction of knowledge from data is called data mining. n Data mining can also be defined as the exploration and analysis of large quantities of data in order to discover meaningful patterns and rules. n The ultimate goal of data mining is to discover knowledge. n Negnevitsky, Pearson Education, 2005 2 Data warehouse Modern organisations must respond quickly to any change in the market. This requires rapid access to current data normally stored in operational databases. n However, an organisation must also determine which trends are relevant. This task is accomplished with access to historical data that are stored in large databases called data warehouses. n Negnevitsky, Pearson Education, 2005 3 The main characteristic of a data warehouse is its capacity. A data warehouse is really big – it includes millions, even billions, of data records. n The data stored in a data warehouse is l time dependent – linked together by the times of recording – and l integrated – all relevant information from the operational databases is combined and structured in the warehouse. n Negnevitsky, Pearson Education, 2005 4 Query tools A data warehouse is designed to support decisionmaking in the organisation. The information needed can be obtained with query tools. n Query tools are assumption-based – a user must ask the right questions. n Negnevitsky, Pearson Education, 2005 5 How is data mining applied in practice? Many companies use data mining today, but refuse to talk about it. n In direct marketing, data mining is used for targeting people who are most likely to buy certain products and services. n In trend analysis, it is used to determine trends in the marketplace, for example, to model the stock market. In fraud detection, data mining is used to identify insurance claims, cellular phone calls and credit card purchases that are most likely to be fraudulent. n Negnevitsky, Pearson Education, 2005 6 Data mining tools Data mining is based on intelligent technologies already discussed in this book. It often applies such tools as neural networks and neuro-fuzzy systems. However, the most popular tool used for data mining is a decision tree. Negnevitsky, Pearson Education, 2005 7 Decision trees A decision tree can be defined as a map of the reasoning process. It describes a data set by a tree-like structure. Decision trees are particularly good at solving classification problems. Negnevitsky, Pearson Education, 2005 8 An example of a decision tree Household responded: 112 not responded: 888 Total: 1000 Homeownership Yes responded: not responded: Total: 9 334 343 No responded: not responded: Total: 103 554 657 Household Income ≤$20,700 responded: not responded: Total: 14 158 172 >$20,701 responded: not responded: Total: 89 396 485 Savings Accounts Yes responded: not responded: Total: Negnevitsky, Pearson Education, 2005 86 188 274 No responded: not responded: Total: 3 208 211 9 A decision tree consists of nodes, branches and leaves. n The top node is called the root node. The tree always starts from the root node and grows down by splitting the data at each level into new nodes. The root node contains the entire data set (all data records), and child nodes hold respective subsets of that set. n All nodes are connected by branches. n Nodes that are at the end of branches are called terminal nodes, or leaves. n Negnevitsky, Pearson Education, 2005 10 How does a decision tree select splits? A split in a decision tree corresponds to the predictor with the maximum separating power. The best split does the best job in creating nodes where a single class dominates. n One of the best known methods of calculating the predictor’s power to separate data is based on the Gini coefficient of inequality. n Negnevitsky, Pearson Education, 2005 11 The Gini coefficient The Gini coefficient is a measure of how well the predictor separates the classes contained in the parent node. Gini, an Italian economist, introduced a rough measure of the amount of inequality in the income distribution in a country. Negnevitsky, Pearson Education, 2005 12 Computation of the Gini coefficient 100 % Income 80 60 40 20 0 0 20 40 60 80 100 % Population The Gini coefficient is calculated as the area between the curve and the diagonal divided by the area below the diagonal. For a perfectly equal wealth distribution, the Gini coefficient is equal to zero. Negnevitsky, Pearson Education, 2005 13 Selecting an optimal decision tree: (a) Splits selected by Gini Class A: 100 Class B: 50 Total: 150 Predictor 1 yes no Class A: 63 Class B: 38 Total: 101 Class A: 37 Class B: 12 Total: 49 Predictor 2 Predictor 4 yes Class A: 4 Class B: 37 Total: 41 yes no Class A: 25 Class B: 4 Total: 29 Class A: 12 Class B: 8 Total: 20 Predictor 3 Predictor 5 Predictor 6 yes yes yes Class A: 0 Class B: 36 Total: 36 Class A: 59 Class B: 1 Total: 60 no no Class A: Class B: Total: 4 1 5 Class A: Class B: Total: Negnevitsky, Pearson Education, 2005 2 1 3 no Class A: 23 Class B: 3 Total: 26 Class A: Class B: Total: 1 8 9 no Class A: 11 Class B: 0 Total: 11 14 Selecting an optimal decision tree: (b) Splits selected by duesswork Class A: 100 Class B: 50 Total: 150 Predictor 5 yes no Class A: 19 Class B: 14 Total: 33 Class A: 81 Class B: 36 Total: 117 Predictor 2 Predictor 3 yes Class A: 17 Class B: 6 Total: 23 no yes Class A: 2 Class B: 8 Total: 10 no Class A: 46 Class B: 21 Total: 67 Class A: 35 Class B: 15 Total: 50 Predictor 1 Predictor 6 yes yes Class A: 37 Class B: 14 Total: 51 no Class A: 9 Class B: 7 Total: 16 Class A: 23 Class B: 9 Total: 32 no Class A: 12 Class B: 6 Total: 18 Predictor 4 yes Class A: 8 Class B: 9 Total: 17 Negnevitsky, Pearson Education, 2005 no Class A: 29 Class B: 5 Total: 34 15 Gain chart of Class A 100 % Class 80 60 40 The Gini splits 20 Manual split selection 0 0 20 40 60 80 100 % Population Negnevitsky, Pearson Education, 2005 16 Can we extract rules from a decision tree? The pass from the root node to the bottom leaf reveals a decision rule. For example, a rule associated with the right bottom leaf in the figure that represents Gini splits can be represented as follows: if and and then (Predictor 1 = no) (Predictor 4 = no) (Predictor 6 = no) class = Class A Negnevitsky, Pearson Education, 2005 17 Case study 10: Profiling people with high blood pressure A typical task for decision trees is to determine conditions that may lead to certain outcomes. Blood pressure can be categorised as optimal, normal or high. Optimal pressure is below 120/80, normal is between 120/80 and 130/85, and a hypertension is diagnosed when blood pressure is over 140/90. Negnevitsky, Pearson Education, 2005 18 A data set for a hypertension study Community Health Survey: Hypertension Study (California, U.S.A.) Gender 3Male Female Age 18 35 351 65 Race 3Caucasian African American Hispanic Asian or Pacific Islander Marital Status Married Separated 3Divorced Widowed Never Married Household Income Less than $20,700 $20,701 − $45,000 3$45,001 − $75,000 $75,001 and over Negnevitsky, Pearson Education, 2005 – 34 years – 50 years – 64 years or more years 19 A data set for a hypertension study (continued) Community Health Survey: Hypertension Study (California, U.S.A.) Alcohol Consumption Abstain from alcohol Occasional (a few drinks per month) 3Regular (one or two drinks per day) Heavy (three or more drinks per day) Smoking Nonsmoker 1 – 10 cigarettes per day 311 – 20 cigarettes per day More than one pack per day Caffeine Intake Abstain from coffee 3One or two cups per day Three or more cups per day Salt Intake Low-salt diet 3Moderate-salt diet High-salt diet Physical Activities None 3One or two times per week Three or more times per week Weight Height 1 7 cm 9 3 kg Blood Pressure Optimal Normal 3High Negnevitsky, Pearson Education, 2005 20 Data cleaning Decision trees are as good as the data they represent. Unlike neural networks and fuzzy systems, decision trees do not tolerate noisy and polluted data. Therefore, the data must be cleaned before we can start data mining. We might find that such fields as Alcohol Consumption or Smoking have been left blank or contain incorrect information. Negnevitsky, Pearson Education, 2005 21 Data enriching From such variables as weight and height we can easily derive a new variable, obesity. This variable is calculated with a body-mass index (BMI), that is, the weight in kilograms divided by the square of the height in metres. Men with BMIs of 27.8 or higher and women with BMIs of 27.3 or higher are classified as obese. Negnevitsky, Pearson Education, 2005 22 A data set for a hypertension study (continued) Community Health Survey: Hypertension Study (California, U.S.A.) Obesity Negnevitsky, Pearson Education, 2005 3Obese Not Obese 23 Growing a decision tree Blood Pressure optimal: 319 (32%) normal: 528 (53%) high: 153 (15%) Total: 1000 Age 18 – 34 years optimal: 88 (56%) normal: 64 (41%) high: 5 (3%) Total: 157 35 – 50 years optimal: 208 (35%) normal: 340 (57%) high: 48 (8%) Total: 596 Negnevitsky, Pearson Education, 2005 51 – 64 years optimal: 21 (12%) normal: 90 (52%) high: 62 (36%) Total: 173 65 or more years optimal: 2 (3%) normal: 34 (46%) high: 38 (51%) Total: 74 24 Growing a decision tree (continued) 51 – 64 years optimal: 21 (12%) normal: 90 (52%) high: 62 (36%) Total: 173 Obesity Obese optimal: 3 (3%) normal: 53 (49%) high: 51 (48%) Total: 107 Negnevitsky, Pearson Education, 2005 Not Obese optimal: 18 (27%) normal: 37 (56%) high: 11 (17%) Total: 66 25 Growing a decision tree (continued) Obese optimal: 3 (3%) normal: 53 (49%) high: 51 (48%) Total: 107 Race Caucasian optimal: 2 (5%) normal: 24 (55%) high: 17 (40%) Total: 43 African American optimal: 0 (0%) normal: 13 (35%) high: 24 (65%) Total: 37 Negnevitsky, Pearson Education, 2005 Hispanic optimal: 0 (0%) normal: 11 (58%) high: 8 (42%) Total: 19 Asian optimal: normal: high: Total: 1 (12%) 5 (63%) 2 (25%) 8 26 Solution space of the hypertension study The solution space is first divided into four rectangles by age, then age group 51-64 is further divided into those who are overweight and those who are not. And finally, the group of obese people is divided by race. Negnevitsky, Pearson Education, 2005 27 Solution space of the hypertension study 157 74 8 19 596 37 66 43 Negnevitsky, Pearson Education, 2005 28 Hypertension study: forcing a split Blood Pressure optimal: 319 (32%) normal: 528 (53%) high: 153 (15%) Total: 1000 Age 18 – 34 years optimal: 88 (56%) normal: 64 (41%) high: 5 (3%) Total: 157 Male optimal: 111 (36%) normal: 168 (55%) high: 28 (9%) Total: 307 35 – 50 years optimal: 208 (35%) normal: 340 (57%) high: 48 (8%) Total: 596 51 – 64 years optimal: 21 (12%) normal: 90 (52%) high: 62 (36%) Total: 173 Gender Gender Female optimal: 97 (34%) normal: 172 (59%) high: 20 (7%) Total: 289 Male optimal: 11 (13%) normal: 48 (56%) high: 27 (31%) Total: 86 Negnevitsky, Pearson Education, 2005 65 or more years optimal: 2 (3%) normal: 34 (46%) high: 38 (51%) Total: 74 Female optimal: 10 (12%) normal: 42 (48%) high: 35 (40%) Total: 87 29 Advantages of decision trees The main advantage of the decision-tree approach to data mining is it visualises the solution; it is easy to follow any path through the tree. n Relationships discovered by a decision tree can be expressed as a set of rules, which can then be used in developing an expert system. n Negnevitsky, Pearson Education, 2005 30 Drawbacks of decision trees Continuous data, such as age or income, have to be grouped into ranges, which can unwittingly hide important patterns. n Handling of missing and inconsistent data – decision trees can produce reliable outcomes only when they deal with “clean” data. n Inability to examine more than one variable at a time. This confines trees to only the problems that can be solved by dividing the solution space into several successive rectangles. n Negnevitsky, Pearson Education, 2005 31 In spite of all these limitations, decision trees have become the most successful technology used for data mining. An ability to produce clear sets of rules make decision trees particularly attractive to business professionals. Negnevitsky, Pearson Education, 2005 32