Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lecture 14 Data mining and knowledge discovery Introduction, or what is data mining? Data warehouse and query tools Decision trees Case study: Profiling people with high blood pressure Summary Slides are based on Negnevitsky, Pearson Education, 2005 1 What is data mining? Data is what we collect and store, and knowledge is what helps us to make informed decisions. The extraction of knowledge from data is called data mining. Data mining can also be defined as the exploration and analysis of large quantities of data in order to discover meaningful patterns and rules. The ultimate goal of data mining is to discover knowledge. Slides are based on Negnevitsky, Pearson Education, 2005 2 Why data mining The Explosive Growth of Data: from terabytes to petabytes – Data collection and data availability » Automated data collection tools, database systems, Web, computerized society – Major sources of abundant data » Business: Web, e-commerce, transactions, stocks, … » Science: Remote sensing, bioinformatics, scientific simulation, … » Society and everyone: news, digital cameras, YouTube knowledge! Slides are based on Negnevitsky, Pearson Education, 2005 3 Why Not Traditional Data Analysis? Tremendous amount of data – Algorithms must be highly scalable to handle such as tera-bytes of data High-dimensionality of data – Micro-array may have tens of thousands of dimensions Slides are based on Negnevitsky, Pearson Education, 2005 4 High complexity of data – Data streams and sensor data – Time-series data, temporal data, sequence data – Structure data, graphs, social networks and multi-linked data – Heterogeneous databases and legacy databases – Spatial, spatiotemporal, multimedia, text and Web data – Software programs, scientific simulations New and sophisticated applications Slides are based on Negnevitsky, Pearson Education, 2005 5 Knowledge Discovery (KDD) Process – Data mining—core of knowledge discovery process Pattern Evaluation Data Mining Task-relevant Data Data Warehouse Selection Data Cleaning Data Integration Databases Slides are based on Negnevitsky, Pearson Education, 2005 6 KDD Process: Several Key Steps Learning the application domain – relevant prior knowledge and goals of application Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation – Find useful features, dimensionality/variable reduction, invariant representation Slides are based on Negnevitsky, Pearson Education, 2005 7 KDD Process: Several Key Steps Choosing functions of data mining – summarization, classification, regression, association, clustering Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation – visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge Slides are based on Negnevitsky, Pearson Education, 2005 8 Data Mining: Confluence of Multiple Disciplines Database Technology Machine Learning Pattern Recognition Statistics Data Mining Algorithm Slides are based on Negnevitsky, Pearson Education, 2005 Visualization Other Disciplines 9 Architecture: Typical Data Mining System Graphical User Interface Pattern Evaluation Data Mining Engine Knowl edgeBase Database or Data Warehouse Server data cleaning, integration, and selection Database Data World-Wide Other Info Repositories Warehouse Web Slides are based on Negnevitsky, Pearson Education, 2005 10 Data Mining Functionalities(1) Frequent patterns, association, correlation vs. causality – Diaper Beer [0.5%, 75%] (Correlation or causality?) Classification and prediction – Construct models (functions) that describe and distinguish classes or concepts for future prediction » E.g., classify countries based on (climate), or classify cars based on (gas mileage) – Predict some unknown or missing numerical values Slides are based on Negnevitsky, Pearson Education, 2005 11 Data Mining Functionalities(2) Cluster analysis – Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns – Maximizing intra-class similarity & minimizing interclass similarity Outlier analysis – Outlier: Data object that does not comply with the general behavior of the data – Noise or exception? Useful in fraud detection, rare events analysis Slides are based on Negnevitsky, Pearson Education, 2005 12 Data Mining Functionalities(3) Trend and evolution analysis – Trend and deviation: e.g., regression analysis – Sequential pattern mining: e.g., digital camera large SD memory – Periodicity analysis – Similarity-based analysis Other pattern-directed or statistical analyses Slides are based on Negnevitsky, Pearson Education, 2005 13 Top-10 Most Popular DM Algorithms: 18 Identified Candidates (I) Classification – #1. C4.5: Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann., 1993. – #2. CART: L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth, 1984. – #3. K Nearest Neighbours (kNN): Hastie, T. and Tibshirani, R. 1996. Discriminant Adaptive Nearest Neighbor Classification. TPAMI. 18(6) – #4. Naive Bayes Hand, D.J., Yu, K., 2001. Idiot's Bayes: Not So Stupid After All? Internat. Statist. Rev. 69, 385-398. Slides are based on Negnevitsky, Pearson Education, 2005 14 (II) Statistical Learning – #5. SVM: Vapnik, V. N. 1995. The Nature of Statistical Learning Theory. Springer-Verlag. – #6. EM: McLachlan, G. and Peel, D. (2000). Finite Mixture Models. J. Wiley, New York. Association Analysis – #7. Apriori: Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules. In VLDB '94. – #8. FP-Tree: Han, J., Pei, J., and Yin, Y. 2000. Mining frequent patterns without candidate generation. In SIGMOD '00. Slides are based on Negnevitsky, Pearson Education, 2005 15 (III) Link Mining – #9. PageRank: Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. In WWW-7, 1998. – #10. HITS: Kleinberg, J. M. 1998. Authoritative sources in a hyperlinked environment. SODA, 1998. Slides are based on Negnevitsky, Pearson Education, 2005 16 (IV) Clustering – #11. K-Means: MacQueen, J. B., Some methods for classification and analysis of multivariate observations, in Proc. 5th Berkeley Symp. Mathematical Statistics and Probability, 1967. – #12. BIRCH: Zhang, T., Ramakrishnan, R., and Livny, M. 1996. BIRCH: an efficient data clustering method for very large databases. In SIGMOD '96. Bagging and Boosting – #13. AdaBoost: Freund, Y. and Schapire, R. E. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 1 (Aug. 1997), 119-139. Slides are based on Negnevitsky, Pearson Education, 2005 17 (V) Sequential Patterns – #14. GSP: Srikant, R. and Agrawal, R. 1996. Mining Sequential Patterns: Generalizations and Performance Improvements. In Proceedings of the 5th International Conference on Extending Database Technology, 1996. – #15. PrefixSpan: J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal and M-C. Hsu. PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. In ICDE '01. Integrated Mining – #16. CBA: Liu, B., Hsu, W. and Ma, Y. M. Integrating classification and association rule mining. KDD-98. Slides are based on Negnevitsky, Pearson Education, 2005 18 (VI) Rough Sets – #17. Finding reduct: Zdzislaw Pawlak, Rough Sets: Theoretical Aspects of Reasoning about Data, Kluwer Academic Publishers, Norwell, MA, 1992 Graph Mining – #18. gSpan: Yan, X. and Han, J. 2002. gSpan: Graph-Based Substructure Pattern Mining. In ICDM '02. Slides are based on Negnevitsky, Pearson Education, 2005 19 Top-10 Algorithm Finally Selected at ICDM’06 #1: C4.5 (61 votes) #2: K-Means (60 votes) #3: SVM (58 votes) #4: Apriori (52 votes) #5: EM (48 votes) #6: PageRank (46 votes) #7: AdaBoost (45 votes) #7: kNN (45 votes) #7: Naive Bayes (45 votes) #10: CART (34 votes) Slides are based on Negnevitsky, Pearson Education, 2005 20 Conferences and Journals on Data Mining KDD Conferences – ACM SIGKDD Int. Conf. on Knowledge Discovery in Databases and Data Mining (KDD) – SIAM Data Mining Conf. (SDM) – (IEEE) Int. Conf. on Data Mining (ICDM) – Conf. on Principles and practices of Knowledge Discovery and Data Mining (PKDD) – Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD) Slides are based on Negnevitsky, Pearson Education, 2005 21 Other related conferences – – – – – ACM SIGMOD VLDB (IEEE) ICDE WWW, SIGIR ICML, CVPR, NIPS Journals – Data Mining and Knowledge Discovery (DAMI or DMKD) – IEEE Trans. On Knowledge and Data Eng. (TKDE) – KDD Explorations – ACM Trans. on KDD Slides are based on Negnevitsky, Pearson Education, 2005 22 Why Not Traditional Data Analysis?(1) Tremendous amount of data – Algorithms must be highly scalable to handle such as tera-bytes of data High-dimensionality of data – Micro-array may have tens of thousands of dimensions Slides are based on Negnevitsky, Pearson Education, 2005 23 (2) High complexity of data – Data streams and sensor data – Time-series data, temporal data, sequence data – Structure data, graphs, social networks and multi-linked data – Heterogeneous databases and legacy databases – Spatial, spatiotemporal, multimedia, text and Web data – Software programs, scientific simulations New and sophisticated applications Slides are based on Negnevitsky, Pearson Education, 2005 24 Data warehouse Modern organisations must respond quickly to any change in the market. This requires rapid access to current data normally stored in operational databases. However, an organisation must also determine which trends are relevant. This task is accomplished with access to historical data that are stored in large databases called data warehouses. Slides are based on Negnevitsky, Pearson Education, 2005 25 The main characteristic of a data warehouse is its capacity. A data warehouse is really big – it includes millions, even billions, of data records. The data stored in a data warehouse is time dependent – linked together by the times of recording – and integrated – all relevant information from the operational databases is combined and structured in the warehouse. Slides are based on Negnevitsky, Pearson Education, 2005 26 Query tools A data warehouse is designed to support decisionmaking in the organisation. The information needed can be obtained with query tools. Query tools are assumption-based – a user must ask the right questions. Slides are based on Negnevitsky, Pearson Education, 2005 27 How is data mining applied in practice? Many companies use data mining today, but refuse to talk about it. In direct marketing, data mining is used for targeting people who are most likely to buy certain products and services. In trend analysis, it is used to determine trends in the marketplace, for example, to model the stock market. In fraud detection, data mining is used to identify insurance claims, cellular phone calls and credit card purchases that are most likely to be fraudulent. Slides are based on Negnevitsky, Pearson Education, 2005 28 Motivation: Finding latent relationships in data – What products were often purchased together?— Beer and diapers?! – What are the subsequent purchases after buying a PC? – What kinds of DNA are sensitive to this new drug? – Can we automatically classify web documents? Slides are based on Negnevitsky, Pearson Education, 2005 29 Slides are based on Negnevitsky, Pearson Education, 2005 30 Applications – Market basket data analysis (shelf space planning/increasing sales/promotion) – cross-marketing – catalog design – sale campaign analysis – Web log (click stream) analysis – DNA sequence analysis Slides are based on Negnevitsky, Pearson Education, 2005 31 Data mining tools Data mining is based on intelligent technologies already discussed in this book. It often applies such tools as neural networks and neuro-fuzzy systems. However, the most popular tool used for data mining is a decision tree. Slides are based on Negnevitsky, Pearson Education, 2005 32 Decision trees A decision tree can be defined as a map of the reasoning process. It describes a data set by a tree-like structure. Decision trees are particularly good at solving classification problems. Slides are based on Negnevitsky, Pearson Education, 2005 33 ID3 (tall, blond, blue) w (short, silver, blue) w (short, black, blue) w (tall, blond, brown) w (tall, silver, blue) w (short, blond, blue) w (short, black, brown) e (tall, silver, black) e (short, black, brown) e (tall, black, brown) e (tall, black, black) e (short, blond, black) e Slides are based on Negnevitsky, Pearson Education, 2005 34 Slides are based on Negnevitsky, Pearson Education, 2005 35 Slides are based on Negnevitsky, Pearson Education, 2005 36 Slides are based on Negnevitsky, Pearson Education, 2005 37 Slides are based on Negnevitsky, Pearson Education, 2005 38 A decision tree consists of nodes, branches and leaves. The top node is called the root node. The tree always starts from the root node and grows down by splitting the data at each level into new nodes. The root node contains the entire data set (all data records), and child nodes hold respective subsets of that set. All nodes are connected by branches. Nodes that are at the end of branches are called terminal nodes, or leaves. Slides are based on Negnevitsky, Pearson Education, 2005 39 How does a decision tree select splits? A split in a decision tree corresponds to the predictor with the maximum separating power. The best split does the best job in creating nodes where a single class dominates. One of the best known methods of calculating the predictor’s power to separate data is based on the Gini coefficient of inequality. Slides are based on Negnevitsky, Pearson Education, 2005 40 Major Issues in Data Mining(1) Mining methodology – Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web – Performance: efficiency, effectiveness, and scalability – Pattern evaluation: the interestingness problem – Incorporation of background knowledge – Handling noise and incomplete data – Parallel, distributed and incremental mining methods – Integration of the discovered knowledge with existing one: knowledge fusion Slides are based on Negnevitsky, Pearson Education, 2005 41 (2) User interaction – Data mining query languages and ad-hoc mining – Expression and visualization of data mining results – Interactive mining of knowledge at multiple levels of abstraction Applications and social impacts – Domain-specific data mining & invisible data mining – Protection of data security, integrity, and privacy Slides are based on Negnevitsky, Pearson Education, 2005 42 Summary(1) Data mining: Discovering interesting patterns from large amounts of data A natural evolution of database technology, in great demand, with wide applications A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation Slides are based on Negnevitsky, Pearson Education, 2005 43 (2) Mining can be performed in a variety of information repositories Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc. Data mining systems and architectures Major issues in data mining Slides are based on Negnevitsky, Pearson Education, 2005 44 Thank you Slides are based on Negnevitsky, Pearson Education, 2005 45 An example of a decision tree Household responded: 112 not responded: 888 Total: 1000 Homeownership Yes responded: not responded: Total: 9 334 343 No responded: not responded: Total: 103 554 657 Household Income $20,700 responded: not responded: Total: 14 158 172 $20,701 responded: not responded: Total: 89 396 485 Savings Accounts Yes responded: not responded: Total: 86 188 274 Slides are based on Negnevitsky, Pearson Education, 2005 No responded: not responded: Total: 3 208 211 46 The Gini coefficient The Gini coefficient is a measure of how well the predictor separates the classes contained in the parent node. Gini, an Italian economist, introduced a rough measure of the amount of inequality in the income distribution in a country. Slides are based on Negnevitsky, Pearson Education, 2005 47 Computation of the Gini coefficient 100 % Income 80 60 40 20 0 0 20 40 60 80 100 % Population The Gini coefficient is calculated as the area between the curve and the diagonal divided by the area below the diagonal. For a perfectly equal wealth distribution, the Gini coefficient is equal to zero. Slides are based on Negnevitsky, Pearson Education, 2005 48 Selecting an optimal decision tree: (a) Splits selected by Gini Class A: 100 Class B: 50 Total: 150 Predictor 1 yes no Class A: 63 Class B: 38 Total: 101 Class A: 37 Class B: 12 Total: 49 Predictor 2 Predictor 4 yes yes no no Class A: 25 Class B: 4 Total: 29 Class A: 12 Class B: 8 Total: 20 Predictor 3 Predictor 5 Predictor 6 yes yes yes Class A: 4 Class B: 37 Total: 41 Class A: 0 Class B: 36 Total: 36 Class A: 59 Class B: 1 Total: 60 no Class A: Class B: Total: 4 1 5 Class A: Class B: Total: 2 1 3 no Class A: 23 Class B: 3 Total: 26 Slides are based on Negnevitsky, Pearson Education, 2005 Class A: Class B: Total: 1 8 9 no Class A: 11 Class B: 0 Total: 11 49 Selecting an optimal decision tree: (b) Splits selected by guesswork Class A: 100 Class B: 50 Total: 150 Predictor 5 yes no Class A: 19 Class B: 14 Total: 33 Class A: 81 Class B: 36 Total: 117 Predictor 2 Predictor 3 yes Class A: 17 Class B: 6 Total: 23 no yes Class A: 2 Class B: 8 Total: 10 no Class A: 46 Class B: 21 Total: 67 Class A: 35 Class B: 15 Total: 50 Predictor 1 Predictor 6 yes yes Class A: 37 Class B: 14 Total: 51 no Class A: 9 Class B: 7 Total: 16 Class A: 23 Class B: 9 Total: 32 no Class A: 12 Class B: 6 Total: 18 Predictor 4 yes Class A: 8 Class B: 9 Total: 17 no Class A: 29 Class B: 5 Total: 34 Slides are based on Negnevitsky, Pearson Education, 2005 50 Gain chart of Class A 100 % Class 80 60 40 The Gini splits 20 Manual split selection 0 0 20 40 60 80 100 % Population Slides are based on Negnevitsky, Pearson Education, 2005 51 Can we extract rules from a decision tree? The pass from the root node to the bottom leaf reveals a decision rule. For example, a rule associated with the right bottom leaf in the figure that represents Gini splits can be represented as follows: if and and then (Predictor 1 = no) (Predictor 4 = no) (Predictor 6 = no) class = Class A Slides are based on Negnevitsky, Pearson Education, 2005 52 Case study: Profiling people with high blood pressure A typical task for decision trees is to determine conditions that may lead to certain outcomes. Blood pressure can be categorised as optimal, normal or high. Optimal pressure is below 120/80, normal is between 120/80 and 130/85, and a hypertension is diagnosed when blood pressure is over 140/90. Slides are based on Negnevitsky, Pearson Education, 2005 53 A data set for a hypertension study Community Health Survey: Hypertension Study (California, U.S.A.) Gender Male Female Age 18 – 34 years 35 – 50 years 51 – 64 years 65 or more years Race Caucasian African American Hispanic Asian or Pacific Islander Marital Status Married Separated Divorced Widowed Never Married Household Income Less than $20,700 $20,701 $45,000 $45,001 $75,000 $75,001 and over Slides are based on Negnevitsky, Pearson Education, 2005 54 A data set for a hypertension study (continued) Community Health Survey: Hypertension Study (California, U.S.A.) Alcohol Consumption Abstain from alcohol Occasional (a few drinks per month) Regular (one or two drinks per day) Heavy (three or more drinks per day) Smoking Nonsmoker 1 – 10 cigarettes per day 11 – 20 cigarettes per day More than one pack per day Caffeine Intake Abstain from coffee One or two cups per day Three or more cups per day Salt Intake Low-salt diet Moderate-salt diet High-salt diet Physical Activities None One or two times per week Three or more times per week Weight Height 170 cm 9 3 kg 0 Blood Pressure Optimal Normal High Slides are based on Negnevitsky, Pearson Education, 2005 55 Data cleaning Decision trees are as good as the data they represent. Unlike neural networks and fuzzy systems, decision trees do not tolerate noisy and polluted data. Therefore, the data must be cleaned before we can start data mining. We might find that such fields as Alcohol Consumption or Smoking have been left blank or contain incorrect information. Slides are based on Negnevitsky, Pearson Education, 2005 56 Data enriching From such variables as weight and height we can easily derive a new variable, obesity. This variable is calculated with a body-mass index (BMI), that is, the weight in kilograms divided by the square of the height in metres. Men with BMIs of 27.8 or higher and women with BMIs of 27.3 or higher are classified as obese. Slides are based on Negnevitsky, Pearson Education, 2005 57 A data set for a hypertension study (continued) Community Health Survey: Hypertension Study (California, U.S.A.) Obesity Obese Not Obese Slides are based on Negnevitsky, Pearson Education, 2005 58 Growing a decision tree Blood Pressure optimal: 319 (32%) normal: 528 (53%) high: 153 (15%) Total: 1000 Age 18 – 34 years optimal: 88 (56%) normal: 64 (41%) high: 5 (3%) Total: 157 35 – 50 years optimal: 208 (35%) normal: 340 (57%) high: 48 (8%) Total: 596 51 – 64 years optimal: 21 (12%) normal: 90 (52%) high: 62 (36%) Total: 173 Slides are based on Negnevitsky, Pearson Education, 2005 65 or more years optimal: 2 (3%) normal: 34 (46%) high: 38 (51%) Total: 74 59 Growing a decision tree (continued) 51 – 64 years optimal: 21 (12%) normal: 90 (52%) high: 62 (36%) Total: 173 Obesity Obese optimal: 3 (3%) normal: 53 (49%) high: 51 (48%) Total: 107 Not Obese optimal: 18 (27%) normal: 37 (56%) high: 11 (17%) Total: 66 Slides are based on Negnevitsky, Pearson Education, 2005 60 Growing a decision tree (continued) Obese optimal: 3 (3%) normal: 53 (49%) high: 51 (48%) Total: 107 Race Caucasian optimal: 2 (5%) normal: 24 (55%) high: 17 (40%) Total: 43 African American optimal: 0 (0%) normal: 13 (35%) high: 24 (65%) Total: 37 Hispanic optimal: 0 (0%) normal: 11 (58%) high: 8 (42%) Total: 19 Slides are based on Negnevitsky, Pearson Education, 2005 Asian optimal: normal: high: Total: 1 (12%) 5 (63%) 2 (25%) 8 61 Solution space of the hypertension study The solution space is first divided into four rectangles by age, then age group 51-64 is further divided into those who are overweight and those who are not. And finally, the group of obese people is divided by race. Slides are based on Negnevitsky, Pearson Education, 2005 62 Solution space of the hypertension study 157 74 8 19 596 37 66 43 Slides are based on Negnevitsky, Pearson Education, 2005 63 Hypertension study: forcing a split Blood Pressure optimal: 319 (32%) normal: 528 (53%) high: 153 (15%) Total: 1000 Age 18 – 34 years optimal: 88 (56%) normal: 64 (41%) high: 5 (3%) Total: 157 Male optimal: 111 (36%) normal: 168 (55%) high: 28 (9%) Total: 307 35 – 50 years optimal: 208 (35%) normal: 340 (57%) high: 48 (8%) Total: 596 51 – 64 years optimal: 21 (12%) normal: 90 (52%) high: 62 (36%) Total: 173 Gender Gender Female optimal: 97 (34%) normal: 172 (59%) high: 20 (7%) Total: 289 Male optimal: 11 (13%) normal: 48 (56%) high: 27 (31%) Total: 86 Slides are based on Negnevitsky, Pearson Education, 2005 65 or more years optimal: 2 (3%) normal: 34 (46%) high: 38 (51%) Total: 74 Female optimal: 10 (12%) normal: 42 (48%) high: 35 (40%) Total: 87 64 Advantages of decision trees The main advantage of the decision-tree approach to data mining is it visualises the solution; it is easy to follow any path through the tree. Relationships discovered by a decision tree can be expressed as a set of rules, which can then be used in developing an expert system. Slides are based on Negnevitsky, Pearson Education, 2005 65 Drawbacks of decision trees Continuous data, such as age or income, have to be grouped into ranges, which can unwittingly hide important patterns. Handling of missing and inconsistent data – decision trees can produce reliable outcomes only when they deal with “clean” data. Inability to examine more than one variable at a time. This confines trees to only the problems that can be solved by dividing the solution space into several successive rectangles. Slides are based on Negnevitsky, Pearson Education, 2005 66 In spite of all these limitations, decision trees have become the most successful technology used for data mining. An ability to produce clear sets of rules make decision trees particularly attractive to business professionals. Slides are based on Negnevitsky, Pearson Education, 2005 67