Download Data Mining A Tutorial

CS-470: Data Mining Fall 2009 1 Organizational Details Class Meeting: 4:00-6:45pm, Tuesday, Room SCIT215 Instructor: Dr. Igor Aizenberg Office: Science and Technology Building, 104C Phone (903 334 6654) e-mail: [email protected] Office hours: Monday, Wednesday 10am-6pm Tuesday 11pm-3pm Class Web Page: http://www.eagle.tamut.edu/faculty/igor/CS-470.htm 2 Text Book • R. J. Roiger, M.W. Geatz, Data Mining. A Tutorial-Based Primer, Addison Wesley, 2003, ISBN 0-201-74128-8 3 Control Exams (open book, open notes): Exam 1: Exam 2: Exam 3: October 6, 2009 November 10, 2009 December 8, 2009 Homework 4 Grading Grading Method Homework and preparation: Exam 1: 30% Exam 2: 30% Exam 3: 10% 30% Grading Scale: 90%+  A 80%+  B 70%+  C 60%+  D less than 60%  F 5 Data Mining: A First View 6 Data Mining: A Definition The process of employing one or more machine learning techniques to automatically analyze and extract knowledge from data. The exploration and analysis of large quantities of data in order to discover meaningful patterns and rules. 7 What Is Data Mining? • Data mining (knowledge discovery in databases) is the process of discovering interesting knowledge from large amounts of data stored either in databases, data warehouses, or other information repositories. • Machine learning and data mining are interested in the process of discovering knowledge that may be structurally or semantically more complex: models, graphs, new theorems or theories … in particular to assist scientific discovery. 8 Why Data Mining? — Potential Applications • Database analysis and decision support – Market analysis and management • target marketing, customer relation management, market basket analysis, cross selling, market segmentation – Risk analysis and management • Forecasting, customer retention, improved underwriting, quality control, competitive analysis – Fraud detection and management • Other Applications – Text mining (news group, email, documents) and Web analysis. – Intelligent query answering. – Medical decision support. 9 Market Analysis and Management (1) • Where are the data sources for analysis? – Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies • Target marketing – Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc. • Determine customer purchasing patterns over time – Conversion of single to a joint bank account: marriage, etc. • Cross-market analysis – Associations/co-relations between product sales – Prediction based on the association information 10 Market Analysis and Financial Time Series Prediction 11 Market Analysis and Financial Time Series Prediction 12 Market Analysis and Financial Time Series Prediction 13 Market Analysis and Financial Time Series Prediction 14 Market Analysis and Management (2) • Customer profiling – data mining can tell you what types of customers buy what products (clustering or classification) • Identifying customer requirements – identifying the best products for different customers – use prediction to find what factors will attract new customers • Provides summary information – various multidimensional summary reports – statistical summary information (data central tendency and variation) 15 Corporate Analysis and Risk Management • Finance planning and asset evaluation – cash flow analysis and prediction – contingent claim analysis to evaluate assets – cross-sectional and time series analysis (financial-ratio, trend analysis, etc.) • Resource planning: – summarize and compare the resources and spending • Competition: – monitor competitors and market directions – group customers into classes and a class-based pricing procedure – set pricing strategy in a highly competitive market 16 Fraud Detection and Management (1) • Applications – widely used in health care, retail, credit card services, telecommunications (phone card fraud), etc. • Approach – use historical data to build models of fraudulent behavior and use data mining to help identify similar instances • Examples – auto insurance: detect a group of people who stage accidents to collect on insurance – money laundering: detect suspicious money transactions (US Treasury's Financial Crimes Enforcement Network) – medical insurance: detect professional patients and ring of doctors and ring of references 17 Fraud Detection and Management (2) • Detecting inappropriate medical treatment – Australian Health Insurance Commission identifies that in many cases blanket screening tests were requested (save Australian $1m/yr). • Detecting telephone fraud – Telephone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm. – British Telecom identified discrete groups of callers with frequent intra-group calls, especially mobile phones, and broke a multimillion dollar fraud. • Retail – Analysts estimate that 38% of retail shrink is due to dishonest employees. 18 Other Applications • Sports – IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks and Miami Heat • Astronomy – JPL and the Palomar Observatory discovered 22 quasars with the help of data mining • Internet Web Surf-Aid – IBM Surf-Aid applies data mining algorithms to Web access logs for market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site organization, etc. 19 Induction-based Learning The process of forming general concept definitions by observing specific examples of concepts to be learned. 20 Four Levels of Learning • Facts • Concepts • Procedures • Principles 21 Facts A fact is a simple statement of truth. 22 Concepts A concept is a set of objects, symbols, or events grouped together because they share certain characteristics. 23 Procedures A procedure is a step-by-step course of action to achieve a goal. 24 Principles A principles are general truths or laws that are basic to other truths. 25 What Can Computers Learn? 26 Computers & Learning Computers are good at learning concepts. Concepts are the output of a data mining session. 27 Three Concept Views • Classical View • Probabilistic View • Exemplar View 28 Classical View All concepts have definite defining properties. 29 Probabilistic View People store and recall concepts as generalizations created by observations. 30 Exemplar View People store and recall likely concept exemplars that are used to classify unknown instances. 31 Methods of Learning 32 Supervised Learning • Build a learner model using data instances of known origin. • Use the model to determine the outcome new instances of unknown origin. 33 Supervised Learning: A Decision Tree Example 34 Decision Tree A tree structure where non-terminal nodes represent tests on one or more attributes and terminal nodes reflect decision outcomes. 35 Table 1.1 • Hypothetical Training Data for Disease Diagnosis Patient ID# Sore Throat 1 2 3 4 5 6 7 8 9 10 Yes No Yes Yes No No No Yes No Yes Fever Swollen Glands Congestion Headache Diagnosis Yes No Yes No Yes No No No Yes Yes Yes No No Yes No No Yes No No No Yes Yes Yes No Yes Yes No Yes Yes Yes Yes Yes No No No No No Yes Yes Yes Strep throat Allergy Cold Strep throat Cold Allergy Strep throat Allergy Cold Cold 36 Swollen Glands No Yes Diagnosis = Strep Throat Fever No Diagnosis = Allergy Yes Diagnosis = Cold 37 Table 1.2 • Data Instances with an Unknown Classification Patient ID# Sore Throat Fever Swollen Glands Congestion Headache Diagnosis 11 12 13 No Yes No No Yes No Yes No No Yes No No Yes Yes Yes ? ? ? 38 Production Rules IF Swollen Glands = Yes THEN Diagnosis = Strep Throat IF Swollen Glands = No & Fever = Yes THEN Diagnosis = Cold IF Swollen Glands = No & Fever = No THEN Diagnosis = Allergy 39 Unsupervised Clustering A data mining method that builds models from data without predefined classes. 40 The “Acme Investors” Dataset of customers maintaining a brokerage account 41 The “Acme Investors” Dataset Table 1.3 • Acme Investors Incorporated Customer ID Account Type Margin Account Transaction Method Trades/ Month Sex 1005 1013 1245 2110 1001 Joint Custodial Joint Individual Individual No No No Yes Yes Online Broker Online Broker Online 12.5 0.5 3.6 22.3 5.0 F F M M M Age Favorite Recreation Annual Income 30–39 50–59 20–29 30–39 40–49 Tennis Skiing Golf Fishing Golf 40–59K 80–99K 20–39K 40–59K 60–79K 42 The “Acme Investors” Dataset & Supervised Learning 1. 2. 3. 4. Can I develop a general profile of an online investor? Can I determine if a new customer is likely to open a margin account? Can I build a model predict the average number of trades per month for a new investor? What characteristics differentiate female and male investors? 43 The “Acme Investors” Dataset & Supervised Learning 1. 2. 3. 4. Can I develop a general profile of an online investor? – output attribute – transaction method Can I determine if a new customer is likely to open a margin account? - output attribute – margin account Can I build a model predict the average number of trades per month for a new investor? output attribute – trades/month What characteristics differentiate female and male investors? - output attribute – sex 44 Alternative: The “Acme Investors” Dataset & Unsupervised Clustering 45 The “Acme Investors” Dataset & Unsupervised Clustering 1. What attribute similarities group customers of Acme Investors together? 2. What differences in attribute values segment the customer database? 46 Clustering • Clustering is the task of segmenting a heterogeneous population into a number of more homogeneous subgroups (clusters). 47 Clustering: Two Approaches • A clustering algorithm requires us to provide an initial best estimate about the total number of clusters in the data (supervised). • A clustering algorithm uses some method in an attempt to determine a best number of clusters (unsupervised) 48 Classification • Classification deals with discrete outcomes: yes or no; big or small; strange or no strange; yellow, green or red; etc. • Estimation is often used to perform a classification task: estimating the number of children in a family; estimating a family’s total household income; etc. • Neural networks and regression models are the best tools for classification/estimation 49 Prediction • Prediction is the same as classification or estimation, except that the records are classified according to some predicted future behavior or estimated future value. • Any of the techniques used for classification and estimation for use in prediction. 50 Classification and Prediction: Implementation • To implement both classification and prediction, we should use the training examples, where the value of the variable to be predicted is already known or membership of the variable to be classified is already known. 51 Is Data Mining Appropriate for My Problem? 52 Will Data Mining help me? • Can we clearly define the problem • Do potentially meaningful data exist? • Do the data contain hidden knowledge or the data is useful for reporting purposes only? • Will the cost of processing the data be less than the likely increase in profit seen by applying any potential knowledge gained from the data mining? 53 Data Mining or Data Query? • Shallow Knowledge • Multidimensional Knowledge • Hidden Knowledge • Deep Knowledge 54 Shallow Knowledge Shallow knowledge is factual. It can be easily stored and manipulated in a database. 55 Multidimensional Knowledge Multidimensional knowledge is also factual. On-line analytical Processing (OLAP) tools are used to manipulate multidimensional knowledge. 56 Hidden Knowledge Hidden knowledge represents patterns or regularities in data that cannot be easily found using database query. However, data mining algorithms can find such patterns with ease. 57 Deep Knowledge Deep knowledge is knowledge stored in a database that can only be found if we are given some direction about what we are looking for. 58 Data Mining or Data Query? • Shallow Knowledge ( can be extracted by the data base query language like SQL) • Multidimensional Knowledge (can be extracted by the On-line Analytical Processing (OLAP) tools • Hidden Knowledge represents patterns and regularities in data that can not be easily found • Deep Knowledge can be found if we are given some direction about what we are looking for 59 Data Mining vs. Data Query: • Use data query if you already almost know what you are looking for. • Use data mining to find regularities in data that are not obvious. 60 A Simple Data Mining Process Model 61 Knowledge Discovery in Databases (KDD) The application of the scientific method to data mining. Data mining is one step of the KDD process. 62 Data Mining: A KDD Process Pattern Evaluation – Data mining: the core of knowledge discovery process. Data Mining Task-relevant Data Data Warehouse Selection Data Cleaning Data Integration Databases 63 The Data Warehouse The data warehouse is a historical database designed for decision support. 64 A Simple Data Mining Process Model Operational Database Data Warehouse 1. 2. 3. 4. SQL Queries Data Mining Interpretation & Evaluation Result Application Assemble a collection of data to analyze Present these data to a data mining tool Interpret the results Apply the results to a new problem or situation 65

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Data Mining A Tutorial