Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Part I Data Mining Fundamentals Chapter 1 Data Mining: A First View Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration Gonzaga University Spokane, WA 99223 [email protected] A/W & Dr. Chen, Data Mining 1.1 Data Mining: A Definition A/W & Dr. Chen, Data Mining 1.1 Data Mining: A Definition • The process of employing one or more computer learning techniques to automatically analyze and extract knowledge from data. 3 A/W & Dr. Chen, Data Mining Induction-based Learning • The process of forming general concept definitions by observing specific examples of concepts to be learned. Knowledge Discovery in Databases (KDD) • The application of the scientific method to data mining. Data mining is one step of the KDD process. 4 A/W & Dr. Chen, Data Mining Data Mining Examples • A telephone company used a data mining tool to analyze their customer’s data warehouse. The data mining tool found about 10,000 supposedly residential customers that were expending over $1,000 monthly in phone bills. • After further study, the phone company discovered that they were really small business owners trying to avoid paying business rates * 5 A/W & Dr. Chen, Data Mining Other Data Mining Examples • 65% of customers who did not use the credit card in the last six months are 88% likely to cancel their accounts. • If age < 30 and income <= $25,000 and credit rating < 3 and credit amount > $25,000 then the minimum loan term is 10 years. • 82% of customers who bought a new TV 27" or larger are 90% likely to buy an entertainment center within the next 4 weeks. 6 A/W & Dr. Chen, Data Mining 1.2 What Can Computers Learn? 7 A/W & Dr. Chen, Data Mining Four Levels of Learning • Fact – a simple statement of truth • Concept – a set of objects, symbols, or events grouped together because they share certain characteristics • Principle – is a step-by-step course of action to achieve a goal. We use procedures in our everyday functioning as well as in the solution of difficult problems • Procedure – represents the highest level of learning. Principles are general truths or laws that are basic to other truths. Source: Merril and Tennyson, 1977, p.5 of the text A/W & Dr. Chen, Data Mining 8 N Concepts • Computers are good at learning concepts. Concepts are the output of a data mining session. Three Concept Views • Classical View • Probabilistic View • Exemplar View 9 A/W & Dr. Chen, Data Mining Three Concept Views • Classical View – Attests that all concepts have definite defining properties. • Probabilistic View – Concepts are represented by properties that are probable of concept members. • Exemplar View – States that a given instance is determined to be an example of a particular concept if the instance is similar enough to a set of one or more known examples of the concepts 10 A/W & Dr. Chen, Data Mining Figure - A hierarchy of data mining strategies Data Mining Strategies Unsupervised Clustering Market Basket Analysis Supervised Learning No output attributes Classification Categorical/discrete (current behavior) A/W & Dr. Chen, Data Mining Prediction Estimation Numeric Future outcome (categorical/numeric)11 Supervised Learning Supervised learning is the process of building classification models using data instances of known origin. Two purposes: • 1. Build a learner (classification) model using data instances of known origin. – is an induction process • 2. Use the model to determine the outcome new instances of unknown origin. – is a deduction process 12 A/W & Dr. Chen, Data Mining Supervised Learning: A Decision Tree Example A/W & Dr. Chen, Data Mining Decision Tree • A tree structure where non-terminal nodes represent tests on one or more attributes and terminal nodes reflect decision outcomes. Table 1.1 – Hypothetical Training Data for Disease Diagnosis Patient ID# Sore Throat Fever Swollen Glands Congestion 1 Yes Yes Yes Yes Yes Strep throat 2 No No No Yes Yes Allergy 3 Yes Yes No Yes No Cold 4 Yes No Yes No No Strep throat 5 No Yes No Yes No Cold 6 No No No Yes No Allergy 7 No No Yes No No Strep throat 8 Yes No No Yes Yes Allergy 9 No Yes No Yes Yes Cold 10 Yes Yes No Yes Yes Cold 14 A/W & Dr. Chen, Data Mining Headache Diagnosis Figure 1.1 – A decision tree for the data in Table 1.1 Swollen Glands No Yes Diagnosis = Strep Throat Fever No Diagnosis = Allergy Yes Diagnosis = Cold 15 A/W & Dr. Chen, Data Mining Table 1.1 – Hypothetical Training Data for Disease Diagnosis Patient ID# Sore Throat Fever Swollen Glands Congestion 1 Yes Yes Yes Yes Yes Strep throat 2 No No No Yes Yes Allergy 3 Yes Yes No Yes No Cold 4 Yes No Yes No No Strep throat 5 No Yes No Yes No Cold 6 No No No Yes No Allergy 7 No No Yes No No Strep throat 8 Yes No No Yes Yes Allergy 9 No Yes No Yes Yes Cold 10 Yes Yes No Yes Yes Cold Headache Diagnosis Table 1.2 Data Instances with an Unknown Classification Patient ID# Sore Throat Fever Swollen Glands 11 No No Yes Yes Yes ? 12 Yes Yes No No Yes ? 13 No No No No Yes ? Congestion Headache Diagnosis 16 A/W & Dr. Chen, Data Mining Production Rules We can translate any decision tree into a set of production rules. They are rules of the form: IF <antecedent conditions> THEN <consequent conditions> • IF Swollen Glands = Yes THEN Diagnosis = Strep Throat • IF Swollen Glands = No & Fever = Yes THEN Diagnosis = Cold • IF Swollen Glands = No & Fever = No THEN Diagnosis = Allergy 17 A/W & Dr. Chen, Data Mining Unsupervised Clustering • A data mining method that builds models from data without predefined classes (see Table 1.3). • Data instances are grouped together based on a similarity scheme defined by the clustering system. • With the help of one or several evaluation techniques, it is up to us to decide the meaning of the formed clusters. 18 A/W & Dr. Chen, Data Mining Table 1.3 – Acme Investors Incorporated Customer Account Margin Transaction Trades/ Favorite Annual ID Type Account Method Month Sex Age Recreation Income 1005 Joint No Online 12.5 F 30–39 Tennis 40–59K 1013 Custodial No Broker 0.5 F 50–59 Skiing 80–99K 1245 Joint No Online 3.6 M 20–29 Golf 20–39K 2110 Individual Yes Broker 22.3 M 30–39 Fishing 40–59K 1001 Individual Yes Online 5 M 40–49 Golf 60–79K 19 A/W & Dr. Chen, Data Mining Possible Questions Questions for supervised learning 1. Can I develop a general profile of an online investor? If so, what characteristics distinguish online investors from investors that use a broker? 2. Can I determine if a new customer who does not initially open a margin account is likely to do so in the future? 3. Can I build a model able to accurately predict the average number of trades per month for a new investor? 4. What characteristics differentiate female and male investors? Questions for unsupervised learning 1. What attribute similarities group customers of Acme Investors together? 2. What differences in attribute values segment the customer database? 20 A/W & Dr. Chen, Data Mining 1.3 Is Data Mining Appropriate for My Problem? 21 A/W & Dr. Chen, Data Mining Data Mining or Data Query? • Shallow Knowledge – is factual; tools used: DBMS/SQL • Multidimensional Knowledge – Is factual; tools used: OLAP • Hidden Knowledge – Represents patterns or regularities in data that cannot be easily found, tools used: data mining • Deep Knowledge – Knowledge stored in a database that can only be found if we are given some direction. 22 A/W & Dr. Chen, Data Mining Data Mining vs. Data Query: An Example • Use data query if you already almost know what you are looking for. • Use data mining to find regularities in data that are not obvious. 23 A/W & Dr. Chen, Data Mining 1.4 Expert Systems or Data Mining? 24 A/W & Dr. Chen, Data Mining Expert System and Knowledge Engineer • An expert system is a computer program that emulates the problem-solving skills of one or more human experts. • A knowledge engineer is a person trained to interact with an expert in order to capture their knowledge. 25 A/W & Dr. Chen, Data Mining Data Data Mining Tool If Swollen Glands = Yes Then Diagnosis = Strep Throat Human Expert Knowledge Engineer Expert System Building Tool If Swollen Glands = Yes Then Diagnosis = Strep Throat 26 A/W & Dr. Chen, Data Mining 1.5 A Simple Data Mining Process Model 27 A/W & Dr. Chen, Data Mining Figure 1.3 - A simples data mining process model Operational Database SQL Queries Interpretation & Data Warehouse Data Mining Evaluation Result Application 28 A/W & Dr. Chen, Data Mining Characteristics of Data Warehouse • Data Warehouse: – Definitions: a subject-oriented, integrated, timevariant, non-updatable collection of data used in support of management decision-making processes – Subject-oriented: e.g. customers, patients, students, products – Integrated: Consistent naming conventions, formats, encoding structures; from multiple data sources – Time-variant: Can study trends and changes – Nonupdatable: Read-only, periodically refreshed • Data Mart: – A data warehouse that is limited in scope 29 A/W & Dr. Chen, Data Mining A four-step process for performing a data mining session • 1. Assembling the data – Operational database (relational databases and flat files) vs. data warehouse • 2. Mining the Data (Giving the data to a mining tool) – Instances for building the model or testing the model • • 3. Interpreting the results 4. Result application 30 A/W & Dr. Chen, Data Mining 1.7 Data Mining Applications (p.24) • • • • • Fraud Detection Health care Business and finance Scientific applications Sports and gaming 31 A/W & Dr. Chen, Data Mining Customer Intrinsic Value _ _ _ _ _B _ _ Intrinsic (Predicted) Value _ _ _ X X A _ X X X X C X X X Actual Value 32 A/W & Dr. Chen, Data Mining