Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Part I Data Mining Fundamentals Data Mining: A First View Chapter 1 1 1 Data Mining: A Definition 1.1 Data Mining The process of employing one or more computer learning techniques to automatically analyze and extract knowledge from data. Induction-based Learning g The process of forming general concept definitions by observing specific examples of concepts to be learned. Knowledge Discovery in Databases (KDD) The application of the scientific method to data mining. mining Data mining is one step of the KDD process. 1.2 What Can Computers p Learn? Four Levels of Learning • Facts • Concepts • Procedures • Principles i i l Facts A fact is a simple statement of truth. Concepts A concept is a set of objects, symbols, or events grouped together because they share certain characteristics. Procedures A procedure is a step-by-step course of action to achieve a goal. Principles A principles are general truths or laws that are basic to other truths. Computers & Learning Computers are good at learning concepts. Concepts are the output of a data mining session. Three Concept Views • Classical View • Probabilistic View • Exemplar View Classical View All concepts have definite g pproperties. p defining Probabilistic View People store and recall concepts generalizations created byy as g observations. Exemplar View People store and recall likely p exemplars p that are used concept to classify unknown instances. Supervised Learning • Build a learner model using g data instances of known origin. • Use the model to determine the outcome for new instances of unknown origin. Supervised Learning: A Decision Tree Example Decision Tree A tree structure where non-terminal nodes d representt tests t t on one or more attributes and terminal nodes reflect d i i outcomes. decision t H h i l Training T i i Data D for f Disease Di Diagnosis Di i T bl 1.1 Table 1 1 • Hypothetical Patient ID# Sore Throat 1 2 3 4 5 6 7 8 9 10 Yes No Yes Yes No No No Yes No Yes Fever Swollen Glands Congestion Headache Diagnosis Yes No Yes No Yes No No No Yes Yes Yes No No Yes No No Yes No No No Yes Yes Yes No Yes Yes No Yes Yes Yes Yes Yes No No No No No Yes Yes Yes Strep throat Allergy Cold Strep throat Cold Allergy Strep throat Allergy Cold Cold Swollen Glands No Yes Diagnosis = Strep Throat Fever No Diagnosis = Allergy Yes Diagnosis = Cold Figure 1.1 A decision tree for the data in Table 1.1 Table 1.2 • Data Instances with an Unknown Classification Patient ID# Sore Throat Fever Swollen Glands Congestion Headache Diagnosis 11 12 13 No Yes Y No No Yes Y No Yes N No No Yes N No No Yes Y Yes Yes ? ? ? Production Rules IF Swollen S ll Glands Gl d = Yes Y THEN Diagnosis = Strep Throat IF Swollen Glands = No & Fever = Yes THEN Diagnosis = Cold IF Swollen Glands = No & Fever = No THEN Diagnosis = Allergy Unsupervised Clustering A data mining method that builds models d l from f d t without data ith t predefined d fi d classes. The h Acme A Investors Dataset Table 1.3 • Acme Investors Incorporated Customer ID Account T Type Margin A Account t Transaction M th d Method Trades/ M th Month S Sex 1005 1013 1245 2110 1001 Joint Custodial Joint Individual Individual No No No Yes Yes Online Broker Online Broker Online 12.5 0.5 3.6 22.3 5.0 F F M M M A Age Favorite R Recreation ti Annual I Income 30–39 50–59 20–29 30–39 40–49 Tennis Skiing Golf Fishing Golf 40–59K 80–99K 20–39K 40–59K 60–79K The Acme Investors Dataset & S Supervised i dL Learning i 1. 2. 3. 4. Can I develop a general profile of an online investor? Can I determine if a new customer is likelyy to open p a margin account? Can I build a model predict the average number of trades pper month for a new investor? What characteristics differentiate female and male investors? The Acme Investors Dataset & Unsupervised Clustering 1. What attribute similarities group customers of Acme Investors together? 2. What differences in attribute values segmentt the th customer t database? d t b ? 1.3 Is Data Mining Appropriate f My Problem? for bl Data Mining or Data Query? • Shallow Knowledge • Multidimensional Knowledge g • Hidden Knowledge • Deep Knowledge Shallow Knowledge Shallow knowledge is factual. It can be easily stored and manipulated in a database. Multidimensional Knowledge Multidimensional knowledge is also factual. On-line analytical Processing (OLAP) tools are used to manipulate multidimensional knowledge. Hidden Knowledge Hidden knowledge represents patterns or regularities in data that cannot be easily found using database query. However, data mining algorithms can find such patterns with ease. Deep Knowledge Deep knowledge is knowledge stored in a database that can only be found if we are given some direction about what we are looking for. Data Mining vs. vs Data Query: An Example p • Use data query if you already almost know what you are looking for. • Use data mining to find regularities in data that are not obvious. obvious 1.4 Expert Systems or Data Mining? i i Expert System A computer program that emulates problem-solving g skills of one or the p more human experts. Knowledge Engineer A person trained to interact with an p in order to capture p their expert knowledge. Data Data Mining Tool If Swollen Glands = Yes Then Diagnosis = Strep Throat Human Expert Knowledge Engineer Expert System Building Tool If Swollen Glands = Yes Then Diagnosis = Strep Throat Figure 1.2 Data mining vs. expert systems 1 5 A Simple Data Mining 1.5 Process Model Operational Database Data Warehouse SQL Queries Data Mining Interpretation & E l ti Evaluation Figure 1.3 A simple data mining process model Result Application pp Assembling bli the h Data • The Data Warehouse • Relational Databases and Flat Files The h Data Warehouse h The data warehouse is a historical d t b database designed d i d for f decision d ii support. Mining the Data Interpreting the Results Result Application 1 6 Why Not Simple Search? 1.6 • Nearest Neighbor Classifier •K K-nearest nearest Neighbor Classifier Nearest Neighbor Classifier Classification is performed by searching the training data for the instance closest in distance to the unknown instance instance. 1.7 Data Mining Applications Customer Intrinsic Value _ _ _ _ _ _ _ Intrinsic (Predicted) Value _ _ X X _ _ X X X X X Actual Value Figure 1.4 Intrinsic vs. actual customer value X X