Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction To Data Mining What Is Data Mining? • A tool • Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data • Core of KDD • Integration of Multiple technologies Part of KDD (Knowledge Discovery in Databases) Interpretation/ Evaluation Data Mining Knowledge Preprocessing Patterns Selection Preprocessed Data Data Target Data adapted from: U. Fayyad, et al. (1995), “From Knowledge Discovery to Data Mining: An Overview,” Advances in Knowledge Discovery and Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press Integration of Multiple Technologies Artificial Intelligence Machine Learning Database Management Statistics Visualization Algorithms Data Mining Other knowledge Why Data Mining? • We are drowning in data (Data explosion problem ), but starving for knowledge! • Solution: Data warehousing and data mining – Data warehousing and on-line analytical processing – Mining interesting knowledge (rules, regularities, patterns, constraints) from data in large databases • A lot of potential applications – Market analysis and management • Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentation – Risk analysis and management • Forecasting, customer retention, improved underwriting, quality control, competitive analysis – Health care … Data mining process State the problem + hypothesis Operational Database Data Warehouse SQL Queries Data Mining Interpretation & Evaluation Knowledge-base Result Application Knowledge from Data Mining • Association rules • Sequential Association • Classification rules • Clustering • Deviation Detection •… Association Rules • Identify association in the data: E.g. market-basket analysis • Indicate significance of each “Find groups of items commonly purchased together” (correlation [A,B] and causality[A->B]) association (only interesting if its confidence exceed a certain measure) • Not all the Association is interesting (too trivial, negative association) – People who purchase fish are likely to purchase wine Sequential Associations • Find event sequences that are unusually likely • Requires “training” event list, known “interesting” events • Must be robust in the face of additional “noise” events Uses: • Failure analysis and prediction Technologies: • Dynamic programming (Dynamic time warping) • “Custom” algorithms “Find common sequences of warnings/faults within 10 minute periods” – Warn 2 on Switch C preceded by Fault 21 on Switch B – Fault 17 on any switch preceded by Warn 2 on any switch Time Switch Event B Fault 21 21:10 A Warn 2 21:11 C Warn 2 21:13 A Fault 17 21:20 Classification rules • Classify a set of data based on their values in certain attributes • Requires “training data”: have predefined attributes Uses: • Profiling Technologies: • Generate decision trees (results are human understandable) • Neural Nets “Route documents to most likely interested parties” – English or nonenglish? – Domestic or Foreign? Training Data tool produces Groups classifier Clustering • Group a set of data base on the conceptual clustering principle(i.e. maximizing the intraclass similarity and minimizing the interclass similarity) • No “training data”: Without predefined attributes Uses: • Demographic analysis Technologies: • Self-Organizing Maps • Probability Densities • Conceptual Clustering “Group people with similar travel profiles” – George, Patricia – Jeff, Evelyn, Chris – Rob Clusters Deviation Detection • Find unexpected values, outliers • “Find unusual occurrences in IBM stock prices” Uses: Sample date Event • Failure analysis Market closed • Anomaly discovery for analysis 58/07/04 Technologies: • clustering/classification methods • Statistical techniques • visualization 59/01/06 59/04/04 73/10/09 Date 58/07/02 58/07/03 58/07/04 58/07/07 Occurrences 317 times 2.5% dividend 2 times 50% stock split 7 times not traded 1 time Close Volume 369.50 314.08 369.25 313.87 Market Closed 370.00 314.50 Spread .022561 .022561 .022561 Popular Data Mining Techniques • Supervised – Decision trees – Rule induction – Regression models – Neural Networks … • Unsupervised —K-means clustering —Self organized maps … Supervised vs. Unsupervised • Supervised algorithms » Learning by example: – Use training data which the value of the response variable is already known – Create a model by running the algorithm on the training data – Identify a class label for the incoming new data » Driven by a real business problems and historical data • Unsupervised algorithms » Do not use training data. » Patterns may not be known in advance Supervised Algorithms Decision Trees • A tree structure where non-terminal nodes represent tests on one or more attributes and terminal nodes reflect decision outcomes. • Advantages of decision trees —Understandable —Relatively fast —Easy to translate into SQL queries • Disadvantages of decision trees — Limited to one output attribute — Decision tree algorithms are not so stable • Types of decision trees —CHAID: Chi-Square Automatic Interaction Detection —CART: Classification and Regression Trees … Table 1.1 • Hypothetical Training Data for Disease Diagnosis Patient ID# Sore Throat 1 2 3 4 5 6 7 8 9 10 Yes No Yes Yes No No No Yes No Yes Fever Swollen Glands Congestion Headache Diagnosis Yes No Yes No Yes No No No Yes Yes Yes No No Yes No No Yes No No No Yes Yes Yes No Yes Yes No Yes Yes Yes Yes Yes No No No No No Yes Yes Yes Strep throat Allergy Cold Strep throat Cold Allergy Strep throat Allergy Cold Cold Swollen Glands No Yes Diagnosis = Strep Throat Fever No Diagnosis = Allergy Yes Diagnosis = Cold Figure 1.1 A decision tree for the data in Table 1.1 Table 1.2 • Data Instances with an Unknown Classification Patient ID# Sore Throat 11 12 13 No Yes No Fever Swollen Glands Congestion Headache Diagnosis No Yes No Yes No No Yes No No Yes Yes Yes ? ? ? Rule induction IF = Antecedent THEN = • The extraction of useful • • independent if-then rules from data based on statistical significance If rules cause prediction confliction -> solve it according to confidence Advantage and disadvantage —Understandable —not cover all the possible situation Consequence E.g. IF Swollen Glands = Yes THEN Diagnosis = Strep Throat IF Swollen Glands = No & Fever = Yes THEN Diagnosis = Cold IF Swollen Glands = No & Fever = No THEN Diagnosis = Allergy Neural Networks • Non-linear predictive models that learn through • • training and resemble biological neural networks in structure Means of efficiently modeling large and complex problems in which there may be hundreds of predictor variables that have many interactions Disadvantage – Difficult understand – Can require significant amounts of time to train, to prepare data –… Input Layer Hidden Layer Figure 2.2 A multilayer fully connected neural network Output Layer Regression Models • Statistical techniques • Using existing values to forecast what other values will be. Y = a + b1(X1) + b2(X2) + b3(X3) + b4(X4) + b5(X5) … • A lot of types regression (linear regression, logistic regression …) K-Means Clustering • Unsupervised algorithm • Steps of algorithm 1. Choose a value for K, the total number of clusters. 2. Randomly choose K points as cluster centers. 3. Assign the remaining instances to their closest cluster center. 4. Calculate a new cluster center for each cluster. 5. Repeat steps 3-5 until the cluster centers do not change. Table 2.3 • The Credit Card Promotion Database Income Range ($) Magazine Promotion 40–50K 30–40K 40–50K 30–40K 50–60K 20–30K 30–40K 20–30K 30–40K 30–40K 40–50K 20–30K 50–60K 40–50K 20–30K Yes Yes No Yes Yes No Yes No Yes Yes No No Yes No No Watch Life Insurance Promotion Promotion No Yes No Yes No No No Yes No Yes Yes Yes Yes Yes No No Yes No Yes Yes No Yes No No Yes Yes Yes Yes No Yes Credit Card Insurance Sex Age No No No Yes No No Yes No No No No No No No Yes Male Female Male Male Female Female Male Male Male Female Female Male Female Male Female 45 40 42 43 38 55 35 27 43 41 43 29 39 55 19 A Hypothesis for the Credit Card Promotion Database A combination of one or more of the dataset attributes differentiate Acme Credit Card Company card holders who have taken advantage of the life insurance promotion and those card holders who have chosen not to participate in the promotional offer. Cluster 1 # Instances: 3 Sex: Male => 3 Female => 0 Age: 43.3 Credit Card Insurance: Yes => 0 No => 3 Life Insurance Promotion: Yes => 0 No => 3 Cluster 2 # Instances: 5 Sex: Male => 3 Female => 2 Age: 37.0 Credit Card Insurance: Yes => 1 No => 4 Life Insurance Promotion: Yes => 2 No => 3 Cluster 3 # Instances: 7 Sex: Male => 2 Female => 5 Age: 39.9 Credit Card Insurance: Yes => 2 No => 5 Life Insurance Promotion: Yes => 7 No => 0 Figure 2.3 An unsupervised cluster of the credit card database Choosing a Data Mining Technique • Know which kind knowledge you want to get • Know your data --What is the interaction between input and output attributes? --What is the Distribution of the Data? --Which Attributes Best Define the Data? • Know the difference among different data mining techniques Questions to Determine Data Mining Applicability 1. Can the problem be clearly defined? 2. Does potentially meaningful data exist? 3. Does data contain hidden knowledge or is it just filled with facts? 4. Is the “juice worth the squeeze?” Data Mining vs. OLAP • Discovery-based • • (deductive process) Mine data warehouse and others Can provide information you didn’t expect • Verification-based • • (inductive process) DSS tool for data warehouse Pre-defined queries Data Mining vs. Data Query • For hidden knowledge • Try to get the answer as accurate as possible • Results are the analysis of the data • Data need to be prepare before producing results • For specific question • Answer to query is • • 100% accurate if data correct Results are subset of data Need not prepare data