Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Overview of Data Mining Methods Data mining techniques What techniques do, examples, advantages & disadvantages 結束 Contents Reviews data mining tools Compares data mining perspectives Discusses data mining functions Presents four sets of data used to demonstrate tools in subsequent chapters Shows the Enterprise Miner structure for data mining analysis in the appendix 4-2 結束 Data mining applications Automobile insurance company: Fraud detection Business applications: loan evaluation, customer segmentation, employee evaluation… Data mining tools categorized by the tasks of classification, estimation, prediction, clustering, and summarization. Classification, estimation, prediction are predictive, while clustering and summarization are descriptive. 4-3 結束 History Statistics AI: genetic algorithms, neural networks analogies with biology memory-based reasoning link analysis from graph theory See table. 4.1 4-4 結束 Data mining perspectives Methods can be viewed from different perspectives, data mining methods include: Cluster analysis (Chapter 5) Regression of various forms (best fit methods, chapter 6) Discriminant analysis (use of regression for classification, chapter 6) Line fitting through the operations research tool of multiple objective linear programming (Chapter 9) AI: ANN (chapter 7) Rule induction (decision trees, chapter 8) Genetic algorithms (supplement) See page 55 for more descriptions 4-5 結束 Techniques Statistical Market-Basket Analysis - find groups of items Memory-Based Reasoning- case based Cluster Detection - undirected (quantitative) Artificial Intelligence Link Analysis - MCI’s Friends & Family Decision Trees, Rule Induction - production rule Neural Networks - automatic pattern detection Genetic Algorithms - keep best parameters 4-6 結束 Models Regression: Y = a + bX Classification:assign new record to class Predictive: assign value to new record Clustering: groups for data Time-series: assign future value Links: patterns in data 4-7 結束 Fitting Underfitting: not enough detail leave out important variables Overfitting: too much detail memorizes training set, but doesn’t help with new data data set too small redundancy in data 4-8 結束 Comparison of Features Rules Neural Net CaseBase Genetic Noisy data Good Very good Good Very good Missing data Good Good Very good Good Very good Poor Good Good Different types Good Numerical Very good Transform Accuracy High Very high High High Explanation Very good Poor Very good Good Integration Good Good Good Very good Ease Easy Difficult Easy Difficult Large sets 4-9 結束 Data Mining Functions Classification Identify categories in data Prediction Formula to predict future observations Association Rules using relationships among entities Detection Anomalies (unusual) & irregularities (fraud detection) 4-10 結束 Financial Applications Technique Application Problem Type Neural net Forecast stock price Prediction NN, Rule Forecast bankruptcy Fraud detection Prediction Detection NN, Case Forecast interest rate Prediction NN, visual Late loan detection Detection Rule Credit assessment Risk classification Prediction Classification Rule, Case Corporate bond rate (公司債) Prediction 4-11 結束 Telecom Applications Technique Application Neural net, Forecast network Rule induction behavior. Problem Type Prediction Churn Rule induction Fraud detection Classification Detection Case based Classification Call tracking 4-12 結束 Marketing Applications Technique Application Market segment Cross-selling Problem Type Classification Association Rule induction, visual Lifestyle analysis Performance analysis. Classification Association Rule induction, genetic, visual Case based Reaction to Prediction promotion Online sales support Classification Rule induction 4-13 結束 Web Applications Technique Rule induction, Visualization Rule-based heuristics Application Problem Type Classification, User browsing similarity analysis. Association Web page content Association similarity 4-14 結束 Other Applications Technique Application Problem Type Neural net Software cost Detection Neural net, rule induction Litigation assessment Prediction Rule induction Insurance fraud Healthcare except. Detection Detection Case based Insurance claim Software quality Genetic algorithm Budget spending Prediction Classification Classification 4-15 結束 Data Sets Loan Applications classification Job Applications classification Insurance Fraud detection Expenditure Data prediction 4-16 結束 Loan Data 650 observations OUTCOMES (binary): On-time Late (default) cost of error: $300 cost of error: $2,000 Variables Age, Income, Assets, Debts, Want, Credit Credit ordinal Transform: Assets, Debts, & Want →Risk 4-17 結束 Job Application Data 500 observations OUTCOMES (ordinal): Unacceptable Minimal Acceptable Excellent Variables Age, State, Degree, Major, Experience State nominal; degree & major ordinal State is superfluous 4-18 結束 Insurance Claim Data 5000 observations OUTCOMES (binary): OK Fraudulent cost of error $500 cost of error $2,500 Variables Age, Gender, Claim, Tickets, Prior claims, Attorney Gender & attorney nominal, tickets & prior claims categorical 4-19 結束 Expenditure Data 10,000 observations OUTCOMES: Could predict response in a number of categories Others Variables: Age, Gender, Marital, Dependents, Income, Job years, Town years, Education years, Drivers license, Own home, Number of credit cards Churn, proportion of income spent on seven categories 4-20