Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Overview of Methods Data mining techniques What techniques do, examples, Advantages & disadvantages 4-2 History • Statistics • AI: – genetic algorithms, neural networks • analogies with biology – memory-based reasoning – link analysis from graph theory McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 4-3 Techniques • Statistical – Market-Basket Analysis - find groups of items – Memory-Based Reasoning- case based – Cluster Detection - undirected (quantitative MBA) • Artificial Intelligence – Link Analysis - MCI’s Friends & Family – Decision Trees, Rule Induction - production rule – Neural Networks - automatic pattern detection – Genetic Algorithms - keep best parameters McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 4-4 Models • Regression: Y = a + bX • Classification: assign new record to class • Predictive: assign value to new record • Clustering: groups for data • Time-series: assign future value • Links: patterns in data McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 4-5 Fitting • Underfitting: not enough detail – leave out important variables • Overfitting: too much detail – memorizes training set, but doesn’t help with new data • data set too small • redundancy in data McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 4-6 Comparison of Features Rules Neural Net CaseBase Genetic Noisy data Good Very good Good Very good Missing data Good Good Very good Good Large sets Very good Poor Good Good Different types Good Numerical Very good Transform Accuracy High Very high High High Explanation Very good Poor Very good Good Integration Good Good Good Very good Ease Easy Difficult Easy Difficult McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 4-7 Data Mining Functions • Classification – Identify categories in data • Prediction – Formula to predict future observations • Association – Rules using relationships among entities • Detection – Anomalies & irregularities (fraud detection) McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 4-8 Financial Applications Technique Application Problem Type Neural net Forecast stock price Prediction NN, Rule NN, Case Forecast bankruptcy Fraud detection Forecast interest rate Prediction Detection Prediction NN, visual Late loan detection Detection Rule Credit assessment Risk classification Prediction Classification Rule, Case Corporate bond rate Prediction McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 4-9 Telecom Applications Technique Application Neural net, Rule induct Forecast network Prediction behav. Rule induct Churn Fraud detection Classification Detection Case based Call tracking Classification McGraw-Hill/Irwin Problem Type ©2007 The McGraw-Hill Companies, Inc. All rights reserved 4-10 Marketing Applications Technique Application Problem Type Rule induct Market segment Cross-selling Classification Association Rule induct, visual Lifestyle analysis Performance analy. Classification Association Rule induct, genetic, visual Reaction to promotion Prediction Case based Online sales support Classification McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 4-11 Web Applications Technique Application Rule induct, Visualization User browsing Classification, similarity Association analy. Web page Association content similarity Rule-based heuristics McGraw-Hill/Irwin Problem Type ©2007 The McGraw-Hill Companies, Inc. All rights reserved 4-12 Other Applications Technique Application Problem Type Neural net Software cost Detection Neural net, rule induct Litigation assessment Prediction Rule induct Insurance fraud Healthcare except. Detection Detection Case based Insurance claim Software quality Prediction Classification Genetic algor. Budget spending Classification McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 4-13 Data Sets • Loan Applications – classification • Job Applications – classification • Insurance Fraud – detection • Expenditure Data – prediction McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 4-14 Loan Data • 650 observations • OUTCOMES (binary): – On-time – Late (default) cost of error: $300 cost of error: $2,000 • Variables – Age, Income, Assets, Debts, Want, Credit • Credit ordinal – Transform: Assets, Debts, & Want →Risk McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 4-15 Job Application Data • 500 observations • OUTCOMES (ordinal): – – – – Unacceptable Minimal Acceptable Excellent • Variables – Age, State, Degree, Major, Experience • State nominal; degree & major ordinal • State is superfluous McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 4-16 Insurance Claim Data • 5000 observations • OUTCOMES (binary): – OK – Fraudulent cost of error $500 cost of error $2,500 • Variables – Age, Gender, Claim, Tickets, Prior claims, Attorney • Gender & attorney nominal, tickets & prior claims categorical McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved 4-17 Expenditure Data • 10,000 observations • OUTCOMES: – Could predict response in a number of categories – Others • Variables: – Age, Gender, Marital, Dependents, Income, Job years, Town years, Education years, Drivers license, Own home, Number of credit cards – Churn, proportion of income spent on seven categories McGraw-Hill/Irwin ©2007 The McGraw-Hill Companies, Inc. All rights reserved