Download Problem Type Application Technique

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Overview of
Methods
Data mining
techniques
What techniques
do, examples,
Advantages &
disadvantages
4-2
History
• Statistics
• AI:
– genetic algorithms, neural networks
• analogies with biology
– memory-based reasoning
– link analysis from graph theory
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
4-3
Techniques
• Statistical
– Market-Basket Analysis - find groups of items
– Memory-Based Reasoning- case based
– Cluster Detection - undirected (quantitative MBA)
• Artificial Intelligence
– Link Analysis - MCI’s Friends & Family
– Decision Trees, Rule Induction - production rule
– Neural Networks - automatic pattern detection
– Genetic Algorithms - keep best parameters
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
4-4
Models
• Regression: Y = a + bX
• Classification:
assign new record to
class
• Predictive: assign value to new record
• Clustering: groups for data
• Time-series: assign future value
• Links:
patterns in data
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
4-5
Fitting
• Underfitting: not enough detail
– leave out important variables
• Overfitting: too much detail
– memorizes training set, but doesn’t help
with new data
• data set too small
• redundancy in data
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
4-6
Comparison of Features
Rules
Neural Net
CaseBase
Genetic
Noisy data
Good
Very good
Good
Very good
Missing data
Good
Good
Very good
Good
Large sets
Very good
Poor
Good
Good
Different types
Good
Numerical
Very good
Transform
Accuracy
High
Very high
High
High
Explanation
Very good
Poor
Very good
Good
Integration
Good
Good
Good
Very good
Ease
Easy
Difficult
Easy
Difficult
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
4-7
Data Mining Functions
• Classification
– Identify categories in data
• Prediction
– Formula to predict future observations
• Association
– Rules using relationships among entities
• Detection
– Anomalies & irregularities (fraud detection)
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
4-8
Financial Applications
Technique
Application
Problem Type
Neural net
Forecast stock price
Prediction
NN, Rule
NN, Case
Forecast bankruptcy
Fraud detection
Forecast interest rate
Prediction
Detection
Prediction
NN, visual
Late loan detection
Detection
Rule
Credit assessment
Risk classification
Prediction
Classification
Rule, Case
Corporate bond rate
Prediction
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
4-9
Telecom Applications
Technique
Application
Neural net,
Rule induct
Forecast network Prediction
behav.
Rule induct
Churn
Fraud detection
Classification
Detection
Case based
Call tracking
Classification
McGraw-Hill/Irwin
Problem Type
©2007 The McGraw-Hill Companies, Inc. All rights reserved
4-10
Marketing Applications
Technique
Application
Problem Type
Rule induct
Market segment
Cross-selling
Classification
Association
Rule induct, visual
Lifestyle analysis
Performance analy.
Classification
Association
Rule induct, genetic,
visual
Reaction to
promotion
Prediction
Case based
Online sales support
Classification
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
4-11
Web Applications
Technique
Application
Rule induct,
Visualization
User browsing Classification,
similarity
Association
analy.
Web page
Association
content
similarity
Rule-based
heuristics
McGraw-Hill/Irwin
Problem Type
©2007 The McGraw-Hill Companies, Inc. All rights reserved
4-12
Other Applications
Technique
Application
Problem Type
Neural net
Software cost
Detection
Neural net,
rule induct
Litigation
assessment
Prediction
Rule induct
Insurance fraud
Healthcare except.
Detection
Detection
Case based
Insurance claim
Software quality
Prediction
Classification
Genetic algor.
Budget spending
Classification
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
4-13
Data Sets
• Loan Applications
– classification
• Job Applications
– classification
• Insurance Fraud
– detection
• Expenditure Data
– prediction
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
4-14
Loan Data
• 650 observations
• OUTCOMES (binary):
– On-time
– Late (default)
cost of error: $300
cost of error: $2,000
• Variables
– Age, Income, Assets, Debts, Want, Credit
• Credit ordinal
– Transform: Assets, Debts, & Want →Risk
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
4-15
Job Application Data
• 500 observations
• OUTCOMES (ordinal):
–
–
–
–
Unacceptable
Minimal
Acceptable
Excellent
• Variables
– Age, State, Degree, Major, Experience
• State nominal; degree & major ordinal
• State is superfluous
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
4-16
Insurance Claim Data
• 5000 observations
• OUTCOMES (binary):
– OK
– Fraudulent
cost of error $500
cost of error $2,500
• Variables
– Age, Gender, Claim, Tickets, Prior claims, Attorney
• Gender & attorney nominal, tickets & prior claims
categorical
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
4-17
Expenditure Data
• 10,000 observations
• OUTCOMES:
– Could predict response in a number of categories
– Others
• Variables:
– Age, Gender, Marital, Dependents, Income, Job
years, Town years, Education years, Drivers
license, Own home, Number of credit cards
– Churn, proportion of income spent on seven
categories
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
Related documents