Download n/(p+n)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Decision Tree
Algorithms
Rule Based
Suitable for
automatic
generation
8-2
Decision trees
• Logical branching
• Historical:
– ID3 – early rulegenerating system
• Branches:
– Different possible
values
• Nodes:
– From which branches
emanate
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-3
Goal-Driven Data Mining
• Define goal
– Identify fraudulent cases
• Develop rules identifying attributes
attaining that goal
– IF attorney = Smith, THEN better check
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-4
Tree Structure
• Sorts out data
– IF THEN rules
– Loan variables
• Age: {young, middle, old}
• Income: {low, average, high}
• Risk: {low, medium, high}
• Exhaustive tree enumerates all
combinations
– 81 combinations – classify all
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-5
Types of Trees
• Classification tree
– Variable values classes
– Finite conditions
• Regression tree
– Variable values continuous numbers
– Prediction or estimation
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-6
Rule Induction
• Automatically process data
– Classification (logical, easier)
– Regression (estimation, messier)
• Search through data for patterns &
relationships
– Pure knowledge discovery
• Assumes no prior hypothesis
• Disregards human judgment
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-7
Example
• Three variables:
– Age
– Income
– Risk
• Outcomes:
– On-time
– Late
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-8
Combinations
Variable
Value
Cases
Age
Income
Young
Middle
Old
Low
12
5
3
5
Risk
Average
High
High
Average
Low
McGraw-Hill/Irwin
OT
Late
Pr(OT)
8
4
3
3
4
1
0
2
0.67
0.80
1.00
0.60
9
6
9
7
5
5
2
1
4
0.78
0.83
0.55
1
10
0
10
1
0
0.00
1.00
©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-9
Basis for Classification
• If a category has all outcomes of a certain
kind, that makes a good rule
– IF income = High, they always paid
• ENTROPY: Measure of content
– Actually measure of randomness
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-10
Entropy formula
Information =
-{p/(p+n)}log2 {p/(p+n)}
-{n/(p+n)}log2 {n/(p+n)}
The lower the measure, the greater the
information content
Can use to automatically select variable with
most productive rule potential
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-11
Entropy
• Young
- 8/12 x -0.390 – 4/12 x -0.528 x 12/20:
0.551
• Middle
- 4/5 x -0.258 – 1/5 x -0.464 x 5/20:
0.180
• Old
- 3/3 x 0 – 0/3 x 0 x 3/20:
SUM
Income
Risk
McGraw-Hill/Irwin
0.000
0.731
0.782
0.446
©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-12
Rule
1. IF(Risk = Low) THEN OT
2. ELSE LATE
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-13
All Rules
1. IF Risk=Low
OT
2. IF Risk NOT Low & Age=Middle Late
3. IF Risk NOT Low & Age NOT Middle &
Income=High
Late
4. ELSE
OT
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-14
Sample Case
• Age 36
• Income $70K/year
• Risk:
– Assets $42K
– Debts $40K
– Wants $5K
Middle
Average
Average
• Rule 2 applies, says Late
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-15
Fuzzy Decision Trees
• Have assumed distinct (crisp) outcomes
• Many data points not that clear
• Fuzzy: Membership function represents
belief (between 0 and 1)
• Fuzzy relationships have been
incorporated in decision tree algorithms
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-16
Fuzzy Example
Age
Young 0.3 Middle 0.9
Income Low 0.0 Average 0.8
Risk
Low 0.1 Average 0.8
• Definitions:
Old 0.2
High 0.3
High 0.3
– Sum will not necessarily equal 1.0
– If ambiguous, select alternative with larger
membership value
– Aggregate with mean
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-17
Fuzzy Model
• IF Risk=Low Then OT
– Membership function: 0.1
• IF Risk NOT Low & Age=Middle Then Late
– Risk MAX(0.8, 0.3)
– Age 0.9
– Membership function: Mean = 0.85
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-18
Fuzzy Model cont.
• IF Risk NOT Low & Age NOT Middle &
Income=High THEN Late
– Risk MAX(0.8, 0.3)
0.8
– Age MAX(0.3, 0.2)
0.3
– Income
0.3
– Membership function: Mean = 0.433
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-19
Fuzzy Model cont.
• IF Risk NOT Low & Age NOT Middle &
Income NOT High THEN Late
– Risk MAX(0.8, 0.3)
0.8
– Age MAX(0.3, 0.2)
0.3
– Income MAX(0.0, 0.8) 0.8
– Membership function: Mean = 0.633
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-20
Fuzzy Model cont.
• Highest membership function is 0.633, for
Rule 4
• Conclusion: On-time
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-21
Applications
• Inventory Prediction
• Clinical Databases
• Software Development Quality
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-22
Inventory Prediction
• Groceries
– Maybe over 100,000 SKUs
– Barcode data input
• Data mining to discover patterns
–
–
–
–
Random sample of over 1.6 million records
30 months
95 outlets
Test sample 400,000 records
• Rule induction more workable than regression
– 28,000 rules
– Very accurate, up to 27% improvement
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-23
Clinical Database
• Headache
– Over 60 possible causes
• Exclusive reasoning uses negative rules
– Use when symptom absent
• Inclusive reasoning uses positive rules
• Probabilistic rule induction expert system
– Headache: Training sample over 50,000 cases, 45
classes, 147 attributes
– Meningitis: 1200 samples on 41 attributes, 4 outputs
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-24
Clinical Database
• Used AQ15, C4.5
– Average accuracy 82%
• Expert System
– Average accuracy 92%
• Rough Set Rule System
– Average accuracy 70%
• Using both positive & negative rules from
rough sets
– Average accuracy over 90%
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-25
Software Development Quality
• Telecommunications company
• Goal: find patterns in modules being
developed likely to contain faults
discovered by customers
– Typical module several million lines of code
– Probability of fault averaged 0.074
• Apply greater effort for those
– Specification, testing, inspection
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-26
Software Quality
• Preprocessed data
• Reduced data
• Used CART
– (Classification & Regression Trees)
– Could specify prior probabilities
• First model 9 rules, 6 variables
– Better at cross-validation
– But variable values not available until late
• Second model 4 rules, 2 variables
– About same accuracy, data available earlier
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
8-27
Decision Trees
• Very effective & useful
• Automatic machine learning
– Thus unbiased (but omit judgment)
• Can handle very large data sets
– Not affected much by missing data
• Lots of software available
McGraw-Hill/Irwin
©2007 The McGraw-Hill Companies, Inc. All rights reserved
Related documents