Download Lecture 8 (2/23) - Department of Computer Science

Document related concepts
no text concepts found
Transcript
Decision Trees and Association
Rules
Prof. Sin-Min Lee
Department of Computer Science
Data Mining: A KDD Process
Pattern Evaluation
– Data mining: the core of
knowledge discovery
process.
Data Mining
Task-relevant Data
Data Warehouse
Data Cleaning
Data Integration
Databases
Selection
Data Mining process model -DM
Search in State Spaces
Decision Trees
•A decision tree is a special case of a state-space
graph.
•It is a rooted tree in which each internal node
corresponds to a decision, with a subtree at these
nodes for each possible outcome of the decision.
•Decision trees can be used to model problems in
which a series of decisions leads to a solution.
•The possible solutions of the problem correspond
to the paths from the root to the leaves of the
decision tree.
Decision Trees
•Example: The n-queens problem
•How can we place n queens on an nn chessboard so that no two
queens can capture each other?
A queen can move any
number of squares
horizontally, vertically, and
diagonally.
Here, the possible target
squares of the queen Q are
marked with an x.
•x
•x
•x
•x
•x
•x
•x •x •x
•x •x •x •Q •x •x •x •x
•x •x •x
•x
•x
•x
•x
•x
•x
•x
•x
•Let us consider the 4-queens problem.
•Question: How many possible configurations of
44 chessboards containing 4 queens are there?
•Answer: There are 16!/(12!4!) =
(13141516)/(234) = 13754 = 1820 possible
configurations.
•Shall we simply try them out one by one until we
encounter a solution?
•No, it is generally useful to think about a search
problem more carefully and discover constraints
on the problem’s solutions.
•Such constraints can dramatically reduce the size
of the relevant state space.
Obviously, in any solution of the n-queens problem,
there must be exactly one queen in each column of
the board.
Otherwise, the two queens in the same column could
capture each other.
Therefore, we can describe the solution of this problem
as a sequence of n decisions:
Decision 1: Place a queen in the first column.
Decision 2: Place a queen in the second column.
.
.
.
Decision n: Place a queen in the n-th column.
Backtracking in Decision Trees
empty board
•Q
place
1st
•Q
queen
•Q
place
2nd
queen
•Q
•Q
•Q
•Q
•Q
•Q
place
3rd
•Q
•Q
queen
•Q
•Q
•Q
•Q
place
4th
queen
•Q
•Q
•Q
Neural Network
Many inputs and a single output
Trained on signal and background sample
Well understood and mostly accepted in HEP
Decision Tree
Many inputs and a single output
Trained on signal and background sample
Used mostly in life sciences & business
Decision tree
Basic
Algorithm
• Initialize top node to all examples
• While impure leaves available
– select next impure leave L
– find splitting attribute A with maximal information gain
– for each value of A add child to L
Decision tree
Find good
splitstatistics to compute info gain: count matrix
• Sufficient
outlook
sunny
sunny
overcast
rainy
rainy
rainy
overcast
sunny
sunny
rainy
sunny
overcast
overcast
rainy
temperature
hot
hot
hot
mild
cool
cool
cool
mild
cool
mild
mild
mild
hot
mild
humidity
high
high
high
high
normal
normal
normal
high
normal
normal
normal
high
normal
high
windy
FALSE
TRUE
FALSE
FALSE
FALSE
TRUE
TRUE
FALSE
FALSE
FALSE
TRUE
TRUE
FALSE
TRUE
play
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
sunny
overcast
rainy
play don't play
2
3
4
0
3
2
gain: 0.25 bits
hot
mild
cool
play don't play
2
2
4
2
3
1
gain: 0.16 bits
humidity
high
normal
play don't play
3
4
6
1
gain: 0.03 bits
windy
FALSE
TRUE
play don't play
6
2
3
3
gain: 0.14 bits
outlook
temperature
Decision trees
•
•
•
•
Simple depth-first construction
Needs entire data to fit in memory
Unsuitable for large data sets
Need to “scale up”
Decision Trees
Planning Tool
Decision Trees
• Enable a business to quantify decision
making
• Useful when the outcomes are uncertain
• Places a numerical value on likely or
potential outcomes
• Allows comparison of different possible
decisions to be made
Decision Trees
• Limitations:
– How accurate is the data used in the construction of the
tree?
– How reliable are the estimates of the probabilities?
– Data may be historical – does this data relate to real
time?
– Necessity of factoring in the qualitative factors –
human resources, motivation, reaction, relations with
suppliers and other stakeholders
Process
Advantages
Disadvantages
Trained
Decision
Tree
(Limit)
(Binned Likelihood Fit)
Decision Trees from Data Base
Ex
Num
Att
Size
Att
Colour
Att
Shape
Concept
Satisfied
1
2
3
4
5
6
7
med
small
small
large
large
large
large
blue
red
red
red
green
red
green
brick
wedge
sphere
wedge
pillar
pillar
sphere
yes
no
yes
no
yes
no
yes
Choose target : Concept satisfied
Use all attributes except Ex Num
Rules from Tree
IF (SIZE = large
AND
((SHAPE = wedge) OR (SHAPE = pillar AND
COLOUR = red) )))
OR (SIZE = small AND SHAPE = wedge)
THEN NO
IF (SIZE = large
AND
((SHAPE = pillar) AND COLOUR = green)
OR SHAPE = sphere) )
OR (SIZE = small AND SHAPE = sphere)
OR (SIZE = medium)
THEN YES
Association Rule
• Used to find all rules in a basket data
• Basket data also called transaction data
• analyze how items purchased by customers in a
shop are related
• discover all rules that have:– support greater than minsup specified by user
– confidence greater than minconf specified by user
• Example of transaction data:–
–
–
–
CD player, music’s CD, music’s book
CD player, music’s CD
music’s CD, music’s book
CD player
Association Rule
• Let I = {i1, i2, …im} be a total set of items
D a set of transactions
d is one transaction consists of a set of items
–
dI
• Association rule:–
–
–
where X  I ,Y  I and X  Y = 
= #of transactions contain X  Y
D
confidence = #of transactions contain X  Y
XY
support
#of transactions contain X
Association Rule
• Example of transaction data:–
–
–
–
CD player, music’s CD, music’s book
CD player, music’s CD
music’s CD, music’s book
CD player
• I = {CD player, music’s CD, music’s book}
• D=4
• #of transactions contain both CD player, music’s
CD =2
• #of transactions contain CD player =3
• CD player  music’s CD (sup=2/4 , conf =2/3 );
Association Rule
• How are association rules mined from large
databases ?
• Two-step process:– find all frequent itemsets
– generate strong association rules from frequent
itemsets
Association Rules
• antecedent  consequent
–
–
–
–
if  then
beer  diaper (Walmart)
economy bad  higher unemployment
Higher unemployment  higher unemployment
benefits cost
• Rules associated with population, support,
confidence
Association Rules
• Population: instances such as grocery store
purchases
• Support
– % of population satisfying antecedent and consequent
• Confidence
– % consequent true when antecedent true
2.
Association
rules
Support
Every association rule has a support and a confidence.
“The support is the percentage of transactions that demonstrate the rule.”
Example: Database with transactions ( customer_# : item_a1, item_a2, … )
1:
2:
3:
4:
1, 3, 5.
1, 8, 14, 17, 12.
4, 6, 8, 12, 9, 104.
2, 1, 8.
support {8,12} = 2 (,or 50% ~ 2 of 4 customers)
support {1, 5} = 1 (,or 25% ~ 1 of 4 customers )
support {1} = 3 (,or 75% ~ 3 of 4 customers)
2.
Association
rules
Support
An itemset is called frequent if its support is equal or greater
than an agreed upon minimal value – the support threshold
add to previous example:
if threshold 50%
then itemsets {8,12} and {1} called frequent
2.
Association
rules
Confidence
Every association rule has a support and a confidence.
An association rule is of the form: X => Y
• X => Y: if someone buys X, he also buys Y
The confidence is the conditional probability that, given X
present in a transition , Y will also be present.
Confidence measure, by definition:
Confidence(X=>Y) equals support(X,Y) / support(X)
2.
Association
rules
Confidence
We should only consider rules derived from itemsets
with high support, and that also have high confidence.
“A rule with low confidence is not meaningful.”
Rules don’t explain anything, they just point out hard
facts in data volumes.
3.
Example
Example: Database with transactions ( customer_# : item_a1, item_a2, … )
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
3, 5, 8.
2, 6, 8.
1, 4, 7, 10.
3, 8, 10.
2, 5, 8.
1, 5, 6.
4, 5, 6, 8.
2, 3, 4.
1, 5, 7, 8.
3, 8, 9, 10.
Conf ( {5} => {8} ) ?
supp({5}) = 5
, supp({8}) = 7 , supp({5,8}) = 4,
then conf( {5} => {8} ) = 4/5 = 0.8 or 80%
3.
Example
Example: Database with transactions ( customer_# : item_a1, item_a2, … )
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
3, 5, 8.
2, 6, 8.
1, 4, 7, 10.
3, 8, 10.
2, 5, 8.
1, 5, 6.
4, 5, 6, 8.
2, 3, 4.
1, 5, 7, 8.
3, 8, 9, 10.
Conf ( {5} => {8} ) ? 80% Done. Conf ( {8} => {5} ) ?
supp({5}) = 5
, supp({8}) = 7 , supp({5,8}) = 4,
then conf( {8} => {5} ) = 4/7 = 0.57 or 57%
3. Example
Example: Database with transactions ( customer_# : item_a1, item_a2, … )
Conf ( {5} => {8} ) ? 80% Done.
Conf ( {8} => {5} ) ? 57% Done.
Rule ( {5} => {8} ) more meaningful then
Rule ( {8} => {5} )
3.
Example
Example: Database with transactions ( customer_# : item_a1, item_a2, … )
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
3, 5, 8.
2, 6, 8.
1, 4, 7, 10.
3, 8, 10.
2, 5, 8.
1, 5, 6.
4, 5, 6, 8.
2, 3, 4.
1, 5, 7, 8.
3, 8, 9, 10.
Conf ( {9} => {3} ) ?
supp({9}) = 1
, supp({3}) = 1 , supp({3,9}) = 1,
then conf( {9} => {3} ) = 1/1 = 1.0 or 100%. OK?
3. Example
Example: Database with transactions ( customer_# : item_a1, item_a2, … )
Conf( {9} => {3} ) = 100%. Done.
Notice: High Confidence, Low Support.
-> Rule ( {9} => {3} ) not meaningful
Association Rules
• Population
– MS, MSA, MSB, MA, MB, BA
– M=Milk, S=Soda, A=Apple, B=beer
• Support (MS)= 3/6
– (MS,MSA,MSB)/(MS,MSA,MSB,MA,MB,
BA)
• Confidence (MS) = 3/5
– (MS, MSA, MSB) / (MS,MSA,MSB,MA,MB)