Download Lecture 4: kNN, Decision Trees

Document related concepts

Mixture model wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Data Mining methods as a
artificial intelligence tool
Agnieszka Nowak - Brzezinska
Decision Trees
and
k-Nearest neighbor
and
basket analysis
Lectures 4
BASKET ANALYSIS
• Data mining (the advanced analysis step of the
"Knowledge Discovery in Databases" process, or KDD),
an interdisciplinary subfield of computer science, is the
computational process of discovering patterns in large
data sets involving methods at the intersection of
artificial intelligence, machine learning, statistics, and
database systems.
• The overall goal of the data mining process is to extract
information from a data set and transform it into an
understandable structure for further use.
What Is Frequent Pattern Analysis?
• Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.)
that occurs frequently in a data set
•
First proposed by Agrawal, Imielinski, and Swami in the context of frequent
itemsets and association rule mining
• Motivation: Finding inherent regularities in data
– What products were often purchased together?— Beer and diapers?!
– What are the subsequent purchases after buying a PC?
– What kinds of DNA are sensitive to this new drug?
– Can we automatically classify web documents?
•
Applications
–
Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
Support and Confidence - Example
• What is the support and confidence of the following rules?
• {Beer}{Bread}
• {Bread, PeanutButter}{Jelly} ?
Support(XY)=support(X Y)
confidence(XY)=support(XY)/support(X)
Association Rule Mining Problem Definition
• Given a set of transactions T={t1, t2, …,tn} and 2 thresholds; minsup
and minconf,
• Find all association rules XY with support  minsup and
confidence  minconf
• I.E: we want rules with high confidence and support
• We call these rules interesting
• We would like to
• Design an efficient algorithm for mining association rules in large
data sets
• Develop an effective approach for distinguishing interesting rules
from spurious ones
Basic Concepts: Frequent Patterns
Tid
Items bought
10
Beer, Nuts, Diaper
20
Beer, Coffee, Diaper
30
Beer, Diaper, Eggs
40
Nuts, Eggs, Milk
50
Nuts, Coffee, Diaper, Eggs, Milk
Customer
buys both
Customer
buys beer
Customer
buys diaper
• itemset: A set of one or more items
• k-itemset X = {x1, …, xk}
• (absolute) support, or, support
count of X: Frequency or
occurrence of an itemset X
• (relative) support, s, is the fraction
of transactions that contains X (i.e.,
the probability that a transaction
contains X)
• An itemset X is frequent if X’s
support is no less than a minsup
threshold
Basic Concepts: Association Rules
Tid
Items bought
10
Beer, Nuts, Diaper
20
Beer, Coffee, Diaper
30
Beer, Diaper, Eggs
40
50
Nuts, Eggs, Milk
•
Nuts, Coffee, Diaper, Eggs, Milk
Customer
buys both
Customer
buys beer
Customer
buys
diaper
Find all the rules X  Y with
minimum support and confidence
– support, s, probability that a
transaction contains X  Y
– confidence, c, conditional
probability that a transaction
having X also contains Y
Let minsup = 50%, minconf = 50%
Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3, {Beer,
Diaper}:3

Association rules: (many more!)

Beer  Diaper (60%, 100%)

Diaper  Beer (60%, 75%)
Measures of Predictive Ability
Support refers to the percentage of baskets where
the rule was true (both left and right-side
products were present).
Confidence measures what percentage of baskets
that contained the left-hand product also
contained the right-hand.
Lift measures how many times Confidence is larger
than the expected (baseline) Confidence. A lift
value that is greater than 1 is desirable.
Support and Confidence: An
Illustration
A B C
Rule
AD
CA
AC
B&CD
A CD
B CD
ADE
B C E
Support
Confidence
Lift
2/5
2/5
2/5
1/5
2/3
2/4
2/3
1/3
2
1
2
0.50
Problem Decomposition
1. Find all sets of items that have minimum
support (frequent itemsets)
2. Use the frequent itemsets to generate the
desired rules
Problem Decomposition – Example
Transaction ID Items Bought
1
Shoes, Shirt, Jacket
2
Shoes,Jacket
3
Shoes, Jeans
4
Shirt, Sweatshirt
Frequent Itemset
{Shoes}
{Shirt}
{Jacket}
{Shoes, Jacket}
Support
75%
50%
50%
50%
For min support = 50% = 2 trans,
and min confidence = 50%
For the rule Shoes  Jacket
•Support = Sup({Shoes,Jacket)}=50%
•Confidence =
50
=66.6%
70
Jacket  Shoes has 50% support and 100% confidence
The Apriori Algorithm — Example
Min support =50% = 2 trans
Database D
TID
100
200
300
400
Items
134
235
1235
25
C1
Scan D
itemset sup.
{1}
2
{2}
3
{3}
3
{4}
1
{5}
3
L1 itemset sup.
{1}
{2}
{3}
{5}
C2 itemset sup
L2
itemset
{1 3}
{2 3}
{2 5}
{3 5}
C3
sup
2
2
3
2
itemset
{2 3 5}
{1
{1
{1
{2
{2
{3
Scan D
2}
3}
5}
3}
5}
5}
1
2
1
2
3
2
L3
C2
Scan D
itemset sup
{2 3 5} 2
2
3
3
3
itemset
{1 2}
{1 3}
{1 5}
{2 3}
{2 5}
{3 5}
KNN
Examples of Classification Task
• Predicting tumor cells as benign or malignant
• Classifying credit card transactions
as legitimate or fraudulent
• Classifying secondary structures of protein
as alpha-helix, beta-sheet, or random
coil
• Categorizing news stories as finance,
weather, entertainment, sports, etc
KNN - Definition
KNN is a simple algorithm that stores
all available cases and classifies new
cases based on a similarity measure
KNN – different names
•
•
•
•
•
•
K-Nearest Neighbors
Memory-Based Reasoning
Example-Based Reasoning
Instance-Based Learning
Case-Based Reasoning
Lazy Learning
KNN – Short History
• Nearest Neighbors have been used in statistical
estimation and pattern recognition already in the
beginning of 1970’s (non-parametric techniques).
• People reason by remembering and learn by doing.
• Thinking is reminding, making analogies.
• The k-Nearest Neighbors (kNN) method provides a simple
approach to calculating predictions for unknown
observations.
• It calculates a prediction by looking at similar observations
and uses some function of their response values to make
the prediction, such as an average.
• Like all prediction methods, it starts with a training set but
• instead of producing a mathematical model it determines
the optimal number of similar observations to use in
making the prediction.
• During the learning phase, the best number of similar
observations is chosen (k).
+/- of kNN
+:
• Noise: kNN is relatively insensitive to errors or
outliers in the data.
• Large sets: kNN can be used with large training
sets.
-:
• Speed: kNN can be computationally slow when it
is applied to a new data set since a similar score
must be generated between the observations
presented to the model and every member of the
training set.
• A kNN model uses the k most similar neighbors to
the observation to calculate a prediction.
• Where a response variable is continuous, the
prediction is the mean of the nearest neighbors.
• Where a response variable is categorical, the
prediction could be presented as a mean or a
voting scheme could be used, that is, select the
most common classification term.
http://people.revoledu.com/kardi/tutorial/KNN/index.html
K Nearest Neighbor (KNN):
• Training set includes classes.
• Examine K items near item to be classified.
• New item placed in class with the most
number of close items.
• O(q) for each tuple to be classified. (Here q
is the size of the training set.)
KNN
The test sample (green
circle) should be classified
either to the first class of
blue squares or to the
second class of red
triangles.
If k = 3 it is assigned to the
second class because there
are 2 triangles and only 1
square inside the inner
circle.
If k = 5 it is assigned to the
first class (3 squares vs. 2
triangles inside the outer
circle).
K-nearest neighbor algorithm
Asumptions:
• We have a training set of observations in which each element
belongs to one of a given classes (Y).
• We have some new observation, for which we do not know
the class and we want to find it using kNN algorithm.
To calculate the distance from A (2,3) to B (7,8):
9
8
7
6
5
A
4
B
3
2
1
0
0
2
4
6
D (A,B) = sqrt((7-2)2 + (8-3)2) = sqrt (25 + 25) = sqrt (50) = 7.07
8
9
B
8
7
6
5
A
B
4
C
A
3
2
C
1
0
0
•
•
•
•
1
2
3
4
5
If we have 3 points A(2,3), B(7,8) and C(5,1):
D (A,B) = sqrt ((7-2)2 + (8-3)2) = sqrt (25 + 25) = sqrt (50) = 7.07
D (A,C) = sqrt ((5-2)2 + (3-1)2) = sqrt (9 + 4) = sqrt (13) = 3.60
D (B,C) = sqrt ((7-5)2 + (3-8)2) = sqrt (4 + 25) = sqrt (29) = 5.38
6
7
8
K-NN
• Step 1: find k nearest neighbors for a given object
• Step 2: choose the class from the neighbors (choose the class which
is more frequent)
K=3
New case:
K=5
Will be
New case:
Will be
What if we have more dimensions ?
A
B
C
V1
0.7
0.6
0.8
V2
0.8
0.8
0.9
V3
0.4
0.5
0.7
V4
0.5
0.4
0.8
V5
0.2
0.2
0.9
D (A,B) = sqrt ((0.7-0.6)2 + (0.8-0.8)2 + (0.4-0.3)2 + (0.5-0.4)2 + (0.2-0.2)2) = sqrt (0.01 + 0.01 +
0.01) = sqrt (0.03) = 0.17
D (A,C) = sqrt ((0.7-0.8)2 + (0.8-0.9)2 + (0.4-0.7)2 + (0.5-0.8)2 + (0.2-0.9)2) =sqrt(0.01 + 0.01 +
0.09 + 0.09 + 0.49) = sqrt (0.69) = 0.83
D (B,C) = sqrt ((0.6-0.8)2 + (0.8-0.9)2 + (0.5-0.7)2 + (0.4-0.8)2 + (0.2-0.9)2) = sqrt (0.04 + 0.01 +
0.04+0.16 + 0.49) = sqrt (0.74) = 0.86
We are looking for the smallest distance. (A & B).
kNN advantages and disadvantages:
Advatages:
• Noise: kNN is relatively insensitive to errors or outliers in the data.
• Large sets: kNN can be used with large training sets.
Disadvantage:
• Speed: kNN can be computationally slow when it is applied to a new
data set since a similar score must be generated between the
observations presented to the model and every member of the
training set.
SSE
Smaller SSE values indicate that the predictions are closer to the actual
values.
The SSE evaluation criterion was used to assess the quality of each model.
To assess the different values for k, the sum of squares of error (SSE) evaluation criteria
will be used:
Table for detecting the best values for k
•The Euclidean distance calculation was
selected to represent the distance between
observations. To calculate an optimal value for
k, different values of k were selected between
2 and 20.
In this example, the value of
k with the lowest SSE value
is 6 and this value is selected
for use with the kNN model.
Observation to be predicted
• To illustrate, a data set of cars will be used and a model built to test
the car fuel efficiency (MPG).
• The following variables will be used as descriptors within the
model: Cylinders, Displacement, Horsepower, Weight, Acceleration,
Model Year and Origin.
Predicting
• Once a value for k has been set in the training phase, the model can
now be used to make predictions.
• For example, an observation x has values for the descriptor
variables but not for the response. Using the same technique for
determining similarity as used in the model building phase,
observation x is compared against all observations in the training
set.
• A distance is computed between x and each training set
observation. The closest k observations are selected and a
prediction is made, for example, using the average value
The observation (Dodge Aspen) was presented to the kNN model built to predict car fuel efficiency
(MPG). The Dodge Aspen observation was compared to all observations in the training set and an
Euclidean distance was computed.
The six observations with the smallest distance scores are selected, as shown in Table. The
prediction is the average of these top six observations, that is, 19.5.
the cross validated prediction is shown alongside the actual value.
Nearest Neighbor Classification
• Input
• A set of stored records
• k: # of nearest neighbors
• Output
d ( p, q ) 
 ( pi
q )
2
• Compute distance:
• Identify k nearest neighbors
• Determine the class label of unknown record based on class labels of
nearest neighbors (i.e. by taking majority vote)
i
i
58
K Nearest Neighbors
• K Nearest Neighbors
– Advantage
• Simple
• Powerful
• Requires no training time
– Disadvantage
• Memory intensive
• Classification/estimation is slow
59
KNN = k nearest neighbors
Gene 2
?
If red = brain tumor and
yellow healthy – do I have
a brain tumor?
Gene 1
KNN is another method for classification. For each point it
looks at its k nearest neighbors.
KNN = k nearest neighbors
Gene 2
?
If red = brain tumor and
yellow healthy – do I have
a brain tumor?
Gene 1
For each point it looks at its k nearest neighbors. For example, the
method with k=3 looks at points 3 nearest neighbors to decide how to
classify it. If the majority are “Red” it will classify the point as red.
KNN - exercise
Gene 2
?
Gene 1
In the above example – how will the point be classified in
KNN with K=1?
KNN Classification
$250 000
$200 000
$150 000
Non-Default
Loan$
Default
$100 000
$50 000
$0
0
10
20
30
40
Age
50
60
70
KNN Classification – Distance
Age
25
35
45
20
35
52
23
40
60
48
33
Loan
$40,000
$60,000
$80,000
$20,000
$120,000
$18,000
$95,000
$62,000
$100,000
$220,000
$150,000
Default
N
N
N
N
N
N
Y
Y
Y
Y
Y
48
$142,000
?
Distance
102000
82000
62000
122000
22000
124000
47000
80000
42000
78000
8000
D  ( x1  x2 )  ( y1  y2 )
2
2
KNN Classification – Standardized Distance
Age
Loan
0.125
0.375
0.625
0
0.375
0.8
0.075
0.5
1
0.7
0.325
0.11
0.21
0.31
0.01
0.50
0.00
0.38
0.22
0.41
1.00
0.65
Default
N
N
N
N
N
N
Y
Y
Y
Y
Y
0.7
0.61
?
X  Min
Xs 
Max  Min
Distance
0.7652
0.5200
0.3160
0.9245
0.3428
0.6220
0.6669
0.4437
0.3650
0.3861
0.3771
KNN Regression - Distance
Age
25
35
45
20
35
52
23
40
60
48
33
Loan
House Price Index
$40,000
135
$60,000
256
$80,000
231
$20,000
267
$120,000
139
$18,000
150
$95,000
127
$62,000
216
$100,000
139
$220,000
250
$150,000
264
48
$142,000
Distance
102000
82000
62000
122000
22000
124000
47000
80000
42000
78000
8000
?
D  ( x1  x2 )  ( y1  y2 )
2
2
KNN Regression – Standardized Distance
Age
Loan
0.125
0.375
0.625
0
0.375
0.8
0.075
0.5
1
0.7
0.325
0.11
0.21
0.31
0.01
0.50
0.00
0.38
0.22
0.41
1.00
0.65
House Price Index
135
256
231
267
139
150
127
216
139
250
264
0.7
0.61
?
X  Min
Xs 
Max  Min
Distance
0.7652
0.5200
0.3160
0.9245
0.3428
0.6220
0.6669
0.4437
0.3650
0.3861
0.3771
KNN – Number of Neighbors
• If K=1, select the nearest neighbor
• If K>1,
– For classification select the most frequent
neighbor.
– For regression calculate the average of K
neighbors.
Distance – Categorical Variables
X
Y
Distance
Male
Male
0
Male
Female
1
x yD0
x  y  D 1
Decision trees
DECISION TREES
Example of a Decision Tree
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Splitting Attributes
Refund
Yes
No
NO
MarSt
Married
Single, Divorced
TaxInc
< 80K
NO
NO
> 80K
YES
10
Training Data
Model: Decision Tree
Another Example of Decision Tree
10
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Married
MarSt
NO
Single,
Divorced
Refund
No
Yes
NO
TaxInc
< 80K
NO
> 80K
YES
There could be more than one tree that fits
the same data!
Decision Tree Classification Task
Tid
Attrib1
Attrib2
Attrib3
Class
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Tree
Induction
algorithm
Induction
Learn
Model
Model
10
Training Set
Tid
Attrib1
Attrib2
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
10
Test Set
Attrib3
Apply
Model
Class
Deduction
Decision Tree
Apply Model to Test Data
Test Data
Start from the root of tree.
Refund
Yes
10
No
NO
MarSt
Married
Single, Divorced
TaxInc
< 80K
NO
NO
> 80K
YES
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
Apply Model to Test Data
Test Data
Refund
Yes
10
No
NO
MarSt
Married
Single, Divorced
TaxInc
< 80K
NO
NO
> 80K
YES
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
Apply Model to Test Data
Test Data
Refund
Yes
10
No
NO
MarSt
Married
Single, Divorced
TaxInc
< 80K
NO
NO
> 80K
YES
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
Apply Model to Test Data
Test Data
Refund
Yes
10
No
NO
MarSt
Married
Single, Divorced
TaxInc
< 80K
NO
NO
> 80K
YES
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
Apply Model to Test Data
Test Data
Refund
Yes
10
No
NO
MarSt
Married
Single, Divorced
TaxInc
< 80K
NO
NO
> 80K
YES
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
Apply Model to Test Data
Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Married
Single, Divorced
TaxInc
< 80K
NO
NO
> 80K
YES
Assign Cheat to “No”
Decision Tree Classification Task
Tid
Attrib1
Attrib2
Attrib3
Class
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Tree
Induction
algorithm
Induction
Learn
Model
Model
10
Training Set
Tid
Attrib1
Attrib2
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
10
Test Set
Attrib3
Apply
Model
Class
Deduction
Decision Tree
Decision tree
distance < 20 km
yes
no
weather
sunny
rainy
Concepts: root, inner node, leaf, edges
Decision tree construction
y2
1 1
1
1
a1
2
y2
2 2
22 2
2 2
a1
1 1
1
1
2
2
2 2
22 2
2 2
2
y1
a2
y1
a3
y2 < a1
y1< a3
no
yes
2
y1 < a2
1
2
y2 < a1
no
yes
no
yes
2
no
yes
2
1
Partition for node
1. Quantitative data: comparison with some treshold
value
yi >  i
yes
no
2. Qualitative data: each possible value has to be used
yi
yi1
yi2
yik
The partition of node for qualitative
data
1. For each attribute yi calculate the value of some given measure.
2. Choose the attibute which is optimal in sense of chosen measure.
3. From a given node create a number of edges equal to the number of
values of attribute yi.
t
yi1
yi
yi2
t1
t2
yik
tk
• Decision trees are often generated by hand to
precisely and consistently define a decision
making process.
• However, they can also be generated
automatically from the data.
• They consist of a series of decision points
based on certain variables
Splitting Criteria -Dividing
Observations
• It is common for the split at each level to be a
two-way split.
• There are methods that split more than two
ways.
• However, care should be taken using these
methods since splitting the set in many ways
early in the construction of the tree may result in
missing interesting relationships that become
exposed as the tree growing process continues.
Any variable type can be split using a
two-way split:
• Dichotomous: Variables with two values are the most
straightforward to split since each branch represents a specific
value. For example, a variable Temperature may have only two
values, hot and cold. Observations will be split based on those with
hot and those with cold temperature values.
• Nominal: Since nominal values are discrete values with no order, a
two-way split is accomplished with one subset being comprised of a
set of observations that equal a certain value and the other subset
being those observations that do not equal that value. For example,
a variable Color that can take the values red, green, blue, and black
may be split two-ways. Observations, for example, which have Color
equaling red generate one subset and those not equaling red
creating the other subset, that is, green, blue and black.
Ordinal: In the case where a variable’s discrete values are ordered, the
resulting subsets may be made up of more than one value, as long as
the ordering is retained. For example, a variable Quality with possible
values low, medium, high, and excellent may be split in four possible
ways. For example, observations equaling low or medium in one
subset and observations equaling high and excellent in another subset.
Another example is where low values are in one set and medium, high,
and excellent values are in the other set.
Continuous: For variables with continuous values to be split two-ways,
a specific cutoff value needs to be determined, where on one side of
the split are values less than the cutoff and on the other side of the
split are values greater than or equal to the cutoff. For example, a
variable Weight which can take any value between 0 and 1,000 with a
selected cutoff of 200. The first subset would be those observations
where the Weight is below 200 and the other subset would be those
observations where the Weight is greater than or equal to 200.
A splitting criterion has two
components:
• (1) the variable to split on and
• (2) values of the variable to split on.
To determine the best split, all possible splits of
all variables must be considered. Since it is
necessary to rank the splits, a score should be
calculated for each split.
There are many ways to rank the split.
The following describes two approaches for prioritizing splits, based on
whether the response is categorical or continuous.
• The objective for an optimal split is to create
subsets which results in observations with a
single response value. In this example, there are
20 observations prior to splitting.
• The response variable (Temperature) has two
possible values, hot and cold. Prior to the split,
the response has an even distribution with the
number of observations where the Temperature
equals hot is ten and with the number of
observations where the Temperature equals cold
is also ten.
• Different criteria are considered for splitting these observations which
results in different distributions of the response variables for each subset
(N2 and N3):
• Split a: Each subset contains ten observations. All ten observations in N2
have hot temperature values, whereas the ten observations in node N3
are all cold.
• Split b: Again each subset (N2 and N3) contains ten observations.
However, in this example there is an even distribution of hot and cold
values in each subset.
• Split c: In this case the splitting criterion results in two subsets where
node N2 has nine observations (one hot and eight cold) and node N3 has
11 observations (nine hot and two cold).
• Split a is the best split since each node contains observations where
the response is one or the other category.
• Split b results in the same even split of hot and cold values (50%
hot, 50% cold) in each of the resulting nodes (N2 and N3) and
would not be considered a good split.
• Split c is a good split; however, this split is not so clean as split a
since there are values of both hot and cold in both subsets.
• The proportion of hot and cold values is biased, in node N2 towards
cold values and in N3 towards hot values. When determining the
best splitting criteria, it is important to determine how clean each
split is, based on the proportion of the different categories of the
response variable (or impurity).
• S is a sample of training examples
• p is the proportion of positive
examples in S
• Entropy measures the impurity of
of S
• Entropy(S) = -plogp-(1-p)log(1-p)
misclassification, Gini, and entropy
• There are three primary methods for calculating impurity:
misclassification, Gini, and entropy.
• In scenario 1, all ten observations have value cold whereas
in scenario 2, one observation has value hot and nine
observations have value cold.
• For each scenario, an entropy score is calculated.
• Cleaner splits result in lower scores.
• In scenario 1 and scenario 11, the split cleanly breaks the
set into observations with only one value. The score for
these scenarios is 0. In scenario 6, the observations are split
evenly across the two values and this is reflected in a score
of 1. In other cases, the score reflects how well the two
values are split.
• In order to determine the best split, we now
need to calculate a ranking based on how
cleanly each split separates the response data.
• This is calculated on the basis of the impurity
before and after the split.
• The formula for this calculation, Gain, is
shown below:
• N is the number of observations in the parent
node,
• k is the number of possible resulting nodes and
• N(vj) is the number of observations for each of
the j child nodes.
• vj is the set of observations for the jth node.
• It should be noted that the Gain formula can be
used with other impurity methods by replacing
the entropy calculation.
ID3 Algorithm
• The ID3 algorithm is considered as a very simple
decision tree algorithm (Quinlan, 1986).
• ID3 uses information gain as splitting criteria.
• The growing stops when all instances belong to a
single value of target feature or when best
information gain is not greater than zero.
• ID3 does not apply any pruning procedures nor
does it handle numeric attributes or missing
values.
C4.5 Algorithm
• C4.5 is an evolution of ID3, presented by the
same author (Quinlan, 1993).
• It uses gain ratio as splitting criteria.
• The splitting ceases when the number of
instances to be split is below a certain threshold.
• Error–based pruning is performed after the
growing phase. C4.5 can handle numeric
attributes.
• It can induce from a training set that incorporates
missing values by using corrected gain ratio
criteria as presented above.
Example: Decision Tree for PlayTennis
Example: Data for PlayTennis
Decision Tree for PlayTennis
3.4 The Basic Decision Tree Learning
Algorithm
• Main loop:
1. A  the “best” decision attribute for next node
2. Assign A as decision attribute for node
3. For each value of A, create new descendant of
node
4. Sort training examples to leaf nodes
5. If training examples perfectly classified, Then
STOP, Else iterate over new leaf nodes
• Which attribute is best?
Entropy
S is a sample of training examples
p⊕ is the proportion of positive examples in S
p⊖ is the proportion of negative examples in S
Entropy measures the impurity of S
Entropy(S)  - p⊕log2 p⊕ - p⊖log2 p⊖
Information Gain
Gain(S, A) = expected reduction in entropy due to
sorting on A
Training Examples
Selecting the Next Attribute(1/2)
Which attribute is the best classifier?
Selecting the Next Attribute(2/2)
Ssunny = {D1,D2,D8,D9,D11}
Gain (Ssunny , Humidity) = .970 - (3/5) 0.0 - (2/5) 0.0 = .970
Gain (Ssunny , Temperature) = .970 - (2/5) 0.0 - (2/5) 1.0 - (1/5) 0.0 = .570
Gain (Ssunny, Wind) = .970 - (2/5) 1.0 - (3/5) .918 = .019
Converting A Tree to Rules
IF
THEN
IF
THEN
….
(Outlook = Sunny) ∧ (Humidity = High)
PlayTennis = No
(Outlook = Sunny) ∧ (Humidity = Normal)
PlayTennis = Yes
Factors Affecting Sunburn
Name Hair
Height
Weight
Sarah blonde average light
Lotion Result
no
positive
Dana
blonde tall
average yes
negative
Alex
brown
average yes
Negative
average no
Positive
short
Annie blonde short
Emily
red
average heavy
no
positive
Peter
brown
tall
heavy
no
Negative
John
brown
average heavy
no
Negative
Katie
blonde short
yes
Negative
light
Phase 1: From Data to Tree
Perform average entropy calculations on the complete data set for
each of the four attributes:
Name
Hair
Height
Weight
Lotion
Result
Sarah
blonde
average
light
no
positive
Dana
blonde
tall
average
yes
negative
Alex
brown
short
average
yes
Negative
Annie
blonde
short
average
no
Positive
Emily
red
average
heavy
no
positive
Peter
brown
tall
heavy
no
Negative
John
brown
average
heavy
no
Negative
Katie
blonde
short
light
yes
Negative
b1 = blonde
b2 = red
b3 = brown
Average Entropy = 0.50
b1 = short
b2 = average
b3 = tall
Name
Hair
Height
Weight
Lotion
Result
Sarah
blonde
average
light
no
positive
Dana
blonde
tall
average
yes
negative
Alex
brown
short
average
yes
Negative
Annie
blonde
short
average
no
Positive
Emily
red
average
heavy
no
positive
Peter
brown
tall
heavy
no
Negative
John
brown
average
heavy
no
Negative
Katie
blonde
short
light
yes
Negative
Average Entropy = 0.69
b1 = light
b2 = average
b3 = heavy
Name
Hair
Height
Weight
Lotion
Result
Sarah
blonde
average
light
no
positive
Dana
blonde
tall
average
yes
negative
Alex
brown
short
average
yes
Negative
Annie
blonde
short
average
no
Positive
Emily
red
average
heavy
no
positive
Peter
brown
tall
heavy
no
Negative
John
brown
average
heavy
no
Negative
Katie
blonde
short
light
yes
Negative
Average Entropy = 0.94
b1 = no
b2 = yes
Name
Hair
Height
Weight
Lotion
Result
Sarah
blonde
average
light
no
positive
Dana
blonde
tall
average
yes
negative
Alex
brown
short
average
yes
Negative
Annie
blonde
short
average
no
Positive
Emily
red
average
heavy
no
positive
Peter
brown
tall
heavy
no
Negative
John
brown
average
heavy
no
Negative
Katie
blonde
short
light
yes
Negative
Average Entropy = 0.61
the attribute "hair color" is selected as the first test
because it minimizes the entropy.
Hair color = blonde
Name
Hair
Height
Weight
Lotion
Result
Sarah
blonde
average
light
no
positive
Dana
blonde
tall
average
yes
negative
Annie
blonde
short
average
no
positive
Katie
blonde
short
light
yes
negative
Hair color = red
Name
Hair
Height
Weight
Lotion
Result
Sarah
blonde
average
light
no
positive
Dana
blonde
tall
average
yes
negative
Alex
brown
short
average
yes
Negative
Annie
blonde
short
average
no
Positive
Emily
red
average
heavy
no
positive
Peter
brown
tall
heavy
no
Negative
John
brown
average
heavy
no
Negative
Katie
blonde
short
light
yes
Negative
Name
Hair
Height
Weight
Lotion
Result
Emily
red
average
heavy
no
positive
Hair color = brown
Name
Hair
Height
Weight
Lotion
Result
Alex
brown
short
average
yes
negative
Pete
brown
tall
heavy
no
negative
John
brown
average
heavy
no
negative
Hair color = blonde
Name
Hair
Height
Weight
Lotion
Result
Sarah
blonde
average
light
no
positive
Dana
blonde
tall
average
yes
negative
Annie
blonde
short
average
no
positive
Katie
blonde
short
light
yes
negative
Hair color = red
Name
Hair
Height
Weight
Lotion
Result
Sarah
blonde
average
light
no
positive
Dana
blonde
tall
average
yes
negative
Alex
brown
short
average
yes
Negative
Annie
blonde
short
average
no
Positive
Emily
red
average
heavy
no
positive
Peter
brown
tall
heavy
no
Negative
John
brown
average
heavy
no
Negative
Katie
blonde
short
light
yes
Negative
Name
Hair
Height
Weight
Lotion
Result
Emily
red
average
heavy
no
positive
Hair color = brown
Name
Hair
Height
Weight
Lotion
Result
Alex
brown
short
average
yes
negative
Pete
brown
tall
heavy
no
negative
John
brown
average
heavy
no
negative
Hair color = blonde
Name
Hair
Height
Weight
Lotion
Result
Sarah
blonde
average
light
no
positive
Dana
blonde
tall
average
yes
negative
Annie
blonde
short
average
no
positive
Katie
blonde
short
light
yes
negative
• Similarily, we now choose another test to separate out the
sunburned individuals from the blonde haired inhomogeneous
subset, {Sarah, Dana, Annie, and Katie}.
• The attribute "lotion" is selected because it minimizes the entropy
in the blonde hair subset.
• Thus, using the "hair color" and "lotion" tests together ensures the
proper identification of all the samples.
the completed decision tree
Decision tree
Age
<20
<20
<20
[20,40]
[20,40]
[20,40]
>40
>40
>40
Gene1
high
high
low
low
high
high
low
high
low
Gene2
high
high
low
high
high
low
low
low
high
Smoker
yes
yes
no
yes
no
yes
yes
no
no
Operation
yes
yes
no
yes
yes
no
no
no
no
Decision tree
Gene 2
high
low
Age >40
Yes
Operation = no
Operation = no
No
Operation = yes
Decision trees are automatically built from “train data” and are
used for classification.
They also tell us which features are most important.
reduced error pruning, example
age
 20
< 20
type
A
y
B
+
colour
+6 -0
colour
white
white
colour
age
type
y
class
1
czarny
11
B
tak
+
2
biały
23
B
tak
-
3
czarny
22
A
nie
-
4
czarny
18
B
nie
+
5
czarny
15
B
tak
-
6
biały
27
B
nie
+
no
yes
type
black
black
+
-
+6 -1
+0 -4
1
4
5
A
typ
A
-
+
-
+0 -6
+5 -1
+0 -9
3
6
B
+
-
+7 -2
+1 -5
2
B
age
 20
< 20
type
A
y
B
+
colour
+6 -0
colour
white
white
1
4
+6 -5 5
+
no
yes
type
black
black
+
-
+6 -1
+0 -4
1
4
5
A
typ
A
-
+
-
+0 -6
+5 -1
+0 -9
3
6
B
+
-
+7 -2
+1 -5
2
B
reduced error pruning
age
 20
< 20
1
+
4 +12 -5
5
type
y
A
+
+6 -0
B
no
yes
1
4
+6 -5 5
+
colour
white
type
black
A
type
A
-
+
-
+0 -6
+5 -1
+0 -9
3
6
B
+
-
+7 -2
+1 -5
2
B
reduced error pruning
age
 20
< 20
1
+
4 +12 -5
5
y
no
yes
colour
white
type
black
A
type
+
+8 -7
2
A
-
+
-
+0 -6
+5 -1
+0 -9
3
6
B
+
-
+7 -2
+1 -5
2
B
reduced error pruning
age
 20
< 20
1
+
4 +12 -5
5
y
no
yes
colour
white
type
B
-
+
-
+0 -6
+5 -1
+0 -9
3
6
B
+
-
+7 -2
+1 -5
2
+5 -10
black
A
A
-
type
3
6
reduced error pruning
age
 20
< 20
+
y
+12 -5
no
yes
-
colour
+5 -10
white
black
-
type
+0 -6
A
B
+
-
+7 -2
+1 -5
Advantages and Disadvantages of
Decision Trees
• 1. Decision trees are self–explanatory and when
compacted they are also easy to follow.
Furthermore decision trees can be converted to a
set of rules.
• 2. Decision trees can handle both nominal and
numeric input attributes.
• 4. Decision trees are capable of handling datasets
that may have errors.
• 5. Decision trees are capable of handling datasets
that may have missing values.
QUALITY OF THE CLASSIFICATION
Trening i testowanie
Each data set for which we used some data mining algorithm has to be tested.
That is why, each data set is divided into two subsets:
• Train set
• Test set
All data
Random partition
Train set
Decision tree induction
Test set
Testing the accuracy
Classification
algorithm
Train set
classifier
If age < 31
Or car type = „sports”
Then eisk = high
classifier
Test set
Age
CAR TYPE
risk
RISK
ACCURACY
classifier
NEW DATA
Age
CAR TYPE
RISK
risk
• x axes- sensitivity,
• Y axes - specificity
Sensitivity vs Specificity
• Sensitivity (also called the true positive rate, or the
recall rate in some fields) measures the proportion of
actual positives which are correctly identified as such
(e.g. the percentage of sick people who are correctly
identified as having the condition).
• Specificity measures the proportion of negatives which
are correctly identified as such (e.g. the percentage of
healthy people who are correctly identified as not
having the condition, sometimes called the true
negative rate).
• These two measures are closely related to the concepts
of type I and type II errors.
Actual true
Actual false
Predicted true
8
1
Predicted false
2
7
Actual true
Actual false
Predicted true
8
1
Predicted false
2
7
Actual true
Actual false
Predicted true
8
1
Predicted false
2
7
Actual true
Actual false
Predicted true
8
1
Predicted false
2
7
Sensitivit y 
Specificit y 
TP
8

 0,8
TP  FN 8  2
TN
7

 0,88
FP  TN 7  1