Download Essence of Knowledge Discovery

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
For written notes on this lecture, please read chapter 3 of The Practical Bioinformatician,
CS2220: Introduction to Computational Biology
Unit 1: Essence of Knowledge Discovery
Li Xiaoli
2
Information about Instructor
• Instructor
– Adj AP Li Xiaoli
– Email:
[email protected]
[email protected]
– Web http://www1.i2r.a-star.edu.sg/~xlli/
– Office:
NUS COM2-03-33
Fusionopolis (One North MRT station)
Institute for Infocomm Research, A*Star
Copyright 2015 © Wong Limsoon, Li Xiaoli
3
Knowledge Discovery/Data Mining
The first two lectures of this course aims to
introduce some concepts, techniques on
knowledge discovery, including
– Association rule mining
– Clustering/unsupervised machine learning techniques
– Classification/supervised machine learning
techniques
– Classification performance measures
– Feature selection techniques
Copyright 2015 © Wong Limsoon, Li Xiaoli
4
Benefits of learning knowledge
discovery
At the end of the course, students should
 Have good knowledge of basic data mining
concepts
 Be familiar with representative data mining
techniques
 Understand how to use a data mining tool, Weka
 Given a biological dataset or task, have some
knowledge of appropriate data mining
techniques and tools for analyzing the data and
address the task
Copyright 2015 © Wong Limsoon, Li Xiaoli
5
What is Knowledge Discovery
/ Data Mining?
Jonathan’s blocks
Jessica’s blocks
Whose block
is this?
Jonathan’s rules
Jessica’s rules
: Blue or Circle
: All the rest
Copyright 2015 © Wong Limsoon, Li Xiaoli
6
What is Knowledge Discovery
/Data Mining?
Question: Can you explain how?
Copyright 2015 © Wong Limsoon, Li Xiaoli
7
The Steps of Data Mining
• Training data gathering
• Feature generation
– k-grams, colour, texture, domain know-how, ...
• Feature selection
– Entropy, 2, CFS, t-test, domain know-how...
• Feature integration and model construction
– SVM, ANN, PCL, CART, C4.5, kNN, ...
Some classifiers / machine learning methods
Copyright 2015 © Wong Limsoon, Li Xiaoli
8
We are drowning in data
But starving for knowledge
Data Mining
Copyright 2015 © Wong Limsoon, Li Xiaoli
9
What is Knowledge Discovery?
• Knowledge Discovery from Databases (KDD): the
overall process of non-trivial extraction of
implicit, previously unknown and potentially
useful knowledge from large amounts of data
• Exploration & analysis, by automatic or
semi-automatic means, of large quantities of data
in order to discover meaningful patterns
Copyright 2015 © Wong Limsoon, Li Xiaoli
Data Mining: The Core Step of a
KDD Process
10
Training data gathering
Although “Data Mining” is just one of the many steps,
it is usually used to refer to the whole process of KDD
Copyright 2015 © Wong Limsoon, Li Xiaoli
11
Origins of Data Mining
• Draws ideas from machine learning/AI, pattern
recognition, statistics, and database systems
• Traditional Techniques
may be unsuitable due to
– Enormity of data
– High dimensionality
of data
– Heterogeneous,
distributed nature
of data
Statistics/
AI
Machine Learning/
Pattern
Recognition
Data Mining
Database
systems
Copyright 2015 © Wong Limsoon, Li Xiaoli
12
Data Mining Tools
•
•
•
•
•
•
•
•
Weka
RapidMiner
R
Orange
IBM SPSS modeler
SAS Data Mining
Oracle Data Mining
……
Copyright 2015 © Wong Limsoon, Li Xiaoli
13
Data Mining Tasks
• Prediction Methods
– Use some variables to predict unknown or future
values of other variables
• Description Methods
– Find human-interpretable patterns that describe
the data
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Copyright 2015 © Wong Limsoon, Li Xiaoli
14
Data Mining Tasks...
• Association Rule Discovery [Descriptive]
• Clustering [Descriptive]
• Classification [Predictive]
• Sequential Pattern Discovery [Descriptive]
• Regression [Predictive]
• Outlier Detection (Deviation Detection) [Predictive]
Copyright 2015 © Wong Limsoon, Li Xiaoli
15
Association Rule Mining:
“Market Basket Analysis”
Copyright 2015 © Wong Limsoon, Li Xiaoli
16
1. Association Rules (Descriptive)
• Finding groups of items that tend to
occur together
• Also known as “frequent itemset
mining” or “market basket analysis”
– Grocery store: Beer and Diaper
– Amazon.com: People who bought this book
also bought other books
– Find gene mutations that are frequently
occurred in patients
Copyright 2015 © Wong Limsoon, Li Xiaoli
17
A Real-world Application:
AMAZON BOOK RECOMMENDATIONS
Copyright 2015 © Wong Limsoon, Li Xiaoli
18
Frequent patterns and association rules
Frequent pattern {data mining, mining the social
web, data analysis with open source tools}
Association Rules {data mining} ->
{machine learning in action}
Copyright 2015 © Wong Limsoon, Li Xiaoli
19
An Intuitive Example of Association
Rules from Transactional Data
Shopping Cart
(transaction)
Items Bought
1
{fruits,
, milk, beef,
2
{sugar,
, toothbrush, ice-cream, …}
3
{
…
…
n
{battery, juice, beef, egg, chicken,…}
, pacifier, formula,
,eggs, …}
,blanket,…}
Copyright 2015 © Wong Limsoon, Li Xiaoli
20
What is Association Rule Mining?
• Given a set of transactions, find rules that will
predict the occurrence of items based on the
occurrences of other items in the transaction
Market-Basket transactions
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Example of Association Rules
{Diaper}  {Beer},
{Milk, Bread}  {Coke},
{Bread}  {Milk},
Explanation and usefulness:
{Bread}  {Milk}
“People who buy bread may also buy milk.”
• Stores might want to offer specials on bread
to get people to buy more milk.
• Stores might want to put bread and milk
close each other.
Note that Implication means co-occurrence, not causality!
Copyright 2015 © Wong Limsoon, Li Xiaoli
21
Definition: Frequent Itemset
• Itemset
– A collection of one or more items
• Example: {Milk, Bread, Diaper}
– k-itemset
• An itemset that contains k items
TID
1
2
3
• Support count ()
– Frequency of occurrence of an itemset 4
5
– E.g. ({Milk, Bread, Diaper}) = 2
Items
Bread, Milk
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
• Support (s)
– Fraction of transactions that contain an itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5 = 40%
• Frequent Itemset
– An itemset whose support is greater than or equal to a minsup
threshold, which is usually defined by a user, possibly
interactively.
Copyright 2015 © Wong Limsoon, Li Xiaoli
22
Definition: Association Rule

Association Rule
– An implication expression of the form
X  Y,
where X and Y are itemsets, and X∩Y=∅
– Example:

{Milk, Diaper}  {Beer}
Rule Evaluation Metrics
– Support (s)
TID
Items
1
2
3
4
5
Bread, Milk
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Fraction of transactions that contain
Example: {Milk , Diaper }  Beer
both X and Y
 (Milk, Diaper,Beer) 2
s

  0.4
# trans containing ( X  Y )
|
T
|
5
P( X Y ) 
# trans in T
 (Milk, Diaper,Beer) 2
c
  0.67
– Confidence (c)
 (Milk, Diaper)
3
 Measures how often items in Y
appear in transactions that contain X Support : How extensive the rules can

P(Y | X ) 
# trans containing ( X  Y )
# trans containing X
be applied?
Confidence: How reliable when we
apply the rules?
Copyright 2015 © Wong Limsoon, Li Xiaoli
23
Association Rule Mining Task
• An association rule r is strong if
– Support(r) ≥ min_sup
– Confidence(r) ≥ min_conf
• Given a set of transactions T, the goal of association
rule mining is to find all rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold
Copyright 2015 © Wong Limsoon, Li Xiaoli
24
2. Clustering (Descriptive)
• Finding groups of objects such that the objects in a
group will be similar (or related) to one another and
different from (or unrelated to) the objects in other
groups
• Automatically learn the structure of data
Main Principle
Intra-cluster distances
are minimized
Inter-cluster distances
are maximized
Copyright 2015 © Wong Limsoon, Li Xiaoli
25
The Input of Clustering
f1 f2
fn
D1 = (V11, V12, …, V1n)
Objects/
Instances/
points
D2 = (V21, V22, …, V2n)
……
Dm = (Vm1, Vm2, …, Vmn)
f1 f2 ,… , fn could be various features like race, gender, income,
education level, medical features (e.g. weight, blood pressure,
total cholesterol, LDL (low-density lipoprotein cholesterol, HDL
(high-density lipoprotein cholesterol), triglycerides (fats carried
in the blood), x-ray, and biological features (e.g. gene
expression, protein interaction, gene functions)
Copyright 2015 © Wong Limsoon, Li Xiaoli
26
Unsupervised Learning
Purely learn from the unlabeled data
Uniformity
Single
Bland
id
Clump Uniformity of Cell Marginal Epithelial Bare Chrom Normal
number Thickness of Cell Size Shape Adhesion Cell Size Nuclei atin Nucleoli Mitoses
ID1
5
1
1
1
2
1
3
1
1
ID2
5
4
4
5
7
10
3
2
1
ID3
3
1
1
1
2
2
3
1
1
ID4
8
10
10
8
7
10
9
7
1
We can learn the relationship or structures from data by clustering,
e.g. maybe ID1 and ID3 should be in 1 cluster?
Copyright 2015 © Wong Limsoon, Li Xiaoli
27
Clustering Definition
• Unsupervised learning: data without labels different from supervised learning/classification
• Given a set of data points/objects, each having a set
of attributes, and a similarity measure among them,
find clusters such that
– Data points in one cluster are more similar to one
another.
– Data points in separate clusters are less similar to
one another
• Similarity Measures:
– Euclidean Distance
– Cosine Similarity
Copyright 2015 © Wong Limsoon, Li Xiaoli
28
Clustering Techniques and
Applications
• We will learn some standard clustering
methods, e.g. K-means and hierarchical
clustering, in the coming weeks
• We will demonstrate how to apply
clustering methods for gene expression
data analysis
Copyright 2015 © Wong Limsoon, Li Xiaoli
29
3. Classification (Predictive)
• Also called Supervised Learning
• Learn from past experience/labels, and use
the learned knowledge to classify new data
• Knowledge learned by intelligent machine
learning algorithms
• Examples:
– Clinical diagnosis for patients
– Cell type classification
– Secondary structure classification of protein as
alpha-helix, beta-sheet, or random coil in biology
Copyright 2015 © Wong Limsoon, Li Xiaoli
30
An Example: Classification
Attributes
Class: (2 for
Uniformity
Single
Bland
benign, 4 for
id
Clump Uniformity
of Cell
Marginal Epithelial Bare Chrom Normal
number Thickness of Cell Size
Shape Adhesion Cell Size Nuclei atin Nucleoli Mitoses malignant)
ID1
5
1
1
1
2
1
3
1
1
2
ID2
5
4
4
5
7
10
3
2
1
2
ID3
3
1
1
1
2
2
3
1
1
4
ID4
8
10
10
8
7
10
9
7
1
4
Find a model for class attribute as a function of the values of other attributes.
f(clump Thickness, Uniformity of Cell Size, …, Mitoses) = Class
Copyright 2015 © Wong Limsoon, Li Xiaoli
31
Classification Data
• Classification application involves > 1 class
of data. E.g.,
– Normal vs disease cells for a diagnosis problem
• Training data is a set of instances (samples,
points) with known class labels
• Test data is a set of instances whose class
labels are to be predicted
Copyright 2015 © Wong Limsoon, Li Xiaoli
32
Classification: Definition
• Given a collection of records (training set:
History )
– Each record contains a set of attributes, one of the
key attributes is the class.
• Find a model for class attribute as a
function of the values of other attributes.
• Goal: previously unseen records should be
assigned a class as accurately as possible
(Future).
Prediction based on the past examples.
Let History tell us the future!
Copyright 2015 © Wong Limsoon, Li Xiaoli
33
Classification: Evaluation
• How good is your prediction? How do you
evaluate your results?
• A test set is used to determine the accuracy
of the model.
• Usually, the given data set is divided into
training and test sets, with training set used
to build the model and test set used to
validate it
• Usually, the overlap between training and
test is empty
Copyright 2015 © Wong Limsoon, Li Xiaoli
34
Construction of a Classifier
Training
samples
Build Classifier
Classifier
Test
instance
Apply Classifier
Prediction
Copyright 2015 © Wong Limsoon, Li Xiaoli
35
Estimate Accuracy: Wrong Way
Training
samples
Build
Classifier
Classifier
Apply
Classifier
Predictions
Estimate
Accuracy
Accuracy
Exercise: Why is this way of estimating accuracy wrong?
We will elaborate how to estimate the classification accuracy in next lecture
Copyright 2015 © Wong Limsoon, Li Xiaoli
36
Typical Notations
• Training data
{x1, y1, x2, y2, …, xm, ym}, where
xj (1<=j<=m) are n-dimensional vectors, e.g.
xj=(Vj1, Vj2, …, Vjn),
yj are from a discrete space Y.
E.g., Y = {normal, disease}, yj = normal.
• Test data
{u1, ?, u2, ?, …, uk, ?, }
Copyright 2015 © Wong Limsoon, Li Xiaoli
37
Process
Y = f(X)
Training data: X
Class labels Y
A classifier, a mapping, a hypothesis
f(U)
Test data: U
Predicted class labels
Copyright 2015 © Wong Limsoon, Li Xiaoli
38
Relational Representation
of Gene Expression Data
n features (order of 1000)
gene1 gene2 gene3 gene4 … genen
m samples
x11 x12
x13
x14 …
x1n
x21 x22
x23
x24 …
x2n
x31 x32
x33
x34 … x3n
………………………………….
xm1 xm2 xm3
xm4 …
xmn
class
P
N
P
N
Copyright 2015 © Wong Limsoon, Li Xiaoli
39
Features (aka Attributes)
• Categorical features: describe qualitative aspects of
an objects
– color = {red, blue, green}
• Continuous or numerical features: describe
quantitative aspects of an objects
– gene expression
– age
– blood pressure
• Discretization: Transform a continuous attribute into
a categorical attribute (e.g. age)
Copyright 2015 © Wong Limsoon, Li Xiaoli
40
Discretization of Continuous Attribute
• Transformation of a continuous attribute to
a categorical attribute involve two
subtasks
– (1) Decide how many categories
• After the values are sorted, they are then divided into n
intervals by specifying n-1 split points (equal interval
bin, equal frequency bin, or clustering)
– (2) Determine how to map the values of the
continuous attribute to these categories
• All the values in one interval are mapped to the same
categorical value
Copyright 2015 © Wong Limsoon, Li Xiaoli
41
An Example of classification task
Copyright 2015 © Wong Limsoon, Li Xiaoli
42
Overall Picture of
Classification/Supervised Learning
Biomedical
Financial
Government
Scientific
……
Decision trees
KNN
SVM
Bayesian Classifier
Neural networks
Classifiers (Medical Doctors)
Copyright 2015 © Wong Limsoon, Li Xiaoli
43
Requirements of
Biomedical Classification
• Accurate
– High accuracy
– High sensitivity
– High specificity
– High precision
• High comprehensibility
Copyright 2015 © Wong Limsoon, Li Xiaoli
44
Classification Techniques
• Decision Tree based Methods
• Rule-based Methods
• Memory-based or Instance-based Methods
• Naïve Bayes Classifiers
• Neural Networks
• Support Vector Machines
• Ensemble Classification
Copyright 2015 © Wong Limsoon, Li Xiaoli
45
Illustrating Classification Task
45
Copyright 2015 © Wong Limsoon, Li Xiaoli
46
Importance of Rule-Based Methods
• Systematic selection of a small number of
features used for the decision making
 Increase comprehensibility of the knowledge
patterns
• C4.5 and CART are two commonly used
rule induction algorithms---a.k.a. decision
tree induction algorithms
Copyright 2015 © Wong Limsoon, Li Xiaoli
47
Structure of Decision Trees
x1
Root node
> a1
x2
x3
Internal nodes
> a2
x4
A
B
B
A
B Leaf nodes
A
A
• If x1 > a1 & x2 > a2, then it’s A class
• C4.5, CART, two of the most widely used
• Easy interpretation
Copyright 2015 © Wong Limsoon, Li Xiaoli
48
Elegance of Decision Trees
Every path from root
to a leaf forms a
decision rule
A
B
B
A
A
Copyright 2015 © Wong Limsoon, Li Xiaoli
49
Example of a Decision Tree
Splitting Attributes
Tid Refund Marital
Status
Taxable
Income Cheat
Refund
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
Single, Divorced
5
No
Divorced 95K
Yes
TaxInc
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
Yes
No
NO
MarSt
< 80K
NO
Married
NO
> 80K
YES
10
Training Data
Leave nodes: Class
Copyright 2015 © Wong Limsoon, Li Xiaoli
50
Another Example of Decision Tree
Tid Refund Marital
Status
Taxable
Income Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced 95K
Yes
6
No
Married
No
7
Yes
Divorced 220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
60K
10
Training Data
MarSt
Married
NO
Single,
Divorced
Refund
No
Yes
NO
TaxInc
< 80K
NO
> 80K
YES
There could be more than one tree that
fits the same data!
Copyright 2015 © Wong Limsoon, Li Xiaoli
51
Decision Tree Classification Task
Tid
Attrib1
1
Yes
Large
125K
No
2
No
Medium
100K
No
3
No
Small
70K
No
4
Yes
Medium
120K
No
5
No
Large
95K
Yes
6
No
Medium
60K
No
7
Yes
Large
220K
No
8
No
Small
85K
Yes
9
No
Medium
75K
No
10
No
Small
90K
Yes
Attrib2
Attrib3
Class
Tree
Induction
algorithm
Induction
Learn
Model
10
Model
Training Set
Tid
Attrib1
Attrib2
Attrib3
11
No
Small
55K
?
12
Yes
Medium
80K
?
13
Yes
Large
110K
?
14
No
Small
95K
?
15
No
Large
67K
?
Apply
Model
Class
Decision
Tree
Deduction
10
Test Set
Copyright 2015 © Wong Limsoon, Li Xiaoli
52
Apply Model to Test Data
Test Data
Start from the root of tree.
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
Copyright 2015 © Wong Limsoon, Li Xiaoli
53
Apply Model to Test Data
Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
Copyright 2015 © Wong Limsoon, Li Xiaoli
54
Apply Model to Test Data
Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
Copyright 2015 © Wong Limsoon, Li Xiaoli
55
Apply Model to Test Data
Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
Copyright 2015 © Wong Limsoon, Li Xiaoli
56
Apply Model to Test Data
Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
YES
Copyright 2015 © Wong Limsoon, Li Xiaoli
57
Apply Model to Test Data
Test Data
Refund
Yes
Refund Marital
Status
Taxable
Income Cheat
No
80K
Married
?
10
No
NO
MarSt
Single, Divorced
TaxInc
< 80K
NO
Married
NO
> 80K
Assign Cheat to “No”
NO
YES
Copyright 2015 © Wong Limsoon, Li Xiaoli
58
Decision Tree Classification Task
Copyright 2015 © Wong Limsoon, Li Xiaoli
59
Decision Tree Induction
• Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART
– ID3, C4.5
– SLIQ, SPRINT
We want to learn a classification tree model
from training data and we want to apply it to
predict future test data as accurate as possible.
Copyright 2015 © Wong Limsoon, Li Xiaoli
60
Most Discriminatory Feature and Partition
• Discriminatory Feature
– Every feature can be used to partition the training data
– If the partitions contain a pure class of training
instances, then this feature is most discriminatory
• Partition
– Categorical feature
• Number of partitions of the training data is equal to the
number of values of this feature
– Numerical feature
• Two partitions
Copyright 2015 © Wong Limsoon, Li Xiaoli
61
Categorical feature
Instance #
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Outlook
Sunny
Sunny
Sunny
Sunny
Sunny
Overcast
Overcast
Overcast
Overcast
Rain
Rain
Rain
Rain
Rain
Numerical feature
Temp
75
80
85
72
69
72
83
64
81
71
65
75
68
70
Humidity
70
90
85
95
70
90
78
65
75
80
70
80
80
96
Windy
true
true
false
true
false
true
false
true
false
true
true
false
false
false
class
Play
Don’t
Don’t
Don’t
Play
Play
Play
Play
Play
Don’t
Don’t
Play
Play
Play
Copyright 2015 © Wong Limsoon, Li Xiaoli
62
Outlook =
sunny
Total 14 training
instances
A categorical feature is
partitioned based on its
number of possible values
Outlook =
overcast
Outlook =
rain
1,2,3,4,5
P,D,D,D,P
6,7,8,9
P,P,P,P
10,11,12,13,14
D, D, P, P, P
Copyright 2015 © Wong Limsoon, Li Xiaoli
63
Temperature
<= 70
5,8,11,13,14
P,P, D, P, P
Total 14 training
instances
Temperature 1,2,3,4,6,7,9,10,12
P,D,D,D,P,P,P,D,P
> 70
A numerical feature is
generally partitioned by
choosing a “cutting point”
Copyright 2015 © Wong Limsoon, Li Xiaoli
64
Steps of Decision Tree Construction
• Select the “best” feature as root node of
the whole tree
• Partition dataset into subsets using this
feature so that the subsets are as “pure”
as possible
• After partition by this feature, select the
best feature (wrt the subset of training
data) as root node of this sub-tree
• Recursively, until the partitions become
pure or almost pure
Copyright 2015 © Wong Limsoon, Li Xiaoli
65
How to determine the Best Split?
Which features/attributes do you want to use first?
Before Splitting: 10 records of class 0 (C0) , Suppose we have three
10 records of class 1 (C1) features, i.e. Own Car,
Car Type, Student ID.?
Own
Car?
Car
Type?
No
Yes
Family
Student
ID?
Luxury
c1
Sports
C0: 6
C1: 4
C0: 4
C1: 6
C0: 1
C1: 3
C0: 8
C1: 0
C0: 1
C1: 7
C0: 1
C1: 0
...
c10
C0: 1
C1: 0
c11
C0: 0
C1: 1
c20
...
C0: 0
C1: 1
Which test condition is the best?
The split can help us to distinguish different classes
Copyright 2015 © Wong Limsoon, Li Xiaoli
66
How to determine the Best Split
• Greedy approach:
– Nodes with homogeneous class distribution are
preferred
• Need a measure of node impurity:
C0: 5
C1: 5
C0: 9
C1: 1
Non-homogeneous,
Homogeneous,
High degree of impurity
Low degree of impurity
Provide certainty and confidence for my future test cases
Copyright 2015 © Wong Limsoon, Li Xiaoli
67
Measures of Node Impurity
• Gini Index
• Entropy (Do it yourself)
• Misclassification error (Do it
yourself)
Copyright 2015 © Wong Limsoon, Li Xiaoli
68
How to Find the Best Split
Before Splitting:
C0
C1
N00
N01
M0
A?
B?
Yes
No
Node N1
C0
C1
Node N2
N10
N11
C0
C1
Overall impurity
of feature A
N20
N21
M2
M1
M12
Impurity of
parents
Yes
No
Node N3
C0
C1
Node N4
N30
N31
C0
C1
M3
Gain = M0 – M12 vs M0 – M34
N40
N41
M4
M34
Overall impurity
of feature B
Which attribute (A or B) should we choose? We choose the one that
can reduce more impurity and bring more gain
Copyright 2015 © Wong Limsoon, Li Xiaoli
Examples for computing Impunity using
GINI for a node
GINI (t )  1  [ p( j | t )]2
j
69
p( j | t) is the relative frequency (prob)
of class j at node t.
C1
C2
0
6
C1
C2
1
5
P(C1) = 1/6
C1
C2
2
4
P(C1) = 2/6
C1
C2
3
3
P(C1) = 0/6 = 0
P(C2) = 6/6 = 1
Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0
P(C2) = 5/6
Gini = 1 – (1/6)2 – (5/6)2 = 0.278
P(C2) = 4/6
Gini = 1 – (2/6)2 – (4/6)2 = 0.444
P(C1) = 3/6
P(C2) = 3/6
Gini = 1 – (3/6)2 – (3/6)2 = 0.5
Copyright 2015 © Wong Limsoon, Li Xiaoli
70
Measure of Impurity: GINI for a node
• Gini Index for a given node t :
GINI (t )  1  [ p( j | t )]2
j
C1
0
C2
6
Gini=0.000
p(c1|t)=0
p(c2|t)=1
C1
1
C2
5
Gini=0.278
C1
2
C2
4
Gini=0.444
C1
3
C2
3
Gini=0.500
p(c1|t)=1/6 p(c1|t)=2/6 p(c1|t)=3/6=0.5
p(c2|t)=5/6 p(c2|t)=4/6 p(c2|t)=3/6=0.5
– Maximum (1 - 1/nc) when records are equally
distributed among all classes p(ci|t)=1/nc
– Minimum (0) when all records belong to one class,
implying most interesting information.
Copyright 2015 © Wong Limsoon, Li Xiaoli
71
Splitting Based on GINI
• Used in CART, SLIQ, SPRINT.
• When a node p is split into k partitions (children), the
quality of split is computed as,
P
k
GINIsplit
ni
  GINI (i)
i 1 n
Yes
Node 1
No
Node 2
where ni = number of records at child i,
n = number of records at node p.
Copyright 2015 © Wong Limsoon, Li Xiaoli
72
Binary Attributes:
Computing GINI Index


Splits into two partitions
Effect of Weighing partitions:
– Larger and Purer Partitions are sought for.
Parent
B?
Yes
No
Node N1
Gini(N1)
= 1 – (5/7)2 – (2/7)2
= 0.4082
Gini(N2)
= 1 – (1/5)2 – (4/5)2
= 0.3200
Node N2
C1
C2
N1
5
2
N2
1
4
Gini=0.3715
What is the gain?
C1
6
C2
6
Gini = 0.500
After use attribute B
Gini(Children)
= 7/12 * 0.4082 +
5/12 * 0.3200
= 0.3715
Copyright 2015 © Wong Limsoon, Li Xiaoli
73
Categorical Attributes:
Computing Gini Index
• For each distinct value, gather counts for each class
in the dataset
• Use the count matrix to make decisions
Multi-way split
C1
C2
Gini
Two-way split
(find best partition of values)
CarType
Family Sports Luxury
1
2
1
4
1
1
0.393
C1
C2
Gini
CarType
{Sports,
{Family}
Luxury}
3
1
2
4
0.400
C1
C2
Gini
CarType
{Family,
{Sports}
Luxury}
2
2
1
5
0.419
Which one do you want to choose to split?
Copyright 2015 © Wong Limsoon, Li Xiaoli
74
Continuous Attributes:
Computing Gini Index
Taxable
Income
> 97K?
• Use Binary Decisions based
on one value
• Several Choices for the
splitting value and each
splitting value has a count
matrix associated with it
Yes
No
A
Tid
– Class counts in each of the
partitions, A < v and A  v
– v could be any value (e.g. 97)
• Simple method to choose
best v
– For each v, scan the database to
gather count matrix and compute
its Gini index
Refund
Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
Copyright 2015 © Wong Limsoon, Li Xiaoli
75
Continuous Attributes:
Computing Gini Index...
• For efficient computation: for each attribute,
– Sort the attribute on values
– Linearly scan these values, each time updating the count
matrix and computing gini index
– Choose the split position that has the least gini index
(largest gain)
Cheat
No
No
No
Yes
Yes
Yes
No
No
No
No
100
120
125
220
Taxable Income
60
Sorted Values
Split Positions
70
55
75
65
85
72
90
80
95
87
92
97
110
122
172
230
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
<=
>
Yes
0
3
0
3
0
3
0
3
1
2
2
1
3
0
3
0
3
0
3
0
3
0
No
0
7
1
6
2
5
3
4
3
4
3
4
3
4
4
3
5
2
6
1
7
0
Gini
0.420
0.400
0.375
0.343
0.417
0.400
0.300
0.343
0.375
0.400
0.420
Copyright 2015 © Wong Limsoon, Li Xiaoli
76
Characteristics of Decision Trees
• Easy to implement
• Single coverage of training data (elegance)
• Divide-and-conquer splitting strategy
• Easy to explain to medical doctors or biologists
Copyright 2015 © Wong Limsoon, Li Xiaoli
77
Example Use of Decision Tree Methods:
Proteomics Approaches to Biomarker Discovery
• In prostate and bladder cancers (Adam et al.
Proteomics, 2001)
• In serum samples to detect breast cancer (Zhang
et al. Clinical Chemistry, 2002)
• In serum samples to detect ovarian cancer
(Petricoin et al. Lancet; Li & Rao, PAKDD 2004)
Copyright 2015 © Wong Limsoon, Li Xiaoli
78
Contact: [email protected] if you have questions
Copyright 2015 © Wong Limsoon, Li Xiaoli