Download Learning Decision Trees

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Learning Decision Trees
Brief tutorial by M Werner
Medical Diagnosis Example
• Goal – Diagnose a disease from a blood
test
• Clinical Use
– Blood sample is obtained from a patient
– Blood is tested to measure current expression
of various proteins, say by using a DNA
microarray
– Data is analyzed to produce a Yes or No
answer
Data Analysis
• Use a decision tree such as:
P1 > K1
Y
N
P2 > K2
N
P3 > K3
P2 > K2
Y
Y
P4 > K4
N
Y
Yes
No
N
Yes
No
P4 > K4
Y
No
N
No
Y
Yes
How to Build the Decision Tree
• Start with samples of blood from patients known
to either have the disease or not (training set).
• Suppose there are 20 patients and 10 are
known to have the disease and 10 not
• From the training set get expression levels for all
proteins of interest
• i.e. if there are 20 patients and 50 proteins we
get a 50 X 20 array of real numbers
• Rows are proteins
• Columns are patients
Choosing the decision nodes
10 have disease
10 don’t
• We would like the
tree to be as short
as possible
• Start with all 20
patients in one
group
• Choose a protein
and a level that
gains the most
information
10/10
Mostly
diseased
Possible
splitting
condition
Px > Kx
9/3
1/7
Mostly
not
diseased
10/10
Alternative
splitting
condition
Py > Ky
7/7
3/3
How to determine information gain
• Purity – A measure to which the patients in a
group share the same outcome.
• A group that splits 1/7 is fairly pure – Most
patients don’t have the disease
• 0/8 is even purer
• 4/4 is the opposite of pure. This group is said to
have high entropy. Knowing that a patient is in
this group does not make her more or less likely
to have the disease.
• The decision tree should reduce entropy as test
conditions are evaluated
Measuring Purity (Entropy)
• Let f(i,j)=Prob(Outcome=j in node i)
• i.e. If node 2 has a 9/3 split
– f(2,0) = 9/12 = .75
– f(2,1) = 3/12 = .25
• Gini impurity:
• Entropy:
Computing Entropy
yes
no
total
frequency prob
log2
prob_log2
11 0.916667 -0.12553 -0.11507
1 0.083333 -3.58496 -0.29875
12
entropy
0.413817
yes
no
total
frequency prob
log2
prob_log2
9.9
0.99
-0.0145 -0.01435
0.1
0.01 -6.64386 -0.06644
10
entropy
0.080793
Goal is to use a test which best
reduces total entropy in the
subgroups
Building
the Tree
Links
• http://www.ece.msstate.edu/research/isip/
publications/courses/ece_8463/lectures/cu
rrent/lecture_27/lecture_27.pdf
• Decision Trees & Data Mining
• Andrew Moore Tutorial