Download Learning Decision Trees

Learning Decision Trees Brief tutorial by M Werner Medical Diagnosis Example • Goal – Diagnose a disease from a blood test • Clinical Use – Blood sample is obtained from a patient – Blood is tested to measure current expression of various proteins, say by using a DNA microarray – Data is analyzed to produce a Yes or No answer Data Analysis • Use a decision tree such as: P1 > K1 Y N P2 > K2 N P3 > K3 P2 > K2 Y Y P4 > K4 N Y Yes No N Yes No P4 > K4 Y No N No Y Yes How to Build the Decision Tree • Start with samples of blood from patients known to either have the disease or not (training set). • Suppose there are 20 patients and 10 are known to have the disease and 10 not • From the training set get expression levels for all proteins of interest • i.e. if there are 20 patients and 50 proteins we get a 50 X 20 array of real numbers • Rows are proteins • Columns are patients Choosing the decision nodes 10 have disease 10 don’t • We would like the tree to be as short as possible • Start with all 20 patients in one group • Choose a protein and a level that gains the most information 10/10 Mostly diseased Possible splitting condition Px > Kx 9/3 1/7 Mostly not diseased 10/10 Alternative splitting condition Py > Ky 7/7 3/3 How to determine information gain • Purity – A measure to which the patients in a group share the same outcome. • A group that splits 1/7 is fairly pure – Most patients don’t have the disease • 0/8 is even purer • 4/4 is the opposite of pure. This group is said to have high entropy. Knowing that a patient is in this group does not make her more or less likely to have the disease. • The decision tree should reduce entropy as test conditions are evaluated Measuring Purity (Entropy) • Let f(i,j)=Prob(Outcome=j in node i) • i.e. If node 2 has a 9/3 split – f(2,0) = 9/12 = .75 – f(2,1) = 3/12 = .25 • Gini impurity: • Entropy: Computing Entropy yes no total frequency prob log2 prob_log2 11 0.916667 -0.12553 -0.11507 1 0.083333 -3.58496 -0.29875 12 entropy 0.413817 yes no total frequency prob log2 prob_log2 9.9 0.99 -0.0145 -0.01435 0.1 0.01 -6.64386 -0.06644 10 entropy 0.080793 Goal is to use a test which best reduces total entropy in the subgroups Building the Tree Links • http://www.ece.msstate.edu/research/isip/ publications/courses/ece_8463/lectures/cu rrent/lecture_27/lecture_27.pdf • Decision Trees & Data Mining • Andrew Moore Tutorial

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Learning Decision Trees