Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Intelligent Data Analysis and Probabilistic Inference Data Mining Tutorial 2 Question 1: Decision Tree Classification/Entropy /Information Gain a) Describe briefly the operation of decision tree classification. Make sure to discuss how to derive a decision tree, how to use a tree in classifying unseen cases. b) In the context of decision tree classification, what is meant by entropy and information gain. Describe how they can be used in deriving a decision tree. c) What is the maximum depth of a decision tree if all the attributes of the training data set are discrete? Question 2 : Decision Tree Classification//Entropy /Information Gain Given the following set of training examples: Instance 1 2 3 4 5 6 7 8 9 10 A1 T T F M F T F M F A2 N O N N N O O N O Class Value 1 0 1 0 1 1 0 1 0 T N 0 a) What is the entropy of this collection of training examples with respect to the target classification? b) Build up a decision tree T from this training data-set based on using the concept of information gain as a metric for choosing the nodes of the tree. c) Build a decision tree T’’ in the reverse order of the tree of part b. (i.e. if the root node condition in T was chosen as the attribute “A1” let it now be the attribute “A2” and vice versa). What do you conclude? Question 3: Data Cleaning/Entropy-based Discretization/Decision Trees a) Explain what is meant by data reduction and explain why it may be needed before submitting data to a mining algorithm, and provide examples of four different data reduction methods. b) Explain briefly what concept hierarchies mean. Illustrate your answer with a simple example. mmg/yg 2003 c) Comment briefly on why discretization of numerical data columns may be needed before submitting data to a mining algorithm. d) Describe briefly what is meant by entropy measures and explain howthey can be used as means for discretization of numerical columns. Illustrate your explanation with pseudo code. e) Suggest how the simple decision tree induction algorithm can be modified to cope with splitting numerical input variables. Describe how entropy measures can help in this case. mmg/yg 2003 Entropy = 1 Split on attribute A1 A1 = T A1 = M Class=0 A2=N A2=O Class=1 1 1 A1 = F Class=0 1 1 A2=N A2=O Class=0 Class=1 1 0 A2=N A2=O 1 0 Entropy = 1 Entropy = 1 Gain = 1 – 4/10*1 – 4/10*1 – 2/10*1 = 0 Class=1 0 2 2 0 Entropy = 1 Split on attribute A2 A2 = N A2 = O Class=0 A1=T A1=F A1=M Class=0 Class=1 1 0 1 A1=T A1=F A1=M 1 2 1 Class=1 1 2 0 1 0 0 Entropy = -2/6log2/6 – 4/6log4/6 = 0.918 Entropy = - (3/4log3/4 + 1/4log1/4) = 0.811 Gain = 1 – 6/10*0.918 – 4/10*0.811= 0.125 Gain(A1) < Gain(A2) => Split on A2 10 c(0) = 50% c(1) = 50% A2=O A2=N 4 c(0) = 75% c(1) = 25% 6 c(0) = 33% c(1) = 67% A1=M A1=T A1=M A1=F 2 c(0) = 50% c(1) = 50% 2 c(0) = 0% c(1) = 100% 2 c(0) = 50% c(1) = 50% 0 c(0) = 0% c(1) = 0% mmg/yg 2003 A1=T A1=F 2 c(0) = 100% c(1) = 0% 2 c(0) = 50% c(1) = 50%