Download 2003Data Mining Tut 2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Intelligent Data Analysis and Probabilistic Inference
Data Mining Tutorial 2
Question 1: Decision Tree Classification/Entropy /Information Gain
a) Describe briefly the operation of decision tree classification. Make sure to discuss how to
derive a decision tree, how to use a tree in classifying unseen cases.
b) In the context of decision tree classification, what is meant by entropy and information
gain. Describe how they can be used in deriving a decision tree.
c) What is the maximum depth of a decision tree if all the attributes of the training data set
are discrete?
Question 2 : Decision Tree Classification//Entropy /Information Gain
Given the following set of training examples:
Instance
1
2
3
4
5
6
7
8
9
10
A1
T
T
F
M
F
T
F
M
F
A2
N
O
N
N
N
O
O
N
O
Class Value
1
0
1
0
1
1
0
1
0
T
N
0
a) What is the entropy of this collection of training examples with respect to the target
classification?
b) Build up a decision tree T from this training data-set based on using the concept of
information gain as a metric for choosing the nodes of the tree.
c) Build a decision tree T’’ in the reverse order of the tree of part b. (i.e. if the root node
condition in T was chosen as the attribute “A1” let it now be the attribute “A2” and vice
versa). What do you conclude?
Question 3: Data Cleaning/Entropy-based Discretization/Decision Trees
a) Explain what is meant by data reduction and explain why it may be needed before
submitting data to a mining algorithm, and provide examples of four different data
reduction methods.
b) Explain briefly what concept hierarchies mean. Illustrate your answer with a simple
example.
mmg/yg 2003
c) Comment briefly on why discretization of numerical data columns may be needed before
submitting data to a mining algorithm.
d) Describe briefly what is meant by entropy measures and explain howthey can be used as
means for discretization of numerical columns. Illustrate your explanation with pseudo
code.
e) Suggest how the simple decision tree induction algorithm can be modified to cope with
splitting numerical input variables. Describe how entropy measures can help in this case.
mmg/yg 2003
Entropy = 1
Split on attribute A1
A1 = T
A1 = M
Class=0
A2=N
A2=O
Class=1
1
1
A1 = F
Class=0
1
1
A2=N
A2=O
Class=0
Class=1
1
0
A2=N
A2=O
1
0
Entropy = 1
Entropy = 1
Gain = 1 – 4/10*1 – 4/10*1 – 2/10*1 = 0
Class=1
0
2
2
0
Entropy = 1
Split on attribute A2
A2 = N
A2 = O
Class=0
A1=T
A1=F
A1=M
Class=0
Class=1
1
0
1
A1=T
A1=F
A1=M
1
2
1
Class=1
1
2
0
1
0
0
Entropy = -2/6log2/6 – 4/6log4/6 = 0.918
Entropy = - (3/4log3/4 + 1/4log1/4) = 0.811
Gain = 1 – 6/10*0.918 – 4/10*0.811= 0.125
Gain(A1) < Gain(A2) => Split on A2
10
c(0) = 50%
c(1) = 50%
A2=O
A2=N
4
c(0) = 75%
c(1) = 25%
6
c(0) = 33%
c(1) = 67%
A1=M
A1=T
A1=M
A1=F
2
c(0) = 50%
c(1) = 50%
2
c(0) = 0%
c(1) = 100%
2
c(0) = 50%
c(1) = 50%
0
c(0) = 0%
c(1) = 0%
mmg/yg 2003
A1=T
A1=F
2
c(0) = 100%
c(1) = 0%
2
c(0) = 50%
c(1) = 50%
Related documents