Download Note5.3

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
CSCI6405 Fall 2003
Dta Mining and Data Warehousing
„
„
„
Instructor: Qigang Gao, Office: CS219,
Tel:494-3356, Email: [email protected]
Teaching Assistant: Christopher Jordan,
Email: [email protected]
Office Hours: TR, 1:30 - 3:00 PM
22 October 2003
1
Lectures Outline
…
„
„
„
Part III: Data Mining Methods/Algorithms
4. Data mining primitives (ch4, optional)
5. Classification data mining (ch7)
6. Association data mining (ch6)
7. Characterization data mining (ch5)
8. Clustering data mining (ch8)
Part IV: Mining Complex Types of Data
9. Mining the Web (Ch9)
10. Mining spatial data (Ch9)
Project Presentations
Ass3: Oct (14) 16 – Oct 30
Ass4: Oct 30 – Nov 13
Project Due: Dec 8
~prof6405/Doc/proj.guide
22 October 2003
2
1
5. CLASSIFICAITON AND PREDICATION (Ch7)
„
„
„
„
„
„
What is classification? What is prediction?
Issues regarding classification and prediction
Classification by decision tree induction
Bayesian Classification
Other Classification Method
Summary
22 October 2003
3
Induction by sorting examples
+: X3,X4,X5,X7,X9,X11,X12,X13
-: X1,X2,X6,X8,X14
Yes for PlayTennis
No for PlayTennis
{Outlook, Temperature, Humidity, Wind}
Outlook
|
sunny
overcast
/
|
+: X9,X11
X3,X7,X12,X13
- : X1,X2,X8
0
[2+,3-]
[4+,0]
|
Yes
Humidity {Tem, Hum, Win}
/
high
/
+: 0
-: X1,X2,x3
[0+, 3-]
No
22 October 2003
\
normal
\
X9,X11
0
[2+, 0-]
Yes
rain
\
X4,X5,X6
X10,X14
[3+,2-]
|
Wind {Tem, Hum, Win}
/
strong
/
0
X6,X14
[0+, 2-]
No
\
weak
\
X4,X5,X10
0
[3+, 0-]
Yes
4
2
Day Outlook Temperature Humidity Wind PlayTennis
---------------------------------------------------------------------1 sunny
hot
high
weak
no
2 sunny
hot
high
strong
no
3 overcast
hot
high
weak
yes
4 rain
mild
high
weak
yes
5 rain
cool
normal weak
yes
6 rain
cool
normal strong
no
7 overcast
cool
normal strong
yes
8 sunny
mild
high
weak
no
9 sunny
cool
normal weak
yes
10 rain
mild
normal weak
yes
11 sunny
mild
normal strong yes
12 overcast
mild
high
strong yes
13 overcast
hot
normal weak
yes
14 rain
mild
high
strong
no
22 October 2003
5
Review: Information Measure
The quantity of the information, "The crow is in cell (4,7)", about finding the cow is
therefore:
Q(message) = P2(outcome_after) - P1(outcome_before)
Information received = log_2 P2 - log_2 P1
= log_2 (P2 / P1)
= log_2 64 = 6 bits
P2(outcome_after)
Information = log_2
P1(outcome_before)
This formula works provided that
1) the probability before receiving the answer is greater than zero and
2) the probability after receiving the answer is greater than or equal to the probability
before receiving the answer.
22 October 2003
6
3
E.g., Classification for the concept PlayTennis with the value (outcome)
“Yes“:
•
P1(message_before) = 9/14
The probability of unknown example's being a “Yes” example without
asking any question.
•
P2(message_after) =?
The questions may be asked like "what is outlook?“, or “what is
Temperature?”, etc from the list of (Outlook Temperature Humidity
Wind).
•
What is Outlook? Then the message can be from
P2_1(Outlook=sunny) = 2/5,
P2_2(Outlook= overcast) = 4/4,
P2_3(Outlook= rain) = 3/5.
How to quantify the information about both “Yes” and “No” classes?
Entropy
22 October 2003
7
Entropy: Measure class impurity of a data set
Measure Impurity:
Entropy can characterize the impurity of of an arbitrary
collection of examples.
The concept of entropy, used in information theory, is closely related to
the concept of entropy studied in thermodynamics, which is a measure of
the disorder in a system of particles. Entropy is a measure of how much
confusion, uncertainty, or disorder there is in a situation.
22 October 2003
8
4
Suppose p1, p2, ...,pn is a set of probabilities, such that p1+p2+...+pn = 1.
E.g: 9/14 + 5/14 = 1 for PlayTennis
The entropy for this set of probabilities is defined to be the value:
Entropy(S) = - ∑
n
i=1
p_i * log_2(p_i)
where S, is the data set,
p_i, stands for a subset of S belonging to a class of the target;
n, is the number of target classes.
E.g., For Play Tennis: Entropy (9+,5-) = - (9/14)log_2(9/14) - (5/14)log_2(5/14)
= 0.940
The entropy is maximized when p1=p2=...=pn=1/n.
The entropy is minimized (zero) if pi=1 for some i, and pj = 0 for all j =/= i.
22 October 2003
9
If the target attribute T has n values (more general case).
n
n
Entropy(S) = ∑ i=1 -(p_i)log_2(p_i)
Where, i is the ith value of T.
E.g., If target attribute = Outlook = {Sunny,Overcast,Rain},
then n = 3.
22 October 2003
10
5
Entropy (cont)
E.g., +: X3,X4,X5,X7,X9,X11,X12,X13 Yes for PlayTennis
-: X1,X2,X6,X8,X14
No for PlayTennis
Outlook
|
sunny
overcast
rain
/
|
\
+: X9,X11
X3,X7,X12,X13
X4,X5,X6
- : X1,X2,X8
0
X10,X14
[2+,3-]
[4+,0]
[3+,2-]
|
Yes
|
Entropy(S) = Entropy([9+,5-])= -(9/14)log_2(9/14) - (5/14)log_2(5/14) = 0.940
Entropy(S_sunny) = Entropy([2+,3-]) = -(2/5)log_2(2/5) - (3/5)log_2(3/5) = 0.971
Entropy(S_overcast) = Entropy([4+,0-]) = -(4/4)log_2(4/4) - (0/4)log_2(0/4) = 0
Entropy(S_rain) = Entropy([3+,2-]) = -(3/5)log_2(3/5) - (2/5)log_2(2/5) = 0.971
Expected entropy (measuring the overall result of a partition):
s1 j + ...+ smj
Enropy (sj)
s
j =1
v
E(A)= ∑
22 October 2003
11
Review: from information to information gain
Outcome
Question
Message
Find the cow
Play tennis (Yes, No)
...
...
...
...
...
...
Where is the cow?
What is the outlook?
Temperature?
Humidity?
Windy?
Cell (5,7), P=1
{Sunny,Overcast, Rain}
{hot,mild,cool}
{high,normal}
{weak,strong}
* A decision tree represents such a series of questions.
The answer to the first question determines what follow-up question is asked next.
If the questions are well chosen, a surprisingly short series is enough to accurately
classify an incoming records.
* The basic idea behind the decision tree algorithm is to test the most important
attribute first. By "most important," we mean the one that makes the most
difference to the classification of an example. This way, we hope to get the correct
classification with a small number of tests, meaning that all paths in the tree will be
short and the tree as a whole will be small.
22 October 2003
12
6
Summary of Attribute Selection Measure:
„
„
„
Select the attribute with the highest information gain
S contains si tuples of class Ci for i = {1, …, m}
Entropy measures information required to classify any
arbitrary tuple set
m
si
si
E ( s1,s 2,...,s m ) = − ∑
i =1
„
s
log 2
s
Expected entropy of attribute A with values {a1,a2,…,av}
v
sj
E (s1 j,...,smj)
j =1 s
EE (A) = ∑
„
Information gained by branching on attribute A
Gain(A) = E (s 1, s 2,..., sm) − EE (A)
22 October 2003
13
E.g., The set S has 14 examples: i.e. S = 14 = [9+,5-] for the
target PlayTennis = {yes,no}.
So we have Entropy(S) = 0.940. When we consider the attribute Outlook,
the information about the set S on the target PlayTennis receives a gain:
Outlook = {sunny, overcast, rain} partitions the set into subsets:
S_sunny = [2+,3-] = 5
S_overcast = [4+,0-] = 4
S_rain = [3+,2-] = 5
Gain(S,Outlook) = Entropy(S) - (5/14)Entropy(S_sunny) - (4/14)Entropy(S_overcast)
- (5/14)Entropy(S_rain) = 0.246
Where,
Entropy(S) = Entropy([9+,5-])= -(9/14)log_2(9/14) - (5/14)log_2(5/14) = 0.940
Entropy(S_sunny) = Entropy([2+,3-])= -(2/5)log_2(2/5) - (3/5)log_2(3/5) = 0.971
Entropy(S_overcast) = Entropy([4+,0-]) = -(4/4)log_2(4/4) - (0/4)log_2(0/4) = 0
Entropy(S_rain) = Entropy([3+,2-])= -(3/5)log_2(3/5) - (2/5)log_2(2/5) = 0.971
22 October 2003
14
7
The information gain calculations for all four attributes:
Gain(S,Outlook)
= 0.246
Gain(S,Humidity) = 0.151
Gain(S,Windy)
= 0.048
Gain(S,Temperature) = 0.029
The attribute Outlook may lead to a best partition of S in terms of the
target PlayTennis.
In other wards, the partitioned subsets are overall more pure (more
homogenous) than other partitions (the best classifier).
22 October 2003
15
* Illustration:
The partially learned decision tree resulting from the first step of ID3:
{X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14}
S: [9+,5-]
outlook
sunny
{X1,X2,X8,X9,X11}
[ 2=,3- ]
?
Which attribute should be
tested here?
22 October 2003
Overcast
{X3,X7,X12,X13}
[4+,0]
Yes
Rain
X4,X5,X6,X10,X14}
[3+,2-]
?
Which attribute should be
tested here?
16
8
How to build decision tree
The representative algorithms:
* ID3 (Interactive Dichotomiser 3),
Quinlan, J.R., "Induction of decision trees",
Machine Learning, Vol 1, No.1, pp 81-106, 1986.
* C4.5/C.5.0, Quinlan, J.R., "C4.5: Programs for Machine
Learning", San Francisco: Morgan Kaufmann, 1993.
* Other popular algorithms:
CART, CHAID, Chi-squared, ...
22 October 2003
17
Main features of ID3 algorithm:
• ID3 is specialized to learning boolean-valued functions
• ID3 is a greedy search algorithm that grows the tree topdown, at each node selecting the attribute that best classifies
the local training examples. This process continues until the
tree perfectly classifies the training examples or until all
attributes have been used.
22 October 2003
18
9
General procedure of building up decision tree:
1) At each decision level, decide which attribute is the best classifier?
At each level, the available attributes are evaluated to find the best classifier
based on richness of the information for classifying the current data set regarding
the target.
The evaluation is based on statistical test to determine how well it alone classifies
the training examples.
E.g., At the level 0 (root), the available attributes would be all (not including
the target), the current set is the original input data.
2) The data set is divided into subsets according to the values of the selected
classifier.
3) If a subset is a leaf node (i.e, the examples in the set belong to a class (a
same label), the branch is an end node. Otherwise assign the node as root
node.
4) The process is repeated using data set associated with each descendant
node to select the best attribute to test at that point in the tree. The learning
process is accomplished though the construction of a decision tree.
22 October 2003
19
ID3 Algorithm
ID3(Examples, Target, Attributes)
/* Examples are the training examples.
Target is the attribute whose value is to be predicated by the tree.
Attributes is a list of other attributes that may be tested by the learned decision tree.
Returns a decision tree that correctly classifies the given Examples.
*/
• Create a Root node for the tree
• If all Examples are positive, return the single-node tree Root, with label = "+".
• If all Examples are negative, Return the single-nod tree Root, with label = "-".
• If Attributes list is empty, Return the single-node tree Root, with label = most
common value of Target in Examples (majority voting).
/* You may provide confidence measure (#%) on the classification, or give a warning
message if two classes tie.
*/
22 October 2003
20
10
• Otherwise Begin
- A Å the attribute from Attributes that best classifies Examples.
/* The best attribute is the one with highest information gain measured by
Gain(S, A), where S is the set of Examples.
*/
- The decision attribute for Root Å A.
- For each possible value, v_i, of A,
- Add a new tree branch below Root, corresponding to the test A = v_i.
- Let Examples_(v_i), be the subset of Examples that have value v_i for A
- If Examples_(v_i) is empty
- Then below this new branch add a leaf node with label = most
common value of Target in Examples.
- Else below this new branch add the subtree
ID3(Examples_(v_i), Target, Attributes - {A})
• End
• Return Root
22 October 2003
21
Implementation of ID3 algorithm
Two key components: greedy search and information gain measure
1. Decide root attribute A from the attributes of S that
best classifies Examples. The best attribute is the one with highest information
(i.e. with the lowest expected entropy).
Root = Outlook after evaluating:
Gain(S,Outlook) = 0.246
Gain(S,Humidity) = 0.151
Gain(S,Windy) = 0.048
Gain(S,Temperature) = 0.029
The branches of Outlook are created below the root for each of its possible
values: sunny, overcast, and rain.
2. For each possible value, v_i, of A,
- Add a new tree branch below Root, corresponding to the test A = v_i.
- Let Examples_(v_i), be the subset of Examples that have value v_i for A.
- if a subset is not a leaf then call ID3 (Examples_(v_i), Target_attribute,
Attributes - {A})
22 October 2003
22
11
* Illustration:
The partially learned decision tree resulting from the first step of ID3:
{X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14}
S: [9+,5-]
outlook
sunny
{X1,X2,X8,X9,X11}
[ 2=,3- ]
?
Overcast
{X3,X7,X12,X13}
[4+,0]
Yes
Which attribute should be
tested here?
Rain
X4,X5,X6,X10,X14}
[3+,2-]
?
Which attribute should be
tested here?
22 October 2003
23
Note:
1) The training examples are sorted to the corresponding descendant nodes.
2) The overcast descendent has only positive examples and therefore
becomes a leaf node with classification Yes.
3) The nodes Sunny and Rain will be further expanded, by selecting the attribute
with highest information gain relative to the new subsets of examples.
E.g., S_sunny = {X1,X2,X8,X9,X11}
Gain(S_sunny,Humidity) = .970 - (3/5)0.0 - (2/5)0.0 = .970
Gain(S_sunny,Windy) = .970 - (2/5)0.0 - (2/5)1.0 -(1/5)0.0 =.570
Gain(S_sunny,Temperature) = .970 - (2/5)1.0 - (3/5).918 =.019
22 October 2003
24
12
outlook
sunny
Overcast
{X3,X7,X12,X13}
[4+,0]
Yes
{X1,X2,X8,X9,X11}
[ 2=,3- ]
?
Humidity
High
Rain
X4,X5,X6,X10,X14}
[3+,2-]
?
Which attribute should be
tested here?
Normal
{X1,X2,X3}
[0+,3-]
Yes
22 October 2003
{X9,X11}
[4+,0]
Yes
25
Implementation Example
Example of ID3 implementation (executable code): ~prof6405Ass/ass3-demo/
To run the program:
$ ID3
You then have the following interface:
What is the name of the file containing your data? data1
/* do not include any data files in your submission assign3 */
Please choose an attribute (by number):
1. Humidity
2. Windy
3. PlayTennis
Attribute: 3
Target attribute is: PlayTennis
22 October 2003
26
13
If outlook is sunny, then
if Humidity is high, then PlayTennis is N.
if Humidity is normal, then PlayTennis is P.
If outlook is overcast, then PlayTennis is P.
If outlook is rain, then
if Windy is false, then PlayTennis is P.
if Windy is true, then PlayTennis is N.
* The code structure of the implementation Use top-down, modular approach.
Overview of the Program Code:
-------------------------------------------ID3.c:
main()
readdata.c: readdata()
getattrib.c:
getattrib()
maketree.c: maketree()
functions.c: binary(), sameresults(), oneattrib()
gain.c:
choose(), gain(), entropy()
printtree.c:
printtree()
22 October 2003
27
The following is the code structure:
main() --> readdata
--> getattrib --> binary
--> maketree --> sameresults
--> oneattrib
--> choose
--> gain --> entropy
--> printtree
...
sameresults():
This function loops through all tuples, setting a counter if a second value is ncountered.
If only one value is present, a leaf node is created.
oneattrib():
This function is called if the only attribute left is the target attribute. As the target
attribute is binary, the number of occurrences of each value is counter, with the leaf
node being assigned the value which occurs most often.
22 October 2003
28
14
Summary of decision tree induction
- Decision tree is a practical method for classification mining
* Efficient heuristic search strategy
Greedy search (or divide and conquer):
It is a Hill climbing search without back tracking of other attributes.
It chooses the test (attribute) that best discriminates among the target classes, and
build branches based upon. It infers decision tree by growing it from root downward.
This process is repeated until the record arrives at a leaf node.
* Transparent representation of mined decision rules
Classification Rule:
There is a unique path from the root to each leaf. That path is an expression of the
rule used to classify the records.
The attractiveness of tree-based methods is due in large part to the fact that, in
contrast to neural networks, decision trees represent rules. Rules can readily be
expressed in English so that we humans can understand them.
22 October 2003
29
* The key issues shared by difference algorithms:
-How to select a best attribute to test (attribute-selection measures).
-How to prune the decision tree (the different pruning strategies).
-How to handle noise or missing data.
-The abilities of handling different target-attribute domains.
* Effectiveness measure:
- We measure the effectiveness of a decision tree, taken as a whole, by
applying it to a collection of previously unseen records and observing the
percentage classified correctly.
- We must also pay attention to the quality of the individual branches of
the tree, Each path through the tree represents a rule and some rules are
better than others. Some times, the predictive power of the whole tree can
be improved by pruning back some of its weaker branches.
* At each node in the tree, we can measure:
- The number of records entering the node.
- The way those records would be classified if this was a leaf node.
- The percentage of records classified correctly at this node.
22 October 2003
30
15
- Properties of ID3 search strategy
* ID3's search space:
- It is the hypothesis space of all decision trees
ID3's hypothesis space of all decision trees is a complete space of finite discretevalued functions, relative to the available attributes.
- Because every finite discrete-valued function can be represented by some
decision tree, ID3 avoids one of the major risks of methods that search
incomplete hypothesis spaces (such as methods that consider only conjunctive
hypothesis): that the hypothesis space might not contain the target function.
* As with other inductive learning methods, ID3 can be characterized as searching a
space of hypotheses for one that fits the training examples best.
E.g., The search space for PlayTennis classification
22 October 2003
31
E.g., The search space for PlayTennis classification
Sear-tree-root (set S)
---------------------------------------------/
|
|
\
Outlook
Temperature
Humidity
Windy
/
|
\
/
| \
/ \
/ \
sunny overca rain
hot mild cold high normal strong weak
|
...
|
... ...
| ...
|
...
{Temperature,
{Outlook,
{Outlook,
{Outlook,
Humidity,
Humidity,
Temperature,
Temperature,
Windy}
Windy}
Windy}
Humidity}
for S_sunny
for S_hot
for S_high
for S_strong
||
...
...
...
The shortest
decision tree.
22 October 2003
32
16
The best tree based on information gain heuristic function
outlook
sunny
Overcast
Humidity
High
No
22 October 2003
Rain
Yes
Normal
Yes
Wind
Strong
No
Weak
Yes
33
17