Download Note5.2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
CSCI6405 Fall 2003
Dta Mining and Data Warehousing
„
„
„
Instructor: Qigang Gao, Office: CS219,
Tel:494-3356, Email: [email protected]
Teaching Assistant: Christopher Jordan,
Email: [email protected]
Office Hours: TR, 1:30 - 3:00 PM
17 October 2003
1
Lectures Outline
„
„
„
„
„
Pat I: Overview on DM and DW
1. Introduction (ch1)
2. Data preprocessing (ch3)
Part II: DW and OLAP
3. Data warehousing and OLAP (Ch2)
Part III: Data Mining Methods/Algorithms
4. Data mining primitives (ch4, optional)
5. Classification data mining (ch7)
6. Association data mining (ch6)
7. Characterization data mining (ch5)
8. Clustering data mining (ch8)
Part IV: Mining Complex Types of Data
9. Mining the Web (Ch9)
10. Mining spatial data (Ch9)
Project Presentations
17 October 2003
Ass1 Due: Sep 23 Tue
Ass2: Sep 23 – Oct 16
Ass3: Oct (14) 16 – Oct 30
Ass4: Oct 30 – Nov 13
Project Due: Dec 8
~prof6405/Doc/proj.guide
2
5. CLASSIFICAITON AND PREDICATION (Ch7)
„
„
„
„
„
„
What is classification? What is prediction?
Issues regarding classification and prediction
Classification by decision tree induction
Bayesian Classification
Other Classification Method
Summary
17 October 2003
3
Algorithm for Decision Tree Induction
„
„
Basic algorithm (a greedy algorithm)
„
Tree is constructed in a top-down recursive divide-and-conquer
manner
„
At start, all the training examples are at the root
„
Attributes are categorical (if continuous-valued, they are
discretized in advance)
„
Examples are partitioned recursively based on selected attributes
„
Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
Conditions for stopping partitioning
„
All samples for a given node belong to the same class
„
There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
„
There are no samples left
17 October 2003
4
Induction by sorting examples
+: X3,X4,X5,X7,X9,X11,X12,X13
-: X1,X2,X6,X8,X14
Yes for PlayTennis
No for PlayTennis
{Outlook, Temperature, Humidity, Wind}
Outlook
|
sunny
overcast
/
|
+: X9,X11
X3,X7,X12,X13
- : X1,X2,X8
0
[2+,3-]
[4+,0]
|
Yes
Humidity {Tem, Hum, Win}
/
high
/
+: 0
-: X1,X2,x3
[0+, 3-]
No
17 October 2003
\
normal
\
X9,X11
0
[2+, 0-]
Yes
rain
\
X4,X5,X6
X10,X14
[3+,2-]
|
Wind {Tem, Hum, Win}
/
strong
/
0
X6,X14
[0+, 2-]
No
\
weak
\
X4,X5,X10
0
[3+, 0-]
Yes
5
Attribute selection measure: Information Gain
Which attribute is the best classifier?
* We need to measures how well a given attribute separates the
training examples according to their target classification.
We want to select the attribute that is most useful for classifying
the current data set. What is a good quantitative measure of the
worth of an attribute?
- Information gain (a probability property)
* ID3 uses information gain to select among the candidate attributes
at each step while growing the tree.
17 October 2003
6
Information measure
Concept of quantity of information:
The quantity of information in a message may be formally defined in terms of
the number of bits necessary to encode the factor by which the
probability of the particular outcome has been increased.
|
E.g., Classification of a specified target
Q(message) = P2(outcome_after) - P1(outcome_before)
Q: the quantity of information contained in a message
P1: the probability of outcome before receiving a message
P2: the probability of outcome after receiving the message
17 October 2003
7
Information measure (cont)
E.g., Suppose a cow has strayed into a pasture represented as an 8 x 8
array of "cells".
The outcome is about finding the cow, i.e., getting the information
about where is the cow.
case1: Without knowing which cell it is in, the information about where is
the cow:
P1 (find-the-cow_before) = 1/64
case2: if some one sends a message telling which cell the cow is in, e.g.
in cell
(4,7) then we acquired some information.
How much information we received about find the cow?
P2 (find-the-cow_after) = 1
The information has been gone from 1/64 to 1 for a-factor-of 64 increase.
17 October 2003
8
Information measure (cont)
The information has been gone from 1/64 to 1 for afactor-of 64 increase.
The quantity of information in a message can be formally
defined in terms of the number of bits necessary to encode
the factor by which the probability of the particular
outcome has been increased:
E.g.,
log_2 (64) = 6 bits
17 October 2003
9
Information measure (cont)
The quantity of the information, "The crow is in cell (4,7)", about finding the cow is
therefore:
Q(message) = P2(outcome_after) - P1(outcome_before)
Information received = log_2 P2 - log_2 P1
= log_2 (P2 / P1)
= log_2 64 = 6 bits
P2(outcome_after)
Information = log_2
P1(outcome_before)
This formula works provided that
1) the probability before receiving the answer is greater than zero and
2) the probability after receiving the answer is greater than or equal to the probability
before receiving the answer.
17 October 2003
10
Information measure (cont)
E.g., Classification for the concept PlayTennis with the value (outcome)
“Yes“:
•
P1(message_before) = 9/14
The probability of unknown example's being a “Yes” example without
asking any question.
•
P2(message_after) =?
The questions may be asked like "what is outlook?“, or “what is
Temperature?”, etc from the list of (Outlook Temperature Humidity
Wind).
•
What is Outlook? Then the message can be from
P2_1(Outlook=sunny) = 2/5,
P2_2(Outlook= overcast) = 4/4,
P2_3(Outlook= rain) = 3/5.
How to quantify the information about both “Yes” and “No” classes?
Entropy
17 October 2003
11
Information measure: Entropy
Measure Impurity:
Entropy can characterize the impurity of of an arbitrary
collection of examples.
The concept of entropy, used in information theory, is closely related to
the concept of entropy studied in thermodynamics, which is a measure of
the disorder in a system of particles. Entropy is a measure of how much
confusion, uncertainty, or disorder there is in a situation.
17 October 2003
12
Entropy (cont)
Suppose p1, p2, ...,pn is a set of probabilities, such that p1+p2+...+pn = 1.
E.g: 9/14 + 5/14 = 1 for PlayTennis
The entropy for this set of probabilities is defined to be the value:
Entropy(S) = - ∑
n
i=1
p_i * log_2(p_i)
where S, is the data set,
p_i, stands for a subset of S belonging to a class of the target;
n, is the number of target classes.
E.g., For Play Tennis: Entropy (9+,5-) = - (9/14)log_2(9/14) - (5/14)log_2(5/14)
= 0.940
The entropy is maximized when p1=p2=...=pn=1/n.
The entropy is minimized (zero) if pi=1 for some i, and pj = 0 for all j =/= i.
E.g. Graph illustration.
17 October 2003
13
Entropy (cont)
E.g., +: X3,X4,X5,X7,X9,X11,X12,X13 Yes for PlayTennis
-: X1,X2,X6,X8,X14
No for PlayTennis
Outlook
|
sunny
overcast
rain
/
|
\
+: X9,X11
X3,X7,X12,X13
X4,X5,X6
- : X1,X2,X8
0
X10,X14
[2+,3-]
[4+,0]
[3+,2-]
|
Yes
|
Entropy(S) = Entropy([9+,5-])= -(9/14)log_2(9/14) - (5/14)log_2(5/14) = 0.940
Entropy(S_sunny) = Entropy([2+,3-]) = -(2/5)log_2(2/5) - (3/5)log_2(3/5) = 0.971
Entropy(S_overcast) = Entropy([4+,0-]) = -(4/4)log_2(4/4) - (0/4)log_2(0/4) = 0
Entropy(S_rain) = Entropy([3+,2-]) = -(3/5)log_2(3/5) - (2/5)log_2(2/5) = 0.971
Expected entropy (measuring the overall result of a partition):
s1 j + ...+ smj
Enropy (sj)
s
j =1
v
E(A)= ∑
17 October 2003
14
Information Gain
Information gain measures the expected reduction in Entropy:
Sj
k
Gain (S, A) = Entropy (S) - ∑ j=1
Entropy(S_j)
S
Where k is the number of different values of the attribute currently
under consideration, and j, ranges from 1 to k.
Gain (S,A), is the information about the purity change of the
examples about the classification on the target using attribute
A.
A good partition will produce a small expected entropy, in turn, the
Gain will be bigger.
17 October 2003
15
Information Gain (cont)
E.g., The set S has 14 examples: i.e. S = 14 = [9+,5-] for the target PlayTennis = {yes,no}.
So wehave Entropy(S) = 0.940. When we consider the attribute Outlook, the information
about the set S on the target PlayTennis receives a gain:
The original set: S = [9+,5-] = 14
Outlook = {sunny,overcast,rain} partitions the set into subsets:
S_sunny = [2+,3-] = 5
S_overcast = [4+,0-] = 4
S_rain = [3+,2-] = 5
Gain(S,Outlook) = Entropy(S) - (5/14)Entropy(S_sunny) - (4/14)Entropy(S_overcast)
- (5/14)Entropy(S_rain) = 0.246
Where,
Entropy(S) = Entropy([9+,5-])= -(9/14)log_2(9/14) - (5/14)log_2(5/14) = 0.940
Entropy(S_sunny) = Entropy([2+,3-])= -(2/5)log_2(2/5) - (3/5)log_2(3/5) = 0.971
Entropy(S_overcast) = Entropy([4+,0-]) = -(4/4)log_2(4/4) - (0/4)log_2(0/4) = 0
Entropy(S_rain) = Entropy([3+,2-])= -(3/5)log_2(3/5) - (2/5)log_2(2/5) = 0.971
17 October 2003
16
Summary of Attribute Selection Measure:
Information Gain (ID3/C4.5)
„
„
„
Select the attribute with the highest information gain
S contains si tuples of class Ci for i = {1, …, m}
Entropy measures information required to classify any
arbitrary tuple set
m
si
si
E ( s1,s 2,...,s m ) = − ∑
i =1
„
s
log 2
s
Expected entropy of attribute A with values {a1,a2,…,av}
v
sj
E (s1 j,...,smj)
s
j =1
EE (A) = ∑
„
Information gained by branching on attribute A
Gain(A) = E (s 1, s 2,..., sm) − EE (A)
17 October 2003
17
Example of attribute selection
E.g., The set S has 14 examples: i.e. S = 14 = [9+,5-] for the target PlayTennis =
{yes,no}. So we have Entropy(S) = 0.940.
When we consider the attribute Outlook, the information about the set S on the
target PlayTennis receives a gain:
Outlook = {sunny,overcast,rain}
S_sunny = [2+,3-] = 5, S_overcast = [4+,0-] = 4, S_rain = [3+,2-] = 5
Gain(S,Outlook) = E(S) - (5/14) E(S_sunny) - (4/14) E(S_overcast)- (5/14) E(S_rain)
= 0.246
Where,
Entropy(S) = Entropy([9+,5-])= -(9/14)log_2(9/14) - (5/14)log_2(5/14) = 0.940
Entropy(S_sunny)= Entropy([2+,3-])= -(2/5)log_2(2/5) - (3/5)log_2(3/5) = 0.971
Entropy(S_overcast) = Entropy([4+,0-]) = -(4/4)log_2(4/4) - (0/4)log_2(0/4) = 0
Entropy(S_rain) = Entropy([3+,2-])= -(3/5)log_2(3/5) - (2/5)log_2(2/5) = 0.971
17 October 2003
18
Example of attribute selection (cont)
The information gain calculations for all four attributes:
Gain(S,Outlook)
= 0.246
Gain(S,Humidity) = 0.151
Gain(S,Windy)
= 0.048
Gain(S,Temperature) = 0.029
The attribute Outlook may lead to a best partition of S in terms of the target
PlayTennis.
The partitioned subsets are overall more pure (more homogenous) than other
partitions (the best classifier).
17 October 2003
19
Example of attribute selection (cont)
* Illustration:
The partially learned decision tree resulting from the first step of ID3:
{X1,X2,X3,X4,X5,X6,X7,X8,X9,X10,X11,X12,X13,X14}
S: [9+,5-]
outlook
sunny
{X1,X2,X8,X9,X11}
[ 2=,3- ]
?
Which attribute should be
tested here?
17 October 2003
Overcast
{X3,X7,X12,X13}
[4+,0]
Yes
Rain
X4,X5,X6,X10,X14}
[3+,2-]
?
Which attribute should be
tested here?
20