Download Slide 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Data Mining
Spring 2007
• Noisy data
• Data Discretization using Entropy
based and ChiMerge
Noisy Data
• Noise: Random error, Data Present but not correct.
– Data Transmission error
– Data Entry problem
• Removing noise
– Data Smoothing (rounding, averaging within a window).
– Clustering/merging and Detecting outliers.
• Data Smoothing
– First sort the data and partition it into (equi-depth) bins.
– Then the values in each bin using Smooth by Bin Means,
Smooth by Bin Median, Smooth by Bin Boundaries, etc.
Noisy Data (Binning Methods)
Sorted data for price (in dollars):
4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Noisy Data (Clustering)
• Outliers may be detected by clustering, where similar
values are organized into groups or “clusters”.
• Values which falls outside of the set of clusters may be
considered outliers.
Data Discretization
• The task of attribute (feature)-discretization techniques is
to discretize the values of continuous features into a small
number of intervals, where each interval is mapped to a
discrete symbol.
• Advantages:– Simplified data description and easy-to-understand data and final
data-mining results.
– Only Small interesting rules are mined.
– End-results processing time decreased.
– End-results accuracy improved.
Effect of Continuous Data on Results
Accuracy
age
<=30
<=30
<=30
<=30
income
medium
medium
medium
medium
age
9
10
11
12
buys_computer
no
no
no
no
Data Mining
Discover only those
rules which contain
support (frequency)
greater >= 1
• If ‘age <= 30’ and income = ‘medium’ and age =
‘9’ then buys_computer = ‘no’
• If ‘age <= 30’ and income = ‘medium’ and age =
‘10’ then buys_computer = ‘no’
• If ‘age <= 30’ and income = ‘medium’ and age =
‘11’ then buys_computer = ‘no’
• If ‘age <= 30’ and income = ‘medium’ and age =
‘12’ then buys_computer = ‘no’
Due to the missing value in training
dataset, the accuracy of prediction
decreases and becomes “66.7%”
age
<=30
<=30
<=30
income
medium
medium
medium
age
9
11
13
buys_computer
?
?
?
Entropy-Based Discretization
• Given a set of samples S, if S is partitioned into two
intervals S1 and S2 using boundary T, the entropy after
partitioning is
• Where pi is the probability of class i in S1, determined by
dividing the number of samples of class i in S1 by the total
number of samples in S1.
Entropy-Based Discretization
• The boundary that minimizes the entropy function over all
possible boundaries is selected as a binary discretization.
• The process is recursively applied to partitions obtained
until some stopping criterion is met, e.g.,
Example 1
ID
Age
Grade
1
21
F
2
22
F
3
24
P
4
25
F
5
27
P
6
27
P
7
27
P
8
35
P
9
41
P
•
Let Grade be the class attribute. Use entropy-based
discretization to divide the range of ages into different
(22+24) / 2 = 23
discrete intervals.
•
There
are 6 possible boundaries. They are 21.5, 23, 24.5,
(21+22) / 2 = 21.5
26, 31, and 38.
•
Let us consider the boundary at T = 21.5.
Let S1 = {21}
Let S2 = {22, 24, 25, 27, 27, 27, 35, 41}
Example 1 (cont’)
ID
Age
1
2
3
4
5
6
7
8
9
21
22
24
25
27
27
27
35
41
F
F
P
F
P
P
P
P
P
Grade
• The number of elements in S1 and S2 are:
|S1| = 1
|S2| = 8
• The entropy of S1 is
Ent ( S1 )   P(Grade  F)  log 2 P(Grade  F)  P(Grade  P)  log 2 P(Grade  P)
 (1)  log 2 (1)  (0)  log 2 (0)

• The entropy of S2 is
Ent ( S 2 )   P(Grade  F)  log 2 P(Grade  F)  P(Grade  P)  log 2 P(Grade  P)
 (2)  log 2 (2)  (6)  log 2 (6)

Example 1 (cont’)
• Hence, the entropy after partitioning at T = 21.5 is
| S1 |
| S2 |
E (S , T ) 
Ent ( S1 ) 
Ent ( S 2 )
|S|
|S|
|1|
|8|

Ent ( S1 ) 
Ent ( S 2 )
|9|
|9|
 ...
Example 1 (cont’)
• The entropies after partitioning for all the boundaries are:
T = 21.5 = E(S,21.5)
T = 23 = E(S,23)
.
.
T = 38 = E(S,38)
Now recursively apply entropy
discretization upon both partitions
Select the boundary with the smallest entropy
Suppose best is T = 23
ID
Age
Grade
1
2
3
4
5
6
7
8
9
21
22
24
25
27
27
27
35
41
F
F
P
F
P
P
P
P
P
ChiMerge (Kerber92)
• This discretization method uses a merging approach.
• ChiMerge’s view:
– First sort the data on the basis of attribute (which is
being descretize).
– List all possible boundaries or intervals. In case of last
example there were 6 boundaries, 0, 21.5, 23, 24.5, 26,
31, and 38.
– For all two near intervals, calculate 2 class independent
test.
•
•
•
•
{0,21.5} and {21.5, 23}
{21.5, 23} and {23, 24.5}
.
.
– Pick the best 2 two near intervals and merge them.
ChiMerge -- The Algorithm
 Compute the 2 value for each pair of adjacent intervals
 Merge the pair of adjacent intervals with the lowest 2
value
 Repeat  and  until 2 values of all adjacent pairs
exceeds a threshold
Chi-Square Test
Class 1 Class 2
Int 1 o11
c
r
  
2
i 1 j 1
o
ij
 eij 
2
eij
o12
Σ
R
1
Int 2 o21
o22
R
2
Σ
C1
oij = observed frequency of interval i for class j
eij = expected frequency (Ri * Cj) / N
C2
N
ChiMerge Example
Data Set
Sample:
1
2
3
4
5
6
7
8
9
10
11
12
F
K
1
3
7
8
9
11
23
37
39
45
46
59
1
2
1
1
1
2
2
1
2
1
1
1
• Interval points for feature F are: 0, 2, 5, 7.5, 8.5, 10, etc.
ChiMerge Example (cont.)
2 was minimum for intervals: [7.5, 8.5] and [8.5, 10]
K=2

A11=1
A21=1
A12=0
A22=0
R1=1
R2=1
C1=2
C2=0
N=2
K=1
Interval [7.5, 8.5]
Interval [8.5, 9.5]

c
r
  
2
i 1 j 1
o
ij
 eij

2
eij
Based on the table’s values, we can calculate expected values:
E11 = 2/2 = 1,
oij = observed frequency of
E12 = 0/2  0,
E21 = 2/2 = 1, and
interval i for class j
E22 = 0/2  0
eij = expected frequency (Ri * Cj)
/N
and corresponding 2 test:
2 = (1 – 1)2 / 1 + (0 – 0)2 / 0.1 + (1 – 1)2 / 1 + (0 – 0)2 / 0.1 = 0.2
For the degree of freedom d=1, and 2 = 0.0 < 2.706 ( MERGE !)
ChiMerge Example (cont.)
Additional Iterations:
K=1
K=2
Interval [0, 7.5] A11=2 A12=1
Interval [7.5, 10] A21=2 A22=0

E11
E12
E21
E22
=
=
=
=
C1=4
C2=1

R1=3
R2=2
N=5
12/5 = 2.4,
3/5 = 0.6,
8/5 = 1.6, and
2/5 = 0.4
2 = (2 – 2.4)2 / 2.4 + (1 – 0.6)2 / 0.6 + (2 – 1.6)2 / 1.6 + (0 – 0.4)2 / 0.4
2 = 0.834
For the degree of freedom d=1, 2 = 0.834 < 2.706
(MERGE!)
ChiMerge Example (cont.)
K=1
Interval [0, 10.0]
Interval [10.0, 42.0]

E11
E12
E21
E22
K=2

A11=4
A21=1
A12=1
A22=3
R1=5
R2=4
C1=5
C2=4
N=9
= 2.78
=2.22
= 2.22
= 1.78
and 2 = 2.72 > 2.706 (NO MERGE !)
Final discretization: [0, 10], [10, 42], and [42, 60]
References
– Text book of Jiawei Han and Micheline Kamber, “Data
Mining: Concepts and Techniques”, Morgan Kaufmann
Publishers, August 2000. (Chapter 3).
– Data Mining: Concepts, Models, Methods, and
Algorithms by Mehmed Kantardzic John Wiley & Sons
2003. (Chapter 3).
Related documents