Download SBAC System

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
國立雲林科技大學
National Yunlin University of Science and Technology
Unsupervised Learning with Mixed
Numeric and Nominal Data
Advisor : Dr. Hsu
Graduate : Yu-Cheng Chen
Authors
: Cen Li,
Gautam Biswas
2002 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
1
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Outline








Motivation
Objective
Introduction
Background
SABC
Experimental results
Conclusions
Personal Opinion
2
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Motivation

Tradition clustering algorithms assume feature are
either numeric or categorical valued.

Majority of the useful data is described by numeric and nominal
valued features
3
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Objective

Developing unsupervised learning techniques that
exhibit good performance with mixed data.
4
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Introduction

Traditional approaches that be used to resolve mixed
data have listed as following:



Binary encoding.
Discretize numeric attributes.
Generalize criterion functions to handle mixed data.
5
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Background

COBWEB/3

use CU measure for categorical attributes

For numeric attributes
2
P
(
A

V
)
 j i ij
6
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Background (cont.)

COBWEB/3

CU measure for numeric attributes is defined as:

The overall CU is defined as:
7
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Background (cont.)


COBWEB
Limitations:



The normal distribution assumption for numeric data.
The accuracy of the estimate is suspect when sample size is samll
When objects in Ck has a unique value, the σik = 0 and 1/ σik →∞ ,
so we set the 1/ σik =1 when σik =1 < 1
8
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Background (cont.)

ECOBWEB want to remedy the disadvantages of
COBWEB/3


The normal distribution assumption
When σik = 0
9
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Background (cont.)


ECOBWEB
Limitations:

The choice of the parameters has a significant effect on
CU computation.
10
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Background (cont.)

AUTOCLASS



Use Bayesian method to clustering
Derive the most probable class distribution for the data given prior
information.
Limitations:


Computational complexity is too high.
Over fitting problem.
11
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
SBAC System

SBAC



uses a similarity measure defined by Goodall
adopts a hierarchical agglomerative approach to build
partition structures.
The similarity is decided by




The uncommonality of feature value matches.
X1= {a, b} , X2={a, b}, X3={c, d} , X4={c, d}
( P(a) =P(b) ) >= ( P(c)=P(d) )
The similarity of X3 and X4 should be greater than that of
X1 and X2.
12
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
SBAC System

Summary

For numeric feature values, the similarity takes on:
 The feature value difference
 The uniqueness of the feature value pair
x1  {1, 5}, x 2  {6, 10}
t 1 Pt  0.8
5

10
P  0.2
v
v  6 13
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
SBAC System

Computing Similarity for numeric Attributes


We define the More Similar Feature Segment Set (MSFSS)
The set of all pairs of values for feature that are equally or
more similar to the pair ( (Vi)k, (Vj)k ).
{( 5.5,5.5), (6,6), (7.5,7.5), (9,9), (10.5,10.5), (5,5,6),
(6,7.5), (7.5,9), (9,10.5)}
MSFSS (7.5,9) 
MSFSS (9,10.5)  {( 5.5,5.5), (6,6), (7.514
,7.5), (9,9), (10.5,10.5), (5,5,6), (9,10.5}
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
SBAC System

The probability of picking two pair having a values (Vl)k, (Vm)k
 MSFVS ((Vi)k ,(Vj)k) is defined as

The dissimilairty of the pair (Dij)k is defined as the summation
of the probabilities.

The similarity of the pair ((Vi)k ,(Vj)k is defined as
15
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
SBAC System

For nominal feature values, the similarity is


We define the More Similar Feature Value Set (MSFVS)
The set of all pairs of values for feature that are equally or
more similar to the pair ( (Vi)k, (Vi)k ).
f(a)=3, f(b)=3, f(c)=4
MSFVS(c, c)={ (a, a) ,(b, b), (c, c)}
MSFVS(b, b)={ (a, a), (b, b)
16
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
SBAC System


The probability of picking a pair (Vl)k, (Vl)k  MSFVS ((Vi)k)
is defined as following
The dissimilairty of the pair (Dii)k is defined as the
summation of the probabilities
17
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
SBAC System
f(a)=3, f(b)=3, f(c)=4
MSFVS(c, c)={ (a, a) ,(b, b), (c, c)}
MSFVS(b, b)={ (a, a), (b, b)
31)
3( 31)
4 ( 4 1)
D(c, c)  pa2  pb2  pc2  103((10


1)
10(101)
10(101)  0.267
S (c, c)  1  D(c, c)  0.733
31)
3( 31)
D(a, a )  pa2  pb2  103((10

1)
10(101)  0.133
S (a, a )  1  D(a, a )  0.867
18
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
SBAC System

Aggregating Similarity from Multiple Features

Assuming the results are expressed as Fisher’s χ2
x 2  2 ln( Pi )

For numeric features:

For nominal features:
19
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
SBAC System

Combining the two types of features:
x e
2
ij
2
xij
2
( td  tc1)

k 0
( 12 xij2 ) k
k!
6 {c, 9}
9 {a, 7.5}
5 {c, 10.5}
8 {c, 9}
20
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
SBAC System

The agglomerative clustering algorithm:
21
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
SBAC System

The predefined threshold t


We set t=0.3 * D(root), D(root)=0.876, t=0.263
If the dissimilarity is dropping larger than t, then stop
22
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experimental results

Artificial data



180 data points, three classes, G1, G2, G3
Two nominal and two numeric attributes.
Each classes has 60 data points.
23
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experimental results (cont.)
24
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experimental results(cont.)
COBWEB
SBAC
AUTOCLASS
ECOBWEB
25
Intelligent Database Systems Lab
Experimental results (cont.)

Real data

Hand Written Character (8OX) Data



Mushroom Data



Numeric features
45 objects
Nominal features
200 objects (100 of them were poisonous)
Heart disease Data


Mixed features
303 patients
26
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experimental results (cont.)

Results
27
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experimental results (cont.)

Results
28
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experimental results (cont.)

Results
29
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
N.Y.U.S.T.
I. M.
Conclusions

This paper proposed a new similarity measure that
assigns greater weight to feature value matches that
are uncommon in the population.

The approach has better performance in clustering
than another’s do.
30
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Personal Opinion

The time complexity of this approach is too high.

The process of computing similarity and clustering are
too messy.
31
Intelligent Database Systems Lab
Related documents