Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Relevant characteristics extraction
from semantically unstructured
data
PhD title : Data mining in unstructured data
Daniel I. MORARIU, MSc
PhD Supervisor: Lucian N. VINŢAN
Sibiu, 2006
Contents


Prerequisites
Correlation of the SVM kernel’s parameters



Feature selection using Genetic Algorithms



Polynomial kernel
Gaussian kernel
Chromosome encoding
Genetic operators
Meta-classifier with SVM


Non-adaptive method – Majority Vote
Adaptive methods





Selection based on Euclidean distance
Selection based on cosine
Initial data set scalability
Choosing training and testing data sets
Conclusions and further work
Prerequisites

Reuters Database Processing


806791 total documents, 126 topics, 366 regions, 870
industry codes
Industry category selection – “system software”




Data representation




7083 documents (4722 training /2361 testing)
19038 attributes (features)
24 classes (topics)
Binary
Nominal
Cornell SMART
Classifier using Support Vector Machine
techniques

kernels
Correlation of the SVM kernel’s parameters

Polynomial kernel
k ( x, x' )  2  d  x  x'


d
Gaussian kernel
 x  x' 2 

k ( x, x' )  exp  


n

C


Polynomial kernel parameter’s correlation

Polynomial kernel

Commonly used kernel
k (x, x)   x  x  b 
d



d – degree of the kernel
b – the offset
Our suggestion

b=2*d
k (x, x' )  2  d  x  x'

d
Bias – Polynomial kernel
k (x, x)   x  x  b 
k (x, x' )  2  d  x  x'
d

d
Influence of the bias - Nominal representation of input data
90
d=1
d=2
80
d=3
75
d=4
70
Our
Choice
Values of the bias (b)
1309
1000
500
100
50
10
9
8
7
6
5
4
3
2
1
65
0
Accuracy (%)
85
Gaussian kernel parameter’s correlation

Gaussian kernel

Commonly used kernel


 x  x' 

k (x, x' )  exp  

C


C – usually represents the dimension of the set
Our suggestion

n – numbers of distinct features greater than 0
 x  x' 2 

k (x, x' )  exp  


n

C


n – Gaussian kernel
Accuracy (%)
 x  x'
k (x, x' )  exp  
C

 x  x' 2 

k (x, x' )  exp  

n  C 





Influence of n - Cornell Smart data representation
90
85
80
75
70
65
60
55
50
C=1.0
C=1.3
C=1.8
C=2.1
1
10
50
100
500
654
Values of parameter n
1000 1309
auto
auto
Feature selection using Genetic Algorithms

Chromosome

Fitness (ci) = SVM (ci)
c  w0 , w1 ,..., w19038, b
m
f (c)  f (( w1 , w2 ,..., wn , b)   w, x i  b
i 1

Methods of selecting parents



Roulette Wheel
Gaussian selection
Genetic operators



Selection
Mutation
Crossover
Methods of selecting the parents

Roulette Wheel


each individual is represented by a space that
corresponds proportionally to its fitness
Gaussian :

maxim value (m=1) and dispersion (σ = 0.4)
 1  ( fitness(ci )  m  2 
P(ci )  exp   
 
 2





The process of obtaining the next generation
Selection
Crossover
Mutation
Current generation
Selects two parents.
The best chromosome is
copied from old
population into the new
population
We create two children
from selected parents using
crossover with parents split
Randomly eliminate one of
the parents
Mutation – randomly
change the sign for a
random number of
elements
Yes
Need more chromosomes
into the set?
No
New generation
GA_FS versus SVM_FS for 1309 features
90
80
Accuracy(%)
70
60
GA-BIN
GA-NOM
GA-CS
SVM-BIN
SVM-NOM
SVM-CS
50
40
30
20
10
0
D1.0
D2.0
D3.0
D4.0
Kernel degree
D5.0
Training time, polynomial kernel, d= 2, NOM
80
Time[minutes]
70
60
50
GA_FS
SVM_FS
IG_FS
40
30
20
10
0
475
1309
2488
number of features
8000
GA_FS versus SVM_FS for 1309 features
84
Accuracy(%)
83.5
83
GA-BIN
GA-CS
SVM-BIN
SVM-CS
82.5
82
81.5
C1.0
C1.3
C1.8
C2.1
Parameter C
C2.8
C3.1
Training time, Gaussian kernel, C=1.3, BIN
120
Time[minutes]
100
80
GA_FS
SVM_FS
IG_FS
60
40
20
0
475
1309
2488
number of features
8000
Meta-classifier with SVM

Set of SVM’s









Polynomial degree 1, Nominal
Polynomial degree 2, Binary
Polynomial degree 2, Cornell Smart
Polynomial degree 3, Cornell Smart
Gaussian C=1.3, Binary
Gaussian C=1.8, Cornell Smart
Gaussian C=2.1, Cornell Smart
Gaussian C=2.8, Cornell Smart
Upper limit (94.21%)
Meta-classifier methods’

Non-adaptive method


Majority Vote – each classifier votes a specific
class for a current document
Adaptive methods -
Compute the similarity between a current sample
and error samples from the self queue

Selection based on Euclidean distance



First good classifier
The best classifier
n
 ([ x]  [ x] )
Eucl (x, x) 
i 1
Selection based on cosine



First good classifier
The best classifier
Using average
i
2
i
n
cos 
x, x'
x  x'

 [ x] [ x' ]
i
i 1
n
 [ x]
i 1
2
i

i
n
 [ x' ]
i 1
2
i
Selection based on Euclidean distance
96
94
92
90
88
86
84
82
80
78
Steps
13
11
9
7
5
Upper Limit
FC-SBED
BC-SBED
3
1
Accuracy(%)
Classification accuracy
Selection based on cosine
Classification accuracy
96
Upper Limit
92
FC-SBCOS
90
88
BC-SBCOS
86
84
BC-SBCOS - with
average
82
Steps
13
11
9
7
5
3
80
1
Accuracy(%)
94
Comparison between SBED and SBCOS
Classification Accuracy
96
92
Majority Vote
SBED
SBCOS
Upper Limit
90
88
86
84
82
Steps
13
11
9
7
5
3
80
1
Accuracy(%)
94
Comparison between SBED and SBCOS
Processing Time
80
60
50
Majority Vote
SBED
SBCOS
40
30
20
10
Steps
13
11
9
7
5
3
0
1
Time [ minutes]
70
Initial data set scalability
Normalize each sample (7053)
Group initial set based on
distance (4474)
Take relevant vector (4474)
Use relevant vector in
classification process
Select only support vectors
(847)


Take samples grouped in
selected support vectors (4256)
Make the classification (with
4256 samples)
Polynomial kernel – 1309 features, NOM
Kernel's degree influence
88
Accuracy (%)
86
84
82
SVM -7053
SVM-4256
80
78
76
74
D1.0
D2.0
D3.0
D4.0
Degree of kernel
D5.0
Gaussian kernel – 1309 features, CS
90
88
Accuracy(%)
86
84
82
80
SVM-7053
SVM-4256
78
76
74
72
70
1
1.3
1.8
parameter C
2.1
2.8
Time [minutes]
Training time
50
40
30
20
4256-Bin
7053-Bin
10
0
C1.0 C1.3 C1.8 C2.1 C2.8
Parameter C
7053-Bin
7053-CS
4256-Bin
4256-CS
Choosing training and testing data set
1309 Features - Polynomial kernel
86
84
82
80
average over old
set
average over new
set
78
76
74
D
1.
0
D
2.
0
D
3.
0
D
4.
0
D
5
Av .0
er
ag
e
Accuracy(%)
88
Kernel's degree
Choosing training and testing data set
90
88
86
84
82
80
78
76
74
72
70
average over old
set
average over new
set
C1
.0
C1
.3
C1
.8
C2
.1
C2
Av .8
er
ag
e
Accuracy(%)
1309 Features - Gaussian kernel
Kernel's degree
Conclusions – other results

Using our correlation









3% better for Polynomial kernel
15% better for Gaussian kernel
Reduced number of features between 2.5% (475)
and 6% (1309)
GA _FS faster than SVM_FS
Polynomial kernel with nominal representation and
small degree
Gaussian kernel with Cornell Smart representation
Reuter’s database is linearly separable
SBED is better and faster than SBCOS
Classification accuracy decreases with 1.2 % when
the data set is reduced
Further work

Features extraction and selection







Association rules between words (Mutual
Information)
Synonym and Polysemy problem
Using families of words (WordNet)
Web mining application
Classifying larger text data sets
A better method of grouping data
Using classification and clustering
together
Related documents