Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Relevant characteristics extraction from semantically unstructured data PhD title : Data mining in unstructured data Daniel I. MORARIU, MSc PhD Supervisor: Lucian N. VINŢAN Sibiu, 2006 Contents Prerequisites Correlation of the SVM kernel’s parameters Feature selection using Genetic Algorithms Polynomial kernel Gaussian kernel Chromosome encoding Genetic operators Meta-classifier with SVM Non-adaptive method – Majority Vote Adaptive methods Selection based on Euclidean distance Selection based on cosine Initial data set scalability Choosing training and testing data sets Conclusions and further work Prerequisites Reuters Database Processing 806791 total documents, 126 topics, 366 regions, 870 industry codes Industry category selection – “system software” Data representation 7083 documents (4722 training /2361 testing) 19038 attributes (features) 24 classes (topics) Binary Nominal Cornell SMART Classifier using Support Vector Machine techniques kernels Correlation of the SVM kernel’s parameters Polynomial kernel k ( x, x' ) 2 d x x' d Gaussian kernel x x' 2 k ( x, x' ) exp n C Polynomial kernel parameter’s correlation Polynomial kernel Commonly used kernel k (x, x) x x b d d – degree of the kernel b – the offset Our suggestion b=2*d k (x, x' ) 2 d x x' d Bias – Polynomial kernel k (x, x) x x b k (x, x' ) 2 d x x' d d Influence of the bias - Nominal representation of input data 90 d=1 d=2 80 d=3 75 d=4 70 Our Choice Values of the bias (b) 1309 1000 500 100 50 10 9 8 7 6 5 4 3 2 1 65 0 Accuracy (%) 85 Gaussian kernel parameter’s correlation Gaussian kernel Commonly used kernel x x' k (x, x' ) exp C C – usually represents the dimension of the set Our suggestion n – numbers of distinct features greater than 0 x x' 2 k (x, x' ) exp n C n – Gaussian kernel Accuracy (%) x x' k (x, x' ) exp C x x' 2 k (x, x' ) exp n C Influence of n - Cornell Smart data representation 90 85 80 75 70 65 60 55 50 C=1.0 C=1.3 C=1.8 C=2.1 1 10 50 100 500 654 Values of parameter n 1000 1309 auto auto Feature selection using Genetic Algorithms Chromosome Fitness (ci) = SVM (ci) c w0 , w1 ,..., w19038, b m f (c) f (( w1 , w2 ,..., wn , b) w, x i b i 1 Methods of selecting parents Roulette Wheel Gaussian selection Genetic operators Selection Mutation Crossover Methods of selecting the parents Roulette Wheel each individual is represented by a space that corresponds proportionally to its fitness Gaussian : maxim value (m=1) and dispersion (σ = 0.4) 1 ( fitness(ci ) m 2 P(ci ) exp 2 The process of obtaining the next generation Selection Crossover Mutation Current generation Selects two parents. The best chromosome is copied from old population into the new population We create two children from selected parents using crossover with parents split Randomly eliminate one of the parents Mutation – randomly change the sign for a random number of elements Yes Need more chromosomes into the set? No New generation GA_FS versus SVM_FS for 1309 features 90 80 Accuracy(%) 70 60 GA-BIN GA-NOM GA-CS SVM-BIN SVM-NOM SVM-CS 50 40 30 20 10 0 D1.0 D2.0 D3.0 D4.0 Kernel degree D5.0 Training time, polynomial kernel, d= 2, NOM 80 Time[minutes] 70 60 50 GA_FS SVM_FS IG_FS 40 30 20 10 0 475 1309 2488 number of features 8000 GA_FS versus SVM_FS for 1309 features 84 Accuracy(%) 83.5 83 GA-BIN GA-CS SVM-BIN SVM-CS 82.5 82 81.5 C1.0 C1.3 C1.8 C2.1 Parameter C C2.8 C3.1 Training time, Gaussian kernel, C=1.3, BIN 120 Time[minutes] 100 80 GA_FS SVM_FS IG_FS 60 40 20 0 475 1309 2488 number of features 8000 Meta-classifier with SVM Set of SVM’s Polynomial degree 1, Nominal Polynomial degree 2, Binary Polynomial degree 2, Cornell Smart Polynomial degree 3, Cornell Smart Gaussian C=1.3, Binary Gaussian C=1.8, Cornell Smart Gaussian C=2.1, Cornell Smart Gaussian C=2.8, Cornell Smart Upper limit (94.21%) Meta-classifier methods’ Non-adaptive method Majority Vote – each classifier votes a specific class for a current document Adaptive methods - Compute the similarity between a current sample and error samples from the self queue Selection based on Euclidean distance First good classifier The best classifier n ([ x] [ x] ) Eucl (x, x) i 1 Selection based on cosine First good classifier The best classifier Using average i 2 i n cos x, x' x x' [ x] [ x' ] i i 1 n [ x] i 1 2 i i n [ x' ] i 1 2 i Selection based on Euclidean distance 96 94 92 90 88 86 84 82 80 78 Steps 13 11 9 7 5 Upper Limit FC-SBED BC-SBED 3 1 Accuracy(%) Classification accuracy Selection based on cosine Classification accuracy 96 Upper Limit 92 FC-SBCOS 90 88 BC-SBCOS 86 84 BC-SBCOS - with average 82 Steps 13 11 9 7 5 3 80 1 Accuracy(%) 94 Comparison between SBED and SBCOS Classification Accuracy 96 92 Majority Vote SBED SBCOS Upper Limit 90 88 86 84 82 Steps 13 11 9 7 5 3 80 1 Accuracy(%) 94 Comparison between SBED and SBCOS Processing Time 80 60 50 Majority Vote SBED SBCOS 40 30 20 10 Steps 13 11 9 7 5 3 0 1 Time [ minutes] 70 Initial data set scalability Normalize each sample (7053) Group initial set based on distance (4474) Take relevant vector (4474) Use relevant vector in classification process Select only support vectors (847) Take samples grouped in selected support vectors (4256) Make the classification (with 4256 samples) Polynomial kernel – 1309 features, NOM Kernel's degree influence 88 Accuracy (%) 86 84 82 SVM -7053 SVM-4256 80 78 76 74 D1.0 D2.0 D3.0 D4.0 Degree of kernel D5.0 Gaussian kernel – 1309 features, CS 90 88 Accuracy(%) 86 84 82 80 SVM-7053 SVM-4256 78 76 74 72 70 1 1.3 1.8 parameter C 2.1 2.8 Time [minutes] Training time 50 40 30 20 4256-Bin 7053-Bin 10 0 C1.0 C1.3 C1.8 C2.1 C2.8 Parameter C 7053-Bin 7053-CS 4256-Bin 4256-CS Choosing training and testing data set 1309 Features - Polynomial kernel 86 84 82 80 average over old set average over new set 78 76 74 D 1. 0 D 2. 0 D 3. 0 D 4. 0 D 5 Av .0 er ag e Accuracy(%) 88 Kernel's degree Choosing training and testing data set 90 88 86 84 82 80 78 76 74 72 70 average over old set average over new set C1 .0 C1 .3 C1 .8 C2 .1 C2 Av .8 er ag e Accuracy(%) 1309 Features - Gaussian kernel Kernel's degree Conclusions – other results Using our correlation 3% better for Polynomial kernel 15% better for Gaussian kernel Reduced number of features between 2.5% (475) and 6% (1309) GA _FS faster than SVM_FS Polynomial kernel with nominal representation and small degree Gaussian kernel with Cornell Smart representation Reuter’s database is linearly separable SBED is better and faster than SBCOS Classification accuracy decreases with 1.2 % when the data set is reduced Further work Features extraction and selection Association rules between words (Mutual Information) Synonym and Polysemy problem Using families of words (WordNet) Web mining application Classifying larger text data sets A better method of grouping data Using classification and clustering together