Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Survey

Document related concepts

Transcript

Genetic-Algorithm-Based Instance and Feature Selection Instance Selection and Construction for Data Mining Ch. 6 H. Ishibuchi, T. Nakashima, and M. Nii Abstract GA based approach for selecting a small number of instances from a given data set in a pattern classification problem. To improve the classification ability of our nearest neighbor classifier by searching for an appropriate reference set. Genetic Algorithm Coding Binary string of the length (n+m) S a1a2 an s1s2 sm ai : inclusion or exclusion of the i-th feature sp : the inclusion or exclusion of the p-th instance Fitness function Minimize |F|, minimize |P|, and maximize g(S) |F| : number of selected feature |P| : number of selected instance g(S) : classification performance Genetic Algorithm Performance measure (first one) : gA(S) The number of correctly classified instances Minimize |P| subject to gA(S) = m Performance measure (second one) : gB(S) When an instance xq was included in the reference set, xq was not selected as its own nearest neighbor. min{ d F ( x p , xq ) | x p P}, if xq P d F ( x p * , xq ) min{ d F ( x p , xq ) | x p P {xq }}, if xq P fitness fitness( S ) Wg g ( S ) WF | F | WP | P | Genetic Algorithm Initialization 2. Genetic Operation: Iterate the following procedure Npop/2 times to generate Npop string 1. 1. Randomly select a pair of strings 2. Apply a uniform crossover 3. Apply a mutation operator Generation Update: Select the Npop best string from 2Npop 4. Termination test 3. Numerical Example Biased Mutation For effectively decreasing the number of selected instances is to bias the mutation probability In the biased mutation, a much larger probability is assigned to the mutation from sp = 1 to sp = 0. Data sets 2 artificial + 4 real Normal distribution with small overlap Normal distribution with large overlap Iris data Appendicitis Data Cancer Data Wine Data Parameter Specifications Pop Size : 50 Crossover Prob. : 1.0 Mutation Prob. Pm = 0.01 for feature selection Pm(1 0) = 0.1 for instance selection Pm(0 1) = 0.01 for instance selection Stopping condition : 500 gen. Weight values : Wg = 5; WF = 1; WP = 1 Performance measure : gA(S) or gB(S) 30 trials for each data Performance on Training Data Performance on Test Data Leaving-one-out procedure (iris & appendicitis) 10-fold cross-validation (cancer & wine) Effect of Feature Selection Effect on NN Some Variants