Download Cepek -

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gene expression programming wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Quantitative comparative linguistics wikipedia , lookup

Transcript
Automatic Method for Data
Preprocessing for the GAME
Inductive Modelling Method
01001110 01100101
01110101 01110010
01101111 01101110
01101111 01110110
01100001 00100000
01110011 01101011
01110101 01110000
01101001 01101110
01100001 00100000
01101011 01100001
01110100 01100101
01100100 01110010
01111001 00100000
01110000 01101111
01100011 01101001
01110100 01100001
01100011 01110101
00101100 00100000
01000110 01000101
01001100 00100000
01000011 01010110
01010101 01010100
Miroslav Čepek
[email protected]
Miloslav Pavlicek, Pavel Kordik
Miroslav Šnorek
Computational Intelligence Group
Department of Computer Science and Engineering
Faculty of Electrical Engineering
Czech Technical University in Prague
ICIM 2008
Automatic preprocessing




The GAME Neural Network (as all others data
mining methods) heavily depends on data
preprocessing.
Preprocessing involves selection, setup and
ordering of preprocessing methods.
We want to automate this stage.
We will use genetic algorithm to find optimal
sequence of methods.
International Conference on Inductive Modelling, Kyiv 2008
Miroslav Cepek, [email protected]
GAME Neural Network


Group of Adaptive Method Evolution (GAME)
uses inductive modelling.
The structure of the model is created in
inductive way (data driven modelling).
International Conference on Inductive Modelling, Kyiv 2008
Miroslav Cepek, [email protected]
Main Ideas of Automatic
Preprocessing



The main idea is to use genetic algorithms to
find optimal order and optimal setup of data
preprocessing methods.
In the first stage we will to use simple genetic
algorithm.
Because we want to find sequence which will
the most fits the GAME ANN we will use
reduced GAME ANN for fitness function
evaluation.
International Conference on Inductive Modelling, Kyiv 2008
Miroslav Cepek, [email protected]
Single individual in automatic
preprocessing

The individuals in our automatic consists of list
of preprocessing methods.

Each method can be applied to different attributes.

Each method have different setup.

Methods are applied one by one.

Some methods changes structure of the dataset
(PCA) and must be treated separately.
International Conference on Inductive Modelling, Kyiv 2008
Miroslav Cepek, [email protected]
GA for Automatic Preprocessing

Genetic algorithm goes in standard way as
shown below.
International Conference on Inductive Modelling, Kyiv 2008
Miroslav Cepek, [email protected]
GA Properties

Selection – tournament selection

Several individuals are selected at random from
population and individual with the highest fitness is
selected.

Cross over – standard one-point cross over.

Mutation

adds or removes preprocessing methods from
individual.

changes order of methods.

changes configuration of methods.
International Conference on Inductive Modelling, Kyiv 2008
Miroslav Cepek, [email protected]
Fitness Recalculation


Fitness is average accuracy of several simple
GAME models generated from data
preprocessed by given individual.

Accuracy of models is not always the same due to
genetic algorithm involved in training.

Using several models allows more consistent
results.
We assume that better simple model also
means better complex models.
International Conference on Inductive Modelling, Kyiv 2008
Miroslav Cepek, [email protected]
Outline of the Experiment


Complete dataset is split into training and
testing part.
From training data given portion of values is
removed.

Several GAME models are created on raw data.

Instances with missing values are removed. Then
several GAME models are created.

Automatic preprocessing is performed. The best
individual is selected and preprocessing methods
are applied and several GAME models are created.
International Conference on Inductive Modelling, Kyiv 2008
Miroslav Cepek, [email protected]
Artificial data
International Conference on Inductive Modelling, Kyiv 2008
Miroslav Cepek, [email protected]
Best Chromosomes
The best individuals for selected amount of missing values. Part
a) shows the best chromosome 1% of missing values. Part b)
shows individual for 5% of missing values and c) shows 20% of
missing values.
International Conference on Inductive Modelling, Kyiv 2008
Miroslav Cepek, [email protected]
Best Chromosomes



Chromosomes for simple problems (low number
of missing values) are quite simple.
Chromosomes for complicated problems (high
number of missing values) are quite
complicated.
In this sense our algorithm works.
International Conference on Inductive Modelling, Kyiv 2008
Miroslav Cepek, [email protected]
Manually vs Automatically selected
methods.
International Conference on Inductive Modelling, Kyiv 2008
Miroslav Cepek, [email protected]
Results



Graph shows that GAME is unable to handle
missing values. Results of RAW data are quite
poor.
When instances with missing data are removed,
accuracy increase rapidly.
When automatic preprocessing is used
accuracy is even better.
International Conference on Inductive Modelling, Kyiv 2008
Miroslav Cepek, [email protected]
Conclusion



We proposed algorithm for automatic selection
and ordering of data preprocessing methods.
We performed the first experiment with our
method.
It works for artificial data and in future we have
to prove that it work also for more complicated
and real-world data.
International Conference on Inductive Modelling, Kyiv 2008
Miroslav Cepek, [email protected]
Thank You for Your attention.
[email protected]
International Conference on Inductive Modelling, Kyiv 2008
Miroslav Cepek, [email protected]