Download Times Series Discretization Using Evolutionary Programming

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gene expression programming wikipedia , lookup

The Measure of a Man (Star Trek: The Next Generation) wikipedia , lookup

Data (Star Trek) wikipedia , lookup

Stemming wikipedia , lookup

Pattern recognition wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Time series wikipedia , lookup

Genetic algorithm wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Times Series Discretization
Using Evolutionary Programming
Abstract. In this work, we present a novel algorithm for time series
discretization. Our approach includes the optimization of the word size
and the alphabet as one parameter. Using evolutionary programming,
the search for a good discretization scheme is guided by a cost function
which considers three criteria: the entropy regarding the classification,
the complexity measured as the number of different strings needed to
represent the complete dataset, and the compression rate assessed as
the length of the discrete representation. Our proposal is compared with
some of the most representative algorithms found in the specialized literature, tested in a well-known benchmark of time series data sets. The
statistical analysis of the classification accuracy shows that the overall
performance of our algorithm is highly competitive.
Keywords: Times series, Discretization, Evolutionary Algorithms, Optimization
1
Introduction
Many real-world applications related with information processing generate temporal data [12]. Most of the cases, this kind of data requires huge data storage.
Therefore, it is desirable to compress this information maintaining the most
important features. Many approaches are mainly focused in data compression.
However they do not rely on significant information measured with entropy [11,
13]. In those approaches, the dimensionality reduction is given by the transformation of time series of length N into a data set of n coefficients, where n < N
[7]. The two main characteristics of a time series are: the number of segments
(word size) and the number of values (alphabet) required to represent its continuous values. Fig. 1 shows a time series with a grid that represents the cut points
for word size and alphabet.
Most of the discretization algorithms require, as an input, the parameters
of word size and alphabet. However, in real-world applications it might be very
difficult to know in advance their best values. Hence, their definitions require a
careful analysis of the time series data set [9, 13].
Among the approaches proposed to deal with data discretization we can find
those which work with one time series at a time, such as the one proposed by
Mörchen [14]. His algorithm is centered on the search of persistent states (the
most frequent values) in time series. However, such states are not common in
many real-world time series applications. Another representative approach was
proposed by Dimitrova [3], where a multi-connected graph representation for
2
Fig. 1: Word size and alphabet representation. In this case the time series has a
word size = 9 and an alphabet = 5 (values A and E do not appear in the time
series)
time series was employed. The links between nodes have Euclidean distance
values which are used under this representation to eliminate links in order to
obtain a path that defines the discretization scheme. Nonetheless, this way to
define the discretization process could be a disadvantage because not all the time
series in a data set will necessarily have the same discretization scheme.
Keogh [13] proposed the Symbolic Aggregate Approximation (SAX) approach.
This algorithm is based in the Piecewise Aggregate Approximation (PAA), a
dimensionality reduction algorithm [8]. After PAA is applied, the values are then
transformed into categorical values through a probability distribution function.
The algorithm requires the alphabet and the word size as inputs. This is SAX’s
main disadvantage because it is not clear how to define them from a given time
series data set.
There are other approaches based on search algorithms. Garcı́a-López [2] proposed EBLA2, which in order to automatically find the word size and alphabet
performs a greedy search looking for entropy minimization. The main disadvantage of this approach is the sensitivity of the greedy search to get trapped in
local optima. Therefore, in [6] simulated annealing was used as a search algorithm and the results improved. Finally, in [1], a genetic algorithm was used
to guide the search, however the solution was incomplete in the sense that the
algorithm considered the minimization of the alphabet as a first stage, and attempted to reduce the word size in a second stage. In this way some solutions
could not be generated.
In this work, we present a new approach in which both, the word size and
the alphabet are optimized at the same time. Due to its simplicity with respect
to other evolutionary algorithms, evolutionary programming (EP) is adopted
as a search algorithm (e.g., no recombination and parent selection mechanisms
are performed and just mutation and replacement need to be designed). Further-
Times Series Discretization Using Evolutionary Programming
3
more, the amount of strings and the length of the discretized series are optimized
as well.
The contents of this paper are organized as follows: Section 2 details the
proposed algorithm. After that, Section 3 presents the obtained results and a
comparison against other approaches. Finally, Section 4, draws some conclusions
and presents the future work.
2
Our approach
In this section we firstly define the discretization problem. Thereafter, EP is
introduced and its adaptation to solve the problem of interest is detailed in four
steps: (1) solution encoding, (2) fitness function definition, (3) mutation operator
and (4) replacement technique.
2.1
Statement of the problem
The discretization process refers to the transformation of continuous values into
discrete values. Formally, the domain is represented as x|x ∈ R where R is the
set of real numbers and the discretization scheme is D = {[d0 , d1 ], (d1 , d2 ], ...
(dn−1 , dn ]} where d0 y dn are the minimum and maximum values for x respectively. Each pair in D represents an interval, where each continuous value is
mapped within the continuous values to one of the elements from the discrete
set 1...m, where m is called the discretization degree and di |i = 1...n are the
limits of intervals, also known as cut points. The discretization process has to
be done in both characteristics, the length (word size) and the interval of values
taken by the continuous variable (alphabet).
Within our approach we use a modified version of the PAA algorithm [8].
PAA requires the number of segments for the time series as an input value.
Moreover, all the partitions have an equal length. In our proposed approach each
segment is calculated through the same idea as in PAA by using mean values.
However, partitions will not necessarily have equal lengths. This difference can
be stated as follows: let C be a time series with length n represented as a vector
C = c1 , ..., cn and T = t1 , t2 , ..., tm be the discretization scheme over word size,
where {(ti , ti+1 ]} is the time interval from segment i of C, where the element i
Pti+1
Ctj
from C is given by: ci = (ti+11−ti ) j=t
i +1
2.2
Evolutionary programming
EP is a simple but powerful evolutionary algorithm where evolution is simulated
at species level, i.e., no crossover is considered [5]. Instead, asexual reproduction
is implemented by a mutation operator. The main steps in EP are:
1. Population initialization.
2. Evaluation of solutions.
3. Offspring generation by mutation.
4
4. Replacement.
From the steps mentioned above, the following elements must be defined so as
to adapt EP to the time series discretization problem: (a) solution encoding, (b)
fitness function to evaluate solutions, (c) mutation operator and (d) replacement
mechanism. They are described below.
Solution encoding As in other evolutionary algorithms, in EP a complete solution of the problem must be encoded in each individual. A complete discretization scheme is encoded as shown in Fig. 2a, where the word size is encoded first
with integer values, followed by the alphabet represented by real numbers, which
must be sorted so as to apply the scheme to the time series data set [17] as shown
in Fig. 2b.
(a) Solution encoding: each block represents a cut point, the first
part of the segment (before the red line) encodes the word size.
The second part represents the alphabet. The first part indicates
that the first segment goes from position 1 to position 23, the
second segment goes from position 24 to position 45, and so on.
In the same manner, the second part shows the alphabet intervals.
See figure 2b
(b) Solution decoding: the decoded solution from Fig. 2a, after
sorting the values for the word size and the alphabet, is applied
to a time series. The solution can be seen as a grid (i.e., the word
size over the x-axis and the alphabet over the y-axis)
Fig. 2: Solution encoding-decoding
Fitness function Different measures have been reported in the specialized literature to determine the quality of discretization schemes, such as information
criterion [3], persistence state [14], information entropy maximization (IEM),
Times Series Discretization Using Evolutionary Programming
5
information gain, entropy maximization, Petterson-Niblett and minimum description length (MDL) [11, 4]. Our fitness function, which aims to bias EP to
promising regions of the search space, is based on three elements:
1. Classification accuracy (accuracy) based on entropy.
2. Strings reduction level (num strings).
3. Compression level (num cutpoints).
Those three values are normalized and added into one single value using the
relationship in Eq. 1 for individual j in the population P op.
F itness(P opj ) = (α accuracy) + (β num strings) + (γ num cutpoints)
(1)
where: α, β y γ are the weights whose values determine the importance of each
element.
The whole evaluation process for a given individual, i.e., discretization scheme,
requires the following steps: First, the discretization scheme is applied over the
complete time series data set S. Then, N S strings are obtained, where N S is
equal to the number of time series from the data set S. A m × ns matrix called
M is generated, where m is the number of different classes and ns is the number
of different strings obtained. From this discretized data set, SU is computed as
the list of unique strings. Each of these strings has its own class label C. The
first element of Eq. 1 (accuracy) is computed through entropy calculation over
the columns of the matrix M as indicated in Equation 2.
accuracy =
#S
XU
Entropy(Colj )
(2)
j=0
where: #SU is the number of different strings and Colj is the column j of the
matrix M . The second element num strings is calculated in Eq. 3
num strings = (#SU − #C)/(N + #C)
(3)
where: #SU is the number of different strings, N is the number of time series
in data set and #C is the number of existing classes. Finally, the third element
num cutpoints is computed in Eq. 4
num cutpoints = (size individual/(2 ∗ length series))
(4)
where: size individual is the number of partitions (word size) that the particular discretization scheme has and length series is the size of the original time
series. In summary, the first element represents how well a particular individual
(discretization scheme) is able to correctly classify the data base, the second
element asses the complexity of the representation in terms of different patterns
needed to encode the data, and the third element is a measure of the compression
rate reached using a particular discretization scheme.
6
Mutation operator The mutation operator is applied to every individual in
the population in order to generate one offspring per individual. We need a value
N M U T ∈ [1, 2, 3] to define how many changes will be made to an individual.
Each time an individual is mutated the N M U T value is calculated. Each change
consists on choosing a position of the vector defined in Fig. 2a and generate a
new valid value at random.
Replacement mechanism The replacement mechanism consists on sorting
the current population and their offspring by their fitness values in and letting
the first half to survive for the next generation while the second half is eliminated.
The pseudocode of our EP algorithm to tackle the times series discretization
problem is presented in Algorithm 1, where a population of individuals, i.e.,
valid schemes is generated at random. After that, each individual m generates
one offspring by the mutation implemented as one to three random changes
in the encoding. The set of current individuals P op and the set of Of f spring
′
are merged into one set called P op which is sorted based on fitness and the
first-half remains for the next generation. The process finishes when a number
of M AXGEN generations is computed and the best discretization scheme is
then used to discretize the data base to be classified by the K-nearest neighbors
(KNN) algorithm.
Algorithm 1 EP pseudocode
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
3
Pop= ∅
for m = 0 to popsize do
Popm = Valid Scheme() %Generate individuals at random.
end for
for k = 0 to M AXGEN do
Offspring = ∅
for m = 0 to popsize do
Offspringm = Mutation(Popm ) %Create a new individual by mutation
end for
′
Pop = Replacement(Pop + Offspring) %Select the best ones
′
Pop=Pop
end for
Experiments and results
The EP algorithm was tested on twenty data sets (available in a well-known
classification/clustering web page of time series [10]). A summary of the features
of each data set is presented in Table 1. The EP algorithm was executed with the
following parameters experimentally found after preliminary tests: popsize = 250
and M AXGEN = 50, α = 0.9009, β = 0.0900 and γ = 0.0090.
Times Series Discretization Using Evolutionary Programming
7
Table 1: Data sets used in the experiments
The quality of the solutions obtained by the EP algorithm was computed by
using the best discretization scheme obtained for a set of five independent runs
in the k-nearest neighbors classifier with K = 1. Other K values were tested
(K = 3 and K = 5) but the performance decreased in all cases. Therefore, the
results were not included in this paper. The low number of runs (5) is due to the
time required (more than an hour) for the algorithm to process one single run for
a given data set. The distance measure used in the k-nearest neighbors algorithm
was Euclidean distance. The algorithms used for comparison were GENEBLA
and SAX. The raw data was also included as a reference. Based on the fact
that SAX requires the word length and alphabet as inputs, it was run using
the parameters obtained by the EP algorithm and also by those obtained by
GENEBLA.
Table 2 summarizes the error rate on the twenty time series data sets for K =
1 for each evaluated algorithm (EP, SAX(EP), GENEBLA, SAX(GENEBLA))
and raw data. Values go from zero to one, where the lower means the better value.
The values between parentheses indicate the confidence on the significance of the
8
differences observed based on statistical tests applied to the samples of results
per algorithm. In all cases the differences were significant.
From the results in Table 2 it can be noticed that different performances
were provided by the compared algorithms. EP obtained the lowest error rate in
nine data sets. On the other hand, GENEBLA had better results in just three
data sets. Regarding the combination of EP and GENEBLA with SAX, slightly
better results were observed with GENEBLA-SAX with respect to EP-SAX,
where better results were obtained in five and three data sets, respectively.
It is worth noticing that EP provided its best performance in data sets with a
lower number of classes, (between 2 and 4): CBF, Face Four, Coffee, Gun Point,
FGC200 and Two Pattern (Fish is the only exception). On the other hand,
GENEBLA performed better in data sets with more classes (Adiac with 37 and
Face All with 14). Another related interesting finding is that SAX seems to help
EP to solve with the best performance of the compared approaches those data
sets with a higher number of classes (Lighting7 with 7, 50words with 50 and
Swedish Leaf with 15). In contrast, the combination of GENEBLA with SAX
help the former to deal with some data sets with a lower number of classes (Beef
with 5, Lighting2 with 2, Synthetic Control and OSU Leaf with 6 and Wafer
with 2).
Finally, there was not a clear pattern about the algorithm with the best
performance by considering the sizes of the training and test sets as well as the
time series length.
Table 2: Error rate obtained by each compared approach in the 20 data sets.
The best result for each data set is remarked with a gray background. Raw data
is presented only as a reference.
Times Series Discretization Using Evolutionary Programming
4
9
Conclusions and Future Work
We presented a novel time series discretization algorithm based on EP. The
proposed algorithm was able to automatically find the parameters for a good
discretization scheme considering the optimization of accuracy and compression
rate. Moreover, and as far as we know, this is the first approach that considers the world length and the alphabet optimization at the same time. A simple
mutation operator was able to sample the search space by generating new and
competitive solutions. Our EP algorithm is easy to implement and the results
obtained in 20 different data sets were highly competitive with respect to previously proposed methods including the raw data, i.e., the original time series,
which means that the EP algorithm is able to obtain the important information
of a continuous time series and disregards unimportant data. The EP algorithm
provided a high performance with respect to GENEBLA in problems with a low
number of classes. However, if EP is combined with SAX, the approach is able
to outperform GENEBLA and also GENEBLA-SAX in problems with a higher
number of classes.
The future work consists on a further analysis of the EP algorithm such
as the effect of the weights in the search as well as the number of changes in
the mutation operator. Furthermore, other classification techniques (besides knearest neighbors) need to be tested. Finally, Pareto dominance will be explored
with the aim to deal with the three objectives considered in the fitness function
[15].
References
1. Garcı́a-López, D.A., Héctor-Gabriel, A.-M., Ernesto, C.-P.: Discretization of Time
Series Dataset with a Genetic Search. In Proceeding MICAI ’09 Proceedings of the
8th Mexican International Conference on Artificial Intelligence, ISBN: 978-3-64205257-6 (2009).
2. Acosta Mesa, H.G., Nicandro, C.R., Daniel-Alejandro, G.-L.: Entropy Based Linear
Approximation Algorithm for Time Series Discretization. In: Advances in Artificial
Intelligence and Applications, vol. 32, pp. 214224. Research in Computers Science.
3. Dimitrova, E.S., McGee, J., Laubenbacher, E.: Discretization of Time Series Data,
(2005) eprint arXiv:q-bio/0505028.
4. Fayyad, U., Irani, K.: Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning. In: Proceedings of the 13th International Joint
Conference on Artificial Intelligence (1993).
5. Fogel L.: Intelligence Through Simulated Evolution. Forty years of Evolutionary
Programming (Wiley Series on Intelligent Systems). (1999).
6. Garcı́a-López D.A.: Algoritmo de Discretizacin de Series de Tiempo Basado en
Entropa y su Aplicacin en Datos Colposcpicos. Universidad Veracruzana (2007).
7. Han J., and Kamber M.: Data Mining: Concepts and Techniques (The Morgan
Kaufmann Series in Data Management Systems). (2001).
8. Keogh E., Chakrabarti K., Pazzani M., S. Mehrotra: Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases, ACM Trans. Database
Syst (2002).
10
9. Keogh E., S. Lonardi, Ratanamabatana C.A.: Towards parameter-free data mining.
In proceedings of Tenth ACM SIGKDD international Conference on Knowledge
Discovery and Data Mining (2001).
10. Keogh E., Xi C., Wei L., Ratanamabatana C.A.: The UCR Time Series Classification/Clustering Homepage (2006), http://www.cs.ucr.edu/ eamonn/time series data/.
11. Kurgan L., and Cios K.: CAIM Discretization Algorithm, IEEE Transactions On
Knowledge And Data Engineering (2004).
12. Last M., Kandel A., and Bunke H.: Data mining in time series databases, World
Scientific Pub Co Inc, Singapore (2004).
13. Lin J., Keogh E., Lonardi S., y Chin B.: A symbolic representation of time series, with implications for streaming Algorithms, In Proceedings of the 8th ACM
SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery
(2003).
14. Mörchen F., y Ultsch A.: Optimizing Time Series Discretization for Knowledge
Discovery. In: Proceeding of the Eleventh ACM SIGKDD international Conference
on Knowledge Discovery in Data Mining (2005).
15. Deb Kalyanmoy, Pratap A., Agarwal S., y Meyarivan T.: A Fast and Elitist Multiobjective Genetic Algorithm: NSGA-II. IEEE Transactions on evolutionary computation (2002).
16. Hastie Trevor, Tibshirani R., Friedman J.: The elements of Statistical Learning.
Springer 2009.
17. Chaochang Chiu*, Nanh S.C.: An adapted covering algorithm approach for modeling airplanes landing gravities. Department of Information Management, Yuan Ze
University, 135 Far East Rd., Chung-Li 320, Taiwan, ROC. Expert Systems with
Applications 26 (2004) 443450. ELSEVIER.