Download Incremental learning - Bournemouth University

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Types of artificial neural networks wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Incremental Learning
Abdelhamid Bouchachia
Department of Informatics
University of Klagenfurt
Universitaetsstr. 65-67
Klagenfurt, 9020
Austria
voice: +43 463 2700 3525
fax: +43 463 2700 3599
email: [email protected]
Incremental Learning
Abdelhamid Bouchachia, University of Klagenfurt, Austria
INTRODUCTION
Data mining and knowledge discovery is about creating a comprehensible model of the
data. Such a model may take different forms going from simple association rules to complex
reasoning system. One of the fundamental aspects this model has to fulfill is adaptivity. This
aspect aims at making the process of knowledge extraction continually maintainable and subject
to future update as new data become available. We refer to this process as knowledge learning.
Knowledge learning systems are traditionally built from data samples in an off-line oneshot experiment. Once the learning phase is exhausted, the learning system is no longer capable
of learning further knowledge from new data nor is it able to update itself in the future. In this
chapter, we consider the problem of incremental learning (IL). We show how, in contrast to offline or batch learning, IL learns knowledge, be it symbolic (e.g., rules) or sub-symbolic (e.g.,
numerical values) from data that evolves over time. The basic idea motivating IL is that as new
data points arrive, new knowledge elements may be created and existing ones may be modified
allowing the knowledge base (respectively, the system) to evolve over time. Thus, the acquired
knowledge becomes self-corrective in light of new evidence. This update is of paramount
importance to ensure the adaptivity of the system. However, it should be meaningful (by
capturing only interesting events brought by the arriving data) and sensitive (by safely ignoring
unimportant events). Perceptually, IL is a fundamental problem of cognitive development.
Indeed, the perceiver usually learns how to make sense of its sensory inputs in an incremental
manner via a filtering procedure.
In this chapter, we will outline the background of IL from different perspectives: machine
learning and data mining before highlighting our IL research, the challenges, and the future
trends of IL.
BACKGROUND
IL is a key issue in applications where data arrives over long periods of time and/or
where storage capacities are very limited. Most of the knowledge learning literature reports on
learning models that are one-shot experience. Once the learning stage is exhausted, the induced
knowledge is no more updated. Thus, the performance of the system depends heavily on the data
used during the learning (knowledge extraction) phase. Shifts of trends in the arriving data
cannot be accounted for.
Algorithms with an IL ability are of increasing importance in many innovative
applications, e.g., video streams, stock market indexes, intelligent agents, user profile learning,
etc. Hence, there is a need to devise learning mechanisms that are able of accommodating new
data in an incremental way, while keeping the system under use. Such a problem has been
studied in the framework of adaptive resonance theory (Carpenter et al., 1991). This theory has
been proposed to efficiently deal with the stability-plasticity dilemma. Formally, a learning
algorithm is totally stable if it keeps the acquired knowledge in memory without any catastrophic
forgetting. However, it is not required to accommodate new knowledge. On the contrary, a
learning algorithm is completely plastic if it is able to continually learn new knowledge without
any requirement on preserving the knowledge previously learned. The dilemma aims at
accommodating new data (plasticity) without forgetting (stability) by generating knowledge
elements over time whenever the new data conveys new knowledge elements worth considering.
Basically there are two schemes to accommodate new data. To retrain the algorithm from
scratch using both old and new data is known as revolutionary strategy. In contrast, an
evolutionary continues to train the algorithm using only the new data (Michalski, 1985). The first
scheme fulfills only the stability requirement, whereas the second is a typical IL scheme that is
able to fulfill both stability and plasticity. The goal is to make a tradeoff between the stability
and plasticity ends of the learning spectrum as shown in Fig.1.
Incremental learning
Favoring stability
Favoring plasticity
Figure 1: Learning spectrum
As noted in (Polikar et al., 2000), there are many approaches referring to some aspects of
IL. They exist under different names like on-line learning, constructive learning, lifelong
learning, and evolutionary learning. Therefore, a definition of IL turns out to be vital:

IL should be able to accommodate plasticity by learning knowledge from new data. This
data can refer either to the already known structure or to a new structure of the system.

IL can use only new data and should not have access at any time to the previously used
data to update the existing system.

IL should be able to observe the stability of the system by avoiding forgetting.
It is worth noting that the IL research flows in three directions: clustering, classification,
and rule associations mining. In the context of classifcation and clustering, many IL approaches
have been introduced. A typical incremental approach is discussed in (Parikh & Polikar, 2007)
which consists of combining an ensemble of multilayer perceptron networks (MLP) to
accommodate new data. Similar work was done later in (Chakraborty & Pal, 2003) using also
MLP. Note here that stand-alone MLPs, like many other neural networks, need retraining in
order to learn from the new data. Other IL algorithms were proposed in (Fritzke, 1994) and in
(Domeniconi & Gunopulos, 2001). The former algorithm is based on radial basis function
networks (RBFs), while the latter aims at constructing incremental support vector machine
classifiers. Actually, there exist four neural models that are inherently incremental: (i) adaptive
resonance theory (ART) (Carpenter et al., 1991), (ii) min-max neural networks (Simpson, 1992),
(iii) nearest generalized exemplar (Salzberg, 1991), and (iv) neural gas model (Fritzke, 1995).
The first three incremental models aim at learning hyper-rectangle categories, while the last one
aims at building point-prototyped categories.
It is important to mention that there exist many classification approaches that are referred
to as IL approaches and which rely on neural networks. These range from retraining
misclassified samples to various weighing schemes (Freeman & Saad, 1997; Grippo, 2000). All
of them are about sequential learning where input samples are sequentially, but iteratively,
presented to the algorithm. However, sequential learning works only in close-ended
environments where classes to be learned have to be reflected by the readily available training
data and more important prior knowledge can also be forgotten if the classes are unbalanced.
In contrast to sub-symbolic learning, few authors have studied incremental symbolic
learning, where the problem is incrementally learning simple classification rules (Maloof &
Michalski, 2004; Reinke & Michalski, 1988; Utgoff, 1988).
In addition, the concept of incrementality has been discussed in the context of association
rules mining (ARM). The goal of ARM is to generate all association rules in the form of X Y
that have support and confidence greater than a user-specified minimum support and minimum
confidence respectively. The motivation underlying incremental ARM stems from the fact that
databases grow over time. The association rules mined need to be updated as new items are
inserted in the database. Incremental ARM aims at using only the incremental part to infer new
rules. However, this is usually done by processing the incremental part separately and scanning
the older database if necessary. Some of the algorithms proposed are FUP (Cheung et al., 1996),
temporal windowing (Rainsford et al., 1997), and DELI (Lee & Cheung, 1997).
In contrast to static databases, IL is more visible in data stream ARM. The nature of data
imposes such an incremental treatment of data. Usually data continually arrives in the form of
high-speed streams. IL is particularly relevant for online streams since data is discarded as soon
as it has been processed. Many algorithms have been introduced to maintain association rules
(Charikar et al., 2004; Chang & Lee, 2004; Domingos & Hulten, 2000; Giannella et al., 2003;
Lin et al., 2005; Yu et al., 2004). Furthermore, many classification clustering algorithm, which
are not fully incremental, have been developed in the context of stream data (Aggarwal et al.,
2004; Guha et al., 2000; Last, 2002).
FOCUS
IL has a large spectrum of investigation facets. We shall focus in the following on
classification and clustering which are key issues in many domains such as data mining, pattern
recognition, knowledge discovery, and machine learning. In particular, we focus on two research
avenues which we have investigated: (i) incremental fuzzy classifiers (IFC) (Bouchachia &
Mittermeir, 2006) and (ii) incremental learning by function decomposition (IFD) (Bouchachia,
2006a).
The motivation behind IFC is to infer knowledge in the form of fuzzy rules from data that
evolves over time. To accommodate IL, appropriate mechanisms are applied in all steps of the
fuzzy system construction:
(1) Incremental supervised clustering: Given a labeled data set, the first step is to cluster this
data with the aim of achieving high purity and separability of clusters. To do that, we
have introduced a clustering algorithm that is incremental and supervised. These two
characteristics are vital for the whole process. The resulting labeled clusters’ prototypes
are projected onto each feature axis to generate some fuzzy partitions.
(2) Fuzzy partitioning and accommodation of change: Fuzzy partitions are generated relying
on two steps: Initially, each cluster is mapped onto a triangular partition. In order to
optimize the shape of the partitions, the number and the complexity of rules, an
aggregation of these triangular partitions is performed. As new data arrives, these
partitions are systematically updated without referring to the previously used data. The
consequent of rules are then accordingly updated.
(3) Incremental feature selection: To find the most relevant features (which results in
compact and transparent rules), an incremental version of Fisher’s interclass separability
criterion is devised. As new data arrives, some features may be substituted for new ones
in the rules. Hence, the rules’ premises are dynamically updated. At any time of the life
of a classifier, the rule base should reflect the semantic contents of the already used data.
To the best of our knowledge, there is no previous work on feature selection algorithms
that observe the notion of incrementality.
In another research axis, IL has been thoroughly investigated in the context of neural
networks. In (Bouchachia, 2006a; Bouchachia, 2006b) we have proposed a novel IL algorithm
based on function decomposition (ILFD) that is realized by a neural network. ILFD uses
clustering and vector quantization techniques to deal with classification tasks. The main
motivation behind ILFD is to enable an on-line classification of data lying in different regions of
the space allowing to generate non-convex partitions and, more generally, to generate
disconnected partitions (not lying in the same contiguous space). Hence, each class can be
approximated by a sufficient number of categories centered around their prototypes.
Furthermore, ILFD differs from the aforementioned learning techniques (Sec. Background) with
respect to the following aspects:

Most of those techniques rely on geometric shapes to represent the categories, such as
hyper-rectangles, hyper-ellipses, etc.; whereas the ILFD approach is not explicitly based
on a particular shape since one can use different types of distances to obtain different
shapes.

Usually, there is no explicit mechanism (except for the neural gas model) to deal with
redundant and dead categories, the ILFD approach uses two procedures to get rid of dead
categories. The first is called dispersion test and aims at eliminating redundant category
nodes. The second is called staleness test and aims at pruning categories that become
stale.

While all of those techniques modify the position of the winner when presenting the
network with a data vector, the learning mechanism in ILFD consists of reinforcing the
winning category from the class of the data vector and pushes away the second winner
from a neighboring class to reduce the overlap between categories.

While the other approaches are either self-supervised or need to match the input with all
existing categories, ILFD compares the input only with categories having the same label
as the input in the first place and then with categories from other labels distinctively.

The ILFD can also deal with the problem of partially labeled data. Indeed, even unlabeled
data can be used during the training stage.
Moreover, the characteristics of ILFD can be compared to other models such as fuzzy ARTMAP
(FAM), min-max neural networks (MMNN), nearest generalized exemplar (NGE), and growing
neural gas (GNG) as shown in Tab. I (Bouchachia et al., 2007).
TABLE I: Characteristics of some IL algorithms
Characteristics
FAM
MMNN
NGE
GNG
ILFD
Online learning
Y
Y
Y
Y
Y
Type of prototypes
Hyperbox Hyperbox Hyperbox Graph node Cluster center
Generation control
Y
Y
Y
Y
Y
Shrinking of prototypes
N
Y
Y
U
U
Deletion of prototypes
N
N
N
Y
Y
Overlap of prototypes
Y
N
N
U
U
Growing of prototypes
Y
Y
Y
U
U
Noise resistance
U
Y
U
U
U
Sensitivity to data order
Y
Y
Y
Y
Y
Normalization
Y
Y
Y/N
N
Y/N
Legend:
Y: yes
N: no
U: uknown/undefined
In our research, we have tried to stick to the spirit of IL. To put it clearly, an IL algorithm, in our
view, should fulfill the following characteristics:

Ability of life-long learning and to deal with plasticity and stability

Old data is never used in subsequent stages

No prior knowledge about the (topological) structure of the system is needed

Ability to incrementally tune the structure of the system

No prior knowledge about the statistical properties of the data is needed

No prior knowledge about the number of existing classes and the number of categories
per class and no prototype initialization are required.
FUTURE TRENDS
The problem of incrementality remains a key aspect in learning systems. The goal is to
achieve adaptive systems that are equipped with self-correction and evolution mechanisms.
However, many issues, which can be seen as shortcomings of existing IL algorithms, remain
open and therefore worth investigating:

Order of data presentation: All of the proposed IL algorithms suffer from the problem of
sensitivity to the order of data presentation. Usually, the inferred classifiers are biased by
this order. Indeed different presentation orders result in different classifier structures and
therefore in different accuracy levels. It is therefore very relevant to look closely at
developing algorithms whose behavior is data-presentation independent. Usually, this is a
desired property.

Category proliferation: The problem of category proliferation in the context of clustering
and classification refers to the problem of generating a large number of categories. This
number is in general proportional to the granularity of categories. In other terms, fine
category size implies large number of categories and larger size implies less categories.
Usually, there is a parameter in each IL algorithm that controls the process of category
generation. The problem here is: what is the appropriate value of such a parameter. This
is clearly related to the problem of plasticity that plays a central role in IL algorithms.
Hence, the question: how can we distinguish between rare events and outliers? What is
the controlling parameter value that allows making such a distinction? This remains a
difficult issue.

Number of parameters: One of the most important shortcomings of the majority of the
IL algorithms is the huge number of user-specified parameters that are involved. It is
usually hard to find the optimal value of these parameters. Furthermore, they are very
sensitive to data, i.e., in general to obtain high accuracy values, the setting requires
change from one data set to another. In this context, there is a real need to develop
algorithms that do not depend heavily on many parameters or which can optimize such
parameters.

Self-consciousness & self-correction: The problem of distinction between noisy input
data and rare event is not only crucial for category generation, but it is also for correction.
In the current approaches, IL systems cannot correct wrong decisions made previously,
because each sample is treated once and any decision about it has to be taken at that time.
Now, assume that at the processing time the sample x was considered a noise, while in
reality it was a rare event, then in a later stage the same rare event was discovered by the
system. Therefore, in the ideal case the system has to recall that the sample x has to be
reconsidered. Current algorithms are not able to adjust the systems by re-examining old
decisions. Thus, IL systems have to be equipped with some memory in order to become
smarter enough.

Data drift: One of the most difficult questions that is worth looking at is related to drift.
Little, if none, attention has been paid to the application and evaluation of the
aforementioned IL algorithms in the context of drifting data although the change of
environment is one of the crucial assumptions of all these algorithms. Furthermore, there
are many publicly available datasets for testing systems within static setting, but there are
very few benchmark data sets for dynamically changing problems. Those existing are
usually artificial sets. It is very important for the IL community to have a repository,
similar to that of the Irvine UCI, in order to evaluate the proposed algorithms in evolving
environments.
As a final aim, the research in the IL framework has to focus on incremental but
stable algorithms that have to be transparent, self-corrective, less sensitive to the order of
data arrival, and whose parameters are less sensitive to the data itself.
CONCLUSION
Building adaptive systems that are able to deal with nonstandard settings of learning is
one of key research avenues in machine learning, data mining and knowledge discovery.
Adaptivity can take different forms, but the most important one is certainly incrementality. Such
systems are continuously updated as more data becomes available over time. The appealing
features of IL, if taken into account, will help integrate intelligence into knowledge learning
systems. In this chapter we have tried to outline the current state of the art in this research area
and to show the main problems that remain unsolved and require further investigations.
REFERENCES
Aggarwal, C., Han, J., Wang, J., & Yu, P. (2004). On demand classification of data streams.
International Conference on Knowledge Discovery and Data Mining, pages:503-508.
Bouchachia, A. & Mittermeir, R. (2006). Towards fuzzy incremental classifiers. Soft Computing,
11(2):193-207, January 2007.
Bouchachia, A., Gabrys, B. & Sahel, Z. (2007). Overview of some incremental learning
algorithms. To appear in proc. of the 16th IEEE international conference on fuzzy systems, IEEE
Computer Society, 2007
Bouchachia, A. (2006a). Learning with incrementality. The 13th International conference on
neural information processing, LNCS 4232, pages: 137-146.
Bouchachia, A. (2006b). Incremental learning via function decomposition. The 5th International
conference on machine learning and applications, pages: 63-68, IEEE Computer Society, 2006.
Carpenter, G., Grossberg, D., & Rosen, D. (1991). Fuzzy ART: Fast stable learning and
categorization of analog patterns by an adaptive resonance system. Neural Networks, 4(6):759–
771.
Chakraborty, D., & Pal, N. (2003). A novel training scheme for multilayered perceptrons to
realize proper generalization and incremental learning. IEEE Transaction on Neural Networks,
14(1):1-14.
Chang, J., & Lee, W. (2004). A sliding window method for finding recently frequent itemsets over online data
streams; Journal of Information Science and Engineering, 20(4):753-762.
Charikar, M., Chen, K., & Farach-Colton, M. (2004). Finding frequent items in data streams.
International Colloquium on Automata, Languages and Programming, pages: 693–703.
Cheung, D., Han, J., Ng, V., & Wong, C. (1996). Maintenance of discovered association rules in
large databases: An incremental updating technique; IEEE International Conference on Data
Mining, 106-114.
Domingos, P., & Hulten, G. (2000). Mining high-speed data streams. The ACM 6th International Conference on
Knowledge Discovery and Data Mining, pages: 71-80.
Domeniconi, C. & Gunopulos, D. (2001). Incremental Support Vector Machine Construction.
International Conference on Data Mining, pages: 589-592.
Freeman, J. & Saad, D. (1997). On-line learning in radial basis function networks. Neural
Computation, 9:1601-1622.
Fritzke, B. (1994). Fast learning with incremental RBF networks. Neural Processing Letters,
1(1):25.
Fritzke, B. (1995). A growing neural gas network learns topologies. Advances in neural
information processing systems, pages 625-632.
Giannella, C., Han, J., Pei, J., Yan, X., & Yu, P. (2003). Mining frequent patterns in data streams
at multiple time granularities. Workshop on Data Mining: Next Generation Challenges and
future Directions, AAAI.
Grippo, L. (2000). Convergent on-line algorithms for supervised learning in neural networks.
IEEE Trans. on Neural Networks, 11:1284–1299.
Guha, S., Mishra, N., Motwani, R., & O'Callaghan, L. (2000). Clustering data streams. IEEE
Symposium on Foundations of Computer Science, pages:359-366.
Last, M. (2002). Online classification of non-stationary data streams, Intelligent Data Analysis,
6(2):129-147.
Lee, S., & Cheung, D. (1997). Maintenance of discovered association rules: when to update?.
SIGMOD
Workshop on Research Issues in Data Mining and Knowledge Discovery.
Lin, C., Chiu, D., Wu, Y., & Chen, A. (2005). Mining frequent itemsets from data streams with a time-sensitive
sliding window. International SIAM Conference on Data Mining.
Maloof, M., & Michalski, R. (2004). Incremental learning with partial instance memory.
Artificial Intelligence 154:95–126.
Michalski, R. (1985). Knowledge repair mechanisms: evolution vs. revolution. International
Machine Learning Workshop, pages 116–119.
Parikh, D. & Polikar, R. (2007). An ensemble-based incremental learning approach to data
fusion. IEEE transaction on Systems, Man and Cybernetics, 37(2):437-450.
Polikar, R., Udpa, L., Udpa, S. & Honavar, V. (2000). Learn++: An incremental learning
algorithm for supervised neural networks. IEEE Trans. on Systems, Man, and Cybernetics,
31(4):497–508.
Rainsford, C., Mohania, M., & Roddick, J. (1997). A temporal windowing approach to the
incremental maintenance of association rules. International Database Workshop, Data Mining,
Data Warehousing and Client/Server Databases, pages:78-94.
Reinke, R., & Michalski, R. (1988). Machine intelligence, chapter: Incremental learning of
concept descriptions: a method and experimental results, pages 263–288.
Salzberg, S. (1991). A nearest hyperrectangle learning method. Machine learning, 6:277–309.
Simpson, P. (1992). Fuzzy min-max neural networks. Part 1: Classification. IEEE Trans. Neural
Networks, 3(5):776-786.
Utgoff, P. (1988). ID5: An incremental ID3. International Conference on Machine Learning,
pages 107–120.
Yu, J., Chong, Z., Lu, H., & Zhou, A.(2004). False positive or false negative: mining frequent
itemsets from high speed transactional data streams. International Conference on Very Large
Databases, pages: 204-215.
KEY TERMS AND THEIR DEFINITIONS
Knowledge learning: Knowledge Learning: The process of automatic extracting Knowledge
from data.
Incrementality: The characteristic of an algorithm that is capable of processing data which
arrives over time sequentially in a stepwise manner without referring to the previously seen data.
Stability: A learning algorithm is totally stable if it keeps the acquired knowledge in memory
without any catastrophic forgetting.
Plasticity: A learning algorithm is completely plastic if it is able to continually learn new
knowledge without any requirement on preserving previously seen data.
Data drift:
Unexpected change over time of the data values (according to one or more
dimensions).
Keywords: online learning, incrementality, adaptivity, model evolution, stability-plasticity