Download Abbreviations and acronyms

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Project Acronym:
FIRST
Project Title:
Large scale information
integration infrastructure
financial decision making
Project Number:
257928
Instrument:
Thematic Priority:
STREP
ICT-2009-4.3 Information and Communication
Technology
extraction and
for supporting
D2.3 Scaling Strategy
Work Package:
WP2 – Technical analysis, scaling strategy and
architecture
Due Date:
30/09/2011
Submission Date:
30/09/2011
Start Date of Project:
01/10/2010
Duration of Project:
36 Months
Organisation Responsible for Deliverable:
ATOS
Version:
1.0
Status:
Final Version
Author(s):
Mateusz
Radzimski,
Murat
Kalender (ATOS), Miha Grcar,
Marko Brakus, Igor Mozetic, Elena
Ikonomovska,
Saso Dzeroski
(JSI), Markus Gsell (IDMS),
Tobias Hausser (UHOH), Joao
Gama
Reviewer(s):
Tomas Pariente (ATOS),
Michael Siering, Mykhalio Saienko (GUF)
R – Report
P – Prototype
D – Demonstrator
O - Other
PU - Public
CO - Confidential, only for members of the
consortium (including the Commission)
RE - Restricted to a group specified by the
consortium (including the Commission Services)
Nature:
Dissemination level:
Project co-funded by the European Commission within the Seventh Framework Programme (2007-2013)
D2.3
Revision history
Version
0.1
0.2
Date
11/07/2011
3/08/2011
Modified by
Mateusz Radzimski (ATOS)
Murat Kalender (ATOS)
0.3
5/08/2011
Mateusz Radzimski (ATOS)
0.4
9/08/2011
Markus Gsell (IDMS)
0.5
10/08/2011
Mateusz Radzimski (ATOS),
Murat Kalender (ATOS)
0.6
17/08/2011
Miha Grcar (JSI)
0.7
5/09/2011
Tobias Haeusser (UHOH)
0.8
12/09/2011
0.9
12/09/2011
Miha Grcar, Marko Brakus,
Igor Mozetic, Elena
Ikonomovska, Saso Dzeroski
(JSI), Joao Gama
Mateusz Radzimski (ATOS)
0.9.5
13/09/2011
0.96
14/09/2011
Mateusz Radzimski, Murat
Kalender (ATOS)
Mateusz Radzimski, (ATOS)
0.97
15/09/2011
Markus Gsell (IDMS)
0.98
30/09/2011
1.0
30/09/2011
Mateusz Radzimski (ATOS),
Achim Klein (UHOH), Murat
Kalender (ATOS)
Tomás Pariente (ATOS)
Comments
First version of ToC provided
First contribution to “Analytical
pipeline scaling techniques”
First contributions to “Scaling
strategy outline” and “common
scaling plan”
First contribution to
“Information integration
services”
Further contributions to „Global
Scaling Strategy” chapter. Minor
editorial changes.
Contributions to chapters “Data
acquisition and preprocessing
services” and “Decision support
and visualisation services”,
various smaller contributions to
other parts of the document.
Contribution to “Information
Extraction services” chapter.
Update of chapter 3.5 “Decision
support and visualisation
services”
Contribution to „Integration
infrastructure” chapter, various
contributions to “Global scaling
strategy” chapter.
Editorial changes
„Executive Summary” and
„Conclusion” chapters, minor
editorial changes. Document
reaches “final draft” status and is
sent for internal review.
Contributions and editing of
“Information integration
services”
Addressing reviewers’
comments. Final version.
Final QA and preparation for
submission
D2.3
Copyright © 2011, FIRST Consortium
The FIRST Consortium (www.project-first.eu) grants third parties the right to use and distribute
all or parts of this document, provided that the FIRST project and the document are properly
referenced.
THIS DOCUMENT IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
DOCUMENT, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
----------------
D2.3
Executive Summary
Scaling strategy constitutes an important part of the technical design of the overall FIRST
project. It specifies technical details and devises a plan for achieving scalability goals with
regard to processing big data volumes in a timely manner, in order to comply with the project
objectives.
This document provides a general overview on suitable scaling techniques that are to be
applied within the project. On the one hand, it encompasses scaling of overall system
architecture and describes possible scenarios for performance improvement. On the other
hand, it presents methods for scaling individual technical components that particularly
correspond with major functionalities of the system.
This document influences the development process of the FIRST system, by defining a
scalability roadmap with defined milestones and objectives that concern every technical aspect
of the project. It aims at continuous and iterative improvement of the system performance and
throughput until it reaches the target capabilities.
D2.3
Table of Contents
Executive Summary ...................................................................................................... 4
Abbreviations and acronyms ....................................................................................... 7
1. Introduction ............................................................................................................ 8
2. Global scaling strategy .......................................................................................... 9
2.1. Scaling strategy outline ..................................................................................... 9
2.2. Scaling analytical pipeline ............................................................................... 11
2.2.1
Overview of pipeline scaling scenarios .................................................... 11
2.2.2
Handling data peaks in the analytical pipeline.......................................... 15
2.3. Summary ......................................................................................................... 17
3. Individual scaling plans ....................................................................................... 18
3.1. Data acquisition and preprocessing services .................................................. 18
3.2. Semantic resources......................................................................................... 19
3.3. Information extraction services ........................................................................ 20
3.4. Information integration services ...................................................................... 23
3.5. Decision support and visualisation services .................................................... 27
3.5.1
Scaling techniques for clustering and classification ................................. 28
3.5.2
Learning model trees from data streams .................................................. 28
3.6. Integration infrastructure ................................................................................. 29
4. Conclusions .......................................................................................................... 31
References ................................................................................................................... 32
Annex 1.
Clustering for topic and trend detection .............................................. 35
Annex a. Introduction .............................................................................................. 35
Annex b. Document streams ................................................................................... 35
Annex c. Clustering document streams................................................................... 35
Annex d. Topic detection ......................................................................................... 37
Annex e. Trend detection ........................................................................................ 37
Annex f. Visualization ............................................................................................. 37
Annex g. Clustering for active learning .................................................................... 38
Annex h. Conclusions.............................................................................................. 38
Annex 2.
Learning model trees from data streams ............................................. 39
Annex a. Introduction .............................................................................................. 39
Annex b. Related work ............................................................................................ 39
Annex c. The FIMT-DD algorithm ........................................................................... 41
Annex d. Conclusions.............................................................................................. 42
D2.3
Index of Figures
Figure 1: Scaling strategy in relation with other workpackages ..................................................... 8
Figure 2: FIRST scaling strategy outline ...................................................................................... 10
Figure 3: Scale-up and scale-out approach ................................................................................... 12
Figure 4: Load balancing of requests (ZeroMQ, 2011). ............................................................... 12
Figure 5: Scalability test of analytical pipeline using the parallelization technique ..................... 13
Figure 6: Pipeline splitting scenario .............................................................................................. 14
Figure 7: ZeroMQ space-time scalability experiment result (ØMQ (version 0.3) tests, 2011) .... 15
Figure 8: Performance comparisons of the approaches for handling data peaks .......................... 16
Figure 9: WP3 and WP4 pipeline integration with extra buffer for data peak handling ............... 17
Figure 10: Data acquisition and preprocessing pipeline at M12 (taken from (FIRST D2.2
Conceptual and technical integrated architecture design, 2011)) ................................................. 18
Figure 11: Information extraction scaling approach ..................................................................... 21
Index of Tables
Table 1: Scaling strategy for prototype release cycles .................................................................. 11
Table 2: Scaling plan for the data acquisition and preprocessing pipeline ................................... 19
Table 3: Scaling plan for the semantic resources .......................................................................... 20
Table 4: Development plan and scaling plan for Information Extraction ..................................... 22
Table 5: Rough estimate of database operations ........................................................................... 26
Table 6: Scaling plan for the knowledge base............................................................................... 27
Table 7: Scaling plan for the decision-support models ................................................................. 28
Table 8: Scaling plan for Integration infrastructure ...................................................................... 30
D2.3
Abbreviations and acronyms
DoW
Description of Work
WP
Workpackage
TBD
To be defined
SOA
Service Oriented Architecture
NP
Nondeterministic Polynomial Time
ESB
Enterprise Service Bus
RUP
Rational Unified Process
CPU
Central Processing Unit
REQ/REP
Request/Reply
M12
Month 12
RSS
RDF Site Summary (also dubbed Really Simple Syndication)
HTML
Hypertext Markup Language
PDF
Portable Document Format
XML
Extensible Markup Language
SVM
Support Vector Machines
JAPE
Java Annotation Patterns Engine
NoSQL
sometimes refered to as Not Only SQL
UC
Use Case
© FIRST consortium
Page 7 of 43
D2.3
1. Introduction
This document provides important insights into the plan of realizing scalability goals within the
FIRST system. Given that the primary objective of the system consists in analyzing big volumes
of data and reducing processing time, most of the technical aspects must be designed with
performance in mind. One conclusion is that all components dealing with data processing
should offer high scalability, conforming to envisaged capacity of the system. It means choosing
the best algorithms and state-of-the-art techniques for accomplishing certain tasks within the
analytical processing pipeline (see (FIRST D2.1 Technical requirements and state-of-the-art,
2011) and (FIRST D2.2 Conceptual and technical integrated architecture design, 2011)).
However, ensuring scalability at the component level is only one side of the coin. Separate
components must be further integrated in the common architecture that should be robust
enough to keep up with performance and system requirements (see (FIRST D1.2 Usecase
requirements specification, 2011)) Therefore architecture scalability is another important factor
of the scaling strategy ensuring coherency with system design. This document will encompass
both: a global strategy that applies to system integration level, and local, component specific,
plans, and will further influence the development process in FIRST (see Figure 1).
Technical
components
WP3
influences
WP4
Requirements
and system objectives
Scaling strategy
WP1
D2.3
WP5
WP6
D2.2
WP7
Integrated
architecture
Figure 1: Scaling strategy in relation with other workpackages
© FIRST consortium
Page 8 of 43
D2.3
2. Global scaling strategy
2.1. Scaling strategy outline
The global scaling strategy describes a roadmap for achieving project goals for the whole
FIRST system with regard to scalability. The main goal is to ensure that the whole system and
all its components are able to fulfil the baseline project requirements of processing high
volumes of data and providing timely results. The methodology followed to reach the goals is
based on an incremental release and evaluation approach, resembling Rational Unified Process
(RUP) approach (IBM Rational Unified Process v7.0, 2008) where all functionalities are being
improved from functional (implementation plan) and scalability (scaling strategy) point of view
throughout the project lifetime.
The scaling strategy is aligned with the prototype release cycle, divided into several milestones,
where each one provides objectives for overall system scaling. By providing a systematic
approach and a set of measurable goals, we enable constant evaluations of the results in order
to detect and respond to risks as early as possible, which is crucial in developing the research
prototype.
The scaling strategy plan encompasses the twofold view of the project:
-
Scalability of the overall FIRST system (hereby called “global scaling strategy”)
-
Scalability of the particular FIRST building blocks (called “individual scaling plans”)
The idea of distinguishing between these two aspects is to analyse scaling challenges and
approaches from the global perspective (“global scaling strategy“) and from the point of view of
different technical components (“individual scaling plans”). The individual scaling plan depicts a
bottom-up view and explains how FIRST components are contributing to achieving scalability,
what are the challenges and achievable goals. It focuses on providing a lower-level,
component-based overview on scaling issues, such as the choice of algorithms or the usage of
proper technological solutions. While this deliverable mentions how those techniques contribute
to achieve scalability goals, technical details will be presented in the deliverables of the
respective workpackages.
On the other hand the global scaling strategy presents project-wide approaches orthogonal to
the individual scaling plans and common for most technical components. Its focus is on
integration aspects to provide a scaling infrastructure that enables using such techniques as
scaling up, scaling out, parallelization or proper resource utilisation. The central aspects of
analysis covered in this view are FIRST Analytical Pipeline scaling methods. The global scaling
strategy also takes into account limits of individual components and aligns them into the
common plan, ensuring that system capabilities are met by individual components. In this sense
both: global and individual scaling strategy influences each other.
Following an incremental building approach, the scaling strategy adapts to the devised project
development plan and to the prototype release cycles. Therefore a scalability goal is defined for
each prototype release milestone. The general approach for preparing the infrastructure to
handle the envisaged amount of data and to comply with scaling requirements, is to 1)
constantly improve algorithms in order to provide results in near real-time, and 2) scale such
solutions to handle vast amount of data (see Figure 2).
The high-level view on the scaling strategy, given in the DOW, suggests that we should first
scale the data volume (from small to large historical datasets) and then change the processing
paradigm (from dataset processing to real-time data stream processing). In reality, we first need
to change the processing paradigm and then scale the data volume (from relatively "slow" data
streams to vast data streams). Since the final goal in the project is not to process large historical
datasets but rather to process data streams in near-real time, the efforts put into scaling from
© FIRST consortium
Page 9 of 43
D2.3
small to large (historical) datasets would not be reflected in the final product. Therefore,
switching to stream processing earlier in the process allows uncovering and address any
possible limitations of such approach. The outline of these techniques is presented in the
following chapters.
Vast amounts of
data
Scaling data volume
P3
Ensure algorithms are timeperformant and scalable
P1
P2
Historical data,
little time
constraints
Live data and news feeds,
near real-time response
Small amounts
of data
Figure 2: FIRST scaling strategy outline
Table 1 outlines the scaling strategy related to the prototype release cycle as envisaged in the
DoW.
Month
Milestone Stage
M12
M2
M18
M3
M24
M4
© FIRST consortium
Description
M12
Early
prototype (“Early
Bird”)
Stage 1
1st Prototype
(P1)
Early demo providing first insight into WP3 Data
Acquisition and WP4 Information Extraction
prototypes.
First preliminary release of Integrated Financial
Market Information system, showing some
components of FIRST Analytical Pipeline (WP3,
WP4, WP6) and visualisations prototypes at work.
The integration prototype is also present, allowing
for lightweight pipeline integration and employing
the messaging approach in its core. The purpose of
the 1st prototype is to show how some use case
tasks are realized by the FIRST system. No scaling
goals are defined, but the prototype and the
infrastructure form a testbed for performance tests
and continuous scalability improvements.
Stage 2
2nd prototype is oriented at improving algorithms in
2nd Prototype (P2) order to shorten the time required to analyze
– Live data (near incoming data, ensuring efficiency with regard to
real-time)
resources and computation time. However it may
Page 10 of 43
D2.3
Month
M33
Milestone Stage
M5
Description
be already capable of processing more amounts of
data, its validation focuses on near real-time results
provision.
Final
Version Final version of the software is able to handle the
(P3)
–
Vast target data load – vast amounts of data in short
amounts of data
time. It is validated against the final requirements
with regard to timeliness and data volume (number
of data sources and number of processed articles
per day). Algorithms and infrastructure are
prepared for efficient scaling with regard to data
load within specified limits. The global scaling
strategy and individual scaling plans are
implemented in 100%.
Table 1: Scaling strategy for prototype release cycles
2.2. Scaling analytical pipeline
2.2.1 Overview of pipeline scaling scenarios
The analytical pipeline is the core of the project; therefore, scalability of the analytical pipeline is
crucial for overall scalability of the FIRST project. The preliminary basis for the scaling strategy
has been outlined in (FIRST D2.1 Technical requirements and state-of-the-art, 2011). This
section will devise the scaling scenarios such as: pipeline parallelisation, pipeline splitting, load
balancing and pipeline multiplication.
Pipeline parallelisation is proposed in (FIRST D2.1 Technical requirements and state-of-the-art,
2011) as a scaling strategy in order to increase the pipeline throughput and decrease the
latency of the pipeline. Parallelisation means processing of several inputs coming from
components in a pipeline with other identical components that work in parallel. In an optimal
scenario, it is simply adding more processing units for the same components that work slower
compared to other components in a pipeline. More processing units can be provided to
components by scaling horizontally or vertically. Scaling horizontally is achieved by adding more
nodes (computers) to a system. Scaling vertically is achieved by adding more resources such
as additions of CPUs and memory to a node in the system (see Figure 3). In the pipeline
processing, vertical scaling can be achieved by running the whole pipeline on a faster machine,
while scaling horizontally is running multiple pipelines (or pipeline fragments) on more
machines.
© FIRST consortium
Page 11 of 43
D2.3
Figure 3: Scale-up and scale-out approach1
Parallelisation requires distribution of work between multiple identical components. Workload
has to be distributed effectively between components to achieve optimal resource utilization and
maximize throughput, which is called load balancing. Inputs between components in the First
analytical pipeline are delivered using the messaging approach as described in (FIRST D2.2
Conceptual and technical integrated architecture design, 2011), which supports techniques for
improving scalability, such as: parallelisation, pipeline splitting and load balancing. As
messaging implementation, ZeroMQ2 has been chosen. Figure 4 shows load balancing of work
between three parallel services using ZeroMQ messaging. There are four messages (R1, R2,
R3, and R4). Two of the messages are sent to the Consumer A, other two are sent to the
Consumer B and Consumer C.
Figure 4: Load balancing of requests (ZeroMQ, 2011).
1
Source: Best Practices in building scalable cloud-ready Service based systems, CodeCamp 11,
http://igorshare.wordpress.com/2009/03/29/codecamp-11-presentation-best-practices-in-building-scalable-cloudready-service-based-systems/
2
http://www.zeromq.org/
© FIRST consortium
Page 12 of 43
D2.3
Parallel components do not have to work on the same computer. The pipeline can be spitted by
distributing components among several machines. For example, the data acquisition service
may work on one computer and multiple information extraction services may work on another.
In this way, components can occupy more resources compared to using the same machine with
other components. ZeroMQ uses sockets to connect applications over the TCP protocol, which
enables pipeline splitting and distributed processing over the network. Since TCP/IP protocol
connectivity is scalable, throughput of the pipeline would increase proportional to number of
computers.
To measure the scalability of the ZeroMQ messaging approach, a pipeline parallelisation
experiment has been carried out. A test system was prepared that transfers messages between
two components. One component produces messages and another consumes the messages.
The message consumer is a time processing component, which slows down the pipeline. The
throughput of the test system is observed with a changing number of message consumer
components. The experiment is done on a 48-core processor. Figure 5 shows the experiment
results. The throughput increases linearly with number of components; therefore we can
conclude that the analytical pipeline and the messaging approach with ZeroMQ implementation
is highly scalable.
Figure 5: Scalability test of analytical pipeline using the parallelization technique
Parallelisation of the FIRST pipeline will be done statically, based on the average latency of
each component and the processing power which is available. First, the latency of each
component will be analyzed. Based on the results, slow components will have more instances
that work in parallel. For example, if the data acquisition component retrieves 2 documents per
second and information extraction can process 1 document per second, there will be one data
acquisition and two information extraction components in the pipeline. After balancing the
pipeline, resource consumption of the pipeline will be observed. If it doesn’t use all the
resources of the computer, multiple pipelines will be executed in parallel to consume all
available resources and increase the throughput of the project.
Another approach for scaling out the pipeline is to perform pipeline splitting, by separating
pipeline components and running them on more machines. It improves scalability when pipeline
components need more computing resources than a single machine can offer. Pipeline splitting
is possible by decoupling components and using messaging to integrate them over distributed
© FIRST consortium
Page 13 of 43
D2.3
machines (see Figure 6). In such a scenario, every component can occupy more resources. For
computing-intensive tasks this may result in shorter computation time and lower pipeline delays.
Single processing unit (node)
20%
20%
20%
20%
20%
Whole pipeline ocuppying
100% of available resources
Pipeline splitting
Single processing unit (node)
33%
33%
33%
Pipeline part ocuppying 100%
of available resources
Single processing unit (node)
messaging
50%
50%
Pipeline part ocuppying 100%
of available resources
Figure 6: Pipeline splitting scenario
Scalability of the messaging integration approach is a very important factor for the scalability of
the analytical pipeline. But, it is also important how it is applied in the project. Poor architectural
design may result in serious scalability problems, i.e. if messaging middleware blocked the
message sender and receiver until a message is transferred. Waiting for a message transfer
would cause performance and scalability problems for components. Because they are
dependent on each other and one of them would block the other component. To handle this
issue, the messaging integration solution mentioned in (FIRST D2.2 Conceptual and technical
integrated architecture design, 2011) is implemented following an asynchronous communication
pattern, with a multi-threaded design principle. Separate threads are responsible for getting and
transferring messages for each component. Additionally, each messaging thread keeps a buffer
queue of the messages to support a constant flow of data. In this way components do not block
each other.
Scalability of a system could be analyzed in various dimensions. So far, we have analyzed the
analytical pipeline in terms of load scalability. Load scalability is the ability of increasing
resource consumption to handle heavier loads. Space and space-time are other two important
scalability types (Bondi, 2000).
Space scalability is the ability of handling increasing number of items without consuming
excessive amount of memory. Space scalability is handled in the analytical pipeline by limiting
the queue size of the messaging threads with a fixed number. In our experiments, we observed
that memory consumption of the analytical pipeline is stable and not increasing with heavier
loads.
Space-time scalability is the ability of handling large items (big messages in our context) without
decreasing the throughput of the system. Messages within the analytical pipeline could be
varying in size. The integration system has to handle all types of messages without a
performance loss. There is a published space-time scalability experiment carried out by the
vendors of ZeroMQ. In Figure 7, the experimental result shows that the duration for sending big
messages increases linearly with their sizes. Thus, we can conclude that ZeroMQ is also
scalable in terms of space-time dimension.
© FIRST consortium
Page 14 of 43
D2.3
Figure 7: ZeroMQ space-time scalability experiment result (ØMQ (version 0.3) tests, 2011)
2.2.2 Handling data peaks in the analytical pipeline
Information exchange between components in the analytical pipeline is done via sending
messages between them. The message receiver module of the FIRST messaging system
keeps received messages in a queue. When data peaks occur in the pipeline, components
cannot process all the received messages and their queues overflow. In order to handle the
queue overflow problem, a new messaging channel is added to the integration system to inform
the message sender component about the status of the message receiver queue.
Since the analytical components constantly observe their own input queues, a data peak can be
identified by each individual component. A data peak (from the perspective of a specific
component) happens when the number of data items (requests) in the queue exceeds a certain
predefined threshold (e.g., 100 items). If this situation occurs, the component fires its peakhandling logic in order to reduce the number of items in the queue. The strategies for reducing
the number of queued items range from simple and pragmatic to relatively complex solutions.
A very simple solution may involve using control messages to pause and resume the traffic.
When the queue size exceeds the maximum value, a “wait” message is sent to the message
producer. In this case, the message producer stops sending messages until receiving a
“continue” message from the message consumer. After the message receiver consumes all
messages in the queue, a continue message is sent to the message producer and messaging
between components continues. However simple, such a solution slows down data processing
and only moves the problem to the message producer, causing overflow at the earlier stages of
the pipeline. The complex solutions include, for example, semantic load shedding where
clustering of the content in the queue is performed in order to select representative instances.
This ensures that the different topics, identified in the queue, are all represented in the final
model and thus in the end-user application. In FIRST, we do not plan to resort to such complex
© FIRST consortium
Page 15 of 43
D2.3
solutions but rather to one of the pragmatic alternatives. The following are the two pragmatic
approaches:

dropping the request that tries to enter a full queue,

dropping each second request from a queue when the queue fills up (i.e., sampling).
For the applications in FIRST, the second approach is more appropriate. In contrast to the first
approach, it is more appropriate to allow recent content to pass the pipeline. For more
information on data reduction techniques, see (FIRST D2.1 Technical requirements and stateof-the-art, 2011), Section 2.5.2.1 and (Barbara & others, 1997). Additionally, the messaging
control channel can be used for controlling behaviour of data peaks handling.
Another aspect of handling data peaks is whether we store "dropped" data for later processing
(when the queues are empty again), i.e. (i) using a broker-based messaging system and
sending new messages to a broker, (ii) writing new coming messages to files for further
processing.
For the first approach, a stable broker-based system is required, which handles also queue
overflow with its persistence mechanisms. Performance would be the selection criteria between
the both approaches. Broker-based messaging (e.g. ActiveMQ1) and file storage features has
been tested in the messaging system for this purpose. Briefly, in the experiment these features
were after receiving the wait messages. New coming messages were sent to the broker or
written to a file. When the message receiver is ready for consumption of the new messages,
these messages are transferred to the message receiver. Performances of these two
approaches are tested on the test dataset, which was used for evaluating the performances of
pipeline and request-reply patterns in the previous section. In the experiments, all messages
are sent via ActiveMQ for the broker approach and sent with ZeroMQ after writing the
messages to files and reading them back for the file storage approach. Figure 8 shows the
performances of these messaging systems with regard to the overall throughput.
Figure 8: Performance comparisons of the approaches for handling data peaks
The file storage approach performs significantly better than the broker-based approach in the
experiments. Figure 9 shows the example architecture of the Data Acquisition and Information
Extraction pipeline integration with the data peak handling using file buffer.
1
http://activemq.apache.org/
© FIRST consortium
Page 16 of 43
D2.3
Figure 9: WP3 and WP4 pipeline integration with extra buffer for data peak handling
Even though this mechanism is relatively easy to implement, the nature of the solution lies in
extending the message buffer from memory-based buffer to the disk-buffer. Using such solution
makes sense when data comes in large bursts of messages that are entirely unprocessable
otherwise. As a trade-off, messages might be significantly delayed while they wait for their turn
to be “replayed” back into the pipeline. In the long run this might be an unwanted side effect. In
a near-real time system, such as the one we aim to develop in FIRST, we would rather opt for
dropping some messages while keeping average processing time short than ensuring all
messages processed regardless of circumstances. In the future we aim at supporting these
scenarios; however final decision depends on further experiments on live data.
2.3. Summary
The global scaling strategy gives a technical overview of scenarios that can be applied to
improve the performance of the FIRST analytical pipeline. The target system will combine the
aforementioned techniques, by choosing the most appropriate method based on the
performance of the individual components, target data volumes or analysis of performance
bottlenecks. This demonstrates that the messaging based integration approach and the
flexibility of the analytical pipeline brings more possibilities for scaling the overall FIRST
architecture and ensures meeting scalability goals in the further project development process.
© FIRST consortium
Page 17 of 43
D2.3
3. Individual scaling plans
Individual scaling plans, as opposed to the global scaling strategy, aim at providing a roadmap
and scaling plans for individual technical workpackages. Those plans focus on internal and
specific aspects of pipeline processing such as the choice of algorithms or improvement of data
handling. They are realized separately inside of each workpackage technical components
according to their own objectives. However, the outcomes of each plan are aligned with the
project plan and prototype release cycle according to the DoW.
The following subchapters are represented by following workpackages: WP3: Data acquisition
and preprocessing services and Semantic resources, WP4: Information extraction services,
WP5: Information integration services, WP6: Decision support and visualisation services, WP7:
Integration infrastructure.
3.1. Data acquisition and preprocessing services
The data acquisition and preprocessing pipeline, shown in Figure 10, consists of relatively
elementary operations that do not need to be replaced with sophisticated online alternatives. In
addition, all these operations are trivially parallelizable (i.e., each document can be processed
independently). This allows us to devise a workflow with multiple parallel preprocessing
pipelines as evident from Figure 10. Load balancing is employed to send the acquired data
through the preprocessing pipelines.
Load
balancing
RSS
reader
Boilerplate
remover
Language
detector
Duplicate
detector
Sentence
splitter
Tokenizer
POS tagger
Semantic
annotator
ZeroMQ
emitter
RSS
reader
Boilerplate
remover
Language
detector
Duplicate
detector
Sentence
splitter
Tokenizer
POS tagger
Semantic
annotator
ZeroMQ
emitter
.
.
.
.
.
.
RSS
reader
Boilerplate
remover
Duplicate
detector
Sentence
splitter
Tokenizer
POS tagger
Semantic
annotator
ZeroMQ
emitter
processing
pipelines
Language
detector
One reader
per site
(80 readers)
Figure 10: Data acquisition and preprocessing pipeline at M12 (taken from (FIRST D2.2
Conceptual and technical integrated architecture design, 2011))
As the project progresses, the data acquisition and preprocessing pipeline will be scaled up
from mainly two perspectives: (i) with respect to the number of sites from which the data is
acquired and (ii) with respect to the number of components (i.e., functionality). Table 2 shows
the scale-up of the data acquisition pipeline from the preliminary version (at M7) to now (M12)
and presents the scale-up plan for the remainder of the project.
© FIRST consortium
Page 18 of 43
D2.3
Ver. 2
Ver. 3
Ver. 4
Apr–Jun 2011
Jun–Sep 2011
Sep 2012–Sep 2013
(M7–M9)
(M9–M12)
Sep 2011–Sep
2012
Now
Ver. 1
(M24–M36)
Functionality
Scale
(M12–M24)
Number of sites: 39
Number of sites: 80
Number of sites: 160
Number of RSS feeds:
1,950 (~50 per site on
average)
Number of RSS feeds:
2,472 (~30 per site on
average)
Number of RSS feeds:
4,800 (~30 per site on
average)
Avg.
number
of
documents per site per
day: 870
Avg.
number
of
documents per site per
day: 425
Total new documents
per day: 33,950
Total new documents
per day: 34,000
RSS
only
Unchanged
Added:
language
detector,
duplicate
acquisition Boilerplate removal detector,
sentence
added
splitter,
tokenizer,
POS tagger, ZeroMQ
emitter
Avg.
number
of
documents per site per
day: 425
Total new documents per
day: 68,000
Unchanged
Table 2: Scaling plan for the data acquisition and preprocessing pipeline
The average number of RSS feeds per site and the average number of acquired documents per
site per day decrease from Ver. 1 to Ver. 2. This is mainly due to the fact that we included a lot
of blogs in Ver. 2. A blog usually provides one single RSS feed and only a few posts per
day/week while a larger news Web site provides a range of RSS feeds and hundreds of news
per day. Another reason for the drop of the average number of documents per site per day, and
consequently the total number of new documents per day, is the new filtering policy. In Ver. 2,
we only accept HTML and plain text documents that are 10 MB or less in size. In Ver. 1, nontextual content (such as video, audio, PDF, and XML) was also accepted and its size was not
limited.
The only component that could benefit from the fact that we are dealing with streams is the
boilerplate remover. The currently implemented solution is based on language-independent
features and employs a decision tree to determine the class of a text block (Kohlschütter,
Fankhauser, & Nejdl, 2010). This solution processes each document separately and is unaware
of the fact that it operates in a stream-based environment. We recently devised a pragmatic
stream-based boilerplate remover that exhibits high content recall (at some expense of
precision). The algorithm is currently being tested and, if deemed suitable, will replace the
currently employed solution at some point during the second project year.
3.2. Semantic resources
The FIRST ontology contains two important aspects of knowledge about financial markets: (i)
real-world entities such as companies and stock indices and their interrelations and (ii) the
corresponding lexical knowledge required to identify these entities in texts. The ontology is thus
fit for the purpose of information extraction rather than representing a basis for logic-based
reasoning.
We distinguish between the static and dynamic part of the ontology. The static part contains
knowledge that does not change frequently (i.e., does not adapt to the stream in real time). It
contains the knowledge about financial indices, instruments, companies, countries, industrial
sectors, sentiment-bearing words, and financial topics. This part of the ontology will scale-up in
© FIRST consortium
Page 19 of 43
D2.3
terms of coverage (i.e., how many financial indices, topics, and sentiment-bearing words the
ontology covers) and in terms of aspects (i.e., which different types of information are available
in the ontology, e.g., industrial sectors, sentiment vocabularies, topic taxonomies…).
The dynamic part will include two aspects of knowledge1 that will be constantly updated with
respect to the data stream: (i) topic taxonomy and (ii) sentiment vocabulary. The dynamic part
of the ontology will scale-up mostly in terms of the maximum throughput of the topic detection
algorithm and sentiment vocabulary extractor.
The topic detection component will be based on an online hierarchical clustering algorithm (see
Annex 1). Rather than performing efficiently on a dataset of documents, an online clustering
algorithm is able to update a hierarchy of document clusters rapidly when a new document
comes into the system. On the other hand, the sentiment vocabulary extractor will employ an
active learning approach based on Support Vector Machines (SVM) (Joachims, 2006; Tong &
Koller, 2000; Saveski & Grcar, 2011). For this purpose, we will employ an online variant of SVM
(Cauwenberghs & Poggio, 2001).
Table 3 gives the current state of the semantic resources in FIRST and also presents the
scaling plan for after M12.
Now
Feb–Sep 2011
Mar–Sep 2012
Sep 2012–Sep 2013
(M18–M24)
(M24–M36)
Collection of existing Ontology “spawned” Ontology “spawned”
semantic resources from 16 financial from >1000 financial
(T3.1)
indices
indices
Unchanged
Sentiment
vocabularies, topic
taxonomies, lexical
resources,
glossaries, financial
Web sites…
Unchanged
Throughput
Coverage
(M12–M18)
Aspects
(M5–M12)
Sep 2011–Mar
2012
N/A
Indices,
stocks,
companies,
countries, industrial
Events added
sectors, sentiment
vocabulary,
topic
taxonomies
N/A
Running in near-real
Running in near-real
time
on
time on 1 selected
approximately 160
Web site (testing)
Web sites
Table 3: Scaling plan for the semantic resources
3.3. Information extraction services
The information extraction service is based on the JAPE-Engine, which leads to the benefit of a
powerful rule engine, but also leads to the disadvantage of being time consuming (for detailed
description see (FIRST D4.1 First semantic information extraction prototype, 2011)). In early
experiments the information extraction service could need some minutes per blog document.
Because the information extraction engine also has to fulfil the requirements to handle a couple
documents per minute, the whole service will provide an internal managed process pool, where
1
Note that these two aspects are also included in the static part where they do not adapt to the stream but rather
represent UC-specific knowledge and existing semantic resources (e.g., McDonald’s financial word lists
<http://www.nd.edu/~mcdonald/Word_Lists.html>).
© FIRST consortium
Page 20 of 43
D2.3
the documents are dispatched using load balancer combined with a messaging approach.
The first version of the managed process pool will have a configurable amount of parallel
processes which will be started from the Process Observer/Management component. The
Process Observer/Management also manages different states (WAITING, BUSY) of the running
information extraction processes by performing constant monitoring and performing proper
actions. If the busy state is not changed within a defined timeout, the process will be killed and
restarted from the Process Observer / Manager.
Information
Extraction
Process 1
Mess
age
Mess
age
Information
Extraction
Process 2
Mess
age
Information
Extraction
Process n
Mess
age
Load Balancer
Process
Observer /
Manager
Managed Process Pool
Figure 11: Information extraction scaling approach
Figure 11 shows an internal approach of Information extraction services in the scope of the
global processing pipeline. Incoming data is first dispatched by load balancer component that
distributes the data across registered information extraction components according to process
observer / manager component. Data is sent only to components that are in the WAITING state.
The number of the information extraction processes in the pool is subject to further experiments
and depends on the amount of data and processing time. Also, there could be more than one
Managed Process Pools components within the system, which can be further scaled according
to global scaling strategy scenarios (using messaging approach i.e. with ZeroMQ
implementation).
This additional internal process pool will be necessary for two major reasons:
1) The local machines could run more than one parallel processes, however one single
JAPE run will need more time (currently up to some minutes) for one document in an
atomic process.
2) In some cases we observed that a JAPE-based process might take too long and thus
blocking or exhausting available resources for other documents (i.e. due to crash or
hang-up). If it doesn’t answer in appropriate time, the build-in observer mechanism will
kill the process and restarts it in order to return the process to the process pool.
© FIRST consortium
Page 21 of 43
D2.3
Natural
language
processing
Boilerplate
remover
Language detector,
duplicate detector,
sentence splitter,
tokenizer, and partof-speech tagger
annotations
Unchanged
Unchanged
Entity
Extraction
Financial
instruments
(stocks, stock
indexes),
companies,
orientation terms
Indicators, topic
taxonomies
Events, locations
Unchanged
Extraction
Coverage
Direct
sentiments
regarding
financial
instruments’
price and
companies’
reputation
Direct sentiments
regarding financial
instruments’ volatility
Indirect sentiments
Unchanged
Extraction
Scaling
n/a
Analysis of
performance
bottlenecks. Improve
prototype with
software
engineering
methods.
Initial process pool
for load balancing.
Advanced process
pool for large
amounts of data.
Extraction
Throughput
Now
The current approach focuses on accuracy of the process of information extraction. As
mentioned before, its performance is still far from optimal. However if it proves infeasible in the
further experiments, other solutions will be explored, that might trade-off accuracy with
performance. That can be done as an alternative solution in order to avoid potential risks in the
scaling of the overall pipeline.
Table 4 presents scaling goals for information extraction components.
Part 0: “M12
Part 1: Integrated
Part 2: Live feeds
Part 3: Large
early prototype”
functional
amounts of data
April 2012–Sep
prototype
Feb–Sep 2011
2012
Sep 2012–June
Oct 2011–Mar
2013
(M5-M12)
(M19–M24)
2012
(M24–M33)
(M13–M18)
1 document per
minute (one
process)
2-3 documents per
minute (one
process)
5 documents per
minute
50 documents per
minute, 68000
documents per day
Table 4: Development plan and scaling plan for Information Extraction
© FIRST consortium
Page 22 of 43
D2.3
3.4. Information integration services
When storage solutions are confronted with high loads of data insertion and or data retrievals,
performance bottlenecks may become a severe issue. Neither data contributors nor data
consumers want to spend much time waiting for their requests to be completed. Performance
issues may arise due to a variety of reasons:

Blocked or limited resources.
Such issues occur when database resources that are required to perform a certain task
are blocked by another operation and processing has to wait until the required resource
is released by the other task. Resources in this context encompass physical resources
(e.g. hard-drive access to alter a file), virtual resources (e.g. a database table that is
write-locked while operations are running that alter its content) or logical resources (e.g.
database connections).

Costly (time-consuming) database operations, due to improper database design.
With an inappropriate database design users may be forced to conduct complex, timeconsuming queries, e.g. using many joins, for frequent tasks.

Costly (time-consuming) database operations, due to improper database management.
With an improperly administrated database, queries can take unnecessarily long, e.g.
when no appropriate indices are maintained.

Costly (time-consuming) database operations, due to inappropriate queries.
Inappropriate, i.e. unnecessarily complex, queries, can harm the query performance.
Solutions to counter these causes for performance bottlenecks can be implemented on different
layers, not only the storage layer itself but also on the access layer.
The first and most crucial decision is to choose the physical storage system. Such solutions
may include the plain file system, relational databases, non-relational (NoSQL) database
approaches such as document-oriented key-value databases, or any hybrid combinations
thereof. Each of these data storage solution bears its individual advantages and disadvantages
which need to be weighted in the light of the respective requirements towards the respective
data items to be stored and retrieved. While e.g. storing items in the file system allows efficient
random access to the stored items, the expressive power of queries is obviously limited to
filenames or creation dates. On the other hand, expressive power of queries in a relational
database is quite high, while such complex queries may harm performance as large tables
might have to be scanned and potentially numerous sub-queries have to be conducted.
Therefore, the choice regarding the storage solution shall be made on an individual basis,
choosing the solution approach that is most appropriate for the respective type of data and the
expected frequency of data insertions, data updates, and data retrievals.
Each storage solution brings its own inherent optimization possibilities. Besides any automatic
query optimizers, the optimizations can also be done with the administrative functionality
provided by the storage solutions as well as with degrees of freedom provided in database
design. E.g. for relational databases, the set of options to enhance performance would include:

Normalization of database tables.
The normalization of database tables avoids the redundant storage of data, which shall
improve performance and avoid inconsistencies due to updates

Set appropriate indices1.
By defining indices (with respect to columns often used in queries) the efficiency of
1
There are two types of indices: clustered and non-clustered. As the former actually changes the physical sequence
of entries to a table accordingly, there can only be one clustered index per table, while several non-clustered indices
can be defined.
© FIRST consortium
Page 23 of 43
D2.3
queries can be increased, as previously determined meta-information from the indices
can be used rather than performing full table scans to identify all the rows that match
certain criteria. In order to define useful indices, the frequently used queries should be
analyzed.
As database content is changing over time, indices shall be rebuilt from time to time and
may require some fine-tuning (e.g. with respect to fill factor that is maintained upon index
creation)

Partitioning.
Database performance may be harmed by very large tables. To counter this, the table
may be (horizontally) partitioned, i.e. different rows of the table are assigned to different
physical partitions (e.g. physically a different disk). Thereby search effort may be
reduced as less often retrieved parts of the data – e.g. older entries – may be swapped
out. As furthermore the size of the respective indices (one per partition) is smaller,
search effort is further reduced.

Database cluster.
If physical alternatives are available, the database can be federated among several
servers to distribute workload.
Although the aforementioned concepts to enhance performance have emerged in the context of
relational databases, similar optimization approaches are existent for other storage paradigms
as well. Also many NoSQL databases offer indexing or distribution of storage among several
severs. With sharding there is even a concept that combines features of horizontal partitioning
and clustering.
Despite all these performance optimization approaches, it still may prove useful – depending on
the actual request pattern –to cache some database entries e.g. the most recent ones, or to denormalize the table structure to some extent to cater for better response time to some queries
that otherwise span across several database tables and require costly joins.
While de-normalization would be part of database (re)design and therefore reside on the
storage layer, caching recent database entries may be part of optimizations on the access layer
or some kind of intermediate layer. Further potential to increase performance may be realized
on the access layer by providing best practice implementations for often used queries, e.g. by
providing prepared statements or the usage of stored procedures where appropriate.
Depending on the insertion patterns, the access layer may also arrange individual inserts into a
bulk insert, where several inserts are bundle into a transaction. Usually indices will be updated
upon each insert. Though, by bundling several inserts into a transaction, the index update is
only performed once, which increases responsiveness of the database.
The issue of blocked resources can be addressed by the access layer in several ways.

Pooling of resources
Establishing database connections is an expensive operation. To avoid this procedure
as often as possible, the access layer shall maintain a connection pool to serve new
requests for a connection. Only when no appropriate connection is available there, a
new one is created. Whenever a connection is released by its user (e.g. when a query
completed) the connection is returned to the pool to be available for re-use for the next
query. Such connection pooling is already offered by many database drivers. Where it is
not available, the access layer shall provide it.
In a similar way threads to process database operation may be pooled internally.

Queuing insertions
In case of a blocked resource, the access layer shall not refuse a request or block itself
by waiting for the required resource to be released. Instead, the request shall be
accepted and queued in a worker thread, until it can be conducted. Though, this should
© FIRST consortium
Page 24 of 43
D2.3
only occur in exceptional circumstances as it is imperative to minimize the occurrence of
blocked resources.
In the following rough capacity estimations regarding the amount of data items to be stored are
outlined. This outline obviously only covers the already known storage requirements. E.g.
storage requirements that are raised by decision support components are not reflected, as they
are yet to be fully defined at the time of writing. Consequently, the data storage component will
need to adapt at a later stage in order to properly store the models, predictions, and potentially
other relevant data and metadata from these components.
Ontology Archive:

number of concurrent users (insertion and retrieval): 1 (WP3)

in regular intervals a serialization of the most up to date ontology will be archived

Retrieval of single ontologies occasionally for back-testing purposes
Storing annotated document corpora

number of concurrent users (insertion and retrieval): 1 (WP3)

archiving of vast amount of annotated documents

Occasional (bulk) retrieval of documents mainly for back-testing purposes
Computed sentiment-related information

number of concurrent users (insertions): 1 (WP4)

number of concurrent users (retrieval): 1-3 (WP6, WP7, WP8)

storing of vast amount of fully annotated GATE documents

Frequent retrieval of sentiments and/or further attributes
The requirements of the ontology archiving task, will probably only change slightly over time.
The frequency, with which the most up-to-date ontology is archived, may increase. The current
expectation is to have one archiving request per day. Though, even if this would dramatically
change to one archiving request every ten minutes the knowledge base should be able to adapt
that without any noticeable impact on overall performance.
However, the requirements regarding storage of annotated document corpora and computed
sentiment-related information are directly driven by the scaling of the data acquisition
component maintained by WP3.
According to the scaling plan outlined for data acquisition in section 3.1, 68,000 new documents
are to be expected per day at the final stage of the project. This figure represents all documents
that are retrieved from the data acquisition components. Based on the ontology, documents will
be filtered out that are irrelevant in the context of FIRST. Therefore, the subsequent
components in the processing pipeline will actually receive fewer documents. Nevertheless, for
the purpose of a worst case estimation, the following calculations assume that all acquired
documents will be passed to the subsequent components in the pipeline. The estimate of
68,000 new documents per day would cause the same number of new annotated document
corpora to be stored per day, i.e. 68,000 insertions to the knowledge base. As mentioned
before, it is assumed for the purpose of this calculation that all those 68,000 documents are
forwarded to the subsequent components of the processing pipeline. Therefore, for each
document annotations will be set, sentiments will be computed and the related database tables
will have to be updated. It is assumed that per processed document 20 sentiment-related
database tables will have to be updated, which causes 1,360,000 update operations per day on
the knowledge base. In order to not ignore future system load that will be caused by decision
support components, it is assumed for this worst case estimation, that the decision support
components will cause the same amount of update operations, i.e. 1,360,000 per day. As for
© FIRST consortium
Page 25 of 43
D2.3
annotated document corpora only occasional retrieval is expected, the number of retrieval
operations per day is estimated to be at around 50% of the total insert operations. That leads to
a total of 4.182 million database operations per day, or on average 48.40 database operations
per second (see Table 5). As storage is spread among different storage solutions, many of
these operations will not impact each other, can be conducted in parallel and the overall number
of operations per second should achievable.
Insert operations document corpora
Insert operations sentiment-related information
Insert operations decision support components
Grand total insert operations per day
68,000
1,360,000
1,360,000
2,788,000
Estimated
retrieval
(50% of grand total inserts) per day
1,394,000
operation
Grand total database operations per day
4,182,000
Grand total database operations per second
48.40
Table 5: Rough estimate of database operations
Though, when one estimates the required storage space in a similar way upon these figures,
another potential bottleneck becomes apparent. Assuming that annotated document corpora
may require 25KB, and each of the insert operation for sentiment-related information and
decision support data may require 10KB each, the total accumulated storage required within
one year would be roughly 10.55 TB1. However, as these figures are very rough and probably
are overestimating, they will need to be reviewed once the data acquisition pipeline is set up.
The plan for implementing scaling approaches for Information integration services is presented
in Table 6.
Functionality
Coverage
(M10–M12)
Now
Jul–Sep 2011
Sep 2011–Mar
2012
Unchanged
Sentiment-related
information
Basic availability of
storage
solutions
(filesystem, MongoDB,
RDBMS)
Sep 2012–Sep 2013
(M18–M24)
(M24–M36)
Expand to cater for
requirements from
Expand to cater for
requirements from
(M12–M18)
Ontology
Document Corpora
Mar–Sep 2012
Pipeline components
do store data

DSS (WP6)

Integrated FIS
(WP7)
Provide
access
interface to WP6/WP7
clients

End-user
prototypes
(WP8)
Unchanged
1
These figures ignore any potential database overhead and assume the factor 1000 (rather than 1024) for
redomination of the dimension KB to MB, to GB, to TB
© FIRST consortium
Page 26 of 43
D2.3
Performance
(M10–M12)
N/A
Now
Jul–Sep 2011
Sep 2011–Mar
2012
Mar–Sep 2012
Sep 2012–Sep 2013
(M18–M24)
(M24–M36)
Scaling performance
along the scaling of
pipeline components.
Both in terms of
number of processed
sources as well as in
terms of near real-time
processing of sources
Unchanged
(M12–M18)
N/A
Table 6: Scaling plan for the knowledge base
3.5. Decision support and visualisation services
To devise a scaling plan for the decision support models, it is first important to identify the
models that will need to be developed for the purpose of the use cases (UC). At this moment,
this is possible only with some speculation and may change as the project progresses.
In UC #1, i.e., Market surveillance use case, the detection of market sounding and pump-anddump scenarios will most likely be attempted by employing near-duplicate detection techniques
(see (FIRST D2.1 Technical requirements and state-of-the-art, 2011) Section 2.1.6). In the first
project year, we have implemented a hash-based near-duplicate detector (Manku, Jain, &
Sarma, 2007) as part of the data acquisition and preprocessing pipeline (see (FIRST D2.1
Technical requirements and state-of-the-art, 2011) Section 2.1.5). The implemented algorithm is
highly scalable (suitable for Web-scale applications) but suffers from a drawback that hinders its
use in the FIRST UC scenarios. Specifically, the algorithm finds all documents, encountered in
the stream, that are 3 or less hash-code bits different from the current document. There is no
intuitive interpretation of how these bits translate into words, sentences, or paragraphs (the
“how many bits for a word” dilemma). Our preliminary experiments showed that many nearduplicates are not discovered, especially if texts are short. This motivated the development of a
new (pragmatic) near-duplicate detection algorithm based on an inverted index. If deemed
suitable, the new algorithm will replace the currently employed solution at some point during the
second project year.
For the tasks in UC #2, i.e., Reputational risk assessment, qualitative multi-attribute models
(Žnidaršič, Bohanec, & Zupan, 2008) are planned to be used (see (FIRST D2.1 Technical
requirements and state-of-the-art, 2011) section 2.5.3). Qualitative models do not present a
scaling issue as they are based on extremely efficient algorithms.
In UC #3, i.e., Retail brokerage use case, the model that clearly exhibits a scaling issue is the
topic detection algorithm required to (i) detect emerging topics and (ii) visualize topic trends. In
addition to the topic hierarchy model, the portfolio optimization task requires another type of
model which can be either qualitative (as in the case of UC #2) or quantitative (e.g., decision
tree). We will scale both these two models up by implementing efficient online (stream-based)
variants. More information on online decision trees and topic detection algorithms is given in
Section 3.5.2.
In addition to the models and algorithms discussed above, we also plan to develop a topic
space visualization algorithm ((Grcar, Podpecan, Jursic, & Lavrac, 2010); see also (FIRST D2.1
Technical requirements and state-of-the-art, 2011) Section 4.2 and Annex 5) and employ it for
providing insights into which topics are being discussed in the context of financial markets.
Table 7 shows the scale-up plan for the models that will be employed in the context of the
FIRST use cases.
© FIRST consortium
Page 27 of 43
Scale
D2.3
Oct 2011–Mar 2012
Mar–Sep 2012
Sep 2012–Mar 2013
Mar–Sep 2013
(M13–M18)
(M18–M24)
(M24–M30)
(M30–M36)
Experimenting with
datasets
created
from the acquired
data
(historical
data)
Experimenting
Running in near-real
with simulated
time on 1 selected
streams
Web site (testing)
(historical data)
Running in nearreal
time
on
approximately
160 Web sites
Table 7: Scaling plan for the decision-support models
3.5.1 Scaling techniques for clustering and classification
From a high-level perspective, we plan to employ (i) qualitative modelling techniques, (ii)
visualization techniques, and (iii) machine learning techniques. As already mentioned,
qualitative models do not present a scaling issue as they are based on extremely efficient
algorithms. Visualization and machine learning techniques, on the other hand, need to be
adapted to work with intensive data streams in near-real time. There are several generalpurpose scaling techniques at hand, already presented to some extent in (FIRST D2.1
Technical requirements and state-of-the-art, 2011), section 2.6, such as pipelining,
parallelization and warm starts1. However, sometimes these are not applicable or the resulting
process is still not efficient enough. In such cases, we need to resort to stream-based
alternatives. These are entirely different algorithms, well aware that they operate in a steambased environment. In the following subsections, we present several stream-based algorithms
from two major categories of machine learning algorithms: (i) unsupervised learning (i.e.,
clustering) and (ii) supervised learning (i.e., classification). To address the FIRST scenarios, we
put the stream-based clustering methods into the context of topic detection, trend detection, and
visualization. In addition, we discuss online model trees (i.e., a variant of stream-based decision
trees) which are glass-box models and are likely to be employed in FIRST.
Details of clustering for topic and trend detection techniques are discussed in Annex 1.
3.5.2 Learning model trees from data streams
The problem of real-time extraction of meaningful patterns from time-changing data streams is
of increasing importance in machine learning and data mining. Regression in time-changing
data streams is a relatively unexplored topic, despite many possible applications. In decision
support and visualization services we will use an efficient and incremental stream mining
algorithm, FIMT-DD (Ikonomovska E, 2011), which is able to learn regression and model trees
from possibly unbounded, high-speed and time-changing data streams. To the best of our
knowledge there is no other general-purpose algorithm for incremental learning
regression/model trees able to perform explicit change detection and informed adaptation. The
algorithm performs online and in real-time, observes each example only once at the time of
arrival, and maintains at any time a ready-to-use model tree. The tree leaves contain linear
models induced online from the examples assigned to them. The algorithm has mechanisms for
Warm starts are possible in practically every iterative optimization method. This means that – when new data
enters the system (or some outdated data “leaves”) – we start the algorithm with the result from the previous run and
consequently it converges faster (i.e., requires fewer iterations to converge). Warm start can be used, for example,
with k-means clustering, stress majorization and other iterative graph layouting methods, least-squares solver,
support vector machines (SVM), and many other iterative methods.
1
© FIRST consortium
Page 28 of 43
D2.3
drift detection and model adaptation, which enable it to maintain accurate and updated
regression models at any time. The drift detection mechanism exploits the structure of the tree
in the process of local change detection. As a response to local drift, the algorithm is able to
update the tree structure only locally. This approach improves the any-time performance (i.e.,
availability of an up-to-date model at any time) and greatly reduces the costs of adaptation.
Details of FIMT-DD algorithm are presented in Annex 2.
3.6. Integration infrastructure
The integration infrastructure has a special and distinct role in the scaling strategy. It can be
considered to have impact on both global and local level. On one hand it provides technological
means for realizing the global scaling strategy. On the other hand integration infrastructure also
encompasses graphical front-ends and necessary services for implementing the whole
Integrated Financial Market Information System (FIRST D2.2 Conceptual and technical
integrated architecture design, 2011). Therefore it should ensure that the rest of the
infrastructure is keeping pace with the results of data processing.
The “global” aspect of integration infrastructure scaling is approached by providing a lightweight
messaging middleware that integrates components that take part in pipeline processing. The
infrastructure will support pipeline scaling scenarios as they have been already described in
Global scaling strategy (see chapter 2).
The “local” aspect has been addressed by providing a coherent system design. From a
performance point of view, required infrastructure characteristics have already been taken into
account in the architecture definition (FIRST D2.2 Conceptual and technical integrated
architecture design, 2011). It outlines such approaches as pipeline processing approach for
data analysis, push-based services or asynchronous data exchange.
The choice of messaging based approach and flexible and lightweight architecture is supporting
the overall scaling goals. In the next steps we plan to adjust and fine-tune integration
middleware in order to comply with the requirements of the analytical pipeline. From early
experiments ((FIRST D2.2 Conceptual and technical integrated architecture design, 2011),
section 4.2) we learned that the overall throughput of the messaging middleware does not
constrain data processing estimates, therefore infrastructure scaling is more a feature-oriented
scaling plan that will improve global throughput of the FIRST system by properly managing the
data stream flow according to the pipeline scaling scenarios. The scaling plan has been
presented in Table 8.
© FIRST consortium
Page 29 of 43
D2.3
Throughput
Aspects
Coverage
(M5–M12)
Analysis and
choice of most
suitable
integration
approach,
experiments with
messaging.
N/A
N/A
Mar–Sep 2012
Sep 2012–Sep 2013
(M18–M24)
(M24–M36)
Advanced prototype
of integration
infrastructure.
Integration of all
pipeline components.
Supporting chosen
global scaling
scenario. Monitoring
of performance allows
for further architecture
fine-tuning. GUI and
high level services are
keeping up with
running pipeline.
Final integration
infrastructure,
supporting devised
scalability goals and
able to handle target
data volume in a
timely manner.
Global scaling
techniques are
supported by
architecture and fine
tuned in optimal
way. The system
works on target
deployment
infrastructure.
Reliable messaging
1 selected scaling
established between
scenario supported.
WP3 and WP4.
All scaling scenarios
supported.
Now
Feb–Sep 2011
Sep 2011–Mar
2012
(M12–M18)
Early version of
pipeline integration
prototype. Testbed
for first scaling
experiments.
N/A
Scale
overall
infrastructure
to
Scale
overall keep up with live
infrastructure to keep stream
of
data
up with test data.
acquisition (around
68,000 documents
daily)
Table 8: Scaling plan for Integration infrastructure
© FIRST consortium
Page 30 of 43
D2.3
4. Conclusions
This document highlights the most important aspects of reaching system scalability, as devised
in the project goals. It presents various techniques (both at architectural and individual technical
components level) and a roadmap for achieving the scalability goals. By following an
incremental development process (divided into milestones with defined goals) it ensures that
the progress of assuring scalability can be tracked and risks can be minimised.
We also demonstrated that architectural decisions and integration approach are greatly
supporting scalability of the overall system, with the emphasis on the analytical pipeline. E.g.
lightweight messaging approach applied to pipeline processing provide flexibility that enables to
apply various scaling-out scenarios.
During the further development, the combination of both: component scaling techniques and
pipeline scaling scenarios will be applied to assure that the FIRST system meets its
performance goals.
© FIRST consortium
Page 31 of 43
D2.3
References
Barbara, D., & others. (1997). The New Jersey data reduction report. Technical Committee on
Data Engineering , 20, 3-45.
Bondi, A. B. (2000). Characteristics of Scalability and Their Impact on performance.
Proceedings of the 2nd international workshop on Software and performance (pp. 195-203). ACM.
C. C. Aggarwal, J. H. (2003). A framework for clustering evolving data streams.
C. C. Aggarwal, J. H. (2004). A Framework for Projected Clustering of High Dimensional Data
Streams.
Cauwenberghs, G., & Poggio, T. (2001). Incremental and Decremental Support Vector Machine
Learning. Proceedings of NIPS 2001 .
Feng Cao, E. M. (2006). Density-based clustering over an evolving data stream with noise. SIAM
Conference on Data Mining.
FIRST D1.2 Usecase requirements specification. (2011).
FIRST D2.1 Technical requirements and state-of-the-art. (2011).
FIRST D2.2 Conceptual and technical integrated architecture design. (2011).
FIRST D4.1 First semantic information extraction prototype. (2011).
Fisher, D. H. (1987). Knowledge Acquisition Via Incremental Conceptual Clustering. Machine
Learning .
Grcar, M., Podpecan, V., Jursic, M., & Lavrac, N. (2010). Efficient Visualization of Document
Streams. Proceedings of Discovery Science 2010 (pp. 174–188). Canberra: SpringerVerlag Berlin Heidelberg.
IBM Rational Unified Process v7.0. (2008).
Ikonomovska E, G. J. (2011). Learning model trees from evolving data streams. Data Min.
Knowl. Discov 23(1) , 128-168.
Joachims, T. (2006). Training Linear SVMs in Linear Time. Proceedings of the ACM
Conference on KDD 2006 .
Kohlschütter, C., Fankhauser, P., & Nejdl, W. (2010). Boilerplate Detection using Shallow Text
Features. Proceedings of The Third ACM International Conference on Web Search and
Data Mining, WSDM 2010. New York.
L. O. Callaghan, N. M. (2003). Streaming-data algorithms for high-quality clustering.
Liu YB, C. J. (Jan. 2008). Clustering text data streams. JOURNAL OF COMPUTER SCIENCE
AND TECHNOLOGY 23(1) , 112–128.
Manku, G. S., Jain, A., & Sarma, A. D. (2007). Detecting Near-Duplicates for Web Crawling.
Proceedings of WWW 2007.
N. Sahoo, J. C. (2006). Incremental hierarchical clustering of text documents. In Proceedings of
the ACM International Conference on Information and Knowledge Management (CIKM),
(pp. 357-366).
ØMQ (version 0.3) tests. (2011). Retrieved July 28, 2011, from ØMQ:
http://www.zeromq.org/results:0mq-tests-v03
Saveski, M., & Grcar, M. (2011). Web Services for Stream Mining: A Stream-Based Active
Learning Use Case. Proceedings of the PlanSoKD Workshop at ECML-PKDD 2011 .
Tong, S., & Koller, D. (2000). Support Vector Machine Active Learning with Applications to
Text Classification. Proceedings of ICML 2000 .
Tsymbal, A. (2004). The problem of concept drift: definitions and related work.
ZeroMQ. (2011). ØMQ - The Guide. Retrieved July 28, 2011, from ØMQ:
http://zguide.zeromq.org/page:all
Zhang, R. a. (1996). BIRCH: An efficient data clustering method for very large databases. ACM
SIGMOD Conference on Management of Data.
© FIRST consortium
Page 32 of 43
D2.3
Žnidaršič, M., Bohanec, M., & Zupan, B. (2008). Modelling impacts of cropping systems:
Demands and solutions for DEX methodology. European Journal of Operational Research
, 189, 594-608.
James Allan. Topic Detection and Tracking: Event-Based Information Organization. Kluwer
Academic Publishers, Norwell, MA, USA, 2002.
L. O. Callaghan, N. Mishra, A. Meyerson, S. Guha, and R. Motwani, "Streaming-data algorithms
for high-quality clustering," 2003.
Susan Havre , Ieee Computer Society , Elizabeth Hetzler , Paul Whitney, Lucy Nowell,
“ThemeRiver: Visualizing thematic changes in large document collections”, IEEE
Transactions on Visualization and Computer Graphics, 2002.
Feng Cao, Martin Ester, Weining Qian, Aoying Zhou,“Density-based clustering over an evolving
data stream with noise”, 2006. In 2006 SIAM Conference on Data Mining.
Ronen Feldman, James Sanger, The Text Mining Handbook: Advanced Approaches in
Analyzing Unstructured Data, 2007
Sanjoy Dasgupta, Daniel Hsu, “Hierarchical Sampling for Active Learning”, Proceedings of the
25th International Conference on Machine Learning, Finland, 2008.
Aggarwal CC (2006) Data streams: models and algorithms. Springer, New York
Breiman L, Friedman JH, Olshen RA, Stone CJ (1998) Classification and regression trees. CRC
Press, Boca Raton, FL
Chaudhuri P, Huang M, Loh W, Yao R (1994) Piecewise polynomial regression trees. Stat Sin
4:143-167
Chen Y, Dong G, Han J, Wah BW, Wang J (2002) Multi-dimensional regression analysis of
time-series data streams. In: Proc the 28th int conf on very large databases. Morgan
Kaufmann, San Francisco, pp 323-334
Dobra A, Gherke J (2002) SECRET: a scalable linear regression tree algorithm. In: Proc 8th
ACM SIGKDD int conf on knowledge discovery and data mining. ACM Press, New York,
pp 481-487
Domingos P, Hulten G (2000) Mining high speed data streams. In: Proc 6th ACM SIGKDD int
conf on knowledge discovery and data mining. ACM Press, New York, pp 71-80
Gama J, Rocha R, Medas P (2003) Accurate decision trees for mining high-speed data streams.
In: Proc 9th ACM SIGKDD int conf on knowledge discovery and data mining. ACM
Press, New York, pp 523-528
Gama J, Medas P, Rocha R (2004) Forest trees for on-line data. In: Proc 2004 ACM symposium
on applied computing. ACM Press, New York, pp 632-636
Gao J, Fan W, Han J, Yu PS (2007) A general framework for mining concept-drifting data
streams with skewed distributions. In: Proc 7th int conf on data mining, SIAM,
Philadelphia, PA
Gratch J (1996) Sequential inductive learning. In: Proc 13th natl conf on artificial intelligence
and 8th innovative applications of artificial intelligence conf, vol 1. AAAI Press, Menlo
Park, CA, pp 779- 786
Hoeffding W (1963) Probability for sums of bounded random variables. J Am Stat Assoc 58:1330
Hulten G, Spencer L, Domingos P (2001) Mining time-changing data streams. In: Proc 7th ACM
SIGKDD int conf on knowledge discovery and data mining. ACM Press, New York, pp
97-106
Ikonomovska E, Gama J, Dzeroski S (2011). Learning model trees from evolving data streams.
Data Min. Knowl. Discov. 23(1), pp 128-168
Jin R, Agrawal G (2003) Efficient decision tree construction on streaming data. In: Proc 9th
ACM SIGKDD int conf on knowledge discovery and data mining. ACM Press, New York,
© FIRST consortium
Page 33 of 43
D2.3
pp 571-576
Karalic A (1992) Employing linear regression in regression tree leaves. In: Proc 10th European
conf on artificial intelligence. Wiley, New York, pp 440-441
Loh W (2002) Regression trees with unbiased variable selection and interaction detection (2002).
Stat Sin 12:361-386
Malerba D, Appice A, Ceci M, Monopoli M (2002) Trading-off local versus global effects of
regression nodes in model trees. In: Proc 13th int symposium on foundations of intelligent
systems, LNCS, vol 2366. Springer, Berlin, pp 393-402
Musick R, Catlett J, Russell S (1993) Decision theoretic sub-sampling for induction on large
databases. In: Proc 10th int conf on machine learning. Morgan Kaufmann, San Francisco,
pp 212-219
Pfahringer B, Holmes G, Kirkby R (2008) Handling numeric attributes in Hoeffding trees. In:
Proc 12th Pacific-Asian conf on knowledge discovery and data mining, LNCS, vol 5012.
Springer, Berlin, pp 296-307
Potts D, Sammut C (2005) Incremental learning of linear model trees. J Mach Learn 61:5-48.
doi:10.1007/ s10994-005-1121-8
Quinlan JR (1992) Learning with continuous classes. In: Proc 5th Australian joint conf on
artificial intelligence. World Scientific, Singapore, pp 343-348
Rajaraman K, Tan (2001) A topic detection, tracking, and trend analysis using self-organizing
neural networks. In: Proc 5th Pacific-Asian conf on knowledge discovery and data mining,
LNCS, vol 2035. Springer, Berlin, pp 102-107
Siciliano R, Mola F (1994) Modeling for recursive partitioning and variable selection. In: Proc
int conf on computational statistics. Physica Verlag, Heidelberg, pp 172-177
Torgo L (1997) Functional models for regression tree leaves. In: Proc 14th int conf on machine
learning. Morgan Kaufmann, San Francisco, pp 385-393 VFML (2003) A toolkit for
mining high-speed time-changing data streams. http://www.cs.washington.edu/dm/vfml.
Accessed 19 Jan 2010
Vogel DS, Asparouhov O, Scheffer T (2007) Scalable look-ahead linear regression trees. In:
Berkhin P, Caruana R, Wu X (eds) Proc 13th ACM SIGKDD int conf on knowledge
discovery and data mining, KDD. ACMK, San Jose, CA, pp 757-764
WEKA 3 (2005) Data Mining Software in Java. http://www.cs.waikato.ac.nz/ml/weka. Accessed
19 Jan 2010 Widmer G, Kubat M (1996) Learning in the presence of concept drifts and
hidden contexts. J Mach Learn 23:69-101. doi:10.1007/BF00116900
© FIRST consortium
Page 34 of 43
D2.3
Annex 1. Clustering for topic and trend detection
Annex a.
Introduction
In this section, we focus on the problem of clustering an online data stream of text documents
with the aim of visualizing and discovering topics and trends. It is infeasible, due to the real-time
nature of the problem, to store such a stream of documents and process it with better known
offline clustering methods. Therefore, we rely on the more novel streaming clustering methods.
Several online clustering methods are described. Further on, we describe topics and trends in
the data, in our case obtained from the hierarchical clustering of documents. Topics and trends
are presented to the user by employing appropriate visualization techniques, such as
dendrogram and canyon flow. Also, it is made interactive by incorporating visualization
techniques into the user interface.
Annex b.
Document streams
A data stream represents ordered (usually temporally) sequence of data items (e.g., user click
streams, network data packets, published text documents). We are concerned with text
documents (news and blogs) as incoming data items. The number of documents is unbounded
and the time between publications is uneven. Web pages are usually obtained through an RSS
reader and processed into pure text documents. As the time of acquisition and the time of
publication of the document are inconsistent, we can only achieve approximate temporal
ordering of the incoming documents. One solution to this is to employ a preliminary buffer to
sort the documents by the date published. This may be inappropriate due to time constraints.
Nevertheless, a sliding time window of fixed size may be used in a manner of first-in-first-out
queue for document clustering.
Annex c.
Clustering document streams
To handle indefinite unevenly distributed incoming stream of text documents, we employ
streaming (online) clustering algorithms. These on-line algorithms process documents unaware
either of the whole past or of the future of the document stream, whereas the off-line ones
cluster completely known data set. We have made an overview of streaming clustering
algorithms suitable for the text document stream.
We can define several application-specific requirements, which text stream clustering
algorithms should comply with. The algorithm should not need to be given the actual number of
clusters, as they are unknown and change during time. As the problem of clustering n
documents into k clusters is NP-hard, it should scale well with the number of data items (text
documents) and number of dimensions (text). It should also support arbitrary sized and shaped
(e.g., different from hyper-sphere) clusters. As an outlier document may represent a new
emerging topic (topic drift), forming of a singleton cluster for an outlier is important. Each cluster
is expected to reflect a meaningful topic, with its sub-clusters as sub-topics (in the case of
hierarchical clustering). One document can often be related to more topics (i.e., fit between
several clusters) therefore, a soft (fuzzy) clustering could also be considered.
The issues that require special consideration in the stream clustering are the topic (concept)
drift (i.e., evolving data) and the time constraints. The problem of concept drift (Tsymbal, 2004)
arises naturally from the real world, where concepts (topics) are often not stable but change
with time, thus making the model built on previous data obsolete. Concept drift is seen in
clustering in the form of new emerging clusters, which can often be confused with outliers, or
existing clusters being reduced in size.
The two most rudimentary clustering methods are also the two most researched and improved
ones – k-means and k-medians. K-means, being fairly simple, produces a solution, which is
guaranteed to be a local optimum only and is also sensitive to outliers. It also requires random
access to the data, which is inherently inappropriate for the streams. K-medians selects one of
the documents for the cluster representative and is thus less sensitive to outliers at the cost of
© FIRST consortium
Page 35 of 43
D2.3
computational complexity. These algorithms also require the number of clusters to be given as a
parameter, which is unsuitable for an evolving data stream.
COBWEB (Fisher, 1987) is an incremental method for hierarchical conceptual clustering.
Although it was originally designed for categorical data, (N. Sahoo, 2006) describes its usage
for text documents. A notion of hierarchical clustering includes the clustering problem and also
the problem of determining useful concepts for each cluster. Hierarchical clustering of data into
a classification tree is performed as a hill-climbing bidirectional search through a space of
classification trees utilizing four basic operators, namely merging, splitting, inserting and
passing of nodes. Although the method is both incremental and computationally feasible, it is
not well suited for data streams. Its main disadvantage is memory-consumptive unbalanced tree
structure.
BIRCH (Zhang, 1996) is arguably the most primitive method among all. It was intended for very
large off-line datasets and can therefore be used on streams, to some extent. In its two steps,
BIRCH methods builds the tree with information about clusters in a single pass through the data
and then refines the tree by removing the sparse clusters as outliers. The information about
each cluster is contained in a clustering feature triple of the CF (clustering feature) tree. The
limitation of the method is the sensitive threshold for the numbers of documents a cluster must
have not to be regarded as an outlier and the radial size of the cluster, which may present a
problem if the cluster of documents is not ovally shaped. There are no guarantees about the
SSE (sum of the squared errors) for its performance. The method does not directly support the
topic drift and removal of the outdated documents.
STREAM (L. O. Callaghan, 2003) is the first method designed especially for stream clustering. It
is also based on two steps, where the documents are first clustered in a k-median way weighted
by the number of documents in a cluster and secondly, the medians are clustered up to a
hierarchy. The main disadvantages of this method are time complexity and evolving data.
Both BIRCH and STREAM are inappropriate for evolving data (topic drift) as they generate
clusters based on the history of the whole dataset.
CluStream (C. C. Aggarwal J. H., 2003) is the first one to give more attention to evolving data
(topic drift) and outliers. It has two base steps. One is an online micro-clustering component,
which stores summary statistics about the streaming data in a manner of snapshots in time and
the other is an off-line macro-clustering component, which uses the stored summary statistics in
conjunction with the user input to build real data clusters for a given period of time. Such a twophase approach gives a significant insight to the user. Its disadvantage is time complexity of
adding and removing documents to and from the model, which is linearly dependent on the
number of clusters in the model. Also, its predefined number of micro-clusters is inappropriate
for evolving data.
HPStream (C. C. Aggarwal J. H., 2004) is a method of streaming clustering, specialized for
high-dimensional data. It uses a fading cluster structure method and the projection-based
clustering. It outperforms CluStream in cluster purity for about 20%, at the cost of speed.
However, it cannot detect clusters of arbitrary orientations.
DenStream (Feng Cao, 2006) has micro-cluster structures, which successfully summarize
clusters of arbitrary size. A novel pruning strategy gets rid of the outliers while it enables the
growth of the new clusters. The purity of the clusters is on average 20% better than with
CluStream.
OCTSM (Liu YB, Jan. 2008) uses semantic smoothing adjusted to stream clustering, which
shows to be better than the more common TF-IDF scheme for clustering the text documents
with respect to their semantic similarity. It employs a fading function (aging) to account for the
evolution of the data stream. A novel cluster statistics structure named cluster profile is
introduced. The cluster profile captures real-time semantics of text data.
ClusTree (Kranen et al, 2009) is a parameter-free algorithm capable of adapting to the speed of
© FIRST consortium
Page 36 of 43
D2.3
the input stream and detecting concept-drift and outliers in the stream. A ClusTree itself is a
compact and adaptive index structure, which maintains stream summaries. Aging of data items
is incorporated to reflect the greater importance of the newer data.
StreamKM++ (Ackermann, 2010) employs weighted sets coresets for non-uniform sampling of
the data stream based on the k-means++ procedure. Fast computation of the coresets is
enabled by building the coreset tree, which is a binary tree associated with hierarchical divisive
clustering. Its advantages are suitability for large number of clusters and scalability with the
number of dimensions. Although the method is slower than BIRCH, it creates significantly better
clusters in terms of the sum of squared error measure.
Annex d.
Topic detection
In the task of topic detection, one wants to detect meaningful groups (clusters) of text
documents that are related to the same topic. The task arises from the concept drift, present in
evolving streams of text documents. We are given no prior information about the number or
names of the topics. The definition of a topic is “a seminal event or activity, along with all directly
related events and activities” (Allan, 2002) or “something nontrivial happening in a certain time
at a certain place”. As the notion of a topic includes all the related events and activities, it is
reasonable to base it on hierarchical cluster (commonly represented as a dendrogram), where
each cluster represents a topic with more specific related topics in the child clusters below.
Emerging and disappearing topics are shown as growing and shrinking clusters, respectively.
Similarity between the documents represented with stemmed-term frequencies is based on the
vector space cosine similarity. Consequently, this makes the clustering independent of the
domain and most of the languages. Alternatively, a semantic smoothing as in the OCTSM (Liu,
Cai, Yin, Fu, 2008) can be used instead of the TF-IDF scheme.
Annex e.
Trend detection
In general, a trend is a long-term temporal variation in the statistical properties of a process. In
our case, a trend describes a change in the number of documents related to some topic over a
long-enough period of time. A positive trend means that a topic, either existing or emerging, is
gaining in the number of documents. In contrast, a negative trend corresponds to a decrease in
the number of documents related to a topic over a long-enough period of time. Trends can be
identified and analyzed by observing derivatives of the function of topic strength. On the other
hand, topics and trends can be nicely visualized with a ThemeRiver-like algorithm called
Canyon Flow. Technically speaking, the algorithm provides a “view” of a hierarchy of document
clusters. The underlying algorithm is essentially a hierarchical bisecting clustering algorithm
employed on the bag-of-words representation of documents. We briefly discuss trend
visualization in the following section.
Annex f.
Visualization
The clustered stream of text documents is visualized with the aim to easily identify and analyze
topics and trends. One of the most intuitive representations of hierarchical document clustering,
despite the quadratic time usually needed to construct it, is certainly dendrogram. In it, lower
levels represent clusters containing conceptually more similar documents and upper more
distanced ones. Thus, clusters in the lower levels of the dendrogram are more specialized
topics, whereas the upper ones are more general.
There are very few methods for the stream clustering visualization (Silic, 2010). They are mostly
applied to lower dimensional data or based on projection to lower dimensional space.
Nevertheless, we are more interested in changes of topic over time, rather than document
clustering itself. To convey the change of topic strength over time, a (multi)line chart is more
appropriate. Each different line in the chart shows the number or percentage of the documents
related to specific topic at a specific time. A canyon flow chart, referred to as ThemeRiver in
(Havre et al, 2002), lends itself even better to display trends over a certain period in time. Each
(differently colored) region corresponds to the percentage of documents related to that topic
during a specific time period. Ideally, the user is able to interactively select a specific region in
© FIRST consortium
Page 37 of 43
D2.3
the canyon flow chart and explore it, which corresponds to going down the cluster to its subclusters in the dendrogram. Also, each region could be colored with different shades of the
same color as to present its sub-clusters. Similar techniques are NewsRiver, LensRiver,
EventRiver, briefly explained in (Silic, 2010).
Annex g.
Clustering for active learning
A convenient application of hierarchical clustering is also in active learning (Dasgupta & Hsu,
2008). Active learning is a machine-learning technique designed to actively query the domain
expert for new labels by putting forward data instances that, upon being labelled, contribute
most to the model being built. As annotating the dataset is costly, it is important to make it as
efficient as possible in terms of time and number of items annotated. Simple random sampling
requires sampling too much of the items to be accurate enough. An intuitive margin-based
heuristics might be used to bias the sampling of the data items towards the decision boundary
to make it more precise. This method does not converge well for complicated boundaries.
Hierarchical clustering provides the necessary granularity to approximate the decision
boundaries, which do not always align with the clusters of the data (i.e., clusters do not always
contain only one class). Clusters are hierarchically queried top-down. More samples are taken
from more numerous and impure clusters. Those clusters which are pure enough are pruned
and not queried any more as they supposedly comprise equally labeled items. Sampling stops
when all the clusters are pruned. The resulting dataset is labeled significantly better than it
would be with margin-based sampling or simple random sampling. In addition, (Dasgupta &
Hsu, 2008) in their experiments prefer using Latent Dirichlet Allocation to create document topic
mixture models and Kullback-Leibler divergence as the notion of distance between documents
over the common TF-IDF scheme for textual data.
Annex h.
Conclusions
Mining streaming data is a relatively novel research field. Detecting topics and changes in such
real-time data makes stream clustering even more challenging. An overview of both original and
novel methods appropriate for stream clustering was presented and application-specific
requirements were stated. Unfortunately, not many experiments were presented for textual data
in the related articles. Further testing of different algorithms on textual streams will help
determine the most suitable one. Efficient topic detection depends mostly on the clustering
algorithm's ability to differ outliers from emerging clusters and should be taken into
consideration. To easily spot emerging and disappearing topics, efficient visualization is
indispensable. Many interactive river-like charts, depicting topics as differently colored regions
are used for this purpose.
© FIRST consortium
Page 38 of 43
D2.3
Annex 2. Learning model trees from data streams
Annex a.
Introduction
In the last decade, data streams (Aggarwal 2006) have been receiving growing attention in
many research areas, due to broad recognition of applications emerging in many areas.
Examples include financial applications (stock exchange transactions), telecommunication data
management (call records), Web applications (customers click stream data), surveillance
(audio/video data), bank-system management (credit card/ATM transactions, etc.), monitoring
patient health, and many others. Such data are typically represented as evolving time series of
data items, arriving continuously at high speeds, and having dynamically changing distributions.
Modeling and predicting the temporal behavior of streaming data can provide valuable
information contributing to the success of time-critical operations.
The task of regression analysis is one of the most commonly addressed topics in the areas of
machine learning and statistics. Regression and model trees are often used for this task due to
their interpretability and good predictive performance. However, regression on time-changing
data streams is a relatively unexplored and typically nontrivial problem. Fast and continuous
data feeds as well as the time-changing distributions make traditional regression tree learning
algorithms unsuitable for data streams. This section describes an efficient and incremental
algorithm for learning regression and model trees from possibly unbounded, high-speed, and
time-changing data streams. There are four main features of the algorithm: an efficient splittingattribute selection in the incremental growing process; an effective approach for computing the
linear models in the leaves; an efficient method for handling numerical attributes; and change
detection and adaptation mechanisms embedded in the learning algorithm.
Annex b.
Related work
We first consider related work on batch and incremental learning of model trees. We then turn
our attention to related methods for learning from stationary data streams: learning
decision/classification trees and linear regression/neural networks. Finally, we take a look at
methods for on-line change detection and management of concept drift.
Batch learning of model trees
Regression and model trees are known to provide efficient solutions to complex nonlinear
regression problems due to the divide-and-conquer approach applied to the instance-space.
Their main strength in solving a complex problem is by recursively fitting different models in
each subspace of the instance-space. Regression trees use the mean for the target variable as
the prediction for each sub-space. Model trees improve upon the accuracy of regression trees
by using more complex models in the leaf nodes.
The splitting of the instance-space is performed recursively by choosing a split that maximizes
some error reduction measure, with respect to the examples that are assigned to the current
region (node). In one category are algorithms like M5 (Quinlan 1992), CART (Breiman et al.
1998), HTL (Torgo 1997). These use variants of the variance reduction measure used in CART,
like standard deviation reduction or the fifth root of the variance in the WEKA (WEKA 3 2005)
implementation of the M5 algorithm. Another category of algorithms are those aiming to find
better globally optimal partitions by using more complex error reduction methods at the cost of
increased computational complexity. Examples are RETIS (Karalic 1992), SUPPORT
(Chaudhuri et al. 1994), SECRET (Dobra and Gherke 2002), GUIDE (Loh 2002), SMOTI
(Malerba et al. 2002) and LLRT (Vogel et al. 2007).
All existing batch approaches to building regression/model trees assume that the training set is
finite and stationary. For this reason, they require all the data for training to be available on the
disk or in the main memory before the learning process begin. When given very large training
sets, batch algorithms have shown to be prohibitively expensive both in memory and time. In
this spirit, several efforts have been made in speeding up learning on large datasets. Notable
© FIRST consortium
Page 39 of 43
D2.3
examples are SECRET (Dobra and Gherke 2002) and the LLRT algorithm (Vogel et al. 2007).
Their main weakness is that they require storing all the examples in the main memory. This
becomes a major problem when datasets are larger than the main memory. In such situations,
users are forced to do sub-sampling or apply other data reduction methods, which is nontrivial
because of the danger of underfitting. Another characteristic of large data is that it is typically
collected over a long time period or generated rapidly by a continuous, possibly distributed
sources of data. In both of these scenarios, there is a high probability of non-stationary
relations, which in learning problems takes the form of concept drift. We have noted that none of
the existing batch algorithms for learning regression trees is able to deal with concept drift.
Incremental learning of model trees
Mining data streams raises many new problems previously not encountered in data mining. One
crucial issue is the real-time response requirement, which severely constrains the use of
complex data mining algorithms that perform multiple passes over the data. Although regression
and model trees are an interesting and efficient class of learners, little research has been done
in the area of incremental regression or model tree induction.
To the best of our knowledge, there is only one paper (Potts and Sammut 2005) addressing the
problem of incremental learning of model trees. The authors follow the method proposed by
Siciliano and Mola (1994), applying it in an incremental way.
Learning decision trees from stationary data streams
The problem of incremental decision tree induction has received appropriate attention within the
data mining community. There is a large literature on incremental decision tree learning, but our
focus is on the line of research initiated by Musick et al. (1993), which motivates sampling
strategies for speeding up the learning process. They note that only a small sample from the
distribution might be enough to confidently determine the best splitting attribute. Example
algorithms from this line of research are the Sequential ID3 (Gratch 1996), VFDT (Domingos
and Hulten 2000), UFFT (Gama et al. 2003), and the NIP-H and NIP-N algorithms (Jin and
Agrawal 2003).
Other regression methods in stream mining
One of the most notable and successful examples of regression on data streams is the multidimensional linear regression analysis of time-series data streams (Chen et al. 2002). It is
based on the OLAP technology for streaming data. This system enables an online computation
of linear regression over multiple dimensions and tracking unusual changes of trends according
to the user's interest.
Some attempts have been also made in applying artificial neural networks over streaming data.
In Rajaraman and Tan (2001), the authors address the problems of topic detection, tracking and
trend analysis over streaming data. The incoming stream of documents is analyzed by using
Adaptive Resonance Theory (ART) networks.
On-line change detection and management of concept drift
The nature of change in streams is diverse. Changes may occur in the context of learning due
to changes in hidden variables or changes in the intrinsic properties of the observed variables.
Often these changes make the model built on old data inconsistent with new data, and regular
updating of the model is necessary.
As Gao et al. (2007) have noted, the joint probability, which represents the data distribution P(x,
y) = P(y|x) * P(x), can evolve over time in three different ways: (1) changes in P(x) known as
virtual concept drift (sampling shift); (2) changes in the conditional probability P(y|x); and (3)
changes in both P(x) and P(y|x). We are in particular interested in detecting changes in the
conditional probability, which in the literature is usually referred to as concept drift. Further, a
change can occur abruptly or gradually, leading to abrupt or gradual concept drift.
With respect to the region of the instance space affected by a change, concept drift can be
© FIRST consortium
Page 40 of 43
D2.3
categorized as local or global. In the case of local concept drift, the distribution changes only
over a constrained region of the instance space (set of ranges for the measured attributes). In
the case of global concept drift, the distribution changes over the whole region of the instance
space, that is, for all the possible values of the target/class and the attributes.
Annex c.
The FIMT-DD algorithm
The problem of learning model trees from data streams raises several important issues typical
for the streaming scenario. First, the dataset is no longer finite and available prior to learning.
As a result, it is impossible to store all the data in memory and learn from them as a whole.
Second, multiple sequential scans over the training data are not allowed. An algorithm must
therefore collect the relevant information at the speed it arrives and incrementally make splitting
decisions. Third, the training dataset may consist of data from several different distributions.
Thus the model needs continuous monitoring and updating whenever a change is detected. We
have developed an incremental algorithm for learning model trees to address these issues,
named Fast Incremental Model Trees with Drift Detection (FIMT-DD).
The algorithm starts with an empty leaf and reads examples in the order of arrival. Each
example is traversed to a leaf where the necessary statistics are updated. Given the first portion
of instances, the algorithm finds the best split for each attribute, and then ranks the attributes
according to some evaluation measure. If the splitting criterion is satisfied it makes a split on the
best attribute, creating two new leafs, one for each branch of the split. Upon arrival of new
instances to a recently created split, they are passed down along the branches corresponding to
the outcome of the test in the split for their values. Change detection tests are updated with
every example from the stream. If a change is detected, an adaptation of the tree structure is
performed.
Splitting criterion
In the literature, several authors have studied the problem of efficient feature, attribute, or model
selection over large databases. The idea was first introduced by Musick et al. (1993) under the
name of decision theoretic sub-sampling, with an immediate application to speed up the basic
decision tree induction algorithm. One of the solutions they propose, which is relevant for our
work, is the utilization of the Hoeffding bound (Hoeffding 1963) in the attribute selection process
in order to decide whether the best attribute can be confidently chosen on a given subsample.
Numerical attributes
The efficiency of the split selection procedure is highly dependent on the number of possible
split points. For numerical attributes with a large number of distinct values, both memory and
computational costs can be very high. The common approach in the batch setting is to perform
a preprocessing phase, typically partitioning the range of numerical attributes (discretization).
This requires an initial pass of the data prior to learning, as well as sorting operations.
Preprocessing is not an option with streaming data and sorting can be very expensive. The
range of possible values for numerical attributes is also unknown and can vary in case of
sampling shift. For classification tasks on data streams, a number of interesting solutions have
been proposed: on-line discretization (with a pre-specified number of bins) (Domingos and
Hulten 2000), Gaussian-based methods for two-class problems (Gama et al. 2004), an equiwidth adaptation to multi-class problems (Pfahringer et al. 2008), and an exhaustive method
based on binary search trees (Gama et al. 2003). They are either sensitive to skewed
distributions or are appropriate only for classification problems. We have developed a timeefficient method for handling numerical attributes based on a E-BST structure, which is an
adaptation of the exhaustive method proposed in Gama et al. (2003), tailored for regression
trees.
Linear models in leaves
Existing batch approaches compute the linear models ether in the pruning phase or in the
growing phase. In the later approach, the algorithms need to perform heavy computations
© FIRST consortium
Page 41 of 43
D2.3
necessary for maintaining the pre-computed linear models for every possible split point. While
efforts have been made in reducing the computational complexity, we observe that none of the
proposed methods would be applicable when dealing with high speed data streams, which are
described by many numerical attributes having large domains of unique values. For this reason,
we propose the most lightweight method for inducing linear models, based on the idea of on-line
training of perceptrons. The trained perceptrons will represent the linear models fitted
separately in each sub-space of the instance-space.
An important difference between our proposed method and the batch ones is that the process of
learning linear models in the leaves will not explicitly reduce the size of the regression tree. The
split selection process is invariant to the existence of linear models in the leaves. However, if
the linear model fits well to the examples assigned to the leaf, no further splitting would be
necessary and pre-pruning can be applied.
The basic idea is to train perceptrons in the leaves of the tree by updating the weights after
each consecutive example. We use the simplest approach: no attribute selection is performed.
All the numerical attributes are included in the regression equation which is represented by a
perceptron without an activation function. The weights of the links are the parameters of the
linear equation.
Change detection
When local concept drifts occur, most of the existing methods discard the whole model simply
because its accuracy on the current data drops. Despite the drop in accuracy, parts of the
model can still be good for the regions not affected by the drift. In such situations, we propose to
update only the affected parts of the model. An example of a system that possesses this
capability is the CVFDT system (Hulten et al. 2001). In CVFDT, splitting decisions are
repeatedly re-evaluated over a window of most recent examples. This approach has a major
drawback: maintaining the necessary counts for class distributions at each node requires a
significant amount of additional memory and computation (especially when the tree becomes
large). We address this problem by using a lightweight on-line change detection test for
continuous signals.
Discussion of the algorithm design choices
The FIMT-DD algorithm is based on a compromise between the accuracy achieved by a model
tree and the time required to learn the model tree. It therefore offers approximate solutions in
real-time. For making splitting decisions, any method can be used that has high confidence in
choosing the best attribute, given the observed data. The Hoeffding bound was chosen due to
its nice statistical properties and the independence of the underlying data distribution. The
growing process is stable because the splitting decisions are supported statistically, so the risk
of overfitting is low. This is an advantage over batch algorithms, where splits in the lower levels
of the tree are chosen using smaller subsets of the data.
To ensure the any-time property of the model tree, we chose perceptrons as linear models in
the leaves. This approach does not reduce the size of the model tree, but improves its accuracy
by reducing the bias as well as the variance component of the error.
The choice of the change detection mechanism was supported by three arguments: the method
is computationally inexpensive, performs explicit change detection, and enables local granular
adaptation. Change detection requires the setting of several parameters, which enable the user
to tune the level of sensitivity to changes and the robustness.
Annex d.
Conclusions
In this section, we presented an algorithm for learning model trees from time-changing data
streams. To the best of our knowledge, FIMT-DD is the first algorithm for learning model trees
from time-changing data streams with explicit drift detection. The algorithm is able to learn very
fast (in a very short time per example) and the only memory it requires is for storing sufficient
statistics at tree leaves. The model tree is available for use at any time in the course of learning,
© FIRST consortium
Page 42 of 43
D2.3
offering an excellent processing and prediction time per example.
In terms of accuracy, the FIMT-DD is competitive with batch algorithms even for medium sized
datasets and has smaller values for the variance component of the error. It effectively maintains
an up-to-date model even in the presence of different types of concept drifts. The algorithm
enables local change detection and adaptation, avoiding the costs of re-growing the whole tree
when only local changes are necessary.
© FIRST consortium
Page 43 of 43