Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Project Acronym: FIRST Project Title: Large scale information integration infrastructure financial decision making Project Number: 257928 Instrument: Thematic Priority: STREP ICT-2009-4.3 Information and Communication Technology extraction and for supporting D2.3 Scaling Strategy Work Package: WP2 – Technical analysis, scaling strategy and architecture Due Date: 30/09/2011 Submission Date: 30/09/2011 Start Date of Project: 01/10/2010 Duration of Project: 36 Months Organisation Responsible for Deliverable: ATOS Version: 1.0 Status: Final Version Author(s): Mateusz Radzimski, Murat Kalender (ATOS), Miha Grcar, Marko Brakus, Igor Mozetic, Elena Ikonomovska, Saso Dzeroski (JSI), Markus Gsell (IDMS), Tobias Hausser (UHOH), Joao Gama Reviewer(s): Tomas Pariente (ATOS), Michael Siering, Mykhalio Saienko (GUF) R – Report P – Prototype D – Demonstrator O - Other PU - Public CO - Confidential, only for members of the consortium (including the Commission) RE - Restricted to a group specified by the consortium (including the Commission Services) Nature: Dissemination level: Project co-funded by the European Commission within the Seventh Framework Programme (2007-2013) D2.3 Revision history Version 0.1 0.2 Date 11/07/2011 3/08/2011 Modified by Mateusz Radzimski (ATOS) Murat Kalender (ATOS) 0.3 5/08/2011 Mateusz Radzimski (ATOS) 0.4 9/08/2011 Markus Gsell (IDMS) 0.5 10/08/2011 Mateusz Radzimski (ATOS), Murat Kalender (ATOS) 0.6 17/08/2011 Miha Grcar (JSI) 0.7 5/09/2011 Tobias Haeusser (UHOH) 0.8 12/09/2011 0.9 12/09/2011 Miha Grcar, Marko Brakus, Igor Mozetic, Elena Ikonomovska, Saso Dzeroski (JSI), Joao Gama Mateusz Radzimski (ATOS) 0.9.5 13/09/2011 0.96 14/09/2011 Mateusz Radzimski, Murat Kalender (ATOS) Mateusz Radzimski, (ATOS) 0.97 15/09/2011 Markus Gsell (IDMS) 0.98 30/09/2011 1.0 30/09/2011 Mateusz Radzimski (ATOS), Achim Klein (UHOH), Murat Kalender (ATOS) Tomás Pariente (ATOS) Comments First version of ToC provided First contribution to “Analytical pipeline scaling techniques” First contributions to “Scaling strategy outline” and “common scaling plan” First contribution to “Information integration services” Further contributions to „Global Scaling Strategy” chapter. Minor editorial changes. Contributions to chapters “Data acquisition and preprocessing services” and “Decision support and visualisation services”, various smaller contributions to other parts of the document. Contribution to “Information Extraction services” chapter. Update of chapter 3.5 “Decision support and visualisation services” Contribution to „Integration infrastructure” chapter, various contributions to “Global scaling strategy” chapter. Editorial changes „Executive Summary” and „Conclusion” chapters, minor editorial changes. Document reaches “final draft” status and is sent for internal review. Contributions and editing of “Information integration services” Addressing reviewers’ comments. Final version. Final QA and preparation for submission D2.3 Copyright © 2011, FIRST Consortium The FIRST Consortium (www.project-first.eu) grants third parties the right to use and distribute all or parts of this document, provided that the FIRST project and the document are properly referenced. THIS DOCUMENT IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS DOCUMENT, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. ---------------- D2.3 Executive Summary Scaling strategy constitutes an important part of the technical design of the overall FIRST project. It specifies technical details and devises a plan for achieving scalability goals with regard to processing big data volumes in a timely manner, in order to comply with the project objectives. This document provides a general overview on suitable scaling techniques that are to be applied within the project. On the one hand, it encompasses scaling of overall system architecture and describes possible scenarios for performance improvement. On the other hand, it presents methods for scaling individual technical components that particularly correspond with major functionalities of the system. This document influences the development process of the FIRST system, by defining a scalability roadmap with defined milestones and objectives that concern every technical aspect of the project. It aims at continuous and iterative improvement of the system performance and throughput until it reaches the target capabilities. D2.3 Table of Contents Executive Summary ...................................................................................................... 4 Abbreviations and acronyms ....................................................................................... 7 1. Introduction ............................................................................................................ 8 2. Global scaling strategy .......................................................................................... 9 2.1. Scaling strategy outline ..................................................................................... 9 2.2. Scaling analytical pipeline ............................................................................... 11 2.2.1 Overview of pipeline scaling scenarios .................................................... 11 2.2.2 Handling data peaks in the analytical pipeline.......................................... 15 2.3. Summary ......................................................................................................... 17 3. Individual scaling plans ....................................................................................... 18 3.1. Data acquisition and preprocessing services .................................................. 18 3.2. Semantic resources......................................................................................... 19 3.3. Information extraction services ........................................................................ 20 3.4. Information integration services ...................................................................... 23 3.5. Decision support and visualisation services .................................................... 27 3.5.1 Scaling techniques for clustering and classification ................................. 28 3.5.2 Learning model trees from data streams .................................................. 28 3.6. Integration infrastructure ................................................................................. 29 4. Conclusions .......................................................................................................... 31 References ................................................................................................................... 32 Annex 1. Clustering for topic and trend detection .............................................. 35 Annex a. Introduction .............................................................................................. 35 Annex b. Document streams ................................................................................... 35 Annex c. Clustering document streams................................................................... 35 Annex d. Topic detection ......................................................................................... 37 Annex e. Trend detection ........................................................................................ 37 Annex f. Visualization ............................................................................................. 37 Annex g. Clustering for active learning .................................................................... 38 Annex h. Conclusions.............................................................................................. 38 Annex 2. Learning model trees from data streams ............................................. 39 Annex a. Introduction .............................................................................................. 39 Annex b. Related work ............................................................................................ 39 Annex c. The FIMT-DD algorithm ........................................................................... 41 Annex d. Conclusions.............................................................................................. 42 D2.3 Index of Figures Figure 1: Scaling strategy in relation with other workpackages ..................................................... 8 Figure 2: FIRST scaling strategy outline ...................................................................................... 10 Figure 3: Scale-up and scale-out approach ................................................................................... 12 Figure 4: Load balancing of requests (ZeroMQ, 2011). ............................................................... 12 Figure 5: Scalability test of analytical pipeline using the parallelization technique ..................... 13 Figure 6: Pipeline splitting scenario .............................................................................................. 14 Figure 7: ZeroMQ space-time scalability experiment result (ØMQ (version 0.3) tests, 2011) .... 15 Figure 8: Performance comparisons of the approaches for handling data peaks .......................... 16 Figure 9: WP3 and WP4 pipeline integration with extra buffer for data peak handling ............... 17 Figure 10: Data acquisition and preprocessing pipeline at M12 (taken from (FIRST D2.2 Conceptual and technical integrated architecture design, 2011)) ................................................. 18 Figure 11: Information extraction scaling approach ..................................................................... 21 Index of Tables Table 1: Scaling strategy for prototype release cycles .................................................................. 11 Table 2: Scaling plan for the data acquisition and preprocessing pipeline ................................... 19 Table 3: Scaling plan for the semantic resources .......................................................................... 20 Table 4: Development plan and scaling plan for Information Extraction ..................................... 22 Table 5: Rough estimate of database operations ........................................................................... 26 Table 6: Scaling plan for the knowledge base............................................................................... 27 Table 7: Scaling plan for the decision-support models ................................................................. 28 Table 8: Scaling plan for Integration infrastructure ...................................................................... 30 D2.3 Abbreviations and acronyms DoW Description of Work WP Workpackage TBD To be defined SOA Service Oriented Architecture NP Nondeterministic Polynomial Time ESB Enterprise Service Bus RUP Rational Unified Process CPU Central Processing Unit REQ/REP Request/Reply M12 Month 12 RSS RDF Site Summary (also dubbed Really Simple Syndication) HTML Hypertext Markup Language PDF Portable Document Format XML Extensible Markup Language SVM Support Vector Machines JAPE Java Annotation Patterns Engine NoSQL sometimes refered to as Not Only SQL UC Use Case © FIRST consortium Page 7 of 43 D2.3 1. Introduction This document provides important insights into the plan of realizing scalability goals within the FIRST system. Given that the primary objective of the system consists in analyzing big volumes of data and reducing processing time, most of the technical aspects must be designed with performance in mind. One conclusion is that all components dealing with data processing should offer high scalability, conforming to envisaged capacity of the system. It means choosing the best algorithms and state-of-the-art techniques for accomplishing certain tasks within the analytical processing pipeline (see (FIRST D2.1 Technical requirements and state-of-the-art, 2011) and (FIRST D2.2 Conceptual and technical integrated architecture design, 2011)). However, ensuring scalability at the component level is only one side of the coin. Separate components must be further integrated in the common architecture that should be robust enough to keep up with performance and system requirements (see (FIRST D1.2 Usecase requirements specification, 2011)) Therefore architecture scalability is another important factor of the scaling strategy ensuring coherency with system design. This document will encompass both: a global strategy that applies to system integration level, and local, component specific, plans, and will further influence the development process in FIRST (see Figure 1). Technical components WP3 influences WP4 Requirements and system objectives Scaling strategy WP1 D2.3 WP5 WP6 D2.2 WP7 Integrated architecture Figure 1: Scaling strategy in relation with other workpackages © FIRST consortium Page 8 of 43 D2.3 2. Global scaling strategy 2.1. Scaling strategy outline The global scaling strategy describes a roadmap for achieving project goals for the whole FIRST system with regard to scalability. The main goal is to ensure that the whole system and all its components are able to fulfil the baseline project requirements of processing high volumes of data and providing timely results. The methodology followed to reach the goals is based on an incremental release and evaluation approach, resembling Rational Unified Process (RUP) approach (IBM Rational Unified Process v7.0, 2008) where all functionalities are being improved from functional (implementation plan) and scalability (scaling strategy) point of view throughout the project lifetime. The scaling strategy is aligned with the prototype release cycle, divided into several milestones, where each one provides objectives for overall system scaling. By providing a systematic approach and a set of measurable goals, we enable constant evaluations of the results in order to detect and respond to risks as early as possible, which is crucial in developing the research prototype. The scaling strategy plan encompasses the twofold view of the project: - Scalability of the overall FIRST system (hereby called “global scaling strategy”) - Scalability of the particular FIRST building blocks (called “individual scaling plans”) The idea of distinguishing between these two aspects is to analyse scaling challenges and approaches from the global perspective (“global scaling strategy“) and from the point of view of different technical components (“individual scaling plans”). The individual scaling plan depicts a bottom-up view and explains how FIRST components are contributing to achieving scalability, what are the challenges and achievable goals. It focuses on providing a lower-level, component-based overview on scaling issues, such as the choice of algorithms or the usage of proper technological solutions. While this deliverable mentions how those techniques contribute to achieve scalability goals, technical details will be presented in the deliverables of the respective workpackages. On the other hand the global scaling strategy presents project-wide approaches orthogonal to the individual scaling plans and common for most technical components. Its focus is on integration aspects to provide a scaling infrastructure that enables using such techniques as scaling up, scaling out, parallelization or proper resource utilisation. The central aspects of analysis covered in this view are FIRST Analytical Pipeline scaling methods. The global scaling strategy also takes into account limits of individual components and aligns them into the common plan, ensuring that system capabilities are met by individual components. In this sense both: global and individual scaling strategy influences each other. Following an incremental building approach, the scaling strategy adapts to the devised project development plan and to the prototype release cycles. Therefore a scalability goal is defined for each prototype release milestone. The general approach for preparing the infrastructure to handle the envisaged amount of data and to comply with scaling requirements, is to 1) constantly improve algorithms in order to provide results in near real-time, and 2) scale such solutions to handle vast amount of data (see Figure 2). The high-level view on the scaling strategy, given in the DOW, suggests that we should first scale the data volume (from small to large historical datasets) and then change the processing paradigm (from dataset processing to real-time data stream processing). In reality, we first need to change the processing paradigm and then scale the data volume (from relatively "slow" data streams to vast data streams). Since the final goal in the project is not to process large historical datasets but rather to process data streams in near-real time, the efforts put into scaling from © FIRST consortium Page 9 of 43 D2.3 small to large (historical) datasets would not be reflected in the final product. Therefore, switching to stream processing earlier in the process allows uncovering and address any possible limitations of such approach. The outline of these techniques is presented in the following chapters. Vast amounts of data Scaling data volume P3 Ensure algorithms are timeperformant and scalable P1 P2 Historical data, little time constraints Live data and news feeds, near real-time response Small amounts of data Figure 2: FIRST scaling strategy outline Table 1 outlines the scaling strategy related to the prototype release cycle as envisaged in the DoW. Month Milestone Stage M12 M2 M18 M3 M24 M4 © FIRST consortium Description M12 Early prototype (“Early Bird”) Stage 1 1st Prototype (P1) Early demo providing first insight into WP3 Data Acquisition and WP4 Information Extraction prototypes. First preliminary release of Integrated Financial Market Information system, showing some components of FIRST Analytical Pipeline (WP3, WP4, WP6) and visualisations prototypes at work. The integration prototype is also present, allowing for lightweight pipeline integration and employing the messaging approach in its core. The purpose of the 1st prototype is to show how some use case tasks are realized by the FIRST system. No scaling goals are defined, but the prototype and the infrastructure form a testbed for performance tests and continuous scalability improvements. Stage 2 2nd prototype is oriented at improving algorithms in 2nd Prototype (P2) order to shorten the time required to analyze – Live data (near incoming data, ensuring efficiency with regard to real-time) resources and computation time. However it may Page 10 of 43 D2.3 Month M33 Milestone Stage M5 Description be already capable of processing more amounts of data, its validation focuses on near real-time results provision. Final Version Final version of the software is able to handle the (P3) – Vast target data load – vast amounts of data in short amounts of data time. It is validated against the final requirements with regard to timeliness and data volume (number of data sources and number of processed articles per day). Algorithms and infrastructure are prepared for efficient scaling with regard to data load within specified limits. The global scaling strategy and individual scaling plans are implemented in 100%. Table 1: Scaling strategy for prototype release cycles 2.2. Scaling analytical pipeline 2.2.1 Overview of pipeline scaling scenarios The analytical pipeline is the core of the project; therefore, scalability of the analytical pipeline is crucial for overall scalability of the FIRST project. The preliminary basis for the scaling strategy has been outlined in (FIRST D2.1 Technical requirements and state-of-the-art, 2011). This section will devise the scaling scenarios such as: pipeline parallelisation, pipeline splitting, load balancing and pipeline multiplication. Pipeline parallelisation is proposed in (FIRST D2.1 Technical requirements and state-of-the-art, 2011) as a scaling strategy in order to increase the pipeline throughput and decrease the latency of the pipeline. Parallelisation means processing of several inputs coming from components in a pipeline with other identical components that work in parallel. In an optimal scenario, it is simply adding more processing units for the same components that work slower compared to other components in a pipeline. More processing units can be provided to components by scaling horizontally or vertically. Scaling horizontally is achieved by adding more nodes (computers) to a system. Scaling vertically is achieved by adding more resources such as additions of CPUs and memory to a node in the system (see Figure 3). In the pipeline processing, vertical scaling can be achieved by running the whole pipeline on a faster machine, while scaling horizontally is running multiple pipelines (or pipeline fragments) on more machines. © FIRST consortium Page 11 of 43 D2.3 Figure 3: Scale-up and scale-out approach1 Parallelisation requires distribution of work between multiple identical components. Workload has to be distributed effectively between components to achieve optimal resource utilization and maximize throughput, which is called load balancing. Inputs between components in the First analytical pipeline are delivered using the messaging approach as described in (FIRST D2.2 Conceptual and technical integrated architecture design, 2011), which supports techniques for improving scalability, such as: parallelisation, pipeline splitting and load balancing. As messaging implementation, ZeroMQ2 has been chosen. Figure 4 shows load balancing of work between three parallel services using ZeroMQ messaging. There are four messages (R1, R2, R3, and R4). Two of the messages are sent to the Consumer A, other two are sent to the Consumer B and Consumer C. Figure 4: Load balancing of requests (ZeroMQ, 2011). 1 Source: Best Practices in building scalable cloud-ready Service based systems, CodeCamp 11, http://igorshare.wordpress.com/2009/03/29/codecamp-11-presentation-best-practices-in-building-scalable-cloudready-service-based-systems/ 2 http://www.zeromq.org/ © FIRST consortium Page 12 of 43 D2.3 Parallel components do not have to work on the same computer. The pipeline can be spitted by distributing components among several machines. For example, the data acquisition service may work on one computer and multiple information extraction services may work on another. In this way, components can occupy more resources compared to using the same machine with other components. ZeroMQ uses sockets to connect applications over the TCP protocol, which enables pipeline splitting and distributed processing over the network. Since TCP/IP protocol connectivity is scalable, throughput of the pipeline would increase proportional to number of computers. To measure the scalability of the ZeroMQ messaging approach, a pipeline parallelisation experiment has been carried out. A test system was prepared that transfers messages between two components. One component produces messages and another consumes the messages. The message consumer is a time processing component, which slows down the pipeline. The throughput of the test system is observed with a changing number of message consumer components. The experiment is done on a 48-core processor. Figure 5 shows the experiment results. The throughput increases linearly with number of components; therefore we can conclude that the analytical pipeline and the messaging approach with ZeroMQ implementation is highly scalable. Figure 5: Scalability test of analytical pipeline using the parallelization technique Parallelisation of the FIRST pipeline will be done statically, based on the average latency of each component and the processing power which is available. First, the latency of each component will be analyzed. Based on the results, slow components will have more instances that work in parallel. For example, if the data acquisition component retrieves 2 documents per second and information extraction can process 1 document per second, there will be one data acquisition and two information extraction components in the pipeline. After balancing the pipeline, resource consumption of the pipeline will be observed. If it doesn’t use all the resources of the computer, multiple pipelines will be executed in parallel to consume all available resources and increase the throughput of the project. Another approach for scaling out the pipeline is to perform pipeline splitting, by separating pipeline components and running them on more machines. It improves scalability when pipeline components need more computing resources than a single machine can offer. Pipeline splitting is possible by decoupling components and using messaging to integrate them over distributed © FIRST consortium Page 13 of 43 D2.3 machines (see Figure 6). In such a scenario, every component can occupy more resources. For computing-intensive tasks this may result in shorter computation time and lower pipeline delays. Single processing unit (node) 20% 20% 20% 20% 20% Whole pipeline ocuppying 100% of available resources Pipeline splitting Single processing unit (node) 33% 33% 33% Pipeline part ocuppying 100% of available resources Single processing unit (node) messaging 50% 50% Pipeline part ocuppying 100% of available resources Figure 6: Pipeline splitting scenario Scalability of the messaging integration approach is a very important factor for the scalability of the analytical pipeline. But, it is also important how it is applied in the project. Poor architectural design may result in serious scalability problems, i.e. if messaging middleware blocked the message sender and receiver until a message is transferred. Waiting for a message transfer would cause performance and scalability problems for components. Because they are dependent on each other and one of them would block the other component. To handle this issue, the messaging integration solution mentioned in (FIRST D2.2 Conceptual and technical integrated architecture design, 2011) is implemented following an asynchronous communication pattern, with a multi-threaded design principle. Separate threads are responsible for getting and transferring messages for each component. Additionally, each messaging thread keeps a buffer queue of the messages to support a constant flow of data. In this way components do not block each other. Scalability of a system could be analyzed in various dimensions. So far, we have analyzed the analytical pipeline in terms of load scalability. Load scalability is the ability of increasing resource consumption to handle heavier loads. Space and space-time are other two important scalability types (Bondi, 2000). Space scalability is the ability of handling increasing number of items without consuming excessive amount of memory. Space scalability is handled in the analytical pipeline by limiting the queue size of the messaging threads with a fixed number. In our experiments, we observed that memory consumption of the analytical pipeline is stable and not increasing with heavier loads. Space-time scalability is the ability of handling large items (big messages in our context) without decreasing the throughput of the system. Messages within the analytical pipeline could be varying in size. The integration system has to handle all types of messages without a performance loss. There is a published space-time scalability experiment carried out by the vendors of ZeroMQ. In Figure 7, the experimental result shows that the duration for sending big messages increases linearly with their sizes. Thus, we can conclude that ZeroMQ is also scalable in terms of space-time dimension. © FIRST consortium Page 14 of 43 D2.3 Figure 7: ZeroMQ space-time scalability experiment result (ØMQ (version 0.3) tests, 2011) 2.2.2 Handling data peaks in the analytical pipeline Information exchange between components in the analytical pipeline is done via sending messages between them. The message receiver module of the FIRST messaging system keeps received messages in a queue. When data peaks occur in the pipeline, components cannot process all the received messages and their queues overflow. In order to handle the queue overflow problem, a new messaging channel is added to the integration system to inform the message sender component about the status of the message receiver queue. Since the analytical components constantly observe their own input queues, a data peak can be identified by each individual component. A data peak (from the perspective of a specific component) happens when the number of data items (requests) in the queue exceeds a certain predefined threshold (e.g., 100 items). If this situation occurs, the component fires its peakhandling logic in order to reduce the number of items in the queue. The strategies for reducing the number of queued items range from simple and pragmatic to relatively complex solutions. A very simple solution may involve using control messages to pause and resume the traffic. When the queue size exceeds the maximum value, a “wait” message is sent to the message producer. In this case, the message producer stops sending messages until receiving a “continue” message from the message consumer. After the message receiver consumes all messages in the queue, a continue message is sent to the message producer and messaging between components continues. However simple, such a solution slows down data processing and only moves the problem to the message producer, causing overflow at the earlier stages of the pipeline. The complex solutions include, for example, semantic load shedding where clustering of the content in the queue is performed in order to select representative instances. This ensures that the different topics, identified in the queue, are all represented in the final model and thus in the end-user application. In FIRST, we do not plan to resort to such complex © FIRST consortium Page 15 of 43 D2.3 solutions but rather to one of the pragmatic alternatives. The following are the two pragmatic approaches: dropping the request that tries to enter a full queue, dropping each second request from a queue when the queue fills up (i.e., sampling). For the applications in FIRST, the second approach is more appropriate. In contrast to the first approach, it is more appropriate to allow recent content to pass the pipeline. For more information on data reduction techniques, see (FIRST D2.1 Technical requirements and stateof-the-art, 2011), Section 2.5.2.1 and (Barbara & others, 1997). Additionally, the messaging control channel can be used for controlling behaviour of data peaks handling. Another aspect of handling data peaks is whether we store "dropped" data for later processing (when the queues are empty again), i.e. (i) using a broker-based messaging system and sending new messages to a broker, (ii) writing new coming messages to files for further processing. For the first approach, a stable broker-based system is required, which handles also queue overflow with its persistence mechanisms. Performance would be the selection criteria between the both approaches. Broker-based messaging (e.g. ActiveMQ1) and file storage features has been tested in the messaging system for this purpose. Briefly, in the experiment these features were after receiving the wait messages. New coming messages were sent to the broker or written to a file. When the message receiver is ready for consumption of the new messages, these messages are transferred to the message receiver. Performances of these two approaches are tested on the test dataset, which was used for evaluating the performances of pipeline and request-reply patterns in the previous section. In the experiments, all messages are sent via ActiveMQ for the broker approach and sent with ZeroMQ after writing the messages to files and reading them back for the file storage approach. Figure 8 shows the performances of these messaging systems with regard to the overall throughput. Figure 8: Performance comparisons of the approaches for handling data peaks The file storage approach performs significantly better than the broker-based approach in the experiments. Figure 9 shows the example architecture of the Data Acquisition and Information Extraction pipeline integration with the data peak handling using file buffer. 1 http://activemq.apache.org/ © FIRST consortium Page 16 of 43 D2.3 Figure 9: WP3 and WP4 pipeline integration with extra buffer for data peak handling Even though this mechanism is relatively easy to implement, the nature of the solution lies in extending the message buffer from memory-based buffer to the disk-buffer. Using such solution makes sense when data comes in large bursts of messages that are entirely unprocessable otherwise. As a trade-off, messages might be significantly delayed while they wait for their turn to be “replayed” back into the pipeline. In the long run this might be an unwanted side effect. In a near-real time system, such as the one we aim to develop in FIRST, we would rather opt for dropping some messages while keeping average processing time short than ensuring all messages processed regardless of circumstances. In the future we aim at supporting these scenarios; however final decision depends on further experiments on live data. 2.3. Summary The global scaling strategy gives a technical overview of scenarios that can be applied to improve the performance of the FIRST analytical pipeline. The target system will combine the aforementioned techniques, by choosing the most appropriate method based on the performance of the individual components, target data volumes or analysis of performance bottlenecks. This demonstrates that the messaging based integration approach and the flexibility of the analytical pipeline brings more possibilities for scaling the overall FIRST architecture and ensures meeting scalability goals in the further project development process. © FIRST consortium Page 17 of 43 D2.3 3. Individual scaling plans Individual scaling plans, as opposed to the global scaling strategy, aim at providing a roadmap and scaling plans for individual technical workpackages. Those plans focus on internal and specific aspects of pipeline processing such as the choice of algorithms or improvement of data handling. They are realized separately inside of each workpackage technical components according to their own objectives. However, the outcomes of each plan are aligned with the project plan and prototype release cycle according to the DoW. The following subchapters are represented by following workpackages: WP3: Data acquisition and preprocessing services and Semantic resources, WP4: Information extraction services, WP5: Information integration services, WP6: Decision support and visualisation services, WP7: Integration infrastructure. 3.1. Data acquisition and preprocessing services The data acquisition and preprocessing pipeline, shown in Figure 10, consists of relatively elementary operations that do not need to be replaced with sophisticated online alternatives. In addition, all these operations are trivially parallelizable (i.e., each document can be processed independently). This allows us to devise a workflow with multiple parallel preprocessing pipelines as evident from Figure 10. Load balancing is employed to send the acquired data through the preprocessing pipelines. Load balancing RSS reader Boilerplate remover Language detector Duplicate detector Sentence splitter Tokenizer POS tagger Semantic annotator ZeroMQ emitter RSS reader Boilerplate remover Language detector Duplicate detector Sentence splitter Tokenizer POS tagger Semantic annotator ZeroMQ emitter . . . . . . RSS reader Boilerplate remover Duplicate detector Sentence splitter Tokenizer POS tagger Semantic annotator ZeroMQ emitter processing pipelines Language detector One reader per site (80 readers) Figure 10: Data acquisition and preprocessing pipeline at M12 (taken from (FIRST D2.2 Conceptual and technical integrated architecture design, 2011)) As the project progresses, the data acquisition and preprocessing pipeline will be scaled up from mainly two perspectives: (i) with respect to the number of sites from which the data is acquired and (ii) with respect to the number of components (i.e., functionality). Table 2 shows the scale-up of the data acquisition pipeline from the preliminary version (at M7) to now (M12) and presents the scale-up plan for the remainder of the project. © FIRST consortium Page 18 of 43 D2.3 Ver. 2 Ver. 3 Ver. 4 Apr–Jun 2011 Jun–Sep 2011 Sep 2012–Sep 2013 (M7–M9) (M9–M12) Sep 2011–Sep 2012 Now Ver. 1 (M24–M36) Functionality Scale (M12–M24) Number of sites: 39 Number of sites: 80 Number of sites: 160 Number of RSS feeds: 1,950 (~50 per site on average) Number of RSS feeds: 2,472 (~30 per site on average) Number of RSS feeds: 4,800 (~30 per site on average) Avg. number of documents per site per day: 870 Avg. number of documents per site per day: 425 Total new documents per day: 33,950 Total new documents per day: 34,000 RSS only Unchanged Added: language detector, duplicate acquisition Boilerplate removal detector, sentence added splitter, tokenizer, POS tagger, ZeroMQ emitter Avg. number of documents per site per day: 425 Total new documents per day: 68,000 Unchanged Table 2: Scaling plan for the data acquisition and preprocessing pipeline The average number of RSS feeds per site and the average number of acquired documents per site per day decrease from Ver. 1 to Ver. 2. This is mainly due to the fact that we included a lot of blogs in Ver. 2. A blog usually provides one single RSS feed and only a few posts per day/week while a larger news Web site provides a range of RSS feeds and hundreds of news per day. Another reason for the drop of the average number of documents per site per day, and consequently the total number of new documents per day, is the new filtering policy. In Ver. 2, we only accept HTML and plain text documents that are 10 MB or less in size. In Ver. 1, nontextual content (such as video, audio, PDF, and XML) was also accepted and its size was not limited. The only component that could benefit from the fact that we are dealing with streams is the boilerplate remover. The currently implemented solution is based on language-independent features and employs a decision tree to determine the class of a text block (Kohlschütter, Fankhauser, & Nejdl, 2010). This solution processes each document separately and is unaware of the fact that it operates in a stream-based environment. We recently devised a pragmatic stream-based boilerplate remover that exhibits high content recall (at some expense of precision). The algorithm is currently being tested and, if deemed suitable, will replace the currently employed solution at some point during the second project year. 3.2. Semantic resources The FIRST ontology contains two important aspects of knowledge about financial markets: (i) real-world entities such as companies and stock indices and their interrelations and (ii) the corresponding lexical knowledge required to identify these entities in texts. The ontology is thus fit for the purpose of information extraction rather than representing a basis for logic-based reasoning. We distinguish between the static and dynamic part of the ontology. The static part contains knowledge that does not change frequently (i.e., does not adapt to the stream in real time). It contains the knowledge about financial indices, instruments, companies, countries, industrial sectors, sentiment-bearing words, and financial topics. This part of the ontology will scale-up in © FIRST consortium Page 19 of 43 D2.3 terms of coverage (i.e., how many financial indices, topics, and sentiment-bearing words the ontology covers) and in terms of aspects (i.e., which different types of information are available in the ontology, e.g., industrial sectors, sentiment vocabularies, topic taxonomies…). The dynamic part will include two aspects of knowledge1 that will be constantly updated with respect to the data stream: (i) topic taxonomy and (ii) sentiment vocabulary. The dynamic part of the ontology will scale-up mostly in terms of the maximum throughput of the topic detection algorithm and sentiment vocabulary extractor. The topic detection component will be based on an online hierarchical clustering algorithm (see Annex 1). Rather than performing efficiently on a dataset of documents, an online clustering algorithm is able to update a hierarchy of document clusters rapidly when a new document comes into the system. On the other hand, the sentiment vocabulary extractor will employ an active learning approach based on Support Vector Machines (SVM) (Joachims, 2006; Tong & Koller, 2000; Saveski & Grcar, 2011). For this purpose, we will employ an online variant of SVM (Cauwenberghs & Poggio, 2001). Table 3 gives the current state of the semantic resources in FIRST and also presents the scaling plan for after M12. Now Feb–Sep 2011 Mar–Sep 2012 Sep 2012–Sep 2013 (M18–M24) (M24–M36) Collection of existing Ontology “spawned” Ontology “spawned” semantic resources from 16 financial from >1000 financial (T3.1) indices indices Unchanged Sentiment vocabularies, topic taxonomies, lexical resources, glossaries, financial Web sites… Unchanged Throughput Coverage (M12–M18) Aspects (M5–M12) Sep 2011–Mar 2012 N/A Indices, stocks, companies, countries, industrial Events added sectors, sentiment vocabulary, topic taxonomies N/A Running in near-real Running in near-real time on time on 1 selected approximately 160 Web site (testing) Web sites Table 3: Scaling plan for the semantic resources 3.3. Information extraction services The information extraction service is based on the JAPE-Engine, which leads to the benefit of a powerful rule engine, but also leads to the disadvantage of being time consuming (for detailed description see (FIRST D4.1 First semantic information extraction prototype, 2011)). In early experiments the information extraction service could need some minutes per blog document. Because the information extraction engine also has to fulfil the requirements to handle a couple documents per minute, the whole service will provide an internal managed process pool, where 1 Note that these two aspects are also included in the static part where they do not adapt to the stream but rather represent UC-specific knowledge and existing semantic resources (e.g., McDonald’s financial word lists <http://www.nd.edu/~mcdonald/Word_Lists.html>). © FIRST consortium Page 20 of 43 D2.3 the documents are dispatched using load balancer combined with a messaging approach. The first version of the managed process pool will have a configurable amount of parallel processes which will be started from the Process Observer/Management component. The Process Observer/Management also manages different states (WAITING, BUSY) of the running information extraction processes by performing constant monitoring and performing proper actions. If the busy state is not changed within a defined timeout, the process will be killed and restarted from the Process Observer / Manager. Information Extraction Process 1 Mess age Mess age Information Extraction Process 2 Mess age Information Extraction Process n Mess age Load Balancer Process Observer / Manager Managed Process Pool Figure 11: Information extraction scaling approach Figure 11 shows an internal approach of Information extraction services in the scope of the global processing pipeline. Incoming data is first dispatched by load balancer component that distributes the data across registered information extraction components according to process observer / manager component. Data is sent only to components that are in the WAITING state. The number of the information extraction processes in the pool is subject to further experiments and depends on the amount of data and processing time. Also, there could be more than one Managed Process Pools components within the system, which can be further scaled according to global scaling strategy scenarios (using messaging approach i.e. with ZeroMQ implementation). This additional internal process pool will be necessary for two major reasons: 1) The local machines could run more than one parallel processes, however one single JAPE run will need more time (currently up to some minutes) for one document in an atomic process. 2) In some cases we observed that a JAPE-based process might take too long and thus blocking or exhausting available resources for other documents (i.e. due to crash or hang-up). If it doesn’t answer in appropriate time, the build-in observer mechanism will kill the process and restarts it in order to return the process to the process pool. © FIRST consortium Page 21 of 43 D2.3 Natural language processing Boilerplate remover Language detector, duplicate detector, sentence splitter, tokenizer, and partof-speech tagger annotations Unchanged Unchanged Entity Extraction Financial instruments (stocks, stock indexes), companies, orientation terms Indicators, topic taxonomies Events, locations Unchanged Extraction Coverage Direct sentiments regarding financial instruments’ price and companies’ reputation Direct sentiments regarding financial instruments’ volatility Indirect sentiments Unchanged Extraction Scaling n/a Analysis of performance bottlenecks. Improve prototype with software engineering methods. Initial process pool for load balancing. Advanced process pool for large amounts of data. Extraction Throughput Now The current approach focuses on accuracy of the process of information extraction. As mentioned before, its performance is still far from optimal. However if it proves infeasible in the further experiments, other solutions will be explored, that might trade-off accuracy with performance. That can be done as an alternative solution in order to avoid potential risks in the scaling of the overall pipeline. Table 4 presents scaling goals for information extraction components. Part 0: “M12 Part 1: Integrated Part 2: Live feeds Part 3: Large early prototype” functional amounts of data April 2012–Sep prototype Feb–Sep 2011 2012 Sep 2012–June Oct 2011–Mar 2013 (M5-M12) (M19–M24) 2012 (M24–M33) (M13–M18) 1 document per minute (one process) 2-3 documents per minute (one process) 5 documents per minute 50 documents per minute, 68000 documents per day Table 4: Development plan and scaling plan for Information Extraction © FIRST consortium Page 22 of 43 D2.3 3.4. Information integration services When storage solutions are confronted with high loads of data insertion and or data retrievals, performance bottlenecks may become a severe issue. Neither data contributors nor data consumers want to spend much time waiting for their requests to be completed. Performance issues may arise due to a variety of reasons: Blocked or limited resources. Such issues occur when database resources that are required to perform a certain task are blocked by another operation and processing has to wait until the required resource is released by the other task. Resources in this context encompass physical resources (e.g. hard-drive access to alter a file), virtual resources (e.g. a database table that is write-locked while operations are running that alter its content) or logical resources (e.g. database connections). Costly (time-consuming) database operations, due to improper database design. With an inappropriate database design users may be forced to conduct complex, timeconsuming queries, e.g. using many joins, for frequent tasks. Costly (time-consuming) database operations, due to improper database management. With an improperly administrated database, queries can take unnecessarily long, e.g. when no appropriate indices are maintained. Costly (time-consuming) database operations, due to inappropriate queries. Inappropriate, i.e. unnecessarily complex, queries, can harm the query performance. Solutions to counter these causes for performance bottlenecks can be implemented on different layers, not only the storage layer itself but also on the access layer. The first and most crucial decision is to choose the physical storage system. Such solutions may include the plain file system, relational databases, non-relational (NoSQL) database approaches such as document-oriented key-value databases, or any hybrid combinations thereof. Each of these data storage solution bears its individual advantages and disadvantages which need to be weighted in the light of the respective requirements towards the respective data items to be stored and retrieved. While e.g. storing items in the file system allows efficient random access to the stored items, the expressive power of queries is obviously limited to filenames or creation dates. On the other hand, expressive power of queries in a relational database is quite high, while such complex queries may harm performance as large tables might have to be scanned and potentially numerous sub-queries have to be conducted. Therefore, the choice regarding the storage solution shall be made on an individual basis, choosing the solution approach that is most appropriate for the respective type of data and the expected frequency of data insertions, data updates, and data retrievals. Each storage solution brings its own inherent optimization possibilities. Besides any automatic query optimizers, the optimizations can also be done with the administrative functionality provided by the storage solutions as well as with degrees of freedom provided in database design. E.g. for relational databases, the set of options to enhance performance would include: Normalization of database tables. The normalization of database tables avoids the redundant storage of data, which shall improve performance and avoid inconsistencies due to updates Set appropriate indices1. By defining indices (with respect to columns often used in queries) the efficiency of 1 There are two types of indices: clustered and non-clustered. As the former actually changes the physical sequence of entries to a table accordingly, there can only be one clustered index per table, while several non-clustered indices can be defined. © FIRST consortium Page 23 of 43 D2.3 queries can be increased, as previously determined meta-information from the indices can be used rather than performing full table scans to identify all the rows that match certain criteria. In order to define useful indices, the frequently used queries should be analyzed. As database content is changing over time, indices shall be rebuilt from time to time and may require some fine-tuning (e.g. with respect to fill factor that is maintained upon index creation) Partitioning. Database performance may be harmed by very large tables. To counter this, the table may be (horizontally) partitioned, i.e. different rows of the table are assigned to different physical partitions (e.g. physically a different disk). Thereby search effort may be reduced as less often retrieved parts of the data – e.g. older entries – may be swapped out. As furthermore the size of the respective indices (one per partition) is smaller, search effort is further reduced. Database cluster. If physical alternatives are available, the database can be federated among several servers to distribute workload. Although the aforementioned concepts to enhance performance have emerged in the context of relational databases, similar optimization approaches are existent for other storage paradigms as well. Also many NoSQL databases offer indexing or distribution of storage among several severs. With sharding there is even a concept that combines features of horizontal partitioning and clustering. Despite all these performance optimization approaches, it still may prove useful – depending on the actual request pattern –to cache some database entries e.g. the most recent ones, or to denormalize the table structure to some extent to cater for better response time to some queries that otherwise span across several database tables and require costly joins. While de-normalization would be part of database (re)design and therefore reside on the storage layer, caching recent database entries may be part of optimizations on the access layer or some kind of intermediate layer. Further potential to increase performance may be realized on the access layer by providing best practice implementations for often used queries, e.g. by providing prepared statements or the usage of stored procedures where appropriate. Depending on the insertion patterns, the access layer may also arrange individual inserts into a bulk insert, where several inserts are bundle into a transaction. Usually indices will be updated upon each insert. Though, by bundling several inserts into a transaction, the index update is only performed once, which increases responsiveness of the database. The issue of blocked resources can be addressed by the access layer in several ways. Pooling of resources Establishing database connections is an expensive operation. To avoid this procedure as often as possible, the access layer shall maintain a connection pool to serve new requests for a connection. Only when no appropriate connection is available there, a new one is created. Whenever a connection is released by its user (e.g. when a query completed) the connection is returned to the pool to be available for re-use for the next query. Such connection pooling is already offered by many database drivers. Where it is not available, the access layer shall provide it. In a similar way threads to process database operation may be pooled internally. Queuing insertions In case of a blocked resource, the access layer shall not refuse a request or block itself by waiting for the required resource to be released. Instead, the request shall be accepted and queued in a worker thread, until it can be conducted. Though, this should © FIRST consortium Page 24 of 43 D2.3 only occur in exceptional circumstances as it is imperative to minimize the occurrence of blocked resources. In the following rough capacity estimations regarding the amount of data items to be stored are outlined. This outline obviously only covers the already known storage requirements. E.g. storage requirements that are raised by decision support components are not reflected, as they are yet to be fully defined at the time of writing. Consequently, the data storage component will need to adapt at a later stage in order to properly store the models, predictions, and potentially other relevant data and metadata from these components. Ontology Archive: number of concurrent users (insertion and retrieval): 1 (WP3) in regular intervals a serialization of the most up to date ontology will be archived Retrieval of single ontologies occasionally for back-testing purposes Storing annotated document corpora number of concurrent users (insertion and retrieval): 1 (WP3) archiving of vast amount of annotated documents Occasional (bulk) retrieval of documents mainly for back-testing purposes Computed sentiment-related information number of concurrent users (insertions): 1 (WP4) number of concurrent users (retrieval): 1-3 (WP6, WP7, WP8) storing of vast amount of fully annotated GATE documents Frequent retrieval of sentiments and/or further attributes The requirements of the ontology archiving task, will probably only change slightly over time. The frequency, with which the most up-to-date ontology is archived, may increase. The current expectation is to have one archiving request per day. Though, even if this would dramatically change to one archiving request every ten minutes the knowledge base should be able to adapt that without any noticeable impact on overall performance. However, the requirements regarding storage of annotated document corpora and computed sentiment-related information are directly driven by the scaling of the data acquisition component maintained by WP3. According to the scaling plan outlined for data acquisition in section 3.1, 68,000 new documents are to be expected per day at the final stage of the project. This figure represents all documents that are retrieved from the data acquisition components. Based on the ontology, documents will be filtered out that are irrelevant in the context of FIRST. Therefore, the subsequent components in the processing pipeline will actually receive fewer documents. Nevertheless, for the purpose of a worst case estimation, the following calculations assume that all acquired documents will be passed to the subsequent components in the pipeline. The estimate of 68,000 new documents per day would cause the same number of new annotated document corpora to be stored per day, i.e. 68,000 insertions to the knowledge base. As mentioned before, it is assumed for the purpose of this calculation that all those 68,000 documents are forwarded to the subsequent components of the processing pipeline. Therefore, for each document annotations will be set, sentiments will be computed and the related database tables will have to be updated. It is assumed that per processed document 20 sentiment-related database tables will have to be updated, which causes 1,360,000 update operations per day on the knowledge base. In order to not ignore future system load that will be caused by decision support components, it is assumed for this worst case estimation, that the decision support components will cause the same amount of update operations, i.e. 1,360,000 per day. As for © FIRST consortium Page 25 of 43 D2.3 annotated document corpora only occasional retrieval is expected, the number of retrieval operations per day is estimated to be at around 50% of the total insert operations. That leads to a total of 4.182 million database operations per day, or on average 48.40 database operations per second (see Table 5). As storage is spread among different storage solutions, many of these operations will not impact each other, can be conducted in parallel and the overall number of operations per second should achievable. Insert operations document corpora Insert operations sentiment-related information Insert operations decision support components Grand total insert operations per day 68,000 1,360,000 1,360,000 2,788,000 Estimated retrieval (50% of grand total inserts) per day 1,394,000 operation Grand total database operations per day 4,182,000 Grand total database operations per second 48.40 Table 5: Rough estimate of database operations Though, when one estimates the required storage space in a similar way upon these figures, another potential bottleneck becomes apparent. Assuming that annotated document corpora may require 25KB, and each of the insert operation for sentiment-related information and decision support data may require 10KB each, the total accumulated storage required within one year would be roughly 10.55 TB1. However, as these figures are very rough and probably are overestimating, they will need to be reviewed once the data acquisition pipeline is set up. The plan for implementing scaling approaches for Information integration services is presented in Table 6. Functionality Coverage (M10–M12) Now Jul–Sep 2011 Sep 2011–Mar 2012 Unchanged Sentiment-related information Basic availability of storage solutions (filesystem, MongoDB, RDBMS) Sep 2012–Sep 2013 (M18–M24) (M24–M36) Expand to cater for requirements from Expand to cater for requirements from (M12–M18) Ontology Document Corpora Mar–Sep 2012 Pipeline components do store data DSS (WP6) Integrated FIS (WP7) Provide access interface to WP6/WP7 clients End-user prototypes (WP8) Unchanged 1 These figures ignore any potential database overhead and assume the factor 1000 (rather than 1024) for redomination of the dimension KB to MB, to GB, to TB © FIRST consortium Page 26 of 43 D2.3 Performance (M10–M12) N/A Now Jul–Sep 2011 Sep 2011–Mar 2012 Mar–Sep 2012 Sep 2012–Sep 2013 (M18–M24) (M24–M36) Scaling performance along the scaling of pipeline components. Both in terms of number of processed sources as well as in terms of near real-time processing of sources Unchanged (M12–M18) N/A Table 6: Scaling plan for the knowledge base 3.5. Decision support and visualisation services To devise a scaling plan for the decision support models, it is first important to identify the models that will need to be developed for the purpose of the use cases (UC). At this moment, this is possible only with some speculation and may change as the project progresses. In UC #1, i.e., Market surveillance use case, the detection of market sounding and pump-anddump scenarios will most likely be attempted by employing near-duplicate detection techniques (see (FIRST D2.1 Technical requirements and state-of-the-art, 2011) Section 2.1.6). In the first project year, we have implemented a hash-based near-duplicate detector (Manku, Jain, & Sarma, 2007) as part of the data acquisition and preprocessing pipeline (see (FIRST D2.1 Technical requirements and state-of-the-art, 2011) Section 2.1.5). The implemented algorithm is highly scalable (suitable for Web-scale applications) but suffers from a drawback that hinders its use in the FIRST UC scenarios. Specifically, the algorithm finds all documents, encountered in the stream, that are 3 or less hash-code bits different from the current document. There is no intuitive interpretation of how these bits translate into words, sentences, or paragraphs (the “how many bits for a word” dilemma). Our preliminary experiments showed that many nearduplicates are not discovered, especially if texts are short. This motivated the development of a new (pragmatic) near-duplicate detection algorithm based on an inverted index. If deemed suitable, the new algorithm will replace the currently employed solution at some point during the second project year. For the tasks in UC #2, i.e., Reputational risk assessment, qualitative multi-attribute models (Žnidaršič, Bohanec, & Zupan, 2008) are planned to be used (see (FIRST D2.1 Technical requirements and state-of-the-art, 2011) section 2.5.3). Qualitative models do not present a scaling issue as they are based on extremely efficient algorithms. In UC #3, i.e., Retail brokerage use case, the model that clearly exhibits a scaling issue is the topic detection algorithm required to (i) detect emerging topics and (ii) visualize topic trends. In addition to the topic hierarchy model, the portfolio optimization task requires another type of model which can be either qualitative (as in the case of UC #2) or quantitative (e.g., decision tree). We will scale both these two models up by implementing efficient online (stream-based) variants. More information on online decision trees and topic detection algorithms is given in Section 3.5.2. In addition to the models and algorithms discussed above, we also plan to develop a topic space visualization algorithm ((Grcar, Podpecan, Jursic, & Lavrac, 2010); see also (FIRST D2.1 Technical requirements and state-of-the-art, 2011) Section 4.2 and Annex 5) and employ it for providing insights into which topics are being discussed in the context of financial markets. Table 7 shows the scale-up plan for the models that will be employed in the context of the FIRST use cases. © FIRST consortium Page 27 of 43 Scale D2.3 Oct 2011–Mar 2012 Mar–Sep 2012 Sep 2012–Mar 2013 Mar–Sep 2013 (M13–M18) (M18–M24) (M24–M30) (M30–M36) Experimenting with datasets created from the acquired data (historical data) Experimenting Running in near-real with simulated time on 1 selected streams Web site (testing) (historical data) Running in nearreal time on approximately 160 Web sites Table 7: Scaling plan for the decision-support models 3.5.1 Scaling techniques for clustering and classification From a high-level perspective, we plan to employ (i) qualitative modelling techniques, (ii) visualization techniques, and (iii) machine learning techniques. As already mentioned, qualitative models do not present a scaling issue as they are based on extremely efficient algorithms. Visualization and machine learning techniques, on the other hand, need to be adapted to work with intensive data streams in near-real time. There are several generalpurpose scaling techniques at hand, already presented to some extent in (FIRST D2.1 Technical requirements and state-of-the-art, 2011), section 2.6, such as pipelining, parallelization and warm starts1. However, sometimes these are not applicable or the resulting process is still not efficient enough. In such cases, we need to resort to stream-based alternatives. These are entirely different algorithms, well aware that they operate in a steambased environment. In the following subsections, we present several stream-based algorithms from two major categories of machine learning algorithms: (i) unsupervised learning (i.e., clustering) and (ii) supervised learning (i.e., classification). To address the FIRST scenarios, we put the stream-based clustering methods into the context of topic detection, trend detection, and visualization. In addition, we discuss online model trees (i.e., a variant of stream-based decision trees) which are glass-box models and are likely to be employed in FIRST. Details of clustering for topic and trend detection techniques are discussed in Annex 1. 3.5.2 Learning model trees from data streams The problem of real-time extraction of meaningful patterns from time-changing data streams is of increasing importance in machine learning and data mining. Regression in time-changing data streams is a relatively unexplored topic, despite many possible applications. In decision support and visualization services we will use an efficient and incremental stream mining algorithm, FIMT-DD (Ikonomovska E, 2011), which is able to learn regression and model trees from possibly unbounded, high-speed and time-changing data streams. To the best of our knowledge there is no other general-purpose algorithm for incremental learning regression/model trees able to perform explicit change detection and informed adaptation. The algorithm performs online and in real-time, observes each example only once at the time of arrival, and maintains at any time a ready-to-use model tree. The tree leaves contain linear models induced online from the examples assigned to them. The algorithm has mechanisms for Warm starts are possible in practically every iterative optimization method. This means that – when new data enters the system (or some outdated data “leaves”) – we start the algorithm with the result from the previous run and consequently it converges faster (i.e., requires fewer iterations to converge). Warm start can be used, for example, with k-means clustering, stress majorization and other iterative graph layouting methods, least-squares solver, support vector machines (SVM), and many other iterative methods. 1 © FIRST consortium Page 28 of 43 D2.3 drift detection and model adaptation, which enable it to maintain accurate and updated regression models at any time. The drift detection mechanism exploits the structure of the tree in the process of local change detection. As a response to local drift, the algorithm is able to update the tree structure only locally. This approach improves the any-time performance (i.e., availability of an up-to-date model at any time) and greatly reduces the costs of adaptation. Details of FIMT-DD algorithm are presented in Annex 2. 3.6. Integration infrastructure The integration infrastructure has a special and distinct role in the scaling strategy. It can be considered to have impact on both global and local level. On one hand it provides technological means for realizing the global scaling strategy. On the other hand integration infrastructure also encompasses graphical front-ends and necessary services for implementing the whole Integrated Financial Market Information System (FIRST D2.2 Conceptual and technical integrated architecture design, 2011). Therefore it should ensure that the rest of the infrastructure is keeping pace with the results of data processing. The “global” aspect of integration infrastructure scaling is approached by providing a lightweight messaging middleware that integrates components that take part in pipeline processing. The infrastructure will support pipeline scaling scenarios as they have been already described in Global scaling strategy (see chapter 2). The “local” aspect has been addressed by providing a coherent system design. From a performance point of view, required infrastructure characteristics have already been taken into account in the architecture definition (FIRST D2.2 Conceptual and technical integrated architecture design, 2011). It outlines such approaches as pipeline processing approach for data analysis, push-based services or asynchronous data exchange. The choice of messaging based approach and flexible and lightweight architecture is supporting the overall scaling goals. In the next steps we plan to adjust and fine-tune integration middleware in order to comply with the requirements of the analytical pipeline. From early experiments ((FIRST D2.2 Conceptual and technical integrated architecture design, 2011), section 4.2) we learned that the overall throughput of the messaging middleware does not constrain data processing estimates, therefore infrastructure scaling is more a feature-oriented scaling plan that will improve global throughput of the FIRST system by properly managing the data stream flow according to the pipeline scaling scenarios. The scaling plan has been presented in Table 8. © FIRST consortium Page 29 of 43 D2.3 Throughput Aspects Coverage (M5–M12) Analysis and choice of most suitable integration approach, experiments with messaging. N/A N/A Mar–Sep 2012 Sep 2012–Sep 2013 (M18–M24) (M24–M36) Advanced prototype of integration infrastructure. Integration of all pipeline components. Supporting chosen global scaling scenario. Monitoring of performance allows for further architecture fine-tuning. GUI and high level services are keeping up with running pipeline. Final integration infrastructure, supporting devised scalability goals and able to handle target data volume in a timely manner. Global scaling techniques are supported by architecture and fine tuned in optimal way. The system works on target deployment infrastructure. Reliable messaging 1 selected scaling established between scenario supported. WP3 and WP4. All scaling scenarios supported. Now Feb–Sep 2011 Sep 2011–Mar 2012 (M12–M18) Early version of pipeline integration prototype. Testbed for first scaling experiments. N/A Scale overall infrastructure to Scale overall keep up with live infrastructure to keep stream of data up with test data. acquisition (around 68,000 documents daily) Table 8: Scaling plan for Integration infrastructure © FIRST consortium Page 30 of 43 D2.3 4. Conclusions This document highlights the most important aspects of reaching system scalability, as devised in the project goals. It presents various techniques (both at architectural and individual technical components level) and a roadmap for achieving the scalability goals. By following an incremental development process (divided into milestones with defined goals) it ensures that the progress of assuring scalability can be tracked and risks can be minimised. We also demonstrated that architectural decisions and integration approach are greatly supporting scalability of the overall system, with the emphasis on the analytical pipeline. E.g. lightweight messaging approach applied to pipeline processing provide flexibility that enables to apply various scaling-out scenarios. During the further development, the combination of both: component scaling techniques and pipeline scaling scenarios will be applied to assure that the FIRST system meets its performance goals. © FIRST consortium Page 31 of 43 D2.3 References Barbara, D., & others. (1997). The New Jersey data reduction report. Technical Committee on Data Engineering , 20, 3-45. Bondi, A. B. (2000). Characteristics of Scalability and Their Impact on performance. Proceedings of the 2nd international workshop on Software and performance (pp. 195-203). ACM. C. C. Aggarwal, J. H. (2003). A framework for clustering evolving data streams. C. C. Aggarwal, J. H. (2004). A Framework for Projected Clustering of High Dimensional Data Streams. Cauwenberghs, G., & Poggio, T. (2001). Incremental and Decremental Support Vector Machine Learning. Proceedings of NIPS 2001 . Feng Cao, E. M. (2006). Density-based clustering over an evolving data stream with noise. SIAM Conference on Data Mining. FIRST D1.2 Usecase requirements specification. (2011). FIRST D2.1 Technical requirements and state-of-the-art. (2011). FIRST D2.2 Conceptual and technical integrated architecture design. (2011). FIRST D4.1 First semantic information extraction prototype. (2011). Fisher, D. H. (1987). Knowledge Acquisition Via Incremental Conceptual Clustering. Machine Learning . Grcar, M., Podpecan, V., Jursic, M., & Lavrac, N. (2010). Efficient Visualization of Document Streams. Proceedings of Discovery Science 2010 (pp. 174–188). Canberra: SpringerVerlag Berlin Heidelberg. IBM Rational Unified Process v7.0. (2008). Ikonomovska E, G. J. (2011). Learning model trees from evolving data streams. Data Min. Knowl. Discov 23(1) , 128-168. Joachims, T. (2006). Training Linear SVMs in Linear Time. Proceedings of the ACM Conference on KDD 2006 . Kohlschütter, C., Fankhauser, P., & Nejdl, W. (2010). Boilerplate Detection using Shallow Text Features. Proceedings of The Third ACM International Conference on Web Search and Data Mining, WSDM 2010. New York. L. O. Callaghan, N. M. (2003). Streaming-data algorithms for high-quality clustering. Liu YB, C. J. (Jan. 2008). Clustering text data streams. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 23(1) , 112–128. Manku, G. S., Jain, A., & Sarma, A. D. (2007). Detecting Near-Duplicates for Web Crawling. Proceedings of WWW 2007. N. Sahoo, J. C. (2006). Incremental hierarchical clustering of text documents. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM), (pp. 357-366). ØMQ (version 0.3) tests. (2011). Retrieved July 28, 2011, from ØMQ: http://www.zeromq.org/results:0mq-tests-v03 Saveski, M., & Grcar, M. (2011). Web Services for Stream Mining: A Stream-Based Active Learning Use Case. Proceedings of the PlanSoKD Workshop at ECML-PKDD 2011 . Tong, S., & Koller, D. (2000). Support Vector Machine Active Learning with Applications to Text Classification. Proceedings of ICML 2000 . Tsymbal, A. (2004). The problem of concept drift: definitions and related work. ZeroMQ. (2011). ØMQ - The Guide. Retrieved July 28, 2011, from ØMQ: http://zguide.zeromq.org/page:all Zhang, R. a. (1996). BIRCH: An efficient data clustering method for very large databases. ACM SIGMOD Conference on Management of Data. © FIRST consortium Page 32 of 43 D2.3 Žnidaršič, M., Bohanec, M., & Zupan, B. (2008). Modelling impacts of cropping systems: Demands and solutions for DEX methodology. European Journal of Operational Research , 189, 594-608. James Allan. Topic Detection and Tracking: Event-Based Information Organization. Kluwer Academic Publishers, Norwell, MA, USA, 2002. L. O. Callaghan, N. Mishra, A. Meyerson, S. Guha, and R. Motwani, "Streaming-data algorithms for high-quality clustering," 2003. Susan Havre , Ieee Computer Society , Elizabeth Hetzler , Paul Whitney, Lucy Nowell, “ThemeRiver: Visualizing thematic changes in large document collections”, IEEE Transactions on Visualization and Computer Graphics, 2002. Feng Cao, Martin Ester, Weining Qian, Aoying Zhou,“Density-based clustering over an evolving data stream with noise”, 2006. In 2006 SIAM Conference on Data Mining. Ronen Feldman, James Sanger, The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data, 2007 Sanjoy Dasgupta, Daniel Hsu, “Hierarchical Sampling for Active Learning”, Proceedings of the 25th International Conference on Machine Learning, Finland, 2008. Aggarwal CC (2006) Data streams: models and algorithms. Springer, New York Breiman L, Friedman JH, Olshen RA, Stone CJ (1998) Classification and regression trees. CRC Press, Boca Raton, FL Chaudhuri P, Huang M, Loh W, Yao R (1994) Piecewise polynomial regression trees. Stat Sin 4:143-167 Chen Y, Dong G, Han J, Wah BW, Wang J (2002) Multi-dimensional regression analysis of time-series data streams. In: Proc the 28th int conf on very large databases. Morgan Kaufmann, San Francisco, pp 323-334 Dobra A, Gherke J (2002) SECRET: a scalable linear regression tree algorithm. In: Proc 8th ACM SIGKDD int conf on knowledge discovery and data mining. ACM Press, New York, pp 481-487 Domingos P, Hulten G (2000) Mining high speed data streams. In: Proc 6th ACM SIGKDD int conf on knowledge discovery and data mining. ACM Press, New York, pp 71-80 Gama J, Rocha R, Medas P (2003) Accurate decision trees for mining high-speed data streams. In: Proc 9th ACM SIGKDD int conf on knowledge discovery and data mining. ACM Press, New York, pp 523-528 Gama J, Medas P, Rocha R (2004) Forest trees for on-line data. In: Proc 2004 ACM symposium on applied computing. ACM Press, New York, pp 632-636 Gao J, Fan W, Han J, Yu PS (2007) A general framework for mining concept-drifting data streams with skewed distributions. In: Proc 7th int conf on data mining, SIAM, Philadelphia, PA Gratch J (1996) Sequential inductive learning. In: Proc 13th natl conf on artificial intelligence and 8th innovative applications of artificial intelligence conf, vol 1. AAAI Press, Menlo Park, CA, pp 779- 786 Hoeffding W (1963) Probability for sums of bounded random variables. J Am Stat Assoc 58:1330 Hulten G, Spencer L, Domingos P (2001) Mining time-changing data streams. In: Proc 7th ACM SIGKDD int conf on knowledge discovery and data mining. ACM Press, New York, pp 97-106 Ikonomovska E, Gama J, Dzeroski S (2011). Learning model trees from evolving data streams. Data Min. Knowl. Discov. 23(1), pp 128-168 Jin R, Agrawal G (2003) Efficient decision tree construction on streaming data. In: Proc 9th ACM SIGKDD int conf on knowledge discovery and data mining. ACM Press, New York, © FIRST consortium Page 33 of 43 D2.3 pp 571-576 Karalic A (1992) Employing linear regression in regression tree leaves. In: Proc 10th European conf on artificial intelligence. Wiley, New York, pp 440-441 Loh W (2002) Regression trees with unbiased variable selection and interaction detection (2002). Stat Sin 12:361-386 Malerba D, Appice A, Ceci M, Monopoli M (2002) Trading-off local versus global effects of regression nodes in model trees. In: Proc 13th int symposium on foundations of intelligent systems, LNCS, vol 2366. Springer, Berlin, pp 393-402 Musick R, Catlett J, Russell S (1993) Decision theoretic sub-sampling for induction on large databases. In: Proc 10th int conf on machine learning. Morgan Kaufmann, San Francisco, pp 212-219 Pfahringer B, Holmes G, Kirkby R (2008) Handling numeric attributes in Hoeffding trees. In: Proc 12th Pacific-Asian conf on knowledge discovery and data mining, LNCS, vol 5012. Springer, Berlin, pp 296-307 Potts D, Sammut C (2005) Incremental learning of linear model trees. J Mach Learn 61:5-48. doi:10.1007/ s10994-005-1121-8 Quinlan JR (1992) Learning with continuous classes. In: Proc 5th Australian joint conf on artificial intelligence. World Scientific, Singapore, pp 343-348 Rajaraman K, Tan (2001) A topic detection, tracking, and trend analysis using self-organizing neural networks. In: Proc 5th Pacific-Asian conf on knowledge discovery and data mining, LNCS, vol 2035. Springer, Berlin, pp 102-107 Siciliano R, Mola F (1994) Modeling for recursive partitioning and variable selection. In: Proc int conf on computational statistics. Physica Verlag, Heidelberg, pp 172-177 Torgo L (1997) Functional models for regression tree leaves. In: Proc 14th int conf on machine learning. Morgan Kaufmann, San Francisco, pp 385-393 VFML (2003) A toolkit for mining high-speed time-changing data streams. http://www.cs.washington.edu/dm/vfml. Accessed 19 Jan 2010 Vogel DS, Asparouhov O, Scheffer T (2007) Scalable look-ahead linear regression trees. In: Berkhin P, Caruana R, Wu X (eds) Proc 13th ACM SIGKDD int conf on knowledge discovery and data mining, KDD. ACMK, San Jose, CA, pp 757-764 WEKA 3 (2005) Data Mining Software in Java. http://www.cs.waikato.ac.nz/ml/weka. Accessed 19 Jan 2010 Widmer G, Kubat M (1996) Learning in the presence of concept drifts and hidden contexts. J Mach Learn 23:69-101. doi:10.1007/BF00116900 © FIRST consortium Page 34 of 43 D2.3 Annex 1. Clustering for topic and trend detection Annex a. Introduction In this section, we focus on the problem of clustering an online data stream of text documents with the aim of visualizing and discovering topics and trends. It is infeasible, due to the real-time nature of the problem, to store such a stream of documents and process it with better known offline clustering methods. Therefore, we rely on the more novel streaming clustering methods. Several online clustering methods are described. Further on, we describe topics and trends in the data, in our case obtained from the hierarchical clustering of documents. Topics and trends are presented to the user by employing appropriate visualization techniques, such as dendrogram and canyon flow. Also, it is made interactive by incorporating visualization techniques into the user interface. Annex b. Document streams A data stream represents ordered (usually temporally) sequence of data items (e.g., user click streams, network data packets, published text documents). We are concerned with text documents (news and blogs) as incoming data items. The number of documents is unbounded and the time between publications is uneven. Web pages are usually obtained through an RSS reader and processed into pure text documents. As the time of acquisition and the time of publication of the document are inconsistent, we can only achieve approximate temporal ordering of the incoming documents. One solution to this is to employ a preliminary buffer to sort the documents by the date published. This may be inappropriate due to time constraints. Nevertheless, a sliding time window of fixed size may be used in a manner of first-in-first-out queue for document clustering. Annex c. Clustering document streams To handle indefinite unevenly distributed incoming stream of text documents, we employ streaming (online) clustering algorithms. These on-line algorithms process documents unaware either of the whole past or of the future of the document stream, whereas the off-line ones cluster completely known data set. We have made an overview of streaming clustering algorithms suitable for the text document stream. We can define several application-specific requirements, which text stream clustering algorithms should comply with. The algorithm should not need to be given the actual number of clusters, as they are unknown and change during time. As the problem of clustering n documents into k clusters is NP-hard, it should scale well with the number of data items (text documents) and number of dimensions (text). It should also support arbitrary sized and shaped (e.g., different from hyper-sphere) clusters. As an outlier document may represent a new emerging topic (topic drift), forming of a singleton cluster for an outlier is important. Each cluster is expected to reflect a meaningful topic, with its sub-clusters as sub-topics (in the case of hierarchical clustering). One document can often be related to more topics (i.e., fit between several clusters) therefore, a soft (fuzzy) clustering could also be considered. The issues that require special consideration in the stream clustering are the topic (concept) drift (i.e., evolving data) and the time constraints. The problem of concept drift (Tsymbal, 2004) arises naturally from the real world, where concepts (topics) are often not stable but change with time, thus making the model built on previous data obsolete. Concept drift is seen in clustering in the form of new emerging clusters, which can often be confused with outliers, or existing clusters being reduced in size. The two most rudimentary clustering methods are also the two most researched and improved ones – k-means and k-medians. K-means, being fairly simple, produces a solution, which is guaranteed to be a local optimum only and is also sensitive to outliers. It also requires random access to the data, which is inherently inappropriate for the streams. K-medians selects one of the documents for the cluster representative and is thus less sensitive to outliers at the cost of © FIRST consortium Page 35 of 43 D2.3 computational complexity. These algorithms also require the number of clusters to be given as a parameter, which is unsuitable for an evolving data stream. COBWEB (Fisher, 1987) is an incremental method for hierarchical conceptual clustering. Although it was originally designed for categorical data, (N. Sahoo, 2006) describes its usage for text documents. A notion of hierarchical clustering includes the clustering problem and also the problem of determining useful concepts for each cluster. Hierarchical clustering of data into a classification tree is performed as a hill-climbing bidirectional search through a space of classification trees utilizing four basic operators, namely merging, splitting, inserting and passing of nodes. Although the method is both incremental and computationally feasible, it is not well suited for data streams. Its main disadvantage is memory-consumptive unbalanced tree structure. BIRCH (Zhang, 1996) is arguably the most primitive method among all. It was intended for very large off-line datasets and can therefore be used on streams, to some extent. In its two steps, BIRCH methods builds the tree with information about clusters in a single pass through the data and then refines the tree by removing the sparse clusters as outliers. The information about each cluster is contained in a clustering feature triple of the CF (clustering feature) tree. The limitation of the method is the sensitive threshold for the numbers of documents a cluster must have not to be regarded as an outlier and the radial size of the cluster, which may present a problem if the cluster of documents is not ovally shaped. There are no guarantees about the SSE (sum of the squared errors) for its performance. The method does not directly support the topic drift and removal of the outdated documents. STREAM (L. O. Callaghan, 2003) is the first method designed especially for stream clustering. It is also based on two steps, where the documents are first clustered in a k-median way weighted by the number of documents in a cluster and secondly, the medians are clustered up to a hierarchy. The main disadvantages of this method are time complexity and evolving data. Both BIRCH and STREAM are inappropriate for evolving data (topic drift) as they generate clusters based on the history of the whole dataset. CluStream (C. C. Aggarwal J. H., 2003) is the first one to give more attention to evolving data (topic drift) and outliers. It has two base steps. One is an online micro-clustering component, which stores summary statistics about the streaming data in a manner of snapshots in time and the other is an off-line macro-clustering component, which uses the stored summary statistics in conjunction with the user input to build real data clusters for a given period of time. Such a twophase approach gives a significant insight to the user. Its disadvantage is time complexity of adding and removing documents to and from the model, which is linearly dependent on the number of clusters in the model. Also, its predefined number of micro-clusters is inappropriate for evolving data. HPStream (C. C. Aggarwal J. H., 2004) is a method of streaming clustering, specialized for high-dimensional data. It uses a fading cluster structure method and the projection-based clustering. It outperforms CluStream in cluster purity for about 20%, at the cost of speed. However, it cannot detect clusters of arbitrary orientations. DenStream (Feng Cao, 2006) has micro-cluster structures, which successfully summarize clusters of arbitrary size. A novel pruning strategy gets rid of the outliers while it enables the growth of the new clusters. The purity of the clusters is on average 20% better than with CluStream. OCTSM (Liu YB, Jan. 2008) uses semantic smoothing adjusted to stream clustering, which shows to be better than the more common TF-IDF scheme for clustering the text documents with respect to their semantic similarity. It employs a fading function (aging) to account for the evolution of the data stream. A novel cluster statistics structure named cluster profile is introduced. The cluster profile captures real-time semantics of text data. ClusTree (Kranen et al, 2009) is a parameter-free algorithm capable of adapting to the speed of © FIRST consortium Page 36 of 43 D2.3 the input stream and detecting concept-drift and outliers in the stream. A ClusTree itself is a compact and adaptive index structure, which maintains stream summaries. Aging of data items is incorporated to reflect the greater importance of the newer data. StreamKM++ (Ackermann, 2010) employs weighted sets coresets for non-uniform sampling of the data stream based on the k-means++ procedure. Fast computation of the coresets is enabled by building the coreset tree, which is a binary tree associated with hierarchical divisive clustering. Its advantages are suitability for large number of clusters and scalability with the number of dimensions. Although the method is slower than BIRCH, it creates significantly better clusters in terms of the sum of squared error measure. Annex d. Topic detection In the task of topic detection, one wants to detect meaningful groups (clusters) of text documents that are related to the same topic. The task arises from the concept drift, present in evolving streams of text documents. We are given no prior information about the number or names of the topics. The definition of a topic is “a seminal event or activity, along with all directly related events and activities” (Allan, 2002) or “something nontrivial happening in a certain time at a certain place”. As the notion of a topic includes all the related events and activities, it is reasonable to base it on hierarchical cluster (commonly represented as a dendrogram), where each cluster represents a topic with more specific related topics in the child clusters below. Emerging and disappearing topics are shown as growing and shrinking clusters, respectively. Similarity between the documents represented with stemmed-term frequencies is based on the vector space cosine similarity. Consequently, this makes the clustering independent of the domain and most of the languages. Alternatively, a semantic smoothing as in the OCTSM (Liu, Cai, Yin, Fu, 2008) can be used instead of the TF-IDF scheme. Annex e. Trend detection In general, a trend is a long-term temporal variation in the statistical properties of a process. In our case, a trend describes a change in the number of documents related to some topic over a long-enough period of time. A positive trend means that a topic, either existing or emerging, is gaining in the number of documents. In contrast, a negative trend corresponds to a decrease in the number of documents related to a topic over a long-enough period of time. Trends can be identified and analyzed by observing derivatives of the function of topic strength. On the other hand, topics and trends can be nicely visualized with a ThemeRiver-like algorithm called Canyon Flow. Technically speaking, the algorithm provides a “view” of a hierarchy of document clusters. The underlying algorithm is essentially a hierarchical bisecting clustering algorithm employed on the bag-of-words representation of documents. We briefly discuss trend visualization in the following section. Annex f. Visualization The clustered stream of text documents is visualized with the aim to easily identify and analyze topics and trends. One of the most intuitive representations of hierarchical document clustering, despite the quadratic time usually needed to construct it, is certainly dendrogram. In it, lower levels represent clusters containing conceptually more similar documents and upper more distanced ones. Thus, clusters in the lower levels of the dendrogram are more specialized topics, whereas the upper ones are more general. There are very few methods for the stream clustering visualization (Silic, 2010). They are mostly applied to lower dimensional data or based on projection to lower dimensional space. Nevertheless, we are more interested in changes of topic over time, rather than document clustering itself. To convey the change of topic strength over time, a (multi)line chart is more appropriate. Each different line in the chart shows the number or percentage of the documents related to specific topic at a specific time. A canyon flow chart, referred to as ThemeRiver in (Havre et al, 2002), lends itself even better to display trends over a certain period in time. Each (differently colored) region corresponds to the percentage of documents related to that topic during a specific time period. Ideally, the user is able to interactively select a specific region in © FIRST consortium Page 37 of 43 D2.3 the canyon flow chart and explore it, which corresponds to going down the cluster to its subclusters in the dendrogram. Also, each region could be colored with different shades of the same color as to present its sub-clusters. Similar techniques are NewsRiver, LensRiver, EventRiver, briefly explained in (Silic, 2010). Annex g. Clustering for active learning A convenient application of hierarchical clustering is also in active learning (Dasgupta & Hsu, 2008). Active learning is a machine-learning technique designed to actively query the domain expert for new labels by putting forward data instances that, upon being labelled, contribute most to the model being built. As annotating the dataset is costly, it is important to make it as efficient as possible in terms of time and number of items annotated. Simple random sampling requires sampling too much of the items to be accurate enough. An intuitive margin-based heuristics might be used to bias the sampling of the data items towards the decision boundary to make it more precise. This method does not converge well for complicated boundaries. Hierarchical clustering provides the necessary granularity to approximate the decision boundaries, which do not always align with the clusters of the data (i.e., clusters do not always contain only one class). Clusters are hierarchically queried top-down. More samples are taken from more numerous and impure clusters. Those clusters which are pure enough are pruned and not queried any more as they supposedly comprise equally labeled items. Sampling stops when all the clusters are pruned. The resulting dataset is labeled significantly better than it would be with margin-based sampling or simple random sampling. In addition, (Dasgupta & Hsu, 2008) in their experiments prefer using Latent Dirichlet Allocation to create document topic mixture models and Kullback-Leibler divergence as the notion of distance between documents over the common TF-IDF scheme for textual data. Annex h. Conclusions Mining streaming data is a relatively novel research field. Detecting topics and changes in such real-time data makes stream clustering even more challenging. An overview of both original and novel methods appropriate for stream clustering was presented and application-specific requirements were stated. Unfortunately, not many experiments were presented for textual data in the related articles. Further testing of different algorithms on textual streams will help determine the most suitable one. Efficient topic detection depends mostly on the clustering algorithm's ability to differ outliers from emerging clusters and should be taken into consideration. To easily spot emerging and disappearing topics, efficient visualization is indispensable. Many interactive river-like charts, depicting topics as differently colored regions are used for this purpose. © FIRST consortium Page 38 of 43 D2.3 Annex 2. Learning model trees from data streams Annex a. Introduction In the last decade, data streams (Aggarwal 2006) have been receiving growing attention in many research areas, due to broad recognition of applications emerging in many areas. Examples include financial applications (stock exchange transactions), telecommunication data management (call records), Web applications (customers click stream data), surveillance (audio/video data), bank-system management (credit card/ATM transactions, etc.), monitoring patient health, and many others. Such data are typically represented as evolving time series of data items, arriving continuously at high speeds, and having dynamically changing distributions. Modeling and predicting the temporal behavior of streaming data can provide valuable information contributing to the success of time-critical operations. The task of regression analysis is one of the most commonly addressed topics in the areas of machine learning and statistics. Regression and model trees are often used for this task due to their interpretability and good predictive performance. However, regression on time-changing data streams is a relatively unexplored and typically nontrivial problem. Fast and continuous data feeds as well as the time-changing distributions make traditional regression tree learning algorithms unsuitable for data streams. This section describes an efficient and incremental algorithm for learning regression and model trees from possibly unbounded, high-speed, and time-changing data streams. There are four main features of the algorithm: an efficient splittingattribute selection in the incremental growing process; an effective approach for computing the linear models in the leaves; an efficient method for handling numerical attributes; and change detection and adaptation mechanisms embedded in the learning algorithm. Annex b. Related work We first consider related work on batch and incremental learning of model trees. We then turn our attention to related methods for learning from stationary data streams: learning decision/classification trees and linear regression/neural networks. Finally, we take a look at methods for on-line change detection and management of concept drift. Batch learning of model trees Regression and model trees are known to provide efficient solutions to complex nonlinear regression problems due to the divide-and-conquer approach applied to the instance-space. Their main strength in solving a complex problem is by recursively fitting different models in each subspace of the instance-space. Regression trees use the mean for the target variable as the prediction for each sub-space. Model trees improve upon the accuracy of regression trees by using more complex models in the leaf nodes. The splitting of the instance-space is performed recursively by choosing a split that maximizes some error reduction measure, with respect to the examples that are assigned to the current region (node). In one category are algorithms like M5 (Quinlan 1992), CART (Breiman et al. 1998), HTL (Torgo 1997). These use variants of the variance reduction measure used in CART, like standard deviation reduction or the fifth root of the variance in the WEKA (WEKA 3 2005) implementation of the M5 algorithm. Another category of algorithms are those aiming to find better globally optimal partitions by using more complex error reduction methods at the cost of increased computational complexity. Examples are RETIS (Karalic 1992), SUPPORT (Chaudhuri et al. 1994), SECRET (Dobra and Gherke 2002), GUIDE (Loh 2002), SMOTI (Malerba et al. 2002) and LLRT (Vogel et al. 2007). All existing batch approaches to building regression/model trees assume that the training set is finite and stationary. For this reason, they require all the data for training to be available on the disk or in the main memory before the learning process begin. When given very large training sets, batch algorithms have shown to be prohibitively expensive both in memory and time. In this spirit, several efforts have been made in speeding up learning on large datasets. Notable © FIRST consortium Page 39 of 43 D2.3 examples are SECRET (Dobra and Gherke 2002) and the LLRT algorithm (Vogel et al. 2007). Their main weakness is that they require storing all the examples in the main memory. This becomes a major problem when datasets are larger than the main memory. In such situations, users are forced to do sub-sampling or apply other data reduction methods, which is nontrivial because of the danger of underfitting. Another characteristic of large data is that it is typically collected over a long time period or generated rapidly by a continuous, possibly distributed sources of data. In both of these scenarios, there is a high probability of non-stationary relations, which in learning problems takes the form of concept drift. We have noted that none of the existing batch algorithms for learning regression trees is able to deal with concept drift. Incremental learning of model trees Mining data streams raises many new problems previously not encountered in data mining. One crucial issue is the real-time response requirement, which severely constrains the use of complex data mining algorithms that perform multiple passes over the data. Although regression and model trees are an interesting and efficient class of learners, little research has been done in the area of incremental regression or model tree induction. To the best of our knowledge, there is only one paper (Potts and Sammut 2005) addressing the problem of incremental learning of model trees. The authors follow the method proposed by Siciliano and Mola (1994), applying it in an incremental way. Learning decision trees from stationary data streams The problem of incremental decision tree induction has received appropriate attention within the data mining community. There is a large literature on incremental decision tree learning, but our focus is on the line of research initiated by Musick et al. (1993), which motivates sampling strategies for speeding up the learning process. They note that only a small sample from the distribution might be enough to confidently determine the best splitting attribute. Example algorithms from this line of research are the Sequential ID3 (Gratch 1996), VFDT (Domingos and Hulten 2000), UFFT (Gama et al. 2003), and the NIP-H and NIP-N algorithms (Jin and Agrawal 2003). Other regression methods in stream mining One of the most notable and successful examples of regression on data streams is the multidimensional linear regression analysis of time-series data streams (Chen et al. 2002). It is based on the OLAP technology for streaming data. This system enables an online computation of linear regression over multiple dimensions and tracking unusual changes of trends according to the user's interest. Some attempts have been also made in applying artificial neural networks over streaming data. In Rajaraman and Tan (2001), the authors address the problems of topic detection, tracking and trend analysis over streaming data. The incoming stream of documents is analyzed by using Adaptive Resonance Theory (ART) networks. On-line change detection and management of concept drift The nature of change in streams is diverse. Changes may occur in the context of learning due to changes in hidden variables or changes in the intrinsic properties of the observed variables. Often these changes make the model built on old data inconsistent with new data, and regular updating of the model is necessary. As Gao et al. (2007) have noted, the joint probability, which represents the data distribution P(x, y) = P(y|x) * P(x), can evolve over time in three different ways: (1) changes in P(x) known as virtual concept drift (sampling shift); (2) changes in the conditional probability P(y|x); and (3) changes in both P(x) and P(y|x). We are in particular interested in detecting changes in the conditional probability, which in the literature is usually referred to as concept drift. Further, a change can occur abruptly or gradually, leading to abrupt or gradual concept drift. With respect to the region of the instance space affected by a change, concept drift can be © FIRST consortium Page 40 of 43 D2.3 categorized as local or global. In the case of local concept drift, the distribution changes only over a constrained region of the instance space (set of ranges for the measured attributes). In the case of global concept drift, the distribution changes over the whole region of the instance space, that is, for all the possible values of the target/class and the attributes. Annex c. The FIMT-DD algorithm The problem of learning model trees from data streams raises several important issues typical for the streaming scenario. First, the dataset is no longer finite and available prior to learning. As a result, it is impossible to store all the data in memory and learn from them as a whole. Second, multiple sequential scans over the training data are not allowed. An algorithm must therefore collect the relevant information at the speed it arrives and incrementally make splitting decisions. Third, the training dataset may consist of data from several different distributions. Thus the model needs continuous monitoring and updating whenever a change is detected. We have developed an incremental algorithm for learning model trees to address these issues, named Fast Incremental Model Trees with Drift Detection (FIMT-DD). The algorithm starts with an empty leaf and reads examples in the order of arrival. Each example is traversed to a leaf where the necessary statistics are updated. Given the first portion of instances, the algorithm finds the best split for each attribute, and then ranks the attributes according to some evaluation measure. If the splitting criterion is satisfied it makes a split on the best attribute, creating two new leafs, one for each branch of the split. Upon arrival of new instances to a recently created split, they are passed down along the branches corresponding to the outcome of the test in the split for their values. Change detection tests are updated with every example from the stream. If a change is detected, an adaptation of the tree structure is performed. Splitting criterion In the literature, several authors have studied the problem of efficient feature, attribute, or model selection over large databases. The idea was first introduced by Musick et al. (1993) under the name of decision theoretic sub-sampling, with an immediate application to speed up the basic decision tree induction algorithm. One of the solutions they propose, which is relevant for our work, is the utilization of the Hoeffding bound (Hoeffding 1963) in the attribute selection process in order to decide whether the best attribute can be confidently chosen on a given subsample. Numerical attributes The efficiency of the split selection procedure is highly dependent on the number of possible split points. For numerical attributes with a large number of distinct values, both memory and computational costs can be very high. The common approach in the batch setting is to perform a preprocessing phase, typically partitioning the range of numerical attributes (discretization). This requires an initial pass of the data prior to learning, as well as sorting operations. Preprocessing is not an option with streaming data and sorting can be very expensive. The range of possible values for numerical attributes is also unknown and can vary in case of sampling shift. For classification tasks on data streams, a number of interesting solutions have been proposed: on-line discretization (with a pre-specified number of bins) (Domingos and Hulten 2000), Gaussian-based methods for two-class problems (Gama et al. 2004), an equiwidth adaptation to multi-class problems (Pfahringer et al. 2008), and an exhaustive method based on binary search trees (Gama et al. 2003). They are either sensitive to skewed distributions or are appropriate only for classification problems. We have developed a timeefficient method for handling numerical attributes based on a E-BST structure, which is an adaptation of the exhaustive method proposed in Gama et al. (2003), tailored for regression trees. Linear models in leaves Existing batch approaches compute the linear models ether in the pruning phase or in the growing phase. In the later approach, the algorithms need to perform heavy computations © FIRST consortium Page 41 of 43 D2.3 necessary for maintaining the pre-computed linear models for every possible split point. While efforts have been made in reducing the computational complexity, we observe that none of the proposed methods would be applicable when dealing with high speed data streams, which are described by many numerical attributes having large domains of unique values. For this reason, we propose the most lightweight method for inducing linear models, based on the idea of on-line training of perceptrons. The trained perceptrons will represent the linear models fitted separately in each sub-space of the instance-space. An important difference between our proposed method and the batch ones is that the process of learning linear models in the leaves will not explicitly reduce the size of the regression tree. The split selection process is invariant to the existence of linear models in the leaves. However, if the linear model fits well to the examples assigned to the leaf, no further splitting would be necessary and pre-pruning can be applied. The basic idea is to train perceptrons in the leaves of the tree by updating the weights after each consecutive example. We use the simplest approach: no attribute selection is performed. All the numerical attributes are included in the regression equation which is represented by a perceptron without an activation function. The weights of the links are the parameters of the linear equation. Change detection When local concept drifts occur, most of the existing methods discard the whole model simply because its accuracy on the current data drops. Despite the drop in accuracy, parts of the model can still be good for the regions not affected by the drift. In such situations, we propose to update only the affected parts of the model. An example of a system that possesses this capability is the CVFDT system (Hulten et al. 2001). In CVFDT, splitting decisions are repeatedly re-evaluated over a window of most recent examples. This approach has a major drawback: maintaining the necessary counts for class distributions at each node requires a significant amount of additional memory and computation (especially when the tree becomes large). We address this problem by using a lightweight on-line change detection test for continuous signals. Discussion of the algorithm design choices The FIMT-DD algorithm is based on a compromise between the accuracy achieved by a model tree and the time required to learn the model tree. It therefore offers approximate solutions in real-time. For making splitting decisions, any method can be used that has high confidence in choosing the best attribute, given the observed data. The Hoeffding bound was chosen due to its nice statistical properties and the independence of the underlying data distribution. The growing process is stable because the splitting decisions are supported statistically, so the risk of overfitting is low. This is an advantage over batch algorithms, where splits in the lower levels of the tree are chosen using smaller subsets of the data. To ensure the any-time property of the model tree, we chose perceptrons as linear models in the leaves. This approach does not reduce the size of the model tree, but improves its accuracy by reducing the bias as well as the variance component of the error. The choice of the change detection mechanism was supported by three arguments: the method is computationally inexpensive, performs explicit change detection, and enables local granular adaptation. Change detection requires the setting of several parameters, which enable the user to tune the level of sensitivity to changes and the robustness. Annex d. Conclusions In this section, we presented an algorithm for learning model trees from time-changing data streams. To the best of our knowledge, FIMT-DD is the first algorithm for learning model trees from time-changing data streams with explicit drift detection. The algorithm is able to learn very fast (in a very short time per example) and the only memory it requires is for storing sufficient statistics at tree leaves. The model tree is available for use at any time in the course of learning, © FIRST consortium Page 42 of 43 D2.3 offering an excellent processing and prediction time per example. In terms of accuracy, the FIMT-DD is competitive with batch algorithms even for medium sized datasets and has smaller values for the variance component of the error. It effectively maintains an up-to-date model even in the presence of different types of concept drifts. The algorithm enables local change detection and adaptation, avoiding the costs of re-growing the whole tree when only local changes are necessary. © FIRST consortium Page 43 of 43