Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Eindhoven University of Technology MASTER Realizing a process cube allowing for the comparison of event data Mamaliga, T. Award date: 2013 Disclaimer This document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Student theses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the document as presented in the repository. The required complexity or quality of research of student theses may vary by program, and the required minimum study period may vary in duration. General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain Take down policy If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim. Download date: 06. May. 2017 Department of Mathematics and Computer Science Architecture of Information Systems Research Group Realizing a Process Cube Allowing for the Comparison of Event Data Master Thesis Tatiana Mamaliga Supervisors: prof. dr. ir. W.M.P. van der Aalst MSc J.C.A.M. Buijs dr. G.H.L. Fletcher Final version Eindhoven, August 2013 Contents 1 Introduction 1.1 Context . . . . . . . . . . 1.2 Challenges - Then & Now 1.3 Assignment Description . 1.4 Approach . . . . . . . . . 1.5 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5 6 7 8 9 2 Preliminaries 2.1 Business Intelligence . . . . . . . . 2.2 Process Mining . . . . . . . . . . . 2.2.1 Concepts and Definitions . 2.2.2 ProM Framework . . . . . . 2.3 OLAP . . . . . . . . . . . . . . . . 2.3.1 Concepts and Definitions . 2.3.2 The Many Flavors of OLAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 11 12 12 15 16 16 20 3 Process Cube 3.1 Process Cube Concept . . . . . . . . . . . . . . . . . . . 3.2 Process Cube by Example . . . . . . . . . . . . . . . . . 3.2.1 From XES Data to Process Cube Structure . . . 3.2.2 Applying OLAP Operations to the Process Cube 3.2.3 Materialization of Process Cells . . . . . . . . . . 3.3 Requirements . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Comparison to Other Hypercube Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 21 24 24 26 28 29 30 4 OLAP Open Source Choice 4.1 Existing OLAP Open Source Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Advantages & Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Palo - Motivation of Choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 32 33 34 5 Implementation 5.1 Architectural Model . . . . . . . . . . . . 5.2 Event Storage . . . . . . . . . . . . . . . . 5.3 Load/Unload of the Database . . . . . . . 5.4 Basic Operations on the Database Subsets 5.4.1 Dice & Slice . . . . . . . . . . . . . 5.4.2 Pivoting . . . . . . . . . . . . . . . 5.4.3 Drill-down & Roll-up . . . . . . . . 5.5 Integration with ProM . . . . . . . . . . . 5.6 Result Visualization . . . . . . . . . . . . 36 36 37 39 41 42 43 44 45 47 . . . . . . . . . . . . . . . . . . . . 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Case Study and Benchmarking 6.1 Evaluation of Functionality . . . . 6.1.1 Synthetic Benchmark . . . 6.1.2 Real-life Log Data Example 6.2 Performance Analysis . . . . . . . 6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 49 49 51 54 57 7 Conclusions & Future Work 7.1 Summary of Contributions . . 7.2 Limitations . . . . . . . . . . 7.2.1 Conceptual Level . . . 7.2.2 Implementation Level 7.3 Further Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 59 60 60 61 61 . . . . . . . . . . . . . . . 3 Abstract Continuous efforts to improve processes, require a deep understanding of process inner working. In this context, the process mining discipline aims at discovering process behavior from historical records, i.e., event logs. Process mining results can be used for analysis of process dynamics. However, mining on realistic event logs is difficult due to complex interdependencies within a process. Therefore, to gain more in-depth knowledge about a certain process, it can be split into subprocesses, which can then be separately analysed and compared. Typical tools for process mining, e.g., ProM, are designed to handle a single event log at a time, which does not particularly facilitate the comparison of multiple processes. To tackle this issue, Van der Aalst proposed in [4] to organize the event log in a cubic data structure, called process cube, with a selection of the event attributes forming the dimensions of the cube. Although, multidimensional data structures are already employed in various business intelligence tools, the data used has a static character. This is in stark contrast to process mining, since event data characterizes a dynamic process that evolves in time. The aim of this thesis is to develop a framework that supports the construction of the process cube and permits multidimensional filtering on it, in order to separate subcubes for further processing. We start with the OLAP foundation and reformulate its corresponding operations for event logs. Moreover, the semantics of a traditional OLAP aggregate are changed. Numerical aggregates are substituted by sublog data. With these adjustments, a tool is developed and integrated as a plugin in ProM to support the aforementioned operations on the event logs. The user can unload sublogs from the process cube, give them as parameters to other plug-ins in ProM and visualize different results simultaneously. During the development of the tool, we had to deal with a shortcoming of the multidimensional database technologies when storing event logs, i.e., the sparsity of the resulted process cube. Sparsity in multidimensional data structures occurs when a large number of cells in a cube are empty, i.e., there are missing data values at the intersection of dimensions. Taking a single attribute of an event log as a dimension in the process cube results in a very sparse multidimensional data structure. As a result, the computational time required to unload a sublog for processing increases dramatically. This shortcoming was addressed by designing a hybrid database structure that combines a high-speed in-memory multidimensional database with a sparsity-immune relational database. Within this solution, only a subset of event attributes actually contribute to the construction of the process, whereas the rest are stored in the relational database and used further only for event log reconstruction. The hybrid database solution proved to provide the flexibility needed for real-life logs, while keeping response times acceptable for efficient user interaction. The applicability of the tool was demonstrated using two event log examples, a synthetic event log and a real-life event log from the CoSeLog project. The thesis concludes with a detailed loading and unloading performance analysis of the developed hybrid structure, for different database configurations. Keywords: event log, relational database, in-memory database, OLAP, process mining, visualization, performance analysis 4 Chapter 1 Introduction The greatest challenge to any thinker is stating the problem in a way that will allow a solution. Bertrand Russell, British author, mathematician, & philosopher (1872 - 1970) This thesis completes my graduation project for the Computer Science and Engineering master at Eindhoven University of Technology (TU/e). The project was conducted in the Architecture of Information Systems (AIS) group. The AIS group has a distinct research reputation and is specialized in process modeling and analysis, process mining and Process-Aware Information Systems (PAIS). The process mining field, detailed further in this chapter, provides valuable analysis techniques and tools, but also faces a series of challenges. Main issues are large data streams and rapid changes over time. This project creates a proof-of-concept prototype, which considers the so-called process cube concept as a starting point for possible solutions to the above-mentioned challenges. The outcome is further used for visual comparison of event data. This chapter describes the assignment within its scientific context. Section 1.1 provides the research background. Section 1.2 enumerates the most important advances in process mining and identifies the current issues in the field. Section 1.3 specifies the problem and the project objectives. Section 1.4 continues with a short summary on the problem solution. Finally, Section 1.5 provides an overview on the remaining chapters of the thesis. 1.1 Context Technology has become an integral part of any organization. For example, current systems and installations are heavily controlled and monitored remotely by integrated internet technologies [23]. Moreover, employing automated solutions in any line-of-business has become a trend. As a result, Enterprise Systems software, offering a seamless integration of all the information flowing through a company [22], is used in any modern organization. Enterprise Information Systems (EIS) keep businesses running, improve service times and thus, attract more clients. Still, like in every complex system, there are multiple points where things can go wrong. System errors, fraud, security issues, inefficient distribution of tasks are just a few to mention. To cope with these issues, EIS had to extend its function-oriented enterprise applications with Business Intelligence (BI) techniques. That is, BI applications have been installed to support management in measuring company’s performance and deriving appropriate decisions [39]. Among most important functions of BI are online analytical processing (OLAP), data mining, business performance management and predictive analytics. Being aware of the existing problems in an organization and applying standardized solutions to solve them, is usually not enough. Consider a doctor that always prescribes pain killers indepen5 dent of the patient complaints. Of course, these kind of pills will temporarily release the pain, but they will not treat the real disease. A good doctor should run tests, identify the root causes of the health problem and only then, give an adequate treatment. This is what the process mining field tries to accomplish. It goes beyond analyzing merely individual data records, but rather focuses on the underlying process which glues event data together. The deep understanding of the inside of a process can point to notorious deviations, persistent bottlenecks and unnecessary rework. All in all, technology has a major impact on organizations and it proved to be an enabler for business process improvement. Therefore, by means of business intelligence, and process mining, in particular, new opportunities are constantly exploited to keep pace with challenges such as change. 1.2 Challenges - Then & Now In the context of today’s rapidly changing environment, organizations are looking for new solutions to keep their businesses running efficiently. Slogans such as “Driving the Change” (Renault), “Changes for the Better” (Mitsubishi Semiconductor), “Empowering Change” (Credit Suisse First Boston), “New Thinking. New Possibilities” (Hyundai) are used more and more often. Furthermore, different areas of business research are trying to keep up with the change and process mining is not an exception. In 2011, the Process Mining Manifesto [7] was released to describe the state-of-the-art in process mining on one hand, and its current challenges, on the other hand. A year later, the project proposal “Mining Process Cubes from Event Data (PROCUBE)” in [4] suggested the socalled process cube as a solution direction for some of these challenges. In the context of currently employed process mining solutions and using the Process Mining Manifesto as a reference, the PROCUBE project proposal presents several challenges that process mining is currently facing: From “small” event data to “big” event data. Due to increased storage capacity and advanced technologies, the vast amount of available event data have become difficult to control and analyse. Most of the traditional process mining techniques operate with event logs whose size does not exceed several thousands cases and a couple hundred thousands events (for example, in BPI Challenge [2] files). However, nowadays corporations work on a different scale of event logs. Giants like Royal Dutch Shell, Walmart, IBM, would rather consider millions of events (a day or even a second) and this number will continue to grow. Ways to ensure that event data growth will not affect the importance of process mining techniques are constantly sought. From homogeneous to heterogeneous processes. With the increasing complexity of an event log, chances are that the variability in its corresponding process increases as well. For example, events in an event log can present different levels of abstraction. However many mining techniques assume that all events in an event log are logged at the same level of abstraction. In that sense, the diverse event log characteristics have to be properly considered. From one to many processes. Many companies have their agencies spread across the globe. Let’s take SAP AG as an example. Only its research and development units are located on four continents, but it has regional offices all around the world. That is, SAP units are executing basically the same set of processes. Still, this does not exclude possible variations. For instance, there might be various influences due to the characteristics of a certain SAP distribution region (Germany, India, Brazil, Israel, Canada, China, and others). Traditional process mining is oriented on stand-alone business processes. However, it is of great importance to be able to compare business processes of different organizations (units of an organization). For example, efficient and less efficient paths in different processes can be identified. Inefficient paths can be substituted and efficient paths can be applied to the rest of the processes to improve performance. 6 From steady-state to transient behavior. The change has a major impact not only on the size of event logs and on the necessity of dealing with many processes together, but also on the state of a business process. For example, companies should be able to quickly adjust to different business requirements. As a result, their corresponding processes undergo different modifications. Current process mining techniques assume business processes to be in a steady-state [5]. However, it is important to understand the changing nature of a process and to react appropriately. The notion of concept drift was introduced in process mining [33] to capture this second-order dynamics. Its target is to discover and analyze the dynamics of a process by detecting and adapting to change patterns in the ongoing work. From offline to online. As previously mentioned, systems produce an overwhelming amount of information. The idea of storing it as historical event data for later analysis, as it is currently done, may not seem as appealing any more. Instead, the emphasis should be more on the present and the future of an event. That is, an event should be analysed on-the-fly and predictions on the contingency of its occurrence should be made based on existing historical data. As such, online analysis of event data is yet another process mining challenge. Each of the issues discussed above, are extremely challenging. Analysing large scale event logs is difficult with the current process mining techniques. Solutions to mitigate some of the issues that appear when dealing with large scale event logs are proposed in [14], i.e., by event log simplification, by dealing with less-structured processes and others. A framework for time-based operational support is described in [8]. In [16], an approach is offered to compare collections of process models corresponding to different Dutch municipalities. Nevertheless, there is still the need for more elaborated solutions and a unified way of approaching them. 1.3 Assignment Description Stand-alone process analysis is the common way of analysing processes in today’s process mining approaches. However, inspecting a process as a single entity, impedes observing differences and similarities with other processes. Let’s take a simple example from the airline industry. There is a constant discussion about which of the low-cost airlines, Ryanair or Wizzair, offers better services. There are both advantages and disadvantages of traveling with either of these two. Generally, Ryanair is considered more punctual than Wizzair 1 . To determine why Ryanair is more on-time with flights than Wizzair, we compare their processes. We noticed that while at Wizzair the luggage is checked only once, Ryanair is very strict with the luggage procedure and checks it twice before embarking. As a result, passengers and crew are not busy with “fitting” luggage that does not fit and the hallway of the aircraft is kept free for new passengers that arrive at board. With minimizing the turnaround time, the airline punctuality improves. The procedure of checking the luggage may not be the only factor that improves the punctuality of Ryanair airline, but it is clear from the comparison of the two airline processes that it contributes to reducing the flight delays. In conclusion, the comparison of the two processes helped in answering a specific question and identifying parts of these processes that can be further improved. When it comes to comparison of large processes, it is difficult to inspect processes entirely at a glance. Splitting and merging different parts of a process can offer more insightful details. Let’s consider the following scenario. In the car manufacturing process, there is a final polishing inspection step. Several resources check whether there is a scratch on a car that needs to be polished. During the last two weeks, it was noticed that one polishing crew worked slower than the others. To identify the cause of this issue, the car manufacturing process is analysed. First, the process is split by department type and the polishing department is selected. Then, only the process corresponding to the resources of this specific crew is isolated. The following aspects are 1 http://www.flightontime.info/scheduled/scheduled.html 7 inspected: the car type, the engine type, the color type. When filtering by car type and engine type, it seems that there are no patterns indicating a potential delay. However, when inspecting the subprocesses corresponding to different car colors, a pattern emerges. The average working time of polishing a red car is much higher compared to the one of polishing cars of a different color. Since red cars take, in general, more time to be polished than other cars, this indicates that there is a problem in the painting department. The red-colored cars are not painted properly and therefore need constant polishing. While at the beginning, it seemed like the crew is responsible for the delays, in fact, the crew members were just polishing more red-colored cars. Since redcolored cars required more polishing due to a painting issue, the crew worked slower compared to the other crews. Without filtering the initial process, it would have been difficult to identify such detailed problems. Taking into consideration the discussion above, the goal of this master project can be defined as follows: GOAL: Create a proof-of-concept tool to allow comparison of multiple processes. In other words, the aim is to support integrated analysis on multiple processes, while examining different views of a process. Together with the main goal, there are some other targets: filtering processes by preserving the initial dataset, merging different parts of a process, visualizing process mining results simultaneously and placing them next to each other to facilitate comparison. In the following, we present the approach we propose to reach the enumerated objectives. 1.4 Approach Figure 1.1: The process cube. Concept proposed in the PROCUBE project. To accomplish the goal, we base our approach on the process cube concept, introduced in [4] and shown in Figure 1.1. A process cube is a structure composed of process cells. Each process cell (or collection of cells) can be used to generate an event log and derive process mining results [4]. Note that traditional process mining algorithms are always applied to a specific event log without systematically considering the multidimensional nature of event data. In this project, the process cube is materialized as an online analytical processing (OLAP) hypercube structure. Except for the built-in multidimensional structure, one can benefit from the functionality of the OLAP operations and hopefully from the good performance of OLAP implementations. Transactional databases are designed to store and clean data, but are not tailored towards analysis. OLAP, on the other hand, is herein chosen to harbor complex event data for further process analysis, in the view of its analysis-optimized databases and its specialized “drilling” operations. Organizing event data in OLAP multidimensional structures, makes it easy 8 to get event data and to pick a side to look at it. There are also many ways to divide event data, e.g., one can always drill down and up in the multidimensional structure and inspect event data at different granularity levels. Finally, the retrieved event data can be used to obtain different process-related characteristics, e.g., process models, that can be further analysed and compared. There are however, some challenges with respect to this approach, mainly due to the fact that OLAP does not handle event data, but enterprise data: • Only the aggregation of large collections of numerical data is supported by the OLAP tools. • Process-related aspects are entirely missing in the OLAP framework. • Overlapping of cells (event) classes is not possible in OLAP cubes. Figure 1.2: Master Project Scope. Nevertheless, adjustments can be made to OLAP tools to accommodate process cube requirements. The approach considers several steps shown also in Figure 1.2. First, event logs are introduced among OLAP data sources. Hence, it becomes possible to load XES event logs in the OLAP database. Second, the process cube is created to support the materialization of an event log. Moreover, the process cube is designed to allow the visualization of cells with overlapping event data. Finally, different process mining results can be produced for any section of the cube and further exported as images. The materialization of the process cube as an OLAP cube allows to define our objective even more precise: the goal is to create a proof-of-concept tool that exploits OLAP features to accommodate process mining solutions such that the comparison of multiple processes is possible. 1.5 Thesis Structure To describe the approach, the master thesis is structured as follows: Present a literature study on employed concepts and technologies (Chapter 2) Concepts from process mining and business intelligence fields will be introduced. Then, a discussion on the implemented OLAP and database technologies will follow. Elaborate on process cube functionality (Chapter 3) The process cube notion will be clearly defined together with its structure. The requirements needed to attire the envisioned process cube functionality will be listed. Explain Palo software choice (Chapter 4) Based on the requirements from Chapter 3, a collection of technological solutions that could support the process cube structure is generated. After analyzing the pros and the cons of each solution, the choice to use Palo OLAP server is described and motivated. 9 Recall the most relevant implementation steps (Chapter 5) After presenting the architecture of the project, the implementation steps are described. The main functionality consists of: loading/unloading a XES file in/from the in-memory database, enabling the adjusted OLAP operations on event logs and visualizing process mining results. Report on the testing process and on the system test results (Chapter 6) The functionality of the software is tested and its performance is evaluated for different event logs and process cubes. Conclude with general remarks on the project (Chapter 7) The thesis concludes with a series of comments and observations on both the implemented solution and further research possibilities. 10 Chapter 2 Preliminaries 2.1 Business Intelligence Business Intelligence (BI) incorporates all technologies and methods that aim at providing actionable information that can be used to support decision making. An alternative definition states that BI systems combine data gathering, data storage, and knowledge management with analytical tools to present complex internal and competitive information to planners and decision makers [41]. All in all, BI represents a mixture of multiple disciplines (e.g., data warehousing, data mining, OLAP, process mining, etc.), as shown in Figure 2.1, all with the same main goal of turning raw data into useful and reliable information for further business improvements. Even though Figure 2.1: BI - a confluence of multiple disciplines. herein presented as totally separate disciplines, there are various attempts to interconnect some of them for obtaining more powerful analysis results. For example, data mining is integrated with OLAP techniques [31, 45]. Data warehousing and OLAP technologies are more and more used in conjunction [13, 18]. From the above-mentioned BI disciplines, process mining and OLAP are detailed in Section 2.2 and in Section 2.3, as being particularly relevant for this project. 11 2.2 2.2.1 Process Mining Concepts and Definitions The idea of process mining is to discover, monitor and improve real processes (i.e., not assumed processes) by extracting knowledge from event logs readily available in todays systems [3]. The content and the level of detail of a process description depends on the goal of the conducted process mining project and the employed process mining techniques. The set of real executions is fixed and is given by the event data from an existing event log. There are basically three types of process mining projects [3]. The goal of the first, data-driven process mining project, is to conclude with a process description, which should be as detailed as possible, without necessarily having a specific question in mind. This can be accomplished in two ways: by a superficial analysis, covering multiple process perspectives or by an in-depth analysis, on a limited number of aspects. The second, the question-driven process mining project, aims at obtaining a process description from which an answer to a concrete question can be derived. A possible question can be: “How does the decision to increase the duration of handling an invoice influences the process?” The third type, the goal-driven process mining project, consists of looking for weaker parts in the resulted process description that can be considered for improving a specific aspect, e.g., better response times. Figure 2.2: Process mining: discovery, conformance, enhancement. Establishing the type of the process mining project to conduct is followed by choosing the relevant process mining techniques to apply on the event log. Process mining comes in three flavors: discovery, conformance and enhancement. Figure 2.2 1 shows these three main process mining categories. Discovery techniques take the event log as input and return the real process as output. Conformance checking techniques checks if reality, as recorded in the log, conforms to the model and vice versa [7]. Enhancement techniques produce an extended process model which gives additional insights in the process, i.e., existing bottlenecks. Regardless of the process mining technique, an event log is always given as input, shown also in Figure 2.2. The content of an event log can vary greatly from process to process. Nevertheless, 1 http://www.processmining.org/research/start 12 Figure 2.3: Structure of event logs. there is a fixed skeleton, expected to be found in any event log. Figure 2.3, from [3], presents the structure of an event log. Generally, event data from an event log correspond to a process. A process is composed of cases or completed process instances. In turn, a case consists of events. Events should be ordered within a case. Preserving the order is important as it influences the control flow of the process. An event corresponds to an activity, e.g., register request, pay compensation. A trace represents a sequence of activities. Both events and cases are characterized by attributes, e.g., activity, time, resource, costs. The data source used for process mining is an event log. Event data of different information systems are stored in event logs. Since event logs can be recorded not only for process mining purposes (e.g., for debugging errors), there is no unique format used at creation. Handling various event log formats for process analysis is time consuming. Therefore, event logs need to be standardized by converting raw event data to a single event log format. One such format is MXML, which emerged in 2003. Recently, the popularity of XES event log standardization has grown. Further, we present an overview on XES event log structure, with relevant details for this master thesis. A more in depth discussion on the XES format can be found in [15] and more up to date information on XES can be found on http://www.xes-standard.org/. Figure 2.4, taken from [29], shows the XES meta model. Except for traces and events, with their corresponding attributes, the log object contains a series of other elements. The global 13 Figure 2.4: The XES Meta-model. attributes for traces and events are usually used to quickly find the existing attributes in the XES log. The purpose of event classifiers is to assign each event to a pre-defined category. Events within the same category can be compared with the ones from another category. XES logs are also characterized by extensions. Extensions are used to resolve the ambiguity in the log by introducing a set of commonly understood attributes and attaching semantics to them. Attributes have assigned values which corresponds to a specific type of data. Based on the type of data, attributes can be classified in five categories: String attributes, Date attributes, Int attributes, Float attributes, and Boolean attributes. These attribute types correspond to the standard XML types: xs:string, xs:dateTime, xs:long, xs:double and xs:boolean. To understand the separation between required and flexible event log aspects, a formalization of the above-highlighted concepts is given. The process mining book [3] is used as reference. Definition 1 (Event, attribute [3]). Let E be the event universe, i.e., the set of all possible event identifiers. Events may be characterized by various attributes, e.g., an event may have a timestamp, correspond to an activity, is executed by a particular person, has associated costs, etc. Let AN be a set of attribute names. For any event e ∈ E and name n ∈ AN : #n (e) is the value of attribute n for event e. If event e does not have an attribute named n, then #n (e) =⊥(null value). Notation 1. For a given set A, A∗ is the set of all finite sequences over A. 14 Definition 2 (Case, trace, event log [3]). Let C be the case universe, i.e., the set of all possible case identifiers. Cases, like events, have attributes. For any case c ∈ C and name n ∈ AN : #n (c) is the value of attribute n for case c (#n (c) =⊥ if case c has no attribute named n). Each case has a special mandatory attribute trace : #trace (c) ∈ E ∗ .2 ĉ = #trace (c) is a shorthand for referring to the trace of a case. A trace is a finite sequence of events σ ∈ E ∗ such that each event appears only once, i.e., for 1 ≤ i < j ≤ |σ| : σ(i) 6= σ(j). For any sequence δ = ha1 , a2 , · · · , an i over A, δset = {a1 , a2 , · · · , an }. δset converts a sequence into a set, e.g., δset (hd, a, a, a, a, a, a, di) = {a, d}. a is an element of δ, denoted as a ∈ δ, if and only if a ∈ δset (δ). An event log is a set of cases L ⊆ C such that each event appears at most once in the entire log, i.e., for any c1 , c2 ∈ L such that c1 6= c2 : δset (cˆ1 ) ∩ δset (cˆ2 ) = ∅. 2.2.2 ProM Framework A large number of algorithms are produced as a result of process mining research. Ranging from algorithms that provide just a helicopter view on the process (Dotted Chart) to ones that give an in-depth analysis (LTL Checker ), many of them are implemented in the ProM Framework in the form of plugins. Figure 2.5: ProM Framework Overview. Figure 2.5, based on [24], shows an overview of the ProM Framework. It includes the main types of ProM plugins and the relations between them. Before applying any mining technique, an event log can be filtered using a Log filter. Further, the filtered event log can be mined using the Mining plugin and then stored as a Frame result. The Visualization engine ensures that frame results can be visualized. An (filtered) event log, but also different models, e.g., Petri nets, LTL formulas, can be loaded into ProM using an Import plugin. Both the Conversion plugin and the 2 In the remainder, we assume #trace (c) 6= hi, i.e., traces in a log contain at east one event 15 Figure 2.6: Examples of process mining plugins: Log Dialog and Dotted Chart (helicopter view), Fuzzy Miner (discovery), Social Networks based on Working Together (organizational perspective). Analysis plugin use mining results as input. While the first plugin is specialized in converting the result to a different format, the second plugin is focused on the analysis of the result. The ProM framework includes five types of process mining plugins, as shown in Figure 2.5: • Mining plugins - mine models from event logs. • Analysis plugins - implement property analysis on a mining result. • Import plugins - allow import of objects from Petri net, LTL formula, etc. • Export plugins - allow export of objects to various formats, e.g., EPC, Petri net, DOT, etc. • Conversion plugins - make conversions between different data formats, e.g., from EPC to Petri net. Figure 2.6 presents some examples of plugins in ProM: the Log Dialog, the Dotted Chart, the Fuzzy Miner [30] and the Working Together Social Network [9]. There are, however, more than 400 plug-ins available in Prom 6.2, covering a wide spectrum. Plugins objectives can vary from providing process information at a glance, e.g., Log Data, Dotted Chart, to providing automated process discovery, e.g., Heuristics Miner [53] and Fuzzy Miner and offering detailed analysis for verification of process models, e.g., Woflan analysis, for performance aspects, e.g., Performance Analysis with Petri net, and for the organizational perspective, e.g., Social Network miner. 2.3 2.3.1 OLAP Concepts and Definitions On-Line Analytical Processing (OLAP) is a method to support decision making in situations where raw data on measures such as sales or profit needs to be analysed at different levels of statistical aggregation [42]. Introduced in 1993 by Codd [20] as a more generic name for “multidimensional 16 data analysis”, OLAP embraces the multidimensionality paradigm as a means to provide fast access to data when analysing it from different views. Figure 2.7: Traditional OLAP cube. At the intersection of the three dimensions: regions, time and sales information, an aggregate (e.g., profit margin %) can be derived. Both time and regions dimensions contain a hierarchy (e.g., 2012Jan, 2012F eb, 2012M ar are months of 2012). In comparison with its On-Line Transactional Processing (OLTP) counterpart, OLAP is optimized for analysing data, rather than storing data originating from multiple sources to avoid redundancy. Therefore, OLAP is mostly based on historical data, e.g., data that can be aggregated, and not on instantaneous data which is quite challenging to analyse, sort, group or compare “on-the-fly”. Multidimensional data analysis is possible due to a multidimensional fact-based structure, called an OLAP cube. An OLAP cube is a specialized data structure to store data in an optimized way for analysis. Figure 2.7 presents the traditional OLAP cube structure. Designed to support enterprise data analysis, an OLAP cube is usually built around a business fact. A fact describes an occurrence of a business operation (e.g., sale), which can be quantified by one or more measures of interest (e.g., the total amount of the sale, sales cost, profit margin %). Generally, the measure of interest is a real number. A business operation can be characterized by multiple dimensions of analysis (e.g., time, region, etc). Let DAi , 1 ≤ i ≤ n be the set of elements of the Qndimensions of analysis. Then, the measure of interest M I can be defined as a function M I : i=1 DAi → R. For example, if region, time and sales are the dimensions of analysis, as in Figure 2.7, then M I(Germany, 2012M ar, P rof itM argin) = 11. Moreover, elements of a dimension of analysis can be organized in a hierarchy, e.g., the Europe region is herein represented by countries like N etherlands, Germany and Belgium. A natural hierarchical organization can be observed among time elements. Consider the tree structure in Figure 2.8. The root of the tree is the 2012 year. This element has three children: 2012Jan, 2012F eb and 2012M ar, corresponding to months. Finally, each month element has days of week as children elements. Let Hi be the set of hierarchy elements, i.e., Hi = {2012, 2012Jan, 2012F eb, 2012M ar, 2012JanM on, 2012JanT hu, . . .}. The children function, children : Hi → P(Hi ) returns the children elements of the argument. For example, children(2012) = {2012Jan, 2012F eb, 2012M ar}. The allLeaves function, allLeaves : Hi → 17 Figure 2.8: Example of hierarchy tree structure on time dimension. P(Hi ) returns all leaf elements corresponding to the subtree with the function argument as a root node. For example, allLeaves(2012) = {2012JanM on, 2012JanT hu, 2012F ebW ed, 2012M arT ue, 2012M arF ri}. Note that a hierarchy is a undirected graph, in which any two nodes are connected by a simple path, with the following property: for any node h ∈ Hi , any two children h1 , h2 ∈ children(h), allLeaves(h1 ) ∩ allLeaves(h2 ) = ∅. Dimensions of analysis, hierarchies and measures of interest can be used to construct an OLAP cube, like the one in Figure 2.7. Dimensions of an OLAP cube are defined by CD = D1 × D2 × . . . × Dn . For any 1 ≤ i ≤ n, Di ⊆ Hi is the set of dimension elements. Hierarchies are defined by CH = H1 × H2 × . . . × Hn . For example, the time dimension contains elements from the hierarchy shown in Figure 2.8. Let D1 be the cube dimension corresponding to time, then a possible content of D1 is {2012Jan, 2012F eb, 2012M ar}. It is not necessary for a dimension to contain all the hierarchy elements. Together with dimensions, hierarchies are elements of an OLAP cube structure CS = {CD, CH}. Measures of interests are functions specificQfor the dimensions of n analysis. For the dimensions of the cube, the aggregate function CA, CA : i=1 Hi → R, is used as an equivalent of a measure of interest. The only difference is that aggregates can be computed from multiple measure of interest results or from other aggregates. For example, the aggregate sales cost for the entire month 2012Jan is a sum of the measure of interest results corresponding to 2012JanM on and 2012JanT hu. To make the reasoning in terms of OLAP more precise and to strengthen the understanding of various cube-related concepts, we provide a formalization of the core OLAP notions. An OLAP cube presents a multidimensional view on data from different sides (dimensions). Each dimension consists of a number of dimension attributes or values, which can be also called dimension elements or members. Members in a dimension can be organized into a hierarchy and correspond, as such, to a hierarchical level. These concepts are further formalized in Definition 3. Definition 3. (OLAP cube) Let Di , 1 ≤ i ≤ n be a set of dimension elements, where n is the number of dimensions, Hi , 1 ≤ i ≤ n be a set of hierarchy elements, CD = D1 × D2 . . . × Dn be the cube dimensions, CH = H1 × H2 . . . × Hn be the cube hierarchies, children : Hi → P(Hi ), where children(h) is the function returning the children of h ∈ Hi , allLeaves : Hi → P(Hi ), where allLeaves(h) is the function returning all leaves of h ∈ Hi , h ∈ Hi , h1 , h2 ∈ children(h), allLeaves(h1 ) ∩ allLeaves(h2 ) = ∅, CS = (CD, CH) be the cube structure, CA : CH → R be the cube aggregate function, An OLAP cube is defined as OC = (CS, CA). Given the multidimensional structure of an OLAP cube, the risk exists of having it populated with sparse data. Sparsity appears when often, at the intersection of dimensions, there is no corresponding measure of interest, thus, there is an empty cell. Such behavior occurs in multidimensional cubes with a large number of sparse dimensions. A dimension is considered a sparse dimension when it has a large number of members, that in most of the cases appear only once in the original data source and data values are missing for the majority of member combinations. On the contrary, in a dense dimension, a data value exists for almost every dimension member. 18 So far, we focused on the OLAP cube multidimensional structure. However, learning how to employ it, is particularly interesting, as it gives a feeling of OLAP’s usefulness and applicability. Therefore, we further discuss about one of the main features of OLAP, the OLAP operations. In [18], Chandhuri and Dayal enumerate among the typical OLAP operations: slice and dice for selection and projection, drill-up (or roll-up) and drill-down, for data grouping and ungrouping, and pivoting (or rotation) for re-orienting the multidimensional view of data. There are also other OLAP operations, e.g., ranking, drill-across [44]. However, the operations mentioned in [18] are considered sufficient for a meaningful exploration of the data. The dice operation returns a subcube by selecting a subset of members on certain dimensions. Definition 4 (Dice operation). Let OC, OC = (CS, CA) and Di0 ⊆ Di for all 1 ≤ i ≤ n. The dice operation is diceCD0 (OC) = OC 0 , where OC 0 = (CS 0 , CA0 ), CS 0 = (CD0 , CH 0 ), CH 0 = H10 × H20 × . . . × Hn0 , Hi0 = {h ∈ Hi |∃v ∈ Di0 , allLeaves(v) ∩ allLeaves(h) 6= ∅}, children0 : Hi0 → P(Hi0 ), children0 (h) = children(h) ∩ Hi0 , allLeaves0 : Hi0 → P(Hi0 ), allLeaves0 (h) = allLeaves(h) ∩ Hi0 , h ∈ Hi0 , h1 , h2 ∈ children0 (h), allLeaves0 (h1 ) ∩ allLeaves0 (h2 ) = ∅, CA0 : CH 0 → R, CA0 (h1 , . . . , hn ) = CA(h1 , . . . , hn ), for (h1 , . . . , hn ) ∈ CH 0 . The slice operation is a special case of dice operation. It produces a subcube by selecting a single member for one of its dimensions. Definition 5 (Slice operation). Let OC, OC = (CS, CA). The slice operation is slicek,v (OC) = OC 0 , where 1 ≤ k ≤ n, v ∈ Dk , and OC 0 = diceCD0 (OC) with CD0 = D1 × . . . × Dk−1 × {v} × Dk+1 × . . . × Dn . Note that an OLAP cell can be defined as an OLAP subcube obtained by slicing each of the OLAP cube dimensions. Let OC, OC = (CS, CA). The OLAP cell is slice1,v1 (slice2,v2 . . . (slicen−1,vn−1 (slicen,vn (OC))) . . .)) = OC 0 . By slice and dice operations, various OLAP subcubes are isolated. To make them useful for analysis purposes, the data from the cube should be visualized. Although the cube is a multidimensional structure, only two dimensions can be visualized at a time. Pivoting (or rotation) operation changes the visualization perspective of the OLAP cube, by swapping two dimensions Di∗ and Dj∗ . Definition 6 (Pivoting operation). Let OC, OC = (CS, CA) with CD = D1 × D2 × . . . × Di × . . . × Dj × . . . × Dn and CH = H1 × H2 × . . . × Hi × . . . × Hj × . . . × Hn . The pivoting operation is pivoti,j (OC) = OC 0 , where 1 ≤ i, j ≤ n, OC 0 = (CS 0 , CA0 ), CS 0 = (CD0 , CH 0 ), CD0 = D1 × D2 × . . . × Dj × . . . × Di × . . . × Dn , CH 0 = H1 × H2 × . . . × Hj × . . . × Hi × . . . × Hn , children0 : Hi0 → P(Hi0 ), children0 (h) = children(h), allLeaves0 : Hi0 → P(Hi0 ), allLeaves0 (h) = allLeaves(h), h ∈ Hi0 , h1 , h2 ∈ children0 (h), allLeaves0 (h1 ) ∩ allLeaves0 (h2 ) = ∅, CA0 : CH 0 → R, CA0 (h1 , . . . , hj , . . . , hi , . . . , hn ) = CA(h1 , . . . , hi , . . . , hj , . . . , hn ), for (h1 , . . . , hj , . . . , hi , . . . , hn ) ∈ CH 0 . The roll-up operation consolidates some of the elements of a dimension into one element, which corresponds to a hierarchically superior level. Definition 7 (Roll-up operation). Let OC, OC = (CS, CA) and v ∈ Hk , where 1 ≤ k ≤ n. The roll-up operation is rollupk,v (OC) = OC 0 , where OC 0 = (CS 0 , CA) with CS 0 = (CD0 , CH), and CD0 = D1 × . . . × Dk−1 × (Dk \ children(v)) ∪ {v} × . . . × Dn . 19 The drill-down operation refines a member of a dimension into a set of members, corresponding to a hierarchically inferior level. Definition 8 (Drill-down operation). Let OC, OC = (CS, CA) and v ∈ Dk , where 1 ≤ k ≤ n. The drill-down operation is drilldownk,v (OC) = OC 0 , where OC 0 = (CS 0 , CA) with CS 0 = (CD0 , CH), and CD0 = D1 × . . . × Dk−1 × (Dk \ {v}) ∪ children(v) × . . . × Dn . 2.3.2 The Many Flavors of OLAP Before introducing the OLAP principle, relational databases were the most widely used as technology for enterprise databases. Relational databases are stable and trustworthy and can be used for storing, updating and retrieving data. However, they provide limited functionality to support user views of data. Most notably lacking was the ability to consolidate, view, and analyze data according to multiple dimensions, in ways that make sense to one or more specific enterprise analysts at any given point in time [20]. Consequently, OLAP facilities were designed to compensate for the limitations of the conventional relational databases. The OLAP Server functionality had to be implemented on top of an existing database technology. Relational databases were considered to be amongst the most reliable and popular types of databases [21]. Naturally, one of the proposed solutions was to add OLAP characteristics on top of a relational model. This is how the ROLAP (Relational OLAP) category came into existence. The OLAP layer provides a multidimensional view, calculation of derived data, slice, dice and drill-down intelligence and the relational database gives an acceptable performance by employing a Star-schema or Snowflake data model [21, 43]. Being the most appropriate database type for OLTP, due to its design, the relational database is not as good an option for OLAP [20, 25]. Even though presenting close to real-time data loading and having advantages in terms of capacity, ROLAP presents slow query performance and is not always efficient when aggregating large amounts of data. Instead, a multidimensional database approach deemed to be more suited [11, 54]. Known under the name of MOLAP (Multi-dimensional OLAP), this type of OLAP is created to achieve the highest possible query performance. Still, MOLAP has its own deficiencies. MOLAP works the best for cubes with a limited number of sparse dimensions. Sparse data within large cubes often causes performance problems. Hence, the advantages of ROLAP are the disadvantages of MOLAP and vice versa. Therefore, the HOLAP (Hybrid OLAP) version was introduced as the combination of the two, to compensate for the deficiencies of each technology [46]. HOLAP is one of the OLAP types that goes mainstream among the next-generation OLAP. Additional technologies, such as in-memory OLAP, are considered for speed-oriented systems. Nonetheless, depending on data characteristics (e.g., summarized, detailed), one or a combination of these technologies can be considered. Even though multi-hybrid models (e.g., MOLAP and real-time in-memory for analysis and HOLAP for drill through) are designed to incorporate the most of OLAP benefits, there is still no generic OLAP architecture or standard procedure to guarantee optimal performance independent of the requirements. With the growth of available memory capacity and because memory prices are decreasing with time, the feasibility of storing large databases in memory increases. As a consequence, the diskbased databases are replaced more and more often with in-memory database technology. While conventional disk-based database systems (DRDB) store data on disk, main memory database systems (MMDB) [26] store and access data directly from the main physical memory. Therefore, the response times and transaction throughputs of a MMDB are considerably better than for a disk-based database system. Obviously, a DRDB still has advantages in terms of capacity. There are very large databases that simply cannot fit in memory, e.g., database containing NASA space data (with images). However, it is difficult for DRDB to compete with the speed of MMDB. That is, a database of a reasonable size stored in-memory outperforms a database stored on disk. 20 Chapter 3 Process Cube In Section 1.3, the goal of this master project was described as to create a proof-of-concept tool to allow comparison of multiple processes. In Section 1.4, the process cube was introduced as a means to satisfy the goal. Both process mining and OLAP aspects were described in Chapter 2. Being the central component of the system, the process cube links the process mining framework to the existing OLAP technology. By storing event logs in OLAP multidimensional structures, event data can be used to obtain and compare process mining results. In this chapter, the concept of the process cube is explained in detail, together with an example that shows its functionality and a comparison with other hypercube structures. Before proceeding with the process cube materialization in Chapter 4, a set of requirements are established and enumerated at the end of the chapter. 3.1 Process Cube Concept In Section 2.2.1, the definition of an event with attributes (Definition 1) and of a case with attributes (Definition 2) were given. Section 2.3.1 includes the definition of an OLAP cube (Definition 3) with its corresponding operations (Definitions 4, 5, 6, 7, 8). In this section, the process cube and process cell notions are introduced by adding event log aspects into the OLAP cube definition. For a further elaboration and formalization of the process cube concept see the paper [6], which was published towards the end of this project. Figure 3.1: Process Cube Concept. Figure 3.1, taken from [4], shows relevant process cube characteristics and is therefore, representative for the definitions of different process cube concepts given below (e.g., process cube, process cell). A detailed discussion on the elements of the Figure 3.1 is presented in [6]. 21 A process cube is a multidimensional structure built from event log data in a way that facilitates further meaningful process mining analysis. A process cube is composed of a set of process cells [4] and the main difference between a process cube and an OLAP cube lies in its cell characteristics. In contrast to the OLAP cube, there is no real measure of interest quantifying a business operation. While OLAP structures are designed for business operations analysis, the process cube aims at analyzing processes. Therefore, each dimension of analysis is composed of event attributes. Consequently, the content of a cell in the process cube changes from real numbers to events. While in OLAP, dimensions of analysis are used to populate the cube, in case of process cubes the events of an event log are used to create the dimensions of analysis. Hence, instead of the M I function, the event members function is defined as EM : E → DA1 × . . . × DAn . Note that to differentiate between two events with the same attributes, the event id is added as a dimension of analysis. Consequently, for each event there will be a unique combination of dimension of analysis members. Definition 9. (Process cube) Let Di , 1 ≤ i ≤ n be a set of dimension elements, where n is the number of dimensions, Hi , 1 ≤ i ≤ n be a set of hierarchy elements, CD = D1 × D2 × . . . × Dn be the cube dimensions, CH = H1 × H2 × . . . × Hn be the cube hierarchies, children : Hi → P(Hi ), where children(h) is the function returning the children of h ∈ Hi , allLeaves : Hi → P(Hi ), where allLeaves(h) is the function returning all leaves of h ∈ Hi , h ∈ Hi , h1 , h2 ∈ children(h), allLeaves(h1 ) ∩ allLeaves(h2 ) = ∅, CS = (CD, CH) be the process cube structure, CE : CH → P(E ) be the cell event function, CE(h1 , h2 , . . . , hn ) = {e ∈ E |(d1 , d2 , . . . dn ) = CC(e), di ∈ allLeaves(hi ), 1 ≤ i ≤ n}, for (h1 , h2 , . . . , hn ) ∈ CH. A process cube is defined as P C = (CS, CE). Note that a process cell can be defined as a subcube obtained by slicing each of the process cube dimensions. Let P C, P C = (CS, CA). The process cell is slice1,v1 (slice2,v2 . . . (slicen−1,vn−1 (slicen,vn (P C))) . . .)) = P C 0 . Each cell in the process cube corresponds to a set of events [4], returned by the cell event function CE. The process cube, as defined above, is a structure that does not allow overlapping of events in its cells. To allow the comparison of different processes using the process cube, a table of visualization is created. The table of visualization is used to visualize only two dimensions at a time. Multiple slice and dice operations can be performed by selecting different elements of the two dimensions. Each slice, dice, roll-up or drill-down is considered to be a filtering operation. Hence, a new filter is created with each OLAP operation. Filters are added as rows/columns in the table of visualization. Note that unlike the cells of the process cube, the cells of the table of visualization may contain overlapping events. That is because there is no restriction in selecting the same dimension members for two filtering operations. Given a process cube P C, a process model, MP C is the result of a process discovery algorithm, such as Alpha Miner, Heuristic Miner or other related algorithms, used on P C. However, there are various process mining algorithms whose results are not necessarily process models. Instead, they can offer some insightful process-related information. For example, Dotted Chart Analysis provides metrics (e.g., average interval between events) related to events and their distribution over time. Process cubes are not limited to process models as well. Therefore, we refer to process mining results just as models. So far, we described the process cube as being a hypercube structure, with a finite number of dimensions. In [4], a special process cube is presented, with three dimensions: case type (ct), event class (ec) and time window (tw). Figure 3.2, taken from [4], contains a table corresponding to a fragment of an event log. Let the event data from the event log be used to construct a process cube P C. Then, the ct, ec and tw dimensions are established as follows. The case type dimension is based on the properties of a case. For example, the case type dimension can be represented by the type of the customer, in which case, the members of ct are gold and silver, i.e., D1 = {gold, silver}, H1 = D1 . The 22 Figure 3.2: Event log excerpt. event class dimension is based on the properties of an event. For example, ec can be represented by the resource and include, as such, the following members: D2 = {John}, H2 = D2 . The time window dimension is based on timestamps. A time window can refer to years, months, days of week, quarters or any other relevant period of time. Due to its natural hierarchical structure, tw dimension can be organized as a hierarchy, e.g., 2012 → 2012Dec → 2012DecSun. We consider D3 = {2012DecSun} and H3 = {2012, 2012Dec, 2012DecSun}. Let D1 = {gold, silver}, D2 = {John} and D3 = {2012DecSun} H1 = {gold, silver}, H2 = {John} and H3 = {2012, 2012Dec, 2012DecSun} CD = D1 × D2 × D3 be the cube dimensions, CH = H1 × H2 × H3 be the cube hierarchies, h1 , h2 ∈ H3 , h1 = 2012, children(h1 ) = {2012Dec}, h2 = 2012Dec, children(h2 ) = 2012DecSun, h1 , h2 ∈ H3 , h1 = 2012, allLeaves(h1 ) = {2012DecSun}, h2 = 2012Dec, allLeaves(h2 ) = 2012DecSun, CS = (CD, CH) be the process cube structure, h1 ∈ H1 , h1 = gold, allLeaves(h1 ) = {gold}, h2 ∈ H2 , h2 = John, allLeaves(h2 ) = {John}, h3 ∈ H3 , h3 = 2012, allLeaves(h3 ) = {2012DecSun}. CE(h1 , h2 , h3 ) = {35654423}, CC(35654423) = (gold, John, 2012DecSun). For the rest of the elements of CH, CE is defined in the same way. The process cube is defined as P C = (CS, CE). Each process cell l can be used to discover a process model, Ml . However, a process model can be also discovered from a group of cells Q, MQ , or from the entire process cube P C, MP C . Figure 3.3 shows a process model discovered from all the event data from the process cube P C. MP C is the discovered process model using the Alpha Miner algorithm, from the set of events returned Figure 3.3: A process model discovered from an by CE. This is possible if consid- extended version of the event log in Figure 3.2 ering the process cube corresponding to using the Alpha Mining algorithm. a single cell in the table of visualization. 23 3.2 Process Cube by Example In the previous section, the process cube was introduced together with a formalization of its relevant concepts. In this section, we continue with describing its functionality by means of an example. Figure 3.4: Functionality in three steps: 1. From XES data to process cube structure. 2. Applying OLAP operations to the process cube. 3. Materialization of process cells. We propose a functionality in three steps approach, as depicted in Figure 3.4. In the first step, the event data for this example is presented in a XES-like format. The event data is then used to construct a process cube prototype. While building the process cube, its various characteristics are clearly specified by referring to definitions from Section 3.1. The aim of the second step is to show ways of exploring the process cube. In that sense, a range of OLAP operations (e.g., slice, dice, roll-up, drill-down, pivoting) are applied to it. As such, the process cube is prepared for the last step - the process cube analysis. More precisely, in the third step, it is described how parts of the process cube are materialized in event logs and then used to obtain process models. These models can then be compared to discover similarities and dissimilarities between their underlying processes. 3.2.1 From XES Data to Process Cube Structure Table 3.1 contains the event data used in this example to illustrate the process cube functionality. This data is needed to build the process cube structure. In practice, explicit case ids and/or the event ids may be missing. From Definition 1 and Definition 2, both events and cases are represented by unique identifiers. Therefore, when these identifiers do not exist in the original data source, they can be automatically generated when extracting the data. The definition of the process cube (Definition 9) describes the process cube as a n−dimensional structure. Thus, establishing the dimensions is an important step in the creation of a process cube. There is no unique way of deciding on a process cube dimensions. One possibility is to select each case attribute and event attribute as a dimension. When applied to our example, this choice leads to a process cube with 5 dimensions. Should the case id and the event id be also considered, the final structure is a 7-dimensional process cube structure. By considering each different attribute value as a dimension member, the resulting process cube has 4 × 2 × 2 × 43 × 43 × 14 × 2 = 828, 352 process cells. It is easy to notice that the case id, event id and timestamp are sparse dimensions, causing the entire process cube to be sparse. Sparsity was discussed in Section 2.3.1. Another possibility is to limit the number of dimensions to three, as suggested in [4]. Based on the case properties, the case type dimension can contain members created from both parts and sum leges attributes. The parts attribute, specifies for what building parts can a building permit be requested, e.g., Bouw, M ilieu. The sum leges attribute, gives the total cost of a building permit application, e.g., 138.55, 179.8. At this point, it is important to establish a representative dimension member, as it can influence further analysis. This can be achieved, for 24 case id properties parts sum leges 1 Bouw 138.55 2 Bouw 138.55 3 Milieu 179.8 4 Bouw 138.55 event id 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 properties timestamp activity 2012-02-21T11:52:13 01 HOOFD 010 2012-02-21T11:56:31 01 HOOFD 020 2012-02-21T12:15:07 01 HOOFD 040 2012-02-21T12:19:22 01 HOOFD 050 2012-02-21T12:50:18 01 HOOFD 055 2012-02-21T14:09:49 01 HOOFD 060 2012-03-08T12:03:11 01 HOOFD 010 2012-03-08T12:07:53 01 HOOFD 020 2012-03-08T12:31:15 01 HOOFD 040 2012-03-08T13:22:08 01 HOOFD 060 2012-03-08T13:35:47 01 HOOFD 065 2012-03-08T14:53:34 01 HOOFD 120 2012-03-08T15:20:55 01 HOOFD 260 2012-03-08T15:36:19 09 AH I 010 2012-03-08T15:56:41 01 HOOFD 430 2012-03-12T09:03:52 01 HOOFD 010 2012-03-12T09:08:21 01 HOOFD 020 2012-03-12T09:17:39 01 HOOFD 040 2012-03-12T09:42:48 01 HOOFD 050 2012-03-12T10:15:07 06 VD 010 2012-03-12T10:24:56 01 HOOFD 120 2012-03-12T10:49:01 01 HOOFD 180 2012-03-12T11:18:19 01 HOOFD 260 2012-03-15T13:11:06 01 HOOFD 010 2012-03-15T13:15:27 01 HOOFD 020 2012-03-15T13:37:42 01 HOOFD 040 2012-03-15T14:02:18 01 HOOFD 050 2012-03-15T14:19:32 01 HOOFD 065 2012-03-15T15:06:11 01 HOOFD 120 2012-03-15T15:46:37 01 HOOFD 180 2012-03-15T16:10:44 01 HOOFD 260 2012-03-15T16:42:01 01 HOOFD 380 2012-03-15T16:53:26 01 HOOFD 430 resource 560464 560464 560925 560464 560464 560925 560464 560464 560925 560925 560925 560925 560464 560925 560925 560464 560464 560925 560925 560925 560925 560925 560925 560464 560464 560925 560925 560925 560464 560464 560464 560464 560925 Table 3.1: Event Log Example instance, by employing data mining techniques. For this example, we describe a simple two-step approach. First, cases are grouped in clusters, based on their properties. It is obvious that cases 1, 3 and 4 belong to one cluster, as they all have the same case properties, and case 2 belongs to another cluster. Secondly, a classification (decision tree learning algorithm) is used on the clustering results. In this example, we expect to identify, after classification, a representative number, e.g., 150, for the sum leges attribute that would differentiate between the two clusters. Consequently, the following two case type dimension members can be considered representative parts = Bouw, sum leges < 150 and parts = M ilieu, sum leges >= 150. The difficulty of this approach is that is requires data mining knowledge to store the event data in the process cube. There is also a middle-ground approach. For instance, the number of dimensions can still be kept small, but not necessarily limited to three. Moreover, one dimension can contain a single property instead of a combination of properties. In this case, the attributes that do not end up as dimensions can be still stored in a cell. For this example, we consider 4 dimensions: parts, activity, resource and timestamp. The parts dimension has two elements, D1 = {Bouw, M ilieu}. The resource dimension has also two elements, D2 = {560464, 560925}. The activity dimension consists 25 of 15 elements, e.g. 01 HOOF D 010, 09 AH I 010 and others. While the first three dimensions have a relatively small number of members, the last dimension consists of 43 different members. To reduce this number, only the year, the month and the day of the week is considered for the timestamp dimension and the rest is stored in the cell. Consequently, the size of the timestamp dimension is reduced to three: 2012F ebT ue, 2012M arM on and 2012M arT hu. As a result, the process cube P C consists of 2 × 14 × 3 × 2 = 168 process cells. To show what is the content of a process cell for the process cube P C, we use the CE function on a set of selected hierarchy elements. For h1 ∈ H1 , h1 = Bouw, allLeaves(Bouw) = {Bouw}, h2 ∈ H2 , h2 = 560925, allLeaves(h2 ) = 560925, h3 ∈ H3 , h3 = 01 HOOF D 040, allLeaves(h3 ) = {01 HOOF D 040}, h4 ∈ H4 , h4 = 2012M arT hu, allLeaves(h4 ) = {2012M arT hu}, the CE function returns CE(h1 , h2 , h3 , h4 ) = {9, 26}. Both CC(9) = (Bouw, 2012M arT hu, 01 HOOF D 040, 560925) and CC(26) = (Bouw, 2012M arT hu, 01 HOOF D 040, 560925) return the same tuple of hierarchy elements. Event data that is not yet stored as dimension values, can still be stored in the process cell containing events 9 and 26, as shown in the Table 3.2. case id 2 4 properties sum leges 138.55 138.55 event id 9 26 properties timestamp 2012-03-08T12:31:15 2012-03-15T13:37:42 Table 3.2: Event data corresponding to the process cell defined by CE(h1 , h2 , h3 , h4 ) = {9, 26}. 3.2.2 Applying OLAP Operations to the Process Cube In Section 2.3.1, the following OLAP operations were described: slice, dice, pivoting, roll-up and drill-down. In this section, we show, by means of an example, how these operations can be applied on a process cube. H4 2012F eb 2012 2012M ar 2012M arM on 2012M arT hu 01 HOOFD 65 01 HOOFD 060 01 HOOFD 055 01 HOOFD 050 01 HOOFD 040 01 HOOFD 020 01 HOOFD 010 560925 2012F ebT ue D2 (resource) 560464 Bouw Milieu D4 (timestamp) D3 (activity) D1 (parts) Figure 3.5: Process cube by example. With orange, 2012F ebT ue and 2012M arT hu are selected for the timestamp dimension and are used for dicing the process cube. With green, a subcube is illustrated, which is the result of slicing the previous subcube on 560464 member of the resource dimension. With red, a subcube is illustrated, which is the result of slicing the previous subcube on 560925 member of the resource dimension. Figure 3.5 illustrates the 4-dimensional process cube P C, constructed in the previous step. To represent the 4D structure in a 2D plan, first the members of the timestamp hierarchy are displayed on the left. The root element of the hierarchy is the 2012 year, followed by the month elements, 26 2012F eb and 2012M ar and having the days of week as the leaf nodes, 2012F ebT ue, 2012M arM on and 2012M arT hu. To each leaf member of the timestamp dimension, corresponds a 3D subcube as the one on the right. For the process cube P C, we choose to do first a dice, by selecting the 2012F ebT ue and the 2012M arT hu members on the timestamp dimension. Let P C, P C = (CS, CA) and Di0 = Di for all 1 ≤ i ≤ 3, D40 = {2012F ebT ue, 2012M arT hu}. The dice operation is diceCD0 (P C) = P C 0 , where P C 0 = (CS 0 , CE 0 ), CS 0 = (CD0 , CH 0 ), CH 0 = H1 × H2 × H3 × H40 , allLeaves(2012) = {2012F ebT ue, 2012M arM on, 2012M arT hu}, allLeaves(2012F ebT ue) = {2012F ebT ue}. T hen, allLeaves(2012) ∩ allLeaves(2012F ebT ue) = {2012F ebT ue}, . . . H40 = {2012, 2012F eb, 2012M ar, 2012F ebT ue, 2012M arT hu}, h ∈ H4 , h = 2012M ar, children(h) = {2012M arM on, 2012M arT hu}, children0 (h) = children(h) ∩ H40 , children0 (h) = {2012M arT hu}, . . . h ∈ H4 , h = 2012M ar, allLeaves(h) = {2012M arM on, 2012M arT hu}, allLeaves0 (h) = allLeaves(h) ∩ H40 , allLeaves0 (h) = {2012M arT hu}, . . . CE 0 (h1 , . . . , h4 ) = CE(h1 , . . . , h4 ), for (h1 , . . . , h4 ) ∈ CH 0 . Further, two slice operations are performed on the diced subcube P C 0 , by selecting first the 560464 and then 560925 member of the resource dimension. The resulted subcubes P C10 and P C20 are still 4D structures, although they have only one member on the resource dimension. The corresponding 3D subcubes, with dimension timestamp left aside due to representation issues, are depicted in Figure 3.5. The P C560464 subcube is represented with green and the P C560925 subcube is represented with red. The slice operation where the 560464 resource is selected is slice2,560464 (P C 0 ) = P C560464 , P C560464 = diceCD0 (P C 0 ) with CD560464 = D10 × {560464} × D30 × D40 . The slice operation where the 560925 resource is selected is slice2,560925 (P C 0 ) = P C560925 , P C560925 = diceCD0 (P C 0 ) with CD560925 = D10 × {560925} × D30 × D40 . While slice and dice operations are used to select parts of a process cube, pivoting, roll-up and drill-down operations help in visualizing the selections. As mentioned in Section 2.3.1, only two dimensions out of all the process cube dimensions, can be visualized at a time. For example, in Figure 3.5, dimensions parts and resource can be easily visualized. This part of the cube indicates which resources are responsible for handling cases for Bouw and which for M ilieu. It is possible to visualize also the activity dimension, but not all its elements can be clearly distinguished. By pivoting (or rotation) operation, the visualization perspective of the process cube can be changed. For example, by selecting the dimension activity on x axis instead of dimension parts and dimension parts on y axis instead of dimension activity, the cube is rotated and a new side of it can be visualized. Such a change makes it easy to distinguish the activities corresponding to Bouw and M ilieu parts, together with their corresponding cells. The pivoting operation is pivot1,3 (P C 0 ) = P Cp0 . P Cp0 = (CSp0 , CEp0 ), CSp0 = (CDp0 , CHp0 ), CDp0 = D30 × D20 × D10 × D40 , CHp0 = H30 × H20 × H10 × H40 , children0 (h) = children(h), allLeaves0 (h) = allLeaves(h), CEp0 (h3 , h2 , h1 , h4 ) = CE(h1 , h2 , h3 , h4 ). The roll-up and drill-down operations have an impact when applied on a dimension with a hierarchical structure. Through a roll-up operation, members of a hierarchically inferior level are replaced with a member of a hierarchically superior level. For this example, we consider the timestamp dimension with its elements 2012F ebT ue, 2012M arM on and 2012M arT hu. A roll-up operation on the children of 2012M ar replaces the current timestamp elements with 2012F ebT ue 27 and 2012M ar. The roll-up operation is then rollup4,2012M ar (P C 0 ) = P Cr0 , where P Cr0 = (CSr0 , CE) with 0 CSr = (CDr0 , CH), and CDr0 = D10 × D20 × D30 × (D40 \ children(2012M ar)) ∪ {2012M ar}. While the roll-up operation folds elements from an inferior hierarchical level into elements of a superior one, the drill-down operation expands members from hierarchically superior levels. We consider again the timestamp dimension. For the previous P Cr0 subcube, a drill-down operation on the 2012M ar element replaces the current dimension elements with 2012F ebT ue, 2012M arM on and 2012M arT hu. The drill-down operation is then drilldown4,2012M ar (P Cr0 ) = P Cd0 , where P Cd0 = (CSd0 , CE) with CSd0 = (CDd0 , CH), and CDd0 = D10 × D20 × D30 × (D40 \ {2012M ar}) ∪ children(2012M ar). 3.2.3 Materialization of Process Cells In the previous step, the applicability of the OLAP operations was shown by means of an example. The main emphasis was on the changes that occurred at the dimension level. Naturally, the question arises as what happens at the cell level. The last step of our approach gives an answer to this question. We rely in our explanation on Figure 3.6, presented in more detail in [6]. Figure 3.6: Partitioning of the process cube. The split operation is realized by drill-down. The functionality of the merge operation is given by roll-up. The left part of Figure 3.6 shows the process cube created from an extended version of the event log in Figure 3.2. In the process cube, the top part depicts a simplified event log corresponding to the process cube. The step of extracting an event log based on the event data from the process cube or from parts of it (process cells or groups of cells) is known as the materialization step. The resulted event logs are then given as input to different process mining algorithms. The outcome is a set of process models which can be visualized. Back to our example, the event log shown at the top of the process cube is used to obtain the process model shown at the bottom, by applying the Alpha Miner algorithm on it. The right part of Figure 3.6 shows the result of splitting the process cube from the left on its case type and event class dimensions. In the figure, two types of splitting can be identified. Vertical splits consider for separation an entire case. For example, by splitting on the case type dimension, cases 1, 4, 5, 6 are separated from cases 2, 3, 7, 8. The results of a horizontal split are no longer whole cases, but rather parts of cases corresponding to subsets of activities. For example, by splitting on the event class dimension, activities A, C are representative for the cell given by CE(silver customer, sales, 2012) and activities C, D, E, F, G are representative for the cell given by 28 (a) The resulted process model after slicing on (b) The resulted process model after slicing on 560464 resource. 560925 resource. Figure 3.7: Process mining results for P C560464 and for P C560925 . CE(silver customer, delivery, 2012). Note that activity C is present in both cells, i.e., activity C can be executed in both sales and delivery departments. This is possible as the activity attribute is not a dimension in the process cube and therefore, the same activity can be present in multiple cell. When related to the OLAP operations, the split operation is realized by the drill-down operation and the merge operation is realized by the roll-up operation. In the second step, based on a process cube example, several OLAP operations were presented. After “playing” with the process cube, one is interested in materializing the selected parts of the process cube and obtaining meaningful process mining results. The P C560464 and P C560529 subcubes are among the subcubes obtained in the second step. Figure 3.7a presents the resulted process model MP C560464 for the process cube P C560464 . Similarly, Figure 3.7b presents the resulted process model MP C560529 for the process cube P C560529 . Now the two process models can be compared to find differences and similarities. An immediate similarity is that both processes contain the same activities 01 HOOFD 050 and 01 HOOFD 120. There are a large number of differences, related both to the activities and also to the control flow. One could start by noticing that one process starts with activity 01 HOOFD 010, while the other starts with activity 01 HOOFD 040. 3.3 Requirements Now that we have established the desired functionality of a process cube, the next step is to find technologies and methods to turn the process cube concept into a real application. There is no fixed recipe that guarantees the achievement of this goal. Multiple tools are available that can accommodate the desired process cube functionality and there is certainly more than one solution to approach the problem. Nevertheless, there is a list of requirements that should be met, independent of the chosen technology and the solution for implementation. 29 As our goal is to create a proof-of-concept tool that exploits OLAP features to accommodate process mining solutions such that the comparison of multiple processes is possible, and based on the process cube functionality presented in this chapter, the following requirements are derived: 1. The system shall include an OLAP Server with support for traditional OLAP operations. 2. External tools shall be open to adjustments. They shall offer the possibility to add new functionality and change the existing one. 3. The application shall be programmed in Java to enable integration with ProM. 4. External tools shall provide means to enable their employment in a Java-written system. The first requirement is quite straightforward, considering the goal of this project. The OLAP Server organizes data in multidimensional structures, which facilitates the inspection of the stored data from different perspectives. In that sense, the OLAP Server can be also used to examine the different views of a process. Employing traditional OLAP operations on the OLAP multidimensional structures, provides quick and facile filtering. By means of this functionality, the integrated analysis on multiple processes can be supported. Since the OLAP Server is an indispensable component of the system, it has to be either created from scratch or employed from an external tool. Creating an OLAP Server from scratch, undoubtedly implies a vast amount of work. Under the circumstances, employing an already existing OLAP Server, to save time, seems to be a plausible idea. Moreover, parts of an OLAP Client application can be also reused to save time. However, in this case, the second requirement has to be considered. The existing OLAP tools cannot handle event logs and do not support process-mining analysis. Therefore, an external OLAP tool shall allow adding this functionality and changing the existing one, should this be the case. This is possible only if the external tool is open-source. ProM Framework was introduced in Section 2.5 as a platform hosting multiple plugins that represent the result of implementation of different process mining algorithms. Clearly, it is wise to use the already existing process mining techniques as they provide sufficient methodology to perform process analysis. However, to facilitate the easy integration with ProM, Java is the preferred programming language. The fourth requirement comes as a consequence of the third requirement. External parties must possess interfacing capabilities with the system. Since the main application has to be written in Java, external tools should be either Java-based or provide a Java Application Programming Interface (API) to allow their employment in the system. 3.4 Comparison to Other Hypercube Structures Before starting with the process cube implementation, a literature study is performed to identify the cubes with the closest functionality and requirements to the process cube. The reason for doing this is threefold. First, one can find similarities with other hypercube structures, in which case, some of its functionality can be reused. Secondly, identifying limitations of the current multidimensional structures, helps in clarifying what is still to be done. Finally, previous work on similar OLAP cubes can suggest where one could expect difficulties. Data loaded in traditional OLAP cubes come from different sources, e.g., multiple data warehouses. Due to the considerable growth of stored data, simple ways of data representation are sought to conveniently keep data outside local databases. OLAP cubes are also adjusted to handle data in different formats. For example, OLAP cubes can be specified on XML data [34]. Still, OLAP cubes cannot support data in XES format, typical for event logs, because of the specific characteristics of event data. OLAP cubes are designed to work with numerical measures, and various ways of computing numerical aggregates are explored, from traditional sum, count and average to sorting-based algorithms [10] and agglomerative hierarchical clustering [40]. In [45], several measures are proposed 30 to summarize process behavior in a multidimensional process model. Among those, instance-based measures (e.g., average throughput time of all process instances), event-based measures (e.g., average execution time of process events), flow-based measures (e.g., average waiting time between process events), are the most relevant. In the last years, also non-numerical data have been considered in an OLAP setting. OLAP cubes have been extended to graphs [52], sequences [37, 38] and also to text data [36]. Creating a Text Cube became possible by employing information retrieval techniques and selecting term frequency and inverted index measures. In [45], the Event Cube is presented. Unlike other OLAP cubes, this multidimensional structure is constructed for the inspection of different perspectives of a business process, which in fact, coincides with the purpose of the process cube. To accomplish this, event information is summarized by means of different measures of interest. For instance, the control-flow measure is used to directly apply the Multidimensional Heuristics Miner process discovery algorithm. The difficulty with respect to this approach, is that traditional process mining techniques have to be extended with multidimensional capacity, in the same way as it was done for the Flexible Heuristics Miner: the Multidimensional Heuristics Miner was introduced as a generalization of the Flexible Heuristics Miner, to handle multidimensional process information. Of course, extending existing process mining techniques requires a lot of effort. Therefore, we propose a more conceptually clear and more generic approach. That is, instead of adjusting all process mining techniques to multidimensionality, the OLAP multidimensional structure can be adjusted to allow employing existing process mining techniques, without the need of changing them. All in all, the process cube is unique as it allows the storage of event data in its multidimensional structure, which is further used for process analysis purposes by employing existing process mining techniques. This approach creates a bridge between process mining and OLAP, as methods from both fields are interchangeably applied. The advantage is that quick discovery and analysis of business processes and of their corresponding sub-processes is facilitated in an integrated way. Moreover, no changes to the applied traditional process mining techniques are needed. 31 Chapter 4 OLAP Open Source Choice Based on the conceptual aspects previously introduced, in the following chapters we continue with describing the prototype solution. Before going into detail with respect to the implementation, in this chapter we give the motivation for our technology choice. The process cube formalization from Chapter 3, indicates the need for process mining and OLAP support. For process mining, the selected framework is ProM, introduced in Section 2.2.1, as it is the leading open source tool for process mining. Other commercial process mining systems exist, e.g., Futura Reflect, Fluxicon, Comprehend, ARIS Process Performance Manager [12], but ProM contains many plugins that allow effective process mining discovery and analysis. A part of these plugins are chosen for this project. Except for the OLAP database, we also use a classical relational database to store event data which is only used for event log reconstruction. There is a vast array of possibilities when it comes to available relational database systems, e.g., Oracle Database, Microsoft SQL Server, MySQL, IBM DB2, SAP Sybase, just to name a few. As there are no special benefits of using one relational database over another, in our project we choose MySQL, as it is one of the most widely used database systems in the world. For OLAP, on the other hand, it is difficult to make an immediate decision with respect to the tool selection. There are multiple technologies available, which vary in terms of the used database type e.g., classical relational, multidimensional, hybrid; the storage location, e.g., inmemory or on-disk; the storage method, e.g., column-based or row-based databases; the way data relationships are kept, e.g., matrix or non-matrix (polynomial) databases and so on. Therefore, in this chapter, the different OLAP tools and their characteristics are further detailed, together with the corresponding advantages and disadvantages. Finally, a single OLAP system is selected for our application. 4.1 Existing OLAP Open Source Tools For a potential OLAP tool to be used in this project, supporting conventional OLAP functionality is not sufficient. Several requirements were listed in Section 3.3. From those, two are particularly important to consider when choosing an OLAP external tool. The tool has to be open source, to allow changes in its functionality, and should provide support for further Java development, to enable the integration of ProM (which is written in Java) and OLAP capabilities on a single platform. OLAP tools can be split in OLAP servers and OLAP clients. OLAP clients are the user interfaces to the OLAP servers. Even though the open source OLAP servers and clients are not as powerful as commercial solutions [49], they encourage the community-based development by being free to use and modify. In our case, when integrating process mining solutions in OLAP technology, we expect to encounter differences with existing functionality. Therefore, in this project, an open source tool which allows to add new solutions is preferred over a more “powerful”, but non-extensible commercial tool. To provide an overview of the existing OLAP open source tools, we refer to the following 32 sources [1, 27, 28, 48, 49, 50]. From those, [1, 49, 50] contain the work of Thomsen and Pedersen, and include a periodic survey of open source tools for business intelligence. The first survey [49], published in 2005, refers to three OLAP servers, Bee, Lemur and Mondrian and two OLAP clients, Bee and JPivot, which are the only ones implemented at the time. In the survey from 2011 [1], only two OLAP servers are presented, Mondrian and Palo. That is because Bee and Lemur servers were discontinued and a new OLAP server, Palo was created. In [28], we find again the same Mondrian and Palo OLAP servers mentioned. By 2011, there are already several OLAP clients available, e.g., JPalo, JPivot, JRubik, FreeAnalysis, JMagallanes OLAP & Reports. There are also several integrated BI Suites. Both [27] and [50] refer to Jasper Soft BI Suite, Pentaho and SpagoBI. All these BI suites use the Mondrian OLAP engine and the JPivot OLAP client graphical interface. Recently, the Palo BI Suite was released that is working with the Palo multidimensional OLAP server and the Palo for Excel client. As every OLAP client uses a specific OLAP server, selecting an OLAP server, automatically narrows the client choice. In the following, we offer a summary on the two previously introduced OLAP servers, Mondrian and Palo. These servers are quite different from each other, mainly because they use different types of databases to store the data. The first one, Mondrian, stores data in relational databases, and it is therefore called a ROLAP server, and the other, Palo, stores data in multidimensional databases, and it is therefore considered a MOLAP server. 4.2 Advantages & Disadvantages The storage engine used, ROLAP or MOLAP, has a considerable influence on the characteristics of the OLAP servers, e.g., implementation design and methods, query mechanisms, performance. Therefore, we start this section with a discussion on ROLAP and MOLAP engines. Then, we emphasize the advantages and disadvantages of Mondrian and Palo OLAP servers by comparing and contrasting their characteristics, e.g., performance, scalability, flexibility. The major advantage of ROLAP is that the relational database technology is well standardized, e.g. SQL2, and is readily available off-the-shelf [17]. The disadvantage is that the query language is not powerful and flexible enough to support true OLAP capabilities [51]. The multidimensional model and its operations have to be mapped into relations and SQL queries [19]. The main advantage of MOLAP is that its model closely matches the multidimensional model, allowing for powerful and flexible queries in terms of OLAP processing [17]. In general, the main disadvantage of MOLAP is that no real standard for MOLAP exists. However, for particular situations, different problems can occur, e.g., scalability issues when it comes to very large databases, sparsity issues for sparse data. In [21], Colliat deems that multidimensional databases are several orders of magnitude faster than relational databases in terms of data retrieval and several orders of magnitude faster in terms of calculation. MOLAP servers have faster access times than ROLAP servers because data is partitioned and stored in dimensions, which allows retrieving data corresponding to any combination of dimension members with a single I/O. In a ROLAP, on the other hand, due to intrinsic mismatches between OLAP-style querying and SQL (e.g., lack of sequential processing and column aggregation), performance bottlenecks are common [18]. Generally, MOLAP provides more space-efficient storage, as data is kept in dimensions and a dimension may correspond to multiple data values. However, this is not valid for sparse data, as in this case, data values are missing for the majority of member combinations. ROLAP systems work better with non-aggregate data and aggregate data management is done at high cost. The MOLAP, on the other hand, works better with aggregate data. This is actually expected, considering the table-based structure of a relational database and the structure of a multidimensional database, which is organized in dimensions and has a built-in hierarchy. An advantage of ROLAP is that it is immune to sparse data, i.e., sparsity does not influence its performance, nor its storage efficiency. On the other hand, sparsity is a limitation for MOLAP servers, which can hinder some of its benefits considerably. For example, a sparse MOLAP does not provide space-efficient storage and runs into considerable performance issues. Therefore, MOLAP 33 servers typically include provisions for handling sparse arrays. For example, the sparsity problem is known to be solved in the case of the commercial Essbase multidimensional database management system, by adjusting the structure of the MOLAP server to handle separately sparse and dense dimensions. Now that the advantages and disadvantages in terms of the employed OLAP engine were presented, in the following, we discuss the advantages and disadvantages of Mondrian and Palo OLAP server tools. Before continuing our discussion, we would like to remark that both Mondrian and Palo satisfy the requirement of being compatible with a Java-written system. Mondrian is implemented in Java and offers cross-platform capabilities. As for what concerns Palo, the initial Palo MOLAP engine was programmed in C++. However, today various serial interfaces in VBA, PHP, C++, Java and .NET allow Palo OLAP to be extended. Performance Performance is a characteristic where generally Palo outruns Mondrian. First, the Palo MOLAP engine offers faster query response times [19] than the ROLAP engine of Mondrian. Secondly, the in-memory feature of the Palo server, improves the speed even further, as naturally, in-memory databases are faster than the disk-based databases. Nevertheless, if not as fast as Palo MOLAP server, the Mondrian ROLAP server is also known to provide an acceptable performance [50]. Scalability The in-memory characteristic is both an advantage (faster data retrieval), but also a disadvantage of Palo. A database which is memory-based, becomes automatically memorylimited. Undoubtedly, the memory capacity grows very quickly, but so does the volume of available data. There are advances made to compensate for the memory need. For example, 3-D stacking in-memories such as Micron hybrid memory cube are available 1 . Nevertheless, at the moment, scalability is considered an advantage of Mondrian and a disadvantage of Palo. Flexibility Both Mondrian and Palo provide different types of flexibility. Being a ROLAP server, Mondrian is more flexible regarding the cube redefinition and provides better support for frequent updates [43]. On the other hand, the in-memory database of Palo does not require indexes, recalculation and pre-aggregations. As analysis is possible to a detailed level without any preprocessing [28], Palo is more flexible in that sense. 4.3 Palo - Motivation of Choice Considering all the features of both Mondrian and Palo presented in Section 4.2, it can be noticed that, in general, the advantages of one technology are the disadvantages of another technology. Moreover, both Mondrian and Palo satisfy the requirements from Section 3.3, e.g., open source, Java-compatible, with OLAP capabilities. Consequently, either of the two OLAP servers can be used in this master project. We choose the Palo in-memory multidimensional OLAP server and in the following, we give a motivation for our choice. First, we adopt Palo technology because we want to explore new and innovative technologies. Mondrian stores data in relational databases. Relational databases are simple and powerful solutions, but they are already used for decades. Palo stores data in a multidimensional in-memory database. Both multidimensional OLAPs and in-memory technologies are relatively new compared to relational databases. Being still in their infancy, they provide various research challenges which are interesting to explore. Secondly, we believe that Palo technologies have a real future perspective. With the decreasing memory prices and the growth of the available memory capacity, there are real chances that inmemory databases will be more often used. Moreover, there are promising performance results 1 http://www.edn.com/design/integrated-circuit-design/4402995/More-than-Moore-memory-grows-up 34 recorded for MOLAP engines. While there are different techniques employed to speed up relational query processing (e.g. index structures, data partitioning), there is not too much that can be done to further improve ROLAP performance. On the other hand, we see Palo as a technology with potential to develop performance-wise. All in all, we choose Palo because it uses new technology and it has real chances to grow in the future. Since JPalo client is the only one to use Palo MOLAP server, JPalo is hence the OLAP client choice for this project. 35 Chapter 5 Implementation In the previous chapter we discussed the storage technologies to be used and we motivated the use of Palo. In this chapter, we describe our implementation using Palo, ProM and MySQL capabilities. We start by describing the system components and the way they are interconnected. Then, we focus on three main aspects: • Storing the event data in the process cube. • Preparing the process cube for analysis purposes, e.g., by filtering on dimensions. • Comparing process cells by visualizing the corresponding process mining results. 5.1 Architectural Model Figure 5.1: The PROCUBE System. It contains components, external parties and the corresponding communications between both internal and external elements of the system. 36 As explained in Section 3.3, our implementation is integrated in ProM, i.e., our application runs as a ProM plugin. The implemented plugin is called the PROCUBE plugin. Together with Palo and MySQL, the PROCUBE plugin forms the PROCUBE system. In this section we describe the architecture of the PROCUBE system. The main components of the PROCUBE system, together with the external parties and the way they communicate with each other, are shown in Figure 5.1. The system interacts with three external tools: ProM, MySQL and PALO. ProM is the host framework of the system, since the PROCUBE application runs as a plugin in ProM. The relational database of MySQL is used to store data from the event logs that is not relevant for multidimensional processing. Palo is employed for its OLAP capabilities. It is composed of two main parts, the Palo Server and the Palo Client. While there are no changes made to the Palo Server in this project, the Palo Client is appropriately adjusted to allow operations on event data. Palo Server comes with an in-memory multidimensional database, for storage purposes, and an OLAP cube, built on top of the database, that is suited to support OLAP functionality. The flow of the event data in the system starts with the loading of an event log. This function is exerted by the Load component. Its role is to pre-process the incoming event data from an event log and to load it in MySQL and Palo databases in such a way that it is properly stored and ready for further use. Also at loading, the Palo cube is created from the event data residing in the Palo database. Immediately after loading, the process cube can be used to recreate the initially loaded event log. However, there is no benefit from having merely this functionality. As such, the system contains also a Filtering component. Its purpose is to perform various filtering operations on the process cube such that the different perspectives of the cube can be inspected. Note that filtering is used to extract parts of the process cube and not to modify its structure. Filtering is based on the traditional OLAP operations: slice, dice, roll-up and drill-down. Except for filtering, pivoting is another useful OLAP operation that is employed. It allows rotating the cube to visualize it from a different angle. Once created, the filtered parts of the process cube are used to unload the corresponding event data, from which an event log is then materialized. The Unload component is responsible for taking the required data from both the relational and the in-memory database and creating an event log out of it. The resulting event log is given as an input to a ProM plugin. The output is a process mining result that can be visualized. Not all the existing ProM plugins are considered. A representative list of ProM plugins is selected for this purpose. Finally, a GUI component was specially created to show simultaneously different process mining results. The advantage of such a component is that it facilitates the comparison of multiple process mining results by placing them next to each other. 5.2 Event Storage The simplest and most intuitive way to store event data in a process cube is by selecting all the attributes in the event log as dimensions. To guarantee that an event is unique in terms of its dimensions, an event id is assigned to each event. The same holds for cases, a case id is assigned to each case. Both the event id and the case id are considered as dimensions. Even though this approach is the easiest one, it can create in many cases considerable problems with respect to both storage space and performance. This is because such a way of storing event data leads to extreme sparsity in the process cube. There are two possible ways to cope with the sparsity problem. The first solution is to reduce the number of dimensions. By reducing the number of dimensions, only a subset from the entire set of attributes is selected to form the dimensions. Consequently, the problem of where and how to store the rest of the event and case attributes appears. Moreover, events are no longer uniquely identified by dimensions, which implies having more than one event corresponding to a cell. An immediate solution is to save the rest of the event data in the process cell. The difficulty with this approach is that Palo Server, as well as other OLAP servers, allows for a limited number of characters per cell. In the case of Palo, the number is 255. Moreover, today’s OLAP servers work 37 with numerical values, rather than with text. This limitation forces to look for a new solution. Figure 5.2: Event storage. Numbers represent cell ids and indicate the existence of a cell with a corresponding set of events. The solution we applied, consists of giving a unique identifier to each cell and save the rest of the event data corresponding to the cell in a relational database. Figure 5.2 illustrates the approach. On the left-hand side, a cube consisting of three dimensions (task, timestamp and last phase) is shown. The numbers in cells, e.g., 6, 7, 10, 11, represent cell ids. On the right-hand side, there is a table with case and event properties. This table is actually saved in the relational database. A row of the table stores data corresponding to an event. The cell id is a column in the table, and it indicates which event corresponds to which cell. For example, for the cell with id 11, three events, namely 27, 28 and 29, can be identified in the table. For each of these events, properties that are not among the dimensions in the process cube are stored in the relational database. The solution presented above, does not fully guarantee that sparsity is sufficiently limited. For instance, if the dimensions stored in the in-memory multidimensional database are all sparse, i.e., contain a large number of members that are hardly repeating in the log, then the sparsity problem is still present. Examples of sparse dimensions are the event id, because there is one member for each new event and the timestamp, since almost each event can have a unique timestamp. Therefore, the second solution consists of reducing the number of elements per dimension. Palo Server, as well as other multi-dimensional OLAP servers, offer a very useful feature, called hierarchy. That is, members in a dimension can be hierarchically organized. An event log can contain different types of attributes: binary, numerical, time, categorical, etc. For the time attributes, there is already a natural built-in hierarchy that can be directly employed, e.g., year → month → day of week. For example, the timestamp 2012-02-21T11:52:13 belongs to the year 2012, the month is 2012F eb and the day of week is 2012F ebT ue. Hierarchies can be used to reduce the number of members per dimension. For the time example, only year, month and day of week can be stored in-memory, while the actual timestamp can be saved in the relational database. For the rest of the attributes, it is also possible to construct hierarchies, but it is not so straightforward as for the attributes of time type. That is, to have a meaningful hierarchy for a set of categorical attribute values, applying clustering and classification techniques would be useful. The time hierarchy is implemented in our project for any dimension which contains elements of date type. As for the rest of the attributes, there is no hierarchy established, since this is not easy to solve in a generic way. As a consequence, even though solutions to limit sparsity were applied, the sparsity problem can still occur, should the user select some sparse non-time dimensions to be stored in the multidimensional database. 38 5.3 Load/Unload of the Database In Section 2.2.1, the XES meta-model was presented. From all the elements of the XES structure, attributes are the most relevant when employing a multidimensional structure for analysis. Case attributes and event attributes are used to create the dimensions of a hypercube together with their corresponding members. Therefore, they have to be loaded in the Palo in-memory database such that to be easily accessed for the process cube creation. As discussed in the previous section, due to sparsity issues, the user is asked to decide upon a smaller set of attributes to be used as dimensions in the process cube. The rest of the attributes are stored in relational databases (RDB), as explained in Section 5.2. Except for traces, events and their corresponding attributes, the log keeps also information regarding the classifiers, the extensions and the global attributes. Even though unnecessary for OLAP operations, these elements are indispensable for the event log reconstruction. Therefore, they are stored separately in RDB tables and used later for unloading purposes. The loading of an event log into databases consists of two steps. First, a special tree structure is created from event data to facilitate the construction of the process cube. Secondly, the created structure is used for building the process cube and storing parts of event data in RDB in an easy-to-access manner. We use pseudocode to present both steps. Algorithm Parsing(log) 1. B log, gives the event log from the file 2. Create a log id, that uniquely identifies the log 3. Create tables in the RDB, with the attributes of the log, the classifiers, the extensions and the globals 4. B rootNode is the root node of a tree structure 5. B eventCoordinates is a list of attribute values for all events in the log 6. Determine the number of traces in the log (nt ) 7. for i ← 1 to nt 8. do traces[i] ← log.getTraces(); 9. rootNode.addNodes(traces[i].getAttributes()); 10. Determine the number of events in traces[i] ( ne ) 11. for j ← 1 to ne 12. do eventCoordinates ← NULL; 13. events[j] ← traces[i].getEvents(); 14. rootNode.addNodes(events[j].getAttributes()); 15. eventCoordinates.setEvent(log id, traces[i].getAttributes(), 16. events[j].getAttributes()); 17. j ←j + 1; 18. i ←i + 1; 19. return rootNode, eventCoordinates In the first step, the classifiers, the extensions and the global attributes are extracted from the XES log structure, and loaded in RDB tables. In that sense, a log id is assigned to the log and is used to distinguish between the classifiers, extensions and global attributes of this log from the ones of other already existing or to be created logs. Traces and events with their attributes are added to a tree structure with the rootNode as the root element of the tree. The rootNode contains all the links of the tree. Nodes are added to the tree structure in the following way: the first hierarchical level of the tree presents properties of cases and events, the next level contains the values of the properties. Other hierarchical levels are also possible. In this project, we implemented hierarchies for time attributes. As such, in case of time attributes, years, months and days of week form the levels of the tree. Except for the rootNode, a set of event coordinates is determined for each event, on lines 15-16 of the Parsing algorithm. Event coordinates give all the necessary information that can be used to place an event back in an event log. Since an event is part of a trace and a trace belongs to 39 a log, also trace and log information is included in the event coordinates. Consequently, event coordinates are composed of the log id, the trace id with the corresponding trace attributes and the event id with the event attributes. Algorithm Loading(rootN ode, eventCoordinates) 1. B Create the process cube PC 2. Determine the number of dimensions nd in the rootNode 3. Allow the user to select a subset Md of all available dimensions 4. for each i ∈ Md 5. do Di ← rootNode.getChildren(i).getLeafs(); 6. if rootNode.getChildren(i) is a time attribute 7. then Hi ← createHierarchy(rootNode.getAttribute(i)); 8. Create P C with the dimensions Di , i ∈ Md with unique cell values 9. Determine the total number of events in the log (nte ) 10. for i ← 1 to nte 11. do k ← 0; 12. columnValues ← NULL 13. for j ← 1 to nd 14. do if j ∈ Md 15. then k ← k + 1; 16. mk ← eventCoordinates.getEvent(i).getAttribute(j); 17. else columnValues.addAtribute(eventCoordinates.getEvent(i).getAttribute(j)); 18. columnValues.addAtribute(getCell(m1 , . . . , mk )); 19. RDB.addRow(columnValues); Once the rootNode and the eventCoordinates are created, they can be used to build the process cube PC. All the trace and event attributes accessible from the rootNode, are potential dimensions of the process cube. Due to sparsity issues, the user is allowed to select a subset of these to be the actual dimensions of the cube. Of course, selecting all the dimensions is also possible. For each of the chosen dimensions, its corresponding member elements and the hierarchy are added, in line 5 to 7, in the Loading algorithm. After populating dimensions with elements, the process cube PC is created, based on these dimensions. At this point, the process cube PC has dimensions and elements, but does not have any values in the cells. The eventCoordinates provides both the coordinates of the cell and the set of its corresponding events. In Section 5.2, it was explained that event data cannot be directly stored in a cell, due to cell limitations. Instead, each cell is given a cell id and the rest of event data which is not yet saved in the PC can be stored in RDB tables, with cell id as a column. As such, members of the PC dimensions are identified in eventCoordinates, line 16, and are used as parameters for the getCell(m1 , . . . , mk ) function which identifies a cell, line 18. The members that are not among PC dimension members, are added in the RDB together with the cell id, line 19. Algorithm Unloading(P C) 1. B log, is the event log to be created after unloading 2. B trace, is a trace of the event log 3. B event, is an event of the event log 4. log ← NULL; 5. Add all the classifiers, extensions and globals to the log, from the RDB tables 6. B eventList is a list with the corresponding coordinates of all the events 7. B attributeList is a list with all the attributes corresponding to an event 8. Create the eventList from both PC dimensions and RDB columns 9. Determine the number of events in the eventList (ne ) 10. for i ← 1 to ne 11. do attributeList ← eventList.getEvent(i).getAttributes(); 12. trace ← NULL; 13. event ← NULL; 40 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. Determine the number of attributes in eventList (na ) for j ← 1 to na do attribute ← attributeList.getAttribute(j); if attribute is a log attribute then logAttributes.add(attribute); ; else if attribute is a trace attribute then traceAttributes.add(attribute); else eventAttributes.add(attribute); event.addAttributes(eventAttributes); if logAttributes are in log if there is a trace with the traceAttributes in log B k is the position of the trace in log then log.getTrace(k).add(event); else trace.addAttributes(traceAttributes); trace.add(event); log.add (trace); else trace.addAttributes(traceAttributes); trace.add(event); log.addAttributes(logAttributes); log.add(trace); return log; Figure 5.1, presented earlier, shows the basic flow of event data in the system. From the event log, event data is loaded in both Palo and MySQL databases and can be retrieved from those at unloading and used to recreate the initial event log. Even though such a functionality does not add yet any value, it can still be used to test the correctness of loading and unloading event data in and from relational and OLAP structures. In what follows, we describe the unloading procedure to complete the scenario. For the Unloading algorithm presented in this thesis, we consider the complete list of events from the initially loaded event log. Nevertheless, this list can be filtered and, as a result, only a subset of total events can be considered at unloading. In any case, there is no change with respect to the pseudocode, only in line 8, the eventList is created differently, this time, based on filtering results. First, the initially NULL log is populated with classifiers, extensions and global attributes from RDB tables. Then, both event data from RDB and from Palo OLAP cube is extracted and used to create an eventList structure. The eventList structure is similar to the eventCoordinates structure created in the Parsing algorithm, in the sense that the eventList constains enough information to place events back in event logs. For instance, the event id gives the order of the event in the log. Note that information like the log id, the case id and the event id is discarded when constructing the event log, as it was created at loading and was not initially part of the log. The eventList contains a list of three types of attributes: log attributes, trace attributes and event attributes. The event attributes, for instance, can be used to create an event, as in line 22. The trace attributes can be used to created a trace. However, since a trace may correspond to multiple events, we check, in line 24, whether a trace with the same attributes already exists. Then, the created event is added either to the already existing trace or to the trace that is newly created. A similar test is required when adding the log attributes to the log, to avoid repeating data in the new event log. 5.4 Basic Operations on the Database Subsets Once loaded in databases, the question appears what can the system do with the event data. First, the system benefits from the multidimensional structure of the OLAP cube. In that sense, inspecting different dimensions of the cube is possible. Moreover, the system supports a set of 41 (a) Dice filtering. Five elements are se- (b) Dice filtering result. While the event log corresponding to lected on the EVENT conceptEXT name P C has 33 events, the event log corresponding to P Cdiced has dimension. only 14 events. Figure 5.3: Dice operation. basic OLAP operations, e.g., slice, dice, drill-down, roll-up and pivoting. Filters can be created that would slice or dice the cube in various way. Default filters exist for drill-down and roll-up operations that can be applied at request on specific chosen dimensions. Each filter is stored for further use, unless not explicitly deleted. Not only can the event data in the cube filtered, it can also be visualized from different perspectives. This functionality is offered by employing the pivoting operation. 5.4.1 Dice & Slice A dice operation is realized when multiple members are selected for one or more dimensions. Given a process cube P C, the result of a dice is a subcube P Cdiced for which only a subset of members are selected on particular dimensions, and for the rest is the same with the initial cube. Figure 5.3a shows a dice filter applied on the EVENT conceptEXT name dimension. With dice, multiple elements of a dimension can be selected. In Figure 5.3a there are five task names selected and the rest of the elements are just discarded for the EVENT conceptEXT name dimension. The result of the dice operation is shown in Figure 5.3b. From 33 events present in the event log corresponding to the process cube P C, only 14 are considered for the P Cdiced . The number of cases remains the same. A dice operation can influence more than one dimension. For example, together with the filter on EVENT conceptEXT name dimension, a subset of timestamps can be selected on the EVENT TIME timeEXT timestamp 1 . A dice operation allows the selection of any element of the time hierarchy. For example, one can select year 2012 and 2013 out of a set of years containing 2010, 2011, 2012 and 2013. The month dimension can also be considered for dice. For instance, selecting the 2012F eb month in 2012 is also a dice, since it contains the following set of elements: 2012F ebM on, the 2012F ebW ed and the 2012F ebT hu. For dimensions with numerical members, a dice filter can be created, by selecting a certain 1 In the dimension name, the TIME tag is used to recognize a dimension corresponding to a time attribute. Other examples of such dimensions are: EVENT TIME dueDate, EVENT TIME plannedDate, EVENT TIME createdDate 42 (a) Slice filtering. Only a single event (b) Slice filtering result. While the event log corresponding to name, 01 HOOFD 060 is selected on the P C has 33 events, the event log corresponding to P Csliced has EVENT conceptEXT name dimension. only 2 events. Figure 5.4: Slice operation. range. For example, for the SUMLeges dimension, all the events with SUMLeges between 100.5 and 500.2 can be selected. The slice operation is a particular type of dice. That is, a slice is performed when only a single member of one dimension is selected and the other members corresponding to the dimension are filtered out. Given a process cube P C, the result of a slice is a subcube P Csliced with the same dimensions as the cube P C, except for one, which has just a single member selected of the initial set of the dimension members. Figure 5.4a shows a slice filter applied on the EVENT conceptEXT name dimension. From all the elements of this dimension, only 01 HOOFD 060 is selected. After creation, the slice filter is saved and, at request, is applied on the event data of the process cube. That is, only events with the event name 01 HOOFD 060 are considered for the new P Csliced cube. Figure 5.4b depicts the slice result on the process cube. In the top window, a Log Dialog shows information on the initial event log. Note that the entire event log contains 4 cases and 33 events. The bottom window illustrates a Log Dialog containing information on the event log created after slicing. The new event log contains only 2 cases and 2 events. Consequently, there are only 2 events with the name 01 HOOFD 060 and they belong to 2 different cases. For a dimension with time attributes, the slice can be performed while selecting a leaf member, situated at the day of week hierarchical level. For example, for a timestamp dimension containing 2012 at the year hierarchical level and 2012F eb at the month level and 2012F ebT ue at the day of week level, a slice can be executed by selecting the 2012F ebT ue element. Note that such a slice filters out all the events except for the ones that occurred on Tuesday in the February month of 2012, and not on all Tuesdays of the 2012 year or on all Tuesdays, in general. 5.4.2 Pivoting The subcubes obtained after slice and dice operations can be visualized. In this project, the traditional 2D visualization is considered for the process cube visualization. As such, only two dimensions of the process cube can be visualized simultaneously. This is possible through the table of visualization. The rows of a table of visualization contain two dimensions of the process 43 cube and also the corresponding filters created by the user. Even though based on the elements of two process cube dimensions, the dimensions of visualization are usually not identical with the former. The main difference is that their elements can be both results of filtering and elements of different hierarchical levels. In that sense, two visualization neighbor-cells can contain overlapping data, while this is never the case for two neighbor-cells of the process cube. The restriction of visualizing only two dimensions at a time has no influence on which two dimensions to select. That is, any combination is possible and any of the two dimensions can be substituted with a new PC dimension, at any time. By swapping from one dimension to another, the visualization perspective of the P C cube changes. This operation is known as pivoting or the rotation operation. Figure 5.5: The result of the pivoting operation. Rotation is obtained by replacing the concept names dimension with the timestamp dimension and the SUMLeges is replaced by the concept names dimension. Figure 5.5 shows the effect of the pivoting operation on the visualization table. In the visualization table from the top of the image, the SUMLeges and the event names are the two dimensions of visualization. In the second table of visualization, the same process cube is visualized through the event names and the timestamp dimensions. Also, while the event names was initially on the x axis, in the second table, it is changed on the y axis. 5.4.3 Drill-down & Roll-up The drill-down operation is realized by unfolding a member situated on a hierarchically superior position in a set of members corresponding to a hierarchical level lower. Figure 5.6 shows a table of visualization with one dimension corresponding to the timestamp and another dimension corresponding to the event name. Elements of the timestamp dimension can be selected from a hierarchy. For example, the 2012 member is selected and a drill-down operation is performed on it. As in the time hierarchy, months follow years, all the months corresponding to year 2012 are shown. Based on the definition of drill-down from Section 2.3.1, the children of 2012 are added to the timestamp dimension of the table of visualization and the 2012 element is removed. In our project, we keep also the 2012 element, because it is useful to compare process mining results corresponding to elements on different hierarchical levels, e.g. the process of 2012 with the process of 2012M ar. 44 Figure 5.6: Drill-down operation on the timestamp dimension. Year 2012 is drilled-down to its months. The roll-up operation is realized by folding certain members of a dimension into one member, which is hierarchically superior. Figure 5.7: Roll-up operation on the timestamp dimension. The months corresponding to year 2012 are folded back. Figure 5.7 shows a table of visualization corresponding to the same timestamp and event name dimensions. Based on the definition of roll-up from Section 2.3.1, the children of 2012 are removed from the timestamp dimension of the table of visualization and the 2012 element is added. In our project, there is no need to add the 2012 element, as it is already present from the drill-down operation. 5.5 Integration with ProM After filtering and selecting a particular side of the process cube for visualization, the Unloading algorithm, presented in Section 5.3, is applied to materialize event logs for different visualization cells. The resulted event logs are given as input to a ProM plugin. Each ProM plugin has a plugin context object, that is required to run in the ProM framework. For some plugins, it is impossible 45 to use them outside ProM, for example, due to the absence of a specific predefined plugin context. Therefore, to allow more flexibility, our application is adjusted to run in ProM. Hundreds of ProM plugins could potentially be use. However, we select only a predefined list of plugins to run in our application. The reason for this is twofold. First, not all of the existing plugins are relevant for the purpose of the PROCUBE tool. One of the objectives is to provide the user a means to visually compare multiple subprocesses. Visual comparison of several subprocesses becomes difficult when there is a different visual representation for each process. In that sense, plugins that provide immediate visualization results are quite handy. If the user has to make changes to get a specific result, repeating them for each visualization window can become troublesome. For example, the user can miss a step, and then the results that are compared are not the intended ones. Also, any change in one window, implies changes in all windows. Naturally, manual changes take time, while automatic changes are impossible, due to different event data per cell. Another problem is that the graphical space is limited. Running in parallel multiple plugins that provide in-depth analysis e.g., LTL Checker, is not very practical, also due to space restrictions, while repeating the changes for each individual process is very time consuming. In conclusion, we aim at quick superficial analysis, with immediate results on multiple sublogs rather than time-consuming, in-depth analysis on a single or very few logs. Another type of ProM plugins, are the ones created to filter event logs. Since filtering is already implemented in the PROCUBE tool, part of the functionality of these plugins is redundant. The second reason is related to the fact that providing a generic way of calling all the ProM plugins is difficult to realize. Each plugin has its own specific input and output parameters and also its own methods. A solution for calling all plugins in a generic way is to create a Wrapper that would uniformly integrate all ProM plugins. For this project, we focus mainly on plugins that return a JComponent, which can be directly used to display the result. The Alpha Miner, for instance, returns a Petri net object. In that case, the visualization component for the Petri net has to be first created and only then can the visualization result be shown. Nr 1. 2. 3. 4. 5. 6. 7. 8. 9. ProM Plugin Log Dialog Dotted Chart Fuzzy Miner Heuristics Miner Working-Together Social Network Handover-of-Work Social Network Similar-Task Social Network Reassignment Social Network Subcontracting Social Network Table 5.1: The list of ProM plugins used in the PROCUBE tool. Moreover, some plugins require going through a sequence of wizard screens to get to the final result. Even if creating a predefined set of parameters, to avoid following the wizard screens, a new set of parameters is required for each individual plugin. Furthermore, for our project, it is not possible to set the parameters only once, beforehand, and use them for all the visualization cells. That is because, the parameters of the initial event log usually do not correspond to the ones of the sublogs resulted after filtering, as the corresponding event data is different. In that sense, for such plugins, following the wizard sequence for each sublog individually is a must. Again, in this case, plugins with immediate results are preferred over the ones preceded by a sequence of wizard screens. Derived from all the specifications mentioned above, Table 5.1 provides the list of plugins currently used in our project. The Log Dialog and the Dotted Chart give a panoramic view on the sublogs processes. The Heuristics Miner and the Fuzzy Miner are used to discover process models from sublogs. The Social Network plugins provides details on the resource perspective of 46 the sublogs. There is no doubt that plugins such as Basic Perfomance and Conformance Checker would add a considerable value to the process analysis and would allow for more extensive use case analysis. Therefore, we suggest adding such plugins as a potential further work. 5.6 Result Visualization The main visualization challenge of the project is to display multiple process mining results at the same time, in an integrated way. The size of the physical screen is the main limiting factor when it comes to displaying multiple windows. Therefore, we apply several solutions to cope with this issue. First of all, we create a new frame, detachable from the main frame, and use it to place all process mining results. Thus, should two screens be available, the table of visualization can be placed on one screen, while the plugin results can be displayed on the second screen. On this new frame, windows are organized next to each other, in an easy-to-identify way. Even though such a frame layout is already enough for the visualization of the plugin results, we decided to make some changes as it was lacking the desired flexibility. Hence, replacing the windows with dockable ones to allow moving them around is one of the most important visualization features that is supported in the project. A large part of the dockable functionality is taken from the DockingFrames 1.1.2 2 and adjusted for the project needs. In the following, we explain the framework of the windows, with details related to the layout of the windows frame. Then, we give a list with the frame functionality items. Finally, we show the result visualization obtained using the PROCUBE plugin. Figure 5.8, taken from [47], shows the framework based on which dockable windows are created. Dockables are not stand-alone windows. They require the support of a main window (the Main-Frame). The main window is most of the times a JFrame. As long as this frame is visible, so are the rest of the components on it. In case of non-dockable panels, they are just directly connected to the main frame. Consequently, the main frame can consist of several panels, with different data displayed on them. To support floating panels, however, an additional layer is added between the panels and the main frame. The compo- Figure 5.8: Dockables functionality. Panels are nents of this layer are the so-called Stations. wrapped into dockables. Dockables are put onto Among their purposes is also to allow the user stations which lay on the main-frame. As such, to drag & drop panels and to minimize or max- dockables can be moved to different stations. imize windows. A central controller is used to wire all the objects of the framework together. It manages the way elements look and their position in the frame and it monitors all the occurring changes within windows. Further, each panel is wrapped into a dockable. Dockables are the final components and they are the ones that actually offer the floating behaviour. To display dockables in a certain layout, a Grid component is used. The matrix of the grid gives an organized way of displaying windows in the screen. For our project, the matrix of the grid component corresponds with the matrix of the table of visualization. That is, the plugin results for different cells are shown in the same order with the one used to display the cells in the visualization table. In the view of the above approach, the following visualization capabilities are supported: • Default layout with all the dockables normalized. Normalized dockables are placed on the main visualization frame, in the way cells are displayed in the visualization table. 2 http://dock.javaforge.com/ 47 • Dockables can be maximized. A maximized dockable takes all the space it can, most of the time, by covering other dockables. • Dockables can be minimized. Minimized dockables are not visible right away. They can be restored to a normal state, by pressing again the minimization button. • Dockables can be extended. Once extended, dockables have their own window, independent of the main visualization frame. This functionality is very useful as it allows, for example, moving windows with plugin results on different screens. • By the drag & drop operation, dockables can be placed on any part of the screen. For example, by dragging one dockable on the place of another one, these two are swapped with each other. • When multiple plugin results are available for the same visualization cell, each result window is a new tab in a tabbed pane. That makes it easy to quickly identify plugin results corresponding to the same visualization cell. • Unnecessary windows can always be closed. Figure 5.9: Visualization of plugin results in the PROCUBE tool. Each plugin result is displayed in a dockable window and can be part of a tabbed pane. Figure 5.9 shows several windows with plugin results. Two Log Dialogs, a Fuzzy Miner, two Heuristics Miners and a Social Network form the visualization results. Multiple tabs can be distinguished since multiple plugin results exist for the same visualization cell. All the windows are dockable. After undocking a window, the rest of the windows are automatically rearranged in the screen. 48 Chapter 6 Case Study and Benchmarking In the previous chapter, the implementation of the process cube was described as a combination of external technology (Palo, MySQL, ProM) and newly-introduced process-cube-related features. Further, we continue with an evaluation of the functionality of different event logs and an assessment of the PROCUBE system performance. The results presented in the chapter are based on the event data of an artificial digital photo copier event log and on event data of a Dutch municipality event log. 6.1 Evaluation of Functionality In this section we choose both a synthetic and a real-life event log to ascertain the capabilities of the PROCUBE system. The functionality that is evaluated comprises the loading of an event log in relational and in-memory databases, executing OLAP operations on the process cube, unloading an event log from databases, generating ProM results based on the event log and visualizing ProM results. 6.1.1 Synthetic Benchmark The synthetic event log we use in this section is taken from the collection of synthetic event logs, found at http://data.3tu.nl/repository/collection:event_logs_synthetic. It is an artificial event log for a simple digital copier, also used as a running example in [33]. The copier is specialized in copying, scanning and printing of documents. As such, users can request copy/scan or print services. The standard procedure followed by a copier is image creation, image processing for quality enhancement, and then, depending on the request, either printing the image or just sending it to the user. The generation of the image for a print request differs from the one for a copy/scan request. The digital photo copier event log contains 100 process instances, 76 event classes and 40995 events. Traces can be separated based on their Class attribute in Print and Copy/Scan. For each event, the name of the activity is given, the lifecycle transition, to attest if an activity is started or completed, and a timestamp of the recorded activity. In the following, based on the digital photo copier process described in [33], we select a few scenarios and use them to present the capabilities of the PROCUBE tool. In Figure 9 from [33], two subprocesses, ‘Interpret’ and ‘Fusing’ are isolated. For our first scenario, the target is to load the entire digital photo copier event log in databases and filter it in such a way that after unloading and applying the Fuzzy Miner plugin, the ‘Interpret’ subprocess from Figure 9 in [33], is obtained. At loading, the TRACE Class and the EVENT conceptEXT name attributes are selected as dimensions of the process cube. After loading, we perform a dice operation on the EVENT conceptEXT name dimension of the process cube, by selecting the following subset of elements: Interpretation, Post Script, Unformatted Text and Page Control Language. 49 Figure 6.1: The ‘Interpret’ subprocess, obtained by dicing the process cube on the task name. Further, an event log is materialized from the filtered event data and is used as a parameter for the Fuzzy Miner plugin. The result is shown in Figure 6.1. The correspondence between our result and the one in [33] can be easily noticed. Figure 6.2: The ‘Interpret’ subprocess with its corresponding branches. The visualization results allow for easy comparison of subprocesses. For further testing, we consider a second scenario, where the same ‘Interpret’ process is taken, but now subprocesses of each of the three branches of the ‘Interpret’ process are isolated, by filtering on the task name. Figure 6.3 shows the main visualization frame with four windows. The first window on top, gives the same ‘Interpret’ process model. The three windows at the bottom, illustrate the subprocesses of the three branches of the process. Such visualization results are powerful for larger processes. First of all, multiple filtering results of the same process can 50 be visualized in the same time. After filtering, the initial process is not discarded, it can be reused again and again for filtering purposes. Presenting processes next to each other, highlights similarities and differences between them. Figure 6.3: Zooming-in on the first part of the copier process model and on the first part of its corresponding ‘Print’ and ‘Copy/Scan’ subprocesses. In the last scenario, the entire copier process model is discovered, using the Heuristics Miner plugin. First, two slice operations are performed on the TRACE Class dimension. Their results are used to discover the ‘Print’ and the ‘Copy/Scan’ subprocesses. The resulted process models are quite large, which makes it difficult to visualize them entirely. Therefore, we zoom-in on the first part of the processes. By placing all the models in parallel, the paths for the ‘Print’ and ‘Copy/Scan’ subprocesses can be distinguished in the copier process model. One branch of the process starts with the ‘Copy/Scan’, ‘Collect Copy/Scan’ and ‘Place Doc’ activities, corresponding to the ‘Copy/Scan’ subprocess, and the other branch starts with the ‘Remote Print’, ‘Read Print’ and ‘Rasterization’ tasks, corresponding to the ‘Print’ subprocess. The same behavior is shown for this part of the process, in Figure 7 from [33]. By zooming-in on the rest of the subprocesses, their entire behavior can be observed and their control-flows can be compared. 6.1.2 Real-life Log Data Example For the real-life example, we select one of the event logs of a Dutch municipality, known under the name of WABO1. The WABO1 event log consists of 691 process instances, 254 event classes and 22130 events. The data captures process events from October 2010 till November 2012 with an overall duration of 758 days. At the case level, the following attributes are available: • parts attribute, specifies for what building parts is the permit requested: “Bouw”(355 cases), “Sloop”(52 cases), “Kap”(32 cases), etc. • SUMleges attribute, gives the total cost of a building permit application, e.g., 192.78, 284.55, 1992.06. • last phase attribute, denotes the outcome of a permit request application. Usually a case finalizes with “Vergunning verleend”(permit given, in 344 cases) or “Vergunning geweigerd” 51 (permit declined, in 2 cases). However, there are a number of cases that end up with “Procedure afgebroken”(procedure aborted, in 74 cases). • caseStatus attribute, indicates whether a case is still opened (“O”) or is already closed (“G”). For a case that is closed, no further objections are possible. However, for an opened case, objections can still be expected. Event attributes give information related to the lifecycle of an event, the resource that executes a task or is responsible for it and different time characteristics, e.g., the time when a task was created or the time when an event was recorded. The lifecycle of en event comprises only a single transition: complete. That is, all the work items in the event log are completed. There are 19 resources that execute tasks. The majority of the tasks are performed by resource number 560872 (30.764 %). Figure 6.4: Dotted charts for a process of a Dutch municipality using absolute time. The influx of new cases is rather constant other time, the top chart. The influx of new cases is decreasing other time, the bottom left chart. For the bottom right chart, there is no pattern identified. Figure 6.4 shows three dotted charts for three of the subprocesses of a Dutch municipality using absolute time. These subprocesses are obtained by slicing the process cube on the TRACE last phase dimension. In all three cases, absolute, real times are used. Moreover, cases are sorted by the time of the first event. The top chart, corresponds to the building permit request applications finalized with giving a permit. For this subprocess, the initial events form an almost straight line. Consequently, there is a close to constant arrival rate of new cases. The bottom left chart correspond to canceled applications. The dotted chart shows that the influx of incoming new cases, that are eventually canceled, is decreasing other time. The last chart, on bottom right part of the image, corresponds to declined cases. Due to the reduced number of declined application, there is difficult to identify a pattern in the arrival of such cases. Figure 6.4 shows three dotted charts for three of the subprocesses of a Dutch municipality using relative time, i.e., all cases start at time zero, with emphasis on the duration of a case. Typically, both approved and canceled cases are handled in 1-2 months, although a large amount of those are finished already after 10-20 days. Nevertheless, there are cases that take up to 1.5 years to complete. For instance, the duration of handling the declined cases is quite large. For one of the cases it takes one year after it is finally rejected. Such behavior is present also for approved and canceled cases, however, very sporadic, like exceptions. Since the event data comes from a real-life log, we do not exclude the possibility of errors in recording for such cases. 52 Figure 6.5: Dotted charts for a process of a Dutch municipality using relative time. The duration of handling a building permit request, that is eventually approved, is typically about 1-2 months. The same is valid for canceled applications. Requests for applications that are declined take longer time to be handled. Figure 6.6: Representation of the Working-Together Social Network for resources working at Aanhoudingsgrond van toepassing (AH) type of activities and on Waw-aanvraag buiten behandeling (AWB) type of activities. Mining social networks is yet another ProM feature supported in the PROCUBE plugin. The social network miners, presented in [9], can be directly applied on the event logs of the subprocesses of a process cube. In this section, we present an example of a Working-Together Social Network for resources in the WABO1 process, working at Aanhoudingsgrond van toepassing (AH) type of 53 activities and on Waw-aanvraag buiten behandeling (AWB) type of activities. In both networks, a cluster of resources working-together and several isolated resources, can be distinguished. Except for a few isolated resources, i.e., 560589, 560999 and 560950, the AH network contains the same elements as the AWB one. This is not the case when it comes to resource interactions in the working-together clusters. Even though it contains almost the same resources, its corresponding chain of interaction changes. That is, compared to the AWB network, in the AH one, only 560912 still works directly with 2670601 and only 3273854 still works directly with 560925. A rather large percentage of resources involved in the entire process, i.e., 19 resources, are also present in the networks, 84 % in the first network and 68% in the second network. This indicates that the majority of the resources may not be specialized in a particular type of activity, but rather execute different types of activities depending on the case. Other network graphs and plugins can be used to fully prove the statement. Consequently, placing social networks next to each other, offers a parallel view of people’s interaction within an organization in various situations, e.g., when handling different tasks. 6.2 Performance Analysis In this section the performance of the PROCUBE system with respect to loading and unloading operations is analysed. Clearly, loading time affects the productivity of the system only once, when the event log data is loaded into the databases, whereas unloading operation could be performed multiple times, i.e., whenever a process mining technique is applied to the events in the cube (possibly a subcube). The time required by these operations has to be small enough to guarantee adequate user interaction with the tool. In what follows, the PROCUBE tool is subject to several tests. Test 1. For the first test, subsets of the WABO1 event log are loaded and unloaded from the database. These subsets contain 160, 338, 687, 1368, 2732, 5505, 11061, and 22130 events. The latter sublog is actually the entire WABO1 event log. The loading and unloading speed is assessed for each sublog in 4 distinct configurations of the in-memory database, i.e., 2D with dimensions TRACE parts and EVENT timestamp, 3D which contains the dimensions from 2D and EVENT orgEXT resources, 4D adds EVENT created to 3D dimensions, and the 5D configuration adds to 4D the TRACE termName dimension. This test illustrates the dependency of the loading and unloading time for typical selection of dimensions. Test 2. The second test illustrates the effects of sparse dimensions on the loading and unloading performance. This test is performed on two 2D configurations and follows the methodology from Test 1. The dimensions of these two cubes are summarized in Table 6.1. Cube Low sparsity High sparsity Dimension TRACE termName EVENT orgEXT resources EVENT taskDescription EVENT conceptEXT name Nr. of members 12 20 73 692 Table 6.1: Summary of dimensions for the 2D cubes in Test 2. Test 3. For the last test, the WABO1 event log is split into several non-overlapping sublogs and the total unloading time of these sublogs is compared to the unloading of the entire WABO1 event log. This test illustrates that the filtering operations and extraction of sublogs does not infer any additional penalty on the unloading time. 54 Loading speed 81 2D load 3D load 4D load 5D load Time (s) 27 9 3 1 100 300 900 2700 Nr. of events 8100 24300 Figure 6.7: Loading times for Test 1. Test 1 Let us begin by showing the loading times for this test setup in Figure 6.7. Although, both scales on the figure axis are logarithmic, it is easy to see that the loading time increases linearly with respect to the number of events in the log. Moreover, the loading time is practically independent of the number of cube dimensions. The latter remark suggests that loading time per dimension into the relational database and in-memory database are about the same, i.e., if one of the dimension is moved from the relational database to the cube, the loading time does not change. Moreover, the loading implies just one constant set of operations per event, therefore it is independent of the number of dimensions in the created cube. Of course, the amount of memory used for the cube increases with the number of dimensions. Unloading speed 700 Time (s) 100 2D load 3D load 4D load 5D load 10 1 100 300 900 2700 Nr. of events 8100 Figure 6.8: Unloading times for Test 1. 55 24300 The situation during the unloading is completely different however. The unloading time for the same databases is shown in Figure 6.8. The time spent for unloading the event log from the database increases considerably for larger numbers of cube dimensions. Of course, unloading time heavily depends on the number of cube cells that do not have any events corresponding to them. These empty cells do not affect the loading time into the database, but consume memory. The opposite is true during unload, when each cell has to be verified. Hence, time is spent on empty cells, but these cells do not contribute with any information to the resulting log. Generally, the sparsity of a cube increases with the increase of the number of dimensions, and as such, the number of empty cells does too. For this particular case study, unloading an event log with 11061 events takes 27 s for a 2D cube, and 688 s for a 5D cube, which illustrates a super-linear increase in the unloading time. Similar tendency can be observed with respect to the number of events in the log. It appears that the sparsity of the cube increases with the number events in the log with a supper-linear rate as well. These observations can be intuitively explained by two facts. First, all the dependencies in the hyper-cubic structures are multiplicative rather than additive, hence the sparsity is expected to rise exponentially. Secondly, event logs contain attributes which characterize the events very precisely, e.g., timestamp or name of a resource. Obviously, finding two events happening in exactly the same time, to say the least, is very difficult, and hardly any resource is engaged in all activities. Hence, due to this precision of event logs the sparsity is unavoidable when a process cube is constructed, and unfortunately, the unloading time complexity rises exponentially with the number of dimensions and events for typical situations. Test 2 As mentioned previously, for this test, we compare loading and unloading times of cubes configurations with different levels of sparsity. Loading speed 110 81 non−sparse sparce Time (s) 27 9 3 1 100 300 900 2700 Nr. of events 8100 24300 Figure 6.9: Loading times for Test 2. It can be seen in Figure 6.9 that the loading time does not vary much in between the two cubes. The sparser cube appears to load only slightly longer. This behavior is expected and was explained on the results of the Test 1. On the examples from Test 1, it is shown that unloading time heavily depends on the number of in-memory dimensions and number of events. However, the unloading time is also dependent on the sparsity of the cube. The unloading time for the two cube configurations with the same number of events and dimensions but different sparsity are illustrated in Figure 6.10. Observe that the difference in between unloading times of the higher and lower sparsity cubes for the entire WABO1 event log is more than 10 fold. 56 Unloading speed 700 non−sparse sparse Time (s) 100 10 1 100 300 900 2700 Nr. of events 8100 24300 Figure 6.10: Unloading times for Test 2. One might expect a larger difference, as the ratio between the number of cells in the cubes is actually about 191, i.e., 73 × 629 cells of a sparse cube divided by 12 × 20 cells of a non-sparse cube, where 73, 629, 12 and 20 represent the number of elements of the dimensions of the cubes. Although all the cells have to be visited while unloading the event log, the hybrid nature of the database prevents huge increase in the required time. Processing time required for empty cells is considerably lower than for the cells with events, i.e., if an empty cell is detected then no query is issued to the relational database and the algorithm jumps to the next cell. Hence, with the 191 times increase in the number of cells, the overall computational load increase is only 10 fold. Test 3 For the purpose of this test, the WABO1 event log with 22,130 events was loaded with the following two dimensions EVENT timestamp and TRACE caseStatus. Furthermore, the drill down operation is applied along the timestamp dimension. Cell Name Unload time (s) All EVENT 61.9 NO VALUE 0.001 2010 4.4 2011 32.5 2012 26.3 SUM 63.2 Table 6.2: Summary of the unload time for the Test 3. In Table 6.2 we provide the unloading time for each cell in the visualization table. The column SUM stands for the sum of all columns except All EVENTS. Observe that the time to unload the entire WABO1 event log from the database is only marginally lower than the cumulative time required for its separate components. This result shows that filtering operation does not infer any performance penalties on the developed database structure. Applying the same operation on the event data stored in the relational database would require complex queries, and as such, would slow down the process. Therefore, fast filtering along the process cube dimensions is herein proven and it represents a benefit of the multidimensional database technologies. 6.3 Discussion There are three main observation that are derived from the experimental results. 57 Observation 1. Loading time of an event log is practically independent from the number of the dimensions of analysis. This fact is illustrated in Figure 6.7 and is the result of the loading algorithm. The event log is loaded into the database event by event, and for each event a constant number of operations is performed. Hence, the loading time is dependent only on the number of events. Observation 2. Sparsity of the process cube heavily impacts the unloading performance. For the selected cell in the table of visualization, all combinations of the dimensions of analysis members which correspond to this cell, are computed during the unload. For each combination, it is verified whether the associated process cube cell contains any events. Hence, there is a fixed amount of time spent to check whether the cell is empty, i.e., the cell id is retrieved from the multidimensional database, if cell id is NULL, the cell is empty and no further actions are performed with respect to this cell. If the cell contains events, additional time is spent to unload the event data from the relational database. Obviously, checking empty cells impacts negatively the unloading time. This is illustrated by the results of the second test, where for 191 times more cells to verify and the same number of events to unload comparing to a normally sparse cube, the unloading time is 10 times larger. Observation 3. Manual splitting and analysing of sparse dimensions, e.g., with several hundred of dimension members, would be very time consuming and probably would overload the user. Realistically, only the dimensions with at most 20 members are fit to be included in the process cube structure. Selection of such dimensions ensures low sparsity of the resulting process cube, and results in good responsiveness of the developed tool. Test 1 was based on a typical selection of analysis dimensions and therefore, its results characterize the operation speed of the tool in case of regular sparsity. Moreover, it was observed that the developed tool with the processing step, e.g., Log Dialog, delivers the result within 10s for event logs smaller than 2000 events and process cubes with about 3 to 4 normally sparse dimensions of analysis. This performance is respectable and makes the tool applicable to different processes. Moreover, the main focus of the tool is to compare selected parts of the event log, thus, only small sections of the process cube will be unloaded for comparison in typical situations. Test 3 shows that the unloading time reduces when only a part the cube is unloaded, which means that for 2000 events and 4 analysis dimensions, the average time of an operation will be far lower than 10s. Furthermore, even if the entire cube is split in subcubes and all these subcubes are unloaded simultaneously, no performance penalty will occur, i.e., all subcubes will be processes within 10s. 58 Chapter 7 Conclusions & Future Work 7.1 Summary of Contributions This master thesis builds on the ideas presented in the PROCUBE project proposal [4]. The proposal suggests to organize event data from logs in process cubes in such a way that discovery, analysis and comparison of multiple processes is possible. The main goal of this master project was to build a framework to support process cube exploration. The goal was achieved by following a series of steps, which the thesis describes in detail. We started by identifying the problem context. The role of business intelligence and process mining, in particular, in the functionality and performance of enterprise information systems, was investigated. Further, the reader was introduced to the business intelligence area, with emphasis on process mining and OLAP technologies. As concepts from both process mining and OLAP were repeatedly employed throughout the thesis, a formalization was given for all the adherent notions. The formalization of OLAP and of process-cube-related notions is one of the contributions of this thesis. Further elaboration and formalization of the process cube concept can be found in [6]. The next step in the project was to describe the central element of the project, the process cube. Process cubes realize the link between the process mining framework and the existing OLAP technology. While, process mining focuses on process anaysis, OLAP technology is used for its built-in hypercube structures allowing for operations like slice, dice, roll-up, drill-down and pivoting. As such, process cubes are defined by introducing the event-related aspects in the formalization of the OLAP cubes. Along with the process cube formalization, an example was presented to illustrate the process cube capabilities. This stage of the project was an important one, as it helped in establishing and clarifying the process cube functionality before its actual implementation. Since databases, OLAP and process mining tools already exist, we decided to reuse the current technologies to save time. Choosing a framework for process mining was easy, as ProM is clearly the leading open source framework and expertise is readily available at TU/e. Selecting a suitable OLAP technology was not as straightforward though. That is because the applied methods and principles vary quite a lot from OLAP tool to tool. Finally, we selected the Palo in-memory multidimensional OLAP. In-memory tools are known for their increased speed. Moreover, unlike relational databases, multidimensional databases have already the built-in multidimensional structure that is natural for OLAP cubes and therefore, facilitates the OLAP analysis. Relatively new, this technology is still undergoing a lot of changes and improvements. Nevertheless, it is deemed to have a bright future perspective, especially because of its current and envisioned performance benefits. The main contribution of the thesis is creating a basic prototype supporting the notion of process cube in a process mining context, with the following functionality: XES event logs are introduced as data sources for OLAP applications; the OLAP process cube is created from event data; the cube can be visualized from different perspectives; one can “play” with the cube before 59 starting the analysis, by applying different OLAP operations. One of the challenges we encountered after finishing the application was that MOLAP performance was worsening with increasing sparsity of the loaded data. We were aware of the sparsity problem from the very beginning, however, we did not expect such poor performance results. One of the potential explanations is that we used an open source version of Palo from 2011, which might not include the latest performance improvements that can be found in the commercial tool. Moreover, sparsity is still an open issue for many multidimensional tools. Only Essbase is known to provide a solution to this problem at the moment, but it is not open source. We hope that Palo will also release a new version with the sparsity problem solved. In meanwhile, we offered an interim solution to improve the performance for sparse data. The solution we provided for dealing with sparsity, was to replace the in-memory database with a hybrid structure, that stores part of event data in-memory and the other part, in a relational database. The advantage of such a strategy is that it reduces the number of dimensions in the cube and thus, makes it less sparse. The limitation is that only a part of the event data can be used for filtering purposes. Furthermore, we reduced the number of elements per dimension by implementing the hierarchical feature for time data. By allowing time data to be stored in a hierarchical structure, the sparsity of some very sparse dimensions like the timestamp, is reduced considerably. Finally, we tested the PROCUBE system to determine its capabilities. The information stored in event logs is inherently multidimensional, and as such, efficient application of process mining tools requires multidimensional filtering of the event database. The multidimensional, and as a particular case, in-memory database technology is developed for exactly that purpose. However, the performed tests show that event logs generally result in sparse multidimensional database structures, which incurs severe performance penalties when unloading parts of the event log for further processing. The proposed hybridization of the database structure, i.e., keeping only strictly necessary dimensions in memory and the rest in a relational database, makes an efficient trade-off in between the flexibility of the complete process cube and responsiveness of the user interaction. Nevertheless, complete understanding of the sparsity concept is required for efficient use of the developed tool as only limited number dimensions, e.g., up to 4D for WABO1 event log, can be used for on-line analysis. 7.2 Limitations In this section we describe two types of limitations of the in-memory multidimensional OLAP process cube approach. First, limitations at the conceptual level are presented, followed by implementation limitations. 7.2.1 Conceptual Level Cell Number Explosion Problem The cell number explosion problem, known also as sparsity, is common for multidimensional structures, where it is not possible to store data in a compact way, and therefore, resulting in a large number of missing values at the intersection of dimensions. As such, a process cube exceeding a certain number of dimensions, with a large number of elements per dimension and with a lot of missing cell values, leads to sparsity problems and high execution times for analysis. Visualization Limitations In the following, we present two types of limitations related to the visualization of process mining results. The first is related to the difficulty in visualizing hypercube structures, while the second one is related to the difficulty in visualizing multiple cell results. Generally, the visualization of the hypercube structures is not an easy task. On one hand, multidimensionality is not the natural way in which people can visualize. On the other hand, 60 there are hardly any tools that provide multidimensional visualizations on more than three dimensions. In our case, we visualize only two dimensions of the process cube at a time. This is a simple, yet powerful visualization that allows efficient visual comparison of cell results. The only mention is that the growth of the number of compared cells can become an issue. Fitting multiple results on a single screen, can impair the visualization of the results, thus, impeding the comparison between cells. This issue becomes even worse in case of large results. In the process mining area, the curse of dimensionality problem is well known. This is the case of large and complex models, that are usually unreadable. Visual comparison of such models is not supported in this project, but this is still a research problem in the area. 7.2.2 Implementation Level Filtering on a Subset of Attributes The hybrid approach adopted in this project, of storing event data in both in-memory and relational databases resulted in considerable performance gains. However it lacks flexibility with respect to the log filtering possibilities and changing dimensions in the cube. That is, the user is allowed to select a subset of attributes to be considered as dimensions in the process cube, while the rest of the attributes and other log information are stored in relational databases. Selecting only a subset of attributes, limits the log filtering possibilities. Moreover, changing one dimension of the cube, implies creating a new process cube, by selecting all the dimensions again. Limited Set of Supported Plugins The PROCUBE plugin uses only a limited set of ProM plugins to obtain process mining results. There are two reasons for this limitation. First, not all the existing ProM plugins are suitable for visual comparison of multiple subprocesses. The PROCUBE tool is designed to work with plugins that provide quick, direct process mining results. Secondly, there are plugins that cannot be used without following a sequence of wizards, which is problematic in the PROCUBE settings, as this procedure should be repeated for each process cell individually. Performance issues for Sparse Dimensions Our methods are oriented on reducing the number of sparse dimensions and the sparsity within dimensions. Still, if the user selects all the attributes for creating cube dimensions and there are sparse dimensions among those, the unloading of event data becomes very slow. 7.3 Further Research The process cube notion offers a wide range of new research questions and challenges. We will not enumerate them in this section. Instead, we give some points of reference for improving and extending the current approach. Data mining for Construction of Hierarchies Hierarchies are one of the most powerful elements of the OLAP structures. In our tool, the hierarchy feature is supported only for dimensions with time values. However, meaningful hierarchical structures can be also constructed for other types of dimensions. Machine learning techniques can be applied in obtaining clusters of dimension elements that can be used to create a hierarchy, e.g., hierarchical clustering. Moreover, data mining techniques can be used to combine elements of multiple dimensions to create a single dimension. That can be accomplished by a meaningful partitioning of the elements, e.g., algorithms for partitioning, for instance, large categorical data exist [35]. Reuse of Precomputed Models Knowledge of the discovered processes can be reused, by storing this precomputed information, not only creating models on-the-fly. Since producing large models on-the-fly takes 61 time, performance can be improved by saving parts of the created models or aggregates of the entire models, for further reuse. Further Visualization Improvement The visualization proposed in this thesis is based on the simple, traditional 2D visualization. Undoubtedly, more advanced visualization techniques can be found, with the advantage of being more representative for analysis and more user-friendly. Such an example is the icicle plot construction [32], that can be used to enhance the hierarchical representation of dimensions and facilitate the comparison between two sub-processes. 62 Bibliography [1] A Survey of Open Source Tools for Business Intelligence. In David Taniar and Li Chen, editors, Integrations of Data Warehousing, Data Mining and Database Technologies, pages 237–257. Information Science Reference, 2011. [2] Business Processing Intelligence Challenge (BPIC). In 8th International Workshop on Business Process Intelligence, 2012. [3] W. M. P. van der Aalst. Process Mining: Discovery, Conformance and Enhancement of Business Processes. Springer, 1998. [4] W. M. P. van der Aalst. Mining Process Cubes from Event Data (PROCUBE), project proposal (under review). 2012. [5] W. M. P. van der Aalst. Process Mining: Making Knowledge Discovery Process Centric. SIGKDD Explorations Newsletter, 13(2):45–49, 2012. [6] W. M. P. van der Aalst. Process Cubes: Slicing, Dicing, Rolling Up and Drilling Down Event Data for Process Mining. In J. Liu M. Song, M.Wynn, editor, Asia Pacific conference on Business Process Management (AP-BPM 2013), Lecture Notes in Business Information Processing, 2013. [7] W. M. P. van der Aalst, A. Adriansyah, A. K. A. de Medeiros, F. Arcieri, T. Baier, T. Blickle, R. P. Jagadeesh Chandra Bose, P. van den Brand, R. Brandtjen, J. C. A. M. Buijs, A. Burattin, J. Carmona, M. Castellanos, J. Claes, J. Cook, N. Costantini, F. Curbera, E. Damiani, M. de Leoni, P. Delias, B. F. van Dongen, M. Dumas, S. Dustdar, D. Fahland, D. R. Ferreira, W. Gaaloul, F. van Geffen, S. Goel, C. W. Gnther, A. Guzzo, P. Harmon, A. H. M. ter Hofstede, J. Hoogland, J. Espen Ingvaldsen, K. Kato, R. Kuhn, A. Kumar, M. La Rosa, F. Maggi, D. Malerba, R. S. Mans, A. Manuel, M. McCreesh, P. Mello, J. Mendling, M. Montali, H. Motahari Nezhad, M. zur Muehlen, J. Munoz-Gama, L. Pontieri, J. Ribeiro, A. Rozinat, H. Seguel Prez, R. Seguel Prez, M. Seplveda, J. Sinur, P. Soffer, M. S. Song, A. Sperduti, G. Stilo, C. Stoel, K. Swenson, M. Talamo, W. Tan, C. Turner, J. Vanthienen, G. Varvaressos, H. M. W. Verbeek, M. Verdonk, R. Vigo, J. Wang, B. Weber, M. Weidlich, A. J. M. M. Weijters, L. Wen, M. Westergaard, and M. T. Wynn. Process Mining Manifesto. In BPM 2011 Workshops, Part I. [8] W. M. P. van der Aalst, M. Pesic, and M. Song. Beyond Process Mining: From the Past to Present and Future. In Proceedings of the 22nd international conference on Advanced information systems engineering, CAiSE’10, pages 38–52, 2010. [9] W. M. P. van der Aalst, H. A. Reijers, and M. Song. Discovering Social Networks from Event Logs. Computer Supported Cooperative Work, 14(6):549–593, 2006. [10] S. Agarwal, R. Agrawal, P. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and S. Sarawagi. On the Computation of Multidimensional Aggregates. 1996. 63 [11] R. Agrawal, A. Gupta, and S. Sarawagi. Modeling Multidimensional Databases. In Proceedings of the Thirteenth International Conference on Data Engineering, ICDE ’97, pages 232–243, 1997. [12] I.-M. Ailenei. Process Mining Tools: A Comparative Analysis. Master’s thesis, Eindhoven University of Technology, 2011. [13] A. Berson and S. J. Smith. Data Warehousing, Data Mining, and Olap. 1997. [14] R.P. Jagadeesh Chandra Bose. Process Mining in the Large: Preprocessing, Discovery, and Diagnostics. PhD thesis, Eindhoven University of Technology, 2012. [15] J. C. A. M. Buijs. Mapping Data Sources to XES in a Generic Way. Master’s thesis, Eindhoven University of Technology, 2010. [16] J. C. A. M. Buijs, B. F. van Dongen, and W. M. P. van der Aalst. Towards CrossOrganizational Process Mining in Collections of Process Models and Their Executions. In Business Process Management Workshops (2), pages 2–13, 2011. [17] J. W. Buzydlowski, I.-Y. Song, and L. Hassell. A Framework for Object-Oriented On-Line Analytic Processing. In Proceedings of the 1st ACM international workshop on Data warehousing and OLAP, DOLAP ’98, pages 10–15, 1998. [18] S. Chaudhuri and U. Dayal. An Overview of Data Warehousing and OLAP Technology. SIGMOD Record, 26(1):65–74, 1997. [19] S. Chaudhuri, U. Dayal, and V. Narasayya. An Overview of Business Intelligence Technology. Commun. ACM, 54(8):88–98, August 2011. [20] E. F. Codd, S. B. Codd, and C. T. Salley. Providing OLAP (On-Line Analytical Processing) to User-Analysis: An IT Mandate, 1993. White paper. [21] G. Colliat. OLAP, Relational, and Multidimensional Database Systems. SIGMOD Record, 25(3):64–69, 1996. [22] T. H. Davenport. Putting the Enterprise into the Enterprise System. Harvard Business Review, 76(4):121–131, 1998. [23] K. Dhinesh Kumar, H. Roth, and L. Karunamoorthy. Critical Success Factors for the Implementation of Integrated Automation Solutions with PC Based Control. In Proceedings of the 10th Mediterranean Conference on Control and Automation, 2002. [24] B. F. van Dongen, A. K. A. de Medeiros, H. M. W. Verbeek, A. J. M. M. Weijters, and W. M. P. van der Aalst. The ProM Framework: A New Era in Process Mining Tool Support. In Proceedings of the 26th international conference on Applications and Theory of Petri Nets, ICATPN’05, pages 444–454, 2005. [25] R. Finkelstein. MDD: Database Reaches the Next Dimension. In Database Programming and Design, pages 27–38, 1995. [26] H. Garcia-Molina and K. Salem. Main Memory Database Systems: An Overview. IEEE Transactions on Knowledge and Data Engineering, 4(6):509–516, 1992. [27] M. Golfarelli. Open Source BI Platforms: A Functional and Architectural Comparison. In Proceedings of the 11th International Conference on Data Warehousing and Knowledge Discovery, DaWaK ’09, 2009. [28] O. Grabova, J. Darmont, J.-H. Chauchat, and I. Zolotaryova. Business Intelligence for Small and Middle-Sized Entreprises. SIGMOD Record, 39(2), 2010. 64 [29] C. W. Günther. XES Standard Definition. Fluxicon Process Laboratories, pages 13–14, 2009. [30] C. W. Günther and W. M. P. van der Aalst. Fuzzy Mining Adaptive Process Simplification Based on Multi-Perspective Metrics. BPM, pages 328–343, 2007. [31] J. Han. OLAP Mining: An Integration of OLAP with Data Mining. In In Proceedings of the 7th IFIP 2.6 Working Conference on Database Semantics (DS-7, pages 1–9, 1997. [32] D. Holten and J. J. van Wijk. Visual Comparison of Hierarchically Organized Data. In Proceedings of the 10th Joint Eurographics / IEEE - VGTC conference on Visualization, EuroVis’08, 2008. [33] R. P. Jagadeesh Chandra Bose, W. M. P. van der Aalst, I. Žliobaite, and M. Pechenizkiy. Handling Concept Drift in Process Mining. In Proceedings of the 23rd international conference on Advanced Information Systems Engineering, CAiSE’11, pages 391–405, 2011. [34] M. R. Jensen, T. H. Møller, and T. B. Pedersen. Specifying OLAP Cubes on XML Data. Journal of Intelligent Information Systems, 17(2-3):255–280, 2001. [35] G. V. Kass. An Exploratory Technique for Investigating Large Quantities of Categorical Data. Journal of the Royal Statistical Society, 29(2):119–127, 1980. [36] C. X. Lin, B. Ding, J. Han, F. Zhu, and B. Zhao. Text Cube: Computing IR Measures for Multidimensional Text Database Analysis. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, ICDM ’08, 2008. [37] M. Liu, E. A. Rundensteiner, K. Greenfield, C. Gupta, S. Wang, I. Ari, and A. Mehta. ECube: Multidimensional event sequence processing using concept and pattern hierarchies. In International Conference on Data Engineering, pages 1097–1100, 2010. [38] E. Lo, B. Kao, W.-S. Ho, S. D. Lee, C. K. Chui, and D. W. Cheung. OLAP on Sequence Data. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, SIGMOD ’08, 2008. [39] F. Melchert, R. Winter, and M. Klesse. Aligning Process Automation and Business Intelligence to Support Corporate Performance Management. In AMCIS’04, pages 507–507, 2004. [40] R. B. Messaoud, O. Boussaid, and S. Rabaséda. A New OLAP Aggregation Based on the AHC Technique. In Proceedings of the 7th ACM international workshop on Data warehousing and OLAP, DOLAP ’04, 2004. [41] S. Negash. Business Intelligence. Communications of the Association for Information Systems, 13(1):177–195, 2004. [42] T. Niemi, J. Nummenmaa, and P. Thanisch. Constructing OLAP Cubes Based on Queries. In Proceedings of the 4th ACM international workshop on Data warehousing and OLAP, DOLAP ’01, 2001. [43] T. B. Pedersen and C. S. Jensen. 34(12):40–46, December 2001. Multidimensional Database Technology. Computer, [44] D. Riazati, J. A. Thom, and X. Zhang. Drill Across and Visualization of Cubes with Nonconformed Dimensions. In Nineteenth Australasian Database Conference, volume 75, pages 85–93, 2008. [45] J. Ribeiro. Multidimensional Process Discovery. Beta Dissertation Series D165, 2013. [46] C. Salka. Ending the MOLAP/ROLAP Debate: Usage Based Aggregation and Flexible HOLAP (Abstract). In Proceedings of the Fourteenth International Conference on Data Engineering, February 23-27, 1998, Orlando, Florida, USA, page 180, 1998. 65 [47] B. Sigg. DockingFrames 1.1.1 - Common. pages 7–8, 2012. [48] Stratebi. Open Source B.I. comparative. 2010. [49] C. Thomsen and T. B. Pedersen. A survey of open source tools for business intelligence. In Proceedings of the 7th international conference on Data Warehousing and Knowledge Discovery, DaWaK’05, 2005. [50] C. Thomsen and T. B. Pedersen. A Survey of Open Source Tools for Business Intelligence. International Journal of Data Warehousing and Mining, 5(3):56–75, 2009. [51] E. Thomsen. OLAP Solutions: Building Multidimensional Information Systems. Robert Ipsen, 2002. [52] Y. Tian, R. A. Hankins, and J. M. Patel. Efficient Aggregation for Graph Summarization. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, SIGMOD ’08, 2008. [53] A. J. M. M. Weijters and A. K. A. de Medeiros. Process Mining with the HeuristicsMiner Algorithm. 2006. [54] K. Withee. Microsoft Business Intelligence for Dummies. Wiley Publishing, 2010. 66