Download Realizing a Process Cube Allowing for the Comparison of Event Data

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Eindhoven University of Technology
MASTER
Realizing a process cube allowing for the comparison of event data
Mamaliga, T.
Award date:
2013
Disclaimer
This document contains a student thesis (bachelor's or master's), as authored by a student at Eindhoven University of Technology. Student
theses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the document
as presented in the repository. The required complexity or quality of research of student theses may vary by program, and the required
minimum study period may vary in duration.
General rights
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners
and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.
• Users may download and print one copy of any publication from the public portal for the purpose of private study or research.
• You may not further distribute the material or use it for any profit-making activity or commercial gain
Take down policy
If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately
and investigate your claim.
Download date: 06. May. 2017
Department of Mathematics and Computer Science
Architecture of Information Systems Research Group
Realizing a Process Cube Allowing
for the Comparison of Event Data
Master Thesis
Tatiana Mamaliga
Supervisors:
prof. dr. ir. W.M.P. van der Aalst
MSc J.C.A.M. Buijs
dr. G.H.L. Fletcher
Final version
Eindhoven, August 2013
Contents
1 Introduction
1.1 Context . . . . . . . . . .
1.2 Challenges - Then & Now
1.3 Assignment Description .
1.4 Approach . . . . . . . . .
1.5 Thesis Structure . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
5
6
7
8
9
2 Preliminaries
2.1 Business Intelligence . . . . . . . .
2.2 Process Mining . . . . . . . . . . .
2.2.1 Concepts and Definitions .
2.2.2 ProM Framework . . . . . .
2.3 OLAP . . . . . . . . . . . . . . . .
2.3.1 Concepts and Definitions .
2.3.2 The Many Flavors of OLAP
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
11
12
12
15
16
16
20
3 Process Cube
3.1 Process Cube Concept . . . . . . . . . . . . . . . . . . .
3.2 Process Cube by Example . . . . . . . . . . . . . . . . .
3.2.1 From XES Data to Process Cube Structure . . .
3.2.2 Applying OLAP Operations to the Process Cube
3.2.3 Materialization of Process Cells . . . . . . . . . .
3.3 Requirements . . . . . . . . . . . . . . . . . . . . . . . .
3.4 Comparison to Other Hypercube Structures . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21
21
24
24
26
28
29
30
4 OLAP Open Source Choice
4.1 Existing OLAP Open Source Tools . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Advantages & Disadvantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Palo - Motivation of Choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
32
33
34
5 Implementation
5.1 Architectural Model . . . . . . . . . . . .
5.2 Event Storage . . . . . . . . . . . . . . . .
5.3 Load/Unload of the Database . . . . . . .
5.4 Basic Operations on the Database Subsets
5.4.1 Dice & Slice . . . . . . . . . . . . .
5.4.2 Pivoting . . . . . . . . . . . . . . .
5.4.3 Drill-down & Roll-up . . . . . . . .
5.5 Integration with ProM . . . . . . . . . . .
5.6 Result Visualization . . . . . . . . . . . .
36
36
37
39
41
42
43
44
45
47
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6 Case Study and Benchmarking
6.1 Evaluation of Functionality . . . .
6.1.1 Synthetic Benchmark . . .
6.1.2 Real-life Log Data Example
6.2 Performance Analysis . . . . . . .
6.3 Discussion . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
49
49
49
51
54
57
7 Conclusions & Future Work
7.1 Summary of Contributions . .
7.2 Limitations . . . . . . . . . .
7.2.1 Conceptual Level . . .
7.2.2 Implementation Level
7.3 Further Research . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
59
59
60
60
61
61
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
Abstract
Continuous efforts to improve processes, require a deep understanding of process inner working.
In this context, the process mining discipline aims at discovering process behavior from historical
records, i.e., event logs. Process mining results can be used for analysis of process dynamics.
However, mining on realistic event logs is difficult due to complex interdependencies within a
process. Therefore, to gain more in-depth knowledge about a certain process, it can be split
into subprocesses, which can then be separately analysed and compared. Typical tools for process
mining, e.g., ProM, are designed to handle a single event log at a time, which does not particularly
facilitate the comparison of multiple processes. To tackle this issue, Van der Aalst proposed in
[4] to organize the event log in a cubic data structure, called process cube, with a selection of the
event attributes forming the dimensions of the cube.
Although, multidimensional data structures are already employed in various business intelligence tools, the data used has a static character. This is in stark contrast to process mining,
since event data characterizes a dynamic process that evolves in time. The aim of this thesis is
to develop a framework that supports the construction of the process cube and permits multidimensional filtering on it, in order to separate subcubes for further processing. We start with
the OLAP foundation and reformulate its corresponding operations for event logs. Moreover, the
semantics of a traditional OLAP aggregate are changed. Numerical aggregates are substituted by
sublog data. With these adjustments, a tool is developed and integrated as a plugin in ProM to
support the aforementioned operations on the event logs. The user can unload sublogs from the
process cube, give them as parameters to other plug-ins in ProM and visualize different results
simultaneously.
During the development of the tool, we had to deal with a shortcoming of the multidimensional database technologies when storing event logs, i.e., the sparsity of the resulted process cube.
Sparsity in multidimensional data structures occurs when a large number of cells in a cube are
empty, i.e., there are missing data values at the intersection of dimensions. Taking a single attribute of an event log as a dimension in the process cube results in a very sparse multidimensional
data structure. As a result, the computational time required to unload a sublog for processing
increases dramatically. This shortcoming was addressed by designing a hybrid database structure
that combines a high-speed in-memory multidimensional database with a sparsity-immune relational database. Within this solution, only a subset of event attributes actually contribute to the
construction of the process, whereas the rest are stored in the relational database and used further
only for event log reconstruction. The hybrid database solution proved to provide the flexibility
needed for real-life logs, while keeping response times acceptable for efficient user interaction. The
applicability of the tool was demonstrated using two event log examples, a synthetic event log
and a real-life event log from the CoSeLog project. The thesis concludes with a detailed loading and unloading performance analysis of the developed hybrid structure, for different database
configurations.
Keywords: event log, relational database, in-memory database, OLAP, process mining, visualization, performance analysis
4
Chapter 1
Introduction
The greatest challenge to any thinker is stating the problem in a way
that will allow a solution.
Bertrand Russell, British author, mathematician, & philosopher (1872 - 1970)
This thesis completes my graduation project for the Computer Science and Engineering master
at Eindhoven University of Technology (TU/e). The project was conducted in the Architecture
of Information Systems (AIS) group. The AIS group has a distinct research reputation and
is specialized in process modeling and analysis, process mining and Process-Aware Information
Systems (PAIS).
The process mining field, detailed further in this chapter, provides valuable analysis techniques
and tools, but also faces a series of challenges. Main issues are large data streams and rapid changes
over time. This project creates a proof-of-concept prototype, which considers the so-called process
cube concept as a starting point for possible solutions to the above-mentioned challenges. The
outcome is further used for visual comparison of event data.
This chapter describes the assignment within its scientific context. Section 1.1 provides the
research background. Section 1.2 enumerates the most important advances in process mining
and identifies the current issues in the field. Section 1.3 specifies the problem and the project
objectives. Section 1.4 continues with a short summary on the problem solution. Finally, Section
1.5 provides an overview on the remaining chapters of the thesis.
1.1
Context
Technology has become an integral part of any organization. For example, current systems and
installations are heavily controlled and monitored remotely by integrated internet technologies
[23]. Moreover, employing automated solutions in any line-of-business has become a trend. As a
result, Enterprise Systems software, offering a seamless integration of all the information flowing
through a company [22], is used in any modern organization.
Enterprise Information Systems (EIS) keep businesses running, improve service times and thus,
attract more clients. Still, like in every complex system, there are multiple points where things can
go wrong. System errors, fraud, security issues, inefficient distribution of tasks are just a few to
mention. To cope with these issues, EIS had to extend its function-oriented enterprise applications
with Business Intelligence (BI) techniques. That is, BI applications have been installed to support
management in measuring company’s performance and deriving appropriate decisions [39]. Among
most important functions of BI are online analytical processing (OLAP), data mining, business
performance management and predictive analytics.
Being aware of the existing problems in an organization and applying standardized solutions to
solve them, is usually not enough. Consider a doctor that always prescribes pain killers indepen5
dent of the patient complaints. Of course, these kind of pills will temporarily release the pain, but
they will not treat the real disease. A good doctor should run tests, identify the root causes of the
health problem and only then, give an adequate treatment. This is what the process mining field
tries to accomplish. It goes beyond analyzing merely individual data records, but rather focuses
on the underlying process which glues event data together. The deep understanding of the inside
of a process can point to notorious deviations, persistent bottlenecks and unnecessary rework.
All in all, technology has a major impact on organizations and it proved to be an enabler for
business process improvement. Therefore, by means of business intelligence, and process mining,
in particular, new opportunities are constantly exploited to keep pace with challenges such as
change.
1.2
Challenges - Then & Now
In the context of today’s rapidly changing environment, organizations are looking for new solutions to keep their businesses running efficiently. Slogans such as “Driving the Change” (Renault),
“Changes for the Better” (Mitsubishi Semiconductor), “Empowering Change” (Credit Suisse First
Boston), “New Thinking. New Possibilities” (Hyundai) are used more and more often. Furthermore, different areas of business research are trying to keep up with the change and process mining
is not an exception.
In 2011, the Process Mining Manifesto [7] was released to describe the state-of-the-art in
process mining on one hand, and its current challenges, on the other hand. A year later, the
project proposal “Mining Process Cubes from Event Data (PROCUBE)” in [4] suggested the socalled process cube as a solution direction for some of these challenges. In the context of currently
employed process mining solutions and using the Process Mining Manifesto as a reference, the
PROCUBE project proposal presents several challenges that process mining is currently facing:
From “small” event data to “big” event data.
Due to increased storage capacity and advanced technologies, the vast amount of available
event data have become difficult to control and analyse. Most of the traditional process
mining techniques operate with event logs whose size does not exceed several thousands cases
and a couple hundred thousands events (for example, in BPI Challenge [2] files). However,
nowadays corporations work on a different scale of event logs. Giants like Royal Dutch Shell,
Walmart, IBM, would rather consider millions of events (a day or even a second) and this
number will continue to grow. Ways to ensure that event data growth will not affect the
importance of process mining techniques are constantly sought.
From homogeneous to heterogeneous processes.
With the increasing complexity of an event log, chances are that the variability in its corresponding process increases as well. For example, events in an event log can present different
levels of abstraction. However many mining techniques assume that all events in an event log
are logged at the same level of abstraction. In that sense, the diverse event log characteristics
have to be properly considered.
From one to many processes.
Many companies have their agencies spread across the globe. Let’s take SAP AG as an
example. Only its research and development units are located on four continents, but it
has regional offices all around the world. That is, SAP units are executing basically the
same set of processes. Still, this does not exclude possible variations. For instance, there
might be various influences due to the characteristics of a certain SAP distribution region
(Germany, India, Brazil, Israel, Canada, China, and others). Traditional process mining is
oriented on stand-alone business processes. However, it is of great importance to be able
to compare business processes of different organizations (units of an organization). For
example, efficient and less efficient paths in different processes can be identified. Inefficient
paths can be substituted and efficient paths can be applied to the rest of the processes to
improve performance.
6
From steady-state to transient behavior.
The change has a major impact not only on the size of event logs and on the necessity
of dealing with many processes together, but also on the state of a business process. For
example, companies should be able to quickly adjust to different business requirements. As a
result, their corresponding processes undergo different modifications. Current process mining
techniques assume business processes to be in a steady-state [5]. However, it is important
to understand the changing nature of a process and to react appropriately. The notion of
concept drift was introduced in process mining [33] to capture this second-order dynamics.
Its target is to discover and analyze the dynamics of a process by detecting and adapting to
change patterns in the ongoing work.
From offline to online.
As previously mentioned, systems produce an overwhelming amount of information. The
idea of storing it as historical event data for later analysis, as it is currently done, may not
seem as appealing any more. Instead, the emphasis should be more on the present and the
future of an event. That is, an event should be analysed on-the-fly and predictions on the
contingency of its occurrence should be made based on existing historical data. As such,
online analysis of event data is yet another process mining challenge.
Each of the issues discussed above, are extremely challenging. Analysing large scale event
logs is difficult with the current process mining techniques. Solutions to mitigate some of the
issues that appear when dealing with large scale event logs are proposed in [14], i.e., by event log
simplification, by dealing with less-structured processes and others. A framework for time-based
operational support is described in [8]. In [16], an approach is offered to compare collections of
process models corresponding to different Dutch municipalities. Nevertheless, there is still the
need for more elaborated solutions and a unified way of approaching them.
1.3
Assignment Description
Stand-alone process analysis is the common way of analysing processes in today’s process mining
approaches. However, inspecting a process as a single entity, impedes observing differences and
similarities with other processes. Let’s take a simple example from the airline industry. There is a
constant discussion about which of the low-cost airlines, Ryanair or Wizzair, offers better services.
There are both advantages and disadvantages of traveling with either of these two. Generally,
Ryanair is considered more punctual than Wizzair 1 . To determine why Ryanair is more on-time
with flights than Wizzair, we compare their processes. We noticed that while at Wizzair the
luggage is checked only once, Ryanair is very strict with the luggage procedure and checks it twice
before embarking. As a result, passengers and crew are not busy with “fitting” luggage that does
not fit and the hallway of the aircraft is kept free for new passengers that arrive at board. With
minimizing the turnaround time, the airline punctuality improves. The procedure of checking the
luggage may not be the only factor that improves the punctuality of Ryanair airline, but it is clear
from the comparison of the two airline processes that it contributes to reducing the flight delays.
In conclusion, the comparison of the two processes helped in answering a specific question and
identifying parts of these processes that can be further improved.
When it comes to comparison of large processes, it is difficult to inspect processes entirely
at a glance. Splitting and merging different parts of a process can offer more insightful details.
Let’s consider the following scenario. In the car manufacturing process, there is a final polishing
inspection step. Several resources check whether there is a scratch on a car that needs to be
polished. During the last two weeks, it was noticed that one polishing crew worked slower than
the others. To identify the cause of this issue, the car manufacturing process is analysed. First,
the process is split by department type and the polishing department is selected. Then, only the
process corresponding to the resources of this specific crew is isolated. The following aspects are
1 http://www.flightontime.info/scheduled/scheduled.html
7
inspected: the car type, the engine type, the color type. When filtering by car type and engine
type, it seems that there are no patterns indicating a potential delay. However, when inspecting
the subprocesses corresponding to different car colors, a pattern emerges. The average working
time of polishing a red car is much higher compared to the one of polishing cars of a different
color. Since red cars take, in general, more time to be polished than other cars, this indicates that
there is a problem in the painting department. The red-colored cars are not painted properly and
therefore need constant polishing. While at the beginning, it seemed like the crew is responsible
for the delays, in fact, the crew members were just polishing more red-colored cars. Since redcolored cars required more polishing due to a painting issue, the crew worked slower compared to
the other crews. Without filtering the initial process, it would have been difficult to identify such
detailed problems.
Taking into consideration the discussion above, the goal of this master project can be defined
as follows:
GOAL: Create a proof-of-concept tool to allow comparison of multiple processes.
In other words, the aim is to support integrated analysis on multiple processes, while examining
different views of a process. Together with the main goal, there are some other targets: filtering
processes by preserving the initial dataset, merging different parts of a process, visualizing process
mining results simultaneously and placing them next to each other to facilitate comparison. In
the following, we present the approach we propose to reach the enumerated objectives.
1.4
Approach
Figure 1.1: The process cube. Concept proposed in the PROCUBE project.
To accomplish the goal, we base our approach on the process cube concept, introduced in [4]
and shown in Figure 1.1. A process cube is a structure composed of process cells. Each process cell
(or collection of cells) can be used to generate an event log and derive process mining results [4].
Note that traditional process mining algorithms are always applied to a specific event log without
systematically considering the multidimensional nature of event data.
In this project, the process cube is materialized as an online analytical processing (OLAP)
hypercube structure. Except for the built-in multidimensional structure, one can benefit from
the functionality of the OLAP operations and hopefully from the good performance of OLAP
implementations. Transactional databases are designed to store and clean data, but are not
tailored towards analysis. OLAP, on the other hand, is herein chosen to harbor complex event
data for further process analysis, in the view of its analysis-optimized databases and its specialized
“drilling” operations. Organizing event data in OLAP multidimensional structures, makes it easy
8
to get event data and to pick a side to look at it. There are also many ways to divide event data,
e.g., one can always drill down and up in the multidimensional structure and inspect event data
at different granularity levels. Finally, the retrieved event data can be used to obtain different
process-related characteristics, e.g., process models, that can be further analysed and compared.
There are however, some challenges with respect to this approach, mainly due to the fact that
OLAP does not handle event data, but enterprise data:
• Only the aggregation of large collections of numerical data is supported by the OLAP tools.
• Process-related aspects are entirely missing in the OLAP framework.
• Overlapping of cells (event) classes is not possible in OLAP cubes.
Figure 1.2: Master Project Scope.
Nevertheless, adjustments can be made to OLAP tools to accommodate process cube requirements. The approach considers several steps shown also in Figure 1.2. First, event logs are
introduced among OLAP data sources. Hence, it becomes possible to load XES event logs in the
OLAP database. Second, the process cube is created to support the materialization of an event
log. Moreover, the process cube is designed to allow the visualization of cells with overlapping
event data. Finally, different process mining results can be produced for any section of the cube
and further exported as images.
The materialization of the process cube as an OLAP cube allows to define our objective even more
precise: the goal is to create a proof-of-concept tool that exploits OLAP features to accommodate
process mining solutions such that the comparison of multiple processes is possible.
1.5
Thesis Structure
To describe the approach, the master thesis is structured as follows:
Present a literature study on employed concepts and technologies (Chapter 2)
Concepts from process mining and business intelligence fields will be introduced. Then, a
discussion on the implemented OLAP and database technologies will follow.
Elaborate on process cube functionality (Chapter 3)
The process cube notion will be clearly defined together with its structure. The requirements
needed to attire the envisioned process cube functionality will be listed.
Explain Palo software choice (Chapter 4)
Based on the requirements from Chapter 3, a collection of technological solutions that could
support the process cube structure is generated. After analyzing the pros and the cons of
each solution, the choice to use Palo OLAP server is described and motivated.
9
Recall the most relevant implementation steps (Chapter 5)
After presenting the architecture of the project, the implementation steps are described.
The main functionality consists of: loading/unloading a XES file in/from the in-memory
database, enabling the adjusted OLAP operations on event logs and visualizing process
mining results.
Report on the testing process and on the system test results (Chapter 6)
The functionality of the software is tested and its performance is evaluated for different event
logs and process cubes.
Conclude with general remarks on the project (Chapter 7)
The thesis concludes with a series of comments and observations on both the implemented
solution and further research possibilities.
10
Chapter 2
Preliminaries
2.1
Business Intelligence
Business Intelligence (BI) incorporates all technologies and methods that aim at providing actionable information that can be used to support decision making. An alternative definition states that
BI systems combine data gathering, data storage, and knowledge management with analytical tools
to present complex internal and competitive information to planners and decision makers [41].
All in all, BI represents a mixture of multiple disciplines (e.g., data warehousing, data mining,
OLAP, process mining, etc.), as shown in Figure 2.1, all with the same main goal of turning
raw data into useful and reliable information for further business improvements. Even though
Figure 2.1: BI - a confluence of multiple disciplines.
herein presented as totally separate disciplines, there are various attempts to interconnect some
of them for obtaining more powerful analysis results. For example, data mining is integrated with
OLAP techniques [31, 45]. Data warehousing and OLAP technologies are more and more used
in conjunction [13, 18]. From the above-mentioned BI disciplines, process mining and OLAP are
detailed in Section 2.2 and in Section 2.3, as being particularly relevant for this project.
11
2.2
2.2.1
Process Mining
Concepts and Definitions
The idea of process mining is to discover, monitor and improve real processes (i.e., not assumed
processes) by extracting knowledge from event logs readily available in todays systems [3]. The
content and the level of detail of a process description depends on the goal of the conducted
process mining project and the employed process mining techniques. The set of real executions is
fixed and is given by the event data from an existing event log.
There are basically three types of process mining projects [3]. The goal of the first, data-driven
process mining project, is to conclude with a process description, which should be as detailed as
possible, without necessarily having a specific question in mind. This can be accomplished in two
ways: by a superficial analysis, covering multiple process perspectives or by an in-depth analysis,
on a limited number of aspects. The second, the question-driven process mining project, aims at
obtaining a process description from which an answer to a concrete question can be derived. A
possible question can be: “How does the decision to increase the duration of handling an invoice
influences the process?” The third type, the goal-driven process mining project, consists of looking
for weaker parts in the resulted process description that can be considered for improving a specific
aspect, e.g., better response times.
Figure 2.2: Process mining: discovery, conformance, enhancement.
Establishing the type of the process mining project to conduct is followed by choosing the
relevant process mining techniques to apply on the event log. Process mining comes in three
flavors: discovery, conformance and enhancement. Figure 2.2 1 shows these three main process
mining categories. Discovery techniques take the event log as input and return the real process
as output. Conformance checking techniques checks if reality, as recorded in the log, conforms to
the model and vice versa [7]. Enhancement techniques produce an extended process model which
gives additional insights in the process, i.e., existing bottlenecks.
Regardless of the process mining technique, an event log is always given as input, shown also
in Figure 2.2. The content of an event log can vary greatly from process to process. Nevertheless,
1 http://www.processmining.org/research/start
12
Figure 2.3: Structure of event logs.
there is a fixed skeleton, expected to be found in any event log. Figure 2.3, from [3], presents the
structure of an event log. Generally, event data from an event log correspond to a process. A
process is composed of cases or completed process instances. In turn, a case consists of events.
Events should be ordered within a case. Preserving the order is important as it influences the
control flow of the process. An event corresponds to an activity, e.g., register request, pay compensation. A trace represents a sequence of activities. Both events and cases are characterized by
attributes, e.g., activity, time, resource, costs.
The data source used for process mining is an event log. Event data of different information
systems are stored in event logs. Since event logs can be recorded not only for process mining
purposes (e.g., for debugging errors), there is no unique format used at creation. Handling various
event log formats for process analysis is time consuming. Therefore, event logs need to be standardized by converting raw event data to a single event log format. One such format is MXML,
which emerged in 2003. Recently, the popularity of XES event log standardization has grown.
Further, we present an overview on XES event log structure, with relevant details for this master
thesis. A more in depth discussion on the XES format can be found in [15] and more up to date
information on XES can be found on http://www.xes-standard.org/.
Figure 2.4, taken from [29], shows the XES meta model. Except for traces and events, with
their corresponding attributes, the log object contains a series of other elements. The global
13
Figure 2.4: The XES Meta-model.
attributes for traces and events are usually used to quickly find the existing attributes in the XES
log. The purpose of event classifiers is to assign each event to a pre-defined category. Events
within the same category can be compared with the ones from another category. XES logs are
also characterized by extensions. Extensions are used to resolve the ambiguity in the log by
introducing a set of commonly understood attributes and attaching semantics to them. Attributes
have assigned values which corresponds to a specific type of data. Based on the type of data,
attributes can be classified in five categories: String attributes, Date attributes, Int attributes,
Float attributes, and Boolean attributes. These attribute types correspond to the standard XML
types: xs:string, xs:dateTime, xs:long, xs:double and xs:boolean.
To understand the separation between required and flexible event log aspects, a formalization
of the above-highlighted concepts is given. The process mining book [3] is used as reference.
Definition 1 (Event, attribute [3]). Let E be the event universe, i.e., the set of all possible
event identifiers. Events may be characterized by various attributes, e.g., an event may have a
timestamp, correspond to an activity, is executed by a particular person, has associated costs, etc.
Let AN be a set of attribute names. For any event e ∈ E and name n ∈ AN : #n (e) is the value
of attribute n for event e. If event e does not have an attribute named n, then #n (e) =⊥(null
value).
Notation 1. For a given set A, A∗ is the set of all finite sequences over A.
14
Definition 2 (Case, trace, event log [3]). Let C be the case universe, i.e., the set of all possible
case identifiers. Cases, like events, have attributes. For any case c ∈ C and name n ∈ AN : #n (c)
is the value of attribute n for case c (#n (c) =⊥ if case c has no attribute named n). Each case has
a special mandatory attribute trace : #trace (c) ∈ E ∗ .2 ĉ = #trace (c) is a shorthand for referring
to the trace of a case.
A trace is a finite sequence of events σ ∈ E ∗ such that each event appears only once, i.e., for
1 ≤ i < j ≤ |σ| : σ(i) 6= σ(j).
For any sequence δ = ha1 , a2 , · · · , an i over A, δset = {a1 , a2 , · · · , an }. δset converts a sequence
into a set, e.g., δset (hd, a, a, a, a, a, a, di) = {a, d}. a is an element of δ, denoted as a ∈ δ, if and
only if a ∈ δset (δ).
An event log is a set of cases L ⊆ C such that each event appears at most once in the entire
log, i.e., for any c1 , c2 ∈ L such that c1 6= c2 : δset (cˆ1 ) ∩ δset (cˆ2 ) = ∅.
2.2.2
ProM Framework
A large number of algorithms are produced as a result of process mining research. Ranging from
algorithms that provide just a helicopter view on the process (Dotted Chart) to ones that give an
in-depth analysis (LTL Checker ), many of them are implemented in the ProM Framework in the
form of plugins.
Figure 2.5: ProM Framework Overview.
Figure 2.5, based on [24], shows an overview of the ProM Framework. It includes the main
types of ProM plugins and the relations between them. Before applying any mining technique, an
event log can be filtered using a Log filter. Further, the filtered event log can be mined using the
Mining plugin and then stored as a Frame result. The Visualization engine ensures that frame
results can be visualized. An (filtered) event log, but also different models, e.g., Petri nets, LTL
formulas, can be loaded into ProM using an Import plugin. Both the Conversion plugin and the
2 In
the remainder, we assume #trace (c) 6= hi, i.e., traces in a log contain at east one event
15
Figure 2.6: Examples of process mining plugins: Log Dialog and Dotted Chart (helicopter view),
Fuzzy Miner (discovery), Social Networks based on Working Together (organizational perspective).
Analysis plugin use mining results as input. While the first plugin is specialized in converting the
result to a different format, the second plugin is focused on the analysis of the result.
The ProM framework includes five types of process mining plugins, as shown in Figure 2.5:
• Mining plugins - mine models from event logs.
• Analysis plugins - implement property analysis on a mining result.
• Import plugins - allow import of objects from Petri net, LTL formula, etc.
• Export plugins - allow export of objects to various formats, e.g., EPC, Petri net, DOT, etc.
• Conversion plugins - make conversions between different data formats, e.g., from EPC to
Petri net.
Figure 2.6 presents some examples of plugins in ProM: the Log Dialog, the Dotted Chart, the
Fuzzy Miner [30] and the Working Together Social Network [9]. There are, however, more than
400 plug-ins available in Prom 6.2, covering a wide spectrum. Plugins objectives can vary from
providing process information at a glance, e.g., Log Data, Dotted Chart, to providing automated
process discovery, e.g., Heuristics Miner [53] and Fuzzy Miner and offering detailed analysis for
verification of process models, e.g., Woflan analysis, for performance aspects, e.g., Performance
Analysis with Petri net, and for the organizational perspective, e.g., Social Network miner.
2.3
2.3.1
OLAP
Concepts and Definitions
On-Line Analytical Processing (OLAP) is a method to support decision making in situations where
raw data on measures such as sales or profit needs to be analysed at different levels of statistical
aggregation [42]. Introduced in 1993 by Codd [20] as a more generic name for “multidimensional
16
data analysis”, OLAP embraces the multidimensionality paradigm as a means to provide fast
access to data when analysing it from different views.
Figure 2.7: Traditional OLAP cube. At the intersection of the three dimensions: regions, time
and sales information, an aggregate (e.g., profit margin %) can be derived. Both time and regions
dimensions contain a hierarchy (e.g., 2012Jan, 2012F eb, 2012M ar are months of 2012).
In comparison with its On-Line Transactional Processing (OLTP) counterpart, OLAP is optimized for analysing data, rather than storing data originating from multiple sources to avoid
redundancy. Therefore, OLAP is mostly based on historical data, e.g., data that can be aggregated, and not on instantaneous data which is quite challenging to analyse, sort, group or compare
“on-the-fly”.
Multidimensional data analysis is possible due to a multidimensional fact-based structure,
called an OLAP cube. An OLAP cube is a specialized data structure to store data in an optimized
way for analysis.
Figure 2.7 presents the traditional OLAP cube structure. Designed to support enterprise data
analysis, an OLAP cube is usually built around a business fact. A fact describes an occurrence
of a business operation (e.g., sale), which can be quantified by one or more measures of interest (e.g., the total amount of the sale, sales cost, profit margin %). Generally, the measure of
interest is a real number. A business operation can be characterized by multiple dimensions of
analysis (e.g., time, region, etc). Let DAi , 1 ≤ i ≤ n be the set of elements of the
Qndimensions of
analysis. Then, the measure of interest M I can be defined as a function M I : i=1 DAi → R.
For example, if region, time and sales are the dimensions of analysis, as in Figure 2.7, then
M I(Germany, 2012M ar, P rof itM argin) = 11.
Moreover, elements of a dimension of analysis can be organized in a hierarchy, e.g., the
Europe region is herein represented by countries like N etherlands, Germany and Belgium.
A natural hierarchical organization can be observed among time elements. Consider the tree
structure in Figure 2.8. The root of the tree is the 2012 year. This element has three children: 2012Jan, 2012F eb and 2012M ar, corresponding to months. Finally, each month element has days of week as children elements. Let Hi be the set of hierarchy elements, i.e.,
Hi = {2012, 2012Jan, 2012F eb, 2012M ar, 2012JanM on, 2012JanT hu, . . .}. The children
function, children : Hi → P(Hi ) returns the children elements of the argument. For example,
children(2012) = {2012Jan, 2012F eb, 2012M ar}. The allLeaves function, allLeaves : Hi →
17
Figure 2.8: Example of hierarchy tree structure on time dimension.
P(Hi ) returns all leaf elements corresponding to the subtree with the function argument as a root
node. For example, allLeaves(2012) = {2012JanM on, 2012JanT hu, 2012F ebW ed, 2012M arT ue,
2012M arF ri}. Note that a hierarchy is a undirected graph, in which any two nodes are connected
by a simple path, with the following property: for any node h ∈ Hi , any two children h1 , h2
∈ children(h), allLeaves(h1 ) ∩ allLeaves(h2 ) = ∅.
Dimensions of analysis, hierarchies and measures of interest can be used to construct an OLAP
cube, like the one in Figure 2.7. Dimensions of an OLAP cube are defined by CD = D1 × D2 ×
. . . × Dn . For any 1 ≤ i ≤ n, Di ⊆ Hi is the set of dimension elements. Hierarchies are defined
by CH = H1 × H2 × . . . × Hn . For example, the time dimension contains elements from the
hierarchy shown in Figure 2.8. Let D1 be the cube dimension corresponding to time, then a
possible content of D1 is {2012Jan, 2012F eb, 2012M ar}. It is not necessary for a dimension to
contain all the hierarchy elements. Together with dimensions, hierarchies are elements of an OLAP
cube structure CS = {CD, CH}. Measures of interests are functions specificQfor the dimensions of
n
analysis. For the dimensions of the cube, the aggregate function CA, CA : i=1 Hi → R, is used
as an equivalent of a measure of interest. The only difference is that aggregates can be computed
from multiple measure of interest results or from other aggregates. For example, the aggregate
sales cost for the entire month 2012Jan is a sum of the measure of interest results corresponding
to 2012JanM on and 2012JanT hu.
To make the reasoning in terms of OLAP more precise and to strengthen the understanding
of various cube-related concepts, we provide a formalization of the core OLAP notions.
An OLAP cube presents a multidimensional view on data from different sides (dimensions).
Each dimension consists of a number of dimension attributes or values, which can be also called
dimension elements or members. Members in a dimension can be organized into a hierarchy and
correspond, as such, to a hierarchical level. These concepts are further formalized in Definition 3.
Definition 3. (OLAP cube)
Let Di , 1 ≤ i ≤ n be a set of dimension elements, where n is the number of dimensions,
Hi , 1 ≤ i ≤ n be a set of hierarchy elements,
CD = D1 × D2 . . . × Dn be the cube dimensions,
CH = H1 × H2 . . . × Hn be the cube hierarchies,
children : Hi → P(Hi ), where children(h) is the function returning the children of h ∈ Hi ,
allLeaves : Hi → P(Hi ), where allLeaves(h) is the function returning all leaves of h ∈ Hi ,
h ∈ Hi , h1 , h2 ∈ children(h), allLeaves(h1 ) ∩ allLeaves(h2 ) = ∅,
CS = (CD, CH) be the cube structure,
CA : CH → R be the cube aggregate function,
An OLAP cube is defined as OC = (CS, CA).
Given the multidimensional structure of an OLAP cube, the risk exists of having it populated
with sparse data. Sparsity appears when often, at the intersection of dimensions, there is no
corresponding measure of interest, thus, there is an empty cell. Such behavior occurs in multidimensional cubes with a large number of sparse dimensions. A dimension is considered a sparse
dimension when it has a large number of members, that in most of the cases appear only once
in the original data source and data values are missing for the majority of member combinations.
On the contrary, in a dense dimension, a data value exists for almost every dimension member.
18
So far, we focused on the OLAP cube multidimensional structure. However, learning how to
employ it, is particularly interesting, as it gives a feeling of OLAP’s usefulness and applicability.
Therefore, we further discuss about one of the main features of OLAP, the OLAP operations.
In [18], Chandhuri and Dayal enumerate among the typical OLAP operations: slice and dice for
selection and projection, drill-up (or roll-up) and drill-down, for data grouping and ungrouping,
and pivoting (or rotation) for re-orienting the multidimensional view of data. There are also other
OLAP operations, e.g., ranking, drill-across [44]. However, the operations mentioned in [18] are
considered sufficient for a meaningful exploration of the data.
The dice operation returns a subcube by selecting a subset of members on certain dimensions.
Definition 4 (Dice operation). Let OC, OC = (CS, CA) and Di0 ⊆ Di for all 1 ≤ i ≤ n. The
dice operation is diceCD0 (OC) = OC 0 , where
OC 0 = (CS 0 , CA0 ),
CS 0 = (CD0 , CH 0 ),
CH 0 = H10 × H20 × . . . × Hn0 ,
Hi0 = {h ∈ Hi |∃v ∈ Di0 , allLeaves(v) ∩ allLeaves(h) 6= ∅},
children0 : Hi0 → P(Hi0 ), children0 (h) = children(h) ∩ Hi0 ,
allLeaves0 : Hi0 → P(Hi0 ), allLeaves0 (h) = allLeaves(h) ∩ Hi0 ,
h ∈ Hi0 , h1 , h2 ∈ children0 (h), allLeaves0 (h1 ) ∩ allLeaves0 (h2 ) = ∅,
CA0 : CH 0 → R, CA0 (h1 , . . . , hn ) = CA(h1 , . . . , hn ), for (h1 , . . . , hn ) ∈ CH 0 .
The slice operation is a special case of dice operation. It produces a subcube by selecting a
single member for one of its dimensions.
Definition 5 (Slice operation). Let OC, OC = (CS, CA). The slice operation is slicek,v (OC) =
OC 0 , where 1 ≤ k ≤ n, v ∈ Dk , and OC 0 = diceCD0 (OC) with CD0 = D1 × . . . × Dk−1 × {v} ×
Dk+1 × . . . × Dn .
Note that an OLAP cell can be defined as an OLAP subcube obtained by slicing each of
the OLAP cube dimensions. Let OC, OC = (CS, CA). The OLAP cell is slice1,v1 (slice2,v2 . . .
(slicen−1,vn−1 (slicen,vn (OC))) . . .)) = OC 0 .
By slice and dice operations, various OLAP subcubes are isolated. To make them useful
for analysis purposes, the data from the cube should be visualized. Although the cube is a
multidimensional structure, only two dimensions can be visualized at a time.
Pivoting (or rotation) operation changes the visualization perspective of the OLAP cube, by
swapping two dimensions Di∗ and Dj∗ .
Definition 6 (Pivoting operation). Let OC, OC = (CS, CA) with CD = D1 × D2 × . . . × Di ×
. . . × Dj × . . . × Dn and CH = H1 × H2 × . . . × Hi × . . . × Hj × . . . × Hn . The pivoting operation
is pivoti,j (OC) = OC 0 , where 1 ≤ i, j ≤ n,
OC 0 = (CS 0 , CA0 ),
CS 0 = (CD0 , CH 0 ),
CD0 = D1 × D2 × . . . × Dj × . . . × Di × . . . × Dn ,
CH 0 = H1 × H2 × . . . × Hj × . . . × Hi × . . . × Hn ,
children0 : Hi0 → P(Hi0 ), children0 (h) = children(h),
allLeaves0 : Hi0 → P(Hi0 ), allLeaves0 (h) = allLeaves(h),
h ∈ Hi0 , h1 , h2 ∈ children0 (h), allLeaves0 (h1 ) ∩ allLeaves0 (h2 ) = ∅,
CA0 : CH 0 → R, CA0 (h1 , . . . , hj , . . . , hi , . . . , hn ) = CA(h1 , . . . , hi , . . . , hj , . . . , hn ), for (h1 ,
. . . , hj , . . . , hi , . . . , hn ) ∈ CH 0 .
The roll-up operation consolidates some of the elements of a dimension into one element, which
corresponds to a hierarchically superior level.
Definition 7 (Roll-up operation). Let OC, OC = (CS, CA) and v ∈ Hk , where 1 ≤ k ≤ n. The
roll-up operation is rollupk,v (OC) = OC 0 , where OC 0 = (CS 0 , CA) with CS 0 = (CD0 , CH), and
CD0 = D1 × . . . × Dk−1 × (Dk \ children(v)) ∪ {v} × . . . × Dn .
19
The drill-down operation refines a member of a dimension into a set of members, corresponding
to a hierarchically inferior level.
Definition 8 (Drill-down operation). Let OC, OC = (CS, CA) and v ∈ Dk , where 1 ≤ k ≤ n.
The drill-down operation is drilldownk,v (OC) = OC 0 , where OC 0 = (CS 0 , CA) with CS 0 =
(CD0 , CH), and CD0 = D1 × . . . × Dk−1 × (Dk \ {v}) ∪ children(v) × . . . × Dn .
2.3.2
The Many Flavors of OLAP
Before introducing the OLAP principle, relational databases were the most widely used as technology for enterprise databases. Relational databases are stable and trustworthy and can be used
for storing, updating and retrieving data. However, they provide limited functionality to support
user views of data. Most notably lacking was the ability to consolidate, view, and analyze data
according to multiple dimensions, in ways that make sense to one or more specific enterprise analysts at any given point in time [20]. Consequently, OLAP facilities were designed to compensate
for the limitations of the conventional relational databases.
The OLAP Server functionality had to be implemented on top of an existing database technology. Relational databases were considered to be amongst the most reliable and popular types of
databases [21]. Naturally, one of the proposed solutions was to add OLAP characteristics on top
of a relational model. This is how the ROLAP (Relational OLAP) category came into existence.
The OLAP layer provides a multidimensional view, calculation of derived data, slice, dice and
drill-down intelligence and the relational database gives an acceptable performance by employing
a Star-schema or Snowflake data model [21, 43].
Being the most appropriate database type for OLTP, due to its design, the relational database
is not as good an option for OLAP [20, 25]. Even though presenting close to real-time data loading
and having advantages in terms of capacity, ROLAP presents slow query performance and is not
always efficient when aggregating large amounts of data.
Instead, a multidimensional database approach deemed to be more suited [11, 54]. Known
under the name of MOLAP (Multi-dimensional OLAP), this type of OLAP is created to achieve
the highest possible query performance. Still, MOLAP has its own deficiencies. MOLAP works
the best for cubes with a limited number of sparse dimensions. Sparse data within large cubes
often causes performance problems.
Hence, the advantages of ROLAP are the disadvantages of MOLAP and vice versa. Therefore,
the HOLAP (Hybrid OLAP) version was introduced as the combination of the two, to compensate
for the deficiencies of each technology [46]. HOLAP is one of the OLAP types that goes mainstream
among the next-generation OLAP. Additional technologies, such as in-memory OLAP, are considered for speed-oriented systems. Nonetheless, depending on data characteristics (e.g., summarized,
detailed), one or a combination of these technologies can be considered. Even though multi-hybrid
models (e.g., MOLAP and real-time in-memory for analysis and HOLAP for drill through) are
designed to incorporate the most of OLAP benefits, there is still no generic OLAP architecture or
standard procedure to guarantee optimal performance independent of the requirements.
With the growth of available memory capacity and because memory prices are decreasing with
time, the feasibility of storing large databases in memory increases. As a consequence, the diskbased databases are replaced more and more often with in-memory database technology. While
conventional disk-based database systems (DRDB) store data on disk, main memory database
systems (MMDB) [26] store and access data directly from the main physical memory. Therefore,
the response times and transaction throughputs of a MMDB are considerably better than for a
disk-based database system. Obviously, a DRDB still has advantages in terms of capacity. There
are very large databases that simply cannot fit in memory, e.g., database containing NASA space
data (with images). However, it is difficult for DRDB to compete with the speed of MMDB. That
is, a database of a reasonable size stored in-memory outperforms a database stored on disk.
20
Chapter 3
Process Cube
In Section 1.3, the goal of this master project was described as to create a proof-of-concept tool
to allow comparison of multiple processes. In Section 1.4, the process cube was introduced as a
means to satisfy the goal. Both process mining and OLAP aspects were described in Chapter 2.
Being the central component of the system, the process cube links the process mining framework
to the existing OLAP technology. By storing event logs in OLAP multidimensional structures,
event data can be used to obtain and compare process mining results. In this chapter, the concept
of the process cube is explained in detail, together with an example that shows its functionality
and a comparison with other hypercube structures. Before proceeding with the process cube
materialization in Chapter 4, a set of requirements are established and enumerated at the end of
the chapter.
3.1
Process Cube Concept
In Section 2.2.1, the definition of an event with attributes (Definition 1) and of a case with
attributes (Definition 2) were given. Section 2.3.1 includes the definition of an OLAP cube (Definition 3) with its corresponding operations (Definitions 4, 5, 6, 7, 8). In this section, the process
cube and process cell notions are introduced by adding event log aspects into the OLAP cube
definition. For a further elaboration and formalization of the process cube concept see the paper
[6], which was published towards the end of this project.
Figure 3.1: Process Cube Concept.
Figure 3.1, taken from [4], shows relevant process cube characteristics and is therefore, representative for the definitions of different process cube concepts given below (e.g., process cube,
process cell). A detailed discussion on the elements of the Figure 3.1 is presented in [6].
21
A process cube is a multidimensional structure built from event log data in a way that facilitates
further meaningful process mining analysis. A process cube is composed of a set of process cells [4]
and the main difference between a process cube and an OLAP cube lies in its cell characteristics.
In contrast to the OLAP cube, there is no real measure of interest quantifying a business operation.
While OLAP structures are designed for business operations analysis, the process cube aims
at analyzing processes. Therefore, each dimension of analysis is composed of event attributes.
Consequently, the content of a cell in the process cube changes from real numbers to events.
While in OLAP, dimensions of analysis are used to populate the cube, in case of process cubes
the events of an event log are used to create the dimensions of analysis. Hence, instead of the M I
function, the event members function is defined as EM : E → DA1 × . . . × DAn . Note that to
differentiate between two events with the same attributes, the event id is added as a dimension of
analysis. Consequently, for each event there will be a unique combination of dimension of analysis
members.
Definition 9. (Process cube)
Let Di , 1 ≤ i ≤ n be a set of dimension elements, where n is the number of dimensions,
Hi , 1 ≤ i ≤ n be a set of hierarchy elements,
CD = D1 × D2 × . . . × Dn be the cube dimensions,
CH = H1 × H2 × . . . × Hn be the cube hierarchies,
children : Hi → P(Hi ), where children(h) is the function returning the children of h ∈ Hi ,
allLeaves : Hi → P(Hi ), where allLeaves(h) is the function returning all leaves of h ∈ Hi ,
h ∈ Hi , h1 , h2 ∈ children(h), allLeaves(h1 ) ∩ allLeaves(h2 ) = ∅,
CS = (CD, CH) be the process cube structure,
CE : CH → P(E ) be the cell event function, CE(h1 , h2 , . . . , hn ) = {e ∈ E |(d1 , d2 , . . . dn ) =
CC(e), di ∈ allLeaves(hi ), 1 ≤ i ≤ n}, for (h1 , h2 , . . . , hn ) ∈ CH.
A process cube is defined as P C = (CS, CE).
Note that a process cell can be defined as a subcube obtained by slicing each of the process cube
dimensions. Let P C, P C = (CS, CA). The process cell is slice1,v1 (slice2,v2 . . . (slicen−1,vn−1
(slicen,vn (P C))) . . .)) = P C 0 . Each cell in the process cube corresponds to a set of events [4],
returned by the cell event function CE.
The process cube, as defined above, is a structure that does not allow overlapping of events
in its cells. To allow the comparison of different processes using the process cube, a table of
visualization is created. The table of visualization is used to visualize only two dimensions at a
time. Multiple slice and dice operations can be performed by selecting different elements of the
two dimensions. Each slice, dice, roll-up or drill-down is considered to be a filtering operation.
Hence, a new filter is created with each OLAP operation. Filters are added as rows/columns in
the table of visualization. Note that unlike the cells of the process cube, the cells of the table of
visualization may contain overlapping events. That is because there is no restriction in selecting
the same dimension members for two filtering operations.
Given a process cube P C, a process model, MP C is the result of a process discovery algorithm,
such as Alpha Miner, Heuristic Miner or other related algorithms, used on P C. However, there
are various process mining algorithms whose results are not necessarily process models. Instead,
they can offer some insightful process-related information. For example, Dotted Chart Analysis
provides metrics (e.g., average interval between events) related to events and their distribution
over time. Process cubes are not limited to process models as well. Therefore, we refer to process
mining results just as models.
So far, we described the process cube as being a hypercube structure, with a finite number
of dimensions. In [4], a special process cube is presented, with three dimensions: case type (ct),
event class (ec) and time window (tw).
Figure 3.2, taken from [4], contains a table corresponding to a fragment of an event log. Let
the event data from the event log be used to construct a process cube P C. Then, the ct, ec and
tw dimensions are established as follows. The case type dimension is based on the properties of
a case. For example, the case type dimension can be represented by the type of the customer,
in which case, the members of ct are gold and silver, i.e., D1 = {gold, silver}, H1 = D1 . The
22
Figure 3.2: Event log excerpt.
event class dimension is based on the properties of an event. For example, ec can be represented
by the resource and include, as such, the following members: D2 = {John}, H2 = D2 . The time
window dimension is based on timestamps. A time window can refer to years, months, days of
week, quarters or any other relevant period of time. Due to its natural hierarchical structure, tw
dimension can be organized as a hierarchy, e.g., 2012 → 2012Dec → 2012DecSun. We consider
D3 = {2012DecSun} and H3 = {2012, 2012Dec, 2012DecSun}.
Let
D1 = {gold, silver}, D2 = {John} and D3 = {2012DecSun}
H1 = {gold, silver}, H2 = {John} and H3 = {2012, 2012Dec, 2012DecSun}
CD = D1 × D2 × D3 be the cube dimensions,
CH = H1 × H2 × H3 be the cube hierarchies,
h1 , h2 ∈ H3 , h1 = 2012, children(h1 ) = {2012Dec}, h2 = 2012Dec, children(h2 ) =
2012DecSun,
h1 , h2 ∈ H3 , h1 = 2012, allLeaves(h1 ) = {2012DecSun}, h2 = 2012Dec, allLeaves(h2 )
= 2012DecSun,
CS = (CD, CH) be the process cube structure,
h1 ∈ H1 , h1 = gold, allLeaves(h1 ) = {gold}, h2 ∈ H2 , h2 = John, allLeaves(h2 ) =
{John}, h3 ∈ H3 , h3 = 2012, allLeaves(h3 ) = {2012DecSun}.
CE(h1 , h2 , h3 ) = {35654423}, CC(35654423) = (gold, John, 2012DecSun).
For the rest of the elements of CH, CE is defined in the same way.
The process cube is defined as P C = (CS, CE).
Each process cell l can be used to discover a process model, Ml .
However,
a process model can be also discovered
from a group of cells Q, MQ , or from
the entire process cube P C, MP C . Figure 3.3 shows a process model discovered
from all the event data from the process cube P C.
MP C is the discovered
process model using the Alpha Miner algorithm, from the set of events returned Figure 3.3: A process model discovered from an
by CE.
This is possible if consid- extended version of the event log in Figure 3.2
ering the process cube corresponding to using the Alpha Mining algorithm.
a single cell in the table of visualization.
23
3.2
Process Cube by Example
In the previous section, the process cube was introduced together with a formalization of its
relevant concepts. In this section, we continue with describing its functionality by means of an
example.
Figure 3.4: Functionality in three steps: 1. From XES data to process cube structure. 2. Applying
OLAP operations to the process cube. 3. Materialization of process cells.
We propose a functionality in three steps approach, as depicted in Figure 3.4. In the first step,
the event data for this example is presented in a XES-like format. The event data is then used to
construct a process cube prototype. While building the process cube, its various characteristics
are clearly specified by referring to definitions from Section 3.1. The aim of the second step is to
show ways of exploring the process cube. In that sense, a range of OLAP operations (e.g., slice,
dice, roll-up, drill-down, pivoting) are applied to it. As such, the process cube is prepared for the
last step - the process cube analysis. More precisely, in the third step, it is described how parts
of the process cube are materialized in event logs and then used to obtain process models. These
models can then be compared to discover similarities and dissimilarities between their underlying
processes.
3.2.1
From XES Data to Process Cube Structure
Table 3.1 contains the event data used in this example to illustrate the process cube functionality.
This data is needed to build the process cube structure. In practice, explicit case ids and/or
the event ids may be missing. From Definition 1 and Definition 2, both events and cases are
represented by unique identifiers. Therefore, when these identifiers do not exist in the original
data source, they can be automatically generated when extracting the data.
The definition of the process cube (Definition 9) describes the process cube as a n−dimensional
structure. Thus, establishing the dimensions is an important step in the creation of a process cube.
There is no unique way of deciding on a process cube dimensions. One possibility is to select each
case attribute and event attribute as a dimension. When applied to our example, this choice leads
to a process cube with 5 dimensions. Should the case id and the event id be also considered, the
final structure is a 7-dimensional process cube structure. By considering each different attribute
value as a dimension member, the resulting process cube has 4 × 2 × 2 × 43 × 43 × 14 × 2 = 828, 352
process cells. It is easy to notice that the case id, event id and timestamp are sparse dimensions,
causing the entire process cube to be sparse. Sparsity was discussed in Section 2.3.1.
Another possibility is to limit the number of dimensions to three, as suggested in [4]. Based
on the case properties, the case type dimension can contain members created from both parts
and sum leges attributes. The parts attribute, specifies for what building parts can a building
permit be requested, e.g., Bouw, M ilieu. The sum leges attribute, gives the total cost of a
building permit application, e.g., 138.55, 179.8. At this point, it is important to establish a
representative dimension member, as it can influence further analysis. This can be achieved, for
24
case id
properties
parts sum leges
1
Bouw
138.55
2
Bouw
138.55
3
Milieu
179.8
4
Bouw
138.55
event id
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
properties
timestamp
activity
2012-02-21T11:52:13 01 HOOFD 010
2012-02-21T11:56:31 01 HOOFD 020
2012-02-21T12:15:07 01 HOOFD 040
2012-02-21T12:19:22 01 HOOFD 050
2012-02-21T12:50:18 01 HOOFD 055
2012-02-21T14:09:49 01 HOOFD 060
2012-03-08T12:03:11 01 HOOFD 010
2012-03-08T12:07:53 01 HOOFD 020
2012-03-08T12:31:15 01 HOOFD 040
2012-03-08T13:22:08 01 HOOFD 060
2012-03-08T13:35:47 01 HOOFD 065
2012-03-08T14:53:34 01 HOOFD 120
2012-03-08T15:20:55 01 HOOFD 260
2012-03-08T15:36:19
09 AH I 010
2012-03-08T15:56:41 01 HOOFD 430
2012-03-12T09:03:52 01 HOOFD 010
2012-03-12T09:08:21 01 HOOFD 020
2012-03-12T09:17:39 01 HOOFD 040
2012-03-12T09:42:48 01 HOOFD 050
2012-03-12T10:15:07
06 VD 010
2012-03-12T10:24:56 01 HOOFD 120
2012-03-12T10:49:01 01 HOOFD 180
2012-03-12T11:18:19 01 HOOFD 260
2012-03-15T13:11:06 01 HOOFD 010
2012-03-15T13:15:27 01 HOOFD 020
2012-03-15T13:37:42 01 HOOFD 040
2012-03-15T14:02:18 01 HOOFD 050
2012-03-15T14:19:32 01 HOOFD 065
2012-03-15T15:06:11 01 HOOFD 120
2012-03-15T15:46:37 01 HOOFD 180
2012-03-15T16:10:44 01 HOOFD 260
2012-03-15T16:42:01 01 HOOFD 380
2012-03-15T16:53:26 01 HOOFD 430
resource
560464
560464
560925
560464
560464
560925
560464
560464
560925
560925
560925
560925
560464
560925
560925
560464
560464
560925
560925
560925
560925
560925
560925
560464
560464
560925
560925
560925
560464
560464
560464
560464
560925
Table 3.1: Event Log Example
instance, by employing data mining techniques. For this example, we describe a simple two-step
approach. First, cases are grouped in clusters, based on their properties. It is obvious that cases
1, 3 and 4 belong to one cluster, as they all have the same case properties, and case 2 belongs
to another cluster. Secondly, a classification (decision tree learning algorithm) is used on the
clustering results. In this example, we expect to identify, after classification, a representative
number, e.g., 150, for the sum leges attribute that would differentiate between the two clusters.
Consequently, the following two case type dimension members can be considered representative
parts = Bouw, sum leges < 150 and parts = M ilieu, sum leges >= 150. The difficulty of this
approach is that is requires data mining knowledge to store the event data in the process cube.
There is also a middle-ground approach. For instance, the number of dimensions can still be
kept small, but not necessarily limited to three. Moreover, one dimension can contain a single
property instead of a combination of properties. In this case, the attributes that do not end up as
dimensions can be still stored in a cell. For this example, we consider 4 dimensions: parts, activity,
resource and timestamp. The parts dimension has two elements, D1 = {Bouw, M ilieu}. The
resource dimension has also two elements, D2 = {560464, 560925}. The activity dimension consists
25
of 15 elements, e.g. 01 HOOF D 010, 09 AH I 010 and others. While the first three dimensions
have a relatively small number of members, the last dimension consists of 43 different members.
To reduce this number, only the year, the month and the day of the week is considered for the
timestamp dimension and the rest is stored in the cell. Consequently, the size of the timestamp
dimension is reduced to three: 2012F ebT ue, 2012M arM on and 2012M arT hu. As a result, the
process cube P C consists of 2 × 14 × 3 × 2 = 168 process cells.
To show what is the content of a process cell for the process cube P C, we use the CE function on
a set of selected hierarchy elements. For h1 ∈ H1 , h1 = Bouw, allLeaves(Bouw) = {Bouw}, h2 ∈
H2 , h2 = 560925, allLeaves(h2 ) = 560925, h3 ∈ H3 , h3 = 01 HOOF D 040, allLeaves(h3 ) =
{01 HOOF D 040}, h4 ∈ H4 , h4 = 2012M arT hu, allLeaves(h4 ) = {2012M arT hu}, the CE
function returns CE(h1 , h2 , h3 , h4 ) = {9, 26}. Both
CC(9) = (Bouw, 2012M arT hu, 01 HOOF D 040, 560925) and
CC(26) = (Bouw, 2012M arT hu, 01 HOOF D 040, 560925)
return the same tuple of hierarchy elements. Event data that is not yet stored as dimension values,
can still be stored in the process cell containing events 9 and 26, as shown in the Table 3.2.
case id
2
4
properties
sum leges
138.55
138.55
event id
9
26
properties
timestamp
2012-03-08T12:31:15
2012-03-15T13:37:42
Table 3.2: Event data corresponding to the process cell defined by CE(h1 , h2 , h3 , h4 ) = {9, 26}.
3.2.2
Applying OLAP Operations to the Process Cube
In Section 2.3.1, the following OLAP operations were described: slice, dice, pivoting, roll-up and
drill-down. In this section, we show, by means of an example, how these operations can be applied
on a process cube.
H4
2012F eb
2012
2012M ar
2012M arM on
2012M arT hu
01 HOOFD 65
01 HOOFD 060
01 HOOFD 055
01 HOOFD 050
01 HOOFD 040
01 HOOFD 020
01 HOOFD 010
560925
2012F ebT ue
D2
(resource)
560464
Bouw
Milieu
D4 (timestamp)
D3
(activity)
D1 (parts)
Figure 3.5: Process cube by example. With orange, 2012F ebT ue and 2012M arT hu are selected
for the timestamp dimension and are used for dicing the process cube. With green, a subcube is
illustrated, which is the result of slicing the previous subcube on 560464 member of the resource
dimension. With red, a subcube is illustrated, which is the result of slicing the previous subcube
on 560925 member of the resource dimension.
Figure 3.5 illustrates the 4-dimensional process cube P C, constructed in the previous step. To
represent the 4D structure in a 2D plan, first the members of the timestamp hierarchy are displayed
on the left. The root element of the hierarchy is the 2012 year, followed by the month elements,
26
2012F eb and 2012M ar and having the days of week as the leaf nodes, 2012F ebT ue, 2012M arM on
and 2012M arT hu. To each leaf member of the timestamp dimension, corresponds a 3D subcube
as the one on the right.
For the process cube P C, we choose to do first a dice, by selecting the 2012F ebT ue and the
2012M arT hu members on the timestamp dimension. Let P C, P C = (CS, CA) and Di0 = Di for
all 1 ≤ i ≤ 3, D40 = {2012F ebT ue, 2012M arT hu}. The dice operation is diceCD0 (P C) = P C 0 ,
where
P C 0 = (CS 0 , CE 0 ),
CS 0 = (CD0 , CH 0 ),
CH 0 = H1 × H2 × H3 × H40 ,
allLeaves(2012) = {2012F ebT ue, 2012M arM on, 2012M arT hu},
allLeaves(2012F ebT ue) = {2012F ebT ue}.
T hen, allLeaves(2012) ∩ allLeaves(2012F ebT ue) = {2012F ebT ue}, . . .
H40 = {2012, 2012F eb, 2012M ar, 2012F ebT ue, 2012M arT hu},
h ∈ H4 , h = 2012M ar, children(h) = {2012M arM on, 2012M arT hu},
children0 (h) = children(h) ∩ H40 , children0 (h) = {2012M arT hu}, . . .
h ∈ H4 , h = 2012M ar, allLeaves(h) = {2012M arM on, 2012M arT hu},
allLeaves0 (h) = allLeaves(h) ∩ H40 , allLeaves0 (h) = {2012M arT hu}, . . .
CE 0 (h1 , . . . , h4 ) = CE(h1 , . . . , h4 ), for (h1 , . . . , h4 ) ∈ CH 0 .
Further, two slice operations are performed on the diced subcube P C 0 , by selecting first the
560464 and then 560925 member of the resource dimension. The resulted subcubes P C10 and P C20
are still 4D structures, although they have only one member on the resource dimension. The
corresponding 3D subcubes, with dimension timestamp left aside due to representation issues, are
depicted in Figure 3.5. The P C560464 subcube is represented with green and the P C560925 subcube
is represented with red.
The slice operation where the 560464 resource is selected is slice2,560464 (P C 0 ) = P C560464 ,
P C560464 = diceCD0 (P C 0 ) with CD560464 = D10 × {560464} × D30 × D40 . The slice operation where
the 560925 resource is selected is slice2,560925 (P C 0 ) = P C560925 , P C560925 = diceCD0 (P C 0 ) with
CD560925 = D10 × {560925} × D30 × D40 .
While slice and dice operations are used to select parts of a process cube, pivoting, roll-up and
drill-down operations help in visualizing the selections. As mentioned in Section 2.3.1, only two
dimensions out of all the process cube dimensions, can be visualized at a time. For example, in
Figure 3.5, dimensions parts and resource can be easily visualized. This part of the cube indicates
which resources are responsible for handling cases for Bouw and which for M ilieu. It is possible
to visualize also the activity dimension, but not all its elements can be clearly distinguished.
By pivoting (or rotation) operation, the visualization perspective of the process cube can be
changed. For example, by selecting the dimension activity on x axis instead of dimension parts
and dimension parts on y axis instead of dimension activity, the cube is rotated and a new side
of it can be visualized. Such a change makes it easy to distinguish the activities corresponding to
Bouw and M ilieu parts, together with their corresponding cells.
The pivoting operation is pivot1,3 (P C 0 ) = P Cp0 .
P Cp0 = (CSp0 , CEp0 ),
CSp0 = (CDp0 , CHp0 ),
CDp0 = D30 × D20 × D10 × D40 ,
CHp0 = H30 × H20 × H10 × H40 ,
children0 (h) = children(h),
allLeaves0 (h) = allLeaves(h),
CEp0 (h3 , h2 , h1 , h4 ) = CE(h1 , h2 , h3 , h4 ).
The roll-up and drill-down operations have an impact when applied on a dimension with a
hierarchical structure. Through a roll-up operation, members of a hierarchically inferior level
are replaced with a member of a hierarchically superior level. For this example, we consider the
timestamp dimension with its elements 2012F ebT ue, 2012M arM on and 2012M arT hu. A roll-up
operation on the children of 2012M ar replaces the current timestamp elements with 2012F ebT ue
27
and 2012M ar.
The roll-up operation is then rollup4,2012M ar (P C 0 ) = P Cr0 , where P Cr0 = (CSr0 , CE) with
0
CSr = (CDr0 , CH), and CDr0 = D10 × D20 × D30 × (D40 \ children(2012M ar)) ∪ {2012M ar}.
While the roll-up operation folds elements from an inferior hierarchical level into elements of
a superior one, the drill-down operation expands members from hierarchically superior levels. We
consider again the timestamp dimension. For the previous P Cr0 subcube, a drill-down operation on
the 2012M ar element replaces the current dimension elements with 2012F ebT ue, 2012M arM on
and 2012M arT hu.
The drill-down operation is then drilldown4,2012M ar (P Cr0 ) = P Cd0 , where P Cd0 = (CSd0 , CE)
with CSd0 = (CDd0 , CH), and CDd0 = D10 × D20 × D30 × (D40 \ {2012M ar}) ∪ children(2012M ar).
3.2.3
Materialization of Process Cells
In the previous step, the applicability of the OLAP operations was shown by means of an example.
The main emphasis was on the changes that occurred at the dimension level. Naturally, the
question arises as what happens at the cell level. The last step of our approach gives an answer
to this question. We rely in our explanation on Figure 3.6, presented in more detail in [6].
Figure 3.6: Partitioning of the process cube. The split operation is realized by drill-down. The
functionality of the merge operation is given by roll-up.
The left part of Figure 3.6 shows the process cube created from an extended version of the event
log in Figure 3.2. In the process cube, the top part depicts a simplified event log corresponding
to the process cube. The step of extracting an event log based on the event data from the process
cube or from parts of it (process cells or groups of cells) is known as the materialization step. The
resulted event logs are then given as input to different process mining algorithms. The outcome
is a set of process models which can be visualized. Back to our example, the event log shown at
the top of the process cube is used to obtain the process model shown at the bottom, by applying
the Alpha Miner algorithm on it.
The right part of Figure 3.6 shows the result of splitting the process cube from the left on
its case type and event class dimensions. In the figure, two types of splitting can be identified.
Vertical splits consider for separation an entire case. For example, by splitting on the case type
dimension, cases 1, 4, 5, 6 are separated from cases 2, 3, 7, 8. The results of a horizontal split are no
longer whole cases, but rather parts of cases corresponding to subsets of activities. For example,
by splitting on the event class dimension, activities A, C are representative for the cell given by
CE(silver customer, sales, 2012) and activities C, D, E, F, G are representative for the cell given by
28
(a) The resulted process model after slicing on (b) The resulted process model after slicing on
560464 resource.
560925 resource.
Figure 3.7: Process mining results for P C560464 and for P C560925 .
CE(silver customer, delivery, 2012). Note that activity C is present in both cells, i.e., activity C
can be executed in both sales and delivery departments. This is possible as the activity attribute
is not a dimension in the process cube and therefore, the same activity can be present in multiple
cell.
When related to the OLAP operations, the split operation is realized by the drill-down operation and the merge operation is realized by the roll-up operation.
In the second step, based on a process cube example, several OLAP operations were presented.
After “playing” with the process cube, one is interested in materializing the selected parts of
the process cube and obtaining meaningful process mining results. The P C560464 and P C560529
subcubes are among the subcubes obtained in the second step. Figure 3.7a presents the resulted
process model MP C560464 for the process cube P C560464 . Similarly, Figure 3.7b presents the resulted process model MP C560529 for the process cube P C560529 . Now the two process models can
be compared to find differences and similarities. An immediate similarity is that both processes
contain the same activities 01 HOOFD 050 and 01 HOOFD 120. There are a large number of
differences, related both to the activities and also to the control flow. One could start by noticing that one process starts with activity 01 HOOFD 010, while the other starts with activity
01 HOOFD 040.
3.3
Requirements
Now that we have established the desired functionality of a process cube, the next step is to
find technologies and methods to turn the process cube concept into a real application. There
is no fixed recipe that guarantees the achievement of this goal. Multiple tools are available that
can accommodate the desired process cube functionality and there is certainly more than one
solution to approach the problem. Nevertheless, there is a list of requirements that should be met,
independent of the chosen technology and the solution for implementation.
29
As our goal is to create a proof-of-concept tool that exploits OLAP features to accommodate
process mining solutions such that the comparison of multiple processes is possible, and based on
the process cube functionality presented in this chapter, the following requirements are derived:
1. The system shall include an OLAP Server with support for traditional OLAP operations.
2. External tools shall be open to adjustments. They shall offer the possibility to add new
functionality and change the existing one.
3. The application shall be programmed in Java to enable integration with ProM.
4. External tools shall provide means to enable their employment in a Java-written system.
The first requirement is quite straightforward, considering the goal of this project. The OLAP
Server organizes data in multidimensional structures, which facilitates the inspection of the stored
data from different perspectives. In that sense, the OLAP Server can be also used to examine the
different views of a process. Employing traditional OLAP operations on the OLAP multidimensional structures, provides quick and facile filtering. By means of this functionality, the integrated
analysis on multiple processes can be supported.
Since the OLAP Server is an indispensable component of the system, it has to be either
created from scratch or employed from an external tool. Creating an OLAP Server from scratch,
undoubtedly implies a vast amount of work. Under the circumstances, employing an already
existing OLAP Server, to save time, seems to be a plausible idea. Moreover, parts of an OLAP
Client application can be also reused to save time. However, in this case, the second requirement
has to be considered. The existing OLAP tools cannot handle event logs and do not support
process-mining analysis. Therefore, an external OLAP tool shall allow adding this functionality
and changing the existing one, should this be the case. This is possible only if the external tool is
open-source.
ProM Framework was introduced in Section 2.5 as a platform hosting multiple plugins that
represent the result of implementation of different process mining algorithms. Clearly, it is wise
to use the already existing process mining techniques as they provide sufficient methodology to
perform process analysis. However, to facilitate the easy integration with ProM, Java is the
preferred programming language.
The fourth requirement comes as a consequence of the third requirement. External parties
must possess interfacing capabilities with the system. Since the main application has to be written
in Java, external tools should be either Java-based or provide a Java Application Programming
Interface (API) to allow their employment in the system.
3.4
Comparison to Other Hypercube Structures
Before starting with the process cube implementation, a literature study is performed to identify
the cubes with the closest functionality and requirements to the process cube. The reason for
doing this is threefold. First, one can find similarities with other hypercube structures, in which
case, some of its functionality can be reused. Secondly, identifying limitations of the current
multidimensional structures, helps in clarifying what is still to be done. Finally, previous work on
similar OLAP cubes can suggest where one could expect difficulties.
Data loaded in traditional OLAP cubes come from different sources, e.g., multiple data warehouses. Due to the considerable growth of stored data, simple ways of data representation are
sought to conveniently keep data outside local databases. OLAP cubes are also adjusted to handle
data in different formats. For example, OLAP cubes can be specified on XML data [34]. Still,
OLAP cubes cannot support data in XES format, typical for event logs, because of the specific
characteristics of event data.
OLAP cubes are designed to work with numerical measures, and various ways of computing
numerical aggregates are explored, from traditional sum, count and average to sorting-based algorithms [10] and agglomerative hierarchical clustering [40]. In [45], several measures are proposed
30
to summarize process behavior in a multidimensional process model. Among those, instance-based
measures (e.g., average throughput time of all process instances), event-based measures (e.g., average execution time of process events), flow-based measures (e.g., average waiting time between
process events), are the most relevant.
In the last years, also non-numerical data have been considered in an OLAP setting. OLAP
cubes have been extended to graphs [52], sequences [37, 38] and also to text data [36]. Creating
a Text Cube became possible by employing information retrieval techniques and selecting term
frequency and inverted index measures.
In [45], the Event Cube is presented. Unlike other OLAP cubes, this multidimensional structure is constructed for the inspection of different perspectives of a business process, which in fact,
coincides with the purpose of the process cube. To accomplish this, event information is summarized by means of different measures of interest. For instance, the control-flow measure is used to
directly apply the Multidimensional Heuristics Miner process discovery algorithm. The difficulty
with respect to this approach, is that traditional process mining techniques have to be extended
with multidimensional capacity, in the same way as it was done for the Flexible Heuristics Miner:
the Multidimensional Heuristics Miner was introduced as a generalization of the Flexible Heuristics Miner, to handle multidimensional process information. Of course, extending existing process
mining techniques requires a lot of effort. Therefore, we propose a more conceptually clear and
more generic approach. That is, instead of adjusting all process mining techniques to multidimensionality, the OLAP multidimensional structure can be adjusted to allow employing existing
process mining techniques, without the need of changing them.
All in all, the process cube is unique as it allows the storage of event data in its multidimensional
structure, which is further used for process analysis purposes by employing existing process mining
techniques. This approach creates a bridge between process mining and OLAP, as methods from
both fields are interchangeably applied. The advantage is that quick discovery and analysis of
business processes and of their corresponding sub-processes is facilitated in an integrated way.
Moreover, no changes to the applied traditional process mining techniques are needed.
31
Chapter 4
OLAP Open Source Choice
Based on the conceptual aspects previously introduced, in the following chapters we continue with
describing the prototype solution. Before going into detail with respect to the implementation, in
this chapter we give the motivation for our technology choice.
The process cube formalization from Chapter 3, indicates the need for process mining and
OLAP support. For process mining, the selected framework is ProM, introduced in Section 2.2.1,
as it is the leading open source tool for process mining. Other commercial process mining systems
exist, e.g., Futura Reflect, Fluxicon, Comprehend, ARIS Process Performance Manager [12], but
ProM contains many plugins that allow effective process mining discovery and analysis. A part of
these plugins are chosen for this project. Except for the OLAP database, we also use a classical
relational database to store event data which is only used for event log reconstruction. There is
a vast array of possibilities when it comes to available relational database systems, e.g., Oracle
Database, Microsoft SQL Server, MySQL, IBM DB2, SAP Sybase, just to name a few. As there
are no special benefits of using one relational database over another, in our project we choose
MySQL, as it is one of the most widely used database systems in the world.
For OLAP, on the other hand, it is difficult to make an immediate decision with respect to
the tool selection. There are multiple technologies available, which vary in terms of the used
database type e.g., classical relational, multidimensional, hybrid; the storage location, e.g., inmemory or on-disk; the storage method, e.g., column-based or row-based databases; the way data
relationships are kept, e.g., matrix or non-matrix (polynomial) databases and so on. Therefore,
in this chapter, the different OLAP tools and their characteristics are further detailed, together
with the corresponding advantages and disadvantages. Finally, a single OLAP system is selected
for our application.
4.1
Existing OLAP Open Source Tools
For a potential OLAP tool to be used in this project, supporting conventional OLAP functionality
is not sufficient. Several requirements were listed in Section 3.3. From those, two are particularly
important to consider when choosing an OLAP external tool. The tool has to be open source,
to allow changes in its functionality, and should provide support for further Java development,
to enable the integration of ProM (which is written in Java) and OLAP capabilities on a single
platform. OLAP tools can be split in OLAP servers and OLAP clients. OLAP clients are the user
interfaces to the OLAP servers.
Even though the open source OLAP servers and clients are not as powerful as commercial
solutions [49], they encourage the community-based development by being free to use and modify.
In our case, when integrating process mining solutions in OLAP technology, we expect to encounter
differences with existing functionality. Therefore, in this project, an open source tool which allows
to add new solutions is preferred over a more “powerful”, but non-extensible commercial tool.
To provide an overview of the existing OLAP open source tools, we refer to the following
32
sources [1, 27, 28, 48, 49, 50]. From those, [1, 49, 50] contain the work of Thomsen and Pedersen,
and include a periodic survey of open source tools for business intelligence. The first survey [49],
published in 2005, refers to three OLAP servers, Bee, Lemur and Mondrian and two OLAP clients,
Bee and JPivot, which are the only ones implemented at the time. In the survey from 2011 [1], only
two OLAP servers are presented, Mondrian and Palo. That is because Bee and Lemur servers were
discontinued and a new OLAP server, Palo was created. In [28], we find again the same Mondrian
and Palo OLAP servers mentioned. By 2011, there are already several OLAP clients available,
e.g., JPalo, JPivot, JRubik, FreeAnalysis, JMagallanes OLAP & Reports. There are also several
integrated BI Suites. Both [27] and [50] refer to Jasper Soft BI Suite, Pentaho and SpagoBI. All
these BI suites use the Mondrian OLAP engine and the JPivot OLAP client graphical interface.
Recently, the Palo BI Suite was released that is working with the Palo multidimensional OLAP
server and the Palo for Excel client.
As every OLAP client uses a specific OLAP server, selecting an OLAP server, automatically
narrows the client choice. In the following, we offer a summary on the two previously introduced
OLAP servers, Mondrian and Palo. These servers are quite different from each other, mainly
because they use different types of databases to store the data. The first one, Mondrian, stores
data in relational databases, and it is therefore called a ROLAP server, and the other, Palo, stores
data in multidimensional databases, and it is therefore considered a MOLAP server.
4.2
Advantages & Disadvantages
The storage engine used, ROLAP or MOLAP, has a considerable influence on the characteristics
of the OLAP servers, e.g., implementation design and methods, query mechanisms, performance.
Therefore, we start this section with a discussion on ROLAP and MOLAP engines. Then, we
emphasize the advantages and disadvantages of Mondrian and Palo OLAP servers by comparing
and contrasting their characteristics, e.g., performance, scalability, flexibility.
The major advantage of ROLAP is that the relational database technology is well standardized,
e.g. SQL2, and is readily available off-the-shelf [17]. The disadvantage is that the query language
is not powerful and flexible enough to support true OLAP capabilities [51]. The multidimensional
model and its operations have to be mapped into relations and SQL queries [19].
The main advantage of MOLAP is that its model closely matches the multidimensional model,
allowing for powerful and flexible queries in terms of OLAP processing [17]. In general, the main
disadvantage of MOLAP is that no real standard for MOLAP exists. However, for particular situations, different problems can occur, e.g., scalability issues when it comes to very large databases,
sparsity issues for sparse data.
In [21], Colliat deems that multidimensional databases are several orders of magnitude faster
than relational databases in terms of data retrieval and several orders of magnitude faster in
terms of calculation. MOLAP servers have faster access times than ROLAP servers because
data is partitioned and stored in dimensions, which allows retrieving data corresponding to any
combination of dimension members with a single I/O. In a ROLAP, on the other hand, due to
intrinsic mismatches between OLAP-style querying and SQL (e.g., lack of sequential processing
and column aggregation), performance bottlenecks are common [18].
Generally, MOLAP provides more space-efficient storage, as data is kept in dimensions and a
dimension may correspond to multiple data values. However, this is not valid for sparse data, as
in this case, data values are missing for the majority of member combinations.
ROLAP systems work better with non-aggregate data and aggregate data management is done
at high cost. The MOLAP, on the other hand, works better with aggregate data. This is actually
expected, considering the table-based structure of a relational database and the structure of a
multidimensional database, which is organized in dimensions and has a built-in hierarchy.
An advantage of ROLAP is that it is immune to sparse data, i.e., sparsity does not influence
its performance, nor its storage efficiency. On the other hand, sparsity is a limitation for MOLAP
servers, which can hinder some of its benefits considerably. For example, a sparse MOLAP does not
provide space-efficient storage and runs into considerable performance issues. Therefore, MOLAP
33
servers typically include provisions for handling sparse arrays. For example, the sparsity problem is
known to be solved in the case of the commercial Essbase multidimensional database management
system, by adjusting the structure of the MOLAP server to handle separately sparse and dense
dimensions.
Now that the advantages and disadvantages in terms of the employed OLAP engine were
presented, in the following, we discuss the advantages and disadvantages of Mondrian and Palo
OLAP server tools. Before continuing our discussion, we would like to remark that both Mondrian
and Palo satisfy the requirement of being compatible with a Java-written system. Mondrian is
implemented in Java and offers cross-platform capabilities. As for what concerns Palo, the initial
Palo MOLAP engine was programmed in C++. However, today various serial interfaces in VBA,
PHP, C++, Java and .NET allow Palo OLAP to be extended.
Performance
Performance is a characteristic where generally Palo outruns Mondrian. First, the Palo
MOLAP engine offers faster query response times [19] than the ROLAP engine of Mondrian.
Secondly, the in-memory feature of the Palo server, improves the speed even further, as
naturally, in-memory databases are faster than the disk-based databases. Nevertheless, if
not as fast as Palo MOLAP server, the Mondrian ROLAP server is also known to provide
an acceptable performance [50].
Scalability
The in-memory characteristic is both an advantage (faster data retrieval), but also a disadvantage of Palo. A database which is memory-based, becomes automatically memorylimited. Undoubtedly, the memory capacity grows very quickly, but so does the volume of
available data. There are advances made to compensate for the memory need. For example,
3-D stacking in-memories such as Micron hybrid memory cube are available 1 . Nevertheless,
at the moment, scalability is considered an advantage of Mondrian and a disadvantage of
Palo.
Flexibility
Both Mondrian and Palo provide different types of flexibility. Being a ROLAP server,
Mondrian is more flexible regarding the cube redefinition and provides better support for
frequent updates [43]. On the other hand, the in-memory database of Palo does not require
indexes, recalculation and pre-aggregations. As analysis is possible to a detailed level without
any preprocessing [28], Palo is more flexible in that sense.
4.3
Palo - Motivation of Choice
Considering all the features of both Mondrian and Palo presented in Section 4.2, it can be noticed
that, in general, the advantages of one technology are the disadvantages of another technology.
Moreover, both Mondrian and Palo satisfy the requirements from Section 3.3, e.g., open source,
Java-compatible, with OLAP capabilities. Consequently, either of the two OLAP servers can be
used in this master project. We choose the Palo in-memory multidimensional OLAP server and
in the following, we give a motivation for our choice.
First, we adopt Palo technology because we want to explore new and innovative technologies.
Mondrian stores data in relational databases. Relational databases are simple and powerful solutions, but they are already used for decades. Palo stores data in a multidimensional in-memory
database. Both multidimensional OLAPs and in-memory technologies are relatively new compared to relational databases. Being still in their infancy, they provide various research challenges
which are interesting to explore.
Secondly, we believe that Palo technologies have a real future perspective. With the decreasing
memory prices and the growth of the available memory capacity, there are real chances that inmemory databases will be more often used. Moreover, there are promising performance results
1 http://www.edn.com/design/integrated-circuit-design/4402995/More-than-Moore-memory-grows-up
34
recorded for MOLAP engines. While there are different techniques employed to speed up relational
query processing (e.g. index structures, data partitioning), there is not too much that can be done
to further improve ROLAP performance. On the other hand, we see Palo as a technology with
potential to develop performance-wise.
All in all, we choose Palo because it uses new technology and it has real chances to grow in the
future. Since JPalo client is the only one to use Palo MOLAP server, JPalo is hence the OLAP
client choice for this project.
35
Chapter 5
Implementation
In the previous chapter we discussed the storage technologies to be used and we motivated the
use of Palo. In this chapter, we describe our implementation using Palo, ProM and MySQL
capabilities. We start by describing the system components and the way they are interconnected.
Then, we focus on three main aspects:
• Storing the event data in the process cube.
• Preparing the process cube for analysis purposes, e.g., by filtering on dimensions.
• Comparing process cells by visualizing the corresponding process mining results.
5.1
Architectural Model
Figure 5.1: The PROCUBE System. It contains components, external parties and the corresponding communications between both internal and external elements of the system.
36
As explained in Section 3.3, our implementation is integrated in ProM, i.e., our application runs
as a ProM plugin. The implemented plugin is called the PROCUBE plugin. Together with Palo
and MySQL, the PROCUBE plugin forms the PROCUBE system. In this section we describe the
architecture of the PROCUBE system. The main components of the PROCUBE system, together
with the external parties and the way they communicate with each other, are shown in Figure
5.1. The system interacts with three external tools: ProM, MySQL and PALO. ProM is the
host framework of the system, since the PROCUBE application runs as a plugin in ProM. The
relational database of MySQL is used to store data from the event logs that is not relevant for
multidimensional processing. Palo is employed for its OLAP capabilities. It is composed of two
main parts, the Palo Server and the Palo Client. While there are no changes made to the Palo
Server in this project, the Palo Client is appropriately adjusted to allow operations on event data.
Palo Server comes with an in-memory multidimensional database, for storage purposes, and an
OLAP cube, built on top of the database, that is suited to support OLAP functionality.
The flow of the event data in the system starts with the loading of an event log. This function
is exerted by the Load component. Its role is to pre-process the incoming event data from an
event log and to load it in MySQL and Palo databases in such a way that it is properly stored
and ready for further use. Also at loading, the Palo cube is created from the event data residing
in the Palo database.
Immediately after loading, the process cube can be used to recreate the initially loaded event
log. However, there is no benefit from having merely this functionality. As such, the system
contains also a Filtering component. Its purpose is to perform various filtering operations on the
process cube such that the different perspectives of the cube can be inspected. Note that filtering
is used to extract parts of the process cube and not to modify its structure. Filtering is based on
the traditional OLAP operations: slice, dice, roll-up and drill-down. Except for filtering, pivoting
is another useful OLAP operation that is employed. It allows rotating the cube to visualize it
from a different angle.
Once created, the filtered parts of the process cube are used to unload the corresponding event
data, from which an event log is then materialized. The Unload component is responsible for
taking the required data from both the relational and the in-memory database and creating an
event log out of it. The resulting event log is given as an input to a ProM plugin. The output is
a process mining result that can be visualized. Not all the existing ProM plugins are considered.
A representative list of ProM plugins is selected for this purpose.
Finally, a GUI component was specially created to show simultaneously different process mining
results. The advantage of such a component is that it facilitates the comparison of multiple process
mining results by placing them next to each other.
5.2
Event Storage
The simplest and most intuitive way to store event data in a process cube is by selecting all the
attributes in the event log as dimensions. To guarantee that an event is unique in terms of its
dimensions, an event id is assigned to each event. The same holds for cases, a case id is assigned
to each case. Both the event id and the case id are considered as dimensions. Even though this
approach is the easiest one, it can create in many cases considerable problems with respect to both
storage space and performance. This is because such a way of storing event data leads to extreme
sparsity in the process cube.
There are two possible ways to cope with the sparsity problem. The first solution is to reduce
the number of dimensions. By reducing the number of dimensions, only a subset from the entire
set of attributes is selected to form the dimensions. Consequently, the problem of where and how
to store the rest of the event and case attributes appears. Moreover, events are no longer uniquely
identified by dimensions, which implies having more than one event corresponding to a cell. An
immediate solution is to save the rest of the event data in the process cell. The difficulty with
this approach is that Palo Server, as well as other OLAP servers, allows for a limited number of
characters per cell. In the case of Palo, the number is 255. Moreover, today’s OLAP servers work
37
with numerical values, rather than with text. This limitation forces to look for a new solution.
Figure 5.2: Event storage. Numbers represent cell ids and indicate the existence of a cell with a
corresponding set of events.
The solution we applied, consists of giving a unique identifier to each cell and save the rest
of the event data corresponding to the cell in a relational database. Figure 5.2 illustrates the
approach. On the left-hand side, a cube consisting of three dimensions (task, timestamp and last
phase) is shown. The numbers in cells, e.g., 6, 7, 10, 11, represent cell ids. On the right-hand
side, there is a table with case and event properties. This table is actually saved in the relational
database. A row of the table stores data corresponding to an event. The cell id is a column in
the table, and it indicates which event corresponds to which cell. For example, for the cell with
id 11, three events, namely 27, 28 and 29, can be identified in the table. For each of these events,
properties that are not among the dimensions in the process cube are stored in the relational
database.
The solution presented above, does not fully guarantee that sparsity is sufficiently limited. For
instance, if the dimensions stored in the in-memory multidimensional database are all sparse, i.e.,
contain a large number of members that are hardly repeating in the log, then the sparsity problem
is still present. Examples of sparse dimensions are the event id, because there is one member
for each new event and the timestamp, since almost each event can have a unique timestamp.
Therefore, the second solution consists of reducing the number of elements per dimension.
Palo Server, as well as other multi-dimensional OLAP servers, offer a very useful feature,
called hierarchy. That is, members in a dimension can be hierarchically organized. An event log
can contain different types of attributes: binary, numerical, time, categorical, etc. For the time
attributes, there is already a natural built-in hierarchy that can be directly employed, e.g., year →
month → day of week. For example, the timestamp 2012-02-21T11:52:13 belongs to the year 2012,
the month is 2012F eb and the day of week is 2012F ebT ue. Hierarchies can be used to reduce the
number of members per dimension. For the time example, only year, month and day of week can
be stored in-memory, while the actual timestamp can be saved in the relational database. For the
rest of the attributes, it is also possible to construct hierarchies, but it is not so straightforward
as for the attributes of time type. That is, to have a meaningful hierarchy for a set of categorical
attribute values, applying clustering and classification techniques would be useful.
The time hierarchy is implemented in our project for any dimension which contains elements of
date type. As for the rest of the attributes, there is no hierarchy established, since this is not easy
to solve in a generic way. As a consequence, even though solutions to limit sparsity were applied,
the sparsity problem can still occur, should the user select some sparse non-time dimensions to be
stored in the multidimensional database.
38
5.3
Load/Unload of the Database
In Section 2.2.1, the XES meta-model was presented. From all the elements of the XES structure,
attributes are the most relevant when employing a multidimensional structure for analysis. Case
attributes and event attributes are used to create the dimensions of a hypercube together with
their corresponding members. Therefore, they have to be loaded in the Palo in-memory database
such that to be easily accessed for the process cube creation. As discussed in the previous section,
due to sparsity issues, the user is asked to decide upon a smaller set of attributes to be used
as dimensions in the process cube. The rest of the attributes are stored in relational databases
(RDB), as explained in Section 5.2.
Except for traces, events and their corresponding attributes, the log keeps also information
regarding the classifiers, the extensions and the global attributes. Even though unnecessary for
OLAP operations, these elements are indispensable for the event log reconstruction. Therefore,
they are stored separately in RDB tables and used later for unloading purposes.
The loading of an event log into databases consists of two steps. First, a special tree structure
is created from event data to facilitate the construction of the process cube. Secondly, the created
structure is used for building the process cube and storing parts of event data in RDB in an
easy-to-access manner. We use pseudocode to present both steps.
Algorithm Parsing(log)
1. B log, gives the event log from the file
2. Create a log id, that uniquely identifies the log
3. Create tables in the RDB, with the attributes of the log, the classifiers, the extensions and
the globals
4. B rootNode is the root node of a tree structure
5. B eventCoordinates is a list of attribute values for all events in the log
6. Determine the number of traces in the log (nt )
7. for i ← 1 to nt
8.
do traces[i] ← log.getTraces();
9.
rootNode.addNodes(traces[i].getAttributes());
10.
Determine the number of events in traces[i] ( ne )
11.
for j ← 1 to ne
12.
do eventCoordinates ← NULL;
13.
events[j] ← traces[i].getEvents();
14.
rootNode.addNodes(events[j].getAttributes());
15.
eventCoordinates.setEvent(log id, traces[i].getAttributes(),
16.
events[j].getAttributes());
17.
j ←j + 1;
18.
i ←i + 1;
19. return rootNode, eventCoordinates
In the first step, the classifiers, the extensions and the global attributes are extracted from the
XES log structure, and loaded in RDB tables. In that sense, a log id is assigned to the log and is
used to distinguish between the classifiers, extensions and global attributes of this log from the ones
of other already existing or to be created logs. Traces and events with their attributes are added to
a tree structure with the rootNode as the root element of the tree. The rootNode contains all the
links of the tree. Nodes are added to the tree structure in the following way: the first hierarchical
level of the tree presents properties of cases and events, the next level contains the values of the
properties. Other hierarchical levels are also possible. In this project, we implemented hierarchies
for time attributes. As such, in case of time attributes, years, months and days of week form the
levels of the tree.
Except for the rootNode, a set of event coordinates is determined for each event, on lines 15-16
of the Parsing algorithm. Event coordinates give all the necessary information that can be used
to place an event back in an event log. Since an event is part of a trace and a trace belongs to
39
a log, also trace and log information is included in the event coordinates. Consequently, event
coordinates are composed of the log id, the trace id with the corresponding trace attributes and
the event id with the event attributes.
Algorithm Loading(rootN ode, eventCoordinates)
1. B Create the process cube PC
2. Determine the number of dimensions nd in the rootNode
3. Allow the user to select a subset Md of all available dimensions
4. for each i ∈ Md
5.
do Di ← rootNode.getChildren(i).getLeafs();
6.
if rootNode.getChildren(i) is a time attribute
7.
then Hi ← createHierarchy(rootNode.getAttribute(i));
8.
Create P C with the dimensions Di , i ∈ Md with unique cell values
9.
Determine the total number of events in the log (nte )
10. for i ← 1 to nte
11.
do k ← 0;
12.
columnValues ← NULL
13.
for j ← 1 to nd
14.
do if j ∈ Md
15.
then k ← k + 1;
16.
mk ← eventCoordinates.getEvent(i).getAttribute(j);
17.
else columnValues.addAtribute(eventCoordinates.getEvent(i).getAttribute(j));
18.
columnValues.addAtribute(getCell(m1 , . . . , mk ));
19.
RDB.addRow(columnValues);
Once the rootNode and the eventCoordinates are created, they can be used to build the process
cube PC. All the trace and event attributes accessible from the rootNode, are potential dimensions
of the process cube. Due to sparsity issues, the user is allowed to select a subset of these to be
the actual dimensions of the cube. Of course, selecting all the dimensions is also possible. For
each of the chosen dimensions, its corresponding member elements and the hierarchy are added,
in line 5 to 7, in the Loading algorithm. After populating dimensions with elements, the process
cube PC is created, based on these dimensions. At this point, the process cube PC has dimensions
and elements, but does not have any values in the cells. The eventCoordinates provides both the
coordinates of the cell and the set of its corresponding events. In Section 5.2, it was explained
that event data cannot be directly stored in a cell, due to cell limitations. Instead, each cell
is given a cell id and the rest of event data which is not yet saved in the PC can be stored in
RDB tables, with cell id as a column. As such, members of the PC dimensions are identified in
eventCoordinates, line 16, and are used as parameters for the getCell(m1 , . . . , mk ) function which
identifies a cell, line 18. The members that are not among PC dimension members, are added in
the RDB together with the cell id, line 19.
Algorithm Unloading(P C)
1. B log, is the event log to be created after unloading
2. B trace, is a trace of the event log
3. B event, is an event of the event log
4. log ← NULL;
5. Add all the classifiers, extensions and globals to the log, from the RDB tables
6. B eventList is a list with the corresponding coordinates of all the events
7. B attributeList is a list with all the attributes corresponding to an event
8. Create the eventList from both PC dimensions and RDB columns
9. Determine the number of events in the eventList (ne )
10. for i ← 1 to ne
11.
do attributeList ← eventList.getEvent(i).getAttributes();
12.
trace ← NULL;
13.
event ← NULL;
40
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
Determine the number of attributes in eventList (na )
for j ← 1 to na
do attribute ← attributeList.getAttribute(j);
if attribute is a log attribute
then logAttributes.add(attribute); ;
else if attribute is a trace attribute
then traceAttributes.add(attribute);
else eventAttributes.add(attribute);
event.addAttributes(eventAttributes);
if logAttributes are in log
if there is a trace with the traceAttributes in log
B k is the position of the trace in log
then log.getTrace(k).add(event);
else trace.addAttributes(traceAttributes);
trace.add(event);
log.add (trace);
else trace.addAttributes(traceAttributes);
trace.add(event);
log.addAttributes(logAttributes);
log.add(trace);
return log;
Figure 5.1, presented earlier, shows the basic flow of event data in the system. From the event
log, event data is loaded in both Palo and MySQL databases and can be retrieved from those at
unloading and used to recreate the initial event log. Even though such a functionality does not
add yet any value, it can still be used to test the correctness of loading and unloading event data in
and from relational and OLAP structures. In what follows, we describe the unloading procedure
to complete the scenario.
For the Unloading algorithm presented in this thesis, we consider the complete list of events
from the initially loaded event log. Nevertheless, this list can be filtered and, as a result, only a
subset of total events can be considered at unloading. In any case, there is no change with respect
to the pseudocode, only in line 8, the eventList is created differently, this time, based on filtering
results.
First, the initially NULL log is populated with classifiers, extensions and global attributes from
RDB tables. Then, both event data from RDB and from Palo OLAP cube is extracted and used to
create an eventList structure. The eventList structure is similar to the eventCoordinates structure
created in the Parsing algorithm, in the sense that the eventList constains enough information to
place events back in event logs. For instance, the event id gives the order of the event in the log.
Note that information like the log id, the case id and the event id is discarded when constructing
the event log, as it was created at loading and was not initially part of the log.
The eventList contains a list of three types of attributes: log attributes, trace attributes and
event attributes. The event attributes, for instance, can be used to create an event, as in line
22. The trace attributes can be used to created a trace. However, since a trace may correspond
to multiple events, we check, in line 24, whether a trace with the same attributes already exists.
Then, the created event is added either to the already existing trace or to the trace that is newly
created. A similar test is required when adding the log attributes to the log, to avoid repeating
data in the new event log.
5.4
Basic Operations on the Database Subsets
Once loaded in databases, the question appears what can the system do with the event data.
First, the system benefits from the multidimensional structure of the OLAP cube. In that sense,
inspecting different dimensions of the cube is possible. Moreover, the system supports a set of
41
(a) Dice filtering. Five elements are se- (b) Dice filtering result. While the event log corresponding to
lected on the EVENT conceptEXT name P C has 33 events, the event log corresponding to P Cdiced has
dimension.
only 14 events.
Figure 5.3: Dice operation.
basic OLAP operations, e.g., slice, dice, drill-down, roll-up and pivoting. Filters can be created
that would slice or dice the cube in various way. Default filters exist for drill-down and roll-up
operations that can be applied at request on specific chosen dimensions. Each filter is stored for
further use, unless not explicitly deleted. Not only can the event data in the cube filtered, it
can also be visualized from different perspectives. This functionality is offered by employing the
pivoting operation.
5.4.1
Dice & Slice
A dice operation is realized when multiple members are selected for one or more dimensions. Given
a process cube P C, the result of a dice is a subcube P Cdiced for which only a subset of members
are selected on particular dimensions, and for the rest is the same with the initial cube.
Figure 5.3a shows a dice filter applied on the EVENT conceptEXT name dimension. With dice,
multiple elements of a dimension can be selected. In Figure 5.3a there are five task names selected
and the rest of the elements are just discarded for the EVENT conceptEXT name dimension.
The result of the dice operation is shown in Figure 5.3b. From 33 events present in the event log
corresponding to the process cube P C, only 14 are considered for the P Cdiced . The number of
cases remains the same.
A dice operation can influence more than one dimension. For example, together with the
filter on EVENT conceptEXT name dimension, a subset of timestamps can be selected on the
EVENT TIME timeEXT timestamp 1 . A dice operation allows the selection of any element of
the time hierarchy. For example, one can select year 2012 and 2013 out of a set of years containing
2010, 2011, 2012 and 2013. The month dimension can also be considered for dice. For instance,
selecting the 2012F eb month in 2012 is also a dice, since it contains the following set of elements:
2012F ebM on, the 2012F ebW ed and the 2012F ebT hu.
For dimensions with numerical members, a dice filter can be created, by selecting a certain
1 In the dimension name, the TIME tag is used to recognize a dimension corresponding to a time attribute.
Other examples of such dimensions are: EVENT TIME dueDate, EVENT TIME plannedDate,
EVENT TIME createdDate
42
(a) Slice filtering. Only a single event (b) Slice filtering result. While the event log corresponding to
name, 01 HOOFD 060 is selected on the P C has 33 events, the event log corresponding to P Csliced has
EVENT conceptEXT name dimension.
only 2 events.
Figure 5.4: Slice operation.
range. For example, for the SUMLeges dimension, all the events with SUMLeges between 100.5
and 500.2 can be selected.
The slice operation is a particular type of dice. That is, a slice is performed when only a single
member of one dimension is selected and the other members corresponding to the dimension are
filtered out. Given a process cube P C, the result of a slice is a subcube P Csliced with the same
dimensions as the cube P C, except for one, which has just a single member selected of the initial
set of the dimension members.
Figure 5.4a shows a slice filter applied on the EVENT conceptEXT name dimension. From all
the elements of this dimension, only 01 HOOFD 060 is selected. After creation, the slice filter is
saved and, at request, is applied on the event data of the process cube. That is, only events with
the event name 01 HOOFD 060 are considered for the new P Csliced cube. Figure 5.4b depicts the
slice result on the process cube. In the top window, a Log Dialog shows information on the initial
event log. Note that the entire event log contains 4 cases and 33 events. The bottom window
illustrates a Log Dialog containing information on the event log created after slicing. The new
event log contains only 2 cases and 2 events. Consequently, there are only 2 events with the name
01 HOOFD 060 and they belong to 2 different cases.
For a dimension with time attributes, the slice can be performed while selecting a leaf member,
situated at the day of week hierarchical level. For example, for a timestamp dimension containing
2012 at the year hierarchical level and 2012F eb at the month level and 2012F ebT ue at the day of
week level, a slice can be executed by selecting the 2012F ebT ue element. Note that such a slice
filters out all the events except for the ones that occurred on Tuesday in the February month of
2012, and not on all Tuesdays of the 2012 year or on all Tuesdays, in general.
5.4.2
Pivoting
The subcubes obtained after slice and dice operations can be visualized. In this project, the
traditional 2D visualization is considered for the process cube visualization. As such, only two
dimensions of the process cube can be visualized simultaneously. This is possible through the
table of visualization. The rows of a table of visualization contain two dimensions of the process
43
cube and also the corresponding filters created by the user. Even though based on the elements
of two process cube dimensions, the dimensions of visualization are usually not identical with the
former. The main difference is that their elements can be both results of filtering and elements of
different hierarchical levels. In that sense, two visualization neighbor-cells can contain overlapping
data, while this is never the case for two neighbor-cells of the process cube.
The restriction of visualizing only two dimensions at a time has no influence on which two
dimensions to select. That is, any combination is possible and any of the two dimensions can be
substituted with a new PC dimension, at any time. By swapping from one dimension to another,
the visualization perspective of the P C cube changes. This operation is known as pivoting or the
rotation operation.
Figure 5.5: The result of the pivoting operation. Rotation is obtained by replacing the concept
names dimension with the timestamp dimension and the SUMLeges is replaced by the concept
names dimension.
Figure 5.5 shows the effect of the pivoting operation on the visualization table. In the visualization table from the top of the image, the SUMLeges and the event names are the two dimensions
of visualization. In the second table of visualization, the same process cube is visualized through
the event names and the timestamp dimensions. Also, while the event names was initially on the
x axis, in the second table, it is changed on the y axis.
5.4.3
Drill-down & Roll-up
The drill-down operation is realized by unfolding a member situated on a hierarchically superior
position in a set of members corresponding to a hierarchical level lower.
Figure 5.6 shows a table of visualization with one dimension corresponding to the timestamp
and another dimension corresponding to the event name. Elements of the timestamp dimension
can be selected from a hierarchy. For example, the 2012 member is selected and a drill-down
operation is performed on it. As in the time hierarchy, months follow years, all the months
corresponding to year 2012 are shown. Based on the definition of drill-down from Section 2.3.1,
the children of 2012 are added to the timestamp dimension of the table of visualization and the
2012 element is removed. In our project, we keep also the 2012 element, because it is useful to
compare process mining results corresponding to elements on different hierarchical levels, e.g. the
process of 2012 with the process of 2012M ar.
44
Figure 5.6: Drill-down operation on the timestamp dimension. Year 2012 is drilled-down to its
months.
The roll-up operation is realized by folding certain members of a dimension into one member,
which is hierarchically superior.
Figure 5.7: Roll-up operation on the timestamp dimension. The months corresponding to year
2012 are folded back.
Figure 5.7 shows a table of visualization corresponding to the same timestamp and event name
dimensions. Based on the definition of roll-up from Section 2.3.1, the children of 2012 are removed
from the timestamp dimension of the table of visualization and the 2012 element is added. In
our project, there is no need to add the 2012 element, as it is already present from the drill-down
operation.
5.5
Integration with ProM
After filtering and selecting a particular side of the process cube for visualization, the Unloading
algorithm, presented in Section 5.3, is applied to materialize event logs for different visualization
cells. The resulted event logs are given as input to a ProM plugin. Each ProM plugin has a plugin
context object, that is required to run in the ProM framework. For some plugins, it is impossible
45
to use them outside ProM, for example, due to the absence of a specific predefined plugin context.
Therefore, to allow more flexibility, our application is adjusted to run in ProM.
Hundreds of ProM plugins could potentially be use. However, we select only a predefined
list of plugins to run in our application. The reason for this is twofold. First, not all of the
existing plugins are relevant for the purpose of the PROCUBE tool. One of the objectives is to
provide the user a means to visually compare multiple subprocesses. Visual comparison of several
subprocesses becomes difficult when there is a different visual representation for each process. In
that sense, plugins that provide immediate visualization results are quite handy. If the user has
to make changes to get a specific result, repeating them for each visualization window can become
troublesome. For example, the user can miss a step, and then the results that are compared are
not the intended ones. Also, any change in one window, implies changes in all windows. Naturally,
manual changes take time, while automatic changes are impossible, due to different event data
per cell. Another problem is that the graphical space is limited. Running in parallel multiple
plugins that provide in-depth analysis e.g., LTL Checker, is not very practical, also due to space
restrictions, while repeating the changes for each individual process is very time consuming. In
conclusion, we aim at quick superficial analysis, with immediate results on multiple sublogs rather
than time-consuming, in-depth analysis on a single or very few logs.
Another type of ProM plugins, are the ones created to filter event logs. Since filtering is already
implemented in the PROCUBE tool, part of the functionality of these plugins is redundant.
The second reason is related to the fact that providing a generic way of calling all the ProM
plugins is difficult to realize. Each plugin has its own specific input and output parameters and
also its own methods. A solution for calling all plugins in a generic way is to create a Wrapper
that would uniformly integrate all ProM plugins. For this project, we focus mainly on plugins
that return a JComponent, which can be directly used to display the result. The Alpha Miner,
for instance, returns a Petri net object. In that case, the visualization component for the Petri
net has to be first created and only then can the visualization result be shown.
Nr
1.
2.
3.
4.
5.
6.
7.
8.
9.
ProM Plugin
Log Dialog
Dotted Chart
Fuzzy Miner
Heuristics Miner
Working-Together Social Network
Handover-of-Work Social Network
Similar-Task Social Network
Reassignment Social Network
Subcontracting Social Network
Table 5.1: The list of ProM plugins used in the PROCUBE tool.
Moreover, some plugins require going through a sequence of wizard screens to get to the final
result. Even if creating a predefined set of parameters, to avoid following the wizard screens, a
new set of parameters is required for each individual plugin. Furthermore, for our project, it is not
possible to set the parameters only once, beforehand, and use them for all the visualization cells.
That is because, the parameters of the initial event log usually do not correspond to the ones of
the sublogs resulted after filtering, as the corresponding event data is different. In that sense, for
such plugins, following the wizard sequence for each sublog individually is a must. Again, in this
case, plugins with immediate results are preferred over the ones preceded by a sequence of wizard
screens.
Derived from all the specifications mentioned above, Table 5.1 provides the list of plugins
currently used in our project. The Log Dialog and the Dotted Chart give a panoramic view on
the sublogs processes. The Heuristics Miner and the Fuzzy Miner are used to discover process
models from sublogs. The Social Network plugins provides details on the resource perspective of
46
the sublogs. There is no doubt that plugins such as Basic Perfomance and Conformance Checker
would add a considerable value to the process analysis and would allow for more extensive use
case analysis. Therefore, we suggest adding such plugins as a potential further work.
5.6
Result Visualization
The main visualization challenge of the project is to display multiple process mining results at the
same time, in an integrated way. The size of the physical screen is the main limiting factor when
it comes to displaying multiple windows. Therefore, we apply several solutions to cope with this
issue. First of all, we create a new frame, detachable from the main frame, and use it to place all
process mining results. Thus, should two screens be available, the table of visualization can be
placed on one screen, while the plugin results can be displayed on the second screen. On this new
frame, windows are organized next to each other, in an easy-to-identify way. Even though such a
frame layout is already enough for the visualization of the plugin results, we decided to make some
changes as it was lacking the desired flexibility. Hence, replacing the windows with dockable ones
to allow moving them around is one of the most important visualization features that is supported
in the project. A large part of the dockable functionality is taken from the DockingFrames 1.1.2
2
and adjusted for the project needs.
In the following, we explain the framework of the windows, with details related to the layout
of the windows frame. Then, we give a list with the frame functionality items. Finally, we show
the result visualization obtained using the PROCUBE plugin.
Figure 5.8, taken from [47], shows the
framework based on which dockable windows
are created. Dockables are not stand-alone
windows. They require the support of a main
window (the Main-Frame). The main window
is most of the times a JFrame. As long as this
frame is visible, so are the rest of the components on it. In case of non-dockable panels,
they are just directly connected to the main
frame. Consequently, the main frame can consist of several panels, with different data displayed on them. To support floating panels,
however, an additional layer is added between
the panels and the main frame. The compo- Figure 5.8: Dockables functionality. Panels are
nents of this layer are the so-called Stations. wrapped into dockables. Dockables are put onto
Among their purposes is also to allow the user stations which lay on the main-frame. As such,
to drag & drop panels and to minimize or max- dockables can be moved to different stations.
imize windows. A central controller is used to
wire all the objects of the framework together.
It manages the way elements look and their position in the frame and it monitors all the occurring
changes within windows. Further, each panel is wrapped into a dockable. Dockables are the final
components and they are the ones that actually offer the floating behaviour.
To display dockables in a certain layout, a Grid component is used. The matrix of the grid
gives an organized way of displaying windows in the screen. For our project, the matrix of the
grid component corresponds with the matrix of the table of visualization. That is, the plugin
results for different cells are shown in the same order with the one used to display the cells in the
visualization table.
In the view of the above approach, the following visualization capabilities are supported:
• Default layout with all the dockables normalized. Normalized dockables are placed on the
main visualization frame, in the way cells are displayed in the visualization table.
2 http://dock.javaforge.com/
47
• Dockables can be maximized. A maximized dockable takes all the space it can, most of the
time, by covering other dockables.
• Dockables can be minimized. Minimized dockables are not visible right away. They can be
restored to a normal state, by pressing again the minimization button.
• Dockables can be extended. Once extended, dockables have their own window, independent
of the main visualization frame. This functionality is very useful as it allows, for example,
moving windows with plugin results on different screens.
• By the drag & drop operation, dockables can be placed on any part of the screen. For
example, by dragging one dockable on the place of another one, these two are swapped with
each other.
• When multiple plugin results are available for the same visualization cell, each result window is a new tab in a tabbed pane. That makes it easy to quickly identify plugin results
corresponding to the same visualization cell.
• Unnecessary windows can always be closed.
Figure 5.9: Visualization of plugin results in the PROCUBE tool. Each plugin result is displayed
in a dockable window and can be part of a tabbed pane.
Figure 5.9 shows several windows with plugin results. Two Log Dialogs, a Fuzzy Miner, two
Heuristics Miners and a Social Network form the visualization results. Multiple tabs can be
distinguished since multiple plugin results exist for the same visualization cell. All the windows
are dockable. After undocking a window, the rest of the windows are automatically rearranged in
the screen.
48
Chapter 6
Case Study and Benchmarking
In the previous chapter, the implementation of the process cube was described as a combination
of external technology (Palo, MySQL, ProM) and newly-introduced process-cube-related features.
Further, we continue with an evaluation of the functionality of different event logs and an assessment of the PROCUBE system performance. The results presented in the chapter are based on the
event data of an artificial digital photo copier event log and on event data of a Dutch municipality
event log.
6.1
Evaluation of Functionality
In this section we choose both a synthetic and a real-life event log to ascertain the capabilities of
the PROCUBE system. The functionality that is evaluated comprises the loading of an event log in
relational and in-memory databases, executing OLAP operations on the process cube, unloading
an event log from databases, generating ProM results based on the event log and visualizing ProM
results.
6.1.1
Synthetic Benchmark
The synthetic event log we use in this section is taken from the collection of synthetic event
logs, found at http://data.3tu.nl/repository/collection:event_logs_synthetic. It is an
artificial event log for a simple digital copier, also used as a running example in [33]. The copier is
specialized in copying, scanning and printing of documents. As such, users can request copy/scan
or print services. The standard procedure followed by a copier is image creation, image processing
for quality enhancement, and then, depending on the request, either printing the image or just
sending it to the user. The generation of the image for a print request differs from the one for a
copy/scan request.
The digital photo copier event log contains 100 process instances, 76 event classes and 40995
events. Traces can be separated based on their Class attribute in Print and Copy/Scan. For each
event, the name of the activity is given, the lifecycle transition, to attest if an activity is started
or completed, and a timestamp of the recorded activity.
In the following, based on the digital photo copier process described in [33], we select a few
scenarios and use them to present the capabilities of the PROCUBE tool.
In Figure 9 from [33], two subprocesses, ‘Interpret’ and ‘Fusing’ are isolated. For our first scenario, the target is to load the entire digital photo copier event log in databases and filter it in such
a way that after unloading and applying the Fuzzy Miner plugin, the ‘Interpret’ subprocess from
Figure 9 in [33], is obtained. At loading, the TRACE Class and the EVENT conceptEXT name
attributes are selected as dimensions of the process cube. After loading, we perform a dice operation on the EVENT conceptEXT name dimension of the process cube, by selecting the following
subset of elements: Interpretation, Post Script, Unformatted Text and Page Control Language.
49
Figure 6.1: The ‘Interpret’ subprocess, obtained by dicing the process cube on the task name.
Further, an event log is materialized from the filtered event data and is used as a parameter for
the Fuzzy Miner plugin. The result is shown in Figure 6.1. The correspondence between our result
and the one in [33] can be easily noticed.
Figure 6.2: The ‘Interpret’ subprocess with its corresponding branches. The visualization results
allow for easy comparison of subprocesses.
For further testing, we consider a second scenario, where the same ‘Interpret’ process is taken,
but now subprocesses of each of the three branches of the ‘Interpret’ process are isolated, by
filtering on the task name. Figure 6.3 shows the main visualization frame with four windows.
The first window on top, gives the same ‘Interpret’ process model. The three windows at the
bottom, illustrate the subprocesses of the three branches of the process. Such visualization results
are powerful for larger processes. First of all, multiple filtering results of the same process can
50
be visualized in the same time. After filtering, the initial process is not discarded, it can be
reused again and again for filtering purposes. Presenting processes next to each other, highlights
similarities and differences between them.
Figure 6.3: Zooming-in on the first part of the copier process model and on the first part of its
corresponding ‘Print’ and ‘Copy/Scan’ subprocesses.
In the last scenario, the entire copier process model is discovered, using the Heuristics Miner
plugin. First, two slice operations are performed on the TRACE Class dimension. Their results
are used to discover the ‘Print’ and the ‘Copy/Scan’ subprocesses. The resulted process models
are quite large, which makes it difficult to visualize them entirely. Therefore, we zoom-in on the
first part of the processes. By placing all the models in parallel, the paths for the ‘Print’ and
‘Copy/Scan’ subprocesses can be distinguished in the copier process model. One branch of the
process starts with the ‘Copy/Scan’, ‘Collect Copy/Scan’ and ‘Place Doc’ activities, corresponding
to the ‘Copy/Scan’ subprocess, and the other branch starts with the ‘Remote Print’, ‘Read Print’
and ‘Rasterization’ tasks, corresponding to the ‘Print’ subprocess. The same behavior is shown
for this part of the process, in Figure 7 from [33]. By zooming-in on the rest of the subprocesses,
their entire behavior can be observed and their control-flows can be compared.
6.1.2
Real-life Log Data Example
For the real-life example, we select one of the event logs of a Dutch municipality, known under
the name of WABO1. The WABO1 event log consists of 691 process instances, 254 event classes
and 22130 events. The data captures process events from October 2010 till November 2012 with
an overall duration of 758 days.
At the case level, the following attributes are available:
• parts attribute, specifies for what building parts is the permit requested: “Bouw”(355 cases),
“Sloop”(52 cases), “Kap”(32 cases), etc.
• SUMleges attribute, gives the total cost of a building permit application, e.g., 192.78, 284.55,
1992.06.
• last phase attribute, denotes the outcome of a permit request application. Usually a case
finalizes with “Vergunning verleend”(permit given, in 344 cases) or “Vergunning geweigerd”
51
(permit declined, in 2 cases). However, there are a number of cases that end up with
“Procedure afgebroken”(procedure aborted, in 74 cases).
• caseStatus attribute, indicates whether a case is still opened (“O”) or is already closed (“G”).
For a case that is closed, no further objections are possible. However, for an opened case,
objections can still be expected.
Event attributes give information related to the lifecycle of an event, the resource that executes
a task or is responsible for it and different time characteristics, e.g., the time when a task was
created or the time when an event was recorded. The lifecycle of en event comprises only a single
transition: complete. That is, all the work items in the event log are completed. There are 19
resources that execute tasks. The majority of the tasks are performed by resource number 560872
(30.764 %).
Figure 6.4: Dotted charts for a process of a Dutch municipality using absolute time. The influx
of new cases is rather constant other time, the top chart. The influx of new cases is decreasing
other time, the bottom left chart. For the bottom right chart, there is no pattern identified.
Figure 6.4 shows three dotted charts for three of the subprocesses of a Dutch municipality using absolute time. These subprocesses are obtained by slicing the process cube on the
TRACE last phase dimension. In all three cases, absolute, real times are used. Moreover, cases
are sorted by the time of the first event. The top chart, corresponds to the building permit request
applications finalized with giving a permit. For this subprocess, the initial events form an almost
straight line. Consequently, there is a close to constant arrival rate of new cases. The bottom left
chart correspond to canceled applications. The dotted chart shows that the influx of incoming new
cases, that are eventually canceled, is decreasing other time. The last chart, on bottom right part
of the image, corresponds to declined cases. Due to the reduced number of declined application,
there is difficult to identify a pattern in the arrival of such cases.
Figure 6.4 shows three dotted charts for three of the subprocesses of a Dutch municipality using
relative time, i.e., all cases start at time zero, with emphasis on the duration of a case. Typically,
both approved and canceled cases are handled in 1-2 months, although a large amount of those
are finished already after 10-20 days. Nevertheless, there are cases that take up to 1.5 years to
complete. For instance, the duration of handling the declined cases is quite large. For one of the
cases it takes one year after it is finally rejected. Such behavior is present also for approved and
canceled cases, however, very sporadic, like exceptions. Since the event data comes from a real-life
log, we do not exclude the possibility of errors in recording for such cases.
52
Figure 6.5: Dotted charts for a process of a Dutch municipality using relative time. The duration
of handling a building permit request, that is eventually approved, is typically about 1-2 months.
The same is valid for canceled applications. Requests for applications that are declined take longer
time to be handled.
Figure 6.6: Representation of the Working-Together Social Network for resources working at
Aanhoudingsgrond van toepassing (AH) type of activities and on Waw-aanvraag buiten behandeling
(AWB) type of activities.
Mining social networks is yet another ProM feature supported in the PROCUBE plugin. The
social network miners, presented in [9], can be directly applied on the event logs of the subprocesses
of a process cube. In this section, we present an example of a Working-Together Social Network
for resources in the WABO1 process, working at Aanhoudingsgrond van toepassing (AH) type of
53
activities and on Waw-aanvraag buiten behandeling (AWB) type of activities. In both networks, a
cluster of resources working-together and several isolated resources, can be distinguished. Except
for a few isolated resources, i.e., 560589, 560999 and 560950, the AH network contains the same
elements as the AWB one. This is not the case when it comes to resource interactions in the
working-together clusters. Even though it contains almost the same resources, its corresponding
chain of interaction changes. That is, compared to the AWB network, in the AH one, only 560912
still works directly with 2670601 and only 3273854 still works directly with 560925. A rather
large percentage of resources involved in the entire process, i.e., 19 resources, are also present
in the networks, 84 % in the first network and 68% in the second network. This indicates that
the majority of the resources may not be specialized in a particular type of activity, but rather
execute different types of activities depending on the case. Other network graphs and plugins can
be used to fully prove the statement. Consequently, placing social networks next to each other,
offers a parallel view of people’s interaction within an organization in various situations, e.g., when
handling different tasks.
6.2
Performance Analysis
In this section the performance of the PROCUBE system with respect to loading and unloading
operations is analysed. Clearly, loading time affects the productivity of the system only once, when
the event log data is loaded into the databases, whereas unloading operation could be performed
multiple times, i.e., whenever a process mining technique is applied to the events in the cube
(possibly a subcube). The time required by these operations has to be small enough to guarantee
adequate user interaction with the tool. In what follows, the PROCUBE tool is subject to several
tests.
Test 1. For the first test, subsets of the WABO1 event log are loaded and unloaded from the
database. These subsets contain 160, 338, 687, 1368, 2732, 5505, 11061, and 22130 events.
The latter sublog is actually the entire WABO1 event log. The loading and unloading speed
is assessed for each sublog in 4 distinct configurations of the in-memory database, i.e., 2D
with dimensions TRACE parts and EVENT timestamp, 3D which contains the dimensions
from 2D and EVENT orgEXT resources, 4D adds EVENT created to 3D dimensions, and
the 5D configuration adds to 4D the TRACE termName dimension. This test illustrates
the dependency of the loading and unloading time for typical selection of dimensions.
Test 2. The second test illustrates the effects of sparse dimensions on the loading and unloading
performance. This test is performed on two 2D configurations and follows the methodology
from Test 1. The dimensions of these two cubes are summarized in Table 6.1.
Cube
Low sparsity
High sparsity
Dimension
TRACE termName
EVENT orgEXT resources
EVENT taskDescription
EVENT conceptEXT name
Nr. of members
12
20
73
692
Table 6.1: Summary of dimensions for the 2D cubes in Test 2.
Test 3. For the last test, the WABO1 event log is split into several non-overlapping sublogs and
the total unloading time of these sublogs is compared to the unloading of the entire WABO1
event log. This test illustrates that the filtering operations and extraction of sublogs does
not infer any additional penalty on the unloading time.
54
Loading speed
81
2D load
3D load
4D load
5D load
Time (s)
27
9
3
1
100
300
900
2700
Nr. of events
8100
24300
Figure 6.7: Loading times for Test 1.
Test 1
Let us begin by showing the loading times for this test setup in Figure 6.7. Although, both scales
on the figure axis are logarithmic, it is easy to see that the loading time increases linearly with
respect to the number of events in the log. Moreover, the loading time is practically independent of
the number of cube dimensions. The latter remark suggests that loading time per dimension into
the relational database and in-memory database are about the same, i.e., if one of the dimension
is moved from the relational database to the cube, the loading time does not change. Moreover,
the loading implies just one constant set of operations per event, therefore it is independent of the
number of dimensions in the created cube. Of course, the amount of memory used for the cube
increases with the number of dimensions.
Unloading speed
700
Time (s)
100
2D load
3D load
4D load
5D load
10
1
100
300
900
2700
Nr. of events
8100
Figure 6.8: Unloading times for Test 1.
55
24300
The situation during the unloading is completely different however. The unloading time for
the same databases is shown in Figure 6.8. The time spent for unloading the event log from
the database increases considerably for larger numbers of cube dimensions. Of course, unloading
time heavily depends on the number of cube cells that do not have any events corresponding to
them. These empty cells do not affect the loading time into the database, but consume memory.
The opposite is true during unload, when each cell has to be verified. Hence, time is spent on
empty cells, but these cells do not contribute with any information to the resulting log. Generally,
the sparsity of a cube increases with the increase of the number of dimensions, and as such, the
number of empty cells does too. For this particular case study, unloading an event log with 11061
events takes 27 s for a 2D cube, and 688 s for a 5D cube, which illustrates a super-linear increase
in the unloading time. Similar tendency can be observed with respect to the number of events
in the log. It appears that the sparsity of the cube increases with the number events in the log
with a supper-linear rate as well. These observations can be intuitively explained by two facts.
First, all the dependencies in the hyper-cubic structures are multiplicative rather than additive,
hence the sparsity is expected to rise exponentially. Secondly, event logs contain attributes which
characterize the events very precisely, e.g., timestamp or name of a resource. Obviously, finding
two events happening in exactly the same time, to say the least, is very difficult, and hardly
any resource is engaged in all activities. Hence, due to this precision of event logs the sparsity is
unavoidable when a process cube is constructed, and unfortunately, the unloading time complexity
rises exponentially with the number of dimensions and events for typical situations.
Test 2
As mentioned previously, for this test, we compare loading and unloading times of cubes configurations with different levels of sparsity.
Loading speed
110
81
non−sparse
sparce
Time (s)
27
9
3
1
100
300
900
2700
Nr. of events
8100
24300
Figure 6.9: Loading times for Test 2.
It can be seen in Figure 6.9 that the loading time does not vary much in between the two
cubes. The sparser cube appears to load only slightly longer. This behavior is expected and was
explained on the results of the Test 1. On the examples from Test 1, it is shown that unloading
time heavily depends on the number of in-memory dimensions and number of events. However,
the unloading time is also dependent on the sparsity of the cube. The unloading time for the
two cube configurations with the same number of events and dimensions but different sparsity are
illustrated in Figure 6.10. Observe that the difference in between unloading times of the higher
and lower sparsity cubes for the entire WABO1 event log is more than 10 fold.
56
Unloading speed
700
non−sparse
sparse
Time (s)
100
10
1
100
300
900
2700
Nr. of events
8100
24300
Figure 6.10: Unloading times for Test 2.
One might expect a larger difference, as the ratio between the number of cells in the cubes is
actually about 191, i.e., 73 × 629 cells of a sparse cube divided by 12 × 20 cells of a non-sparse
cube, where 73, 629, 12 and 20 represent the number of elements of the dimensions of the cubes.
Although all the cells have to be visited while unloading the event log, the hybrid nature of the
database prevents huge increase in the required time. Processing time required for empty cells is
considerably lower than for the cells with events, i.e., if an empty cell is detected then no query is
issued to the relational database and the algorithm jumps to the next cell. Hence, with the 191
times increase in the number of cells, the overall computational load increase is only 10 fold.
Test 3
For the purpose of this test, the WABO1 event log with 22,130 events was loaded with the following two dimensions EVENT timestamp and TRACE caseStatus. Furthermore, the drill down
operation is applied along the timestamp dimension.
Cell Name
Unload time (s)
All EVENT
61.9
NO VALUE
0.001
2010
4.4
2011
32.5
2012
26.3
SUM
63.2
Table 6.2: Summary of the unload time for the Test 3.
In Table 6.2 we provide the unloading time for each cell in the visualization table. The column
SUM stands for the sum of all columns except All EVENTS. Observe that the time to unload
the entire WABO1 event log from the database is only marginally lower than the cumulative time
required for its separate components. This result shows that filtering operation does not infer any
performance penalties on the developed database structure. Applying the same operation on the
event data stored in the relational database would require complex queries, and as such, would
slow down the process. Therefore, fast filtering along the process cube dimensions is herein proven
and it represents a benefit of the multidimensional database technologies.
6.3
Discussion
There are three main observation that are derived from the experimental results.
57
Observation 1. Loading time of an event log is practically independent from the number of the
dimensions of analysis. This fact is illustrated in Figure 6.7 and is the result of the loading
algorithm. The event log is loaded into the database event by event, and for each event a
constant number of operations is performed. Hence, the loading time is dependent only on
the number of events.
Observation 2. Sparsity of the process cube heavily impacts the unloading performance. For
the selected cell in the table of visualization, all combinations of the dimensions of analysis
members which correspond to this cell, are computed during the unload. For each combination, it is verified whether the associated process cube cell contains any events. Hence,
there is a fixed amount of time spent to check whether the cell is empty, i.e., the cell id is
retrieved from the multidimensional database, if cell id is NULL, the cell is empty and no
further actions are performed with respect to this cell. If the cell contains events, additional
time is spent to unload the event data from the relational database. Obviously, checking
empty cells impacts negatively the unloading time. This is illustrated by the results of the
second test, where for 191 times more cells to verify and the same number of events to unload
comparing to a normally sparse cube, the unloading time is 10 times larger.
Observation 3. Manual splitting and analysing of sparse dimensions, e.g., with several hundred
of dimension members, would be very time consuming and probably would overload the
user. Realistically, only the dimensions with at most 20 members are fit to be included
in the process cube structure. Selection of such dimensions ensures low sparsity of the
resulting process cube, and results in good responsiveness of the developed tool. Test 1 was
based on a typical selection of analysis dimensions and therefore, its results characterize the
operation speed of the tool in case of regular sparsity. Moreover, it was observed that the
developed tool with the processing step, e.g., Log Dialog, delivers the result within 10s for
event logs smaller than 2000 events and process cubes with about 3 to 4 normally sparse
dimensions of analysis. This performance is respectable and makes the tool applicable to
different processes. Moreover, the main focus of the tool is to compare selected parts of the
event log, thus, only small sections of the process cube will be unloaded for comparison in
typical situations. Test 3 shows that the unloading time reduces when only a part the cube
is unloaded, which means that for 2000 events and 4 analysis dimensions, the average time
of an operation will be far lower than 10s. Furthermore, even if the entire cube is split in
subcubes and all these subcubes are unloaded simultaneously, no performance penalty will
occur, i.e., all subcubes will be processes within 10s.
58
Chapter 7
Conclusions & Future Work
7.1
Summary of Contributions
This master thesis builds on the ideas presented in the PROCUBE project proposal [4]. The
proposal suggests to organize event data from logs in process cubes in such a way that discovery,
analysis and comparison of multiple processes is possible. The main goal of this master project
was to build a framework to support process cube exploration. The goal was achieved by following
a series of steps, which the thesis describes in detail.
We started by identifying the problem context. The role of business intelligence and process
mining, in particular, in the functionality and performance of enterprise information systems, was
investigated. Further, the reader was introduced to the business intelligence area, with emphasis
on process mining and OLAP technologies. As concepts from both process mining and OLAP were
repeatedly employed throughout the thesis, a formalization was given for all the adherent notions.
The formalization of OLAP and of process-cube-related notions is one of the contributions of this
thesis. Further elaboration and formalization of the process cube concept can be found in [6].
The next step in the project was to describe the central element of the project, the process
cube. Process cubes realize the link between the process mining framework and the existing
OLAP technology. While, process mining focuses on process anaysis, OLAP technology is used
for its built-in hypercube structures allowing for operations like slice, dice, roll-up, drill-down
and pivoting. As such, process cubes are defined by introducing the event-related aspects in the
formalization of the OLAP cubes. Along with the process cube formalization, an example was
presented to illustrate the process cube capabilities. This stage of the project was an important
one, as it helped in establishing and clarifying the process cube functionality before its actual
implementation.
Since databases, OLAP and process mining tools already exist, we decided to reuse the current
technologies to save time. Choosing a framework for process mining was easy, as ProM is clearly
the leading open source framework and expertise is readily available at TU/e. Selecting a suitable OLAP technology was not as straightforward though. That is because the applied methods
and principles vary quite a lot from OLAP tool to tool. Finally, we selected the Palo in-memory
multidimensional OLAP. In-memory tools are known for their increased speed. Moreover, unlike
relational databases, multidimensional databases have already the built-in multidimensional structure that is natural for OLAP cubes and therefore, facilitates the OLAP analysis. Relatively new,
this technology is still undergoing a lot of changes and improvements. Nevertheless, it is deemed
to have a bright future perspective, especially because of its current and envisioned performance
benefits.
The main contribution of the thesis is creating a basic prototype supporting the notion of
process cube in a process mining context, with the following functionality: XES event logs are
introduced as data sources for OLAP applications; the OLAP process cube is created from event
data; the cube can be visualized from different perspectives; one can “play” with the cube before
59
starting the analysis, by applying different OLAP operations. One of the challenges we encountered after finishing the application was that MOLAP performance was worsening with increasing
sparsity of the loaded data. We were aware of the sparsity problem from the very beginning, however, we did not expect such poor performance results. One of the potential explanations is that
we used an open source version of Palo from 2011, which might not include the latest performance
improvements that can be found in the commercial tool. Moreover, sparsity is still an open issue
for many multidimensional tools. Only Essbase is known to provide a solution to this problem at
the moment, but it is not open source. We hope that Palo will also release a new version with the
sparsity problem solved. In meanwhile, we offered an interim solution to improve the performance
for sparse data.
The solution we provided for dealing with sparsity, was to replace the in-memory database with
a hybrid structure, that stores part of event data in-memory and the other part, in a relational
database. The advantage of such a strategy is that it reduces the number of dimensions in the
cube and thus, makes it less sparse. The limitation is that only a part of the event data can
be used for filtering purposes. Furthermore, we reduced the number of elements per dimension
by implementing the hierarchical feature for time data. By allowing time data to be stored in a
hierarchical structure, the sparsity of some very sparse dimensions like the timestamp, is reduced
considerably.
Finally, we tested the PROCUBE system to determine its capabilities. The information stored
in event logs is inherently multidimensional, and as such, efficient application of process mining
tools requires multidimensional filtering of the event database. The multidimensional, and as a
particular case, in-memory database technology is developed for exactly that purpose. However,
the performed tests show that event logs generally result in sparse multidimensional database
structures, which incurs severe performance penalties when unloading parts of the event log for
further processing. The proposed hybridization of the database structure, i.e., keeping only strictly
necessary dimensions in memory and the rest in a relational database, makes an efficient trade-off
in between the flexibility of the complete process cube and responsiveness of the user interaction.
Nevertheless, complete understanding of the sparsity concept is required for efficient use of the
developed tool as only limited number dimensions, e.g., up to 4D for WABO1 event log, can be
used for on-line analysis.
7.2
Limitations
In this section we describe two types of limitations of the in-memory multidimensional OLAP
process cube approach. First, limitations at the conceptual level are presented, followed by implementation limitations.
7.2.1
Conceptual Level
Cell Number Explosion Problem
The cell number explosion problem, known also as sparsity, is common for multidimensional
structures, where it is not possible to store data in a compact way, and therefore, resulting in
a large number of missing values at the intersection of dimensions. As such, a process cube
exceeding a certain number of dimensions, with a large number of elements per dimension
and with a lot of missing cell values, leads to sparsity problems and high execution times for
analysis.
Visualization Limitations
In the following, we present two types of limitations related to the visualization of process
mining results. The first is related to the difficulty in visualizing hypercube structures, while
the second one is related to the difficulty in visualizing multiple cell results.
Generally, the visualization of the hypercube structures is not an easy task. On one hand,
multidimensionality is not the natural way in which people can visualize. On the other hand,
60
there are hardly any tools that provide multidimensional visualizations on more than three
dimensions. In our case, we visualize only two dimensions of the process cube at a time.
This is a simple, yet powerful visualization that allows efficient visual comparison of cell
results. The only mention is that the growth of the number of compared cells can become an
issue. Fitting multiple results on a single screen, can impair the visualization of the results,
thus, impeding the comparison between cells. This issue becomes even worse in case of large
results. In the process mining area, the curse of dimensionality problem is well known. This
is the case of large and complex models, that are usually unreadable. Visual comparison of
such models is not supported in this project, but this is still a research problem in the area.
7.2.2
Implementation Level
Filtering on a Subset of Attributes
The hybrid approach adopted in this project, of storing event data in both in-memory and
relational databases resulted in considerable performance gains. However it lacks flexibility
with respect to the log filtering possibilities and changing dimensions in the cube. That
is, the user is allowed to select a subset of attributes to be considered as dimensions in
the process cube, while the rest of the attributes and other log information are stored in
relational databases. Selecting only a subset of attributes, limits the log filtering possibilities.
Moreover, changing one dimension of the cube, implies creating a new process cube, by
selecting all the dimensions again.
Limited Set of Supported Plugins
The PROCUBE plugin uses only a limited set of ProM plugins to obtain process mining results. There are two reasons for this limitation. First, not all the existing ProM plugins are
suitable for visual comparison of multiple subprocesses. The PROCUBE tool is designed to
work with plugins that provide quick, direct process mining results. Secondly, there are plugins that cannot be used without following a sequence of wizards, which is problematic in the
PROCUBE settings, as this procedure should be repeated for each process cell individually.
Performance issues for Sparse Dimensions
Our methods are oriented on reducing the number of sparse dimensions and the sparsity
within dimensions. Still, if the user selects all the attributes for creating cube dimensions
and there are sparse dimensions among those, the unloading of event data becomes very
slow.
7.3
Further Research
The process cube notion offers a wide range of new research questions and challenges. We will
not enumerate them in this section. Instead, we give some points of reference for improving and
extending the current approach.
Data mining for Construction of Hierarchies
Hierarchies are one of the most powerful elements of the OLAP structures. In our tool, the
hierarchy feature is supported only for dimensions with time values. However, meaningful
hierarchical structures can be also constructed for other types of dimensions. Machine learning techniques can be applied in obtaining clusters of dimension elements that can be used
to create a hierarchy, e.g., hierarchical clustering. Moreover, data mining techniques can be
used to combine elements of multiple dimensions to create a single dimension. That can be
accomplished by a meaningful partitioning of the elements, e.g., algorithms for partitioning,
for instance, large categorical data exist [35].
Reuse of Precomputed Models
Knowledge of the discovered processes can be reused, by storing this precomputed information, not only creating models on-the-fly. Since producing large models on-the-fly takes
61
time, performance can be improved by saving parts of the created models or aggregates of
the entire models, for further reuse.
Further Visualization Improvement
The visualization proposed in this thesis is based on the simple, traditional 2D visualization.
Undoubtedly, more advanced visualization techniques can be found, with the advantage
of being more representative for analysis and more user-friendly. Such an example is the
icicle plot construction [32], that can be used to enhance the hierarchical representation of
dimensions and facilitate the comparison between two sub-processes.
62
Bibliography
[1] A Survey of Open Source Tools for Business Intelligence. In David Taniar and Li Chen,
editors, Integrations of Data Warehousing, Data Mining and Database Technologies, pages
237–257. Information Science Reference, 2011.
[2] Business Processing Intelligence Challenge (BPIC). In 8th International Workshop on Business Process Intelligence, 2012.
[3] W. M. P. van der Aalst. Process Mining: Discovery, Conformance and Enhancement of
Business Processes. Springer, 1998.
[4] W. M. P. van der Aalst. Mining Process Cubes from Event Data (PROCUBE), project
proposal (under review). 2012.
[5] W. M. P. van der Aalst. Process Mining: Making Knowledge Discovery Process Centric.
SIGKDD Explorations Newsletter, 13(2):45–49, 2012.
[6] W. M. P. van der Aalst. Process Cubes: Slicing, Dicing, Rolling Up and Drilling Down
Event Data for Process Mining. In J. Liu M. Song, M.Wynn, editor, Asia Pacific conference
on Business Process Management (AP-BPM 2013), Lecture Notes in Business Information
Processing, 2013.
[7] W. M. P. van der Aalst, A. Adriansyah, A. K. A. de Medeiros, F. Arcieri, T. Baier, T. Blickle,
R. P. Jagadeesh Chandra Bose, P. van den Brand, R. Brandtjen, J. C. A. M. Buijs, A. Burattin, J. Carmona, M. Castellanos, J. Claes, J. Cook, N. Costantini, F. Curbera, E. Damiani,
M. de Leoni, P. Delias, B. F. van Dongen, M. Dumas, S. Dustdar, D. Fahland, D. R. Ferreira,
W. Gaaloul, F. van Geffen, S. Goel, C. W. Gnther, A. Guzzo, P. Harmon, A. H. M. ter Hofstede, J. Hoogland, J. Espen Ingvaldsen, K. Kato, R. Kuhn, A. Kumar, M. La Rosa, F. Maggi,
D. Malerba, R. S. Mans, A. Manuel, M. McCreesh, P. Mello, J. Mendling, M. Montali,
H. Motahari Nezhad, M. zur Muehlen, J. Munoz-Gama, L. Pontieri, J. Ribeiro, A. Rozinat,
H. Seguel Prez, R. Seguel Prez, M. Seplveda, J. Sinur, P. Soffer, M. S. Song, A. Sperduti,
G. Stilo, C. Stoel, K. Swenson, M. Talamo, W. Tan, C. Turner, J. Vanthienen, G. Varvaressos, H. M. W. Verbeek, M. Verdonk, R. Vigo, J. Wang, B. Weber, M. Weidlich, A. J. M. M.
Weijters, L. Wen, M. Westergaard, and M. T. Wynn. Process Mining Manifesto. In BPM
2011 Workshops, Part I.
[8] W. M. P. van der Aalst, M. Pesic, and M. Song. Beyond Process Mining: From the Past
to Present and Future. In Proceedings of the 22nd international conference on Advanced
information systems engineering, CAiSE’10, pages 38–52, 2010.
[9] W. M. P. van der Aalst, H. A. Reijers, and M. Song. Discovering Social Networks from Event
Logs. Computer Supported Cooperative Work, 14(6):549–593, 2006.
[10] S. Agarwal, R. Agrawal, P. Deshpande, A. Gupta, J. F. Naughton, R. Ramakrishnan, and
S. Sarawagi. On the Computation of Multidimensional Aggregates. 1996.
63
[11] R. Agrawal, A. Gupta, and S. Sarawagi. Modeling Multidimensional Databases. In Proceedings of the Thirteenth International Conference on Data Engineering, ICDE ’97, pages
232–243, 1997.
[12] I.-M. Ailenei. Process Mining Tools: A Comparative Analysis. Master’s thesis, Eindhoven
University of Technology, 2011.
[13] A. Berson and S. J. Smith. Data Warehousing, Data Mining, and Olap. 1997.
[14] R.P. Jagadeesh Chandra Bose. Process Mining in the Large: Preprocessing, Discovery, and
Diagnostics. PhD thesis, Eindhoven University of Technology, 2012.
[15] J. C. A. M. Buijs. Mapping Data Sources to XES in a Generic Way. Master’s thesis, Eindhoven
University of Technology, 2010.
[16] J. C. A. M. Buijs, B. F. van Dongen, and W. M. P. van der Aalst. Towards CrossOrganizational Process Mining in Collections of Process Models and Their Executions. In
Business Process Management Workshops (2), pages 2–13, 2011.
[17] J. W. Buzydlowski, I.-Y. Song, and L. Hassell. A Framework for Object-Oriented On-Line
Analytic Processing. In Proceedings of the 1st ACM international workshop on Data warehousing and OLAP, DOLAP ’98, pages 10–15, 1998.
[18] S. Chaudhuri and U. Dayal. An Overview of Data Warehousing and OLAP Technology.
SIGMOD Record, 26(1):65–74, 1997.
[19] S. Chaudhuri, U. Dayal, and V. Narasayya. An Overview of Business Intelligence Technology.
Commun. ACM, 54(8):88–98, August 2011.
[20] E. F. Codd, S. B. Codd, and C. T. Salley. Providing OLAP (On-Line Analytical Processing)
to User-Analysis: An IT Mandate, 1993. White paper.
[21] G. Colliat. OLAP, Relational, and Multidimensional Database Systems. SIGMOD Record,
25(3):64–69, 1996.
[22] T. H. Davenport. Putting the Enterprise into the Enterprise System. Harvard Business
Review, 76(4):121–131, 1998.
[23] K. Dhinesh Kumar, H. Roth, and L. Karunamoorthy. Critical Success Factors for the Implementation of Integrated Automation Solutions with PC Based Control. In Proceedings of the
10th Mediterranean Conference on Control and Automation, 2002.
[24] B. F. van Dongen, A. K. A. de Medeiros, H. M. W. Verbeek, A. J. M. M. Weijters, and W. M.
P. van der Aalst. The ProM Framework: A New Era in Process Mining Tool Support. In
Proceedings of the 26th international conference on Applications and Theory of Petri Nets,
ICATPN’05, pages 444–454, 2005.
[25] R. Finkelstein. MDD: Database Reaches the Next Dimension. In Database Programming and
Design, pages 27–38, 1995.
[26] H. Garcia-Molina and K. Salem. Main Memory Database Systems: An Overview. IEEE
Transactions on Knowledge and Data Engineering, 4(6):509–516, 1992.
[27] M. Golfarelli. Open Source BI Platforms: A Functional and Architectural Comparison. In
Proceedings of the 11th International Conference on Data Warehousing and Knowledge Discovery, DaWaK ’09, 2009.
[28] O. Grabova, J. Darmont, J.-H. Chauchat, and I. Zolotaryova. Business Intelligence for Small
and Middle-Sized Entreprises. SIGMOD Record, 39(2), 2010.
64
[29] C. W. Günther. XES Standard Definition. Fluxicon Process Laboratories, pages 13–14, 2009.
[30] C. W. Günther and W. M. P. van der Aalst. Fuzzy Mining Adaptive Process Simplification
Based on Multi-Perspective Metrics. BPM, pages 328–343, 2007.
[31] J. Han. OLAP Mining: An Integration of OLAP with Data Mining. In In Proceedings of the
7th IFIP 2.6 Working Conference on Database Semantics (DS-7, pages 1–9, 1997.
[32] D. Holten and J. J. van Wijk. Visual Comparison of Hierarchically Organized Data. In
Proceedings of the 10th Joint Eurographics / IEEE - VGTC conference on Visualization,
EuroVis’08, 2008.
[33] R. P. Jagadeesh Chandra Bose, W. M. P. van der Aalst, I. Žliobaite, and M. Pechenizkiy.
Handling Concept Drift in Process Mining. In Proceedings of the 23rd international conference
on Advanced Information Systems Engineering, CAiSE’11, pages 391–405, 2011.
[34] M. R. Jensen, T. H. Møller, and T. B. Pedersen. Specifying OLAP Cubes on XML Data.
Journal of Intelligent Information Systems, 17(2-3):255–280, 2001.
[35] G. V. Kass. An Exploratory Technique for Investigating Large Quantities of Categorical
Data. Journal of the Royal Statistical Society, 29(2):119–127, 1980.
[36] C. X. Lin, B. Ding, J. Han, F. Zhu, and B. Zhao. Text Cube: Computing IR Measures for
Multidimensional Text Database Analysis. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, ICDM ’08, 2008.
[37] M. Liu, E. A. Rundensteiner, K. Greenfield, C. Gupta, S. Wang, I. Ari, and A. Mehta. ECube: Multidimensional event sequence processing using concept and pattern hierarchies. In
International Conference on Data Engineering, pages 1097–1100, 2010.
[38] E. Lo, B. Kao, W.-S. Ho, S. D. Lee, C. K. Chui, and D. W. Cheung. OLAP on Sequence
Data. In Proceedings of the 2008 ACM SIGMOD international conference on Management
of data, SIGMOD ’08, 2008.
[39] F. Melchert, R. Winter, and M. Klesse. Aligning Process Automation and Business Intelligence to Support Corporate Performance Management. In AMCIS’04, pages 507–507, 2004.
[40] R. B. Messaoud, O. Boussaid, and S. Rabaséda. A New OLAP Aggregation Based on the
AHC Technique. In Proceedings of the 7th ACM international workshop on Data warehousing
and OLAP, DOLAP ’04, 2004.
[41] S. Negash. Business Intelligence. Communications of the Association for Information Systems,
13(1):177–195, 2004.
[42] T. Niemi, J. Nummenmaa, and P. Thanisch. Constructing OLAP Cubes Based on Queries.
In Proceedings of the 4th ACM international workshop on Data warehousing and OLAP,
DOLAP ’01, 2001.
[43] T. B. Pedersen and C. S. Jensen.
34(12):40–46, December 2001.
Multidimensional Database Technology.
Computer,
[44] D. Riazati, J. A. Thom, and X. Zhang. Drill Across and Visualization of Cubes with Nonconformed Dimensions. In Nineteenth Australasian Database Conference, volume 75, pages
85–93, 2008.
[45] J. Ribeiro. Multidimensional Process Discovery. Beta Dissertation Series D165, 2013.
[46] C. Salka. Ending the MOLAP/ROLAP Debate: Usage Based Aggregation and Flexible
HOLAP (Abstract). In Proceedings of the Fourteenth International Conference on Data Engineering, February 23-27, 1998, Orlando, Florida, USA, page 180, 1998.
65
[47] B. Sigg. DockingFrames 1.1.1 - Common. pages 7–8, 2012.
[48] Stratebi. Open Source B.I. comparative. 2010.
[49] C. Thomsen and T. B. Pedersen. A survey of open source tools for business intelligence. In
Proceedings of the 7th international conference on Data Warehousing and Knowledge Discovery, DaWaK’05, 2005.
[50] C. Thomsen and T. B. Pedersen. A Survey of Open Source Tools for Business Intelligence.
International Journal of Data Warehousing and Mining, 5(3):56–75, 2009.
[51] E. Thomsen. OLAP Solutions: Building Multidimensional Information Systems. Robert
Ipsen, 2002.
[52] Y. Tian, R. A. Hankins, and J. M. Patel. Efficient Aggregation for Graph Summarization.
In Proceedings of the 2008 ACM SIGMOD international conference on Management of data,
SIGMOD ’08, 2008.
[53] A. J. M. M. Weijters and A. K. A. de Medeiros. Process Mining with the HeuristicsMiner
Algorithm. 2006.
[54] K. Withee. Microsoft Business Intelligence for Dummies. Wiley Publishing, 2010.
66