Download Profiling Event Log data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
MAKERERE UNIVERSITY , FACULTY OF COMPUTING AND IT
Profiling Event Log data
A PHD Draft Proposal submitted to School of Graduate Studies, Makerere University
Peter Khisa Wakholi
4/28/2008
Process-aware information systems generate logs resulting from activities undertaken during
transaction of business processes. For unstructured business processes, several factors which are
dependant on the attributes and interests of the actors determine the process of execution. Models
extracted from these processes are spaghetti-like and hard to understand. This study seeks to explore
how profiles extracted from the logs can be used understand the underlying behavior in order to
generate more descriptive information about the processes and subsequently explain behavior and
predict incomplete processes.
1 Introduction
1.1 The City Metaphor
Imagine a large city; it will have many places and residents. The places can have distinct
address locations. In each location, an activity that defines the place is undertaken; for
example places that sell wines and spirits would be bars. Furthermore, places can have
similar functionality and can therefore be categorized (e.g. bars).
Each day the residents engage in activities that take them to various places in the city. The
choice of which place to go depends on the individual’s interest. For example, in order to
work, a resident who is a student will go to school. After work, he may choose to pass
another place – depending on his interests. For example, one who likes ‘hanging out’ would
probably pass a bar before going home. It is always the case that most people will follow a
particular weekly pattern. There are always deviations from a particular pattern that
seldom occur.
Imagine, that it were possible to log every place visited by the residents. By looking at these
logs, one would be able to know a lot of information about an individual. This information
would be derived based on the nature of places visited. For example, if someone goes to a
place called school very often, then we can conclude that the person either studies or works
in a school. With more knowledge about the attributes of the individual (e.g. age) it is
possible to associate that individual as a student, following some association rule. For
example a rule that states that people below 22 visiting schools on a regular basis are
students and they follow a particular sequence of events. In essence, we are using the logs,
underlying process models and attributes to develop a profile of an individual. The profile
can even be extended to fit other individuals with similar characteristics.
For a large population, the process of discovering profiles would be tedious. It would be
prudent to start by clustering the logs based on some similar characteristics. Again the
2
criteria used are very important. For example based on a certain age group one would get a
cluster of activities that people in that age group do. A different approach could be based on
a sequence of events. In this case, we would cluster according to the places people visit and
related attributes like how long they spend there.
Profiling is defined as the extrapolation of information about something, based on known
qualities (Wikipedia). The profile developed can be useful in many ways. For example a
profile can be developed of a typical week days’ activity as; home to university to bar to
home on a Wednesday. It is possible to draw a map of all the places an individual visits
and define the rules followed to visit a place. Therefore, given an individual with some
characteristics, it should be possible to predict what route that individual would take on a
particular day. The reverse is also true; given a set of logs, it should also be possible to tell
what kind of individual we are dealing with. The accuracy of such predictions depends
largely on the level of knowledge derived. It is therefore important that the derived profile
is complete and consistent.
1.2 Profiling based on Information System event logs
The scenario described above can be translated in information systems. Today many of the
activities occurring in processes are either supported or monitored by information
systems. There are many systems, for example ERP, WFM, CRM, and PDM systems that
support a wide variety of business processes while recording well-structured and detailed
event logs. Whereas it is possible to have a clear process that defines the workflow (e.g. in a
manufacturing plant), in some cases it is not possible to define the process (e.g. patient flow
in hospital).
It is in situations with such unstructured processes that profiling, based on event logs
would be most relevant. Like the city described above, the sequence of events for a
particular case largely depends on the interest and attributes. If information about the
actors is available, it is possible to use the event log and attributes to discover interesting
information about their characteristics based on behaviour. The case is different in
3
situations where only the log exists (e.g. web based logs). In such cases, profiling would
only be based event logs. Therefore to develop a complete profile, these two possible
scenarios exist.
Unlike the city metaphor where the number places to be visited is very large and therefore
infinite; information systems typically have a finite number of places. Each place has an
attribute stamp, which in essence is the characteristic that an actor needs to get to it. The
transition from a place to the other is an event in a sequence. A collection of sequences
form the process model that defines the system. These models are unstructured in that
they do not follow a predefined process model. However, by process mining based on the
logs it is possible to create a process model. This can be enhanced by creating profiles,
which are the rules that determine the navigation pattern and characteristics of the actors.
A range of techniques ranging from basic statistics machine learning techniques like
clustering, association rules and sequential mining can to develop profiles. If by definition,
a complete profile is one that provides a clear picture of actors and their behaviour, a
combination of the techniques would be necessary. Basic statistics could give a general
overview of the most occurring patterns and attributes. Clustering can be done based on
known attributes of the actors or based on patterns in the log if used in combination with
sequential mining. Association rules can be used to develop rules that define a sequence of
events. However these profiles are just a set of event logs. Process mining can be used to
discover to process models that describe the possible activities an actor can engage in.
As in the case of the large city metaphor above, a complete profile would involve using the
various techniques above in combination. For an unstructured event log, this involves
identifying appropriate techniques that can derive meaning from attributes of the cases
involved, sequence of events, clusters of cases and related association rules. A complete
profile log can then be mined to discover the processes and the rules that define flow
within the process. This can then provide a basis for behavioral analysis – previously
4
discovering unknown attributes. Furthermore it can be used to accurately predict
incomplete processes.
1.3 Statement of the Problem
Unstructured processes do not follow predetermined process models. The paths followed
largely depend on the attributes of the interacting entities and the environment factors.
Event logs from such processes would provide useful information about profiles of the
entities involved. Process mining such logs produces Spaghetti-like models which are hard
to draw meaningful information.
Profiling would help in developing better understanding of the underlying process models
by.
1. Extracting meaningful process models from logs
2. Determining the rules that define the control flow
3. Determining the characteristics of the actors in the model
There are many knowledge discovery techniques that can be used to provide different
aspects of the profile. It is important that the profile discovered is complete in order to gain
insight about the entities involved and predict a course of action for incomplete processes.
In order to build a complete profile for this purpose a combination of the various
techniques available is necessary. The research proposes to provide a model that uses a
combination of techniques, by analyzing existing profiling techniques and customizing
them for unstructured event log data.
5
2 Research Questions
The main research question in this project is “How can complete and highly accurate
profiles be developed from event log data in order to discover structure of underlying
process models and the nature of the actors?”
In order to achieve new knowledge in this area, the following sub questions will be investigated.
1. What techniques that can be used to extract process related profiles based on event
log data?
2. How can these techniques be deployed to develop a complete profile for
unstructured processes?
3. How can mining these log data based on these profiles be used to develop process
models in order to discover structure out the unstructured processes?
4. What interpretation or meaning can be attributed to observed behavior in the
profiles?
6
3 Research Motivation
The purpose of this research project is to develop a methodology for profiling based on
event log data and known case attributes. It seeks to develop a deeper understanding of the
subject area with a view of establishing the capabilities and limitations of the available
statistical and machine learning algorithms.
The knowledge developed from this research should give a better understanding of process
mined from unstructured event logs. Furthermore it should be possible to determine which
profile suits a case based on some characteristics, which can then be used for prediction.
These concepts could be used in real-life situations to get a better understanding of the
reasons for some observed behaviour in a system. Additionally, by predicting the possible
sequence of a new case, runtime parameters can be optimally configured thereby
increasing efficiency.
7
4 Review of Related literature
4.1 What is profiling?
Profiles based on data have been widely used in database design and development.
(www.bitpipe.com) defines data profiling as “The use of analytical techniques about data
for the purpose of developing a thorough knowledge of its content, structure and quality”.
To carry out a data profiling project, it is important to define the objectives of the profiling
effort. Appropriate data sampling techniques can then be used to data samples. These
samples can then be used to identify a complete profile, which can form the basis of
analysis.
Profiling based on event log data can be defined as the practice of tracking information
about processes by monitoring their execution. This can be done by analyzing the case
perspective, process perspective and resource perspective to assess their behaviour,
predict certain characteristics and to configure optimum runtime parameters. The case
perspective here refers to the attributes that change as the process is executing. The
Process perspective on the other hand refers to the actual sequence of events that are
followed. The resource perspective refers to the resources that are utilised in executing the
process.
4.2 Event Log Based Profiling
Event logs typically contain a lot of information about the process execution, resources
utilization and cases being executed. This information is in the form of timestamps,
transactional information, information on users, data attributes, etc.
Log based profiles have been developed for unstructured processes such as computer
forensics and web usage. The basic goal is to develop information about the user's behavior
and can include known attributes about the user. Profiles can be developed for individual
users or user groups. Often, machine learning techniques like clustering, association mining
8
and sequential mining are used to create user profiles. Simple statistical analysis based on
frequency of occurrences can also provide useful information. Process mining can be used
to discover process models for a particular user or groups of users with similar
characteristics.
4.3 Profiling Techniques
To develop profiles, data mining techniques are commonly used. Tamas (2006) that this
approach is used in web log data mining in order to market content and services tailored to
an individual on the basis of knowledge about their preferences and behaviour.
Techniques commonly used include association rule mining and clustering. A new
technique which is gaining ground in event log data is sequence mining.
According to (Fayyad, et.al. 1996), data mining techniques can be categorised as prediction
methods and description methods. Prediction methods use variables to predict unknown
or future values of other variables. Examples of prediction methods include classifications,
regression and deviation detection. Description methods find human interpretable patterns
that describe data. Examples of description methods include clustering, association rule
discovery and sequential pattern discovery. The techniques used in profiling are explored
in the sections that follow.
4.3.1 Classification
Classification techniques extract classes from a collection of records based on a set of
attributes (Tan, Steinbacha and Kumar, 2004). The goal is to find a model for class attribute
as a function of the values of other attributes. Previously unseen records should be
assigned a class as accurately as possible. Usually, the given data set is divided into training
and test sets, with training set used to build the model and test set used to validate it Tan,
Steinbacha and Kumar, 2004).
9
4.3.2 Association rule Mining
Association rule mining produces dependency rules which will predict occurrences of an
item based on occurrences of other items in a collection. Association rules can be used to
create a system profile by considering the most frequently occurring behaviour as normal
(R. Vaarandi 2003). Association rule algorithms are used to detect relationships between
event types.
Association rules can be used to build a rule set that describes the behaviour of data within
a level of confidence. Such information can be obtained from log files. Association rule
algorithms for example provide the rule “if events of type A and B occur within 5 seconds,
they will be followed by an event of type C within 60 seconds”. Association rules have been
used for forensic purposes based on log data.
4.3.3 Cluster Analysis
Clustering divides a set of data into meaningful or useful groups. (R. Vaarandi 2003)
describes clustering as a technique used to group objects into similar clusters based on
some patterns. Cluster analysis can be used in isolation or as a part of several other
techniques, depending on the goal (Tan, Steinbacha and Kumar). If the goal is to create
meaningful groups of data then this technique can be used in isolation. In other cases, it
provides a useful starting point of data summarisation.
The basic concepts of clustering are widely used to describe groups of objects that have
common characteristics. Applications of these techniques are widely used in biology,
climate, psychology and medicine, business and information retrieval (Tan, Steinbacha and
Kumar). Clustering techniques abstract properties about the data being analysed. It is
therefore important to make sure the techniques provide the most representative
prototypes.
These techniques can be classified as summarisation, compression and efficiently finding
the nearest neighbour (Tan, Steinbacha and Kumar). Summarisation techniques use a
10
representative sample of data because they are not practical for large data sets.
Compression techniques are sometimes known as vector quantisation because objects are
assigned a tabular quantity. Efficiently finding the nearest neighbour requires calculating
the pair wise distance between all points.
4.3.4 Sequence Pattern Mining
Sequential pattern discovery bases on a set of objects, with each object associated with its
own timeline of events, to find rules that predict strong sequential dependencies among
different events (Tan, Steinbacha and Kumar). Rules are formed by first discovering
patterns. Event occurrences in the patterns are governed by timing constraints.
4.4 Applications of Log Data Profiling Techniques
4.4.1 Computer Forensics
In addition a clearly identified line pattern can be included in the final profile of the system.
These techniques can be used detect anomalies by creating clusters of anomalies.
Clustering can be used to create system profiles so that anomalies in a process can be
detected. Clustering techniques divides a data set into groups each having similar
characteristics. This can be used as a precursor to association rule mining to detect
relationships between event types.
4.4.2 Web Services
Many techniques have been developed for mining web log data in order to derive some
meaning. (Chi 2002)Illustrates how association rule mining can be used to enable an
understanding of user goals when navigating the web. To develop complete and accurate
profiles, an understanding of customer web browsing is necessary. (Chi 2002) proposes a
method that infers major groupings of web traffic through association rule mining. (Shah,
Joshi, Wurman 2002) use data mining to understand the auction process by exploring
common bidding patterns. Through this they propose new bidding engagements and rules
11
for classifying strategies. Furthermore they seek to suggest economic motivations for such
behaviour.
To develop profiles, data mining techniques have been used with modification to enhance
relevance to the web domain. (Ypma, and Heskes, 2002) use Markov models to model the
click streams of web surfers. They use prior knowledge and various Markov modelling
techniques to obtain web page categorizations based on weblogs. (Hay, Wets and Vanhoof)
present a new algorithm called Multidimensional Sequence Alignment Method (MDSAM).
MDSAM is used for mining navigation patterns on a web site. It examines sequences
composed of several information types, such as visited pages and visiting time spent on
pages. It identifies profiles showing visited pages, visiting time spent on pages and the
order in which pages are visited on a website. (Yang, Parthasarty and Reddy, 2002) provide
an approach which is based on association rule mining. Their algorithm discovers
association rules that are constrained (and ordered) temporally. This is based on the
premise that pages accessed recently have a greater influence on pages that will be
accessed in the near future. (Oyanagi, Kubota, Nakase) explore issues in sequence pattern
mining web data. The Apriori algorithm suffers from inherent difficulties in finding long
sequential patterns and in finding interesting patterns among a huge amount of results.
They propose a new method for finding sequence patterns by matrix clustering.
4.5 Application in Process Mining
Process mining is the extraction of non-trivial and useful information from event logs
recorded by information systems. Many techniques have been developed to automatically
discover a process model based on some event log. For unstructured processes, the models
produced are spaghetti-like and difficult to read (Medeiros et al, 2008).
Clustering
techniques have been developed within the PROM framework to develop process models
based on coherent set of cases.
12
A number of data mining techniques have been implemented in process mining. The PROM
framework provides a variety of plug-ins that can be used to develop profiles. This plug-ins
can be classified as relating to mining or analysis of event logs.
4.5.1 Mining Plug-ins

Social Network Miner

Heuristics miner

Fuzzy miner

Association rule miner
4.5.2 Analysis Plug-ins

LTL checker

Log summary

Performance sequence diagram analysis

Log clustering

Basic log statistics
13
5 Methodology
A study of data and process mining algorithms is required to address the research
questions and its associated knowledge areas. The proposed research will take place in
three stages. The first stage will be the preparation stage; the second will be tool
development and method verification and finally development of profiling guidelines. The
next stage will develop a tool deployed within the PROM framework to validate and refine
the methods developed based on real world data. The final stage will be development of
guidelines for interpretation of profile information developed from unstructured
processes.
The proposed project will involve the following stages:-
5.1 Preparation stage
The preparation stage will involve review of relevant literature on data and process
mining. This stage will involve the review of various techniques in order to determine the
most appropriate techniques for profiling based on event log data. It will also seek to
determine how some techniques can be modified to suit unstructured event log data. The
result will be the development of methods and mechanisms in which a combination of
these algorithms can be used to develop complete and accurate profiles based on event log
data.
5.2 Tool Development and Method Testing
This stage will to develop a tool(S) that implements the above methods within the PROM
framework. In order to validate the methods developed, experiments based on real world
data will be used and the profiles generated studied. The profiles generated will be
compared to the data to determine its accuracy and completeness.
14
5.3 Development of Guidelines
It cannot be possible that one method developed would work for all cases. So the methods
developed will relate to more of the general application sense. This stage will therefore
seek to develop a list of guidelines that could be used to accurately develop profiles based
on possible scenarios.
5.4 Results
The proposed research includes aspects of an academic discipline and a professional field.
The academic discipline will be influenced by inputs from the IS research teams while the
professional field from industry related information systems. Similarly, the results of
research should provide opportunities for application to the problems and be used by
professionals in the field. This will be strengthened by the development of methods,
modified algorithms and guidelines that can be used to develop profiles.
The results of this research are intended to contribute to the knowledge to information
systems and more specifically in developing profiles using data and process mining
techniques. On the application these concepts in process mining, the result will enable
better understanding of process models and the ability to accurately predict events yet to
occur.
6 References
PLEASE NOTE THAT REFERENCING IS YET TO BE DONE PROPERLY.
1. http://en.wikipedia.org/wiki/Profiling
2. http://www.bitpipe.com/tlist/Data-Profiling.html
3. http://en.wikipedia.org/wiki/Data_profiling
4. http://www.tdwi.org/Publications/WhatWorks/display.aspx?id=7312
15
5. Fayyad, et.al. Advances in Knowledge Discovery and Data Mining, 1996
6. R. Vaarandi 2003, A Data Clustering Algorithm for Mining patterns From Event
Logs
7. R. Vaarandi 2003, A Data Clustering Algorithm for Mining Patterns From Event
Logs
8. Book - Tan, Steinbach, Kumar, Introduction to Data Mining
9. Brij
et
al,
Web
site
of
the
WebKDD
2002
Workshop,
http://db.cs.ualberta.ca/webkdd02/.
10. Tamas Abraham, 2006 Event Sequence Mining to Develop Profiles for Computer
Forensic Investigation Purposes Australian Computer Society
11. Shigeru Oyanagi, Kazuto Kubota and Akihiko Nakase 2002, Mining WWW Access Sequence by
Matrix Clustering WEBKDD 2002 - MiningWeb Data for Discovering Usage Patterns and
Profiles
12. Alexander Ypma and Tom Heskes Automatic Categorization of Web Pages and User
Clustering with Mixtures of Hidden Markov Models
13. Birgit Hay, Geert Wets and Koen Vanhoof Web
Usage
Mining
by
Means
of
Multidimensional Sequence Alignment Methods
14. Hui Yang and Srinivasan Parthasarathy On the Use of Constrained Associations for
Web Log Mining
15. Harshit S. Shah, Neeraj R. Joshi, Ashish Sureka and Peter R. Wurman Mining eBay:
Bidding Strategies and Shill Detection
16
16. Ed H. Chi, Adam Rosien and Jeffrey Heer LumberJack: Intelligent Discovery and
Analysis of Web User Traffic Composition
17