Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
MAKERERE UNIVERSITY , FACULTY OF COMPUTING AND IT Profiling Event Log data A PHD Draft Proposal submitted to School of Graduate Studies, Makerere University Peter Khisa Wakholi 4/28/2008 Process-aware information systems generate logs resulting from activities undertaken during transaction of business processes. For unstructured business processes, several factors which are dependant on the attributes and interests of the actors determine the process of execution. Models extracted from these processes are spaghetti-like and hard to understand. This study seeks to explore how profiles extracted from the logs can be used understand the underlying behavior in order to generate more descriptive information about the processes and subsequently explain behavior and predict incomplete processes. 1 Introduction 1.1 The City Metaphor Imagine a large city; it will have many places and residents. The places can have distinct address locations. In each location, an activity that defines the place is undertaken; for example places that sell wines and spirits would be bars. Furthermore, places can have similar functionality and can therefore be categorized (e.g. bars). Each day the residents engage in activities that take them to various places in the city. The choice of which place to go depends on the individual’s interest. For example, in order to work, a resident who is a student will go to school. After work, he may choose to pass another place – depending on his interests. For example, one who likes ‘hanging out’ would probably pass a bar before going home. It is always the case that most people will follow a particular weekly pattern. There are always deviations from a particular pattern that seldom occur. Imagine, that it were possible to log every place visited by the residents. By looking at these logs, one would be able to know a lot of information about an individual. This information would be derived based on the nature of places visited. For example, if someone goes to a place called school very often, then we can conclude that the person either studies or works in a school. With more knowledge about the attributes of the individual (e.g. age) it is possible to associate that individual as a student, following some association rule. For example a rule that states that people below 22 visiting schools on a regular basis are students and they follow a particular sequence of events. In essence, we are using the logs, underlying process models and attributes to develop a profile of an individual. The profile can even be extended to fit other individuals with similar characteristics. For a large population, the process of discovering profiles would be tedious. It would be prudent to start by clustering the logs based on some similar characteristics. Again the 2 criteria used are very important. For example based on a certain age group one would get a cluster of activities that people in that age group do. A different approach could be based on a sequence of events. In this case, we would cluster according to the places people visit and related attributes like how long they spend there. Profiling is defined as the extrapolation of information about something, based on known qualities (Wikipedia). The profile developed can be useful in many ways. For example a profile can be developed of a typical week days’ activity as; home to university to bar to home on a Wednesday. It is possible to draw a map of all the places an individual visits and define the rules followed to visit a place. Therefore, given an individual with some characteristics, it should be possible to predict what route that individual would take on a particular day. The reverse is also true; given a set of logs, it should also be possible to tell what kind of individual we are dealing with. The accuracy of such predictions depends largely on the level of knowledge derived. It is therefore important that the derived profile is complete and consistent. 1.2 Profiling based on Information System event logs The scenario described above can be translated in information systems. Today many of the activities occurring in processes are either supported or monitored by information systems. There are many systems, for example ERP, WFM, CRM, and PDM systems that support a wide variety of business processes while recording well-structured and detailed event logs. Whereas it is possible to have a clear process that defines the workflow (e.g. in a manufacturing plant), in some cases it is not possible to define the process (e.g. patient flow in hospital). It is in situations with such unstructured processes that profiling, based on event logs would be most relevant. Like the city described above, the sequence of events for a particular case largely depends on the interest and attributes. If information about the actors is available, it is possible to use the event log and attributes to discover interesting information about their characteristics based on behaviour. The case is different in 3 situations where only the log exists (e.g. web based logs). In such cases, profiling would only be based event logs. Therefore to develop a complete profile, these two possible scenarios exist. Unlike the city metaphor where the number places to be visited is very large and therefore infinite; information systems typically have a finite number of places. Each place has an attribute stamp, which in essence is the characteristic that an actor needs to get to it. The transition from a place to the other is an event in a sequence. A collection of sequences form the process model that defines the system. These models are unstructured in that they do not follow a predefined process model. However, by process mining based on the logs it is possible to create a process model. This can be enhanced by creating profiles, which are the rules that determine the navigation pattern and characteristics of the actors. A range of techniques ranging from basic statistics machine learning techniques like clustering, association rules and sequential mining can to develop profiles. If by definition, a complete profile is one that provides a clear picture of actors and their behaviour, a combination of the techniques would be necessary. Basic statistics could give a general overview of the most occurring patterns and attributes. Clustering can be done based on known attributes of the actors or based on patterns in the log if used in combination with sequential mining. Association rules can be used to develop rules that define a sequence of events. However these profiles are just a set of event logs. Process mining can be used to discover to process models that describe the possible activities an actor can engage in. As in the case of the large city metaphor above, a complete profile would involve using the various techniques above in combination. For an unstructured event log, this involves identifying appropriate techniques that can derive meaning from attributes of the cases involved, sequence of events, clusters of cases and related association rules. A complete profile log can then be mined to discover the processes and the rules that define flow within the process. This can then provide a basis for behavioral analysis – previously 4 discovering unknown attributes. Furthermore it can be used to accurately predict incomplete processes. 1.3 Statement of the Problem Unstructured processes do not follow predetermined process models. The paths followed largely depend on the attributes of the interacting entities and the environment factors. Event logs from such processes would provide useful information about profiles of the entities involved. Process mining such logs produces Spaghetti-like models which are hard to draw meaningful information. Profiling would help in developing better understanding of the underlying process models by. 1. Extracting meaningful process models from logs 2. Determining the rules that define the control flow 3. Determining the characteristics of the actors in the model There are many knowledge discovery techniques that can be used to provide different aspects of the profile. It is important that the profile discovered is complete in order to gain insight about the entities involved and predict a course of action for incomplete processes. In order to build a complete profile for this purpose a combination of the various techniques available is necessary. The research proposes to provide a model that uses a combination of techniques, by analyzing existing profiling techniques and customizing them for unstructured event log data. 5 2 Research Questions The main research question in this project is “How can complete and highly accurate profiles be developed from event log data in order to discover structure of underlying process models and the nature of the actors?” In order to achieve new knowledge in this area, the following sub questions will be investigated. 1. What techniques that can be used to extract process related profiles based on event log data? 2. How can these techniques be deployed to develop a complete profile for unstructured processes? 3. How can mining these log data based on these profiles be used to develop process models in order to discover structure out the unstructured processes? 4. What interpretation or meaning can be attributed to observed behavior in the profiles? 6 3 Research Motivation The purpose of this research project is to develop a methodology for profiling based on event log data and known case attributes. It seeks to develop a deeper understanding of the subject area with a view of establishing the capabilities and limitations of the available statistical and machine learning algorithms. The knowledge developed from this research should give a better understanding of process mined from unstructured event logs. Furthermore it should be possible to determine which profile suits a case based on some characteristics, which can then be used for prediction. These concepts could be used in real-life situations to get a better understanding of the reasons for some observed behaviour in a system. Additionally, by predicting the possible sequence of a new case, runtime parameters can be optimally configured thereby increasing efficiency. 7 4 Review of Related literature 4.1 What is profiling? Profiles based on data have been widely used in database design and development. (www.bitpipe.com) defines data profiling as “The use of analytical techniques about data for the purpose of developing a thorough knowledge of its content, structure and quality”. To carry out a data profiling project, it is important to define the objectives of the profiling effort. Appropriate data sampling techniques can then be used to data samples. These samples can then be used to identify a complete profile, which can form the basis of analysis. Profiling based on event log data can be defined as the practice of tracking information about processes by monitoring their execution. This can be done by analyzing the case perspective, process perspective and resource perspective to assess their behaviour, predict certain characteristics and to configure optimum runtime parameters. The case perspective here refers to the attributes that change as the process is executing. The Process perspective on the other hand refers to the actual sequence of events that are followed. The resource perspective refers to the resources that are utilised in executing the process. 4.2 Event Log Based Profiling Event logs typically contain a lot of information about the process execution, resources utilization and cases being executed. This information is in the form of timestamps, transactional information, information on users, data attributes, etc. Log based profiles have been developed for unstructured processes such as computer forensics and web usage. The basic goal is to develop information about the user's behavior and can include known attributes about the user. Profiles can be developed for individual users or user groups. Often, machine learning techniques like clustering, association mining 8 and sequential mining are used to create user profiles. Simple statistical analysis based on frequency of occurrences can also provide useful information. Process mining can be used to discover process models for a particular user or groups of users with similar characteristics. 4.3 Profiling Techniques To develop profiles, data mining techniques are commonly used. Tamas (2006) that this approach is used in web log data mining in order to market content and services tailored to an individual on the basis of knowledge about their preferences and behaviour. Techniques commonly used include association rule mining and clustering. A new technique which is gaining ground in event log data is sequence mining. According to (Fayyad, et.al. 1996), data mining techniques can be categorised as prediction methods and description methods. Prediction methods use variables to predict unknown or future values of other variables. Examples of prediction methods include classifications, regression and deviation detection. Description methods find human interpretable patterns that describe data. Examples of description methods include clustering, association rule discovery and sequential pattern discovery. The techniques used in profiling are explored in the sections that follow. 4.3.1 Classification Classification techniques extract classes from a collection of records based on a set of attributes (Tan, Steinbacha and Kumar, 2004). The goal is to find a model for class attribute as a function of the values of other attributes. Previously unseen records should be assigned a class as accurately as possible. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it Tan, Steinbacha and Kumar, 2004). 9 4.3.2 Association rule Mining Association rule mining produces dependency rules which will predict occurrences of an item based on occurrences of other items in a collection. Association rules can be used to create a system profile by considering the most frequently occurring behaviour as normal (R. Vaarandi 2003). Association rule algorithms are used to detect relationships between event types. Association rules can be used to build a rule set that describes the behaviour of data within a level of confidence. Such information can be obtained from log files. Association rule algorithms for example provide the rule “if events of type A and B occur within 5 seconds, they will be followed by an event of type C within 60 seconds”. Association rules have been used for forensic purposes based on log data. 4.3.3 Cluster Analysis Clustering divides a set of data into meaningful or useful groups. (R. Vaarandi 2003) describes clustering as a technique used to group objects into similar clusters based on some patterns. Cluster analysis can be used in isolation or as a part of several other techniques, depending on the goal (Tan, Steinbacha and Kumar). If the goal is to create meaningful groups of data then this technique can be used in isolation. In other cases, it provides a useful starting point of data summarisation. The basic concepts of clustering are widely used to describe groups of objects that have common characteristics. Applications of these techniques are widely used in biology, climate, psychology and medicine, business and information retrieval (Tan, Steinbacha and Kumar). Clustering techniques abstract properties about the data being analysed. It is therefore important to make sure the techniques provide the most representative prototypes. These techniques can be classified as summarisation, compression and efficiently finding the nearest neighbour (Tan, Steinbacha and Kumar). Summarisation techniques use a 10 representative sample of data because they are not practical for large data sets. Compression techniques are sometimes known as vector quantisation because objects are assigned a tabular quantity. Efficiently finding the nearest neighbour requires calculating the pair wise distance between all points. 4.3.4 Sequence Pattern Mining Sequential pattern discovery bases on a set of objects, with each object associated with its own timeline of events, to find rules that predict strong sequential dependencies among different events (Tan, Steinbacha and Kumar). Rules are formed by first discovering patterns. Event occurrences in the patterns are governed by timing constraints. 4.4 Applications of Log Data Profiling Techniques 4.4.1 Computer Forensics In addition a clearly identified line pattern can be included in the final profile of the system. These techniques can be used detect anomalies by creating clusters of anomalies. Clustering can be used to create system profiles so that anomalies in a process can be detected. Clustering techniques divides a data set into groups each having similar characteristics. This can be used as a precursor to association rule mining to detect relationships between event types. 4.4.2 Web Services Many techniques have been developed for mining web log data in order to derive some meaning. (Chi 2002)Illustrates how association rule mining can be used to enable an understanding of user goals when navigating the web. To develop complete and accurate profiles, an understanding of customer web browsing is necessary. (Chi 2002) proposes a method that infers major groupings of web traffic through association rule mining. (Shah, Joshi, Wurman 2002) use data mining to understand the auction process by exploring common bidding patterns. Through this they propose new bidding engagements and rules 11 for classifying strategies. Furthermore they seek to suggest economic motivations for such behaviour. To develop profiles, data mining techniques have been used with modification to enhance relevance to the web domain. (Ypma, and Heskes, 2002) use Markov models to model the click streams of web surfers. They use prior knowledge and various Markov modelling techniques to obtain web page categorizations based on weblogs. (Hay, Wets and Vanhoof) present a new algorithm called Multidimensional Sequence Alignment Method (MDSAM). MDSAM is used for mining navigation patterns on a web site. It examines sequences composed of several information types, such as visited pages and visiting time spent on pages. It identifies profiles showing visited pages, visiting time spent on pages and the order in which pages are visited on a website. (Yang, Parthasarty and Reddy, 2002) provide an approach which is based on association rule mining. Their algorithm discovers association rules that are constrained (and ordered) temporally. This is based on the premise that pages accessed recently have a greater influence on pages that will be accessed in the near future. (Oyanagi, Kubota, Nakase) explore issues in sequence pattern mining web data. The Apriori algorithm suffers from inherent difficulties in finding long sequential patterns and in finding interesting patterns among a huge amount of results. They propose a new method for finding sequence patterns by matrix clustering. 4.5 Application in Process Mining Process mining is the extraction of non-trivial and useful information from event logs recorded by information systems. Many techniques have been developed to automatically discover a process model based on some event log. For unstructured processes, the models produced are spaghetti-like and difficult to read (Medeiros et al, 2008). Clustering techniques have been developed within the PROM framework to develop process models based on coherent set of cases. 12 A number of data mining techniques have been implemented in process mining. The PROM framework provides a variety of plug-ins that can be used to develop profiles. This plug-ins can be classified as relating to mining or analysis of event logs. 4.5.1 Mining Plug-ins Social Network Miner Heuristics miner Fuzzy miner Association rule miner 4.5.2 Analysis Plug-ins LTL checker Log summary Performance sequence diagram analysis Log clustering Basic log statistics 13 5 Methodology A study of data and process mining algorithms is required to address the research questions and its associated knowledge areas. The proposed research will take place in three stages. The first stage will be the preparation stage; the second will be tool development and method verification and finally development of profiling guidelines. The next stage will develop a tool deployed within the PROM framework to validate and refine the methods developed based on real world data. The final stage will be development of guidelines for interpretation of profile information developed from unstructured processes. The proposed project will involve the following stages:- 5.1 Preparation stage The preparation stage will involve review of relevant literature on data and process mining. This stage will involve the review of various techniques in order to determine the most appropriate techniques for profiling based on event log data. It will also seek to determine how some techniques can be modified to suit unstructured event log data. The result will be the development of methods and mechanisms in which a combination of these algorithms can be used to develop complete and accurate profiles based on event log data. 5.2 Tool Development and Method Testing This stage will to develop a tool(S) that implements the above methods within the PROM framework. In order to validate the methods developed, experiments based on real world data will be used and the profiles generated studied. The profiles generated will be compared to the data to determine its accuracy and completeness. 14 5.3 Development of Guidelines It cannot be possible that one method developed would work for all cases. So the methods developed will relate to more of the general application sense. This stage will therefore seek to develop a list of guidelines that could be used to accurately develop profiles based on possible scenarios. 5.4 Results The proposed research includes aspects of an academic discipline and a professional field. The academic discipline will be influenced by inputs from the IS research teams while the professional field from industry related information systems. Similarly, the results of research should provide opportunities for application to the problems and be used by professionals in the field. This will be strengthened by the development of methods, modified algorithms and guidelines that can be used to develop profiles. The results of this research are intended to contribute to the knowledge to information systems and more specifically in developing profiles using data and process mining techniques. On the application these concepts in process mining, the result will enable better understanding of process models and the ability to accurately predict events yet to occur. 6 References PLEASE NOTE THAT REFERENCING IS YET TO BE DONE PROPERLY. 1. http://en.wikipedia.org/wiki/Profiling 2. http://www.bitpipe.com/tlist/Data-Profiling.html 3. http://en.wikipedia.org/wiki/Data_profiling 4. http://www.tdwi.org/Publications/WhatWorks/display.aspx?id=7312 15 5. Fayyad, et.al. Advances in Knowledge Discovery and Data Mining, 1996 6. R. Vaarandi 2003, A Data Clustering Algorithm for Mining patterns From Event Logs 7. R. Vaarandi 2003, A Data Clustering Algorithm for Mining Patterns From Event Logs 8. Book - Tan, Steinbach, Kumar, Introduction to Data Mining 9. Brij et al, Web site of the WebKDD 2002 Workshop, http://db.cs.ualberta.ca/webkdd02/. 10. Tamas Abraham, 2006 Event Sequence Mining to Develop Profiles for Computer Forensic Investigation Purposes Australian Computer Society 11. Shigeru Oyanagi, Kazuto Kubota and Akihiko Nakase 2002, Mining WWW Access Sequence by Matrix Clustering WEBKDD 2002 - MiningWeb Data for Discovering Usage Patterns and Profiles 12. Alexander Ypma and Tom Heskes Automatic Categorization of Web Pages and User Clustering with Mixtures of Hidden Markov Models 13. Birgit Hay, Geert Wets and Koen Vanhoof Web Usage Mining by Means of Multidimensional Sequence Alignment Methods 14. Hui Yang and Srinivasan Parthasarathy On the Use of Constrained Associations for Web Log Mining 15. Harshit S. Shah, Neeraj R. Joshi, Ashish Sureka and Peter R. Wurman Mining eBay: Bidding Strategies and Shill Detection 16 16. Ed H. Chi, Adam Rosien and Jeffrey Heer LumberJack: Intelligent Discovery and Analysis of Web User Traffic Composition 17