Download Spatio-Temporal Data Mining with Event Logs from High Volume

Spatio-Temporal Data Mining with Event Logs from High Volume Logistics Information Rodrigo Miguel Tavares Gonçalves Thesis to obtain the Master of Science Degree in Mechanical Engineering Supervisors: Prof. João Miguel da Costa Sousa Prof. Rui Jorge de Almeida e Santos Nogueira Examination Committee Chairperson: Prof. João Rogério Caldas Pinto Supervisor: Prof. João Miguel da Costa Sousa Member of the Committee: Prof. Carlos Baptista Cardeira September 2015 Acknowledgments I would like to express my sincere gratitude to my advisor Prof. João Sousa for the continuous support during my MSc study and related research. His guidance helped me in time of research and writing of this thesis. Besides my advisor, I would like to thank Prof. Rui Jorge de Almeida for his insightful comments and encouragement, but also for the hard question which incentivized me to widen my research from various perspectives. This work was performed as part of the DAIPEX project grand funded by Dinalog. ii Abstract In logistics, software aids for transportation planning and scheduling are often based in approximations and abstractions that do not take into account real-world data. The aim of this work is to provide an analysis and methodology, based on real-world data, on how to obtain probability density functions for prediction of activity duration. Such information can be used in planning algorithms, like vehicle routing problem, capable of dealing with stochastic time-windows. Given a large spatio-temporal database of events, where each event consists of the fields event ID, time, location, and event type, the aim is to extract valuable information about activities duration. The process is not straightforward since the log is human-influenced creating uncertainty related with the time at which the events are logged. In order to overcome this, a novel framework is proposed: it uses the spatiotemporal trajectories to identify regions-of-interest based on speed, and builds an ROI activity time-line using the activities extracted from event logs. The framework’s ability to estimate activities durations was tested in three different environments: the Amsterdam Airport Schiphol, the Port of Rotterdam and a single vehicle scenario. Experimental results validate the usefulness of the approach at finding probability density functions for prediction of activity durations at specific locations. Further more, it provides a detailed activity time-line for the trajectory, or for specific portion like regions of interest, that allows a close monitoring of the vehicles workflow. Keywords: Spatio-Temporal, Event Logs, Logistics, Event Mining, Trajectories. iii Contents Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii 1 Introduction 1 1.1 Sequential Pattern Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.1 AprioriAll Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1.2 Generalized Sequential Patterns Algorithm . . . . . . . . . . . . . . . . . . . . 4 1.1.3 SPADE Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.1.4 PrefixSpan Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2 Geographical distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.1 Haversine formula 1.3 Clustering Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3.1 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2 Spatio-Temporal Event Log Mining 13 2.1 Identify Regions of Interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 Identify Activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3 Characterize Activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3 Analysis of a Logistics Database 21 3.1 Spatio-Temporal Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2 Event Log Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2.1 Sequential Pattern Mining in Event Logs 4 Processing High Volume Logistics Data 4.1 Spatio-Temporal Data - Trajectories . . . . . . . . . . . . . . . . . . . . . 29 30 . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 iv 4.1.1 Regions of Interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.2 Event Logs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.3 Activity Duration Estimation 4.3.1 Single Activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.3.2 Multiple Activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.4 Clustering Locations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5 Real World Examples 52 5.1 Amsterdam Airport Schiphol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.2 Port of Rotterdam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.3 International Truck . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 6 Discussion 63 6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 References 69 A Event List 70 v List of Tables 1.1 Difference between databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.1 Data point arrangement in the data base . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 Load and Unload activities related events . . . . . . . . . . . . . . . . . . . . . . . . 26 4.1 Example of the identification of activities from the event log data . . . . . . . . . . . . . 39 4.2 Example of activity Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3 Costumer table: the table contains the data from acitivties preformed on the costumer area 51 vi List of Figures 1.1 Great Circle: a great circle of a sphere is the intersection of the sphere and a plane which passes through its center point. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2 Spherical triangle: law of cosines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1 Activity duration estimation framework . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 Data acquired by a global positioning system . . . . . . . . . . . . . . . . . . . . . . . 15 (a) Trajectory data base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 (b) Trajectory graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.3 Framework to extract regions of interest . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 Activity time-line of a region of interest . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5 Empty time: the time available on the neighbourhood of an activity in the ROI time-line . . 19 3.1 Data Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.2 Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3 Load and Unload event count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 (a) Load Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 (b) Unload Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.4 Other acivity related events count 3.5 Log In & Log Out event count . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.6 Most frequent events in the date base . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.7 Activity time distribuiton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.1 Effect of the speed filter on an average speed profile of the international truck with 4.2 Average speed profile of the international truck with = 6 . 32 = 6 . . . . . . . . . . . . . . . . 33 4.3 Average Speed Histograms with ↵ = 0.5 . . . . . . . . . . . . . . . . . . . . . . . . . 34 (a) All trucks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 (b) TruckID 1141 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 (c) TruckID 7234 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 (d) TruckID 1849 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.4 Split regions of interest (right figure) due to a low sbound . . . . . . . . . . . . . . . . . 35 4.5 Choosing the correct value for the sbound parameter . . . . . . . . . . . . . . . . . . . 35 vii (a) Number of ROIs per truck against sbound . . . . . . . . . . . . . . . . . . . . . . 35 (b) Total time spent in ROIs per truck against sbound . . . . . . . . . . . . . . . . . . 35 4.6 An example on extracting regions of interest with T = 0 . . . . . . . . . . . . . . . . . 36 (a) Regions of interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 (b) Trajectory graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.7 Examples of regions of interest for sbound = 15km/h . . . . . . . . . . . . . . . . . . 37 (a) Single activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 (b) Multiple activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 (c) Split ROI due to low sbound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 (d) Effect of the human behvaiour . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.8 Averaging latitude and longitude coordinates to obtain a single point representation of the activity location . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.9 ROI activity time-lines examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 (a) Wrongly introduced activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 (b) Correctly introduced activities . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.10 Empty time: the time available on the neighbourhood of an activity in the ROI time-line . . 43 4.11 Different types of estimation cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 (a) Single activity case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 (b) Multiple activities case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.12 Single activity situations: different cases when estimating the duration of a single activity . 44 (a) Activity is the only activity in the ROI . . . . . . . . . . . . . . . . . . . . . . . . 44 (b) Activity is the firsts activty in the ROI . . . . . . . . . . . . . . . . . . . . . . . . 44 (c) Activity is the last activity in the ROI (d) Activity is in between other activities . . . . . . . . . . . . . . . . . . . . . . . . 44 . . . . . . . . . . . . . . . . . . . . . . . . 44 4.13 Multiple activities situation: defining the interval for estimation . . . . . . . . . . . . . . 45 4.14 Legend: short activities, long activities and other activities . . . . . . . . . . . . . . . . 45 4.15 Hypothesis 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.16 Hypothesis 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.17 Hypothesis 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.18 Hypothesis 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.19 Mean locations of load acitivities in Schiphol Airport . . . . . . . . . . . . . . . . . . . 48 4.20 Dendrogram for unload activities at Schiphol Airport . . . . . . . . . . . . . . . . . . . 49 4.21 Load activities from Schiphol Airport clustered . . . . . . . . . . . . . . . . . . . . . . 50 (a) Load Clusters at Schiphol Airport . . . . . . . . . . . . . . . . . . . . . . . . . . 50 (b) Load Clusters at Schiphol Airport Zoom 1 . . . . . . . . . . . . . . . . . . . . . 50 (c) Load Clusters at Schiphol Airport Zoom 2 . . . . . . . . . . . . . . . . . . . . . 50 5.1 Load activities from Schiphol Airport . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 (a) Location of KML Cargo site cluster . . . . . . . . . . . . . . . . . . . . . . . . . 53 viii (b) KML Cargo site satellite view and load locations . . . . . . . . . . . . . . . . . . 53 5.2 PDF of the original activties durations from KML Cargo site . . . . . . . . . . . . . . . . 54 (a) Original load durations of KML Cargo site . . . . . . . . . . . . . . . . . . . . . 54 (b) Original unload durations of KML Cargo site . . . . . . . . . . . . . . . . . . . . 54 5.3 PDF of the estimated activity durations from KML Cargo site using hypothesis 1 . . . . . 54 (a) Estimated load durations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 (b) Estimated unload durations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.4 PDF of the estimated activity durations from KML Cargo site using hypothesis 2 . . . . . 55 (a) Estimated load durations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 (b) Estimated unload durations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.5 PDF of the estimated activity durations from KML Cargo site using hypothesis 3 . . . . . 56 (a) Estimated load durations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 (b) Estimated unload durations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.6 PDF of the estimated activity durations from KML Cargo site using hypothesis 4 . . . . . 56 (a) Estimated load durations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 (b) Estimated unload durations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.7 PDF for the original service times from KML Cargo site . . . . . . . . . . . . . . . . . . 57 5.8 PDF of the estimated services times for KML Cargo . . . . . . . . . . . . . . . . . . . 58 (a) Estimated service times using hypothesis 1 . . . . . . . . . . . . . . . . . . . . . 58 (b) Estimated service times using hypothesis 2 . . . . . . . . . . . . . . . . . . . . . 58 (c) Estimated service times using hypothesis 3 . . . . . . . . . . . . . . . . . . . . . 58 (d) Estimated service times using hypothesis 4 . . . . . . . . . . . . . . . . . . . . . 58 5.9 Rotterdam Load & Unload Activities . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.10 Activities Locations at the port of Rotterdam . . . . . . . . . . . . . . . . . . . . . . . 59 (a) Load locations at the port of Rotterdam . . . . . . . . . . . . . . . . . . . . . . . 59 (b) Unload locations at the port of Rotterdam . . . . . . . . . . . . . . . . . . . . . . 59 5.11 PDF for the activities original durations at the port of Rotterdam . . . . . . . . . . . . . 60 (a) Original load durations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 (b) Original unload durations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.12 PDF for the estimated activity durations using hypothesis 1,2,3 and 4 . . . . . . . . . . . 60 (a) Estimated load durations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 (b) Estimated unload durations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.13 International Ttuck activity time-line for all the event log . . . . . . . . . . . . . . . . . . 61 5.14 Productivity of the international truck . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 A.1 Link Speeds and Classifications with ↵ = 0.5 . . . . . . . . . . . . . . . . . . . . . . . 74 (a) TID 1141 Link Speeds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 (b) TID 1141 Link Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 (c) TID 1849 Link Speeds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 ix (d) TID 1849 Link Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 (e) TID 7234 Link Speeds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 (f) TID 7234 Link Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 x List of Equations 1.1 The Law of Haversines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2 Haversine Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3 Haversine of a Central Angle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.4 Haversine Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1 Link Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2 Set of Activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3 Set of Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4 Assigning activities to ROIs 3.1 Cardinality of Load Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2 Cardinality of Activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.1 Average Speed 4.2 Speed Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.3 Hypothesis 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.4 Hypothesis 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.5 Hypothesis 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 xi Nomenclature Average speed in a trajectory link s̄ x Trajectory link length in Km User-defined parameter: speed filter in seconds Longitude coordinate in degrees Latitude coordinate in degrees A Set of activities present on the event log a An activity present on the event log A⇤ Set of activities present on the event log whose duration is to be estimated C Set of events that identify a special occurrence of an activity c Event that identify a special occurrence of an activity e An event from the event log F Set of events that indicate the end of an activity f Event that indicate the end of an activity l Trajectory link m Number of data-points in an activity N Number of data-points in a trajectory xii p Data-point from a trajectory R Candidate region of interest r Earth’s radius in Km S Set of events that indicate the beginning of an activity s Event that indicate the beginning of an activity sbound User-defined parameter for link classification T User-defined parameter: minimum length of stay inside an ROI in minutes t Log date of a data-point xiii Chapter 1 Introduction Transportation companies often find that their day-to-day transportation execution does not conform to the transportation plan that they made in advance. To a large extent this is caused by the fact that the software that aids in the creation of transportation plans, does not take into account the real-world complexity of transportation and logistics [1]. Rather, it uses approximations and abstractions that do not do justice to that complexity. As a consequence, the transportation plans that are generated by transportation planning software often lead to violated time windows [2], unnecessary delays, underutilized transportation capacity, etc. The real-world complexity of transportation planning is caused by the high level of detail that is required to get executable plans, the size of the instances as found in reality, and the large volumes of data that must be collected and processed to gather the information required to create the planning [3]. The aim of this work is to provide an analysis and methodology on how to estimate the duration of process related activities, such as load and unload, based on spatio-temporal data and event logs with uncertainty related to the human behaviour. Further,the acquired information is categorized according the location to obtain probability density functions of activity durations at specific locations. Such information can be used in software applications for transportation planning such as vehicle routing problems that can handle stochastic time-windows. Currently, mobile communications and positioning systems are well-established technologies. GPS equipped devices are able to provide valuable spatio-temporal data with increasingly finer spatial resolutions. The use of GPS-enabled devices allow us to describe the movement of an object (i.e. trajectory) as a sequence of spatial locations sampled at consecutive time-stamps. Spatio-temporal patterns in trajectories, which represent movement patterns of objects, can provide useful information for high quality location based services, such as traffic flow control, location-aware advertising, etc. [4] [5] [6]. In addition, many operating systems, software frameworks, and programs include a logging system. Event logs 1 are able to record events taking place in the execution of a system. This provides a chronological record of a sequence of activities that is essential to gather information about complex systems. The problem is that, in many cases, events are introduced by humans, the system users, leading to the existence of uncertainty in the data. When system logs are human dependent, it is not assured that event records happened in coherence with reality. For example, if an activity is characterized by a start and an end event, its duration can be calculated as the time difference between such events. However, if the events are recorded before, or after, such occurrences the previous statement no longer holds true. In order to overcome such problem, a new algorithm that combines spatio-temporal and event logs data-bases is proposed. The goal is to develop a processing mechanism that can efficiently aggregate information from high-volume business data streams to provide up-to-date management information, for operational business decision makers, in the form of time distributions of expected durations for load and unload activities at specific locations. The algorithm will be tested in harbour and airport services with heavily fluctuating, unpredictable behaviour and road transport processes with high volumes of real-time data. 1.1 Sequential Pattern Mining Data mining techniques, also known as knowledge discovery tools in databases, are used in order to find valid, novel, potentially useful and understandable patterns in data [7]. Sequential pattern mining, as sub-field of data mining, is a topic concerned with finding statistically relevant patterns, from frequent sub-sequences, between data examples where the values are delivered in a sequence. It’s strongly motivated by it’s utility as a tool to obtain knowledge from customer purchase database [8], DNA sequences [9], web logs [10], event logs and medical time series [11]. Sequences are common, occurring in any metric space that facilitates either total or partial ordering [12]. Acquiring knowledge about them is an important data mining research problem with broad applications since the detection of frequent (totally or partially ordered) sub-sequences might be extremely useful to support decision making problems. Sequential pattern mining has arisen as a technology to discover such sub-sequences. Over the last years there has been a substantial increase in temporal, spatial and spatio-temporal data mining publications due to the continuous growth of this sub-field of data mining. The high volume of available data, through internet mainly, and the prominent advantages provided by the data mining analysis to the market, are some of the principal reasons for its development. Since there are many application domains that have a temporal or spatial context, time and space are components that must be taken into account in data mining processes. In the following paragraph temporal, spatial and spatiotemporal data mining bases are going to be briefly explained. Temporal data mining is about the analysis of events ordered by one (or more) dimensions of time and 2 it can be approached in two different ways. On the one hand it’s focused in the discovery of causal relationships among temporally-oriented events, it corresponds to the discovery of ”rules” [13]. On the other hand it is the discovery of similar patterns within the same time sequence, giving rise to the term ”sequence pattern mining”. Spatial data mining can be seen as the multi-dimensional equivalent of temporal data mining however, due to the complexity of spatial data types, spatial relationships, and spatial autocorrelation, turns out to be much more difficult to extract interesting and useful patterns from this type of datasets. The same problem occurs with spatio-temporal data mining, it requires explicit or implicit modelling of spatio-temporal autocorrelation and constraints [14]. Many algorithms are proposed for mining sequential patterns. Broadly, data mining algorithms can be divided into two classes according its method: Apriori-based candidate generation method [15], [8] and Pattern-Growth method [16],[17],[16]. In the following section algorithms for both approaches are presented. 1.1.1 AprioriAll Algorithm Sequential pattern mining was firstly introduced by Agrawal and Srikant [8] in 1995, over transaction databases (known as basket data). The aim was to find frequent item sets bought by costumers in order to obtain typical behaviours according to the user’s point of view. This wouldz support the decision making problem faced by most large retail organizations. The sequential pattern mining problem was defined as follows: “Given a database of sequences, where each sequence consists of a list of transactions ordered by transaction time and each transaction is a set of items, sequential pattern mining is to discover all sequential patterns with a user-specified minimum support,where the support of a sequential pattern is the percentage of data-sequences that contain the pattern.” [8] To find all sequential patterns Agrawal and Srikant divided the mining problem into five phases: (i) Sort phase - consists in creating costumer-sequences, from the original database, by finding the transactions with the same transaction-id and ordering them according the transaction time; (ii) Large item set phase - the item sets with the user-specified minimum support, litemsets, are found; The support for an item set i is defined as the fraction of costumers who bought the items in i in a single transaction; (iii) Transformation phase - each transaction in the costumer-sequences is replaced by the set off all litemsets contained in that transactions; (iv) Sequence phase - the sequence phase is the actual mining phase, where the set of litemsets is used to find the desired sequences (the ones that satisfy the minimum support constraint - large 3 sequences); (v) Maximal phase - is to find the maximal sequences among the set of large sequences. A sequence s is said to be maximal if, in a set of sequences, s is not contained in any other sequence. For the sequence phase, two families of algorithms were presented: count-all, where all the large sequences are taken into account, including the non-maximal sequences, and count-some. The AprioriAll algorithm, based on the Apriori algortihm to mine association rules [18] was shown to perform better than the other two approaches in [8]. It uses a breadth-first search strategy to count the support of itemsets and a candidate generation function which exploits the downward closure property of support. However, the problem definition presented above had some limitations. The absence of time constraints, taxonomies and the rigid definition of a transaction make the use of this algorithms on broader problems impossible. 1.1.2 Generalized Sequential Patterns Algorithm In 1996, Srikant and Agrawal [15] generalized the problem of sequential pattern mining. Time constraints that specify a minimum and/or maximum time period between adjacent elements in a pattern were added and, to overcome the rigid definition of transaction, the restriction that the items in an element of a sequential pattern must come from the same transaction was relaxed by allowing the items to be present in a set of transactions whose transaction-times are within a user-specified time window. Hierarchy was also introduced, given a user-defined taxonomy on items, sequential patterns are allowed to include items across all levels of the taxonomy. Since many applications require all patterns and their supports, the count-some algorithms from [8], that find only maximal sequential patterns, were abandoned. Despite of being possible to extend the AprioriAll algorithm to handle time constraints and taxonomy, incorporate sliding windows was not feasible. Apart from that, the performance of the algorithm was poor since it had to preform the data transformation on-the-fly during each pass while finding sequential patterns. Generalized Sequential Patterns (GSP) algorithm was presented and problem definition reformulated: “Given a database D of data-sequences, a taxonomy T, a user-specified min-gap and maxgap time constraints and a user-specified sliding-window size, the problem of mining sequential patterns is to find all sequences whose support is greater than the user-specified minimum support. Each sequence represents a sequential pattern, also called a frequent sequence.” [15] The GSP algortihm, as the AprioriAll, assume a horizontal database layout, which means that the database is formed by a set of input-sequences. Each input-sequence has a set of events, along with 4 the items contained in the event. GSP works in a multiple pass fashion over the data. The first pass determines the support of each item, that is, the number of data-sequences that include the item. The items that respect the minimum support are the frequent items. Each such item yields a 1-element frequent sequence and on each subsequent pass the previous frequent sequences, seed set, are used to generate new potentially frequent sequences, called candidate sequences, with one more item than the seed sequences. During the pass over the data the algorithm computes the support for each one of the candidate sequences and determines which ones of them are actually frequent. These frequent candidate sequences become the seed set for the next pass. When there are no frequent sequences or no candidate sequences generated, the algorithm terminates. GSP is a complete algorithm in that it guarantees finding all sequences that have a user-specified minimum support. Further more is up to twenty times faster than the previous presented algorithm AprioriAll. 1.1.3 SPADE Algorithm Using the GSP algorithm as base of comparison Zaki developed SPADE, a new algorithm for fast mining of sequential patterns in large databases [19]. Both, GSP and AprioriAll algorithms, required as many full database scans as the longest frequent sequence which is clearly a very expensive process. SPADE decomposes the original problem into smaller sub-problems using equivalence classes on frequent sequences, thus all sequences are discovered in only three database scans — one for frequent 1-sequences, another for frequent 2-sequences, and one more for generating all other frequent sequences. In contrast to the previous algorithms, SPADE uses a vertical id-list database format.The sequence enumeration is done in a lattice-based approach. Each input-sequence in the database has an unique identifier called sid, and each event in a given input-sequence also has a unique identifier called eid. This allows the creation of an id-list, where each entry is a (sid,eid) pair where the item occurs; that enables the chance to check support via simple id-list joins. Two different search strategies for enumerating the frequent sequences were used: breadth-first and depth-first search. Given the vertical id-list database, the 1-sequences can be computed in a single scan by incrementing the support for each new sid encountered in the id-list. For the second step of the algortihm, finding the frequent 2-sequences, a preprocessing step to gather all 2-sequences, above a user specified lower bound, is done. Frequent sequences are then generated by joining the id-lists of all pairs of atoms (including a self-join) and checking the number of distinct sid values of the resulting id-list against the minimum support. The sequences found to be frequent at the current level form the atoms of classes for the next level. This recursive process is repeated until all frequent sequences have been enumerated. Other Apriori-Based algorithms were developed, however only with minor improvements such as per- 5 formance and large search spaces. The SPAM algorithm [20], in the same fashion as SPADE, uses a vertical bitmap representation of the databased combined with an efficient support counting and pruning mechanisms. Yang introduced LPAPIN-SPAM to overcome the problem of ineffectiveness on handling long patterns by using last position of an item to judge if a sequence can be extended. 1.1.4 PrefixSpan Algorithm The previously presented sequential pattern mining methods explore a candidate generation-and-test approach based on the Apriori heuristic proposed in association mining [18]: ”any super-pattern of a non frequent pattern cannot be frequent.”. Due to that, Apriori-Based algorithms require multiple scans of the data base. They also generate a huge sets of candidates, especially 2-item candidates, reducing the performance of the mining process and make it not suitable for mining long sequential patterns. For frequent pattern mining, a frequent pattern growth method called FP-growth [22] has been developed for efficient mining of frequent patterns without candidate generation. The method uses a data structure called FP-tree to store compressed frequent patterns in transaction database and recursively mines the projected conditional FP-trees to achieve high performance. In [17] a new approach to this problem was proposed. With FreeSpan algorithm, the idea was to integrate the mining of frequent sequences with that of frequent patterns and use projected sequence databases to confine the search and the growth of subsequence fragments. By scanning once the database, FreeSpan creates a frequent item list, based on the support, which is also the set of length-1 sequential patterns. The sequence database is than recursively projected into a set of smaller projected databases based on the current sequential pattern(s), and sequential patterns are grown in each projected databases by exploring only locally frequent fragments. In 2004, Pei et al. introduced an improve method, PrefixSpan (Prefix-projected Sequential pattern mining) [16]. Unlike FreeSpan, that creates projected databases based on the current set of frequent patterns without a particular ordering (i.e., growth direction), PrefixSpan projects databases by growing frequent prefixes. Sequential pattern mining for logistic event logs Despite the usefulness of the previous presented algorithms, their implementation on broader problems, and with real-world data, is very limited. Firstly, all of them imply an a priori knowledge of the sequence it self. Since they are designed to mine transaction databases, the algorithms do not tackle the problem of defining a proper sequence. On a transaction data base, sequences are given by a list of transactions 6 ordered by transaction time, where each transaction is a set of items bought by a specific costumer. Each sequence is the purchase history of a costumer. Costumer ID Costumer Sequence Trajectory ID 1 2 3 h(AC), (B), (DF )i h(C), (F B), (AF )i h(AB), (E), (C)i 1 2 3 (a) Transaction database Trajectory Sequence hA, C, D, G, . . . , F i hD, F, A, C, . . . , Bi hA, F, E, C, . . . , Gi (b) Event log data Table 1.1: Difference between databases In logistics event logs, however, sequences are given by the full list of logged events from a trajectory. There are no set of items, since events are not concurrent and their log is continuous in contrast with the discrete transactions. This produces extremely long sequences, with thousands of events, making the act of find patterns almost impossible since the support is always very small. Sequences have to be narrowed to include only the portion of the trajectory that is intended to be studied. For instance, in a load an unload process, a sequence could be given by the events that occurred 30 minutes before and 30 minutes after the activity. But it could also be defined only with events prior or posterior to the process. The act of defining a useful sequence is not trivial and the obtained results are highly dependent on such choice. Hence, the work of this thesis will diverge from the proper term of sequential pattern mining. Instead, from the event log data point of view, it will introduce a methodology on how to process and filter event logs to obtain sequences of a higher degree of granularity with elements such as activities, rather than events. 1.2 Geographical distance When dealing with GPS-generated data bases, determining the distance between locations is a crucial part of the analysis. By knowing the distance between two positions and the time that took to travel from one to another, the average speed of that segment of the trajectory can be obtained. Having knowledge about the average speed of a moving object allow to determine whether the object is stopped or at movement. Global Positioning Systems are explicitly designed to store, handle, and retrieve spatially referenced data. In addition to basic Euclidean or straight-line distances, there is need for more complex forms of analysis that can incorporate a higher level of detail such as having account for the curvature of the earth. Depending on the nature of the data, type of coordinates, application and required accuracy, there are several formulations to achieve this calculation. 7 1.2.1 Haversine formula Given the latitude and longitude of two points, it is possible to determine the shortest distance between them by using the Haversine formula [23]. This formula is mathematically equivalent to the Law of Cosines, that uses spherical geometry to calculate the great circle distance for two points on the globe. Nevertheless, is often preferred since it is less sensitive to round-off errors that can occur when measuring distances between points that are located very close together [24]. Instead, the error can occur for antipodal points (i.e. points that are on opposite sides of the earth). Although its an accurate formulation, it does not take into account the ellipsoid shape of the earth, considering it as a sphere of radius r . Figure 1.1: Great Circle: a great circle of a sphere is the intersection of the sphere and a plane which passes through its center point. The Haversine formula is a particular case of the law of haversines: given a unit sphere, a triangle on the surface of the sphere is defined by the great circles (see fig. 1.1) connecting three points u, v , and w on the sphere. 8 Figure 1.2: Spherical triangle: law of cosines If the length between the sides that connect those points are a, b and c and the angle of the corner opposite to c is C ; the law of haversines states: haversin(c) = haversin(a (1.1) b) + sin(a)sin(b)haversin(C) Where haversin is the haversine function give by: ✓ ◆ # 1 haversin(#) = sin = 2 2 cos(#) 2 (1.2) The lengths a, b and c are equal to the angles, in radians, subtended by those sides from the center of the sphere. In order to obtain the Haversine formula, used to calculate the shortest distance between two points, the point distance u is to be considered as the north pole, while v and w are the two points whose x is to be determined. In such a case, having the latitude a= ⇡ 2 1 and b = ⇡ 2 2. ( 1, 2) and longitude ( 1, 2) C is the longitude separation given by of two points, = 2 a and b becomes: x 1 and c = r . The law of haversines becomes: haversin Note that ✓ x r ◆ = haversin ( 1) 2 + cos( 1 )cos( 2 )haversin( 1) 2 (1.3) x is the shortest distance between two points along a great circle, fig.1.1, r is the radius of the sphere. By applying the inverse thus obtain the desired distance x = 2r arcsin haversin to equation 1.3, is possible to solve the equation and x: s sin2 ✓ 2 1 2 ◆ + cos ( 1 ) cos ( 2 ) sin2 9 ✓ 2 1 2 ◆! (1.4) 1.3 Clustering Methods Another important data mining technique is data clustering. Clustering algorithms aim at dividing a set of objects into groups (clusters), where objects in each cluster are similar to each other (and as dissimilar as possible to objects from other clusters) [25]. Clustering plays an outstanding role in data mining applications such as scientific data exploration, information retrieval and text mining, spatial database applications, Web analysis, marketing, medical diagnostics, computational biology, and many others [26]. In this work, clustering methods are used to group the activities locations making possible the creation of probability density functions for activity durations according their location. The clustering problem is often defined as follows: given a set of points with n attributes in the data <n , find a partition of points into clusters so that points within each cluster are close (similar) to each other. In order to determine, how close (similar) two points x and y are to each other, a distance space function d(x, y) is employed. There are several clustering methods and they can be broadly divided onto two classes: hierarchical clustering and objective function-based clustering. In the latter, as the name states, an objective function is needed and data is partitioned by optimization of it. Such objective function usually minimizes the distances to the cluster center. One of the biggest drawbacks of these type of clustering is that a number of clusters has to be specified in advance. Hierarchical clustering, on the other hand, is a connectivity based clustering method that does not need the number of clusters to be specified. 1.3.1 Hierarchical Clustering Hierarchical clustering method is based on the idea that objects that are nearby are more related than those who are farther away. It builds a cluster hierarchy (e.g. a tree of clusters) also known as a dendrogram. Every cluster node contains child clusters. Such an approach allows exploring data on different levels of granularity. Hierarchical clustering methods are categorized into agglomerative, that starts with one-point (singleton) clusters and recursively merges two or more most appropriate clusters, and divisive, that starts with one cluster of all data points and recursively splits the most appropriate cluster. In order to decide which clusters should be merged (for agglomerative), or where a cluster should be divided (for divisive), a measure of similarity between sets of objects is required. In most methods of hierarchical clustering, this is achieved by using metric (a measure of distance between pairs of objects) such as euclidean distance, and a linkage criterion which specifies the similarity of sets as a function of the pairwise distances of objects in the sets. The most common linkage criterion are: complete-linkage which uses the maximal distance; single-linkage that uses the minimal distance and average-linkage 10 that uses the mean of the distances. After linking the all objects in the data set into a hierarchical cluster tree, the tree is pruned at a user specified value. All branches at or below each cut are grouped into a single cluster. 1.4 Contributions In this thesis, an efficient data mining method has been developed for high volume spatio-temporal data, with event logs. The focus is onto addressing the aforementioned challenges of dealing with the uncertainty of human dependent event logs. The thesis brings together the following contributions: • The proposal of a framework to overcome the problem of estimating the duration of relevant activities from human dependent event logs data. The central piece of the framework is to use a spatio-temporal time-window, as time constrain, to build an activity time-line that allows a proper estimation of activities duration. The idea is based on the correlation between time and space; • A novel algorithm called Spatio-Temporal Event Log Miner to build activity time-lines of regions of interest (ROIs) from logistics spatio-temporal datasets with event logs. The proposed algorithm tackles the problem extracting regions of interest from trajectory datasets based on the average speed of the object being tracked. • An approach on how to avoid mislead calculations of the average speed of an object when obtained from the time-differential method.Trajectory data bases have an associated positioning error, inherent to the global positioning systems, that is amplified through differentiation; • Providing a higher level of detail for software aids in the creation of transportation plans by using real-world data driven methodologies on how to filter event logs for activity extraction and which assumptions to make in order to achieve accurate duration estimations based on the notion of activity time-lines; • A complete methodology on how to obtain probability density functions of activities durations based on their location from real-world logistic data. Such information is extremely useful for planning algorithms, like vehicle routing problem, that are capable of dealing with stochastic time-windows. 11 1.5 Outline The remain of the thesis is as follows: in chapter 2 the overall framework for mining spatio-temporal data with associated event logs is described and formalized. The task of identifying regions of interest based on speed is addressed along with the definition of a set activities from event logs. The concept of activity time-lines is introduced and the activity duration estimation process is outlined. A spatio-temporal event log, STEL, mining algorithm is presented. The databased used to test the STEL algorithm is analysed in chapter 3. In order to defined the set of activities, events from the event log are divided onto classes and categorized according the related occurrences. The set of activities and correspondent events are identified. Spatio-temporal data is briefly introduced by presenting the locations that are going to be mined. Chapter 4 describes the various steps carried out in developing the data model and its implementation on a logistics data base. In section 4.1 the average speed of trucks is calculated and regions of interest are identified. Activities are extracted from the event logs and classified in section 4.2. In section 4.3 the activities duration is estimated based on the assumptions and hypothesis formulated. Finally, in section 4.4 the information is clustered to enable a frequency analysis at each location. The utility of the algorithm is shown in chapter 5 were the results are shown. Probability density functions for the duration of load/unload activities and service times are obtain for two different logistic environments: the Amsterdam Airport of Schiphol and the Port of Rotterdam. Results for a single truck analysis are also shown with the obtained activity time-lines. Lastly, conclusions about the effectiveness of the algorithm in mining both fleets and single trucks are discussed in chapter 6 along with possible future work to improve the STEL algorithm. 12 Chapter 2 Spatio-Temporal Event Log Mining Classical data mining techniques often perform poorly when applied spatio-temporal data sets [27]. Such data sets are embedded in continuous space, whereas classical datasets, like transactions, are discrete. In addition, patterns are often local and classical data mining techniques normally focus on global patterns. A big amount of events to mine turns out to be also a drawback on such approaches. In that extent, a new algorithm is proposed. The spatio temporal event log mining (STEL) algorithm, is designed to mine spatio-temporal data bases in conjunction with event logs. As information systems are becoming more intertwined with the operational processes they support, multitudes of events are recorded by today’s information systems, providing detailed information about the history of processes [28]. Despite the omnipresence of event data, most organizations still diagnose problems based on fiction, through approximations and abstractions, rather than facts. Hence, the goal of event mining is to use such event data to extract process related information so that the task of planning and scheduling becomes more accurate and reliable [29]. More specifically, STEL is meant to estimate the time duration of process activities that are logged using human based event logs. As stated previously, the human behaviour is subject to mistakes. Such human errors create an uncertainty in the event log since events are not assured to be logged in coherence with reality [30]. In other words, users are able to log events before, or after, the corresponding occurrences happening making the estimated duration of the activities differ from reality. The spatio-temporal data, in this work, comes in the form of a trajectory data base and it is used to serve two main purposes: to deal with the uncertainty related to the time at which the events are record in the event log and categorizing the extracted knowledge according to the location where the activities took place. This enables the possibility of having a-priori information about expected duration times for activities that are preformed at certain locations. Such information can be extremely handful in planning and scheduling applications for logistic related software. It provides estimations based on real-world data rather than in approximations and abstractions that do not take into account the complexity of transportation problems. 13 The framework for the STEL algorithm is shown in figure 2.1 and can be described in three main steps: identify regions of interest, identify performed activities and merge the information acquired from both analysis to create an activity time-line for each region of interest so that is possible to estimate activities durations.The steps of the algorithm are further described in the following sections. Identify regions of interest - the spatio-temporal data of the trajectory database provide sampled positions of the object being tracked. The distance between two consecutive positions is calculated and, using the time difference between acquisitions, the average speed of the object in between such positions is known. Regions of interest can than be identified based on the average speed of the object, creating a time-window that defines the boundary for the activity duration estimation; Identify activities - the event log data provide a sequence of events that were logged during the trajectory. Such events can be of many types, from text messages, warnings or activities related events. Hence, activities must be extracted from the event log by analysing the set of activity related events; Create activity time-lines - Once activities and regions of interest of the trajectory are known, each activity is assigned to the correspondent ROI using the log times. Once all activities are assigned, regions of interest can be described by their activity time-line containing all the activities that took place. The duration estimation of the activities is done based on the activity time-lines. Figure 2.1: Activity duration estimation framework 14 2.1 Identify Regions of Interest Using the spatio-temporal data, regions of interest are identified based on the average speed of moving objects. Despite the term ”region”, its significance is in terms of time rather than the location it self. Each ROI is characterized by a start date, end date and a corresponding duration. They can be seen as time portions of the trajectories where the object being tracked was standing still, or bellow a certain speed. From the user point of view, the concept of trajectory is based in the evolving position (perceived as a point) of an object travelling, in some space, during a given time interval. Thus, a trajectory is by definition a spatio-temporal concept. A GPS trajectory can be formally defined as in [31], [32], [33]: Definition 1. A trajectory is a finite sequence of space-time points hp0 , p1 , . . . , pN i, where pi = ( i , i , ti ) and N is the total number of data-points in a trajectory. The i , i 2 R2 are spatial coordinates, and the ti Each ( i, i) 2 R+ , are timestamps, with ti < ti + 1 for i = 0, 1, . . . , N . pair represents the position recorded of a moving object at time ti from a GPS enabled device. Each trajectory is associated to a truck with a unique truck ID. A trajectory is then formed by a sequence of segments called trajectory links. Definition 2. A trajectory link lj is a straight line between two consecutive points pi pi+1 = ( i+1 , i+1 , ti+1 ) of the same trajectory, where i = ( i, pN : Latitude 1, 2, 3, ... N, Longitude 1, 2, 3, N, link1 Time p2 t1 t2 t3 tN and 2 N0 and j = i. p1 p1 : p2 : p3 : i , ti ) pN ... linkN link2 p3 1 (b) Trajectory graph (a) Trajectory data base Figure 2.2: Data acquired by a global positioning system If an object takes t time to travel x distance, it maintains an average speed of x t for at least t time. The range of speed that the vehicle maintains while in an certain area will be used in addition with a minimum duration of staying to define ROI. Since there is available information about the position and time of each data point, it is possible to estimate the average speed of a trajectory link. By using the latitude and longitude of two consecutive data points the distance x between them (e.g. length of the link) can be obtained. The average speed of each link can then be calculated using the timestamps of each data point pi . Once the average speed of each link is known, the speed ranges for classification of 15 the links can be identified and the extraction of regions of interest becomes possible. The classification of the links is done according to their corresponding average speed and it is divided onto two classes: stopped and moving. As presented in Definition 3, ROIs are defined in terms of the average speed and thus the definition of sbound is crucial since it will be the boundary between a trajectory link to be considered as a candidate ROI or not. Stopped: Moving: 0 s̄  sbound s̄ > sbound (2.1) Conceptually, a region of interest is intended to be a region where moving objects pause or wait in order to complete activities that are difficult or impossible to carry out while in motion. In this work, a region of interest is formulated as follows: Definition 3. A region R is a region of interest if at least one trajectory link l 2 R of the tracked object has its average speed between [0, sbound ], and the object remains in R for at least T time before leaving R. That is, Pn i=j ti+1 ti T with R = [lj , ln ]. The parameters sbound and T are user-defined. In the following figure it is presented the framework used to extract ROIs: Figure 2.3: Framework to extract regions of interest As stated previously, the aim of the thesis is to obtain probability distribution functions for the estimation of activity durations. In this case, such activities are preformed with when the vehicle is stopped and regions of interest are defined with that goal. However, the definition of region of interest is not confined to areas where the object is stopped or below a certain speed. It is also possible to estimate the durations 16 of activities done while in movement by changing the speed interval of Definition 3. For instance, if the aim was to estimate the duration of an activity such as driving, in order to obtain travel times between locations, the interval could be defined as [sbound , 0]. 2.2 Identify Activities The event log provides a record of specific events at specific timestamps. Such events can be seen as atomic occurrences, with no time duration. They do not provide an explicit knowledge about activities, instead they come in the form of an event sequence. Such events can be of any type, from warnings, text messages to activities. Hence, activities must be extracted from the event logs in order to be characterized and studied. Given the large amount of event types present in event logs, it is necessary to do an event log analysis so that the events related to activities and activities them self are identified. Such study can be seen in section 3.2. By looking at the event log is possible to identify keywords related to activities such as: load, unload, wait, etc. From the keywords found in the event log the set of activities is defined. (2.2) A = { a1 , a2 , . . . , ak } Definition 4. An activity ak , where k is the index that identify an activity, is a finite sequence datapoints hp0 , p1 , . . . , pm i, where pi = ( i , i , ti , ei ), such that e0 = sk and em = fk . The i , i 2 R2 are spatial coordinates, the ti 2 R+ , are timestamps, with ti < ti + 1 for i = 0, 1, . . . , N and ei are the recorded events. sk and fk denote the events that indicate the start and the end of an activity, respectively. In the same empirical manner, three sets of events are defined: that indicate a start of an activity ak , S representing the set of events sk F representing the set of events fk that indicate the end of an activity ak and C the set of identifiers events ck whose presence in the sequence ak indicate a special occurrence. S = { s1 , s2 , . . . , sk } F = { f 1 , f2 , . . . , f k } (2.3) C = { c 1 , c2 , . . . , c k } At this point is where the human influence takes place. The lack of correlation between the time at which the events take place and the time that sk and fk events are recorded at leads to a wrongly estimation of activity durations as seen in section 4.2. To overcome this, the STEL algorithm estimates activity durations based on the creation of a time-line for each region of interest. Having a full list of activities for each trajectory, activities need to be associated to the corresponding, 17 previously found, regions of interest so that the time-line is built. Time-lines are composed by the start dates tj and end dates tn of the regions of interest and the start and end dates of the activities that were preformed on those time-spans. Those dates are given by the date at which the sk and fk events were recorded, respectively. An activity is considered to be part of a region of interest if at least on of the following conditions is satisfied: Start date of ROI, tj Start of ROI date, tj  Start of date activity, t0  End date of ROI, tn (2.4)  End of activity date, tm  End of ROI date, tn Each region of interest will than have their own activities assigned to it and the time-line can be built for each one of them. Each time-line can be seen as a description of the ROI since it contains all the preformed activities in the ROI time-span. An example is shown in figure 2.4. Activity 1 tj t0,1 Activity 2 tm,1 t0,2 Activity 3 tm,2 t0,3 tm,3 Activity 5 Activity 4 t0,4 tm,4 t0,5 tm,5 tn Figure 2.4: Activity time-line of a region of interest 2.3 Characterize Activities Using the obtained activity time-lines for the regions of interest it is possible to estimate, under some assumptions, the time duration for the process related activities. Log systems, in general, keep track of activities that occur during a certain process and, if there is no activity taking place, systems are able to change their status to idle making the interpretation of the log file easier. However, log systems that are human dependent are not completely aware of the process. They lack of sensors to log activity related events. The dependence on humans to log certain events leads to emptiness in the logs in the sense that the system have no knowledge of what is happening in reality. For instance, if one thinks on a person day-to-day life, there is no emptiness in what concerns activities. Either we are working, sleeping, waiting, etc. we are always performing an activity. In a parallel, if a system can keep track of all the activity types, there should not be empty times in between activities. The only reason for that to happen on such event logs, is the human dependence characteristic of systems. Hence, assuming that the system is capable of tracking every activity that occurs during a process, the time-lines of regions of interest should be fulfilled. The STEL algorithm uses the duration of stay within the ROI, from the spatio-temporal data, in con- 18 junction with the activities start times and end times, from the event log data, to estimate more reliable durations for certain activities. This is done by “stretching” the activity blocks based on the empty time available in the neighbourhood of such activities. However, not all activities should have their duration estimated. Despite the human dependence of systems when logging certain activities, systems are fully aware of activities that are preformed on the system it self. For instance, when a log in activity is preformed, the system knowns when did the user started and when it finished. Such activities, despite the need for interaction between the user and the system, are logged accurately. Thus, there is the need to define a subset of activities A⇤ whose duration is going to be estimated. The events related to the activities that do not belong to the subset A⇤ are assumed to have been recorded without any time difference from reality. Figure 2.5 shows an example of it. empty time Activity which time is to be estimated. Activity which time is not to be estimated. ak ∉ A* tj ak ∈ A* ak ∉ A* ROI Time Line tn Figure 2.5: Empty time: the time available on the neighbourhood of an activity in the ROI time-line The process of characterizing the activities relatively to their duration implies assumptions to be made. Hence, it is mandatory to have a deep knowledge of the activities that make part of the event log, and more importantly, to be aware of how are those activities recorded into the event log. Hence, such assumptions have to be made in accordance with the data that is being dealt with. In this case, the STEL algorithm is tested on two different logistics environments: the Amsterdam Schiphol Airport and the Port of Rotterdam. The assumptions made to estimate activities durations are explained in section 4.3. 19 Algorithm 1: STEL Algorithm A, A⇤ , S, F, C, sbound , T, ; for i = 1 : number of trucks do Define: for k = 2 : N do Calculate link length xk using equation 1.2 ; Calculate link average speed s̄k ; end for k = 1 : N do Classify links according equation 2.1 ; end = 1 : length(A) do for j = 1 : N do for k Identify the indexes of S, F, C events; Extract and identify activities as in Definition 4 ; end end for k = 1 : N do Find Regions of Interest as in Definition 3 ; end for k = 1 : number of ROI’s do Build time-line using equation 2.4; for j=1:no of activities do if aj 2 A⇤ then Estimate activity duration; end end end end 20 Chapter 3 Analysis of a Logistics Database The logistics data used to test the STEL algorithm was collected by a TomTom global positioning system (GPS) from a fleet of logistics trucks from DHL Global Forwarding - Schiphol and Jan de Rijk Logistics. The system collects the trajectory of the vehicles, by recording latitude and longitude coordinates with a timestamps, and keeps record of any occurred event such as “Start of Load”, “End of Load”, “Task Received from terminal”, “Task Finished” etc., 159 event types were found in the database. A complete list of events and their correspondent codes and their number of occurrences is available in the appendix. Each data-entry is formed by the ID of the truck, the position: given by latitude and longitude coordinates and a time-stamp. The events are recorded with a description and their corresponding activity ID. The data is arranged as follows: Truck ID Latitude Longitude Activity ID YYYY-MM-DD HH:MM:SS Event Description Table 3.1: Data point arrangement in the data base Since the positions, and the corresponding timestamps, are being constantly recorded it happens that sometimes there is no event to assign to the data point. Thus, when no specific event occurs, and the system records a data point, a “Basic Record” is kept under the event description field with the activity ID being set to zero. Having this in mind, data points can be recorded within two possibilities: 1. An event is triggered leading to a data point record. The events can be triggered by the system user, in this case the driver, or by the system it self depending on he type of event; 2. No event took place, but the position is still tracked. The event description field is recorded as a ”Basic record” event; 21 3.1 Spatio-Temporal Data Analysis As said previously, the spatio-temporal component of the data is given in the form latitude and longitude coordinates and correspondent date at which the data-point was recorded. The data base contains trajectories from trucks preforming activities in Rotterdam and Schiphol areas. There is also a specific truck preforming activities across Germany, Netherlands, Belgium and France and its going to be referred as the international truck. There are 42 different trucks in Rotterdam performing a total of 1972 data points. In contrast to the Rotterdam area, Schiphol area contains a large amount of data concentrated on the Schiphol Airport. There are 135 thousand data points distributed among 276 trucks. The international truck is formed by 2468 data points belonging to a single truck with TruckID 1141. The database was collected in a time span of 10 days. In figure 3.1 are show the locations where the data-points were recorded. Each one of the data-points have the form of table 3.1. 52.5 Data point 52.02 Data point 52 52 51.5 51.98 51.96 Latitude [°] Latitude [°] 51 51.94 51.92 51.9 50.5 50 51.88 49.5 51.86 51.84 49 51.82 4 4.05 4.1 4.15 4.2 4.25 4.3 4.35 4.4 48.5 4.45 2 3 4 5 6 7 8 9 Longitude [°] Longitude [°] (a) (b) (c) Figure 3.1: (a) Rotterdam; (b) Schiphol; (c) International Truck. Trajectories are sampled with a maximum period of 15 minutes, however, since data-points are also recorded when events are triggered, such period can be in the seconds order of magnitude. There are also cases where the system records more than one event simultaneously. This leads to some drawbacks when estimating the average speed of the trucks. The position errors inherent to the GPS are amplified through differentiation leading to incorrect speed calculations. This matter is further addressed in section 4.1. For a better comprehension and visualization of the database, a re-construction of it is done while data is being processed. The acquired information is stored as a structure, governed by the truck identification, with several fields that can be accessed for future analysis. In figure 3.2 the set of fields is shown. 22 Merged Data Event Log Analysis SpatioTemporal Analysis Coordinates DATA Total Time TruckID Speed Original Data Distance & Time Intervals Figure 3.2: Data Structure With a total of 284 trucks, the dataset is divided according the TruckID (TID). In this way, trucks data is treated as singles trajectories and their related information is kept under the structure fields. The Original Data field, as the name state, is where the original data of the trajectory is kept in the form of table 3.1. Using the original trajectory data, the distance and time intervals between two data points are calculated, in section 4.1, and stored in Distance & Time Intervals field. From here, a Speed vector is calculated, containing the average speed of the trucks in between data points. The Total Time field corresponds to a vector with the amount of time elapsed from the beginning until each data point of the trajectory. The Spatio-Temporal Analysis field refers to the data processed in section 4.1.1 where regions of interest are extracted based on the speed of the trucks. Event Log Analysis field is where the information about identified activities is kept. Each truck event log is scanned in order to find groups of data-points that represent an activity. Once activities are identified they are merged with the previously found regions of interest creating the necessary tools to estimate activities duration. This process is described in section 4.3 and the resultant knowledge is kept under the Merged Data field. Having the duration of activities estimated, their location is clustered so that is possible to predict duration of activities according the locations where they are preformed. The clustered activities and their data is stored in the Coordinates field. 23 3.2 Event Log Analysis In the section an intensive study of the event log data will be presented. It is important to state that the following analysis, and consequent extracted information, was strictly based on an objective and discerning analysis of the dataset with no a priori knowledge given by the involved companies. Event logs provide a record of ephemeral occurrences that can be related to several types of circumstances, from received messages to warnings or activities. However, in this work the estimation of activities duration is done based on time-lines that are described by the preformed activities. Such activities are embedded into the event logs as a sequence of specific events. Hence, it is necessary to categorize events according to the occurrences that they are related to. Due to the huge amount of event types a selective analysis has to be made. While focusing in the goal of the project, events that are not related to any form of activity will not be taken into account. Events can be divided onto 9 main categories, be they: “Start of/End of” - events characterized by this prefix indicate that a specific task/occurrence has started (ended). It can be a user-introduced event, for activities such as load and unload but also for automatically-introduced events such as a speed limit violations; “Cancellation of” - For the user-introduced events (e.g. activities), a cancellation option is available for the case of user mistake; “Report of” - Users are able to report when tasks are completed, however it is not a common practice among most of the users since the number of occurrences is not correlated with the number of activities. For instance, the number of reported loads is 27 against 869 loads preformed. This class of events will not be taken into account; “Navigation” - Navigation events are analogous to GPS actions such as introducing the destination and arriving to the destination - those events will be also dismissed; “Driving” - Driving events are automatically recorded. They represent warnings related to driving times due to restrictions. “Driving times state” and “Driving times driving violation” are examples of events. Such class do not represent any activity and so these events are not going to be considered; “Activity Midnight” - The presence of this event indicates that a certain activity is being accomplished during night time. The exact interval to be considered as a midnight activity is unknown. However such event is of greater relevance since it can indicate a different pattern in the activity duration; “Texts and Tasks” - Texts and tasks events can be seen as a form of communication between the office 24 and the terminal in the trucks. “Text message received”, “Text message read”, “Task received from terminal”,“Task accepted” are examples of it; Others - All the other events that do not fall in any previous category and are not relevant in this analysis. “Crossed country border” and “Deceleration limit violation” are some examples; From the above described event categories, “Start of/End of”, “Cancellation of” and “Activity Midnight” are the only ones that are activity related. However, depending on the activity being preformed, there are automatically recorded events as well as user-introduced events. When the human factor is on the line, errors and mistakes are possible to happen making the data base noisy and with an uncertainty related to lack of time coordination between records and reality as explained previously. This creates the need to differ the activities that are logged by the system from the ones recorded by users. Looking at the ”Start of/End of” class of events it is possible to identify the following activities: arrive, break, costs, drive, garage, gas, load, unload, log in, log out, passage, rest, sign up, wait, peak RPM limit violation and speed limit violation. Hence, the set of activities from equation 2.2 and the events sets from equation 2.3 are defined as follows: A = { Arrive, Break, Costs, Drive, Garage, Gas, Load, Unload, Log In, Log Out, . . . . . . , Passage, Rest, Sign Up, Wait, Peak RPM limit Violation, Speed Limit Violation }; S = { Start of “ak ” }; F = { End of “ak ”, Cancellation of “ak ” }; C = { Activity Midnight “ak ” }; From those, only the speed limit violation and peak RPM violation are fully recorded by the system it self. However, activities like log in and sign up, despite needing the interaction of the user, they are not subject to the human behaviour. Since those activities are preformed on the logging system, the system is aware of when the user start and ends such process, keeping record of the events at the correct time. Activities such as load and unload, on the other hand, are not tracked by the system. Their log is completely dependent on the user, thus being subject to mistakes. In this thesis, those are the activities representing the subset A⇤ ✓ A who are going to have their duration estimated: A⇤ = { Load, Unload }; Load and Unload activities represent the core of logistic processes. Therefore, to extract valuable information that can be used in scheduling and planning applications, is it mandatory to have full knowledge about such activities. There are several event types concerned with load and unload activities. In table 3.2 is presented a description of such events. 25 Event Type Description Start of Load/Unload End of Load/Unload Cancellation of Load/Unload Report of Load/Unload Join of Load Leaving of Load Activity Midnight Load/Unload Event record by the user when the loading/unloading process starts; Event record by the user when the loading/unloading process finishes; Cancellation of the loading/unloading process due to user mistake; Unknown; Merge of two or more truck loads into a single truck; Trailer is left at the location without unloading the cargo; Loading/unloading process occurs during the night; Table 3.2: Load and Unload activities related events In figure 3.3 is shown a bar graphic with the number of occurrences of each event. As it can be seen, all events that are not ”Start of Load/Unload” or ”End of Load/Unload” show a low number of occurrences when compared with the rest of them. However, ”Cancellation of Load/Unload” and ”Activity Midnight Load/Unload” events have an important role in the project. Since these are user-introduce events, the human factor is to be taken into account, hence, despite the low number of occurrences this events can not be dismissed. Also, the presence of ”Activity Midnight Load/Unload” events might indicate a different pattern in terms of the duration of the activities since they are being accomplished in a particular time of the day. 1200 1000 923 1095 869 1019 1000 Event Occurences Event Occurences 800 800 600 600 400 400 200 200 75 53 27 3 0 Start of Load End of Load Cancellation of Load Join of Load 4 Report of Load Leaving of Load 25 0 Midnight Load (a) Load Events Start of Unload End of Unload Cancellation of Unload 15 23 Report of Unload Midnight Unload (b) Unload Events Figure 3.3: Load and Unload event count The load and unload activity events are correlated with each other in the sense that the number of occurrences ”Start of Load” is quasi equal to the number of occurrences of ”End of Load” events plus ”Cancellation of Load” : #start of load ' #end of load + #cancellation of load (3.1) By knowing such information, once the regions of interest are identified, load and unload activities can 26 be extracted from trajectories as the set of data points between a ”Start of Load/Unload” and ”End of Load/Unload” event. This process is further explained in section 4.2. Apart from the load and unload activities, there are more activities related events. For instance, drivers upon a certain period of driving must stop to rest or preform breaks. In addition, trucks need to be refuelled which consists in a stop as well. Since the load and unload locations have schedules to be served at, there is the possibility for a truck to arrive earlier than it should. When this occurs, the driver has to wait in order for the costumer to be available to preform the transaction. In figure 3.4 the number or occurrences for these events are shown. 500 450 434 400 369 Event Occurences 350 300 250 206 212 203 193 200 150 110 103 100 59 50 18 7 4 0 Start of Rest End of Rest Canc. of Rest Start of Break End of Break Canc. of Break Start of Gas End of Gas Canc. of Gas Start of Wait End of Wait Canc. of Wait Figure 3.4: Other acivity related events count 600 530 522 507 491 500 Event Occurences 400 372 302 284 300 252 200 100 17 0 User login User logout Start of Sign Up End of Sign Up Canc. of Sign Up 13 8 Start of Log In End of Log In Canc. of Log In Start of Log Out End of Log Out Canc. of Log Out Figure 3.5: Log In & Log Out event count Once again, the previously described events are correlated. Despite the small difference in the number of occurrences, that might be attributed to missing data, the difference is small and it can be unaccounted 27 for as in the load and unload cases. #start of ”activity” ' #end of ”activity” + #cancellation of ”activity” (3.2) Is it also important to have in consideration the events shown in figure 3.6 as they are the most frequent events in this data set. As described previously, the ”Basic record” event is inherent to a data point acquisition without the occurrence of any specific event. The purpose of this acquisition is strictly for geographical and temporal means, having no significance as an event. ”Driving times state” event is believed to be a system-introduced event that is related with the restrictions of driving times. ”Start of”, ”End of” and ”Cancellation of” are ”incomplete” events whose origin is unknown, and thus will not be considered during the processing of the data in the following chapter. ”Contact ON” and ”Contact OFF” are the events correspondent to switching the truck on and off, respectively. As the name states, ”Start of Drive” and ”End of Drive” are user-introduced events, that indicates the beginning and ending of the driving process between two locations. In between such events no other activities occurs. ×10 4 3.5 32293 3 Event Occurences 2.5 2 18146 1.5 9645 1 8946 8942 7355 0.5 4004 4012 Start of Drive End of Drive 2293 0 Basic record Driving times state Cancellation of End of Start of Contact ON Contact OFF Figure 3.6: Most frequent events in the date base Is it of great interest to know a priori which activities have the biggest impact in terms of time. In a naive approach, once the activity related events are identified, it is possible to measure their duration as the elapsed time between the ”Start of -activity-” and ”End of -activity-” events. However, this measurement should not be taken as a true estimation of event times. The assumption that the events are correctly introduced by the users is simply not correct as proven in section 4.3, thus the following plot will only serve as a term of comparison for future results. 28 2% 8% 11% Rest Time Wating Time Loading Time Unloading Time Other Events 4% 74% Figure 3.7: Activity time distribuiton In this analysis were taken into to account the most important activities in terms of time: rest activities, load and unload activities and waiting activities. Break activities, passage activities, arriving activities and refuelling activities are also considered but were represent as ”others” due to the small portion of time. 3.2.1 Sequential Pattern Mining in Event Logs As explained previously, to perform sequential pattern mining on event logs there is the need to defined shorter sequences. It is not possible to use a complete sequence of events from a trajectory, directly from the event log. Event logs contain thousands of events that make hard the existence of patterns due to the small support. In addition, event logs can be from different locations making their patterns not shareable by other trucks. The amount of event types in the event log is also a problem that such algorithms have difficulties to tackle. To demonstrate this, sequences were built from the event logs of each trajectory, where each event is handled as a set of items from a transaction and each costumer ID is replaced by the trajectory ID (e.g. truck ID). Using an open-source software for pattern mining [34], the sequences were explored but, as expected, no output results were found despite the minimum support being as small as 1%. 29 Chapter 4 Processing High Volume Logistics Data In this chapter the two types of data are going to be processed. Firstly, the spatio-temporal data is going to be used to identify the regions of interest with their corresponding start and end dates. Secondly, event log data will be mined; activities are identified and extracted accordingly to the type of situation: normal, cancellation or midnight activity. Further, using the extracted knowledge from both analysis, data will be merged in order to estimate the duration of load and unload activities. The goal is to estimated the duration of each one of the load and unload activities. The acquired information about the activities is than clustered according their location so that it can be translated to probability density functions that express the duration of an activity at a given location. Such results are of greater importance for planning and scheduling applications such as vehicle routing problems with stochastic time windows. Another possible product of this analysis is the travel times between costumers. Despite the work of this thesis being focused in acquiring a proper duration estimation of load and unload activities, it is also possible to estimate drive activity durations (e.g. travel time) by focusing on moving portions of the time line rather than in the stopped ones. This can be useful to obtain probability density functions of travel times to use in applications that deal with stochastic travel times. 4.1 Spatio-Temporal Data - Trajectories In previous sections the definitions of trajectory, trajectory links and regions of interest were presented. Since ROIs are defined in terms of speed there is the need to calculate the average speed at each link. Using the latitude and longitude of the data points pi+1 and pi , from link lj , it is possible to determine the 30 travelled distance (e.g length of the link - xj ). There several approaches to achieve this calculation, as presented in section 1.2. Haversine formula was implemented due to its precision and simplicity when compared with others (e.g. Pythagoras Theorem, Spherical Law of Cosines). It is a special case of the general law of haversines, relating the sides and angles of spherical triangles. It gives the shortest distance between two points measured along the surface of a sphere rather than a straight line through the sphere interior. Despite its accuracy, haversine formula does not take into account the ellipsoid geometry of earth nether the elevation due to mountains and others. Following is presented the explicit formula form equation 1.2: xj = 2r arcsin s sin2 ✓ i i 1 2 ◆ + cos ( i 1 ) cos ( i ) sin i = Latitude of pi i = Longitude of pi 2 ✓ i i 1 2 ◆! r = Earth radius Once the distance and time between two data points are known, it is possible to calculate the average speed for each link: s¯j = xj tj where tj = tj+1 tj (4.1) Due to the GPS accuracy and precision, the recorded latitude and longitude can differ slightly from the real one. GPS error sources can be divided onto three classes: satellite dependent errors, propagation errors and receiver errors [35], [36]. (i) Satellite Dependent Errors - Although satellites are launched into precise orbits small changes might occur, caused by gravitational pulls from the moon and sun and by the pressure of solar radiation on the satellites; The satellite’s atomic clocks might experience noise and clock drift errors. (ii) Propagation Errors - Inconsistencies of atmospheric conditions affect the speed of the GPS signals as they pass through the Earth’s atmosphere causing delays in the signals. Also GPS signals can be affected by the environment, where the radio signals reflect off surrounding terrain, buildings, canyon walls, hard ground, etc. (iii) Receiver Errors - Depending in the quality of the GPS receiver, tiny discrepancies between the receiver clock and GPS time will influence the calculated distances. Accordingly to the U.S Government information about the Global Positioning System [37] the accuracy standard for positioning, given at a 95% confident level, is 17 meters for horizontal errors and 37 meters for vertical errors. It is also important to have into account that higher accuracy is attainable by using GPS in combination with augmentation systems, like EGNOS within europe. 31 Nevertheless, the simplest way to get the velocity from a GPS receiver is to differentiate the GPS determined positions with respect to time. Velocity is the first time derivative of positions. The problem is that errors in positions will be amplified through differentiation. This becomes worse when a high output rate is used, since the positional error remains the same but the time interval is decreased. The outcome are mislead calculations of speed that, in the case of high output rates, differ greatly from the reality speeds. Figure 4.1 illustrates an example of this. Although the truck is travelling at cruising speeds, between 80 and 90 km/h, when the time interval between two samples is extremely small, the calculated speed seems to be unreal. In the first occurrence in the figure, the travelled distance between sample is 43 m with a time interval of 1 second, resulting in a speed of 155 km/h. 160 Original Average Speed Filtered Average Speed 140 Average Speed [Km/h] 120 100 80 60 40 20 0 100 120 140 160 180 200 220 240 Minutes Figure 4.1: Effect of the speed filter on an average speed profile of the international truck with = 6 In addition, there are also data points that are recorded simultaneously making t = 0, and thus the equation (4.2) becomes unsolvable. In order to overcome these problems a speed filter was created. The filter sets the the average speed of link lj equal to the average speed of link lj 1 when the temporal distance between the data points is smaller than , a user defined amount of time. The results can be seen in figure 4.1. s̄j = 8 <s̄ :s̄ j if j 1 else tj , (4.2) Figure 4.2 shows the obtained average speed profile, after being filtered, for the international truck along a time span of 10 days. The speed filter proved to be effective since there are no link speeds above to 32 what is expected for a truck. Despite the lack of precision when obtaining the speed values, since they are derived from data points with an associated error as seen previously, they are reliable enough to determine candidate regions of interest. By analysing the figure it is possible to see that eight significant trips were made, represented by the velocity peaks, interlaced with seven major stops. The gap at the end of the plot is due to a system shut off where no data points were recorded. 100 90 80 Speed [Km/h] 70 60 50 40 30 20 10 0 0 2000 4000 6000 8000 10000 12000 14000 Minutes Figure 4.2: Average speed profile of the international truck with = 6 4.1.1 Regions of Interest As described in section 2.1, a region of interest is a selected subset of trajectory links whose average speed is below a certain value sbound . Once the average speeds of the trajectory links are known, they can be classified according equation 2.1. Since the sbound value is the boundary between a trajectory link to be considered part of a region of interest or not, the value chosen for it will highly influence the obtained results. To be able to define sbound , a study of frequency was made. The previously calculated average link speeds were grouped into 1 Km/h width bins. Each bin’s length represents the number of occurrences of certain speed. In the following figures are presented a set of histograms for various truck IDs as well as the whole database. 33 (a) All trucks (b) TruckID 1141 (c) TruckID 7234 (d) TruckID 1849 Figure 4.3: Average Speed Histograms with ↵ = 0.5 Note: Trucks are most of the time stopped and thus the number of occurrences for s̄ = 0 is huge. Plotting them would result in less visible graphs due to scaling reasons, hence they were not plotted. As it can be seen, the shape of the histogram is highly dependent on the truck ID. Since the trucks operate in different areas, with different topologies and speed restrictions, their average speed profile is quite different. For instance, the histogram of figure 4.3 (c) belongs to a truck preforming tasks in the Amsterdam Airport Schiphol and due to that it does not exceed the speed limit of 50 Km/h. On the other hand, in figure 4.3 (b) are plotted the speeds of the TruckID 1141, previously referred as International Truck, that shows two main areas: one at lower speeds - when the truck is moving inside cities and another at cruise speeds (80-90 Km/h) when the truck is on highway. Figure 4.3 (d) contains data from a truck of the Port of Rotterdam and, given the length of the port, this truck makes use of both highway and national roads (A15 and N15, respectively). Considering that the database is mostly formed by trucks of Amsterdam Airport Schiphol, the fact that there are a lot more occurrences for low speeds than for high speeds was expected and it can be seen in figure 4.3 (a) that shows the histogram of all available data. 34 sbound sbound ROI ak ∈ A* ROI1 ak ∈ A* Time ROI2 Time Figure 4.4: Split regions of interest (right figure) due to a low sbound Such diversity makes the decision of the sbound value not trivial. If a high value is chosen, the duration of the ROIs will be larger than in reality is. In particular, trucks whose average speed profile is low, like TruckID 7234, would show regions of interest with data belonging to moving segments of the trip due to the high number of low speed links. On the other hand, small sbound values would lead to a falsely high number of ROIs with short durations. Owing to the previously tackled GPS inaccuracies, the acquired position is not always constant when trucks are stopped leading to the existence of non zero speeds. If such speeds are greater than sbound , regions of interest are truncated leading to a loss of crucial information to estimate activity times. For instance, if an activity is being preformed and there is a speed record greater than sbound , the activity will be split onto the time-lines of both regions of interest, see figure 4.4. Such behaviours can be seen on figures 4.5a and 4.5b where each line represent the total number of regions of interest for a specific truck and the total time spent by the truck inside ROIs, respectively. In figure 4.5a, when the sbound value increases, the number of detected ROIs decreases , making their duration longer as it can be seen in figure 4.5b where each line belongs to a specific truck and represents the total duration of stay in regions of interest by the correspondent truck. 350 300 14000 250 12000 Total ROI time [min] Number of ROIs 16000 200 150 10000 8000 6000 100 4000 50 2000 0 0 2 4 6 8 10 12 14 16 18 20 s bound 0 0 2 4 6 8 10 12 14 16 18 s bound (a) Number of ROIs per truck against sbound (b) Total time spent in ROIs per truck against sbound Figure 4.5: Choosing the correct value for the sbound parameter 35 20 Once the value for the sbound is set and the link classification is done according equation 2.1, regions of interest are built according Definition 3. In order not to loose information relative to short truck stops, the minimum duration T was set to zero. This means that every sequence of trajectory links with stopped classification will represent a region of interest as figure 4.6 illustrates. In such trajectory, three ROI were find: the first is formed by the trajectory links hl1 , l2 , l3 i, the second by hl6 , l7 , l8 i and the third by hl10 i. l1 : l2 : l3 : l4 : l5 : l6 : l7 : l8 : l9 : l10 : l11 : Lat Long 1 2 3 4 5 6 1 2 3 4 5 6 7 8 9 10 11 7 8 9 10 11 Time Class. t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 stopped stopped stopped moving moving stopped stopped stopped moving stopped moving l1 l2 l3 l4 l5 l6 l8 l9 l11 l7 l10 (a) Regions of interest (b) Trajectory graph Figure 4.6: An example on extracting regions of interest with T =0 If the focus was onto obtain travel times between costumers, the regions of interest would be the links with moving classification instead: hl4 , l5 i, hl9 i and hl11 i. The travel time would be given by the time difference between the last data-point of the last link and the first data-point of the first link. Now that regions of interest are known, their content can be plotted. By using the speed from the links and the corresponding recorded events, a sequence plot was made. For the sake of simplicity, only relevant events where considered. Figure 4.7a shows a region of interest with 1.44 hour duration in which an unload event take place. It is important to notice that while the truck is unloading recorded speeds reach up to 15 km/h. This is the reason why the sbound have to be kept sufficiently high, otherwise links in the middle of the events would be classified as moving and regions of interest would not contain all the necessary information for time estimation. Figure 4.7c shows that problem: only the ”end of load” event is contained inside the ROI whose duration is half an hour. In reality, the truck stayed for longer in that region, but due to recorded speed bigger than sbound , that information was kept in a separated region of interest. Both figures 4.7a and 4.7b are examples of how load and unload events should be introduced by the user. In the first case, a single load took place, occupying almost all of the time duration of stay in the region of interest. In the second case, the user also introduced all events in the expected sequence and at the expected time. It finishes the drive, than start the first load activity. After a certain amount of time, which is considered to be time duration of the load task, it introduces the ”end of load” event. 36 The user repeats the process for the second load activity then start the driving process and leaves the region of interest. However, in many situations this is not what happen. For some reason, events ”start of load/unload” and ”end of load/unload” are many times introduced only seconds apart from each other, making the duration of the events not correct. In figure 4.7d is shown an example of this. TruckID - 1141 15 Stop Duration - 1.44 Hours TruckID - 1141 7 Stop Duration - 1.78 Hours e 6 of Driv 5 ←Start e of Driv 4 ←End Velocity [Km/h] Velocity [Km/h] 10 3 5 rt o f Un loa ←Start Sta of Driv e 2 nd ←E d → 5700 0 1 5720 5740 5760 5780 n of U loa Sta d rt o 0 5800 Sta rt o f Lo ad f Lo → 3900 o nd nd ←E of L 4000 oad 4050 Minutes (a) Single activity TruckID - 1849 ad f Lo → ←E 3950 Minutes 9 ad (b) Multiple activities Stop Duration - 0.50 Hours TruckID - 1157 12 Stop Duration - 1.47 Hours 8 10 7 Velocity [Km/h] 8 5 4 3 e 1.1899 ←Start → ad Lo of → d En d oa → fL d to ar Loa of d → → 1.1899 St d oa 0 1.1899 En ad 2945 Lo 2940 ad f Unlo do ←En 2950 fL 2935 of 2930 d 2925 to ar 2920 St 2915 2 En 1 of Driv of Driv e 2 0 6 4 ←Start Velocity [Km/h] 6 1.1900 1.1900 1.19 1.1900 1.1900 1.1901 Minutes Minutes (c) Split ROI due to low sbound (d) Effect of the human behvaiour Figure 4.7: Examples of regions of interest for sbound 37 = 15km/h 1.1901 1.1901 ×10 4 4.2 Event Logs As explained in chapter 2, the STEL algorithm estimates activity durations based on a created time-line for each region of interest. Such time-line is composed by the start and end date of the region of interest and the start and end dates of the activities that were preformed on that time-span. However, event logs do not provide an explicit knowledge about activities. Instead, they represent a sequence of momentary events, with no duration. Such events can be of any type, from warnings to text messages as seen in section 3.2. With that said, there is the need to identify the activities from the event log in order to build the region of interest time-line. In the analysis made in chapter 3, thirteen types of activities were found in the database. They are listed below: • Load • Gas • Log Out • Unload • Drive • Log In • Rest • Sign Up • Passage • Arrive • Garage • Wait • Costs Events that are not activity related do not have any relevance in what concerns the estimation of activity durations. This is due to the fact that such events do not required the presence of a user since they are recorded by the system it self. Activity related events can only be recorded by the user directly, like in a load activity, or by the interaction between the user and the system, like in a log in activity. Hence, warning events and events that are not activity related in general, will not be considered. Each one of the previous listed activities is characterised by a set of events: ”Start of activity”, ”End of activity”, ”Cancellation of activity” and ”Activity Midnight”. Hence, is it possible to classify the activities as three types of situations: 1. Normal situation - a normal situation is given by a pair of events ”Start of activity” and ”End of activity”; 2. Cancellation situation - it is characterised by a ”Start of activity” followed by a ”Cancellation of activity” events; 3. Midnight activity - is formed by a set of three events: ”Start of activity” ”End of activity” and, in between those events, ”Activity Midnight”. For each one of the activities type, the trajectory is scanned in order to find ”Start of activity” events. 38 Once such event is identified, a second scan is preformed to identify the end of the activity. Depending on the type of situation, the end of an activity can be dictated by a ”End of Activity” of ”Cancellation of activity”. Once the entries that limit the activity are known, the in between events are scanned so that identifying events, such as ”Activity Midnight”, are found. An example can be seen in Table 4.1. Each activity is then formed by a sequence of data points with an associated time-stamp, non-relevant event, and latitude and longitude coordinate. Activities are defined as the set of data-points between the ”Start of activity” event and ”End of activity” or ”Cancellation of activity” events, depending on the situation. Each one of the data-points as the form of table 3.1. The importance in distinguish the types of activities situations plays a big role in estimating the time durations of activities. If the cancellation situations are not differed from normal situations the algorithm takes it as an occurred activity, making the time-span between the ”start of activity” and ”cancellation of activity” events occupied. Since it is unclear of what is happening in a cancellation situation, those activities are not part of the time-line. The classifier ”Midnight activity” is used to distinguish activities that were preformed during the night. Their duration patterns are different from the ones preformed during the day due to workforce differences. Thus, to obtain accurate predictions, they must not be mixed with the rest of the activities. Truck ID Date & Hour Event type Activity 1141 1141 1141 1141 1141 1141 1141 1141 1141 1141 1141 1141 ... 1141 1141 1141 1141 1141 1141 1141 1141 1141 2013-05-02 12:57:52 2013-05-02 13:57:55 2013-05-02 14:57:58 2013-05-02 15:21:41 2013-05-02 15:21:45 2013-05-02 15:21:46 2013-05-02 15:21:46 2013-05-02 15:22:21 2013-05-02 15:22:45 2013-05-02 15:22:45 2013-05-02 15:23:15 2013-05-02 15:23:15 ... 2013-05-02 18:23:34 2013-05-02 18:23:55 2013-05-02 18:24:56 2013-05-02 18:26:29 2013-05-02 18:27:31 2013-05-02 18:27:46 2013-05-02 18:28:52 2013-05-02 18:28:53 2013-05-02 18:30:46 Navigation ETA update Contact ON Start of Break Cancellation of End of Break Start of Cancellation of Start of Drive Driving times state event Basic record End of Drive Start of ... Task Busy Cancellation of Start of Unload Contact OFF Driving times state event Basic record Contact ON Task Finished End of Unload Break Break Break Driving Driving Driving Driving ... Unloading Unloading Unloading Unloading Unloading Unloading Unloading Table 4.1: Example of the identification of activities from the event log data Having the activities identified, as in table 4.1, and classified conforming the situation, an activity table is 39 All Data-points Average Long & Lat 50.0575 50.057 50.0565 Latitude [°] 50.056 50.0555 50.055 50.0545 50.054 50.0535 50.053 8.52 8.521 8.522 8.523 8.524 8.525 8.526 8.527 8.528 8.529 8.53 Longitude [°] Figure 4.8: Averaging latitude and longitude coordinates to obtain a single point representation of the activity location built for each trajectory. Previously identified activities, formed by a set of data-points, are transformed into an atomic representation composed by a single data-point. The following paragraph explains the process. With exception for the drive activity, all activities are supposed to be preformed with the truck stopped since there is need for the interaction between the user and the logging system. Consequently, the latitude and longitude for the data points of such activities is very similar and it can be averaged so that it represents the actual activity location. Figure 4.8 is an example of the averaging process for a load activity; although the truck is stopped when loading, due to the lack of precision of the GPS, the points are not recorded at the same precise location, thus it is assumed that the activity takes place at the average location of the data-points locations that form the activity. The activity table is then built by using the record time of the events ”Start of activity” and ”End of activity”, the averaged latitude and longitude and the activity type as in table 4.2. The activity duration can be calculated as the difference between the end of activity date and the start of activity date. All cancellation situations are assumed to be real, and thus they do not make part of the activity table. Table 4.2: Example of activity Table Start Date & Hour End Date & Hour Latitude Longitude Activity 2013-05-02 14:57:58 2013-05-02 15:22:21 ... 2013-05-02 18:24:56 2013-05-02 15:21:45 2013-05-02 15:23:15 ... 2013-05-02 18:30:46 50.89785 50.38141 ... 50.02661 7.15316 7.94855 ... 8.56742 ’Break’ ’Drive’ ... ’Unload’ Once the activity table for the trajectory is known, the activities are assigned to the correspondent regions 40 of interest according equation 2.4. These allows the creation of the so called activity time-line for the regions of interest. By looking at such time-lines and analyse the activities performed on it, it possible to clearly differentiate between rest stops, traffic stops and work stops. If no activities are performed, and the duration of the region of interest is relatively small, it is likely to be a traffic stop. On the other hand, if there are activities such as rest or load/unload is it clear the type of truck stop being dealt with. The duration of such truck stops can also be of interest of the involved companies. 4.3 Activity Duration Estimation Event log data is vulnerable to uncertainty. In fact, that is the biggest draw back when dealing with data sets that are human dependent. On one hand, the introduced data is subject to human errors making the existence of the activities unclear. Although there is the chance to cancel an activity, after starting it wrongly, it is not possible to assure that the users adopted that behaviour. Rather, they can simply end the activity and it becomes impossible to trace if the activity actually existed or not. On the other hand, events should be introduced in coherence with reality: as soon as the activity starts and ends. However in many situations that does not happen. It is a common practice for users to introduced the events ”Start of activity” and ”End of activity” within a seconds time period making the duration of the activities unreal. In figure 4.9a an example is shown. The black bar represents the region of interest: the length is the total duration of the truck ”stop”. Each additional bar represents a preformed activity inside the region of interest. Load and Unload activities were highlighted with a dotted line. By analysing the figure 4.9a it is possible to tell that the region of interest (black bar) begins slightly before of the end of the driving activity (blue bar). Then it proceeds with the sign up process followed by another driving activity, possibly to the load location and it completes a load task. The user drives again, completes another load task and starts driving away; the region of interest ends. In terms of the sequence at which the events were introduced everything is ordinary. However, the duration of the load activities is illusory. Having in consideration the empty time between the second driving activity and the first load activity, th most probable scenario is that the user, after arriving to the load site, preformed the actual load activity and only than introduced the corresponding events. On the contrary, figure 4.9b is a case of a truck stop whose events are more probable to have been introduced in coherence with reality. As it can be seen, all the time-line of the region of interest is fulfilled with activities. 41 STOP 0 0.5 1 Drive 1.5 2 Sign Up 2.5 STOP Load 3 3.5 0 4 1 2 Drive 3 Sign Up 4 5 Rest 6 7 Duration [Hours] Duration [Hours] (a) Wrongly introduced activities (b) Correctly introduced activities Unload 8 9 10 Figure 4.9: ROI activity time-lines examples For planning and scheduling purposes it is extremely important to have precise knowledge about load and unload duration at each location. Such knowledge can be achieved if the uncertainty present in the data is overcome. Nevertheless, the fact that the existence of the activities is uncertain will be discarded: every pair of events ”start of activity” and ”end of activity” will be considered as a preformed activity. The uncertainty that its going to be dealt with is the one related to the lack of coherence between the dates that events are record and reality. Recalling the event analysis made in chapter 3, most of the activities are stop related tasks since the interaction between the driver and the system is needed. Hence, by joining the two types of data, spatio-temporal and event logs, it is possible to estimate the load and unload activities duration. Nevertheless, some general assumptions need to me made. It is assumed that: i) events are totally ordered (i.e., in the log events are recorded sequentially even though tasks may be executed in parallel); ii) users did not record any event by mistake; iii) the time-span of a cancelled activity is considered as empty (i.e. if an activity is cancelled it did not existed) ; iv) all activities apart from driving take place inside a region of interest; v) all activities, besides load and unload, were introduced in coherence with reality; The STEL algorithm uses the duration of stay within the ROI, from the spatio-temporal data, in con42 junction with the activity start times and end times, from to event log data, to estimate more reliable load/unload durations. The estimation is done by ”stretching” the activity blocks, based on the empty time available in the neighbourhood of such activities, see figure 2.5. By doing this, the obtain load/unload activity will be formed of three parts: one reliable portion and two possiblistic portions. The reliable part is given by the original activity block that is delimited by the ”start of activity” and ”end of activity” events. Possiblistic portions are represented by the stretched parts of the blocks where uncertainty, related to the actual start or end of the activity is present. Figure 4.12 illustrates these statement. empty time Activity which time is to be estimated. Activity which time is not to be estimated. ak ∉ A* ak ∈ A* tj ak ∉ A* tn ROI Time Line Figure 4.10: Empty time: the time available on the neighbourhood of an activity in the ROI time-line At this point, two different situations arise: to estimate the duration of a single load/unload activity or to estimate the duration of followed load/unload activities. It would not be reasonable to estimate multiple activities durations in the same way as single activity duration is estimated. Since it is a sequential process, the estimation of the first activity of the group would constrain the estimation of the following activities. When both start and end dates of the first activity are stretched, the previous existent gap between the activities disappears making the start data of the following activity constrain to its original position. Hence, single activity and multiple activity situations have to be identified. LOAD Other activity 0 min Other activity t ROI Time Line (a) Single activity case Other activity 0 min LOAD LOAD Other activity UNLOAD t ROI Time Line Activity which time is to be estimated Activity which time is not to be estimated (b) Multiple activities case Figure 4.11: Different types of estimation cases 43 4.3.1 Single Activity In the case of a single a activity, the estimation of the duration is pretty straight forward. If no other activities are present in the region of interest, the activity which duration is to be estimated is assumed to have the same duration as the region of interest, see figure 4.12a. When the such activity is the first activity of the region of interest, its duration is given by the time span from the beginning of the region of interest until the start of the following activity, see figure 4.12b. Same thing happens for activities that are the placed at the end of a region of interest, the duration is given by the time difference between the end of the priovious activity and the end of the region of interest, figure 4.12c. The estimated duration for activities that are in between other activities is given by the elapsed time between the end of the prior activity and the start of the following activity, see figure 4.12d. Only load and unload single activities whose duration is smaller or equal to mated, otherwise, the original duration of the activities is kept. " have its duration esti- " is a user defined constant that sets the boundary between the activity duration being estimated or not. LOAD 0 min LOAD ROI Time Line t 0 min ROI Time Line t (a) Activity is the only activity in the ROI LOAD 0 min Other Activity ROI Time Line LOAD t 0 min Other Activity ROI Time Line t (b) Activity is the firsts activty in the ROI Other Activity 0 min LOAD Other Activity ROI Time Line t 0 min LOAD ROI Time Line t (c) Activity is the last activity in the ROI Other Activity 0 min LOAD ROI Time Line Other Activity Other Activity t 0 min LOAD ROI Time Line Other Activity t (d) Activity is in between other activities Figure 4.12: Single activity situations: different cases when estimating the duration of a single activity In figure 4.12 is shown all the possible cases when estimating the duration of single activities. In the right side of the figure is presented the proposed solution for the cases in the left. The rectangle in green, 44 without the gradient, represents the original timespan where the activity was registered. The gradient filled rectangles are the possiblistic areas where the start or end of the activity are possible to have happened. 4.3.2 Multiple Activities When facing a multiple load/unload activities situation, as in figure 4.11b, a different approach has to be taken. In this case, rather only then using the empty time in the neighbourhood of such activities, the whole interval between two others activities, whose duration is not to be estimated, is used. Figure 4.13 illustrates such interval. interval Other activity 0 min LOAD LOAD UNLOAD Other activity t ROI Time Line Figure 4.13: Multiple activities situation: defining the interval for estimation Activities whose original duration is equal or smaller than " are going to be referred as short activities. On the contrary, when the original duration the bigger than " activities are referred as long activities. activity'duration ≤ ɛ Short activity activity'duration > ɛ Long activity Activity which time is not to be estimated. Figure 4.14: Legend: short activities, long activities and other activities Several hypothesis can be formulated to estimate multiple load and unload activities durations. Depending on the assumptions made, different results can be achieved. When there are multiple load and/or unload activities without any other activity in between it implies that such activities are being preformed at the same location hence at the same costumer. Thus, load and 45 unload activities of a group, despite its original duration, can be assumed to have the same duration. Such duration can be obtain by simply diving the total time interval by the number of activities present on it. The obtain result is the one in figure 4.15 and the duration is given by: new duration Other Activity LOAD = interval LOAD 0 min (4.3) # of activities UNLOAD Other Activity ROI Time Line Other Activity LOAD t LOAD 0 min UNLOAD Other Activity ROI Time Line t Figure 4.15: Hypothesis 1 Another possible approach is to assume that the activities whose duration is bigger than ", long activities, are in coherence with reality. In this way, similar to what is done for the single activity case, it is only estimated the duration for the short activities, by using the equation 4.4. The duration of the long load and unload activities, in blue, is kept constant and the short activity, in green, is stretched to fulfil the empty space of the interval. short activities duration Other Activity LOAD = interval UNLOAD Other Activity ROI Time Line Other Activity LOAD LOAD 0 min (4.4) # of short activities LOAD 0 min long activities durations ROI Time Line t UNLOAD Other Activity t Figure 4.16: Hypothesis 2 In the previous approaches all activities were considered and assumed to had happen. In many cases, for multiple load and/or unload activities situations, it was noticed that it is common the existence of a long activity followed by one or two short activities; or the other way around. Such pattern can be assumed as an indicator of activities that were recorded by mistake of the user. The long activity recorded represents the actual preformed load/unload activity. 46 To emulate such situations, two hypothesis are proposed: i) the short events are simply dismissed and kept the same duration for the long activities, see figure 4.17 or ii) the short events are dismissed and the long activities duration is estimated based on the available time interval, see figure 4.18. In the latter case, the new duration for the long activities is given by: long activities duration Other Activity LOAD = LOAD 0 min interval (4.5) # of long activities UNLOAD Other Activity ROI Time Line Other Activity LOAD t UNLOAD 0 min Other Activity ROI Time Line t Figure 4.17: Hypothesis 3 Other Activity LOAD LOAD UNLOAD Other Activity ROI Time Line 0 min Other Activity LOAD 0 min t UNLOAD ROI Time Line Other Activity t Figure 4.18: Hypothesis 4 4.4 Clustering Locations At this point each load and unload has an estimated duration and its known where the activity took place. Still, it is not possible to predict a load or unload activity duration given a specific costumer. In order for the extracted knowledge to be used in software applications for transportation planning, information must be categorized into locations so that it is possible to obtain probability density funcations of the estimated service times for specific costumers. This can be done using costumers locations. Each costumer has a fixed location where trucks preform load and unload activities, hence, estimated service times can be clustered according their latitude and longitude. In previous section 4.2 an activity was initially defined as the set of consequent data points between a ”Start of activity” event and an ”End of activity” event. When passing it to its atomic representation each activity is given a specific location averaging the locations of activity data-points. Figure 4.19 shows 47 such locations for load activities preformed in the Schiphol Airport where each circle represents a load activity. 52.33 Mean Load Locations 52.32 Latitude [°] 52.31 52.3 52.29 52.28 52.27 4.72 4.74 4.76 4.78 4.8 4.82 Longitude [°] Figure 4.19: Mean locations of load acitivities in Schiphol Airport Since there is no a prior information about costumers locations, load and unload activities can not be linked to costumers. However it seems reasonable to assume that such load and unload activities are preformed in the surroundings of a costumer location. By using clustering algorithms, as explained previously in section 1.3, the activities can be grouped according their latitude and longitude. Each group will then represent a costumer. Having no information about the number of costumers presents in the data set, it is not possible to pre-define a number of clusters. Hence, hierarchical clustering is used. This method is based on the idea that objects that are nearby are more related than those who are farther way. After gathering the necessary information from each trajectory, load and unload activities are filtered from the activity table 4.2. Than, for each type of activity, in this case load and unload, the clustering is preformed. Each load and unload activity is seen as an object with two coordinates: latitude and longitude. This clustering method consists in three main steps. Firstly, the similarity between every pair of objects in the data set is computed by evaluating the euclidean distances. It measures the ordinary straight line distance between two objects. Since both variables, latitude and longitude, are measured in the same units, no normalization is used. Secondly, using the distance information acquired previously, clusters are grouped in a bottom-up fashion (agglomerative clustering) based on the cluster that contains the closest pair of objects - single linkage. Those newly formed clusters are then linked to each other, 48 by using an average distance between clusters, until every object in the data set is linked. A cluster can be described largely by the maximum distance needed to connect parts of the cluster. At different distances, different clusters will be formed. This creates an hierarchical tree, known as dendrogram, where the horizontal axis represent the indices of the objects and the vertical axis indicates the distance between the objects. The height of each ”U” represents the distance between the two data points being connected. Figure 4.20 shows a hierarchical tree for the unload activities at Schiphol Airport using the euclidean distance as measure and a single linkage. ×10 -3 12 10 8 6 4 2 13 16 11 2 1 15 24 10 7 14 21 3 12 4 26 18 8 19 5 27 28 29 9 17 23 25 6 22 20 30 Figure 4.20: Dendrogram for unload activities at Schiphol Airport To partition the data into the desired clusters, the branches of the hierarchical tree are pruned at a specific value. This assigns all the objects below each cut to a single cluster. Such value, in this case, is given by the maximal distance between two activities to be considered as part of the same costumer. Since the clustered variables, latitude and longitude, are in degrees, the cut-off value have to bet set in coherence. In order to be reasonable the assumption that each cluster represent a costumer the value at which the hierarchical tree is cut must be small. By setting it to 0.0005 degrees, approximately 56 meters, the obtain results for the load activities in the Schiphol Airport are shown in figure 4.21. In such way, it is assumed that each costumer is spaced at least by 56 meters. However, it is possible to preform the following studies at different granularities. If one desired to obtain the distribution of activity times for a region, like the Schiphol Airport, instead of a single costumer, the cutoff value would simply have to be set higher. That is one of the biggest advantages of using hierarchical clustering: it enables the possibility to analyse the data at various levels of granularity. 49 Number of Clusters = 38 52.33 52.306 52.32 52.304 52.302 Latitude [°] Latitude [°] 52.31 52.3 52.3 52.29 52.298 52.28 52.296 52.294 52.27 4.72 4.74 4.76 4.78 4.8 4.82 4.74 4.745 Longitude [°] 4.75 4.755 4.76 Longitude [°] (a) Load Clusters at Schiphol Airport (b) Load Clusters at Schiphol Airport Zoom 1 52.296 52.295 Latitude [°] 52.294 52.293 52.292 52.291 4.762 4.764 4.766 4.768 4.77 4.772 Longitude [°] (c) Load Clusters at Schiphol Airport Zoom 2 Figure 4.21: Load activities from Schiphol Airport clustered Different clusters are shown with different colors and each circle represents a load activity. As it can be seen in figure 4.21c, despite costumers being very close to each other the clustering method distinguish them successfully. In this area of the airport is possible to identify six areas where there load activity is particularly dense. Such areas are possible costumers locations. Once the coordinates of the load and unload activities are clustered, the data is rearrange according the cluster number. Load and unload activities are assigned to their corresponding cluster in addition with their information: the coordinates of where it was preformed, the identification of truck, the type of activity and its duration. A table with the activities is than built for each cluster and it represents the activity information of an hypothetical costumer. 50 Latitude Longitude Truck ID Activity Type 52.29663 52.29837 52.29811 ... 52.29817 4.74278 4.74696 4.74633 ... 4.74666 1141 1148 1153 ... 1157 Unload Load Load ... Unload Activity Duration [min] 61.73 121.03 80.88 ... 100.83 Table 4.3: Costumer table: the table contains the data from acitivties preformed on the costumer area 51 Chapter 5 Real World Examples In this chapter are shown the results of the STEL algorithm for the previous described data. To evaluate the obtained results, the analysis is focused on the load and unload activities preformed at the KLM Cargo site, in the Amsterdam Ariport Schiphol, and at the port of Rotterdam. These locations were chosen so that the ability of the STEL algorithm to handle different types of logistics is demonstrated. Some results for a single truck case where the algorithm is able to produce a complete activity time-line for the whole trajectory. For the estimated durations of load and unload activities, the results are shown in the form of histograms. Each bar has a width of five minutes and its height represents the percentage of load/unload activities whose duration is on a given interval. 5.1 Amsterdam Airport Schiphol Figure 5.1 shows the location of the analysed cluster representing the KML Cargo site. As it can be seen in figure 5.1b the most of the load activities are concentrated in the parking lot area where the trucks are stopped. 52 52.33 All Loads KLM Cargo Loads 52.304 52.32 KLM Cargo Loads 52.3038 52.3036 52.31 Latitude [°] Latitude [°] 52.3034 52.3 52.29 52.3032 52.303 52.3028 52.3026 52.28 52.3024 52.3022 52.27 4.72 4.74 4.76 4.78 4.8 4.82 4.75 Longitude [°] 4.7505 4.751 4.7515 4.752 4.7525 4.753 4.7535 4.754 Longitude [°] (a) Location of KML Cargo site cluster (b) KML Cargo site satellite view and load locations Figure 5.1: Load activities from Schiphol Airport The original load an unload durations for such location are shown in figure 5.2. As it can be seen, the majority of the load and unload activities, 45% and 28% respectively, are originally placed in the first five minutes interval. It illustrates clearly the problem of the human influence in such type of event logs. Such behaviour has to be taken into account since it is not reasonable to accept those load and unload durations as real. Users, instead of introducing the start and end of activity events related with such occurrences, they often introduce those events before or after accomplishing the load or unload tasks. This, results in a sequence of start and end of activity events that is spaced in time by scarce seconds. Hence, the duration can not be simply calculated by the time difference of such events. To overcome this problem, STEL algorithm combines the event log data with the spatio-temporal data. From the spacial coordinates and time-stamps of each data point is possible to calculate an average speed of the truck between two data points. By evaluating the average speed it is possible to identify areas at which the truck is stopped or moving at slow speeds at the costumer surrounding. Once such areas are known, the regions of interest, a time-line can be built for each ROI. Truck regions of interest will than be described by a start and end date, from spatio-temporal data, and by a sequence of events from the event log. Activities that occurred inside the region of interest can be extracted from the event sequence making it possible to estimate the duration of load and unload activities as described section 4.3. 53 (a) Original load durations of KML Cargo site (b) Original unload durations of KML Cargo site Figure 5.2: PDF of the original activties durations from KML Cargo site In the following figures are presented the estimated load and unload durations, for the same cluster, using the proposed hypothesis, formulated in section 4.3.2, for the multiple activities duration estimation. In all cases, the duration of single activities is estimated according the process described in section 4.3.1. (a) Estimated load durations (b) Estimated unload durations Figure 5.3: PDF of the estimated activity durations from KML Cargo site using hypothesis 1 Figures 5.3a and 5.3b show the estimated load and unload durations, respectively, for the proposed hypothesis 1: assuming that the durations of the activities on a group are equal. As it can be seen, the obtained histogram is quite different from the original one and much more truthful. 45% (28%) of the load (unload) activities preformed at the KML Cargo site had its original duration in between 0 and 5 minutes. Using the proposed hypothesis 1, that percentage drops to about 4.5% (5.5%). For both load and unload activities, the overall shape of the histogram is kept, despite the incremental on the percentages for durations above 5 minutes. This is due to the fact that for multiple load and/or unload 54 activities, the durations of the activities in each group are estimated equally. (a) Estimated load durations (b) Estimated unload durations Figure 5.4: PDF of the estimated activity durations from KML Cargo site using hypothesis 2 The proposed hypothesis 2, however, results in a histogram that is relatively skewed to the right. Like stated previously, it is common in for the multiple activities situations to be composed by a long activity and one or more small activities. This in addition with the fact that, in hypothesis 2, only the duration of short activities are estimated, results in a more constrained calculation. For instance, if an 1 hour interval contains one long activity with 30 min and two other short activities with 5 minutes each, the resultant duration for the short activities would be 15 min, since 30 min are already ”taken” by the long activity. In contrast with hypothesis 1, those same short activities would be estimated to have 20 minutes, since all activities are estimate equally. Due to this fact, although the original percentages of load and unload activities, in the range of 0 to 5 minutes, goes down to about 10% and 8%, respectively, the drop is not as pronounced as when using hypothesis 1. 55 (a) Estimated load durations (b) Estimated unload durations Figure 5.5: PDF of the estimated activity durations from KML Cargo site using hypothesis 3 (a) Estimated load durations (b) Estimated unload durations Figure 5.6: PDF of the estimated activity durations from KML Cargo site using hypothesis 4 In both previous hypothesis, all the original load and unload activities are kept, despite their original duration. So, the total number of activities is equal in the figures 5.2, 5.3 and 5.4. The same can not be said about hypothesis 3 and 4. Activities that make part of a group (e.g. multiple activity situation) and whose duration is smaller or equal to " and are considered as non-existent as explained in section 4.3.2. This results in a smaller total number of load and unload activities. When comparing hypothesis 3 with hypothesis 4, the percentage of load and unload activities with duration in the 0 to 5 minutes range is exactly the same, since in both situations short activities within a group are excluded. The difference in percentages becomes noticeable for the activities whose duration is bigger than 50 minutes. The reason for such results is that in hypothesis 4, after the short activities 56 being excluded, the long activities see their durations augmented to fulfil the empty space in the interval, result in higher means and medians. See figures 5.5 and 5.6. The previous presented results are referring to the durations of sing load and unload activities at the KML Cargo site. However, is known that trucks often preform more than one load/unload activity at a costumer. It would be useful to have the expected service time at each costumer rather than expected time for a load/unload activity. Hence, by summing the durations of the load and unload activities that are preformed in a visit of the truck to the costumer, the service time at such costumer is obtained. Using the original activities durations from the KML Cargo site, the obtained service times are the ones in figure 5.7. Here, load and unload activities are not distinguished since in a service, the same truck can preform both activities. As expected, since the original durations are being used, the major part of the services times are of placed in the 0-5 minutes interval. Figure 5.7: PDF for the original service times from KML Cargo site When using the estimated load and unload durations from the various hypothesis, the service times get longer since the estimated activities duration is longer as well. With hypothesis 1 and 2, the service time estimations are exactly the same. These was expected due to the fact that the on both hypothesis the time interval available for estimation is completely fulfilled, ether the estimation is done equally on every activity (hypothesis 1) or only on the short activities (hypothesis 2). The following hypothesis 3 and 4 shows shorter service times since the short activities are eliminated. Nevertheless the results are quite similar no matter which hypothesis is used. 57 (a) Estimated service times using hypothesis 1 (b) Estimated service times using hypothesis 2 (c) Estimated service times using hypothesis 3 (d) Estimated service times using hypothesis 4 Figure 5.8: PDF of the estimated services times for KML Cargo 5.2 Port of Rotterdam Opposing to the Amsterdam Schiphol Airport, in the port of Rotterdam the amount of load and unload activities present in data is very limited as shown in figure 5.9. Apart from that, there are significant differences between airport and harbour logistics. In airports, load and unload tasks are made at specific sites, where the logistic warehouses are located. This makes the activities locations to be highly concentrated on such areas, enabling the possibility to identify the clients warehouses by clustering the activity locations. In harbours, however, the cargo is usually containers that are spread over the port area. The containers arrive in cargo ships that are unloaded using ship-to-shore cranes. When the load tasks are to be preformed, specialized vehicles transport the containers from their location to the logistics trucks. This transferring process is the actual load activity from the truck point of view. 58 Load Activities Unload Activities 52.02 52 51.98 Latitude [°] 51.96 51.94 51.92 51.9 51.88 51.86 51.84 51.82 4 4.05 4.1 4.15 4.2 4.25 4.3 4.35 4.4 4.45 Longitude [°] Figure 5.9: Rotterdam Load & Unload Activities Hence, unlike in the airport warehouses, load and unload tasks are not preform in the same locations. In figure 5.9 its possible to identify three main unload locations and two load locations. Due to the different topology of the port of Rotterdam, in comparison with the Schiphol Airport, load and unload activities are preformed spreader, hence the cutoff value has to be increased. The load and unload locations that are going to be characterized are shown in figures 5.10a and 5.10b respectively. 51.95 51.958 51.9495 51.949 51.956 51.9485 Latitude [°] Latitude [°] 51.948 51.9475 51.947 51.954 51.952 51.9465 51.95 51.946 51.9455 51.948 51.945 4.027 4.028 4.029 4.03 4.031 4.032 4.033 4.034 4.035 4.036 4.045 Longitude [°] 4.05 4.055 4.06 4.065 Longitude [°] (a) Load locations at the port of Rotterdam (b) Unload locations at the port of Rotterdam Figure 5.10: Activities Locations at the port of Rotterdam 59 4.07 (a) Original load durations (b) Original unload durations Figure 5.11: PDF for the activities original durations at the port of Rotterdam Comparing the original activities duration, from figure 5.11, with the estimated ones, in figure 5.12, they are quite similar. In the loads case, some of the activities are shifted from the 0-5 min interval to the 5-10, but the average stays similar. The unloads case, the histogram has exactly the same shape. This reveals that the users are recording the events in coherence with reality leaving no ”empty space” in between activities. (a) Estimated load durations (b) Estimated unload durations Figure 5.12: PDF for the estimated activity durations using hypothesis 1,2,3 and 4 Another particularity is that, at the port of Rotterdam, trucks only preform a single load/unload activity at each load/unload location. Unlike in the airport warehouses, where the same truck can load/unload different types of cargo (e.g. packages) at the same location, in ports the trucks can not transport more than one container at a time. This leads to the inexistence of the multiple activity situation. The formulated hypothesis only differ from each other on the method of estimation for the multiply activity 60 situation, hence, in the port of Rotterdam case, all hypothesis present the same results. In addition, since trucks only perform one load/unload task per service, for the reasons stated above, the estimated service times are the same as the estimated load and unload times. 5.3 International Truck The prediction of activity durations, with the use of probability density functions, is only possible for situations where there is enough available data from load and unload activities. However, the STEL algorithm can also be useful in single vehicle situations. Figure 5.13 is a complete activity time-line for the international truck in the time span where the data was acquired. Break TID 1141 0 Load 50 Unload Rest Arrival 100 Wait Drive 150 Refuelling 200 Time [Hours] Figure 5.13: International Ttuck activity time-line for all the event log It is an efficient way to evaluated the overall productivity of vehicles since there is available information from all the performed activities and the correspondent duration. It is also possible to determine which users are dealing incorrectly with the logging system by evaluating the empty spaces in between activities. By looking at the activity time-lines from regions of interest from specific locations, companies are able to identify scheduling problems due to the logged wait activities. An increased awareness of the amount of non-productive time and productive time when the truck is stopped can be achieved by evaluating the regions of interest. For the international truck case, in a time-span of 240 hours (10 days) the truck was stopped proximately 155 hours. From the total stopped time, there were 7 hours related with non-productive activities such as waiting and signing up, 31 hours of productive time divided onto load and unload activities, 90 hours of sleeping logged as rest activities and 27 hours where no activities were logged. Figure 5.14 shows these results. 61 17% 4% Time with no activity logged Non Productive Time Productive Time Rest Time 58% 20% Figure 5.14: Productivity of the international truck 62 Chapter 6 Discussion This thesis presents a spatio-temporal event log mining algorithm called STEL. STEL includes a novel framework that handles real-world logistic data to obtain probability density functions of activity and service durations based on the location where the activities were preformed. Such results are of extremely importance for the development of algorithms to solve planning and scheduling problems such as vehicle routing problem. Using probability density functions as an input, these algorithms are capable or producing feasible planning solutions that take into account costumers time-windows and using stochastic service times derived from the probability density functions. In addition, if travel times were to be estimated on this work, the results would lead to information useful for such algorithms to incorporate stochastic travel times. The problem of estimating activity durations is not straightforward due to the human dependence of the event logs. Activities are identified by specific pairs of events and their log date is essential to estimate such durations. However, despite most of the events being automatically logged by the system it self, the log of external activities (relatively to the log system) relies on the user, since the system is not aware of such tasks. The most significant issue to consider is the event log uncertainty, related to the time at which the events are logged, that is induced by the human factor. In a naive approach, if durations were estimated as the elapsed time between the log of the first and second event of the pair, the obtained estimations would be, in general, very short. These shows the tendency that users have to log both events before (or after) the actual accomplishment of the activity rather than during the process, by logging a ”start” event at the start of the activity and an ”end” event at the end of the activity. STEL algorithm tackles this problem by creating activity time-lines for time-windows of the trajectory that where activities take place. Such portions of trajectories were addressed as regions of interest. Assuming the log system can log all types of activities performed by the user, activity time-lines should always be fulfilled. The presence of empty time spaces in between activities is attributed to the human 63 effect in logging the events. By defining a sub-set of process related activities, such as load and unload, their durations can be estimated based on the amount of empty time in their neighbourhood. The timewindows, defined by regions of interest, and activities that do not belong to such sub-set, serve as constrains to the duration estimation of the activity sub-set. In order to identify regions of interest, the STEL algorithm makes use of the spatio-temporal data acquired by a global positioning system. Regions of interest are defined in terms of speed which is obtained using the time-differential method. Since GPS technologies have an associated positioning error, the obtained distances between two data samples are not exact causing an amplification of the error, through differentiation, when calculating the average speed. To avoid mislead calculations, STEL algorithm filters the spatio-temporal data for high output rate data samples and uses the previously calculated speed so that the errors are minimized. Thus, the obtained speeds using this method are not accurate, but are reliable enough for this application. 6.1 Conclusions The actual estimation process undergoes some assumptions that have to be made considering the application. For the multiple activity estimations, where there is a lot of uncertainty due to the amount of activities, assumptions have to be made looking at possible scenarios that lead to such activity timelines. In this work, four hypothesis focus in the short activities were formulated. They mainly differ in two properties: to have in consideration, or not, the existence of short activities, and to estimate, or not, the duration of long activities. The major difference is results comes when short activities are considered and long activity durations are not estimated (hypothesis 2) leading to higher probabilities in short time ranges. The reasoning for this is given to a specific phenomenon of the event log used: users commonly log a long activity followed by one or two short activities at the end of the ROI, causing the short activities to have no room to ”expand” and thus leading to a higher number of activities with estimated duration up to 20 minutes. Apart from that, the obtain results were very consistent, revealing that the time constrains applied by time-windows and the other activities (whose duration was not estimated) create a well conditioned problem leading to similar output results independently of the hypothesis used to estimate the duration of multiply activity situations. The utility of the STEL algorithm was demonstrated in three real-world scenarios: a fleet of logistic trucks working in the Amsterdam Schiphol Airport area, container trucks from the port of Rotterdam and a single truck case from an international truck preforming activities across Europe. The obtain probability density functions are real-world based approximations that can be used as input in software aids to solve planning and scheduling problems such as vehicle routing with stochastic time windows. 64 It was also shown that the algorithm can be used to estimate travel times between costumer locations. Such results can be of interest for application that tackle the problem of dealing with stochastic travel times. STEL also supports iterative and flexible exploration of probability density functions by giving the possibility to refine parameters of the clustering algorithm such as the level of granularity in terms of location. This allows information to be grouped at different levels providing a big picture analysis of whole locations rather than a costumer by costumer analysis. In addition, STEL helps to understand different types of load and unload activities by doing a characterization of both activity and service durations. As in the study cases, it is possible to stablish assumptions related to the type of cargo being dealt by comparing both analysis as in the study cases. Furthermore, it is possible to distinguish load and unload sites by looking at the distribution of the locations where activities were performed as well as the number of activities performed by service. The created activity time-lines for regions of interest, and for complete trajectories, also allow a better comprehension of how the trucks are working given an awareness for problems related to the workflow. 6.2 Future Work These use cases show that STEL can lead to insights, however many topics remain for future work. While STEL supports interactive parametrization of the mining algorithm, choosing the right parameters might be a challenge. A next goal of STEL is to implement fuzzy logic for the classification of the trajectory links. Apart from the average speed, the classification can also be based in additional parameters such as the average link acceleration and link length. In these work was demonstrated the lack of effectiveness of sequential pattern mining algorithms in finding patterns directly from the event logs. Sequences are extremely large and complex due to the amount of event types. The creation of an activity log enables the possibility to apply such algorithms on a sequence of activities rather than events, producing patterns with a higher degree level. For instance, if such sequences are defined as a list of activities prior to a specific activity, it might be possible to find a frequent activity sequence that usually leads to the occurrence of that specific activity. Such work would have relevance in the prediction of essential tasks such as sleeping. Employing visualization tools to analyse the generated results would be useful since it provides a more user-friendly experience and may allow a better analysis of the clusters and the respective probability density functions formed. It would also be of great interest to include more event types in the duration estimation of the activities, 65 rather than only looking at a singular class of events (activity related events). Perhaps the presence of other events in the event sequence would help to reduce the uncertainty related to the log times of the events. 66 References [1] DINALOG. DAIPEX, 2015. URL http://www.dinalog.nl/en/projects/r_d_projects/daipex/. [2] Martin Desrochers, Jacques Desrosiers, and Marius Solomon. A new optimization algorithm for the vehicle routing problem with time windows. Operations research, 40(2):342–354, 1992. [3] Zoltán Fazekas, Péter Gáspár, and Roland Kovács. Determining truck activity from recorded trajectory data. Procedia-Social and Behavioral Sciences, 20:796–805, 2011. [4] T. Pedersen G. Gidofalvi. Mining long, sharable patterns in trajectories of moving objects. Proceedings of STDBM, pages 49–85, 2006. [5] Juyoung Kang and Hwan-Seung Yong. Spatio-temporal discretization for sequential pattern mining. In Proceedings of the 2nd international conference on Ubiquitous information management and communication, pages 218–224. ACM, 2008. [6] Juyoung Kang and Hwan-Seung Yong. Mining trajectory patterns by incorporating temporal properties. In Proceedings of the 1st International Conference on Emerging Databases, pages 63–68, 2009. [7] P. Smyth U. M. Fayyad, G. Piatetsky-Shapiro and R.Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996. [8] R. Agrawal and R. Srikant. Mining sequential patterns. In Proceedings of the 11th International Conference on Data Engineering, pages 3–14, Taipei, Taiwan, March 1995. [9] Mohammed J Zaki. Mining data in bioinformatics. Handbook of Data Mining, pages 573–596, 2003. [10] Adam Perer and Fei Wang. Frequence: Interactive mining and visualization of temporal frequent event sequences. In Proceedings of the 19th international conference on Intelligent User Interfaces, pages 153–162. ACM, 2014. [11] Debprakash Patnaik, Patrick Butler, Naren Ramakrishnan, Laxmi Parida, Benjamin J Keller, and David A Hanauer. Experiences with mining temporal event sequences from electronic medical records: initial successes and some challenges. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 360–368. ACM, 2011. 67 [12] CARL H. MOONEY and JOHN F. RODDICK. Sequential pattern mining – approaches and algorithms. ACM Computing Surveys, 2013. doi:10.1145/2431211.2431218. [13] K Venkateswara Rao, A Govardhan, and KV Chalapati Rao. Spatiotemporal data mining: Issues, tasks and applications. International Journal of Computer Science & Engineering Survey (IJCSES) Vol, 3, 2012. [14] John F Roddick and Myra Spiliopoulou. A bibliography of temporal, spatial and spatio-temporal data mining research. ACM SIGKDD Explorations Newsletter, 1(1):34–38, 1999. [15] Ramakrishnan Srikant and Rakesh Agrawal. Mining sequential patterns: Generalizations and performance improvements. In Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology, pages 3–17, 1996. ISBN:3-540-61057X. [16] Jian Pei, Ieee Computer Society, Jiawei Han, Senior Member, Behzad Mortazavi-asl, Jianyong Wang, Helen Pinto, and Qiming Chen. Mining sequential patterns by pattern-growth - The prefixspan approach. IEEE Transactions on Knowlegde and Data Engineering, 16(10):1–17, 2004. [17] Jiawei Han, Jian Pei, Behzad Mortazavi-Asl, Qiming Chen, Umeshwar Dayal, and Mei-Chun Hsu. Freespan: Frequent pattern-projected sequential pattern mining. Proc. 2000 ACM SIGKDD Int’l Conf. Knowledge Discovery in Databases, pages 355–359, August 2000. [18] Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules in Large Databases. Journal of Computer Science and Technology, 15(6):487–499, 1994. ISSN 1000-9000. doi: 10.1007/BF02948845. URL http://portal.acm.org/citation.cfm?id=645920.672836. [19] Mohammed J. Zaki. SPADE: An efficient algorithm for mining frequent sequences. Machine Learning, 42(1-2):31–60, 2001. ISSN 08856125. doi: 10.1023/A:1007652502315. [20] Jay Ayres, Johannes Gehrke, Tomi Yiu, and Jason Flannick. Sequential pattern mining using a bitmap representation. Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 429–435, 2002. doi: 10.1145/775107.775109. URL http://dl.acm.org/citation.cfm?id=775109. [21] Zhenglu Yang. E ective Sequential Pattern Mining Algorithms by Last Position Induction 1 Introduction 2 LAPIN ( LAst Position IN- duction ) sequential pattern mining, 2005. [22] Jiawei Han, Jian Pei, and Yiwen Yin. Frequent Pattern Tree : Design and Construction. Networks, pages 1–12, 2000. [23] Michael C Palmer. Calculation of distance traveled by fishing vessels using gps positional data: A theoretical evaluation of the sources of error. Fisheries Research, 89(1):57–64, 2008. [24] R. W. Sinnott. Virtues of the Haversine. Sky and Telescope, 68:158, December 1984. [25] David J Hand, Heikki Mannila, and Padhraic Smyth. Principles of data mining. MIT press, 2001. 68 [26] Pavel Berkhin. A survey of clustering data mining techniques. In Grouping multidimensional data, pages 25–71. Springer, 2006. [27] Shashi Shekhar and Sanjay Chawla. Spatial databases: a tour, volume 2003. prentice hall Upper Saddle River, NJ, 2003. [28] Wil Van der Aalst, Ton Weijters, and Laura Maruster. Workflow mining: Discovering process models from event logs. Knowledge and Data Engineering, IEEE Transactions on, 16(9):1128–1142, 2004. [29] Boudewijn F van Dongen, Ana Karla A de Medeiros, HMW Verbeek, AJMM Weijters, and Wil MP Van Der Aalst. The prom framework: A new era in process mining tool support. In Applications and Theory of Petri Nets 2005, pages 444–454. Springer, 2005. [30] Ming Hsu, Meghana Bhatt, Ralph Adolphs, Daniel Tranel, and Colin F Camerer. Neural systems responding to degrees of uncertainty in human decision-making. Science, 310(5754):1680–1683, 2005. [31] Leticia Gómez, Bart Kuijpers, and Alejandro Vaisman. Querying and mining trajectory databases using places of interest. In New trends in data warehousing and data analysis, pages 1–26. Springer, 2009. [32] Yu Zheng, Lizhu Zhang, Xing Xie, and Wei-Ying Ma. Mining interesting locations and travel sequences from gps trajectories. In Proceedings of the 18th international conference on World wide web, pages 791–800. ACM, 2009. [33] Stefano Spaccapietra, Christine Parent, Maria Luisa Damiani, Jose Antonio de Macedo, Fabio Porto, and Christelle Vangenot. A conceptual view on trajectories. Data & knowledge engineering, 65(1):126–146, 2008. [34] P. Fournier-Viger, A. Gomariz, T. Gueniche, A. Soltani, C. Wu., and V. S. Tseng. SPMF: a Java Open-Source Pattern Mining Library. Journal of Machine Learning Research (JMLR), 15:3389– 3393, 2014. URL http://www.philippe-fournier-viger.com/spmf/. [35] Jianjun Zhang. Precise velocity and acceleration determination using a standalone gps receiver in real time, 2007. [36] Mohinder S Grewal, Lawrence R Weill, and Angus P Andrews. Global positioning systems, inertial navigation, and integration. John Wiley & Sons, 2007. [37] J William. Global positioning system (gps) standard positioning service (sps) performance analysis report. Federal Aviation Administration, Washington, DC, 410, 2014. 69 Appendix A Event List Event Description Activity ID New Activity ID No of Occurrences Acceleration limit violation 53 1 108 Activity Midnight 9 UN 2 92 Activity Midnight Arrive 9 1 AANK 3 2 Activity Midnight Break 9 1 PA 4 1 Activity Midnight Drive 9 DR 5 21 Activity Midnight Gas 9 1 TA 6 1 Activity Midnight Load 9 1 LAD 7 25 Activity Midnight Log Out 9 1 LOZO 8 4 Activity Midnight Rest 9 1 RU 9 128 Activity Midnight Sign Up 9 1 MELD 10 2 Activity Midnight Unload 9 1 LOS 11 23 Activity Midnight Wait 9 1 WA 12 1 Basic record 0 13 32293 Cancellation of 11 UN 14 7354 Cancellation of Arrive 11 1 AANK 15 16 Cancellation of Break 11 1 PA 16 7 Cancellation of Costs 11 1 KSTN 17 1 Cancellation of Garage 11 1 GAR 18 1 Cancellation of Gas 11 1 TA 19 4 Cancellation of Load 11 1 LAD 20 75 Cancellation of Log In 11 LI 21 8 Cancellation of Log Out 11 1 LOZO 22 13 Cancellation of Passage 11 1 OT 23 1 Continued on next page 70 Table A.1 – Continued from previous page Event Description Activity ID New Activity ID No of Occurrences Cancellation of Rest 11 1 RU 24 59 Cancellation of Sign Up 11 1 MELD 25 17 Cancellation of Unload 11 1 LOS 26 53 Cancellation of Wait 11 1 WA 27 18 Contact OFF 72 28 8942 Contact ON 71 29 8946 Crossed country border 44 30 28 Deceleration limit violation 54 31 69 Driver switch 3 32 42 Driving times driving violation 84 33 357 Driving times driving warning 83 34 71 Driving times state event 82 35 18146 Driving times total driving warning 85 36 165 Driving without any driver logged in 8 37 73 End of 13 TJ 38 2293 End of Arrive 13 1 AANK 39 427 End of Break 13 1 PA 40 103 End of Drive 13 DR 41 4012 End of Garage 13 1 GAR 42 36 End of Gas 13 1 TA 43 203 End of Load 13 1 LAD 44 1019 End of Log In 13 LI 45 522 End of Log Out 13 LO 46 491 End of Passage 13 1 OT 47 12 End of Rest 13 1 RU 48 369 End of Sign Up 13 1 MELD 49 284 End of Unload 13 1 LOS 50 869 End of Wait 13 1 WA 51 193 End of peak RPM limit violation 61 52 782 End of speed limit violation 60 53 315 Engine idle violation 55 54 21 GPRS status info 200 55 480 Invalid driver card 6 56 275 Join of 14 UN 57 20 Join of Drive 14 DR 58 2 Join of Gas 14 1 TA 59 1 Continued on next page 71 Table A.1 – Continued from previous page Event Description Activity ID New Activity ID No of Occurrences Join of Load 14 1 LAD 60 3 Join of Log In 14 LI 61 2 Join of Passage 14 1 OT 62 1 Join of Rest 14 1 RU 63 6 Leaving of 15 UN 64 13 Leaving of Gas 15 1 TA 65 1 Leaving of Load 15 1 LAD 66 4 Leaving of Rest 15 1 RU 67 7 Leaving of Sign Up 15 1 MELD 68 1 Mileage violation 94 69 1 Navigation ETA update 42 70 3519 Navigation cancelled 41 71 199 Navigation destination reached 43 72 188 Navigation started to given destination 40 73 432 Report of 12 1 LAZO 74 28 Report of Arrive 12 1 AANK 75 8 Report of Break 12 1 PA 76 2 Report of Garage 12 1 GAR 77 5 Report of Gas 12 1 TA 78 2 Report of Load 12 1 LAD 79 27 Report of Log Out 12 1 LOZO 80 8 Report of Passage 12 1 OT 81 2 Report of Rest 12 1 RU 82 373 Report of Sign Up 12 1 MELD 83 6 Report of Unload 12 1 LOS 84 15 Report of Wait 12 1 WA 85 47 Start of 10 TJ 86 9645 Start of Arrive 10 1 AANK 87 443 Start of Break 10 1 PA 88 110 Start of Costs 10 1 KSTN 89 1 Start of Drive 10 DR 90 4004 Start of Garage 10 1 GAR 91 37 Start of Gas 10 1 TA 92 206 Start of Load 10 1 LAD 93 1095 Start of Log In 10 LI 94 530 Start of Log Out 10 LO 95 507 Continued on next page 72 Table A.1 – Continued from previous page Event Description Activity ID New Activity ID No of Occurrences Start of Passage 10 1 OT 96 15 Start of Rest 10 1 RU 97 434 Start of Sign Up 10 1 MELD 98 302 Start of Unload 10 1 LOS 99 923 Start of Wait 10 1 WA 100 212 Start of peak RPM limit violation 51 101 769 Start of speed limit violation 50 102 333 System Shutdown 73 103 223 Task Accepted 21 104 3851 Task Busy 23 105 3729 Task Cancelled 24 106 186 Task Finished 25 107 3450 Task Received from terminal 20 108 1983 Text message 300 109 1850 Text message read 302 110 1905 Text message received 301 111 1881 Trailer tethered 98 112 668 Trailer untethered 99 113 676 Update of peak RPM limit violation 66 114 9 Update of speed limit violation 65 115 76 User login 1 116 372 User logout 2 117 252 empty 7 118 99 73 52.5 52.5 Moving Link Stopped Link 60 52 52 50.5 30 Speed [Km/h] 40 Latitude [°] 51 51 Latitude [°] 51.5 50 51.5 50.5 50 50 20 49.5 49.5 10 49 48.5 2 3 4 5 6 7 8 49 48.5 9 2 3 4 5 6 7 8 9 Longitude [°] Longitude [°] (a) TID 1141 Link Speeds (b) TID 1141 Link Classification Moving Link Stopped Link 52 60 52 51.98 50 51.96 51.95 51.94 30 Latitude [°] 51.9 Speed [Km/h] Latitude [°] 40 51.92 51.9 51.88 20 51.86 51.85 51.84 10 51.82 51.8 4 4.05 4.1 4.15 4.2 4.25 4.3 4.35 4.4 4 4.45 4.05 4.1 4.15 4.2 4.25 4.3 4.35 4.4 4.45 Longitude [°] Longitude [°] (c) TID 1849 Link Speeds (d) TID 1849 Link Classification 52.31 Moving Link Stopped Link 52.31 60 52.305 52.305 50 52.3 52.295 30 Speed [Km/h] Latitude [°] 40 Latitude [°] 52.3 52.295 52.29 52.29 20 52.285 52.285 10 52.28 4.73 4.74 4.75 4.76 4.77 4.78 52.28 4.79 4.73 4.74 4.75 4.76 4.77 Longitude [°] Longitude [°] (e) TID 7234 Link Speeds (f) TID 7234 Link Classification Figure A.1: Link Speeds and Classifications with ↵ = 0.5 74 4.78 4.79

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Spatio-Temporal Data Mining with Event Logs from High Volume