Download Spatio-Temporal Data Mining with Event Logs from High Volume

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Transcript
Spatio-Temporal Data Mining with Event Logs from High
Volume Logistics Information
Rodrigo Miguel Tavares Gonçalves
Thesis to obtain the Master of Science Degree in
Mechanical Engineering
Supervisors: Prof. João Miguel da Costa Sousa
Prof. Rui Jorge de Almeida e Santos Nogueira
Examination Committee
Chairperson: Prof. João Rogério Caldas Pinto
Supervisor: Prof. João Miguel da Costa Sousa
Member of the Committee: Prof. Carlos Baptista Cardeira
September 2015
Acknowledgments
I would like to express my sincere gratitude to my advisor Prof. João Sousa for the continuous support
during my MSc study and related research. His guidance helped me in time of research and writing of
this thesis.
Besides my advisor, I would like to thank Prof. Rui Jorge de Almeida for his insightful comments and
encouragement, but also for the hard question which incentivized me to widen my research from various
perspectives.
This work was performed as part of the DAIPEX project grand funded by Dinalog.
ii
Abstract
In logistics, software aids for transportation planning and scheduling are often based in approximations
and abstractions that do not take into account real-world data. The aim of this work is to provide an
analysis and methodology, based on real-world data, on how to obtain probability density functions for
prediction of activity duration. Such information can be used in planning algorithms, like vehicle routing
problem, capable of dealing with stochastic time-windows.
Given a large spatio-temporal database of events, where each event consists of the fields event ID, time,
location, and event type, the aim is to extract valuable information about activities duration. The process
is not straightforward since the log is human-influenced creating uncertainty related with the time at
which the events are logged. In order to overcome this, a novel framework is proposed: it uses the spatiotemporal trajectories to identify regions-of-interest based on speed, and builds an ROI activity time-line
using the activities extracted from event logs. The framework’s ability to estimate activities durations
was tested in three different environments: the Amsterdam Airport Schiphol, the Port of Rotterdam
and a single vehicle scenario. Experimental results validate the usefulness of the approach at finding
probability density functions for prediction of activity durations at specific locations. Further more, it
provides a detailed activity time-line for the trajectory, or for specific portion like regions of interest, that
allows a close monitoring of the vehicles workflow.
Keywords: Spatio-Temporal, Event Logs, Logistics, Event Mining, Trajectories.
iii
Contents
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vi
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
x
Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
1 Introduction
1
1.1 Sequential Pattern Mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.1.1 AprioriAll Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.1.2 Generalized Sequential Patterns Algorithm . . . . . . . . . . . . . . . . . . . .
4
1.1.3 SPADE Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
1.1.4 PrefixSpan Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
1.2 Geographical distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
1.2.1 Haversine formula
1.3 Clustering Methods
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.1 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Contributions
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Spatio-Temporal Event Log Mining
13
2.1 Identify Regions of Interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Identify Activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Characterize Activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3 Analysis of a Logistics Database
21
3.1 Spatio-Temporal Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Event Log Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.1 Sequential Pattern Mining in Event Logs
4 Processing High Volume Logistics Data
4.1 Spatio-Temporal Data - Trajectories
. . . . . . . . . . . . . . . . . . . . . 29
30
. . . . . . . . . . . . . . . . . . . . . . . . . . . 30
iv
4.1.1 Regions of Interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2 Event Logs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3 Activity Duration Estimation
4.3.1 Single Activity
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3.2 Multiple Activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.4 Clustering Locations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5 Real World Examples
52
5.1 Amsterdam Airport Schiphol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2 Port of Rotterdam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.3 International Truck . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6 Discussion
63
6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
References
69
A Event List
70
v
List of Tables
1.1 Difference between databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
3.1 Data point arrangement in the data base . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Load and Unload activities related events
. . . . . . . . . . . . . . . . . . . . . . . . 26
4.1 Example of the identification of activities from the event log data . . . . . . . . . . . . . 39
4.2 Example of activity Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3 Costumer table: the table contains the data from acitivties preformed on the costumer area 51
vi
List of Figures
1.1 Great Circle: a great circle of a sphere is the intersection of the sphere and a plane which
passes through its center point.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
1.2 Spherical triangle: law of cosines . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.1 Activity duration estimation framework . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Data acquired by a global positioning system . . . . . . . . . . . . . . . . . . . . . . . 15
(a)
Trajectory data base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
(b)
Trajectory graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Framework to extract regions of interest . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Activity time-line of a region of interest . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Empty time: the time available on the neighbourhood of an activity in the ROI time-line . . 19
3.1 Data Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Load and Unload event count
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
(a)
Load Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
(b)
Unload Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 Other acivity related events count
3.5 Log In & Log Out event count
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.6 Most frequent events in the date base . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.7 Activity time distribuiton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1 Effect of the speed filter on an average speed profile of the international truck with
4.2 Average speed profile of the international truck with
= 6 . 32
= 6 . . . . . . . . . . . . . . . . 33
4.3 Average Speed Histograms with ↵ = 0.5 . . . . . . . . . . . . . . . . . . . . . . . . . 34
(a)
All trucks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
(b)
TruckID 1141 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
(c)
TruckID 7234 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
(d)
TruckID 1849 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4 Split regions of interest (right figure) due to a low sbound . . . . . . . . . . . . . . . . . 35
4.5 Choosing the correct value for the sbound parameter . . . . . . . . . . . . . . . . . . . 35
vii
(a)
Number of ROIs per truck against sbound . . . . . . . . . . . . . . . . . . . . . . 35
(b)
Total time spent in ROIs per truck against sbound . . . . . . . . . . . . . . . . . . 35
4.6 An example on extracting regions of interest with T
= 0 . . . . . . . . . . . . . . . . . 36
(a)
Regions of interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
(b)
Trajectory graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.7 Examples of regions of interest for sbound
= 15km/h . . . . . . . . . . . . . . . . . . 37
(a)
Single activity
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
(b)
Multiple activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
(c)
Split ROI due to low sbound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
(d)
Effect of the human behvaiour . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.8 Averaging latitude and longitude coordinates to obtain a single point representation of the
activity location
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.9 ROI activity time-lines examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
(a)
Wrongly introduced activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
(b)
Correctly introduced activities
. . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.10 Empty time: the time available on the neighbourhood of an activity in the ROI time-line . . 43
4.11 Different types of estimation cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
(a)
Single activity case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
(b)
Multiple activities case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.12 Single activity situations: different cases when estimating the duration of a single activity . 44
(a)
Activity is the only activity in the ROI . . . . . . . . . . . . . . . . . . . . . . . . 44
(b)
Activity is the firsts activty in the ROI . . . . . . . . . . . . . . . . . . . . . . . . 44
(c)
Activity is the last activity in the ROI
(d)
Activity is in between other activities . . . . . . . . . . . . . . . . . . . . . . . . 44
. . . . . . . . . . . . . . . . . . . . . . . . 44
4.13 Multiple activities situation: defining the interval for estimation
. . . . . . . . . . . . . . 45
4.14 Legend: short activities, long activities and other activities . . . . . . . . . . . . . . . . 45
4.15 Hypothesis 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.16 Hypothesis 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.17 Hypothesis 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.18 Hypothesis 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.19 Mean locations of load acitivities in Schiphol Airport . . . . . . . . . . . . . . . . . . . 48
4.20 Dendrogram for unload activities at Schiphol Airport . . . . . . . . . . . . . . . . . . . 49
4.21 Load activities from Schiphol Airport clustered . . . . . . . . . . . . . . . . . . . . . . 50
(a)
Load Clusters at Schiphol Airport . . . . . . . . . . . . . . . . . . . . . . . . . . 50
(b)
Load Clusters at Schiphol Airport Zoom 1
. . . . . . . . . . . . . . . . . . . . . 50
(c)
Load Clusters at Schiphol Airport Zoom 2
. . . . . . . . . . . . . . . . . . . . . 50
5.1 Load activities from Schiphol Airport . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
(a)
Location of KML Cargo site cluster . . . . . . . . . . . . . . . . . . . . . . . . . 53
viii
(b)
KML Cargo site satellite view and load locations . . . . . . . . . . . . . . . . . . 53
5.2 PDF of the original activties durations from KML Cargo site . . . . . . . . . . . . . . . . 54
(a)
Original load durations of KML Cargo site
. . . . . . . . . . . . . . . . . . . . . 54
(b)
Original unload durations of KML Cargo site . . . . . . . . . . . . . . . . . . . . 54
5.3 PDF of the estimated activity durations from KML Cargo site using hypothesis 1 . . . . . 54
(a)
Estimated load durations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
(b)
Estimated unload durations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.4 PDF of the estimated activity durations from KML Cargo site using hypothesis 2 . . . . . 55
(a)
Estimated load durations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
(b)
Estimated unload durations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.5 PDF of the estimated activity durations from KML Cargo site using hypothesis 3 . . . . . 56
(a)
Estimated load durations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
(b)
Estimated unload durations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.6 PDF of the estimated activity durations from KML Cargo site using hypothesis 4 . . . . . 56
(a)
Estimated load durations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
(b)
Estimated unload durations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.7 PDF for the original service times from KML Cargo site . . . . . . . . . . . . . . . . . . 57
5.8 PDF of the estimated services times for KML Cargo . . . . . . . . . . . . . . . . . . . 58
(a)
Estimated service times using hypothesis 1 . . . . . . . . . . . . . . . . . . . . . 58
(b)
Estimated service times using hypothesis 2 . . . . . . . . . . . . . . . . . . . . . 58
(c)
Estimated service times using hypothesis 3 . . . . . . . . . . . . . . . . . . . . . 58
(d)
Estimated service times using hypothesis 4 . . . . . . . . . . . . . . . . . . . . . 58
5.9 Rotterdam Load & Unload Activities . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.10 Activities Locations at the port of Rotterdam . . . . . . . . . . . . . . . . . . . . . . . 59
(a)
Load locations at the port of Rotterdam . . . . . . . . . . . . . . . . . . . . . . . 59
(b)
Unload locations at the port of Rotterdam . . . . . . . . . . . . . . . . . . . . . . 59
5.11 PDF for the activities original durations at the port of Rotterdam
. . . . . . . . . . . . . 60
(a)
Original load durations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
(b)
Original unload durations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.12 PDF for the estimated activity durations using hypothesis 1,2,3 and 4 . . . . . . . . . . . 60
(a)
Estimated load durations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
(b)
Estimated unload durations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.13 International Ttuck activity time-line for all the event log . . . . . . . . . . . . . . . . . . 61
5.14 Productivity of the international truck . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
A.1 Link Speeds and Classifications with ↵ = 0.5 . . . . . . . . . . . . . . . . . . . . . . . 74
(a)
TID 1141 Link Speeds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
(b)
TID 1141 Link Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
(c)
TID 1849 Link Speeds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
ix
(d)
TID 1849 Link Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
(e)
TID 7234 Link Speeds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
(f)
TID 7234 Link Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
x
List of Equations
1.1
The Law of Haversines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
1.2
Haversine Function
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
1.3
Haversine of a Central Angle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
1.4
Haversine Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.1
Link Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2
Set of Activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3
Set of Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4
Assigning activities to ROIs
3.1
Cardinality of Load Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.2
Cardinality of Activities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1
Average Speed
4.2
Speed Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3
Hypothesis 1
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4
Hypothesis 2
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.5
Hypothesis 4
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
xi
Nomenclature
Average speed in a trajectory link
s̄
x
Trajectory link length in Km
User-defined parameter: speed filter in seconds
Longitude coordinate in degrees
Latitude coordinate in degrees
A
Set of activities present on the event log
a
An activity present on the event log
A⇤
Set of activities present on the event log whose duration is to be estimated
C
Set of events that identify a special occurrence of an activity
c
Event that identify a special occurrence of an activity
e
An event from the event log
F
Set of events that indicate the end of an activity
f
Event that indicate the end of an activity
l
Trajectory link
m
Number of data-points in an activity
N
Number of data-points in a trajectory
xii
p
Data-point from a trajectory
R
Candidate region of interest
r
Earth’s radius in Km
S
Set of events that indicate the beginning of an activity
s
Event that indicate the beginning of an activity
sbound User-defined parameter for link classification
T
User-defined parameter: minimum length of stay inside an ROI in minutes
t
Log date of a data-point
xiii
Chapter 1
Introduction
Transportation companies often find that their day-to-day transportation execution does not conform to
the transportation plan that they made in advance. To a large extent this is caused by the fact that the
software that aids in the creation of transportation plans, does not take into account the real-world complexity of transportation and logistics [1]. Rather, it uses approximations and abstractions that do not
do justice to that complexity. As a consequence, the transportation plans that are generated by transportation planning software often lead to violated time windows [2], unnecessary delays, underutilized
transportation capacity, etc. The real-world complexity of transportation planning is caused by the high
level of detail that is required to get executable plans, the size of the instances as found in reality, and
the large volumes of data that must be collected and processed to gather the information required to
create the planning [3].
The aim of this work is to provide an analysis and methodology on how to estimate the duration of
process related activities, such as load and unload, based on spatio-temporal data and event logs with
uncertainty related to the human behaviour. Further,the acquired information is categorized according
the location to obtain probability density functions of activity durations at specific locations. Such information can be used in software applications for transportation planning such as vehicle routing problems
that can handle stochastic time-windows.
Currently, mobile communications and positioning systems are well-established technologies. GPS
equipped devices are able to provide valuable spatio-temporal data with increasingly finer spatial resolutions. The use of GPS-enabled devices allow us to describe the movement of an object (i.e. trajectory)
as a sequence of spatial locations sampled at consecutive time-stamps. Spatio-temporal patterns in trajectories, which represent movement patterns of objects, can provide useful information for high quality
location based services, such as traffic flow control, location-aware advertising, etc. [4] [5] [6]. In addition, many operating systems, software frameworks, and programs include a logging system. Event logs
1
are able to record events taking place in the execution of a system. This provides a chronological record
of a sequence of activities that is essential to gather information about complex systems. The problem
is that, in many cases, events are introduced by humans, the system users, leading to the existence of
uncertainty in the data. When system logs are human dependent, it is not assured that event records
happened in coherence with reality. For example, if an activity is characterized by a start and an end
event, its duration can be calculated as the time difference between such events. However, if the events
are recorded before, or after, such occurrences the previous statement no longer holds true.
In order to overcome such problem, a new algorithm that combines spatio-temporal and event logs
data-bases is proposed. The goal is to develop a processing mechanism that can efficiently aggregate
information from high-volume business data streams to provide up-to-date management information, for
operational business decision makers, in the form of time distributions of expected durations for load and
unload activities at specific locations. The algorithm will be tested in harbour and airport services with
heavily fluctuating, unpredictable behaviour and road transport processes with high volumes of real-time
data.
1.1
Sequential Pattern Mining
Data mining techniques, also known as knowledge discovery tools in databases, are used in order to
find valid, novel, potentially useful and understandable patterns in data [7]. Sequential pattern mining,
as sub-field of data mining, is a topic concerned with finding statistically relevant patterns, from frequent
sub-sequences, between data examples where the values are delivered in a sequence. It’s strongly motivated by it’s utility as a tool to obtain knowledge from customer purchase database [8], DNA sequences
[9], web logs [10], event logs and medical time series [11]. Sequences are common, occurring in any
metric space that facilitates either total or partial ordering [12]. Acquiring knowledge about them is an
important data mining research problem with broad applications since the detection of frequent (totally
or partially ordered) sub-sequences might be extremely useful to support decision making problems.
Sequential pattern mining has arisen as a technology to discover such sub-sequences.
Over the last years there has been a substantial increase in temporal, spatial and spatio-temporal data
mining publications due to the continuous growth of this sub-field of data mining. The high volume
of available data, through internet mainly, and the prominent advantages provided by the data mining
analysis to the market, are some of the principal reasons for its development. Since there are many
application domains that have a temporal or spatial context, time and space are components that must
be taken into account in data mining processes. In the following paragraph temporal, spatial and spatiotemporal data mining bases are going to be briefly explained.
Temporal data mining is about the analysis of events ordered by one (or more) dimensions of time and
2
it can be approached in two different ways. On the one hand it’s focused in the discovery of causal
relationships among temporally-oriented events, it corresponds to the discovery of ”rules” [13]. On the
other hand it is the discovery of similar patterns within the same time sequence, giving rise to the term
”sequence pattern mining”. Spatial data mining can be seen as the multi-dimensional equivalent of
temporal data mining however, due to the complexity of spatial data types, spatial relationships, and
spatial autocorrelation, turns out to be much more difficult to extract interesting and useful patterns from
this type of datasets. The same problem occurs with spatio-temporal data mining, it requires explicit or
implicit modelling of spatio-temporal autocorrelation and constraints [14].
Many algorithms are proposed for mining sequential patterns. Broadly, data mining algorithms can
be divided into two classes according its method: Apriori-based candidate generation method [15], [8]
and Pattern-Growth method [16],[17],[16]. In the following section algorithms for both approaches are
presented.
1.1.1
AprioriAll Algorithm
Sequential pattern mining was firstly introduced by Agrawal and Srikant [8] in 1995, over transaction
databases (known as basket data). The aim was to find frequent item sets bought by costumers in
order to obtain typical behaviours according to the user’s point of view. This wouldz support the decision
making problem faced by most large retail organizations. The sequential pattern mining problem was
defined as follows:
“Given a database of sequences, where each sequence consists of a list of transactions ordered by transaction time and each transaction is a set of items, sequential pattern mining is
to discover all sequential patterns with a user-specified minimum support,where the support
of a sequential pattern is the percentage of data-sequences that contain the pattern.” [8]
To find all sequential patterns Agrawal and Srikant divided the mining problem into five phases:
(i) Sort phase - consists in creating costumer-sequences, from the original database, by finding the
transactions with the same transaction-id and ordering them according the transaction time;
(ii) Large item set phase - the item sets with the user-specified minimum support, litemsets, are found;
The support for an item set i is defined as the fraction of costumers who bought the items in i in a
single transaction;
(iii) Transformation phase - each transaction in the costumer-sequences is replaced by the set off all
litemsets contained in that transactions;
(iv) Sequence phase - the sequence phase is the actual mining phase, where the set of litemsets is
used to find the desired sequences (the ones that satisfy the minimum support constraint - large
3
sequences);
(v) Maximal phase - is to find the maximal sequences among the set of large sequences. A sequence
s is said to be maximal if, in a set of sequences, s is not contained in any other sequence.
For the sequence phase, two families of algorithms were presented: count-all, where all the large sequences are taken into account, including the non-maximal sequences, and count-some. The AprioriAll
algorithm, based on the Apriori algortihm to mine association rules [18] was shown to perform better
than the other two approaches in [8]. It uses a breadth-first search strategy to count the support of
itemsets and a candidate generation function which exploits the downward closure property of support.
However, the problem definition presented above had some limitations. The absence of time constraints,
taxonomies and the rigid definition of a transaction make the use of this algorithms on broader problems
impossible.
1.1.2
Generalized Sequential Patterns Algorithm
In 1996, Srikant and Agrawal [15] generalized the problem of sequential pattern mining. Time constraints
that specify a minimum and/or maximum time period between adjacent elements in a pattern were added
and, to overcome the rigid definition of transaction, the restriction that the items in an element of a sequential pattern must come from the same transaction was relaxed by allowing the items to be present
in a set of transactions whose transaction-times are within a user-specified time window. Hierarchy was
also introduced, given a user-defined taxonomy on items, sequential patterns are allowed to include
items across all levels of the taxonomy. Since many applications require all patterns and their supports, the count-some algorithms from [8], that find only maximal sequential patterns, were abandoned.
Despite of being possible to extend the AprioriAll algorithm to handle time constraints and taxonomy,
incorporate sliding windows was not feasible. Apart from that, the performance of the algorithm was
poor since it had to preform the data transformation on-the-fly during each pass while finding sequential patterns. Generalized Sequential Patterns (GSP) algorithm was presented and problem definition
reformulated:
“Given a database D of data-sequences, a taxonomy T, a user-specified min-gap and maxgap time constraints and a user-specified sliding-window size, the problem of mining sequential patterns is to find all sequences whose support is greater than the user-specified
minimum support. Each sequence represents a sequential pattern, also called a frequent
sequence.” [15]
The GSP algortihm, as the AprioriAll, assume a horizontal database layout, which means that the
database is formed by a set of input-sequences. Each input-sequence has a set of events, along with
4
the items contained in the event. GSP works in a multiple pass fashion over the data. The first pass
determines the support of each item, that is, the number of data-sequences that include the item. The
items that respect the minimum support are the frequent items. Each such item yields a 1-element
frequent sequence and on each subsequent pass the previous frequent sequences, seed set, are used
to generate new potentially frequent sequences, called candidate sequences, with one more item than
the seed sequences. During the pass over the data the algorithm computes the support for each one
of the candidate sequences and determines which ones of them are actually frequent. These frequent
candidate sequences become the seed set for the next pass. When there are no frequent sequences
or no candidate sequences generated, the algorithm terminates. GSP is a complete algorithm in that
it guarantees finding all sequences that have a user-specified minimum support. Further more is up to
twenty times faster than the previous presented algorithm AprioriAll.
1.1.3
SPADE Algorithm
Using the GSP algorithm as base of comparison Zaki developed SPADE, a new algorithm for fast mining of sequential patterns in large databases [19]. Both, GSP and AprioriAll algorithms, required as
many full database scans as the longest frequent sequence which is clearly a very expensive process. SPADE decomposes the original problem into smaller sub-problems using equivalence classes
on frequent sequences, thus all sequences are discovered in only three database scans — one for frequent 1-sequences, another for frequent 2-sequences, and one more for generating all other frequent
sequences.
In contrast to the previous algorithms, SPADE uses a vertical id-list database format.The sequence
enumeration is done in a lattice-based approach. Each input-sequence in the database has an unique
identifier called sid, and each event in a given input-sequence also has a unique identifier called eid. This
allows the creation of an id-list, where each entry is a (sid,eid) pair where the item occurs; that enables
the chance to check support via simple id-list joins. Two different search strategies for enumerating the
frequent sequences were used: breadth-first and depth-first search.
Given the vertical id-list database, the 1-sequences can be computed in a single scan by incrementing
the support for each new sid encountered in the id-list. For the second step of the algortihm, finding
the frequent 2-sequences, a preprocessing step to gather all 2-sequences, above a user specified lower
bound, is done. Frequent sequences are then generated by joining the id-lists of all pairs of atoms
(including a self-join) and checking the number of distinct sid values of the resulting id-list against the
minimum support. The sequences found to be frequent at the current level form the atoms of classes for
the next level. This recursive process is repeated until all frequent sequences have been enumerated.
Other Apriori-Based algorithms were developed, however only with minor improvements such as per-
5
formance and large search spaces. The SPAM algorithm [20], in the same fashion as SPADE, uses a
vertical bitmap representation of the databased combined with an efficient support counting and pruning
mechanisms. Yang introduced LPAPIN-SPAM to overcome the problem of ineffectiveness on handling
long patterns by using last position of an item to judge if a sequence can be extended.
1.1.4
PrefixSpan Algorithm
The previously presented sequential pattern mining methods explore a candidate generation-and-test
approach based on the Apriori heuristic proposed in association mining [18]: ”any super-pattern of a
non frequent pattern cannot be frequent.”. Due to that, Apriori-Based algorithms require multiple scans
of the data base. They also generate a huge sets of candidates, especially 2-item candidates, reducing
the performance of the mining process and make it not suitable for mining long sequential patterns.
For frequent pattern mining, a frequent pattern growth method called FP-growth [22] has been developed
for efficient mining of frequent patterns without candidate generation. The method uses a data structure
called FP-tree to store compressed frequent patterns in transaction database and recursively mines the
projected conditional FP-trees to achieve high performance.
In [17] a new approach to this problem was proposed. With FreeSpan algorithm, the idea was to integrate the mining of frequent sequences with that of frequent patterns and use projected sequence
databases to confine the search and the growth of subsequence fragments. By scanning once the
database, FreeSpan creates a frequent item list, based on the support, which is also the set of length-1
sequential patterns. The sequence database is than recursively projected into a set of smaller projected databases based on the current sequential pattern(s), and sequential patterns are grown in each
projected databases by exploring only locally frequent fragments.
In 2004, Pei et al. introduced an improve method, PrefixSpan (Prefix-projected Sequential pattern mining) [16]. Unlike FreeSpan, that creates projected databases based on the current set of frequent patterns without a particular ordering (i.e., growth direction), PrefixSpan projects databases by growing
frequent prefixes.
Sequential pattern mining for logistic event logs
Despite the usefulness of the previous presented algorithms, their implementation on broader problems,
and with real-world data, is very limited. Firstly, all of them imply an a priori knowledge of the sequence
it self. Since they are designed to mine transaction databases, the algorithms do not tackle the problem
of defining a proper sequence. On a transaction data base, sequences are given by a list of transactions
6
ordered by transaction time, where each transaction is a set of items bought by a specific costumer.
Each sequence is the purchase history of a costumer.
Costumer ID
Costumer Sequence
Trajectory ID
1
2
3
h(AC), (B), (DF )i
h(C), (F B), (AF )i
h(AB), (E), (C)i
1
2
3
(a) Transaction database
Trajectory Sequence
hA, C, D, G, . . . , F i
hD, F, A, C, . . . , Bi
hA, F, E, C, . . . , Gi
(b) Event log data
Table 1.1: Difference between databases
In logistics event logs, however, sequences are given by the full list of logged events from a trajectory.
There are no set of items, since events are not concurrent and their log is continuous in contrast with
the discrete transactions. This produces extremely long sequences, with thousands of events, making
the act of find patterns almost impossible since the support is always very small. Sequences have to be
narrowed to include only the portion of the trajectory that is intended to be studied. For instance, in a
load an unload process, a sequence could be given by the events that occurred 30 minutes before and
30 minutes after the activity. But it could also be defined only with events prior or posterior to the process.
The act of defining a useful sequence is not trivial and the obtained results are highly dependent on such
choice. Hence, the work of this thesis will diverge from the proper term of sequential pattern mining.
Instead, from the event log data point of view, it will introduce a methodology on how to process and
filter event logs to obtain sequences of a higher degree of granularity with elements such as activities,
rather than events.
1.2
Geographical distance
When dealing with GPS-generated data bases, determining the distance between locations is a crucial
part of the analysis. By knowing the distance between two positions and the time that took to travel from
one to another, the average speed of that segment of the trajectory can be obtained. Having knowledge
about the average speed of a moving object allow to determine whether the object is stopped or at
movement.
Global Positioning Systems are explicitly designed to store, handle, and retrieve spatially referenced
data. In addition to basic Euclidean or straight-line distances, there is need for more complex forms
of analysis that can incorporate a higher level of detail such as having account for the curvature of the
earth. Depending on the nature of the data, type of coordinates, application and required accuracy,
there are several formulations to achieve this calculation.
7
1.2.1
Haversine formula
Given the latitude and longitude of two points, it is possible to determine the shortest distance between
them by using the Haversine formula [23]. This formula is mathematically equivalent to the Law of
Cosines, that uses spherical geometry to calculate the great circle distance for two points on the globe.
Nevertheless, is often preferred since it is less sensitive to round-off errors that can occur when measuring distances between points that are located very close together [24]. Instead, the error can occur for
antipodal points (i.e. points that are on opposite sides of the earth). Although its an accurate formulation,
it does not take into account the ellipsoid shape of the earth, considering it as a sphere of radius r .
Figure 1.1: Great Circle: a great circle of a sphere
is the intersection of the sphere and a plane which
passes through its center point.
The Haversine formula is a particular case of the law of haversines: given a unit sphere, a triangle on
the surface of the sphere is defined by the great circles (see fig. 1.1) connecting three points u, v , and
w on the sphere.
8
Figure 1.2: Spherical triangle: law of cosines
If the length between the sides that connect those points are a,
b and c and the angle of the corner
opposite to c is C ; the law of haversines states:
haversin(c) = haversin(a
(1.1)
b) + sin(a)sin(b)haversin(C)
Where haversin is the haversine function give by:
✓ ◆
#
1
haversin(#) = sin
=
2
2
cos(#)
2
(1.2)
The lengths a, b and c are equal to the angles, in radians, subtended by those sides from the center of
the sphere. In order to obtain the Haversine formula, used to calculate the shortest distance between
two points, the point
distance
u is to be considered as the north pole, while v and w are the two points whose
x is to be determined.
In such a case, having the latitude
a=
⇡
2
1
and b
=
⇡
2
2.
( 1,
2)
and longitude
( 1,
2)
C is the longitude separation given by
of two points,
=
2
a and b becomes:
x
1 and c = r . The
law of haversines becomes:
haversin
Note that
✓
x
r
◆
= haversin (
1)
2
+ cos( 1 )cos( 2 )haversin(
1)
2
(1.3)
x is the shortest distance between two points along a great circle, fig.1.1, r is the radius of
the sphere. By applying the inverse
thus obtain the desired distance
x = 2r arcsin
haversin to equation 1.3, is possible to solve the equation and
x:
s
sin2
✓
2
1
2
◆
+ cos ( 1 ) cos ( 2 ) sin2
9
✓
2
1
2
◆!
(1.4)
1.3
Clustering Methods
Another important data mining technique is data clustering. Clustering algorithms aim at dividing a set of
objects into groups (clusters), where objects in each cluster are similar to each other (and as dissimilar
as possible to objects from other clusters) [25]. Clustering plays an outstanding role in data mining
applications such as scientific data exploration, information retrieval and text mining, spatial database
applications, Web analysis, marketing, medical diagnostics, computational biology, and many others
[26]. In this work, clustering methods are used to group the activities locations making possible the
creation of probability density functions for activity durations according their location.
The clustering problem is often defined as follows: given a set of points with
n attributes in the data
<n , find a partition of points into clusters so that points within each cluster are close (similar) to
each other. In order to determine, how close (similar) two points x and y are to each other, a distance
space
function d(x, y) is employed.
There are several clustering methods and they can be broadly divided onto two classes: hierarchical
clustering and objective function-based clustering. In the latter, as the name states, an objective function
is needed and data is partitioned by optimization of it. Such objective function usually minimizes the
distances to the cluster center. One of the biggest drawbacks of these type of clustering is that a number
of clusters has to be specified in advance. Hierarchical clustering, on the other hand, is a connectivity
based clustering method that does not need the number of clusters to be specified.
1.3.1
Hierarchical Clustering
Hierarchical clustering method is based on the idea that objects that are nearby are more related than
those who are farther away. It builds a cluster hierarchy (e.g. a tree of clusters) also known as a
dendrogram. Every cluster node contains child clusters. Such an approach allows exploring data on
different levels of granularity. Hierarchical clustering methods are categorized into agglomerative, that
starts with one-point (singleton) clusters and recursively merges two or more most appropriate clusters,
and divisive, that starts with one cluster of all data points and recursively splits the most appropriate
cluster.
In order to decide which clusters should be merged (for agglomerative), or where a cluster should be
divided (for divisive), a measure of similarity between sets of objects is required. In most methods of
hierarchical clustering, this is achieved by using metric (a measure of distance between pairs of objects)
such as euclidean distance, and a linkage criterion which specifies the similarity of sets as a function of
the pairwise distances of objects in the sets. The most common linkage criterion are: complete-linkage
which uses the maximal distance; single-linkage that uses the minimal distance and average-linkage
10
that uses the mean of the distances.
After linking the all objects in the data set into a hierarchical cluster tree, the tree is pruned at a user
specified value. All branches at or below each cut are grouped into a single cluster.
1.4
Contributions
In this thesis, an efficient data mining method has been developed for high volume spatio-temporal
data, with event logs. The focus is onto addressing the aforementioned challenges of dealing with the
uncertainty of human dependent event logs. The thesis brings together the following contributions:
• The proposal of a framework to overcome the problem of estimating the duration of relevant activities from human dependent event logs data. The central piece of the framework is to use a
spatio-temporal time-window, as time constrain, to build an activity time-line that allows a proper
estimation of activities duration. The idea is based on the correlation between time and space;
• A novel algorithm called Spatio-Temporal Event Log Miner to build activity time-lines of regions of
interest (ROIs) from logistics spatio-temporal datasets with event logs. The proposed algorithm
tackles the problem extracting regions of interest from trajectory datasets based on the average
speed of the object being tracked.
• An approach on how to avoid mislead calculations of the average speed of an object when obtained
from the time-differential method.Trajectory data bases have an associated positioning error, inherent to the global positioning systems, that is amplified through differentiation;
• Providing a higher level of detail for software aids in the creation of transportation plans by using
real-world data driven methodologies on how to filter event logs for activity extraction and which
assumptions to make in order to achieve accurate duration estimations based on the notion of
activity time-lines;
• A complete methodology on how to obtain probability density functions of activities durations based
on their location from real-world logistic data. Such information is extremely useful for planning
algorithms, like vehicle routing problem, that are capable of dealing with stochastic time-windows.
11
1.5
Outline
The remain of the thesis is as follows: in chapter 2 the overall framework for mining spatio-temporal data
with associated event logs is described and formalized. The task of identifying regions of interest based
on speed is addressed along with the definition of a set activities from event logs. The concept of activity
time-lines is introduced and the activity duration estimation process is outlined. A spatio-temporal event
log, STEL, mining algorithm is presented.
The databased used to test the STEL algorithm is analysed in chapter 3. In order to defined the set
of activities, events from the event log are divided onto classes and categorized according the related
occurrences. The set of activities and correspondent events are identified. Spatio-temporal data is
briefly introduced by presenting the locations that are going to be mined.
Chapter 4 describes the various steps carried out in developing the data model and its implementation
on a logistics data base. In section 4.1 the average speed of trucks is calculated and regions of interest
are identified. Activities are extracted from the event logs and classified in section 4.2. In section 4.3 the
activities duration is estimated based on the assumptions and hypothesis formulated. Finally, in section
4.4 the information is clustered to enable a frequency analysis at each location.
The utility of the algorithm is shown in chapter 5 were the results are shown. Probability density functions for the duration of load/unload activities and service times are obtain for two different logistic environments: the Amsterdam Airport of Schiphol and the Port of Rotterdam. Results for a single truck
analysis are also shown with the obtained activity time-lines.
Lastly, conclusions about the effectiveness of the algorithm in mining both fleets and single trucks are
discussed in chapter 6 along with possible future work to improve the STEL algorithm.
12
Chapter 2
Spatio-Temporal Event Log Mining
Classical data mining techniques often perform poorly when applied spatio-temporal data sets [27].
Such data sets are embedded in continuous space, whereas classical datasets, like transactions, are
discrete. In addition, patterns are often local and classical data mining techniques normally focus on
global patterns. A big amount of events to mine turns out to be also a drawback on such approaches. In
that extent, a new algorithm is proposed.
The spatio temporal event log mining (STEL) algorithm, is designed to mine spatio-temporal data bases
in conjunction with event logs. As information systems are becoming more intertwined with the operational processes they support, multitudes of events are recorded by today’s information systems,
providing detailed information about the history of processes [28]. Despite the omnipresence of event
data, most organizations still diagnose problems based on fiction, through approximations and abstractions, rather than facts. Hence, the goal of event mining is to use such event data to extract process
related information so that the task of planning and scheduling becomes more accurate and reliable [29].
More specifically, STEL is meant to estimate the time duration of process activities that are logged using
human based event logs. As stated previously, the human behaviour is subject to mistakes. Such human
errors create an uncertainty in the event log since events are not assured to be logged in coherence with
reality [30]. In other words, users are able to log events before, or after, the corresponding occurrences
happening making the estimated duration of the activities differ from reality. The spatio-temporal data, in
this work, comes in the form of a trajectory data base and it is used to serve two main purposes: to deal
with the uncertainty related to the time at which the events are record in the event log and categorizing
the extracted knowledge according to the location where the activities took place. This enables the possibility of having a-priori information about expected duration times for activities that are preformed at
certain locations. Such information can be extremely handful in planning and scheduling applications for
logistic related software. It provides estimations based on real-world data rather than in approximations
and abstractions that do not take into account the complexity of transportation problems.
13
The framework for the STEL algorithm is shown in figure 2.1 and can be described in three main steps:
identify regions of interest, identify performed activities and merge the information acquired from both
analysis to create an activity time-line for each region of interest so that is possible to estimate activities
durations.The steps of the algorithm are further described in the following sections.
Identify regions of interest - the spatio-temporal data of the trajectory database provide sampled positions of the object being tracked. The distance between two consecutive positions is calculated
and, using the time difference between acquisitions, the average speed of the object in between
such positions is known. Regions of interest can than be identified based on the average speed of
the object, creating a time-window that defines the boundary for the activity duration estimation;
Identify activities - the event log data provide a sequence of events that were logged during the trajectory. Such events can be of many types, from text messages, warnings or activities related
events. Hence, activities must be extracted from the event log by analysing the set of activity
related events;
Create activity time-lines - Once activities and regions of interest of the trajectory are known, each
activity is assigned to the correspondent ROI using the log times. Once all activities are assigned,
regions of interest can be described by their activity time-line containing all the activities that took
place. The duration estimation of the activities is done based on the activity time-lines.
Figure 2.1: Activity duration estimation framework
14
2.1
Identify Regions of Interest
Using the spatio-temporal data, regions of interest are identified based on the average speed of moving
objects. Despite the term ”region”, its significance is in terms of time rather than the location it self. Each
ROI is characterized by a start date, end date and a corresponding duration. They can be seen as time
portions of the trajectories where the object being tracked was standing still, or bellow a certain speed.
From the user point of view, the concept of trajectory is based in the evolving position (perceived as a
point) of an object travelling, in some space, during a given time interval. Thus, a trajectory is by definition a spatio-temporal concept. A GPS trajectory can be formally defined as in [31], [32], [33]:
Definition 1. A trajectory is a finite sequence of space-time points
hp0 , p1 , . . . , pN i, where pi =
( i , i , ti ) and N is the total number of data-points in a trajectory. The i , i 2 R2 are spatial coordinates, and the ti
Each
( i,
i)
2 R+ , are timestamps, with ti < ti + 1 for i = 0, 1, . . . , N .
pair represents the position recorded of a moving object at time ti from a GPS enabled
device. Each trajectory is associated to a truck with a unique truck ID. A trajectory is then formed by a
sequence of segments called trajectory links.
Definition 2. A trajectory link lj is a straight line between two consecutive points pi
pi+1 = (
i+1 ,
i+1 , ti+1 )
of the same trajectory, where i
= ( i,
pN :
Latitude
1,
2,
3,
...
N,
Longitude
1,
2,
3,
N,
link1
Time
p2
t1
t2
t3
tN
and
2 N0 and j = i.
p1
p1 :
p2 :
p3 :
i , ti )
pN
...
linkN
link2
p3
1
(b) Trajectory graph
(a) Trajectory data base
Figure 2.2: Data acquired by a global positioning system
If an object takes
t time to travel
x distance, it maintains an average speed of
x
t
for at least
t
time. The range of speed that the vehicle maintains while in an certain area will be used in addition with
a minimum duration of staying to define ROI. Since there is available information about the position and
time of each data point, it is possible to estimate the average speed of a trajectory link. By using the
latitude and longitude of two consecutive data points the distance
x between them (e.g. length of the
link) can be obtained. The average speed of each link can then be calculated using the timestamps of
each data point pi . Once the average speed of each link is known, the speed ranges for classification of
15
the links can be identified and the extraction of regions of interest becomes possible.
The classification of the links is done according to their corresponding average speed and it is divided
onto two classes: stopped and moving. As presented in Definition 3, ROIs are defined in terms of
the average speed and thus the definition of sbound is crucial since it will be the boundary between a
trajectory link to be considered as a candidate ROI or not.
Stopped:
Moving:
0 s̄  sbound
s̄ > sbound
(2.1)
Conceptually, a region of interest is intended to be a region where moving objects pause or wait in order
to complete activities that are difficult or impossible to carry out while in motion. In this work, a region of
interest is formulated as follows:
Definition 3. A region R is a region of interest if at least one trajectory link l
2 R of the tracked object
has its average speed between [0, sbound ], and the object remains in R for at least T time before leaving
R. That is,
Pn
i=j ti+1
ti
T with R = [lj , ln ]. The parameters sbound and T are user-defined.
In the following figure it is presented the framework used to extract ROIs:
Figure 2.3: Framework to extract regions of interest
As stated previously, the aim of the thesis is to obtain probability distribution functions for the estimation
of activity durations. In this case, such activities are preformed with when the vehicle is stopped and
regions of interest are defined with that goal. However, the definition of region of interest is not confined
to areas where the object is stopped or below a certain speed. It is also possible to estimate the durations
16
of activities done while in movement by changing the speed interval of Definition 3. For instance, if the
aim was to estimate the duration of an activity such as driving, in order to obtain travel times between
locations, the interval could be defined as [sbound , 0].
2.2
Identify Activities
The event log provides a record of specific events at specific timestamps. Such events can be seen as
atomic occurrences, with no time duration. They do not provide an explicit knowledge about activities,
instead they come in the form of an event sequence. Such events can be of any type, from warnings,
text messages to activities. Hence, activities must be extracted from the event logs in order to be
characterized and studied. Given the large amount of event types present in event logs, it is necessary
to do an event log analysis so that the events related to activities and activities them self are identified.
Such study can be seen in section 3.2. By looking at the event log is possible to identify keywords
related to activities such as: load, unload, wait, etc. From the keywords found in the event log the set of
activities is defined.
(2.2)
A = { a1 , a2 , . . . , ak }
Definition 4. An activity ak , where
k is the index that identify an activity, is a finite sequence datapoints hp0 , p1 , . . . , pm i, where pi = ( i , i , ti , ei ), such that e0 = sk and em = fk . The i , i 2 R2
are spatial coordinates, the ti
2 R+ , are timestamps, with ti < ti + 1 for i = 0, 1, . . . , N and ei
are the recorded events. sk and fk denote the events that indicate the start and the end of an activity,
respectively.
In the same empirical manner, three sets of events are defined:
that indicate a start of an activity ak ,
S representing the set of events sk
F representing the set of events fk that indicate the end of an
activity ak and C the set of identifiers events ck whose presence in the sequence ak indicate a special
occurrence.
S = { s1 , s2 , . . . , sk }
F = { f 1 , f2 , . . . , f k }
(2.3)
C = { c 1 , c2 , . . . , c k }
At this point is where the human influence takes place. The lack of correlation between the time at which
the events take place and the time that sk and fk events are recorded at leads to a wrongly estimation
of activity durations as seen in section 4.2. To overcome this, the STEL algorithm estimates activity
durations based on the creation of a time-line for each region of interest.
Having a full list of activities for each trajectory, activities need to be associated to the corresponding,
17
previously found, regions of interest so that the time-line is built. Time-lines are composed by the start
dates tj and end dates tn of the regions of interest and the start and end dates of the activities that
were preformed on those time-spans. Those dates are given by the date at which the sk and fk events
were recorded, respectively. An activity is considered to be part of a region of interest if at least on of
the following conditions is satisfied:
Start date of ROI, tj
Start of ROI date, tj
 Start of date activity, t0  End date of ROI, tn
(2.4)
 End of activity date, tm  End of ROI date, tn
Each region of interest will than have their own activities assigned to it and the time-line can be built
for each one of them. Each time-line can be seen as a description of the ROI since it contains all the
preformed activities in the ROI time-span. An example is shown in figure 2.4.
Activity 1
tj t0,1
Activity 2
tm,1
t0,2
Activity 3
tm,2 t0,3
tm,3
Activity 5
Activity 4
t0,4
tm,4
t0,5
tm,5
tn
Figure 2.4: Activity time-line of a region of interest
2.3
Characterize Activities
Using the obtained activity time-lines for the regions of interest it is possible to estimate, under some
assumptions, the time duration for the process related activities. Log systems, in general, keep track of
activities that occur during a certain process and, if there is no activity taking place, systems are able
to change their status to idle making the interpretation of the log file easier. However, log systems that
are human dependent are not completely aware of the process. They lack of sensors to log activity
related events. The dependence on humans to log certain events leads to emptiness in the logs in the
sense that the system have no knowledge of what is happening in reality. For instance, if one thinks
on a person day-to-day life, there is no emptiness in what concerns activities. Either we are working,
sleeping, waiting, etc. we are always performing an activity. In a parallel, if a system can keep track of
all the activity types, there should not be empty times in between activities. The only reason for that to
happen on such event logs, is the human dependence characteristic of systems. Hence, assuming that
the system is capable of tracking every activity that occurs during a process, the time-lines of regions of
interest should be fulfilled.
The STEL algorithm uses the duration of stay within the ROI, from the spatio-temporal data, in con-
18
junction with the activities start times and end times, from the event log data, to estimate more reliable
durations for certain activities. This is done by “stretching” the activity blocks based on the empty time
available in the neighbourhood of such activities. However, not all activities should have their duration
estimated. Despite the human dependence of systems when logging certain activities, systems are fully
aware of activities that are preformed on the system it self. For instance, when a log in activity is preformed, the system knowns when did the user started and when it finished. Such activities, despite the
need for interaction between the user and the system, are logged accurately.
Thus, there is the need to define a subset of activities A⇤ whose duration is going to be estimated. The
events related to the activities that do not belong to the subset A⇤ are assumed to have been recorded
without any time difference from reality. Figure 2.5 shows an example of it.
empty time
Activity which time is to be estimated.
Activity which time is not to be estimated.
ak ∉ A*
tj
ak ∈ A*
ak ∉ A*
ROI Time Line
tn
Figure 2.5: Empty time: the time available on the neighbourhood of an
activity in the ROI time-line
The process of characterizing the activities relatively to their duration implies assumptions to be made.
Hence, it is mandatory to have a deep knowledge of the activities that make part of the event log, and
more importantly, to be aware of how are those activities recorded into the event log. Hence, such assumptions have to be made in accordance with the data that is being dealt with. In this case, the STEL
algorithm is tested on two different logistics environments: the Amsterdam Schiphol Airport and the Port
of Rotterdam. The assumptions made to estimate activities durations are explained in section 4.3.
19
Algorithm 1: STEL Algorithm
A, A⇤ , S, F, C, sbound , T, ;
for i = 1 : number of trucks do
Define:
for k
= 2 : N do
Calculate link length
xk using equation 1.2 ;
Calculate link average speed s̄k ;
end
for k
= 1 : N do
Classify links according equation 2.1 ;
end
= 1 : length(A) do
for j = 1 : N do
for k
Identify the indexes of S, F, C events;
Extract and identify activities as in Definition 4 ;
end
end
for k
= 1 : N do
Find Regions of Interest as in Definition 3 ;
end
for k
= 1 : number of ROI’s do
Build time-line using equation 2.4;
for j=1:no of activities do
if aj
2 A⇤ then
Estimate activity duration;
end
end
end
end
20
Chapter 3
Analysis of a Logistics Database
The logistics data used to test the STEL algorithm was collected by a TomTom global positioning system
(GPS) from a fleet of logistics trucks from DHL Global Forwarding - Schiphol and Jan de Rijk Logistics.
The system collects the trajectory of the vehicles, by recording latitude and longitude coordinates with
a timestamps, and keeps record of any occurred event such as “Start of Load”, “End of Load”, “Task
Received from terminal”, “Task Finished” etc., 159 event types were found in the database. A complete
list of events and their correspondent codes and their number of occurrences is available in the appendix.
Each data-entry is formed by the ID of the truck, the position: given by latitude and longitude coordinates
and a time-stamp. The events are recorded with a description and their corresponding activity ID. The
data is arranged as follows:
Truck ID
Latitude
Longitude
Activity ID
YYYY-MM-DD HH:MM:SS
Event Description
Table 3.1: Data point arrangement in the data base
Since the positions, and the corresponding timestamps, are being constantly recorded it happens that
sometimes there is no event to assign to the data point. Thus, when no specific event occurs, and the
system records a data point, a “Basic Record” is kept under the event description field with the activity
ID being set to zero. Having this in mind, data points can be recorded within two possibilities:
1. An event is triggered leading to a data point record. The events can be triggered by the system
user, in this case the driver, or by the system it self depending on he type of event;
2. No event took place, but the position is still tracked. The event description field is recorded as a
”Basic record” event;
21
3.1
Spatio-Temporal Data Analysis
As said previously, the spatio-temporal component of the data is given in the form latitude and longitude
coordinates and correspondent date at which the data-point was recorded. The data base contains trajectories from trucks preforming activities in Rotterdam and Schiphol areas. There is also a specific truck
preforming activities across Germany, Netherlands, Belgium and France and its going to be referred as
the international truck. There are 42 different trucks in Rotterdam performing a total of 1972 data points.
In contrast to the Rotterdam area, Schiphol area contains a large amount of data concentrated on the
Schiphol Airport. There are 135 thousand data points distributed among 276 trucks. The international
truck is formed by 2468 data points belonging to a single truck with TruckID 1141. The database was
collected in a time span of 10 days. In figure 3.1 are show the locations where the data-points were
recorded. Each one of the data-points have the form of table 3.1.
52.5
Data point
52.02
Data point
52
52
51.5
51.98
51.96
Latitude [°]
Latitude [°]
51
51.94
51.92
51.9
50.5
50
51.88
49.5
51.86
51.84
49
51.82
4
4.05
4.1
4.15
4.2
4.25
4.3
4.35
4.4
48.5
4.45
2
3
4
5
6
7
8
9
Longitude [°]
Longitude [°]
(a)
(b)
(c)
Figure 3.1: (a) Rotterdam; (b) Schiphol; (c) International Truck.
Trajectories are sampled with a maximum period of 15 minutes, however, since data-points are also
recorded when events are triggered, such period can be in the seconds order of magnitude. There
are also cases where the system records more than one event simultaneously. This leads to some
drawbacks when estimating the average speed of the trucks. The position errors inherent to the GPS are
amplified through differentiation leading to incorrect speed calculations. This matter is further addressed
in section 4.1.
For a better comprehension and visualization of the database, a re-construction of it is done while data is
being processed. The acquired information is stored as a structure, governed by the truck identification,
with several fields that can be accessed for future analysis. In figure 3.2 the set of fields is shown.
22
Merged Data
Event Log
Analysis
SpatioTemporal
Analysis
Coordinates
DATA
Total Time
TruckID
Speed
Original Data
Distance
& Time
Intervals
Figure 3.2: Data Structure
With a total of 284 trucks, the dataset is divided according the TruckID (TID). In this way, trucks data is
treated as singles trajectories and their related information is kept under the structure fields. The Original
Data field, as the name state, is where the original data of the trajectory is kept in the form of table 3.1.
Using the original trajectory data, the distance and time intervals between two data points are calculated,
in section 4.1, and stored in Distance & Time Intervals field. From here, a Speed vector is calculated,
containing the average speed of the trucks in between data points. The Total Time field corresponds to
a vector with the amount of time elapsed from the beginning until each data point of the trajectory. The
Spatio-Temporal Analysis field refers to the data processed in section 4.1.1 where regions of interest
are extracted based on the speed of the trucks. Event Log Analysis field is where the information about
identified activities is kept. Each truck event log is scanned in order to find groups of data-points that
represent an activity. Once activities are identified they are merged with the previously found regions of
interest creating the necessary tools to estimate activities duration. This process is described in section
4.3 and the resultant knowledge is kept under the Merged Data field. Having the duration of activities
estimated, their location is clustered so that is possible to predict duration of activities according the
locations where they are preformed. The clustered activities and their data is stored in the Coordinates
field.
23
3.2
Event Log Analysis
In the section an intensive study of the event log data will be presented. It is important to state that
the following analysis, and consequent extracted information, was strictly based on an objective and
discerning analysis of the dataset with no a priori knowledge given by the involved companies.
Event logs provide a record of ephemeral occurrences that can be related to several types of circumstances, from received messages to warnings or activities. However, in this work the estimation of
activities duration is done based on time-lines that are described by the preformed activities. Such activities are embedded into the event logs as a sequence of specific events. Hence, it is necessary to
categorize events according to the occurrences that they are related to. Due to the huge amount of
event types a selective analysis has to be made. While focusing in the goal of the project, events that
are not related to any form of activity will not be taken into account. Events can be divided onto 9 main
categories, be they:
“Start of/End of” - events characterized by this prefix indicate that a specific task/occurrence has
started (ended). It can be a user-introduced event, for activities such as load and unload but
also for automatically-introduced events such as a speed limit violations;
“Cancellation of” - For the user-introduced events (e.g. activities), a cancellation option is available for
the case of user mistake;
“Report of” - Users are able to report when tasks are completed, however it is not a common practice
among most of the users since the number of occurrences is not correlated with the number of
activities. For instance, the number of reported loads is 27 against 869 loads preformed. This
class of events will not be taken into account;
“Navigation” - Navigation events are analogous to GPS actions such as introducing the destination
and arriving to the destination - those events will be also dismissed;
“Driving” - Driving events are automatically recorded. They represent warnings related to driving times
due to restrictions. “Driving times state” and “Driving times driving violation” are examples of
events. Such class do not represent any activity and so these events are not going to be considered;
“Activity Midnight” - The presence of this event indicates that a certain activity is being accomplished
during night time. The exact interval to be considered as a midnight activity is unknown. However
such event is of greater relevance since it can indicate a different pattern in the activity duration;
“Texts and Tasks” - Texts and tasks events can be seen as a form of communication between the office
24
and the terminal in the trucks. “Text message received”, “Text message read”, “Task received from
terminal”,“Task accepted” are examples of it;
Others - All the other events that do not fall in any previous category and are not relevant in this analysis.
“Crossed country border” and “Deceleration limit violation” are some examples;
From the above described event categories, “Start of/End of”, “Cancellation of” and “Activity Midnight”
are the only ones that are activity related. However, depending on the activity being preformed, there
are automatically recorded events as well as user-introduced events. When the human factor is on the
line, errors and mistakes are possible to happen making the data base noisy and with an uncertainty
related to lack of time coordination between records and reality as explained previously. This creates
the need to differ the activities that are logged by the system from the ones recorded by users.
Looking at the ”Start of/End of” class of events it is possible to identify the following activities: arrive,
break, costs, drive, garage, gas, load, unload, log in, log out, passage, rest, sign up, wait, peak RPM
limit violation and speed limit violation. Hence, the set of activities from equation 2.2 and the events sets
from equation 2.3 are defined as follows:
A = { Arrive, Break, Costs, Drive, Garage, Gas, Load, Unload, Log In, Log Out, . . .
. . . , Passage, Rest, Sign Up, Wait, Peak RPM limit Violation, Speed Limit Violation };
S = { Start of “ak ” };
F = { End of “ak ”, Cancellation of “ak ” };
C = { Activity Midnight “ak ” };
From those, only the speed limit violation and peak RPM violation are fully recorded by the system it
self. However, activities like log in and sign up, despite needing the interaction of the user, they are not
subject to the human behaviour. Since those activities are preformed on the logging system, the system
is aware of when the user start and ends such process, keeping record of the events at the correct
time. Activities such as load and unload, on the other hand, are not tracked by the system. Their log is
completely dependent on the user, thus being subject to mistakes. In this thesis, those are the activities
representing the subset A⇤
✓ A who are going to have their duration estimated:
A⇤ = { Load, Unload };
Load and Unload activities represent the core of logistic processes. Therefore, to extract valuable information that can be used in scheduling and planning applications, is it mandatory to have full knowledge
about such activities. There are several event types concerned with load and unload activities. In table
3.2 is presented a description of such events.
25
Event Type
Description
Start of Load/Unload
End of Load/Unload
Cancellation of Load/Unload
Report of Load/Unload
Join of Load
Leaving of Load
Activity Midnight Load/Unload
Event record by the user when the loading/unloading process starts;
Event record by the user when the loading/unloading process finishes;
Cancellation of the loading/unloading process due to user mistake;
Unknown;
Merge of two or more truck loads into a single truck;
Trailer is left at the location without unloading the cargo;
Loading/unloading process occurs during the night;
Table 3.2: Load and Unload activities related events
In figure 3.3 is shown a bar graphic with the number of occurrences of each event. As it can be seen, all
events that are not ”Start of Load/Unload” or ”End of Load/Unload” show a low number of occurrences
when compared with the rest of them. However, ”Cancellation of Load/Unload” and ”Activity Midnight
Load/Unload” events have an important role in the project. Since these are user-introduce events, the
human factor is to be taken into account, hence, despite the low number of occurrences this events can
not be dismissed. Also, the presence of ”Activity Midnight Load/Unload” events might indicate a different
pattern in terms of the duration of the activities since they are being accomplished in a particular time of
the day.
1200
1000
923
1095
869
1019
1000
Event Occurences
Event Occurences
800
800
600
600
400
400
200
200
75
53
27
3
0
Start of Load
End of Load Cancellation of Load Join of Load
4
Report of Load Leaving of Load
25
0
Midnight Load
(a) Load Events
Start of Unload
End of Unload
Cancellation of Unload
15
23
Report of Unload
Midnight Unload
(b) Unload Events
Figure 3.3: Load and Unload event count
The load and unload activity events are correlated with each other in the sense that the number of
occurrences ”Start of Load” is quasi equal to the number of occurrences of ”End of Load” events plus
”Cancellation of Load” :
#start of load ' #end of load + #cancellation of load
(3.1)
By knowing such information, once the regions of interest are identified, load and unload activities can
26
be extracted from trajectories as the set of data points between a ”Start of Load/Unload” and ”End of
Load/Unload” event. This process is further explained in section 4.2.
Apart from the load and unload activities, there are more activities related events. For instance, drivers
upon a certain period of driving must stop to rest or preform breaks. In addition, trucks need to be
refuelled which consists in a stop as well. Since the load and unload locations have schedules to be
served at, there is the possibility for a truck to arrive earlier than it should. When this occurs, the driver
has to wait in order for the costumer to be available to preform the transaction. In figure 3.4 the number
or occurrences for these events are shown.
500
450
434
400
369
Event Occurences
350
300
250
206
212
203
193
200
150
110
103
100
59
50
18
7
4
0
Start of Rest
End of Rest
Canc. of Rest Start of Break End of Break Canc. of Break Start of Gas
End of Gas
Canc. of Gas
Start of Wait
End of Wait
Canc. of Wait
Figure 3.4: Other acivity related events count
600
530
522
507
491
500
Event Occurences
400
372
302
284
300
252
200
100
17
0
User login
User logout
Start of Sign Up
End of Sign Up Canc. of Sign Up
13
8
Start of Log In
End of Log In
Canc. of Log In
Start of Log Out
End of Log Out Canc. of Log Out
Figure 3.5: Log In & Log Out event count
Once again, the previously described events are correlated. Despite the small difference in the number
of occurrences, that might be attributed to missing data, the difference is small and it can be unaccounted
27
for as in the load and unload cases.
#start of ”activity” ' #end of ”activity” + #cancellation of ”activity”
(3.2)
Is it also important to have in consideration the events shown in figure 3.6 as they are the most frequent
events in this data set. As described previously, the ”Basic record” event is inherent to a data point
acquisition without the occurrence of any specific event. The purpose of this acquisition is strictly for
geographical and temporal means, having no significance as an event. ”Driving times state” event is
believed to be a system-introduced event that is related with the restrictions of driving times. ”Start of”,
”End of” and ”Cancellation of” are ”incomplete” events whose origin is unknown, and thus will not be
considered during the processing of the data in the following chapter. ”Contact ON” and ”Contact OFF”
are the events correspondent to switching the truck on and off, respectively. As the name states, ”Start
of Drive” and ”End of Drive” are user-introduced events, that indicates the beginning and ending of the
driving process between two locations. In between such events no other activities occurs.
×10 4
3.5
32293
3
Event Occurences
2.5
2
18146
1.5
9645
1
8946
8942
7355
0.5
4004
4012
Start of Drive
End of Drive
2293
0
Basic record
Driving times state
Cancellation of
End of
Start of
Contact ON
Contact OFF
Figure 3.6: Most frequent events in the date base
Is it of great interest to know a priori which activities have the biggest impact in terms of time. In a naive
approach, once the activity related events are identified, it is possible to measure their duration as the
elapsed time between the ”Start of -activity-” and ”End of -activity-” events. However, this measurement
should not be taken as a true estimation of event times. The assumption that the events are correctly
introduced by the users is simply not correct as proven in section 4.3, thus the following plot will only
serve as a term of comparison for future results.
28
2%
8%
11%
Rest Time
Wating Time
Loading Time
Unloading Time
Other Events
4%
74%
Figure 3.7: Activity time distribuiton
In this analysis were taken into to account the most important activities in terms of time: rest activities,
load and unload activities and waiting activities. Break activities, passage activities, arriving activities
and refuelling activities are also considered but were represent as ”others” due to the small portion of
time.
3.2.1
Sequential Pattern Mining in Event Logs
As explained previously, to perform sequential pattern mining on event logs there is the need to defined
shorter sequences. It is not possible to use a complete sequence of events from a trajectory, directly
from the event log. Event logs contain thousands of events that make hard the existence of patterns
due to the small support. In addition, event logs can be from different locations making their patterns
not shareable by other trucks. The amount of event types in the event log is also a problem that such
algorithms have difficulties to tackle. To demonstrate this, sequences were built from the event logs of
each trajectory, where each event is handled as a set of items from a transaction and each costumer ID
is replaced by the trajectory ID (e.g. truck ID). Using an open-source software for pattern mining [34], the
sequences were explored but, as expected, no output results were found despite the minimum support
being as small as 1%.
29
Chapter 4
Processing High Volume Logistics
Data
In this chapter the two types of data are going to be processed. Firstly, the spatio-temporal data is going
to be used to identify the regions of interest with their corresponding start and end dates. Secondly,
event log data will be mined; activities are identified and extracted accordingly to the type of situation:
normal, cancellation or midnight activity. Further, using the extracted knowledge from both analysis, data
will be merged in order to estimate the duration of load and unload activities. The goal is to estimated
the duration of each one of the load and unload activities. The acquired information about the activities
is than clustered according their location so that it can be translated to probability density functions that
express the duration of an activity at a given location. Such results are of greater importance for planning
and scheduling applications such as vehicle routing problems with stochastic time windows.
Another possible product of this analysis is the travel times between costumers. Despite the work of
this thesis being focused in acquiring a proper duration estimation of load and unload activities, it is also
possible to estimate drive activity durations (e.g. travel time) by focusing on moving portions of the time
line rather than in the stopped ones. This can be useful to obtain probability density functions of travel
times to use in applications that deal with stochastic travel times.
4.1
Spatio-Temporal Data - Trajectories
In previous sections the definitions of trajectory, trajectory links and regions of interest were presented.
Since ROIs are defined in terms of speed there is the need to calculate the average speed at each link.
Using the latitude and longitude of the data points pi+1 and pi , from link lj , it is possible to determine the
30
travelled distance (e.g length of the link -
xj ). There several approaches to achieve this calculation, as
presented in section 1.2. Haversine formula was implemented due to its precision and simplicity when
compared with others (e.g. Pythagoras Theorem, Spherical Law of Cosines). It is a special case of
the general law of haversines, relating the sides and angles of spherical triangles. It gives the shortest
distance between two points measured along the surface of a sphere rather than a straight line through
the sphere interior. Despite its accuracy, haversine formula does not take into account the ellipsoid
geometry of earth nether the elevation due to mountains and others. Following is presented the explicit
formula form equation 1.2:
xj = 2r arcsin
s
sin2
✓
i
i 1
2
◆
+ cos (
i 1 ) cos ( i ) sin
i
= Latitude of pi
i
= Longitude of pi
2
✓
i
i 1
2
◆!
r = Earth radius
Once the distance and time between two data points are known, it is possible to calculate the average
speed for each link:
s¯j =
xj
tj
where
tj = tj+1
tj
(4.1)
Due to the GPS accuracy and precision, the recorded latitude and longitude can differ slightly from the
real one. GPS error sources can be divided onto three classes: satellite dependent errors, propagation
errors and receiver errors [35], [36].
(i) Satellite Dependent Errors - Although satellites are launched into precise orbits small changes
might occur, caused by gravitational pulls from the moon and sun and by the pressure of solar
radiation on the satellites; The satellite’s atomic clocks might experience noise and clock drift
errors.
(ii) Propagation Errors - Inconsistencies of atmospheric conditions affect the speed of the GPS signals
as they pass through the Earth’s atmosphere causing delays in the signals. Also GPS signals can
be affected by the environment, where the radio signals reflect off surrounding terrain, buildings,
canyon walls, hard ground, etc.
(iii) Receiver Errors - Depending in the quality of the GPS receiver, tiny discrepancies between the
receiver clock and GPS time will influence the calculated distances.
Accordingly to the U.S Government information about the Global Positioning System [37] the accuracy
standard for positioning, given at a 95% confident level, is 17 meters for horizontal errors and 37 meters
for vertical errors. It is also important to have into account that higher accuracy is attainable by using
GPS in combination with augmentation systems, like EGNOS within europe.
31
Nevertheless, the simplest way to get the velocity from a GPS receiver is to differentiate the GPS determined positions with respect to time. Velocity is the first time derivative of positions. The problem is that
errors in positions will be amplified through differentiation. This becomes worse when a high output rate
is used, since the positional error remains the same but the time interval is decreased. The outcome are
mislead calculations of speed that, in the case of high output rates, differ greatly from the reality speeds.
Figure 4.1 illustrates an example of this. Although the truck is travelling at cruising speeds, between
80 and 90 km/h, when the time interval between two samples is extremely small, the calculated speed
seems to be unreal. In the first occurrence in the figure, the travelled distance between sample is 43 m
with a time interval of 1 second, resulting in a speed of 155 km/h.
160
Original Average Speed
Filtered Average Speed
140
Average Speed [Km/h]
120
100
80
60
40
20
0
100
120
140
160
180
200
220
240
Minutes
Figure 4.1: Effect of the speed filter on an average speed
profile of the international truck with = 6
In addition, there are also data points that are recorded simultaneously making
t = 0, and thus the
equation (4.2) becomes unsolvable. In order to overcome these problems a speed filter was created.
The filter sets the the average speed of link lj equal to the average speed of link lj 1 when the temporal
distance between the data points is smaller than , a user defined amount of time. The results can be
seen in figure 4.1.
s̄j =
8
<s̄
:s̄
j
if
j 1
else
tj
,
(4.2)
Figure 4.2 shows the obtained average speed profile, after being filtered, for the international truck along
a time span of 10 days. The speed filter proved to be effective since there are no link speeds above to
32
what is expected for a truck. Despite the lack of precision when obtaining the speed values, since they
are derived from data points with an associated error as seen previously, they are reliable enough to
determine candidate regions of interest. By analysing the figure it is possible to see that eight significant
trips were made, represented by the velocity peaks, interlaced with seven major stops. The gap at the
end of the plot is due to a system shut off where no data points were recorded.
100
90
80
Speed [Km/h]
70
60
50
40
30
20
10
0
0
2000
4000
6000
8000
10000
12000
14000
Minutes
Figure 4.2: Average speed profile of the international
truck with = 6
4.1.1
Regions of Interest
As described in section 2.1, a region of interest is a selected subset of trajectory links whose average
speed is below a certain value sbound . Once the average speeds of the trajectory links are known, they
can be classified according equation 2.1. Since the sbound value is the boundary between a trajectory
link to be considered part of a region of interest or not, the value chosen for it will highly influence the
obtained results.
To be able to define sbound , a study of frequency was made. The previously calculated average link
speeds were grouped into 1 Km/h width bins. Each bin’s length represents the number of occurrences
of certain speed. In the following figures are presented a set of histograms for various truck IDs as well
as the whole database.
33
(a) All trucks
(b) TruckID 1141
(c) TruckID 7234
(d) TruckID 1849
Figure 4.3: Average Speed Histograms with ↵ = 0.5
Note: Trucks are most of the time stopped and thus the number of occurrences
for s̄ = 0 is huge. Plotting them would result in less visible graphs due to scaling
reasons, hence they were not plotted.
As it can be seen, the shape of the histogram is highly dependent on the truck ID. Since the trucks
operate in different areas, with different topologies and speed restrictions, their average speed profile is
quite different. For instance, the histogram of figure 4.3 (c) belongs to a truck preforming tasks in the
Amsterdam Airport Schiphol and due to that it does not exceed the speed limit of 50 Km/h. On the other
hand, in figure 4.3 (b) are plotted the speeds of the TruckID 1141, previously referred as International
Truck, that shows two main areas: one at lower speeds - when the truck is moving inside cities and
another at cruise speeds (80-90 Km/h) when the truck is on highway. Figure 4.3 (d) contains data from a
truck of the Port of Rotterdam and, given the length of the port, this truck makes use of both highway and
national roads (A15 and N15, respectively). Considering that the database is mostly formed by trucks of
Amsterdam Airport Schiphol, the fact that there are a lot more occurrences for low speeds than for high
speeds was expected and it can be seen in figure 4.3 (a) that shows the histogram of all available data.
34
sbound
sbound
ROI
ak ∈ A*
ROI1
ak ∈ A*
Time
ROI2
Time
Figure 4.4: Split regions of interest (right figure) due to a low sbound
Such diversity makes the decision of the sbound value not trivial. If a high value is chosen, the duration
of the ROIs will be larger than in reality is. In particular, trucks whose average speed profile is low, like
TruckID 7234, would show regions of interest with data belonging to moving segments of the trip due to
the high number of low speed links. On the other hand, small sbound values would lead to a falsely high
number of ROIs with short durations. Owing to the previously tackled GPS inaccuracies, the acquired
position is not always constant when trucks are stopped leading to the existence of non zero speeds.
If such speeds are greater than sbound , regions of interest are truncated leading to a loss of crucial
information to estimate activity times. For instance, if an activity is being preformed and there is a speed
record greater than sbound , the activity will be split onto the time-lines of both regions of interest, see
figure 4.4.
Such behaviours can be seen on figures 4.5a and 4.5b where each line represent the total number of
regions of interest for a specific truck and the total time spent by the truck inside ROIs, respectively.
In figure 4.5a, when the sbound value increases, the number of detected ROIs decreases , making their
duration longer as it can be seen in figure 4.5b where each line belongs to a specific truck and represents
the total duration of stay in regions of interest by the correspondent truck.
350
300
14000
250
12000
Total ROI time [min]
Number of ROIs
16000
200
150
10000
8000
6000
100
4000
50
2000
0
0
2
4
6
8
10
12
14
16
18
20
s bound
0
0
2
4
6
8
10
12
14
16
18
s bound
(a) Number of ROIs per truck against sbound
(b) Total time spent in ROIs per truck against sbound
Figure 4.5: Choosing the correct value for the sbound parameter
35
20
Once the value for the sbound is set and the link classification is done according equation 2.1, regions of
interest are built according Definition 3. In order not to loose information relative to short truck stops, the
minimum duration
T was set to zero. This means that every sequence of trajectory links with stopped
classification will represent a region of interest as figure 4.6 illustrates. In such trajectory, three ROI were
find: the first is formed by the trajectory links hl1 , l2 , l3 i, the second by hl6 , l7 , l8 i and the third by hl10 i.
l1 :
l2 :
l3 :
l4 :
l5 :
l6 :
l7 :
l8 :
l9 :
l10 :
l11 :
Lat
Long
1
2
3
4
5
6
1
2
3
4
5
6
7
8
9
10
11
7
8
9
10
11
Time
Class.
t1
t2
t3
t4
t5
t6
t7
t8
t9
t10
t11
stopped
stopped
stopped
moving
moving
stopped
stopped
stopped
moving
stopped
moving
l1
l2
l3
l4
l5
l6
l8
l9
l11
l7
l10
(a) Regions of interest
(b) Trajectory graph
Figure 4.6: An example on extracting regions of interest with T
=0
If the focus was onto obtain travel times between costumers, the regions of interest would be the links
with moving classification instead:
hl4 , l5 i, hl9 i and hl11 i. The travel time would be given by the time
difference between the last data-point of the last link and the first data-point of the first link.
Now that regions of interest are known, their content can be plotted. By using the speed from the links
and the corresponding recorded events, a sequence plot was made. For the sake of simplicity, only
relevant events where considered.
Figure 4.7a shows a region of interest with 1.44 hour duration in which an unload event take place. It is
important to notice that while the truck is unloading recorded speeds reach up to 15 km/h. This is the
reason why the sbound have to be kept sufficiently high, otherwise links in the middle of the events would
be classified as moving and regions of interest would not contain all the necessary information for time
estimation. Figure 4.7c shows that problem: only the ”end of load” event is contained inside the ROI
whose duration is half an hour. In reality, the truck stayed for longer in that region, but due to recorded
speed bigger than sbound , that information was kept in a separated region of interest.
Both figures 4.7a and 4.7b are examples of how load and unload events should be introduced by the
user. In the first case, a single load took place, occupying almost all of the time duration of stay in the
region of interest. In the second case, the user also introduced all events in the expected sequence
and at the expected time. It finishes the drive, than start the first load activity. After a certain amount
of time, which is considered to be time duration of the load task, it introduces the ”end of load” event.
36
The user repeats the process for the second load activity then start the driving process and leaves the
region of interest. However, in many situations this is not what happen. For some reason, events ”start
of load/unload” and ”end of load/unload” are many times introduced only seconds apart from each other,
making the duration of the events not correct. In figure 4.7d is shown an example of this.
TruckID - 1141
15
Stop Duration - 1.44 Hours
TruckID - 1141
7
Stop Duration - 1.78 Hours
e
6
of Driv
5
←Start
e
of Driv
4
←End
Velocity [Km/h]
Velocity [Km/h]
10
3
5
rt o
f
Un
loa
←Start
Sta
of Driv
e
2
nd
←E
d
→
5700
0
1
5720
5740
5760
5780
n
of U
loa
Sta
d
rt o
0
5800
Sta
rt o
f Lo
ad
f Lo
→
3900
o
nd
nd
←E
of L
4000
oad
4050
Minutes
(a) Single activity
TruckID - 1849
ad
f Lo
→ ←E
3950
Minutes
9
ad
(b) Multiple activities
Stop Duration - 0.50 Hours
TruckID - 1157
12
Stop Duration - 1.47 Hours
8
10
7
Velocity [Km/h]
8
5
4
3
e
1.1899
←Start
→
ad
Lo
of
→
d
En
d
oa →
fL
d
to
ar Loa
of
d
→
→
1.1899
St
d
oa
0
1.1899
En
ad
2945
Lo
2940
ad
f Unlo
do
←En
2950
fL
2935
of
2930
d
2925
to
ar
2920
St
2915
2
En
1
of Driv
of Driv
e
2
0
6
4
←Start
Velocity [Km/h]
6
1.1900
1.1900
1.19
1.1900
1.1900
1.1901
Minutes
Minutes
(c) Split ROI due to low sbound
(d) Effect of the human behvaiour
Figure 4.7: Examples of regions of interest for sbound
37
= 15km/h
1.1901
1.1901
×10 4
4.2
Event Logs
As explained in chapter 2, the STEL algorithm estimates activity durations based on a created time-line
for each region of interest. Such time-line is composed by the start and end date of the region of interest
and the start and end dates of the activities that were preformed on that time-span. However, event logs
do not provide an explicit knowledge about activities. Instead, they represent a sequence of momentary
events, with no duration. Such events can be of any type, from warnings to text messages as seen in
section 3.2. With that said, there is the need to identify the activities from the event log in order to build
the region of interest time-line. In the analysis made in chapter 3, thirteen types of activities were found
in the database. They are listed below:
• Load
• Gas
• Log Out
• Unload
• Drive
• Log In
• Rest
• Sign Up
• Passage
• Arrive
• Garage
• Wait
• Costs
Events that are not activity related do not have any relevance in what concerns the estimation of activity
durations. This is due to the fact that such events do not required the presence of a user since they are
recorded by the system it self. Activity related events can only be recorded by the user directly, like in
a load activity, or by the interaction between the user and the system, like in a log in activity. Hence,
warning events and events that are not activity related in general, will not be considered.
Each one of the previous listed activities is characterised by a set of events: ”Start of activity”, ”End of
activity”, ”Cancellation of activity” and ”Activity Midnight”. Hence, is it possible to classify the activities
as three types of situations:
1. Normal situation - a normal situation is given by a pair of events ”Start of activity” and ”End of
activity”;
2. Cancellation situation - it is characterised by a ”Start of activity” followed by a ”Cancellation of
activity” events;
3. Midnight activity - is formed by a set of three events: ”Start of activity” ”End of activity” and, in
between those events, ”Activity Midnight”.
For each one of the activities type, the trajectory is scanned in order to find ”Start of activity” events.
38
Once such event is identified, a second scan is preformed to identify the end of the activity. Depending
on the type of situation, the end of an activity can be dictated by a ”End of Activity” of ”Cancellation of
activity”. Once the entries that limit the activity are known, the in between events are scanned so that
identifying events, such as ”Activity Midnight”, are found. An example can be seen in Table 4.1. Each
activity is then formed by a sequence of data points with an associated time-stamp, non-relevant event,
and latitude and longitude coordinate. Activities are defined as the set of data-points between the ”Start
of activity” event and ”End of activity” or ”Cancellation of activity” events, depending on the situation.
Each one of the data-points as the form of table 3.1.
The importance in distinguish the types of activities situations plays a big role in estimating the time
durations of activities. If the cancellation situations are not differed from normal situations the algorithm
takes it as an occurred activity, making the time-span between the ”start of activity” and ”cancellation
of activity” events occupied. Since it is unclear of what is happening in a cancellation situation, those
activities are not part of the time-line. The classifier ”Midnight activity” is used to distinguish activities that
were preformed during the night. Their duration patterns are different from the ones preformed during
the day due to workforce differences. Thus, to obtain accurate predictions, they must not be mixed with
the rest of the activities.
Truck ID
Date & Hour
Event type
Activity
1141
1141
1141
1141
1141
1141
1141
1141
1141
1141
1141
1141
...
1141
1141
1141
1141
1141
1141
1141
1141
1141
2013-05-02 12:57:52
2013-05-02 13:57:55
2013-05-02 14:57:58
2013-05-02 15:21:41
2013-05-02 15:21:45
2013-05-02 15:21:46
2013-05-02 15:21:46
2013-05-02 15:22:21
2013-05-02 15:22:45
2013-05-02 15:22:45
2013-05-02 15:23:15
2013-05-02 15:23:15
...
2013-05-02 18:23:34
2013-05-02 18:23:55
2013-05-02 18:24:56
2013-05-02 18:26:29
2013-05-02 18:27:31
2013-05-02 18:27:46
2013-05-02 18:28:52
2013-05-02 18:28:53
2013-05-02 18:30:46
Navigation ETA update
Contact ON
Start of Break
Cancellation of
End of Break
Start of
Cancellation of
Start of Drive
Driving times state event
Basic record
End of Drive
Start of
...
Task Busy
Cancellation of
Start of Unload
Contact OFF
Driving times state event
Basic record
Contact ON
Task Finished
End of Unload
Break
Break
Break
Driving
Driving
Driving
Driving
...
Unloading
Unloading
Unloading
Unloading
Unloading
Unloading
Unloading
Table 4.1: Example of the identification of activities from the event log
data
Having the activities identified, as in table 4.1, and classified conforming the situation, an activity table is
39
All Data-points
Average Long & Lat
50.0575
50.057
50.0565
Latitude [°]
50.056
50.0555
50.055
50.0545
50.054
50.0535
50.053
8.52
8.521
8.522
8.523
8.524
8.525
8.526
8.527
8.528
8.529
8.53
Longitude [°]
Figure 4.8: Averaging latitude and longitude coordinates to obtain a
single point representation of the activity location
built for each trajectory. Previously identified activities, formed by a set of data-points, are transformed
into an atomic representation composed by a single data-point. The following paragraph explains the
process.
With exception for the drive activity, all activities are supposed to be preformed with the truck stopped
since there is need for the interaction between the user and the logging system. Consequently, the
latitude and longitude for the data points of such activities is very similar and it can be averaged so that
it represents the actual activity location. Figure 4.8 is an example of the averaging process for a load
activity; although the truck is stopped when loading, due to the lack of precision of the GPS, the points
are not recorded at the same precise location, thus it is assumed that the activity takes place at the
average location of the data-points locations that form the activity. The activity table is then built by using
the record time of the events ”Start of activity” and ”End of activity”, the averaged latitude and longitude
and the activity type as in table 4.2. The activity duration can be calculated as the difference between
the end of activity date and the start of activity date. All cancellation situations are assumed to be real,
and thus they do not make part of the activity table.
Table 4.2: Example of activity Table
Start Date & Hour
End Date & Hour
Latitude
Longitude
Activity
2013-05-02 14:57:58
2013-05-02 15:22:21
...
2013-05-02 18:24:56
2013-05-02 15:21:45
2013-05-02 15:23:15
...
2013-05-02 18:30:46
50.89785
50.38141
...
50.02661
7.15316
7.94855
...
8.56742
’Break’
’Drive’
...
’Unload’
Once the activity table for the trajectory is known, the activities are assigned to the correspondent regions
40
of interest according equation 2.4. These allows the creation of the so called activity time-line for the
regions of interest. By looking at such time-lines and analyse the activities performed on it, it possible to
clearly differentiate between rest stops, traffic stops and work stops. If no activities are performed, and
the duration of the region of interest is relatively small, it is likely to be a traffic stop. On the other hand,
if there are activities such as rest or load/unload is it clear the type of truck stop being dealt with. The
duration of such truck stops can also be of interest of the involved companies.
4.3
Activity Duration Estimation
Event log data is vulnerable to uncertainty. In fact, that is the biggest draw back when dealing with data
sets that are human dependent. On one hand, the introduced data is subject to human errors making
the existence of the activities unclear. Although there is the chance to cancel an activity, after starting it
wrongly, it is not possible to assure that the users adopted that behaviour. Rather, they can simply end
the activity and it becomes impossible to trace if the activity actually existed or not. On the other hand,
events should be introduced in coherence with reality: as soon as the activity starts and ends. However
in many situations that does not happen. It is a common practice for users to introduced the events
”Start of activity” and ”End of activity” within a seconds time period making the duration of the activities
unreal. In figure 4.9a an example is shown. The black bar represents the region of interest: the length
is the total duration of the truck ”stop”. Each additional bar represents a preformed activity inside the
region of interest. Load and Unload activities were highlighted with a dotted line.
By analysing the figure 4.9a it is possible to tell that the region of interest (black bar) begins slightly
before of the end of the driving activity (blue bar). Then it proceeds with the sign up process followed
by another driving activity, possibly to the load location and it completes a load task. The user drives
again, completes another load task and starts driving away; the region of interest ends. In terms of the
sequence at which the events were introduced everything is ordinary. However, the duration of the load
activities is illusory. Having in consideration the empty time between the second driving activity and the
first load activity, th most probable scenario is that the user, after arriving to the load site, preformed the
actual load activity and only than introduced the corresponding events. On the contrary, figure 4.9b is a
case of a truck stop whose events are more probable to have been introduced in coherence with reality.
As it can be seen, all the time-line of the region of interest is fulfilled with activities.
41
STOP
0
0.5
1
Drive
1.5
2
Sign Up
2.5
STOP
Load
3
3.5
0
4
1
2
Drive
3
Sign Up
4
5
Rest
6
7
Duration [Hours]
Duration [Hours]
(a) Wrongly introduced activities
(b) Correctly introduced activities
Unload
8
9
10
Figure 4.9: ROI activity time-lines examples
For planning and scheduling purposes it is extremely important to have precise knowledge about load
and unload duration at each location. Such knowledge can be achieved if the uncertainty present in the
data is overcome. Nevertheless, the fact that the existence of the activities is uncertain will be discarded:
every pair of events ”start of activity” and ”end of activity” will be considered as a preformed activity. The
uncertainty that its going to be dealt with is the one related to the lack of coherence between the dates
that events are record and reality.
Recalling the event analysis made in chapter 3, most of the activities are stop related tasks since the
interaction between the driver and the system is needed. Hence, by joining the two types of data,
spatio-temporal and event logs, it is possible to estimate the load and unload activities duration.
Nevertheless, some general assumptions need to me made. It is assumed that:
i) events are totally ordered (i.e., in the log events are recorded sequentially even though tasks may
be executed in parallel);
ii) users did not record any event by mistake;
iii) the time-span of a cancelled activity is considered as empty (i.e. if an activity is cancelled it did not
existed) ;
iv) all activities apart from driving take place inside a region of interest;
v) all activities, besides load and unload, were introduced in coherence with reality;
The STEL algorithm uses the duration of stay within the ROI, from the spatio-temporal data, in con42
junction with the activity start times and end times, from to event log data, to estimate more reliable
load/unload durations. The estimation is done by ”stretching” the activity blocks, based on the empty time
available in the neighbourhood of such activities, see figure 2.5. By doing this, the obtain load/unload
activity will be formed of three parts: one reliable portion and two possiblistic portions. The reliable part
is given by the original activity block that is delimited by the ”start of activity” and ”end of activity” events.
Possiblistic portions are represented by the stretched parts of the blocks where uncertainty, related to
the actual start or end of the activity is present. Figure 4.12 illustrates these statement.
empty time
Activity which time is to be estimated.
Activity which time is not to be estimated.
ak ∉ A*
ak ∈ A*
tj
ak ∉ A*
tn
ROI Time Line
Figure 4.10: Empty time: the time available on the neighbourhood of an
activity in the ROI time-line
At this point, two different situations arise: to estimate the duration of a single load/unload activity or to
estimate the duration of followed load/unload activities. It would not be reasonable to estimate multiple
activities durations in the same way as single activity duration is estimated. Since it is a sequential
process, the estimation of the first activity of the group would constrain the estimation of the following
activities. When both start and end dates of the first activity are stretched, the previous existent gap
between the activities disappears making the start data of the following activity constrain to its original
position. Hence, single activity and multiple activity situations have to be identified.
LOAD
Other activity
0 min
Other activity
t
ROI Time Line
(a) Single activity case
Other activity
0 min
LOAD
LOAD
Other activity
UNLOAD
t
ROI Time Line
Activity which time is to be estimated
Activity which time is not to be estimated
(b) Multiple activities case
Figure 4.11: Different types of estimation cases
43
4.3.1
Single Activity
In the case of a single a activity, the estimation of the duration is pretty straight forward. If no other
activities are present in the region of interest, the activity which duration is to be estimated is assumed
to have the same duration as the region of interest, see figure 4.12a. When the such activity is the first
activity of the region of interest, its duration is given by the time span from the beginning of the region
of interest until the start of the following activity, see figure 4.12b. Same thing happens for activities that
are the placed at the end of a region of interest, the duration is given by the time difference between the
end of the priovious activity and the end of the region of interest, figure 4.12c. The estimated duration
for activities that are in between other activities is given by the elapsed time between the end of the prior
activity and the start of the following activity, see figure 4.12d.
Only load and unload single activities whose duration is smaller or equal to
mated, otherwise, the original duration of the activities is kept.
" have its duration esti-
" is a user defined constant that sets the
boundary between the activity duration being estimated or not.
LOAD
0 min
LOAD
ROI Time Line
t
0 min
ROI Time Line
t
(a) Activity is the only activity in the ROI
LOAD
0 min
Other Activity
ROI Time Line
LOAD
t
0 min
Other Activity
ROI Time Line
t
(b) Activity is the firsts activty in the ROI
Other Activity
0 min
LOAD
Other Activity
ROI Time Line
t
0 min
LOAD
ROI Time Line
t
(c) Activity is the last activity in the ROI
Other Activity
0 min
LOAD
ROI Time Line
Other Activity
Other Activity
t
0 min
LOAD
ROI Time Line
Other Activity
t
(d) Activity is in between other activities
Figure 4.12: Single activity situations: different cases when estimating the duration of a single activity
In figure 4.12 is shown all the possible cases when estimating the duration of single activities. In the right
side of the figure is presented the proposed solution for the cases in the left. The rectangle in green,
44
without the gradient, represents the original timespan where the activity was registered. The gradient
filled rectangles are the possiblistic areas where the start or end of the activity are possible to have
happened.
4.3.2
Multiple Activities
When facing a multiple load/unload activities situation, as in figure 4.11b, a different approach has to be
taken. In this case, rather only then using the empty time in the neighbourhood of such activities, the
whole interval between two others activities, whose duration is not to be estimated, is used. Figure 4.13
illustrates such interval.
interval
Other activity
0 min
LOAD
LOAD
UNLOAD
Other activity
t
ROI Time Line
Figure 4.13: Multiple activities situation: defining the interval for estimation
Activities whose original duration is equal or smaller than
" are going to be referred as short activities.
On the contrary, when the original duration the bigger than " activities are referred as long activities.
activity'duration ≤ ɛ
Short activity
activity'duration > ɛ
Long activity
Activity which time is not to be estimated.
Figure 4.14: Legend: short activities, long activities
and other activities
Several hypothesis can be formulated to estimate multiple load and unload activities durations. Depending on the assumptions made, different results can be achieved.
When there are multiple load and/or unload activities without any other activity in between it implies that
such activities are being preformed at the same location hence at the same costumer. Thus, load and
45
unload activities of a group, despite its original duration, can be assumed to have the same duration.
Such duration can be obtain by simply diving the total time interval by the number of activities present
on it. The obtain result is the one in figure 4.15 and the duration is given by:
new duration
Other Activity
LOAD
=
interval
LOAD
0 min
(4.3)
# of activities
UNLOAD
Other Activity
ROI Time Line
Other Activity
LOAD
t
LOAD
0 min
UNLOAD
Other Activity
ROI Time Line
t
Figure 4.15: Hypothesis 1
Another possible approach is to assume that the activities whose duration is bigger than ", long activities,
are in coherence with reality. In this way, similar to what is done for the single activity case, it is only
estimated the duration for the short activities, by using the equation 4.4. The duration of the long load
and unload activities, in blue, is kept constant and the short activity, in green, is stretched to fulfil the
empty space of the interval.
short activities duration
Other Activity
LOAD
=
interval
UNLOAD
Other Activity
ROI Time Line
Other Activity
LOAD
LOAD
0 min
(4.4)
# of short activities
LOAD
0 min
long activities durations
ROI Time Line
t
UNLOAD
Other Activity
t
Figure 4.16: Hypothesis 2
In the previous approaches all activities were considered and assumed to had happen. In many cases,
for multiple load and/or unload activities situations, it was noticed that it is common the existence of a long
activity followed by one or two short activities; or the other way around. Such pattern can be assumed as
an indicator of activities that were recorded by mistake of the user. The long activity recorded represents
the actual preformed load/unload activity.
46
To emulate such situations, two hypothesis are proposed: i) the short events are simply dismissed and
kept the same duration for the long activities, see figure 4.17 or ii) the short events are dismissed and
the long activities duration is estimated based on the available time interval, see figure 4.18. In the latter
case, the new duration for the long activities is given by:
long activities duration
Other Activity
LOAD
=
LOAD
0 min
interval
(4.5)
# of long activities
UNLOAD
Other Activity
ROI Time Line
Other Activity
LOAD
t
UNLOAD
0 min
Other Activity
ROI Time Line
t
Figure 4.17: Hypothesis 3
Other Activity
LOAD
LOAD
UNLOAD
Other Activity
ROI Time Line
0 min
Other Activity
LOAD
0 min
t
UNLOAD
ROI Time Line
Other Activity
t
Figure 4.18: Hypothesis 4
4.4
Clustering Locations
At this point each load and unload has an estimated duration and its known where the activity took place.
Still, it is not possible to predict a load or unload activity duration given a specific costumer. In order for
the extracted knowledge to be used in software applications for transportation planning, information must
be categorized into locations so that it is possible to obtain probability density funcations of the estimated
service times for specific costumers. This can be done using costumers locations. Each costumer has
a fixed location where trucks preform load and unload activities, hence, estimated service times can be
clustered according their latitude and longitude.
In previous section 4.2 an activity was initially defined as the set of consequent data points between a
”Start of activity” event and an ”End of activity” event. When passing it to its atomic representation each
activity is given a specific location averaging the locations of activity data-points. Figure 4.19 shows
47
such locations for load activities preformed in the Schiphol Airport where each circle represents a load
activity.
52.33
Mean Load Locations
52.32
Latitude [°]
52.31
52.3
52.29
52.28
52.27
4.72
4.74
4.76
4.78
4.8
4.82
Longitude [°]
Figure 4.19: Mean locations of load acitivities in Schiphol Airport
Since there is no a prior information about costumers locations, load and unload activities can not be
linked to costumers. However it seems reasonable to assume that such load and unload activities are
preformed in the surroundings of a costumer location. By using clustering algorithms, as explained
previously in section 1.3, the activities can be grouped according their latitude and longitude. Each
group will then represent a costumer. Having no information about the number of costumers presents
in the data set, it is not possible to pre-define a number of clusters. Hence, hierarchical clustering is
used. This method is based on the idea that objects that are nearby are more related than those who
are farther way.
After gathering the necessary information from each trajectory, load and unload activities are filtered
from the activity table 4.2. Than, for each type of activity, in this case load and unload, the clustering
is preformed. Each load and unload activity is seen as an object with two coordinates: latitude and
longitude.
This clustering method consists in three main steps. Firstly, the similarity between every pair of objects
in the data set is computed by evaluating the euclidean distances. It measures the ordinary straight line
distance between two objects. Since both variables, latitude and longitude, are measured in the same
units, no normalization is used. Secondly, using the distance information acquired previously, clusters
are grouped in a bottom-up fashion (agglomerative clustering) based on the cluster that contains the
closest pair of objects - single linkage. Those newly formed clusters are then linked to each other,
48
by using an average distance between clusters, until every object in the data set is linked. A cluster
can be described largely by the maximum distance needed to connect parts of the cluster. At different
distances, different clusters will be formed. This creates an hierarchical tree, known as dendrogram,
where the horizontal axis represent the indices of the objects and the vertical axis indicates the distance
between the objects. The height of each ”U” represents the distance between the two data points being
connected. Figure 4.20 shows a hierarchical tree for the unload activities at Schiphol Airport using the
euclidean distance as measure and a single linkage.
×10 -3
12
10
8
6
4
2
13 16 11 2 1 15 24 10 7 14 21 3 12 4 26 18 8 19 5 27 28 29 9 17 23 25 6 22 20 30
Figure 4.20: Dendrogram for unload activities at Schiphol Airport
To partition the data into the desired clusters, the branches of the hierarchical tree are pruned at a
specific value. This assigns all the objects below each cut to a single cluster. Such value, in this case,
is given by the maximal distance between two activities to be considered as part of the same costumer.
Since the clustered variables, latitude and longitude, are in degrees, the cut-off value have to bet set in
coherence. In order to be reasonable the assumption that each cluster represent a costumer the value
at which the hierarchical tree is cut must be small. By setting it to 0.0005 degrees, approximately 56
meters, the obtain results for the load activities in the Schiphol Airport are shown in figure 4.21. In such
way, it is assumed that each costumer is spaced at least by 56 meters. However, it is possible to preform
the following studies at different granularities. If one desired to obtain the distribution of activity times
for a region, like the Schiphol Airport, instead of a single costumer, the cutoff value would simply have
to be set higher. That is one of the biggest advantages of using hierarchical clustering: it enables the
possibility to analyse the data at various levels of granularity.
49
Number of Clusters = 38
52.33
52.306
52.32
52.304
52.302
Latitude [°]
Latitude [°]
52.31
52.3
52.3
52.29
52.298
52.28
52.296
52.294
52.27
4.72
4.74
4.76
4.78
4.8
4.82
4.74
4.745
Longitude [°]
4.75
4.755
4.76
Longitude [°]
(a) Load Clusters at Schiphol Airport
(b) Load Clusters at Schiphol Airport Zoom 1
52.296
52.295
Latitude [°]
52.294
52.293
52.292
52.291
4.762
4.764
4.766
4.768
4.77
4.772
Longitude [°]
(c) Load Clusters at Schiphol Airport Zoom 2
Figure 4.21: Load activities from Schiphol Airport clustered
Different clusters are shown with different colors and each circle represents a load activity. As it can be
seen in figure 4.21c, despite costumers being very close to each other the clustering method distinguish
them successfully. In this area of the airport is possible to identify six areas where there load activity is
particularly dense. Such areas are possible costumers locations.
Once the coordinates of the load and unload activities are clustered, the data is rearrange according
the cluster number. Load and unload activities are assigned to their corresponding cluster in addition
with their information: the coordinates of where it was preformed, the identification of truck, the type
of activity and its duration. A table with the activities is than built for each cluster and it represents the
activity information of an hypothetical costumer.
50
Latitude
Longitude
Truck ID
Activity Type
52.29663
52.29837
52.29811
...
52.29817
4.74278
4.74696
4.74633
...
4.74666
1141
1148
1153
...
1157
Unload
Load
Load
...
Unload
Activity Duration [min]
61.73
121.03
80.88
...
100.83
Table 4.3: Costumer table: the table contains the data from acitivties preformed on the costumer area
51
Chapter 5
Real World Examples
In this chapter are shown the results of the STEL algorithm for the previous described data. To evaluate
the obtained results, the analysis is focused on the load and unload activities preformed at the KLM
Cargo site, in the Amsterdam Ariport Schiphol, and at the port of Rotterdam. These locations were
chosen so that the ability of the STEL algorithm to handle different types of logistics is demonstrated.
Some results for a single truck case where the algorithm is able to produce a complete activity time-line
for the whole trajectory.
For the estimated durations of load and unload activities, the results are shown in the form of histograms.
Each bar has a width of five minutes and its height represents the percentage of load/unload activities
whose duration is on a given interval.
5.1
Amsterdam Airport Schiphol
Figure 5.1 shows the location of the analysed cluster representing the KML Cargo site. As it can be seen
in figure 5.1b the most of the load activities are concentrated in the parking lot area where the trucks are
stopped.
52
52.33
All Loads
KLM Cargo Loads
52.304
52.32
KLM Cargo Loads
52.3038
52.3036
52.31
Latitude [°]
Latitude [°]
52.3034
52.3
52.29
52.3032
52.303
52.3028
52.3026
52.28
52.3024
52.3022
52.27
4.72
4.74
4.76
4.78
4.8
4.82
4.75
Longitude [°]
4.7505
4.751
4.7515
4.752
4.7525
4.753
4.7535
4.754
Longitude [°]
(a) Location of KML Cargo site cluster
(b) KML Cargo site satellite view and load locations
Figure 5.1: Load activities from Schiphol Airport
The original load an unload durations for such location are shown in figure 5.2. As it can be seen, the
majority of the load and unload activities, 45% and 28% respectively, are originally placed in the first
five minutes interval. It illustrates clearly the problem of the human influence in such type of event logs.
Such behaviour has to be taken into account since it is not reasonable to accept those load and unload
durations as real. Users, instead of introducing the start and end of activity events related with such
occurrences, they often introduce those events before or after accomplishing the load or unload tasks.
This, results in a sequence of start and end of activity events that is spaced in time by scarce seconds.
Hence, the duration can not be simply calculated by the time difference of such events.
To overcome this problem, STEL algorithm combines the event log data with the spatio-temporal data.
From the spacial coordinates and time-stamps of each data point is possible to calculate an average
speed of the truck between two data points. By evaluating the average speed it is possible to identify
areas at which the truck is stopped or moving at slow speeds at the costumer surrounding. Once such
areas are known, the regions of interest, a time-line can be built for each ROI. Truck regions of interest
will than be described by a start and end date, from spatio-temporal data, and by a sequence of events
from the event log. Activities that occurred inside the region of interest can be extracted from the event
sequence making it possible to estimate the duration of load and unload activities as described section
4.3.
53
(a) Original load durations of KML Cargo site
(b) Original unload durations of KML Cargo site
Figure 5.2: PDF of the original activties durations from KML Cargo site
In the following figures are presented the estimated load and unload durations, for the same cluster,
using the proposed hypothesis, formulated in section 4.3.2, for the multiple activities duration estimation.
In all cases, the duration of single activities is estimated according the process described in section 4.3.1.
(a) Estimated load durations
(b) Estimated unload durations
Figure 5.3: PDF of the estimated activity durations from KML Cargo site using
hypothesis 1
Figures 5.3a and 5.3b show the estimated load and unload durations, respectively, for the proposed
hypothesis 1: assuming that the durations of the activities on a group are equal. As it can be seen,
the obtained histogram is quite different from the original one and much more truthful. 45% (28%) of
the load (unload) activities preformed at the KML Cargo site had its original duration in between 0 and
5 minutes. Using the proposed hypothesis 1, that percentage drops to about 4.5% (5.5%). For both
load and unload activities, the overall shape of the histogram is kept, despite the incremental on the
percentages for durations above 5 minutes. This is due to the fact that for multiple load and/or unload
54
activities, the durations of the activities in each group are estimated equally.
(a) Estimated load durations
(b) Estimated unload durations
Figure 5.4: PDF of the estimated activity durations from KML Cargo site using
hypothesis 2
The proposed hypothesis 2, however, results in a histogram that is relatively skewed to the right. Like
stated previously, it is common in for the multiple activities situations to be composed by a long activity
and one or more small activities. This in addition with the fact that, in hypothesis 2, only the duration of
short activities are estimated, results in a more constrained calculation. For instance, if an 1 hour interval
contains one long activity with 30 min and two other short activities with 5 minutes each, the resultant
duration for the short activities would be 15 min, since 30 min are already ”taken” by the long activity.
In contrast with hypothesis 1, those same short activities would be estimated to have 20 minutes, since
all activities are estimate equally. Due to this fact, although the original percentages of load and unload
activities, in the range of 0 to 5 minutes, goes down to about 10% and 8%, respectively, the drop is not
as pronounced as when using hypothesis 1.
55
(a) Estimated load durations
(b) Estimated unload durations
Figure 5.5: PDF of the estimated activity durations from KML Cargo site using
hypothesis 3
(a) Estimated load durations
(b) Estimated unload durations
Figure 5.6: PDF of the estimated activity durations from KML Cargo site using
hypothesis 4
In both previous hypothesis, all the original load and unload activities are kept, despite their original
duration. So, the total number of activities is equal in the figures 5.2, 5.3 and 5.4. The same can not
be said about hypothesis 3 and 4. Activities that make part of a group (e.g. multiple activity situation)
and whose duration is smaller or equal to " and are considered as non-existent as explained in section
4.3.2. This results in a smaller total number of load and unload activities.
When comparing hypothesis 3 with hypothesis 4, the percentage of load and unload activities with
duration in the 0 to 5 minutes range is exactly the same, since in both situations short activities within a
group are excluded. The difference in percentages becomes noticeable for the activities whose duration
is bigger than 50 minutes. The reason for such results is that in hypothesis 4, after the short activities
56
being excluded, the long activities see their durations augmented to fulfil the empty space in the interval,
result in higher means and medians. See figures 5.5 and 5.6.
The previous presented results are referring to the durations of sing load and unload activities at the
KML Cargo site. However, is known that trucks often preform more than one load/unload activity at a
costumer. It would be useful to have the expected service time at each costumer rather than expected
time for a load/unload activity. Hence, by summing the durations of the load and unload activities that
are preformed in a visit of the truck to the costumer, the service time at such costumer is obtained.
Using the original activities durations from the KML Cargo site, the obtained service times are the ones
in figure 5.7. Here, load and unload activities are not distinguished since in a service, the same truck
can preform both activities. As expected, since the original durations are being used, the major part of
the services times are of placed in the 0-5 minutes interval.
Figure 5.7: PDF for the original service times from KML Cargo site
When using the estimated load and unload durations from the various hypothesis, the service times get
longer since the estimated activities duration is longer as well. With hypothesis 1 and 2, the service time
estimations are exactly the same. These was expected due to the fact that the on both hypothesis the
time interval available for estimation is completely fulfilled, ether the estimation is done equally on every
activity (hypothesis 1) or only on the short activities (hypothesis 2). The following hypothesis 3 and 4
shows shorter service times since the short activities are eliminated. Nevertheless the results are quite
similar no matter which hypothesis is used.
57
(a) Estimated service times using hypothesis 1
(b) Estimated service times using hypothesis 2
(c) Estimated service times using hypothesis 3
(d) Estimated service times using hypothesis 4
Figure 5.8: PDF of the estimated services times for KML Cargo
5.2
Port of Rotterdam
Opposing to the Amsterdam Schiphol Airport, in the port of Rotterdam the amount of load and unload
activities present in data is very limited as shown in figure 5.9. Apart from that, there are significant
differences between airport and harbour logistics. In airports, load and unload tasks are made at specific sites, where the logistic warehouses are located. This makes the activities locations to be highly
concentrated on such areas, enabling the possibility to identify the clients warehouses by clustering the
activity locations. In harbours, however, the cargo is usually containers that are spread over the port
area. The containers arrive in cargo ships that are unloaded using ship-to-shore cranes. When the
load tasks are to be preformed, specialized vehicles transport the containers from their location to the
logistics trucks. This transferring process is the actual load activity from the truck point of view.
58
Load Activities
Unload Activities
52.02
52
51.98
Latitude [°]
51.96
51.94
51.92
51.9
51.88
51.86
51.84
51.82
4
4.05
4.1
4.15
4.2
4.25
4.3
4.35
4.4
4.45
Longitude [°]
Figure 5.9: Rotterdam Load & Unload Activities
Hence, unlike in the airport warehouses, load and unload tasks are not preform in the same locations. In
figure 5.9 its possible to identify three main unload locations and two load locations. Due to the different
topology of the port of Rotterdam, in comparison with the Schiphol Airport, load and unload activities
are preformed spreader, hence the cutoff value has to be increased. The load and unload locations that
are going to be characterized are shown in figures 5.10a and 5.10b respectively.
51.95
51.958
51.9495
51.949
51.956
51.9485
Latitude [°]
Latitude [°]
51.948
51.9475
51.947
51.954
51.952
51.9465
51.95
51.946
51.9455
51.948
51.945
4.027
4.028
4.029
4.03
4.031
4.032
4.033
4.034
4.035
4.036
4.045
Longitude [°]
4.05
4.055
4.06
4.065
Longitude [°]
(a) Load locations at the port of Rotterdam
(b) Unload locations at the port of Rotterdam
Figure 5.10: Activities Locations at the port of Rotterdam
59
4.07
(a) Original load durations
(b) Original unload durations
Figure 5.11: PDF for the activities original durations at the port of Rotterdam
Comparing the original activities duration, from figure 5.11, with the estimated ones, in figure 5.12, they
are quite similar. In the loads case, some of the activities are shifted from the 0-5 min interval to the
5-10, but the average stays similar. The unloads case, the histogram has exactly the same shape. This
reveals that the users are recording the events in coherence with reality leaving no ”empty space” in
between activities.
(a) Estimated load durations
(b) Estimated unload durations
Figure 5.12: PDF for the estimated activity durations using hypothesis 1,2,3 and 4
Another particularity is that, at the port of Rotterdam, trucks only preform a single load/unload activity
at each load/unload location. Unlike in the airport warehouses, where the same truck can load/unload
different types of cargo (e.g. packages) at the same location, in ports the trucks can not transport
more than one container at a time. This leads to the inexistence of the multiple activity situation. The
formulated hypothesis only differ from each other on the method of estimation for the multiply activity
60
situation, hence, in the port of Rotterdam case, all hypothesis present the same results. In addition,
since trucks only perform one load/unload task per service, for the reasons stated above, the estimated
service times are the same as the estimated load and unload times.
5.3
International Truck
The prediction of activity durations, with the use of probability density functions, is only possible for
situations where there is enough available data from load and unload activities. However, the STEL
algorithm can also be useful in single vehicle situations. Figure 5.13 is a complete activity time-line for
the international truck in the time span where the data was acquired.
Break
TID
1141
0
Load
50
Unload
Rest
Arrival
100
Wait
Drive
150
Refuelling
200
Time [Hours]
Figure 5.13: International Ttuck activity time-line for all the event log
It is an efficient way to evaluated the overall productivity of vehicles since there is available information
from all the performed activities and the correspondent duration. It is also possible to determine which
users are dealing incorrectly with the logging system by evaluating the empty spaces in between activities. By looking at the activity time-lines from regions of interest from specific locations, companies are
able to identify scheduling problems due to the logged wait activities.
An increased awareness of the amount of non-productive time and productive time when the truck is
stopped can be achieved by evaluating the regions of interest. For the international truck case, in a
time-span of 240 hours (10 days) the truck was stopped proximately 155 hours. From the total stopped
time, there were 7 hours related with non-productive activities such as waiting and signing up, 31 hours
of productive time divided onto load and unload activities, 90 hours of sleeping logged as rest activities
and 27 hours where no activities were logged. Figure 5.14 shows these results.
61
17%
4%
Time with no activity logged
Non Productive Time
Productive Time
Rest Time
58%
20%
Figure 5.14: Productivity of the international truck
62
Chapter 6
Discussion
This thesis presents a spatio-temporal event log mining algorithm called STEL. STEL includes a novel
framework that handles real-world logistic data to obtain probability density functions of activity and
service durations based on the location where the activities were preformed. Such results are of extremely importance for the development of algorithms to solve planning and scheduling problems such
as vehicle routing problem. Using probability density functions as an input, these algorithms are capable or producing feasible planning solutions that take into account costumers time-windows and using
stochastic service times derived from the probability density functions. In addition, if travel times were to
be estimated on this work, the results would lead to information useful for such algorithms to incorporate
stochastic travel times.
The problem of estimating activity durations is not straightforward due to the human dependence of the
event logs. Activities are identified by specific pairs of events and their log date is essential to estimate
such durations. However, despite most of the events being automatically logged by the system it self,
the log of external activities (relatively to the log system) relies on the user, since the system is not aware
of such tasks. The most significant issue to consider is the event log uncertainty, related to the time at
which the events are logged, that is induced by the human factor. In a naive approach, if durations were
estimated as the elapsed time between the log of the first and second event of the pair, the obtained
estimations would be, in general, very short. These shows the tendency that users have to log both
events before (or after) the actual accomplishment of the activity rather than during the process, by
logging a ”start” event at the start of the activity and an ”end” event at the end of the activity.
STEL algorithm tackles this problem by creating activity time-lines for time-windows of the trajectory
that where activities take place. Such portions of trajectories were addressed as regions of interest.
Assuming the log system can log all types of activities performed by the user, activity time-lines should
always be fulfilled. The presence of empty time spaces in between activities is attributed to the human
63
effect in logging the events. By defining a sub-set of process related activities, such as load and unload,
their durations can be estimated based on the amount of empty time in their neighbourhood. The timewindows, defined by regions of interest, and activities that do not belong to such sub-set, serve as
constrains to the duration estimation of the activity sub-set.
In order to identify regions of interest, the STEL algorithm makes use of the spatio-temporal data acquired by a global positioning system. Regions of interest are defined in terms of speed which is obtained
using the time-differential method. Since GPS technologies have an associated positioning error, the obtained distances between two data samples are not exact causing an amplification of the error, through
differentiation, when calculating the average speed. To avoid mislead calculations, STEL algorithm filters
the spatio-temporal data for high output rate data samples and uses the previously calculated speed so
that the errors are minimized. Thus, the obtained speeds using this method are not accurate, but are
reliable enough for this application.
6.1
Conclusions
The actual estimation process undergoes some assumptions that have to be made considering the
application. For the multiple activity estimations, where there is a lot of uncertainty due to the amount
of activities, assumptions have to be made looking at possible scenarios that lead to such activity timelines. In this work, four hypothesis focus in the short activities were formulated. They mainly differ in two
properties: to have in consideration, or not, the existence of short activities, and to estimate, or not, the
duration of long activities. The major difference is results comes when short activities are considered
and long activity durations are not estimated (hypothesis 2) leading to higher probabilities in short time
ranges. The reasoning for this is given to a specific phenomenon of the event log used: users commonly
log a long activity followed by one or two short activities at the end of the ROI, causing the short activities
to have no room to ”expand” and thus leading to a higher number of activities with estimated duration up
to 20 minutes.
Apart from that, the obtain results were very consistent, revealing that the time constrains applied by
time-windows and the other activities (whose duration was not estimated) create a well conditioned
problem leading to similar output results independently of the hypothesis used to estimate the duration
of multiply activity situations.
The utility of the STEL algorithm was demonstrated in three real-world scenarios: a fleet of logistic
trucks working in the Amsterdam Schiphol Airport area, container trucks from the port of Rotterdam
and a single truck case from an international truck preforming activities across Europe. The obtain
probability density functions are real-world based approximations that can be used as input in software
aids to solve planning and scheduling problems such as vehicle routing with stochastic time windows.
64
It was also shown that the algorithm can be used to estimate travel times between costumer locations.
Such results can be of interest for application that tackle the problem of dealing with stochastic travel
times.
STEL also supports iterative and flexible exploration of probability density functions by giving the possibility to refine parameters of the clustering algorithm such as the level of granularity in terms of location.
This allows information to be grouped at different levels providing a big picture analysis of whole locations rather than a costumer by costumer analysis.
In addition, STEL helps to understand different types of load and unload activities by doing a characterization of both activity and service durations. As in the study cases, it is possible to stablish assumptions
related to the type of cargo being dealt by comparing both analysis as in the study cases. Furthermore,
it is possible to distinguish load and unload sites by looking at the distribution of the locations where
activities were performed as well as the number of activities performed by service. The created activity
time-lines for regions of interest, and for complete trajectories, also allow a better comprehension of how
the trucks are working given an awareness for problems related to the workflow.
6.2
Future Work
These use cases show that STEL can lead to insights, however many topics remain for future work.
While STEL supports interactive parametrization of the mining algorithm, choosing the right parameters
might be a challenge. A next goal of STEL is to implement fuzzy logic for the classification of the trajectory links. Apart from the average speed, the classification can also be based in additional parameters
such as the average link acceleration and link length.
In these work was demonstrated the lack of effectiveness of sequential pattern mining algorithms in
finding patterns directly from the event logs. Sequences are extremely large and complex due to the
amount of event types. The creation of an activity log enables the possibility to apply such algorithms on
a sequence of activities rather than events, producing patterns with a higher degree level. For instance,
if such sequences are defined as a list of activities prior to a specific activity, it might be possible to find a
frequent activity sequence that usually leads to the occurrence of that specific activity. Such work would
have relevance in the prediction of essential tasks such as sleeping.
Employing visualization tools to analyse the generated results would be useful since it provides a more
user-friendly experience and may allow a better analysis of the clusters and the respective probability
density functions formed.
It would also be of great interest to include more event types in the duration estimation of the activities,
65
rather than only looking at a singular class of events (activity related events). Perhaps the presence of
other events in the event sequence would help to reduce the uncertainty related to the log times of the
events.
66
References
[1] DINALOG. DAIPEX, 2015. URL http://www.dinalog.nl/en/projects/r_d_projects/daipex/.
[2] Martin Desrochers, Jacques Desrosiers, and Marius Solomon. A new optimization algorithm for the
vehicle routing problem with time windows. Operations research, 40(2):342–354, 1992.
[3] Zoltán Fazekas, Péter Gáspár, and Roland Kovács. Determining truck activity from recorded trajectory data. Procedia-Social and Behavioral Sciences, 20:796–805, 2011.
[4] T. Pedersen G. Gidofalvi. Mining long, sharable patterns in trajectories of moving objects. Proceedings of STDBM, pages 49–85, 2006.
[5] Juyoung Kang and Hwan-Seung Yong. Spatio-temporal discretization for sequential pattern mining.
In Proceedings of the 2nd international conference on Ubiquitous information management and
communication, pages 218–224. ACM, 2008.
[6] Juyoung Kang and Hwan-Seung Yong. Mining trajectory patterns by incorporating temporal properties. In Proceedings of the 1st International Conference on Emerging Databases, pages 63–68,
2009.
[7] P. Smyth U. M. Fayyad, G. Piatetsky-Shapiro and R.Uthurusamy. Advances in Knowledge Discovery
and Data Mining. AAAI/MIT Press, 1996.
[8] R. Agrawal and R. Srikant. Mining sequential patterns. In Proceedings of the 11th International
Conference on Data Engineering, pages 3–14, Taipei, Taiwan, March 1995.
[9] Mohammed J Zaki. Mining data in bioinformatics. Handbook of Data Mining, pages 573–596, 2003.
[10] Adam Perer and Fei Wang. Frequence: Interactive mining and visualization of temporal frequent
event sequences. In Proceedings of the 19th international conference on Intelligent User Interfaces,
pages 153–162. ACM, 2014.
[11] Debprakash Patnaik, Patrick Butler, Naren Ramakrishnan, Laxmi Parida, Benjamin J Keller, and
David A Hanauer. Experiences with mining temporal event sequences from electronic medical
records: initial successes and some challenges. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 360–368. ACM, 2011.
67
[12] CARL H. MOONEY and JOHN F. RODDICK. Sequential pattern mining – approaches and algorithms. ACM Computing Surveys, 2013. doi:10.1145/2431211.2431218.
[13] K Venkateswara Rao, A Govardhan, and KV Chalapati Rao. Spatiotemporal data mining: Issues,
tasks and applications. International Journal of Computer Science & Engineering Survey (IJCSES)
Vol, 3, 2012.
[14] John F Roddick and Myra Spiliopoulou. A bibliography of temporal, spatial and spatio-temporal
data mining research. ACM SIGKDD Explorations Newsletter, 1(1):34–38, 1999.
[15] Ramakrishnan Srikant and Rakesh Agrawal. Mining sequential patterns: Generalizations and
performance improvements. In Proceedings of the 5th International Conference on Extending
Database Technology: Advances in Database Technology, pages 3–17, 1996. ISBN:3-540-61057X.
[16] Jian Pei, Ieee Computer Society, Jiawei Han, Senior Member, Behzad Mortazavi-asl, Jianyong
Wang, Helen Pinto, and Qiming Chen. Mining sequential patterns by pattern-growth - The prefixspan approach. IEEE Transactions on Knowlegde and Data Engineering, 16(10):1–17, 2004.
[17] Jiawei Han, Jian Pei, Behzad Mortazavi-Asl, Qiming Chen, Umeshwar Dayal, and Mei-Chun Hsu.
Freespan: Frequent pattern-projected sequential pattern mining. Proc. 2000 ACM SIGKDD Int’l
Conf. Knowledge Discovery in Databases, pages 355–359, August 2000.
[18] Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules in Large
Databases. Journal of Computer Science and Technology, 15(6):487–499, 1994. ISSN 1000-9000.
doi: 10.1007/BF02948845. URL http://portal.acm.org/citation.cfm?id=645920.672836.
[19] Mohammed J. Zaki. SPADE: An efficient algorithm for mining frequent sequences. Machine Learning, 42(1-2):31–60, 2001. ISSN 08856125. doi: 10.1023/A:1007652502315.
[20] Jay Ayres, Johannes Gehrke, Tomi Yiu, and Jason Flannick. Sequential pattern mining using
a bitmap representation. Proceedings of the eighth ACM SIGKDD international conference on
Knowledge discovery and data mining, pages 429–435, 2002. doi: 10.1145/775107.775109. URL
http://dl.acm.org/citation.cfm?id=775109.
[21] Zhenglu Yang. E ective Sequential Pattern Mining Algorithms by Last Position Induction 1 Introduction 2 LAPIN ( LAst Position IN- duction ) sequential pattern mining, 2005.
[22] Jiawei Han, Jian Pei, and Yiwen Yin. Frequent Pattern Tree : Design and Construction. Networks,
pages 1–12, 2000.
[23] Michael C Palmer. Calculation of distance traveled by fishing vessels using gps positional data: A
theoretical evaluation of the sources of error. Fisheries Research, 89(1):57–64, 2008.
[24] R. W. Sinnott. Virtues of the Haversine. Sky and Telescope, 68:158, December 1984.
[25] David J Hand, Heikki Mannila, and Padhraic Smyth. Principles of data mining. MIT press, 2001.
68
[26] Pavel Berkhin. A survey of clustering data mining techniques. In Grouping multidimensional data,
pages 25–71. Springer, 2006.
[27] Shashi Shekhar and Sanjay Chawla. Spatial databases: a tour, volume 2003. prentice hall Upper
Saddle River, NJ, 2003.
[28] Wil Van der Aalst, Ton Weijters, and Laura Maruster. Workflow mining: Discovering process models
from event logs. Knowledge and Data Engineering, IEEE Transactions on, 16(9):1128–1142, 2004.
[29] Boudewijn F van Dongen, Ana Karla A de Medeiros, HMW Verbeek, AJMM Weijters, and Wil MP
Van Der Aalst. The prom framework: A new era in process mining tool support. In Applications and
Theory of Petri Nets 2005, pages 444–454. Springer, 2005.
[30] Ming Hsu, Meghana Bhatt, Ralph Adolphs, Daniel Tranel, and Colin F Camerer. Neural systems
responding to degrees of uncertainty in human decision-making. Science, 310(5754):1680–1683,
2005.
[31] Leticia Gómez, Bart Kuijpers, and Alejandro Vaisman. Querying and mining trajectory databases
using places of interest. In New trends in data warehousing and data analysis, pages 1–26.
Springer, 2009.
[32] Yu Zheng, Lizhu Zhang, Xing Xie, and Wei-Ying Ma. Mining interesting locations and travel sequences from gps trajectories. In Proceedings of the 18th international conference on World wide
web, pages 791–800. ACM, 2009.
[33] Stefano Spaccapietra, Christine Parent, Maria Luisa Damiani, Jose Antonio de Macedo, Fabio
Porto, and Christelle Vangenot. A conceptual view on trajectories. Data & knowledge engineering,
65(1):126–146, 2008.
[34] P. Fournier-Viger, A. Gomariz, T. Gueniche, A. Soltani, C. Wu., and V. S. Tseng. SPMF: a Java
Open-Source Pattern Mining Library. Journal of Machine Learning Research (JMLR), 15:3389–
3393, 2014. URL http://www.philippe-fournier-viger.com/spmf/.
[35] Jianjun Zhang. Precise velocity and acceleration determination using a standalone gps receiver in
real time, 2007.
[36] Mohinder S Grewal, Lawrence R Weill, and Angus P Andrews. Global positioning systems, inertial
navigation, and integration. John Wiley & Sons, 2007.
[37] J William. Global positioning system (gps) standard positioning service (sps) performance analysis
report. Federal Aviation Administration, Washington, DC, 410, 2014.
69
Appendix A
Event List
Event Description
Activity ID
New Activity ID
No of Occurrences
Acceleration limit violation
53
1
108
Activity Midnight
9 UN
2
92
Activity Midnight Arrive
9 1 AANK
3
2
Activity Midnight Break
9 1 PA
4
1
Activity Midnight Drive
9 DR
5
21
Activity Midnight Gas
9 1 TA
6
1
Activity Midnight Load
9 1 LAD
7
25
Activity Midnight Log Out
9 1 LOZO
8
4
Activity Midnight Rest
9 1 RU
9
128
Activity Midnight Sign Up
9 1 MELD
10
2
Activity Midnight Unload
9 1 LOS
11
23
Activity Midnight Wait
9 1 WA
12
1
Basic record
0
13
32293
Cancellation of
11 UN
14
7354
Cancellation of Arrive
11 1 AANK
15
16
Cancellation of Break
11 1 PA
16
7
Cancellation of Costs
11 1 KSTN
17
1
Cancellation of Garage
11 1 GAR
18
1
Cancellation of Gas
11 1 TA
19
4
Cancellation of Load
11 1 LAD
20
75
Cancellation of Log In
11 LI
21
8
Cancellation of Log Out
11 1 LOZO
22
13
Cancellation of Passage
11 1 OT
23
1
Continued on next page
70
Table A.1 – Continued from previous page
Event Description
Activity ID
New Activity ID
No of Occurrences
Cancellation of Rest
11 1 RU
24
59
Cancellation of Sign Up
11 1 MELD
25
17
Cancellation of Unload
11 1 LOS
26
53
Cancellation of Wait
11 1 WA
27
18
Contact OFF
72
28
8942
Contact ON
71
29
8946
Crossed country border
44
30
28
Deceleration limit violation
54
31
69
Driver switch
3
32
42
Driving times driving violation
84
33
357
Driving times driving warning
83
34
71
Driving times state event
82
35
18146
Driving times total driving warning
85
36
165
Driving without any driver logged in
8
37
73
End of
13 TJ
38
2293
End of Arrive
13 1 AANK
39
427
End of Break
13 1 PA
40
103
End of Drive
13 DR
41
4012
End of Garage
13 1 GAR
42
36
End of Gas
13 1 TA
43
203
End of Load
13 1 LAD
44
1019
End of Log In
13 LI
45
522
End of Log Out
13 LO
46
491
End of Passage
13 1 OT
47
12
End of Rest
13 1 RU
48
369
End of Sign Up
13 1 MELD
49
284
End of Unload
13 1 LOS
50
869
End of Wait
13 1 WA
51
193
End of peak RPM limit violation
61
52
782
End of speed limit violation
60
53
315
Engine idle violation
55
54
21
GPRS status info
200
55
480
Invalid driver card
6
56
275
Join of
14 UN
57
20
Join of Drive
14 DR
58
2
Join of Gas
14 1 TA
59
1
Continued on next page
71
Table A.1 – Continued from previous page
Event Description
Activity ID
New Activity ID
No of Occurrences
Join of Load
14 1 LAD
60
3
Join of Log In
14 LI
61
2
Join of Passage
14 1 OT
62
1
Join of Rest
14 1 RU
63
6
Leaving of
15 UN
64
13
Leaving of Gas
15 1 TA
65
1
Leaving of Load
15 1 LAD
66
4
Leaving of Rest
15 1 RU
67
7
Leaving of Sign Up
15 1 MELD
68
1
Mileage violation
94
69
1
Navigation ETA update
42
70
3519
Navigation cancelled
41
71
199
Navigation destination reached
43
72
188
Navigation started to given destination
40
73
432
Report of
12 1 LAZO
74
28
Report of Arrive
12 1 AANK
75
8
Report of Break
12 1 PA
76
2
Report of Garage
12 1 GAR
77
5
Report of Gas
12 1 TA
78
2
Report of Load
12 1 LAD
79
27
Report of Log Out
12 1 LOZO
80
8
Report of Passage
12 1 OT
81
2
Report of Rest
12 1 RU
82
373
Report of Sign Up
12 1 MELD
83
6
Report of Unload
12 1 LOS
84
15
Report of Wait
12 1 WA
85
47
Start of
10 TJ
86
9645
Start of Arrive
10 1 AANK
87
443
Start of Break
10 1 PA
88
110
Start of Costs
10 1 KSTN
89
1
Start of Drive
10 DR
90
4004
Start of Garage
10 1 GAR
91
37
Start of Gas
10 1 TA
92
206
Start of Load
10 1 LAD
93
1095
Start of Log In
10 LI
94
530
Start of Log Out
10 LO
95
507
Continued on next page
72
Table A.1 – Continued from previous page
Event Description
Activity ID
New Activity ID
No of Occurrences
Start of Passage
10 1 OT
96
15
Start of Rest
10 1 RU
97
434
Start of Sign Up
10 1 MELD
98
302
Start of Unload
10 1 LOS
99
923
Start of Wait
10 1 WA
100
212
Start of peak RPM limit violation
51
101
769
Start of speed limit violation
50
102
333
System Shutdown
73
103
223
Task Accepted
21
104
3851
Task Busy
23
105
3729
Task Cancelled
24
106
186
Task Finished
25
107
3450
Task Received from terminal
20
108
1983
Text message
300
109
1850
Text message read
302
110
1905
Text message received
301
111
1881
Trailer tethered
98
112
668
Trailer untethered
99
113
676
Update of peak RPM limit violation
66
114
9
Update of speed limit violation
65
115
76
User login
1
116
372
User logout
2
117
252
empty
7
118
99
73
52.5
52.5
Moving Link
Stopped Link
60
52
52
50.5
30
Speed [Km/h]
40
Latitude [°]
51
51
Latitude [°]
51.5
50
51.5
50.5
50
50
20
49.5
49.5
10
49
48.5
2
3
4
5
6
7
8
49
48.5
9
2
3
4
5
6
7
8
9
Longitude [°]
Longitude [°]
(a) TID 1141 Link Speeds
(b) TID 1141 Link Classification
Moving Link
Stopped Link
52
60
52
51.98
50
51.96
51.95
51.94
30
Latitude [°]
51.9
Speed [Km/h]
Latitude [°]
40
51.92
51.9
51.88
20
51.86
51.85
51.84
10
51.82
51.8
4
4.05
4.1
4.15
4.2
4.25
4.3
4.35
4.4
4
4.45
4.05
4.1
4.15
4.2
4.25
4.3
4.35
4.4
4.45
Longitude [°]
Longitude [°]
(c) TID 1849 Link Speeds
(d) TID 1849 Link Classification
52.31
Moving Link
Stopped Link
52.31
60
52.305
52.305
50
52.3
52.295
30
Speed [Km/h]
Latitude [°]
40
Latitude [°]
52.3
52.295
52.29
52.29
20
52.285
52.285
10
52.28
4.73
4.74
4.75
4.76
4.77
4.78
52.28
4.79
4.73
4.74
4.75
4.76
4.77
Longitude [°]
Longitude [°]
(e) TID 7234 Link Speeds
(f) TID 7234 Link Classification
Figure A.1: Link Speeds and Classifications with ↵ = 0.5
74
4.78
4.79