Download The SIGSPATIAL Special

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
The SIGSPATIAL Special
Newsletter of the Association for Computing Machinery
Special Interest Group on Spatial Information
Volume 7
Number 1
March 2015
The SIGSPATIAL Special
The SIGSPATIAL Special is the newsletter of the Association for Computing Machinery (ACM) Special
Interest Group on Spatial Information (SIGSPATIAL).
ACM SIGSPATIAL addresses issues related to the acquisition, management, and processing of spatiallyrelated information with a focus on algorithmic, geometric, and visual considerations. The scope includes,
but is not limited to, geographic information systems.
Current Elected ACM SIGSPATIAL officers are:
 Chair, Mohamed Mokbel, University of Minnesota
 Past Chair, Walid G. Aref, Purdue University
 Vice-Chair, Shawn Newsam, University of California at Merced
 Secretary, Roger Zimmermann, National University of Singapore
 Treasurer, Egemen Tanin, University of Melbourne
Current Appointed ACM SIGSPATIAL officers are:
 Newsletter Editor, Chi-Yin Chow (Ted), City University of Hong Kong
 Webmaster, Ibrahim Sabek, University of Minnesota
For more details and membership information for ACM SIGSPATIAL as well as for accessing the
newsletters please visit http://www.sigspatial.org.
The SIGSPATIAL Special serves the community by publishing short contributions such as SIGSPATIAL
conferences’ highlights, calls and announcements for conferences and journals that are of interest to the
community, as well as short technical notes on current topics. The newsletter has three issues every year,
i.e., March, July, and November. For more detailed information regarding the newsletter or suggestions
please contact the editor via email at [email protected].
Notice to contributing authors to The SIGSPATIAL Special: By submitting your article for distribution in
this publication, you hereby grant to ACM the following non-exclusive, perpetual, worldwide rights:




to publish in print on condition of acceptance by the editor,
to digitize and post your article in the electronic version of this publication,
to include the article in the ACM Digital Library,
to allow users to copy and distribute the article for noncommercial, educational or research
purposes.
However, as a contributing author, you retain copyright to your article and ACM will make every effort to
refer requests for commercial use directly to you.
Notice to the readers: Opinions expressed in articles and letters are those of the author(s) and do not
necessarily express the opinions of the ACM, SIGSPATIAL or the newsletter.
The SIGSPATIAL Special (ISSN 1946-7729) Volume 7, Number 1, March 2015.
i
Table of Contents
Message from the Editor…………………………………………………………………..
Chi-Yin Chow
Page
1
Section 1: Special Issue on Semantic and Symbolic Trajectories
Introduction to this Special Issue: Semantic and Symbolic Trajectories……………...
Maria Luisa Damiani and Chiara Renso
2
The Low Hanging Fruit is Gone – Achievements and Challenges of Computational
Movement Analysis……………...…………………………………………………………
Patrick Laube
3
Semantic Enrichment and Analysis of Movement Data: Probably it is just
Starting!.................................................................................................................................
Renato Fileto, Vania Bogorny, Cleto May, and Douglas Klein
11
Hermoupolis: A Semantic Trajectory Generator in the Data Science era……..............
Nikos Pelekis, Stylianos Sideridis, Panagiotis Tampakis, and Yannis Theodoridis
19
Constructing Semantic Interpretation of Routine and Anomalous Mobility Behaviors
from Big Data………………………………………………………………………………
Georg Fuchs, Hendrik Stange, Dirk Hecker, Natalia Andrienko, and Gennady Andrienko
27
An Integrated Qualitative and Boundary-based Formal Model for a Semantic
Representation of Trajectories............................................................................................
Jing Wu, Christophe Claramunt, and Min Deng
35
Trajectory Similarity Measures...........................................................................................
Kevin Toohey and Matt Duckham
43
Symbolic Trajectories and Application Challenges...........................................................
Maria Luisa Damiani, Hamza Issa, Ralf Hartmut Güting, and Fabio Valdes
51
Planning Sightseeing Tours Using Crowdsensed Trajectories.........................................
Igo Brilhante, Jose Antonio Macedo, Franco Maria Nardini, Raffaele Perego, and Chiara
Renso
59
Section 2: Event Reports
Highlights from ACM SIGSPATIAL China Chapter in 2014………………………..…
Guangzhong Sun, Yang Yue, and Xing Xie
ii
67
ACM SIGSPATIAL LBSN 2014 Workshop Report.........................................................
Alexei Pozdnoukhov and Sen Xu
69
ACM SIGSPATIAL IWGS 2014 Workshop Report……………..……………………...
Chengyang Zhang, Anas Basalamah, and Abdeltawab Hendawi
70
ACM SIGSPATIAL PhD Symposium 2014 Report..........................................................
Ugur Demiryurek and Mohamed Sarwat
72
iii
Message from the Editor
Chi-Yin Chow
Department of Computer Science, City University of Hong Kong, Hong Kong
Email: [email protected]
In the first section, we have a special issue of some topic of interest to the SIGSPATIAL community. The
topic of this issue is “Semantic and Symbolic Trajectories” which is edited by our associate editors: Prof. Maria
Luisa Damiani and Dr. Chiara Renso. Prof. Damiani is currently a professor in the Department of Computer
Science, University of Milan, Italy and Dr. Renso is currently a researcher in ISTI-CNR, Italy.
The second section consists of four event reports from:
1. Highlights from ACM SIGSPATIAL China Chapter in 2014
2. The 7th ACM SIGSPATIAL International Workshop on Location-Based Social Networks (ACM SIGSPATIAL LBSN 2014)
3. The 5th ACM SIGSPATIAL International Workshop on GeoStreaming (ACM SIGSPATIAL IWGS 2014)
4. The 1st ACM SIGSPATIAL PhD Symposium 2014
I would like to sincerely thank all the newsletter authors, Prof. Damiani, Dr. Renso, and event organizers
for their generous contributions of time and effort that made this issue possible. I hope that you will find the
newsletters interesting and informative and that you will enjoy this issue.
You can download all Special issues from:
http://www.sigspatial.org/sigspatial-special
.
1
The SIGSPATIAL Special
Section 1: Special Issue on
Semantic and Symbolic
Trajectories
ACM SIGSPATIAL
http://www.sigspatial.org
Introduction to this Special Issue:
Semantic and Symbolic Trajectories
Maria Luisa Damiani1 , Chiara Renso2
1
Department of Computer Science, University of Milan, Italy
2
ISTI-CNR, Italy
Nowadays, the digital traces of moving objects, such as people, animals, goods, increasingly contain
richer information than the mere location history. Traces can report, for example, the places visited by tourists
during a touristic journey (e.g., hotels, entertainment spots), the activities performed during such a journey
(e.g., shopping, driving), the people encountered during the trip. This supplementary information is often
referred to as semantic annotation (of trajectories). Semantic annotations can be directly acquired from sensors
or, alternatively, be the result of some analytic process or be directly provided by users. In recent times,
following a few pioneering research projects, especially in Europe and Asia1 , the processing of large amounts
of semantics-enriched trajectories has become an exciting, cross-disciplinary research area, spanning spatial
computing, data analytics and geographical information science. Semantics-enriched trajectories have important
applications in a variety of domains, such as urban computing, social media analysis and animal ecology,
especially in connection with the analysis of moving objects’ behavior, both at individual and collective level.
This special issue consists of eight contributions differently related to semantics-enriched trajectories.
The first contribution is a position paper by Patrick Laube presenting a fresh viewpoint on achievements and
open challenges in the area of movement analysis, especially with respect to the problem of semantically
annotating patterns extracted from spatial trajectories. The next three papers are closely related to the notion
of semantic trajectories: Renato Fileto et al. present a conceptual vision of the semantic enrichment process
especially focusing on the challenging opportunities offered by Linked Open Data; Nikos Pelekis et al. present
Hermopoulis, a valuable tool for the synthetic generation of semantic trajectories, very often a necessary step to
cope with the lack of access to such data; Georg Fuchs et al. focus on the visual aspects of semantic trajectory
analysis, presenting visual analytics methods that can be used to interpret routine and anomalous patterns of
human mobility. The next group of two papers focus more on computational aspects. In particular, Jing Wu
et al. present an approach to the definition of predicates modeling qualitative relationships between geometric
trajectories and regions, while Kevin Toohey and Matt Duckham introduce and compare four of the most
common measures of trajectory similarity. The last group of two papers present a few application scenarios for
different types of semantics-enriched trajectories. In particular Maria Luisa Damiani et al. focus on possible
applications of the symbolic trajectories data model, a novel trajectory data model for the representation and
pattern-based querying of timestamped labels sequences, currently implemented in the Secondo database; Igo
Brilhante et al. present a comprehensive application showing how semantics-enriched trajectories built on
crowd-sensed tourism data extracted from social media represent precious information for the definition of
tourists behavior models and itineraries recommendation.
We hope that the readers will enjoy reading this issue and appreciate the multiplicity of views it offers.
1
In particular, the project GeoPKDD (2006-2008) funded by the European Community and the Microsoft GeoLife project
2
The Low Hanging Fruit is Gone – Achievements and
Challenges of Computational Movement Analysis
Patrick Laube1
1
Institute of Natural Resource Sciences,
Zurich University of Applied Sciences, Switzerland,
[email protected]
Abstract
This position paper reviews the achievements and open challenges of movement analysis within Geographical Information Science. The paper argues that the simple problems of movement analysis have
mostly been addressed to a sufficient level (“the low hanging fruit”), leaving the research community
with the much more challenging problems for the years ahead (“the high hanging fruit”). Whereas the
community has made good progress in structuring trajectory data (segmentation, similarity, clustering)
and conceptualizing and detecting movement patterns, the much harder task of semantic annotation of
structures and patterns remains difficult. The position paper summarizes both achievements and challenges with two sets assertions and calls for the establishment of a unifying theory of Computational
Movement Analysis.
1
Introduction
Movement analysis has become a constant topic on the programs of all important meetings and conferences in
Geographical Information Science (GIScience) and spatial computing. After more than a decade of enthusiasm
and rapid progress, I currently see a certain stagnation in the field. At meetings and conferences, and when
reviewing papers, I get the impression that the discussion often revolves around similar topics. In this position
paper I argue that the community has now reached a stage where most simple problems have been addressed, and
perhaps even solved. What remains on the table are the hard problems. For that reason I propose consolidating
what has been achieved so far in a unifying theory of Computational Movement Analysis (CMA) and face the
hard problems allowing the discipline to evolve. I have structured my position paper into achievements (“the
low hanging fruit”) and grand challenges for today and the years to come (“the higher hanging fruit”).
2
2.1
The Low Hanging Fruit
Data Gold-Rush
The emergence of movement research within GIScience was very much technology or even more data driven.
With the rapid improvement of GPS-based tracking technology – receivers getting much smaller and batteries
lasting much longer – a sudden overabundance of movement data triggered a gold-rush like enthusiasm amongst
theory and application researchers. This shift from a data poor to a data rich problem is perhaps best illustrated
by biologists’ efforts to track turtles. What started off as a thread trailing exercise [10] has in the last two decades
3
embraced GPS technology [20]. There was a time when it almost appeared that “just putting movement data on a
map” got you into Nature [11]. Two facts underline that this statement is not meant in a derogative way, but rather
the opposite. First, it was Graeme Hays himself making that comment at a workshop in [22] and his outstanding
publication record exposes the understatement of his quote. Second, looking ahead to the remaining higher
hanging fruit and the community’s challenges, it’s worth noting at this stage that it was biologists publishing in
Nature, not GIS or computer science researchers.
I argue that movement analysis emerged as a hot topic in recent years mainly because the seemingly simple
task of analyzing a set of points allowed GIScience for the first time truly overcoming the legacy of static
cartography. Movement data offered quasi-continuous sampling instead of sporadic snapshots. Everybody had
a colleague in biology, ecology or transportation research, and getting their data in a GIS was interesting and
relatively straightforward. Researchers with various interests in GIS and spatial computing – from cartography
to generalization, from time geography to 3D, or from spatial databases to data mining – got excited about all
that available movement data and fired-up their engines. Very much to the benefit of the emerging discipline.
2.2
Simple Solutions for Simple Problems
My second low hanging fruit is simplicity. In the beginning, the field of movement analysis was so wide open,
there were so many open and interesting problems, that one could just pick the simple ones. And then even got
away with very simple solutions to the simple problems one had framed in the first place. The relative movement
framework (REMO) first presented at GIScience 2002 may serve as an illustration of the simplicity of some
initial concepts [15]. The REMO framework in essence bases on putting together a matrix that temporally aligns
color-coded sequences of movement parameters such as speed or movement azimuth, hoping for the emergence
of interesting and unexpected patterns. Considering my own early biography, it is only fair to say that any
resemblance to the colorful lego bricks is not coincidental at all. However, just as the lego models, the REMO
framework is a crude, edgy and inflexible model defying the true complexity of the real world. Nevertheless,
its simplicity was appealing and triggered at the time important discussions about analyzing more than just the
mere spatial footprint of trajectories.
Whereas the REMO matrix aligned derived movement parameters and ignored the absolute positions of
the moving entities, subsequently proposed movement patterns such as “flock” or “trendsetter” in addition also
considered the spatial arrangements of the entities. Hence, again adhering to a rather simplistic and deterministic
view of the world, Laube et al. [17, 16] defined a flock as a group of n moving entities that move in spatial
proximity for a defined time interval k. And this spatial neighborhood was defined as a simple bounding box or
a disc of radius r. Also these mechanistic and almost naı̈ve early movement patterns triggered useful discussions
about dynamic collectives. However, they don’t consider membership issues or entities temporarily leaving
the group. Antony Galtons beautiful string quartet metaphor illustrates for only a small and simple group the
potential semantic complexity of dynamic collectives and their collective dynamics [8].
2.3
The Hammer and Nail Issue
Apart from addressing the simple problems first, another low hanging fruit came in the form of well-developed
and hence ready to use toolboxes for static spatial analysis. For instance, the use of a disc neighborhood for
conceptualizing flock patterns was not primarily problem-driven, but rather motivated by the availability of
efficient algorithms for finding sets of points that lie close together – in this case using higher-order Voronoi
diagrams. It’s fair to say that the then proposed solution for conceptualizing and then detecting flock patterns
clearly reflects the background of the involved researchers in geography and computational geometry [17].
Certainly, researchers can’t be blamed that for tackling new problems they first resort to familiar tools. The
following selection of studies illustrates the adaptation of established tools and concepts for the emerging field
of movement analysis. For example, Ross Purves’ and my “How fast is a cow” paper reflects our background
4
as GIS researchers with experiences in sensitivity studies, scale issues, and uncertainty [14]. The visualization
research group around Natalia and Gennady Andrienko successfully incorporated various movement analysis
perspectives into their visualization environment [1]. Database research saw the emergence of novel Moving
Object Databases, specifically targeting data management and querying problems for moving objects, for example, for real-time fleet management applications [9]. Finally, computational geometers have successfully
proposed several algorithms around the Fréchet distance for the conceptualization and detection of a set of
movement patterns and similarity problems [5].
Benefitting from existing and ready to be used concepts surely presents an efficient research avenue. Unfortunately, this strategy bears the danger of producing solution-dominated methodological research that misses the
actual problems of the targeted application areas. Or, “If all you have is a hammer, everything looks like a nail”.
The hammer and nail problem became apparent when the earlier presented flock definitions based on disc-shaped
neighborhoods were put to the test with a herd of actual cows moving across a paddock. It turned out that the
mechanistic flock definition missed the social behavior of the animals. The cows much more expressed group
cohesion in the form of pair-wise density-connections, with cows lined up like pearls on a string [14]. Such
a miss-match between solution and problem can result in a decreased impact of the methodological research,
as only those suggested tools will be widely accepted that help the application scientists tackling their actual
research questions.
2.4
Seduction of Syntactic Sugar
Peter Landin’s term syntactic sugar serves as an analogy for my fourth low hanging fruit. The term refers
to additions to the syntax of a programming language that do not affect its expressiveness but only make it
sweeter for humans to use [12]. Many early proposed movement patterns, whose prime function was to structure
trajectories of moving entities, were sugarcoated with intuitive labels referring to behavior patterns from the
targeted application domains. With my flocks, I am certainly “guilty as charged” in this respect (further examples
are trendsetter or leadership patterns), but others happily jumped on the bandwagon (herd, convoy, single file
patterns).
This strategy has since repeatedly been criticized, and for good reasons. First, proposing movement patterns
for different application domain, structurally very similar patterns were coined with different terms (flocks,
herds, convoys). Second, the sugary names implied having already bridged the semantic gap between patterns
and behaviors, whereas in reality the patterns really only captured structural features (see section 4.2). However,
today I would argue that these sugary names helped establishing the concept of movement patterns as a driving
force for movement analysis, hence played an important role as a marketing vehicle for the emerging research
field.
2.5
Novelty Through Interdisciplinarity
My fifth low hanging fruit is interdisciplinary. A wide range of research fields currently contribute to the rapid
development of movement analysis, including GIScience, computer science with computational geometry and
database research, as well as various application fields such as movement ecology, transportation research and
planning, and robotics. There is no doubt, interdisciplinary collaborations do produce innovative movement
analysis concepts. Recent examples include work on stacked densities (infovis, computational geometry, and
ecology) [6], semantic enrichment of trajectories (GIS, ecology, engineering) [7], or Brownian bridges (computational geometry, ecology) [4].
One reason why interdisciplinary research is attractive, lies in the fact that methods and concepts established
in one research area may be new to others. In the best case such “novel” methods address urgent challenges
in the neighboring field. Often this is a win-win situation. When methods researchers collaborate with domain
experts the former get visibility in relevant application areas whereas the latter get new analytical perspectives
5
advancing their field. Amongst many other recent examples, the handshake between data mining and restoration
ecology [2] and machine learning and drug screening for biomedical research [23] illustrates this successful
publication strategy.
3
Computational Movement Analysis
Put together, the many individual contributions outlined above shape a solid theoretical basement of movement
analysis in a wider GIScience context. Summarizing the above overview, the following list captures the main
achievements of the discipline to date:
• Movement data is arguably the first continuously sampled form of spatio-temporal data that became
widely available for overcoming GIScience legacy of static cartography. Hence, movement analysis acted
as important trailblazer advancing GIS from being static towards dynamic.
• The diversity of data capture methods used in the various contributing application contexts – ranging from
traffic gantries and video surveillance to mobile phones and GPS collars – led to a thorough understanding
of the interrelation between conceptual models of movement spaces and the therein possible movement
traces.
• The countless hours spent on getting raw trajectories ready for analysis led to a solid body of scientific
fundamentals related to preprocessing, cleaning, and filtering notoriously noisy and imperfect movement
data. The research field has also produced solutions for integrating, storing, managing, and querying the
rapidly growing data streams describing movement phenomena.
• In terms of analytical concepts the main achievement surely lies in the many concepts and related algorithms for structuring movement data. This includes segmentation procedures, similarity measures, and
movement patterns.
• Significant contributions were furthermore made for visualizing movement processes. What started off
with 3D space-time cubes and time geography soon led to powerful concepts for animation and visual
analytics.
In a recently published SpringerBrief volume I outlined what I believe should be the core topics of an
underlining theoretical basis of Computational Movement Analysis (CMA) [13].1
Definition. Computational Movement Analysis (CMA) is the interdisciplinary research field studying the development and application of computational techniques for capturing, processing, managing, structuring, and
ultimately analyzing data describing movement phenomena, both in geographic and abstract spaces, aiming for
a better understanding of the processes governing that movement [13, p. 4].
On top of the above listed core topics, CMA must also investigate the specific characteristics and peculiarities
of the geographic phenomenon movement and the spatio-temporal data describing it, including data quality
(uncertainty, accuracy), scale issues, and spatio-temporal autocorrelation. Clearly, CMA must operate in close
collaboration with its application domains studying the peculiarities of established and emerging integrated
spatial systems serving as direct or indirect tracking systems. Finally, CMA should address societal issues,
including ethics and privacy, as well as issues around user-generated and open data. Some of these issues are
hard and have hence not yet seen the required attention. They directly lead to the following list of higher hanging
fruit.
1
P. Laube. Computational Movement Analysis. SpringerBriefs in Computer Science. Springer, Berlin Heidelberg, 2014, DOI
10.1007/978-3-319-10268-9, ISBN 978-3-319-10267-2.
6
4
4.1
The Higher Hanging Fruit
Move Beyond The Trajectory
In my opinion, trajectories have been the key data structures making CMA possible in the first place. Although
related to polylines, trajectories are different as they inherently capture the spatio-temporal nature of movement.
They not only align position fixes in a sequence but also timestamp those fixes allowing for representing stop
and go patterns and speed variations. However, trajectories still remain only a spatio-temporal footprint of the
actual behaviors we typically want to understand. So, it is now time to acknowledge that movement behaviors
can hardly be understood from just studying their footprints alone. Reaching out for my first higher hanging
fruit means moving beyond the trajectory. This is meant in two ways.
First, studying shape and arrangement patterns of trajectories may reveal certain structural characteristics of
movement, but a true understanding of the respective behaviors requires the embedding of the trajectories in the
movement context enabling and constraining that movement. Pedestrians walking across a bridge do cluster and
do perform a uniform straight movement, but understanding why they move the way they move and where is best
explained by linking their movement to the underlying geography – here the bridge enabling and constraining the
movement. A dedicated session on movement context at the 2014 GIScience conference in Vienna underlines
the relevance of this topic. One challenge for understanding movement in its context lies in the difficulty of
accessing the comprehensive data required for such studies. On top of fine-grained movement data one also
needs semantic information about the movement context, ideally with a similar spatial and temporal resolution.
One way of accessing such rich data sources comes in the form of multi-sensor systems. For example, Nathan et
al. combine GPS readings with additional sensors concurrently tracking biomechanical variables when aiming
at classifying behavioral modes of free-ranging animals [18].
Second, trajectories typically adhere to the Lagrangian perspective of movement, tracking the changes of
an entity’s location. This is the typical case when performing tracking experiments with small numbers of
GPS-tagged individuals. The antagonistic Eulerian perspective – observing the movement of entities relative
to fixed reference points – is much less investigated. Considering that many large-scale tracking systems, for
example urban transit ticketing or mobile phone infrastructure, adhere to the Eulerian perspective, it is obvious
that tackling the so far neglected Eulerian perspective promises insights into movement systems of great socioeconomic relevance that cover much larger spaces and populations of moving entities.
4.2
Bridge The Semantic Gap
Stepping up from “just” structuring movement data means reaching out for a truly high hanging fruit – bridging
the semantic gap. This gap separates the low-level observational data from the high-level conceptual schemes
through which humans interpret, understand and use that data [8]. Segmenting a bird’s trajectory according to
its speed and sinuosity characteristic is one thing. But asserting that the wiggly slow segments really correspond
to what biologist would classify as “foraging” behavior is a rather different thing. One strategy for narrowing the semantic gap comes in the form of thorough validation. To this end, studies where trajectory data is
complemented with semantic “ground truth” data in the form a direct observations of the studied behaviors are
especially valuable. Shamoun-Baranes et al. illustrate how this can be achieved [21]. They cross-validate their
machine-learning behavior classification with conventional behavior observations. Similarly, a workshop held
in Zurich in 2012 specifically aimed at bringing together theory researchers with application domain specialist.
The workshop and the subsequent special issue in Computers, Environments and Urban Systems (vol. 47, 2014)
only invited work on real data and real problems, carried out by teams that also included domain specialists
grounding the methodological work in the semantics of real problems [19].
7
4.3
Reach Out for Application Domain Outlets
I have argued above that interdisciplinary collaborations help spreading ideas and getting visibility in neighboring fields. However, it is worth having a look at the directions of this exchange for work on movement analysis.
In my opinion, this interdisciplinary exchange has so far been very one-sided. The community has been very
successful publishing collaborations with domain specialist – but mainly in GIScience and computer science
outlets. This is not really surprising since, for example, a bird biologist gets little credit for having developed a
new method. What counts is contributing to a better understanding of avian navigation, the methods that helped
advancing that understanding often literally disappear in the small print. Additionally, methodological practice
is very persistent in many research field. So, introducing new methods in application fields that have established
standard procedures is hard. Nevertheless, I still consider movement analysis as one branch of GIScience with
the biggest outreach potential. Hence, even if it is difficult, we should aspire to be getting our work published
in the application domains’ outlets. For the recognition and the further advancement of movement analysis as
a key GIScience contribution, those publications are most valuable that appear in the application fields’ outlets,
e.g. [7].
4.4
Revisit Privacy
My fourth high hanging fruit is an old friend: Privacy. Clearly, animal movement is an excellent use case for
stimulating and interesting CMA problems [13, p. 85]. But as I argued above the movement of people and
the related applications and services bears a much larger socio-economic potential. This is a huge opportunity
not to be missed. But people care about privacy. I see two main challenges with respect to privacy: First,
develop strategies for getting access to the really interesting large volumes of people movement data. Second,
develop analytical frameworks that can produce useful information but at the same time safeguard people’s
privacy. Clearly, the community has already produced concepts and algorithms that do just that, but mainly on a
theoretical level. Now comes the opportunity to put concepts to the test and apply them to the big data coming
our way.
4.5
Engage With Big Data
I would argue that most data sets worked on in movement analysis to date do not really qualify as big data. That
is per se not a problem, as conceptual contributions can easily be made with smaller and manageable data sets.
However, considering the huge potential and relevance of Eulerian movement systems, the big data streaming
out of ICT, public transport and traffic systems clearly challenge the discipline. Even more so, as processing
big data explicitly requires methods that can cope with messy, noisy, incomplete, uncertain, heterogeneous and
multi-source data. Since these are all known characteristics of spatio-temporal and geographic data in the first
place, surely GIScience has a contribution to make for coping with mobility related big data pouring out of our
ICT infrastructure and smart cities.
4.6
Envision Mobile Everyware
My last high hanging fruit is arguably still ripening. It is probably fair to say that most movement analysis to date
still happens in centralized desktop computers. However, in the course of advancing ubiquitous computing, we
may expect mobile spatial “everyware” increasingly building a normal part of our ICT infrastructure. It is easy
to picture a plethora of mobile and communicating computing nodes in smart cities requiring mobility-related
intelligence for autonomous transportation or aging populations. Big challenges include here the adaptation of
established movement analysis concepts for decentralized environments and the seizing of new opportunities
and application fields of movement analysis that only start crystalizing in today’s very dynamic and volatile
computing environments (see, for example, [3]).
8
5
Concluding Remarks
With Computational Movement Analysis GIScience has arguably overcome cartography’s legacy of the static.
The discipline has built up a solid theoretical basis on capturing, preprocessing, structuring, and visualizing
movement data, exploiting a range conceptual data models and data structures for movement spaces and movement traces adapted from the GIScience toolbox for handling static spatial data. CMA should, however, widen
its focus including movement data emerging systems other than GPS tracking and get ready for the expected big
data streams pouring out of today’s and tomorrow’s ICT infrastructure. Albeit all progress in processing movement data, the biggest challenge remains the semantic annotation of found structures, as a true understanding of
the involved processes is impossible without understanding their context.
Acknowledgments
This position paper is the result of a keynote talk with the same title I gave at the GIScience 2014 workshop on
Analysis of Movement Data in September 2014 in Vienna. I’d like to thank the organizers for inviting me. I’d
finally also like to thank the Zurich University of Applied Sciences for supporting my research.
References
[1] G. Andrienko, N. Andrienko, P. Bak, D. Keim, and S. Wrobel. Visual Analytics of Movement. Springer,
Berlin Heidelberg, 2013.
[2] S. Bleisch, M. Duckham, A. Galton, P. Laube, and J. Lyon. Mining candidate causal relationships in
movement patterns. International Journal of Geographical Information Science, 28(2):363–382, Nov.
2013.
[3] A. Both, M. Duckham, P. Laube, T. Wark, and J. Yeoman. Decentralized Monitoring of Moving Objects in
a Transportation Network Augmented with Checkpoints. The Computer Journal, 56(12):1432–1449, Sept.
2012.
[4] K. Buchin, T. J. M. Arseneau, S. Sijben, and E. P. Willems. Detecting movement patterns using Brownian
bridges. In Proceedings of the 20th International Conference on Advances in Geographic Information
Systems - SIGSPATIAL ’12, page 119, New York, New York, USA, Nov. 2012. ACM Press.
[5] K. Buchin, M. Buchin, and J. Gudmundsson. Detecting single file movement. In Proceedings of the 16th
ACM SIGSPATIAL international conference on Advances in geographic information systems - GIS ’08,
page 1, New York, New York, USA, Nov. 2008. ACM Press.
[6] U. Demšar, K. Buchin, E. E. van Loon, and J. Shamoun-Baranes. Stacked space-time densities: a geovisualisation approach to explore dynamics of space use over time. GeoInformatica, 19(1):85–115, Apr.
2014.
[7] S. Dodge, G. Bohrer, R. Weinzierl, S. C. Davidson, R. Kays, D. Douglas, S. Cruz, J. Han, D. Brandes,
and M. Wikelski. The environmental-data automated track annotation (Env-DATA) system: linking animal
tracks with environmental data. Movement Ecology, 1(1):3, July 2013.
[8] A. Galton. Dynamic Collectives and Their Collective Dynamics. In A. Cohn and D. Mark, editors, Spatial
Information Theory SE - 19, volume 3693 of Lecture Notes in Computer Science, pages 300–315. Springer
Berlin Heidelberg, 2005.
9
[9] R. H. Güting and M. Schneider. Moving Objects Databases. Morgan Kaufmann, 2005.
[10] A. Hailey. How far do animals move? Routine movements in a tortoise. Canadian Journal of Zoology,
67(1):208–215, Jan. 1989.
[11] G. C. Hays, J. D. R. Houghton, and A. E. Myers. Endangered species: Pan-Atlantic leatherback turtle
movements. Nature, 429(6991):522, June 2004.
[12] P. J. Landin. The Mechanical Evaluation of Expressions. The Computer Journal, 6(4):308–320, Jan. 1964.
[13] P. Laube. Computational Movement Analysis. SpringerBriefs in Computer Science. Springer International
Publishing, Berlin Heidelberg, 2014.
[14] P. Laube, M. Duckham, and M. Palaniswami. Deferred decentralized movement pattern mining for geosensor networks. International Journal of Geographical Information Science, 25(2):273–292, Mar. 2011.
[15] P. Laube and S. Imfeld. Analyzing Relative Motion within Groups ofTrackable Moving Point Objects. In
M. Egenhofer and D. Mark, editors, Geographic Information Science SE - 10, volume 2478 of Lecture
Notes in Computer Science, pages 132–144. Springer Berlin Heidelberg, 2002.
[16] P. Laube, S. Imfeld, and R. Weibel. Discovering relative motion patterns in groups of moving point objects.
International Journal of Geographical Information Science, 19(6):639–668, July 2005.
[17] P. Laube, M. van Kreveld, and S. Imfeld. Finding REMO Detecting Relative Motion Patterns in Geospatial
Lifelines. In Developments in Spatial Data Handling SE - 16, pages 201–215. Springer Berlin Heidelberg,
2005.
[18] R. Nathan, O. Spiegel, S. Fortmann-Roe, R. Harel, M. Wikelski, and W. M. Getz. Using tri-axial acceleration data to identify behavioral modes of free-ranging animals: general concepts and tools illustrated for
griffon vultures. The Journal of experimental biology, 215(Pt 6):986–96, Mar. 2012.
[19] R. S. Purves, P. Laube, M. Buchin, and B. Speckmann. Moving beyond the point: An agenda for research
in movement analysis with real data. Computers, Environment and Urban Systems, 47:1–4, Sept. 2014.
[20] G. Schofield, V. J. Hobson, S. Fossette, M. K. S. Lilley, K. A. Katselidis, and G. C. Hays. BIODIVERSITY
RESEARCH: Fidelity to foraging sites, consistency of migration routes and habitat modulation of home
range by sea turtles. Diversity and Distributions, 16(5):840–853, Sept. 2010.
[21] J. Shamoun-Baranes, R. Bom, E. E. van Loon, B. J. Ens, K. Oosterbeek, and W. Bouten. From sensor data
to animal behaviour: an oystercatcher example. PloS one, 7(5):e37997, Jan. 2012.
[22] J. Shamoun-Baranes, E. E. van Loon, R. S. Purves, B. Speckmann, D. Weiskopf, and C. J. Camphuysen.
Analysis and visualization of animal movement. Biology letters, 8(1):6–9, Feb. 2012.
[23] A. Soleymani, J. Cachat, and K. Robinson. Integrating cross-scale analysis in the spatial and temporal
domains for classification of behavioral movement. Journal of Spatial Information Science, 2015.
10
Semantic Enrichment and Analysis of Movement Data:
Probably it is just Starting!
Renato Fileto1,2 , Vania Bogorny1,2 , Cleto May1,2 , Douglas Klein2
1
Post-Graduate Program in Computer Science, 2 Department of Informatics and Statistics (INE),
Federal University of Santa Catarina (UFSC), Florianópolis-SC, Brazil
{r.fileto|vania.bogony}@ufsc.br, {cleto.may|douglas.klein}@grad.ufsc.br
Abstract
The widespread use of sensors and information systems, frequently via mobile devices, allows gathering
large amounts of movement data, such as trajectories of moving objects and sequences of users posts on
social media. These data can enable several applications, but some of them involve understanding what
is going on with moving objects (e.g., exact places and/or events of interest, activities performed, reasons
for stops and moves). Thus, there is a demand to enrich with well-defined semantics the potentially
imprecise spatiotemporal coordinates of movement data, which are sometimes tied together with text
(e.g., comments, tags). This paper provides an overview of proposals and possible developments in
semantic enrichment and analysis of movement data. It also presents some details of our current methods
to associate movement data with concepts and/or instances described in ontologies or Linked Open Data
(LOD). Our experiments with methods to associate tweets with places visited by the users who posted
them show that textual contents of some tweets can contribute to make correct associations. In addition,
the experience suggests that a variety of techniques can be helpful for semantically enriching movement
data in several analysis dimensions. It poses many research challenges, some of them multidisciplinary.
1
Introduction
The popularization of mobile devices, positioning and sensing technologies (e.g., GPS, GSM, RFID, cameras),
and information systems on the Web (e.g., social media, systems access logs) has created an abundance of data
that can be useful to analyze movements of objects and beings (vehicles, people, etc.). We call movement
data any collection of spatiotemporal positions of moving objects, which can be captured by sensors and/or
information systems. Each position can be represented by geographic coordinates and the instant when the
object occupied that position. Freely annotated movement data have text associated with some positions. This
definition encompasses, among other things, moving objects trajectories with freely annotated segments (e.g.,
sub-trajectories, stops, moves), sequences of users posts on social media, and their fusions [15].
A raw trajectory is a temporally ordered and fine grained sequence of spatiotemporal positions occupied
by a moving object. Nowadays, it is possible to get accurate trajectories by using state-of-the applications that
employ sensors to get positions of moving objects at fine sampling rates (e.g., every second, every 3 meters) [6].
However, it is hard to gather large volumes of annotated trajectories, because annotating is a laborious task [20].
We call a user trail a temporally ordered sequence of traces of a user in a particular system (e.g., Twitter,
Facebook). Differently from raw trajectories, users trails are usually imprecise and sparse in space and time, due
to limitations that can be imposed on the access of accurate positions, and the asynchronous nature of the users
11
activities. Nevertheless, trails composed of social media posts, for example, usually have plenty of contents
(e.g., textual contents, hash tags, keywords, images). Some of these additional data may serve as annotations.
Several applications can benefit from movement data analysis, in such diverse areas as traffic, logistics, security and marketing. However, to realize potential applications it is necessary to develop appropriate methods
to extract useful information from usually large amounts of movement data. Recently, there has been significant progress in methods to handle such data [9, 10, 18, 24]. Notwithstanding, much of this progress refers to
spatiotemporal data management and analysis, while it is recognized by the scientific community that semantic
issues must be addressed to better understand and exploit movement data [2, 7, 17–19, 23, 25]. For example,
consider the following queries:
Q1: “Select the trajectories that have a stop to watch a sport event and another stop located up to a certain
distance of a touristic place called Corcovado in the city of Rio de Janeiro.”
Q2: “Select the trails of European people who visit Rio city and use public transportation to visit at least one
national park in Rio state, for at least 2 hours.”
Q3: “What is the percentage of the social media trails that are first at home, then at work, then at a place for
doing physical exercise (e.g. a gym), and finally at home again?”
Notice that these queries involve semantic issues, by referring to concepts (classes) such as sport event,
touristic place, city, European people, public transportation, national park, state, home, work, physical exercise, and gym. Some queries also refer to specific objects (instances) such as the touristic place called Corcovado
and the city or the state called Rio de Janeiro. Consequently, it is not possible to solve such queries confidently
just by looking at spatiotemporal coordinates. First, movement data to be queried must be semantically enriched
with annotations that precisely associate particular movement segments (i.e., spatiotemporal positions or temporally ordered subsequences of them) with resources (objects or concepts) that are relevant for the analysis.
The semantics of the annotations must be well-defined and processable by machines, to enable the automatic
identification of semantic relationships with things like places (e.g., a night club called Rio in the city having
the same name), phenomena (e.g., a traffic jam in Rio city caused by floods in the whole state also called Rio),
actions (e.g., buy or watch the movie Rio), and their respective classes (e.g., night club, city, state, movie).
This paper describe recent progresses in the field of semantic enrichment of movement data, and delineates
possible research directions in this field. It reports some of our research results to automatically produce semantic annotations of movement data, i.e., associations of movement segments with resources described in
ontologies and Linked Open Data (LOD) collections. These annotations have better defined semantics than free
text. Our current methods employ spatiotemporal compatibility, several kinds of lexical similarity, and similarity
joins, for associating spatiotemporal position or stops with visited Places of Interest (PoIs). However, a variety
of other techniques such as entity linking [3, 11, 22] can be useful for semantic enrichment in several analyzes
dimensions that can help explain movements [7, 8]. Once movement data is semantically annotated, the welldefined semantics lent by concepts and/or objects of ontologies/LOD to the resulting annotations enable queries
such as Q1, Q2 and Q3 to be expressed and processed by using existing languages such as spatial extensions
of SQL and geoSPARQL [1], among other possibilities. The experiments presented in this paper refer to the
automatic association of tweets with PoIs taken from LinkedGeoData1 , to indicate that the user who posted the
respective tweet visited that PoI. These experiments show that geographic proximity is crucial to make these
associations, but also that textual data can sometimes contribute to make correct associations.
The remainder of this paper is organized as follows. Section 2 discusses related work. Section 3 summarizes our research program in semantic enrichment, analysis, and mining of movement data. Section 4 briefly
describes some of our methods to semantically enrich movement data, and presents results of experiments that
associate tweets with visited PoIs. Finally, Section 5 delineates conclusions and research challenges.
1
http://linkedgeodata.org
12
2
Related works
Lots of information can be extracted from freely annotated movement data such as trajectories and trails, and
combinations of them, with a myriad of useful applications [19]. However, freely annotated movement data lack
well-defined semantics to support information analysis. For instance, the tag "Rio" in a social media post may
refer to a city, a state, or even a restaurant, nightclub or a movie, among other possibilities. Therefore, to realize
potential applications, it is necessary to develop appropriate methods to turn raw spatiotemporal coordinates,
possibly connected with rough free text, into semantically rich movement data.
The fusion of trajectories with trails to produce trajectories annotated with the contents of social media posts
is addressed in [15]. In this proposal, trajectories are segmented and structured as temporally ordered sequences
of stops and moves, by using clustering-based techniques such as CB-SMoT [16] and DB-SMoT [21]. Then, the
spatiotemporal positions of both, the structured trajectories and the social media trails, are indexed to support
efficient algorithms that use proximity joins and ranking criteria to fuse the structured trajectories with trails.
However, just fusing movement data from different sources does not necessarily solve semantic issues such as
homonyms and ambiguities.
Several models, methods, and tools have been proposed for semantic enrichment, analysis, and mining of
movement data [2, 17, 18]. The use of ontologies and Knowledge Bases (KBs) for such purposes have been
addressed in a few works [12, 19]. The idea of semantically enriching movement data with resources of Linked
Open Data (LOD) collections was introduced in [7], which proposes ontological constructs and a general semiautomatic process for this enrichment. That work also illustrates how to accommodate the semantic enriched
movement data into the proposed ontological model, and the benefits of the well-defined semantics behind
LOD annotating movement segments for supporting the execution of GeoSPARQL queries referring to semantic
aspects of the movement and the environment where it takes place, for example.
Nevertheless, the development and the evaluation of efficient and effective methods for enriching movement
data with ontologies and KBs such as LOD collections is still an open research challenge. The work described
in [13] is an initial effort towards properly connecting freely annotated movement locations or stops with PoIs
taken from LOD collections. First, it uses a spatial access method to select the resources that are within a
given radius of each position or stop annotated with free text. Then, it chooses among these resources in the
surroundings, those having labels most lexically similar to named entities found in the text associated with the
stop or movement location. It employs Soft-TF-IDF [14] to calculate the textual similarity of named entities of
the text associated to the movement data with labels of LOD resources, based on a similarity metric between
words, such as Levenshtein edit distance or Jaro Winkler [4]. In fact, a variety of alternative lexical similarity
metrics and disambiguation techniques can be employed for making the associations. The experiments reported
in [7] and [13] enrich Flickr trails taken from CoPhIR2 with DBPedia and LinkedGeoData resources. Despite
the simplicity of this method and the low probability of a Flickr to an entity described in such a LOD collection,
the results include a considerable number of associations, some of them confirmed by photographic and textual
inspection. Several variations of this method have been investigated by our research group in experiments with
distinct movement data and LOD collections.
3
Research Directions on Semantic Movement Data
Figure 1 provides an overview of the process for semantic enrichment, analysis, and mining of Movement
Data. Trajectories of moving objects (Trajs), trails of users on social media (or just Trails), and other kinds of
movement data must be selected for a particular application, cleansed to eliminate or correct erroneous or imprecise data, structured into movement segments (i.e., positions subsequences that satisfy certain predicates, like
stops and moves [24, 25]), and sometimes fused before being semantically enriched and used. Analogously, the
2
http://cophir.isti.cnr.it
13
Enriching information, such as conventional databases (DBs), spatiotemporal databases (STDBs), knowledge
bases (KBs), or Linked Open Data (LOD) collections also usually need some pre-processing for data selection, cleansing and sometimes data integration, among other tasks, prior to being used for enriching particular
movement data.
Figure 1: Overview of the semantic enrichment, analyses and mining process
The Semantic Enrichment task aims to associate (annotate) particular positions or temporally ordered
subsequences of positions occupied by a moving object with data having well-defined semantics that help to
describe what is going on. The annotation of the movement segments can be done according with a variety
of dimensions such as those proposed in the CONSTAnT model [2], namely, Moving Object (that encompass
the moving Entity (e.g. person) and Device (e.g. cell phone) used to collect the movement data), Space (e.g.,
PoIs), Time (e.g., periods of time of interest, such as years, months, season, days of the week), Goal (e.g., eat,
shopping, study, leisure), Transportation Means, Environment Condition (e.g., windy), and Activity performed
by the moving entity (e.g., talking via the cell phone). All these dimensions can be organized in hierarchies of
concepts (e.g., kinds of PoIs, kinds of periods of time) and/or hierarchies of instances of these concepts (e.g.,
PoIs or periods of time organized in a hierarchies in accordance with their containment relationships).
Once a sufficient amount of movement data have been semantically enriched, they can also be accommodated
in semantic movement data warehouses to support Semantic Analysis, as proposed in [8]. That work proposes
a collection of constructs compatible with description logics and semantic Web standards to support flexible and
powerful information analyses of semantically enriched movement data. Those constructs include hierarchies
of semantic movement segments (semantic trajectories, stops, moves) with arbitrary refinement levels, analysis
dimensions of descriptive data with several hierarchies of categories and/or instances, and flexible conceptualizations of movement segments and movement patterns. It enables movement analysis based on movement
patterns defined by (i) spatiotemporal conformations of movement segments (e.g., moving clusters) [5]; and/or
(ii) semantic, ordering, and timing constraints on subsegments (e.g., semantic trajectories refined in one stop at
an Airport, followed by one stop at a Hotel, which is followed by a stop at a Stadium that lasts at least 4 hours).
Semantically enriched movement data can also support what we call Semantic Mining, i.e., data mining
that exploits the semantic annotations. The extracted information and knowledge, besides having a variety of
applications, can also used as feedback for more semantic enrichment, analysis and mining of movement data,
as illustrated in Figure 1.
14
4
Semantically Enriching Tweeter Trails with PoIs of LOD Collections
Our group has been doing experiments for semantically enriching a variety of movement data: Flickr, Twitter,
and Facebook trails, some annotated trajectories, and some trajectories fused with social media trails of the same
users. The specific methods for associating trajectories with KB resources, such as LOD resource, varies with
the data involved and the semantic dimension (Space, Time, Goal, etc.). Since the beginning of this research
the focus has been first on enriching movement segments (positions or sequences of positions such as stops)
with the visited PoIs. Although we have plans to develop specific methods for semantic enrichment in different
movement analysis dimensions, by now our most reliable results are associations of positions with PoIs. Thus,
this paper keeps the focus on this kind of associations.
Preliminary methods to semantic enrich movement data with resources taken from KBs such as LOD collections are presented in in [7] and [13]. The algorithms proposed in [13] to associate movement segments
(usually individual positions or stops) with PoIs consider the PoIs proximity to the movement positions and the
textual similarity between PoIs labels and named entities found in the free text associated with movement data
by systems users (e.g., social media posts contents, textually annotated trajectories). The basic idea behind these
algorithms is to select the visited PoI from the possibly multiple ones within a certain distance of each movement segment, by using the textual similarity between PoIs labels and named entities associated to movement
segments to solve these ambiguities. It may be necessary due to inaccuracies in the trajectories or in areas with
a high density of small PoIs or overlapping PoIs (e.g, in a same building with multiple stores).
4.1
Experimental Results
The experimental results presented in this paper refer to the enrichment of 3,642 tweets sent in the metropolitan
area of Florianópolis Brazil in October 2015. These tweets were in fact originated in FourSquare, when users
check-in some place. Their contents follow the pattern “I am at place name”, what facilitates the extraction
of the place name and makes the generated results more reliable. Unfortunately, only a small percentage of
the tweets that we have collected from the Twitter API follows this pattern (between 2% and 14% in different
datasets used in our experiments so far). These tweets have been associated with 1,268 LinkedGeoData resources
(PoIs), whose geographic extensions are inside the Florianópolis metropolitan area.
Only a percentage of the tweets are associated with some PoI, because some named entities mentioned in
the tweets are not present in the LinkedGeoData collection. This percentage varies sharply in our experiments
(between 0.1% and 80%), according with the dataset and the parameters of the algorithms used for making
the associations. The parameters of our simplest algorithms are the spatial threshold (τs : maximum distance
between the movement segment and the PoI to consider the pair as a candidate association) and the textual
threshold (τt : the minimum textual similarity to consider a candidate association).
The number of associations per tweet can also be bigger than one, when our methods are not able to eliminate
ambiguities. Figure 2 presents the average number of associations per associated tweet for two variations of our
method: the first returns all the associations that satisfy both parameters τs and τt , and the second takes only the
closest PoIs (i.e., the ones that are within the same minimum distance found between any PoI and the respective
tweet). Notice that the second variation allows a sharp decrease in the number of PoIs associated to each tweet,
and in some cases eliminates ambiguities completely (the ones in bold). We have observed that the second
variation also run considerably faster in most of our experiments.
Some associations produced by our algorithms, mainly in areas of a high density of PoIs, were clearly
enabled by the textual similarity, because the spatial coordinates available in the data collections used in these
experiments are not accurate enough to enable the correct association. Figure 3 shows an example in which the
textual similarity is crucial to connect the tweet with the correct PoI, though this PoI is not the closest one to the
tweet geographic coordinates. The tweet is associated to Bobs instead of Posto Angeloni Beira Mar (the closest
resource) because the textual contents of the tweet mentions that the user is at Bobs. Of course, the use of more
15
Figure 2: Average number of candidate resources per associated tweet
sophisticated techniques must be investigated to fitly and efficiently connect movement data accompanied by
textual contents that not follows the “I am at ” pattern with ontology concepts or LOD resources.
Figure 3: Association based on textual similarity between tweet and label of one of the PoIs in the surroundings
5
Conclusions and Future Work
We leave electronic traces of our movements as GPS trajectories, Wi-Fi and GSM networks access logs, Web
logs, credit card transactions recordings, etc. Lots of information about behaviors of moving objects can be
extracted from these traces. However, a key step for doing so is the semantic enrichment of these movement
data. It can be done by associating these data with well-defined semantics, to help explain what is going on.
In this paper, we discussed current research on semantic enrichment of movement data, and outlined some
possible research directions in this field. The methods for associating movement positions or stops with resources taken from existing information and KBs, such as extensive LOD collections available nowadays, are
just one example of what can be done. The results of experiments that we have done so far for associating
sequences of posts in social media with PoIs taken from LOD collections such as LinkedGeoData show that
our methods are very sensitive to the tunning of parameters, such as the geographical proximity and the textual
similarity that must be observed between a movement segment and a PoI to associate them. They also show that
though proximity is the basic criteria for making associations with visited PoIs, textual similarity is sometimes
crucial to make correct associations, specially when coordinates are not precise and in regions with high density
of small or even some overlapping PoIs.
16
Despite recent progresses, we believe that research on data integration for semantic enrichment of movement
data is still in its infancy, and several methods are to be developed yet. Our experience suggests that a variety of
techniques can be helpful for semantic enriching movement data, posing many research challenges, some of them
multidisciplinary. Future work includes (i) devise methods to semantically enrich movement data according with
other analysis dimensions besides visited PoIs, such as transportation means, and goals of the movements; (ii)
create benchmarks to organize the investigation of methods for semantically enriching movement data and enable
a fair comparison of their performance, in terms of both execution time and results quality; (iii) make extensive
experiments with different datasets; and (iv) investigate the usefulness of the resulted semantic annotations to
support semantically-enabled analysis and mining of movement data in several application domains.
Acknowledgments
The authors were partially supported by EU Marie Courie IRSES-SEEK (grant 295179), CNPq (grant
478634/2011-0), CAPES, and FEESC.
References
[1] R. Battle and D. Kolas. Enabling the geospatial Semantic Web with Parliament and GeoSPARQL. Semantic
Web, 3(4):355–370, 2012.
[2] V. Bogorny, C. Renso, A. R. de Aquino, F. de Lucca Siqueira, and L. O. Alvares. CONSTAnT - A
Conceptual Data Model for Semantic Trajectories of Moving Objects. T. GIS, 18(1):66–88, 2014.
[3] D. Ceccarelli, C. Lucchese, S. Orlando, R. Perego, and S. Trani. Learning relatedness measures for entity
linking. In Q. He, A. Iyengar, W. Nejdl, J. Pei, and R. Rastogi, editors, CIKM, pages 139–148. ACM, 2013.
[4] W. W. Cohen, P. D. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for namematching tasks. In S. Kambhampati and C. A. Knoblock, editors, IIWeb, pages 73–78, 2003.
[5] S. Dodge, R. Weibel, and A.-K. Lautenschütz. Towards a Taxonomy of Movement Patterns. Information
Visualization, 7(3):240–252, June 2008.
[6] A. Doulamis, N. Pelekis, and Y. Theodoridis. Easytracker: An android application for capturing mobility
behavior. 2012 16th Panhellenic Conference on Informatics, 0:357–362, 2012.
[7] R. Fileto, M. Krüger, N. Pelekis, Y. Theodoridis, and C. Renso. Baquara: A Holistic Ontological Framework for Movement Analysis Using Linked Data. In W. Ng, V. C. Storey, and J. Trujillo, editors, ER,
volume 8217 of LNCS, pages 342–355. Springer, 2013.
[8] R. Fileto, A. Raffaetà, A. Roncato, J. A. P. Sacenti, C. May, and D. Klein. A semantic model for movement
data warehouses. In 16th DOLAP, Shanghai, China, November 7 (to appear), 2014.
[9] B. Furletti, L. Gabrielli, C. Renso, and S. Rinzivillo. Analysis of GSM calls data for understanding user
mobility behavior. In BigData Conference, pages 550–555. IEEE, 2013.
[10] F. Giannotti, M. Nanni, D. Pedreschi, F. Pinelli, C. Renso, S. Rinzivillo, and R. Trasarti. Unveiling the
complexity of human mobility by querying and mining massive trajectory data. VLDB J., 20(5):695–719,
2011.
[11] Z. Guo and D. Barbosa. Entity linking with a unified semantic representation. In 23rd Intl. Conference on
World Wide Web, WWW Companion, pages 1305–1310, 2014.
17
[12] G. Manco, M. Baglioni, F. Giannotti, B. Kuijpers, A. Raffaetà, and C. Renso. Querying and reasoning for
spatiotemporal data mining. In F. Giannotti and D. Pedreschi, editors, Mobility, Data Mining and Privacy,
pages 335–374. Springer, 2008.
[13] C. May and R. Fileto. Connecting Textually Annotated Movement Data with Linked Data. In IX Regional
School on Databases, ERBD, São Francisco do Sul, SC, Brazil (in Portuguese), 2014. SBC.
[14] E. Moreau, F. Yvon, and O. Cappé. Robust Similarity Measures for Named Entities Matching. In 22nd
Intl. Conf. on Computational Linguistics - Volume 1, COLING, pages 593–600, 2008.
[15] R. G. B. Nabo, R. Fileto, C. Renso, and M. Nanni. Annotating Trajectories by Fusing them with Social
Media Users’ Posts. In Brazilian Symposium on Geoinformatics, GeoInfo, Campos do Jordão, SP, Brazil
(to appear), 2014.
[16] A. T. Palma, V. Bogorny, B. Kuijpers, and L. O. Alvares. A clustering-based approach for discovering
interesting places in trajectories. In R. L. Wainwright and H. Haddad, editors, SAC, pages 863–868. ACM,
2008.
[17] C. Parent, S. Spaccapietra, C. Renso, G. L. Andrienko, N. V. Andrienko, V. Bogorny, M. L. Damiani,
A. Gkoulalas-Divanis, J. A. F. de Macêdo, N. Pelekis, Y. Theodoridis, and Z. Yan. Semantic trajectories
modeling and analysis. ACM Comput. Surv., 45(4), 2013. Article 42.
[18] N. Pelekis and Y. Theodoridis. Mobility Data Management and Exploration. Springer, 2014.
[19] C. Renso, M. Baglioni, J. A. F. de Macêdo, R. Trasarti, and M. Wachowicz. How you move reveals who
you are: understanding human behavior by analyzing trajectory data. Knowl. Inf. Syst., 37(2):331–362,
2013.
[20] S. Rinzivillo, F. de Lucca Siqueira, L. Gabrielli, C. Renso, and V. Bogorny. Where Have You Been Today?
Annotating Trajectories with DayTag. In SSTD, volume 8098 of LNCS, pages 467–471. Springer, 2013.
[21] J. A. M. R. Rocha, V. C. Times, G. Oliveira, L. O. Alvares, and V. Bogorny. DB-SMoT: A direction-based
spatio-temporal clustering method. In IEEE Conf. of Intelligent Systems, pages 114–119. IEEE, 2010.
[22] W. Shen, J. Wang, P. Luo, and M. Wang. Linking named entities in tweets with knowledge base via user
interest modeling. In ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, KDD, pages
68–76, 2013.
[23] S. Spaccapietra and C. Parent. Adding meaning to your steps (keynote paper). In M. A. Jeusfeld, L. M. L.
Delcambre, and T. W. Ling, editors, ER, volume 6998 of LNCS, pages 13–31. Springer, 2011.
[24] S. Spaccapietra, C. Parent, M. L. Damiani, J. A. F. de Macêdo, F. Porto, and C. Vangenot. A conceptual
view on trajectories. Data Knowl. Eng., 65(1):126–146, 2008.
[25] Z. Yan, D. Chakraborty, C. Parent, S. Spaccapietra, and K. Aberer. Semantic trajectories: Mobility data
computation and annotation. ACM TIST, 4(3), 2013.
18
Hermoupolis: A Semantic Trajectory Generator in the
Data Science era
Nikos Pelekis1 , Stylianos Sideridis2 , Panagiotis Tampakis2 , Yannis Theodoridis2
1
Department of Statistics and Insurance Science, University of Piraeus, Greece
2
Department of Informatics, University of Piraeus, Greece
{npelekis, siderste, ptampak, ytheod}@unipi.gr
Abstract
The domain of trajectory data management and mining undoubtedly contributes with interesting research problems and corresponding effective solutions to what is called data science. An interesting
trend that poses new challenges in the field and has emerged especially due to the advance of locationbased social networks, is that involved data cannot be considered purely spatiotemporal; trajectories of
moving objects also contain additional semantic information that deserves to be further explored. On
the other hand, the recently available real trajectory datasets are neither adequate nor appropriate for
a wide empirical evaluation of related research proposals. As in other domains, a practical approach
to overcome this limitation is developing efficient and functional synthetic trajectory generators. In this
line of research, we present Hermoupolis, a pattern- and semantic-aware synthetic trajectory generator, which is able to produce realistic semantic trajectory datasets (along with their synchronized raw
spatiotemporal counterparts), conforming to mobility profiles given as input by users.
1
Introduction and Motivation
The immense advances on location aware mobile devices technology, such as smart phones, GPS navigation
devices, tablets etc. along with the development of the appropriate techniques for storing, processing and analyzing spatio-temporal data, has led to the generation of huge amounts of GPS-like data. So far, a successfully
followed challenge is the transformation of this kind of data into actionable knowledge, in terms of the mobility of the territories that we occupy. Examples of this success story is the development of several mobility
data mining techniques, such as identification of moving clusters [7], clusters of entire trajectories [12] or of
sub-trajectories [10, 15], flocks [4, 9], convoys [6], sequential trajectory patterns [2], swarms [11], and top-k
representative trajectory samples [13].
A vital prerequisite for the evaluation of these methods is the experimentation with enormous amounts of real
mobility data. Unfortunately, such datasets cannot be easily acquired and when they are, they are significantly
distorted due to privacy issues, thus diminishing their value. For this reason, a number of synthetic generators of
moving objects have been developed so far, assisting researchers to evaluate the scalability of the aforementioned
techniques. However, evaluating scalability is not enough, as we also need to evaluate the effectiveness of these
methods with respect to their prescribed specifications, i.e. identifying the patterns that they are supposed to do.
In order to achieve this, one needs to guarantee the existence of specific patterns within a dataset the so-called
ground truth. Usually, it is the lack of such ground truth that makes real or synthetically generated trajectory
datasets inappropriate for evaluating the effectiveness of the various approaches. So far, the effectiveness of
such methods has been evaluated with the use of small “hand-made” datasets but as efficiency and scalability,
19
effectiveness should also be tested in very large datasets. The same stands the other way around, namely,
efficiency and scalability should be evaluated in datasets that contain patterns of known ground truth, at various
scales. This way only, experimental results can be interpretable and useful.
In addition, a recent research trend in the mobility data management and mining community is to take
advantage of semantic information that annotates raw spatio-temporal data. Such semantically annotated trajectories [14, 16] provide us with a more valuable representation of the mobility behavior of a moving object.
Informally, a raw trajectory is a sequence of recorded positions along with the timestamp when these positions
were recorded, while its semantic equivalent is a sequence of more abstract and insightful fragments of mobility
data, called episodes. Episodes can be either moves (driving, walking, riding a bicycle, etc.) or stops (e.g. at
work, at home, at the super-market) that hold semantic annotations of type “what?”, “how?”, etc. Actually,
preserving semantic information seems to be rather a natural and valuable representation of the mobility behavior, since it captures the purpose and the mean of the movement. Hence, synchronized semantic and raw
spatio-temporal information is essential.
In this paper, we present Hermoupolis a pattern- and semantic- aware synthetic trajectory simulator, which
produces annotated trajectories of moving objects following given mobility profiles along with the respective
simulated GPS-like recordings that are network-constrained. To the best of our knowledge, such a generator
has unique characteristics with respect to related work, which includes (raw) trajectory data generators, such as
GSTD [17], Brinkhoff [1], and SUMO [8], as well as activity-oriented but far from being considered semantic
trajectory data generators, such as BerlinMOD [15], ST-ACTS [3], and MWGen [18].
The paper is structured as follows: Section 2 presents the Hermoupolis workflow, Section 3 describes interesting use cases showing how users may handle Hermoupolis in order to simulate various mobility scenarios,
while Section 4 concludes and points out challenging future work in the field.
2
Hermoupolis Workflow
Hermoupolis input consists of (i) a road network, N = G(V, E), (ii) a set of points of interest on this network,
PoI, and (iii) a set of mobility profiles, MP, which we desire to simulate over this network. In detail, since
our goal is to produce network-constrained semantic trajectories, an important element of the generator is the
underlying road network and its properties. Hermoupolis follows the typical paradigm: a road network N
is represented by a graph G = (V, E) consisting of a set of vertices V = {v1, . . . , vn } that correspond to
geographical locations (x, y) and a set of edges E = {eij = (vi , vj )|vi , vj ∈ V, i ̸= j} connecting those
vertices. Edges are also annotated with information that describes the type of road (e.g. highway). An additional
input of the generator is a set of points of interest PoI of the area upon which the simulation will take place. Each
poi ∈ PoI is a tuple (poi-ID, poi-Loc, poi-Tags, poi-Cat), where poi-ID is a unique identifier of each poi, poi-Loc
is a spatial point (x, y) denoting its location, poi-Tags is a set of tags stating its underlying utility, and poi-Cat is
the corresponding category that the poi belongs to.
Furthermore, a mobility profile mp ∈ MP is a tuple (mp-ID, c, Stop0 , M ove1 , Stop1 , M ove2 , Stop2 , . . .,
M ovek , Stopk ) that denotes the mobility profile identifier mp-ID, the cardinality c of the semantic trajectories
to be generated, along with a sequence of ‘abstract’ Stops and Moves (with the constraint that a movement starts
from and end to a stop). In particular:
2
2
, σtime
, poi-Cat), where MBB is the (spatio-temporal) minimum
• An abstract Stop is a tuple (MBB, σspace
bounding box that defines the area inside which specific simulated Stop episodes take place, poi-Cat is
2
2
(σtime
) is the variance of the
the category of PoIs that simulated Stop episodes must belong to, and σspace
spatial (temporal, respectively) range of the simulated Stop episodes (i.e. the variance of the spatial extent
around the selected PoIs and the variance of the duration of the simulated MBBs of the Stop episodes).
• An abstract Move is a tuple (speedmax , move-Tags), where speedmax is the maximum allowed speed
20
of the movements to be simulated and move-Tags is a set of textual annotations that are attached to the
simulated Move episodes (examples include “driving”, “jogging”, etc.).
On the other hand, Hermoupolis output is a semantic mobility database consisting of a set of semantic
trajectories along with their raw counterparts that are compliant with the spatio-temporal-textual constraints
imposed by the MPs. In particular, a semantic trajectory is a tuple (T-ID, mp-ID, Episodes), where T-ID is a
unique identifier of the moving object trajectory, mp-ID is the identifier of the profile mp ∈ MP the trajectory
belongs to, and Episodes is a sequence of episodes. In turn, each episode e ∈ Episodes is a tuple (e-ID, e-flag,
e-MBB, e-tags, T-link), where e-ID is the episode identifier, e-flag is a flag in {‘Move’, ‘Stop’}, e-MBB is the 3D
spatio-temporal approximation of the counterpart raw sub-trajectory, e-tags is a set of keywords derived from the
corresponding abstract episode, and T-link is a link to the actual raw sub-trajectory, which in turn is a sequence
of map-matched spatiotemporal points (xi , yi , ti ).
Combining the above, Hermoupolis workflow is graphically illustrated in Figure 1.
Figure 1: Hermoupolis workflow (input – methodology – output).
Before we proceed with the presentation of case studies developed using Hermoupolis, it is important to note
that the starting point of Hermoupolis software is Brinkhoff generator [1], which has been radically extended in
order to provide the above functionality. Moreover, Hermoupolis software along with representative case studies
is available at: http://infolab.cs.unipi.gr/hermoupolis/.
3
Hermoupolis in Action
In this section, we present three case studies that highlight Hermoupolis functionality and its usefulness in
simulating various mobility scenarios that are applicable to different domains and application areas. They can
be considered representative of the purposes the generator has been developed for:
21
• case study I, titled “a typical day in Athens”, demonstrates how researchers working on semantic trajectory
data management (reconstruction, processing, etc.) can find support in the empirical evaluation of their
proposals;
• case study II, titled “a big event in Athens”, aims at the transportation research field and makes use of the
expressive power of the generator in simulating real-world cases;
• case study III, titled “collective mobility behavior in Athens”, is in support of researchers in the mobility
data mining domain seeking for datasets simulating various well-known mobility patterns.
Case Study I: “a typical day in Athens”. As already mentioned, one of the major contributions of the
Hermoupolis generator, in comparison to other generators, is its ability to produce not only “raw” trajectories
but also the corresponding “semantic” trajectories that are synchronized with the former. In this case study,
we present in detail such a scenario, where various mobility (population) profiles, each consisting of various
activities and transportation means, are simulated. The entire simulation scenario, consisting of six mobility
profiles, is illustrated in Figure 2. (Arrows in Figure 2 illustrate the direction of movement, while their thickness
is proportional to their cardinality.) More specifically, imagine a typical day of a mobility profile, called “youngactive-workers” (the orange colored mobility profile in Figure 2) as follows: starting from their home (Stop0 ),
they take their bicycles to a bus station (M ove1 ) where they park their bicycles into a bicycle parking area
(Stop1 ) and catch their bus to work (M ove2 ). After 8 hours of work (Stop2 ), they catch the bus back home
(M ove3 ), arrive at the bus station (Stop3 ) where they change back to their bicycles and ride home (M ove4 ).
As soon as they arrive there (Stop4 ), they take their car in order to go for grocery shopping (M ove5 ). After
shopping (Stop5 ), they return back home by car (M ove6 ) where they relax for a while (Stop6 ). Then, they
walk to the gym (M ove7 ) in order to work out (Stop7 ). Once they complete it, they return back home on foot
(M ove8 ) where they rest until the end of the day (Stop8 ).
Respectively, the user has given as input another five mobility profiles (the mobility profiles in Figure 2
colored different from orange). For the sake of presentation, we avoid presenting detailed description for each
of those profiles as we did for “young-active-workers” above. Instead, we provide a more abstract outline. In
particular, profile “school kids” (green) go from home to school in the morning and return back in the afternoon,
all on foot. Profile “young students” (turquoise) start in the morning from home and head by bicycle to their
university, where they study until the afternoon; then, they ride to a nearby area in order to visit a café for
socialization and, finally, they return back to home again by bicycle. Profile “middle-aged-workers” (purple)
simulates people following the typical home – work – home pattern, all by car. Similarly, profile “middle-agedworkers-and-shoppers” (red) simulates people moving home – work – shops – home, again all by car. Finally,
the sixth profile (light green) called “relaxers”, simulates those using public transportation and their everyday
movement routine is home – café – home.
It is straightforward for one to see that having such synchronized raw vs. semantic representations of mobility data is extremely useful for various researches. For instance, semantic trajectory reconstruction techniques
could make use of the produced raw data, apply the segmentation and semantic annotation approach under study,
and make use of the semantic counterparts to evaluate the proposals.
Case Study II: “a big event in Athens”. Hermoupolis can be utilized to simulate the traffic flow of an entire
city for large periods of time and the behavioural analysis of people living in an urban environment wherein they
perform their daily activities. Under this setting, we present a scenario that simulates a big event (e.g. a concert
or a football game) that takes place at Athens Olympic stadium, where lots of people from the metropolitan area
around are rushing to attend. This scenario is a bit more intricate to simulate in comparison with the previous
one. For instance, in this scenario people from different areas should have different starting times in order to
reach the place of the event on time (one who lives very close to the stadium could safely leave home a few
minutes earlier than the starting time, opposed to one who lives in a distant suburb and perhaps needs 1 hour or
even more to move there). This characteristic makes mandatory to create several mobility profiles with different
22
Figure 2: A typical day in Athens simulating various activities and means of transportation for various mobility
patterns during a day in Athens metropolitan area.
starting times depending on their proximity to the place of the event. In our example, as illustrated in Figure 3,
we create 17 (= 8 + 7 + 1 + 1) profiles of people starting their way to Athens Olympic stadium: 30 min (8
profiles; those living nearby), 60 min (7 profiles; those living in areas adjacent to the former), 75 min (1 profile;
those living east), and 90 min (1 profile; those living south) in advance.
The outcome of such an analysis could assist in “smart” and efficient urban planning and decision making,
thus having a great impact in the improvement of our everyday life.
23
Figure 3: A big event in Athens - simulating the traffic flow due to a scheduled event at Athens Olympic stadium.
Case Study III: “collective mobility behavior in Athens”. A unique feature of Hermoupolis is the ability
to produce moving objects that follow a variety of mobility patterns. Consider, for instance, Figure 4 (left) that
illustrates a mobility pattern consisting of 4 overlapping abstract Stops (depicted as rectangles) and 3 abstract
Moves (depicted as arrows). It is evident that such movement simulates a flock [4, 9] or convoy [6] mobility
pattern. Another example is demonstrated in Figure 4 (right), where there exist two mobility patterns; the one
(green) contains 6 abstract Stops and 5 abstract Moves whereas the other (turquoise) contains 5 abstract Stops
and 4 abstract Moves. Both profiles include Stops with varying spatial extent and varying speed and agility.
By imposing the two profiles to meet at two specific regions – see the first and the last but one abstract Stop in
Figure 3 (right) – we can simulate trajectories following a swarm pattern [11].
4
Conclusions and Future Work
Concluding, Hermoupolis is a synthetic data generator that succeeds in generating synchronized raw spatiotemporal and semantically annotated trajectories, which simulate realistic mobility behaviours by following
24
Figure 4: Collective mobility behavior in Athens – simulating flocks / convoys (left) and swarm patterns (right).
user-defined mobility patterns, in terms of sequences of (abstract) Stops and Moves. As demonstrated through
the case studies that we presented, the application domain for such a generator could range from the empirical
evaluation of semantic trajectory data management techniques (that handle semantic trajectories as complex
spatio-temporal-textual sequences or behaviourally-rich semantic entities) and the effectiveness validation of
mobility pattern mining techniques to efficient urban planning through the simulation of the traffic flow of an
entire city, e.g. during a scheduled event.
In its current version, Hermoupolis requests that the mobility profiles to be simulated are given as input
by the user. Following the “by-example” paradigm, a quite challenging extension is that mobility profiles are
‘automatically’ extracted from a real small “semantically-aware” dataset that will be instead given as input.
Under this setting, urban planners could provide e.g. a small real dataset extracted from questionnaires or diaries
as input and get a large realistic synthetic dataset keeping the semantics of the former as output, an extremely
useful tool in modern planning, e.g. in the era of electric vehicles that is emerging [5].
References
[1] T. Brinkhoff. A framework for generating network-based moving objects. Geoinformatica, 9(1):153–180,
2002.
[2] F. Giannotti, M. Nanni, D. Pedreschi, and F. Pinelli. Trajectory pattern mining. In Proc. of ACM SIGKDD,
2007.
[3] G. Gidofalvi and T. B. Pedersen. ST-ACTS: a spatio-temporal activity simulator. In Proc. of ACM GIS,
2006.
[4] J. Gudmundsson, M. J. Kreveld, and B. Speckmann. Efficient detection of patterns in 2d trajectories of
moving points. GeoInformatica, 11(2):195–215, 2007.
[5] D. Janssens, F. Giannotti, M. Nanni, D. Pedreschi, and S. Rinzivillo. Data science for simulating the era
of electric vehicles. Künstliche Intelligenz, 26(3):275–278, 2012.
[6] H. Jeung, M. L. Yiu, X. Zhou, C. S. Jensen, and H. T. Shen. Discovery of convoys in trajectory databases.
In Proc. of VLDB, 2008.
25
[7] P. Kalnis, N. Mamoulis, and S. Bakiras. On discovering moving clusters in spatio-temporal data. In Proc.
of SSTD, 2005.
[8] D. Krajzewicz, G. Hertkorn, C. Rssel, and P. Wagner. SUMO (Simulation of Urban MObil-ity): an opensource traffic simulation. In Proc. of 4th Middle East Symposium on Simulation and Modelling, 2002.
[9] P. Laube, S. Imfeld, and R. Weibel. Discovering relative motion patterns in groups of moving point objects.
IJGIS, 19(6):639668, 2005.
[10] J. G. Lee, J. Han, and K. Y. Whang. Trajectory clustering: A partition-and-group framework. In Proc. of
ACM SIGMOD, 2007.
[11] Z. Li, B. Ding, J. Han, and R. Kays. Swarm: mining relaxed temporal moving object clusters. Proc. of
PVLDB, 3(1-2):723–734, 2010.
[12] M. Nanni and D. Pedreschi. Time-focused clustering of trajectories of moving objects. JIIS, 27(3), 2006.
[13] C. Panagiotakis, N. Pelekis, I. Kopanakis, E. Ramasso, and Y. Theodoridis. Segmentation and sampling of
moving object trajectories based on representativeness. IEEE TKDE, 24(7):1328–1343, 2012.
[14] C. Parent, S. Spaccapietra, C. Renso, G. Andrienko, N. Andrienko, V. Bogorny, M. Damiani, A. GkoulalasDivanis, J. A. Macedo, N. Pelekis, Y. Theodoridis, and Z. Yan. Semantic trajectories modeling and analysis.
acm computing surveys. ACM Computing Surveys, 45(4), 2013.
[15] N. Pelekis, I. Kopanakis, E. Kotsifakos, E. Frentzos, and Y. Theodoridis. Clustering uncertain trajectories.
KAIS, 28(1):117–147, 2011.
[16] N. Pelekis, Y. Theodoridis, and D. Janssens. On the management and analysis of our lifesteps. ACM
SIGKDD Explorations Newsletter, 15(1):23–32, 2013.
[17] Y. Theodoridis, J. Silva, and M. Nascimento. On the generation of spatiotemporal datasets. In Proc. of
SSD, 1999.
[18] J. Xu and R. H. Güting. MWGen: A mini world generator. In Proc. of IEEE MDM, 2012.
26
Constructing Semantic Interpretation of Routine and
Anomalous Mobility Behaviors from Big Data
Georg Fuchs1,3 , Hendrik Stange1 , Dirk Hecker1 , Natalia Andrienko1,2,3 , Gennady Andrienko1,2,3
1
Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS, St. Augustin, Germany
2
City University London, London, UK
3
Friedrich-Wilhelms University, Bonn, Germany
Abstract
Annually organized VAST Challenges provide a unique opportunity to analyze complex data with available ground truth. In 2014, one of the tasks was to interpret routine and anomalous patterns of human
mobility based on big data: trajectories of cars and credit card transactions.
We describe a scalable visual analytics approach to solving this problem. Repeatedly visited personal and public places were extracted from trajectories by finding spatial clusters of stop points. Temporal patterns of people’s presence in the places resulted from spatio-temporal aggregation of the data by
the places and hourly intervals within the weekly cycle. Based on these patterns, we identified the meanings or purposes of the places: home, work, breakfast, lunch and dinner, etc. Meanings of some places
could be refined using the credit card transaction data. By representing the place meanings as points on
a 2D plane, we built an abstract semantic space and transformed the original trajectories to trajectories
in the semantic space, i.e., performed semantic abstraction of the data. Spatio-temporal aggregation of
the transformed trajectories into flows between the semantic places and subsequent clustering of time
intervals by the similarity of the flow situations allowed us to reveal and analyze the routine movement
behaviors. To detect anomalies, we (a) investigated the visits to the places with unknown meanings, and
(b) looked for unusual presence times or visit durations at different semantic places.
The analysis is scalable since all tools and methods can be applied to much larger data. Moreover,
the semantic data abstraction can serve as a tool for protecting the personal privacy.
1
Introduction
Huge amounts of data reflecting human mobility are constantly generated, including mobile phone use records,
geographically referenced posts in social media (Twitter, Foursquare, Flickr, etc.), and GPS tracks. These data
provide unprecedented opportunities for studying and understanding human mobility but require appropriate
analysis methods, in particular, methods for semantic analysis that could infer and exploit meanings of places
and purposes for attending places to enable understanding of people’s everyday behaviors and life styles.
Combining multiple sources of mobility data is challenging. Despite traditional many V’s of Big Data [10]
(Volume, Velocity, Variety, Veracity, just to name a few), there exist specific complexities associated with the
peculiarities of human mobility and corresponding data sets. Thus, different data sets have different structure,
different quality, different spatial and temporal resolutions. Visual Analytics [12] creates opportunities for a
synergy between human analyst and computer by providing appropriate visual interfaces to all stages of computational analysis, from data pre-processing and exploration to pattern search and model building. In the context
of mobility analysis, visual analytics must address the specifics of space and time [3].
27
A common pattern of development in mobility analytics is the paradigm shift from syntactic [9] to semantic [11] analysis of movement data. Since mobility data by themselves are semantically poor, human interpretation, reasoning, and judgment are essential for giving sense and meaning to them. Purely computational methods
only produce elementary results, e.g., trajectories with labeled segments. Often, semantic interpretation is based
on a pre-defined set of places of interest (POI), also called areas of interest (AOI). This approach has limited
applicability if POIs are not available or outdated. Moreover, POIs may be useful for identifying public places
(visited by several people), but their applicability is limited for personal places (frequently visited by selected
individuals). Additionally, the same POI may have different meanings for different people. For example, an
apartment building may be a home place for its residents and a work place for service staff members.
In this paper we describe a visual analysis approach that facilitates human analyst-driven synthesis and
semantic interpretation of human mobility behavior at different levels of abstraction, as appropriate for the
analysis task at hand. It combines existing methods for the extraction, visual exploration and enrichment of
raw mobility data with a novel concept of semantic spaces that allow analysis of routine as well as abnormal
mobility behavior. We argue that the proposed transformation from geographic to semantic space creates new
opportunities for analysis of mobility data – unlike existing mobility analysis approaches, semantic spaces allows
to compare behaviors of a few to many individuals across different geographies and time periods.
This paper extends our contribution that received an award for outstanding scalable analysis [6] at VAST
Challenge 2014 [8], mini-challenge 2. We demonstrate our approach to acquisition of semantically meaningful
locations by combining two simulated data sources of that challenge, namely, trajectories of cars and credit card
transactions. After extracting and interpreting personal and public places and assignment of the semantic labels
to them, we transform the original trajectories into sequences of place visit records, each record containing the
semantic label of the visited place and the start and end times of the visit. This transformation projects human
mobility from geographic to semantic space. We demonstrate that the proposed transformation creates new
opportunities for data analysis.
2
Problem Description and Example Data Sets
VAST challenges are open to participation by individuals and teams in industry, government, and academia. The
challenge setup typically comprises several interrelated, large data sets together with a set of complex analysis
tasks, to which the participants’ submissions should showcase their visual analytics approach and provide wellfounded answers. To accommodate the background story framing these analysis tasks, challenge data sets are
either derived from real data with alterations, or generated artificially but with realistic artifacts such as missing
values, precision limitations, and ambiguities just as could be expected from real use case data.
The 2014 IEEE VAST challenge’s background story can be found on the archived challenge website [8]. The
challenge consisted of three inter-related mini-challenges and an overall Grand Challenge. This paper presents
our visual analysis approach targeting mini-challenge 2 (MC2), which involved geospatial, temporal, and transaction data analysis. In a nutshell, a company called GAStech provided company cars to its employees. Both
personal and business uses were allowed. However, without the employees’ knowledge, GAStech had installed
trackers in the company vehicles. The devices periodically recorded the vehicles’ geospatial positions when they
were moving. The recorded tracks from a two week period were provided for the challenge. Additionally, credit
and debit card transactions of the GAStech employees were available for the same period.
The GPS trajectories data set consists of 671,717 time-stamped positions of 40 distinct cars. The credit card
transactions data set consists of 1,087 records of 35 individuals. Each record includes a card owner name, named
location (but no geo-coordinates), date and time, and transaction amount. Both data sets include systematic and
arbitrary mistakes such as shifted positions, wrong times, missing records etc.
The overall tasks for participants of MC2 was, first, to describe the routine behaviors of the GAStech employees, and, second, to identify suspicious patterns of behavior. The participants had to cope with uncertainties
28
that result from missing, conflicting, and imperfect data. The following sections briefly describe our approaches
in addressing these questions.
3
Extraction and Interpretation of Places
We used an automated tool [7] that extracts repeatedly visited personal and public places by spatial clustering
of points from trajectories. Relevant points can be previously selected by interactive filtering. We selected the
points of stopping for at least one minute. The tool’s work is based on finding groups of points fitting in circles
with a chosen maximal radius and uniting close groups. We chose a sufficiently big maximal radius (100m) to
account for the noise in the data. Personal places are extracted separately from the trajectory of each person.
Public places visited by at least a given minimal number of distinct persons (two in our analysis) are extracted
from all trajectories together. Figure 1 shows identified personal and public places.
Figure 1: Identified personal and public places.
Through spatio-temporal aggregation of the trajectories [5], we obtained the visit counts for the extracted
places by hourly intervals within the weekly time cycle. For the personal places, only the visits of the place
owners were counted. We analyzed the temporal distributions of the place visits using 2D time histograms
(Figure 2), where the rows correspond to 7 days the week, columns to 24 hours of the day, and marks in the cells
represent aggregated visit counts for the whole set or subsets of places. By clustering the places according to
similarity of their visit distributions, we found groups of places with prominent temporal patterns of visits, which
could be attributed to certain categories of places or activities: home, work, breakfast or coffee, lunch, lunch
and dinner. We combined card transaction data with the extracted stop points to (1) determine the geographic
locations of the businesses and (2) among the places visited in the evenings and on the weekend, distinguish
places for eating, shopping, sport, etc.
We could not determine the meanings of five public places visited mostly in hour 11 of the week days. The
temporal pattern did not hint at any usual people’s activity, and no card transaction records could be matched
with the place visit times. We found that these places were attended by particular people and gave them a
label “BFMO place”, where BFMO consists of the initials of the last names of these people. These places and
corresponding mobility patterns require further investigation.
29
Figure 2: 2D time histograms represent the total counts of visits to different place categories by hourly intervals
in the weekly cycle.
4
Analysis of Routine Behaviors in Semantic Space
After the assignment of the semantic labels to the places, we transformed the original trajectories into sequences
of place visit records, each record containing the semantic label of the visited place and the start and end times
of the visit. The intermediate trajectory points between the place visits were omitted.
We created an abstract semantic space where the semantic categories of places are represented as points
on a 2D plane; we call them “semantic places”. Then, we transformed the sequences of place visit records
to trajectories in the semantic space. For this purpose, the place visit records were complemented with the
coordinates of the semantic places (see Figure 3).
Figure 3: Trajectories in the semantic space: map (left) and space-time cube (right).
The data transformation in which real geographic coordinates are replaced by “locations” in an abstract
semantic space is called semantic abstraction. Semantic abstraction is a tool for protecting personal location
30
Figure 4: Summarized flows between semantic places. The widths of the flow symbols are proportional to the
total counts of the moves between the respective types of places.
Figure 5: Clustering of hourly intervals by the similarity of the flows in the semantic space reveals high regularity
of the movements.
31
privacy, since sensitive information about specific geographic locations visited by individuals is completely
removed from the data.
Spatio-temporal aggregation can be applied to trajectories in abstract spaces in the same way as to trajectories
in geographic space [2]. We aggregated the transformed trajectories into flows (aggregate moves) between the
semantic places for the overall period and by hourly intervals. Figure 4 shows summarized flows between
semantic places.
Then we clustered the intervals by similarity of the respective flow situations, i.e., vectors composed of
the magnitudes of the flows for all ordered pairs of semantic places. We applied k-means clustering algorithm
using Manhattan distance between the vectors as the similarity measure. In the calendar display (bottom center
of Figure 5), pixels representing the hourly intervals are colored by their cluster membership. Periodic patterns
with regard to the daily and weekly time cycles could be observed for different values of the parameter k (number
of clusters). In the small multiple maps in Figure 5, the average hourly flows for the time clusters are represented
by the widths of the flow symbols (curved lines with the curvature increasing towards the destination).
5
Detection and Analysis of Anomalies
By interacting with the map display of the semantic space and transformed trajectories, we selected the daily
trajectories visiting BFMO places (cf. Section 3) as shown in Figure 6, and onserved that people went to these
places from the work place. From BFMO places, the visitors almost always moved to lunch/dinner places
and then returned to the work place. Hence, BFMO places were not visited for having lunch. By extracting
corresponding place visit records we examined who was when in which place and found five cases when two or
three people met in the same place. All four visitors of BFMO places were security employees.
For each semantic place, we analyzed the temporal distribution of the visits using a 2D time histogram
similar to those shown in Figure 2 but with 14 rows corresponding to the consecutive days of the two week
period of the data (as in the calendar display in Figure 5). We paid attention to place visits in unusual times. For
each such case, we extracted (by spatio-temporal filtering) the place visit records including the visitors’ names,
exact visit times, and place names. In this way, we detected that four persons sometimes attended the homes of
their colleagues in night time. We also detected night visits to work and some other anomalies.
Figure 6: By filtering through the semantic space map, we have selected only those daily trajectories that include
visits to BFMO places. There are 30 such trajectories, which are also shown in a summarized form as a set of
flows. The map shows that the visitors typically went to the BFMO places from the work and after that went for
lunch, which means that the BFMO places are not lunch places.
32
6
Social Network Analysis
From the trajectories, we have extracted all meetings of the people and excluded the meetings that occurred
at work and the meetings of people living together at their homes. From the remaining meetings, we have
computed distances between individuals based on the relative frequencies of their meetings.
Figure 7: The space of inter-personal relationships.
The map display (Figure 7) shows the space of inter-personal relationships. The 2D projection has been
obtained based on the pair-wise distances between the individuals. The dots represent the individuals and are
colored according to their employment types. The curved connecting lines represent the strengths of the relationships between the individuals (i.e., the relative meeting frequencies) by proportional widths and opacities.
We see a tight group of security employees (the group also includes two non-security persons). Two security
persons, Cocinaro and Osvaldo, bridge this group with another tight group made by engineering and information
technology employees. Another group of engineers is relatively separated from the latter group and from the
security group but has strong links to executive staff.
7
Conclusion
All methods and tools that we used for our analysis are scalable with regard to the number of individuals, number
of places, and length of the time period covered by the data. We mostly used aggregated views, which could also
be applied to much larger data. Detailed data (place visit records) were accessed only for analyzing anomalies.
The analysis greatly relied on computational data processing: stop and place extraction, data aggregation, and
clustering. These operations are also scalable. Apart from the examination of the anomalies, the analysis was
done in a way respecting personal privacy, without accessing personal data. Our approach demonstrates that
33
Visual Analytics may be not only harmful for personal privacy [4], but also has potential to create opportunities
for privacy-preserving analysis of human mobility [1].
The page limit does not allow us to provide here a detailed bibliography on known methods and approaches
to place extraction from movement data and to describe in detail all algorithms used in our approach. Interested
readers are pointed at our recently published paper [7] that includes a comprehensive bibliography and provides
all necessary algorithmic details. In that paper, the approach was used for reconstructing mobility patterns of
residents of San-Diego based on geo-located twitter messages and land use map layers.
References
[1] G. Andrienko and N. Andrienko. Privacy Issues in Geospatial Visual Analytics. In G. Gartner and F. Ortag,
editors, Advances in Location-Based Services, Lecture Notes in Geoinformation and Cartography, pages
239–246. Springer Berlin Heidelberg, 2012.
[2] G. Andrienko, N. Andrienko, P. Bak, D. Keim, and S. Wrobel. Visual Analytics of Movement. Springer
Verlag, 2013.
[3] G. Andrienko, N. Andrienko, U. Demsar, D. Dransch, J. Dykes, S. I. Fabrikant, M. Jern, M.-J. Kraak,
H. Schumann, and C. Tominski. Space, time and visual analytics. International Journal of Geographical
Information Science, 24(10):1577–1600, 2010.
[4] G. Andrienko, N. Andrienko, D. Keim, A. M. MacEachren, and S. Wrobel. Challenging problems of
geospatial visual analytics. Journal of Visual Languages & Computing, 22(4):251–256, 2011.
[5] N. Andrienko and G. Andrienko. Spatial Generalization and Aggregation of Massive Movement Data.
IEEE Transactions on Visualization and Computer Graphics, 17(2):205–19, 2011.
[6] N. Andrienko, G. Andrienko, and G. Fuchs. Analysis of Mobility Behaviors in Geographic and Semantic
Spaces. In VAST Challenge @ IEEE VAST 2014, 2014. Award for Outstanding Scalable Analysis.
[7] N. Andrienko, G. Andrienko, G. Fuchs, and P. Jankowski. Scalable and Privacy-respectful Interactive
Discovery of Place Semantics from Human Mobility Traces. Information Visualization, ??(?):???, July
2015. DOI: http://dx.doi.org/10.1177/1473871615581216.
[8] K. Cook, G. Grinstein, and M. Whiting. IEEE VAST Challenge 2014 – The Kronos Incident.
http://vacommunity.org/VAST+Challenge+2014, November 2014. Last accessed 2015-03-29.
[9] F. Giannotti and D. Pedreschi. Mobility, data mining and privacy: Geographic knowledge discovery.
Springer Science & Business Media, 2008.
[10] A. McAfee and E. Brynjolfsson. Big data: the management revolution. Harvard business review, 90:60–68,
October 2012.
[11] C. Parent, S. Spaccapietra, C. Renso, G. Andrienko, N. Andrienko, V. Bogorny, M. Damiani, A. GkoulalasDivanis, J. Macedo, N. Pelekis, Y. Theodoridis, and Z. Yan. Semantic trajectories modeling and analysis.
ACM Computing Surveys, 45(4):42, 2013.
[12] J. J. Thomas and K. A. Cook. Illuminating the path:[the research and development agenda for visual
analytics]. IEEE Computer Society, 2005.
34
An Integrated Qualitative and Boundary-based Formal
Model for a Semantic Representation of Trajectories
Jing Wu1,2 , Christophe Claramunt2 , Min Deng1
1
Department of Geo-Informatics, Central South University, Changsha, China
2
Naval Academy Research Institute, Lanveoc-Poulmic, BP 600, 29240 Brest Naval, France
{jing.wu,claramunt}@ecole-navale.fr, [email protected]
Abstract
Nowadays, the tracking, representation and analysis of moving objects and trajectories have attracted several research efforts at the formal level. The work presented in this paper introduces a spatial
qualitative approach for enriching semantic trajectories with movement predicates. The model developed integrates topological relations and qualitative distances between a trajectory and a region of interest. Such a spatio-temporal framework supports the derivation of the basic movement configurations
derived from moving and static entities. The approach is flexible enough to reconstruct the trajectory as
a sequence of highly-correlated episodes according to the underlying topological properties such as the
dimension and cardinality of the intersections that emerge between the trajectory and the given region.
1
Introduction
Thanks to the proliferation of GPS, WiFi, RFID and other sensor-based tracking techniques, the collection of
mobility data is becoming much more efficient thus offering many novel application perspectives. This also
favours the emergence of several research avenues such as spatial data mining [3] or statistical analysis [16]
where large trajectory data sets provide the information repository to manipulate. Although one might consider
a trajectory as a relative straightforward modeling primitive, the information encapsulated is often rich when
considering the semantics and behaviour associated to the underlying movement.
A trajectory, modeled as a semantic abstraction, should be enriched by the spatial, temporal and semantic
domain knowledge this being denoted as a semantic enrichment process [1, 9]. Regarding the spatial dimension,
a trajectory can be modeled as a series of episodes as suggested in [6], an episode being defined as a maximal
homogeneous sub-sequence of a trajectory. This allows to map a given trajectory to a series of spatial predicates
whose semantics can be also enriched by additional application dependent criteria.
One can for example make a difference between stationary positions of a given entity (stops) and periods
in which the entity is moving (moves) as suggested by the Time Geography framework [4]. Specific episodes
can be also identified according to some speed variation [7], change of direction [10], route constraints [15] or
points of interest [13]. A difference can be also made between the semantic and spatial dimensions in order to
provide a data model representation that supports different levels of abstraction as suggested in [14] .
The research presented in this paper develops a spatial modeling approach whose objective is to enrich
trajectory data with movement predicates. The approach considers a trajectory behaviour with respect to a
given Region of Interest (ROI). The respective spatio-temporal configurations between a trajectory and a ROI
provides a sequence of episodes modeled as basic spatio-temporal predicates. The formal model developed
35
gives an intuitive set of movement predicates that can be used to model the behaviour of a given trajectory
with respect to some predefined ROIs. The whole approach is complemented by a systematic representation of
spatio-temporal configurations mapped to possible natural language expressions.
The remainder of this paper is organized as follows. Section 2 introduces the main principles behind the
formal model developed for a semantic enrichment of the concept of trajectory while section 3 presents a series
of movement predicates. Finally section 4 concludes the paper and draws some conclusions.
2
2.1
Modeling Principles
Qualitative topological relation
The trajectory of a moving entity with respect to a ROI can be modeled as a qualitative topological relation
between an oriented line and a region. The objective is to reflect how a given entity evolves outside, inside or
on the boundary of a region, the trajectory of this entity being represented as a directed line. Although several
models [2, 5] have been developed to describe line-region topological relations, whether they can cover all
possible configurations in a 2-dimensional space is still an open research question.
The Boundary-based Trajectory Model developed in a related work [11] provides the foundations of a topological relation model between a directed line and a region (DL-RE). The principles behind the Boundary-based
Trajectory Model are as follows. The directed line represents the trajectory of a moving entity in a 2-Dimensional
space from a starting point (Lts ) to a destination point (Lte ). Several topological properties, such as the cardinality (m), dimensions (d), orientation, and intersection types of the neighboring disc derived from the DL-RE
intersections (neigh(p)), are used to derive a set of primitive DL-RE topological relations. As shown in Figure
1, 30 DL-RE topological relations are derived based on possible DL-RE topological relations. These relations
can be then composed to represent complex trajectories of a given spatial entity over one to many regions. They
also provide a support for a derivation of possible movements in case of incomplete knowledge.
So far this model focuses on the topological properties of the intersections on the boundary of the reference
entity without analysing the details when the entity is moving outside (S1 ) or inside the reference entity (S2 ),
this being the modeling objective of the extension presented in the following sections.
2.2
Qualitative distance relation
The Boundary-based Trajectory model is oriented to the analysis of the trajectory configurations of a moving
entity represented as a point with respect to the boundary of a given region. But a few information is given
regarding the behaviour of such trajectory in the non immediate proximity of this boundary, for instance either far
from this region or inside this region. This is the reason behind the development of a complementary modeling
approach where a notion of qualitative distance is taken into account to distinguish whether the entity moves
toward or away from the ROI. The Qualitative Trajectory Model developed [12] provides a set of modeling
primitives that support the qualitative representation of the movement between a moving entity and a ROI over
a given time interval T based on complementary qualitative RCC8 topological relations [8] and qualitative
distance relations D. The qualitative distance represents the monotonic and continuous variation of the minimum
distance d between the boundary of the moving entity and the boundary of ROI: D is continuously increasing
outside (dext+ ), D is continuously decreasing outside (dext− ), D is constant outside (dext= ), D is null (d0 ), D
is continuously increasing inside (dint+ ), D is continuously decreasing inside (dint− ), and D is constant inside
(dint= ).
This approach is complemented by a tentative qualification of the possible natural language expressions
of the primitive movements identified. These movements are classified into three categories according to the
relative location of a moving entity with respect to a reference entity, which are movements outside, on the
boundary, and inside the reference entity. Overall, this modeling approach is qualitative per nature and does
36
Figure 1: DL-RE topological relations
not support the identification of specific movements in the vicinity of the boundary of the reference region, this
motivating the search for an integrated modeling approach that will combine the respective advantages of the
boundary-based and qualitative-based modeling approaches. This is the objective of the modeling framework
developed in the next section.
3
Towards an Integrated Qualitative and Boundary-based Trajectory Model
Let us first model qualitatively the trajectory of a moving point with respect to a reference region (ROI), the
primitive movement semantic (PriSem) of the trajectory is formally defined based on DL-RE topological relations (T OPDL−RE ) and qualitative distance D over a time interval T.
P riSem(A, B) ≡ Holds(T OPDL−RE , D, T )
(1)
Where T OPDL−RE ∈ {S1 , · · · , S30 }, D ∈ {dext+ , dext− , dext= , d0 , dint+ , dint− , dint= }.
The intersection type of the neighboring disc is used to derive T OPDL−RE . If there is no intersection
between the trajectory and the boundary of the reference entity, the intersection type of neigh(Lte ) should be
considered. Otherwise, the intersection type of neigh(I) is applied. The primitive movement semantics are
classified into three categories according to the number and dimension of the intersection as follows.
37
3.1
No intersection
Let us first consider the configuration where a moving point A is outside a reference entity B. Over a given
temporal interval T, three categories of primitive movement predicates can be distinguished: Approach (AP),
Leave (LV) and AroundOutside (AO). During that time interval T, the DL-RE relation is S1 and the relative
distance can be either decreasing, increasing or constant (dext− , dext+ and dext= , respectively). More formally:
• Approach(A, B) denotes the case of a trajectory A is approaching the reference entity B over a time interval
T, as shown in Figure 2a. More formally, for all t∈T, S1 holds and the relative distance is decreasing
outside B:
Approach(A, B) ≡ Holds(S1 , dext− , T)
• Leave(A, B) denotes the case of a trajectory A is leaving the reference entity B over a time interval T, as
shown in Figure 2b. More formally, for all t∈T, S1 holds and the relative distance is increasing outside B:
Leave(A, B) ≡ Holds(S1 , dext+ , T)
• AroundOutside(A, B) denotes the case of a moving entity A is either moving around or static outside the
reference entity B over a time interval T, as shown in Figure 2c. More formally, for all t∈T, S1 holds and
the relative distance is constant outside B:
AroundOutside(A, B) ≡ Holds(S1 , dext= , T)
Figure 2: Movement outside a reference entity
When a moving point A is inside a reference entity B over a time interval T, there are three categories of
movements: MovetoBoundary (MB), MovetoInterior (MI) and AroundInside (AI). During that time interval T,
the DL-RE relation is S2 and the relative distance can be either decreasing, increasing or constant (dint− , dint+
and dint= , respectively). More formally:
• MovetoBoundary(A, B) denotes the case of a moving entity A inside B and moving to the boundary of B
over a time interval T, as shown in Figure 3a. More formally, for all t∈T, S2 holds and the relative distance
between A and B is decreasing inside B:
MovetoBoundary(A, B) ≡ Holds(S2 , dint− , T)
• MovetoInterior(A, B) denotes the case of a moving entity A inside B and leaving the boundary of B over
a time interval T, as shown in Figure 3b. More formally, for all t∈T, S2 holds and the relative distance
between A and B is increasing inside B:
MovetoInterior(A, B) ≡ Holds(S2 , dint+ , T)
38
• AroundInside(A, B) denotes the case of a moving entity A inside B and moving around the boundary of B
over a time interval T, as shown in Figure 3c. More formally, for all t∈T, S2 holds and the relative distance
between A and B is constant inside B:
AroundInside(A, B) ≡ Holds(S2 , dint= , T)
Figure 3: Movement inside a reference entity with no intersection
3.2
One 0-dimensional intersection point
When there is one intersection point between a trajectory and the boundary of a reference region, the movement
pedicate is composed by one or two movement states that hold over a temporal interval and another one meets
the boundary of the region at certain time point, such as the start time (ts ), the end time (te ) or the time point
when the trajectory intersect the boundary (tI ), respectively. More formally:
• Arrive(A, B) denotes the case of the end point of the trajectory A meets outside B over a time interval
(T+te ), as shown in Figure 4a. The trajectory A Approach B before it meets the boundary of B. More
formally:
Arrive(A, B) ≡ Holds(S1 , dext− , T) ∧ (S3 , d0 , te )
• Depart(A, B) denotes the case of the start point of the trajectory A meets outside B over a time interval
(ts +T), as shown in Figure 4b. The trajectory A starts on the boundary of B then Leave it. More formally:
Depart(A, B) ≡ (S4 , d0 , ts ) ∧ Holds(S1 , dext+ , T)
• Exit(A, B) denotes the case of the end point of the trajectory A meets inside B over a time interval (T+te ),
as shown in Figure 4c. The trajectory A MovetoBoundary of B then ends on the boundary of it. More
formally:
Exit(A, B) ≡ Holds(S2 , dint− , T) ∧ (S5 , d0 , te )
• Enter(A, B) denotes the case of the start point of the trajectory A meets inside B over a time interval (ts +T)
, as shown in Figure 4d. The trajectory A starts on the boundary of B then MovetoInterior of it. More
formally:
Enter(A, B) ≡ (S6 , d0 , ts ) ∧ Holds(S2 , dint+ , T)
• CrossIn(A, B) denotes the case of the start point of the trajectory A is outside B then cross the boundary of
B and end inside B over a time interval (T1 +tI +T2 ), as shown in Figure 4e:
CrossIn(A, B) ≡ Holds(S1 , dext− , T1 ) ∧ (S7 , d0 , tI ) ∧ Holds(S2 , dint+ , T2 )
• CrossOut(A, B) denotes the case of the start point of the trajectory A is inside B then cross the boundary
of B and end outside B over a time interval (T1 +tI +T2 ), as shown in Figure 4f:
CrossOut(A, B) ≡ Holds(S2 , dint− , T1 ) ∧ (S8 , d0 , tI ) ∧ Holds(S1 , dext+ , T2 )
39
• TouchOutside(A, B) denotes the case of the start and end points of the trajectory A both lie in the exterior
part of B and the interior of A meets the boundary of B with either clockwise or anticlockwise orientations
over a time interval (T1 +tI +T2 ), as shown in Figure 4g:
TouchOutside(A, B) ≡ Holds(S1 , dext− , T1 ) ∧ ((S9 ∨ S10 ), d0 , tI ) ∧ Holds(S1 , dext+ , T2 )
• TouchInside(A, B) denotes the case of the start and end points of the trajectory A both lie in the interior
part of B and the interior of A meets the boundary of B with either clockwise or anticlockwise orientations
over a time interval (T1 +tI +T2 ), as shown in Figure 4h:
TouchInside(A, B) ≡ Holds(S2 , dint− , T1 ) ∧ ((S11 ∨ S12 ), d0 , tI ) ∧ Holds(S2 , dint+ , T2 )
Figure 4: Movement configurations with one 0-dimensional intersection point
3.3
One 1-dimensional intersection line
When part of the trajectory of a moving point A is on the boundary of a reference entity B during a time interval
T, the different configurations of the start and end points of the intersection, which is one 1-dimensional line,
should be considered.
• Along(A, B) denotes the case of the start point and end point of the trajectory both lie in the boundary
of B over a time interval T. More formally, for all t∈T, S13 or S14 holds according to the clockwise or
anticlockwise orientations of the trajectory and the relative distance is null:
Along(A, B) ≡ Holds((S13 ∨ S14 ), d0 , T )
• Arrive-Along(A, B) denotes the case of the start point of the trajectory A lies in the exterior part of B and
ends on its boundary with either clockwise or anticlockwise orientations over a time interval (T1 +tI +T2 ).
Arrive-Along(A, B) ≡ Holds(S1 , dext− , T1 ) ∧ (((S15 (I1 ), tI ) ∧ Holds(S14 , d0 , T2 ))∨
((S16 (I1 ), tI ) ∧ Holds(S13 , d0 , T2 )))
40
• Exit-Along(A, B) denotes the case of the start point of the trajectory A lies in the interior part of B and ends
on its boundary with either clockwise or anticlockwise orientations over a time interval (T1 + tI + T2 ).
Exit-Along(A, B) ≡ Holds(S2 , dint− , T1 ) ∧ (((S17 (I1 ), tI ) ∧ Holds(S13 , d0 , T2 ))∨
((S18 (I1 ), tI ) ∧ Holds(S14 , d0 , T2 )))
• Along-Depart(A, B) denotes the case of the start point of the trajectory A lies on the boundary of B and ends
in its exterior part with either clockwise or anticlockwise orientations over a time interval (T1 + tI + T2 ).
Along-Depart(A, B) ≡ (((Holds(S14 , d0 , T1 ) ∧ (S19 (I2 ), tI )) ∨ (Holds(S13 , d0 , T1 )∧
(S20 (I2 ), tI ))) ∧ Holds(S1 , dext+ , T2 )
• Along-Enter(A, B) denotes the case of the start point of the trajectory A lies on the boundary of B and ends
in its interior part with either clockwise or anticlockwise orientations over a time interval (T1 + tI + T2 ).
Along-Enter(A, B) ≡ (((Holds(S13 , d0 , T1 ) ∧ (S21 (I2 ), tI )) ∨ (Holds(S14 , d0 , T1 )∧
(S22 (I2 ), tI ))) ∧ Holds(S2 , dint+ , T2 )
As there should be an infinite number of possible DL-RE configurations in a 2-dimensional Euclidean space,
the possible movement semantics that emerge from those configurations can be relative large. A composition of
the primitive movement predicates defined above can be applied to reconstruct the trajectory as a sequence of
highly-correlated episodes.
4
Conclusion
With rapid and continuous progress on the availability of mobility data and the large range of potential applications, there is still a call for the development of spatio-temporal data models that will encompass at the abstract
level the semantics that emerge from the underlying phenomena. The modeling approaches developed in this
paper introduce an integrated formal framework that represents the trajectory of a moving entity with respect to
a region of reference. Several spatial qualitative parameters are taken into account such as the boundary-based
relationship between the two entities considered as well as the evolution of the relative distance between them.
The framework developed also favours the identification of a series of movement predicates and natural language expressions that qualify the movements of a given entity with respect to a reference entity. The approach
is preliminary and still can be extended by the integration of additional spatial properties such as velocity or
more specific semantic information.
Acknowledgments
The research was funded by the Fundamental Research Funds for the Central Universities of Central South
University and Open Research Fund Program of Key Laboratory of Digital Mapping and Land Information
Application Engineering (GCWD201206), State Bureau of Surveying and Mapping.
References
[1] V. Bogorny, C. Renso, A. R. de Aquino, F. de Lucca Siqueira, and L. O. Alvares. Constant: A conceptual
data model for semantic trajectories of moving objects. Transactions in GIS, 18(1):66–88, January 2014.
41
[2] M. J. Egenhofer and R. D. Franzosa. On the equivalence of topological relations. International Journal of
Geographical Information Systems, 9(2):133–152, February 1995.
[3] F. Giannotti and D. Pedreschi. Mobility, Data Mining and Privacy - Geographic Knowledge Discovery.
Springer-Verlag, Berlin Heidelberg, 2008.
[4] T. Hgerstrand. Geography and the study of interaction between nature and society. Geoforum, 7(5-6):329–
334, May 1976.
[5] Y. Kurata and M. J. Egenhofer. The 9+ intersection for topological relations between a directed line
segment and a region. In Proceedings of the 1st Workshop on Behavioral Monitoring and Interpretation,
pages 62–76. IOS, September 2007.
[6] D. Mountain and J. Raper. Modelling human spatio-temporal behaviour: a challenge for location-based
services. In Proceedings of the 6th International Conference on GeoComputation, pages 24–26. GeoComputation, September 2001.
[7] A. T. Palma, V. Bogorny, B. Kuijpers, and L. O. Alvares. A clustering-based approach for discovering
interesting places in trajectories. In Proceedings of the 2008 ACM Symposium on Applied Computing,
pages 863–868. ACM, March 2008.
[8] D. A. Randell, Z. Cui, and A. G. Cohn. A spatial logic based on regions and connection. In Proceedings of
the 3rd International Conference on Knowledge Representation and Reasoning, pages 165–176, October
1992.
[9] C. Renso, S. Spaccapietra, and E. Zimnyi. Mobility Data: Modeling, Management, and Understanding.
Cambridge University Press, Cambridge, 2013.
[10] J. A. M. R. Rocha, V. C. Times, G. Oliveira, L. O. Alvares, and V. Bogorny. Db-smot: A direction-based
spatio-temporal clustering method. In Proceedings of the 5th IEEE International Conference on Intelligent
Systems, pages 114–119. IEEE, July 2010.
[11] J. Wu, C. Claramunt, and M. Deng. Modelling movement patterns using topological relations between
a directed line and a region. In Proceedings of the 5th ACM SIGSPATIAL International Workshop on
GeoStreaming, pages 43–52. ACM, November 2014.
[12] J. Wu, C. Claramunt, and M. Deng. Towards a qualitative representation of movement. In Advances in
Conceptual Modeling, pages 191–200. ER 2014 Workshops, October 2014.
[13] K. Xie, K. Deng, and X. Zhou. From trajectories to activities: A spatio-temporal join approach. In
Proceedings of the 2009 International Workshop on Location Based Social Networks, pages 25–32. ACM,
November 2009.
[14] Z. Yan, C. Parent, S. Spaccapietra, and D. Chakraborty. A hybrid model and computing platform for
spatio-semantic trajectories. In The Semantic Web: Research and Applications, pages 60–75. 7th Extended
Semantic Web Conference, ESWC 2010, May 2010.
[15] Y. Zheng, L. Zhang, Z. Ma, X. Xie, and W.-Y. Ma. Recommending friends and locations based on individual location history. ACM Transaction on the Web, 5(1):1–44, February 2011.
[16] Y. Zheng and X. Zhou. Computing with Spatial Trajectories. Springer-Verlag, New York, 2011.
42
Trajectory Similarity Measures
Kevin Toohey, Matt Duckham
University of Melbourne, Australia
Abstract
Storing, querying, and analyzing trajectories is becoming increasingly important, as the availability and volumes of trajectory data increases. One important class of trajectory analysis is computing
trajectory similarity. This paper introduces and compares four of the most common measures of trajectory similarity: longest common subsequence (LCSS), Fréchet distance, dynamic time warping (DTW),
and edit distance. These four measures have been implemented in a new open source R package, freely
available on CRAN [19]. The paper highlights some of the differences between these four similarity measures, using real trajectory data, in addition to indicating some of the important emerging applications
for measurement of trajectory similarity.
1
Introduction
As the technology for tracking moving objects becomes cheaper and more accurate, the amount and availability
of stored movement data is continuing to increase rapidly. Most movement data is captured and stored in the
form of trajectories, defined as “a sequence of time-stamped locations” [11]. However, the analysis of these
increasing volumes of trajectory data can be challenging, due in large part to the way the same continuous
movement can have innumerable different discretized trajectory representations.
One important class of trajectory analysis is the measurement of similarity between trajectories. Several
measures exist for calculating the similarity between two trajectories, each with their own strengths and weaknesses. Several surveys of trajectory similarity measures have been performed [7, 11, 18]. After first outlining
some of the useful applications of trajectory similarity measures, four of the most commonly used similarity
measures will be discussed in detail: longest common subsequence (LCSS), Fréchet distance, dynamic time
warping (DTW), and edit distance. These four measures have been implemented within a new R package called
“SimilarityMeasures,” available on CRAN [19]. The four similarity measures are compared empirically using a
sample movement dataset, highlighting where differences in computed similarity value are expected to occur.
2
Applications of trajectory similarity measures
The most common use of trajectory similarity measures is for database indexing. For example, Vlachos et
al. apply the longest common subsequence similarity measure to index a set of marine animal trajectories [20].
The study shows considerable speed increases for nearest neighbor computations when using this index over
brute force linear scans. Other examples of indexing trajectories using similarity measures can be seen in
[6, 8, 14].
Movement patterns in vehicle and pedestrian traffic have of course been analyzed using trajectory similarity
measures. Information about the similarities in movement patterns can enable traffic managers to adjust timings
43
on a road network, to find where problems are occurring, or to increase safety and security. Suspicious behavior,
for example, can be detected from dissimilarity from predefined “normal” behavior [10].
Using Fréchet and discrete Fréchet similarity measures, Buchin et al. [3] explored the detection of commuting patterns in trajectories. Li et al. [16] used longest common subsequence similarity measures to compare
calculated paths with actual paths in an analysis on crowded scene movements. Use of tracking data in sports is
also becoming more common. Analyzing tracking data from sports can allow players to increase their efficiency
and effectiveness. Haase and Brefeld [12] used a dynamic time warping similarity measure to explore similar
movements in a soccer game, while Perše et al. [17] analyzed and segmented basketball games using an edit
distance similarity measure.
Similarity measures can be used on animal trajectories in behavioral science to explore information about
popular tracks, movements, and social interactions [10], such as to compare movements of albatross [4] or of
cattle [15].
3
Trajectory similarity computation
A trajectory TA , contains a series of m timestamped n dimensional points ai = (ai,1 , . . . , ai,n ):
TA = ((t1 , a1 ), . . . , (tm , am ))
where ti are discrete timestamps and ti < ti+1 . The length of a trajectory is defined here as the number of
discrete timestamps (“fixes”). Spatial trajectory points are commonly recorded in 2 or 3 dimensions, although
higher dimensionality trajectories are of course possible.
The key challenge in identifying a satisfactory trajectory similarity measure is the arbitrary nature of the
discretization. For example, a naı̈ve similarity measure is Euclidean distance, calculated as the sum of the
distances between ordered pairs of points in two trajectories. However, such a simple measure struggles with
different sampling rates, outliers, and requires trajectories of different lengths to be cut to equal size [20]. Thus,
many more sophisticated similarity measures have been proposed and implemented to overcome the challenges
resulting from discretization.
Four of the most commonly encountered advanced similarity measures are explored in the next sections.
Three of the discussed measures were included in a previous effectiveness study of six similarity measures
tested on a taxi dataset [21], while two were also contained in a comparison of another six similarity measures
in [23]. A fully documented R package, called “SimilarityMeasures,” is freely available online [19], and has
been written to enable easier access to these analyses. The functions in this package are able to compute each
of the following similarity measures on n-dimensional trajectories. Please see the package documentation for
further details, including how to use the various functions. More information on R, and R packages, can be
found at the R Project web page1 .
3.1
Fréchet metric
The Fréchet metric (or Fréchet distance) is amongst the most popular of similarity measures [11]. The metric
was first defined by Fréchet [9] and can be applied to both continuous directed curves as well as the discretized
trajectories considered here. The Fréchet metric is generally described in the following way: a person is walking
a dog on a leash. The person walks on one curve while the dog walks on the other [1]. The dog and the person are
able to vary their speeds, or even stop, but not go backwards. The Fréchet metric is the minimum leash length
required to complete the traversal of both curves. As with most similarity measures, the choice of distance
function can be adapted to suit the specific application and trajectories. Euclidean distance is used in this paper.
1
http://www.r-project.org
44
Trajectory points are not matched together using the Fréchet metric. This allows the metric to perform well
with even the most widely varying sampling rates and trajectory lengths. Unfortunately, the Fréchet metric can
be greatly affected by outliers if they are not removed before performing the calculation. This is caused by the
fact that every point of the two trajectories is used in the calculation.
The Fréchet distance computation contained in our “SimilarityMeasures” R package is implemented using an
algorithm discussed by Alt and Godau [1]. This algorithm allows computation of the Fréchet distance between
two trajectories of length m and k, with a worst case complexity of O((m2 k +k 2 m) log mk). Alt and Godau use
the idea of free space diagrams to allow the efficient calculation of this similarity measure. For more information
on the calculations used in this algorithm see Alt and Godau [1].
3.2
Dynamic time warping (DTW)
Unlike Fréchet distance, dynamic time warping (DTW) is a similarity measure that relies on matching points in
trajectories. Using DTW, the trajectories are “warped” in a non-linear way to measure similarity while allowing
for varying sampling rates [22]. The calculation is again performed using a chosen distance function (Euclidean
distance in our examples).
For two trajectories TA and TB , with lengths m and k, an m × k grid can be created where each grid point
(i, j) represents the distance between points ai and bj [2]. A warping path W is created by starting at grid point
(1, 1), and incrementing either i or j or both by 1 each step until reaching point (m, k). For example, a path
beginning at grid point (1, 1) could move to one of grid points (1, 2), (2, 1) or (2, 2).
Definition 1: If wl represents a grid point (i, j)l , then a warping path W can be represented as the following
sequence of grid points:
W = w1 , . . . , w p
Exponentially many paths satisfy the conditions above [14]. A warping cost is calculated from a warping
path in various ways. A common warping cost for calculating DTW is the total of all of the distances calculated
along the warping path. Finally, the DTW similarity value is the minimum of all possible warping costs [2].
Using DTW, a single point on one trajectory can be matched to multiple points on the other. This allows
DTW to perform well with trajectories of different lengths and even widely varying sampling rates. However,
outliers can again greatly affect this method because every point of both trajectories must have at least one
match. The choice of a distance function is also clearly important to DTW, and the warping cost calculation can
be changed to suit different needs (cf. Keogh and Ratanamahatana [14] and Keogh and Pazzani [13] for more
ways to compute warping cost).
Our R package DTW calculation was implemented using the warping cost algorithm discussed in Berndt
and Clifford [2]. This implementation allows the DTW calculation to be performed between two trajectories
of length m and k, with complexity of O(mk). This DTW algorithm calculates and returns the total of the
distances between each pair of points on the optimal warping path using Euclidean distance.
3.3
Longest common subsequence (LCSS)
Longest common subsequence (LCSS) is a similarity measure where trajectories can be stretched, while some
points are able to remain unmatched in an attempt to provide an accurate similarity analysis [20]. The LCSS
value represents a count of the maximum number of points which can be considered equivalent, while the
trajectories are traversed monotonically from start to end.
Definition 2: Using trajectories TA and TB , with lengths m and k, an integer δ ≥ 0 and a matching threshold
45
ε ≥ 0, the LCSSδ,ε definition from Vlachos et al. [20] is adapted to the following:


0,
if TA or TB is empty





1 + LCSSδ,ε (Head(TA ), Head(TB )), if |m − k| ≤ δ and |am,1 − bk,1 | ≤ ε
and . . . and |am,n − bk,n | ≤ ε
LCSSδ,ε (TA , TB ) =


max(LCSSδ,ε (Head(TA ), TB ),




LCSSδ,ε (TA , Head(TB )),
otherwise,
In this definition, the constant δ provides a maximum index difference when comparing points from the
two trajectories. The constant ε defines the maximum distance in each dimension allowed for two points to
be considered equivalent. Finally, Head(TA ) represents TA with the last point removed. With careful use of
the two constants, δ and ε, this method is highly robust to outliers, while performing well with trajectories of
different lengths. The LCSS measure also generally functions well with different sampling rates. However,
widely varying sampling rates may cause issues when many points must be left unmatched in the calculation.
The length of the shorter trajectory can be used to normalize this method as an LCSS ratio, allowing for
comparisons in the same scale. The LCSS computation implemented in the R package uses the algorithm
discussed by Vlachos et al. [20]. Using dynamic programming, the LCSS value for two trajectories with lengths
m and k, can be found with a complexity of O((m + k)δ).
3.4
Edit distance
The fundamental idea of edit distance is to count the minimum number of edits required to make two trajectories
equivalent. Several variations of edit distance exist including edit distance with real penalty (ERP) [5] and edit
distance on real sequence (EDR) [6]. The discussion below concerns edit distance on real sequence (EDR) as
described by Chen et al. [6].
Definition 3: Using a matching threshold ε ≥ 0, and trajectories TA and TB with lengths m and k, the EDRε
value (edit distance) defined in Chen et al. [6] is adapted to the following:


k,
if m = 0





if k = 0
 m,
EDRε (TA , TB ) =
min(EDRε (Rest(TA ), Rest(TB )) + subcost,



otherwise
EDRε (Rest(TA ), TB ) + 1,




EDRε (TA , Rest(TB )) + 1),
In this definition, subcost = 0 if the first point of TA lies within the matching threshold of the first point
of TB in every dimension, and subcost = 1 otherwise. Finally, Rest(TA ) represents trajectory TA with its first
point removed (the trajectory now starts from the second point if one exists, otherwise it now has length 0).
EDR is relatively unaffected by outliers because the matching threshold reduces the increments to values of
0 and 1 only [6]. Therefore, even though outliers must still be processed using this method, each outlier can
potentially only increase the EDR value by 1, and not some arbitrarily large value as in DTW or Fréchet. EDR
also performs well with trajectories that have varying sampling rates. The method does not require trajectories of
equal length. However, different length trajectories will automatically inflate the edit distance. This is because
every extra point is required to be edited out (or in) for the trajectories to be considered equivalent. This fact
needs to be considered when choosing this method in practical applications.
Edit distance on real sequence (EDR), as discussed in Chen et al. [6], was implemented as the edit distance
function in our R package. Dynamic programming was used to obtain an efficient calculation of EDR. With two
trajectories TA and TB , of length m and k, this implementation has a complexity of O(mk).
46
4
Comparison and evaluation
A sample dataset, containing trajectory data of delivery drivers in the UK, was used to help evaluate the similarity
measures discussed in this paper. Our of a total of 23,400 segmented trajectories in our data set, a small sample
of 50 randomly chosen pairs of trajectories was used for the analysis. All calculations were performed using the
R package discussed earlier.
4.1
Normalization
The trajectories in the dataset vary significantly in terms of location and scale. This limits the amount of useful
comparisons and analysis between raw, untransformed trajectories. Therefore, each pair of trajectories was
normalized to approximately match, to enable meaningful comparisons across the four similarity measures. The
trajectories were rotated, scaled, and translated to align their start and end points.
This normalization, however, does lead to some bias in the results, which must be taken into account. As a
result of the normalization, the LCSS calculation is guaranteed to contain a minimum of two points (start and
end) if the index spacing distance allows it, while the edit distance calculation is guaranteed to have two points
which don’t require edits. This problem can also be seen in DTW where the start and end points will both add
zero distance to the final value. Although this fact changes all of the absolute values in a constant way, the ratio
values are altered in a non-linear manner, and this was taken into consideration for the analysis.
4.2
The similarity measures
Each of the four similarity measures was performed on the 50 pairs of trajectories. The allowed point index
spacing (for LCSS and DTW) was set to unlimited to allow for the large variance in trajectory length. The
distance for points to be considered equivalent (for LCSS and edit distance) was set to 100m. This value was set
using the knowledge that most trajectories range from hundreds of meters to several kilometers, and allows for
a wide range of values to be obtained in the analysis.
The LCSS, DTW, and edit distance values were converted to ratios. This was done to ensure that the large
variances in trajectory length did not dominate the results. The Fréchet distance was left unchanged because it
is mainly unaffected by the length of the trajectories. The end points were left out of the ratio calculations to
account for the normalization performed earlier. The DTW and edit distances used the larger trajectory length
(minus 2 for the end points) to calculate their ratios (e.g. DT W (TA , TB )/(max(|TA |, |TB |) − 2)). The LCSS
ratio used the minimum trajectory length (minus 2 for the end points) as discussed earlier.
Table 1 shows the correlations between the different similarity values computed across the four similarity
measures2 . There is a strong (positive) correlation between Fréchet distance and DTW ratio similarity values,
while a strong (negative) correlation can be seen between LCSS ratio and edit distance ratio. Correlations in all
other pairings of similarity measures are much weaker (Table 1).
Table 1: The correlation coefficients between each of the similarity measures.
LCSS Ratio Fréchet Distance DTW Ratio
Fréchet Distance
-0.3707
–
DTW Ratio
-0.4391
0.9587
–
Edit Distance Ratio
-0.8340
0.3202
0.3866
Figure 1 shows scatterplots comparing the two pairs of most highly correlated similarity measures. Although
some correlation was expected between the Fréchet distance and DTW ratio (both use absolute distances in their
2
Strictly speaking, Fréchet, DTW, and edit distance are dissimilarity measures (higher values equal greater dissimilarity) while LCSS
is a similarity measure (higher values equal greater similarity)
47
calculation), the strength of the correlation is remarkably strong (correlation coefficient of 0.9587), particularly
given that the underlying calculation is rather different (cf. Section 3.1 and 3.2).
Figure 1: Scatter plots comparing the most largely correlated similarity measures, DTW ratio against Fréchet
distance (left) and LCSS ratio against edit distance ratio (right).
LCSS and edit distance also exhibit strong (negative) correlations to one another. As discussed above, the
LCSS and edit distances were converted to ratios, and so their values range from 0 to 1 (although 1 is most
highly dissimilar in the case of edit distance, and most highly similar in the case of LCSS). Thus, the strong
negative correlation coefficient (−0.8340) between the two is in line with the expectation that these methods
would yield similar results. The slightly lower correlation seen between the LCSS ratio and edit distance, when
compared to Fréchet distance and DTW ratio, is likely caused by the large variations in trajectory length.
Despite the agreement, occasional differences in DTW ratio and Fréchet distance are visible in Figure 1.
Large discrepancies are always possible when comparing DTW and Fréchet distances, because they present no
bounds on how large the (dis)similarity can be. This is unlike LCSS and edit distance ratios, which have a
maximum ratio of 1. The fact that Fréchet distance and DTW values can grow so large makes them sensitive to
outliers as discussed earlier. However, this feature can also help to and emphasize extreme cases of trajectory
difference. At the other extreme, comparing two equal trajectories using any of the above similarity measures
will always yield a value of 0 (1 for LCSS ratio).
These two pairs of measures, Fréchet and DTW, and LCSS and edit distance, appear well correlated in
practice for this dataset. However, it is possible to find instances of trajectories that have dramatically different
similarity values across these pairs. The left image in Figure 2, for example, shows two trajectories that might be
considered similar, although one has a large displacement “peak” near its center. When considering the number
of distances combined, this peak will not influence the DTW value greatly. However, the Fréchet distance will
reflect considerable dissimilarity in this pair of trajectories, due to the outlying peak.
The right image in 2 presents another pair of trajectories to be considered. Again, these trajectories are
relatively similar, although one contains more points clustered in the middle, while the other only contains two
points. With a reasonable matching threshold, LCSS will consider these trajectories perfectly equivalent because
all of one trajectories points are matched up. However, edit distance highlights that the five middle points have
no match and therefore the trajectories are quite dissimilar. The DTW value will also be relatively large because
it requires each point to be matched to another, and therefore the trajectories are considered more dissimilar.
Finally, the Fréchet distance value will be very low, representing a view of very similar trajectories because it
does not require point matching.
48
Figure 2: Two pairs of example trajectories. All four trajectories begin in the lower left corner.
5
Concluding remarks
Similarity measures are a useful, and often underused, tool for the analysis of trajectories across a wide range
of application. This paper highlights some of the commonalities and differences amongst four of the the most
frequently encountered similarity measures, both in theory and in practice. While these measures are today
most often used for indexing of spatial databases, they have important applications to the interpretation of
pedestrian, vehicle, sport, and animal movement. By implementing an R package capable of easily computing
these measures in a single, integrated environment, our aim has been to provide users with easier access to
exploring the use of these measures in practical applications.
Acknowledgments
The authors would like to acknowledge collaborators at the 2014 Lorentz Workshop for “Geometric Algorithms
in the Field” for providing the initial idea of comparing these four similarity metrics: Kevin Buchin, Patrick
Laube, Dongliang Peng, Ross Purves, Stef Sijben, Rodrigo Silveira.
References
[1] H. Alt and M. Godau. Computing the Fréchet distance between two polygonal curves. International
Journal of Computational Geometry and Applications, 5(01n02):75–91, 1995.
[2] D. Berndt and J. Clifford. Using dynamic time warping to find patterns in time series. In KDD workshop,
volume 10, pages 359–370. Seattle, WA, 1994.
[3] K. Buchin, M. Buchin, J. Gudmundsson, M. Löffler, and J. Luo. Detecting commuting patterns by clustering subtrajectories. In S.-H. Hong, H. Nagamochi, and T. Fukunaga, editors, Algorithms and Computation,
volume 5369 of Lecture Notes in Computer Science, pages 644–655. Springer, 2008.
[4] M. Buchin, S. Dodge, and B. Speckmann. Similarity of trajectories taking into account geographic context.
Journal of Spatial Information Science, 9(1):101–124, 2014.
[5] L. Chen and R. Ng. On the marriage of lp -norms and edit distance. In Proc. 30th International Conference
on Very Large Data Bases, pages 792–803, 2004.
49
[6] L. Chen, M. T. Özsu, and V. Oria. Robust and fast similarity search for moving object trajectories. In
Proc. ACM SIGMOD International Conference on Management of Data, SIGMOD ’05, pages 491–502,
New York, NY, USA, 2005. ACM.
[7] S. Dodge. Exploring Movement Using Similarity Analysis. PhD thesis, Universität Zürich, 2011.
[8] C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast subsequence matching in time-series databases.
SIGMOD Rec., 23(2):419–429, May 1994.
[9] M. M. Fréchet. Sur quelques points du calcul fonctionnel. Rendiconti del Circolo Matematico di Palermo,
22(1):1–72, 1906.
[10] J. Gudmundsson, P. Laube, and T. Wolle. Movement patterns in spatiotemporal data. In Encyclopedia of
GIS, pages 726–732. Springer US, 2008.
[11] J. Gudmundsson, P. Laube, and T. Wolle. Computational movement analysis. In Springer handbook of
geographic information, pages 423–438. Springer, 2012.
[12] J. Haase and U. Brefeld. Finding similar movements in positional data streams. In Proc. ECML/PKDD
Workshop on Machine Learning and Data Mining for Sports Analytics, 2013.
[13] E. Keogh and M. Pazzani. Scaling up dynamic time warping for datamining applications. In Proc. 6th
ACM International Conference on Knowledge Discovery and Data Mining, pages 285–289, 2000.
[14] E. Keogh and C. A. Ratanamahatana. Exact indexing of dynamic time warping. Knowledge and Information Systems, 7(3):358–386, 2005.
[15] P. Laube and R. S. Purves. How fast is a cow? Cross-scale analysis of movement data. Transactions in
GIS, 15(3):401–418, 2011.
[16] T. Li, H. Chang, M. Wang, B. Ni, R. Hong, and S. Yan. Crowded scene analysis: A survey. IEEE
Transactions on Circuits and Systems for Video Technology, (99):1–20, 2014.
[17] M. Perše, M. Kristan, S. Kovačič, G. Vučkovič, and J. Perš. A trajectory-based analysis of coordinated
team activity in a basketball game. Computer Vision and Image Understanding, 113(5):612–621, 2009.
Computer Vision Based Analysis in Sport Environments.
[18] P. Ranacher and K. Tzavella. How to compare movement? A review of physical movement similarity measures in geographic information science and beyond. Cartography and Geographic Information Science,
41(3):286–307, 2014.
[19] K. Toohey. SimilarityMeasures: Trajectory Similarity Measures, 2015. R package version 1.4, http:
//CRAN.R-project.org/package=SimilarityMeasures.
[20] M. Vlachos, G. Kollios, and D. Gunopulos. Discovering similar multidimensional trajectories.
Proc. 18th International Conference on Data Engineering (ICDE), pages 673–684, 2002.
In
[21] H. Wang, H. Su, K. Zheng, S. Sadiq, and X. Zhou. An effectiveness study on trajectory similarity measures.
In Proc. 24th Australasian Database Conference, volume 137, pages 13–22, 2013.
[22] Y. Yuan. Image-based gesture recognition with support vector machines. ProQuest, 2008.
[23] Z. Zhang, K. Huang, and T. Tan. Comparison of similarity measures for trajectory clustering in outdoor
surveillance scenes. In Proc. 18th International Conference on Pattern Recognition (ICPR), volume 3,
pages 1135–1138, 2006.
50
Symbolic trajectories and application challenges
Maria Luisa Damiani1 , Hamza Issa1 , Ralf Hartmut Güting2 , Fabio Valdes2
1
Department of Computer Science, University of Milan, Italy
E-mail:{maria.damiani, hamza.issa}@unimi.it
2
FernUniversität Hagen, Germany
E-mail:{fabio.valdes,rhg}@fernuni-hagen.de
Abstract
Describing the location history of moving objects exclusively in geometric terms is no longer sufficient,
whereas more expressive data models capturing the complexity and heterogeneity of movement data are
needed. Following this trend, the data model of symbolic trajectories has been recently proposed for
the representation of content-rich trajectories in databases. The model provides a simple notation and
a powerful and fully operational pattern-based query language for trajectory matching and rewriting.
In this paper, we overview the key features of the model and sketch two applications cases, the former
regarding the integration of heterogeneous mobility data (GPS and transportation modes), the latter the
representation of migration patterns in animal ecology. The goal is to show the flexibility of the model
and, at the same time, to prospect possible directions of research.
1
Introduction
Recent years have witnessed the proliferation of applications which collect and record the location history of
large amounts of moving objects with high accuracy. For example, Google can capture and record the location
history of all the devices with a Google account (if location tracking is explicitly granted by users1 ); in the field
of animal ecology, modern animal telemetry and sensor networks (e.g. GPS receivers and other sensors mounted
on devices deployed on animals, such as collars) enable the collection of fine-grained animals’ trajectories [3].
Trajectory data is an invaluable source of behavioral information on moving objects. Behavioral information
is important for many different purposes. For example, it facilitates the development of advanced on-line information services, moreover it opens up new opportunities of research in diverse scientific disciplines, such as
sociology, biology, animal ecology. Dealing with rich trajectory data stores, however, raises important technical
challenges calling for advanced knowledge discovery, data protection, data visualization and data management
solutions [10]. Our research focuses on the latter aspect. Trajectory databases have been around since early 2000
[6]. Existing databases, however, present important limitations especially for what concerns the expressiveness
of the data models which describe the object’s trajectories exclusively in terms of timestamped points sampling
continuous movement. In reality, mobility data is not only ’big’ in size and thus computationally demanding,
but also exhibits a great variety. Variety regards various aspects, in particular we highlight the following four
dimensions, that we label location, context, time evolution, movement granularity, respectively, briefly discussed
below:
(i) Location. It is often the case that locations are not directly expressed in terms of geographical coordinates
but rather in symbolic form. The spatial reference system is thus indirect. Examples of symbolic locations
1
http://www.androidcentral.com/understanding-googles-android-location-tracking
51
are the cell identifiers reporting the position of GSM network users, rooms and floors in indoor spaces,
the places that the LBSN users (location-based social networks) share with their friends when they report
a check-in.
(ii) Context. Applications may require supplementary time-varying information beyond location, typically
the context in which the movement takes place. Contextual information includes, for example, the transportation means used by individuals to move in a city, the land type traversed by animals during their daily
activities, the people met during a journey. Therefore the movement can be seen as consisting of multiple
dimensions, where each dimension is a time-varying function. Location is just one of these dimensions.
(iii) Time evolution. The contextual data of concern can vary either discretely or continuously in time. For
instance, the movement of a vehicle can be seen as continuous for what concerns the location but discrete
for what concerns the roads traversed by the vehicle. These dimensions are obviously interrelated.
(iv) Granularity. Trajectories can be described at different level of abstraction or granularity. For example,
certain mobility patterns extracted from geometric trajectories can be represented themselves as trajectories, though at coarser granularity. Moreover, the spatial dimension of movement may become irrelevant
at a certain level of abstraction. Applications may require rolling up and drilling down through these
abstraction levels to perform multi-scale analysis.
Unconventional trajectories include the sequences of users check-ins in LBSNs; the sequences of activities,
e.g. shopping, working, inferred from mobility data; the movement of individuals in an urban setting described
by both a GPS trajectory (continuous) and by a discrete sequence of transportation modes. The challenge is to
capture this multiplicity of views in a unifying and flexible trajectory data model.
1.1
Brief overview of the research context and paper contribution
Semantic trajectories is a first step in that direction [1, 11, 12, 10]. In reality, the semantic enrichment of spatial
and spatio-temporal objects is not a novelty. For example, coordinate locations can be enriched (or annotated)
with places in e.g. [8]; another form of popular annotation is the transportation mode, e.g. [19]; sequence
of activities are represented in e.g. [13]. In general, these annotations can be extracted from geometric data
using analytical techniques, or be directly specified by the user or acquired from sensors. What is new in the
notion of semantic trajectory with respect to the existing work is the idea of defining a framework enabling the
construction, representation, and analysis of annotated trajectories [12, 16, 10]. While such research has led to
diverse techniques and applications, it is worth observing that still there is no univocal definition of semantic
trajectory. In the database realm, recent research focuses on the development of access mechanisms for the fast
processing of specific class of queries, e.g. k-nn queries [18, 2], and data mining tasks, e.g. frequent sequential
pattern mining [17], typically over trajectories representing sequences of places. Pattern matching techniques
for querying trajectories annotated with symbolic locations are proposed in [5, 15]. More recent is the concern
for the development of database models [9], that is where the research on symbolic trajectories fits in.
The goal of this short paper is to offer a few hints on the application potential of symbolic trajectories,
highlighting as well some open issues. The rest of the paper is organized as follows: Section 2 briefly presents
the key features of the symbolic trajectory data model, Section 3 sketches the two application cases. The paper
ends with some final considerations.
2
Symbolic trajectories
Symbolic trajectories is basically a data model for the representation of discrete trajectories in databases [14, 7].
Abstractly, a symbolic trajectory is a time-dependent function which takes values in a categorical domain. The
52
domain consists of a set of strings called labels. Labels can represent, for example, activities, e.g. shopping,
road names, e.g. Fifth Avenue, weather conditions, e.g. raining. A symbolic trajectory has a simple structure
which consists, in its basic form, of a sequence of pairs (units) < (i1 , l1 ), ..., (in , ln ) > where ij = [tj1 tj2 ] is
a time interval and lj the label. The time intervals are disjoint and temporally ordered. For example, a simple
symbolic trajectory is:
< ([8:45 - 17:00] working)([17:00 - 18:30] shopping) (..)>
If we think of the label as an event, the interval specifies when such an event takes place. Multiple labels can
be specified for the same interval to denote for example a set of events which occur at the same time. Moreover,
a label can be associated with a spatial object with a reference to a precise geometric location and extent. The
association of a label and a spatial object is called place. Places can be used, for example, to denote symbolic
locations, such as the check-ins venues in LBSN.
Symbolic trajectories are accessed using a language for pattern matching and rewriting. Matching is used to
retrieve the trajectories satisfying the pattern. Rewriting is to extract or redefine parts of trajectories matching
the given pattern. Patterns can be defined using regular expressions, variables, and a variety of conditions. A
simple pattern is for example:
* (_ working) (_ shopping) *
This pattern matches symbolic trajectories where working is followed by shopping. The symbol ∗ matches
any sequence of units, the symbol the unit component. A more complex pattern using variables and conditions
on variables is:
* (morning working) X(_ shopping)* //duration X.time> 2 * hour
The matching trajectories are those in which the working activity takes place in the morning and is followed
by a shopping activity taking more than 2 hours. This pattern contains the variable X, denoting the following
unit, and a temporal condition (the symbol // separates the pattern from the condition); X.time denotes the time
interval in the unit denoted by X. Patterns can be extended to rewriting rules. For example:
* X(morning working) Y(_ shopping)* //duration Y.time> 2 * hour => X Y
This rule returns for each input trajectory all the parts that match the two adjacent units denoted by X and Y.
At system level, following the framework in [6], symbolic trajectories are introduced through the definition
of new data types. In the simplest case the data type is moving(label) (or mlabel). The pattern language and the
type system is seamlessly integrated into the relation model. For example we can construct a relation describing the trips of a person with attributes describing the symbolic information on road names and the symbolic
information on transportation modes:
Trips (Id: int, RoadName: mlabel, TransportMode: mlabel)
The query: retrieve the trips that start from a given road in the morning and end at the same road in the
evening can be simply formulated as follows:
SELECT *
FROM Trips
WHERE RoadName matches ’ X(morning ) +
Y (evening ) //X.label=Y.label’
The language is fully operational, running on a platform (SECONDO) offering a rich and extensible set of data
types that can be used for formulating a variety of conditions on pattern variables.
53
3
Application challenges
In what follows, we overview two application cases where symbolic trajectories are used for very different
purposes. The former case regards the integration of heterogeneous mobility data (GPS and transportation
modes), the latter the representation of specific mobility patterns (migrations) in animal ecology.
3.1
Querying GPS trajectories and transportation modes
A prominent example of dataset containing both continuous and discrete movement data is GeoLife [20]. Geolife is a well-known dataset reporting the traces of a group of individuals monitored in Beijing for a few years.
Interestingly, GeoLife consists not only of the GPS tracks in the form of timestamped point sequences, i.e.
{(ti , pi )}i∈[1,n] , but also, and this only for a subset of users, the temporally annotated sequences of the transportation modes. A sequence of transportation modes takes the form {(Ii , li )}i∈[1,m] where Ii is a time interval
and li a string in the set: {walk, bike, car, bus, airplane, other}. Notably, this dataset exemplifies a situation increasingly common in applications, that is the coexistence of heterogeneous trajectory data
describing different aspects of the movement. Querying heterogeneous trajectory data is a challenging issue.
In what follows, we show a first approach which leverages the querying capabilities of symbolic trajectories.
Preliminarily we describe how to store the GeoLife dataset.
Simply the data is stored in a table consisting of three attributes: the user id, the geometric trajectory GPStrack of type mpoint and the symbolic trajectory Transport of type mlabel.
geoLife(UserId: integer, GPStrack: mpoint, Transport: mlabel)
A first simple query, that simply uses the matches operator, is the following:
Query 1: retrieve the individuals that use buses and trains in the morning and
again in the evening of the same day
SELECT user_id
FROM geoLife
WHERE Transport matches
’ * X[(morning bus) | (morning train)]+
Y [(evening bus)| (evening train)]+ *//
(Y.end - X.start ) < 1 * day ’
The specified rule contains the pattern and the temporal condition. The next example shows how to replace a
set of labels with a more general term using the rewrite operation. This mechanism can be used for example to
perform roll-up operations over a hierarchy of terms and thus perform multi-scale analysis.
Query 2: retrieve the trajectories of the individuals that use train
and buses in the morning and in the evening, replacing the sequence of transportation
means with a more general term (i.e. ’public transportation’).
SELECT
rewrite(transport,’* X [(morning bus) |(morning train)]+
Y[(evening bus) |(evening train)]* // (Y.end - X.start) < 1 * day
=> T // T.label := publicTransport, T.start := X.start, T.end :=Y.end’)
AS generalTransport
FROM geoLife
A slightly different but more challenging query is to retrieve the individuals who not only use public transportation in the morning and in the evening, possibly after walking for a while, but that also return near the
54
point from where they left in the morning. This query specifies two conditions: one is on the symbolic trajectory; the other is a spatial condition (i.e. the initial and final points are to be close to each other). This query
cannot be solved by solving separately the symbolic and spatial components. Rather a tighter integration of the
trajectories representations, both at language and system level, is needed. A first approach to the problem has
been shown in [4]. The idea is to leverage the temporal correlation between the two trajectories by retrieving the
sub-trajectories which match the symbolic pattern and, based on them, temporally restrict the spatial trajectories.
At language level, this connection between the symbolic and the spatial component in a comprehensive pattern
is realized through the use of global variables consisting of both a symbolic and spatial component. At database
system level, a new data type hybrid is defined for handling these composite trajectories while the operators
matches and rewrite are extended accordingly. The query is formulated as follows. Let Cs and Cp be the rewrite
rule and the spatial condition, respectively:
let C_s =
’* X_s (morning walk)+ * Y_s[(morning bus)|(morning train)]+
* W_s[(evening bus)| (evening train)]+ Z_s(evening walk) *//
(Z_s.end - X_s.start ) < 1 * day
=> X_s Y_s W_s Z_s’
let C_p = ’distance(val(final(X_p)), val(initial(Z_p))) < 40’
The global variables are X, Y, W, Z where the variable subscript (s and p) indicates the dimension (symbolic or
spatial). The extended rewrite operation over the hybrid trajectory HT raj is called as follows :
SELECT h_rewrite(HTraj, C_p, C_s)
FROM geoLife
where h rewrite() is the extended rewrite operation. One trajectory resulting from the query is shown in
Figure 1.(a). A similar example is shown in Figure 1.(b) displaying the result of a query involving the spatial
containment predicate. Specifically the query is to retrieve the trajectories that start from Hong Kong by boat to
arrive at Macao, where a bus is taken before returning to Hong Kong by boat. For brevity, only the symbolic and
spatial conditions are reported below:
let C_s = ’ *
Y_s[(_ boat)]+ K_s[(_ bus)]+ Z_s[(_ boat)]+ * =>
Y_s K_s Z_s’
let C_p =
’val(initial(Y_p)) inside Hong Kong,
val(final(Y_P)) inside Macao’
The query language can be used thus to support exploratory searching across heterogeneous mobility data.
It can be noted that these queries are quite compact. If they were expressed using an existing database (spatial, moving object), the queries would have been extremely long and complex or would have required ad-hoc
programs. This results into a more usable data analysis platform.
3.2
Representing the migratory behavior
In this second case, we use symbolic trajectories to represent the behavioral information extracted from geometric trajectories whenever such knowledge takes the form of trajectory, and thus time still plays a key role. Whilst
in the previous case we have emphasized the analytical capabilities of the symbolic query language, now we
focus more on the effectiveness of the data model in describing the individual patterns extracted from geometric
trajectories.
The case study regards the representation of the migratory behavior. In particular, the study has been conducted for a group of wild animals (roe deer), equipped with a low sampling rate GPS collar and tracked for a
55
(a)
(b)
Figure 1: (a) home-work trajectory; (b) trajectory from Hong Kong to Macao, followed by a travel by bus in
Macao before the return by boat
period covering a few seasons [3]. The animals of this species can either migrate or be stationary, a behavior
known as partial migration in animal ecology, moreover, whenever an animal migrates, the migration takes
place with modalities and times that - although respecting certain general patterns, e.g. seasonality - can vary
from animal to animal. Therefore every animal has its own migratory behavior. Because of that, the pattern is
defined as individual pattern.
At first sight, this migratory behavior can be considered a stop-and-move pattern [11], i.e. the moving object
stays in a region for some time and then moves to some other stay region. In reality the migratory behavior is significantly different. In fact, the object residing in a given stay region can experience brief excursions after which
the object returns to the stay region. Therefore the residence consists of periods of presence interleaved with
periods of absence while a migration is the definitive transition from a stay region to another stay region. Extracting this kind of pattern raises interesting challenges because no assumption can be made on speed, direction
and other movement characteristics, as well as on the distribution of points.
Recently, a time-aware, density-based clustering algorithm has been proposed to extract the stay regions
where the animals reside for most of their time and the transitions form one stay region to another stay region.
[3]. While the details of the algorithm are not relevant for this discussion, more interesting is the question
of how to represent the migratory behavior resulting from the clustering. One could opt, for example, for a
graphical representation relying on modern visual analytics techniques. Though important, visualization is not
however sufficient because the behavioral information cannot be easily processed. Another simple approach is to
represent the cluster through a set of labeled points, i.e. a point (pi , ti ) is labeled with the identifier of the cluster
the point belongs to. Accordingly a cluster, say c can be represented using a point-based representation, i.e.
through a temporally ordered sequence of timestamped points. This representation is however unsatisfactory.
For example, given pi , pi+1 two points in c, the point-based representation does not specify where the object
is in the time interval [ti , ti+1 ] (this independently from the inherent location uncertainty). We recall, in fact,
that the object can experience periods of absence, therefore subsequent points in the cluster sequence are not
necessarily consecutive in the original trajectory.
The information on presence/absence can be naturally expressed using symbolic trajectories. In fact, the
algorithm keeps track of the periods of presence and absence. This information on the periods of presence can
be translated into units reporting the time interval and the cluster id for each of these periods. This results in
a fine-grained representation of the movement. Figure 2 displays the trajectory of an animal and the resulting
clusters and migration path. The symbolic trajectory (a fragment) detailing when the animal is inside/outside
the cluster labeled H1 is reported below.
56
(a)
(b)
Figure 2: (a) The space-time cube representing the geometric trajectory of the animal; (b) the trajectory at cluster
level [3].
.
["2006-06-17
["2006-06-18
["2006-06-21
["2006-06-24
["2006-06-25
["2006-06-30
["2006-07-03
["2006-07-23
["2006-08-18
12:02:00.016"
20:00:00.053"
20:00:00.054"
20:02:00.014"
04:01:00.012"
20:00:00.055"
04:01:00.049"
20:00:00.053"
12:01:00.012"
"2006-06-18
"2006-06-19
"2006-06-24
"2006-06-24
"2006-06-30
"2006-07-02
"2006-07-23
"2006-08-14
"2006-09-13
04:01:00.024"
00:00:00.054"
04:01:00.055"
20:02:00.014"
04:00:00.053"
08:00:00.053"
08:01:00.024"
16:01:00.042"
00:01:00.012"
]"H1"
]"H1"
]"H1"
]"H1"
]"H1"
]"H1"
]"H1"
]"H1"
]"H1"
.....
The symbolic trajectory offers an unprecedented level of detail on the animal’s movement inside the home-range
and that is of great ecological interest. Moreover, the movement can be observed at different levels of temporal
and thematic detail. For example rewriting rules or classification rules can be applied to automatically generate
more synthetic descriptions facilitating multi-scale analysis.
4
Conclusion
Providing tools enabling the flexible representation of mobility behavior is a prominent and promising research
direction. Symbolic trajectories is the first running solution proposed to solve the problem from a database
perspective. The research is however only at the beginning. Extending the concept of symbolic trajectory to
that of multi-dimensional trajectory, mediating between expressiveness and efficiency and, at the same time,
experimenting the solutions on real applications, are major challenges for future work.
References
[1] L. Alvares, V. Bogorny, B. Kuijpers, J. de Macedo, B. Moelans, and A. Vaisman. A model for enriching
trajectories with semantic geographical information. In Proc. ACM GIS, page 22, 2007.
57
[2] G. Cong, H. Lu, B. Chin-Ooi, D. Zhang, and M. Zhang. Efficient spatial keyword search in trajectory
databases. CoRR, abs/1205.2880, 2012.
[3] M. L. Damiani, H. Issa, and F. Cagnacci. Extracting stay regions with uncertain boundaries from gps
trajectories: A case study in animal ecology. In Proc. SIGSPATIAL, 2014.
[4] M. L. Damiani, H. Issa, R. H. Güting, and F. Valdes. Hybrid queries over symbolic and spatial trajectories:
A usage scenario. In Proc. MDM, 2014.
[5] C. du Mouza and P. Rigaux. Mobility patterns. Geoinformatica, 9:297–319, 2005.
[6] R. H. Güting, M. H. Böhlen, M. Erwig, C. S. Jensen, N. A. Lorentzos, M. Schneider, and M. Vazirgiannis.
A foundation for representing and querying moving objects. ACM Trans. Database Syst., 25(1):1–42,
2000.
[7] R. H. Guting, F. Valdes, and M. L. Damiani. Symbolic trajectories. Technicl report.- Fernuniversitt in
Hagen, Informatik-Report 369 - 12/2013., 2013.
[8] J. Liu, O. Wolfson, and H. Yin. Extracting semantic location from outdoor positioning systems. In Proceedings of the 7th International Conference on Mobile Data Management, page 73, 2006.
[9] N. Pelekis and Y. Theodoridis. Mobility Data Management and Exploration. Springer, 2014.
[10] C. Renso, S. Spaccapietra, and E. Zimányi. Mobility Data – Modeling, Management, and Understanding.
Cambridge Press, 2013.
[11] S. Spaccapietra, C. Parent, M. L. Damiani, J. de Macedo, F. Porto, and C. Vangenot. A conceptual view on
trajectories. Data Knowl. Eng., 65:126–146, 2008.
[12] S. Spaccapietra, C. Parent, C. Renso, G. Andrienko, N. Andrienko, V. Bogorny, M. Damiani, A. GkoulalasDivanis, J. Macedo, N. Pelekis, Y. Theodoridis, and Z. Yan. Semantic Trajectories Modeling and Analysis.
ACM Computing Surveys, 45(4):42, 2013.
[13] E. Tapia, S. Intille, and K. Larson. Activity recognition in the home using simple and ubiquitous sensors.
Springer, 2004.
[14] F. Valdés, M. L. Damiani, and R. H. Güting. Symbolic trajectories in SECONDO: pattern matching and
rewriting. In Proc. DASFAA, 2013.
[15] M. R. Vieira, P. Bakalov, and V. J. Tsotras. Querying trajectories using flexible patterns. In Proc. of the
13th Int. Conf. on Extending Database Technology, EDBT ’10, pages 406–417, 2010.
[16] Z. Yan and D. Chakraborty. Semantics in mobile sensing. Morgan & Claypool, 2014.
[17] C. Zhang, J. Han, L. Shou, J. Lu, and T. F. L. Porta. Splitter: Mining fine-grained sequential patterns in
semantic trajectories. PVLDB, 7(9):769–780, 2014.
[18] K. Zheng, S. Shang, N. Yuan, and Y. Yang. Towards efficient search for activity trajectories. In Proceedings
of the 2013 IEEE International Conference on Data Engineering (ICDE 2013), 2013.
[19] Y. Zheng, L. Liu, L. Wang, and X. Xie. Learning transportation mode from raw gps data for geographic
applications on the web. In Proceedings of the 17th International Conference on World Wide Web, WWW
’08, pages 247–256, 2008.
[20] Y. Zheng, X. Xie, and W.-Y. Ma. GeoLife: A Collaborative Social Networking Service among User,
Location and Trajectory. IEEE Data Eng. Bull., 33(2):32–39, 2010.
58
Planning Sightseeing Tours
using Crowdsensed Trajectories
Igo Brilhante1 , Jose Antonio Macedo1 ,
Franco Maria Nardini2 , Raffaele Perego2 , Chiara Renso2
1
Department of Computer Science, Federal University of Ceará, Brazil
{igobrilhante,jose.macedo}@lia.ufc.br
2
ISTI-CNR, Italy
{nardini,perego,renso}@isti.cnr.it
Abstract
We present an application where semantically enriched trajectories obtained from crowdsensed data are
used to build an advanced system for planning personalized sightseeing tours, called T RIP B UILDER.
The interesting feature of T RIP B UILDER is that it uses Wikipedia content and trajectories of previous
tourists collected by georeferenced Flickr photos in a complex spatio-temporal framework. The objective
is to address, in an unsupervised way, the problem of suggesting a budgeted sightseeing tour based on
the preferences of the tourist and the time available for the visit. We present few highlights of how
T RIP B UILDER works along with a research agenda where we discuss the role of semantically enriched
trajectories and crowdsourced location data in planning itineraries.
1
Introduction
Tourists approaching their destination for the first time deal with the problem of planning a sightseeing itinerary
that covers the most subjectively interesting attractions, and fits the time available for their visit. Precious
information can be nowadays gathered from many digital sources, e.g., travel guides, maps, institutional sites,
travel blogs. Nevertheless, the tourists still need to choose the preferred Points of Interests (PoIs), to guess how
much time is needed to visit them and to move from one attraction to the next one. In this paper we discuss
T RIP B UILDER, an unsupervised system helping tourists to build their own personalized sightseeing tour. Given
the target destination, the time available for the visit, and the tourist’s profile, T RIP B UILDER recommends a
time-budgeted tour that maximizes tourist’s interests and takes into account both the time needed to enjoy the
attractions and to move from one PoI to the next one. Moreover, the knowledge base feeding the recommendation
model is entirely and automatically extracted from publicly available Web services, namely, Wikipedia, Flickr
and Google Maps.
T RIP B UILDER exploits the publicly available content shared on Flickr by tourists. Unofficial statistics claim
that about 1.4 million of public photos are uploaded every day, and that upload peaks occurr just after holiday
periods1 . Each photo comes with very useful information such as: tags, comments and likes from Flickr social
network, number of views, information about the user, timestamp, GPS coordinates of the place where the photo
was taken. This allows us to reconstruct the movements of users and their interests by analyzing the time-ordered
sequence of their photos.
1
http://www.flickr.com/photos/franckmichel/6855169886/
59
The process of recognizing relevant PoIs given such set of photos is however not trivial in the lack of a
common database of geo-referenced PoIs. Fortunately, a Wikipedia2 page is associated with most entities of
interest for tourism. From this Wikipedia page we can thus easily extract: the (multilingual) name of the PoI, its
geographic coordinates, the categories which the PoI belongs to according to a weak but precise ontology (i.e.,
the PoI is a church, a square, a museum, a historical building, a bridge, etc).
By clustering and spatially matching tourists’ photo albums from Flickr on the relevant PoIs extracted from
Wikipedia pages, we can thus derive a knowledge base that represents the behavior of people visiting a given
city. In this knowledge base the popularity of a PoI is estimated from the number of photos available or from
the number of different visitors that shot photos there. Furthermore, from the distribution of the timestamps of
the first and last photos taken in a given PoI, we can roughly estimate the average time needed for visiting it.
The time needed to move from a PoI to the next one in the sightseeing itinerary is instead computed by querying
Google Maps. Finally, the Wikipedia categories of the PoIs visited by a given tourist are used to build her profile
and to characterize the trajectories across the PoIs. For example, if a tourist takes many pictures of churches
and museums, we can infer a preference for cultural/historical attractions. Analogously, we can aggregate this
information at the level of the trajectories mined from Flickr photo albums to estimate the relevance of the given
trajectory for the tourist profile.
In [1, 3], we discussed the T RIP B UILDER methodology for generating personalized sightseeing tours while
in [2] we described the Web application implementing the T RIP B UILDER system (see Figure 4). This paper is
organized as follows. We give a short introduction to the T RIP B UILDER solution and the related problems in
Section 2. Then, in Section 3 we show the current state of its distributed architecture that, by exploiting the
unsupervised approach of T RIP B UILDER, aims at automatically building the knowledge base for a large number
of different cities. Finally, we discuss new directions of research in Section 4.
2
Building the T RIP B UILDER knowledge base
The generation of the knowledge base used by T RIP B UILDER is a multi-steps unsupervised process that we are
going to detail in the following.
PoIs. The first step is to identify the set of PoIs in the target geographical region. Given the bounding box
BBcity containing the city of interest, we download all the geo-referenced Wikipedia pages falling within this
region. We assume each geo-referenced Wikipedia named entity, whose geographical coordinates falls into
BBcity , to be a fine-grained Point of Interest. For each PoI, we retrieve its descriptive label, its geographic
coordinates as reported in the Wikipedia page, and the set of categories the PoI belongs to. Categories are
reported at the bottom of the Wikipedia page, and are used to link articles under a common topic. They form
a hierarchy, although sub-categories may be a member of more than one category. By considering the set C
of categories associated with all the PoIs, we generate the normalized relevance vector of each PoI. We then
perform a density-based clustering to group in a single PoI sightseeing entities which are very close one to each
other. Clustering very close PoIs is important since a tourist in a given place can enjoy all the attractions in the
surroundings even if she do not take photos to all of them. Moreover, it aims at reducing the sparsity that might
affect trajectory data. To cluster the PoIs we use DBScan [5]. To build our dataset, we set 1 as the minimum
number of points and 200 meters as . At the end of this step each PoI p ∈ P is univocally identified by its
geographic coordinates, a name, and a relevance vector, v~p ∈ [0, 1]|C| , measuring the normalized relevance of p
w.r.t the categories C. For the clustered PoIs, the relevance vector v~p is obtained by considering the occurrences
of each category in the members of the clusters and by normalizing the resulting vector.
Users and PoI histories. As second step we need a method for collecting tourists’ information and their longterm itineraries crossing the discovered PoIs. We query Flickr to retrieve the metadata (user id, timestamp, tags,
2
http://www.wikipedia.org
60
geographic coordinates, etc.) of the photos taken in the given area BBcity . The assumption we are making is
that photo albums made by Flickr users implicitly represent sightseeing itineraries within the city. To strengthen
the accuracy of our method, we retrieve only the photos having the highest geo-referenced accuracy given by
Flickr3 . This process thus collects a large set of geo-tagged photo albums taken by different users within BBcity .
We preliminary discard photo albums containing only one photo. Then, we spatially match the remaining photos
against the set of PoIs previously collected. We associate a photo to a PoI when it has been taken within a circular
buffer of a given radius having the PoI as its center. Note that in order to deal with clustered PoIs, we consider
the distance of the photo from all constituent members: in the case the photo falls within the circular region of
at least one of the members, it is assigned to the clustered PoI. Moreover, since several photos by the same user
are usually taken close to the same PoI, we consider the timestamps associated with the first and last of these
photos as the starting and ending time of the user visit to the PoI. The PoI visiting time ρ(p) is then estimated
by computing for each PoI the average of these times. Moreover, the popularity of each PoI is computed as
the number of distinct users that take at least one photo in its circular region. The above process allows us to
generate the set of users, their PoI history (the temporally ordered sequence of PoIs visited by a user u), and
estimate for the popularity and visiting time of each PoI. Finally, a preference vector v~u ∈ [0, 1]|C| stating the
normalized interest of u for the categories in C is built by summing up and normalizing the relevance vectors of
all the PoIs occurring in u PoI history.
Trajectories. In order to build the set S of trajectories used by T RIP B UILDER we split users’ PoI histories. In
particular, given a PoI History Hu where each p of Hu is annotated with the two timestamps [t1 , t2 ] indicating
the start time and the end time of the visit, and a time threshold δ, we define a trajectory Tu any subsequence of
Hu
< (pk , [t1k , t2k ]), . . . , (pk+i , [t1(k+i) , t2(k+i) ]) >
such that:
i≥1
t1k − t2(k−1) > δ,
if k > 1
t1(k+i+1) − t2(k+i) > δ, if (k + i) < m
t1(k+j) − t2(k+j−1) ≤ δ, ∀j s.t. 1 ≥ j ≤ i.
The intuition is that trajectories are sequences of PoIs visited consecutively at the same “visit”. They are
obtained by cutting the user PoI history where the time interval between the visit to two subsequent PoIs is
greater than a given threshold δ. To choose the splitting threshold δ, we derive the users’ wisdom-of-crowds
behavior by analyzing the inter-arrival time of each pair of consecutive photos taken in different PoIs. Therefore,
we compute the distribution of probability of the inter-arrival time P (x ≤ δ) of pairs of consecutive photos.
Then, we devise the time threshold δ such that P (x ≤ δ) = 0.9.
Traveling time estimation. An important aspect of T RIP B UILDER is that we recommend sightseeing tours
fitting the available time budget and not just the set of PoIs to be visited. The sightseeing tour building step
should therefore consider not only the PoI visiting time ρ(p) but also the time τ (·, ·) needed to move between
consecutive PoIs in the itinerary. Since measuring intra-PoI moving time from the photo albums resulted to be
inaccurate for not popular PoIs, we resort to an external service. Given a pair (pi , pj ) of PoIs in a trajectory,
we estimate τ (pi , pj ) by querying Google Maps for the walking time between the PoIs. Naturally, this is an
approximation since several variations may happen: the user having a car, using public transportation, taking a
taxi. However, our method is parametric to these aspects, and the system can be easily adapted to consider the
different choices. Moreover, most PoIs in our sightseeing cities are actually at walking distances.
User-PoI Interest. Given a PoI p, its relevance vector v~p , a user u, and the associated preference vector v~u , we
define the User-PoI Interest function as a the following function Γ(p, u) : P × U → [0, 1]:
Γ(p, u) = α · sim(v~p , v~u ) + (1 − α) · pop(p)
3
http://www.flickr.com/services/api/flickr.photos.search.html
61
v~ ·v~
where sim(v~p , v~u ) = ||v~p p|| ||uv~u || is the cosine similarity between the user preference and the PoI relevance
vectors, pop(p) is a function, ranging from 0 to 1, measuring the popularity of p, and α ∈ [0, 1] is a parameter
controlling how much user preference and popularity of PoIs have to be taken into account. T RIP B UILDER
addresses the problem of planning the visit to the city as a two-step process: T RIP C OVER and T RAJ SP.
T RIP C OVER. First, given the profile of the tourist and the amount of time available for the visit, we address
the problem of choosing the set of trajectories S ∗ ⊆ S that best fits tourist interest and respects the given time
constraint [1]. The association between users and interests is represented as a User-PoI interest function Γ, while
the time of the visit to PoI p as the cost function ρ(p). The output is a set of trajectories maximizing the user’s
interest, that is, trajectories crossing the most relevant PoIs to the user constrained by the given time budget.
The T RIP C OVER problem is an instance of the Generalized Maximum Coverage (GMC) problem that is proven
to be NP-hard [4]. An efficient greedy approximation algorithm for the GMC problem is known that achieves
an approximation ratio of e/(e − 1) + , ∀ > 0 [4]. We used this approximation algorithm (whose source
was kindly provided us by the authors) after slightly modifying it to take into account T RIP C OVER specific
constraints [1].
T RAJ SP. In a second step, the selected trajectories that best fits tourist interest and respects her time budget are
joined in a sightseeing itinerary by means of a heuristic algorithm based on local search operations (2-OPT and
3-OPT). We model this second problem, called T RAJ SP, as a particular instance of Traveling Salesman Problem
(TSP). in [3] T RAJ SP is addressed by proposing a Local Search heuristics that starts from a (given or random)
tour P̂ connecting all trajectories in S ∗ , and then applying local changes to P̂ by means of 2-OPT or 3-OPT
strategies [6].
3
The T RIP B UILDER System
The architecture of T RIP B UILDER for generating time budgeted sightseeing tours involves four different layers:
i) Stream Layer, ii) Batch Layer, iii) Distributed Data Storage, and, iv) TripBuilder Engine. In the following,
we detail the functionalities of the four modules. Figure 1 presents the architecture of the T RIP B UILDER system
enabling the computation of personalized budgeted sightseeing tours.
Stream Layer. This layer is composed of two different modules that retrieve the relevant information from
Flickr and Wikipedia by receiving city bounding boxes as a stream. In particular, each item of the stream is used
by Photo Discovery to query Flickr to retrieve the metadata (user id, timestamp, tags, geographic coordinates,
etc.) of photo albums, i.e., sequences of photos taken in the given geographic area. This process thus collects a
large set of geo-tagged photo albums taken by different users in the given geographic area. The second module,
Wikipedia PoI Discovery, collects PoIs from Wikipedia. In particular, we assume each geo-referenced Wikipedia
named entity, whose geographical coordinates falls into a given area, to be a Point of Interest. For each PoI,
we retrieve its descriptive label, its geographic coordinates as reported in the Wikipedia page, and the set of
categories the PoI belongs to, which are reported at the bottom of the Wikipedia page. Then, photos from Flickr
and PoIs from Wikipedia are matched by spatial proximity according to their coordinates. Figure 2 highlights
the components on the stream layer.
The stream layer is built by means of Apache Storm4 , a free and open source distributed realtime computation system. Apache Storm allows to reliably process unbounded streams of data. Storm organizes the
computation in a graph, called “topology”, where data flows through nodes, called “bolts”. Our stream layer
is thus able to crawl Flickr and Wikipedia in a real-time fashion by receiving from an input queue a given geographic bounding box representing the target geographic area. The results of the real-time computation are
stored on a distributed data storage.
4
https://storm.apache.org/
62
Tour
City
Budget
Categories
Personalization
TripBuilder
Batch Layer
Trajectories
Creation
Poi Visiting Time
Estimation
Users'
Photos
HDFS
HDFS
Wikipedia
PoI
Discovery
City Bounding
Box
Photo
Discovery
Stream Layer
HDFS
Distributed Data Storage
Figure 1: Architecture of T RIP B UILDER. We outline the four layers of the system, i.e. Stream Layer, Batch
Layer, Distributed Data Storage and TripBuilder Engine.
Streams
Stream Layer
Batch Layer
City
City
Wikipedia PoI
Discovery
Photo
Discovery
Trajectories
Creation
City
HDFS
HDFS
Trajectory
Split
Estimation
HDFS
Users'
Photos
Poi Visiting
Time
Estimation
HDFS
Distributed Data Storage
Figure 2: Layers of T RIP B UILDER architecture: Stream layer processes incoming data; Batch layer is responsible for processing and transforming data to T RIP B UILDER Engine.
Batch Layer. This layer is made up of different components each one manipulating the data previously collected. It is in charge of cleaning and transforming the data by means of distributed computing frameworks like
Apache Hadoop5 and Spark6 to speed up the data processing step. In particular, the modules here transform
sequences of photos from Flickr to sequences of visited Wikipedia PoIs, i.e., trajectories, to be used in the T RIP B UILDER engine. Moreover, this step is in charge of computing popularity and other important characteristics
of PoIs by considering metadata and information extracted both from Flickr and Wikipedia. The data obtained
(see Figure 3) are then stored by means of a distributed data storage layer. This is an important point in favour
of enabling the flexibility of T RIP B UILDER: different sources of information for trajectories and PoIs can be
easily integrated into the system by modifying only the two lowest layers. Moreover, the approach taken allows
to scale to large geographic areas as the two layers effectively exploits modern state-of-the-art technologies for
distributed and parallel computation.
Distributed Data Storage. This component is responsible for storing, querying and indexing trajectory and
PoI data. It is composed of a database management system and a distributed filesystem that efficiently provides
5
6
http://hadoop.apache.org
http://spark.apache.org
63
Colosseum
3 photos
01/07/2013 9:00 -12:00
Ruins
2 photos
01/07/2013 13:30 -15:00
Trevi Fountain
2 photos
01/07/2013 15:42 - 16:00
...
Figure 3: Application of the Stream and the Batch Layer to raw data from Flickr and Wikipedia. The result of
these two steps is a set of crowdsensed trajectories describing past behavior of tourists in a geographic area.
information to the “T RIP B UILDER Engine” component and a distributed data storage to support Stream and
Batch layers. The database component contains a well-defined schema to enable flexibility in integrating other
data sources. Geo-spatial indexes are used for searching spatial objects, such as PoIs and tourist traces, within
a given region (e.g. polygon). The system also takes advantage of indexes over PoI categories and tourist
traces, both represented as arrays, to efficiently retrieve relevant PoIs to the user preferences. Moreover, the
distributed filesystem is built by using the Apache Hadoop Distributed Filesystem (HDFS). We choose the
HDFS technology because it is a mature solution for storing data in distributed environments. As an example,
it provides effective and efficient mechanisms to deal with faults thus preventing us to avoid data loss in case of
hardware problems.
T RIP B UILDER Engine. This is the core of the architecture responsible for computing personalized budgeted
sightseeing tours. Given a set of trajectories crossing a set of PoIs, a time budget, the user preferences and the
personalization factor used to tune the level of personalization as input, it generates the personalized sightseeing
tour.
We experimented T RIP B UILDER with data collected for three cities different for their size and the amount of
user-generated content available for download, namely Pisa, Florence, Rome, located in Italy [3]. We collected
crowdsourced data from Flickr, Wikipedia and Google Maps. We evaluated our framework by considering the
performance of both the T RIP C OVER and the T RAJ SP problems [3]. The effectiveness of T RIP B UILDER in
selecting a set of trajectories of interest for a given user shows remarkable improvements over two competitive
baselines in terms of all the metrics adopted for assessment. Our solution suggests itineraries that better match
user preferences. Moreover, such itineraries present higher visiting time and, consequently, lower intra-PoI
movement time than the baselines. Furthermore, we showed that our TSP-based local search heuristic to schedule a set of trajectories into the user agenda outperforms two baselines. Finally, the tests conducted to assess the
efficiency of T RIP B UILDER show that it computes a four-day personalized sightseeing tours of Rome in about
3 seconds thus confirming that our approach can be fruitfully deployed in online applications.
64
(a)
(b)
Figure 4: A screenshot of the Web interface that lets users interact with T RIP B UILDER. The targeted city is
initially selected (a). The drop-down menu on the left helps specify their preferences, the time budget and the
personalization factor (b). On the right, a summary of the tour is proposed. Each PoI in the summary comes
with a photo and a set of useful information (i.e., visiting time, categories, etc.).
4
Crowdsensed Trajectories for Tourism: Which are the Challenges?
In this paper we discussed how semantically enriched trajectories derived from user-generated content from web
services like Flickr and Wikipedia offer a solid backgroud for planning personalized sightseeing tours. However,
we believe that this approach opens the way to many research challenges on recommendation task from crowdgenerated content. Here we highlight some ideas that we believe could be the next steps for sightseeing tour
recommendation and planning.
• Linked data. The increasing availability of Linked Open Data (LOD) sources has brought new opportunities to integrate different data source to semantically enrich trajectories and points of interests. LOD
may fulfil the lack of information we typically experience managing trajectories from user-generated data
and this additional information will bring to better recommendations combined with the capability of
explaining the recommendation itself.
• Smart Cities. Nowadays, we see the opportunity of integrating the crowd-generated trajectories with
smart cities environments, such as the physical sensors used for collecting data representing several aspects of the city (pollution, traffic, etc). Similar to the LOD case, here we will have the opportunity that
creates a new huge potential to enrich the recommendations to the tourism.
• Real-time Services. It is crucial to keep the tourist tour up to date with the most recent events in the city,
like special discounts for museums, restaurants, events, etc. How to deal with it and collect such amount
of users’ data to provide real-time support to tourists will be an important challenge.
• Group Recommendation. The fact that people usually do not travel alone highlights the importance of
recommending tours for groups of people instead of single individuals. The task here is to balance the
recommendation to satisfy the distinct preferences inherence to each user in the group. This may be a
hard task since other issues may come up: influence among people, distinguish roles of people inside the
group like the leader, etc.
65
• Hierachical Sightseeing Tours. So far we have considered the recommendation of sightseeing tours for
a single city. If the tourist is willing to travel across different cities, she would need to use the system
to generate the tour for each city separately. To overcome this limitation, the design of a hierarchical
approach would bring several benefits to help users to travel across many cities.
• Time-aware PoIs and Trajectories. Another important challenge is related to the temporal dimension
of PoIs and trajectories. Some PoIs and trajectories might be relevant according to a period of the day:
most people visit beaches during the day or some monuments (e.g. Colosseum, Eiffel tower) could have a
different appearence during the sunlight or the night. Therefore, considering the temporal importance and
relevance of the PoIs and trajectories may suggest better personalized tours for the users.
• Tour-based Hotel Selection. During the scheduling of a trip, tourists choose the city, pick up a hotel and
write down the PoIs to visit. Although the POIs tour can be generated by T RIP B UILDER, how to choose
the hotel is still missing. Although the selection of the hotel is usually related to traditional contrainsts
like price, ratings, etc. having the planned sightseeing tour could be used to favor the choice of the hotel
minimizing the distance with the planned tour attractions.
• Personalized Visiting Time. T RIP B UILDER uses the crowd-generated content to infer an approximate
visiting time for each PoI and this information is used for all tourists. However, tourists usually have
preferences and they might wish to spend more time at some preferred places compared to other less
interesting attractions. A personalization of visiting time might be very relevant for the tourists to complement their tours splitting their time budget to specific attractions based on their preferences.
Acknowledgments
This work was partially supported by EU FP7 Marie Curie project SEEK (no. 295179), CNPQ Scholarship (no.
306806/2012-6), CNPQ Casadinho / PROCAD Project (no. 552578/2011-8), CAPES Scholarship.
References
[1] I. R. Brilhante, J. A. Macedo, F. M. Nardini, R. Perego, and C. Renso. Where shall we go today?: Planning
touristic tours with tripbuilder. In Proceedings of the 22nd ACM International Conference on Conference on
Information and Knowledge Management, CIKM 2013, pages 757–762, New York, NY, USA, 2013. ACM.
[2] I. R. Brilhante, J. A. Macedo, F. M. Nardini, R. Perego, and C. Renso. Tripbuilder: A tool for recommending
sightseeing tours. In M. Rijke, T. Kenter, A. P. Vries, C. X. Zhai, F. Jong, K. Radinsky, and K. Hofmann,
editors, Advances in Information Retrieval, volume 8416 of Lecture Notes in Computer Science, pages
771–774. Springer International Publishing, 2014.
[3] I. R. Brilhante, J. A. Macedo, F. M. Nardini, R. Perego, and C. Renso. On planning sightseeing tours with
tripbuilder. Information Processing & Management, 51(2):1 – 15, 2015.
[4] R. Cohen and L. Katzir. The generalized maximum coverage problem. Information Processing Letters,
108(1):15–22, 2008.
[5] M. Ester, H. peter Kriegel, J. S, and X. Xu. A density-based algorithm for discovering clusters in large
spatial databases with noise. pages 226–231. AAAI Press, 1996.
[6] S. S. Rajesh Matai and M. L. Mittal. Traveling salesman problem: an overview of applications, formulations, and solution approaches, traveling salesman problem, theory and applications. In Traveling Salesman
Problem, Theory and Applications. InTech, 2010.
66
The SIGSPATIAL Special
Section 2: Event Reports
ACM SIGSPATIAL
http://www.sigspatial.org
Highlights from ACM SIGSPATIAL China Chapter in
2014
1
Guangzhong Sun1 , Yang Yue2 , Xing Xie3
University of Science and Technology of China, Hefei, China
2
Shenzhen University, Shenzhen, China
3
Microsoft Research, Beijing, China
Abstract
In order to promote ACM SIGSPATIAL and corresponding research area in China, and encourage collaboration between SIGSPATIAL researchers in China and researchers worldwide, ACM SIGSPATIAL
China chapter was established in October 2009, with the strong support of SIGSPATIAL executive committee. This article describes ACM SIGSPATIAL China Chapter’s current status and some activities in
the year of 2014.
1
Current status of ACM SIGSPATIAL China
There are 35 professional members in ACM SIGSPATIAL China since the forming of this chapter. They come
from Chinese universities such as Shenzhen University, Renmin University, Chinese Academy of Sciences,
University of Science and Technology of China, and industry labs such as Microsoft Research Asia.
The current chapter officers are as below:
• Honorary chair: Prof. Qingquan Li(Shenzhen University) and Prof. XiaofengMeng(Renmin University).
• Chair: Dr. Xing Xie(Microsoft Research Asia).
• Vice chair: Dr. Feng Lu(Chinese Academy of Sciences) and Dr. Zhiming Ding(Beijing University of
Technology).
• Secretary: Dr. Yang Yue(Shenzhen University) and Dr. Guangzhong Sun(University of Science and
Technology of China).
More information about the chapter and activity reports can be found at ACM SIGSPATIAL China website
(http://sigspatial.ustc.edu.cn), which is in Chinese language.
2
2.1
Activities in 2014
Workshop on Spatial Big Data Mining and Visualization(SBDMV 2014)
Existing concepts, theories, and methods for the spatial big data are facing many challenges, for instance, spatial
big data storage, processing, analysis, mining and visualization. The value of spatial big data could not be fully
achieved without data mining. And it has been realized that visualization is an effective means for not only
presenting essential information in vast amounts of data but also driving big data analytics.
67
With ISPRS WG II/7, ACM SIGSPATIAL China held a workshop on Spatial Big Data Mining and Visualization(SBDMV 2014), in conjunction with ICDM 2014, on December 14, 2014. This half-day workshop aimed
to serve as a platform to discuss the recent trends in spatial big data mining and visualization, for the purpose
of intelligent spatial decision support. More than 40 people joined this workshop, who came from Chinese
Academy of Sciences, Wuhan University, Shenzhen University, Peking University, University of Science and
Technology of China, University of Tennessee, University of Melbourne, Microsoft Research Asia and Tencent
Research. Figure 1 shows two photos of this workshop.
(a) in meeting room
(b) with invited guests
Figure 1: Workshop on Spatial Big Data Mining and Visualization(SBDMV 2014) in Shenzhen
More information about this workshop can be found at its website (http://spatial.szu.edu.cn/
SBDMV2014.htm).
2.2
CCCF Special issue on user understanding from big data
Communication of China Computer Federation(CCCF) is one of the most popular computer magazines in China.
This Chinese magazine has more than 10,000 subscribers with great impact in Chinese computer researchers
community.
With the Technical Committee on Pervasive Computing of China Computer Federation(CCF), ACM
SIGSPATIAL China organized one special issue on user understanding from big data for CCCF in may 2014.
Invited professors and researchers, from Tsinghua University, Nankai University, Zhejiang University, Chinese
Academy of Sciences and Microsoft Research Asia, wrote five articles to give a broad and in-depth introductions
around the big data and user understanding. Some latest progress were also presented in these articles.
More information about this special issue can be found at this page (http://www.ccf.org.cn/
sites/ccf/jsjtbbd.jsp?contentId=2799676848706), whose language is Chinese.
68
LBSN 2014 Workshop Report
The Seventh ACM SIGSPATIAL International Workshop
on Location-Based Social Networks
Dallas, Texas, USA - November 4, 2014
Alexei Pozdnoukhov1
Sen Xu2
1
UC Berkeley, USA
2
Twitter Inc., USA
[email protected]
[email protected]
(Workshop Co-chairs)
ACM
SIGSPATIAL
workshop
on
Location
Based
Social
Networks
2014
(http://faculty.ce.berkeley.edu/pozdnukhov/lbsn14/index.html) was held in conjunction with the 22nd ACM
SIGSPATIAL International Conference on Advances in Geographic Information Systems (SIGSPATIAL 2014)
on November 4, 2014 in Dallas, Texas, USA. The objective of this workshop was to provide professionals,
researchers, and technologists with a single forum where they can discuss and share the state-of-the-art of
LBSN development and applications, present their ideas and contributions, and set future directions in emerging
innovative research for location based social networks. This year program was composed of three sessions
covering all aspects of LBSNs, with opening invited talks given by Dr. A. Haro (HERE/Nokia), Prof. M.
Duckham (Uni Melbourne), Dr. Sen Xu (Twitter).
Best Paper and Best Student Paper nominations were made based on the reviews and evaluations received
from the Program Committee. Each award carried a prize sponsored by Twitter Inc.
• Best Paper. Sophy: a Morphological Framework for Structuring Geo-referenced Social Media, by
Kyoung-Sook Kim, Hirotaka Ogawa, Akihito Nakamura and Isao Kojima.
• Best Student Paper. Moving on Twitter: Using Episodic Hotspot and Drift Analysis to Detect and Characterise Spatial Trajectories, by Hansi Senaratne, Arne Broering, Tobias Schreck and Dominic Lehle.
Social networks have been prevalent on the Internet. The data produced by their users ignited multiple
research topics attracting many professionals from a variety of fields. The advances in location-acquisition and
mobile communication technologies empower people to use location data with existing online social networks.
As location is one of the most important components of user context, extensive knowledge about an individuals
interests, behaviors, and relationships with others can be learned from locations. Furthermore, people expand
their social connections with the new inter-dependencies conditioned on their locations and mobility habits,
shaping social and spatial structure in the cities of the future. These topics will remain key in fundamental
academic research and will continue attracting significant interest from industry.
We would like to thank the authors for publishing and presenting their papers in LBSN’2014, and the program committee members and external reviewers for their professional evaluation and help in the paper review
process. We hope that this work will inspire new research and make an impact on commercial applications in an
exciting area of Location-Based Social Networks.
69
IWGS 2014 Workshop Report
The 5th ACM SIGSPATIAL International Workshop on
GeoStreaming
Dallas, Texas, USA - November 4, 2014
Chengyang Zhang1
Anas Basalamah2
1
Teradata Inc.
2
Umm Al Qura University
3
University of Washington Tacoma
[email protected]
[email protected]
(Workshop Co-chairs)
Abdeltawab Hendawi3
[email protected]
The ACM SIGSPATIAL International Workshop on Geostreaming (IWGS) was held for the fifth time in
conjunction with the 22nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACMGIS 2014). The workshop has been a successful event that attracted participants from both
academia and industry. The workshop addressed topics that are at the intersection of data streaming and geospatial systems. The workshop fostered an environment where geospatial researchers can benefit from the advances
in geosensing technologies and data streaming systems.
We are entering the era of big data thanks to the exponential growth and availability of structured and
unstructured data, among which a large amount are real-time streaming data emitted from sensors, imagery and
mobile devices. In addition to the temporal nature of stream data, various sources provide stream data that has
geographical locations and/or spatial extents, such as geotagging twitter streams, mobile GPS location streams,
spatial temporal image streams, and so on. On one hand, this amount of streamed data has been a major propeller
to advance the state of the art in geographic information systems. On the other hand, the ability to process, mine,
and analyze that massive amount of data in a timely manner prevented researchers from making full use of the
incoming stream data. The geostreaming term refers to the ongoing effort in academia and industry to process,
mine and analyze stream data with geographic and spatial information.
This workshop addresses the research communities in both stream processing and geographic information
systems. It brings together experts in the field from academia, industry and research labs to discuss the lessons
they have learned over the years, to demonstrate what they have achieved so far, and to plan for the future of
geostreaming.
The workshop featured a keynote by Dennis Luxen from Mapbox Inc., providing an introduction into the
technical infrastructure to scale globally, the concepts of processing streams of data to provide real-time updates
to custom-styled online maps, as well as non-trivial processing methods that provide even more sophisticated
products like a world-wide satellite mosaic that is seamless and also cloudless. On one side, the keynote provided
a good opportunity for researchers to better understand the business scenarios. On the other side, the workshop
was useful for the industry representative to learn about the ambitions and directions of researchers in order to
better shape the future of geostreaming.
The call for paper resulted in 16 submissions of research papers. A program committee of 9 members
reviewed the submissions and as a result 12 highest quality papers were accepted. On average, Over 20 attendees
were present at every session of the workshop. The topics presented in the workshop include but are not limited
70
to: Geostream Query Processing, Geostream Theory and Applications in Tranportation, Streaming Trajectories
and Moving Regions and Geostreaming Systems.
71
The First ACM SIGSPATIAL PhD Symposium 2014
Ugur Demiryurek1 , Mohamed Sarwat2
Department of Computer Science, University of Southern California, USA
2
School of Computing, Informatics, and Decision Systems Engineering, Arizona State University, USA
1
[email protected],[email protected]
1
1 Summary
The ACM SIGSPATIAL Ph.D. Symposium is a forum where Ph.D. students present, discuss, and receive feedback on their research in a constructive atmosphere. The symposium will be attended by professors, researchers
and practitioners in the ACM SIGSPATIAL community, who will participate actively and contribute to the discussions. The workshop is co-located with ACM SIGSPATIAL GIS 2014 The ACM SIGSPATIAL 2014 PhD
Symposium provides an opportunity for doctoral students to explore and develop their research interests in the
broad areas addressed by the ACM SIGSPATIAL community. We invite PhD students to submit a summary of
their dissertation work to share their work with students in a similar situation as well as senior researchers in
the field. We have two tracks for submission. The Junior PhD Track is for students who are in early stages of
their doctoral studies. The submission should provide a clear problem definition, explain why it is important,
survey related work, and summarize the new solutions that are pursued. The Senior PhD Track is for students
who are close to completion (expected to graduate by 2014/2015). The submissions focused on describing the
contribution they made in their doctoral dissertation. The strongest candidates are those who have a clear topic
and research approach, and have made some progress, but who are not so far along that they can no longer make
changes.
2 Program
This year, we accepted five papers to the PhD Symposium. The list of papers is as follows:
• Partitions to Improve Spatial Reasoning (Author: Matthew P. Dube – Supervised by: Max J. Egenhofer)
• Novel Clustering and Analysis Techniques for Mining Spatio-temporal Data (Author: Yongli Zhang –
Supervised by: Christoph F. Eick)
• Spatial Sensor Data Processing and Analysis for Mobile Media Applications (Author: Guanfeng Wang –
Supervised by: Roger Zimmermann)
• Towards Resource Route Queries with Reappearance (Author: Gregor Joss – Supervised by: Matthias
Schubert)
• SimMatching - Adaptable Road Network Matching for Efficient and Scalable Spatial Data Integration
(Author: Michael Schfers – Supervised by: Udo W. Lipeck)
Authors got the chance to present their papers and get feedback on their dissertation topic from experienced
researchers (from both academia and industry) in Geographic Information Systems and Spatial Data Analytics.
72
3 Keynote
The PhD symposium featured a Keynote speech by professor Ouri Wolfson (Richard and Loan Hill Professor
of Computer Science at the University of Illinois at Chicago) on ”What to Research in Spatial Information and
Hot to Do So”. This talk is divided into two parts the What and the How. The What part describes the research
directions, based on Dr. Wolfson’s perspective, that are most promising in the area of spatial information. These
involve abstractions of dynamic data about space and time to guide users in conducting everyday activities. In the
How part Dr. Wolfoson describes the character and attitude traits that I view as essential to conduct world-class
research. These involve problem selection, collaboration and inspiration.
73
join today!
SIGSPATIAL & ACM
www.sigspatial.org
www.acm.org
The ACM Special Interest Group on Spatial Information (SIGSPATIAL) addresses issues related to the acquisition, management, and processing
of spatially-related information with a focus on algorithmic, geometric, and visual considerations. The scope includes, but is not limited to, geographic information systems (GIS).
The Association for Computing Machinery (ACM) is an educational and scientific computing society which works to advance computing as a
science and a profession. Benefits include subscriptions to Communications of the ACM, MemberNet, TechNews and CareerNews, full and unlimited
access to online courses and books, discounts on conferences and the option to subscribe to the ACM Digital Library.
❑ SIGSPATIAL (ACM Member). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . $ 15
❑ SIGSPATIAL (ACM Student Member & Non-ACM Student Member). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . $ 6
❑ SIGSPATIAL (Non-ACM Member). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . $ 15
❑ ACM Professional Membership ($99) & SIGSPATIAL ($15) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . $114
❑ ACM Professional Membership ($99) & SIGSPATIAL ($15) & ACM Digital Library ($99) . . . . . . . . . . . . . . . . . . . . . . . $213
❑ ACM Student Membership ($19) & SIGSPATIAL ($6). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . $ 25
payment information
Name __________________________________________________
Credit Card Type:
ACM Member # __________________________________________
Credit Card # ______________________________________________
Mailing Address __________________________________________
Exp. Date _________________________________________________
_______________________________________________________
Signature_________________________________________________
City/State/Province _______________________________________
ZIP/Postal Code/Country___________________________________
Email _________________________________________________
Mobile Phone___________________________________________
❏ AMEX
❏ VISA
❏ MC
Make check or money order payable to ACM, Inc
ACM accepts U.S. dollars or equivalent in foreign currency. Prices include
surface delivery charge. Expedited Air Service, which is a partial air freight
delivery service, is available outside North America. Contact ACM for
more information.
Fax ____________________________________________________
Mailing List Restriction
ACM occasionally makes its mailing list available to computer-related
organizations, educational institutions and sister societies. All email
addresses remain strictly confidential. Check one of the following if
you wish to restrict the use of your name:
❏ ACM announcements only
❏ ACM and other sister society announcements
❏ ACM subscription and renewal notices only
Questions? Contact:
ACM Headquarters
2 Penn Plaza, Suite 701
New York, NY 10121-0701
voice: 212-626-0500
fax: 212-944-1318
email: [email protected]
Remit to:
ACM
General Post Office
P.O. Box 30777
New York, NY 10087-0777
SIGAPP
www.acm.org/joinsigs
Advancing Computing as a Science & Profession
The SIGSPATIAL Special
ACM SIGSPATIAL
http://www.sigspatial.org