Download The SIGSPATIAL Special

The SIGSPATIAL Special Newsletter of the Association for Computing Machinery Special Interest Group on Spatial Information Volume 7 Number 1 March 2015 The SIGSPATIAL Special The SIGSPATIAL Special is the newsletter of the Association for Computing Machinery (ACM) Special Interest Group on Spatial Information (SIGSPATIAL). ACM SIGSPATIAL addresses issues related to the acquisition, management, and processing of spatiallyrelated information with a focus on algorithmic, geometric, and visual considerations. The scope includes, but is not limited to, geographic information systems. Current Elected ACM SIGSPATIAL officers are:  Chair, Mohamed Mokbel, University of Minnesota  Past Chair, Walid G. Aref, Purdue University  Vice-Chair, Shawn Newsam, University of California at Merced  Secretary, Roger Zimmermann, National University of Singapore  Treasurer, Egemen Tanin, University of Melbourne Current Appointed ACM SIGSPATIAL officers are:  Newsletter Editor, Chi-Yin Chow (Ted), City University of Hong Kong  Webmaster, Ibrahim Sabek, University of Minnesota For more details and membership information for ACM SIGSPATIAL as well as for accessing the newsletters please visit http://www.sigspatial.org. The SIGSPATIAL Special serves the community by publishing short contributions such as SIGSPATIAL conferences’ highlights, calls and announcements for conferences and journals that are of interest to the community, as well as short technical notes on current topics. The newsletter has three issues every year, i.e., March, July, and November. For more detailed information regarding the newsletter or suggestions please contact the editor via email at [email protected]. Notice to contributing authors to The SIGSPATIAL Special: By submitting your article for distribution in this publication, you hereby grant to ACM the following non-exclusive, perpetual, worldwide rights:     to publish in print on condition of acceptance by the editor, to digitize and post your article in the electronic version of this publication, to include the article in the ACM Digital Library, to allow users to copy and distribute the article for noncommercial, educational or research purposes. However, as a contributing author, you retain copyright to your article and ACM will make every effort to refer requests for commercial use directly to you. Notice to the readers: Opinions expressed in articles and letters are those of the author(s) and do not necessarily express the opinions of the ACM, SIGSPATIAL or the newsletter. The SIGSPATIAL Special (ISSN 1946-7729) Volume 7, Number 1, March 2015. i Table of Contents Message from the Editor………………………………………………………………….. Chi-Yin Chow Page 1 Section 1: Special Issue on Semantic and Symbolic Trajectories Introduction to this Special Issue: Semantic and Symbolic Trajectories……………... Maria Luisa Damiani and Chiara Renso 2 The Low Hanging Fruit is Gone – Achievements and Challenges of Computational Movement Analysis……………...………………………………………………………… Patrick Laube 3 Semantic Enrichment and Analysis of Movement Data: Probably it is just Starting!................................................................................................................................. Renato Fileto, Vania Bogorny, Cleto May, and Douglas Klein 11 Hermoupolis: A Semantic Trajectory Generator in the Data Science era…….............. Nikos Pelekis, Stylianos Sideridis, Panagiotis Tampakis, and Yannis Theodoridis 19 Constructing Semantic Interpretation of Routine and Anomalous Mobility Behaviors from Big Data……………………………………………………………………………… Georg Fuchs, Hendrik Stange, Dirk Hecker, Natalia Andrienko, and Gennady Andrienko 27 An Integrated Qualitative and Boundary-based Formal Model for a Semantic Representation of Trajectories............................................................................................ Jing Wu, Christophe Claramunt, and Min Deng 35 Trajectory Similarity Measures........................................................................................... Kevin Toohey and Matt Duckham 43 Symbolic Trajectories and Application Challenges........................................................... Maria Luisa Damiani, Hamza Issa, Ralf Hartmut Güting, and Fabio Valdes 51 Planning Sightseeing Tours Using Crowdsensed Trajectories......................................... Igo Brilhante, Jose Antonio Macedo, Franco Maria Nardini, Raffaele Perego, and Chiara Renso 59 Section 2: Event Reports Highlights from ACM SIGSPATIAL China Chapter in 2014………………………..… Guangzhong Sun, Yang Yue, and Xing Xie ii 67 ACM SIGSPATIAL LBSN 2014 Workshop Report......................................................... Alexei Pozdnoukhov and Sen Xu 69 ACM SIGSPATIAL IWGS 2014 Workshop Report……………..……………………... Chengyang Zhang, Anas Basalamah, and Abdeltawab Hendawi 70 ACM SIGSPATIAL PhD Symposium 2014 Report.......................................................... Ugur Demiryurek and Mohamed Sarwat 72 iii Message from the Editor Chi-Yin Chow Department of Computer Science, City University of Hong Kong, Hong Kong Email: [email protected] In the first section, we have a special issue of some topic of interest to the SIGSPATIAL community. The topic of this issue is “Semantic and Symbolic Trajectories” which is edited by our associate editors: Prof. Maria Luisa Damiani and Dr. Chiara Renso. Prof. Damiani is currently a professor in the Department of Computer Science, University of Milan, Italy and Dr. Renso is currently a researcher in ISTI-CNR, Italy. The second section consists of four event reports from: 1. Highlights from ACM SIGSPATIAL China Chapter in 2014 2. The 7th ACM SIGSPATIAL International Workshop on Location-Based Social Networks (ACM SIGSPATIAL LBSN 2014) 3. The 5th ACM SIGSPATIAL International Workshop on GeoStreaming (ACM SIGSPATIAL IWGS 2014) 4. The 1st ACM SIGSPATIAL PhD Symposium 2014 I would like to sincerely thank all the newsletter authors, Prof. Damiani, Dr. Renso, and event organizers for their generous contributions of time and effort that made this issue possible. I hope that you will find the newsletters interesting and informative and that you will enjoy this issue. You can download all Special issues from: http://www.sigspatial.org/sigspatial-special . 1 The SIGSPATIAL Special Section 1: Special Issue on Semantic and Symbolic Trajectories ACM SIGSPATIAL http://www.sigspatial.org Introduction to this Special Issue: Semantic and Symbolic Trajectories Maria Luisa Damiani1 , Chiara Renso2 1 Department of Computer Science, University of Milan, Italy 2 ISTI-CNR, Italy Nowadays, the digital traces of moving objects, such as people, animals, goods, increasingly contain richer information than the mere location history. Traces can report, for example, the places visited by tourists during a touristic journey (e.g., hotels, entertainment spots), the activities performed during such a journey (e.g., shopping, driving), the people encountered during the trip. This supplementary information is often referred to as semantic annotation (of trajectories). Semantic annotations can be directly acquired from sensors or, alternatively, be the result of some analytic process or be directly provided by users. In recent times, following a few pioneering research projects, especially in Europe and Asia1 , the processing of large amounts of semantics-enriched trajectories has become an exciting, cross-disciplinary research area, spanning spatial computing, data analytics and geographical information science. Semantics-enriched trajectories have important applications in a variety of domains, such as urban computing, social media analysis and animal ecology, especially in connection with the analysis of moving objects’ behavior, both at individual and collective level. This special issue consists of eight contributions differently related to semantics-enriched trajectories. The first contribution is a position paper by Patrick Laube presenting a fresh viewpoint on achievements and open challenges in the area of movement analysis, especially with respect to the problem of semantically annotating patterns extracted from spatial trajectories. The next three papers are closely related to the notion of semantic trajectories: Renato Fileto et al. present a conceptual vision of the semantic enrichment process especially focusing on the challenging opportunities offered by Linked Open Data; Nikos Pelekis et al. present Hermopoulis, a valuable tool for the synthetic generation of semantic trajectories, very often a necessary step to cope with the lack of access to such data; Georg Fuchs et al. focus on the visual aspects of semantic trajectory analysis, presenting visual analytics methods that can be used to interpret routine and anomalous patterns of human mobility. The next group of two papers focus more on computational aspects. In particular, Jing Wu et al. present an approach to the definition of predicates modeling qualitative relationships between geometric trajectories and regions, while Kevin Toohey and Matt Duckham introduce and compare four of the most common measures of trajectory similarity. The last group of two papers present a few application scenarios for different types of semantics-enriched trajectories. In particular Maria Luisa Damiani et al. focus on possible applications of the symbolic trajectories data model, a novel trajectory data model for the representation and pattern-based querying of timestamped labels sequences, currently implemented in the Secondo database; Igo Brilhante et al. present a comprehensive application showing how semantics-enriched trajectories built on crowd-sensed tourism data extracted from social media represent precious information for the definition of tourists behavior models and itineraries recommendation. We hope that the readers will enjoy reading this issue and appreciate the multiplicity of views it offers. 1 In particular, the project GeoPKDD (2006-2008) funded by the European Community and the Microsoft GeoLife project 2 The Low Hanging Fruit is Gone – Achievements and Challenges of Computational Movement Analysis Patrick Laube1 1 Institute of Natural Resource Sciences, Zurich University of Applied Sciences, Switzerland, [email protected] Abstract This position paper reviews the achievements and open challenges of movement analysis within Geographical Information Science. The paper argues that the simple problems of movement analysis have mostly been addressed to a sufficient level (“the low hanging fruit”), leaving the research community with the much more challenging problems for the years ahead (“the high hanging fruit”). Whereas the community has made good progress in structuring trajectory data (segmentation, similarity, clustering) and conceptualizing and detecting movement patterns, the much harder task of semantic annotation of structures and patterns remains difficult. The position paper summarizes both achievements and challenges with two sets assertions and calls for the establishment of a unifying theory of Computational Movement Analysis. 1 Introduction Movement analysis has become a constant topic on the programs of all important meetings and conferences in Geographical Information Science (GIScience) and spatial computing. After more than a decade of enthusiasm and rapid progress, I currently see a certain stagnation in the field. At meetings and conferences, and when reviewing papers, I get the impression that the discussion often revolves around similar topics. In this position paper I argue that the community has now reached a stage where most simple problems have been addressed, and perhaps even solved. What remains on the table are the hard problems. For that reason I propose consolidating what has been achieved so far in a unifying theory of Computational Movement Analysis (CMA) and face the hard problems allowing the discipline to evolve. I have structured my position paper into achievements (“the low hanging fruit”) and grand challenges for today and the years to come (“the higher hanging fruit”). 2 2.1 The Low Hanging Fruit Data Gold-Rush The emergence of movement research within GIScience was very much technology or even more data driven. With the rapid improvement of GPS-based tracking technology – receivers getting much smaller and batteries lasting much longer – a sudden overabundance of movement data triggered a gold-rush like enthusiasm amongst theory and application researchers. This shift from a data poor to a data rich problem is perhaps best illustrated by biologists’ efforts to track turtles. What started off as a thread trailing exercise [10] has in the last two decades 3 embraced GPS technology [20]. There was a time when it almost appeared that “just putting movement data on a map” got you into Nature [11]. Two facts underline that this statement is not meant in a derogative way, but rather the opposite. First, it was Graeme Hays himself making that comment at a workshop in [22] and his outstanding publication record exposes the understatement of his quote. Second, looking ahead to the remaining higher hanging fruit and the community’s challenges, it’s worth noting at this stage that it was biologists publishing in Nature, not GIS or computer science researchers. I argue that movement analysis emerged as a hot topic in recent years mainly because the seemingly simple task of analyzing a set of points allowed GIScience for the first time truly overcoming the legacy of static cartography. Movement data offered quasi-continuous sampling instead of sporadic snapshots. Everybody had a colleague in biology, ecology or transportation research, and getting their data in a GIS was interesting and relatively straightforward. Researchers with various interests in GIS and spatial computing – from cartography to generalization, from time geography to 3D, or from spatial databases to data mining – got excited about all that available movement data and fired-up their engines. Very much to the benefit of the emerging discipline. 2.2 Simple Solutions for Simple Problems My second low hanging fruit is simplicity. In the beginning, the field of movement analysis was so wide open, there were so many open and interesting problems, that one could just pick the simple ones. And then even got away with very simple solutions to the simple problems one had framed in the first place. The relative movement framework (REMO) first presented at GIScience 2002 may serve as an illustration of the simplicity of some initial concepts [15]. The REMO framework in essence bases on putting together a matrix that temporally aligns color-coded sequences of movement parameters such as speed or movement azimuth, hoping for the emergence of interesting and unexpected patterns. Considering my own early biography, it is only fair to say that any resemblance to the colorful lego bricks is not coincidental at all. However, just as the lego models, the REMO framework is a crude, edgy and inflexible model defying the true complexity of the real world. Nevertheless, its simplicity was appealing and triggered at the time important discussions about analyzing more than just the mere spatial footprint of trajectories. Whereas the REMO matrix aligned derived movement parameters and ignored the absolute positions of the moving entities, subsequently proposed movement patterns such as “flock” or “trendsetter” in addition also considered the spatial arrangements of the entities. Hence, again adhering to a rather simplistic and deterministic view of the world, Laube et al. [17, 16] defined a flock as a group of n moving entities that move in spatial proximity for a defined time interval k. And this spatial neighborhood was defined as a simple bounding box or a disc of radius r. Also these mechanistic and almost naı̈ve early movement patterns triggered useful discussions about dynamic collectives. However, they don’t consider membership issues or entities temporarily leaving the group. Antony Galtons beautiful string quartet metaphor illustrates for only a small and simple group the potential semantic complexity of dynamic collectives and their collective dynamics [8]. 2.3 The Hammer and Nail Issue Apart from addressing the simple problems first, another low hanging fruit came in the form of well-developed and hence ready to use toolboxes for static spatial analysis. For instance, the use of a disc neighborhood for conceptualizing flock patterns was not primarily problem-driven, but rather motivated by the availability of efficient algorithms for finding sets of points that lie close together – in this case using higher-order Voronoi diagrams. It’s fair to say that the then proposed solution for conceptualizing and then detecting flock patterns clearly reflects the background of the involved researchers in geography and computational geometry [17]. Certainly, researchers can’t be blamed that for tackling new problems they first resort to familiar tools. The following selection of studies illustrates the adaptation of established tools and concepts for the emerging field of movement analysis. For example, Ross Purves’ and my “How fast is a cow” paper reflects our background 4 as GIS researchers with experiences in sensitivity studies, scale issues, and uncertainty [14]. The visualization research group around Natalia and Gennady Andrienko successfully incorporated various movement analysis perspectives into their visualization environment [1]. Database research saw the emergence of novel Moving Object Databases, specifically targeting data management and querying problems for moving objects, for example, for real-time fleet management applications [9]. Finally, computational geometers have successfully proposed several algorithms around the Fréchet distance for the conceptualization and detection of a set of movement patterns and similarity problems [5]. Benefitting from existing and ready to be used concepts surely presents an efficient research avenue. Unfortunately, this strategy bears the danger of producing solution-dominated methodological research that misses the actual problems of the targeted application areas. Or, “If all you have is a hammer, everything looks like a nail”. The hammer and nail problem became apparent when the earlier presented flock definitions based on disc-shaped neighborhoods were put to the test with a herd of actual cows moving across a paddock. It turned out that the mechanistic flock definition missed the social behavior of the animals. The cows much more expressed group cohesion in the form of pair-wise density-connections, with cows lined up like pearls on a string [14]. Such a miss-match between solution and problem can result in a decreased impact of the methodological research, as only those suggested tools will be widely accepted that help the application scientists tackling their actual research questions. 2.4 Seduction of Syntactic Sugar Peter Landin’s term syntactic sugar serves as an analogy for my fourth low hanging fruit. The term refers to additions to the syntax of a programming language that do not affect its expressiveness but only make it sweeter for humans to use [12]. Many early proposed movement patterns, whose prime function was to structure trajectories of moving entities, were sugarcoated with intuitive labels referring to behavior patterns from the targeted application domains. With my flocks, I am certainly “guilty as charged” in this respect (further examples are trendsetter or leadership patterns), but others happily jumped on the bandwagon (herd, convoy, single file patterns). This strategy has since repeatedly been criticized, and for good reasons. First, proposing movement patterns for different application domain, structurally very similar patterns were coined with different terms (flocks, herds, convoys). Second, the sugary names implied having already bridged the semantic gap between patterns and behaviors, whereas in reality the patterns really only captured structural features (see section 4.2). However, today I would argue that these sugary names helped establishing the concept of movement patterns as a driving force for movement analysis, hence played an important role as a marketing vehicle for the emerging research field. 2.5 Novelty Through Interdisciplinarity My fifth low hanging fruit is interdisciplinary. A wide range of research fields currently contribute to the rapid development of movement analysis, including GIScience, computer science with computational geometry and database research, as well as various application fields such as movement ecology, transportation research and planning, and robotics. There is no doubt, interdisciplinary collaborations do produce innovative movement analysis concepts. Recent examples include work on stacked densities (infovis, computational geometry, and ecology) [6], semantic enrichment of trajectories (GIS, ecology, engineering) [7], or Brownian bridges (computational geometry, ecology) [4]. One reason why interdisciplinary research is attractive, lies in the fact that methods and concepts established in one research area may be new to others. In the best case such “novel” methods address urgent challenges in the neighboring field. Often this is a win-win situation. When methods researchers collaborate with domain experts the former get visibility in relevant application areas whereas the latter get new analytical perspectives 5 advancing their field. Amongst many other recent examples, the handshake between data mining and restoration ecology [2] and machine learning and drug screening for biomedical research [23] illustrates this successful publication strategy. 3 Computational Movement Analysis Put together, the many individual contributions outlined above shape a solid theoretical basement of movement analysis in a wider GIScience context. Summarizing the above overview, the following list captures the main achievements of the discipline to date: • Movement data is arguably the first continuously sampled form of spatio-temporal data that became widely available for overcoming GIScience legacy of static cartography. Hence, movement analysis acted as important trailblazer advancing GIS from being static towards dynamic. • The diversity of data capture methods used in the various contributing application contexts – ranging from traffic gantries and video surveillance to mobile phones and GPS collars – led to a thorough understanding of the interrelation between conceptual models of movement spaces and the therein possible movement traces. • The countless hours spent on getting raw trajectories ready for analysis led to a solid body of scientific fundamentals related to preprocessing, cleaning, and filtering notoriously noisy and imperfect movement data. The research field has also produced solutions for integrating, storing, managing, and querying the rapidly growing data streams describing movement phenomena. • In terms of analytical concepts the main achievement surely lies in the many concepts and related algorithms for structuring movement data. This includes segmentation procedures, similarity measures, and movement patterns. • Significant contributions were furthermore made for visualizing movement processes. What started off with 3D space-time cubes and time geography soon led to powerful concepts for animation and visual analytics. In a recently published SpringerBrief volume I outlined what I believe should be the core topics of an underlining theoretical basis of Computational Movement Analysis (CMA) [13].1 Definition. Computational Movement Analysis (CMA) is the interdisciplinary research field studying the development and application of computational techniques for capturing, processing, managing, structuring, and ultimately analyzing data describing movement phenomena, both in geographic and abstract spaces, aiming for a better understanding of the processes governing that movement [13, p. 4]. On top of the above listed core topics, CMA must also investigate the specific characteristics and peculiarities of the geographic phenomenon movement and the spatio-temporal data describing it, including data quality (uncertainty, accuracy), scale issues, and spatio-temporal autocorrelation. Clearly, CMA must operate in close collaboration with its application domains studying the peculiarities of established and emerging integrated spatial systems serving as direct or indirect tracking systems. Finally, CMA should address societal issues, including ethics and privacy, as well as issues around user-generated and open data. Some of these issues are hard and have hence not yet seen the required attention. They directly lead to the following list of higher hanging fruit. 1 P. Laube. Computational Movement Analysis. SpringerBriefs in Computer Science. Springer, Berlin Heidelberg, 2014, DOI 10.1007/978-3-319-10268-9, ISBN 978-3-319-10267-2. 6 4 4.1 The Higher Hanging Fruit Move Beyond The Trajectory In my opinion, trajectories have been the key data structures making CMA possible in the first place. Although related to polylines, trajectories are different as they inherently capture the spatio-temporal nature of movement. They not only align position fixes in a sequence but also timestamp those fixes allowing for representing stop and go patterns and speed variations. However, trajectories still remain only a spatio-temporal footprint of the actual behaviors we typically want to understand. So, it is now time to acknowledge that movement behaviors can hardly be understood from just studying their footprints alone. Reaching out for my first higher hanging fruit means moving beyond the trajectory. This is meant in two ways. First, studying shape and arrangement patterns of trajectories may reveal certain structural characteristics of movement, but a true understanding of the respective behaviors requires the embedding of the trajectories in the movement context enabling and constraining that movement. Pedestrians walking across a bridge do cluster and do perform a uniform straight movement, but understanding why they move the way they move and where is best explained by linking their movement to the underlying geography – here the bridge enabling and constraining the movement. A dedicated session on movement context at the 2014 GIScience conference in Vienna underlines the relevance of this topic. One challenge for understanding movement in its context lies in the difficulty of accessing the comprehensive data required for such studies. On top of fine-grained movement data one also needs semantic information about the movement context, ideally with a similar spatial and temporal resolution. One way of accessing such rich data sources comes in the form of multi-sensor systems. For example, Nathan et al. combine GPS readings with additional sensors concurrently tracking biomechanical variables when aiming at classifying behavioral modes of free-ranging animals [18]. Second, trajectories typically adhere to the Lagrangian perspective of movement, tracking the changes of an entity’s location. This is the typical case when performing tracking experiments with small numbers of GPS-tagged individuals. The antagonistic Eulerian perspective – observing the movement of entities relative to fixed reference points – is much less investigated. Considering that many large-scale tracking systems, for example urban transit ticketing or mobile phone infrastructure, adhere to the Eulerian perspective, it is obvious that tackling the so far neglected Eulerian perspective promises insights into movement systems of great socioeconomic relevance that cover much larger spaces and populations of moving entities. 4.2 Bridge The Semantic Gap Stepping up from “just” structuring movement data means reaching out for a truly high hanging fruit – bridging the semantic gap. This gap separates the low-level observational data from the high-level conceptual schemes through which humans interpret, understand and use that data [8]. Segmenting a bird’s trajectory according to its speed and sinuosity characteristic is one thing. But asserting that the wiggly slow segments really correspond to what biologist would classify as “foraging” behavior is a rather different thing. One strategy for narrowing the semantic gap comes in the form of thorough validation. To this end, studies where trajectory data is complemented with semantic “ground truth” data in the form a direct observations of the studied behaviors are especially valuable. Shamoun-Baranes et al. illustrate how this can be achieved [21]. They cross-validate their machine-learning behavior classification with conventional behavior observations. Similarly, a workshop held in Zurich in 2012 specifically aimed at bringing together theory researchers with application domain specialist. The workshop and the subsequent special issue in Computers, Environments and Urban Systems (vol. 47, 2014) only invited work on real data and real problems, carried out by teams that also included domain specialists grounding the methodological work in the semantics of real problems [19]. 7 4.3 Reach Out for Application Domain Outlets I have argued above that interdisciplinary collaborations help spreading ideas and getting visibility in neighboring fields. However, it is worth having a look at the directions of this exchange for work on movement analysis. In my opinion, this interdisciplinary exchange has so far been very one-sided. The community has been very successful publishing collaborations with domain specialist – but mainly in GIScience and computer science outlets. This is not really surprising since, for example, a bird biologist gets little credit for having developed a new method. What counts is contributing to a better understanding of avian navigation, the methods that helped advancing that understanding often literally disappear in the small print. Additionally, methodological practice is very persistent in many research field. So, introducing new methods in application fields that have established standard procedures is hard. Nevertheless, I still consider movement analysis as one branch of GIScience with the biggest outreach potential. Hence, even if it is difficult, we should aspire to be getting our work published in the application domains’ outlets. For the recognition and the further advancement of movement analysis as a key GIScience contribution, those publications are most valuable that appear in the application fields’ outlets, e.g. [7]. 4.4 Revisit Privacy My fourth high hanging fruit is an old friend: Privacy. Clearly, animal movement is an excellent use case for stimulating and interesting CMA problems [13, p. 85]. But as I argued above the movement of people and the related applications and services bears a much larger socio-economic potential. This is a huge opportunity not to be missed. But people care about privacy. I see two main challenges with respect to privacy: First, develop strategies for getting access to the really interesting large volumes of people movement data. Second, develop analytical frameworks that can produce useful information but at the same time safeguard people’s privacy. Clearly, the community has already produced concepts and algorithms that do just that, but mainly on a theoretical level. Now comes the opportunity to put concepts to the test and apply them to the big data coming our way. 4.5 Engage With Big Data I would argue that most data sets worked on in movement analysis to date do not really qualify as big data. That is per se not a problem, as conceptual contributions can easily be made with smaller and manageable data sets. However, considering the huge potential and relevance of Eulerian movement systems, the big data streaming out of ICT, public transport and traffic systems clearly challenge the discipline. Even more so, as processing big data explicitly requires methods that can cope with messy, noisy, incomplete, uncertain, heterogeneous and multi-source data. Since these are all known characteristics of spatio-temporal and geographic data in the first place, surely GIScience has a contribution to make for coping with mobility related big data pouring out of our ICT infrastructure and smart cities. 4.6 Envision Mobile Everyware My last high hanging fruit is arguably still ripening. It is probably fair to say that most movement analysis to date still happens in centralized desktop computers. However, in the course of advancing ubiquitous computing, we may expect mobile spatial “everyware” increasingly building a normal part of our ICT infrastructure. It is easy to picture a plethora of mobile and communicating computing nodes in smart cities requiring mobility-related intelligence for autonomous transportation or aging populations. Big challenges include here the adaptation of established movement analysis concepts for decentralized environments and the seizing of new opportunities and application fields of movement analysis that only start crystalizing in today’s very dynamic and volatile computing environments (see, for example, [3]). 8 5 Concluding Remarks With Computational Movement Analysis GIScience has arguably overcome cartography’s legacy of the static. The discipline has built up a solid theoretical basis on capturing, preprocessing, structuring, and visualizing movement data, exploiting a range conceptual data models and data structures for movement spaces and movement traces adapted from the GIScience toolbox for handling static spatial data. CMA should, however, widen its focus including movement data emerging systems other than GPS tracking and get ready for the expected big data streams pouring out of today’s and tomorrow’s ICT infrastructure. Albeit all progress in processing movement data, the biggest challenge remains the semantic annotation of found structures, as a true understanding of the involved processes is impossible without understanding their context. Acknowledgments This position paper is the result of a keynote talk with the same title I gave at the GIScience 2014 workshop on Analysis of Movement Data in September 2014 in Vienna. I’d like to thank the organizers for inviting me. I’d finally also like to thank the Zurich University of Applied Sciences for supporting my research. References [1] G. Andrienko, N. Andrienko, P. Bak, D. Keim, and S. Wrobel. Visual Analytics of Movement. Springer, Berlin Heidelberg, 2013. [2] S. Bleisch, M. Duckham, A. Galton, P. Laube, and J. Lyon. Mining candidate causal relationships in movement patterns. International Journal of Geographical Information Science, 28(2):363–382, Nov. 2013. [3] A. Both, M. Duckham, P. Laube, T. Wark, and J. Yeoman. Decentralized Monitoring of Moving Objects in a Transportation Network Augmented with Checkpoints. The Computer Journal, 56(12):1432–1449, Sept. 2012. [4] K. Buchin, T. J. M. Arseneau, S. Sijben, and E. P. Willems. Detecting movement patterns using Brownian bridges. In Proceedings of the 20th International Conference on Advances in Geographic Information Systems - SIGSPATIAL ’12, page 119, New York, New York, USA, Nov. 2012. ACM Press. [5] K. Buchin, M. Buchin, and J. Gudmundsson. Detecting single file movement. In Proceedings of the 16th ACM SIGSPATIAL international conference on Advances in geographic information systems - GIS ’08, page 1, New York, New York, USA, Nov. 2008. ACM Press. [6] U. Demšar, K. Buchin, E. E. van Loon, and J. Shamoun-Baranes. Stacked space-time densities: a geovisualisation approach to explore dynamics of space use over time. GeoInformatica, 19(1):85–115, Apr. 2014. [7] S. Dodge, G. Bohrer, R. Weinzierl, S. C. Davidson, R. Kays, D. Douglas, S. Cruz, J. Han, D. Brandes, and M. Wikelski. The environmental-data automated track annotation (Env-DATA) system: linking animal tracks with environmental data. Movement Ecology, 1(1):3, July 2013. [8] A. Galton. Dynamic Collectives and Their Collective Dynamics. In A. Cohn and D. Mark, editors, Spatial Information Theory SE - 19, volume 3693 of Lecture Notes in Computer Science, pages 300–315. Springer Berlin Heidelberg, 2005. 9 [9] R. H. Güting and M. Schneider. Moving Objects Databases. Morgan Kaufmann, 2005. [10] A. Hailey. How far do animals move? Routine movements in a tortoise. Canadian Journal of Zoology, 67(1):208–215, Jan. 1989. [11] G. C. Hays, J. D. R. Houghton, and A. E. Myers. Endangered species: Pan-Atlantic leatherback turtle movements. Nature, 429(6991):522, June 2004. [12] P. J. Landin. The Mechanical Evaluation of Expressions. The Computer Journal, 6(4):308–320, Jan. 1964. [13] P. Laube. Computational Movement Analysis. SpringerBriefs in Computer Science. Springer International Publishing, Berlin Heidelberg, 2014. [14] P. Laube, M. Duckham, and M. Palaniswami. Deferred decentralized movement pattern mining for geosensor networks. International Journal of Geographical Information Science, 25(2):273–292, Mar. 2011. [15] P. Laube and S. Imfeld. Analyzing Relative Motion within Groups ofTrackable Moving Point Objects. In M. Egenhofer and D. Mark, editors, Geographic Information Science SE - 10, volume 2478 of Lecture Notes in Computer Science, pages 132–144. Springer Berlin Heidelberg, 2002. [16] P. Laube, S. Imfeld, and R. Weibel. Discovering relative motion patterns in groups of moving point objects. International Journal of Geographical Information Science, 19(6):639–668, July 2005. [17] P. Laube, M. van Kreveld, and S. Imfeld. Finding REMO Detecting Relative Motion Patterns in Geospatial Lifelines. In Developments in Spatial Data Handling SE - 16, pages 201–215. Springer Berlin Heidelberg, 2005. [18] R. Nathan, O. Spiegel, S. Fortmann-Roe, R. Harel, M. Wikelski, and W. M. Getz. Using tri-axial acceleration data to identify behavioral modes of free-ranging animals: general concepts and tools illustrated for griffon vultures. The Journal of experimental biology, 215(Pt 6):986–96, Mar. 2012. [19] R. S. Purves, P. Laube, M. Buchin, and B. Speckmann. Moving beyond the point: An agenda for research in movement analysis with real data. Computers, Environment and Urban Systems, 47:1–4, Sept. 2014. [20] G. Schofield, V. J. Hobson, S. Fossette, M. K. S. Lilley, K. A. Katselidis, and G. C. Hays. BIODIVERSITY RESEARCH: Fidelity to foraging sites, consistency of migration routes and habitat modulation of home range by sea turtles. Diversity and Distributions, 16(5):840–853, Sept. 2010. [21] J. Shamoun-Baranes, R. Bom, E. E. van Loon, B. J. Ens, K. Oosterbeek, and W. Bouten. From sensor data to animal behaviour: an oystercatcher example. PloS one, 7(5):e37997, Jan. 2012. [22] J. Shamoun-Baranes, E. E. van Loon, R. S. Purves, B. Speckmann, D. Weiskopf, and C. J. Camphuysen. Analysis and visualization of animal movement. Biology letters, 8(1):6–9, Feb. 2012. [23] A. Soleymani, J. Cachat, and K. Robinson. Integrating cross-scale analysis in the spatial and temporal domains for classification of behavioral movement. Journal of Spatial Information Science, 2015. 10 Semantic Enrichment and Analysis of Movement Data: Probably it is just Starting! Renato Fileto1,2 , Vania Bogorny1,2 , Cleto May1,2 , Douglas Klein2 1 Post-Graduate Program in Computer Science, 2 Department of Informatics and Statistics (INE), Federal University of Santa Catarina (UFSC), Florianópolis-SC, Brazil {r.fileto|vania.bogony}@ufsc.br, {cleto.may|douglas.klein}@grad.ufsc.br Abstract The widespread use of sensors and information systems, frequently via mobile devices, allows gathering large amounts of movement data, such as trajectories of moving objects and sequences of users posts on social media. These data can enable several applications, but some of them involve understanding what is going on with moving objects (e.g., exact places and/or events of interest, activities performed, reasons for stops and moves). Thus, there is a demand to enrich with well-defined semantics the potentially imprecise spatiotemporal coordinates of movement data, which are sometimes tied together with text (e.g., comments, tags). This paper provides an overview of proposals and possible developments in semantic enrichment and analysis of movement data. It also presents some details of our current methods to associate movement data with concepts and/or instances described in ontologies or Linked Open Data (LOD). Our experiments with methods to associate tweets with places visited by the users who posted them show that textual contents of some tweets can contribute to make correct associations. In addition, the experience suggests that a variety of techniques can be helpful for semantically enriching movement data in several analysis dimensions. It poses many research challenges, some of them multidisciplinary. 1 Introduction The popularization of mobile devices, positioning and sensing technologies (e.g., GPS, GSM, RFID, cameras), and information systems on the Web (e.g., social media, systems access logs) has created an abundance of data that can be useful to analyze movements of objects and beings (vehicles, people, etc.). We call movement data any collection of spatiotemporal positions of moving objects, which can be captured by sensors and/or information systems. Each position can be represented by geographic coordinates and the instant when the object occupied that position. Freely annotated movement data have text associated with some positions. This definition encompasses, among other things, moving objects trajectories with freely annotated segments (e.g., sub-trajectories, stops, moves), sequences of users posts on social media, and their fusions [15]. A raw trajectory is a temporally ordered and fine grained sequence of spatiotemporal positions occupied by a moving object. Nowadays, it is possible to get accurate trajectories by using state-of-the applications that employ sensors to get positions of moving objects at fine sampling rates (e.g., every second, every 3 meters) [6]. However, it is hard to gather large volumes of annotated trajectories, because annotating is a laborious task [20]. We call a user trail a temporally ordered sequence of traces of a user in a particular system (e.g., Twitter, Facebook). Differently from raw trajectories, users trails are usually imprecise and sparse in space and time, due to limitations that can be imposed on the access of accurate positions, and the asynchronous nature of the users 11 activities. Nevertheless, trails composed of social media posts, for example, usually have plenty of contents (e.g., textual contents, hash tags, keywords, images). Some of these additional data may serve as annotations. Several applications can benefit from movement data analysis, in such diverse areas as traffic, logistics, security and marketing. However, to realize potential applications it is necessary to develop appropriate methods to extract useful information from usually large amounts of movement data. Recently, there has been significant progress in methods to handle such data [9, 10, 18, 24]. Notwithstanding, much of this progress refers to spatiotemporal data management and analysis, while it is recognized by the scientific community that semantic issues must be addressed to better understand and exploit movement data [2, 7, 17–19, 23, 25]. For example, consider the following queries: Q1: “Select the trajectories that have a stop to watch a sport event and another stop located up to a certain distance of a touristic place called Corcovado in the city of Rio de Janeiro.” Q2: “Select the trails of European people who visit Rio city and use public transportation to visit at least one national park in Rio state, for at least 2 hours.” Q3: “What is the percentage of the social media trails that are first at home, then at work, then at a place for doing physical exercise (e.g. a gym), and finally at home again?” Notice that these queries involve semantic issues, by referring to concepts (classes) such as sport event, touristic place, city, European people, public transportation, national park, state, home, work, physical exercise, and gym. Some queries also refer to specific objects (instances) such as the touristic place called Corcovado and the city or the state called Rio de Janeiro. Consequently, it is not possible to solve such queries confidently just by looking at spatiotemporal coordinates. First, movement data to be queried must be semantically enriched with annotations that precisely associate particular movement segments (i.e., spatiotemporal positions or temporally ordered subsequences of them) with resources (objects or concepts) that are relevant for the analysis. The semantics of the annotations must be well-defined and processable by machines, to enable the automatic identification of semantic relationships with things like places (e.g., a night club called Rio in the city having the same name), phenomena (e.g., a traffic jam in Rio city caused by floods in the whole state also called Rio), actions (e.g., buy or watch the movie Rio), and their respective classes (e.g., night club, city, state, movie). This paper describe recent progresses in the field of semantic enrichment of movement data, and delineates possible research directions in this field. It reports some of our research results to automatically produce semantic annotations of movement data, i.e., associations of movement segments with resources described in ontologies and Linked Open Data (LOD) collections. These annotations have better defined semantics than free text. Our current methods employ spatiotemporal compatibility, several kinds of lexical similarity, and similarity joins, for associating spatiotemporal position or stops with visited Places of Interest (PoIs). However, a variety of other techniques such as entity linking [3, 11, 22] can be useful for semantic enrichment in several analyzes dimensions that can help explain movements [7, 8]. Once movement data is semantically annotated, the welldefined semantics lent by concepts and/or objects of ontologies/LOD to the resulting annotations enable queries such as Q1, Q2 and Q3 to be expressed and processed by using existing languages such as spatial extensions of SQL and geoSPARQL [1], among other possibilities. The experiments presented in this paper refer to the automatic association of tweets with PoIs taken from LinkedGeoData1 , to indicate that the user who posted the respective tweet visited that PoI. These experiments show that geographic proximity is crucial to make these associations, but also that textual data can sometimes contribute to make correct associations. The remainder of this paper is organized as follows. Section 2 discusses related work. Section 3 summarizes our research program in semantic enrichment, analysis, and mining of movement data. Section 4 briefly describes some of our methods to semantically enrich movement data, and presents results of experiments that associate tweets with visited PoIs. Finally, Section 5 delineates conclusions and research challenges. 1 http://linkedgeodata.org 12 2 Related works Lots of information can be extracted from freely annotated movement data such as trajectories and trails, and combinations of them, with a myriad of useful applications [19]. However, freely annotated movement data lack well-defined semantics to support information analysis. For instance, the tag "Rio" in a social media post may refer to a city, a state, or even a restaurant, nightclub or a movie, among other possibilities. Therefore, to realize potential applications, it is necessary to develop appropriate methods to turn raw spatiotemporal coordinates, possibly connected with rough free text, into semantically rich movement data. The fusion of trajectories with trails to produce trajectories annotated with the contents of social media posts is addressed in [15]. In this proposal, trajectories are segmented and structured as temporally ordered sequences of stops and moves, by using clustering-based techniques such as CB-SMoT [16] and DB-SMoT [21]. Then, the spatiotemporal positions of both, the structured trajectories and the social media trails, are indexed to support efficient algorithms that use proximity joins and ranking criteria to fuse the structured trajectories with trails. However, just fusing movement data from different sources does not necessarily solve semantic issues such as homonyms and ambiguities. Several models, methods, and tools have been proposed for semantic enrichment, analysis, and mining of movement data [2, 17, 18]. The use of ontologies and Knowledge Bases (KBs) for such purposes have been addressed in a few works [12, 19]. The idea of semantically enriching movement data with resources of Linked Open Data (LOD) collections was introduced in [7], which proposes ontological constructs and a general semiautomatic process for this enrichment. That work also illustrates how to accommodate the semantic enriched movement data into the proposed ontological model, and the benefits of the well-defined semantics behind LOD annotating movement segments for supporting the execution of GeoSPARQL queries referring to semantic aspects of the movement and the environment where it takes place, for example. Nevertheless, the development and the evaluation of efficient and effective methods for enriching movement data with ontologies and KBs such as LOD collections is still an open research challenge. The work described in [13] is an initial effort towards properly connecting freely annotated movement locations or stops with PoIs taken from LOD collections. First, it uses a spatial access method to select the resources that are within a given radius of each position or stop annotated with free text. Then, it chooses among these resources in the surroundings, those having labels most lexically similar to named entities found in the text associated with the stop or movement location. It employs Soft-TF-IDF [14] to calculate the textual similarity of named entities of the text associated to the movement data with labels of LOD resources, based on a similarity metric between words, such as Levenshtein edit distance or Jaro Winkler [4]. In fact, a variety of alternative lexical similarity metrics and disambiguation techniques can be employed for making the associations. The experiments reported in [7] and [13] enrich Flickr trails taken from CoPhIR2 with DBPedia and LinkedGeoData resources. Despite the simplicity of this method and the low probability of a Flickr to an entity described in such a LOD collection, the results include a considerable number of associations, some of them confirmed by photographic and textual inspection. Several variations of this method have been investigated by our research group in experiments with distinct movement data and LOD collections. 3 Research Directions on Semantic Movement Data Figure 1 provides an overview of the process for semantic enrichment, analysis, and mining of Movement Data. Trajectories of moving objects (Trajs), trails of users on social media (or just Trails), and other kinds of movement data must be selected for a particular application, cleansed to eliminate or correct erroneous or imprecise data, structured into movement segments (i.e., positions subsequences that satisfy certain predicates, like stops and moves [24, 25]), and sometimes fused before being semantically enriched and used. Analogously, the 2 http://cophir.isti.cnr.it 13 Enriching information, such as conventional databases (DBs), spatiotemporal databases (STDBs), knowledge bases (KBs), or Linked Open Data (LOD) collections also usually need some pre-processing for data selection, cleansing and sometimes data integration, among other tasks, prior to being used for enriching particular movement data. Figure 1: Overview of the semantic enrichment, analyses and mining process The Semantic Enrichment task aims to associate (annotate) particular positions or temporally ordered subsequences of positions occupied by a moving object with data having well-defined semantics that help to describe what is going on. The annotation of the movement segments can be done according with a variety of dimensions such as those proposed in the CONSTAnT model [2], namely, Moving Object (that encompass the moving Entity (e.g. person) and Device (e.g. cell phone) used to collect the movement data), Space (e.g., PoIs), Time (e.g., periods of time of interest, such as years, months, season, days of the week), Goal (e.g., eat, shopping, study, leisure), Transportation Means, Environment Condition (e.g., windy), and Activity performed by the moving entity (e.g., talking via the cell phone). All these dimensions can be organized in hierarchies of concepts (e.g., kinds of PoIs, kinds of periods of time) and/or hierarchies of instances of these concepts (e.g., PoIs or periods of time organized in a hierarchies in accordance with their containment relationships). Once a sufficient amount of movement data have been semantically enriched, they can also be accommodated in semantic movement data warehouses to support Semantic Analysis, as proposed in [8]. That work proposes a collection of constructs compatible with description logics and semantic Web standards to support flexible and powerful information analyses of semantically enriched movement data. Those constructs include hierarchies of semantic movement segments (semantic trajectories, stops, moves) with arbitrary refinement levels, analysis dimensions of descriptive data with several hierarchies of categories and/or instances, and flexible conceptualizations of movement segments and movement patterns. It enables movement analysis based on movement patterns defined by (i) spatiotemporal conformations of movement segments (e.g., moving clusters) [5]; and/or (ii) semantic, ordering, and timing constraints on subsegments (e.g., semantic trajectories refined in one stop at an Airport, followed by one stop at a Hotel, which is followed by a stop at a Stadium that lasts at least 4 hours). Semantically enriched movement data can also support what we call Semantic Mining, i.e., data mining that exploits the semantic annotations. The extracted information and knowledge, besides having a variety of applications, can also used as feedback for more semantic enrichment, analysis and mining of movement data, as illustrated in Figure 1. 14 4 Semantically Enriching Tweeter Trails with PoIs of LOD Collections Our group has been doing experiments for semantically enriching a variety of movement data: Flickr, Twitter, and Facebook trails, some annotated trajectories, and some trajectories fused with social media trails of the same users. The specific methods for associating trajectories with KB resources, such as LOD resource, varies with the data involved and the semantic dimension (Space, Time, Goal, etc.). Since the beginning of this research the focus has been first on enriching movement segments (positions or sequences of positions such as stops) with the visited PoIs. Although we have plans to develop specific methods for semantic enrichment in different movement analysis dimensions, by now our most reliable results are associations of positions with PoIs. Thus, this paper keeps the focus on this kind of associations. Preliminary methods to semantic enrich movement data with resources taken from KBs such as LOD collections are presented in in [7] and [13]. The algorithms proposed in [13] to associate movement segments (usually individual positions or stops) with PoIs consider the PoIs proximity to the movement positions and the textual similarity between PoIs labels and named entities found in the free text associated with movement data by systems users (e.g., social media posts contents, textually annotated trajectories). The basic idea behind these algorithms is to select the visited PoI from the possibly multiple ones within a certain distance of each movement segment, by using the textual similarity between PoIs labels and named entities associated to movement segments to solve these ambiguities. It may be necessary due to inaccuracies in the trajectories or in areas with a high density of small PoIs or overlapping PoIs (e.g, in a same building with multiple stores). 4.1 Experimental Results The experimental results presented in this paper refer to the enrichment of 3,642 tweets sent in the metropolitan area of Florianópolis Brazil in October 2015. These tweets were in fact originated in FourSquare, when users check-in some place. Their contents follow the pattern “I am at place name”, what facilitates the extraction of the place name and makes the generated results more reliable. Unfortunately, only a small percentage of the tweets that we have collected from the Twitter API follows this pattern (between 2% and 14% in different datasets used in our experiments so far). These tweets have been associated with 1,268 LinkedGeoData resources (PoIs), whose geographic extensions are inside the Florianópolis metropolitan area. Only a percentage of the tweets are associated with some PoI, because some named entities mentioned in the tweets are not present in the LinkedGeoData collection. This percentage varies sharply in our experiments (between 0.1% and 80%), according with the dataset and the parameters of the algorithms used for making the associations. The parameters of our simplest algorithms are the spatial threshold (τs : maximum distance between the movement segment and the PoI to consider the pair as a candidate association) and the textual threshold (τt : the minimum textual similarity to consider a candidate association). The number of associations per tweet can also be bigger than one, when our methods are not able to eliminate ambiguities. Figure 2 presents the average number of associations per associated tweet for two variations of our method: the first returns all the associations that satisfy both parameters τs and τt , and the second takes only the closest PoIs (i.e., the ones that are within the same minimum distance found between any PoI and the respective tweet). Notice that the second variation allows a sharp decrease in the number of PoIs associated to each tweet, and in some cases eliminates ambiguities completely (the ones in bold). We have observed that the second variation also run considerably faster in most of our experiments. Some associations produced by our algorithms, mainly in areas of a high density of PoIs, were clearly enabled by the textual similarity, because the spatial coordinates available in the data collections used in these experiments are not accurate enough to enable the correct association. Figure 3 shows an example in which the textual similarity is crucial to connect the tweet with the correct PoI, though this PoI is not the closest one to the tweet geographic coordinates. The tweet is associated to Bobs instead of Posto Angeloni Beira Mar (the closest resource) because the textual contents of the tweet mentions that the user is at Bobs. Of course, the use of more 15 Figure 2: Average number of candidate resources per associated tweet sophisticated techniques must be investigated to fitly and efficiently connect movement data accompanied by textual contents that not follows the “I am at ” pattern with ontology concepts or LOD resources. Figure 3: Association based on textual similarity between tweet and label of one of the PoIs in the surroundings 5 Conclusions and Future Work We leave electronic traces of our movements as GPS trajectories, Wi-Fi and GSM networks access logs, Web logs, credit card transactions recordings, etc. Lots of information about behaviors of moving objects can be extracted from these traces. However, a key step for doing so is the semantic enrichment of these movement data. It can be done by associating these data with well-defined semantics, to help explain what is going on. In this paper, we discussed current research on semantic enrichment of movement data, and outlined some possible research directions in this field. The methods for associating movement positions or stops with resources taken from existing information and KBs, such as extensive LOD collections available nowadays, are just one example of what can be done. The results of experiments that we have done so far for associating sequences of posts in social media with PoIs taken from LOD collections such as LinkedGeoData show that our methods are very sensitive to the tunning of parameters, such as the geographical proximity and the textual similarity that must be observed between a movement segment and a PoI to associate them. They also show that though proximity is the basic criteria for making associations with visited PoIs, textual similarity is sometimes crucial to make correct associations, specially when coordinates are not precise and in regions with high density of small or even some overlapping PoIs. 16 Despite recent progresses, we believe that research on data integration for semantic enrichment of movement data is still in its infancy, and several methods are to be developed yet. Our experience suggests that a variety of techniques can be helpful for semantic enriching movement data, posing many research challenges, some of them multidisciplinary. Future work includes (i) devise methods to semantically enrich movement data according with other analysis dimensions besides visited PoIs, such as transportation means, and goals of the movements; (ii) create benchmarks to organize the investigation of methods for semantically enriching movement data and enable a fair comparison of their performance, in terms of both execution time and results quality; (iii) make extensive experiments with different datasets; and (iv) investigate the usefulness of the resulted semantic annotations to support semantically-enabled analysis and mining of movement data in several application domains. Acknowledgments The authors were partially supported by EU Marie Courie IRSES-SEEK (grant 295179), CNPq (grant 478634/2011-0), CAPES, and FEESC. References [1] R. Battle and D. Kolas. Enabling the geospatial Semantic Web with Parliament and GeoSPARQL. Semantic Web, 3(4):355–370, 2012. [2] V. Bogorny, C. Renso, A. R. de Aquino, F. de Lucca Siqueira, and L. O. Alvares. CONSTAnT - A Conceptual Data Model for Semantic Trajectories of Moving Objects. T. GIS, 18(1):66–88, 2014. [3] D. Ceccarelli, C. Lucchese, S. Orlando, R. Perego, and S. Trani. Learning relatedness measures for entity linking. In Q. He, A. Iyengar, W. Nejdl, J. Pei, and R. Rastogi, editors, CIKM, pages 139–148. ACM, 2013. [4] W. W. Cohen, P. D. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for namematching tasks. In S. Kambhampati and C. A. Knoblock, editors, IIWeb, pages 73–78, 2003. [5] S. Dodge, R. Weibel, and A.-K. Lautenschütz. Towards a Taxonomy of Movement Patterns. Information Visualization, 7(3):240–252, June 2008. [6] A. Doulamis, N. Pelekis, and Y. Theodoridis. Easytracker: An android application for capturing mobility behavior. 2012 16th Panhellenic Conference on Informatics, 0:357–362, 2012. [7] R. Fileto, M. Krüger, N. Pelekis, Y. Theodoridis, and C. Renso. Baquara: A Holistic Ontological Framework for Movement Analysis Using Linked Data. In W. Ng, V. C. Storey, and J. Trujillo, editors, ER, volume 8217 of LNCS, pages 342–355. Springer, 2013. [8] R. Fileto, A. Raffaetà, A. Roncato, J. A. P. Sacenti, C. May, and D. Klein. A semantic model for movement data warehouses. In 16th DOLAP, Shanghai, China, November 7 (to appear), 2014. [9] B. Furletti, L. Gabrielli, C. Renso, and S. Rinzivillo. Analysis of GSM calls data for understanding user mobility behavior. In BigData Conference, pages 550–555. IEEE, 2013. [10] F. Giannotti, M. Nanni, D. Pedreschi, F. Pinelli, C. Renso, S. Rinzivillo, and R. Trasarti. Unveiling the complexity of human mobility by querying and mining massive trajectory data. VLDB J., 20(5):695–719, 2011. [11] Z. Guo and D. Barbosa. Entity linking with a unified semantic representation. In 23rd Intl. Conference on World Wide Web, WWW Companion, pages 1305–1310, 2014. 17 [12] G. Manco, M. Baglioni, F. Giannotti, B. Kuijpers, A. Raffaetà, and C. Renso. Querying and reasoning for spatiotemporal data mining. In F. Giannotti and D. Pedreschi, editors, Mobility, Data Mining and Privacy, pages 335–374. Springer, 2008. [13] C. May and R. Fileto. Connecting Textually Annotated Movement Data with Linked Data. In IX Regional School on Databases, ERBD, São Francisco do Sul, SC, Brazil (in Portuguese), 2014. SBC. [14] E. Moreau, F. Yvon, and O. Cappé. Robust Similarity Measures for Named Entities Matching. In 22nd Intl. Conf. on Computational Linguistics - Volume 1, COLING, pages 593–600, 2008. [15] R. G. B. Nabo, R. Fileto, C. Renso, and M. Nanni. Annotating Trajectories by Fusing them with Social Media Users’ Posts. In Brazilian Symposium on Geoinformatics, GeoInfo, Campos do Jordão, SP, Brazil (to appear), 2014. [16] A. T. Palma, V. Bogorny, B. Kuijpers, and L. O. Alvares. A clustering-based approach for discovering interesting places in trajectories. In R. L. Wainwright and H. Haddad, editors, SAC, pages 863–868. ACM, 2008. [17] C. Parent, S. Spaccapietra, C. Renso, G. L. Andrienko, N. V. Andrienko, V. Bogorny, M. L. Damiani, A. Gkoulalas-Divanis, J. A. F. de Macêdo, N. Pelekis, Y. Theodoridis, and Z. Yan. Semantic trajectories modeling and analysis. ACM Comput. Surv., 45(4), 2013. Article 42. [18] N. Pelekis and Y. Theodoridis. Mobility Data Management and Exploration. Springer, 2014. [19] C. Renso, M. Baglioni, J. A. F. de Macêdo, R. Trasarti, and M. Wachowicz. How you move reveals who you are: understanding human behavior by analyzing trajectory data. Knowl. Inf. Syst., 37(2):331–362, 2013. [20] S. Rinzivillo, F. de Lucca Siqueira, L. Gabrielli, C. Renso, and V. Bogorny. Where Have You Been Today? Annotating Trajectories with DayTag. In SSTD, volume 8098 of LNCS, pages 467–471. Springer, 2013. [21] J. A. M. R. Rocha, V. C. Times, G. Oliveira, L. O. Alvares, and V. Bogorny. DB-SMoT: A direction-based spatio-temporal clustering method. In IEEE Conf. of Intelligent Systems, pages 114–119. IEEE, 2010. [22] W. Shen, J. Wang, P. Luo, and M. Wang. Linking named entities in tweets with knowledge base via user interest modeling. In ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, KDD, pages 68–76, 2013. [23] S. Spaccapietra and C. Parent. Adding meaning to your steps (keynote paper). In M. A. Jeusfeld, L. M. L. Delcambre, and T. W. Ling, editors, ER, volume 6998 of LNCS, pages 13–31. Springer, 2011. [24] S. Spaccapietra, C. Parent, M. L. Damiani, J. A. F. de Macêdo, F. Porto, and C. Vangenot. A conceptual view on trajectories. Data Knowl. Eng., 65(1):126–146, 2008. [25] Z. Yan, D. Chakraborty, C. Parent, S. Spaccapietra, and K. Aberer. Semantic trajectories: Mobility data computation and annotation. ACM TIST, 4(3), 2013. 18 Hermoupolis: A Semantic Trajectory Generator in the Data Science era Nikos Pelekis1 , Stylianos Sideridis2 , Panagiotis Tampakis2 , Yannis Theodoridis2 1 Department of Statistics and Insurance Science, University of Piraeus, Greece 2 Department of Informatics, University of Piraeus, Greece {npelekis, siderste, ptampak, ytheod}@unipi.gr Abstract The domain of trajectory data management and mining undoubtedly contributes with interesting research problems and corresponding effective solutions to what is called data science. An interesting trend that poses new challenges in the field and has emerged especially due to the advance of locationbased social networks, is that involved data cannot be considered purely spatiotemporal; trajectories of moving objects also contain additional semantic information that deserves to be further explored. On the other hand, the recently available real trajectory datasets are neither adequate nor appropriate for a wide empirical evaluation of related research proposals. As in other domains, a practical approach to overcome this limitation is developing efficient and functional synthetic trajectory generators. In this line of research, we present Hermoupolis, a pattern- and semantic-aware synthetic trajectory generator, which is able to produce realistic semantic trajectory datasets (along with their synchronized raw spatiotemporal counterparts), conforming to mobility profiles given as input by users. 1 Introduction and Motivation The immense advances on location aware mobile devices technology, such as smart phones, GPS navigation devices, tablets etc. along with the development of the appropriate techniques for storing, processing and analyzing spatio-temporal data, has led to the generation of huge amounts of GPS-like data. So far, a successfully followed challenge is the transformation of this kind of data into actionable knowledge, in terms of the mobility of the territories that we occupy. Examples of this success story is the development of several mobility data mining techniques, such as identification of moving clusters [7], clusters of entire trajectories [12] or of sub-trajectories [10, 15], flocks [4, 9], convoys [6], sequential trajectory patterns [2], swarms [11], and top-k representative trajectory samples [13]. A vital prerequisite for the evaluation of these methods is the experimentation with enormous amounts of real mobility data. Unfortunately, such datasets cannot be easily acquired and when they are, they are significantly distorted due to privacy issues, thus diminishing their value. For this reason, a number of synthetic generators of moving objects have been developed so far, assisting researchers to evaluate the scalability of the aforementioned techniques. However, evaluating scalability is not enough, as we also need to evaluate the effectiveness of these methods with respect to their prescribed specifications, i.e. identifying the patterns that they are supposed to do. In order to achieve this, one needs to guarantee the existence of specific patterns within a dataset the so-called ground truth. Usually, it is the lack of such ground truth that makes real or synthetically generated trajectory datasets inappropriate for evaluating the effectiveness of the various approaches. So far, the effectiveness of such methods has been evaluated with the use of small “hand-made” datasets but as efficiency and scalability, 19 effectiveness should also be tested in very large datasets. The same stands the other way around, namely, efficiency and scalability should be evaluated in datasets that contain patterns of known ground truth, at various scales. This way only, experimental results can be interpretable and useful. In addition, a recent research trend in the mobility data management and mining community is to take advantage of semantic information that annotates raw spatio-temporal data. Such semantically annotated trajectories [14, 16] provide us with a more valuable representation of the mobility behavior of a moving object. Informally, a raw trajectory is a sequence of recorded positions along with the timestamp when these positions were recorded, while its semantic equivalent is a sequence of more abstract and insightful fragments of mobility data, called episodes. Episodes can be either moves (driving, walking, riding a bicycle, etc.) or stops (e.g. at work, at home, at the super-market) that hold semantic annotations of type “what?”, “how?”, etc. Actually, preserving semantic information seems to be rather a natural and valuable representation of the mobility behavior, since it captures the purpose and the mean of the movement. Hence, synchronized semantic and raw spatio-temporal information is essential. In this paper, we present Hermoupolis a pattern- and semantic- aware synthetic trajectory simulator, which produces annotated trajectories of moving objects following given mobility profiles along with the respective simulated GPS-like recordings that are network-constrained. To the best of our knowledge, such a generator has unique characteristics with respect to related work, which includes (raw) trajectory data generators, such as GSTD [17], Brinkhoff [1], and SUMO [8], as well as activity-oriented but far from being considered semantic trajectory data generators, such as BerlinMOD [15], ST-ACTS [3], and MWGen [18]. The paper is structured as follows: Section 2 presents the Hermoupolis workflow, Section 3 describes interesting use cases showing how users may handle Hermoupolis in order to simulate various mobility scenarios, while Section 4 concludes and points out challenging future work in the field. 2 Hermoupolis Workflow Hermoupolis input consists of (i) a road network, N = G(V, E), (ii) a set of points of interest on this network, PoI, and (iii) a set of mobility profiles, MP, which we desire to simulate over this network. In detail, since our goal is to produce network-constrained semantic trajectories, an important element of the generator is the underlying road network and its properties. Hermoupolis follows the typical paradigm: a road network N is represented by a graph G = (V, E) consisting of a set of vertices V = {v1, . . . , vn } that correspond to geographical locations (x, y) and a set of edges E = {eij = (vi , vj )|vi , vj ∈ V, i ̸= j} connecting those vertices. Edges are also annotated with information that describes the type of road (e.g. highway). An additional input of the generator is a set of points of interest PoI of the area upon which the simulation will take place. Each poi ∈ PoI is a tuple (poi-ID, poi-Loc, poi-Tags, poi-Cat), where poi-ID is a unique identifier of each poi, poi-Loc is a spatial point (x, y) denoting its location, poi-Tags is a set of tags stating its underlying utility, and poi-Cat is the corresponding category that the poi belongs to. Furthermore, a mobility profile mp ∈ MP is a tuple (mp-ID, c, Stop0 , M ove1 , Stop1 , M ove2 , Stop2 , . . ., M ovek , Stopk ) that denotes the mobility profile identifier mp-ID, the cardinality c of the semantic trajectories to be generated, along with a sequence of ‘abstract’ Stops and Moves (with the constraint that a movement starts from and end to a stop). In particular: 2 2 , σtime , poi-Cat), where MBB is the (spatio-temporal) minimum • An abstract Stop is a tuple (MBB, σspace bounding box that defines the area inside which specific simulated Stop episodes take place, poi-Cat is 2 2 (σtime ) is the variance of the the category of PoIs that simulated Stop episodes must belong to, and σspace spatial (temporal, respectively) range of the simulated Stop episodes (i.e. the variance of the spatial extent around the selected PoIs and the variance of the duration of the simulated MBBs of the Stop episodes). • An abstract Move is a tuple (speedmax , move-Tags), where speedmax is the maximum allowed speed 20 of the movements to be simulated and move-Tags is a set of textual annotations that are attached to the simulated Move episodes (examples include “driving”, “jogging”, etc.). On the other hand, Hermoupolis output is a semantic mobility database consisting of a set of semantic trajectories along with their raw counterparts that are compliant with the spatio-temporal-textual constraints imposed by the MPs. In particular, a semantic trajectory is a tuple (T-ID, mp-ID, Episodes), where T-ID is a unique identifier of the moving object trajectory, mp-ID is the identifier of the profile mp ∈ MP the trajectory belongs to, and Episodes is a sequence of episodes. In turn, each episode e ∈ Episodes is a tuple (e-ID, e-flag, e-MBB, e-tags, T-link), where e-ID is the episode identifier, e-flag is a flag in {‘Move’, ‘Stop’}, e-MBB is the 3D spatio-temporal approximation of the counterpart raw sub-trajectory, e-tags is a set of keywords derived from the corresponding abstract episode, and T-link is a link to the actual raw sub-trajectory, which in turn is a sequence of map-matched spatiotemporal points (xi , yi , ti ). Combining the above, Hermoupolis workflow is graphically illustrated in Figure 1. Figure 1: Hermoupolis workflow (input – methodology – output). Before we proceed with the presentation of case studies developed using Hermoupolis, it is important to note that the starting point of Hermoupolis software is Brinkhoff generator [1], which has been radically extended in order to provide the above functionality. Moreover, Hermoupolis software along with representative case studies is available at: http://infolab.cs.unipi.gr/hermoupolis/. 3 Hermoupolis in Action In this section, we present three case studies that highlight Hermoupolis functionality and its usefulness in simulating various mobility scenarios that are applicable to different domains and application areas. They can be considered representative of the purposes the generator has been developed for: 21 • case study I, titled “a typical day in Athens”, demonstrates how researchers working on semantic trajectory data management (reconstruction, processing, etc.) can find support in the empirical evaluation of their proposals; • case study II, titled “a big event in Athens”, aims at the transportation research field and makes use of the expressive power of the generator in simulating real-world cases; • case study III, titled “collective mobility behavior in Athens”, is in support of researchers in the mobility data mining domain seeking for datasets simulating various well-known mobility patterns. Case Study I: “a typical day in Athens”. As already mentioned, one of the major contributions of the Hermoupolis generator, in comparison to other generators, is its ability to produce not only “raw” trajectories but also the corresponding “semantic” trajectories that are synchronized with the former. In this case study, we present in detail such a scenario, where various mobility (population) profiles, each consisting of various activities and transportation means, are simulated. The entire simulation scenario, consisting of six mobility profiles, is illustrated in Figure 2. (Arrows in Figure 2 illustrate the direction of movement, while their thickness is proportional to their cardinality.) More specifically, imagine a typical day of a mobility profile, called “youngactive-workers” (the orange colored mobility profile in Figure 2) as follows: starting from their home (Stop0 ), they take their bicycles to a bus station (M ove1 ) where they park their bicycles into a bicycle parking area (Stop1 ) and catch their bus to work (M ove2 ). After 8 hours of work (Stop2 ), they catch the bus back home (M ove3 ), arrive at the bus station (Stop3 ) where they change back to their bicycles and ride home (M ove4 ). As soon as they arrive there (Stop4 ), they take their car in order to go for grocery shopping (M ove5 ). After shopping (Stop5 ), they return back home by car (M ove6 ) where they relax for a while (Stop6 ). Then, they walk to the gym (M ove7 ) in order to work out (Stop7 ). Once they complete it, they return back home on foot (M ove8 ) where they rest until the end of the day (Stop8 ). Respectively, the user has given as input another five mobility profiles (the mobility profiles in Figure 2 colored different from orange). For the sake of presentation, we avoid presenting detailed description for each of those profiles as we did for “young-active-workers” above. Instead, we provide a more abstract outline. In particular, profile “school kids” (green) go from home to school in the morning and return back in the afternoon, all on foot. Profile “young students” (turquoise) start in the morning from home and head by bicycle to their university, where they study until the afternoon; then, they ride to a nearby area in order to visit a café for socialization and, finally, they return back to home again by bicycle. Profile “middle-aged-workers” (purple) simulates people following the typical home – work – home pattern, all by car. Similarly, profile “middle-agedworkers-and-shoppers” (red) simulates people moving home – work – shops – home, again all by car. Finally, the sixth profile (light green) called “relaxers”, simulates those using public transportation and their everyday movement routine is home – café – home. It is straightforward for one to see that having such synchronized raw vs. semantic representations of mobility data is extremely useful for various researches. For instance, semantic trajectory reconstruction techniques could make use of the produced raw data, apply the segmentation and semantic annotation approach under study, and make use of the semantic counterparts to evaluate the proposals. Case Study II: “a big event in Athens”. Hermoupolis can be utilized to simulate the traffic flow of an entire city for large periods of time and the behavioural analysis of people living in an urban environment wherein they perform their daily activities. Under this setting, we present a scenario that simulates a big event (e.g. a concert or a football game) that takes place at Athens Olympic stadium, where lots of people from the metropolitan area around are rushing to attend. This scenario is a bit more intricate to simulate in comparison with the previous one. For instance, in this scenario people from different areas should have different starting times in order to reach the place of the event on time (one who lives very close to the stadium could safely leave home a few minutes earlier than the starting time, opposed to one who lives in a distant suburb and perhaps needs 1 hour or even more to move there). This characteristic makes mandatory to create several mobility profiles with different 22 Figure 2: A typical day in Athens simulating various activities and means of transportation for various mobility patterns during a day in Athens metropolitan area. starting times depending on their proximity to the place of the event. In our example, as illustrated in Figure 3, we create 17 (= 8 + 7 + 1 + 1) profiles of people starting their way to Athens Olympic stadium: 30 min (8 profiles; those living nearby), 60 min (7 profiles; those living in areas adjacent to the former), 75 min (1 profile; those living east), and 90 min (1 profile; those living south) in advance. The outcome of such an analysis could assist in “smart” and efficient urban planning and decision making, thus having a great impact in the improvement of our everyday life. 23 Figure 3: A big event in Athens - simulating the traffic flow due to a scheduled event at Athens Olympic stadium. Case Study III: “collective mobility behavior in Athens”. A unique feature of Hermoupolis is the ability to produce moving objects that follow a variety of mobility patterns. Consider, for instance, Figure 4 (left) that illustrates a mobility pattern consisting of 4 overlapping abstract Stops (depicted as rectangles) and 3 abstract Moves (depicted as arrows). It is evident that such movement simulates a flock [4, 9] or convoy [6] mobility pattern. Another example is demonstrated in Figure 4 (right), where there exist two mobility patterns; the one (green) contains 6 abstract Stops and 5 abstract Moves whereas the other (turquoise) contains 5 abstract Stops and 4 abstract Moves. Both profiles include Stops with varying spatial extent and varying speed and agility. By imposing the two profiles to meet at two specific regions – see the first and the last but one abstract Stop in Figure 3 (right) – we can simulate trajectories following a swarm pattern [11]. 4 Conclusions and Future Work Concluding, Hermoupolis is a synthetic data generator that succeeds in generating synchronized raw spatiotemporal and semantically annotated trajectories, which simulate realistic mobility behaviours by following 24 Figure 4: Collective mobility behavior in Athens – simulating flocks / convoys (left) and swarm patterns (right). user-defined mobility patterns, in terms of sequences of (abstract) Stops and Moves. As demonstrated through the case studies that we presented, the application domain for such a generator could range from the empirical evaluation of semantic trajectory data management techniques (that handle semantic trajectories as complex spatio-temporal-textual sequences or behaviourally-rich semantic entities) and the effectiveness validation of mobility pattern mining techniques to efficient urban planning through the simulation of the traffic flow of an entire city, e.g. during a scheduled event. In its current version, Hermoupolis requests that the mobility profiles to be simulated are given as input by the user. Following the “by-example” paradigm, a quite challenging extension is that mobility profiles are ‘automatically’ extracted from a real small “semantically-aware” dataset that will be instead given as input. Under this setting, urban planners could provide e.g. a small real dataset extracted from questionnaires or diaries as input and get a large realistic synthetic dataset keeping the semantics of the former as output, an extremely useful tool in modern planning, e.g. in the era of electric vehicles that is emerging [5]. References [1] T. Brinkhoff. A framework for generating network-based moving objects. Geoinformatica, 9(1):153–180, 2002. [2] F. Giannotti, M. Nanni, D. Pedreschi, and F. Pinelli. Trajectory pattern mining. In Proc. of ACM SIGKDD, 2007. [3] G. Gidofalvi and T. B. Pedersen. ST-ACTS: a spatio-temporal activity simulator. In Proc. of ACM GIS, 2006. [4] J. Gudmundsson, M. J. Kreveld, and B. Speckmann. Efficient detection of patterns in 2d trajectories of moving points. GeoInformatica, 11(2):195–215, 2007. [5] D. Janssens, F. Giannotti, M. Nanni, D. Pedreschi, and S. Rinzivillo. Data science for simulating the era of electric vehicles. Künstliche Intelligenz, 26(3):275–278, 2012. [6] H. Jeung, M. L. Yiu, X. Zhou, C. S. Jensen, and H. T. Shen. Discovery of convoys in trajectory databases. In Proc. of VLDB, 2008. 25 [7] P. Kalnis, N. Mamoulis, and S. Bakiras. On discovering moving clusters in spatio-temporal data. In Proc. of SSTD, 2005. [8] D. Krajzewicz, G. Hertkorn, C. Rssel, and P. Wagner. SUMO (Simulation of Urban MObil-ity): an opensource traffic simulation. In Proc. of 4th Middle East Symposium on Simulation and Modelling, 2002. [9] P. Laube, S. Imfeld, and R. Weibel. Discovering relative motion patterns in groups of moving point objects. IJGIS, 19(6):639668, 2005. [10] J. G. Lee, J. Han, and K. Y. Whang. Trajectory clustering: A partition-and-group framework. In Proc. of ACM SIGMOD, 2007. [11] Z. Li, B. Ding, J. Han, and R. Kays. Swarm: mining relaxed temporal moving object clusters. Proc. of PVLDB, 3(1-2):723–734, 2010. [12] M. Nanni and D. Pedreschi. Time-focused clustering of trajectories of moving objects. JIIS, 27(3), 2006. [13] C. Panagiotakis, N. Pelekis, I. Kopanakis, E. Ramasso, and Y. Theodoridis. Segmentation and sampling of moving object trajectories based on representativeness. IEEE TKDE, 24(7):1328–1343, 2012. [14] C. Parent, S. Spaccapietra, C. Renso, G. Andrienko, N. Andrienko, V. Bogorny, M. Damiani, A. GkoulalasDivanis, J. A. Macedo, N. Pelekis, Y. Theodoridis, and Z. Yan. Semantic trajectories modeling and analysis. acm computing surveys. ACM Computing Surveys, 45(4), 2013. [15] N. Pelekis, I. Kopanakis, E. Kotsifakos, E. Frentzos, and Y. Theodoridis. Clustering uncertain trajectories. KAIS, 28(1):117–147, 2011. [16] N. Pelekis, Y. Theodoridis, and D. Janssens. On the management and analysis of our lifesteps. ACM SIGKDD Explorations Newsletter, 15(1):23–32, 2013. [17] Y. Theodoridis, J. Silva, and M. Nascimento. On the generation of spatiotemporal datasets. In Proc. of SSD, 1999. [18] J. Xu and R. H. Güting. MWGen: A mini world generator. In Proc. of IEEE MDM, 2012. 26 Constructing Semantic Interpretation of Routine and Anomalous Mobility Behaviors from Big Data Georg Fuchs1,3 , Hendrik Stange1 , Dirk Hecker1 , Natalia Andrienko1,2,3 , Gennady Andrienko1,2,3 1 Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS, St. Augustin, Germany 2 City University London, London, UK 3 Friedrich-Wilhelms University, Bonn, Germany Abstract Annually organized VAST Challenges provide a unique opportunity to analyze complex data with available ground truth. In 2014, one of the tasks was to interpret routine and anomalous patterns of human mobility based on big data: trajectories of cars and credit card transactions. We describe a scalable visual analytics approach to solving this problem. Repeatedly visited personal and public places were extracted from trajectories by finding spatial clusters of stop points. Temporal patterns of people’s presence in the places resulted from spatio-temporal aggregation of the data by the places and hourly intervals within the weekly cycle. Based on these patterns, we identified the meanings or purposes of the places: home, work, breakfast, lunch and dinner, etc. Meanings of some places could be refined using the credit card transaction data. By representing the place meanings as points on a 2D plane, we built an abstract semantic space and transformed the original trajectories to trajectories in the semantic space, i.e., performed semantic abstraction of the data. Spatio-temporal aggregation of the transformed trajectories into flows between the semantic places and subsequent clustering of time intervals by the similarity of the flow situations allowed us to reveal and analyze the routine movement behaviors. To detect anomalies, we (a) investigated the visits to the places with unknown meanings, and (b) looked for unusual presence times or visit durations at different semantic places. The analysis is scalable since all tools and methods can be applied to much larger data. Moreover, the semantic data abstraction can serve as a tool for protecting the personal privacy. 1 Introduction Huge amounts of data reflecting human mobility are constantly generated, including mobile phone use records, geographically referenced posts in social media (Twitter, Foursquare, Flickr, etc.), and GPS tracks. These data provide unprecedented opportunities for studying and understanding human mobility but require appropriate analysis methods, in particular, methods for semantic analysis that could infer and exploit meanings of places and purposes for attending places to enable understanding of people’s everyday behaviors and life styles. Combining multiple sources of mobility data is challenging. Despite traditional many V’s of Big Data [10] (Volume, Velocity, Variety, Veracity, just to name a few), there exist specific complexities associated with the peculiarities of human mobility and corresponding data sets. Thus, different data sets have different structure, different quality, different spatial and temporal resolutions. Visual Analytics [12] creates opportunities for a synergy between human analyst and computer by providing appropriate visual interfaces to all stages of computational analysis, from data pre-processing and exploration to pattern search and model building. In the context of mobility analysis, visual analytics must address the specifics of space and time [3]. 27 A common pattern of development in mobility analytics is the paradigm shift from syntactic [9] to semantic [11] analysis of movement data. Since mobility data by themselves are semantically poor, human interpretation, reasoning, and judgment are essential for giving sense and meaning to them. Purely computational methods only produce elementary results, e.g., trajectories with labeled segments. Often, semantic interpretation is based on a pre-defined set of places of interest (POI), also called areas of interest (AOI). This approach has limited applicability if POIs are not available or outdated. Moreover, POIs may be useful for identifying public places (visited by several people), but their applicability is limited for personal places (frequently visited by selected individuals). Additionally, the same POI may have different meanings for different people. For example, an apartment building may be a home place for its residents and a work place for service staff members. In this paper we describe a visual analysis approach that facilitates human analyst-driven synthesis and semantic interpretation of human mobility behavior at different levels of abstraction, as appropriate for the analysis task at hand. It combines existing methods for the extraction, visual exploration and enrichment of raw mobility data with a novel concept of semantic spaces that allow analysis of routine as well as abnormal mobility behavior. We argue that the proposed transformation from geographic to semantic space creates new opportunities for analysis of mobility data – unlike existing mobility analysis approaches, semantic spaces allows to compare behaviors of a few to many individuals across different geographies and time periods. This paper extends our contribution that received an award for outstanding scalable analysis [6] at VAST Challenge 2014 [8], mini-challenge 2. We demonstrate our approach to acquisition of semantically meaningful locations by combining two simulated data sources of that challenge, namely, trajectories of cars and credit card transactions. After extracting and interpreting personal and public places and assignment of the semantic labels to them, we transform the original trajectories into sequences of place visit records, each record containing the semantic label of the visited place and the start and end times of the visit. This transformation projects human mobility from geographic to semantic space. We demonstrate that the proposed transformation creates new opportunities for data analysis. 2 Problem Description and Example Data Sets VAST challenges are open to participation by individuals and teams in industry, government, and academia. The challenge setup typically comprises several interrelated, large data sets together with a set of complex analysis tasks, to which the participants’ submissions should showcase their visual analytics approach and provide wellfounded answers. To accommodate the background story framing these analysis tasks, challenge data sets are either derived from real data with alterations, or generated artificially but with realistic artifacts such as missing values, precision limitations, and ambiguities just as could be expected from real use case data. The 2014 IEEE VAST challenge’s background story can be found on the archived challenge website [8]. The challenge consisted of three inter-related mini-challenges and an overall Grand Challenge. This paper presents our visual analysis approach targeting mini-challenge 2 (MC2), which involved geospatial, temporal, and transaction data analysis. In a nutshell, a company called GAStech provided company cars to its employees. Both personal and business uses were allowed. However, without the employees’ knowledge, GAStech had installed trackers in the company vehicles. The devices periodically recorded the vehicles’ geospatial positions when they were moving. The recorded tracks from a two week period were provided for the challenge. Additionally, credit and debit card transactions of the GAStech employees were available for the same period. The GPS trajectories data set consists of 671,717 time-stamped positions of 40 distinct cars. The credit card transactions data set consists of 1,087 records of 35 individuals. Each record includes a card owner name, named location (but no geo-coordinates), date and time, and transaction amount. Both data sets include systematic and arbitrary mistakes such as shifted positions, wrong times, missing records etc. The overall tasks for participants of MC2 was, first, to describe the routine behaviors of the GAStech employees, and, second, to identify suspicious patterns of behavior. The participants had to cope with uncertainties 28 that result from missing, conflicting, and imperfect data. The following sections briefly describe our approaches in addressing these questions. 3 Extraction and Interpretation of Places We used an automated tool [7] that extracts repeatedly visited personal and public places by spatial clustering of points from trajectories. Relevant points can be previously selected by interactive filtering. We selected the points of stopping for at least one minute. The tool’s work is based on finding groups of points fitting in circles with a chosen maximal radius and uniting close groups. We chose a sufficiently big maximal radius (100m) to account for the noise in the data. Personal places are extracted separately from the trajectory of each person. Public places visited by at least a given minimal number of distinct persons (two in our analysis) are extracted from all trajectories together. Figure 1 shows identified personal and public places. Figure 1: Identified personal and public places. Through spatio-temporal aggregation of the trajectories [5], we obtained the visit counts for the extracted places by hourly intervals within the weekly time cycle. For the personal places, only the visits of the place owners were counted. We analyzed the temporal distributions of the place visits using 2D time histograms (Figure 2), where the rows correspond to 7 days the week, columns to 24 hours of the day, and marks in the cells represent aggregated visit counts for the whole set or subsets of places. By clustering the places according to similarity of their visit distributions, we found groups of places with prominent temporal patterns of visits, which could be attributed to certain categories of places or activities: home, work, breakfast or coffee, lunch, lunch and dinner. We combined card transaction data with the extracted stop points to (1) determine the geographic locations of the businesses and (2) among the places visited in the evenings and on the weekend, distinguish places for eating, shopping, sport, etc. We could not determine the meanings of five public places visited mostly in hour 11 of the week days. The temporal pattern did not hint at any usual people’s activity, and no card transaction records could be matched with the place visit times. We found that these places were attended by particular people and gave them a label “BFMO place”, where BFMO consists of the initials of the last names of these people. These places and corresponding mobility patterns require further investigation. 29 Figure 2: 2D time histograms represent the total counts of visits to different place categories by hourly intervals in the weekly cycle. 4 Analysis of Routine Behaviors in Semantic Space After the assignment of the semantic labels to the places, we transformed the original trajectories into sequences of place visit records, each record containing the semantic label of the visited place and the start and end times of the visit. The intermediate trajectory points between the place visits were omitted. We created an abstract semantic space where the semantic categories of places are represented as points on a 2D plane; we call them “semantic places”. Then, we transformed the sequences of place visit records to trajectories in the semantic space. For this purpose, the place visit records were complemented with the coordinates of the semantic places (see Figure 3). Figure 3: Trajectories in the semantic space: map (left) and space-time cube (right). The data transformation in which real geographic coordinates are replaced by “locations” in an abstract semantic space is called semantic abstraction. Semantic abstraction is a tool for protecting personal location 30 Figure 4: Summarized flows between semantic places. The widths of the flow symbols are proportional to the total counts of the moves between the respective types of places. Figure 5: Clustering of hourly intervals by the similarity of the flows in the semantic space reveals high regularity of the movements. 31 privacy, since sensitive information about specific geographic locations visited by individuals is completely removed from the data. Spatio-temporal aggregation can be applied to trajectories in abstract spaces in the same way as to trajectories in geographic space [2]. We aggregated the transformed trajectories into flows (aggregate moves) between the semantic places for the overall period and by hourly intervals. Figure 4 shows summarized flows between semantic places. Then we clustered the intervals by similarity of the respective flow situations, i.e., vectors composed of the magnitudes of the flows for all ordered pairs of semantic places. We applied k-means clustering algorithm using Manhattan distance between the vectors as the similarity measure. In the calendar display (bottom center of Figure 5), pixels representing the hourly intervals are colored by their cluster membership. Periodic patterns with regard to the daily and weekly time cycles could be observed for different values of the parameter k (number of clusters). In the small multiple maps in Figure 5, the average hourly flows for the time clusters are represented by the widths of the flow symbols (curved lines with the curvature increasing towards the destination). 5 Detection and Analysis of Anomalies By interacting with the map display of the semantic space and transformed trajectories, we selected the daily trajectories visiting BFMO places (cf. Section 3) as shown in Figure 6, and onserved that people went to these places from the work place. From BFMO places, the visitors almost always moved to lunch/dinner places and then returned to the work place. Hence, BFMO places were not visited for having lunch. By extracting corresponding place visit records we examined who was when in which place and found five cases when two or three people met in the same place. All four visitors of BFMO places were security employees. For each semantic place, we analyzed the temporal distribution of the visits using a 2D time histogram similar to those shown in Figure 2 but with 14 rows corresponding to the consecutive days of the two week period of the data (as in the calendar display in Figure 5). We paid attention to place visits in unusual times. For each such case, we extracted (by spatio-temporal filtering) the place visit records including the visitors’ names, exact visit times, and place names. In this way, we detected that four persons sometimes attended the homes of their colleagues in night time. We also detected night visits to work and some other anomalies. Figure 6: By filtering through the semantic space map, we have selected only those daily trajectories that include visits to BFMO places. There are 30 such trajectories, which are also shown in a summarized form as a set of flows. The map shows that the visitors typically went to the BFMO places from the work and after that went for lunch, which means that the BFMO places are not lunch places. 32 6 Social Network Analysis From the trajectories, we have extracted all meetings of the people and excluded the meetings that occurred at work and the meetings of people living together at their homes. From the remaining meetings, we have computed distances between individuals based on the relative frequencies of their meetings. Figure 7: The space of inter-personal relationships. The map display (Figure 7) shows the space of inter-personal relationships. The 2D projection has been obtained based on the pair-wise distances between the individuals. The dots represent the individuals and are colored according to their employment types. The curved connecting lines represent the strengths of the relationships between the individuals (i.e., the relative meeting frequencies) by proportional widths and opacities. We see a tight group of security employees (the group also includes two non-security persons). Two security persons, Cocinaro and Osvaldo, bridge this group with another tight group made by engineering and information technology employees. Another group of engineers is relatively separated from the latter group and from the security group but has strong links to executive staff. 7 Conclusion All methods and tools that we used for our analysis are scalable with regard to the number of individuals, number of places, and length of the time period covered by the data. We mostly used aggregated views, which could also be applied to much larger data. Detailed data (place visit records) were accessed only for analyzing anomalies. The analysis greatly relied on computational data processing: stop and place extraction, data aggregation, and clustering. These operations are also scalable. Apart from the examination of the anomalies, the analysis was done in a way respecting personal privacy, without accessing personal data. Our approach demonstrates that 33 Visual Analytics may be not only harmful for personal privacy [4], but also has potential to create opportunities for privacy-preserving analysis of human mobility [1]. The page limit does not allow us to provide here a detailed bibliography on known methods and approaches to place extraction from movement data and to describe in detail all algorithms used in our approach. Interested readers are pointed at our recently published paper [7] that includes a comprehensive bibliography and provides all necessary algorithmic details. In that paper, the approach was used for reconstructing mobility patterns of residents of San-Diego based on geo-located twitter messages and land use map layers. References [1] G. Andrienko and N. Andrienko. Privacy Issues in Geospatial Visual Analytics. In G. Gartner and F. Ortag, editors, Advances in Location-Based Services, Lecture Notes in Geoinformation and Cartography, pages 239–246. Springer Berlin Heidelberg, 2012. [2] G. Andrienko, N. Andrienko, P. Bak, D. Keim, and S. Wrobel. Visual Analytics of Movement. Springer Verlag, 2013. [3] G. Andrienko, N. Andrienko, U. Demsar, D. Dransch, J. Dykes, S. I. Fabrikant, M. Jern, M.-J. Kraak, H. Schumann, and C. Tominski. Space, time and visual analytics. International Journal of Geographical Information Science, 24(10):1577–1600, 2010. [4] G. Andrienko, N. Andrienko, D. Keim, A. M. MacEachren, and S. Wrobel. Challenging problems of geospatial visual analytics. Journal of Visual Languages & Computing, 22(4):251–256, 2011. [5] N. Andrienko and G. Andrienko. Spatial Generalization and Aggregation of Massive Movement Data. IEEE Transactions on Visualization and Computer Graphics, 17(2):205–19, 2011. [6] N. Andrienko, G. Andrienko, and G. Fuchs. Analysis of Mobility Behaviors in Geographic and Semantic Spaces. In VAST Challenge @ IEEE VAST 2014, 2014. Award for Outstanding Scalable Analysis. [7] N. Andrienko, G. Andrienko, G. Fuchs, and P. Jankowski. Scalable and Privacy-respectful Interactive Discovery of Place Semantics from Human Mobility Traces. Information Visualization, ??(?):???, July 2015. DOI: http://dx.doi.org/10.1177/1473871615581216. [8] K. Cook, G. Grinstein, and M. Whiting. IEEE VAST Challenge 2014 – The Kronos Incident. http://vacommunity.org/VAST+Challenge+2014, November 2014. Last accessed 2015-03-29. [9] F. Giannotti and D. Pedreschi. Mobility, data mining and privacy: Geographic knowledge discovery. Springer Science & Business Media, 2008. [10] A. McAfee and E. Brynjolfsson. Big data: the management revolution. Harvard business review, 90:60–68, October 2012. [11] C. Parent, S. Spaccapietra, C. Renso, G. Andrienko, N. Andrienko, V. Bogorny, M. Damiani, A. GkoulalasDivanis, J. Macedo, N. Pelekis, Y. Theodoridis, and Z. Yan. Semantic trajectories modeling and analysis. ACM Computing Surveys, 45(4):42, 2013. [12] J. J. Thomas and K. A. Cook. Illuminating the path:[the research and development agenda for visual analytics]. IEEE Computer Society, 2005. 34 An Integrated Qualitative and Boundary-based Formal Model for a Semantic Representation of Trajectories Jing Wu1,2 , Christophe Claramunt2 , Min Deng1 1 Department of Geo-Informatics, Central South University, Changsha, China 2 Naval Academy Research Institute, Lanveoc-Poulmic, BP 600, 29240 Brest Naval, France {jing.wu,claramunt}@ecole-navale.fr, [email protected] Abstract Nowadays, the tracking, representation and analysis of moving objects and trajectories have attracted several research efforts at the formal level. The work presented in this paper introduces a spatial qualitative approach for enriching semantic trajectories with movement predicates. The model developed integrates topological relations and qualitative distances between a trajectory and a region of interest. Such a spatio-temporal framework supports the derivation of the basic movement configurations derived from moving and static entities. The approach is flexible enough to reconstruct the trajectory as a sequence of highly-correlated episodes according to the underlying topological properties such as the dimension and cardinality of the intersections that emerge between the trajectory and the given region. 1 Introduction Thanks to the proliferation of GPS, WiFi, RFID and other sensor-based tracking techniques, the collection of mobility data is becoming much more efficient thus offering many novel application perspectives. This also favours the emergence of several research avenues such as spatial data mining [3] or statistical analysis [16] where large trajectory data sets provide the information repository to manipulate. Although one might consider a trajectory as a relative straightforward modeling primitive, the information encapsulated is often rich when considering the semantics and behaviour associated to the underlying movement. A trajectory, modeled as a semantic abstraction, should be enriched by the spatial, temporal and semantic domain knowledge this being denoted as a semantic enrichment process [1, 9]. Regarding the spatial dimension, a trajectory can be modeled as a series of episodes as suggested in [6], an episode being defined as a maximal homogeneous sub-sequence of a trajectory. This allows to map a given trajectory to a series of spatial predicates whose semantics can be also enriched by additional application dependent criteria. One can for example make a difference between stationary positions of a given entity (stops) and periods in which the entity is moving (moves) as suggested by the Time Geography framework [4]. Specific episodes can be also identified according to some speed variation [7], change of direction [10], route constraints [15] or points of interest [13]. A difference can be also made between the semantic and spatial dimensions in order to provide a data model representation that supports different levels of abstraction as suggested in [14] . The research presented in this paper develops a spatial modeling approach whose objective is to enrich trajectory data with movement predicates. The approach considers a trajectory behaviour with respect to a given Region of Interest (ROI). The respective spatio-temporal configurations between a trajectory and a ROI provides a sequence of episodes modeled as basic spatio-temporal predicates. The formal model developed 35 gives an intuitive set of movement predicates that can be used to model the behaviour of a given trajectory with respect to some predefined ROIs. The whole approach is complemented by a systematic representation of spatio-temporal configurations mapped to possible natural language expressions. The remainder of this paper is organized as follows. Section 2 introduces the main principles behind the formal model developed for a semantic enrichment of the concept of trajectory while section 3 presents a series of movement predicates. Finally section 4 concludes the paper and draws some conclusions. 2 2.1 Modeling Principles Qualitative topological relation The trajectory of a moving entity with respect to a ROI can be modeled as a qualitative topological relation between an oriented line and a region. The objective is to reflect how a given entity evolves outside, inside or on the boundary of a region, the trajectory of this entity being represented as a directed line. Although several models [2, 5] have been developed to describe line-region topological relations, whether they can cover all possible configurations in a 2-dimensional space is still an open research question. The Boundary-based Trajectory Model developed in a related work [11] provides the foundations of a topological relation model between a directed line and a region (DL-RE). The principles behind the Boundary-based Trajectory Model are as follows. The directed line represents the trajectory of a moving entity in a 2-Dimensional space from a starting point (Lts ) to a destination point (Lte ). Several topological properties, such as the cardinality (m), dimensions (d), orientation, and intersection types of the neighboring disc derived from the DL-RE intersections (neigh(p)), are used to derive a set of primitive DL-RE topological relations. As shown in Figure 1, 30 DL-RE topological relations are derived based on possible DL-RE topological relations. These relations can be then composed to represent complex trajectories of a given spatial entity over one to many regions. They also provide a support for a derivation of possible movements in case of incomplete knowledge. So far this model focuses on the topological properties of the intersections on the boundary of the reference entity without analysing the details when the entity is moving outside (S1 ) or inside the reference entity (S2 ), this being the modeling objective of the extension presented in the following sections. 2.2 Qualitative distance relation The Boundary-based Trajectory model is oriented to the analysis of the trajectory configurations of a moving entity represented as a point with respect to the boundary of a given region. But a few information is given regarding the behaviour of such trajectory in the non immediate proximity of this boundary, for instance either far from this region or inside this region. This is the reason behind the development of a complementary modeling approach where a notion of qualitative distance is taken into account to distinguish whether the entity moves toward or away from the ROI. The Qualitative Trajectory Model developed [12] provides a set of modeling primitives that support the qualitative representation of the movement between a moving entity and a ROI over a given time interval T based on complementary qualitative RCC8 topological relations [8] and qualitative distance relations D. The qualitative distance represents the monotonic and continuous variation of the minimum distance d between the boundary of the moving entity and the boundary of ROI: D is continuously increasing outside (dext+ ), D is continuously decreasing outside (dext− ), D is constant outside (dext= ), D is null (d0 ), D is continuously increasing inside (dint+ ), D is continuously decreasing inside (dint− ), and D is constant inside (dint= ). This approach is complemented by a tentative qualification of the possible natural language expressions of the primitive movements identified. These movements are classified into three categories according to the relative location of a moving entity with respect to a reference entity, which are movements outside, on the boundary, and inside the reference entity. Overall, this modeling approach is qualitative per nature and does 36 Figure 1: DL-RE topological relations not support the identification of specific movements in the vicinity of the boundary of the reference region, this motivating the search for an integrated modeling approach that will combine the respective advantages of the boundary-based and qualitative-based modeling approaches. This is the objective of the modeling framework developed in the next section. 3 Towards an Integrated Qualitative and Boundary-based Trajectory Model Let us first model qualitatively the trajectory of a moving point with respect to a reference region (ROI), the primitive movement semantic (PriSem) of the trajectory is formally defined based on DL-RE topological relations (T OPDL−RE ) and qualitative distance D over a time interval T. P riSem(A, B) ≡ Holds(T OPDL−RE , D, T ) (1) Where T OPDL−RE ∈ {S1 , · · · , S30 }, D ∈ {dext+ , dext− , dext= , d0 , dint+ , dint− , dint= }. The intersection type of the neighboring disc is used to derive T OPDL−RE . If there is no intersection between the trajectory and the boundary of the reference entity, the intersection type of neigh(Lte ) should be considered. Otherwise, the intersection type of neigh(I) is applied. The primitive movement semantics are classified into three categories according to the number and dimension of the intersection as follows. 37 3.1 No intersection Let us first consider the configuration where a moving point A is outside a reference entity B. Over a given temporal interval T, three categories of primitive movement predicates can be distinguished: Approach (AP), Leave (LV) and AroundOutside (AO). During that time interval T, the DL-RE relation is S1 and the relative distance can be either decreasing, increasing or constant (dext− , dext+ and dext= , respectively). More formally: • Approach(A, B) denotes the case of a trajectory A is approaching the reference entity B over a time interval T, as shown in Figure 2a. More formally, for all t∈T, S1 holds and the relative distance is decreasing outside B: Approach(A, B) ≡ Holds(S1 , dext− , T) • Leave(A, B) denotes the case of a trajectory A is leaving the reference entity B over a time interval T, as shown in Figure 2b. More formally, for all t∈T, S1 holds and the relative distance is increasing outside B: Leave(A, B) ≡ Holds(S1 , dext+ , T) • AroundOutside(A, B) denotes the case of a moving entity A is either moving around or static outside the reference entity B over a time interval T, as shown in Figure 2c. More formally, for all t∈T, S1 holds and the relative distance is constant outside B: AroundOutside(A, B) ≡ Holds(S1 , dext= , T) Figure 2: Movement outside a reference entity When a moving point A is inside a reference entity B over a time interval T, there are three categories of movements: MovetoBoundary (MB), MovetoInterior (MI) and AroundInside (AI). During that time interval T, the DL-RE relation is S2 and the relative distance can be either decreasing, increasing or constant (dint− , dint+ and dint= , respectively). More formally: • MovetoBoundary(A, B) denotes the case of a moving entity A inside B and moving to the boundary of B over a time interval T, as shown in Figure 3a. More formally, for all t∈T, S2 holds and the relative distance between A and B is decreasing inside B: MovetoBoundary(A, B) ≡ Holds(S2 , dint− , T) • MovetoInterior(A, B) denotes the case of a moving entity A inside B and leaving the boundary of B over a time interval T, as shown in Figure 3b. More formally, for all t∈T, S2 holds and the relative distance between A and B is increasing inside B: MovetoInterior(A, B) ≡ Holds(S2 , dint+ , T) 38 • AroundInside(A, B) denotes the case of a moving entity A inside B and moving around the boundary of B over a time interval T, as shown in Figure 3c. More formally, for all t∈T, S2 holds and the relative distance between A and B is constant inside B: AroundInside(A, B) ≡ Holds(S2 , dint= , T) Figure 3: Movement inside a reference entity with no intersection 3.2 One 0-dimensional intersection point When there is one intersection point between a trajectory and the boundary of a reference region, the movement pedicate is composed by one or two movement states that hold over a temporal interval and another one meets the boundary of the region at certain time point, such as the start time (ts ), the end time (te ) or the time point when the trajectory intersect the boundary (tI ), respectively. More formally: • Arrive(A, B) denotes the case of the end point of the trajectory A meets outside B over a time interval (T+te ), as shown in Figure 4a. The trajectory A Approach B before it meets the boundary of B. More formally: Arrive(A, B) ≡ Holds(S1 , dext− , T) ∧ (S3 , d0 , te ) • Depart(A, B) denotes the case of the start point of the trajectory A meets outside B over a time interval (ts +T), as shown in Figure 4b. The trajectory A starts on the boundary of B then Leave it. More formally: Depart(A, B) ≡ (S4 , d0 , ts ) ∧ Holds(S1 , dext+ , T) • Exit(A, B) denotes the case of the end point of the trajectory A meets inside B over a time interval (T+te ), as shown in Figure 4c. The trajectory A MovetoBoundary of B then ends on the boundary of it. More formally: Exit(A, B) ≡ Holds(S2 , dint− , T) ∧ (S5 , d0 , te ) • Enter(A, B) denotes the case of the start point of the trajectory A meets inside B over a time interval (ts +T) , as shown in Figure 4d. The trajectory A starts on the boundary of B then MovetoInterior of it. More formally: Enter(A, B) ≡ (S6 , d0 , ts ) ∧ Holds(S2 , dint+ , T) • CrossIn(A, B) denotes the case of the start point of the trajectory A is outside B then cross the boundary of B and end inside B over a time interval (T1 +tI +T2 ), as shown in Figure 4e: CrossIn(A, B) ≡ Holds(S1 , dext− , T1 ) ∧ (S7 , d0 , tI ) ∧ Holds(S2 , dint+ , T2 ) • CrossOut(A, B) denotes the case of the start point of the trajectory A is inside B then cross the boundary of B and end outside B over a time interval (T1 +tI +T2 ), as shown in Figure 4f: CrossOut(A, B) ≡ Holds(S2 , dint− , T1 ) ∧ (S8 , d0 , tI ) ∧ Holds(S1 , dext+ , T2 ) 39 • TouchOutside(A, B) denotes the case of the start and end points of the trajectory A both lie in the exterior part of B and the interior of A meets the boundary of B with either clockwise or anticlockwise orientations over a time interval (T1 +tI +T2 ), as shown in Figure 4g: TouchOutside(A, B) ≡ Holds(S1 , dext− , T1 ) ∧ ((S9 ∨ S10 ), d0 , tI ) ∧ Holds(S1 , dext+ , T2 ) • TouchInside(A, B) denotes the case of the start and end points of the trajectory A both lie in the interior part of B and the interior of A meets the boundary of B with either clockwise or anticlockwise orientations over a time interval (T1 +tI +T2 ), as shown in Figure 4h: TouchInside(A, B) ≡ Holds(S2 , dint− , T1 ) ∧ ((S11 ∨ S12 ), d0 , tI ) ∧ Holds(S2 , dint+ , T2 ) Figure 4: Movement configurations with one 0-dimensional intersection point 3.3 One 1-dimensional intersection line When part of the trajectory of a moving point A is on the boundary of a reference entity B during a time interval T, the different configurations of the start and end points of the intersection, which is one 1-dimensional line, should be considered. • Along(A, B) denotes the case of the start point and end point of the trajectory both lie in the boundary of B over a time interval T. More formally, for all t∈T, S13 or S14 holds according to the clockwise or anticlockwise orientations of the trajectory and the relative distance is null: Along(A, B) ≡ Holds((S13 ∨ S14 ), d0 , T ) • Arrive-Along(A, B) denotes the case of the start point of the trajectory A lies in the exterior part of B and ends on its boundary with either clockwise or anticlockwise orientations over a time interval (T1 +tI +T2 ). Arrive-Along(A, B) ≡ Holds(S1 , dext− , T1 ) ∧ (((S15 (I1 ), tI ) ∧ Holds(S14 , d0 , T2 ))∨ ((S16 (I1 ), tI ) ∧ Holds(S13 , d0 , T2 ))) 40 • Exit-Along(A, B) denotes the case of the start point of the trajectory A lies in the interior part of B and ends on its boundary with either clockwise or anticlockwise orientations over a time interval (T1 + tI + T2 ). Exit-Along(A, B) ≡ Holds(S2 , dint− , T1 ) ∧ (((S17 (I1 ), tI ) ∧ Holds(S13 , d0 , T2 ))∨ ((S18 (I1 ), tI ) ∧ Holds(S14 , d0 , T2 ))) • Along-Depart(A, B) denotes the case of the start point of the trajectory A lies on the boundary of B and ends in its exterior part with either clockwise or anticlockwise orientations over a time interval (T1 + tI + T2 ). Along-Depart(A, B) ≡ (((Holds(S14 , d0 , T1 ) ∧ (S19 (I2 ), tI )) ∨ (Holds(S13 , d0 , T1 )∧ (S20 (I2 ), tI ))) ∧ Holds(S1 , dext+ , T2 ) • Along-Enter(A, B) denotes the case of the start point of the trajectory A lies on the boundary of B and ends in its interior part with either clockwise or anticlockwise orientations over a time interval (T1 + tI + T2 ). Along-Enter(A, B) ≡ (((Holds(S13 , d0 , T1 ) ∧ (S21 (I2 ), tI )) ∨ (Holds(S14 , d0 , T1 )∧ (S22 (I2 ), tI ))) ∧ Holds(S2 , dint+ , T2 ) As there should be an infinite number of possible DL-RE configurations in a 2-dimensional Euclidean space, the possible movement semantics that emerge from those configurations can be relative large. A composition of the primitive movement predicates defined above can be applied to reconstruct the trajectory as a sequence of highly-correlated episodes. 4 Conclusion With rapid and continuous progress on the availability of mobility data and the large range of potential applications, there is still a call for the development of spatio-temporal data models that will encompass at the abstract level the semantics that emerge from the underlying phenomena. The modeling approaches developed in this paper introduce an integrated formal framework that represents the trajectory of a moving entity with respect to a region of reference. Several spatial qualitative parameters are taken into account such as the boundary-based relationship between the two entities considered as well as the evolution of the relative distance between them. The framework developed also favours the identification of a series of movement predicates and natural language expressions that qualify the movements of a given entity with respect to a reference entity. The approach is preliminary and still can be extended by the integration of additional spatial properties such as velocity or more specific semantic information. Acknowledgments The research was funded by the Fundamental Research Funds for the Central Universities of Central South University and Open Research Fund Program of Key Laboratory of Digital Mapping and Land Information Application Engineering (GCWD201206), State Bureau of Surveying and Mapping. References [1] V. Bogorny, C. Renso, A. R. de Aquino, F. de Lucca Siqueira, and L. O. Alvares. Constant: A conceptual data model for semantic trajectories of moving objects. Transactions in GIS, 18(1):66–88, January 2014. 41 [2] M. J. Egenhofer and R. D. Franzosa. On the equivalence of topological relations. International Journal of Geographical Information Systems, 9(2):133–152, February 1995. [3] F. Giannotti and D. Pedreschi. Mobility, Data Mining and Privacy - Geographic Knowledge Discovery. Springer-Verlag, Berlin Heidelberg, 2008. [4] T. Hgerstrand. Geography and the study of interaction between nature and society. Geoforum, 7(5-6):329– 334, May 1976. [5] Y. Kurata and M. J. Egenhofer. The 9+ intersection for topological relations between a directed line segment and a region. In Proceedings of the 1st Workshop on Behavioral Monitoring and Interpretation, pages 62–76. IOS, September 2007. [6] D. Mountain and J. Raper. Modelling human spatio-temporal behaviour: a challenge for location-based services. In Proceedings of the 6th International Conference on GeoComputation, pages 24–26. GeoComputation, September 2001. [7] A. T. Palma, V. Bogorny, B. Kuijpers, and L. O. Alvares. A clustering-based approach for discovering interesting places in trajectories. In Proceedings of the 2008 ACM Symposium on Applied Computing, pages 863–868. ACM, March 2008. [8] D. A. Randell, Z. Cui, and A. G. Cohn. A spatial logic based on regions and connection. In Proceedings of the 3rd International Conference on Knowledge Representation and Reasoning, pages 165–176, October 1992. [9] C. Renso, S. Spaccapietra, and E. Zimnyi. Mobility Data: Modeling, Management, and Understanding. Cambridge University Press, Cambridge, 2013. [10] J. A. M. R. Rocha, V. C. Times, G. Oliveira, L. O. Alvares, and V. Bogorny. Db-smot: A direction-based spatio-temporal clustering method. In Proceedings of the 5th IEEE International Conference on Intelligent Systems, pages 114–119. IEEE, July 2010. [11] J. Wu, C. Claramunt, and M. Deng. Modelling movement patterns using topological relations between a directed line and a region. In Proceedings of the 5th ACM SIGSPATIAL International Workshop on GeoStreaming, pages 43–52. ACM, November 2014. [12] J. Wu, C. Claramunt, and M. Deng. Towards a qualitative representation of movement. In Advances in Conceptual Modeling, pages 191–200. ER 2014 Workshops, October 2014. [13] K. Xie, K. Deng, and X. Zhou. From trajectories to activities: A spatio-temporal join approach. In Proceedings of the 2009 International Workshop on Location Based Social Networks, pages 25–32. ACM, November 2009. [14] Z. Yan, C. Parent, S. Spaccapietra, and D. Chakraborty. A hybrid model and computing platform for spatio-semantic trajectories. In The Semantic Web: Research and Applications, pages 60–75. 7th Extended Semantic Web Conference, ESWC 2010, May 2010. [15] Y. Zheng, L. Zhang, Z. Ma, X. Xie, and W.-Y. Ma. Recommending friends and locations based on individual location history. ACM Transaction on the Web, 5(1):1–44, February 2011. [16] Y. Zheng and X. Zhou. Computing with Spatial Trajectories. Springer-Verlag, New York, 2011. 42 Trajectory Similarity Measures Kevin Toohey, Matt Duckham University of Melbourne, Australia Abstract Storing, querying, and analyzing trajectories is becoming increasingly important, as the availability and volumes of trajectory data increases. One important class of trajectory analysis is computing trajectory similarity. This paper introduces and compares four of the most common measures of trajectory similarity: longest common subsequence (LCSS), Fréchet distance, dynamic time warping (DTW), and edit distance. These four measures have been implemented in a new open source R package, freely available on CRAN [19]. The paper highlights some of the differences between these four similarity measures, using real trajectory data, in addition to indicating some of the important emerging applications for measurement of trajectory similarity. 1 Introduction As the technology for tracking moving objects becomes cheaper and more accurate, the amount and availability of stored movement data is continuing to increase rapidly. Most movement data is captured and stored in the form of trajectories, defined as “a sequence of time-stamped locations” [11]. However, the analysis of these increasing volumes of trajectory data can be challenging, due in large part to the way the same continuous movement can have innumerable different discretized trajectory representations. One important class of trajectory analysis is the measurement of similarity between trajectories. Several measures exist for calculating the similarity between two trajectories, each with their own strengths and weaknesses. Several surveys of trajectory similarity measures have been performed [7, 11, 18]. After first outlining some of the useful applications of trajectory similarity measures, four of the most commonly used similarity measures will be discussed in detail: longest common subsequence (LCSS), Fréchet distance, dynamic time warping (DTW), and edit distance. These four measures have been implemented within a new R package called “SimilarityMeasures,” available on CRAN [19]. The four similarity measures are compared empirically using a sample movement dataset, highlighting where differences in computed similarity value are expected to occur. 2 Applications of trajectory similarity measures The most common use of trajectory similarity measures is for database indexing. For example, Vlachos et al. apply the longest common subsequence similarity measure to index a set of marine animal trajectories [20]. The study shows considerable speed increases for nearest neighbor computations when using this index over brute force linear scans. Other examples of indexing trajectories using similarity measures can be seen in [6, 8, 14]. Movement patterns in vehicle and pedestrian traffic have of course been analyzed using trajectory similarity measures. Information about the similarities in movement patterns can enable traffic managers to adjust timings 43 on a road network, to find where problems are occurring, or to increase safety and security. Suspicious behavior, for example, can be detected from dissimilarity from predefined “normal” behavior [10]. Using Fréchet and discrete Fréchet similarity measures, Buchin et al. [3] explored the detection of commuting patterns in trajectories. Li et al. [16] used longest common subsequence similarity measures to compare calculated paths with actual paths in an analysis on crowded scene movements. Use of tracking data in sports is also becoming more common. Analyzing tracking data from sports can allow players to increase their efficiency and effectiveness. Haase and Brefeld [12] used a dynamic time warping similarity measure to explore similar movements in a soccer game, while Perše et al. [17] analyzed and segmented basketball games using an edit distance similarity measure. Similarity measures can be used on animal trajectories in behavioral science to explore information about popular tracks, movements, and social interactions [10], such as to compare movements of albatross [4] or of cattle [15]. 3 Trajectory similarity computation A trajectory TA , contains a series of m timestamped n dimensional points ai = (ai,1 , . . . , ai,n ): TA = ((t1 , a1 ), . . . , (tm , am )) where ti are discrete timestamps and ti < ti+1 . The length of a trajectory is defined here as the number of discrete timestamps (“fixes”). Spatial trajectory points are commonly recorded in 2 or 3 dimensions, although higher dimensionality trajectories are of course possible. The key challenge in identifying a satisfactory trajectory similarity measure is the arbitrary nature of the discretization. For example, a naı̈ve similarity measure is Euclidean distance, calculated as the sum of the distances between ordered pairs of points in two trajectories. However, such a simple measure struggles with different sampling rates, outliers, and requires trajectories of different lengths to be cut to equal size [20]. Thus, many more sophisticated similarity measures have been proposed and implemented to overcome the challenges resulting from discretization. Four of the most commonly encountered advanced similarity measures are explored in the next sections. Three of the discussed measures were included in a previous effectiveness study of six similarity measures tested on a taxi dataset [21], while two were also contained in a comparison of another six similarity measures in [23]. A fully documented R package, called “SimilarityMeasures,” is freely available online [19], and has been written to enable easier access to these analyses. The functions in this package are able to compute each of the following similarity measures on n-dimensional trajectories. Please see the package documentation for further details, including how to use the various functions. More information on R, and R packages, can be found at the R Project web page1 . 3.1 Fréchet metric The Fréchet metric (or Fréchet distance) is amongst the most popular of similarity measures [11]. The metric was first defined by Fréchet [9] and can be applied to both continuous directed curves as well as the discretized trajectories considered here. The Fréchet metric is generally described in the following way: a person is walking a dog on a leash. The person walks on one curve while the dog walks on the other [1]. The dog and the person are able to vary their speeds, or even stop, but not go backwards. The Fréchet metric is the minimum leash length required to complete the traversal of both curves. As with most similarity measures, the choice of distance function can be adapted to suit the specific application and trajectories. Euclidean distance is used in this paper. 1 http://www.r-project.org 44 Trajectory points are not matched together using the Fréchet metric. This allows the metric to perform well with even the most widely varying sampling rates and trajectory lengths. Unfortunately, the Fréchet metric can be greatly affected by outliers if they are not removed before performing the calculation. This is caused by the fact that every point of the two trajectories is used in the calculation. The Fréchet distance computation contained in our “SimilarityMeasures” R package is implemented using an algorithm discussed by Alt and Godau [1]. This algorithm allows computation of the Fréchet distance between two trajectories of length m and k, with a worst case complexity of O((m2 k +k 2 m) log mk). Alt and Godau use the idea of free space diagrams to allow the efficient calculation of this similarity measure. For more information on the calculations used in this algorithm see Alt and Godau [1]. 3.2 Dynamic time warping (DTW) Unlike Fréchet distance, dynamic time warping (DTW) is a similarity measure that relies on matching points in trajectories. Using DTW, the trajectories are “warped” in a non-linear way to measure similarity while allowing for varying sampling rates [22]. The calculation is again performed using a chosen distance function (Euclidean distance in our examples). For two trajectories TA and TB , with lengths m and k, an m × k grid can be created where each grid point (i, j) represents the distance between points ai and bj [2]. A warping path W is created by starting at grid point (1, 1), and incrementing either i or j or both by 1 each step until reaching point (m, k). For example, a path beginning at grid point (1, 1) could move to one of grid points (1, 2), (2, 1) or (2, 2). Definition 1: If wl represents a grid point (i, j)l , then a warping path W can be represented as the following sequence of grid points: W = w1 , . . . , w p Exponentially many paths satisfy the conditions above [14]. A warping cost is calculated from a warping path in various ways. A common warping cost for calculating DTW is the total of all of the distances calculated along the warping path. Finally, the DTW similarity value is the minimum of all possible warping costs [2]. Using DTW, a single point on one trajectory can be matched to multiple points on the other. This allows DTW to perform well with trajectories of different lengths and even widely varying sampling rates. However, outliers can again greatly affect this method because every point of both trajectories must have at least one match. The choice of a distance function is also clearly important to DTW, and the warping cost calculation can be changed to suit different needs (cf. Keogh and Ratanamahatana [14] and Keogh and Pazzani [13] for more ways to compute warping cost). Our R package DTW calculation was implemented using the warping cost algorithm discussed in Berndt and Clifford [2]. This implementation allows the DTW calculation to be performed between two trajectories of length m and k, with complexity of O(mk). This DTW algorithm calculates and returns the total of the distances between each pair of points on the optimal warping path using Euclidean distance. 3.3 Longest common subsequence (LCSS) Longest common subsequence (LCSS) is a similarity measure where trajectories can be stretched, while some points are able to remain unmatched in an attempt to provide an accurate similarity analysis [20]. The LCSS value represents a count of the maximum number of points which can be considered equivalent, while the trajectories are traversed monotonically from start to end. Definition 2: Using trajectories TA and TB , with lengths m and k, an integer δ ≥ 0 and a matching threshold 45 ε ≥ 0, the LCSSδ,ε definition from Vlachos et al. [20] is adapted to the following:   0, if TA or TB is empty      1 + LCSSδ,ε (Head(TA ), Head(TB )), if |m − k| ≤ δ and |am,1 − bk,1 | ≤ ε and . . . and |am,n − bk,n | ≤ ε LCSSδ,ε (TA , TB ) =   max(LCSSδ,ε (Head(TA ), TB ),     LCSSδ,ε (TA , Head(TB )), otherwise, In this definition, the constant δ provides a maximum index difference when comparing points from the two trajectories. The constant ε defines the maximum distance in each dimension allowed for two points to be considered equivalent. Finally, Head(TA ) represents TA with the last point removed. With careful use of the two constants, δ and ε, this method is highly robust to outliers, while performing well with trajectories of different lengths. The LCSS measure also generally functions well with different sampling rates. However, widely varying sampling rates may cause issues when many points must be left unmatched in the calculation. The length of the shorter trajectory can be used to normalize this method as an LCSS ratio, allowing for comparisons in the same scale. The LCSS computation implemented in the R package uses the algorithm discussed by Vlachos et al. [20]. Using dynamic programming, the LCSS value for two trajectories with lengths m and k, can be found with a complexity of O((m + k)δ). 3.4 Edit distance The fundamental idea of edit distance is to count the minimum number of edits required to make two trajectories equivalent. Several variations of edit distance exist including edit distance with real penalty (ERP) [5] and edit distance on real sequence (EDR) [6]. The discussion below concerns edit distance on real sequence (EDR) as described by Chen et al. [6]. Definition 3: Using a matching threshold ε ≥ 0, and trajectories TA and TB with lengths m and k, the EDRε value (edit distance) defined in Chen et al. [6] is adapted to the following:   k, if m = 0      if k = 0  m, EDRε (TA , TB ) = min(EDRε (Rest(TA ), Rest(TB )) + subcost,    otherwise EDRε (Rest(TA ), TB ) + 1,     EDRε (TA , Rest(TB )) + 1), In this definition, subcost = 0 if the first point of TA lies within the matching threshold of the first point of TB in every dimension, and subcost = 1 otherwise. Finally, Rest(TA ) represents trajectory TA with its first point removed (the trajectory now starts from the second point if one exists, otherwise it now has length 0). EDR is relatively unaffected by outliers because the matching threshold reduces the increments to values of 0 and 1 only [6]. Therefore, even though outliers must still be processed using this method, each outlier can potentially only increase the EDR value by 1, and not some arbitrarily large value as in DTW or Fréchet. EDR also performs well with trajectories that have varying sampling rates. The method does not require trajectories of equal length. However, different length trajectories will automatically inflate the edit distance. This is because every extra point is required to be edited out (or in) for the trajectories to be considered equivalent. This fact needs to be considered when choosing this method in practical applications. Edit distance on real sequence (EDR), as discussed in Chen et al. [6], was implemented as the edit distance function in our R package. Dynamic programming was used to obtain an efficient calculation of EDR. With two trajectories TA and TB , of length m and k, this implementation has a complexity of O(mk). 46 4 Comparison and evaluation A sample dataset, containing trajectory data of delivery drivers in the UK, was used to help evaluate the similarity measures discussed in this paper. Our of a total of 23,400 segmented trajectories in our data set, a small sample of 50 randomly chosen pairs of trajectories was used for the analysis. All calculations were performed using the R package discussed earlier. 4.1 Normalization The trajectories in the dataset vary significantly in terms of location and scale. This limits the amount of useful comparisons and analysis between raw, untransformed trajectories. Therefore, each pair of trajectories was normalized to approximately match, to enable meaningful comparisons across the four similarity measures. The trajectories were rotated, scaled, and translated to align their start and end points. This normalization, however, does lead to some bias in the results, which must be taken into account. As a result of the normalization, the LCSS calculation is guaranteed to contain a minimum of two points (start and end) if the index spacing distance allows it, while the edit distance calculation is guaranteed to have two points which don’t require edits. This problem can also be seen in DTW where the start and end points will both add zero distance to the final value. Although this fact changes all of the absolute values in a constant way, the ratio values are altered in a non-linear manner, and this was taken into consideration for the analysis. 4.2 The similarity measures Each of the four similarity measures was performed on the 50 pairs of trajectories. The allowed point index spacing (for LCSS and DTW) was set to unlimited to allow for the large variance in trajectory length. The distance for points to be considered equivalent (for LCSS and edit distance) was set to 100m. This value was set using the knowledge that most trajectories range from hundreds of meters to several kilometers, and allows for a wide range of values to be obtained in the analysis. The LCSS, DTW, and edit distance values were converted to ratios. This was done to ensure that the large variances in trajectory length did not dominate the results. The Fréchet distance was left unchanged because it is mainly unaffected by the length of the trajectories. The end points were left out of the ratio calculations to account for the normalization performed earlier. The DTW and edit distances used the larger trajectory length (minus 2 for the end points) to calculate their ratios (e.g. DT W (TA , TB )/(max(|TA |, |TB |) − 2)). The LCSS ratio used the minimum trajectory length (minus 2 for the end points) as discussed earlier. Table 1 shows the correlations between the different similarity values computed across the four similarity measures2 . There is a strong (positive) correlation between Fréchet distance and DTW ratio similarity values, while a strong (negative) correlation can be seen between LCSS ratio and edit distance ratio. Correlations in all other pairings of similarity measures are much weaker (Table 1). Table 1: The correlation coefficients between each of the similarity measures. LCSS Ratio Fréchet Distance DTW Ratio Fréchet Distance -0.3707 – DTW Ratio -0.4391 0.9587 – Edit Distance Ratio -0.8340 0.3202 0.3866 Figure 1 shows scatterplots comparing the two pairs of most highly correlated similarity measures. Although some correlation was expected between the Fréchet distance and DTW ratio (both use absolute distances in their 2 Strictly speaking, Fréchet, DTW, and edit distance are dissimilarity measures (higher values equal greater dissimilarity) while LCSS is a similarity measure (higher values equal greater similarity) 47 calculation), the strength of the correlation is remarkably strong (correlation coefficient of 0.9587), particularly given that the underlying calculation is rather different (cf. Section 3.1 and 3.2). Figure 1: Scatter plots comparing the most largely correlated similarity measures, DTW ratio against Fréchet distance (left) and LCSS ratio against edit distance ratio (right). LCSS and edit distance also exhibit strong (negative) correlations to one another. As discussed above, the LCSS and edit distances were converted to ratios, and so their values range from 0 to 1 (although 1 is most highly dissimilar in the case of edit distance, and most highly similar in the case of LCSS). Thus, the strong negative correlation coefficient (−0.8340) between the two is in line with the expectation that these methods would yield similar results. The slightly lower correlation seen between the LCSS ratio and edit distance, when compared to Fréchet distance and DTW ratio, is likely caused by the large variations in trajectory length. Despite the agreement, occasional differences in DTW ratio and Fréchet distance are visible in Figure 1. Large discrepancies are always possible when comparing DTW and Fréchet distances, because they present no bounds on how large the (dis)similarity can be. This is unlike LCSS and edit distance ratios, which have a maximum ratio of 1. The fact that Fréchet distance and DTW values can grow so large makes them sensitive to outliers as discussed earlier. However, this feature can also help to and emphasize extreme cases of trajectory difference. At the other extreme, comparing two equal trajectories using any of the above similarity measures will always yield a value of 0 (1 for LCSS ratio). These two pairs of measures, Fréchet and DTW, and LCSS and edit distance, appear well correlated in practice for this dataset. However, it is possible to find instances of trajectories that have dramatically different similarity values across these pairs. The left image in Figure 2, for example, shows two trajectories that might be considered similar, although one has a large displacement “peak” near its center. When considering the number of distances combined, this peak will not influence the DTW value greatly. However, the Fréchet distance will reflect considerable dissimilarity in this pair of trajectories, due to the outlying peak. The right image in 2 presents another pair of trajectories to be considered. Again, these trajectories are relatively similar, although one contains more points clustered in the middle, while the other only contains two points. With a reasonable matching threshold, LCSS will consider these trajectories perfectly equivalent because all of one trajectories points are matched up. However, edit distance highlights that the five middle points have no match and therefore the trajectories are quite dissimilar. The DTW value will also be relatively large because it requires each point to be matched to another, and therefore the trajectories are considered more dissimilar. Finally, the Fréchet distance value will be very low, representing a view of very similar trajectories because it does not require point matching. 48 Figure 2: Two pairs of example trajectories. All four trajectories begin in the lower left corner. 5 Concluding remarks Similarity measures are a useful, and often underused, tool for the analysis of trajectories across a wide range of application. This paper highlights some of the commonalities and differences amongst four of the the most frequently encountered similarity measures, both in theory and in practice. While these measures are today most often used for indexing of spatial databases, they have important applications to the interpretation of pedestrian, vehicle, sport, and animal movement. By implementing an R package capable of easily computing these measures in a single, integrated environment, our aim has been to provide users with easier access to exploring the use of these measures in practical applications. Acknowledgments The authors would like to acknowledge collaborators at the 2014 Lorentz Workshop for “Geometric Algorithms in the Field” for providing the initial idea of comparing these four similarity metrics: Kevin Buchin, Patrick Laube, Dongliang Peng, Ross Purves, Stef Sijben, Rodrigo Silveira. References [1] H. Alt and M. Godau. Computing the Fréchet distance between two polygonal curves. International Journal of Computational Geometry and Applications, 5(01n02):75–91, 1995. [2] D. Berndt and J. Clifford. Using dynamic time warping to find patterns in time series. In KDD workshop, volume 10, pages 359–370. Seattle, WA, 1994. [3] K. Buchin, M. Buchin, J. Gudmundsson, M. Löffler, and J. Luo. Detecting commuting patterns by clustering subtrajectories. In S.-H. Hong, H. Nagamochi, and T. Fukunaga, editors, Algorithms and Computation, volume 5369 of Lecture Notes in Computer Science, pages 644–655. Springer, 2008. [4] M. Buchin, S. Dodge, and B. Speckmann. Similarity of trajectories taking into account geographic context. Journal of Spatial Information Science, 9(1):101–124, 2014. [5] L. Chen and R. Ng. On the marriage of lp -norms and edit distance. In Proc. 30th International Conference on Very Large Data Bases, pages 792–803, 2004. 49 [6] L. Chen, M. T. Özsu, and V. Oria. Robust and fast similarity search for moving object trajectories. In Proc. ACM SIGMOD International Conference on Management of Data, SIGMOD ’05, pages 491–502, New York, NY, USA, 2005. ACM. [7] S. Dodge. Exploring Movement Using Similarity Analysis. PhD thesis, Universität Zürich, 2011. [8] C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast subsequence matching in time-series databases. SIGMOD Rec., 23(2):419–429, May 1994. [9] M. M. Fréchet. Sur quelques points du calcul fonctionnel. Rendiconti del Circolo Matematico di Palermo, 22(1):1–72, 1906. [10] J. Gudmundsson, P. Laube, and T. Wolle. Movement patterns in spatiotemporal data. In Encyclopedia of GIS, pages 726–732. Springer US, 2008. [11] J. Gudmundsson, P. Laube, and T. Wolle. Computational movement analysis. In Springer handbook of geographic information, pages 423–438. Springer, 2012. [12] J. Haase and U. Brefeld. Finding similar movements in positional data streams. In Proc. ECML/PKDD Workshop on Machine Learning and Data Mining for Sports Analytics, 2013. [13] E. Keogh and M. Pazzani. Scaling up dynamic time warping for datamining applications. In Proc. 6th ACM International Conference on Knowledge Discovery and Data Mining, pages 285–289, 2000. [14] E. Keogh and C. A. Ratanamahatana. Exact indexing of dynamic time warping. Knowledge and Information Systems, 7(3):358–386, 2005. [15] P. Laube and R. S. Purves. How fast is a cow? Cross-scale analysis of movement data. Transactions in GIS, 15(3):401–418, 2011. [16] T. Li, H. Chang, M. Wang, B. Ni, R. Hong, and S. Yan. Crowded scene analysis: A survey. IEEE Transactions on Circuits and Systems for Video Technology, (99):1–20, 2014. [17] M. Perše, M. Kristan, S. Kovačič, G. Vučkovič, and J. Perš. A trajectory-based analysis of coordinated team activity in a basketball game. Computer Vision and Image Understanding, 113(5):612–621, 2009. Computer Vision Based Analysis in Sport Environments. [18] P. Ranacher and K. Tzavella. How to compare movement? A review of physical movement similarity measures in geographic information science and beyond. Cartography and Geographic Information Science, 41(3):286–307, 2014. [19] K. Toohey. SimilarityMeasures: Trajectory Similarity Measures, 2015. R package version 1.4, http: //CRAN.R-project.org/package=SimilarityMeasures. [20] M. Vlachos, G. Kollios, and D. Gunopulos. Discovering similar multidimensional trajectories. Proc. 18th International Conference on Data Engineering (ICDE), pages 673–684, 2002. In [21] H. Wang, H. Su, K. Zheng, S. Sadiq, and X. Zhou. An effectiveness study on trajectory similarity measures. In Proc. 24th Australasian Database Conference, volume 137, pages 13–22, 2013. [22] Y. Yuan. Image-based gesture recognition with support vector machines. ProQuest, 2008. [23] Z. Zhang, K. Huang, and T. Tan. Comparison of similarity measures for trajectory clustering in outdoor surveillance scenes. In Proc. 18th International Conference on Pattern Recognition (ICPR), volume 3, pages 1135–1138, 2006. 50 Symbolic trajectories and application challenges Maria Luisa Damiani1 , Hamza Issa1 , Ralf Hartmut Güting2 , Fabio Valdes2 1 Department of Computer Science, University of Milan, Italy E-mail:{maria.damiani, hamza.issa}@unimi.it 2 FernUniversität Hagen, Germany E-mail:{fabio.valdes,rhg}@fernuni-hagen.de Abstract Describing the location history of moving objects exclusively in geometric terms is no longer sufficient, whereas more expressive data models capturing the complexity and heterogeneity of movement data are needed. Following this trend, the data model of symbolic trajectories has been recently proposed for the representation of content-rich trajectories in databases. The model provides a simple notation and a powerful and fully operational pattern-based query language for trajectory matching and rewriting. In this paper, we overview the key features of the model and sketch two applications cases, the former regarding the integration of heterogeneous mobility data (GPS and transportation modes), the latter the representation of migration patterns in animal ecology. The goal is to show the flexibility of the model and, at the same time, to prospect possible directions of research. 1 Introduction Recent years have witnessed the proliferation of applications which collect and record the location history of large amounts of moving objects with high accuracy. For example, Google can capture and record the location history of all the devices with a Google account (if location tracking is explicitly granted by users1 ); in the field of animal ecology, modern animal telemetry and sensor networks (e.g. GPS receivers and other sensors mounted on devices deployed on animals, such as collars) enable the collection of fine-grained animals’ trajectories [3]. Trajectory data is an invaluable source of behavioral information on moving objects. Behavioral information is important for many different purposes. For example, it facilitates the development of advanced on-line information services, moreover it opens up new opportunities of research in diverse scientific disciplines, such as sociology, biology, animal ecology. Dealing with rich trajectory data stores, however, raises important technical challenges calling for advanced knowledge discovery, data protection, data visualization and data management solutions [10]. Our research focuses on the latter aspect. Trajectory databases have been around since early 2000 [6]. Existing databases, however, present important limitations especially for what concerns the expressiveness of the data models which describe the object’s trajectories exclusively in terms of timestamped points sampling continuous movement. In reality, mobility data is not only ’big’ in size and thus computationally demanding, but also exhibits a great variety. Variety regards various aspects, in particular we highlight the following four dimensions, that we label location, context, time evolution, movement granularity, respectively, briefly discussed below: (i) Location. It is often the case that locations are not directly expressed in terms of geographical coordinates but rather in symbolic form. The spatial reference system is thus indirect. Examples of symbolic locations 1 http://www.androidcentral.com/understanding-googles-android-location-tracking 51 are the cell identifiers reporting the position of GSM network users, rooms and floors in indoor spaces, the places that the LBSN users (location-based social networks) share with their friends when they report a check-in. (ii) Context. Applications may require supplementary time-varying information beyond location, typically the context in which the movement takes place. Contextual information includes, for example, the transportation means used by individuals to move in a city, the land type traversed by animals during their daily activities, the people met during a journey. Therefore the movement can be seen as consisting of multiple dimensions, where each dimension is a time-varying function. Location is just one of these dimensions. (iii) Time evolution. The contextual data of concern can vary either discretely or continuously in time. For instance, the movement of a vehicle can be seen as continuous for what concerns the location but discrete for what concerns the roads traversed by the vehicle. These dimensions are obviously interrelated. (iv) Granularity. Trajectories can be described at different level of abstraction or granularity. For example, certain mobility patterns extracted from geometric trajectories can be represented themselves as trajectories, though at coarser granularity. Moreover, the spatial dimension of movement may become irrelevant at a certain level of abstraction. Applications may require rolling up and drilling down through these abstraction levels to perform multi-scale analysis. Unconventional trajectories include the sequences of users check-ins in LBSNs; the sequences of activities, e.g. shopping, working, inferred from mobility data; the movement of individuals in an urban setting described by both a GPS trajectory (continuous) and by a discrete sequence of transportation modes. The challenge is to capture this multiplicity of views in a unifying and flexible trajectory data model. 1.1 Brief overview of the research context and paper contribution Semantic trajectories is a first step in that direction [1, 11, 12, 10]. In reality, the semantic enrichment of spatial and spatio-temporal objects is not a novelty. For example, coordinate locations can be enriched (or annotated) with places in e.g. [8]; another form of popular annotation is the transportation mode, e.g. [19]; sequence of activities are represented in e.g. [13]. In general, these annotations can be extracted from geometric data using analytical techniques, or be directly specified by the user or acquired from sensors. What is new in the notion of semantic trajectory with respect to the existing work is the idea of defining a framework enabling the construction, representation, and analysis of annotated trajectories [12, 16, 10]. While such research has led to diverse techniques and applications, it is worth observing that still there is no univocal definition of semantic trajectory. In the database realm, recent research focuses on the development of access mechanisms for the fast processing of specific class of queries, e.g. k-nn queries [18, 2], and data mining tasks, e.g. frequent sequential pattern mining [17], typically over trajectories representing sequences of places. Pattern matching techniques for querying trajectories annotated with symbolic locations are proposed in [5, 15]. More recent is the concern for the development of database models [9], that is where the research on symbolic trajectories fits in. The goal of this short paper is to offer a few hints on the application potential of symbolic trajectories, highlighting as well some open issues. The rest of the paper is organized as follows: Section 2 briefly presents the key features of the symbolic trajectory data model, Section 3 sketches the two application cases. The paper ends with some final considerations. 2 Symbolic trajectories Symbolic trajectories is basically a data model for the representation of discrete trajectories in databases [14, 7]. Abstractly, a symbolic trajectory is a time-dependent function which takes values in a categorical domain. The 52 domain consists of a set of strings called labels. Labels can represent, for example, activities, e.g. shopping, road names, e.g. Fifth Avenue, weather conditions, e.g. raining. A symbolic trajectory has a simple structure which consists, in its basic form, of a sequence of pairs (units) < (i1 , l1 ), ..., (in , ln ) > where ij = [tj1 tj2 ] is a time interval and lj the label. The time intervals are disjoint and temporally ordered. For example, a simple symbolic trajectory is: < ([8:45 - 17:00] working)([17:00 - 18:30] shopping) (..)> If we think of the label as an event, the interval specifies when such an event takes place. Multiple labels can be specified for the same interval to denote for example a set of events which occur at the same time. Moreover, a label can be associated with a spatial object with a reference to a precise geometric location and extent. The association of a label and a spatial object is called place. Places can be used, for example, to denote symbolic locations, such as the check-ins venues in LBSN. Symbolic trajectories are accessed using a language for pattern matching and rewriting. Matching is used to retrieve the trajectories satisfying the pattern. Rewriting is to extract or redefine parts of trajectories matching the given pattern. Patterns can be defined using regular expressions, variables, and a variety of conditions. A simple pattern is for example: * (_ working) (_ shopping) * This pattern matches symbolic trajectories where working is followed by shopping. The symbol ∗ matches any sequence of units, the symbol the unit component. A more complex pattern using variables and conditions on variables is: * (morning working) X(_ shopping)* //duration X.time> 2 * hour The matching trajectories are those in which the working activity takes place in the morning and is followed by a shopping activity taking more than 2 hours. This pattern contains the variable X, denoting the following unit, and a temporal condition (the symbol // separates the pattern from the condition); X.time denotes the time interval in the unit denoted by X. Patterns can be extended to rewriting rules. For example: * X(morning working) Y(_ shopping)* //duration Y.time> 2 * hour => X Y This rule returns for each input trajectory all the parts that match the two adjacent units denoted by X and Y. At system level, following the framework in [6], symbolic trajectories are introduced through the definition of new data types. In the simplest case the data type is moving(label) (or mlabel). The pattern language and the type system is seamlessly integrated into the relation model. For example we can construct a relation describing the trips of a person with attributes describing the symbolic information on road names and the symbolic information on transportation modes: Trips (Id: int, RoadName: mlabel, TransportMode: mlabel) The query: retrieve the trips that start from a given road in the morning and end at the same road in the evening can be simply formulated as follows: SELECT * FROM Trips WHERE RoadName matches ’ X(morning ) + Y (evening ) //X.label=Y.label’ The language is fully operational, running on a platform (SECONDO) offering a rich and extensible set of data types that can be used for formulating a variety of conditions on pattern variables. 53 3 Application challenges In what follows, we overview two application cases where symbolic trajectories are used for very different purposes. The former case regards the integration of heterogeneous mobility data (GPS and transportation modes), the latter the representation of specific mobility patterns (migrations) in animal ecology. 3.1 Querying GPS trajectories and transportation modes A prominent example of dataset containing both continuous and discrete movement data is GeoLife [20]. Geolife is a well-known dataset reporting the traces of a group of individuals monitored in Beijing for a few years. Interestingly, GeoLife consists not only of the GPS tracks in the form of timestamped point sequences, i.e. {(ti , pi )}i∈[1,n] , but also, and this only for a subset of users, the temporally annotated sequences of the transportation modes. A sequence of transportation modes takes the form {(Ii , li )}i∈[1,m] where Ii is a time interval and li a string in the set: {walk, bike, car, bus, airplane, other}. Notably, this dataset exemplifies a situation increasingly common in applications, that is the coexistence of heterogeneous trajectory data describing different aspects of the movement. Querying heterogeneous trajectory data is a challenging issue. In what follows, we show a first approach which leverages the querying capabilities of symbolic trajectories. Preliminarily we describe how to store the GeoLife dataset. Simply the data is stored in a table consisting of three attributes: the user id, the geometric trajectory GPStrack of type mpoint and the symbolic trajectory Transport of type mlabel. geoLife(UserId: integer, GPStrack: mpoint, Transport: mlabel) A first simple query, that simply uses the matches operator, is the following: Query 1: retrieve the individuals that use buses and trains in the morning and again in the evening of the same day SELECT user_id FROM geoLife WHERE Transport matches ’ * X[(morning bus) | (morning train)]+ Y [(evening bus)| (evening train)]+ *// (Y.end - X.start ) < 1 * day ’ The specified rule contains the pattern and the temporal condition. The next example shows how to replace a set of labels with a more general term using the rewrite operation. This mechanism can be used for example to perform roll-up operations over a hierarchy of terms and thus perform multi-scale analysis. Query 2: retrieve the trajectories of the individuals that use train and buses in the morning and in the evening, replacing the sequence of transportation means with a more general term (i.e. ’public transportation’). SELECT rewrite(transport,’* X [(morning bus) |(morning train)]+ Y[(evening bus) |(evening train)]* // (Y.end - X.start) < 1 * day => T // T.label := publicTransport, T.start := X.start, T.end :=Y.end’) AS generalTransport FROM geoLife A slightly different but more challenging query is to retrieve the individuals who not only use public transportation in the morning and in the evening, possibly after walking for a while, but that also return near the 54 point from where they left in the morning. This query specifies two conditions: one is on the symbolic trajectory; the other is a spatial condition (i.e. the initial and final points are to be close to each other). This query cannot be solved by solving separately the symbolic and spatial components. Rather a tighter integration of the trajectories representations, both at language and system level, is needed. A first approach to the problem has been shown in [4]. The idea is to leverage the temporal correlation between the two trajectories by retrieving the sub-trajectories which match the symbolic pattern and, based on them, temporally restrict the spatial trajectories. At language level, this connection between the symbolic and the spatial component in a comprehensive pattern is realized through the use of global variables consisting of both a symbolic and spatial component. At database system level, a new data type hybrid is defined for handling these composite trajectories while the operators matches and rewrite are extended accordingly. The query is formulated as follows. Let Cs and Cp be the rewrite rule and the spatial condition, respectively: let C_s = ’* X_s (morning walk)+ * Y_s[(morning bus)|(morning train)]+ * W_s[(evening bus)| (evening train)]+ Z_s(evening walk) *// (Z_s.end - X_s.start ) < 1 * day => X_s Y_s W_s Z_s’ let C_p = ’distance(val(final(X_p)), val(initial(Z_p))) < 40’ The global variables are X, Y, W, Z where the variable subscript (s and p) indicates the dimension (symbolic or spatial). The extended rewrite operation over the hybrid trajectory HT raj is called as follows : SELECT h_rewrite(HTraj, C_p, C_s) FROM geoLife where h rewrite() is the extended rewrite operation. One trajectory resulting from the query is shown in Figure 1.(a). A similar example is shown in Figure 1.(b) displaying the result of a query involving the spatial containment predicate. Specifically the query is to retrieve the trajectories that start from Hong Kong by boat to arrive at Macao, where a bus is taken before returning to Hong Kong by boat. For brevity, only the symbolic and spatial conditions are reported below: let C_s = ’ * Y_s[(_ boat)]+ K_s[(_ bus)]+ Z_s[(_ boat)]+ * => Y_s K_s Z_s’ let C_p = ’val(initial(Y_p)) inside Hong Kong, val(final(Y_P)) inside Macao’ The query language can be used thus to support exploratory searching across heterogeneous mobility data. It can be noted that these queries are quite compact. If they were expressed using an existing database (spatial, moving object), the queries would have been extremely long and complex or would have required ad-hoc programs. This results into a more usable data analysis platform. 3.2 Representing the migratory behavior In this second case, we use symbolic trajectories to represent the behavioral information extracted from geometric trajectories whenever such knowledge takes the form of trajectory, and thus time still plays a key role. Whilst in the previous case we have emphasized the analytical capabilities of the symbolic query language, now we focus more on the effectiveness of the data model in describing the individual patterns extracted from geometric trajectories. The case study regards the representation of the migratory behavior. In particular, the study has been conducted for a group of wild animals (roe deer), equipped with a low sampling rate GPS collar and tracked for a 55 (a) (b) Figure 1: (a) home-work trajectory; (b) trajectory from Hong Kong to Macao, followed by a travel by bus in Macao before the return by boat period covering a few seasons [3]. The animals of this species can either migrate or be stationary, a behavior known as partial migration in animal ecology, moreover, whenever an animal migrates, the migration takes place with modalities and times that - although respecting certain general patterns, e.g. seasonality - can vary from animal to animal. Therefore every animal has its own migratory behavior. Because of that, the pattern is defined as individual pattern. At first sight, this migratory behavior can be considered a stop-and-move pattern [11], i.e. the moving object stays in a region for some time and then moves to some other stay region. In reality the migratory behavior is significantly different. In fact, the object residing in a given stay region can experience brief excursions after which the object returns to the stay region. Therefore the residence consists of periods of presence interleaved with periods of absence while a migration is the definitive transition from a stay region to another stay region. Extracting this kind of pattern raises interesting challenges because no assumption can be made on speed, direction and other movement characteristics, as well as on the distribution of points. Recently, a time-aware, density-based clustering algorithm has been proposed to extract the stay regions where the animals reside for most of their time and the transitions form one stay region to another stay region. [3]. While the details of the algorithm are not relevant for this discussion, more interesting is the question of how to represent the migratory behavior resulting from the clustering. One could opt, for example, for a graphical representation relying on modern visual analytics techniques. Though important, visualization is not however sufficient because the behavioral information cannot be easily processed. Another simple approach is to represent the cluster through a set of labeled points, i.e. a point (pi , ti ) is labeled with the identifier of the cluster the point belongs to. Accordingly a cluster, say c can be represented using a point-based representation, i.e. through a temporally ordered sequence of timestamped points. This representation is however unsatisfactory. For example, given pi , pi+1 two points in c, the point-based representation does not specify where the object is in the time interval [ti , ti+1 ] (this independently from the inherent location uncertainty). We recall, in fact, that the object can experience periods of absence, therefore subsequent points in the cluster sequence are not necessarily consecutive in the original trajectory. The information on presence/absence can be naturally expressed using symbolic trajectories. In fact, the algorithm keeps track of the periods of presence and absence. This information on the periods of presence can be translated into units reporting the time interval and the cluster id for each of these periods. This results in a fine-grained representation of the movement. Figure 2 displays the trajectory of an animal and the resulting clusters and migration path. The symbolic trajectory (a fragment) detailing when the animal is inside/outside the cluster labeled H1 is reported below. 56 (a) (b) Figure 2: (a) The space-time cube representing the geometric trajectory of the animal; (b) the trajectory at cluster level [3]. . ["2006-06-17 ["2006-06-18 ["2006-06-21 ["2006-06-24 ["2006-06-25 ["2006-06-30 ["2006-07-03 ["2006-07-23 ["2006-08-18 12:02:00.016" 20:00:00.053" 20:00:00.054" 20:02:00.014" 04:01:00.012" 20:00:00.055" 04:01:00.049" 20:00:00.053" 12:01:00.012" "2006-06-18 "2006-06-19 "2006-06-24 "2006-06-24 "2006-06-30 "2006-07-02 "2006-07-23 "2006-08-14 "2006-09-13 04:01:00.024" 00:00:00.054" 04:01:00.055" 20:02:00.014" 04:00:00.053" 08:00:00.053" 08:01:00.024" 16:01:00.042" 00:01:00.012" ]"H1" ]"H1" ]"H1" ]"H1" ]"H1" ]"H1" ]"H1" ]"H1" ]"H1" ..... The symbolic trajectory offers an unprecedented level of detail on the animal’s movement inside the home-range and that is of great ecological interest. Moreover, the movement can be observed at different levels of temporal and thematic detail. For example rewriting rules or classification rules can be applied to automatically generate more synthetic descriptions facilitating multi-scale analysis. 4 Conclusion Providing tools enabling the flexible representation of mobility behavior is a prominent and promising research direction. Symbolic trajectories is the first running solution proposed to solve the problem from a database perspective. The research is however only at the beginning. Extending the concept of symbolic trajectory to that of multi-dimensional trajectory, mediating between expressiveness and efficiency and, at the same time, experimenting the solutions on real applications, are major challenges for future work. References [1] L. Alvares, V. Bogorny, B. Kuijpers, J. de Macedo, B. Moelans, and A. Vaisman. A model for enriching trajectories with semantic geographical information. In Proc. ACM GIS, page 22, 2007. 57 [2] G. Cong, H. Lu, B. Chin-Ooi, D. Zhang, and M. Zhang. Efficient spatial keyword search in trajectory databases. CoRR, abs/1205.2880, 2012. [3] M. L. Damiani, H. Issa, and F. Cagnacci. Extracting stay regions with uncertain boundaries from gps trajectories: A case study in animal ecology. In Proc. SIGSPATIAL, 2014. [4] M. L. Damiani, H. Issa, R. H. Güting, and F. Valdes. Hybrid queries over symbolic and spatial trajectories: A usage scenario. In Proc. MDM, 2014. [5] C. du Mouza and P. Rigaux. Mobility patterns. Geoinformatica, 9:297–319, 2005. [6] R. H. Güting, M. H. Böhlen, M. Erwig, C. S. Jensen, N. A. Lorentzos, M. Schneider, and M. Vazirgiannis. A foundation for representing and querying moving objects. ACM Trans. Database Syst., 25(1):1–42, 2000. [7] R. H. Guting, F. Valdes, and M. L. Damiani. Symbolic trajectories. Technicl report.- Fernuniversitt in Hagen, Informatik-Report 369 - 12/2013., 2013. [8] J. Liu, O. Wolfson, and H. Yin. Extracting semantic location from outdoor positioning systems. In Proceedings of the 7th International Conference on Mobile Data Management, page 73, 2006. [9] N. Pelekis and Y. Theodoridis. Mobility Data Management and Exploration. Springer, 2014. [10] C. Renso, S. Spaccapietra, and E. Zimányi. Mobility Data – Modeling, Management, and Understanding. Cambridge Press, 2013. [11] S. Spaccapietra, C. Parent, M. L. Damiani, J. de Macedo, F. Porto, and C. Vangenot. A conceptual view on trajectories. Data Knowl. Eng., 65:126–146, 2008. [12] S. Spaccapietra, C. Parent, C. Renso, G. Andrienko, N. Andrienko, V. Bogorny, M. Damiani, A. GkoulalasDivanis, J. Macedo, N. Pelekis, Y. Theodoridis, and Z. Yan. Semantic Trajectories Modeling and Analysis. ACM Computing Surveys, 45(4):42, 2013. [13] E. Tapia, S. Intille, and K. Larson. Activity recognition in the home using simple and ubiquitous sensors. Springer, 2004. [14] F. Valdés, M. L. Damiani, and R. H. Güting. Symbolic trajectories in SECONDO: pattern matching and rewriting. In Proc. DASFAA, 2013. [15] M. R. Vieira, P. Bakalov, and V. J. Tsotras. Querying trajectories using flexible patterns. In Proc. of the 13th Int. Conf. on Extending Database Technology, EDBT ’10, pages 406–417, 2010. [16] Z. Yan and D. Chakraborty. Semantics in mobile sensing. Morgan & Claypool, 2014. [17] C. Zhang, J. Han, L. Shou, J. Lu, and T. F. L. Porta. Splitter: Mining fine-grained sequential patterns in semantic trajectories. PVLDB, 7(9):769–780, 2014. [18] K. Zheng, S. Shang, N. Yuan, and Y. Yang. Towards efficient search for activity trajectories. In Proceedings of the 2013 IEEE International Conference on Data Engineering (ICDE 2013), 2013. [19] Y. Zheng, L. Liu, L. Wang, and X. Xie. Learning transportation mode from raw gps data for geographic applications on the web. In Proceedings of the 17th International Conference on World Wide Web, WWW ’08, pages 247–256, 2008. [20] Y. Zheng, X. Xie, and W.-Y. Ma. GeoLife: A Collaborative Social Networking Service among User, Location and Trajectory. IEEE Data Eng. Bull., 33(2):32–39, 2010. 58 Planning Sightseeing Tours using Crowdsensed Trajectories Igo Brilhante1 , Jose Antonio Macedo1 , Franco Maria Nardini2 , Raffaele Perego2 , Chiara Renso2 1 Department of Computer Science, Federal University of Ceará, Brazil {igobrilhante,jose.macedo}@lia.ufc.br 2 ISTI-CNR, Italy {nardini,perego,renso}@isti.cnr.it Abstract We present an application where semantically enriched trajectories obtained from crowdsensed data are used to build an advanced system for planning personalized sightseeing tours, called T RIP B UILDER. The interesting feature of T RIP B UILDER is that it uses Wikipedia content and trajectories of previous tourists collected by georeferenced Flickr photos in a complex spatio-temporal framework. The objective is to address, in an unsupervised way, the problem of suggesting a budgeted sightseeing tour based on the preferences of the tourist and the time available for the visit. We present few highlights of how T RIP B UILDER works along with a research agenda where we discuss the role of semantically enriched trajectories and crowdsourced location data in planning itineraries. 1 Introduction Tourists approaching their destination for the first time deal with the problem of planning a sightseeing itinerary that covers the most subjectively interesting attractions, and fits the time available for their visit. Precious information can be nowadays gathered from many digital sources, e.g., travel guides, maps, institutional sites, travel blogs. Nevertheless, the tourists still need to choose the preferred Points of Interests (PoIs), to guess how much time is needed to visit them and to move from one attraction to the next one. In this paper we discuss T RIP B UILDER, an unsupervised system helping tourists to build their own personalized sightseeing tour. Given the target destination, the time available for the visit, and the tourist’s profile, T RIP B UILDER recommends a time-budgeted tour that maximizes tourist’s interests and takes into account both the time needed to enjoy the attractions and to move from one PoI to the next one. Moreover, the knowledge base feeding the recommendation model is entirely and automatically extracted from publicly available Web services, namely, Wikipedia, Flickr and Google Maps. T RIP B UILDER exploits the publicly available content shared on Flickr by tourists. Unofficial statistics claim that about 1.4 million of public photos are uploaded every day, and that upload peaks occurr just after holiday periods1 . Each photo comes with very useful information such as: tags, comments and likes from Flickr social network, number of views, information about the user, timestamp, GPS coordinates of the place where the photo was taken. This allows us to reconstruct the movements of users and their interests by analyzing the time-ordered sequence of their photos. 1 http://www.flickr.com/photos/franckmichel/6855169886/ 59 The process of recognizing relevant PoIs given such set of photos is however not trivial in the lack of a common database of geo-referenced PoIs. Fortunately, a Wikipedia2 page is associated with most entities of interest for tourism. From this Wikipedia page we can thus easily extract: the (multilingual) name of the PoI, its geographic coordinates, the categories which the PoI belongs to according to a weak but precise ontology (i.e., the PoI is a church, a square, a museum, a historical building, a bridge, etc). By clustering and spatially matching tourists’ photo albums from Flickr on the relevant PoIs extracted from Wikipedia pages, we can thus derive a knowledge base that represents the behavior of people visiting a given city. In this knowledge base the popularity of a PoI is estimated from the number of photos available or from the number of different visitors that shot photos there. Furthermore, from the distribution of the timestamps of the first and last photos taken in a given PoI, we can roughly estimate the average time needed for visiting it. The time needed to move from a PoI to the next one in the sightseeing itinerary is instead computed by querying Google Maps. Finally, the Wikipedia categories of the PoIs visited by a given tourist are used to build her profile and to characterize the trajectories across the PoIs. For example, if a tourist takes many pictures of churches and museums, we can infer a preference for cultural/historical attractions. Analogously, we can aggregate this information at the level of the trajectories mined from Flickr photo albums to estimate the relevance of the given trajectory for the tourist profile. In [1, 3], we discussed the T RIP B UILDER methodology for generating personalized sightseeing tours while in [2] we described the Web application implementing the T RIP B UILDER system (see Figure 4). This paper is organized as follows. We give a short introduction to the T RIP B UILDER solution and the related problems in Section 2. Then, in Section 3 we show the current state of its distributed architecture that, by exploiting the unsupervised approach of T RIP B UILDER, aims at automatically building the knowledge base for a large number of different cities. Finally, we discuss new directions of research in Section 4. 2 Building the T RIP B UILDER knowledge base The generation of the knowledge base used by T RIP B UILDER is a multi-steps unsupervised process that we are going to detail in the following. PoIs. The first step is to identify the set of PoIs in the target geographical region. Given the bounding box BBcity containing the city of interest, we download all the geo-referenced Wikipedia pages falling within this region. We assume each geo-referenced Wikipedia named entity, whose geographical coordinates falls into BBcity , to be a fine-grained Point of Interest. For each PoI, we retrieve its descriptive label, its geographic coordinates as reported in the Wikipedia page, and the set of categories the PoI belongs to. Categories are reported at the bottom of the Wikipedia page, and are used to link articles under a common topic. They form a hierarchy, although sub-categories may be a member of more than one category. By considering the set C of categories associated with all the PoIs, we generate the normalized relevance vector of each PoI. We then perform a density-based clustering to group in a single PoI sightseeing entities which are very close one to each other. Clustering very close PoIs is important since a tourist in a given place can enjoy all the attractions in the surroundings even if she do not take photos to all of them. Moreover, it aims at reducing the sparsity that might affect trajectory data. To cluster the PoIs we use DBScan [5]. To build our dataset, we set 1 as the minimum number of points and 200 meters as . At the end of this step each PoI p ∈ P is univocally identified by its geographic coordinates, a name, and a relevance vector, v~p ∈ [0, 1]|C| , measuring the normalized relevance of p w.r.t the categories C. For the clustered PoIs, the relevance vector v~p is obtained by considering the occurrences of each category in the members of the clusters and by normalizing the resulting vector. Users and PoI histories. As second step we need a method for collecting tourists’ information and their longterm itineraries crossing the discovered PoIs. We query Flickr to retrieve the metadata (user id, timestamp, tags, 2 http://www.wikipedia.org 60 geographic coordinates, etc.) of the photos taken in the given area BBcity . The assumption we are making is that photo albums made by Flickr users implicitly represent sightseeing itineraries within the city. To strengthen the accuracy of our method, we retrieve only the photos having the highest geo-referenced accuracy given by Flickr3 . This process thus collects a large set of geo-tagged photo albums taken by different users within BBcity . We preliminary discard photo albums containing only one photo. Then, we spatially match the remaining photos against the set of PoIs previously collected. We associate a photo to a PoI when it has been taken within a circular buffer of a given radius having the PoI as its center. Note that in order to deal with clustered PoIs, we consider the distance of the photo from all constituent members: in the case the photo falls within the circular region of at least one of the members, it is assigned to the clustered PoI. Moreover, since several photos by the same user are usually taken close to the same PoI, we consider the timestamps associated with the first and last of these photos as the starting and ending time of the user visit to the PoI. The PoI visiting time ρ(p) is then estimated by computing for each PoI the average of these times. Moreover, the popularity of each PoI is computed as the number of distinct users that take at least one photo in its circular region. The above process allows us to generate the set of users, their PoI history (the temporally ordered sequence of PoIs visited by a user u), and estimate for the popularity and visiting time of each PoI. Finally, a preference vector v~u ∈ [0, 1]|C| stating the normalized interest of u for the categories in C is built by summing up and normalizing the relevance vectors of all the PoIs occurring in u PoI history. Trajectories. In order to build the set S of trajectories used by T RIP B UILDER we split users’ PoI histories. In particular, given a PoI History Hu where each p of Hu is annotated with the two timestamps [t1 , t2 ] indicating the start time and the end time of the visit, and a time threshold δ, we define a trajectory Tu any subsequence of Hu < (pk , [t1k , t2k ]), . . . , (pk+i , [t1(k+i) , t2(k+i) ]) > such that: i≥1 t1k − t2(k−1) > δ, if k > 1 t1(k+i+1) − t2(k+i) > δ, if (k + i) < m t1(k+j) − t2(k+j−1) ≤ δ, ∀j s.t. 1 ≥ j ≤ i. The intuition is that trajectories are sequences of PoIs visited consecutively at the same “visit”. They are obtained by cutting the user PoI history where the time interval between the visit to two subsequent PoIs is greater than a given threshold δ. To choose the splitting threshold δ, we derive the users’ wisdom-of-crowds behavior by analyzing the inter-arrival time of each pair of consecutive photos taken in different PoIs. Therefore, we compute the distribution of probability of the inter-arrival time P (x ≤ δ) of pairs of consecutive photos. Then, we devise the time threshold δ such that P (x ≤ δ) = 0.9. Traveling time estimation. An important aspect of T RIP B UILDER is that we recommend sightseeing tours fitting the available time budget and not just the set of PoIs to be visited. The sightseeing tour building step should therefore consider not only the PoI visiting time ρ(p) but also the time τ (·, ·) needed to move between consecutive PoIs in the itinerary. Since measuring intra-PoI moving time from the photo albums resulted to be inaccurate for not popular PoIs, we resort to an external service. Given a pair (pi , pj ) of PoIs in a trajectory, we estimate τ (pi , pj ) by querying Google Maps for the walking time between the PoIs. Naturally, this is an approximation since several variations may happen: the user having a car, using public transportation, taking a taxi. However, our method is parametric to these aspects, and the system can be easily adapted to consider the different choices. Moreover, most PoIs in our sightseeing cities are actually at walking distances. User-PoI Interest. Given a PoI p, its relevance vector v~p , a user u, and the associated preference vector v~u , we define the User-PoI Interest function as a the following function Γ(p, u) : P × U → [0, 1]: Γ(p, u) = α · sim(v~p , v~u ) + (1 − α) · pop(p) 3 http://www.flickr.com/services/api/flickr.photos.search.html 61 v~ ·v~ where sim(v~p , v~u ) = ||v~p p|| ||uv~u || is the cosine similarity between the user preference and the PoI relevance vectors, pop(p) is a function, ranging from 0 to 1, measuring the popularity of p, and α ∈ [0, 1] is a parameter controlling how much user preference and popularity of PoIs have to be taken into account. T RIP B UILDER addresses the problem of planning the visit to the city as a two-step process: T RIP C OVER and T RAJ SP. T RIP C OVER. First, given the profile of the tourist and the amount of time available for the visit, we address the problem of choosing the set of trajectories S ∗ ⊆ S that best fits tourist interest and respects the given time constraint [1]. The association between users and interests is represented as a User-PoI interest function Γ, while the time of the visit to PoI p as the cost function ρ(p). The output is a set of trajectories maximizing the user’s interest, that is, trajectories crossing the most relevant PoIs to the user constrained by the given time budget. The T RIP C OVER problem is an instance of the Generalized Maximum Coverage (GMC) problem that is proven to be NP-hard [4]. An efficient greedy approximation algorithm for the GMC problem is known that achieves an approximation ratio of e/(e − 1) + , ∀ > 0 [4]. We used this approximation algorithm (whose source was kindly provided us by the authors) after slightly modifying it to take into account T RIP C OVER specific constraints [1]. T RAJ SP. In a second step, the selected trajectories that best fits tourist interest and respects her time budget are joined in a sightseeing itinerary by means of a heuristic algorithm based on local search operations (2-OPT and 3-OPT). We model this second problem, called T RAJ SP, as a particular instance of Traveling Salesman Problem (TSP). in [3] T RAJ SP is addressed by proposing a Local Search heuristics that starts from a (given or random) tour P̂ connecting all trajectories in S ∗ , and then applying local changes to P̂ by means of 2-OPT or 3-OPT strategies [6]. 3 The T RIP B UILDER System The architecture of T RIP B UILDER for generating time budgeted sightseeing tours involves four different layers: i) Stream Layer, ii) Batch Layer, iii) Distributed Data Storage, and, iv) TripBuilder Engine. In the following, we detail the functionalities of the four modules. Figure 1 presents the architecture of the T RIP B UILDER system enabling the computation of personalized budgeted sightseeing tours. Stream Layer. This layer is composed of two different modules that retrieve the relevant information from Flickr and Wikipedia by receiving city bounding boxes as a stream. In particular, each item of the stream is used by Photo Discovery to query Flickr to retrieve the metadata (user id, timestamp, tags, geographic coordinates, etc.) of photo albums, i.e., sequences of photos taken in the given geographic area. This process thus collects a large set of geo-tagged photo albums taken by different users in the given geographic area. The second module, Wikipedia PoI Discovery, collects PoIs from Wikipedia. In particular, we assume each geo-referenced Wikipedia named entity, whose geographical coordinates falls into a given area, to be a Point of Interest. For each PoI, we retrieve its descriptive label, its geographic coordinates as reported in the Wikipedia page, and the set of categories the PoI belongs to, which are reported at the bottom of the Wikipedia page. Then, photos from Flickr and PoIs from Wikipedia are matched by spatial proximity according to their coordinates. Figure 2 highlights the components on the stream layer. The stream layer is built by means of Apache Storm4 , a free and open source distributed realtime computation system. Apache Storm allows to reliably process unbounded streams of data. Storm organizes the computation in a graph, called “topology”, where data flows through nodes, called “bolts”. Our stream layer is thus able to crawl Flickr and Wikipedia in a real-time fashion by receiving from an input queue a given geographic bounding box representing the target geographic area. The results of the real-time computation are stored on a distributed data storage. 4 https://storm.apache.org/ 62 Tour City Budget Categories Personalization TripBuilder Batch Layer Trajectories Creation Poi Visiting Time Estimation Users' Photos HDFS HDFS Wikipedia PoI Discovery City Bounding Box Photo Discovery Stream Layer HDFS Distributed Data Storage Figure 1: Architecture of T RIP B UILDER. We outline the four layers of the system, i.e. Stream Layer, Batch Layer, Distributed Data Storage and TripBuilder Engine. Streams Stream Layer Batch Layer City City Wikipedia PoI Discovery Photo Discovery Trajectories Creation City HDFS HDFS Trajectory Split Estimation HDFS Users' Photos Poi Visiting Time Estimation HDFS Distributed Data Storage Figure 2: Layers of T RIP B UILDER architecture: Stream layer processes incoming data; Batch layer is responsible for processing and transforming data to T RIP B UILDER Engine. Batch Layer. This layer is made up of different components each one manipulating the data previously collected. It is in charge of cleaning and transforming the data by means of distributed computing frameworks like Apache Hadoop5 and Spark6 to speed up the data processing step. In particular, the modules here transform sequences of photos from Flickr to sequences of visited Wikipedia PoIs, i.e., trajectories, to be used in the T RIP B UILDER engine. Moreover, this step is in charge of computing popularity and other important characteristics of PoIs by considering metadata and information extracted both from Flickr and Wikipedia. The data obtained (see Figure 3) are then stored by means of a distributed data storage layer. This is an important point in favour of enabling the flexibility of T RIP B UILDER: different sources of information for trajectories and PoIs can be easily integrated into the system by modifying only the two lowest layers. Moreover, the approach taken allows to scale to large geographic areas as the two layers effectively exploits modern state-of-the-art technologies for distributed and parallel computation. Distributed Data Storage. This component is responsible for storing, querying and indexing trajectory and PoI data. It is composed of a database management system and a distributed filesystem that efficiently provides 5 6 http://hadoop.apache.org http://spark.apache.org 63 Colosseum 3 photos 01/07/2013 9:00 -12:00 Ruins 2 photos 01/07/2013 13:30 -15:00 Trevi Fountain 2 photos 01/07/2013 15:42 - 16:00 ... Figure 3: Application of the Stream and the Batch Layer to raw data from Flickr and Wikipedia. The result of these two steps is a set of crowdsensed trajectories describing past behavior of tourists in a geographic area. information to the “T RIP B UILDER Engine” component and a distributed data storage to support Stream and Batch layers. The database component contains a well-defined schema to enable flexibility in integrating other data sources. Geo-spatial indexes are used for searching spatial objects, such as PoIs and tourist traces, within a given region (e.g. polygon). The system also takes advantage of indexes over PoI categories and tourist traces, both represented as arrays, to efficiently retrieve relevant PoIs to the user preferences. Moreover, the distributed filesystem is built by using the Apache Hadoop Distributed Filesystem (HDFS). We choose the HDFS technology because it is a mature solution for storing data in distributed environments. As an example, it provides effective and efficient mechanisms to deal with faults thus preventing us to avoid data loss in case of hardware problems. T RIP B UILDER Engine. This is the core of the architecture responsible for computing personalized budgeted sightseeing tours. Given a set of trajectories crossing a set of PoIs, a time budget, the user preferences and the personalization factor used to tune the level of personalization as input, it generates the personalized sightseeing tour. We experimented T RIP B UILDER with data collected for three cities different for their size and the amount of user-generated content available for download, namely Pisa, Florence, Rome, located in Italy [3]. We collected crowdsourced data from Flickr, Wikipedia and Google Maps. We evaluated our framework by considering the performance of both the T RIP C OVER and the T RAJ SP problems [3]. The effectiveness of T RIP B UILDER in selecting a set of trajectories of interest for a given user shows remarkable improvements over two competitive baselines in terms of all the metrics adopted for assessment. Our solution suggests itineraries that better match user preferences. Moreover, such itineraries present higher visiting time and, consequently, lower intra-PoI movement time than the baselines. Furthermore, we showed that our TSP-based local search heuristic to schedule a set of trajectories into the user agenda outperforms two baselines. Finally, the tests conducted to assess the efficiency of T RIP B UILDER show that it computes a four-day personalized sightseeing tours of Rome in about 3 seconds thus confirming that our approach can be fruitfully deployed in online applications. 64 (a) (b) Figure 4: A screenshot of the Web interface that lets users interact with T RIP B UILDER. The targeted city is initially selected (a). The drop-down menu on the left helps specify their preferences, the time budget and the personalization factor (b). On the right, a summary of the tour is proposed. Each PoI in the summary comes with a photo and a set of useful information (i.e., visiting time, categories, etc.). 4 Crowdsensed Trajectories for Tourism: Which are the Challenges? In this paper we discussed how semantically enriched trajectories derived from user-generated content from web services like Flickr and Wikipedia offer a solid backgroud for planning personalized sightseeing tours. However, we believe that this approach opens the way to many research challenges on recommendation task from crowdgenerated content. Here we highlight some ideas that we believe could be the next steps for sightseeing tour recommendation and planning. • Linked data. The increasing availability of Linked Open Data (LOD) sources has brought new opportunities to integrate different data source to semantically enrich trajectories and points of interests. LOD may fulfil the lack of information we typically experience managing trajectories from user-generated data and this additional information will bring to better recommendations combined with the capability of explaining the recommendation itself. • Smart Cities. Nowadays, we see the opportunity of integrating the crowd-generated trajectories with smart cities environments, such as the physical sensors used for collecting data representing several aspects of the city (pollution, traffic, etc). Similar to the LOD case, here we will have the opportunity that creates a new huge potential to enrich the recommendations to the tourism. • Real-time Services. It is crucial to keep the tourist tour up to date with the most recent events in the city, like special discounts for museums, restaurants, events, etc. How to deal with it and collect such amount of users’ data to provide real-time support to tourists will be an important challenge. • Group Recommendation. The fact that people usually do not travel alone highlights the importance of recommending tours for groups of people instead of single individuals. The task here is to balance the recommendation to satisfy the distinct preferences inherence to each user in the group. This may be a hard task since other issues may come up: influence among people, distinguish roles of people inside the group like the leader, etc. 65 • Hierachical Sightseeing Tours. So far we have considered the recommendation of sightseeing tours for a single city. If the tourist is willing to travel across different cities, she would need to use the system to generate the tour for each city separately. To overcome this limitation, the design of a hierarchical approach would bring several benefits to help users to travel across many cities. • Time-aware PoIs and Trajectories. Another important challenge is related to the temporal dimension of PoIs and trajectories. Some PoIs and trajectories might be relevant according to a period of the day: most people visit beaches during the day or some monuments (e.g. Colosseum, Eiffel tower) could have a different appearence during the sunlight or the night. Therefore, considering the temporal importance and relevance of the PoIs and trajectories may suggest better personalized tours for the users. • Tour-based Hotel Selection. During the scheduling of a trip, tourists choose the city, pick up a hotel and write down the PoIs to visit. Although the POIs tour can be generated by T RIP B UILDER, how to choose the hotel is still missing. Although the selection of the hotel is usually related to traditional contrainsts like price, ratings, etc. having the planned sightseeing tour could be used to favor the choice of the hotel minimizing the distance with the planned tour attractions. • Personalized Visiting Time. T RIP B UILDER uses the crowd-generated content to infer an approximate visiting time for each PoI and this information is used for all tourists. However, tourists usually have preferences and they might wish to spend more time at some preferred places compared to other less interesting attractions. A personalization of visiting time might be very relevant for the tourists to complement their tours splitting their time budget to specific attractions based on their preferences. Acknowledgments This work was partially supported by EU FP7 Marie Curie project SEEK (no. 295179), CNPQ Scholarship (no. 306806/2012-6), CNPQ Casadinho / PROCAD Project (no. 552578/2011-8), CAPES Scholarship. References [1] I. R. Brilhante, J. A. Macedo, F. M. Nardini, R. Perego, and C. Renso. Where shall we go today?: Planning touristic tours with tripbuilder. In Proceedings of the 22nd ACM International Conference on Conference on Information and Knowledge Management, CIKM 2013, pages 757–762, New York, NY, USA, 2013. ACM. [2] I. R. Brilhante, J. A. Macedo, F. M. Nardini, R. Perego, and C. Renso. Tripbuilder: A tool for recommending sightseeing tours. In M. Rijke, T. Kenter, A. P. Vries, C. X. Zhai, F. Jong, K. Radinsky, and K. Hofmann, editors, Advances in Information Retrieval, volume 8416 of Lecture Notes in Computer Science, pages 771–774. Springer International Publishing, 2014. [3] I. R. Brilhante, J. A. Macedo, F. M. Nardini, R. Perego, and C. Renso. On planning sightseeing tours with tripbuilder. Information Processing & Management, 51(2):1 – 15, 2015. [4] R. Cohen and L. Katzir. The generalized maximum coverage problem. Information Processing Letters, 108(1):15–22, 2008. [5] M. Ester, H. peter Kriegel, J. S, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. pages 226–231. AAAI Press, 1996. [6] S. S. Rajesh Matai and M. L. Mittal. Traveling salesman problem: an overview of applications, formulations, and solution approaches, traveling salesman problem, theory and applications. In Traveling Salesman Problem, Theory and Applications. InTech, 2010. 66 The SIGSPATIAL Special Section 2: Event Reports ACM SIGSPATIAL http://www.sigspatial.org Highlights from ACM SIGSPATIAL China Chapter in 2014 1 Guangzhong Sun1 , Yang Yue2 , Xing Xie3 University of Science and Technology of China, Hefei, China 2 Shenzhen University, Shenzhen, China 3 Microsoft Research, Beijing, China Abstract In order to promote ACM SIGSPATIAL and corresponding research area in China, and encourage collaboration between SIGSPATIAL researchers in China and researchers worldwide, ACM SIGSPATIAL China chapter was established in October 2009, with the strong support of SIGSPATIAL executive committee. This article describes ACM SIGSPATIAL China Chapter’s current status and some activities in the year of 2014. 1 Current status of ACM SIGSPATIAL China There are 35 professional members in ACM SIGSPATIAL China since the forming of this chapter. They come from Chinese universities such as Shenzhen University, Renmin University, Chinese Academy of Sciences, University of Science and Technology of China, and industry labs such as Microsoft Research Asia. The current chapter officers are as below: • Honorary chair: Prof. Qingquan Li(Shenzhen University) and Prof. XiaofengMeng(Renmin University). • Chair: Dr. Xing Xie(Microsoft Research Asia). • Vice chair: Dr. Feng Lu(Chinese Academy of Sciences) and Dr. Zhiming Ding(Beijing University of Technology). • Secretary: Dr. Yang Yue(Shenzhen University) and Dr. Guangzhong Sun(University of Science and Technology of China). More information about the chapter and activity reports can be found at ACM SIGSPATIAL China website (http://sigspatial.ustc.edu.cn), which is in Chinese language. 2 2.1 Activities in 2014 Workshop on Spatial Big Data Mining and Visualization(SBDMV 2014) Existing concepts, theories, and methods for the spatial big data are facing many challenges, for instance, spatial big data storage, processing, analysis, mining and visualization. The value of spatial big data could not be fully achieved without data mining. And it has been realized that visualization is an effective means for not only presenting essential information in vast amounts of data but also driving big data analytics. 67 With ISPRS WG II/7, ACM SIGSPATIAL China held a workshop on Spatial Big Data Mining and Visualization(SBDMV 2014), in conjunction with ICDM 2014, on December 14, 2014. This half-day workshop aimed to serve as a platform to discuss the recent trends in spatial big data mining and visualization, for the purpose of intelligent spatial decision support. More than 40 people joined this workshop, who came from Chinese Academy of Sciences, Wuhan University, Shenzhen University, Peking University, University of Science and Technology of China, University of Tennessee, University of Melbourne, Microsoft Research Asia and Tencent Research. Figure 1 shows two photos of this workshop. (a) in meeting room (b) with invited guests Figure 1: Workshop on Spatial Big Data Mining and Visualization(SBDMV 2014) in Shenzhen More information about this workshop can be found at its website (http://spatial.szu.edu.cn/ SBDMV2014.htm). 2.2 CCCF Special issue on user understanding from big data Communication of China Computer Federation(CCCF) is one of the most popular computer magazines in China. This Chinese magazine has more than 10,000 subscribers with great impact in Chinese computer researchers community. With the Technical Committee on Pervasive Computing of China Computer Federation(CCF), ACM SIGSPATIAL China organized one special issue on user understanding from big data for CCCF in may 2014. Invited professors and researchers, from Tsinghua University, Nankai University, Zhejiang University, Chinese Academy of Sciences and Microsoft Research Asia, wrote five articles to give a broad and in-depth introductions around the big data and user understanding. Some latest progress were also presented in these articles. More information about this special issue can be found at this page (http://www.ccf.org.cn/ sites/ccf/jsjtbbd.jsp?contentId=2799676848706), whose language is Chinese. 68 LBSN 2014 Workshop Report The Seventh ACM SIGSPATIAL International Workshop on Location-Based Social Networks Dallas, Texas, USA - November 4, 2014 Alexei Pozdnoukhov1 Sen Xu2 1 UC Berkeley, USA 2 Twitter Inc., USA [email protected] [email protected] (Workshop Co-chairs) ACM SIGSPATIAL workshop on Location Based Social Networks 2014 (http://faculty.ce.berkeley.edu/pozdnukhov/lbsn14/index.html) was held in conjunction with the 22nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (SIGSPATIAL 2014) on November 4, 2014 in Dallas, Texas, USA. The objective of this workshop was to provide professionals, researchers, and technologists with a single forum where they can discuss and share the state-of-the-art of LBSN development and applications, present their ideas and contributions, and set future directions in emerging innovative research for location based social networks. This year program was composed of three sessions covering all aspects of LBSNs, with opening invited talks given by Dr. A. Haro (HERE/Nokia), Prof. M. Duckham (Uni Melbourne), Dr. Sen Xu (Twitter). Best Paper and Best Student Paper nominations were made based on the reviews and evaluations received from the Program Committee. Each award carried a prize sponsored by Twitter Inc. • Best Paper. Sophy: a Morphological Framework for Structuring Geo-referenced Social Media, by Kyoung-Sook Kim, Hirotaka Ogawa, Akihito Nakamura and Isao Kojima. • Best Student Paper. Moving on Twitter: Using Episodic Hotspot and Drift Analysis to Detect and Characterise Spatial Trajectories, by Hansi Senaratne, Arne Broering, Tobias Schreck and Dominic Lehle. Social networks have been prevalent on the Internet. The data produced by their users ignited multiple research topics attracting many professionals from a variety of fields. The advances in location-acquisition and mobile communication technologies empower people to use location data with existing online social networks. As location is one of the most important components of user context, extensive knowledge about an individuals interests, behaviors, and relationships with others can be learned from locations. Furthermore, people expand their social connections with the new inter-dependencies conditioned on their locations and mobility habits, shaping social and spatial structure in the cities of the future. These topics will remain key in fundamental academic research and will continue attracting significant interest from industry. We would like to thank the authors for publishing and presenting their papers in LBSN’2014, and the program committee members and external reviewers for their professional evaluation and help in the paper review process. We hope that this work will inspire new research and make an impact on commercial applications in an exciting area of Location-Based Social Networks. 69 IWGS 2014 Workshop Report The 5th ACM SIGSPATIAL International Workshop on GeoStreaming Dallas, Texas, USA - November 4, 2014 Chengyang Zhang1 Anas Basalamah2 1 Teradata Inc. 2 Umm Al Qura University 3 University of Washington Tacoma [email protected] [email protected] (Workshop Co-chairs) Abdeltawab Hendawi3 [email protected] The ACM SIGSPATIAL International Workshop on Geostreaming (IWGS) was held for the fifth time in conjunction with the 22nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACMGIS 2014). The workshop has been a successful event that attracted participants from both academia and industry. The workshop addressed topics that are at the intersection of data streaming and geospatial systems. The workshop fostered an environment where geospatial researchers can benefit from the advances in geosensing technologies and data streaming systems. We are entering the era of big data thanks to the exponential growth and availability of structured and unstructured data, among which a large amount are real-time streaming data emitted from sensors, imagery and mobile devices. In addition to the temporal nature of stream data, various sources provide stream data that has geographical locations and/or spatial extents, such as geotagging twitter streams, mobile GPS location streams, spatial temporal image streams, and so on. On one hand, this amount of streamed data has been a major propeller to advance the state of the art in geographic information systems. On the other hand, the ability to process, mine, and analyze that massive amount of data in a timely manner prevented researchers from making full use of the incoming stream data. The geostreaming term refers to the ongoing effort in academia and industry to process, mine and analyze stream data with geographic and spatial information. This workshop addresses the research communities in both stream processing and geographic information systems. It brings together experts in the field from academia, industry and research labs to discuss the lessons they have learned over the years, to demonstrate what they have achieved so far, and to plan for the future of geostreaming. The workshop featured a keynote by Dennis Luxen from Mapbox Inc., providing an introduction into the technical infrastructure to scale globally, the concepts of processing streams of data to provide real-time updates to custom-styled online maps, as well as non-trivial processing methods that provide even more sophisticated products like a world-wide satellite mosaic that is seamless and also cloudless. On one side, the keynote provided a good opportunity for researchers to better understand the business scenarios. On the other side, the workshop was useful for the industry representative to learn about the ambitions and directions of researchers in order to better shape the future of geostreaming. The call for paper resulted in 16 submissions of research papers. A program committee of 9 members reviewed the submissions and as a result 12 highest quality papers were accepted. On average, Over 20 attendees were present at every session of the workshop. The topics presented in the workshop include but are not limited 70 to: Geostream Query Processing, Geostream Theory and Applications in Tranportation, Streaming Trajectories and Moving Regions and Geostreaming Systems. 71 The First ACM SIGSPATIAL PhD Symposium 2014 Ugur Demiryurek1 , Mohamed Sarwat2 Department of Computer Science, University of Southern California, USA 2 School of Computing, Informatics, and Decision Systems Engineering, Arizona State University, USA 1 [email protected],[email protected] 1 1 Summary The ACM SIGSPATIAL Ph.D. Symposium is a forum where Ph.D. students present, discuss, and receive feedback on their research in a constructive atmosphere. The symposium will be attended by professors, researchers and practitioners in the ACM SIGSPATIAL community, who will participate actively and contribute to the discussions. The workshop is co-located with ACM SIGSPATIAL GIS 2014 The ACM SIGSPATIAL 2014 PhD Symposium provides an opportunity for doctoral students to explore and develop their research interests in the broad areas addressed by the ACM SIGSPATIAL community. We invite PhD students to submit a summary of their dissertation work to share their work with students in a similar situation as well as senior researchers in the field. We have two tracks for submission. The Junior PhD Track is for students who are in early stages of their doctoral studies. The submission should provide a clear problem definition, explain why it is important, survey related work, and summarize the new solutions that are pursued. The Senior PhD Track is for students who are close to completion (expected to graduate by 2014/2015). The submissions focused on describing the contribution they made in their doctoral dissertation. The strongest candidates are those who have a clear topic and research approach, and have made some progress, but who are not so far along that they can no longer make changes. 2 Program This year, we accepted five papers to the PhD Symposium. The list of papers is as follows: • Partitions to Improve Spatial Reasoning (Author: Matthew P. Dube – Supervised by: Max J. Egenhofer) • Novel Clustering and Analysis Techniques for Mining Spatio-temporal Data (Author: Yongli Zhang – Supervised by: Christoph F. Eick) • Spatial Sensor Data Processing and Analysis for Mobile Media Applications (Author: Guanfeng Wang – Supervised by: Roger Zimmermann) • Towards Resource Route Queries with Reappearance (Author: Gregor Joss – Supervised by: Matthias Schubert) • SimMatching - Adaptable Road Network Matching for Efficient and Scalable Spatial Data Integration (Author: Michael Schfers – Supervised by: Udo W. Lipeck) Authors got the chance to present their papers and get feedback on their dissertation topic from experienced researchers (from both academia and industry) in Geographic Information Systems and Spatial Data Analytics. 72 3 Keynote The PhD symposium featured a Keynote speech by professor Ouri Wolfson (Richard and Loan Hill Professor of Computer Science at the University of Illinois at Chicago) on ”What to Research in Spatial Information and Hot to Do So”. This talk is divided into two parts the What and the How. The What part describes the research directions, based on Dr. Wolfson’s perspective, that are most promising in the area of spatial information. These involve abstractions of dynamic data about space and time to guide users in conducting everyday activities. In the How part Dr. Wolfoson describes the character and attitude traits that I view as essential to conduct world-class research. These involve problem selection, collaboration and inspiration. 73 join today! SIGSPATIAL & ACM www.sigspatial.org www.acm.org The ACM Special Interest Group on Spatial Information (SIGSPATIAL) addresses issues related to the acquisition, management, and processing of spatially-related information with a focus on algorithmic, geometric, and visual considerations. The scope includes, but is not limited to, geographic information systems (GIS). The Association for Computing Machinery (ACM) is an educational and scientific computing society which works to advance computing as a science and a profession. Benefits include subscriptions to Communications of the ACM, MemberNet, TechNews and CareerNews, full and unlimited access to online courses and books, discounts on conferences and the option to subscribe to the ACM Digital Library. ❑ SIGSPATIAL (ACM Member). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . $ 15 ❑ SIGSPATIAL (ACM Student Member & Non-ACM Student Member). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . $ 6 ❑ SIGSPATIAL (Non-ACM Member). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . $ 15 ❑ ACM Professional Membership ($99) & SIGSPATIAL ($15) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . $114 ❑ ACM Professional Membership ($99) & SIGSPATIAL ($15) & ACM Digital Library ($99) . . . . . . . . . . . . . . . . . . . . . . . $213 ❑ ACM Student Membership ($19) & SIGSPATIAL ($6). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . $ 25 payment information Name __________________________________________________ Credit Card Type: ACM Member # __________________________________________ Credit Card # ______________________________________________ Mailing Address __________________________________________ Exp. Date _________________________________________________ _______________________________________________________ Signature_________________________________________________ City/State/Province _______________________________________ ZIP/Postal Code/Country___________________________________ Email _________________________________________________ Mobile Phone___________________________________________ ❏ AMEX ❏ VISA ❏ MC Make check or money order payable to ACM, Inc ACM accepts U.S. dollars or equivalent in foreign currency. Prices include surface delivery charge. Expedited Air Service, which is a partial air freight delivery service, is available outside North America. Contact ACM for more information. Fax ____________________________________________________ Mailing List Restriction ACM occasionally makes its mailing list available to computer-related organizations, educational institutions and sister societies. All email addresses remain strictly conﬁdential. Check one of the following if you wish to restrict the use of your name: ❏ ACM announcements only ❏ ACM and other sister society announcements ❏ ACM subscription and renewal notices only Questions? Contact: ACM Headquarters 2 Penn Plaza, Suite 701 New York, NY 10121-0701 voice: 212-626-0500 fax: 212-944-1318 email: [email protected] Remit to: ACM General Post Oﬃce P.O. Box 30777 New York, NY 10087-0777 SIGAPP www.acm.org/joinsigs Advancing Computing as a Science & Profession The SIGSPATIAL Special ACM SIGSPATIAL http://www.sigspatial.org

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download The SIGSPATIAL Special