* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download EBSCAN: An Entanglement-based Algorithm for Discovering Dense
Survey
Document related concepts
Transcript
EBSCAN: An Entanglement-based Algorithm for Discovering Dense Regions in Large Geo-social Data Streams with Noise Shohei Yokoyama Hiroshi Ishikawa Ágnes Bogárdi-Mészöly Faculty of Informatics, Shizuoka University Hamamatsu, Japan Department of Automation and Applied Informatics, Budapest University of Technology and Economics [email protected] Budapest, Hungary Faculty of Informatics, Shizuoka University Hamamatsu, Japan Faculty of System Design, Tokyo Metropolitan University Tokyo, Japan [email protected] [email protected] ABSTRACT General Terms The remarkable growth of social networking services on global positioning system (GPS)-enabled handheld devices has produced enormous amounts of georeferenced big data. Given a large spatial dataset, the challenge is to effectively discover dense regions from the dataset. Dense regions might be the most attractive area in a city or the most dangerous zone of a town. A solution to this problem can be useful in many applications, including marketing, tourism, and social research. Density-based clustering methods, such as DBSCAN, are often used for this purpose. Nevertheless, current spatial clustering methods emphasize density while neglecting human behavior derived from geographical features. In this paper, we propose EBSCAN, which is based on the novel idea of an entanglement-based approach. Our method considers not only spatial information but also human behavior derived from geographical features. Another problem is that competing methods such as DBSCAN have two input parameters. Thus, it is difficult to determine optimal values. EBSCAN requires only a single intuitive parameter, tooFar, to discover dense regions. Finally, we evaluate the effectiveness of the proposed method using both toy examples and real datasets. Our experimentally obtained results reveal the properties of EBSCAN and show that it is >10 times faster than the competitor. Algorithms, Experimentation, Theory Categories and Subject Descriptors I.5.3 [PATTERN RECOGNITION]: Clustering—Algorithms; H.2.8 [DATABASE MANAGEMENT]: Database Applications—Spatial databases and GIS Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. LBSN’15, November 03, 2015, Bellevue, WA, USA Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-3975-9/15/11...$15.00 DOI: http://dx.doi.org/10.1145/2830657.2830661 Keywords Geo-social data, Density-based clustering, Spatial index 1. INTRODUCTION In accordance with the recent remarkable growth of smart devices [e.g., smartphones and tablet personal computers (PCs)], social networking services (SNSs) are shifting rapidly to a mobile-first strategy[2]. Gartner has reported that users can choose a tablet as their first computing device instead of a PC[1]. These facts indicate that the user’s location should be considered in data mining for SNSs. Therefore, spatio-temporal analysis is a key tool for the analysis of the behavior of SNS users. Location information of SNS users is often given as check-in and geotagged information. Check-in is an innovation used by geo-social platforms, such as Swarm. Commonly used SNSs, such as Facebook and Google+, now enable users to report their location as a check-in. Geotag is metadata consisting of at least the latitude and longitude of the user’s location. It is stored with various media, such as photographs, videos, and Tweets. Hereafter, the term “geo-social data” is used to refer to such georeferenced information related to SNSs. We can readily access huge geo-social data sharing on various social media via Web APIs developed during the Web2.0 movement. For example, it is possible to collect >160,000 geotagged photos1 taken around the Colosseum in Rome and upload them to Flickr. Considering that Flickr photographs taken around the Arc de Triomphe amount to 60,0002 , some might say that the Colosseum is more attractive than the Arc de Triomphe. This is just one example, but it appears reasonable to infer that the density of geotagged photographs is an important index that can measure the attention level of a place. 1 2 See http://bit.ly/1tLCHew. See http://bit.ly/1sX4fd0. Applying density-based data mining techniques to detect attractive places is a major direction for social network analysis. For example, DBSCAN[10], which is a typical density-based clustering method, is used in various research areas, including landmark detection, travel pattern mining, and event detection. DBSCAN is a clustering algorithm designed for a large spatial database. The procedure requires two parameters: Epsilon (Eps) and minimum points (M inP ts). It starts with an arbitrary object. If at least the minimum number of objects (M inP ts) exists around the selected object within a given radius (Eps), then a cluster is created. Otherwise, the object is labeled as noise. DBSCAN is used in social network analysis due to the following reasons: 80 60 40 20 0 0 20 40 60 80 100 (b) Eps=4, MinPts=30 100 (a) Eps=4, MinPts=15 20 40 60 80 100 40 60 80 100 80 60 40 20 0 0 20 40 60 80 100 (d) Eps=8, MinPts=30 100 (c) Eps=8, MinPts=15 20 • DBSCAN does not require the number of clusters as its input parameter. Generally, users do not know, but want to know, the number of dense regions in a database. 20 40 60 80 100 20 40 60 80 100 Figure 1: DBSCAN results in case of different input parameter settings. • DBSCAN can find non-linearly separable clusters. Actual geographical clusters often have non-linear shapes. River Bridge • DBSCAN is robust for discovering outliers. This feature is suitable for geo-social data, because the data includes various errors caused by global positioning system (GPS) accuracy, human factors, and intentional spam. Wall Bridge (a) Geospatial data (b) Geographical features Despite its benefits, DBSCAN also has some shortcomings. A major disadvantage of DBSCAN is that it requires two input parameters, which makes it difficult to use. Figure 2: Difference between data and actual geographical features. In this paper, we propose a novel entanglement-based clustering algorithm, EBSCAN, that is designed for geo-social data. EBSCAN requires only one explicit input parameter. It is able to discover dense regions in a spatial database, similar to DBSCAN. Therefore, EBSCAN can be an alternative to the density-based DBSCAN and its followers. Geographical Features. Figure 2 presents a toy example of geo-social data and geographical features of the place. Despite the wall that divides the field into north and the south, a density-based approach might output two clusters: east (blue) and west (red). In the case shown in Figure 2, the north and the south sectors are connected by bridges, but are divided completely by the wall. Therefore, blue and red clusters must be divided into two clusters. However, a density-based approach might not only overlook such geographical features, it might also be confused by a GPS error, human error, or rounding values. 1.1 Problem Statement In this section, we address the problems associated with density-based spatial clustering. Input Parameter. As described above, DBSCAN has two parameters: Eps and M inP ts. Figure 1 presents the results of using DBSCAN with various parameter values. Result (a) is the best result of the four, but all points in the second and third chunk in the leftmost column are labeled as noise, because the chunk density fell short of the level required by the input parameters. However, Result (d) shows that the second chunk from the bottom in the leftmost column was detectable as a cluster, but it formed one cluster consisting of four chunks in the second column from the left. In Result (b), many outer rim points are labeled as noise. In contrast, no point is detected as noise in Result (c), but all chunks are gathered into one cluster. Consequently, DBSCAN is sensitive to its input parameters. Furthermore, it is difficult to ascertain the appropriate parameter values of Eps and M inP ts. In fact, knowing the appropriate number of points (M inP ts) within an Eps radius from any point is difficult. 1.2 Contributions It is noteworthy that the input data of EBSCAN are a set of trajectories comprising multiple points. This is a limitation of EBSCAN in contrast with DBSCAN, which can be applied to a set of points. However, in almost all cases, geo-social data are a set of trajectories. For example, when a user uploads geotagged photographs to Flickr, then the user’s photostream indicates the user trajectory. DBSCAN might be more widely applicable than EBSCAN; however, EBSCAN presents advantages when used with a large geo-social dataset. To avoid the weakness of density-based clustering, we propose a novel entanglement-based algorithm for clustering that uses a geo-social trajectory to discover high-density areas. We focus on GPS errors, round-off errors, and human y y in Figure 2. EBSCAN perceives such geographical features 1.3 Outline The remainder of this paper is organized as follows. Section 2 discusses the related work. Section 3 presents the proposed EBSCAN algorithm. Section 4 shows the results of our experimental evaluation. Section 5 includes our conclusions. x (a) actual trajectory x (b) recorded trajectory Figure 3: Difference between actual and recorded trajectories. errors in geo-social data. Figure 3 presents an example of the trajectories of (a) actual behavior and (b) a recorded GPS log. As the figure shows, recorded trajectories get entangled spontaneously even when they travel alongside each other. We specifically examine the entanglement of the geo-social trajectory. Input Data We designed EBSCAN to cluster a large geotagged contents, e.g. photograph and Twitter. Therefore, EBSCAN requires a set of sequences of a georeferenced document. In the case of Flickr, a georeferenced document is a geotagged photo, the sequence is a user’s photostream, and the input data is a set of photostreams. Output Cluster After the clustering phase, EBSCAN discovers clusters and outliers in the dataset. The cluster contains georeferenced documents that exist in the same dense area. DBSCAN can also discover dense regions, but the clusters from EBSCAN are more useful for social data mining. The main contributions of our geospatial clustering algorithm EBSCAN are summarized as follows. Execution Time. As describe in Section 4.2, the execution time for clustering of EBSCAN is >10 times faster than that of DBSCAN. EBSCAN takes longer to build a database during the initialization process; however, our experimental results showed that the overall execution time was twice as fast as that of DBSCAN using a k-d tree index. Input Parameter. EBSCAN requires only a single parameter, tooF ar, which represents a radius from each point and which is used to find near neighbors. An important problem with DBSCAN is that is requires two input parameters. In contrast with the input parameters of DBSCAN, using tooF ar is clear and intuitive. For example, if a given sufficient distance to divide points into different clusters is 500 m, then tooF ar takes a value of 500 m. In addition, tooF ar is not only easy to set, it can readily estimate an optimal value. Later in this paper, we also discuss the optimal estimation of tooF ar. Geographical Features. EBSCAN uses a set of trajectories to find dense regions. Therefore, clusters that consider geographical features are created. For example, if trajectories a and d have no intersection, it might indicate that a barrier exists between a and d, such as the wall shown 2. RELATED WORKS The following survey of related works is categorized into two types: spatial clustering and geodata network analysis. Spatial Clustering. Spatial clustering is a powerful and traditional partitioning method that includes k-means and k-medoids. Partitioning methods are simple. In addition, they work well with a spatial database. A predefined parameter k that specifies the number of clusters in the output must be decided before clustering. However, its use is not reasonable in almost all cases for our geo-social clustering problem because the number of clusters is unknown beforehand. Another approach of spatial clustering includes hierarchical techniques such as BIRCH[34]. Hierarchical clustering methods need no predefined number of clusters, but they lack termination criteria. Therefore, hierarchical clustering cannot be a solution to our geo-social clustering problem. Partitioning and hierarchical clustering are suitable for finding spherical clusters in a spatial database. However, a geographical cluster takes various arbitrary shapes. In contrast with partitioning and hierarchal clustering, density-based approaches are suitable for application to the geo-social clustering problem. A representative density-based clustering method is DBSCAN[10], which can yield non-linearly separable clusters. DBSCAN is extremely influential. It has triggered many updates and extensions. OPTICS[3] an extension of DBSCAN, can obtain various sizes of clusters at different granularities. DENCLUE[14], a density-based algorithm, uses the density distribution function on a computational mesh. Density-based algorithms are also suitable for outlier detection. Moreover, DBSCAN can identify outlier points represented as a draft of clustering from a large spatial database. Actually, LOF[7], a density-based algorithm, was initially developed to optimize outlier detection. Entanglementbased EBSCAN is also able to identify outliers. Many modified algorithms of DBSCAN have been proposed for specific purposes. ST-DBSCAN [5] extends DBSCAN for use with spatial databases that contain non-spatial and temporal values. Tamura et al. proposed density-based clustering to discover local topics and events from a huge number of georeferenced documents on social media sites[28]. PDBSCAN [18] is a DBSCAN-based clustering method that specializes in working with geotagged photographs. They are noteworthy algorithms used in DBSCAN-based clustering for geo-social network analysis, because it can consider both spatial information and the social distance between users who visit the clustered places. However, DCPGS [26] demands two additional input parameters, τ and maxD, aside from eps and minP ts. Therefore, the setting of input parameters for DCPGS is more difficult than it is for DBSCAN. Set of geo-social datastreams: T Trajectory: T1 of User foo Sequence of georeferenced documents DBSCAN-based algorithms include the inherent weaknesses of DBSCAN and extend them. Our aim is to overcome the weaknesses of DBSCAN. Some studies have specifically examined the difficulty of the DBSCAN’s input parameter setting. BDE-DBSCAN[16] is a wrapper of DBSCAN used for a repetitious parameter survey. Li et al. tackled this problem using semi-supervised machine learning [22]. However, a parameter survey for DBSCAN does not indicate its efficiency. Grid-based clustering techniques, e.g., STING[32] and Wave Cluster[25], have been proposed as high-efficiency algorithms. The salient advantage of using grid-based clustering is its rapid processing, which depends on the number of cells in each axis; however, the clustering result has low accuracy. As described above, EBSCAN uses trajectories on SNSs. Lastly, we explain trajectory clustering. Gaffney et al.[11] proposed model-based trajectory clustering. Lee et al.[20] proposed density-based trajectory clustering. Vieira et al. [29] proposed flock pattern mining. Trajectory clustering is a technique used for line-segment clustering aimed at discovering a common sub-trajectory and classifying major routes from a large real trajectory database. In contrast with trajectory clustering, EBSCAN is a technique that uses trajectories to perform point clustering to discover high-density areas as clusters. {lat,lon} DBSCAN is used for geo-social data analysis. Particularly, DBSCAN is often applied to discover points of interest (PoIs) in geo-social data analyses. Related works that use DBSCAN are approximately of two typess: (1)photographbased analysis [8, 15, 17, 19, 27, 35], (2) Tweet-based analysis[21, 31]. In addition, DBSCAN is widely applicable and is used not only for PoIs[19], but also for travel pattern mining[35], landmark shape detection [27], urban cluster generation [31], and event detection [8]. EBSCAN presents some possibilities for replacing DBSCAN in such research efforts. {lat,lon} {lat,lon} {lat,lon} Crawling (1) Building Database Set of line segments: L Intersection DB: X Intersection: X1 Start End Start End Start End Intersect Lines {lat,lon} {lat,lon} {lat,lon} {lat,lon} Line: L {lat,lon} {lat,lon} {lat,lon} 1 Line: L2 Line segment: L1 Start End Start End Start End Start End {lat,lon} {lat,lon} {lat,lon} {lat,lon} {lat,lon} {lat,lon} {lat,lon} {lat,lon} (2) Clustering RoI/PoI analysis Output clusters: C C1 {lat,lon} C2 {lat,lon} C3 {lat,lon} {lat,lon} Figure 4: An outline of clustering procedure using EBSCAN. X(a1→2,b1→2) Pa2 Pb1 Traj. b tooFar Lb1→2 Pa1 The most closely related work in trajectory mining is research concerning region of interest (RoI) detection using a large trajectory dataset. Gianotti et al. proposed an efficient density-based approach[12]. Geo-social data analysis. In recent years, an explosion of interest has arisen in geo-social data analysis. Sakaki et al.[24] proposed keyword-based models using Twitter to automatically identify where and when earthquakes occur. Yamaguchi et al.[33] used a large Twitter dataset to infer users’ home locations. Crandall et al.[9] proposed a mapping system using a combination of textual and visual features from a large photographic database that was “crawled” from Flickr. Ruocco et al.[23] used Suffix Tree Clustering to discover events and their occurrence time and position from Flickr photographs. Flickr Twitter Traj. a Pb2 La1→2 Pc1 Pd1 Traj. d Traj. c X(b2→3,c2→3) Figure 5: Key idea of entanglement-based approach. 3. ENTANGLEMENT-BASED ING CLUSTER- Next, we propose an entanglement-based clustering algorithm, EBSCAN, which is able to discover dense regions from spatial databases using entanglements of trajectories. The procedure of EBSCAN has two steps: (1) building intersection database and (2) clustering. Outline of EBSCAN algorithm is illustrated in Figure 4. Key Idea Consider a set of georeferenced documents to be clustered. EBSCAN classifies the documents as Near Neighbor points and outliers. Definition 1. (Near Neighbor) Given a line Lxn→n+1 between point Pxn and Pxn+1 on trajectory x, neighbor points P is at both ends of the intersected line of Lxn→n+1 . When the distance between a point P ′ ∈ P and Pxn is less Algorithm 1 BuildDatabase(T) 1: Lines L = ∅ 2: for each Trajectory T ∈ T do 3: Pprev = null 4: for each Point P ∈ T do 5: if Pprev ̸= null then 6: L.push(new Line(P ,Pprev )) 7: end if 8: Pprev = P 9: end for 10: end for 11: X = GetAllIntersections(L) 12: return X than the value of input parameter tooF ar, then P ′ is called Near Neighbor of Pxn . Definition 2. (Outlier) All point P which do not have more than one Near Neighbor are considered Outliers. Finally, clusters are composed on the basis of a simple dedinition as follows: Definition 3. (Cluster) Any point P in the dataset must belong to the same cluster of all near-neighbor points This is a key idea of the proposed EBSCAN. Figure 5 shows an example representing the key idea of EBSCAN. Example 1. This example uses six points {Pa1 ,Pa2 , Pb1 , Pb2 , Pc1 , Pd1 } on four trajectories {a, b, c, d}, and two intersections {X(a1→2 , b1→2 ), X(b2→3 , c2→3 )}. First, the intersections are analyzed. For the intersection {X(a1→2 , b1→2 ), the near neighbor is {Pa2 , Pb2 }. Therefore, both Pa2 and Pb2 belong to the same cluster C. Next, the near neighbor related to X(b2→3 , c2→3 ) is {Pb2 , Pc1 }. However Pb2 is already belong to a cluster C. In this case, cluster C includes Pc1 . For Pa1 and Pd1 , the distance between the two points is less than the value of the input parameter tooF ar. However, the lines connected to Pd1 and Pa1 do not mutually intersect. Using DBSCAN, these might be classified as part of the same cluster. However, EBSCAN can detect that trajectories a and d do not intersect near points Pd1 and Pa1 . Finally, cluster C is obtained, which contains {Pa2 , Pb2 , Pc1 } and outliers {Pa1 , Pb1 , Pd1 }. 3.1 Preprocessing Algorithm The input of EBSCAN is a target dataset T, which is a set of trajectories, and the parameter tooF ar. First, given trajectories T are divided into a set of line segments L.The time complexity of the preprocessing (Line 1-10 of Algorithm 1) is O(n) where n is the total number of points. The function GetAllIntersections, which is described Algorithm 2 GetAllIntersectionsBF(L) 1: X = ∅ 2: for each Line L ∈ L do 3: for each Line L′ ∈ L do 4: if LandL′ are intersected then 5: X.push(new Intersection(LandL′ )) 6: end if 7: end for 8: end for 9: return X Algorithm 3 GetAllIntersectionsRT(L) 1: X = ∅ 2: R =RTree(∅) 3: for each Line L ∈ L do 4: L′ = R.search(bbox(Ln )) 5: for each Line L′ ∈ L′ do 6: if LandL′ are intersected then 7: X.push(new Intersection(LandL′ )) 8: end if 9: end for 10: R.insert(L) 11: end for 12: return X in the next section, is the most complicated part of EBSCAN and takes L as an argument to return a set of intersections X. Finally, BuildDatabase returns X for the next clustering phase. 3.2 Algorithms for Finding Intersections The most complicated part of DBSCAN, which is a competitor of EBSCAN, is finding the k-nearest neighbor points.The time complexity of DBSCAN is O(n2 ). However, if a spatial access method such as a k-d tree is used for the nearestneighbor search, then the time complexity is O(n log n).In comparison with DBSCAN, the most complicated part of EBSCAN is finding the intersections between line segments (Line 11 of Algorithm 1). In this paper, we describe three methods for finding intersections. Brute-force Search. The first is a brute-force search algorithm (Algorithm 2). This extremely simple algorithm includes two nested loops. The time complexity of the bruteforce search is O(n2 ). This is not an efficient algorithm; however, we implemented it as a baseline for EBSCAN. Bentley-Ottmann. A sweep-line algorithm is a technique to list intersections from a large line dataset. The BentleyOttmann algorithm[4] is known as an efficient algorithm based on the sweep line. The Bentley-Ottmann algorithm uses indexes of two types to find intersections efficiently: a heap and a balanced binary tree. However, considering the application of the Bentley-Ottomann algorithm to EBSCAN, the precision of the coordinates of intersections calculated from two intersected lines will cause trouble. The algorithm is not difficult to understand; however, efficient implementation is difficult. R-tree Search. Algorithm 3 shows uses the R-tree[13] index to find candidates that might intersect. A search by bounding box (Line 4 of Algorithm 3) might extract many non-intersecting lines; therefore, each line of L′ must be tested to determine whether it intersects or not. It is not an efficient process, but it is simpler than using Bentley-Ottmann. 3.3 Clustering Algorithm Algorithm 4 shows the main procedure of clustering. The first argument of clustering is the intersection database X which is mentioned in previous sections. EBSCAN uses tooF ar as an input parameter At each step, the algorithm first obtains an intersection X from X. X has two intersected line segments, La and Lb . Both lines have a starting point and an ending point, (Pastart , Paend ) and (Pbstart , Pbend ), respectively. Next, four pairs of points are created. The pairs consist of (Pastart , Pbstart ), (Pbstart , Paend ), (Paend , Pbend ), and (Pbend , Pastart ). Subsequently, the distance between the points of each pair is calculated. If the distance is larger than the parameter tooF ar, then the pair is ignored. (b) Bridge (a) Crossroad 350 70 60 300 50 250 Y Y 40 200 30 150 20 100 10 50 0 50 100 150 200 250 300 350 0 10 20 30 40 X (c) Noisy 70 50 60 70 80 X (d) Tetris 140 65 120 60 55 100 50 Y 45 Y Algorithm 4 Clustering(X,tooF ar) 1: Array C = ∅ 2: for each Intersection X ∈ X do 3: list(La ,Lb ) = X.getIntersectedLines() 4: list(Pastart , Paend ) = La .getBothPoints() 5: list(Pbstart , Pbend ) = Lb .getBothPoints() 6: Array P = [Pbstart , Pbstart , Paend , Pbend ] 7: for i = 1 to 4 do 8: Px = P[i − 1] 9: Py = P[i mod |P|] 10: if distance(Px , Py )≥ tooF ar then 11: continue 12: end if 13: if Px do not belong to any cluster then 14: if Py do not belong to any cluster then 15: C = newCluster() 16: Px .belongT o(C) 17: Py .belongT o(C) 18: C.push(C) 19: else 20: C = Py .belongedCluster() 21: Px .belongT o(C) 22: end if 23: else 24: if Py do not belong to any cluster then 25: C = Px .belongedCluster() 26: Py .belongT o(C) 27: else 28: Cx = Px .belongedCluster() 29: Cy = Py .belongedCluster() 30: C ′ = marge(Cx , Cy ) 31: C.remove(Cx , Cy ) 32: C.push(C ′ ) 33: end if 34: end if 35: end for 36: end for 37: return C 40 35 80 60 30 25 40 20 15 -20 20 0 20 40 60 80 X (e) Midtown Manhattan 100 0 20 40 60 80 100 120 140 X (f) Mt. Fuji and Fuji Five Lakes Figure 6: Datasets (toy example and real data). Table 1: Number of Points of Each Toy # of Traj. 80 160 320 640 Crossroad 1218 2456 5081 9610 Bridge 810 1678 3259 6358 Noisy 1145 1968 4164 8269 Tetris 2128 3944 8420 17470 Dataset 1260 19297 12920 16921 33475 In other cases, both points are labeled as belonging to the same cluster. If either one already belongs to cluster Cx , then the other point is added to cluster Cx . If both points do not belong to any cluster, then a new cluster is created and both points are added to the cluster. The time complexity of the clustering is O(m) where m is the total number of intersections. 4. EXPERIMENTAL RESULTS In this section, we demonstrate the effectiveness of EBSCAN with various datasets. These experiments were designed for the following purposes: 1. To compare three intersection-finding algorithms. 2. To compare EBSCAN to a competitor. 3. To determine if tooF ar is sufficiently simple and effective. 4. To evaluate if EBSCAN is effective with a real dataset. Implementation. We implemented EBSCAN, which supports three intersection-finding algorithms. It is written entirely in JavaScript for the node.js environment3 . 3 The implementation of three intersection-finding algorithms is shared on GitHub. https://github.com/ abarth500/line-segment-intersection. Execution Time (s) BuruteForce (Toy Example) RTree (Toy Example) BuruteForce (Real Dataset) RTree (Real Dataset) Bentrey Ottomann(Toy Example) 40 35 30 25 20 15 10 5 0 0 5000 10000 15000 (a) DBSCAN [Eps=6/MinPts=40] (b) EBSCAN [tooFar=4] (c) DBSCAN [Eps=8/MinPts=40] (d) EBSCAN [tooFar=6] (e) DBSCAN [Eps=10/MinPts=40] (f) EBSCAN [tooFar=200] 20000 Number of Points Figure 7: Execution time of preprocessing. Dataset Method Crossroad Bridge Noisy Tetris DBSCAN EBSCAN Execution Time (s) 1000 100 10 1 Figure 9: Clustering result in case of dataset Bridge. 0.1 0.01 0.001 0 5000 10000 15000 20000 Number of points Figure 8: Execution time of clustering. Environment. Our experiments were conducted in an Intel Next Unit of Computing (NUC) environment with an Intel Core i7-5557U processor containing 16 GB of memory and 256 GB of SSD. Dataset. We used four toy examples with five size factors and two real datasets to achieve the purposes described above. Figure 6 presents the dataset. We created a data generator 4 for this experiment and generated four datasets (a)–(d). We generate five sizes of each dataset. The sizes of the data were 80, 160, 320, 640, and 1260 trajectories. We used 20 datasets for the toy example. Figure 6 shows the points in the experiment that used 160 trajectories. It is difficult to draw all the trajectory lines in the figure. Therefore, we have drawn only the points that are shown as red crosses. The approximate trend of the trajectories is presented as a blue arrow. The number of points contained in each dataset is described in the Table 1. We also used real data consisting of geotagged photographs of Flickr. Midtown Manhattan (e) and Mount Fuji (f) of Figure 6 are the datasets that have different numbers of photographs on different scales of regions. Midtown 4 To ensure repeatability, we publish it on the Gist/GitHub. See https://gist.github.com/anonymous/ 01a23f7bd8ac0b7a0c25. Manhattan has 13,802 geotags on 320 trajectories; Mount Fuji has 23,332 geotags on 640 trajectories. We used the real datasets to estimate the practical efficiency of the proposed EBSCAN. Competitor. We used DBSCAN for comparison with EBSCAN. DBSCAN was implemented using JavaScript and was executed on node.js for fairness of the experiments. We uses k-d tree enabled O(n log n) DBSCAN. 4.1 Execution Time of Preprocessing First, we investigated the efficiency and scalability of three implementations that used the proposed intersection-finding algorithms. We proposed three different algorithms: Bruteforce, Bentley-Ottoman, and R-tree. Figure 7 shows the execution time needed to build the intersection database. The slowest algorithm was the Bentley-Ottoman. We did not expected it it, because although the Bentley-Ottoman algorithm is known as the most effective algorithm to find intersections, it is also known to be sensitive to floating point arithmetic, rounding errors [6]. A discussion of these concerns is beyond the scope of this paper. We used the R-tree algorithm to build an intersection database for the following evaluations. 4.2 Execution Time of Clustering Figure 8 illustrates the execution time of EBSCAN and DBSCAN using k-d tree. The process used four toy example datasets with five factors. This result shows that EBSCAN is >10 times faster than DBSCAN in almost all cases of our datasets. 4.3 Awareness of Geographical Features (a) EBSCAN(tooFar=100m) (b) DBSCAN(Eps=100m,MinPts=1) (c) DBSCAN(Eps=100m,MinPts=20) Figure 10: Clustering result in case of dataset Midtown Manhattan. (a) Result of EBSCAN (tooFar=0.013) Figure 9(a) shows the result of using DBSCAN when Eps = 4 and M inP ts = 40. Figure 9(b) presents the result of using EBSCAN when tooF ar = 4. Both results show five and three clusters. These results were insufficient, because we wanted to find two clusters from four dense regions. However, the results in Figure 9(c) and 9(d) show two different clusters. Figure 9(d) shows that the result of using EBSCAN with parameter tooF ar = 6 is divided between the northern and southern gap. Conversely, Figure 9(c) shows that the result of using DBSCAN with Eps = 8 and M inP ts = 40 is divided with a gap between the east and west despite the connection. These results indicate an advantage when using the proposed entanglement-based clustering. In contrast with DBSCAN, which finds dense regions directly in the dataset, EBSCAN focuses on entanglements of trajectories; thereby, EBSCAN perceives trivial but important connections between chunks in the dataset. Figure 9 also shows the advantage of using EBSCAN. If the value of Eps in DBSCAN is slightly high, then the clusters will become one cluster, as shown in Figure 9(e). In contrast, if the value of tooF ar is extremely high, EBSCAN perceives the gap between the north and south and shows a suitable number of clusters, as shown in Figure 9(f). 4.4 (b) Result of DBSCAN (Eps=0.012/MinPts=30) Figure 11: Clustering result in case of dataset Mt. Fuji. We described in Section 1.1 that EBSCAN can obtain clusters that consider geographical features. The Bridge is a toy example that includes geographical barriers, such as those shown in Figure 2. Figure 9 shows the clusters obtained from the Bridge using both EBSCAN and DBSCAN. The east and the west chunks are connected, but the north and south chunks are not. If the dataset is divided into two clusters, then one should be composed of the northern two chunks, and the other should be composed of the southern two chunks. However, DBSCAN often commits an error. Qualitative Assessment In the next two sections, we evaluate EBSCAN using real datasets. First, we discuss the quality of the output clusters. EBSCAN classifies points in the same cluster if the distance between two points is less than the input parameter tooF ar. Consequently, it could be said that results of using the EBSCAN parameter tooF ar are equivalent to the results of using the DBSCAN parameter Eps when M inP ts = 1. However, EBSCAN, which is based on an entanglement approach, extracts more suitable clusters than the densitybased approach. Figure 10(a) and (b) shows the differences between DBSCAN and EBSCAN when the values of tooF ar and Eps are the same. We used Vincenty’s formula [30] to calculate the distance for clustering; therefore, tooF ar and Eps are specified as metric. Both tooF ar and Eps were 100 m, and M inP ts was 1. The outline of region a in Figure 10(a) and • Fast. Our experiments show that the execution speed of EBSCAN is faster than that of DBSCAN. Its average speed is >10 times faster during the clustering phase and twice as fast for the entire execution time (Section 4.2 and 4.5). EBSCAN DBSCAN 0 2 4 6 8 10 12 14 16 18 Execution Time (s) EBSCAN DBSCAN Bilding Database/KD-Tree 5.341 s 0.090 s Clustering 1.639 s 15.739 s • Simple. EBSCAN requires only a single parameter, tooF ar, which represents a radius from each point, and uses it to find near neighbors. The number of input parameters engenders difficulty in finding optimal values. Using EBSCAN’s tooF ar parameter is intuitive, and our experimentally obtained results showed that the correct number of clusters is obtained with a wide range of tooF ar values (Section 4.3 ). Figure 12: Execution time. (b) is close, but EBSCAN can find some dense regions in the chunk, whereas DBSCAN shows only one big cluster. The results from DBSCAN also show small clusters in spots β2 . The clusters consist of two or three geotagged photographs, and these should be labeled as noise. EBSCAN detected these noise regions correctly. To avoid this problem of DBSCAN, M inP ts can be set to a high value; however, a different problem appears because of the high M inP ts value. Figure 10(c) shows the result of using DBSCAN when M inP ts = 20. In this case, DBSCAN perceived the noises of spots β3 and divided region α3 into some clusters. However, each cluster appears thin, because the number of points that are labeled as noisy is scaled linearly with M inP ts. Our entanglement-based approach solved the problem caused by the input parameters of DBSCAN. 4.5 Quantitative Assessment Lastly, we discuss the overall execution time, which included the database-building phase and the clustering phase. Figure 11 is a dataset of photos taken around Mount Fuji. First, we surveyed a suitable input parameter that could classify Fuji Five Lakes as five difference clusters, because the number and sizes of the clusters impinge on the execution time needed for clustering. We specified 0.013 for tooF ar, 0.012 for Eps, and 30 for M inP ts. The execution times of EBSCAN and DBSCAN are shown in Figure 12. Finding intersections is a time-consuming task compared with k-d tree generation; however, these results showed that EBSCAN is twice as fast as DBSCAN in the overall execution time. In addition, iteration of clustering phase always happens to find good parameters and clusters in large geo-social datasets. In that case, the execution speed of the clustering phase is the most important of the whole process. The result also indicated that EBSCAN is a suitable to use with a parameter survey for practical clustering purposes. 5. CONCLUSIONS In this study, we specifically examined the problem of developing a method for finding a dense area from a large, spatial database that is particularly suitable for use with large datasets that have geotags on SNSs. The entanglementbased approach proposed in this paper, EBSCAN, might bring about a new paradigm for improving density-based clustering problems. • High quality. EBSCAN uses a set of trajectories instead of a set of points. Therefore, EBSCAN perceives trivial but important connections between adjacent dense areas in a dataset and unifies them into one cluster (Section 4.3). We regard this quality as extremely important when using a real dataset. Our experimentally obtained results also showed that EBSCAN works well on a real dataset of geotagged photographs obtained from Flickr (Section 4.4 and 4.5). Future directions. This proposal is the first effort of using an entanglement-based approach to find dense regions from a large spatial database. Therefore, some issues remain. Intersection-finding algorithms with R-tree did not improve the speed of execution when compared with those using Brute-force, because the R-tree index was searched using a bounding box and it extracted a lot of non-intersecting lines. We would like to improve the intersection-finding algorithm. We will also work to implement parallel processing during the clustering phase of EBSCAN. 6. ACKNOWLEDGMENTS This research was partly supported by JSPS KAKENHI Grant Number 15K00421 and Grant-in-Aid for Research on Priority Areas, Tokyo Metropolitan University, “Research on social big data.” 7. REFERENCES [1] Gartner says worldwide pc, tablet and mobile phone combined shipments to reach 2.4 billion units in 2013. http://www.gartner.com/newsroom/id/2408515. [2] Most social networks are now mobile first. http://www.statista.com/chart/2109/. [3] M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify the clustering structure. In ACM Sigmod Record, volume 28, pages 49–60. ACM, 1999. [4] J. L. Bentley and T. A. Ottmann. Algorithms for reporting and counting geometric intersections. IEEE Transactions on Computers, 100(9):643–647, 1979. [5] D. Birant and A. Kut. St-dbscan: An algorithm for clustering spatial–temporal data. Data & Knowledge Engineering, 60(1):208–221, 2007. [6] J.-D. Boissonnat and F. P. Preparata. Robust plane sweep for intersecting segments. SIAM Journal on Computing, 29(5):1401–1421, 2000. [7] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. Lof: identifying density-based local outliers. In ACM Sigmod Record, volume 29, pages 93–104, 2000. [8] L. Chen and A. Roy. Event detection from flickr data through wavelet-based spatial analysis. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, pages 523–532, 2009. [9] D. J. Crandall, L. Backstrom, D. Huttenlocher, and J. Kleinberg. Mapping the world’s photos. In Proceedings of the 18th International Conference on World Wide Web, pages 761–770, 2009. [10] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Kdd, volume 96, pages 226–231, 1996. [11] S. Gaffney and P. Smyth. Trajectory clustering with mixtures of regression models. In Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 63–72, 1999. [12] F. Giannotti, M. Nanni, F. Pinelli, and D. Pedreschi. Trajectory pattern mining. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 330–339, 2007. [13] A. Guttman. R-trees: A dynamic index structure for spatial searching. SIGMOD Record, 14(2):47–57, June 1984. [14] A. Hinneburg and D. A. Keim. An efficient approach to clustering in large multimedia databases with noise. In KDD, volume 98, pages 58–65, 1998. [15] S. Hussain and G. Hazarika. Mining volunteered geographic information datasets with heterogeneous spatial reference. International Journal of Advanced Computer Science and Applications Special Issue on Artificial Intelligence, 2011. [16] A. Karami and R. Johansson. Choosing dbscan parameters automatically using differential evolution. International Journal of Computer Applications, 91(7):1–11, 2014. [17] S. Kisilevich, M. Krstajic, D. Keim, N. Andrienko, and G. Andrienko. Event-based analysis of people’s activities and behavior using flickr and panoramio geotagged photo collections. In 14th International Conference on Information Visualisation (IV), pages 289–296, 2010. [18] S. Kisilevich, F. Mansmann, and D. Keim. P-dbscan: a density based clustering algorithm for exploration and analysis of attractive areas using collections of geo-tagged photos. In Proceedings of the 1st International Conference and Exhibition on Computing for Geospatial Research & Application, page 38. ACM, 2010. [19] I. Lee, G. Cai, and K. Lee. Mining points-of-interest association rules from geo-tagged photos. In 46th Hawaii International Conference on System Sciences (HICSS), pages 1580–1588, 2013. [20] J.-G. Lee, J. Han, and K.-Y. Whang. Trajectory clustering: a partition-and-group framework. In Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pages 593–604, 2007. [21] R. Lee, S. Wakamiya, and K. Sumiya. Exploring geospatial cognition based on location-based social network sites. World Wide Web, pages 1–26, 2014. [22] J. Li, J. Sander, R. Campello, and A. Zimek. Active learning strategies for semi-supervised dbscan. pages 179–190. Springer International Publishing, 2014. [23] M. Ruocco and H. Ramampiaro. Event clusters detection on flickr images using a suffix-tree structure. In IEEE International Symposium on Multimedia (ISM), pages 41–48, 2010. [24] T. Sakaki, M. Okazaki, and Y. Matsuo. Earthquake shakes twitter users: real-time event detection by social sensors. In Proceedings of the 19th International Conference on World Wide Web, pages 851–860, 2010. [25] G. Sheikholeslami, S. Chatterjee, and A. Zhang. Wavecluster: A multi-resolution clustering approach for very large spatial databases. In VLDB, volume 98, pages 428–439, 1998. [26] J. Shi, N. Mamoulis, D. Wu, and D. W. Cheung. Density-based place clustering in geo-social networks. In In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD 2014), number EPFL-CONF-198425, 2014. [27] M. Shirai, M. Hirota, S. Yokoyama, N. Fukuta, and H. Ishikawa. Discovering multiple hotspots using geo-tagged photographs. In Proceedings of the 20th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pages 490–493, 2012. [28] K. Tamura and T. Ichimura. Density-based spatiotemporal clustering algorithm for extracting bursty areas from georeferenced documents. In IEEE International Conference on Systems, Man, and Cybernetics (SMC), pages 2079–2084. IEEE, 2013. [29] M. R. Vieira, P. Bakalov, and V. J. Tsotras. On-line discovery of flock patterns in spatio-temporal data. In Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, pages 286–295, 2009. [30] T. Vincenty. Direct and inverse solutions of geodesics on the ellipsoid with application of nested equations. Survey review, 23(176):88–93, 1975. [31] S. Wakamiya, R. Lee, and K. Sumiya. Measuring crowd-sourced cognitive distance between urban clusters with twitter for socio-cognitive map generation. In International Conference on Emerging Databases, 2012. [32] W. Wang, J. Yang, R. Muntz, et al. Sting: A statistical information grid approach to spatial data mining. In VLDB, volume 97, pages 186–195, 1997. [33] Y. Yamaguchi, T. Amagasa, and H. Kitagawa. Landmark-based user location inference in social media. In Proceedings of the 1st ACM conference on Online Social Networks, pages 223–234, 2013. [34] T. Zhang, R. Ramakrishnan, and M. Livny. Birch: an efficient data clustering method for very large databases. In SIGMOD Record, volume 25, pages 103–114, 1996. [35] Y.-T. Zheng, Z.-J. Zha, and T.-S. Chua. Mining travel patterns from geotagged photos. ACM Transactions on Intelligent Systems and Technology (TIST), 3(3):56, 2012.