Download EBSCAN: An Entanglement-based Algorithm for Discovering Dense

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human genetic clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
EBSCAN: An Entanglement-based Algorithm for
Discovering Dense Regions in Large Geo-social Data
Streams with Noise
Shohei Yokoyama
Hiroshi Ishikawa
Ágnes Bogárdi-Mészöly
Faculty of Informatics,
Shizuoka University
Hamamatsu, Japan
Department of Automation and
Applied Informatics, Budapest
University of Technology and
Economics
[email protected]
Budapest, Hungary
Faculty of Informatics,
Shizuoka University
Hamamatsu, Japan
Faculty of System Design,
Tokyo Metropolitan University
Tokyo, Japan
[email protected]
[email protected]
ABSTRACT
General Terms
The remarkable growth of social networking services on
global positioning system (GPS)-enabled handheld devices
has produced enormous amounts of georeferenced big data.
Given a large spatial dataset, the challenge is to effectively
discover dense regions from the dataset. Dense regions might
be the most attractive area in a city or the most dangerous
zone of a town. A solution to this problem can be useful
in many applications, including marketing, tourism, and
social research. Density-based clustering methods, such as
DBSCAN, are often used for this purpose. Nevertheless,
current spatial clustering methods emphasize density while
neglecting human behavior derived from geographical features. In this paper, we propose EBSCAN, which is based
on the novel idea of an entanglement-based approach. Our
method considers not only spatial information but also human behavior derived from geographical features. Another
problem is that competing methods such as DBSCAN have
two input parameters. Thus, it is difficult to determine
optimal values. EBSCAN requires only a single intuitive
parameter, tooFar, to discover dense regions. Finally, we
evaluate the effectiveness of the proposed method using
both toy examples and real datasets. Our experimentally
obtained results reveal the properties of EBSCAN and show
that it is >10 times faster than the competitor.
Algorithms, Experimentation, Theory
Categories and Subject Descriptors
I.5.3 [PATTERN RECOGNITION]: Clustering—Algorithms; H.2.8 [DATABASE MANAGEMENT]:
Database Applications—Spatial databases and GIS
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from [email protected].
LBSN’15, November 03, 2015, Bellevue, WA, USA
Copyright is held by the owner/author(s). Publication rights licensed to ACM.
ACM 978-1-4503-3975-9/15/11...$15.00
DOI: http://dx.doi.org/10.1145/2830657.2830661
Keywords
Geo-social data, Density-based clustering, Spatial index
1.
INTRODUCTION
In accordance with the recent remarkable growth of smart
devices [e.g., smartphones and tablet personal computers
(PCs)], social networking services (SNSs) are shifting rapidly
to a mobile-first strategy[2]. Gartner has reported that users
can choose a tablet as their first computing device instead
of a PC[1]. These facts indicate that the user’s location
should be considered in data mining for SNSs. Therefore,
spatio-temporal analysis is a key tool for the analysis of the
behavior of SNS users.
Location information of SNS users is often given as check-in
and geotagged information. Check-in is an innovation
used by geo-social platforms, such as Swarm. Commonly
used SNSs, such as Facebook and Google+, now enable
users to report their location as a check-in. Geotag is
metadata consisting of at least the latitude and longitude
of the user’s location. It is stored with various media,
such as photographs, videos, and Tweets. Hereafter, the
term “geo-social data” is used to refer to such georeferenced
information related to SNSs.
We can readily access huge geo-social data sharing on various
social media via Web APIs developed during the Web2.0
movement. For example, it is possible to collect >160,000
geotagged photos1 taken around the Colosseum in Rome and
upload them to Flickr. Considering that Flickr photographs
taken around the Arc de Triomphe amount to 60,0002 ,
some might say that the Colosseum is more attractive than
the Arc de Triomphe. This is just one example, but it
appears reasonable to infer that the density of geotagged
photographs is an important index that can measure the
attention level of a place.
1
2
See http://bit.ly/1tLCHew.
See http://bit.ly/1sX4fd0.
Applying density-based data mining techniques to detect
attractive places is a major direction for social network
analysis. For example, DBSCAN[10], which is a typical
density-based clustering method, is used in various research
areas, including landmark detection, travel pattern mining,
and event detection. DBSCAN is a clustering algorithm
designed for a large spatial database. The procedure
requires two parameters: Epsilon (Eps) and minimum
points (M inP ts). It starts with an arbitrary object. If
at least the minimum number of objects (M inP ts) exists
around the selected object within a given radius (Eps), then
a cluster is created. Otherwise, the object is labeled as
noise. DBSCAN is used in social network analysis due to
the following reasons:
80
60
40
20
0
0
20
40
60
80
100
(b) Eps=4, MinPts=30
100
(a) Eps=4, MinPts=15
20
40
60
80
100
40
60
80
100
80
60
40
20
0
0
20
40
60
80
100
(d) Eps=8, MinPts=30
100
(c) Eps=8, MinPts=15
20
• DBSCAN does not require the number of clusters as
its input parameter. Generally, users do not know,
but want to know, the number of dense regions in a
database.
20
40
60
80
100
20
40
60
80
100
Figure 1: DBSCAN results in case of different input
parameter settings.
• DBSCAN can find non-linearly separable clusters. Actual geographical clusters often have non-linear shapes.
River
Bridge
• DBSCAN is robust for discovering outliers. This
feature is suitable for geo-social data, because the data
includes various errors caused by global positioning
system (GPS) accuracy, human factors, and intentional spam.
Wall
Bridge
(a) Geospatial data
(b) Geographical features
Despite its benefits, DBSCAN also has some shortcomings.
A major disadvantage of DBSCAN is that it requires two
input parameters, which makes it difficult to use.
Figure 2: Difference between data and actual geographical features.
In this paper, we propose a novel entanglement-based clustering algorithm, EBSCAN, that is designed for geo-social
data. EBSCAN requires only one explicit input parameter.
It is able to discover dense regions in a spatial database, similar to DBSCAN. Therefore, EBSCAN can be an alternative
to the density-based DBSCAN and its followers.
Geographical Features. Figure 2 presents a toy example
of geo-social data and geographical features of the place.
Despite the wall that divides the field into north and the
south, a density-based approach might output two clusters:
east (blue) and west (red). In the case shown in Figure 2,
the north and the south sectors are connected by bridges,
but are divided completely by the wall. Therefore, blue and
red clusters must be divided into two clusters. However,
a density-based approach might not only overlook such
geographical features, it might also be confused by a GPS
error, human error, or rounding values.
1.1 Problem Statement
In this section, we address the problems associated with
density-based spatial clustering.
Input Parameter. As described above, DBSCAN has two
parameters: Eps and M inP ts. Figure 1 presents the results
of using DBSCAN with various parameter values. Result
(a) is the best result of the four, but all points in the second
and third chunk in the leftmost column are labeled as noise,
because the chunk density fell short of the level required by
the input parameters. However, Result (d) shows that the
second chunk from the bottom in the leftmost column was
detectable as a cluster, but it formed one cluster consisting
of four chunks in the second column from the left. In Result
(b), many outer rim points are labeled as noise. In contrast,
no point is detected as noise in Result (c), but all chunks
are gathered into one cluster. Consequently, DBSCAN is
sensitive to its input parameters. Furthermore, it is difficult
to ascertain the appropriate parameter values of Eps and
M inP ts. In fact, knowing the appropriate number of points
(M inP ts) within an Eps radius from any point is difficult.
1.2
Contributions
It is noteworthy that the input data of EBSCAN are a
set of trajectories comprising multiple points. This is a
limitation of EBSCAN in contrast with DBSCAN, which
can be applied to a set of points. However, in almost all
cases, geo-social data are a set of trajectories. For example,
when a user uploads geotagged photographs to Flickr,
then the user’s photostream indicates the user trajectory.
DBSCAN might be more widely applicable than EBSCAN;
however, EBSCAN presents advantages when used with a
large geo-social dataset.
To avoid the weakness of density-based clustering, we
propose a novel entanglement-based algorithm for clustering
that uses a geo-social trajectory to discover high-density
areas. We focus on GPS errors, round-off errors, and human
y
y
in Figure 2. EBSCAN perceives such geographical features
1.3
Outline
The remainder of this paper is organized as follows. Section
2 discusses the related work. Section 3 presents the proposed
EBSCAN algorithm. Section 4 shows the results of our
experimental evaluation. Section 5 includes our conclusions.
x
(a) actual trajectory
x
(b) recorded trajectory
Figure 3: Difference between actual and recorded
trajectories.
errors in geo-social data. Figure 3 presents an example of the
trajectories of (a) actual behavior and (b) a recorded GPS
log. As the figure shows, recorded trajectories get entangled
spontaneously even when they travel alongside each other.
We specifically examine the entanglement of the geo-social
trajectory.
Input Data We designed EBSCAN to cluster a large geotagged contents, e.g. photograph and Twitter. Therefore,
EBSCAN requires a set of sequences of a georeferenced
document. In the case of Flickr, a georeferenced document
is a geotagged photo, the sequence is a user’s photostream,
and the input data is a set of photostreams.
Output Cluster After the clustering phase, EBSCAN
discovers clusters and outliers in the dataset. The cluster
contains georeferenced documents that exist in the same
dense area. DBSCAN can also discover dense regions, but
the clusters from EBSCAN are more useful for social data
mining.
The main contributions of our geospatial clustering algorithm EBSCAN are summarized as follows.
Execution Time. As describe in Section 4.2, the execution
time for clustering of EBSCAN is >10 times faster than
that of DBSCAN. EBSCAN takes longer to build a database
during the initialization process; however, our experimental
results showed that the overall execution time was twice as
fast as that of DBSCAN using a k-d tree index.
Input Parameter. EBSCAN requires only a single parameter, tooF ar, which represents a radius from each point and
which is used to find near neighbors. An important problem
with DBSCAN is that is requires two input parameters.
In contrast with the input parameters of DBSCAN, using
tooF ar is clear and intuitive. For example, if a given
sufficient distance to divide points into different clusters is
500 m, then tooF ar takes a value of 500 m. In addition,
tooF ar is not only easy to set, it can readily estimate an
optimal value. Later in this paper, we also discuss the
optimal estimation of tooF ar.
Geographical Features. EBSCAN uses a set of trajectories to find dense regions. Therefore, clusters that
consider geographical features are created. For example, if
trajectories a and d have no intersection, it might indicate
that a barrier exists between a and d, such as the wall shown
2.
RELATED WORKS
The following survey of related works is categorized into two
types: spatial clustering and geodata network analysis.
Spatial Clustering. Spatial clustering is a powerful and
traditional partitioning method that includes k-means and
k-medoids. Partitioning methods are simple. In addition,
they work well with a spatial database. A predefined
parameter k that specifies the number of clusters in the
output must be decided before clustering. However, its use
is not reasonable in almost all cases for our geo-social clustering problem because the number of clusters is unknown
beforehand.
Another approach of spatial clustering includes hierarchical
techniques such as BIRCH[34]. Hierarchical clustering
methods need no predefined number of clusters, but they
lack termination criteria. Therefore, hierarchical clustering
cannot be a solution to our geo-social clustering problem.
Partitioning and hierarchical clustering are suitable for finding spherical clusters in a spatial database. However, a geographical cluster takes various arbitrary shapes. In contrast
with partitioning and hierarchal clustering, density-based
approaches are suitable for application to the geo-social
clustering problem. A representative density-based clustering method is DBSCAN[10], which can yield non-linearly
separable clusters. DBSCAN is extremely influential. It
has triggered many updates and extensions. OPTICS[3] an
extension of DBSCAN, can obtain various sizes of clusters
at different granularities. DENCLUE[14], a density-based
algorithm, uses the density distribution function on a computational mesh.
Density-based algorithms are also suitable for outlier detection. Moreover, DBSCAN can identify outlier points represented as a draft of clustering from a large spatial database.
Actually, LOF[7], a density-based algorithm, was initially
developed to optimize outlier detection. Entanglementbased EBSCAN is also able to identify outliers.
Many modified algorithms of DBSCAN have been proposed
for specific purposes. ST-DBSCAN [5] extends DBSCAN
for use with spatial databases that contain non-spatial and
temporal values. Tamura et al. proposed density-based clustering to discover local topics and events from a huge number of georeferenced documents on social media sites[28].
PDBSCAN [18] is a DBSCAN-based clustering method that
specializes in working with geotagged photographs.
They are noteworthy algorithms used in DBSCAN-based
clustering for geo-social network analysis, because it can
consider both spatial information and the social distance
between users who visit the clustered places. However,
DCPGS [26] demands two additional input parameters, τ
and maxD, aside from eps and minP ts. Therefore, the
setting of input parameters for DCPGS is more difficult than
it is for DBSCAN.
Set of geo-social datastreams: T
Trajectory: T1 of User foo
Sequence of georeferenced documents
DBSCAN-based algorithms include the inherent weaknesses
of DBSCAN and extend them. Our aim is to overcome the
weaknesses of DBSCAN.
Some studies have specifically examined the difficulty of the
DBSCAN’s input parameter setting. BDE-DBSCAN[16] is
a wrapper of DBSCAN used for a repetitious parameter
survey. Li et al. tackled this problem using semi-supervised
machine learning [22]. However, a parameter survey for
DBSCAN does not indicate its efficiency.
Grid-based clustering techniques, e.g., STING[32] and Wave
Cluster[25], have been proposed as high-efficiency algorithms. The salient advantage of using grid-based clustering
is its rapid processing, which depends on the number of cells
in each axis; however, the clustering result has low accuracy.
As described above, EBSCAN uses trajectories on SNSs.
Lastly, we explain trajectory clustering. Gaffney et al.[11]
proposed model-based trajectory clustering. Lee et al.[20]
proposed density-based trajectory clustering. Vieira et al.
[29] proposed flock pattern mining. Trajectory clustering
is a technique used for line-segment clustering aimed at
discovering a common sub-trajectory and classifying major
routes from a large real trajectory database. In contrast
with trajectory clustering, EBSCAN is a technique that
uses trajectories to perform point clustering to discover
high-density areas as clusters.
{lat,lon}
DBSCAN is used for geo-social data analysis. Particularly,
DBSCAN is often applied to discover points of interest
(PoIs) in geo-social data analyses. Related works that use
DBSCAN are approximately of two typess: (1)photographbased analysis [8, 15, 17, 19, 27, 35], (2) Tweet-based
analysis[21, 31].
In addition, DBSCAN is widely applicable and is used not
only for PoIs[19], but also for travel pattern mining[35],
landmark shape detection [27], urban cluster generation [31],
and event detection [8]. EBSCAN presents some possibilities
for replacing DBSCAN in such research efforts.
{lat,lon}
{lat,lon}
{lat,lon}
Crawling
(1) Building Database
Set of line segments: L
Intersection DB: X
Intersection:
X1
Start
End
Start
End
Start
End
Intersect
Lines
{lat,lon}
{lat,lon}
{lat,lon}
{lat,lon}
Line: L
{lat,lon}
{lat,lon} {lat,lon} 1
Line: L2
Line segment:
L1
Start
End
Start
End
Start
End
Start
End
{lat,lon}
{lat,lon}
{lat,lon}
{lat,lon}
{lat,lon}
{lat,lon} {lat,lon}
{lat,lon}
(2) Clustering
RoI/PoI
analysis
Output clusters: C
C1
{lat,lon}
C2
{lat,lon}
C3
{lat,lon}
{lat,lon}
Figure 4: An outline of clustering procedure using
EBSCAN.
X(a1→2,b1→2)
Pa2
Pb1
Traj. b
tooFar
Lb1→2
Pa1
The most closely related work in trajectory mining is
research concerning region of interest (RoI) detection using a
large trajectory dataset. Gianotti et al. proposed an efficient
density-based approach[12].
Geo-social data analysis. In recent years, an explosion
of interest has arisen in geo-social data analysis. Sakaki
et al.[24] proposed keyword-based models using Twitter to
automatically identify where and when earthquakes occur.
Yamaguchi et al.[33] used a large Twitter dataset to infer
users’ home locations. Crandall et al.[9] proposed a mapping
system using a combination of textual and visual features
from a large photographic database that was “crawled” from
Flickr. Ruocco et al.[23] used Suffix Tree Clustering to
discover events and their occurrence time and position from
Flickr photographs.
Flickr
Twitter
Traj. a
Pb2
La1→2
Pc1
Pd1
Traj. d
Traj. c
X(b2→3,c2→3)
Figure 5: Key idea of entanglement-based approach.
3.
ENTANGLEMENT-BASED
ING
CLUSTER-
Next, we propose an entanglement-based clustering algorithm, EBSCAN, which is able to discover dense regions
from spatial databases using entanglements of trajectories.
The procedure of EBSCAN has two steps: (1) building intersection database and (2) clustering. Outline of EBSCAN
algorithm is illustrated in Figure 4.
Key Idea Consider a set of georeferenced documents to
be clustered. EBSCAN classifies the documents as Near
Neighbor points and outliers.
Definition 1. (Near Neighbor) Given a line Lxn→n+1
between point Pxn and Pxn+1 on trajectory x, neighbor
points P is at both ends of the intersected line of Lxn→n+1 .
When the distance between a point P ′ ∈ P and Pxn is less
Algorithm 1 BuildDatabase(T)
1: Lines L = ∅
2: for each Trajectory T ∈ T do
3:
Pprev = null
4:
for each Point P ∈ T do
5:
if Pprev ̸= null then
6:
L.push(new Line(P ,Pprev ))
7:
end if
8:
Pprev = P
9:
end for
10: end for
11: X = GetAllIntersections(L)
12: return X
than the value of input parameter tooF ar, then P ′ is called
Near Neighbor of Pxn .
Definition 2. (Outlier) All point P which do not have
more than one Near Neighbor are considered Outliers.
Finally, clusters are composed on the basis of a simple
dedinition as follows:
Definition 3. (Cluster) Any point P in the dataset must
belong to the same cluster of all near-neighbor points
This is a key idea of the proposed EBSCAN. Figure 5 shows
an example representing the key idea of EBSCAN.
Example 1. This example uses six points {Pa1 ,Pa2 , Pb1 ,
Pb2 , Pc1 , Pd1 } on four trajectories {a, b, c, d}, and two
intersections {X(a1→2 , b1→2 ), X(b2→3 , c2→3 )}.
First, the intersections are analyzed. For the intersection
{X(a1→2 , b1→2 ), the near neighbor is {Pa2 , Pb2 }. Therefore,
both Pa2 and Pb2 belong to the same cluster C. Next,
the near neighbor related to X(b2→3 , c2→3 ) is {Pb2 , Pc1 }.
However Pb2 is already belong to a cluster C. In this case,
cluster C includes Pc1 .
For Pa1 and Pd1 , the distance between the two points is less
than the value of the input parameter tooF ar. However, the
lines connected to Pd1 and Pa1 do not mutually intersect.
Using DBSCAN, these might be classified as part of the same
cluster. However, EBSCAN can detect that trajectories a
and d do not intersect near points Pd1 and Pa1 .
Finally, cluster C is obtained, which contains {Pa2 , Pb2 , Pc1 }
and outliers {Pa1 , Pb1 , Pd1 }.
3.1 Preprocessing Algorithm
The input of EBSCAN is a target dataset T, which is a set
of trajectories, and the parameter tooF ar.
First, given trajectories T are divided into a set of line
segments L.The time complexity of the preprocessing (Line
1-10 of Algorithm 1) is O(n) where n is the total number of
points. The function GetAllIntersections, which is described
Algorithm 2 GetAllIntersectionsBF(L)
1: X = ∅
2: for each Line L ∈ L do
3:
for each Line L′ ∈ L do
4:
if LandL′ are intersected then
5:
X.push(new Intersection(LandL′ ))
6:
end if
7:
end for
8: end for
9: return X
Algorithm 3 GetAllIntersectionsRT(L)
1: X = ∅
2: R =RTree(∅)
3: for each Line L ∈ L do
4:
L′ = R.search(bbox(Ln ))
5:
for each Line L′ ∈ L′ do
6:
if LandL′ are intersected then
7:
X.push(new Intersection(LandL′ ))
8:
end if
9:
end for
10:
R.insert(L)
11: end for
12: return X
in the next section, is the most complicated part of EBSCAN
and takes L as an argument to return a set of intersections
X. Finally, BuildDatabase returns X for the next clustering
phase.
3.2
Algorithms for Finding Intersections
The most complicated part of DBSCAN, which is a competitor of EBSCAN, is finding the k-nearest neighbor points.The
time complexity of DBSCAN is O(n2 ). However, if a spatial
access method such as a k-d tree is used for the nearestneighbor search, then the time complexity is O(n log n).In
comparison with DBSCAN, the most complicated part of
EBSCAN is finding the intersections between line segments
(Line 11 of Algorithm 1). In this paper, we describe three
methods for finding intersections.
Brute-force Search. The first is a brute-force search
algorithm (Algorithm 2). This extremely simple algorithm
includes two nested loops. The time complexity of the bruteforce search is O(n2 ). This is not an efficient algorithm;
however, we implemented it as a baseline for EBSCAN.
Bentley-Ottmann. A sweep-line algorithm is a technique
to list intersections from a large line dataset. The BentleyOttmann algorithm[4] is known as an efficient algorithm
based on the sweep line. The Bentley-Ottmann algorithm
uses indexes of two types to find intersections efficiently:
a heap and a balanced binary tree. However, considering the application of the Bentley-Ottomann algorithm to
EBSCAN, the precision of the coordinates of intersections
calculated from two intersected lines will cause trouble. The
algorithm is not difficult to understand; however, efficient
implementation is difficult.
R-tree Search. Algorithm 3 shows uses the R-tree[13]
index to find candidates that might intersect. A search
by bounding box (Line 4 of Algorithm 3) might extract
many non-intersecting lines; therefore, each line of L′ must
be tested to determine whether it intersects or not. It
is not an efficient process, but it is simpler than using
Bentley-Ottmann.
3.3 Clustering Algorithm
Algorithm 4 shows the main procedure of clustering. The
first argument of clustering is the intersection database X
which is mentioned in previous sections. EBSCAN uses
tooF ar as an input parameter
At each step, the algorithm first obtains an intersection
X from X. X has two intersected line segments, La and
Lb . Both lines have a starting point and an ending point,
(Pastart , Paend ) and (Pbstart , Pbend ), respectively.
Next, four pairs of points are created. The pairs consist
of (Pastart , Pbstart ), (Pbstart , Paend ), (Paend , Pbend ), and
(Pbend , Pastart ). Subsequently, the distance between the
points of each pair is calculated. If the distance is larger
than the parameter tooF ar, then the pair is ignored.
(b) Bridge
(a) Crossroad
350
70
60
300
50
250
Y
Y
40
200
30
150
20
100
10
50
0
50
100
150
200
250
300
350
0
10
20
30
40
X
(c) Noisy
70
50
60
70
80
X
(d) Tetris
140
65
120
60
55
100
50
Y
45
Y
Algorithm 4 Clustering(X,tooF ar)
1: Array C = ∅
2: for each Intersection X ∈ X do
3:
list(La ,Lb ) = X.getIntersectedLines()
4:
list(Pastart , Paend ) = La .getBothPoints()
5:
list(Pbstart , Pbend ) = Lb .getBothPoints()
6:
Array P = [Pbstart , Pbstart , Paend , Pbend ]
7:
for i = 1 to 4 do
8:
Px = P[i − 1]
9:
Py = P[i mod |P|]
10:
if distance(Px , Py )≥ tooF ar then
11:
continue
12:
end if
13:
if Px do not belong to any cluster then
14:
if Py do not belong to any cluster then
15:
C = newCluster()
16:
Px .belongT o(C)
17:
Py .belongT o(C)
18:
C.push(C)
19:
else
20:
C = Py .belongedCluster()
21:
Px .belongT o(C)
22:
end if
23:
else
24:
if Py do not belong to any cluster then
25:
C = Px .belongedCluster()
26:
Py .belongT o(C)
27:
else
28:
Cx = Px .belongedCluster()
29:
Cy = Py .belongedCluster()
30:
C ′ = marge(Cx , Cy )
31:
C.remove(Cx , Cy )
32:
C.push(C ′ )
33:
end if
34:
end if
35:
end for
36: end for
37: return C
40
35
80
60
30
25
40
20
15
-20
20
0
20
40
60
80
X
(e) Midtown Manhattan
100
0
20
40
60
80
100
120
140
X
(f) Mt. Fuji and Fuji Five Lakes
Figure 6: Datasets (toy example and real data).
Table 1: Number of Points of Each Toy
# of Traj.
80
160
320
640
Crossroad 1218 2456 5081 9610
Bridge
810 1678 3259 6358
Noisy
1145 1968 4164 8269
Tetris
2128 3944 8420 17470
Dataset
1260
19297
12920
16921
33475
In other cases, both points are labeled as belonging to the
same cluster. If either one already belongs to cluster Cx ,
then the other point is added to cluster Cx . If both points
do not belong to any cluster, then a new cluster is created
and both points are added to the cluster.
The time complexity of the clustering is O(m) where m is
the total number of intersections.
4.
EXPERIMENTAL RESULTS
In this section, we demonstrate the effectiveness of EBSCAN
with various datasets. These experiments were designed for
the following purposes:
1. To compare three intersection-finding algorithms.
2. To compare EBSCAN to a competitor.
3. To determine if tooF ar is sufficiently simple and
effective.
4. To evaluate if EBSCAN is effective with a real dataset.
Implementation. We implemented EBSCAN, which supports three intersection-finding algorithms. It is written
entirely in JavaScript for the node.js environment3 .
3
The implementation of three intersection-finding algorithms is shared on GitHub.
https://github.com/
abarth500/line-segment-intersection.
Execution Time (s)
BuruteForce (Toy Example)
RTree (Toy Example)
BuruteForce (Real Dataset)
RTree (Real Dataset)
Bentrey Ottomann(Toy Example)
40
35
30
25
20
15
10
5
0 0
5000
10000
15000
(a) DBSCAN [Eps=6/MinPts=40]
(b) EBSCAN [tooFar=4]
(c) DBSCAN [Eps=8/MinPts=40]
(d) EBSCAN [tooFar=6]
(e) DBSCAN [Eps=10/MinPts=40]
(f) EBSCAN [tooFar=200]
20000
Number of Points
Figure 7: Execution time of preprocessing.
Dataset
Method
Crossroad
Bridge
Noisy
Tetris
DBSCAN
EBSCAN
Execution Time (s)
1000
100
10
1
Figure 9: Clustering result in case of dataset Bridge.
0.1
0.01
0.001
0
5000
10000
15000
20000
Number of points
Figure 8: Execution time of clustering.
Environment. Our experiments were conducted in an Intel
Next Unit of Computing (NUC) environment with an Intel
Core i7-5557U processor containing 16 GB of memory and
256 GB of SSD.
Dataset. We used four toy examples with five size factors
and two real datasets to achieve the purposes described
above. Figure 6 presents the dataset. We created a data
generator 4 for this experiment and generated four datasets
(a)–(d). We generate five sizes of each dataset. The sizes of
the data were 80, 160, 320, 640, and 1260 trajectories. We
used 20 datasets for the toy example.
Figure 6 shows the points in the experiment that used 160
trajectories. It is difficult to draw all the trajectory lines in
the figure. Therefore, we have drawn only the points that
are shown as red crosses. The approximate trend of the
trajectories is presented as a blue arrow. The number of
points contained in each dataset is described in the Table 1.
We also used real data consisting of geotagged photographs
of Flickr. Midtown Manhattan (e) and Mount Fuji (f)
of Figure 6 are the datasets that have different numbers
of photographs on different scales of regions. Midtown
4
To ensure repeatability, we publish it on the
Gist/GitHub. See https://gist.github.com/anonymous/
01a23f7bd8ac0b7a0c25.
Manhattan has 13,802 geotags on 320 trajectories; Mount
Fuji has 23,332 geotags on 640 trajectories. We used the real
datasets to estimate the practical efficiency of the proposed
EBSCAN.
Competitor. We used DBSCAN for comparison with
EBSCAN. DBSCAN was implemented using JavaScript and
was executed on node.js for fairness of the experiments. We
uses k-d tree enabled O(n log n) DBSCAN.
4.1
Execution Time of Preprocessing
First, we investigated the efficiency and scalability of three
implementations that used the proposed intersection-finding
algorithms. We proposed three different algorithms: Bruteforce, Bentley-Ottoman, and R-tree. Figure 7 shows the
execution time needed to build the intersection database.
The slowest algorithm was the Bentley-Ottoman. We did
not expected it it, because although the Bentley-Ottoman
algorithm is known as the most effective algorithm to find
intersections, it is also known to be sensitive to floating
point arithmetic, rounding errors [6]. A discussion of these
concerns is beyond the scope of this paper. We used the
R-tree algorithm to build an intersection database for the
following evaluations.
4.2
Execution Time of Clustering
Figure 8 illustrates the execution time of EBSCAN and
DBSCAN using k-d tree. The process used four toy example
datasets with five factors. This result shows that EBSCAN
is >10 times faster than DBSCAN in almost all cases of our
datasets.
4.3
Awareness of Geographical Features
(a) EBSCAN(tooFar=100m)
(b) DBSCAN(Eps=100m,MinPts=1) (c) DBSCAN(Eps=100m,MinPts=20)
Figure 10: Clustering result in case of dataset Midtown Manhattan.
(a) Result of EBSCAN (tooFar=0.013)
Figure 9(a) shows the result of using DBSCAN when Eps =
4 and M inP ts = 40. Figure 9(b) presents the result of
using EBSCAN when tooF ar = 4. Both results show five
and three clusters. These results were insufficient, because
we wanted to find two clusters from four dense regions.
However, the results in Figure 9(c) and 9(d) show two
different clusters. Figure 9(d) shows that the result of using
EBSCAN with parameter tooF ar = 6 is divided between
the northern and southern gap. Conversely, Figure 9(c)
shows that the result of using DBSCAN with Eps = 8
and M inP ts = 40 is divided with a gap between the east
and west despite the connection. These results indicate
an advantage when using the proposed entanglement-based
clustering. In contrast with DBSCAN, which finds dense
regions directly in the dataset, EBSCAN focuses on entanglements of trajectories; thereby, EBSCAN perceives trivial
but important connections between chunks in the dataset.
Figure 9 also shows the advantage of using EBSCAN. If the
value of Eps in DBSCAN is slightly high, then the clusters
will become one cluster, as shown in Figure 9(e). In contrast,
if the value of tooF ar is extremely high, EBSCAN perceives
the gap between the north and south and shows a suitable
number of clusters, as shown in Figure 9(f).
4.4
(b) Result of DBSCAN (Eps=0.012/MinPts=30)
Figure 11: Clustering result in case of dataset Mt.
Fuji.
We described in Section 1.1 that EBSCAN can obtain clusters that consider geographical features. The Bridge is a toy
example that includes geographical barriers, such as those
shown in Figure 2. Figure 9 shows the clusters obtained
from the Bridge using both EBSCAN and DBSCAN. The
east and the west chunks are connected, but the north and
south chunks are not. If the dataset is divided into two
clusters, then one should be composed of the northern two
chunks, and the other should be composed of the southern
two chunks. However, DBSCAN often commits an error.
Qualitative Assessment
In the next two sections, we evaluate EBSCAN using real
datasets. First, we discuss the quality of the output clusters.
EBSCAN classifies points in the same cluster if the distance
between two points is less than the input parameter tooF ar.
Consequently, it could be said that results of using the
EBSCAN parameter tooF ar are equivalent to the results
of using the DBSCAN parameter Eps when M inP ts = 1.
However, EBSCAN, which is based on an entanglement
approach, extracts more suitable clusters than the densitybased approach.
Figure 10(a) and (b) shows the differences between DBSCAN and EBSCAN when the values of tooF ar and Eps
are the same. We used Vincenty’s formula [30] to calculate
the distance for clustering; therefore, tooF ar and Eps are
specified as metric. Both tooF ar and Eps were 100 m, and
M inP ts was 1. The outline of region a in Figure 10(a) and
• Fast. Our experiments show that the execution speed
of EBSCAN is faster than that of DBSCAN. Its
average speed is >10 times faster during the clustering
phase and twice as fast for the entire execution time
(Section 4.2 and 4.5).
EBSCAN
DBSCAN
0
2
4
6
8
10
12
14
16
18
Execution Time (s)
EBSCAN
DBSCAN
Bilding Database/KD-Tree
5.341 s
0.090 s
Clustering
1.639 s
15.739 s
• Simple. EBSCAN requires only a single parameter,
tooF ar, which represents a radius from each point,
and uses it to find near neighbors. The number
of input parameters engenders difficulty in finding
optimal values. Using EBSCAN’s tooF ar parameter
is intuitive, and our experimentally obtained results
showed that the correct number of clusters is obtained
with a wide range of tooF ar values (Section 4.3 ).
Figure 12: Execution time.
(b) is close, but EBSCAN can find some dense regions in the
chunk, whereas DBSCAN shows only one big cluster. The
results from DBSCAN also show small clusters in spots β2 .
The clusters consist of two or three geotagged photographs,
and these should be labeled as noise. EBSCAN detected
these noise regions correctly.
To avoid this problem of DBSCAN, M inP ts can be set to
a high value; however, a different problem appears because
of the high M inP ts value. Figure 10(c) shows the result
of using DBSCAN when M inP ts = 20. In this case,
DBSCAN perceived the noises of spots β3 and divided region
α3 into some clusters. However, each cluster appears thin,
because the number of points that are labeled as noisy
is scaled linearly with M inP ts. Our entanglement-based
approach solved the problem caused by the input parameters
of DBSCAN.
4.5 Quantitative Assessment
Lastly, we discuss the overall execution time, which included
the database-building phase and the clustering phase. Figure 11 is a dataset of photos taken around Mount Fuji. First,
we surveyed a suitable input parameter that could classify
Fuji Five Lakes as five difference clusters, because the
number and sizes of the clusters impinge on the execution
time needed for clustering. We specified 0.013 for tooF ar,
0.012 for Eps, and 30 for M inP ts.
The execution times of EBSCAN and DBSCAN are shown
in Figure 12. Finding intersections is a time-consuming task
compared with k-d tree generation; however, these results
showed that EBSCAN is twice as fast as DBSCAN in the
overall execution time.
In addition, iteration of clustering phase always happens
to find good parameters and clusters in large geo-social
datasets. In that case, the execution speed of the clustering
phase is the most important of the whole process. The result
also indicated that EBSCAN is a suitable to use with a
parameter survey for practical clustering purposes.
5. CONCLUSIONS
In this study, we specifically examined the problem of developing a method for finding a dense area from a large, spatial
database that is particularly suitable for use with large
datasets that have geotags on SNSs. The entanglementbased approach proposed in this paper, EBSCAN, might
bring about a new paradigm for improving density-based
clustering problems.
• High quality. EBSCAN uses a set of trajectories
instead of a set of points. Therefore, EBSCAN
perceives trivial but important connections between
adjacent dense areas in a dataset and unifies them
into one cluster (Section 4.3). We regard this quality
as extremely important when using a real dataset.
Our experimentally obtained results also showed that
EBSCAN works well on a real dataset of geotagged
photographs obtained from Flickr (Section 4.4 and
4.5).
Future directions. This proposal is the first effort of using
an entanglement-based approach to find dense regions from
a large spatial database. Therefore, some issues remain.
Intersection-finding algorithms with R-tree did not improve
the speed of execution when compared with those using
Brute-force, because the R-tree index was searched using a
bounding box and it extracted a lot of non-intersecting lines.
We would like to improve the intersection-finding algorithm.
We will also work to implement parallel processing during
the clustering phase of EBSCAN.
6.
ACKNOWLEDGMENTS
This research was partly supported by JSPS KAKENHI
Grant Number 15K00421 and Grant-in-Aid for Research on
Priority Areas, Tokyo Metropolitan University, “Research
on social big data.”
7.
REFERENCES
[1] Gartner says worldwide pc, tablet and mobile phone
combined shipments to reach 2.4 billion units in 2013.
http://www.gartner.com/newsroom/id/2408515.
[2] Most social networks are now mobile first.
http://www.statista.com/chart/2109/.
[3] M. Ankerst, M. M. Breunig, H.-P. Kriegel, and
J. Sander. Optics: Ordering points to identify the
clustering structure. In ACM Sigmod Record,
volume 28, pages 49–60. ACM, 1999.
[4] J. L. Bentley and T. A. Ottmann. Algorithms for
reporting and counting geometric intersections. IEEE
Transactions on Computers, 100(9):643–647, 1979.
[5] D. Birant and A. Kut. St-dbscan: An algorithm for
clustering spatial–temporal data. Data & Knowledge
Engineering, 60(1):208–221, 2007.
[6] J.-D. Boissonnat and F. P. Preparata. Robust plane
sweep for intersecting segments. SIAM Journal on
Computing, 29(5):1401–1421, 2000.
[7] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander.
Lof: identifying density-based local outliers. In ACM
Sigmod Record, volume 29, pages 93–104, 2000.
[8] L. Chen and A. Roy. Event detection from flickr data
through wavelet-based spatial analysis. In Proceedings
of the 18th ACM Conference on Information and
Knowledge Management, pages 523–532, 2009.
[9] D. J. Crandall, L. Backstrom, D. Huttenlocher, and
J. Kleinberg. Mapping the world’s photos. In
Proceedings of the 18th International Conference on
World Wide Web, pages 761–770, 2009.
[10] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A
density-based algorithm for discovering clusters in
large spatial databases with noise. In Kdd, volume 96,
pages 226–231, 1996.
[11] S. Gaffney and P. Smyth. Trajectory clustering with
mixtures of regression models. In Proceedings of the
5th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pages 63–72,
1999.
[12] F. Giannotti, M. Nanni, F. Pinelli, and D. Pedreschi.
Trajectory pattern mining. In Proceedings of the 13th
ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pages
330–339, 2007.
[13] A. Guttman. R-trees: A dynamic index structure for
spatial searching. SIGMOD Record, 14(2):47–57, June
1984.
[14] A. Hinneburg and D. A. Keim. An efficient approach
to clustering in large multimedia databases with noise.
In KDD, volume 98, pages 58–65, 1998.
[15] S. Hussain and G. Hazarika. Mining volunteered
geographic information datasets with heterogeneous
spatial reference. International Journal of Advanced
Computer Science and Applications Special Issue on
Artificial Intelligence, 2011.
[16] A. Karami and R. Johansson. Choosing dbscan
parameters automatically using differential evolution.
International Journal of Computer Applications,
91(7):1–11, 2014.
[17] S. Kisilevich, M. Krstajic, D. Keim, N. Andrienko,
and G. Andrienko. Event-based analysis of people’s
activities and behavior using flickr and panoramio
geotagged photo collections. In 14th International
Conference on Information Visualisation (IV), pages
289–296, 2010.
[18] S. Kisilevich, F. Mansmann, and D. Keim. P-dbscan:
a density based clustering algorithm for exploration
and analysis of attractive areas using collections of
geo-tagged photos. In Proceedings of the 1st
International Conference and Exhibition on
Computing for Geospatial Research & Application,
page 38. ACM, 2010.
[19] I. Lee, G. Cai, and K. Lee. Mining points-of-interest
association rules from geo-tagged photos. In 46th
Hawaii International Conference on System Sciences
(HICSS), pages 1580–1588, 2013.
[20] J.-G. Lee, J. Han, and K.-Y. Whang. Trajectory
clustering: a partition-and-group framework. In
Proceedings of the 2007 ACM SIGMOD International
Conference on Management of Data, pages 593–604,
2007.
[21] R. Lee, S. Wakamiya, and K. Sumiya. Exploring
geospatial cognition based on location-based social
network sites. World Wide Web, pages 1–26, 2014.
[22] J. Li, J. Sander, R. Campello, and A. Zimek. Active
learning strategies for semi-supervised dbscan. pages
179–190. Springer International Publishing, 2014.
[23] M. Ruocco and H. Ramampiaro. Event clusters
detection on flickr images using a suffix-tree structure.
In IEEE International Symposium on Multimedia
(ISM), pages 41–48, 2010.
[24] T. Sakaki, M. Okazaki, and Y. Matsuo. Earthquake
shakes twitter users: real-time event detection by
social sensors. In Proceedings of the 19th International
Conference on World Wide Web, pages 851–860, 2010.
[25] G. Sheikholeslami, S. Chatterjee, and A. Zhang.
Wavecluster: A multi-resolution clustering approach
for very large spatial databases. In VLDB, volume 98,
pages 428–439, 1998.
[26] J. Shi, N. Mamoulis, D. Wu, and D. W. Cheung.
Density-based place clustering in geo-social networks.
In In Proceedings of the 2014 ACM SIGMOD
International Conference on Management of Data
(SIGMOD 2014), number EPFL-CONF-198425, 2014.
[27] M. Shirai, M. Hirota, S. Yokoyama, N. Fukuta, and
H. Ishikawa. Discovering multiple hotspots using
geo-tagged photographs. In Proceedings of the 20th
ACM SIGSPATIAL International Conference on
Advances in Geographic Information Systems, pages
490–493, 2012.
[28] K. Tamura and T. Ichimura. Density-based
spatiotemporal clustering algorithm for extracting
bursty areas from georeferenced documents. In IEEE
International Conference on Systems, Man, and
Cybernetics (SMC), pages 2079–2084. IEEE, 2013.
[29] M. R. Vieira, P. Bakalov, and V. J. Tsotras. On-line
discovery of flock patterns in spatio-temporal data. In
Proceedings of the 17th ACM SIGSPATIAL
International Conference on Advances in Geographic
Information Systems, pages 286–295, 2009.
[30] T. Vincenty. Direct and inverse solutions of geodesics
on the ellipsoid with application of nested equations.
Survey review, 23(176):88–93, 1975.
[31] S. Wakamiya, R. Lee, and K. Sumiya. Measuring
crowd-sourced cognitive distance between urban
clusters with twitter for socio-cognitive map
generation. In International Conference on Emerging
Databases, 2012.
[32] W. Wang, J. Yang, R. Muntz, et al. Sting: A
statistical information grid approach to spatial data
mining. In VLDB, volume 97, pages 186–195, 1997.
[33] Y. Yamaguchi, T. Amagasa, and H. Kitagawa.
Landmark-based user location inference in social
media. In Proceedings of the 1st ACM conference on
Online Social Networks, pages 223–234, 2013.
[34] T. Zhang, R. Ramakrishnan, and M. Livny. Birch: an
efficient data clustering method for very large
databases. In SIGMOD Record, volume 25, pages
103–114, 1996.
[35] Y.-T. Zheng, Z.-J. Zha, and T.-S. Chua. Mining travel
patterns from geotagged photos. ACM Transactions
on Intelligent Systems and Technology (TIST),
3(3):56, 2012.